- Questions & Answers
- Accounting
- Computer Science
- Automata or Computationing
- Computer Architecture
- Computer Graphics and Multimedia Applications
- Computer Network Security
- Data Structures
- Database Management System
- Design and Analysis of Algorithms
- Information Technology
- Linux Environment
- Networking
- Operating System
- Software Engineering
- Big Data
- Android
- iOS
- Matlab

- Economics
- Engineering
- Finance
- Thesis
- Management
- Science/Math
- Statistics
- Writing
- Dissertations
- Essays
- Programming
- Healthcare
- Law

- Log in | Sign up

https://courses.grainger.illinois.edu/cs361/sp2021/secure/project/CS361Project

CS 361: Probability and Statistics for Computer Science (Spring 2021)

Stochastic Optimization Project

1 Stochastic Optimization Theory Part

1.1 A common stochastic optimization task

In many machine learning problems, we are trying to minimize function f(✓) in the following format.

f(✓) =

1

k

kX

j=1

Q(✓, j) (1.1)

where Q(✓, j) is the loss function for jth data point where we have k training data points in the data set D. Here,

✓ is the set of parameters of the model that we try to optimize for, that is we want to find ✓⇤ = argmin✓ f(✓).

However, f is either unknown or very expensive to collect, but we have access to the noisy version g(✓). Using

g(✓) to find the optimized ✓⇤ for f(✓) would be a stochastic optimization task.

Exercise 1.

(4 points) Define the random variable g(✓) specifically here to be Q(✓, i), where i is a random variable that

is of uniform distribution in the integer set [1 : k]. So, rg(✓) is to be rQ(✓, i). Further, we define the noise

z = Q(✓, i)� f(✓), hence g(✓) = Q(✓, i) = f(✓) + z. That’s why we call g is the noisy version of f . Given this

definition of f and g, prove that equations E[g(✓)] = f(✓), E[rg(✓)] = rf(✓), and E[z] = 0 are satisfied.

1.2 Review of GD and SGD

The goal of optimization is to find the ✓⇤ that minimizes f(✓), which could be more general than that in (1.1).

Again sometimes f is either unknown or very expensive to collect, but we have access to the noisy version g(✓).

The following are two approximation methods to find ✓⇤ iteratively.

1.2.1 Gradient Descent Algorithm (GD)

• Initialize ✓1

• For t = 1, 2, · · · Then do the following update

✓t+1 = ✓t � ⌘trf(✓t) (1.2)

1.2.2 Stochastic Gradient Descent Algorithm (SGD)

• Initialize ✓1

• For t = 1, 2, · · · . Obtain a noisy version of the gradient (i.e. rg(✓t)). Then do the following update

✓t+1 = ✓t � ⌘trg(✓t) (1.3)

1.3 Convergence rates for SGD and GD

Suppose we want to minimize function f with learning rate sequence ⌘t, We have the following theorems:

Theorem 1.1 Assume rf has a unique root ✓⇤. If f is twice continuously di↵erentiable and strongly convex,

and ⌘t = O(t�1), then to reach an approximation error ✏ = |E[f(✓t)] � f(✓⇤)|, stochastic gradient descent

requires t = O( 1✏ ) updates.

1

– Stochastic Optimization Project 2

Theorem 1.2 Assume rf has a unique root ✓⇤. If either the smoothness assumptions about f in theorem 1.1

or ⌘t = O(t�1) is not met, then to reach an approximation error ✏ = |E[f(✓t)] � f(✓⇤)|, stochastic gradient

descent requires t = O( 1✏2 ) updates.

Theorem 1.3 Assume rf has a unique root ✓⇤. To reach an approximation error ✏ = |E[f(✓t)] � f(✓⇤)|,

gradient descent requires t = O(ln( 1✏ )) updates.

2 Comparing SGD vs GD for their running time

Let us consider the f minimization task discussed earlier in (1.1).

• f(✓) and g(✓) are as described in exercise 1, rf(✓⇤) = 0.

• Let’s assume that f is strongly convex, and is twice continuously di↵erentiable.

• We use the ⌘t = 1t learning rate sequence.

• We have k data points.

• We want to achieve a training precision of ✏ = |E[f(✓t)]� f(✓⇤)|.

• Assume finding rQ(✓, j) takes c time, and finding rf(✓) takes kc time.

Exercise 2.

(15 points) Fill in table 1, with complexity orders in terms of k and ✏

Table 1 Comparing SGD vs GD in terms of training precision

SGD GD

Computational Cost per Update O(1)

Number of updates to reach ✏

Total Cost

Explain your answer for each element of the second row by providing a reference to the theorem

mentioned in this document. For every other element, explain your answer.

3 Comparing SGD vs GD with respect to test loss

The last exercise was about finding the computational complexity required to reach a specific precision with

respect to the optimal training loss. However, a specific precision of training loss does not translate to the

same precision of test loss. Here are some helpful facts:

• To achieve a specific test loss precision ⇢, you need a large enough number of training samples k. The

relation

⇢ = O(k��)

must hold for most loss functions where 0 < � 1 is an unknown constant.

• Let’s assume ✏ = ⇥(⇢); for simplicity, you can assume that it is as if ✏ is c⇥⇢ where c is a positive constant.

Exercise 3.

(16 points) 1) Fill in table 2, with complexity orders in terms of ⇢ and �. You can refer to the previous table,

and replace the elements with appropriate values. Make sure you state the reason for each element even

if it looks obvious.

– Stochastic Optimization Project 3

Table 2 Comparing SGD vs GD in terms of testing precision

SGD GD

Computational Cost per Update O(1)

Number of updates to reach ⇢

Total Cost

2) For a typical � = 0.2, Explain why choosing GD over SGD can be very unwise.

CS 361: Probability and Statistics for Computer Science (Spring 2021)

Stochastic Optimization Implementation

Objective: Implement state-of-the-art stochastic optimization algorithms for Machine Learning Problems,

Linear Regression and Classification (using Neural Networks).

3.1 Adaptive Momentum Estimation (ADAM) Gradient Descent Algorithm

SGD can be ine↵ective when the function is highly non-convex. Unfortunately, most modern applications of

Machine Learning involve non-convex optimization problems. One stochastic optimization algorithm that works

well even for non-convexity is ADAM [KB14]. ADAM uses past gradient information to “speed” up optimization

by smoothing, and the algoirthm has been further improved [SSS18]. You will implement the ADAM stochastic

optimization algorithm for a Linear Regression problem.

The pseudo-code for ADAM has been reproduced here from this paper. Credits to [KB14].

Disclaimer: The textbook, in Chapter 13, uses � for parameters but we will be using ✓.

Algorithm 1: g

2

t indicates the elementwise square gt�gt. Good default settings for the tested machine learning

problems are ↵ = 0.001, �1 = 0.9,�2 = 0.999 and ✏ = 10�8. All operations on vectors are element-wise. With

�

t

1 and �

t

2, we denote �1 and �2 to the power t.

Require: ↵: Stepsize

Require: ✏: Division-by-zero control parameter

Require: �1,�2 2 [0, 1): Exponential decay rates for the moment estimates

Require: f(✓): Stochastic objective function with parameters ✓.

Require: ✓0: Initial parameter vector

m0 0 (Initialize 1st moment vector)

v0 0 (Initialize 2nd moment vector)

t 0 (Initialize timestep)

while ✓t not converged do

t t+ 1

gt r✓ft(✓t�1) (Get gradients w.r.t. stochastic objective at timestep t)

mt �1 ·mt�1 + (1� �1) · gt (Update biased first moment estimate)

vt �2 · vt�1 + (1� �2) · g2t (Update biased second raw moment estimate)

m̂t mt/(1� �t1) (Compute bias-corrected first moment estimate).

v̂t vt/(1� �t2) (Compute bias-correct second raw moment estimate).

✓̂t ✓t�1 � ↵ · m̂t/(

p

v̂t + ✏) (Update parameters)

end while

return ✓t (Resulting parameters)

Exercise 4.

Consider the following problem setting:

• The number of training data points is k = 1000. The number of test data points is 50.

• Generation of the data set to fit for the model: the jth data point is represented by xj and is a 10-

dimensional vector. Each element of xj is being generated from a uniform distribution over the interval

[�1, 1].

• ✓true (i.e. the true parameter set) and ✓ (i.e. the variable indicating a candidate parameter set) are also

10-dimensional vectors. Assume that ✓true is all ones. However, we will pretend that we do not know it

and we want to estimate it using the training data.

• Data is being generated according to yj = xTj ✓true+0.1 ·⇠ (where ⇠ ⇠ N (0, 1) is a sample from the normal

distribution). The label yj is a scalar.

1

– Stochastic Optimization Implementation 2

• Let us assume that all the algorithm variants start from the same initial ✓, whose elements are picked

uniformly in the interval [0, 0.1].

• Use 1000 updates.

Since this is an optimization problem, we need to define the loss function. Let’s use the following format

F (✓) =

1

k

kX

j=1

Q(✓, j) (3.1)

Q(✓, j) =

����x

T

j ✓ � yj

����

�

(3.2)

Where � is a hyper-parameter that we use to control and define the loss function.

For the following problems, state findings, provide analysis, and substantiate them in your write-up with screen-

shots of the plots from the Gradient Descent Python notebook code. Please save the notebook to your

own Google drive so that your work could be preserved after the runtime expires.

1. (5 points) For � = 2, the problem becomes the least squares regression that you learned from Chapter

13. State the closed form solution (i.e. ✓ = ...) in your report, and then implement the solution to solve

for ✓. Provide the results of the experiment and state whether it is close to the true value.

2. (5 points) For Q(✓, j), find an expression that gives you the gradient rQ(✓, j). Report this expression,

and implement it in the appropriate function.

hint: For r(✓) = h(e(✓)) = h(xTj ✓ � yj), the gradient can be written as rr(✓) = @[email protected] · re(✓) =

@h

@e · xj

according to the chain rule.

hint: The sign function, sgn(x), may be useful.

3. (5 points) Code the ADAM optimization algorithm (with default hyper-parameters such as the learning

rate as mentioned in the pseudocode above) and SGD to find the best parameter ✓. Use a batch size of 1

for this problem, and perform 1000 parameter updates. Report the final set of parameters.

4. (5 points) Update your code to compute the average test loss at every 50th update, and plot these values.

You might notice that the error bars of ADAM and SGD overlap. This is due to the inherent randomness

from sampling values. To avoid this probabilistic overlap, increase the number of replicates (num rep in

the starter code) until the error bars between ADAM and SGD do not overlap. Report this curve.

5. (9 points) Run ADAM method for each of � = 0.4, 0.7, 1, 2, 3, and 5. Report your observations clearly,

and analyze the trends you are seeing. State whether ADAM consistently out performs SGD. Your analysis

should include the reason why one method outperforms the other under each setting.

Example Plot

Note: Your plots will probably not end up looking exactly like this one.

– Stochastic Optimization Implementation 3

3.2 Classifying Handwritten Digits with Neural Networks

Next, we’ll use Neural Networks to classify handwritten digits from MNIST dataset (dataset of handwritten

digits). The objective is to use di↵erent stochastic optimization algorithms that you have seen so far and

compare their performances (GD, SGD, ADAM). You will train the model and then classify handwritten digits.

Fun fact: One of the co-creators of MNIST dataset (Dr. Yann LeCun) is also the co-recicpient of 2018 ACM

Turing award for his work on Neural Networks.

Exercise 5.

For the following problems, please modify the Neural Networks Python notebook code.

1. To run the Gradient Descent (GD) Algorithm: Set the (mini) batch size to XXXXXXXXXXthe size of the MNIST

dataset). Run the SGD optimizer with a learning rate of 0.003.

(1 points) Why is this the same as the GD algorithm if we are using SGD optimizer?

(2 points) What do you observe? Report the accuracy.

(2 points) List at least 2 ways to improve the accuracy.

2. To run the Stochastic Gradient Descent (SGD) Algorithm: Set the (mini) batch size to 64. Run the SGD

optimizer with a learning rate of 0.003.

(2 points) What do you observe? Report the accuracy.

(1 points) List at least one way to improve accuracy further.

3. To run the ADAM algorithm: Set the (mini) batch size to 64. Run the ADAM optimizer with default

learning rates (Algorithm 1)

Hint: You can use Pytorch (or any other library in any language) for setting up the ADAM optimizer.

(2 points) What do you observe? Report the accuracy.

(1 points) Why does ADAM converge faster than SGD?

– Stochastic Optimization Implementation 4

References

[RM51] Herbert Robbins and Sutton Monro. “A stochastic approximation method”. In: The annals of

mathematical statistics (1951), pp. 400–407.

[Sac58] Jerome Sacks. “Asymptotic distribution of stochastic approximation procedures”. In: The Annals

of Mathematical Statistics XXXXXXXXXX), pp. 373–405.

[NY83] Semenovich Nemirovsky Arkadi and David Borisovich Yudin. “Problem complexity and method

e�ciency in optimization.” In: Wiley-Interscience series in discrete mathematics. (1983).

[CZ07] Felipe Cucker and Ding Xuan Zhou. Learning theory: an approximation theory viewpoint. Vol. 24.

Cambridge University Press, 2007.

[Nem+09] Arkadi Nemirovski et al. “Robust stochastic approximation approach to stochastic programming”.

In: SIAM Journal on optimization XXXXXXXXXX), pp. 1574–1609.

[Bot10] Léon Bottou. “Large-scale machine learning with stochastic gradient descent”. In: Proceedings of

COMPSTAT’2010. Springer, 2010, pp. 177–186.

[Bot12] Léon Bottou. “Stochastic gradient descent tricks”. In: Neural networks: Tricks of the trade. Springer,

2012, pp. 421–436.

[DGL13] Luc Devroye, László Györfi, and Gábor Lugosi. A probabilistic theory of pattern recognition. Vol. 31.

Springer Science & Business Media, 2013.

[JZ13] Rie Johnson and Tong Zhang. “Accelerating stochastic gradient descent using predictive variance

reduction”. In: Advances in neural information processing systems. 2013, pp. 315–323.

[Vap13] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.

[KB14] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization XXXXXXXXXXarXiv:

XXXXXXXXXX [cs.LG].

[MV18] Pierre Moulin and Venugopal Veeravalli. Statistical inference for engineers and data scientists.

Cambridge University Press, 2018.

[SSS18] Sashank, Satyen, and Sanjiv. On the Convergence of Adam and Beyond. 2018.

Acknowledgements

Ehsan Saleh, Anay Pattanik created the first draft of the project outline.

Hongye Liu, Ajay Fewell, Chenhui Zhang, Muhammed Imran, Brian Yang, and Jinglin contributed to the newest

edition.

This document was compiled and inspired by the work and ideas shown in [RM51; Sac58; NY83; Nem+09;

Bot10; JZ13; Bot12; Vap13; DGL13; CZ07; MV18]

https://arxiv.org/abs/ XXXXXXXXXX

CS 361: Probability and Statistics for Computer Science (Spring 2021)

Stochastic Optimization Project

1 Stochastic Optimization Theory Part

1.1 A common stochastic optimization task

In many machine learning problems, we are trying to minimize function f(✓) in the following format.

f(✓) =

1

k

kX

j=1

Q(✓, j) (1.1)

where Q(✓, j) is the loss function for jth data point where we have k training data points in the data set D. Here,

✓ is the set of parameters of the model that we try to optimize for, that is we want to find ✓⇤ = argmin✓ f(✓).

However, f is either unknown or very expensive to collect, but we have access to the noisy version g(✓). Using

g(✓) to find the optimized ✓⇤ for f(✓) would be a stochastic optimization task.

Exercise 1.

(4 points) Define the random variable g(✓) specifically here to be Q(✓, i), where i is a random variable that

is of uniform distribution in the integer set [1 : k]. So, rg(✓) is to be rQ(✓, i). Further, we define the noise

z = Q(✓, i)� f(✓), hence g(✓) = Q(✓, i) = f(✓) + z. That’s why we call g is the noisy version of f . Given this

definition of f and g, prove that equations E[g(✓)] = f(✓), E[rg(✓)] = rf(✓), and E[z] = 0 are satisfied.

1.2 Review of GD and SGD

The goal of optimization is to find the ✓⇤ that minimizes f(✓), which could be more general than that in (1.1).

Again sometimes f is either unknown or very expensive to collect, but we have access to the noisy version g(✓).

The following are two approximation methods to find ✓⇤ iteratively.

1.2.1 Gradient Descent Algorithm (GD)

• Initialize ✓1

• For t = 1, 2, · · · Then do the following update

✓t+1 = ✓t � ⌘trf(✓t) (1.2)

1.2.2 Stochastic Gradient Descent Algorithm (SGD)

• Initialize ✓1

• For t = 1, 2, · · · . Obtain a noisy version of the gradient (i.e. rg(✓t)). Then do the following update

✓t+1 = ✓t � ⌘trg(✓t) (1.3)

1.3 Convergence rates for SGD and GD

Suppose we want to minimize function f with learning rate sequence ⌘t, We have the following theorems:

Theorem 1.1 Assume rf has a unique root ✓⇤. If f is twice continuously di↵erentiable and strongly convex,

and ⌘t = O(t�1), then to reach an approximation error ✏ = |E[f(✓t)] � f(✓⇤)|, stochastic gradient descent

requires t = O( 1✏ ) updates.

1

– Stochastic Optimization Project 2

Theorem 1.2 Assume rf has a unique root ✓⇤. If either the smoothness assumptions about f in theorem 1.1

or ⌘t = O(t�1) is not met, then to reach an approximation error ✏ = |E[f(✓t)] � f(✓⇤)|, stochastic gradient

descent requires t = O( 1✏2 ) updates.

Theorem 1.3 Assume rf has a unique root ✓⇤. To reach an approximation error ✏ = |E[f(✓t)] � f(✓⇤)|,

gradient descent requires t = O(ln( 1✏ )) updates.

2 Comparing SGD vs GD for their running time

Let us consider the f minimization task discussed earlier in (1.1).

• f(✓) and g(✓) are as described in exercise 1, rf(✓⇤) = 0.

• Let’s assume that f is strongly convex, and is twice continuously di↵erentiable.

• We use the ⌘t = 1t learning rate sequence.

• We have k data points.

• We want to achieve a training precision of ✏ = |E[f(✓t)]� f(✓⇤)|.

• Assume finding rQ(✓, j) takes c time, and finding rf(✓) takes kc time.

Exercise 2.

(15 points) Fill in table 1, with complexity orders in terms of k and ✏

Table 1 Comparing SGD vs GD in terms of training precision

SGD GD

Computational Cost per Update O(1)

Number of updates to reach ✏

Total Cost

Explain your answer for each element of the second row by providing a reference to the theorem

mentioned in this document. For every other element, explain your answer.

3 Comparing SGD vs GD with respect to test loss

The last exercise was about finding the computational complexity required to reach a specific precision with

respect to the optimal training loss. However, a specific precision of training loss does not translate to the

same precision of test loss. Here are some helpful facts:

• To achieve a specific test loss precision ⇢, you need a large enough number of training samples k. The

relation

⇢ = O(k��)

must hold for most loss functions where 0 < � 1 is an unknown constant.

• Let’s assume ✏ = ⇥(⇢); for simplicity, you can assume that it is as if ✏ is c⇥⇢ where c is a positive constant.

Exercise 3.

(16 points) 1) Fill in table 2, with complexity orders in terms of ⇢ and �. You can refer to the previous table,

and replace the elements with appropriate values. Make sure you state the reason for each element even

if it looks obvious.

– Stochastic Optimization Project 3

Table 2 Comparing SGD vs GD in terms of testing precision

SGD GD

Computational Cost per Update O(1)

Number of updates to reach ⇢

Total Cost

2) For a typical � = 0.2, Explain why choosing GD over SGD can be very unwise.

CS 361: Probability and Statistics for Computer Science (Spring 2021)

Stochastic Optimization Implementation

Objective: Implement state-of-the-art stochastic optimization algorithms for Machine Learning Problems,

Linear Regression and Classification (using Neural Networks).

3.1 Adaptive Momentum Estimation (ADAM) Gradient Descent Algorithm

SGD can be ine↵ective when the function is highly non-convex. Unfortunately, most modern applications of

Machine Learning involve non-convex optimization problems. One stochastic optimization algorithm that works

well even for non-convexity is ADAM [KB14]. ADAM uses past gradient information to “speed” up optimization

by smoothing, and the algoirthm has been further improved [SSS18]. You will implement the ADAM stochastic

optimization algorithm for a Linear Regression problem.

The pseudo-code for ADAM has been reproduced here from this paper. Credits to [KB14].

Disclaimer: The textbook, in Chapter 13, uses � for parameters but we will be using ✓.

Algorithm 1: g

2

t indicates the elementwise square gt�gt. Good default settings for the tested machine learning

problems are ↵ = 0.001, �1 = 0.9,�2 = 0.999 and ✏ = 10�8. All operations on vectors are element-wise. With

�

t

1 and �

t

2, we denote �1 and �2 to the power t.

Require: ↵: Stepsize

Require: ✏: Division-by-zero control parameter

Require: �1,�2 2 [0, 1): Exponential decay rates for the moment estimates

Require: f(✓): Stochastic objective function with parameters ✓.

Require: ✓0: Initial parameter vector

m0 0 (Initialize 1st moment vector)

v0 0 (Initialize 2nd moment vector)

t 0 (Initialize timestep)

while ✓t not converged do

t t+ 1

gt r✓ft(✓t�1) (Get gradients w.r.t. stochastic objective at timestep t)

mt �1 ·mt�1 + (1� �1) · gt (Update biased first moment estimate)

vt �2 · vt�1 + (1� �2) · g2t (Update biased second raw moment estimate)

m̂t mt/(1� �t1) (Compute bias-corrected first moment estimate).

v̂t vt/(1� �t2) (Compute bias-correct second raw moment estimate).

✓̂t ✓t�1 � ↵ · m̂t/(

p

v̂t + ✏) (Update parameters)

end while

return ✓t (Resulting parameters)

Exercise 4.

Consider the following problem setting:

• The number of training data points is k = 1000. The number of test data points is 50.

• Generation of the data set to fit for the model: the jth data point is represented by xj and is a 10-

dimensional vector. Each element of xj is being generated from a uniform distribution over the interval

[�1, 1].

• ✓true (i.e. the true parameter set) and ✓ (i.e. the variable indicating a candidate parameter set) are also

10-dimensional vectors. Assume that ✓true is all ones. However, we will pretend that we do not know it

and we want to estimate it using the training data.

• Data is being generated according to yj = xTj ✓true+0.1 ·⇠ (where ⇠ ⇠ N (0, 1) is a sample from the normal

distribution). The label yj is a scalar.

1

– Stochastic Optimization Implementation 2

• Let us assume that all the algorithm variants start from the same initial ✓, whose elements are picked

uniformly in the interval [0, 0.1].

• Use 1000 updates.

Since this is an optimization problem, we need to define the loss function. Let’s use the following format

F (✓) =

1

k

kX

j=1

Q(✓, j) (3.1)

Q(✓, j) =

����x

T

j ✓ � yj

����

�

(3.2)

Where � is a hyper-parameter that we use to control and define the loss function.

For the following problems, state findings, provide analysis, and substantiate them in your write-up with screen-

shots of the plots from the Gradient Descent Python notebook code. Please save the notebook to your

own Google drive so that your work could be preserved after the runtime expires.

1. (5 points) For � = 2, the problem becomes the least squares regression that you learned from Chapter

13. State the closed form solution (i.e. ✓ = ...) in your report, and then implement the solution to solve

for ✓. Provide the results of the experiment and state whether it is close to the true value.

2. (5 points) For Q(✓, j), find an expression that gives you the gradient rQ(✓, j). Report this expression,

and implement it in the appropriate function.

hint: For r(✓) = h(e(✓)) = h(xTj ✓ � yj), the gradient can be written as rr(✓) = @[email protected] · re(✓) =

@h

@e · xj

according to the chain rule.

hint: The sign function, sgn(x), may be useful.

3. (5 points) Code the ADAM optimization algorithm (with default hyper-parameters such as the learning

rate as mentioned in the pseudocode above) and SGD to find the best parameter ✓. Use a batch size of 1

for this problem, and perform 1000 parameter updates. Report the final set of parameters.

4. (5 points) Update your code to compute the average test loss at every 50th update, and plot these values.

You might notice that the error bars of ADAM and SGD overlap. This is due to the inherent randomness

from sampling values. To avoid this probabilistic overlap, increase the number of replicates (num rep in

the starter code) until the error bars between ADAM and SGD do not overlap. Report this curve.

5. (9 points) Run ADAM method for each of � = 0.4, 0.7, 1, 2, 3, and 5. Report your observations clearly,

and analyze the trends you are seeing. State whether ADAM consistently out performs SGD. Your analysis

should include the reason why one method outperforms the other under each setting.

Example Plot

Note: Your plots will probably not end up looking exactly like this one.

– Stochastic Optimization Implementation 3

3.2 Classifying Handwritten Digits with Neural Networks

Next, we’ll use Neural Networks to classify handwritten digits from MNIST dataset (dataset of handwritten

digits). The objective is to use di↵erent stochastic optimization algorithms that you have seen so far and

compare their performances (GD, SGD, ADAM). You will train the model and then classify handwritten digits.

Fun fact: One of the co-creators of MNIST dataset (Dr. Yann LeCun) is also the co-recicpient of 2018 ACM

Turing award for his work on Neural Networks.

Exercise 5.

For the following problems, please modify the Neural Networks Python notebook code.

1. To run the Gradient Descent (GD) Algorithm: Set the (mini) batch size to XXXXXXXXXXthe size of the MNIST

dataset). Run the SGD optimizer with a learning rate of 0.003.

(1 points) Why is this the same as the GD algorithm if we are using SGD optimizer?

(2 points) What do you observe? Report the accuracy.

(2 points) List at least 2 ways to improve the accuracy.

2. To run the Stochastic Gradient Descent (SGD) Algorithm: Set the (mini) batch size to 64. Run the SGD

optimizer with a learning rate of 0.003.

(2 points) What do you observe? Report the accuracy.

(1 points) List at least one way to improve accuracy further.

3. To run the ADAM algorithm: Set the (mini) batch size to 64. Run the ADAM optimizer with default

learning rates (Algorithm 1)

Hint: You can use Pytorch (or any other library in any language) for setting up the ADAM optimizer.

(2 points) What do you observe? Report the accuracy.

(1 points) Why does ADAM converge faster than SGD?

– Stochastic Optimization Implementation 4

References

[RM51] Herbert Robbins and Sutton Monro. “A stochastic approximation method”. In: The annals of

mathematical statistics (1951), pp. 400–407.

[Sac58] Jerome Sacks. “Asymptotic distribution of stochastic approximation procedures”. In: The Annals

of Mathematical Statistics XXXXXXXXXX), pp. 373–405.

[NY83] Semenovich Nemirovsky Arkadi and David Borisovich Yudin. “Problem complexity and method

e�ciency in optimization.” In: Wiley-Interscience series in discrete mathematics. (1983).

[CZ07] Felipe Cucker and Ding Xuan Zhou. Learning theory: an approximation theory viewpoint. Vol. 24.

Cambridge University Press, 2007.

[Nem+09] Arkadi Nemirovski et al. “Robust stochastic approximation approach to stochastic programming”.

In: SIAM Journal on optimization XXXXXXXXXX), pp. 1574–1609.

[Bot10] Léon Bottou. “Large-scale machine learning with stochastic gradient descent”. In: Proceedings of

COMPSTAT’2010. Springer, 2010, pp. 177–186.

[Bot12] Léon Bottou. “Stochastic gradient descent tricks”. In: Neural networks: Tricks of the trade. Springer,

2012, pp. 421–436.

[DGL13] Luc Devroye, László Györfi, and Gábor Lugosi. A probabilistic theory of pattern recognition. Vol. 31.

Springer Science & Business Media, 2013.

[JZ13] Rie Johnson and Tong Zhang. “Accelerating stochastic gradient descent using predictive variance

reduction”. In: Advances in neural information processing systems. 2013, pp. 315–323.

[Vap13] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.

[KB14] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization XXXXXXXXXXarXiv:

XXXXXXXXXX [cs.LG].

[MV18] Pierre Moulin and Venugopal Veeravalli. Statistical inference for engineers and data scientists.

Cambridge University Press, 2018.

[SSS18] Sashank, Satyen, and Sanjiv. On the Convergence of Adam and Beyond. 2018.

Acknowledgements

Ehsan Saleh, Anay Pattanik created the first draft of the project outline.

Hongye Liu, Ajay Fewell, Chenhui Zhang, Muhammed Imran, Brian Yang, and Jinglin contributed to the newest

edition.

This document was compiled and inspired by the work and ideas shown in [RM51; Sac58; NY83; Nem+09;

Bot10; JZ13; Bot12; Vap13; DGL13; CZ07; MV18]

https://arxiv.org/abs/ XXXXXXXXXX

May 04, 2021

- fsdfdsds Probability Models in Operations Research Chapter 4 Discrete Time Markov Chains Chapter 4. Discrete Time Markov Chains Table of Contents 4.1 Introduction 4.2 Chapman-Kolmogorov Equations 4.3...SolvedJun 17, 2021
- Problem Set 3.2 MPCS 58020 Spring 2021 Logistics and Submission Solve the following problems and show your work. All problems are worth the same number of points. • You may use any programming...SolvedMay 12, 2021
- please do my assignmentSolvedApr 17, 2021
- STAT4207/5207 Homework assignment 4 Due by April 15, 2021. Write down the complete derivations for full credits. Question 3 and 4 are bonus questions (worth 12 and 13 points respectively). Exercise 1...Apr 06, 2021
- Assignment 10: Stochastic Processes Probability Theory and Stochastic Processes The method by which you arrive at an answer is as important as the answer itself. Therefore, a portion of your grade for...Apr 06, 2021

- HIST E 1672 Long 1960s: Pop Music, Counterculture, Black Awakening Choose TWO prompts to respond to and write a response using specific examples from the readings on our syllabus. You are welcome to...Oct 21, 2021
- The Partnership Agreement In the interests of clarity, develop a partnership agreement with your partners. Please do not make this agreement excessively formal and refrain from using a legal template....Oct 21, 2021
- Chapter 3 Objectives:• Show how economists conceptualize individual preferences in words, graphs, and math.• Describe the ‘preference relation,’ and show its use for predicting individual choice.•...Oct 21, 2021
- Hi I need this assignment in 5 hours. Its 5 programming question and it does not need a whole program. as long as it has functionality. I need 5 separate .py files with names Q1, Q2, Q3, Q4, and Q5. I...Oct 21, 2021
- Based on Chapter 11 in your textbook, create a Paper that thoroughly answers in written paper / paragraph format one of the following questions: What is a walled garden? Facebook has been called a...Oct 21, 2021

Copy and Paste Your Assignment Here

Copyright © 2021. All rights reserved.