optimization methods for machine learning · 2018-09-10 · optimization methods for machine...
TRANSCRIPT
![Page 1: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/1.jpg)
Optimization Methods for Machine LearningPart II – The theory of SG
Leon BottouFacebook AI Research
Frank E. CurtisLehigh University
Jorge NocedalNorthwestern University
![Page 2: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/2.jpg)
Summary
1. Setup2. Fundamental Lemmas3. SG for Strongly Convex Objectives4. SG for General Objectives5. Work complexity for Large-Scale Learning6. Comments
2
![Page 3: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/3.jpg)
1- Setup
3
![Page 4: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/4.jpg)
The generic SG algorithm
The SG algorithm produces successive iterates 𝑤" ∈ ℝ&with the goal to minimize a certain function 𝐹 ∶ ℝ& → ℝ.
We assume that we have access to three mechanisms1. Given an iteration number 𝑘 ,
a mechanism to generate a realization of a random variable 𝜉".The 𝜉" form a sequence of jointly independent random variables
2. Given an iterate 𝑤" and a realization 𝜉", a mechanism to compute a stochastic vector 𝑔 𝑤", 𝜉" ∈ ℝ&
3. Given an iteration number,a mechanism to compute a scalar stepsize 𝛼" > 0
4
![Page 5: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/5.jpg)
The generic SG algorithm
Algorithm 4.1 (Stochastic Gradient (SG) Method)
5
![Page 6: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/6.jpg)
The generic SG algorithm
The function 𝐹 ∶ ℝ& → ℝ could be
The stochastic vector could be
the gradient for one example,
the empirical risk.
the gradient for a minibatch,
possibly rescaled
the expected risk,
6
![Page 7: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/7.jpg)
The generic SG algorithm
Stochastic processes
• We assume that the 𝜉" are jointly independent to avoid the full machinery of stochastic processes. But everything still holds if the 𝜉"form an adapted stochastic process, where each 𝜉" can depend on the previous ones.
Active learning
• We can handle more complex setups by view 𝜉" as a “random seed”.For instance, in active learning, 𝑔(𝑤", 𝜉") firsts construct a multinomial distribution on the training examples in a manner that depends on 𝑤", then uses the random seed 𝜉" to pick one according to that distribution.
The same mathematics cover all these cases.
7
![Page 8: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/8.jpg)
2- Fundamental lemmas
8
![Page 9: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/9.jpg)
Smoothness
Smoothness• Our analysis relies on a smoothness assumption.
We chose this path because it also gives results for the nonconvex case.We’ll discuss other paths in the commentary section.
Well known consequence
9
![Page 10: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/10.jpg)
Smoothness
• 𝔼45 is the expectation with respect to the distribution of 𝜉" only.• 𝔼45 𝐹(𝑤"67) is meaningful because 𝑤"67 depends on 𝜉" (step 6 of SG)
Expected decrease Noise
10
![Page 11: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/11.jpg)
Smoothness
11
![Page 12: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/12.jpg)
Moments
12
![Page 13: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/13.jpg)
Moments
• In expectation 𝑔(𝑤", 𝜉") is a sufficient descent direction.• True if 𝔼45 𝑔(𝑤", 𝜉") = 𝛻𝐹 𝑤" with 𝜇 = 𝜇; = 1.• True if 𝔼45 𝑔(𝑤", 𝜉") = 𝐻"𝛻𝐹(𝑤") with bounded spectrum.
13
![Page 14: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/14.jpg)
Moments
• 𝕍45 denotes the variance w.r.t. 𝜉"• Variance of the noise must be bounded in a mild manner.
14
![Page 15: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/15.jpg)
Moments
• Combining (4.7b) and (4.8) gives
with
15
![Page 16: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/16.jpg)
Moments
• The convergence of SG depends on the balance between these two terms.
Expected decrease Noise
16
![Page 17: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/17.jpg)
Moments
17
![Page 18: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/18.jpg)
3- SG for Strongly Convex Objectives
18
![Page 19: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/19.jpg)
Strong convexity
Known consequence
Why does strong convexity matter?• It gives the strongest results.• It often happens in practice (one regularizes to facilitate optimization!)• It describes any smooth function near a strong local minimum.
19
![Page 20: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/20.jpg)
Total expectation
Different expectations• 𝔼45 is the expectation with respect to the distribution of 𝜉" only.• 𝔼 is the total expectation w.r.t. the joint distribution of all 𝜉".
For instance, since 𝑤" depends only on 𝜉7, 𝜉?,… , 𝜉"A7,
𝔼 𝐹 𝑤" = 𝔼4B𝔼4C …𝔼45DB[𝐹 𝑤" ]
Results in expectation
• We focus on results that characterize the properties of SG in expectation.
• The stochastic approximation literature usually relies on rather complex martingale techniques to establish almost sure convergence results. We avoid them because they do not give much additional insight.
20
![Page 21: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/21.jpg)
SG with fixed stepsize
• Only converges to a neighborhood of the optimal value.• Both (4.13) and (4.14) describe well the actual behavior.
21
![Page 22: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/22.jpg)
SG with fixed stepsize (proof)
22
![Page 23: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/23.jpg)
SG with fixed stepsize
Note the interplay between the stepsize𝛼G and the variance bound 𝑀.• If 𝑀 = 0, one recovers the linear convergence of batch gradient descent.• If 𝑀 > 0, one reaches a point where the noise prevents further progress.
23
![Page 24: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/24.jpg)
Diminishing the stepsizes
• If we wait long enough, halving the stepsizeα eventually halves 𝐹 𝑤" − 𝐹∗.• We can even estimate 𝐹∗ ≈ 2𝐹N/? − 𝐹N
k
𝐹𝑤 "
Stepsize α
Stepsize α/2
𝐹∗
=
24
𝐹N/?
![Page 25: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/25.jpg)
Diminishing the stepsizes faster
• Divide 𝛼 by 2 whenever 𝔼 𝐹 𝑤" reaches 𝛼𝐿𝑀/𝑐𝜇.• Time 𝜏S between changes : 1 − 𝛼𝑐𝜇 TU = 1/3 means 𝜏S ∝ 1/𝛼.• Whenever we halve 𝛼 we must wait twice as long to halve 𝐹(𝑤) − 𝐹∗.• Overall convergence rate in 𝒪 1 𝑘⁄ .
k
𝔼𝐹𝑤 "
𝐹∗
α
α/2
α/4
25
Divide 𝛼 by 2 whenever 𝔼 𝐹 𝑤" − 𝐹∗ reaches 2S[\?]^ .
![Page 26: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/26.jpg)
SG with diminishing stepsizes
26
![Page 27: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/27.jpg)
SG with diminishing stepsizes
27
Same maximal stepsize
Stepsize decreases in 1/k
Not too slow…
gap ∝ stepsize…otherwise
![Page 28: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/28.jpg)
SG with diminishing stepsizes (proof)
28
![Page 29: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/29.jpg)
Mini batching
Using minibatches with stepsize𝛼G :
Using single example with stepsize𝛼G / 𝑛`a :
29
Computation Noise
1 𝑀
𝑛`a 𝑀/𝑛`a
𝑛`a times more iterations that are 𝑛`a times cheaper. same
![Page 30: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/30.jpg)
Minibatching
Ignoring implementation issues• We can match minibatch SG with stepsize𝛼G
using single example SG with stepsize𝛼G / 𝑛`a .• We can match single example SG with stepsize 𝛼G
using minibatch SG with stepsize 𝛼G × 𝑛`aprovided 𝛼G × 𝑛`a is smaller than the max stepsize.
With implementation issues• Minibatch implementations use the hardware better.• Especially on GPU.
30
![Page 31: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/31.jpg)
4- SG for General Objectives
31
![Page 32: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/32.jpg)
Nonconvex objectives
Nonconvex training objectives are pervasive in deep learning.
Nonconvex landscape in high dimension can be very complex.• Critical points can be local minima or saddle points.• Critical points can be first order of high order.• Critical points can be part of critical manifolds.• A critical manifold can contain both local minima and saddle points.
We describe meaningful (but weak) guarantees• Essentially, SG goes to critical points.
The SG noise plays an important role in practice• It seems to help navigating local minima and saddle points.• More noise has been found to sometimes help optimization.• But the theoretical understanding of these facts is weak.
32
![Page 33: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/33.jpg)
Nonconvex SG with fixed stepsize
33
![Page 34: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/34.jpg)
Nonconvex SG with fixed stepsize
34
Same max stepsize
This goes to zero like 1/K
This does not
If the average norm of the gradient is small, then the
norm of the gradient cannot be often large…
![Page 35: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/35.jpg)
Nonconvex SG with fixed stepsize (proof)
35
![Page 36: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/36.jpg)
Nonconvex SG with diminishing step sizes
36
![Page 37: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/37.jpg)
5- Work complexity for Large-Scale Learning
37
![Page 38: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/38.jpg)
Large-Scale Learning
Assume that we are in the large data regime• Training data is essentially unlimited.• Computation time is limited.
The good• More training data ⇒ less overfitting• Less overfitting ⇒ richer models.
The bad• Using more training data or rich models quickly exhausts the time budget.
The hope• How thoroughly do we need to optimize 𝑅d(𝑤)
when we actually want another function 𝑅(𝑤) to be small ?
38
![Page 39: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/39.jpg)
Expected risk versus training time
• When we vary the number of examples
39
Expected Risk
![Page 40: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/40.jpg)
Expected risk versus training time
• When we vary the number of examples, the model, and the optimizer…
40
Expected Risk
![Page 41: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/41.jpg)
Expected risk versus training time
• The optimal combination depends on the computing time budget
41
Expected Risk
Good combinations
![Page 42: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/42.jpg)
Formalization
The components of the expected risk
Question• Given a fixed model ℋ and a time budget 𝒯 gh, choose 𝑛, 𝜖…
Approach• Statistics tell us ℰklm(𝑛) decreases with a rate in range 1/ 𝑛� … 1/𝑛.• For now, let’s work with the fastest rate compatible with statistics
42
![Page 43: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/43.jpg)
Batch versus Stochastic
Typical convergence rates• Batch algorithm:• Stochastic algorithm:
Rate analysis
43
Processing more training examples beats
optimizing more thoroughly.
This effect only grows if ℰklm(𝑛) decreases
slower than 1/𝑛.
![Page 44: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/44.jpg)
6- Comments
44
![Page 45: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/45.jpg)
Asymptotic performance of SG is fragile
Diminishing stepsizes are tricky• Theorem 4.7 (strongly convex function) suggests
Constant stepsizes are often used in practice• Sometimes with a simple halving protocol.
45
SG converges very slowly if 𝛽 < 7
]^
SG usually diverges when 𝛼 is above ?^[\q
![Page 46: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/46.jpg)
Condition numbers
The ratios 𝑳𝒄
and 𝑴𝒄
appear in critical places
• Theorem 4.6. With 𝜇 = 1, 𝑀x = 0, the optimal stepsize is 𝛼G = 7[
46
![Page 47: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/47.jpg)
Distributed computing
SG is notoriously hard to parallelize• Because it updates the parameters 𝑤 with high frequency• Because it slows down with delayed updates.
SG still works with relaxed synchronization• Because this is just a little bit more noise.
Communication overhead give room for new opportunities• There is ample time to compute things while communication takes place.• Opportunity for optimization algorithms with higher per-iteration costs ➔ SG may not be the best answer for distributed training.
47
![Page 48: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis](https://reader034.vdocuments.site/reader034/viewer/2022050409/5f869fe3549acc09a77c8b65/html5/thumbnails/48.jpg)
Smoothness versus Convexity
Analyses of SG that only rely on convexity• Bounding 𝑤" −𝑤∗ ? instead of 𝐹 𝑤" − 𝐹∗
and assuming 𝔼45 𝑔(𝑤", 𝜉") = 𝑔y 𝑤" ∈ 𝜕𝐹(𝑤")gives a result similar to Lemma 4.4.
• Ways to bound the expected decrease
• Proof does not easily support second order methods.48
Expected decrease Noise