lecture19 additional optimization - github pages · 2020. 9. 6. · • heuristic trades off...
TRANSCRIPT
![Page 1: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/1.jpg)
CS109A Introduction to Data SciencePavlos Protopapas and Kevin Rader
Lecture 19 Additional Material: Optimization
![Page 2: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/2.jpg)
CS109A, PROTOPAPAS, RADER
Outline
Optimization• Challenges in Optimization
• Momentum
• Adaptive Learning Rate
• Parameter Initialization
• Batch Normalization
2
![Page 3: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/3.jpg)
CS109A, PROTOPAPAS, RADER
Learning vs. Optimization
Goal of learning: minimize generalization error
In practice, empirical risk minimization:
3
Quantityoptimizeddifferentfromthequantity
wecareabout
J(θ ) = E(x,y)~pdata L( f (x;θ ), y)[ ]
J(θ ) = 1m
Li=1
m
∑ ( f (x(i);θ ), y(i) )
![Page 4: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/4.jpg)
CS109A, PROTOPAPAS, RADER
Batch vs. Stochastic Algorithms
Batch algorithms
• Optimize empirical risk using exact gradients
Stochastic algorithms
• Estimates gradient from a small random sample
4
Largemini-batch:gradientcomputationexpensive
Smallmini-batch:greatervarianceinestimate,longerstepsforconvergence
∇J(θ ) = E(x,y)~pdata ∇L( f (x;θ ), y)[ ]
![Page 5: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/5.jpg)
CS109A, PROTOPAPAS, RADER
Critical Points
Points with zero gradient
2nd-derivate (Hessian) determines curvature
5Goodfellow etal.(2016)
![Page 6: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/6.jpg)
CS109A, PROTOPAPAS, RADER
Stochastic Gradient Descent
Take small steps in direction of negative gradient
Sample m examples from training set and compute:
Update parameters:
6
Inpractice:shuffletrainingsetonceandpassthroughmultipletimes
g = 1m
∇L( f (x(i);θ ), y(i) )i∑
θ =θ −εkg
![Page 7: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/7.jpg)
CS109A, PROTOPAPAS, RADER
Stochastic Gradient Descent
7
Oscillationsbecauseupdatesdonotexploitcurvatureinformation
Goodfellow etal.(2016)
J(θ )
![Page 8: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/8.jpg)
CS109A, PROTOPAPAS, RADER
Outline
Optimization• Challenges in Optimization• Momentum
• Adaptive Learning Rate
• Parameter Initialization
• Batch Normalization
8
![Page 9: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/9.jpg)
CS109A, PROTOPAPAS, RADER
Local Minima
9Goodfellow etal.(2016)
![Page 10: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/10.jpg)
CS109A, PROTOPAPAS, RADER
Local Minima
Old view: local minima is major problem in neural network training
Recent view:
• For sufficiently large neural networks, most local minima incur low cost
• Not important to find true global minimum
10
![Page 11: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/11.jpg)
CS109A, PROTOPAPAS, RADER
Saddle Points
Recent studies indicate that in high dim, saddle points are more likely than local min
Gradient can be very small near saddle points
11
Bothlocalminandmax
Goodfellow etal.(2016)
![Page 12: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/12.jpg)
CS109A, PROTOPAPAS, RADER
Saddle Points
SGD is seen to escape saddle points
– Moves down-hill, uses noisy gradients
Second-order methods get stuck
– solves for a point with zero gradient
12Goodfellow etal.(2016)
![Page 13: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/13.jpg)
CS109A, PROTOPAPAS, RADER
Poor Conditioning
Poorly conditioned Hessian matrix
– High curvature: small steps leads to huge increase
Learning is slow despite strong gradients
13
Oscillationsslowdownprogress
![Page 14: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/14.jpg)
CS109A, PROTOPAPAS, RADER
No Critical Points
Some cost functions do not have critical points. In particular classification.
14
![Page 15: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/15.jpg)
CS109A, PROTOPAPAS, RADER
No Critical Points
Gradient norm increases, but validation error decreases
15
ConvolutionNetsforObjectDetection
Goodfellow etal.(2016)
![Page 16: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/16.jpg)
CS109A, PROTOPAPAS, RADER
Exploding and Vanishing Gradients
16
Linearactivation
deeplearning.ai
h1 =Wxhi =Whi−1, i = 2…n
y =σ (h1n + h2
n ), where σ (s) = 11+ e−s
![Page 17: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/17.jpg)
CS109A, PROTOPAPAS, RADER
Exploding and Vanishing Gradients
17
h11
h12
!
"
##
$
%
&&= a 0
0 b
!
"#
$
%&
x1
x2
!
"##
$
%&& !
hn1hn2
!
"
##
$
%
&&= an 0
0 bn!
"##
$
%&&
x1
x2
!
"##
$
%&&
Suppose W = a 00 b
!
"#
$
%& :
y =σ (anx1 + bnx2 )
∇y = "σ (anx1 + bnx2 )
nan−1x1nbn−1x2
$
%
&&
'
(
))
![Page 18: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/18.jpg)
CS109A, PROTOPAPAS, RADER
Exploding and Vanishing Gradients
18
Explodes!
Vanishes!
Suppose x = 11
!
"#
$
%&
Case 1: a =1, b = 2 :
y→1, ∇y→ nn2n−1
!
"##
$
%&&
Case 2: a = 0.5, b = 0.9 :
y→ 0, ∇y→ 00
!
"#
$
%&
![Page 19: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/19.jpg)
CS109A, PROTOPAPAS, RADER
Exploding and Vanishing Gradients
Exploding gradients lead to cliffs
Can be mitigated using gradient clipping
19Goodfellow etal.(2016)
![Page 20: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/20.jpg)
CS109A, PROTOPAPAS, RADER
Outline
Optimization• Challenges in Optimization
• Momentum• Adaptive Learning Rate
• Parameter Initialization
• Batch Normalization
20
![Page 21: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/21.jpg)
CS109A, PROTOPAPAS, RADER
Momentum
SGD is slow when there is high curvature
Average gradient presents faster path to opt:
– vertical components cancel out
21
The image part with relationship ID rId5 was not found in the file.
![Page 22: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/22.jpg)
CS109A, PROTOPAPAS, RADER
Momentum
Uses past gradients for update
Maintains a new quantity: ‘velocity’
Exponentially decaying average of gradients:
22
controlshowquicklyeffectofpastgradientsdecay
The image part with relationship ID rId4 was not found in the file.
Currentgradientupdate
v = αv + (−εg)
![Page 23: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/23.jpg)
CS109A, PROTOPAPAS, RADER
Momentum
Compute gradient estimate:
Update velocity:
Update parameters:
23
g = 1m
∇θL( f (x(i);θ ), y(i) )
i∑
v =αv−εg
θ =θ + v
![Page 24: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/24.jpg)
CS109A, PROTOPAPAS, RADER
Momentum
24
Dampedoscillations:gradientsinoppositedirectionsgetcancelledout
The image part with relationship ID rId4 was not found in the file.
Goodfellow etal.(2016)
![Page 25: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/25.jpg)
CS109A, PROTOPAPAS, RADER
Nesterov Momentum
Apply an interim update:
Perform a correction based on gradient at the interim point:
25
Momentumbasedonlook-aheadslope
g = 1m
∇θL( f (x(i); !θ ), y(i) )
i∑
v =αv−εg
θ =θ + v
!θ =θ + v
![Page 26: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/26.jpg)
CS109A, PROTOPAPAS, RADER
Outline
Optimization• Challenges in Optimization
• Momentum
• Adaptive Learning Rate• Parameter Initialization
• Batch Normalization
26
![Page 27: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/27.jpg)
CS109A, PROTOPAPAS, RADER
Adaptive Learning Rates
Oscillations along vertical direction– Learning must be slower along parameter 2
Use a different learning rate for each parameter?27
The image part with relationship ID rId4 was
The image part with relationship ID rId4 was not
The image part with relationship ID rId4 was not found in the file.
![Page 28: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/28.jpg)
CS109A, PROTOPAPAS, RADER
AdaGrad
• Accumulate squared gradients:
• Update each parameter:
• Greater progress along gently sloped directions
28
Inverselyproportionaltocumulativesquaredgradient
ri = ri + gi2
θi =θi −ε
δ + rigi
![Page 29: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/29.jpg)
CS109A, PROTOPAPAS, RADER
RMSProp
• For non-convex problems, AdaGrad can prematurely decrease learning rate
• Use exponentially weighted average for gradient accumulation
29
ri = ρri + (1− ρ)gi2
θi =θi −ε
δ + rigi
![Page 30: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/30.jpg)
CS109A, PROTOPAPAS, RADER
Adam
• RMSProp + Momentum
• Estimate first moment:
• Estimate second moment:
• Update parameters:
30
Alsoappliesbiascorrection
tov andr
Workswellinpractice,isfairlyrobusttohyper-parameters
vi = ρ1vi + (1− ρ1 )gi
θi =θi −ε
δ + rivi
ri = ρ2ri + (1− ρ2 )gi2
![Page 31: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/31.jpg)
CS109A, PROTOPAPAS, RADER
Outline
Optimization• Challenges in Optimization
• Momentum
• Adaptive Learning Rate
• Parameter Initialization • Batch Normalization
31
![Page 32: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/32.jpg)
CS109A, PROTOPAPAS, RADER
Parameter Initialization
• Goal: break symmetry between units
• so that each unit computes a different function
• Initialize all weights (not biases) randomly
• Gaussian or uniform distribution
• Scale of initialization?
• Large -> grad explosion, Small -> grad vanishing
32
![Page 33: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/33.jpg)
CS109A, PROTOPAPAS, RADER
Xavier Initialization
• Heuristic for all outputs to have unit variance
• For a fully-connected layer with m inputs:
• For ReLU units, it is recommended:
33
Wij ~ N 0, 1m
!
"#
$
%&
Wij ~ N 0, 2m
!
"#
$
%&
![Page 34: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/34.jpg)
CS109A, PROTOPAPAS, RADER
Normalized Initialization
• Fully-connected layer with m inputs, n outputs:
• Heuristic trades off between initialize all layers have same activation and gradient variance
• Sparse variant when m is large
– Initialize k nonzero weights in each unit
34
Wij ~U −6
m+ n, 6
m+ n
"
#$
%
&'
![Page 35: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/35.jpg)
CS109A, PROTOPAPAS, RADER
Bias Initialization
• Output unit bias
• Marginal statistics of the output in the training set
• Hidden unit bias
• Avoid saturation at initialization
• E.g. in ReLU, initialize bias to 0.1 instead of 0
• Units controlling participation of other units
• Set bias to allow participation at initialization
35
![Page 36: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/36.jpg)
CS109A, PROTOPAPAS, RADER
Outline
Challenges in Optimization
Momentum
Adaptive Learning Rate
Parameter Initialization
Batch Normalization
36
![Page 37: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/37.jpg)
CS109A, PROTOPAPAS, RADER
Feature Normalization
Good practice to normalize features before applying learning algorithm:
Features in same scale: mean 0 and variance 1– Speeds up learning
37
Vectorofmeanfeaturevalues
VectorofSDoffeaturevalues
Featurevector
!x = x −µσ
![Page 38: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/38.jpg)
CS109A, PROTOPAPAS, RADER
Feature Normalization
38
Beforenormalization Afternormalization
The image part with relationship ID rId4 was not found in the file.
![Page 39: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/39.jpg)
CS109A, PROTOPAPAS, RADER
Internal Covariance Shift
Each hidden layer changes distribution of inputs to next layer: slows down learning
39
Normalizeinputstolayer2
Normalizeinputstolayern
…
![Page 40: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/40.jpg)
CS109A, PROTOPAPAS, RADER
Batch Normalization
Training time:– Mini-batch of activations for layer to normalize
40
K hiddenlayeractivations
N datapointsinmini-batch
H =
H11 ! H1K
" # "HN1 ! HNK
!
"
####
$
%
&&&&
![Page 41: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/41.jpg)
CS109A, PROTOPAPAS, RADER
Batch Normalization
Training time: – Mini-batch of activations for layer to normalize
where
41
Vectorofmeanactivationsacrossmini-batch
VectorofSDofeachunitacrossmini-batch
H ' = H −µσ
µ =1m
Hi,:i∑ σ =
1m
(H −µ)i2 +δ
i∑
![Page 42: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/42.jpg)
CS109A, PROTOPAPAS, RADER
Batch Normalization
Training time: – Normalization can reduce expressive power
– Instead use:
– Allows network to control range of normalization
42
Learnableparameters
γ !H +β
![Page 43: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/43.jpg)
CS109A, PROTOPAPAS, RADER
Batch Normalization
43
…..
Batch1
BatchNAddnormalizationoperationsforlayer1
µ1 =1m
Hi,:i∑
σ 1 =1m
(H −µ)i2 +δ
i∑
![Page 44: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/44.jpg)
CS109A, PROTOPAPAS, RADER
µ 2 =1m
Hi,:i∑
σ 2 =1m
(H −µ)i2 +δ
i∑
Batch Normalization
44
Batch1
BatchN
…..
Addnormalizationoperationsforlayer2andsoon…
![Page 45: Lecture19 Additional Optimization - GitHub Pages · 2020. 9. 6. · • Heuristic trades off between initialize all layers have same ... •E.g. in ReLU, initialize bias to 0.1 instead](https://reader036.vdocuments.site/reader036/viewer/2022071419/6116f775b4aa2176024d48c6/html5/thumbnails/45.jpg)
CS109A, PROTOPAPAS, RADER
Batch Normalization
Differentiate the joint loss for N mini-batches
Back-propagate through the norm operations
Test time:– Model needs to be evaluated on a single example
– Replace μ and σ with running averages collected during training
45