cs7015 (deep learning) : lecture 9 · cs7015 (deep learning) : lecture 9 greedy layerwise...
TRANSCRIPT
![Page 1: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/1.jpg)
1/67
CS7015 (Deep Learning) : Lecture 9Greedy Layerwise Pre-training, Better activation functions, Better weight
initialization methods, Batch Normalization
Mitesh M. Khapra
Department of Computer Science and EngineeringIndian Institute of Technology Madras
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 2: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/2.jpg)
2/67
Module 9.1 : A quick recap of training deep neuralnetworks
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 3: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/3.jpg)
3/67
x
σ
w
y
We already saw how to train this network
w = w − η∇w where,
∇w =∂L (w)
∂w= (f(x)− y) ∗ f(x) ∗ (1− f(x)) ∗ x
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 4: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/4.jpg)
3/67
x
σ
w
yWe already saw how to train this network
w = w − η∇w where,
∇w =∂L (w)
∂w= (f(x)− y) ∗ f(x) ∗ (1− f(x)) ∗ x
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 5: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/5.jpg)
3/67
x
σ
w
yWe already saw how to train this network
w = w − η∇w where,
∇w =∂L (w)
∂w= (f(x)− y) ∗ f(x) ∗ (1− f(x)) ∗ x
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 6: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/6.jpg)
3/67
x
σ
w
yWe already saw how to train this network
w = w − η∇w where,
∇w =∂L (w)
∂w
= (f(x)− y) ∗ f(x) ∗ (1− f(x)) ∗ x
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 7: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/7.jpg)
3/67
x
σ
w
yWe already saw how to train this network
w = w − η∇w where,
∇w =∂L (w)
∂w= (f(x)− y) ∗ f(x) ∗ (1− f(x)) ∗ x
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 8: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/8.jpg)
3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
We already saw how to train this network
w = w − η∇w where,
∇w =∂L (w)
∂w= (f(x)− y) ∗ f(x) ∗ (1− f(x)) ∗ x
What about a wider network with more inputs:
w1 = w1 − η∇w1
w2 = w2 − η∇w2
w3 = w3 − η∇w3
where,∇wi = (f(x)− y) ∗ f(x) ∗ (1− f(x)) ∗ xi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 9: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/9.jpg)
3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
We already saw how to train this network
w = w − η∇w where,
∇w =∂L (w)
∂w= (f(x)− y) ∗ f(x) ∗ (1− f(x)) ∗ x
What about a wider network with more inputs:
w1 = w1 − η∇w1
w2 = w2 − η∇w2
w3 = w3 − η∇w3
where,∇wi = (f(x)− y) ∗ f(x) ∗ (1− f(x)) ∗ xi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 10: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/10.jpg)
3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
We already saw how to train this network
w = w − η∇w where,
∇w =∂L (w)
∂w= (f(x)− y) ∗ f(x) ∗ (1− f(x)) ∗ x
What about a wider network with more inputs:
w1 = w1 − η∇w1
w2 = w2 − η∇w2
w3 = w3 − η∇w3
where,∇wi = (f(x)− y) ∗ f(x) ∗ (1− f(x)) ∗ xi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 11: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/11.jpg)
3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
We already saw how to train this network
w = w − η∇w where,
∇w =∂L (w)
∂w= (f(x)− y) ∗ f(x) ∗ (1− f(x)) ∗ x
What about a wider network with more inputs:
w1 = w1 − η∇w1
w2 = w2 − η∇w2
w3 = w3 − η∇w3
where,∇wi = (f(x)− y) ∗ f(x) ∗ (1− f(x)) ∗ xi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 12: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/12.jpg)
3/67
x
σ
w
y
x1x2 x3
σ
y
w1 w2 w3
We already saw how to train this network
w = w − η∇w where,
∇w =∂L (w)
∂w= (f(x)− y) ∗ f(x) ∗ (1− f(x)) ∗ x
What about a wider network with more inputs:
w1 = w1 − η∇w1
w2 = w2 − η∇w2
w3 = w3 − η∇w3
where,∇wi = (f(x)− y) ∗ f(x) ∗ (1− f(x)) ∗ xi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 13: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/13.jpg)
4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1;hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
We can now calculate ∇w1 using chain rule:
∂L (w)
∂w1=∂L (w)
∂y.∂y
∂a3.∂a3∂h2
.∂h2∂a2
.∂a2∂h1
.∂h1∂a1
.∂a1∂w1
=∂L (w)
∂y∗ ............... ∗ h0
In general,
∇wi =∂L (w)
∂y∗ ............... ∗ hi−1
Notice that∇wi is proportional to the correspond-ing input hi−1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 14: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/14.jpg)
4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1;hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
We can now calculate ∇w1 using chain rule:
∂L (w)
∂w1=∂L (w)
∂y.∂y
∂a3.∂a3∂h2
.∂h2∂a2
.∂a2∂h1
.∂h1∂a1
.∂a1∂w1
=∂L (w)
∂y∗ ............... ∗ h0
In general,
∇wi =∂L (w)
∂y∗ ............... ∗ hi−1
Notice that∇wi is proportional to the correspond-ing input hi−1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 15: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/15.jpg)
4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1;hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
We can now calculate ∇w1 using chain rule:
∂L (w)
∂w1=∂L (w)
∂y.∂y
∂a3.∂a3∂h2
.∂h2∂a2
.∂a2∂h1
.∂h1∂a1
.∂a1∂w1
=∂L (w)
∂y∗ ............... ∗ h0
In general,
∇wi =∂L (w)
∂y∗ ............... ∗ hi−1
Notice that∇wi is proportional to the correspond-ing input hi−1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 16: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/16.jpg)
4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1;hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
We can now calculate ∇w1 using chain rule:
∂L (w)
∂w1=∂L (w)
∂y.∂y
∂a3.∂a3∂h2
.∂h2∂a2
.∂a2∂h1
.∂h1∂a1
.∂a1∂w1
=∂L (w)
∂y∗ ............... ∗ h0
In general,
∇wi =∂L (w)
∂y∗ ............... ∗ hi−1
Notice that∇wi is proportional to the correspond-ing input hi−1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 17: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/17.jpg)
4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1;hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
We can now calculate ∇w1 using chain rule:
∂L (w)
∂w1=∂L (w)
∂y.∂y
∂a3.∂a3∂h2
.∂h2∂a2
.∂a2∂h1
.∂h1∂a1
.∂a1∂w1
=∂L (w)
∂y∗ ............... ∗ h0
In general,
∇wi =∂L (w)
∂y∗ ............... ∗ hi−1
Notice that∇wi is proportional to the correspond-ing input hi−1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 18: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/18.jpg)
4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1;hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
We can now calculate ∇w1 using chain rule:
∂L (w)
∂w1=∂L (w)
∂y.∂y
∂a3.∂a3∂h2
.∂h2∂a2
.∂a2∂h1
.∂h1∂a1
.∂a1∂w1
=∂L (w)
∂y∗ ............... ∗ h0
In general,
∇wi =∂L (w)
∂y∗ ............... ∗ hi−1
Notice that∇wi is proportional to the correspond-ing input hi−1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 19: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/19.jpg)
4/67
σ
x = h0
σ
σ
y
w1
w2
w3
a1
h1
a2
h2
a3
ai = wihi−1;hi = σ(ai)
a1 = w1 ∗ x = w1 ∗ h0
What if we have a deeper network ?
We can now calculate ∇w1 using chain rule:
∂L (w)
∂w1=∂L (w)
∂y.∂y
∂a3.∂a3∂h2
.∂h2∂a2
.∂a2∂h1
.∂h1∂a1
.∂a1∂w1
=∂L (w)
∂y∗ ............... ∗ h0
In general,
∇wi =∂L (w)
∂y∗ ............... ∗ hi−1
Notice that∇wi is proportional to the correspond-ing input hi−1 (we will use this fact later)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 20: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/20.jpg)
5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deepand wide?
How do you calculate ∇w2 =?
It will be given by chain rule applied across mul-tiple paths
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 21: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/21.jpg)
5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deepand wide?
How do you calculate ∇w2 =?
It will be given by chain rule applied across mul-tiple paths
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 22: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/22.jpg)
5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deepand wide?
How do you calculate ∇w2 =?
It will be given by chain rule applied across mul-tiple paths
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 23: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/23.jpg)
5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deepand wide?
How do you calculate ∇w2 =?
It will be given by chain rule applied across mul-tiple paths
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 24: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/24.jpg)
5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deepand wide?
How do you calculate ∇w2 =?
It will be given by chain rule applied across mul-tiple paths
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 25: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/25.jpg)
5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deepand wide?
How do you calculate ∇w2 =?
It will be given by chain rule applied across mul-tiple paths
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 26: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/26.jpg)
5/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
What happens if we have a network which is deepand wide?
How do you calculate ∇w2 =?
It will be given by chain rule applied across mul-tiple paths (We saw this in detail when we studiedback propagation)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 27: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/27.jpg)
6/67
Things to remember
Training Neural Networks is a Game of Gradients (played using any of theexisting gradient based approaches that we discussed)
The gradient tells us the responsibility of a parameter towards the loss
The gradient w.r.t. a parameter is proportional to the input to the parameters(recall the “..... ∗ x” term or the “.... ∗ hi” term in the formula for ∇wi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 28: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/28.jpg)
6/67
Things to remember
Training Neural Networks is a Game of Gradients (played using any of theexisting gradient based approaches that we discussed)
The gradient tells us the responsibility of a parameter towards the loss
The gradient w.r.t. a parameter is proportional to the input to the parameters(recall the “..... ∗ x” term or the “.... ∗ hi” term in the formula for ∇wi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 29: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/29.jpg)
6/67
Things to remember
Training Neural Networks is a Game of Gradients (played using any of theexisting gradient based approaches that we discussed)
The gradient tells us the responsibility of a parameter towards the loss
The gradient w.r.t. a parameter is proportional to the input to the parameters(recall the “..... ∗ x” term or the “.... ∗ hi” term in the formula for ∇wi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 30: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/30.jpg)
7/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
Backpropagation was made popularby Rumelhart et.al in 1986
However when used for really deepnetworks it was not very successful
In fact, till 2006 it was very hard totrain very deep networks
Typically, even after a large numberof epochs the training did not con-verge
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 31: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/31.jpg)
7/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
Backpropagation was made popularby Rumelhart et.al in 1986
However when used for really deepnetworks it was not very successful
In fact, till 2006 it was very hard totrain very deep networks
Typically, even after a large numberof epochs the training did not con-verge
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 32: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/32.jpg)
7/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
Backpropagation was made popularby Rumelhart et.al in 1986
However when used for really deepnetworks it was not very successful
In fact, till 2006 it was very hard totrain very deep networks
Typically, even after a large numberof epochs the training did not con-verge
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 33: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/33.jpg)
7/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
Backpropagation was made popularby Rumelhart et.al in 1986
However when used for really deepnetworks it was not very successful
In fact, till 2006 it was very hard totrain very deep networks
Typically, even after a large numberof epochs the training did not con-verge
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 34: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/34.jpg)
7/67
σ
σ
x1
σ σ
x2 x3
σ σ σ
σ
y
w1 w2 w3
Backpropagation was made popularby Rumelhart et.al in 1986
However when used for really deepnetworks it was not very successful
In fact, till 2006 it was very hard totrain very deep networks
Typically, even after a large numberof epochs the training did not con-verge
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 35: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/35.jpg)
8/67
Module 9.2 : Unsupervised pre-training
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 36: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/36.jpg)
9/67
What has changed now? How did Deep Learning become so popular despitethis problem with training large networks?
Well, until 2006 it wasn’t so popular
The field got revived after the seminal work of Hinton and Salakhutdinov in2006
1G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neuralnetworks. Science, 313(5786):504–507, July 2006.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 37: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/37.jpg)
9/67
What has changed now? How did Deep Learning become so popular despitethis problem with training large networks?
Well, until 2006 it wasn’t so popular
The field got revived after the seminal work of Hinton and Salakhutdinov in2006
1G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neuralnetworks. Science, 313(5786):504–507, July 2006.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 38: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/38.jpg)
9/67
What has changed now? How did Deep Learning become so popular despitethis problem with training large networks?
Well, until 2006 it wasn’t so popular
The field got revived after the seminal work of Hinton and Salakhutdinov in2006
1G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neuralnetworks. Science, 313(5786):504–507, July 2006.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 39: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/39.jpg)
10/67
Let’s look at the idea of unsupervised pre-training introduced in this paper ...
(note that in this paper they introduced the idea in the context of RBMs but wewill discuss it in the context of Autoencoders)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 40: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/40.jpg)
10/67
Let’s look at the idea of unsupervised pre-training introduced in this paper ...(note that in this paper they introduced the idea in the context of RBMs but we
will discuss it in the context of Autoencoders)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 41: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/41.jpg)
11/67
Consider the deep neural networkshown in this figure
Let us focus on the first two layers ofthe network (x and h1)
We will first train the weightsbetween these two layers using an un-supervised objective
Note that we are trying to reconstructthe input (x) from the hidden repres-entation (h1)
We refer to this as an unsupervisedobjective because it does not involvethe output label (y) and only uses theinput data (x)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 42: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/42.jpg)
11/67
Consider the deep neural networkshown in this figure
Let us focus on the first two layers ofthe network (x and h1)
We will first train the weightsbetween these two layers using an un-supervised objective
Note that we are trying to reconstructthe input (x) from the hidden repres-entation (h1)
We refer to this as an unsupervisedobjective because it does not involvethe output label (y) and only uses theinput data (x)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 43: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/43.jpg)
11/67
Consider the deep neural networkshown in this figure
Let us focus on the first two layers ofthe network (x and h1)
We will first train the weightsbetween these two layers using an un-supervised objective
Note that we are trying to reconstructthe input (x) from the hidden repres-entation (h1)
We refer to this as an unsupervisedobjective because it does not involvethe output label (y) and only uses theinput data (x)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 44: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/44.jpg)
11/67
x
h1
x
reconstruct x
min1
m
m∑i=1
n∑j=1
(xij − xij)2
Consider the deep neural networkshown in this figure
Let us focus on the first two layers ofthe network (x and h1)
We will first train the weightsbetween these two layers using an un-supervised objective
Note that we are trying to reconstructthe input (x) from the hidden repres-entation (h1)
We refer to this as an unsupervisedobjective because it does not involvethe output label (y) and only uses theinput data (x)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 45: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/45.jpg)
11/67
x
h1
x
reconstruct x
min1
m
m∑i=1
n∑j=1
(xij − xij)2
Consider the deep neural networkshown in this figure
Let us focus on the first two layers ofthe network (x and h1)
We will first train the weightsbetween these two layers using an un-supervised objective
Note that we are trying to reconstructthe input (x) from the hidden repres-entation (h1)
We refer to this as an unsupervisedobjective because it does not involvethe output label (y) and only uses theinput data (x)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 46: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/46.jpg)
11/67
x
h1
x
reconstruct x
min1
m
m∑i=1
n∑j=1
(xij − xij)2
Consider the deep neural networkshown in this figure
Let us focus on the first two layers ofthe network (x and h1)
We will first train the weightsbetween these two layers using an un-supervised objective
Note that we are trying to reconstructthe input (x) from the hidden repres-entation (h1)
We refer to this as an unsupervisedobjective because it does not involvethe output label (y) and only uses theinput data (x)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 47: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/47.jpg)
11/67
x
h1
x
reconstruct x
min1
m
m∑i=1
n∑j=1
(xij − xij)2
At the end of this step, the weightsin layer 1 are trained such that h1captures an abstract representationof the input x
We now fix the weights in layer 1 andrepeat the same process with layer 2
At the end of this step, the weights inlayer 2 are trained such that h2 cap-tures an abstract representation of h1
We continue this process till the lasthidden layer (i.e., the layer before theoutput layer) so that each successivelayer captures an abstract represent-ation of the previous layer
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 48: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/48.jpg)
11/67
h1
h2
h1
x
min1
m
m∑i=1
n∑j=1
(h1ij − h1ij )2
At the end of this step, the weightsin layer 1 are trained such that h1captures an abstract representationof the input x
We now fix the weights in layer 1 andrepeat the same process with layer 2
At the end of this step, the weights inlayer 2 are trained such that h2 cap-tures an abstract representation of h1
We continue this process till the lasthidden layer (i.e., the layer before theoutput layer) so that each successivelayer captures an abstract represent-ation of the previous layer
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 49: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/49.jpg)
11/67
h1
h2
h1
x
min1
m
m∑i=1
n∑j=1
(h1ij − h1ij )2
At the end of this step, the weightsin layer 1 are trained such that h1captures an abstract representationof the input x
We now fix the weights in layer 1 andrepeat the same process with layer 2
At the end of this step, the weights inlayer 2 are trained such that h2 cap-tures an abstract representation of h1
We continue this process till the lasthidden layer (i.e., the layer before theoutput layer) so that each successivelayer captures an abstract represent-ation of the previous layer
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 50: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/50.jpg)
11/67
h1
h2
h1
x
min1
m
m∑i=1
n∑j=1
(h1ij − h1ij )2
At the end of this step, the weightsin layer 1 are trained such that h1captures an abstract representationof the input x
We now fix the weights in layer 1 andrepeat the same process with layer 2
At the end of this step, the weights inlayer 2 are trained such that h2 cap-tures an abstract representation of h1
We continue this process till the lasthidden layer (i.e., the layer before theoutput layer) so that each successivelayer captures an abstract represent-ation of the previous layer
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 51: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/51.jpg)
12/67
x1 x2 x3
minθ
1
m
m∑i=1
(yi − f(xi))2
After this layerwise pre-training, weadd the output layer and train thewhole network using the task specificobjective
Note that, in effect we have initial-ized the weights of the network us-ing the greedy unsupervised objectiveand are now fine tuning these weightsusing the supervised objective
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 52: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/52.jpg)
12/67
x1 x2 x3
minθ
1
m
m∑i=1
(yi − f(xi))2
After this layerwise pre-training, weadd the output layer and train thewhole network using the task specificobjective
Note that, in effect we have initial-ized the weights of the network us-ing the greedy unsupervised objectiveand are now fine tuning these weightsusing the supervised objective
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 53: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/53.jpg)
13/67
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?
Let’s see what these two questions mean and try to answer them based on some(among many) existing studies1,2
1The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan etal,2009
2Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 54: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/54.jpg)
13/67
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?
Let’s see what these two questions mean and try to answer them based on some(among many) existing studies1,2
1The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan etal,2009
2Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 55: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/55.jpg)
13/67
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?
Let’s see what these two questions mean and try to answer them based on some(among many) existing studies1,2
1The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan etal,2009
2Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 56: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/56.jpg)
13/67
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?
Let’s see what these two questions mean and try to answer them based on some(among many) existing studies1,2
1The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan etal,2009
2Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 57: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/57.jpg)
14/67
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 58: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/58.jpg)
15/67
What is the optimization problem that we are trying to solve?
minimize L (θ) =1
m
m∑i=1
(yi − f(xi))2
Is it the case that in the absence of unsupervised pre-training we are not ableto drive L (θ) to 0 even for the training data (hence poor optimization) ?
Let us see this in more detail ...
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 59: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/59.jpg)
15/67
What is the optimization problem that we are trying to solve?
minimize L (θ) =1
m
m∑i=1
(yi − f(xi))2
Is it the case that in the absence of unsupervised pre-training we are not ableto drive L (θ) to 0 even for the training data (hence poor optimization) ?
Let us see this in more detail ...
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 60: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/60.jpg)
15/67
What is the optimization problem that we are trying to solve?
minimize L (θ) =1
m
m∑i=1
(yi − f(xi))2
Is it the case that in the absence of unsupervised pre-training we are not ableto drive L (θ) to 0 even for the training data (hence poor optimization) ?
Let us see this in more detail ...
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 61: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/61.jpg)
15/67
What is the optimization problem that we are trying to solve?
minimize L (θ) =1
m
m∑i=1
(yi − f(xi))2
Is it the case that in the absence of unsupervised pre-training we are not ableto drive L (θ) to 0 even for the training data (hence poor optimization) ?
Let us see this in more detail ...
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 62: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/62.jpg)
16/67
The error surface of the supervisedobjective of a Deep Neural Networkis highly non-convex
With many hills and plateaus and val-leys
Given that large capacity of DNNs itis still easy to land in one of these 0error regions
Indeed Larochelle et.al.1 show that ifthe last layer has large capacity thenL (θ) goes to 0 even without pre-training
However, if the capacity of the net-work is small, unsupervised pre-training helps
1Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 63: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/63.jpg)
16/67
The error surface of the supervisedobjective of a Deep Neural Networkis highly non-convex
With many hills and plateaus and val-leys
Given that large capacity of DNNs itis still easy to land in one of these 0error regions
Indeed Larochelle et.al.1 show that ifthe last layer has large capacity thenL (θ) goes to 0 even without pre-training
However, if the capacity of the net-work is small, unsupervised pre-training helps
1Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 64: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/64.jpg)
16/67
The error surface of the supervisedobjective of a Deep Neural Networkis highly non-convex
With many hills and plateaus and val-leys
Given that large capacity of DNNs itis still easy to land in one of these 0error regions
Indeed Larochelle et.al.1 show that ifthe last layer has large capacity thenL (θ) goes to 0 even without pre-training
However, if the capacity of the net-work is small, unsupervised pre-training helps
1Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 65: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/65.jpg)
16/67
The error surface of the supervisedobjective of a Deep Neural Networkis highly non-convex
With many hills and plateaus and val-leys
Given that large capacity of DNNs itis still easy to land in one of these 0error regions
Indeed Larochelle et.al.1 show that ifthe last layer has large capacity thenL (θ) goes to 0 even without pre-training
However, if the capacity of the net-work is small, unsupervised pre-training helps
1Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 66: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/66.jpg)
16/67
The error surface of the supervisedobjective of a Deep Neural Networkis highly non-convex
With many hills and plateaus and val-leys
Given that large capacity of DNNs itis still easy to land in one of these 0error regions
Indeed Larochelle et.al.1 show that ifthe last layer has large capacity thenL (θ) goes to 0 even without pre-training
However, if the capacity of the net-work is small, unsupervised pre-training helps
1Exploring Strategies for Training Deep Neural Networks, Larocelle et al,2009Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 67: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/67.jpg)
17/67
Why does this work better?
Is it because of better optimization?
Is it because of better regularization?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 68: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/68.jpg)
18/67
What does regularization do?
It con-strains the weights to certain regionsof the parameter space
L-1 regularization: constrains mostweights to be 0
L-2 regularization: prevents mostweights from taking large values
1Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman,Pg 71
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 69: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/69.jpg)
18/67
What does regularization do? It con-strains the weights to certain regionsof the parameter space
L-1 regularization: constrains mostweights to be 0
L-2 regularization: prevents mostweights from taking large values
1Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman,Pg 71
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 70: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/70.jpg)
18/67
What does regularization do? It con-strains the weights to certain regionsof the parameter space
L-1 regularization: constrains mostweights to be 0
L-2 regularization: prevents mostweights from taking large values
1Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman,Pg 71
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 71: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/71.jpg)
18/67
What does regularization do? It con-strains the weights to certain regionsof the parameter space
L-1 regularization: constrains mostweights to be 0
L-2 regularization: prevents mostweights from taking large values
1Image Source:The Elements of Statistical Learning-T. Hastie, R. Tibshirani, and J. Friedman,Pg 71
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 72: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/72.jpg)
19/67
Unsupervised objective:
We can think of this unsupervised ob-jective as an additional constraint onthe optimization problem
Supervised objective:
Indeed, pre-training constrains theweights to lie in only certain regionsof the parameter space
Specifically, it constrains the weightsto lie in regions where the character-istics of the data are captured well (asgoverned by the unsupervised object-ive)
This unsupervised objective ensuresthat that the learning is not greedyw.r.t. the supervised objective (andalso satisfies the unsupervised object-ive)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 73: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/73.jpg)
19/67
Unsupervised objective:
We can think of this unsupervised ob-jective as an additional constraint onthe optimization problem
Supervised objective:
Indeed, pre-training constrains theweights to lie in only certain regionsof the parameter space
Specifically, it constrains the weightsto lie in regions where the character-istics of the data are captured well (asgoverned by the unsupervised object-ive)
This unsupervised objective ensuresthat that the learning is not greedyw.r.t. the supervised objective (andalso satisfies the unsupervised object-ive)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 74: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/74.jpg)
19/67
Unsupervised objective:
Ω(θ) =1
m
m∑i=1
n∑j=1
(xij − xij)2
We can think of this unsupervised ob-jective as an additional constraint onthe optimization problem
Supervised objective:
Indeed, pre-training constrains theweights to lie in only certain regionsof the parameter space
Specifically, it constrains the weightsto lie in regions where the character-istics of the data are captured well (asgoverned by the unsupervised object-ive)
This unsupervised objective ensuresthat that the learning is not greedyw.r.t. the supervised objective (andalso satisfies the unsupervised object-ive)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 75: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/75.jpg)
19/67
Unsupervised objective:
Ω(θ) =1
m
m∑i=1
n∑j=1
(xij − xij)2
We can think of this unsupervised ob-jective as an additional constraint onthe optimization problem
Supervised objective:
Indeed, pre-training constrains theweights to lie in only certain regionsof the parameter space
Specifically, it constrains the weightsto lie in regions where the character-istics of the data are captured well (asgoverned by the unsupervised object-ive)
This unsupervised objective ensuresthat that the learning is not greedyw.r.t. the supervised objective (andalso satisfies the unsupervised object-ive)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 76: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/76.jpg)
19/67
Unsupervised objective:
Ω(θ) =1
m
m∑i=1
n∑j=1
(xij − xij)2
We can think of this unsupervised ob-jective as an additional constraint onthe optimization problem
Supervised objective:
L (θ) =1
m
m∑i=1
(yi − f(xi))2
Indeed, pre-training constrains theweights to lie in only certain regionsof the parameter space
Specifically, it constrains the weightsto lie in regions where the character-istics of the data are captured well (asgoverned by the unsupervised object-ive)
This unsupervised objective ensuresthat that the learning is not greedyw.r.t. the supervised objective (andalso satisfies the unsupervised object-ive)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 77: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/77.jpg)
20/67
Some other experiments have alsoshown that pre-training is more ro-bust to random initializations
One accepted hypothesis is that pre-training leads to better weight ini-tializations (so that the layers cap-ture the internal characteristics of thedata)
1The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan etal,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 78: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/78.jpg)
20/67
Some other experiments have alsoshown that pre-training is more ro-bust to random initializations
One accepted hypothesis is that pre-training leads to better weight ini-tializations (so that the layers cap-ture the internal characteristics of thedata)
1The difficulty of training deep architectures and effect of unsupervised pre-training - Erhan etal,2009
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 79: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/79.jpg)
21/67
So what has happened since 2006-2009?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 80: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/80.jpg)
22/67
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 81: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/81.jpg)
22/67
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 82: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/82.jpg)
22/67
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 83: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/83.jpg)
22/67
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 84: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/84.jpg)
23/67
Module 9.3 : Better activation functions
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 85: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/85.jpg)
24/67
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 86: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/86.jpg)
25/67
Before we look at activation functions, let’s try to answer the following question:“What makes Deep Neural Networks powerful ?”
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 87: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/87.jpg)
26/67
h0 = x
y
σ
σ
σ
a1h1
a2h2
a3
w1
w2
w3
Consider this deep neural network
Imagine if we replace the sigmoid ineach layer by a simple linear trans-formation
y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x))))
Then we will just learn y as a lineartransformation of x
In other words we will be constrainedto learning linear decision boundaries
We cannot learn arbitrary decisionboundaries
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 88: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/88.jpg)
26/67
h0 = x
y
σ
σ
σ
a1h1
a2h2
a3
w1
w2
w3
Consider this deep neural network
Imagine if we replace the sigmoid ineach layer by a simple linear trans-formation
y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x))))
Then we will just learn y as a lineartransformation of x
In other words we will be constrainedto learning linear decision boundaries
We cannot learn arbitrary decisionboundaries
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 89: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/89.jpg)
26/67
h0 = x
y
σ
σ
σ
a1h1
a2h2
a3
w1
w2
w3
Consider this deep neural network
Imagine if we replace the sigmoid ineach layer by a simple linear trans-formation
y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x))))
Then we will just learn y as a lineartransformation of x
In other words we will be constrainedto learning linear decision boundaries
We cannot learn arbitrary decisionboundaries
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 90: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/90.jpg)
26/67
h0 = x
y
σ
σ
σ
a1h1
a2h2
a3
w1
w2
w3
Consider this deep neural network
Imagine if we replace the sigmoid ineach layer by a simple linear trans-formation
y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x))))
Then we will just learn y as a lineartransformation of x
In other words we will be constrainedto learning linear decision boundaries
We cannot learn arbitrary decisionboundaries
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 91: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/91.jpg)
26/67
h0 = x
y
σ
σ
σ
a1h1
a2h2
a3
w1
w2
w3
Consider this deep neural network
Imagine if we replace the sigmoid ineach layer by a simple linear trans-formation
y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x))))
Then we will just learn y as a lineartransformation of x
In other words we will be constrainedto learning linear decision boundaries
We cannot learn arbitrary decisionboundaries
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 92: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/92.jpg)
26/67
h0 = x
y
σ
σ
σ
a1h1
a2h2
a3
w1
w2
w3
Consider this deep neural network
Imagine if we replace the sigmoid ineach layer by a simple linear trans-formation
y = (w4 ∗ (w3 ∗ (w2 ∗ (w1x))))
Then we will just learn y as a lineartransformation of x
In other words we will be constrainedto learning linear decision boundaries
We cannot learn arbitrary decisionboundaries
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 93: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/93.jpg)
26/67
In particular, a deep linear neuralnetwork cannot learn such boundar-ies
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 94: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/94.jpg)
26/67
In particular, a deep linear neuralnetwork cannot learn such boundar-ies
But a deep non linear neural net-work can indeed learn such bound-aries (recall Universal ApproximationTheorem)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 95: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/95.jpg)
27/67
Now let’s look at some non-linear activation functions that are typically used indeep neural networks (Much of this material is taken from Andrej Karpathy’slecture notes 1)
1http://cs231n.github.ioMitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 96: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/96.jpg)
28/67
Sigmoid
σ(x) = 11+e−x
As is obvious, the sigmoid functioncompresses all its inputs to the range[0,1]
Since we are always interested ingradients, let us find the gradient ofthis function
Let us see what happens if we use sig-moid in a deep network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 97: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/97.jpg)
28/67
Sigmoid
σ(x) = 11+e−x
As is obvious, the sigmoid functioncompresses all its inputs to the range[0,1]
Since we are always interested ingradients, let us find the gradient ofthis function
Let us see what happens if we use sig-moid in a deep network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 98: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/98.jpg)
28/67
Sigmoid
σ(x) = 11+e−x
As is obvious, the sigmoid functioncompresses all its inputs to the range[0,1]
Since we are always interested ingradients, let us find the gradient ofthis function
Let us see what happens if we use sig-moid in a deep network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 99: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/99.jpg)
28/67
Sigmoid
σ(x) = 11+e−x
As is obvious, the sigmoid functioncompresses all its inputs to the range[0,1]
Since we are always interested ingradients, let us find the gradient ofthis function
∂σ(x)
∂x= σ(x)(1− σ(x))
(you can easily derive it)
Let us see what happens if we use sig-moid in a deep network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 100: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/100.jpg)
28/67
Sigmoid
σ(x) = 11+e−x
As is obvious, the sigmoid functioncompresses all its inputs to the range[0,1]
Since we are always interested ingradients, let us find the gradient ofthis function
∂σ(x)
∂x= σ(x)(1− σ(x))
(you can easily derive it)
Let us see what happens if we use sig-moid in a deep network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 101: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/101.jpg)
29/67
h0 = x
σ
σ
σ
σ
a1h1
a2h2
a3h3
a4h4
a3 = w2h2h3 = σ(a3)
While calculating ∇w2 at some pointin the chain rule we will encounter
What is the consequence of this ?
To answer this question let us firstunderstand the concept of saturatedneuron ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 102: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/102.jpg)
29/67
h0 = x
σ
σ
σ
σ
a1h1
a2h2
a3h3
a4h4
a3 = w2h2h3 = σ(a3)
While calculating ∇w2 at some pointin the chain rule we will encounter
∂h3∂a3
=∂σ(a3)
∂a3= σ(a3)(1− σ(a3))
What is the consequence of this ?
To answer this question let us firstunderstand the concept of saturatedneuron ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 103: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/103.jpg)
29/67
h0 = x
σ
σ
σ
σ
a1h1
a2h2
a3h3
a4h4
a3 = w2h2h3 = σ(a3)
While calculating ∇w2 at some pointin the chain rule we will encounter
∂h3∂a3
=∂σ(a3)
∂a3= σ(a3)(1− σ(a3))
What is the consequence of this ?
To answer this question let us firstunderstand the concept of saturatedneuron ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 104: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/104.jpg)
29/67
h0 = x
σ
σ
σ
σ
a1h1
a2h2
a3h3
a4h4
a3 = w2h2h3 = σ(a3)
While calculating ∇w2 at some pointin the chain rule we will encounter
∂h3∂a3
=∂σ(a3)
∂a3= σ(a3)(1− σ(a3))
What is the consequence of this ?
To answer this question let us firstunderstand the concept of saturatedneuron ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 105: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/105.jpg)
30/67
−2 −1 1 2
0.2
0.4
0.6
0.8
1
x
y
Saturated neurons thus cause thegradient to vanish.
A sigmoid neuron is said to have sat-urated when σ(x) = 1 or σ(x) = 0
What would the gradient be at satur-ation?
Well it would be 0 (you can see it fromthe plot or from the formula that wederived)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 106: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/106.jpg)
30/67
−2 −1 1 2
0.2
0.4
0.6
0.8
1
x
y
Saturated neurons thus cause thegradient to vanish.
A sigmoid neuron is said to have sat-urated when σ(x) = 1 or σ(x) = 0
What would the gradient be at satur-ation?
Well it would be 0 (you can see it fromthe plot or from the formula that wederived)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 107: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/107.jpg)
30/67
−2 −1 1 2
0.2
0.4
0.6
0.8
1
x
y
Saturated neurons thus cause thegradient to vanish.
A sigmoid neuron is said to have sat-urated when σ(x) = 1 or σ(x) = 0
What would the gradient be at satur-ation?
Well it would be 0 (you can see it fromthe plot or from the formula that wederived)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 108: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/108.jpg)
30/67
−2 −1 1 2
0.2
0.4
0.6
0.8
1
x
y
Saturated neurons thus cause thegradient to vanish.
A sigmoid neuron is said to have sat-urated when σ(x) = 1 or σ(x) = 0
What would the gradient be at satur-ation?
Well it would be 0 (you can see it fromthe plot or from the formula that wederived)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 109: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/109.jpg)
31/67
Saturated neurons thus cause thegradient to vanish.
w1 w2 w3 w4
σ(∑4
i=1wixi)
−2 −1 1 2
0.2
0.4
0.6
0.8
1
∑4i=1wixi
y
But why would the neurons saturate?
Consider what would happen if weuse sigmoid neurons and initialize theweights to very high values ?
The neurons will saturate veryquickly
The gradients will vanish and thetraining will stall (more on this later)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 110: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/110.jpg)
31/67
Saturated neurons thus cause thegradient to vanish.
w1 w2 w3 w4
σ(∑4
i=1wixi)
−2 −1 1 2
0.2
0.4
0.6
0.8
1
∑4i=1wixi
y
But why would the neurons saturate?
Consider what would happen if weuse sigmoid neurons and initialize theweights to very high values ?
The neurons will saturate veryquickly
The gradients will vanish and thetraining will stall (more on this later)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 111: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/111.jpg)
31/67
Saturated neurons thus cause thegradient to vanish.
w1 w2 w3 w4
σ(∑4
i=1wixi)
−2 −1 1 2
0.2
0.4
0.6
0.8
1
∑4i=1wixi
y
But why would the neurons saturate?
Consider what would happen if weuse sigmoid neurons and initialize theweights to very high values ?
The neurons will saturate veryquickly
The gradients will vanish and thetraining will stall (more on this later)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 112: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/112.jpg)
31/67
Saturated neurons thus cause thegradient to vanish.
w1 w2 w3 w4
σ(∑4
i=1wixi)
−2 −1 1 2
0.2
0.4
0.6
0.8
1
∑4i=1wixi
y
But why would the neurons saturate?
Consider what would happen if weuse sigmoid neurons and initialize theweights to very high values ?
The neurons will saturate veryquickly
The gradients will vanish and thetraining will stall (more on this later)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 113: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/113.jpg)
32/67
Saturated neurons cause the gradientto vanish
Sigmoids are not zero centered
Consider the gradient w.r.t. w1 andw2
∇w1 =∂L (w)
∂y
∂y
h3
∂h3∂a3
∇w2 =∂L (w)
∂y
∂y
h3
∂h3∂a3
Note that h21 and h22 are between[0, 1] (i.e., they are both positive)
So if the first common term (in red)is positive (negative) then both ∇w1
and ∇w2 are positive (negative)
Why is this a problem??
Essentially, either all the gradients ata layer are positive or all the gradientsat a layer are negative
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 114: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/114.jpg)
32/67
Saturated neurons cause the gradientto vanish
Sigmoids are not zero centered
Consider the gradient w.r.t. w1 andw2
∇w1 =∂L (w)
∂y
∂y
h3
∂h3∂a3
∇w2 =∂L (w)
∂y
∂y
h3
∂h3∂a3
Note that h21 and h22 are between[0, 1] (i.e., they are both positive)
So if the first common term (in red)is positive (negative) then both ∇w1
and ∇w2 are positive (negative)
Why is this a problem??
Essentially, either all the gradients ata layer are positive or all the gradientsat a layer are negative
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 115: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/115.jpg)
32/67
Saturated neurons cause the gradientto vanish
Sigmoids are not zero centered
Consider the gradient w.r.t. w1 andw2
∇w1 =∂L (w)
∂y
∂y
h3
∂h3∂a3
∇w2 =∂L (w)
∂y
∂y
h3
∂h3∂a3
Note that h21 and h22 are between[0, 1] (i.e., they are both positive)
So if the first common term (in red)is positive (negative) then both ∇w1
and ∇w2 are positive (negative)
Why is this a problem??
Essentially, either all the gradients ata layer are positive or all the gradientsat a layer are negative
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 116: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/116.jpg)
32/67
Saturated neurons cause the gradientto vanish
Sigmoids are not zero centered
Consider the gradient w.r.t. w1 andw2
∇w1 =∂L (w)
∂y
∂y
h3
∂h3∂a3
∇w2 =∂L (w)
∂y
∂y
h3
∂h3∂a3
Note that h21 and h22 are between[0, 1] (i.e., they are both positive)
So if the first common term (in red)is positive (negative) then both ∇w1
and ∇w2 are positive (negative)
Why is this a problem??
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2
Essentially, either all the gradients ata layer are positive or all the gradientsat a layer are negative
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 117: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/117.jpg)
32/67
Saturated neurons cause the gradientto vanish
Sigmoids are not zero centered
Consider the gradient w.r.t. w1 andw2
∇w1 =∂L (w)
∂y
∂y
h3
∂h3∂a3
∂a3∂w1
∇w2 =∂L (w)
∂y
∂y
h3
∂h3∂a3
∂a3∂w2
Note that h21 and h22 are between[0, 1] (i.e., they are both positive)
So if the first common term (in red)is positive (negative) then both ∇w1
and ∇w2 are positive (negative)
Why is this a problem??
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2
Essentially, either all the gradients ata layer are positive or all the gradientsat a layer are negative
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 118: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/118.jpg)
32/67
Saturated neurons cause the gradientto vanish
Sigmoids are not zero centered
Consider the gradient w.r.t. w1 andw2
∇w1 =∂L (w)
∂y
∂y
h3
∂h3∂a3
h21
∇w2 =∂L (w)
∂y
∂y
h3
∂h3∂a3
h22
Note that h21 and h22 are between[0, 1] (i.e., they are both positive)
So if the first common term (in red)is positive (negative) then both ∇w1
and ∇w2 are positive (negative)
Why is this a problem??
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2
Essentially, either all the gradients ata layer are positive or all the gradientsat a layer are negative
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 119: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/119.jpg)
32/67
Saturated neurons cause the gradientto vanish
Sigmoids are not zero centered
Consider the gradient w.r.t. w1 andw2
∇w1 =∂L (w)
∂y
∂y
h3
∂h3∂a3
h21
∇w2 =∂L (w)
∂y
∂y
h3
∂h3∂a3
h22
Note that h21 and h22 are between[0, 1] (i.e., they are both positive)
So if the first common term (in red)is positive (negative) then both ∇w1
and ∇w2 are positive (negative)
Why is this a problem??
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2
Essentially, either all the gradients ata layer are positive or all the gradientsat a layer are negative
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 120: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/120.jpg)
32/67
Saturated neurons cause the gradientto vanish
Sigmoids are not zero centered
Consider the gradient w.r.t. w1 andw2
∇w1 =∂L (w)
∂y
∂y
h3
∂h3∂a3
h21
∇w2 =∂L (w)
∂y
∂y
h3
∂h3∂a3
h22
Note that h21 and h22 are between[0, 1] (i.e., they are both positive)
So if the first common term (in red)is positive (negative) then both ∇w1
and ∇w2 are positive (negative)
Why is this a problem??
w1 w2
a3 = w1 ∗ h21 + w2 ∗ h22
y
h0 = x
h1
h2
Essentially, either all the gradients ata layer are positive or all the gradientsat a layer are negative
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 121: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/121.jpg)
33/67
Saturated neurons cause the gradientto vanish
Sigmoids are not zero centered
This restricts the possible update dir-ections
∇w2
∇w1
(Not possible) Quadrant in whichall gradients are
+ve(Allowed)
Quadrant in whichall gradients are
-ve(Allowed)
(Not possible)
Now imagine:this is theoptimal w
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 122: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/122.jpg)
33/67
Saturated neurons cause the gradientto vanish
Sigmoids are not zero centered
This restricts the possible update dir-ections
∇w2
∇w1
(Not possible) Quadrant in whichall gradients are
+ve(Allowed)
Quadrant in whichall gradients are
-ve(Allowed)
(Not possible)
Now imagine:this is theoptimal w
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 123: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/123.jpg)
33/67
Saturated neurons cause the gradientto vanish
Sigmoids are not zero centered
This restricts the possible update dir-ections
∇w2
∇w1
(Not possible) Quadrant in whichall gradients are
+ve(Allowed)
Quadrant in whichall gradients are
-ve(Allowed)
(Not possible)
Now imagine:this is theoptimal w
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 124: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/124.jpg)
34/67
Saturated neurons cause the gradientto vanish
Sigmoids are not zero centered ∇w2
∇w1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 125: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/125.jpg)
34/67
Saturated neurons cause the gradientto vanish
Sigmoids are not zero centered ∇w2
∇w1
starting from thisinitial positiononly way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 126: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/126.jpg)
34/67
Saturated neurons cause the gradientto vanish
Sigmoids are not zero centered ∇w2
∇w1
starting from thisinitial positiononly way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 127: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/127.jpg)
34/67
Saturated neurons cause the gradientto vanish
Sigmoids are not zero centered ∇w2
∇w1
starting from thisinitial positiononly way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 128: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/128.jpg)
34/67
Saturated neurons cause the gradientto vanish
Sigmoids are not zero centered ∇w2
∇w1
starting from thisinitial positiononly way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 129: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/129.jpg)
34/67
Saturated neurons cause the gradientto vanish
Sigmoids are not zero centered ∇w2
∇w1
starting from thisinitial positiononly way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 130: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/130.jpg)
34/67
Saturated neurons cause the gradientto vanish
Sigmoids are not zero centered ∇w2
∇w1
starting from thisinitial positiononly way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 131: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/131.jpg)
34/67
Saturated neurons cause the gradientto vanish
Sigmoids are not zero centered
And lastly, sigmoids are compu-tationally expensive (because ofexp (x))
∇w2
∇w1
starting from thisinitial positiononly way to reach it
is by taking a zigzag path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 132: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/132.jpg)
35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
Compresses all its inputs to the range[-1,1]
Zero centered
What is the derivative of this func-tion?
∂tanh(x)
∂x= (1− tanh2(x))
The gradient still vanishes at satura-tion
Also computationally expensive
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 133: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/133.jpg)
35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
Compresses all its inputs to the range[-1,1]
Zero centered
What is the derivative of this func-tion?
∂tanh(x)
∂x= (1− tanh2(x))
The gradient still vanishes at satura-tion
Also computationally expensive
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 134: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/134.jpg)
35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
Compresses all its inputs to the range[-1,1]
Zero centered
What is the derivative of this func-tion?
∂tanh(x)
∂x= (1− tanh2(x))
The gradient still vanishes at satura-tion
Also computationally expensive
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 135: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/135.jpg)
35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
Compresses all its inputs to the range[-1,1]
Zero centered
What is the derivative of this func-tion?
∂tanh(x)
∂x= (1− tanh2(x))
The gradient still vanishes at satura-tion
Also computationally expensive
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 136: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/136.jpg)
35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
Compresses all its inputs to the range[-1,1]
Zero centered
What is the derivative of this func-tion?
∂tanh(x)
∂x= (1− tanh2(x))
The gradient still vanishes at satura-tion
Also computationally expensive
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 137: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/137.jpg)
35/67
tanh(x)
0−4 −2 2 4
−1
−0.5
0.5
1
x
y
f(x) = tanh(x)
Compresses all its inputs to the range[-1,1]
Zero centered
What is the derivative of this func-tion?
∂tanh(x)
∂x= (1− tanh2(x))
The gradient still vanishes at satura-tion
Also computationally expensive
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 138: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/138.jpg)
36/67
ReLU
f(x) = max(0, x)
f(x) = max(0, x+ 1)−max(0, x− 1)
Is this a non-linear function?
Indeed it is!
In fact we can combine two ReLUunits to recover a piecewise linear ap-proximation of the sigmoid function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 139: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/139.jpg)
36/67
ReLU
f(x) = max(0, x)
f(x) = max(0, x+ 1)−max(0, x− 1)
Is this a non-linear function?
Indeed it is!
In fact we can combine two ReLUunits to recover a piecewise linear ap-proximation of the sigmoid function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 140: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/140.jpg)
36/67
ReLU
f(x) = max(0, x)
f(x) = max(0, x+ 1)−max(0, x− 1)
Is this a non-linear function?
Indeed it is!
In fact we can combine two ReLUunits to recover a piecewise linear ap-proximation of the sigmoid function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 141: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/141.jpg)
37/67
ReLU
f(x) = max(0, x)
Advantages of ReLU
Does not saturate in the positive re-gion
Computationally efficient
In practice converges much fasterthan sigmoid/tanh1
1ImageNet Classification with Deep Convolutional Neural Networks- Alex Krizhevsky IlyaSutskever, Geoffrey E. Hinton, 2012
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 142: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/142.jpg)
37/67
ReLU
f(x) = max(0, x)
Advantages of ReLU
Does not saturate in the positive re-gion
Computationally efficient
In practice converges much fasterthan sigmoid/tanh1
1ImageNet Classification with Deep Convolutional Neural Networks- Alex Krizhevsky IlyaSutskever, Geoffrey E. Hinton, 2012
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 143: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/143.jpg)
37/67
ReLU
f(x) = max(0, x)
Advantages of ReLU
Does not saturate in the positive re-gion
Computationally efficient
In practice converges much fasterthan sigmoid/tanh1
1ImageNet Classification with Deep Convolutional Neural Networks- Alex Krizhevsky IlyaSutskever, Geoffrey E. Hinton, 2012
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 144: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/144.jpg)
38/67
x1 x2 1
y
w1 w2 b
h1a1
a2w3
In practice there is a caveat
Let’s see what is the derivative of ReLU(x)
∂ReLU(x)
∂x= 0 if x < 0
= 1 if x > 0
Now consider the given network
What would happen if at some point a largegradient causes the bias b to be updated to alarge negative value?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 145: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/145.jpg)
38/67
x1 x2 1
y
w1 w2 b
h1a1
a2w3
In practice there is a caveat
Let’s see what is the derivative of ReLU(x)
∂ReLU(x)
∂x= 0 if x < 0
= 1 if x > 0
Now consider the given network
What would happen if at some point a largegradient causes the bias b to be updated to alarge negative value?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 146: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/146.jpg)
38/67
x1 x2 1
y
w1 w2 b
h1a1
a2w3
In practice there is a caveat
Let’s see what is the derivative of ReLU(x)
∂ReLU(x)
∂x= 0 if x < 0
= 1 if x > 0
Now consider the given network
What would happen if at some point a largegradient causes the bias b to be updated to alarge negative value?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 147: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/147.jpg)
38/67
x1 x2 1
y
w1 w2 b
h1a1
a2w3
In practice there is a caveat
Let’s see what is the derivative of ReLU(x)
∂ReLU(x)
∂x= 0 if x < 0
= 1 if x > 0
Now consider the given network
What would happen if at some point a largegradient causes the bias b to be updated to alarge negative value?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 148: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/148.jpg)
39/67
x1 x2 1
y
w1 w2 b
h1a1
a2w3
w1x1 + w2x2 + b < 0 [if b << 0]
The neuron would output 0 [dead neuron]
Not only would the output be 0 but duringbackpropagation even the gradient ∂h1
∂a1would
be zero
The weights w1, w2 and b will not get updated[∵ there will be a zero term in the chain rule]
∇w1 =∂L (θ)
∂y.∂y
∂a2.∂a2∂h1
.∂h1∂a1
.∂a1∂w1
The neuron will now stay dead forever!!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 149: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/149.jpg)
39/67
x1 x2 1
y
w1 w2 b
h1a1
a2w3
w1x1 + w2x2 + b < 0 [if b << 0]
The neuron would output 0 [dead neuron]
Not only would the output be 0 but duringbackpropagation even the gradient ∂h1
∂a1would
be zero
The weights w1, w2 and b will not get updated[∵ there will be a zero term in the chain rule]
∇w1 =∂L (θ)
∂y.∂y
∂a2.∂a2∂h1
.∂h1∂a1
.∂a1∂w1
The neuron will now stay dead forever!!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 150: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/150.jpg)
39/67
x1 x2 1
y
w1 w2 b
h1a1
a2w3
w1x1 + w2x2 + b < 0 [if b << 0]
The neuron would output 0 [dead neuron]
Not only would the output be 0 but duringbackpropagation even the gradient ∂h1
∂a1would
be zero
The weights w1, w2 and b will not get updated[∵ there will be a zero term in the chain rule]
∇w1 =∂L (θ)
∂y.∂y
∂a2.∂a2∂h1
.∂h1∂a1
.∂a1∂w1
The neuron will now stay dead forever!!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 151: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/151.jpg)
39/67
x1 x2 1
y
w1 w2 b
h1a1
a2w3
w1x1 + w2x2 + b < 0 [if b << 0]
The neuron would output 0 [dead neuron]
Not only would the output be 0 but duringbackpropagation even the gradient ∂h1
∂a1would
be zero
The weights w1, w2 and b will not get updated[∵ there will be a zero term in the chain rule]
∇w1 =∂L (θ)
∂y.∂y
∂a2.∂a2∂h1
.∂h1∂a1
.∂a1∂w1
The neuron will now stay dead forever!!
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 152: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/152.jpg)
40/67
x1 x2 1
y
w1 w2 b
h1a1
a2w3
In practice a large fraction of ReLUunits can die if the learning rate is settoo high
It is advised to initialize the bias to apositive value (0.01)
Use other variants of ReLU (as wewill soon see)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 153: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/153.jpg)
40/67
x1 x2 1
y
w1 w2 b
h1a1
a2w3
In practice a large fraction of ReLUunits can die if the learning rate is settoo high
It is advised to initialize the bias to apositive value (0.01)
Use other variants of ReLU (as wewill soon see)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 154: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/154.jpg)
40/67
x1 x2 1
y
w1 w2 b
h1a1
a2w3
In practice a large fraction of ReLUunits can die if the learning rate is settoo high
It is advised to initialize the bias to apositive value (0.01)
Use other variants of ReLU (as wewill soon see)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 155: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/155.jpg)
41/67
Leaky ReLU
x
y
f(x) = max(0.01x,x)
No saturation
Will not die (0.01x ensures thatat least a small gradient will flowthrough)
Computationally efficient
Close to zero centered ouputs
Parametric ReLU
f(x) = max(αx, x)
α is a parameter of the model
α will get updated during backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 156: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/156.jpg)
41/67
Leaky ReLU
x
y
f(x) = max(0.01x,x)
No saturation
Will not die (0.01x ensures thatat least a small gradient will flowthrough)
Computationally efficient
Close to zero centered ouputs
Parametric ReLU
f(x) = max(αx, x)
α is a parameter of the model
α will get updated during backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 157: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/157.jpg)
41/67
Leaky ReLU
x
y
f(x) = max(0.01x,x)
No saturation
Will not die (0.01x ensures thatat least a small gradient will flowthrough)
Computationally efficient
Close to zero centered ouputs
Parametric ReLU
f(x) = max(αx, x)
α is a parameter of the model
α will get updated during backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 158: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/158.jpg)
41/67
Leaky ReLU
x
y
f(x) = max(0.01x,x)
No saturation
Will not die (0.01x ensures thatat least a small gradient will flowthrough)
Computationally efficient
Close to zero centered ouputs
Parametric ReLU
f(x) = max(αx, x)
α is a parameter of the model
α will get updated during backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 159: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/159.jpg)
41/67
Leaky ReLU
x
y
f(x) = max(0.01x,x)
No saturation
Will not die (0.01x ensures thatat least a small gradient will flowthrough)
Computationally efficient
Close to zero centered ouputs
Parametric ReLU
f(x) = max(αx, x)
α is a parameter of the model
α will get updated during backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 160: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/160.jpg)
42/67
Exponential Linear Unit
x
y
f(x) = x if x > 0
= aex − 1 if x ≤ 0
All benefits of ReLU
aex − 1 ensures that at least a smallgradient will flow through
Close to zero centered outputs
Expensive (requires computation ofexp(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 161: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/161.jpg)
42/67
Exponential Linear Unit
x
y
f(x) = x if x > 0
= aex − 1 if x ≤ 0
All benefits of ReLU
aex − 1 ensures that at least a smallgradient will flow through
Close to zero centered outputs
Expensive (requires computation ofexp(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 162: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/162.jpg)
42/67
Exponential Linear Unit
x
y
f(x) = x if x > 0
= aex − 1 if x ≤ 0
All benefits of ReLU
aex − 1 ensures that at least a smallgradient will flow through
Close to zero centered outputs
Expensive (requires computation ofexp(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 163: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/163.jpg)
42/67
Exponential Linear Unit
x
y
f(x) = x if x > 0
= aex − 1 if x ≤ 0
All benefits of ReLU
aex − 1 ensures that at least a smallgradient will flow through
Close to zero centered outputs
Expensive (requires computation ofexp(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 164: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/164.jpg)
43/67
Maxout Neuron
max(wT1 x+ b1, wT2 x+ b2)
Generalizes ReLU and Leaky ReLU
No saturation! No death!
Doubles the number of parameters
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 165: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/165.jpg)
43/67
Maxout Neuron
max(wT1 x+ b1, wT2 x+ b2)
Generalizes ReLU and Leaky ReLU
No saturation! No death!
Doubles the number of parameters
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 166: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/166.jpg)
43/67
Maxout Neuron
max(wT1 x+ b1, wT2 x+ b2)
Generalizes ReLU and Leaky ReLU
No saturation! No death!
Doubles the number of parameters
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 167: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/167.jpg)
44/67
Things to Remember
Sigmoids are bad
ReLU is more or less the standard unit for Convolutional Neural Networks
Can explore Leaky ReLU/Maxout/ELU
tanh sigmoids are still used in LSTMs/RNNs (we will see more on this later)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 168: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/168.jpg)
44/67
Things to Remember
Sigmoids are bad
ReLU is more or less the standard unit for Convolutional Neural Networks
Can explore Leaky ReLU/Maxout/ELU
tanh sigmoids are still used in LSTMs/RNNs (we will see more on this later)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 169: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/169.jpg)
44/67
Things to Remember
Sigmoids are bad
ReLU is more or less the standard unit for Convolutional Neural Networks
Can explore Leaky ReLU/Maxout/ELU
tanh sigmoids are still used in LSTMs/RNNs (we will see more on this later)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 170: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/170.jpg)
44/67
Things to Remember
Sigmoids are bad
ReLU is more or less the standard unit for Convolutional Neural Networks
Can explore Leaky ReLU/Maxout/ELU
tanh sigmoids are still used in LSTMs/RNNs (we will see more on this later)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 171: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/171.jpg)
45/67
Module 9.4 : Better initialization strategies
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 172: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/172.jpg)
46/67
Deep Learning has evolved
Better optimization algorithms
Better regularization methods
Better activation functions
Better weight initialization strategies
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 173: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/173.jpg)
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
What happens if we initialize allweights to 0?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 174: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/174.jpg)
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
What happens if we initialize allweights to 0?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 175: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/175.jpg)
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
What happens if we initialize allweights to 0?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 176: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/176.jpg)
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
What happens if we initialize allweights to 0?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 177: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/177.jpg)
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
What happens if we initialize allweights to 0?
All neurons in layer 1 will get thesame activation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 178: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/178.jpg)
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Now what will happen during backpropagation?
∇w11 =∂L (w)
∂y.∂y
∂h11.∂h11∂a11
.x1
∇w21 =∂L (w)
∂y.∂y
∂h12.∂h12∂a12
.x1
but h11 = h12
and a12 = a12
∴ ∇w11 = ∇w21
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 179: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/179.jpg)
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Now what will happen during backpropagation?
∇w11 =∂L (w)
∂y.∂y
∂h11.∂h11∂a11
.x1
∇w21 =∂L (w)
∂y.∂y
∂h12.∂h12∂a12
.x1
but h11 = h12
and a12 = a12
∴ ∇w11 = ∇w21
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 180: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/180.jpg)
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Now what will happen during backpropagation?
∇w11 =∂L (w)
∂y.∂y
∂h11.∂h11∂a11
.x1
∇w21 =∂L (w)
∂y.∂y
∂h12.∂h12∂a12
.x1
but h11 = h12
and a12 = a12
∴ ∇w11 = ∇w21
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 181: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/181.jpg)
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Now what will happen during backpropagation?
∇w11 =∂L (w)
∂y.∂y
∂h11.∂h11∂a11
.x1
∇w21 =∂L (w)
∂y.∂y
∂h12.∂h12∂a12
.x1
but h11 = h12
and a12 = a12
∴ ∇w11 = ∇w21
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 182: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/182.jpg)
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Now what will happen during backpropagation?
∇w11 =∂L (w)
∂y.∂y
∂h11.∂h11∂a11
.x1
∇w21 =∂L (w)
∂y.∂y
∂h12.∂h12∂a12
.x1
but h11 = h12
and a12 = a12
∴ ∇w11 = ∇w21
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 183: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/183.jpg)
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Now what will happen during backpropagation?
∇w11 =∂L (w)
∂y.∂y
∂h11.∂h11∂a11
.x1
∇w21 =∂L (w)
∂y.∂y
∂h12.∂h12∂a12
.x1
but h11 = h12
and a12 = a12
∴ ∇w11 = ∇w21
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 184: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/184.jpg)
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Hence both the weights will get thesame update and remain equal
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 185: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/185.jpg)
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Hence both the weights will get thesame update and remain equal
Infact this symmetry will never breakduring training
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 186: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/186.jpg)
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Hence both the weights will get thesame update and remain equal
Infact this symmetry will never breakduring training
The same is true for w12 and w22
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 187: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/187.jpg)
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Hence both the weights will get thesame update and remain equal
Infact this symmetry will never breakduring training
The same is true for w12 and w22
And for all weights in layer 2 (infact,work out the math and convince your-self that all the weights in this layerwill remain equal )
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 188: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/188.jpg)
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Hence both the weights will get thesame update and remain equal
Infact this symmetry will never breakduring training
The same is true for w12 and w22
And for all weights in layer 2 (infact,work out the math and convince your-self that all the weights in this layerwill remain equal )
This is known as the symmetrybreaking problem
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 189: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/189.jpg)
47/67
y
σ σ σ
σ
x1 x2
h21
a21
h11 h12 h13
a11 a12 a13
a11 = w11x1 + w12x2
a12 = w21x1 + w22x2
∴ a11 = a12 = 0
∴ h11 = h12
Hence both the weights will get thesame update and remain equal
Infact this symmetry will never breakduring training
The same is true for w12 and w22
And for all weights in layer 2 (infact,work out the math and convince your-self that all the weights in this layerwill remain equal )
This is known as the symmetrybreaking problem
This will happen if all the weights ina network are initialized to the samevalue
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 190: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/190.jpg)
48/67
We will now consider a feedforwardnetwork with:
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 191: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/191.jpg)
48/67
We will now consider a feedforwardnetwork with:
input: 1000 points, each ∈ R500
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 192: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/192.jpg)
48/67
We will now consider a feedforwardnetwork with:
input: 1000 points, each ∈ R500
input data is drawn from unit Gaus-sian
−3 −2 −1 0 1 2 3
0.1
0.2
0.3
0.4
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 193: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/193.jpg)
48/67
We will now consider a feedforwardnetwork with:
input: 1000 points, each ∈ R500
input data is drawn from unit Gaus-sian
−3 −2 −1 0 1 2 3
0.1
0.2
0.3
0.4
the network has 5 layers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 194: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/194.jpg)
48/67
We will now consider a feedforwardnetwork with:
input: 1000 points, each ∈ R500
input data is drawn from unit Gaus-sian
−3 −2 −1 0 1 2 3
0.1
0.2
0.3
0.4
the network has 5 layers
each layer has 500 neurons
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 195: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/195.jpg)
48/67
We will now consider a feedforwardnetwork with:
input: 1000 points, each ∈ R500
input data is drawn from unit Gaus-sian
−3 −2 −1 0 1 2 3
0.1
0.2
0.3
0.4
the network has 5 layers
each layer has 500 neurons
we will run forward propagation onthis network with different weight ini-tializations
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 196: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/196.jpg)
49/67
Let’s try to initialize the weights tosmall random numbers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 197: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/197.jpg)
49/67
tanh activation functions
Let’s try to initialize the weights tosmall random numbers
We will see what happens to the ac-tivation across different layers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 198: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/198.jpg)
49/67
sigmoid activation functions
Let’s try to initialize the weights tosmall random numbers
We will see what happens to the ac-tivation across different layers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 199: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/199.jpg)
50/67
What will happen during backpropagation?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 200: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/200.jpg)
50/67
What will happen during backpropagation?
Recall that ∇w1 is proportional tothe activation passing through it
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 201: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/201.jpg)
50/67
What will happen during backpropagation?
Recall that ∇w1 is proportional tothe activation passing through it
If all the activations in a layer arevery close to 0, what will happen tothe gradient of the weights connect-ing this layer to the next layer?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 202: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/202.jpg)
50/67
What will happen during backpropagation?
Recall that ∇w1 is proportional tothe activation passing through it
If all the activations in a layer arevery close to 0, what will happen tothe gradient of the weights connect-ing this layer to the next layer?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 203: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/203.jpg)
50/67
What will happen during backpropagation?
Recall that ∇w1 is proportional tothe activation passing through it
If all the activations in a layer arevery close to 0, what will happen tothe gradient of the weights connect-ing this layer to the next layer?
They will all be close to 0 (vanishinggradient problem)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 204: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/204.jpg)
51/67
Let us try to initialize the weights tolarge random numbers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 205: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/205.jpg)
51/67
tanh activation with large weights
Let us try to initialize the weights tolarge random numbers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 206: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/206.jpg)
51/67
sigmoid activations with large weights
tanh activation with large weights
Let us try to initialize the weights tolarge random numbers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 207: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/207.jpg)
51/67
sigmoid activations with large weights
tanh activation with large weights
Let us try to initialize the weights tolarge random numbers
Most activations have saturated
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 208: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/208.jpg)
51/67
sigmoid activations with large weights
tanh activation with large weights
Let us try to initialize the weights tolarge random numbers
Most activations have saturated
What happens to the gradients at sat-uration?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 209: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/209.jpg)
51/67
sigmoid activations with large weights
tanh activation with large weights
Let us try to initialize the weights tolarge random numbers
Most activations have saturated
What happens to the gradients at sat-uration?
They will all be close to 0 (vanishinggradient problem)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 210: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/210.jpg)
52/67
Let us try to arrive at a more principledway of initializing weights
s11 =
n∑i=1
w1ixi
V ar(s11) = V ar(
n∑i=1
w1ixi) =
n∑i=1
V ar(w1ixi)
=
n∑i=1
[(E[w1i])
2V ar(xi)
+ (E[xi])2V ar(w1i) + V ar(xi)V ar(w1i)
]=
n∑i=1
V ar(xi)V ar(w1i)
= (nV ar(w))(V ar(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 211: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/211.jpg)
52/67
x1 x2 x3
s1ns11
xn
Let us try to arrive at a more principledway of initializing weights
s11 =
n∑i=1
w1ixi
V ar(s11) = V ar(
n∑i=1
w1ixi) =
n∑i=1
V ar(w1ixi)
=
n∑i=1
[(E[w1i])
2V ar(xi)
+ (E[xi])2V ar(w1i) + V ar(xi)V ar(w1i)
]=
n∑i=1
V ar(xi)V ar(w1i)
= (nV ar(w))(V ar(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 212: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/212.jpg)
52/67
x1 x2 x3
s1ns11
xn
Let us try to arrive at a more principledway of initializing weights
s11 =
n∑i=1
w1ixi
V ar(s11) = V ar(
n∑i=1
w1ixi) =
n∑i=1
V ar(w1ixi)
=
n∑i=1
[(E[w1i])
2V ar(xi)
+ (E[xi])2V ar(w1i) + V ar(xi)V ar(w1i)
]=
n∑i=1
V ar(xi)V ar(w1i)
= (nV ar(w))(V ar(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 213: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/213.jpg)
52/67
x1 x2 x3
s1ns11
xn
Let us try to arrive at a more principledway of initializing weights
s11 =
n∑i=1
w1ixi
V ar(s11) = V ar(
n∑i=1
w1ixi) =
n∑i=1
V ar(w1ixi)
=
n∑i=1
[(E[w1i])
2V ar(xi)
+ (E[xi])2V ar(w1i) + V ar(xi)V ar(w1i)
]
=n∑i=1
V ar(xi)V ar(w1i)
= (nV ar(w))(V ar(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 214: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/214.jpg)
52/67
x1 x2 x3
s1ns11
xn
[Assuming 0 Mean inputs andweights]
Let us try to arrive at a more principledway of initializing weights
s11 =
n∑i=1
w1ixi
V ar(s11) = V ar(
n∑i=1
w1ixi) =
n∑i=1
V ar(w1ixi)
=
n∑i=1
[(E[w1i])
2V ar(xi)
+ (E[xi])2V ar(w1i) + V ar(xi)V ar(w1i)
]
=n∑i=1
V ar(xi)V ar(w1i)
= (nV ar(w))(V ar(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 215: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/215.jpg)
52/67
x1 x2 x3
s1ns11
xn
[Assuming 0 Mean inputs andweights]
Let us try to arrive at a more principledway of initializing weights
s11 =
n∑i=1
w1ixi
V ar(s11) = V ar(
n∑i=1
w1ixi) =
n∑i=1
V ar(w1ixi)
=
n∑i=1
[(E[w1i])
2V ar(xi)
+ (E[xi])2V ar(w1i) + V ar(xi)V ar(w1i)
]=
n∑i=1
V ar(xi)V ar(w1i)
= (nV ar(w))(V ar(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 216: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/216.jpg)
52/67
x1 x2 x3
s1ns11
xn
[Assuming 0 Mean inputs andweights]
[Assuming V ar(xi) = V ar(x)∀i ]
[AssumingV ar(w1i) = V ar(w)∀i]
Let us try to arrive at a more principledway of initializing weights
s11 =
n∑i=1
w1ixi
V ar(s11) = V ar(
n∑i=1
w1ixi) =
n∑i=1
V ar(w1ixi)
=
n∑i=1
[(E[w1i])
2V ar(xi)
+ (E[xi])2V ar(w1i) + V ar(xi)V ar(w1i)
]=
n∑i=1
V ar(xi)V ar(w1i)
= (nV ar(w))(V ar(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 217: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/217.jpg)
53/67
x1 x2 x3
s1ns11
xn
In general,
V ar(S1i) = (nV ar(w))(V ar(x))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 218: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/218.jpg)
53/67
x1 x2 x3
s1ns11
xn
In general,
V ar(S1i) = (nV ar(w))(V ar(x))
What would happen if nV ar(w) 1?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 219: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/219.jpg)
53/67
x1 x2 x3
s1ns11
xn
In general,
V ar(S1i) = (nV ar(w))(V ar(x))
What would happen if nV ar(w) 1?
The variance of S1i will be large
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 220: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/220.jpg)
53/67
x1 x2 x3
s1ns11
xn
In general,
V ar(S1i) = (nV ar(w))(V ar(x))
What would happen if nV ar(w) 1?
The variance of S1i will be large
What would happen if nV ar(w)→ 0?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 221: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/221.jpg)
53/67
x1 x2 x3
s1ns11
xn
In general,
V ar(S1i) = (nV ar(w))(V ar(x))
What would happen if nV ar(w) 1?
The variance of S1i will be large
What would happen if nV ar(w)→ 0?
The variance of S1i will be small
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 222: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/222.jpg)
54/67
Let us see what happens if we add onemore layer
Using the same procedure as abovewe will arrive at
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 223: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/223.jpg)
54/67
x1 x2 x3
s1ns11
s21
xn
Let us see what happens if we add onemore layer
Using the same procedure as abovewe will arrive at
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 224: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/224.jpg)
54/67
x1 x2 x3
s1ns11
s21
xn
Let us see what happens if we add onemore layer
Using the same procedure as abovewe will arrive at
V ar(s21) =
n∑i=1
V ar(s1i)V ar(w2i)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 225: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/225.jpg)
54/67
x1 x2 x3
s1ns11
s21
xn
Let us see what happens if we add onemore layer
Using the same procedure as abovewe will arrive at
V ar(s21) =
n∑i=1
V ar(s1i)V ar(w2i)
= nV ar(s1i)V ar(w2)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 226: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/226.jpg)
54/67
x1 x2 x3
s1ns11
s21
xn
V ar(Si1) = nV ar(w1)V ar(x)
Let us see what happens if we add onemore layer
Using the same procedure as abovewe will arrive at
V ar(s21) =
n∑i=1
V ar(s1i)V ar(w2i)
= nV ar(s1i)V ar(w2)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 227: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/227.jpg)
54/67
x1 x2 x3
s1ns11
s21
xn
V ar(Si1) = nV ar(w1)V ar(x)
Let us see what happens if we add onemore layer
Using the same procedure as abovewe will arrive at
V ar(s21) =
n∑i=1
V ar(s1i)V ar(w2i)
= nV ar(s1i)V ar(w2)
V ar(s21) ∝ [nV ar(w2)][nV ar(w1)]V ar(x)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 228: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/228.jpg)
54/67
x1 x2 x3
s1ns11
s21
xn
V ar(Si1) = nV ar(w1)V ar(x)
Let us see what happens if we add onemore layer
Using the same procedure as abovewe will arrive at
V ar(s21) =
n∑i=1
V ar(s1i)V ar(w2i)
= nV ar(s1i)V ar(w2)
V ar(s21) ∝ [nV ar(w2)][nV ar(w1)]V ar(x)
∝ [nV ar(w)]2V ar(x)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 229: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/229.jpg)
54/67
x1 x2 x3
s1ns11
s21
xn
V ar(Si1) = nV ar(w1)V ar(x)
Let us see what happens if we add onemore layer
Using the same procedure as abovewe will arrive at
V ar(s21) =
n∑i=1
V ar(s1i)V ar(w2i)
= nV ar(s1i)V ar(w2)
V ar(s21) ∝ [nV ar(w2)][nV ar(w1)]V ar(x)
∝ [nV ar(w)]2V ar(x)
Assuming weights across all layers
have the same variance
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 230: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/230.jpg)
55/67
In general,
To ensure that variance in the output of anylayer does not blow up or shrink we want:
If we draw the weights from a unit Gaussianand scale them by 1√
nthen, we have :
nV ar(w) = nV ar(z√n
)
= n ∗ 1
nV ar(z) = 1← (UnitGaussian)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 231: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/231.jpg)
55/67
In general,
V ar(ski) = [nV ar(w)]kV ar(x)
To ensure that variance in the output of anylayer does not blow up or shrink we want:
If we draw the weights from a unit Gaussianand scale them by 1√
nthen, we have :
nV ar(w) = nV ar(z√n
)
= n ∗ 1
nV ar(z) = 1← (UnitGaussian)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 232: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/232.jpg)
55/67
In general,
V ar(ski) = [nV ar(w)]kV ar(x)
To ensure that variance in the output of anylayer does not blow up or shrink we want:
nV ar(w) = 1
If we draw the weights from a unit Gaussianand scale them by 1√
nthen, we have :
nV ar(w) = nV ar(z√n
)
= n ∗ 1
nV ar(z) = 1← (UnitGaussian)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 233: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/233.jpg)
55/67
In general,
V ar(ski) = [nV ar(w)]kV ar(x)
To ensure that variance in the output of anylayer does not blow up or shrink we want:
nV ar(w) = 1
If we draw the weights from a unit Gaussianand scale them by 1√
nthen, we have :
nV ar(w) = nV ar(z√n
)
= n ∗ 1
nV ar(z) = 1← (UnitGaussian)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 234: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/234.jpg)
55/67
In general,
V ar(ski) = [nV ar(w)]kV ar(x)
To ensure that variance in the output of anylayer does not blow up or shrink we want:
nV ar(w) = 1
If we draw the weights from a unit Gaussianand scale them by 1√
nthen, we have :
nV ar(w) = nV ar(z√n
)
= n ∗ 1
nV ar(z) = 1← (UnitGaussian)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 235: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/235.jpg)
55/67
V ar(az) = a2(V ar(z))
In general,
V ar(ski) = [nV ar(w)]kV ar(x)
To ensure that variance in the output of anylayer does not blow up or shrink we want:
nV ar(w) = 1
If we draw the weights from a unit Gaussianand scale them by 1√
nthen, we have :
nV ar(w) = nV ar(z√n
)
= n ∗ 1
nV ar(z) = 1
← (UnitGaussian)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 236: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/236.jpg)
55/67
V ar(az) = a2(V ar(z))
In general,
V ar(ski) = [nV ar(w)]kV ar(x)
To ensure that variance in the output of anylayer does not blow up or shrink we want:
nV ar(w) = 1
If we draw the weights from a unit Gaussianand scale them by 1√
nthen, we have :
nV ar(w) = nV ar(z√n
)
= n ∗ 1
nV ar(z) = 1← (UnitGaussian)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 237: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/237.jpg)
56/67
Let’s see what happens if we use thisinitialization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 238: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/238.jpg)
56/67
tanh activation
Let’s see what happens if we use thisinitialization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 239: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/239.jpg)
56/67
sigmoid activations
tanh activation
Let’s see what happens if we use thisinitialization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 240: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/240.jpg)
57/67
However this does not work for ReLUneurons
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 241: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/241.jpg)
57/67
However this does not work for ReLUneurons
Why ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 242: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/242.jpg)
57/67
However this does not work for ReLUneurons
Why ?
Intuition: He et.al. argue that afactor of 2 is needed when dealingwith ReLU Neurons
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 243: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/243.jpg)
57/67
However this does not work for ReLUneurons
Why ?
Intuition: He et.al. argue that afactor of 2 is needed when dealingwith ReLU Neurons
Intuitively this happens because therange of ReLU neurons is restrictedonly to the positive half of the space
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 244: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/244.jpg)
58/67
Indeed when we account for thisfactor of 2 we see better performance
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 245: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/245.jpg)
58/67
Indeed when we account for thisfactor of 2 we see better performance
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 246: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/246.jpg)
59/67
Module 9.5 : Batch Normalization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 247: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/247.jpg)
60/67
We will now see a method called batch normalization which allows us to be lesscareful about initialization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 248: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/248.jpg)
61/67
To understand the intuition behind Batch Nor-malization let us consider a deep network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 249: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/249.jpg)
61/67
x1 x2 x3
h0
h1
h2
h3
h4
To understand the intuition behind Batch Nor-malization let us consider a deep network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 250: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/250.jpg)
61/67
x1 x2 x3
h0
h1
h2
h3
h4
To understand the intuition behind Batch Nor-malization let us consider a deep network
Let us focus on the learning process for the weightsbetween these two layers
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 251: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/251.jpg)
61/67
x1 x2 x3
h0
h1
h2
h3
h4
To understand the intuition behind Batch Nor-malization let us consider a deep network
Let us focus on the learning process for the weightsbetween these two layers
Typically we use mini-batch algorithms
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 252: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/252.jpg)
61/67
x1 x2 x3
h0
h1
h2
h3
h4
To understand the intuition behind Batch Nor-malization let us consider a deep network
Let us focus on the learning process for the weightsbetween these two layers
Typically we use mini-batch algorithms
What would happen if there is a constant changein the distribution of h3
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 253: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/253.jpg)
61/67
x1 x2 x3
h0
h1
h2
h3
h4
To understand the intuition behind Batch Nor-malization let us consider a deep network
Let us focus on the learning process for the weightsbetween these two layers
Typically we use mini-batch algorithms
What would happen if there is a constant changein the distribution of h3
In other words what would happen if across mini-batches the distribution of h3 keeps changing
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 254: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/254.jpg)
61/67
x1 x2 x3
h0
h1
h2
h3
h4
To understand the intuition behind Batch Nor-malization let us consider a deep network
Let us focus on the learning process for the weightsbetween these two layers
Typically we use mini-batch algorithms
What would happen if there is a constant changein the distribution of h3
In other words what would happen if across mini-batches the distribution of h3 keeps changing
Would the learning process be easy or hard?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 255: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/255.jpg)
62/67
It would help if the pre-activations at each layerwere unit gaussians
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 256: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/256.jpg)
62/67
It would help if the pre-activations at each layerwere unit gaussians
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 257: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/257.jpg)
62/67
It would help if the pre-activations at each layerwere unit gaussians
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 258: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/258.jpg)
62/67
It would help if the pre-activations at each layerwere unit gaussians
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 259: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/259.jpg)
62/67
It would help if the pre-activations at each layerwere unit gaussians
Why not explicitly ensure this by standardizingthe pre-activation ?
sik = sik−E[sik]√var(sik)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 260: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/260.jpg)
62/67
It would help if the pre-activations at each layerwere unit gaussians
Why not explicitly ensure this by standardizingthe pre-activation ?
sik = sik−E[sik]√var(sik)
But how do we compute E[sik] and Var[sik]?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 261: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/261.jpg)
62/67
It would help if the pre-activations at each layerwere unit gaussians
Why not explicitly ensure this by standardizingthe pre-activation ?
sik = sik−E[sik]√var(sik)
But how do we compute E[sik] and Var[sik]?
We compute it from a mini-batch
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 262: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/262.jpg)
62/67
It would help if the pre-activations at each layerwere unit gaussians
Why not explicitly ensure this by standardizingthe pre-activation ?
sik = sik−E[sik]√var(sik)
But how do we compute E[sik] and Var[sik]?
We compute it from a mini-batch
Thus we are explicitly ensuring that the distri-bution of the inputs at different layers does notchange across batches
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 263: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/263.jpg)
63/67
This is what the deep network will look like withBatch Normalization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 264: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/264.jpg)
63/67
This is what the deep network will look like withBatch Normalization
Is this legal ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 265: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/265.jpg)
63/67
This is what the deep network will look like withBatch Normalization
Is this legal ?
Yes, it is because just as the tanh layer is dif-ferentiable, the Batch Normalization layer is alsodifferentiable
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 266: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/266.jpg)
63/67
This is what the deep network will look like withBatch Normalization
Is this legal ?
Yes, it is because just as the tanh layer is dif-ferentiable, the Batch Normalization layer is alsodifferentiable
Hence we can backpropagate through this layer
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 267: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/267.jpg)
64/67
γk and βk are additionalparameters of the network.
Catch: Do we necessarily want to force a unitgaussian input to the tanh layer?
Why not let the network learn what is best for it?
After the Batch Normalization step add the fol-lowing step:
y(k) = γksik + β(k)
What happens if the network learns:
γk =√var(xk)
βk = E[xk]
We will recover sik
In other words by adjusting these additional para-meters the network can learn to recover sik if thatis more favourable
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 268: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/268.jpg)
64/67
γk and βk are additionalparameters of the network.
Catch: Do we necessarily want to force a unitgaussian input to the tanh layer?
Why not let the network learn what is best for it?
After the Batch Normalization step add the fol-lowing step:
y(k) = γksik + β(k)
What happens if the network learns:
γk =√var(xk)
βk = E[xk]
We will recover sik
In other words by adjusting these additional para-meters the network can learn to recover sik if thatis more favourable
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 269: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/269.jpg)
64/67
γk and βk are additionalparameters of the network.
Catch: Do we necessarily want to force a unitgaussian input to the tanh layer?
Why not let the network learn what is best for it?
After the Batch Normalization step add the fol-lowing step:
y(k) = γksik + β(k)
What happens if the network learns:
γk =√var(xk)
βk = E[xk]
We will recover sik
In other words by adjusting these additional para-meters the network can learn to recover sik if thatis more favourable
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 270: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/270.jpg)
64/67
γk and βk are additionalparameters of the network.
Catch: Do we necessarily want to force a unitgaussian input to the tanh layer?
Why not let the network learn what is best for it?
After the Batch Normalization step add the fol-lowing step:
y(k) = γksik + β(k)
What happens if the network learns:
γk =√var(xk)
βk = E[xk]
We will recover sik
In other words by adjusting these additional para-meters the network can learn to recover sik if thatis more favourable
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 271: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/271.jpg)
64/67
γk and βk are additionalparameters of the network.
Catch: Do we necessarily want to force a unitgaussian input to the tanh layer?
Why not let the network learn what is best for it?
After the Batch Normalization step add the fol-lowing step:
y(k) = γksik + β(k)
What happens if the network learns:
γk =√var(xk)
βk = E[xk]
We will recover sik
In other words by adjusting these additional para-meters the network can learn to recover sik if thatis more favourable
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 272: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/272.jpg)
64/67
γk and βk are additionalparameters of the network.
Catch: Do we necessarily want to force a unitgaussian input to the tanh layer?
Why not let the network learn what is best for it?
After the Batch Normalization step add the fol-lowing step:
y(k) = γksik + β(k)
What happens if the network learns:
γk =√var(xk)
βk = E[xk]
We will recover sik
In other words by adjusting these additional para-meters the network can learn to recover sik if thatis more favourable
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 273: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/273.jpg)
65/67
We will now compare the performance with and without batch normalization onMNIST data using 2 layers....
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 274: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/274.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 275: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/275.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 276: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/276.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 277: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/277.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 278: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/278.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 279: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/279.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 280: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/280.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 281: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/281.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 282: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/282.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 283: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/283.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 284: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/284.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 285: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/285.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 286: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/286.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 287: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/287.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 288: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/288.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 289: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/289.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 290: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/290.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 291: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/291.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 292: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/292.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 293: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/293.jpg)
66/67
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 294: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/294.jpg)
67/67
2016-17: Still exciting times
Even better optimization methods
Data driven initialization methods
Beyond batch normalization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 295: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/295.jpg)
67/67
2016-17: Still exciting times
Even better optimization methods
Data driven initialization methods
Beyond batch normalization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9
![Page 296: CS7015 (Deep Learning) : Lecture 9 · CS7015 (Deep Learning) : Lecture 9 Greedy Layerwise Pre-training, Better activation functions, Better weight initialization methods, Batch Normalization](https://reader033.vdocuments.site/reader033/viewer/2022052408/5f25053057272b6f37586dff/html5/thumbnails/296.jpg)
67/67
2016-17: Still exciting times
Even better optimization methods
Data driven initialization methods
Beyond batch normalization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 9