nicolas le roux - wordpress.com › ...deep learning tutorial nicolas le roux adam coates, yoshua...
TRANSCRIPT
Deep Learning tutorial
Nicolas Le Roux
Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng
DL/UFL workshop
16/12/11
Nicolas Le Roux Adam Coates, Yoshua Bengio, Yann LeCun, Andrew Ng (DL/UFL workshop)Deep Learning tutorial 16/12/11 1 / 31
Disclaimer
A lot of the slides of this tutorial have been taken from
presentations from:
Honglak Lee
Yann LeCun
Geoff Hinton
Andrew Ng
Yoshua BengioNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 2 / 31
Outline1 Introduction
What is deep?
Training problems2 Unsupervised pretraining
Pretraining
The denoising autoencoder
Predictive sparse coding
The deep belief network3 ConclusionNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 3 / 31
What does deep mean?
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 4 / 31
What does deep mean?
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 4 / 31
What does deep mean?
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 4 / 31
Properties of deep models
Many non-linearities = One non-linearity
More constrained space of transformations
Deep model = prior on the type of transformations
May need fewer computational units for the same function
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 5 / 31
Properties of deep models
Many non-linearities = One non-linearity
More constrained space of transformations
Deep model = prior on the type of transformations
May need fewer computational units for the same function
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 5 / 31
Properties of deep models
Many non-linearities = One non-linearity
More constrained space of transformations
Deep model = prior on the type of transformations
May need fewer computational units for the same function
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 5 / 31
Features for face recognition
Image courtesy of Honglak Lee
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 6 / 31
Features for face recognition
Image courtesy of Honglak Lee
“First-level” features are not
task-dependent
“First-level” features are easy to
learn
“Higher-level” features are
easier to learn given the
low-level features
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 6 / 31
A deep neural network
H1 = g(W1X)
H2 = g(W2H1)
Y = W3H2
g = sigmoid or tanh
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 7 / 31
A deep neural network
H1 = g(W1X)
H2 = g(W2H1)
Y = W3H2
g = sigmoid or tanh
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 7 / 31
A deep neural network
H1 = g(W1X)
H2 = g(W2H1)
Y = W3H2
g = sigmoid or tanh
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 7 / 31
A deep neural network
H1 = g(W1X)
H2 = g(W2H1)
Y = W3H2
g = sigmoid or tanh
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 7 / 31
Training issues
W3 is a lot easier to learn than W2 and W1
Compare
Given set of features, find combination
Given combination, find set of features
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 8 / 31
Training issues
W3 is a lot easier to learn than W2 and W1
Compare
Given set of features, find combination
Given combination, find set of features
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 9 / 31
Practical results
Parameters randomly initialized
They do not change much (bad local minima)
A random transformation is not very good
Bengio et al., 2007
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 10 / 31
Practical results
Parameters randomly initialized
They do not change much (bad local minima)
A random transformation is not very good
Bengio et al., 2007
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 10 / 31
Practical results - TIMIT
Mohamed et al., Deep Belief Networks for phone recognition
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 11 / 31
Outline1 Introduction
What is deep?
Training problems2 Unsupervised pretraining
Pretraining
The denoising autoencoder
Predictive sparse coding
The deep belief network3 ConclusionNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 12 / 31
The gist of pretraining
1 We want (potentially) deep networks
2 Training the lower layers is hard.
Solution: optimize each layer without relying on the layersabove.
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 13 / 31
The gist of pretraining
1 We want (potentially) deep networks
2 Training the lower layers is hard.
Solution: optimize each layer without relying on the layersabove.
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 13 / 31
The gist of pretraining
1 Learn W1 first
2 Given W1, learn W2
3 Given W1 and W2, learn W3
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 14 / 31
The gist of pretraining
1 Learn W1 first
2 Given W1, learn W2
3 Given W1 and W2, learn W3
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 14 / 31
The gist of pretraining
1 Learn W1 first
2 Given W1, learn W2
3 Given W1 and W2, learn W3
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 14 / 31
The gist of pretraining
1 Learn W1 first
2 Given W1, learn W2
3 Given W1 and W2, learn W3
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 14 / 31
The gist of pretraining
1 Learn W1 first
2 Given W1, learn W2
3 Given W1 and W2, learn W3
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 14 / 31
Greedy learning
1 Train each layer in a supervised way
2 Use alternate cost hoping that it will be useful.
Alternate cost: modeling the data well.
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 15 / 31
Greedy learning
1 Train each layer in a supervised way
2 Use alternate cost hoping that it will be useful.
Alternate cost: modeling the data well.
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 15 / 31
Denoising autoencoders
Vincent et al, 2008
Corrupt the input (e.g. set 25% of inputs to 0)
Reconstruct the uncorrupted input
Use uncorrupted encoding as input to next level
A representation is good if it works well with noisy data
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 16 / 31
Denoising autoencoders
Vincent et al, 2008
Corrupt the input (e.g. set 25% of inputs to 0)
Reconstruct the uncorrupted input
Use uncorrupted encoding as input to next levelA representation is good if it works well with noisy dataNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 16 / 31
DAE - A graphic explanation
Projects a “corrupted point” onto its position on the manifold
Two corrupted versions of the same point have the same representation
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 17 / 31
Stacking DAEs
1 Start with the lowest level and stack upwards
2 Train each layer on the intermediate code (features) from the layer below
3 Top layer can have a different output (e.g., softmax non-linearity) to provide an output for
classification
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 18 / 31
Predictive sparse coding
Objective function for sparse coding:
Z (X ) = argminZ12‖X − DZ‖2
2 + λ‖Z‖1
Z must:
Be sparse
Reconstruct X well
Be well predicted by a feedforward model
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 19 / 31
Predictive sparse coding
Objective function for sparse coding:
Z (X ) = argminZ12‖X − DZ‖2
2 + λ‖Z‖1 + α‖Z − F (X , θ)‖22
F (X , θ) = Gtanh(WeX )
Z must:
Be sparse
Reconstruct X well
Be well predicted by a feedforward modelNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 19 / 31
Stacking PSDs
Z (X ) = argminZ12‖X − DZ‖2
2 + λ‖Z‖1 + α‖Z − F1(X , θ)‖22
F1(X , θ1) = Gtanh(W 1e X )
1 Train the first layer using PSD on X
2 Use H1 = |F1(X , θ)| as features
3 Train the second layer using PSD on H1
4 Use H2 = |F2(H1, θ)| as features
5 Keep going...
6 Use H37 as input to your final classifier
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 20 / 31
Stacking PSDs
Z (X ) = argminZ12‖X − DZ‖2
2 + λ‖Z‖1 + α‖Z − F1(X , θ)‖22
F1(X , θ1) = Gtanh(W 1e X )
1 Train the first layer using PSD on X
2 Use H1 = |F1(X , θ)| as features
3 Train the second layer using PSD on H1
4 Use H2 = |F2(H1, θ)| as features
5 Keep going...
6 Use H37 as input to your final classifier
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 20 / 31
Stacking PSDs
Z (H1) = argminZ12‖H1 − DZ‖2
2 + λ‖Z‖1 + α‖Z − F2(H1, θ)‖22
F2(H1, θ2) = Gtanh(W 2e H1)
1 Train the first layer using PSD on X
2 Use H1 = |F1(X , θ)| as features
3 Train the second layer using PSD on H1
4 Use H2 = |F2(H1, θ)| as features
5 Keep going...
6 Use H37 as input to your final classifier
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 20 / 31
Stacking PSDs
Z (H1) = argminZ12‖H1 − DZ‖2
2 + λ‖Z‖1 + α‖Z − F2(H1, θ)‖22
F2(H1, θ2) = Gtanh(W 2e H1)
1 Train the first layer using PSD on X
2 Use H1 = |F1(X , θ)| as features
3 Train the second layer using PSD on H1
4 Use H2 = |F2(H1, θ)| as features
5 Keep going...
6 Use H37 as input to your final classifier
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 20 / 31
Stacking PSDs
Z (H1) = argminZ12‖H1 − DZ‖2
2 + λ‖Z‖1 + α‖Z − F2(H1, θ)‖22
F2(H1, θ2) = Gtanh(W 2e H1)
1 Train the first layer using PSD on X
2 Use H1 = |F1(X , θ)| as features
3 Train the second layer using PSD on H1
4 Use H2 = |F2(H1, θ)| as features
5 Keep going...
6 Use H37 as input to your final classifier
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 20 / 31
Stacking PSDs
Z (H1) = argminZ12‖H1 − DZ‖2
2 + λ‖Z‖1 + α‖Z − F2(H1, θ)‖22
F2(H1, θ2) = Gtanh(W 2e H1)
1 Train the first layer using PSD on X
2 Use H1 = |F1(X , θ)| as features
3 Train the second layer using PSD on H1
4 Use H2 = |F2(H1, θ)| as features
5 Keep going...
6 Use H37 as input to your final classifier
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 20 / 31
The mothership: the DBN
Unsupervised pretraining works by reconstructing the data
What if we tried to have a full generative model of the data?
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 21 / 31
The mothership: the DBN
Unsupervised pretraining works by reconstructing the data
What if we tried to have a full generative model of the data?
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 21 / 31
Restricted Boltzmann Machine (RBM)
E(x,h) = −∑
ij
Wijxihj = −x>Wh
P(x,h) =exp
[x>Wh
]∑x0,h0
exp[x>
0 Wh0]
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 22 / 31
Type of variables in an RBM
P(x,h) =exp
[x>Wh
]∑x0,h0
exp[x>
0 Wh0]
This formulation is correct when x and h are binary
variables
You have no choice over the type of x (this is your data)
You can choose what type h is
There are other formulations when x is not binary
They typically do not work that well (with notable
exceptions)
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 23 / 31
Training an RBM
P(x,h) =exp
[x>Wh
]∑x0,h0
exp[x>
0 Wh0]
We have a joint probability of data and “features”
Maximizing P(x) will give us meaningful features
There are efficient ways to do that (check my webpage)
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 24 / 31
Stacking RBMs
Learn the first layer by maximizing P(x) =∑
h1 P(x,h1)
Once trained, the features of x are EP [h1|x]
Learn the second layer by maximizing
P(h1) =∑
h2 P(h1,h2)
Keep going...Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 25 / 31
Results - 1
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 26 / 31
Results - 2
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 27 / 31
Outline1 Introduction
What is deep?
Training problems2 Unsupervised pretraining
Pretraining
The denoising autoencoder
Predictive sparse coding
The deep belief network3 ConclusionNicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 28 / 31
Take-home messages - Deep learning
Building a hierarchy of features can be beneficial
Training all layers at once is very hard
Pretraining alleviates the problem of bad local minima
Regularizes the model
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 29 / 31
Take-home messages - Pretraining
Pretraining requires building blocks
ALL pretraining methods share the same idea: model the
input while having somewhat invariant features
By no means is this necessarily the best method
And now for some applications...
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 30 / 31
Take-home messages - Pretraining
Pretraining requires building blocks
ALL pretraining methods share the same idea: model the
input while having somewhat invariant features
By no means is this necessarily the best method
And now for some applications...
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 30 / 31
This is the end...
Questions?
Nicolas Le Roux (DL/UFL workshop) Deep Learning tutorial 16/12/11 31 / 31