greedy layer-wise training of deep networks yoshua bengio, pascal lamblin, dan popovici, hugo...
TRANSCRIPT
![Page 1: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/1.jpg)
Greedy Layer-Wise Training of Deep Networks
Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo LarochelleNIPS 2007
Presented by Ahmed Hefny
![Page 2: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/2.jpg)
Story so far …
• Deep neural nets are more expressive: Can learn wider classes of functions with less hidden units (parameters) and training examples.
• Unfortunately they are not easy to train with randomly initialized gradient-based methods.
![Page 3: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/3.jpg)
Story so far …
• Hinton et. al. (2006) proposed greedy unsupervised layer-wise training:• Greedy layer-wise: Train layers sequentially starting from bottom
(input) layer.• Unsupervised: Each layer learns a higher-level representation of
the layer below. The training criterion does not depend on the labels.
• Each layer is trained as a Restricted Boltzman Machine. (RBM is the building block of Deep Belief Networks).
• The trained model can be fine tuned using a supervised method.
RBM 0
RBM 1
RBM 2
![Page 4: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/4.jpg)
This paper
• Extends the concept to:• Continuous variables• Uncooperative input distributions• Simultaneous Layer Training
• Explores variations to better understand the training method:• What if we use greedy supervised layer-wise training ?• What if we replace RBMs with auto-encoders ?RBM 0
RBM 1
RBM 2
![Page 5: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/5.jpg)
Outline
• Review• Restricted Boltzman Machines• Deep Belief Networks• Greedy layer-wise Training
• Supervised Fine-tuning• Extensions• Continuous Inputs• Uncooperative Input Distributions• Simultaneous Training
• Analysis Experiments
![Page 6: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/6.jpg)
Outline
• Review• Restricted Boltzman Machines• Deep Belief Networks• Greedy layer-wise Training
• Supervised Fine-tuning• Extensions• Continuous Inputs• Uncooperative Input Distributions• Simultaneous Training
• Analysis Experiments
![Page 7: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/7.jpg)
Restricted Boltzman Machine
𝑣
hUndirected bipartite graphical model with connections between visible nodes and hidden nodes.
Corresponds to joint probability distribution
![Page 8: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/8.jpg)
Restricted Boltzman Machine
𝑣
hUndirected bipartite graphical model with connections between visible nodes and hidden nodes.
Corresponds to joint probability distribution
Factorized Conditionals
![Page 9: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/9.jpg)
Restricted Boltzman Machine (Training)• Given input vectors , adjust to increase
![Page 10: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/10.jpg)
Restricted Boltzman Machine (Training)• Given input vectors , adjust to increase
![Page 11: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/11.jpg)
Restricted Boltzman Machine (Training)• Given input vectors , adjust to increase
Sample given Sample and using Gibbs sampling
![Page 12: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/12.jpg)
Restricted Boltzman Machine (Training)• Now we can perform stochastic gradient descent on data log-
likelihood
• Stop based on some criterion (e.g. reconstruction error )
![Page 13: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/13.jpg)
Deep Belief Network
• A DBN is a model of the form
denotes input variables denotes hidden layers of causal variables
![Page 14: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/14.jpg)
Deep Belief Network
• A DBN is a model of the form
denotes input variables denotes hidden layers of causal variables
![Page 15: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/15.jpg)
Deep Belief Network
• A DBN is a model of the form
denotes input variables denotes hidden layers of causal variables
is an RBM RBM = Infinitely
Deep network with tied weights
![Page 16: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/16.jpg)
Greedy layer-wise training
• is intractable• Approximate with • Treat bottom two layers as an RBM• Fit parameters using contrastive divergence
![Page 17: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/17.jpg)
Greedy layer-wise training
• is intractable• Approximate with • Treat bottom two layers as an RBM• Fit parameters using contrastive divergence
• That gives an approximate • We need to match it with
![Page 18: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/18.jpg)
Greedy layer-wise training
• Approximate • Treat layers as an RBM• Fit parameters using contrastive divergence• Sample recursively using starting from
![Page 19: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/19.jpg)
Outline
• Review• Restricted Boltzman Machines• Deep Belief Networks• Greedy layer-wise Training
• Supervised Fine-tuning• Extensions• Continuous Inputs• Uncooperative Input Distributions• Simultaneous Training
• Analysis Experiments
![Page 20: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/20.jpg)
Supervised Fine Tuning (In this paper)• Use greedy layer-wise training to initialize weights of all layers except
output layer.
• For fine-tuning, use stochastic gradient descent of a cost function on the outputs where the conditional expected values of hidden nodes are approximated using mean-field.
![Page 21: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/21.jpg)
Supervised Fine Tuning (In this paper)• Use greedy layer-wise training to initialize weights of all layers except
output layer.
• Use backpropagation
![Page 22: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/22.jpg)
Outline
• Review• Restricted Boltzman Machines• Deep Belief Networks• Greedy layer-wise Training
• Supervised Fine-tuning• Extensions• Continuous Inputs• Uncooperative Input Distributions• Simultaneous Training
• Analysis Experiments
![Page 23: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/23.jpg)
Continuous Inputs
• Recall RBMs:
• If we restrict then normalization gives us binomial with given by sigmoid.• Instead, if we get exponential density• If is closed interval then we get truncated exponential
![Page 24: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/24.jpg)
Continuous Inputs (Case for truncated exponential [0,1])• Sampling
For truncated exponential, inverse CDF can be used
where is sampled uniformly from [0,1]
• Conditional Expectation
![Page 25: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/25.jpg)
Continuous Inputs
• To handle Gaussian inputs, we need to augment the energy function with a term quadratic in .• For a diagonal covariance matrix
Giving
![Page 26: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/26.jpg)
Continuous Hidden Nodes ?
![Page 27: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/27.jpg)
Continuous Hidden Nodes ?
• Truncated Exponential
• Gaussian
![Page 28: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/28.jpg)
Uncooperative Input Distributions
• Setting
• No particular relation between p and f, (e.g. Gaussian and sinus)
![Page 29: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/29.jpg)
Uncooperative Input Distributions
• Setting
• No particular relation between p and f, (e.g. Gaussian and sinus)
• Problem: Unsupvervised pre-training may not help prediction
![Page 30: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/30.jpg)
Outline
• Review• Restricted Boltzman Machines• Deep Belief Networks• Greedy layer-wise Training
• Supervised Fine-tuning• Extensions• Analysis Experiments
![Page 31: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/31.jpg)
Uncooperative Input Distributions
• Proposal: Mix unsupervised and supervised training for each layer
Temp. Ouptut Layer
Stochastic Gradient of input log likelihood by Contrastive Divergence
Stochastic Gradient of prediction error
Combined Update
![Page 32: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/32.jpg)
Simultaneous Layer Training
• Greedy Layer-wise Training• For each layer• Repeat Until Criterion Met
• Sample layer input (by recursively applying trained layers to data)• Update parameters using contrastive divergence
![Page 33: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/33.jpg)
Simultaneous Layer Training
• Simultaneous Training• Repeat Until Criterion Met• Sample input to all layers• Update parameters of all layers using contrastive divergence
• Simpler: One criterion for the entire network• Takes more time
![Page 34: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/34.jpg)
Outline
• Review• Restricted Boltzman Machines• Deep Belief Networks• Greedy layer-wise Training
• Supervised Fine-tuning• Extensions• Continuous Inputs• Uncooperative Input Distributions• Simultaneous Training
• Analysis Experiments
![Page 35: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/35.jpg)
Experiments
• Does greedy unsupervised pre-training help ?• What if we replace RBM with auto-encoders ?• What if we do greedy supervised pre-training ?
• Does continuous variable modeling help ?• Does partially supervised pre-training help ?
![Page 36: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/36.jpg)
Experiment 1
• Does greedy unsupervised pre-training help ?• What if we replace RBM with auto-encoders ?• What if we do greedy supervised pre-training ?
• Does continuous variable modeling help ?• Does partially supervised pre-training help ?
![Page 37: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/37.jpg)
Experiment 1
![Page 38: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/38.jpg)
Experiment 1
![Page 39: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/39.jpg)
Experiment 1(MSE and Training Errors)
Partially Supervised < Unsupervised Pre-training < No Pre-training
Gaussian < Binomial
![Page 40: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/40.jpg)
Experiment 2
• Does greedy unsupervised pre-training help ?• What if we replace RBM with auto-encoders ?• What if we do greedy supervised pre-training ?
• Does continuous variable modeling help ?• Does partially supervised pre-training help ?
![Page 41: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/41.jpg)
Experiment 2
• Auto Encoders• Learn a compact representation to reconstruct X
• Trained to minimize reconstruction cross-entropyX X
![Page 42: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/42.jpg)
Experiment 2
(500~1000) layer width 20 nodes in last two layers
![Page 43: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/43.jpg)
Experiment 2
• Auto-encoder pre-training outperforms supervised pre-training but is still outperformed by RBM.
• Without pre-training, deep nets do not generalize well, but they can still fit the data if the output layers are wide enough.
![Page 44: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/44.jpg)
Conclusions
• Unsupervised pre-training is important for deep networks.
• Partial supervision further enhances results, especially when input distribution and the function to be estimated are not closely related.
• Explicitly modeling conditional inputs is better than using binomial models.
![Page 45: Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny](https://reader035.vdocuments.site/reader035/viewer/2022062404/5518a221550346a61f8b48f6/html5/thumbnails/45.jpg)
Thanks