unsupervised deep learning stats 306b - stanford...
TRANSCRIPT
![Page 1: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/1.jpg)
Unsupervised Deep Learning
Stats 306B
Richard Socher
Some slides from ACL 2012 Tutorial by Richard Socher, Yoshua Bengio and Chris Manning
![Page 2: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/2.jpg)
Deep Learning
Most current machine learning workswell because of human-designedrepresentations and input features
Machine learning becomes just optimizingweights to best make a final prediction
Representation learning attempts to automatically learn good features or representations
Deep learning algorithms attempt to learn multiple levels of representation of increasing complexity/abstraction
NER WordNet
SRL Parser
2
![Page 3: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/3.jpg)
A Deep Architecture
Mainly, work has explored deep belief networks (DBNs), Markov Random Fields with multiple layers, and various types of multiple-layer neural networks
Output layer
Here predicting a supervised target
Hidden layers
These learn more abstract representations as you head up
Input layer
Raw sensory inputs (roughly)3
![Page 4: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/4.jpg)
Five Reasons to Explore
Deep and Representation Learning
Part 1.1: The Basics
4
![Page 5: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/5.jpg)
#1 Learning representations
5
Handcrafting features is time-consuming
The features are often both over-specified and incomplete
The work has to be done again for each task/domain/…
We must move beyond handcrafted features and simple ML
Humans develop representations for learning and reasoning
Our computers should do the same
Deep learning provides a way of doing this
![Page 6: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/6.jpg)
#2 The need for distributed representations
Current ML/NLP systems are incredibly fragile because of their atomic symbol representations, very sparse
Crazy sentential complement, such as for “likes [(being) crazy]”6
![Page 7: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/7.jpg)
Learning features that are not mutually exclusive can be exponentially more efficient than nearest-neighbor-like or clustering-like models
#2 The need for distributed representations
Multi-Clustering
Clustering
7
C1 C2 C3
input
![Page 8: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/8.jpg)
Distributed representations deal with the curse
of dimensionality
Generalizing locally (e.g., nearest neighbors) requires representative examples for all relevant variations!
Classic solutions:
• Manual feature design
• Assuming a smooth target function (e.g., linear models)
• Kernel methods (linear in terms of kernel based on data points)
Neural networks parameterize and learn a “similarity” kernel
8
![Page 9: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/9.jpg)
#3 Unsupervised feature and weight
learning – Focus Today!
Today, most practical, good NLP& ML methods require labeled training data (i.e., supervised learning)
But almost all data is unlabeled
Most information must be acquired unsupervised
Fortunately, a good model of observed data can really help you learn classification decisions
9
![Page 10: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/10.jpg)
We need good intermediate representations that can be shared across tasks
Multiple levels of latent variables allow combinatorial sharing of statistical strength
#4 Learning multiple levels of representation
Biologically inspired generic learning algorithm.
Task 1 Output
Linguistic Input
Task 2 Output Task 3 Output
10
Layer 1
Layer 2
Layer 3
![Page 11: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/11.jpg)
#5 Why now?Despite prior investigation and understanding of many of the algorithmic techniques …
Before 2006 training deep architectures was unsuccessful
What has changed?
• Faster and multicore CPUs and GPUs and more data
• New methods for unsupervised pre-training have been developed (Restricted Boltzmann Machines = RBMs, autoencoders, contrastive estimation, etc.)
• More efficient parameter estimation methods
• Better understanding of model regularization
Great performance in speech, vision and language tasks
![Page 12: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/12.jpg)
Outline
1. Motivation
2. Neural Networks: Feed-forward, Autoencoders
3. Probabilistic - Directed: PCA, Sparse Coding
4. Probabilistic – Undirected: MRFs and RBMs
12
![Page 13: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/13.jpg)
From logistic regression to neural nets
Part 2.2: Neural Nets Feed-forward Intro
13
![Page 14: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/14.jpg)
Demystifying neural networks
Neural networks come with their own terminological baggage
… just like SVMs
But if you understand how maxent/logistic regression models work
Then you already understand the operation of a basic neural network neuron!
A single neuronA computational unit with n (3) inputs
and 1 outputand parameters W, b
Activation function
Inputs
Bias unit corresponds to intercept term
Output
14
![Page 15: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/15.jpg)
This is exactly what a neuron computes
hw,b(x) = f (wTx+ b)
f (z) =1
1+ e-z
w, b are the parameters of this neuroni.e., this logistic regression model15
b: We can have an “always on” feature, which gives a class prior, or separate it out, as a bias term
![Page 16: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/16.jpg)
A neural network = running several logistic
regressions at the same time
If we feed a vector of inputs through a bunch of logistic regression functions, then we get a vector of outputs
But we don’t have to decide ahead of time what variables these logistic regressions are trying to predict!
16
![Page 17: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/17.jpg)
A neural network = running several logistic
regressions at the same time
… which we can feed into another logistic regression function
and it is the training criterion that will decide what those intermediate binary target variables should be, so as to do a good job of predicting the targets for the next layer, etc.
17
![Page 18: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/18.jpg)
A neural network = running several logistic
regressions at the same time
Before we know it, we have a multilayer neural network….
18
![Page 19: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/19.jpg)
Matrix notation for a layer
We have
In matrix notation
where f is applied element-wise:
a1
a2
a3
a1 = f (W11x1 +W12x2 +W13x3 +b1)
a2 = f (W21x1 +W22x2 +W23x3 +b2 )
etc.
z =Wx+ b
a = f (z)
f ([z1, z2, z3]) = [ f (z1), f (z2 ), f (z3)]19
W12
b3
![Page 20: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/20.jpg)
How do we train the weights W?
• For a supervised single layer neural net, we can train the model just like a maxent model – we calculate and use gradients
• Stochastic gradient descent (SGD)
• Conjugate gradient or L-BFGS
• Adagrad!
20
![Page 21: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/21.jpg)
Non-linearities: Why they’re needed
• For logistic regression: map to probabilities
• Here: function approximation, e.g., regression or classification• Without non-linearities, deep neural networks
can’t do anything more than a linear transform
• Extra layers could just be compiled down into a single linear transform
• Probabilistic interpretation unnecessary except in the Boltzmann machine/graphical models
• People often use other non-linearities, such as tanh, as we’ll discuss in part 3
21
![Page 22: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/22.jpg)
Learning Word Representations with a
Simple “Unsupervised” Model
Part 2.2: Neural Nets Feed-forward Example Model
22
![Page 23: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/23.jpg)
Neural word embeddings - visualization
23
![Page 24: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/24.jpg)
Advantages of the neural word embedding
approach
24
Compared to a method like LSA, neural word embeddings can become more meaningful through adding supervision from one or multiple tasks
For instance, sentiment is usually not captured in unsupervised word embeddings but can be in neural word vectors
![Page 25: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/25.jpg)
A neural network for learning word vectors (Collobert et al. JMLR 2011)
Idea: A word and its context is a positive training sample; a random word in that same context gives a negative training sample:
cat chills on a mat cat chills Jeju a mat
Similar: Implicit negative evidence in Contrastive Estimation, (Smith and Eisner 2005)
25
![Page 26: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/26.jpg)
Summary: Feed-forward Computation
26
Computing a window’s score with a 3-layer Neural Net: s = score(cat chills on a mat)
cat chills on a mat
![Page 27: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/27.jpg)
Summary: Feed-forward Computation
• s = score(cat chills on a mat)
• sc = score(cat chills Jeju a mat)
• Idea for training objective: make score of true window larger and corrupt window’s score lower (until they’re good enough): minimize
• This is continuous, can perform SGD27
![Page 28: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/28.jpg)
Training with Backpropagation
Assuming cost J is > 0, it is simple to see that we can compute the derivatives of s and sc wrt all the involved variables: U, W, b, x
28
![Page 29: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/29.jpg)
Training with Backpropagation
• Let’s consider the derivative of a single weight Wij
• This only appears inside ai
• For example: W23 is only used to compute a2
x1 x2 x3 +1
a1 a2
s U2
W23
29
![Page 30: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/30.jpg)
Training with Backpropagation
Derivative of weight Wij:
30
x1 x2 x3 +1
a1 a2
s U2
W23
![Page 31: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/31.jpg)
Training with Backpropagation
Derivative of single weight Wij :
Local error signal
Local input signal
31
x1 x2 x3 +1
a1 a2
s U2
W23
![Page 32: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/32.jpg)
• We want all combinations ofi = 1, 2 and j = 1, 2, 3
• Solution: Outer product:where is the “responsibility” coming from each activation a
Training with Backpropagation
• From single weight Wij to full W:
32
x1 x2 x3 +1
a1 a2
s U2
W23
![Page 33: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/33.jpg)
Training with Backpropagation
• For biases b, we get:
33
x1 x2 x3 +1
a1 a2
s U2
W23
![Page 34: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/34.jpg)
Training with Backpropagation
34
That’s almost backpropagation
It’s simply taking derivatives and using the chain rule!
Remaining trick: we can re-use derivatives computed for higher layers in computing derivatives for lower layers
Example: last derivatives of model, the word vectors in x
![Page 35: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/35.jpg)
Training with Backpropagation
• Take derivative of score with respect to single word vector (for simplicity a 1d vector, but same if it was longer)
• Now, we cannot just take into consideration one ai
because each xj is connected to all the neurons above and hence xj influences the overall score through all of these, hence:
Re-used part of previous derivative35
![Page 36: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/36.jpg)
Training with Backpropagation: softmax
36
What is the major benefit of learned word vectors?
Ability to also propagate labeled information into them, via softmax/maxent and hidden layer:
S
c1 c2 c3
x1 x2 x3 +1
a1 a2
P(c | d,l) =elT f (c,d )
elT f ( ¢c ,d )
¢cå
![Page 37: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/37.jpg)
Autoencoder and their variants
Part 2.2: Autoencoders
37
![Page 38: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/38.jpg)
Sharing Statistical Strength
• Besides very fast prediction, the main advantage of deep learning is statistical
• Potential to learn from less labeled examples because of sharing of statistical strength:
• Unsupervised pre-training & Multi-task learning
• Semi-supervised learning
38
![Page 39: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/39.jpg)
Semi-Supervised Learning
• Hypothesis: P(c|x) can be more accurately computed using shared structure with P(x)
purely supervised
39
![Page 40: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/40.jpg)
Semi-Supervised Learning
• Hypothesis: P(c|x) can be more accurately computed using shared structure with P(x)
semi-supervised
40
![Page 41: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/41.jpg)
Deep autoencoders
• Alternative to contrastive unsupervised word learning
• Another is RBMs (Hinton et al. 2006), next section
1. Definition, intuition and variants of autoencoders
2. Stacking for deep autoencoders
41
![Page 42: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/42.jpg)
Auto-Encoders
• Multilayer neural net with target output = input
• Reconstruction=decoder(encoder(input))
• Probable inputs have small reconstruction error
…
code= latent features
…
encoder
decoder
input
reconstruction
42
![Page 43: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/43.jpg)
PCA = Linear Manifold = Linear Auto-Encoder
reconstruction error vector
Linear manifold
reconstruction(x)
x
input x, 0-meanfeatures=code=h(x)=W xreconstruction(x)=WT h(x) = WT W xW = principal eigen-basis of Cov(X)
LSA example:x = (normalized) distribution of co-occurrence frequencies
43
![Page 44: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/44.jpg)
The Manifold Learning Hypothesis
• Examples concentrate near a lower dimensional “manifold” (region of high density where small changes are only allowed in certain directions)
44
![Page 45: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/45.jpg)
45
Auto-Encoders Learn Salient Variations, like a
non-linear PCA
Minimizing reconstruction error
forces latent representation of
“similar inputs” to stay on
manifold
![Page 46: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/46.jpg)
Auto-Encoder Variants
• Discrete inputs: cross-entropy or log-likelihood reconstruction criterion (similar to used for discrete targets for MLPs)
• Preventing them to learn the identity everywhere:
• Undercomplete (eg PCA): bottleneck code smaller than input
• Sparsity: penalize hidden unit activations so at 0 or near (0.05)
[Goodfellow et al 2009]
• Contractive: force encoder to have small derivatives
[Rifai et al 2011]
• Denoising: see next slide
[Vincent et al 2008]
46
![Page 47: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/47.jpg)
Denoising Autoencoder
47
![Page 48: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/48.jpg)
Stacking Auto-Encoders
• Can be stacked successfully (Bengio et al NIPS’2006) to form highly non-linear representations
48
![Page 49: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/49.jpg)
Layer-wise Unsupervised Learning
…input
49
![Page 50: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/50.jpg)
Layer-wise Unsupervised Pre-training
…
…
input
features
50
![Page 51: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/51.jpg)
Layer-wise Unsupervised Pre-training
…
…
…
input
features
reconstruction
of input=?
… input
51
![Page 52: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/52.jpg)
Layer-wise Unsupervised Pre-training
…
…
input
features
52
![Page 53: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/53.jpg)
Layer-wise Unsupervised Pre-training
…
…
input
features
…More abstract
features
53
![Page 54: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/54.jpg)
…
…
input
features
…More abstract
features
reconstruction
of features=?
… ………
Layer-Wise Unsupervised Pre-trainingLayer-wise Unsupervised Learning
54
![Page 55: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/55.jpg)
…
…
input
features
…More abstract
features
Layer-wise Unsupervised Pre-training
55
![Page 56: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/56.jpg)
…
…
input
features
…More abstract
features
…Even more abstract
features
Layer-wise Unsupervised Learning
56
![Page 57: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/57.jpg)
…
…
input
features
…More abstract
features
…Even more abstract
features
Output
f(X) sixTarget
Ytwo!=
?
Supervised Fine-Tuning
57
![Page 58: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/58.jpg)
Outline
1. Motivation
2. Neural Networks: Feed-forward*, Autoencoders
3. Probabilistic - Directed: PCA, Sparse Coding
4. Probabilistic – Undirected: MRFs and RBMs
58
![Page 59: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/59.jpg)
Probabilistic Models
• Learn a joint probability distribution over p(x,h)
• Maximize likelihood
• See posterior p(h|x) as the feature representation
59
![Page 60: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/60.jpg)
Directed Graphical Models: PCA
• Factorization of p(x,h) = p(x|h) p(h) = likelihood x prior
• Probabilistic interpretation of PCA:
• Get W via iterative maximization of likelihood (i.e. EM)
60
![Page 61: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/61.jpg)
Sparse Coding: Intuition for images
Natural Images
Learned bases: “Edges”
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
0.8 * + 0.3 * + 0.5 *
[a1, …, a64] = [0, 0, …, 0, 0.8, 0, …, 0, 0.3, 0, …, 0, 0.5, 0] (feature representation)
Test example
61
![Page 62: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/62.jpg)
Directed Graphical Models: Sparse Coding
• Sparse Coding (non-probabilistic)
• Training:
• Testing:
62
![Page 63: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/63.jpg)
Directed Graphical Models: Sparse Coding
• Sparse Coding (probabilistic)
• Similar to pPCA but with Laplace prior on hidden variables
• In practice Adam Coates showed that dictionary D can be learned with k-meansand encoding (test time) can use soft threshold:
63
![Page 64: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/64.jpg)
Undirected Graphical Models:
MRFs and Boltzmann machines
• Generally p(x,h) defined via non-negative clique potentials
• Special case: Boltzmann machine, energy
• Gives sigmoid conditional for hidden:
• Totally intractable since single hidden unit requires marginalization over all others:
64
![Page 65: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/65.jpg)
Undirected Graphical Models:
Restricted Boltzmann machines (RBMs)
• RBM energy only hasvisible-hidden connections
• Bipartite graph
• Conditionals nicelyfactorize
65
![Page 66: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/66.jpg)
Undirected Graphical Models:
Restricted Boltzmann machines (RBMs)
• Detail: Factorization
66
![Page 67: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/67.jpg)
Undirected Graphical Models:
Restricted Boltzmann machines (RBMs)
• Detail: Training RBMs
• Maximize marginal
67
![Page 68: Unsupervised Deep Learning Stats 306B - Stanford …web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring...Unsupervised Deep Learning Stats 306B Richard Socher Some slides from](https://reader031.vdocuments.site/reader031/viewer/2022013110/5b05aa8b7f8b9ad1768bb2ae/html5/thumbnails/68.jpg)
Undirected Graphical Models:
Restricted Boltzmann machines (RBMs)
• Detail: Training RBMs
• Conditional of empirical distributions is easy
• Expected value wrt joint distribution tricky!
• Use average over samples from MCMC approximation
• MAGIC: Contrastive Divergence, Hinton 2006:
• Instead of many samples, use only 1
• Instead of waiting for burn in, take just one step & start with data sample68