deep learning - restricted boltzmann machines (rbm) · ali ghodsi university of waterloo december...
TRANSCRIPT
![Page 1: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/1.jpg)
Deep Learning
Restricted Boltzmann Machines (RBM)
Ali Ghodsi
University of Waterloo
December 15, 2015
Slides are partially based on Book in preparation, Deep Learningby Bengio, Goodfellow, and Aaron Courville, 2015
Ali Ghodsi Deep Learning
![Page 2: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/2.jpg)
Restricted Boltzmann Machines
Restricted Boltzmann machines are some of the most commonbuilding blocks of deep probabilistic models. They are undirectedprobabilistic graphical models containing a layer of observablevariables and a single layer of latent variables.
Ali Ghodsi Deep Learning
![Page 3: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/3.jpg)
Restricted Boltzmann Machines
p(v,h) =1
Zexp{−E (v,h)}.
Where E (v,h) is the energy function.
E (v,h) = −bTv − cTh− vTWh,
Z is the normalizing constant partition function:
Z =∑v
∑h
exp{−E (v,h)}.
Ali Ghodsi Deep Learning
![Page 4: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/4.jpg)
Restricted Boltzmann Machine (RBM)
Energy function:
E (v,h) = −bTv − cTh− vTWh
= −∑k
bkvk −∑j
cjhj −∑j
∑k
Wjkhjvk
Distribution: p(v,h) = 1Zexp{−E (v,h)}
Partition function: Z =∑
v
∑h exp{−E (v,h)}
Ali Ghodsi Deep Learning
![Page 5: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/5.jpg)
Conditional Distributions
The partition function Z is intractable.
Therefore the joint probability distribution is also intractable.
But P(h|v) is simple to compute and sample from.
Ali Ghodsi Deep Learning
![Page 6: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/6.jpg)
Deriving the conditional distributions from the
joint distribution.
p(h|v) =p(h, v)
p(v)
=1
p(v)
1
Zexp{bTv + cTh+ vTWh}
=1
Z ′ exp{cTh+ vTWh}
=1
Z ′ exp
{n∑
j=1
cjhj +n∑
j=1
vTW:jhj
}
=1
Z ′
n∏j=1
exp{cjhj + vTW:jhj}
Ali Ghodsi Deep Learning
![Page 7: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/7.jpg)
Deriving the conditional distributions from the
joint distribution.
p(h|v) =p(h, v)
p(v)
=1
p(v)
1
Zexp{bTv + cTh+ vTWh}
=1
Z ′ exp{cTh+ vTWh}
=1
Z ′ exp
{n∑
j=1
cjhj +n∑
j=1
vTW:jhj
}
=1
Z ′
n∏j=1
exp{cjhj + vTW:jhj}
Ali Ghodsi Deep Learning
![Page 8: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/8.jpg)
Deriving the conditional distributions from the
joint distribution.
p(h|v) =p(h, v)
p(v)
=1
p(v)
1
Zexp{bTv + cTh+ vTWh}
=1
Z ′ exp{cTh+ vTWh}
=1
Z ′ exp
{n∑
j=1
cjhj +n∑
j=1
vTW:jhj
}
=1
Z ′
n∏j=1
exp{cjhj + vTW:jhj}
Ali Ghodsi Deep Learning
![Page 9: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/9.jpg)
Deriving the conditional distributions from the
joint distribution.
p(h|v) =p(h, v)
p(v)
=1
p(v)
1
Zexp{bTv + cTh+ vTWh}
=1
Z ′ exp{cTh+ vTWh}
=1
Z ′ exp
{n∑
j=1
cjhj +n∑
j=1
vTW:jhj
}
=1
Z ′
n∏j=1
exp{cjhj + vTW:jhj}
Ali Ghodsi Deep Learning
![Page 10: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/10.jpg)
Deriving the conditional distributions from the
joint distribution.
p(h|v) =p(h, v)
p(v)
=1
p(v)
1
Zexp{bTv + cTh+ vTWh}
=1
Z ′ exp{cTh+ vTWh}
=1
Z ′ exp
{n∑
j=1
cjhj +n∑
j=1
vTW:jhj
}
=1
Z ′
n∏j=1
exp{cjhj + vTW:jhj}
Ali Ghodsi Deep Learning
![Page 11: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/11.jpg)
The distributions over the individual binary hj
P(hj = 1|v) =P(hj = 1, v)
P(hj = 0, v) + P(hj = 1, v)
=exp{cj + vTW:j}
exp{0}+ exp{cj + vTW:j}= sigmoid(cj + vTW:j)
P(h|v) =n∏
j=1
sigmoid(cj + vTW:j)
P(v|h) =d∏
i=1
sigmoid(bi +Wi :h)
Ali Ghodsi Deep Learning
![Page 12: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/12.jpg)
RBM Gibbs Sampling
Step1: Sample h(l) ∼ P(h|v(l)).
We can simultaneously and independently sample from all theelements of h(l) given v(l).
Step 2: Sample v(l+1) ∼ P(v|h(l)).
We can simultaneously and independently sample from all theelements of v(l+1) given h(l).
Ali Ghodsi Deep Learning
![Page 13: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/13.jpg)
Training Restricted Boltzmann Machines
The log-likelihood is given by:
`(W ,b, c) =n∑
t=1
logP(v(t))
=n∑
t=1
log∑h
P(v(t),h)
=n∑
t=1
log∑h
exp{−E (v(t),h)} )− n logZ
=n∑
t=1
log∑h
exp{−E (v(t),h)} )− n log∑v,h
exp{−E (v,h)}
Ali Ghodsi Deep Learning
![Page 14: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/14.jpg)
Training Restricted Boltzmann Machines
The log-likelihood is given by:
`(W ,b, c) =n∑
t=1
logP(v(t))
=n∑
t=1
log∑h
P(v(t),h)
=n∑
t=1
log∑h
exp{−E (v(t),h)} )− n logZ
=n∑
t=1
log∑h
exp{−E (v(t),h)} )− n log∑v,h
exp{−E (v,h)}
Ali Ghodsi Deep Learning
![Page 15: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/15.jpg)
Training Restricted Boltzmann Machines
The log-likelihood is given by:
`(W ,b, c) =n∑
t=1
logP(v(t))
=n∑
t=1
log∑h
P(v(t),h)
=n∑
t=1
log∑h
exp{−E (v(t),h)} )− n logZ
=n∑
t=1
log∑h
exp{−E (v(t),h)} )− n log∑v,h
exp{−E (v,h)}
Ali Ghodsi Deep Learning
![Page 16: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/16.jpg)
Training Restricted Boltzmann Machines
The log-likelihood is given by:
`(W ,b, c) =n∑
t=1
logP(v(t))
=n∑
t=1
log∑h
P(v(t),h)
=n∑
t=1
log∑h
exp{−E (v(t),h)} )− n logZ
=n∑
t=1
log∑h
exp{−E (v(t),h)} )− n log∑v,h
exp{−E (v,h)}
Ali Ghodsi Deep Learning
![Page 17: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/17.jpg)
Maximizing the likelihood
θ = {b, c,W } :
`(θ) =n∑
t=1
log∑h
exp{−E (v(t),h)} )− n log∑v,h
exp{−E (v,h)}
∇θ`(θ) = ∇θn∑
t=1
log∑h
exp{−E (v(t),h)} )− n ∇θlog∑v,h
exp{−E (v,h)}
=n∑
t=1
∑h exp{−E (v(t),h)}∇θ − E (v(t),h)∑
h exp{−E (v(t),h)}
−n∑
v,h exp{−E (v,h)}∇θ − E (v,h)∑v,h exp{−E (v,h)}
=n∑
t=1
EP(h|v(t))[∇θ − E (v(t),h)]− n EP(h,v)[∇θ − E (v,h)]
Ali Ghodsi Deep Learning
![Page 18: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/18.jpg)
Maximizing the likelihood
θ = {b, c,W } :
`(θ) =n∑
t=1
log∑h
exp{−E (v(t),h)} )− n log∑v,h
exp{−E (v,h)}
∇θ`(θ) = ∇θn∑
t=1
log∑h
exp{−E (v(t),h)} )− n ∇θlog∑v,h
exp{−E (v,h)}
=n∑
t=1
∑h exp{−E (v(t),h)}∇θ − E (v(t),h)∑
h exp{−E (v(t),h)}
−n∑
v,h exp{−E (v,h)}∇θ − E (v,h)∑v,h exp{−E (v,h)}
=n∑
t=1
EP(h|v(t))[∇θ − E (v(t),h)]− n EP(h,v)[∇θ − E (v,h)]
Ali Ghodsi Deep Learning
![Page 19: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/19.jpg)
Maximizing the likelihood
θ = {b, c,W } :
`(θ) =n∑
t=1
log∑h
exp{−E (v(t),h)} )− n log∑v,h
exp{−E (v,h)}
∇θ`(θ) = ∇θn∑
t=1
log∑h
exp{−E (v(t),h)} )− n ∇θlog∑v,h
exp{−E (v,h)}
=n∑
t=1
∑h exp{−E (v(t),h)}∇θ − E (v(t),h)∑
h exp{−E (v(t),h)}
−n∑
v,h exp{−E (v,h)}∇θ − E (v,h)∑v,h exp{−E (v,h)}
=n∑
t=1
EP(h|v(t))[∇θ − E (v(t),h)]− n EP(h,v)[∇θ − E (v,h)]
Ali Ghodsi Deep Learning
![Page 20: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/20.jpg)
Maximizing the likelihood
θ = {b, c,W } :
`(θ) =n∑
t=1
log∑h
exp{−E (v(t),h)} )− n log∑v,h
exp{−E (v,h)}
∇θ`(θ) = ∇θn∑
t=1
log∑h
exp{−E (v(t),h)} )− n ∇θlog∑v,h
exp{−E (v,h)}
=n∑
t=1
∑h exp{−E (v(t),h)}∇θ − E (v(t),h)∑
h exp{−E (v(t),h)}
−n∑
v,h exp{−E (v,h)}∇θ − E (v,h)∑v,h exp{−E (v,h)}
=n∑
t=1
EP(h|v(t))[∇θ − E (v(t),h)]− n EP(h,v)[∇θ − E (v,h)]
Ali Ghodsi Deep Learning
![Page 21: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/21.jpg)
The gradient of the negative energy function
∇W − E (v,h) =∂
∂W(bTv + cTh + vTWh)
= hvT
∇b − E (v,h) =∂
∂b(bTv + cTh + vTWh)
= v
∇c − E (v,h) =∂
∂c(bTv + cTh + vTWh)
= h
Ali Ghodsi Deep Learning
![Page 22: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/22.jpg)
∇θ`(θ) =n∑
t=1
EP(h|v(t))[∇θ − E (v(t),h)]− n EP(h,v)[∇θ − E (v,h)]
∇W `(W ,b, c) =n∑
t=1
h(t)v(t)T − nEP(v,h)[hvT ]
∇b`(W ,b, c) =n∑
t=1
v(t)T − nEP(v,h)[v]
∇c`(W ,b, c) =n∑
t=1
h(t) − nEP(v,h)[h]
whereh(t) = EP(h,v(t))[h] = sigmoid(c+ v(t)W ).
it is impractical to compute the exact log-likelihood gradient.Ali Ghodsi Deep Learning
![Page 23: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/23.jpg)
Contrastive Divergence
Idea:
1. replace the expectation by a point estimate at v
2. obtain the point v by Gibbs sampling
3. start sampling chain at v(t)
EP(h,v)[∇θ − E (v,h)] ≈ ∇θ − E (v,h)|v=v,h=h
Ali Ghodsi Deep Learning
![Page 24: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/24.jpg)
Set ∈, the step size, to a small positive numberSet k , the number of Gibbs steps, high enough to allow a Markovchain of p(v; θ) to mix when initialized from pdata. Perhaps 1-20 totrain an RBM on a small image patch.while Not converged doSample a mini batch of m examples from the training set{v(1), . . . , v(m)}.∇W ← 1
m
∑mt=1 v
(t)h(t)T∇b ← 1
m
∑mt=1 v
(t)
∇c ← 1m
∑mt=1 h
(t)
for t = 1 to m dov(t) ← v(t)
end forfor ` = 1 to k dofor t = 1 to m do
Ali Ghodsi Deep Learning
![Page 25: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/25.jpg)
h(t) sampled from∏n
j=1 sigmoid(cj + v(t)TW:,j).
v(t) sampled from∏d
i=1 sigmoid(bj +Wi ,:h(t)).end forend forh(t) ← sigmoid(c+ v(t)TW )∇W ← ∇W − 1
m
∑mt=1 v
(t)h(t)T∇b ← ∇b − 1
m
∑mt=1 v
(t)
∇c ← ∇c − 1m
∑mt=1 h
(t)
W ← W+ ∈ ∇W
b← b+ ∈ ∇b
∑c← c+ ∈ ∇c
∑end while
Ali Ghodsi Deep Learning
![Page 26: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/26.jpg)
Pseudo code
1. For each training example v(t)
i. generate a negative sample v using k steps of Gibbs sampling,starting at v(t)
ii. update parameters
W ⇐ W + α(h(v(t))x (t) − h(v)vT
)b ⇐ b+ α
(h(v(t))− h(v)
)c ⇐ c+ α
(v(t) − v)
)
2. Go back to 1 until stopping criteria
Ali Ghodsi Deep Learning
![Page 27: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/27.jpg)
Example
Samples from the MNIST digit recognition data set. Here, a black pixel corresponds to an input value of 0 and a white pixel
corresponds to 1 (the inputs are scaled between 0 and 1).
Ali Ghodsi Deep Learning
![Page 28: Deep Learning - Restricted Boltzmann Machines (RBM) · Ali Ghodsi University of Waterloo December 15, 2015 Slides are partially based on Book in preparation, Deep Learning by Bengio,](https://reader030.vdocuments.site/reader030/viewer/2022041022/5ed31e05e87f8f56a1275a0a/html5/thumbnails/28.jpg)
Example
The input weights of a random subset of the hidden units. The activation of units of the first hidden layer is obtained by a dot
product of such a weight “image” with the input image. In these images, a black pixel corresponds to a weight smaller than 3
and a white pixel to a weight larger than 3, with the different shades of gray corresponding to different weight values uniformly
between 3 and 3. Larochelle, et. al, JMLR2009
Ali Ghodsi Deep Learning