contrastive divergence - training products of … · contrastive divergence training products of...
TRANSCRIPT
Contrastive DivergenceTraining Products of Experts by Minimizing CD
Hinton, 2002
Helmut Puhr
Institute for Theoretical Computer ScienceTU Graz
June 9, 2010
Theory Argument Contrastive divergence Applications Summary
Contents
1 Theory
2 Argument
3 Contrastive divergence
4 Applications
5 SummaryHelmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
Definitions
Training data . . .X = x1, ..., xkModel parameters . . . Θ
p(x |Θ) =1
Z (Θ)f (x |Θ) (1)
where
Z (Θ) =
∫f (x |Θ)dx (2)
Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
Estimating model parameters
Find Θ which maximises probability of training data
p(X|Θ) =K∏
k=1
1
Z (Θ)f (xk |Θ) (3)
or equivalently, by minimizing − log p(X|Θ)(denoted Energy function)
E (X|Θ) = log Z (Θ)− 1
K
K∑k=1
log f (xk |Θ) (4)
Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
Model function: Gaussian
Choose probability model function as PDF ofnormal distribution
f (x |Θ) = N (x |µ, σ) (5)
so that Θ = {µ, σ}∂E (X|Θ)
∂µ → optimal µ is mean of training data∂E (X|Θ)
∂σ → optimal σ is√σ2 of training data
Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
Model function: Mixture of Gaussians
Choose probability model function as sum of N normaldistributions so that Θ = {µ1, . . . , µN , σ1, . . . , σN}
f (x |Θ) =N∑i=1
N (x |µi , σi) (6)
log Z (Θ) = logN (7)
∂E(X|Θ)∂Θi
→ depends on other parametersUse expectation maximisation or gradient ascent
Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
Model function: Product of Gaussians
Choose probability model function as product of Nnormal distributions
f (x |Θ) =N∏i=1
N (x |µi , σi) (8)
Z (Θ) is no longer a constantZ (Θ) =
∫f (x |Θ)dx is not tractable
Numerical intergration of E (X|Θ) is too costly
Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
Why use a Product of Gaussians?
Mixture models
very inefficient in high-dimensional spaces
posterior distribution can not be sharper than individualmodels
but individual models have to be broadly tuned
Product of Gaussians
can not approximate arbitrary smooth distributions
if individual model contains at least one latent variable→ expert that poses constraint
hard to calculate derivativesHelmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
Contrastive divergence
Minimize energy function that can not beevaluated
Use CD to estimate gradient of energy function
Find minimum by taking small steps indirection of steepest gradient
Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
CD: Energy function
∂E (X|Θ)
∂Θ=∂ log Z (Θ)
∂Θ− 1
K
K∑i=1
∂ log f (xi |Θ)
∂Θ(9)
=∂ log Z (Θ)
∂Θ− 〈∂ log f (xi |Θ)
∂Θ〉X (10)
where 〈.〉X denotes expectation of . given data X
Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
CD: Derivation of log Z (Θ)
∂ log Z (Θ)
∂Θ=
1
Z (Θ)
∂Z (Θ)
∂Θ(11)
=1
Z (Θ)
∂
∂Θ
∫f (x |Θ)dx (12)
...
= 〈∂ log f (x |Θ)
∂Θ〉p(x |Θ) (13)
(see 4)Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
CD: Sampling
Approximate derivative of log Z (Θ)
by drawing samples from p(x |Θ)
which can not be drawn directly (Z (Θ)
unkown)
but use MCMC (e.g. Gibbs) sampling
Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
Gibbs Sampling
special case of Metropolitan-Hastings algorithm
simpler to sample from a conditional distribution than to
marginalize by integrating over a joint distribution
e.g. sample p(x , y) starting with y0, i = 1
xi ∼ p(x |y = yi−1)
yi ∼ p(y |x = xi)
Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
CD: MCMC Sampling
use many cycles of MCMC sampling
to transform training data (drawn from target
distribution)
into data drawn from the proposed distribution
given the data, all hidden states can be
updated in parallel (conditional independence)
Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
CD: Gibbs sampling
time 0: all hidden variables are updated with samples
from their posterior distribution | visible variables
time 1: update visible variables to reproduce original
data vector, update all hidden variables
Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
CD: Energy function 2
Xn . . . transformed training data (n cycles MCMC)
such that X0 ≡ X
Substituting leads to
∂E (X|Θ)
∂Θ= 〈∂ log f (x |Θ)
∂Θ〉X∞ − 〈∂ log f (x |Θ)
∂Θ〉X0 (14)
Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
CD: MCMC Sampling length
many cycles of MCMC sampling still too costly
Hinton’s “intuition”
few MCMC cycles ought to be enough
after few iterations data moved from target to
proposed distribution
empirically found 1 cycle suffices
Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
CD-1
To minimize energy function
Θt+1 = Θt + η(〈∂ log f (x |Θ)
∂Θ〉X0−〈∂ log f (x |Θ)
∂Θ〉X1)
(15)with step size η (chosen experimentally)
Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
Simple example
15 “unigauss” experts
given data, calculate posterior of selecting gaussian
or uniform, calculate 〈.〉X0
stochastically select gaussian or uniform according
to posterior. Compute normalised product of
selected gaussians, sample from it to get a
reconstructed vector in data space
calculate 〈.〉X1
Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
Simple example 2
1 each dot is datapoint
2 fitted with 15uni-gaussian experts
3 ellipses show σ
Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
RBM with CD: digits
1 500 hidden units2 16x16 visible units3 pixel intensities [0, 1]4 8000 examples5 weights of 100 units
shown6 almost perfect
reconstructions
Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
Summary
PoE may lead to better approximation thanmixture models
learning gradient is intractable to calculate,estimation via CD
similiar to RBM learning
Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
Thank you for your attention!
“Training Products of Experts by Minimizing ContrastiveDivergence” by Geoffrey E. Hinton, 2002
”Notes on Contrastive Divergence“ by Oliver Woodford
Helmut Puhr TU Graz Contrastive Divergence
Theory Argument Contrastive divergence Applications Summary
Proof 1
∂ log Z (Θ)
∂Θ=
1
Z (Θ)
∂
∂Θ
∫f (x |Θ)dx (16)
=1
Z (Θ)
∫∂f (x |Θ)
∂Θdx (17)
=1
Z (Θ)
∫f (x |Θ)
∂ log f (x |Θ)
∂Θdx (18)
=
∫p(x |Θ)
∂ log f (x |Θ)
∂Θdx (19)
= 〈∂ log f (x |Θ)
∂Θ〉p(x |Θ) (20)
Helmut Puhr TU Graz Contrastive Divergence