contrastive divergence - training products of … · contrastive divergence training products of...

Contrastive DivergenceTraining Products of Experts by Minimizing CD

Hinton, 2002

Helmut Puhr

Institute for Theoretical Computer ScienceTU Graz

June 9, 2010

Theory Argument Contrastive divergence Applications Summary

Contents

1 Theory

2 Argument

3 Contrastive divergence

4 Applications

5 SummaryHelmut Puhr TU Graz Contrastive Divergence


Definitions

Training data . . .X = x1, ..., xkModel parameters . . . Θ

p(x |Θ) =1

Z (Θ)f (x |Θ) (1)

where

Z (Θ) =

∫f (x |Θ)dx (2)

Helmut Puhr TU Graz Contrastive Divergence


Model function: Gaussian

Choose probability model function as PDF ofnormal distribution

f (x |Θ) = N (x |µ, σ) (5)

so that Θ = {µ, σ}∂E (X|Θ)

∂µ → optimal µ is mean of training data∂E (X|Θ)

∂σ → optimal σ is√σ2 of training data



Model function: Mixture of Gaussians

Choose probability model function as sum of N normaldistributions so that Θ = {µ1, . . . , µN , σ1, . . . , σN}

f (x |Θ) =N∑i=1

N (x |µi , σi) (6)

log Z (Θ) = logN (7)

∂E(X|Θ)∂Θi

→ depends on other parametersUse expectation maximisation or gradient ascent



Model function: Product of Gaussians

Choose probability model function as product of Nnormal distributions

f (x |Θ) =N∏i=1

N (x |µi , σi) (8)

Z (Θ) is no longer a constantZ (Θ) =

∫f (x |Θ)dx is not tractable

Numerical intergration of E (X|Θ) is too costly



Why use a Product of Gaussians?

Mixture models

very inefficient in high-dimensional spaces

posterior distribution can not be sharper than individualmodels

but individual models have to be broadly tuned

Product of Gaussians

can not approximate arbitrary smooth distributions

if individual model contains at least one latent variable→ expert that poses constraint

hard to calculate derivativesHelmut Puhr TU Graz Contrastive Divergence


Contrastive divergence

Minimize energy function that can not beevaluated

Use CD to estimate gradient of energy function

Find minimum by taking small steps indirection of steepest gradient



CD: Energy function

∂E (X|Θ)

∂Θ=∂ log Z (Θ)

∂Θ− 1

K

K∑i=1

∂ log f (xi |Θ)

∂Θ(9)

=∂ log Z (Θ)

∂Θ− 〈∂ log f (xi |Θ)

∂Θ〉X (10)

where 〈.〉X denotes expectation of . given data X



CD: Derivation of log Z (Θ)

∂ log Z (Θ)

∂Θ=

1

Z (Θ)

∂Z (Θ)

∂Θ(11)

=1

Z (Θ)

∂

∂Θ

∫f (x |Θ)dx (12)

...

= 〈∂ log f (x |Θ)

∂Θ〉p(x |Θ) (13)

(see 4)Helmut Puhr TU Graz Contrastive Divergence


CD: Sampling

Approximate derivative of log Z (Θ)

by drawing samples from p(x |Θ)

which can not be drawn directly (Z (Θ)

unkown)

but use MCMC (e.g. Gibbs) sampling



Gibbs Sampling

special case of Metropolitan-Hastings algorithm

simpler to sample from a conditional distribution than to

marginalize by integrating over a joint distribution

e.g. sample p(x , y) starting with y0, i = 1

xi ∼ p(x |y = yi−1)

yi ∼ p(y |x = xi)



CD: MCMC Sampling

use many cycles of MCMC sampling

to transform training data (drawn from target

distribution)

into data drawn from the proposed distribution

given the data, all hidden states can be

updated in parallel (conditional independence)



CD: Gibbs sampling

time 0: all hidden variables are updated with samples

from their posterior distribution | visible variables

time 1: update visible variables to reproduce original

data vector, update all hidden variables



CD: Energy function 2

Xn . . . transformed training data (n cycles MCMC)

such that X0 ≡ X

Substituting leads to

∂E (X|Θ)

∂Θ= 〈∂ log f (x |Θ)

∂Θ〉X∞ − 〈∂ log f (x |Θ)

∂Θ〉X0 (14)



CD: MCMC Sampling length

many cycles of MCMC sampling still too costly

Hinton’s “intuition”

few MCMC cycles ought to be enough

after few iterations data moved from target to

proposed distribution

empirically found 1 cycle suffices



CD-1

To minimize energy function

Θt+1 = Θt + η(〈∂ log f (x |Θ)

∂Θ〉X0−〈∂ log f (x |Θ)

∂Θ〉X1)

(15)with step size η (chosen experimentally)



Simple example

15 “unigauss” experts

given data, calculate posterior of selecting gaussian

or uniform, calculate 〈.〉X0

stochastically select gaussian or uniform according

to posterior. Compute normalised product of

selected gaussians, sample from it to get a

reconstructed vector in data space

calculate 〈.〉X1



Simple example 2

1 each dot is datapoint

2 fitted with 15uni-gaussian experts

3 ellipses show σ



RBM with CD: digits

1 500 hidden units2 16x16 visible units3 pixel intensities [0, 1]4 8000 examples5 weights of 100 units

shown6 almost perfect

reconstructions



Summary

PoE may lead to better approximation thanmixture models

learning gradient is intractable to calculate,estimation via CD

similiar to RBM learning



Thank you for your attention!

“Training Products of Experts by Minimizing ContrastiveDivergence” by Geoffrey E. Hinton, 2002

”Notes on Contrastive Divergence“ by Oliver Woodford


contrastive divergence - training products of … · contrastive divergence training products of...

Documents