contrastive divergence - training products of … · contrastive divergence training products of...

24
Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer Science TU Graz June 9, 2010

Upload: nguyentruc

Post on 21-Aug-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Contrastive DivergenceTraining Products of Experts by Minimizing CD

Hinton, 2002

Helmut Puhr

Institute for Theoretical Computer ScienceTU Graz

June 9, 2010

Page 2: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

Contents

1 Theory

2 Argument

3 Contrastive divergence

4 Applications

5 SummaryHelmut Puhr TU Graz Contrastive Divergence

Page 3: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

Definitions

Training data . . .X = x1, ..., xkModel parameters . . . Θ

p(x |Θ) =1

Z (Θ)f (x |Θ) (1)

where

Z (Θ) =

∫f (x |Θ)dx (2)

Helmut Puhr TU Graz Contrastive Divergence

Page 4: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

Estimating model parameters

Find Θ which maximises probability of training data

p(X|Θ) =K∏

k=1

1

Z (Θ)f (xk |Θ) (3)

or equivalently, by minimizing − log p(X|Θ)(denoted Energy function)

E (X|Θ) = log Z (Θ)− 1

K

K∑k=1

log f (xk |Θ) (4)

Helmut Puhr TU Graz Contrastive Divergence

Page 5: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

Model function: Gaussian

Choose probability model function as PDF ofnormal distribution

f (x |Θ) = N (x |µ, σ) (5)

so that Θ = {µ, σ}∂E (X|Θ)

∂µ → optimal µ is mean of training data∂E (X|Θ)

∂σ → optimal σ is√σ2 of training data

Helmut Puhr TU Graz Contrastive Divergence

Page 6: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

Model function: Mixture of Gaussians

Choose probability model function as sum of N normaldistributions so that Θ = {µ1, . . . , µN , σ1, . . . , σN}

f (x |Θ) =N∑i=1

N (x |µi , σi) (6)

log Z (Θ) = logN (7)

∂E(X|Θ)∂Θi

→ depends on other parametersUse expectation maximisation or gradient ascent

Helmut Puhr TU Graz Contrastive Divergence

Page 7: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

Model function: Product of Gaussians

Choose probability model function as product of Nnormal distributions

f (x |Θ) =N∏i=1

N (x |µi , σi) (8)

Z (Θ) is no longer a constantZ (Θ) =

∫f (x |Θ)dx is not tractable

Numerical intergration of E (X|Θ) is too costly

Helmut Puhr TU Graz Contrastive Divergence

Page 8: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

Why use a Product of Gaussians?

Mixture models

very inefficient in high-dimensional spaces

posterior distribution can not be sharper than individualmodels

but individual models have to be broadly tuned

Product of Gaussians

can not approximate arbitrary smooth distributions

if individual model contains at least one latent variable→ expert that poses constraint

hard to calculate derivativesHelmut Puhr TU Graz Contrastive Divergence

Page 9: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

Contrastive divergence

Minimize energy function that can not beevaluated

Use CD to estimate gradient of energy function

Find minimum by taking small steps indirection of steepest gradient

Helmut Puhr TU Graz Contrastive Divergence

Page 10: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

CD: Energy function

∂E (X|Θ)

∂Θ=∂ log Z (Θ)

∂Θ− 1

K

K∑i=1

∂ log f (xi |Θ)

∂Θ(9)

=∂ log Z (Θ)

∂Θ− 〈∂ log f (xi |Θ)

∂Θ〉X (10)

where 〈.〉X denotes expectation of . given data X

Helmut Puhr TU Graz Contrastive Divergence

Page 11: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

CD: Derivation of log Z (Θ)

∂ log Z (Θ)

∂Θ=

1

Z (Θ)

∂Z (Θ)

∂Θ(11)

=1

Z (Θ)

∂Θ

∫f (x |Θ)dx (12)

...

= 〈∂ log f (x |Θ)

∂Θ〉p(x |Θ) (13)

(see 4)Helmut Puhr TU Graz Contrastive Divergence

Page 12: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

CD: Sampling

Approximate derivative of log Z (Θ)

by drawing samples from p(x |Θ)

which can not be drawn directly (Z (Θ)

unkown)

but use MCMC (e.g. Gibbs) sampling

Helmut Puhr TU Graz Contrastive Divergence

Page 13: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

Gibbs Sampling

special case of Metropolitan-Hastings algorithm

simpler to sample from a conditional distribution than to

marginalize by integrating over a joint distribution

e.g. sample p(x , y) starting with y0, i = 1

xi ∼ p(x |y = yi−1)

yi ∼ p(y |x = xi)

Helmut Puhr TU Graz Contrastive Divergence

Page 14: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

CD: MCMC Sampling

use many cycles of MCMC sampling

to transform training data (drawn from target

distribution)

into data drawn from the proposed distribution

given the data, all hidden states can be

updated in parallel (conditional independence)

Helmut Puhr TU Graz Contrastive Divergence

Page 15: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

CD: Gibbs sampling

time 0: all hidden variables are updated with samples

from their posterior distribution | visible variables

time 1: update visible variables to reproduce original

data vector, update all hidden variables

Helmut Puhr TU Graz Contrastive Divergence

Page 16: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

CD: Energy function 2

Xn . . . transformed training data (n cycles MCMC)

such that X0 ≡ X

Substituting leads to

∂E (X|Θ)

∂Θ= 〈∂ log f (x |Θ)

∂Θ〉X∞ − 〈∂ log f (x |Θ)

∂Θ〉X0 (14)

Helmut Puhr TU Graz Contrastive Divergence

Page 17: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

CD: MCMC Sampling length

many cycles of MCMC sampling still too costly

Hinton’s “intuition”

few MCMC cycles ought to be enough

after few iterations data moved from target to

proposed distribution

empirically found 1 cycle suffices

Helmut Puhr TU Graz Contrastive Divergence

Page 18: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

CD-1

To minimize energy function

Θt+1 = Θt + η(〈∂ log f (x |Θ)

∂Θ〉X0−〈∂ log f (x |Θ)

∂Θ〉X1)

(15)with step size η (chosen experimentally)

Helmut Puhr TU Graz Contrastive Divergence

Page 19: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

Simple example

15 “unigauss” experts

given data, calculate posterior of selecting gaussian

or uniform, calculate 〈.〉X0

stochastically select gaussian or uniform according

to posterior. Compute normalised product of

selected gaussians, sample from it to get a

reconstructed vector in data space

calculate 〈.〉X1

Helmut Puhr TU Graz Contrastive Divergence

Page 20: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

Simple example 2

1 each dot is datapoint

2 fitted with 15uni-gaussian experts

3 ellipses show σ

Helmut Puhr TU Graz Contrastive Divergence

Page 21: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

RBM with CD: digits

1 500 hidden units2 16x16 visible units3 pixel intensities [0, 1]4 8000 examples5 weights of 100 units

shown6 almost perfect

reconstructions

Helmut Puhr TU Graz Contrastive Divergence

Page 22: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

Summary

PoE may lead to better approximation thanmixture models

learning gradient is intractable to calculate,estimation via CD

similiar to RBM learning

Helmut Puhr TU Graz Contrastive Divergence

Page 23: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

Thank you for your attention!

“Training Products of Experts by Minimizing ContrastiveDivergence” by Geoffrey E. Hinton, 2002

”Notes on Contrastive Divergence“ by Oliver Woodford

Helmut Puhr TU Graz Contrastive Divergence

Page 24: Contrastive Divergence - Training Products of … · Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer

Theory Argument Contrastive divergence Applications Summary

Proof 1

∂ log Z (Θ)

∂Θ=

1

Z (Θ)

∂Θ

∫f (x |Θ)dx (16)

=1

Z (Θ)

∫∂f (x |Θ)

∂Θdx (17)

=1

Z (Θ)

∫f (x |Θ)

∂ log f (x |Θ)

∂Θdx (18)

=

∫p(x |Θ)

∂ log f (x |Θ)

∂Θdx (19)

= 〈∂ log f (x |Θ)

∂Θ〉p(x |Θ) (20)

Helmut Puhr TU Graz Contrastive Divergence