1. motivation

1
1. Motivation Undirected models are useful in many settings. Consider models in exponential family form: Task: Given i.i.d. data, estimate parameters accurately and quickly Maximum likelihood estimation (MLE): Need to resort to approximate techniques: Pseudolikelihood / composite likelihoods Sampling-based techniques (e.g. MCMC-MLE) Contrastive divergence (CD) learning We propose particle filtered MCMC-MLE Particle Filtered MCMC-MLE with Connections to Contrastive Divergence Arthur Asuncion, Qiang Liu, Alexander Ihler, Padhraic Smyth Department of Computer Science, University of California, Irvine 3. Contrastive Divergence (CD) Widely-used machine learning algorithm for learning undirected models [Hinton, 2002] CD can be motivated by taking gradient of log-likelihood directly: CD-n samples from current model (approx.): Initialize chains at empirical data distribution Only run n MCMC steps Persistent CD: initialize chains at samples at previous iteration [Tieleman, 2008] 5. Experimental Analysis Visible Boltzmann machines: Exponential random graph models (ERGMs): Conditional random fields (CRFs): Restricted Boltzmann machines (RBMs): 2. MCMC-MLE Widely used in statistics [Geyer, 1991] Idea: draw samples from alternate distribution p(x|θ 0 ) using MCMC, to approximate the likelihood: To optimize approximate likelihood, use gradient: Degeneracy problems if θ moves far from initial θ 0 4. Particle Filtered MCMC-MLE (PF) Use sampling-importance-resampling (SIR) with MCMC rejuvenation to estimate gradient Monitor effective sample size (ESS): If ESS (“health” of particles) is low: Resample particles in proportion to w Rejuvenate with n MCMC steps based on θ PF can avoid MCMC-MLE’s degeneracy issues PF can be potentially faster than CD since it only “rejuvenates” when ESS is low As the number of particles approaches infinity, PF recovers the exact log- likelihood gradient 6. Conclusions Particle filtered MCMC-MLE can avoid the degeneracy issues of MCMC-MLE by performing resampling and rejuvenation Particle filtered MCMC-MLE is sometimes faster than CD since it only rejuvenates when needed There is a unified view of all these algorithms Partition function usually intractable MCMC-MLE uses importance sampling to estimate gradient Run MCMC under p(x|θ 0 ) until equilibrium Calculate new weight Update θ using approximate gradient Run MCMC under θ for n steps Calculate weight and check ESS If ESS is low, resample and rejuvenate Run MCMC under θ for n steps # edges # 2- stars # triangle s Network statistics: Experiments on MNIST data. 500 hidden units used. Monte Carlo approximat ion PF can be viewed as a “hybrid” between MCMC-MLE and CD

Upload: todd

Post on 18-Mar-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Particle Filtered MCMC-MLE with Connections to Contrastive Divergence. Arthur Asuncion, Qiang Liu, Alexander Ihler, Padhraic Smyth Department of Computer Science, University of California, Irvine. 1. Motivation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 1.  Motivation

1. Motivation

• Undirected models are useful in many settings. Consider models in exponential family form:

• Task: Given i.i.d. data, estimate parameters accurately and quickly

• Maximum likelihood estimation (MLE):

• Need to resort to approximate techniques:• Pseudolikelihood / composite likelihoods• Sampling-based techniques (e.g. MCMC-MLE)• Contrastive divergence (CD) learning

• We propose particle filtered MCMC-MLE

Particle Filtered MCMC-MLE with Connections to Contrastive Divergence

Arthur Asuncion, Qiang Liu, Alexander Ihler, Padhraic Smyth Department of Computer Science, University of California, Irvine

3. Contrastive Divergence (CD)

• Widely-used machine learning algorithm for learning undirected models [Hinton, 2002]

• CD can be motivated by taking gradient of log-likelihood directly:

• CD-n samples from current model (approx.):• Initialize chains at empirical data distribution• Only run n MCMC steps

• Persistent CD: initialize chains at samples at previous iteration [Tieleman, 2008]

5. Experimental Analysis

• Visible Boltzmann machines:

• Exponential random graph models (ERGMs):

• Conditional random fields (CRFs):

• Restricted Boltzmann machines (RBMs):

2. MCMC-MLE

• Widely used in statistics [Geyer, 1991]

• Idea: draw samples from alternate distribution p(x|θ0) using MCMC, to approximate the likelihood:

• To optimize approximate likelihood, use gradient:

• Degeneracy problems if θ moves far from initial θ0

4. Particle Filtered MCMC-MLE (PF)

• Use sampling-importance-resampling (SIR) with MCMC rejuvenation to estimate gradient

• Monitor effective sample size (ESS):

• If ESS (“health” of particles) is low:• Resample particles in proportion to w• Rejuvenate with n MCMC steps based on θ

• PF can avoid MCMC-MLE’s degeneracy issues

• PF can be potentially faster than CD since it only “rejuvenates” when ESS is low

• As the number of particles approaches infinity, PF recovers the exact log-likelihood gradient

6. Conclusions

• Particle filtered MCMC-MLE can avoid the degeneracy issues of MCMC-MLE by performing resampling and rejuvenation

• Particle filtered MCMC-MLE is sometimes faster than CD since it only rejuvenates when needed

• There is a unified view of all these algorithms

Partition function usually intractable

MCMC-MLE uses importance sampling to estimate gradient

Run MCMC under p(x|θ0) until equilibrium

Calculate new weight

Update θ using approximate gradient

Run MCMC under θ for n steps

Calculate weight and check ESS

If ESS is low,resample and rejuvenate

Run MCMC under θ for n steps

# edges # 2-stars # triangles

Network statistics:

Experiments on MNIST data. 500 hidden units used.

Monte Carlo approximation

PF can be viewed as a “hybrid” between MCMC-MLE and CD