markov chain monte carlo methods - university at …srihari/cse676/17.3 mcmc...markov chain monte...

Deep Learning Srihari

1

Markov Chain Monte Carlo Methods

Sargur N. [email protected]


Topics in Monte Carlo Methods

1. Sampling and Monte Carlo Methods2. Importance Sampling3. Markov Chain Monte Carlo Methods4. Gibbs Sampling5. Mixing between separated modes

2


Topics in Markov Chain Monte Carlo

• Limitations of plain Monte Carlo methods• Markov Chains• Metropolis-Hastings Algorithm• MCMC and Energy-based models

3


Limitations of plain Monte Carlo• Direct (unconditional) sampling

– Hard to get rare events in high-dimensional spaces– Infeasible for MRFs, unless we know partition function Z

• Rejection sampling, Importance sampling– Doesn’t work well if proposal q(x) is very different from p(x)– Yet constructing a q(x) similar to p(x) can be difficult

• Making a good proposal usually requires knowledge of the analytic form of p(x) – but if we had that, we wouldn’t even need to sample!

• Intuition of MCMC– Instead of a fixed proposal q(x), use an adaptive proposal

4

Deep Learning SrihariMCMC and Adaptive Proposals

• MCMC:– Instead of q(x), use q(x’|x) where x’ is the new state

being sampled, and x is previous sample• As x changes, q(x’|x) changes as a function of x’

5

q(x)p(x) p(x)

q(x2|x1)q(x3|x2) q(x4|x3)

x3 x1 x2 x1 x2 x3

Importance sampling from p(x)with a bad proposal q(x)

MCMC sampling from p(x)with an adaptive proposal q(x’|x)


Markov Chain Monte Carlo Methods• In many cases we wish to use a Monte Carlo

technique but there is no tractable method for drawing exact samples from pmodel(x) or from a good (low variance) importance sampling distribution q(x)

• In deep learning this happens most often when pmodel(x) is represented by an undirected model

• In this case we use a mathematical tool called a Markov chain to sample from pmodel(x)

• Algorithms that use Markov chains to perform Monte Carlo estimates are called MCMC 6

Deep Learning SrihariMarkov Chain• A sequence of random variables S0, S1, S2,…

with each Si∈{1,2,…,d} taking one of d possible values representing state of a system– Initial state distributed according to p(S0)– Subsequent states generated from a conditional

distribution that depends only on the previous state• i.e. Si is distributed according to p(Si∣Si−1)

7

A Markov chain over three states. The weighted directed edges indicate probabilities of transitioning to a different state

Deep Learning SrihariIdea of MCMC• Idea of MCMC is

– Construct a Markov chain whose states will be joint assignments to the variables in the model

• And whose stationary distribution will equal the model probability p

8


Metropolis-Hastings• User to specify a transition kernel q(x’|x) and

acceptance probability A(x’|x)• M-H Algorithm

– Draw sample x’ from q(x’|x), where x is previous sample

– New sample x’ accepted/rejected with probability A(x’|x)

• This acceptance probability is

• It encourages us to move towards more likely points in the distribution 9

A(x ' |x) = min 1,

p(x ')q(x |x ')p(x)q(x ' |x)

⎛⎝⎜

⎞⎠⎟


Acceptance probability• A(x’|x) is like ratio of importance sampling

weights

p(x’)/q(x’|x) is the importance weight for x’p(x)/q(x|x’) is the importance weight for x

– We divide the importance weight for x’ by that of x– Notice that we only need to compute p(x’)/p(x)

rather than p(x’) or p(x) separately• A(x’|x) ensures that, after sufficient draws,

samples will come from true distribution p(x)10

A(x ' |x) = min 1,

p(x ')q(x |x ')p(x)q(x ' |x)

⎛⎝⎜

⎞⎠⎟


The M-H Algorithm (1)• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)

11

Initialize x(0)


The M-H Algorithm (2)

12

Initialize x(0)

Draw, accept x(1)

• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)



13

Initialize x(0)

Draw, accept x(1)


Draw, accept x(2)



14

Initialize x(0)

Draw, accept x(1)


Draw, accept x(2)

Draw, but reject, set x(3)=x(2)



15

Initialize x(0)

Draw, accept x(1)


Draw, accept x(2)


Draw, accept x(4)



16

Initialize x(0)

Draw, accept x(1)


Draw, accept x(2)


Draw, accept x(4)

Draw, accept x(5)


MCMC and Energy-Based Models• Guarantees for MCMC are when model does

not assign zero probability to any state• Thus convenient to present these techniques as

sampling from an energy-based model (EBM)p(x)∝ exp(-E(x))

17

p(x) = 1

Zp̂(x)

!p(x) = φ(C)

C∈G∏

Z = !p(x)dx∫ !p(x) = exp(−E(x))


Need more than ancestral sampling• for directed acyclic graphs:

– Start with lowest numbered node– Draw a sample from the distribution p(x1) which we call – Work through each of the nodes in order

• For node n we draw a sample from conditional distribution p(xn|pan)– Defines an efficient single pass algorithm

• Not so simple in EBMs: chicken-egg problem– In order to sample from A we need to draw from p(A|B,D)– In order to sample from B we need to sample from p(B|A,C)

18

1x̂


Avoiding chicken-and-egg in EBM• We avoid the chicken-and-egg problem using a

Markov chain• Core idea of Markov chain

– Have a state x that begins with an arbitrary value– Over time we repeatedly update x – Eventually x becomes a fair sample from p(x)– Markov chain is defined by random state x and transition

distribution T(x�| x)

19

Deep Learning SrihariTheoretical understanding of MCMC• Reparameterize the problem

– Restrict attention to the case where r.v. x has countably many states

• Represent the state as an integer x• Different integer values of x map back to different states x

in the original problem– Consider when we run infinitely many Markov

chains in parallel• All the different states these Markov chains are drawn

from some distribution q(t)(x) where t is no of time steps • q(0) is some distribution that we initialize

• Our goal is for q(t)(x) to converge to p(x) 20


Equilibrium Distribution

• Because we have reparameterized the problem in terms of positive integer x, we can describe the probability distribution q using a vector vwith q(x=i)=vi

• The stationary distribution, also called the equilibrium distribution, is given by an eigenvector equation v’=Av=v

21


Choice of Transition Distribution

• If we have chosen T correctly then the stationary distribution q will be equal to the distribution p we wish to sample from

• We describe how to choose T next, when we discuss Gibbs sampling

22

markov chain monte carlo methods - university at …srihari/cse676/17.3 mcmc...markov chain monte...

Documents