markov chain monte carlo methods - university at …srihari/cse676/17.3 mcmc...markov chain monte...
TRANSCRIPT
Deep Learning Srihari
Topics in Monte Carlo Methods
1. Sampling and Monte Carlo Methods2. Importance Sampling3. Markov Chain Monte Carlo Methods4. Gibbs Sampling5. Mixing between separated modes
2
Deep Learning Srihari
Topics in Markov Chain Monte Carlo
• Limitations of plain Monte Carlo methods• Markov Chains• Metropolis-Hastings Algorithm• MCMC and Energy-based models
3
Deep Learning Srihari
Limitations of plain Monte Carlo• Direct (unconditional) sampling
– Hard to get rare events in high-dimensional spaces– Infeasible for MRFs, unless we know partition function Z
• Rejection sampling, Importance sampling– Doesn’t work well if proposal q(x) is very different from p(x)– Yet constructing a q(x) similar to p(x) can be difficult
• Making a good proposal usually requires knowledge of the analytic form of p(x) – but if we had that, we wouldn’t even need to sample!
• Intuition of MCMC– Instead of a fixed proposal q(x), use an adaptive proposal
4
Deep Learning SrihariMCMC and Adaptive Proposals
• MCMC:– Instead of q(x), use q(x’|x) where x’ is the new state
being sampled, and x is previous sample• As x changes, q(x’|x) changes as a function of x’
5
q(x)p(x) p(x)
q(x2|x1)q(x3|x2) q(x4|x3)
x3 x1 x2 x1 x2 x3
Importance sampling from p(x)with a bad proposal q(x)
MCMC sampling from p(x)with an adaptive proposal q(x’|x)
Deep Learning Srihari
Markov Chain Monte Carlo Methods• In many cases we wish to use a Monte Carlo
technique but there is no tractable method for drawing exact samples from pmodel(x) or from a good (low variance) importance sampling distribution q(x)
• In deep learning this happens most often when pmodel(x) is represented by an undirected model
• In this case we use a mathematical tool called a Markov chain to sample from pmodel(x)
• Algorithms that use Markov chains to perform Monte Carlo estimates are called MCMC 6
Deep Learning SrihariMarkov Chain• A sequence of random variables S0, S1, S2,…
with each Si∈{1,2,…,d} taking one of d possible values representing state of a system– Initial state distributed according to p(S0)– Subsequent states generated from a conditional
distribution that depends only on the previous state• i.e. Si is distributed according to p(Si∣Si−1)
7
A Markov chain over three states. The weighted directed edges indicate probabilities of transitioning to a different state
Deep Learning SrihariIdea of MCMC• Idea of MCMC is
– Construct a Markov chain whose states will be joint assignments to the variables in the model
• And whose stationary distribution will equal the model probability p
8
Deep Learning Srihari
Metropolis-Hastings• User to specify a transition kernel q(x’|x) and
acceptance probability A(x’|x)• M-H Algorithm
– Draw sample x’ from q(x’|x), where x is previous sample
– New sample x’ accepted/rejected with probability A(x’|x)
• This acceptance probability is
• It encourages us to move towards more likely points in the distribution 9
A(x ' |x) = min 1,
p(x ')q(x |x ')p(x)q(x ' |x)
⎛⎝⎜
⎞⎠⎟
Deep Learning Srihari
Acceptance probability• A(x’|x) is like ratio of importance sampling
weights
p(x’)/q(x’|x) is the importance weight for x’p(x)/q(x|x’) is the importance weight for x
– We divide the importance weight for x’ by that of x– Notice that we only need to compute p(x’)/p(x)
rather than p(x’) or p(x) separately• A(x’|x) ensures that, after sufficient draws,
samples will come from true distribution p(x)10
A(x ' |x) = min 1,
p(x ')q(x |x ')p(x)q(x ' |x)
⎛⎝⎜
⎞⎠⎟
Deep Learning Srihari
The M-H Algorithm (1)• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)
11
Initialize x(0)
Deep Learning Srihari
The M-H Algorithm (2)
12
Initialize x(0)
Draw, accept x(1)
• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)
Deep Learning Srihari
The M-H Algorithm (3)
13
Initialize x(0)
Draw, accept x(1)
• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)
Draw, accept x(2)
Deep Learning Srihari
The M-H Algorithm (4)
14
Initialize x(0)
Draw, accept x(1)
• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)
Draw, accept x(2)
Draw, but reject, set x(3)=x(2)
Deep Learning Srihari
The M-H Algorithm (5)
15
Initialize x(0)
Draw, accept x(1)
• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)
Draw, accept x(2)
Draw, but reject, set x(3)=x(2)
Draw, accept x(4)
Deep Learning Srihari
The M-H Algorithm (6)
16
Initialize x(0)
Draw, accept x(1)
• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)
Draw, accept x(2)
Draw, but reject, set x(3)=x(2)
Draw, accept x(4)
Draw, accept x(5)
Deep Learning Srihari
MCMC and Energy-Based Models• Guarantees for MCMC are when model does
not assign zero probability to any state• Thus convenient to present these techniques as
sampling from an energy-based model (EBM)p(x)∝ exp(-E(x))
17
p(x) = 1
Zp̂(x)
!p(x) = φ(C)
C∈G∏
Z = !p(x)dx∫ !p(x) = exp(−E(x))
Deep Learning Srihari
Need more than ancestral sampling• for directed acyclic graphs:
– Start with lowest numbered node– Draw a sample from the distribution p(x1) which we call – Work through each of the nodes in order
• For node n we draw a sample from conditional distribution p(xn|pan)– Defines an efficient single pass algorithm
• Not so simple in EBMs: chicken-egg problem– In order to sample from A we need to draw from p(A|B,D)– In order to sample from B we need to sample from p(B|A,C)
18
1x̂
Deep Learning Srihari
Avoiding chicken-and-egg in EBM• We avoid the chicken-and-egg problem using a
Markov chain• Core idea of Markov chain
– Have a state x that begins with an arbitrary value– Over time we repeatedly update x – Eventually x becomes a fair sample from p(x)– Markov chain is defined by random state x and transition
distribution T(x�| x)
19
Deep Learning SrihariTheoretical understanding of MCMC• Reparameterize the problem
– Restrict attention to the case where r.v. x has countably many states
• Represent the state as an integer x• Different integer values of x map back to different states x
in the original problem– Consider when we run infinitely many Markov
chains in parallel• All the different states these Markov chains are drawn
from some distribution q(t)(x) where t is no of time steps • q(0) is some distribution that we initialize
• Our goal is for q(t)(x) to converge to p(x) 20
Deep Learning Srihari
Equilibrium Distribution
• Because we have reparameterized the problem in terms of positive integer x, we can describe the probability distribution q using a vector vwith q(x=i)=vi
• The stationary distribution, also called the equilibrium distribution, is given by an eigenvector equation v’=Av=v
21
Deep Learning Srihari
Choice of Transition Distribution
• If we have chosen T correctly then the stationary distribution q will be equal to the distribution p we wish to sample from
• We describe how to choose T next, when we discuss Gibbs sampling
22