markov chain monte carlo(mcmc)

28
Markov chain Monte Carlo(MCMC) Jin Young Choi 1

Upload: others

Post on 23-Jan-2022

15 views

Category:

Documents


1 download

TRANSCRIPT

Markov chain Monte Carlo(MCMC)

Jin Young Choi

1

Outline

Monte Carlo : Sample from a distribution to estimate the distribution

Markov Chain Monte Carlo (MCMC)

β€’ Applied to Clustering, Unsupervised Learning, Bayesian Inference

Importance Sampling

Metropolis-Hastings Algorithm

Gibbs Sampling

Markov Blanket in Sampling for Bayesian Network

Example: Estimation of Gaussian Mixture Model

2

𝑝 π‘₯ 𝐷 =𝑧,πœƒπ‘ π‘₯, 𝑧, πœƒ 𝐷 = ? , 𝑝 𝑧|π‘₯, πœƒ =? , 𝑝 πœƒ|π‘₯, 𝑧 =?

Markov chain Monte Carlo(MCMC)

Monte Carlo : Sample from a distribution

- to estimate the distribution for GMM estimation, Clustering

(Labeling, Unsupervised Learning)

- to compute max, mean

Markov Chain Monte Carlo : sampling using β€œlocal” information

- Generic β€œproblem solving technique”

- decision/inference/optimization/learning problem

- generic, but not necessarily very efficient

3

Monte Carlo Integration

General problem: evaluating

𝔼𝑃 β„Ž 𝑋 = ∫ β„Ž π‘₯ 𝑃 π‘₯ 𝑑π‘₯can be difficult. (∫ β„Ž π‘₯ 𝑃 π‘₯ 𝑑π‘₯ < ∞)

If we can draw samples π‘₯(𝑠)~𝑃 π‘₯ , then we can estimate

𝔼𝑃 β„Ž 𝑋 β‰ˆ ΰ΄€β„Žπ‘ =1

𝑁

𝑠=1

𝑁

β„Ž π‘₯ 𝑠 .

Monte Carlo integration is great if you can sample from the target distribution

β€’ But what if you can’t sample from the target?

β€’ Importance sampling: Use of a simple distribution

4

Importance Sampling

Idea of importance sampling:

Draw the sample from a proposal distribution 𝑄(β‹…) and re-weight the integral using importance weights so that the correct distribution is targeted

𝔼𝑃 β„Ž 𝑋 = βˆ«β„Ž π‘₯ 𝑃 π‘₯

𝑄 π‘₯𝑄 π‘₯ 𝑑π‘₯ = 𝔼𝑄

β„Ž 𝑋 𝑃 𝑋

𝑄 𝑋.

Hence, given an iid sample π‘₯ 𝑠 from 𝑄, our estimator becomes

πΈπ‘„β„Ž 𝑋 𝑃 𝑋

𝑄 𝑋=1

𝑁

𝑠=1

π‘β„Ž π‘₯ 𝑠 𝑃 π‘₯ 𝑠

𝑄 π‘₯ 𝑠

5

Limitations of Monte Carlo

Direct (unconditional) sampling

β€’ Hard to get rare events in high-dimensional spaces Gibbs sampling

Importance sampling

β€’ Do not work well if the proposal 𝑄 π‘₯ is very different from target 𝑃 π‘₯

β€’ Yet constructing a 𝑄 π‘₯ similar to 𝑃 π‘₯ can be difficult Markov Chain

Intuition: instead of a fixed proposal 𝑄 π‘₯ , what if we could use an adaptiveproposal?

β€’ 𝑋𝑑+1 depends only on 𝑋𝑑, not on 𝑋0, 𝑋1, … , π‘‹π‘‘βˆ’1β€’ Markov Chain

6

Markov Chains: Notation & Terminology

Countable (finite) state space Ξ© (e.g. N)

Sequence of random variables 𝑋𝑑 on Ξ© for 𝑑 = 0,1,2, …

Definition : 𝑋𝑑 is a Markov Chain if

𝑃 𝑋𝑑+1 = 𝑦 | 𝑋𝑑 = π‘₯𝑑, … , 𝑋0 = π‘₯0 = 𝑃 𝑋𝑑+1 = 𝑦 | 𝑋𝑑 = π‘₯𝑑

Notation : 𝑃 𝑋𝑑+1 = 𝑖 | 𝑋𝑑 = 𝑗 = 𝑝𝑗𝑖

- Random Works

Example.

𝑝𝐴𝐴 = 𝑃 𝑋𝑑+1 = 𝐴 | 𝑋𝑑 = 𝐴 = 0.6𝑝𝐴𝐸 = 𝑃 𝑋𝑑+1 = 𝐸 | 𝑋𝑑 = 𝐴 = 0.4𝑝𝐸𝐴 = 𝑃 𝑋𝑑+1 = 𝐴 | 𝑋𝑑 = 𝐸 = 0.7𝑝𝐸𝐸 = 𝑃 𝑋𝑑+1 = 𝐸 | 𝑋𝑑 = 𝐸 = 0.3

7

Markov Chains: Notation & Terminology

Let 𝑷 = 𝑝𝑖𝑗 - transition probability matrix

- dimension Ξ© Γ— Ξ©

Let πœ‹π‘‘ 𝑗 = 𝑃 𝑋𝑑 = 𝑗

- πœ‹0 : initial probability distribution

Then πœ‹π‘‘ 𝑗 = σ𝑖 πœ‹π‘‘βˆ’1 𝑖 𝑝𝑖𝑗 = πœ‹π‘‘βˆ’1𝑷 𝑗 = πœ‹0𝑷𝑑 𝑗

πœ‹π‘‘ = πœ‹π‘‘βˆ’1𝑷 = πœ‹π‘‘βˆ’2𝑷2 =βˆ™βˆ™βˆ™= πœ‹0𝑷

𝑑

8

Markov Chains: Fundamental Properties

Theorem:

- If the limit limπ‘‘β†’βˆž

𝑃𝑑 = 𝑃 exists and Ξ© is finite, then

πœ‹π‘ƒ 𝑗 = πœ‹ 𝑗 and σ𝑗 πœ‹ 𝑗 = 1

and such πœ‹ is an unique solution to πœ‹π‘· = πœ‹ (πœ‹ is called a stationary distribution)

- No matter where we start, after some time, we will be in any state 𝑗 with probability ~ πœ‹ 𝑗

9

Markov Chain Monte Carlo

MCMC algorithm feature adaptive proposals

- Instead of 𝑄 π‘₯β€² , they use 𝑄(π‘₯β€²|π‘₯) where π‘₯β€² is the new state being sampled, and π‘₯ is the previous sample

- As π‘₯ changes, 𝑄 π‘₯β€²|π‘₯ can also change (as a function of π‘₯β€²)

- The acceptance probability is set to 𝐴 π‘₯β€²|π‘₯ = min 1,𝑃 π‘₯β€² /𝑄 π‘₯β€²|π‘₯

𝑃 π‘₯ /𝑄 π‘₯|π‘₯β€²

- No matter where we start, after some time, we will be in any state 𝑗 with probability ~ πœ‹ 𝑗

10β†’ 𝑝12← 𝑝21

𝑝22𝑝11 β†’ 𝑝12← 𝑝21

𝑝22𝑝11

importance

𝑄 π‘₯β€²|π‘₯ = 𝑄 π‘₯β€²|π‘₯ for Gaussian Why?

Metropolis-Hastings

Draws a sample π‘₯β€² from 𝑄 π‘₯β€²|π‘₯ , where π‘₯ is the previous sample

The new sample π‘₯β€² is accepted or rejected with some probability 𝐴 π‘₯β€²|π‘₯

β€’ This acceptance probability is 𝐴 π‘₯β€²|π‘₯ = min 1,𝑃 π‘₯β€² /𝑄 π‘₯β€²|π‘₯

𝑃 π‘₯ /𝑄 π‘₯|π‘₯β€²

β€’ 𝐴 π‘₯β€²|π‘₯ is like a ratio of importance sampling weights

‒𝑃 π‘₯β€²

𝑄 π‘₯β€² π‘₯is the importance weight for π‘₯β€²,

𝑃 π‘₯

𝑄 π‘₯|π‘₯β€²is the importance weight for π‘₯

β€’ We divide the importance weight for π‘₯β€² by that of π‘₯

β€’ Notice that we only need to compute 𝑃 π‘₯β€² /𝑃 π‘₯ rather than 𝑃 π‘₯β€² or 𝑃 π‘₯ separately

β€’ 𝐴 π‘₯β€² π‘₯ ensures that, after sufficiently many draws, our samples will come from the true distribution 𝑃(π‘₯)

11

𝔼𝑃 β„Ž 𝑋 = βˆ«β„Ž π‘₯ 𝑃 π‘₯

𝑄 π‘₯𝑄 π‘₯ 𝑑π‘₯ = 𝔼𝑄

β„Ž 𝑋 𝑃 𝑋

𝑄 𝑋

𝑄 π‘₯β€²|π‘₯ = 𝑄 π‘₯β€²|π‘₯ for Gaussian Why?

The MH Algorithm

Initialize starting state π‘₯(0),

Burn-in: while samples have β€œnot converged”

β€’ π‘₯ = π‘₯(𝑑)

β€’ 𝑑 = 𝑑 + 1

β€’ Sample π‘₯βˆ—~𝑄(π‘₯βˆ—|π‘₯) // draw from proposal

β€’ Sample 𝑒~Uniform 0,1 // draw acceptance threshold

β€’ If 𝑒 < 𝐴 π‘₯βˆ— π‘₯ = min 1,𝑃 π‘₯βˆ— 𝑄(π‘₯|π‘₯βˆ—)

𝑃 π‘₯ 𝑄 π‘₯βˆ—|π‘₯, π‘₯(𝑑)= π‘₯βˆ— // transition

β€’ Else π‘₯(𝑑) = π‘₯ // stay in current state

β€’ Repeat until converging

12

The MH Algorithm

Example:

β€’ Let 𝑄 π‘₯β€²|π‘₯ be a Gaussian centered on π‘₯

β€’ We’re trying to sample from a bimodal distribution 𝑃 π‘₯

𝐴 π‘₯β€²|π‘₯ = min 1,𝑃 π‘₯β€² /𝑄 π‘₯β€²|π‘₯

𝑃 π‘₯ /𝑄 π‘₯|π‘₯β€²

13

The MH Algorithm

Example:

β€’ Let 𝑄 π‘₯β€²|π‘₯ be a Gaussian centered on π‘₯

β€’ We’re trying to sample from a bimodal distribution 𝑃 π‘₯

14

𝐴 π‘₯β€²|π‘₯ = min 1,𝑃 π‘₯β€² /𝑄 π‘₯β€²|π‘₯

𝑃 π‘₯ /𝑄 π‘₯|π‘₯β€²

The MH Algorithm

Example:

β€’ Let 𝑄 π‘₯β€²|π‘₯ be a Gaussian centered on π‘₯

β€’ We’re trying to sample from a bimodal distribution 𝑃 π‘₯

15

𝐴 π‘₯β€²|π‘₯ = min 1,𝑃 π‘₯β€² /𝑄 π‘₯β€²|π‘₯

𝑃 π‘₯ /𝑄 π‘₯|π‘₯β€²

The MH Algorithm

Example:

β€’ Let 𝑄 π‘₯β€²|π‘₯ be a Gaussian centered on π‘₯

β€’ We’re trying to sample from a bimodal distribution 𝑃 π‘₯

16

𝐴 π‘₯β€²|π‘₯ = min 1,𝑃 π‘₯β€² /𝑄 π‘₯β€²|π‘₯

𝑃 π‘₯ /𝑄 π‘₯|π‘₯β€²

The MH Algorithm

Example:

β€’ Let 𝑄 π‘₯β€²|π‘₯ be a Gaussian centered on π‘₯

β€’ We’re trying to sample from a bimodal distribution 𝑃 π‘₯

17

𝐴 π‘₯β€²|π‘₯ = min 1,𝑃 π‘₯β€² /𝑄 π‘₯β€²|π‘₯

𝑃 π‘₯ /𝑄 π‘₯|π‘₯β€²

The MH Algorithm

Example:

β€’ Let 𝑄 π‘₯β€²|π‘₯ be a Gaussian centered on π‘₯

β€’ We’re trying to sample from a bimodal distribution 𝑃 π‘₯

18

𝐴 π‘₯β€²|π‘₯ = min 1,𝑃 π‘₯β€² /𝑄 π‘₯β€²|π‘₯

𝑃 π‘₯ /𝑄 π‘₯|π‘₯β€²

The MH Algorithm

Example:

β€’ Let 𝑄 π‘₯β€²|π‘₯ be a Gaussian centered on π‘₯

β€’ We’re trying to sample from a bimodal distribution 𝑃 π‘₯

19

𝐴 π‘₯β€²|π‘₯ = min 1,𝑃 π‘₯β€² /𝑄 π‘₯β€²|π‘₯

𝑃 π‘₯ /𝑄 π‘₯|π‘₯β€²

The MH Algorithm

Example:

β€’ Let 𝑄 π‘₯β€²|π‘₯ be a Gaussian centered on π‘₯

β€’ We’re trying to sample from a bimodal distribution 𝑃 π‘₯

20

𝐴 π‘₯β€²|π‘₯ = min 1,𝑃 π‘₯β€² /𝑄 π‘₯β€²|π‘₯

𝑃 π‘₯ /𝑄 π‘₯|π‘₯β€²

Gibbs Sampling

Gibbs Sampling is an MCMC algorithm that samples each random variable of a graphical model, one at a time

β€’ GS is a special case of the MH algorithm

Consider a factored state space

β€’ π‘₯ ∈ Ξ© is a vector π‘₯ = π‘₯1, … , π‘₯π‘šβ€’ Notation: π‘₯βˆ’π‘– = π‘₯1, … , π‘₯π‘–βˆ’1, π‘₯𝑖+1, … , π‘₯π‘š

21

Gibbs Sampling

The GS algorithm:

1. Suppose the graphical model contains variables π‘₯1, … , π‘₯𝑛2. Initialize starting values for π‘₯1, … , π‘₯𝑛3. Do until convergence:

1. Pick a component 𝑖 ∈ 1, … , 𝑛

2. Sample value of 𝑧~𝑃 π‘₯𝑖|π‘₯βˆ’π‘– , and update π‘₯𝑖 ← 𝑧

When we update π‘₯𝑖, we immediately use its new value for sampling other variables π‘₯𝑗

𝑃 π‘₯𝑖|π‘₯βˆ’π‘– achieves the acceptance probability in MH algorithm.

22

𝐴 π‘₯β€²|π‘₯ = min 1,𝑃 π‘₯β€² /𝑄 π‘₯β€²|π‘₯

𝑃 π‘₯ /𝑄 π‘₯|π‘₯β€²

Markov Blankets

The conditional 𝑃 π‘₯𝑖 π‘₯βˆ’π‘– can be obtained using Markov Blanket

β€’ Let 𝑀𝐡(π‘₯𝑖) be the Markov Blanket of π‘₯𝑖, then

𝑃 π‘₯𝑖 | π‘₯βˆ’π‘– = 𝑃 π‘₯𝑖|MB π‘₯𝑖

For a Bayesian Network, the Markov Blanket of π‘₯𝑖 is the set containing its parents, children, and co-parents

23

Gibbs Sampling: An Example

Consider the GMM

β€’ The data π‘₯ (position) are extracted from two Gaussian distribution

β€’ We do NOT know the class 𝑦 of each data, and information of the Gaussian distribution

β€’ Initialize the class of each data at 𝑑 = 0 to randomly

24

Gaussian with mean 3,βˆ’3 , variance 3

Gaussian with mean 1,2 , variance 2

Gibbs Sampling: An Example

Sampling 𝑃 𝑦𝑖 π‘₯βˆ’π‘– , π‘¦βˆ’π‘–) at 𝑑 = 1, we compute:

𝑃 𝑦𝑖 = 0 |π‘₯βˆ’π‘– , π‘¦βˆ’π‘– ∝ 𝒩 π‘₯𝑖|πœ‡π‘₯βˆ’π‘–,0, 𝜎π‘₯βˆ’π‘–,0𝑃 𝑦𝑖 = 1 | π‘₯βˆ’π‘– , π‘¦βˆ’π‘– ∝ 𝒩 π‘₯𝑖|πœ‡π‘₯βˆ’π‘–,1, 𝜎π‘₯βˆ’π‘–,1

whereπœ‡π‘₯βˆ’π‘–,𝐾 = 𝑀𝐸𝐴𝑁 𝑋𝑖𝐾 , 𝜎π‘₯βˆ’π‘–,𝐾 = 𝑉𝐴𝑅 𝑋𝑖𝐾𝑋𝑖𝐾 = π‘₯𝑗 | π‘₯𝑗 ∈ π‘₯βˆ’π‘– , 𝑦𝑗 = 𝐾

And update 𝑦𝑖 with 𝑃 𝑦𝑖 |π‘₯βˆ’π‘– , π‘¦βˆ’π‘– and repeat for all data

25

Iteration of 𝑖 at the same 𝑑

0 1

Gibbs Sampling: An Example

Now 𝑑 = 2, and we repeat the procedure to sample new class of each data

And similarly for 𝑑 = 3, 4, …

26

β‹…β‹…β‹…

Gibbs Sampling: An Example

Data 𝑖’s class can be chosen with tendency of 𝑦𝑖‒ The classes of the data can be oscillated after the sufficient sequences

β€’ We can assume the class of datum as more frequently selected class

In the simulation, the final class is correct with the probability of 94.9% at 𝑑 =100

27

Interim Summary

Markov Chain Monte Carlo methods use adaptive proposals 𝑄 π‘₯β€² π‘₯ to sample from the true distribution 𝑃 π‘₯

Metropolis-Hastings allows you to specify any proposal 𝑄 π‘₯β€²|π‘₯

β€’ But choosing a good 𝑄 π‘₯β€²|π‘₯ requires care

Gibbs sampling sets the proposal 𝑄 π‘₯𝑖′|π‘₯βˆ’1 to the conditional distribution 𝑃 π‘₯𝑖

β€²|π‘₯βˆ’1β€’ Acceptance rate always 1.

β€’ But remember that high acceptance usually entails slow exploration

β€’ In fact, there are better MCMC algorithms for certain models

28