approximate inference in gms samplinglsong/teaching/8803ml/lecture7.pdf · 2012. 2. 2. ·...
TRANSCRIPT
Approximate Inference in GMs Sampling
Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Le Song
What we have learned so far
Variable elimination and junction tree
Exact inference
Exponential in tree-width
Mean field, loopy belief propagation
Approximate inference for marginals/conditionals
Fast, but can get inaccurate estimates
Sample-based inference
Approximate inference for marginals/conditionals
With âenoughâ samples, will converge to the right answer
2
Variational Inference
What is the approximating structure?
How to measure the goodness of the approximation of ð ð1, ⊠, ðð to the original ð ð1, ⊠, ðð ?
Reverse KL-divergence ðŸð¿(ð| ð
How to compute the new parameters?
Optimization Qâ = argminQ ðŸð¿(ð| ð
3
ð
? ð
? ð
? ð
mean field
â ððð€ ððððððð¡ððð
Deriving mean field fixed point condition I
Approximate ð ð â Κ ð·ðð with ð ð = ðð ððð
ð . ð¡. âðð , ðð ðð > 0, â« ðð ðð ððð = 1
ð·(ð| ð = ââ« ð ð lnð ð ðð + â« ð ð lnð ð ðð + ðððð ð¡.
= ââ« ðð ðð ( lnΚ ð·ð )ðð + ðð â« ðð ðð ( lnðð ðð )ðððð
= â â« ðð ðð lnΚ ð·ð ððð·ð ðâð·ðð + â« ðð ðð ln ðð ðð ðððð
4
Κ(ð1, ð2)
Κ(ð1, ð3)
Κ ð1, ð4
Κ(ð1, ð5) ð1 ð1
ð1 ð2
ð3
ð4
ð5 ð6
ð7
E.g., â« ðð ðð lnΚ ð1, ð5 ð ðð =
â« ð1 ð1 ð5 ð5 lnΚ ð1, ð5 ðð1ðð5
Deriving mean field fixed point condition II
Let ð¿ be the Lagrangian function (ðð ðð ðð > 0?)
ð¿ =
â â« ðð ðð lnΚ ð·ð ððð·ð ðâð·ðð +
â« ðð ðð ln ðð ðð ðððð â ðœð(1 â â« ðð ðð )ðððð
For KL divergence, the nonnegativity constraint comes for free
If ð is a maximum for the original constrained problem, then there exists ðð , ðœð such that (ð, ðð , ðœð) is a stationary point for the Lagrange function (derivative equal 0).
5
Deriving mean field fixed point condition III
ð¿ =
â â« ðð ðð lnΚ ð·ð ððð·ð ðâð·ðð + â« ðð ðð ln ðð ðð ðððð â
ðœð(1 â â« ðð ðð )ðððð
ðð¿
ððð(ðð)= â ( ðð ðððâð·ð\i
lnΚ(ð·ð)) + ln ðð ððð:ðâð·ðâ
ðœð = 0
ðð ðð = exp ( ðð ðððâð·ð\ilnΚ(ð·ð)) +ð:ðâð·ð
ðœð =1
ððexp ( ðð ðððâð·ð\i
lnΚ(ð·ð)) ð:ðâð·ð
ðð ðð =1
ððexp ðžð lnΚ ð·ðð·ð:ððâð·ð
6
Mean Field Algorithm
Initialize ð ð1, ⊠, ðð = ð ððð (eg., randomly or smartly)
Set all variables to unprocessed
Pick an unprocessed variable ðð
Update ðð:
ðð ðð =1
ððexp ðžð lnΚ ð·ð
ð·ð:ððâð·ð
Set variable ððas processed
If ðð changed
Set neighbors of ðð to unprocessed
Guaranteed to converge
7
Understanding mean field algorithm
8
ð ð2
ð¥2
ln Κ ð1, ð2
ð ð3
ð¥3
ln Κ ð1, ð3
ð ð4
ð¥4
ln Κ ð1, ð4
ð(ð5)
ð¥5
ln Κ ð1, ð5 ð1 ð1
ð1 ð1 =1
ð1exp ðžð(ð2) lnΚ ð1, ð2 + ðžð(ð3) lnΚ ð1, ð3 + ðžð(ð4) lnΚ ð1, ð4 + ðžð(ð5) lnΚ ð1, ð5
ð2 ð2 =1
ð1exp ðžð(ð1) lnΚ ð1, ð2 + ðžð(ð6) lnΚ ð2, ð6 + ðžð(ð7) lnΚ ð2, ð7
ð1 ð2
ð3
ð4
ð5 ð6
ð7
ð ð1
ð¥1
ln Κ ð1, ð2 ð ð7
ð¥7
ln Κ ð2, ð7
ð(ð6)
ð¥6
ln Κ ð1, ð6 ð2 ð2
ð1 ð2
ð3
ð4
ð5 ð6
ð7
Example of Mean Field
Mean field equation
ðð ðð =1
ððexp ðžð lnΚ ð·ðð·ð:ððâð·ð
Pairwise Markov random field
ð ð1, ⊠, ðð â exp ððððððð + ððððððð
= exp (ððððððð)ðð exp (ðððð)ð
The mean field equation has simple form:
ðð ðð =1
ððexp ððððððððð ððð¥ððâð ð + ðððð
=1
ððexp ððððð < ðð >ðððâð ð + ðððð
9
Κ ðð , ðð
lnΚ ðð , ðð = ððððððð
Κ ðð lnΚ ðð = ðððð
ðð ðð
< ðð >ðð
< ðð >ðð
< ðð >ðð
< ðð >ðð
Example of generalized mean field
ð ð¶1 âexp( ðððð +ðâð¶1
ððððð âðž,ðâð¶1,ðâð¶1ðððð + ððððð < ðð >ð ð¶ððâð¶1,ðâð¶ð,ðâððµ ð¶1
)ð
10
ð¶1
ð¶2 ð¶3
ð¶4
ððµ(ð¶1)
Node potential within ð¶1
Edge potential within ð¶1
Mean of variables in Markov blanket
Edge potential across ð¶1 and ð¶ð
Why Sampling
Previous inference tasks focus on obtaining the entire posterior distribution ð ðð ð
Often we want to take expectations
Mean ððð|ð = ðž ðð ð = â« ððð ðð ð ððð
Variance ððð|ð2 = ðž (ððâððð|ð)
2 ð = â« (ððâððð|ð)2ð ðð ð ððð
More general ðž ð = â« ð ð ð ð|ð ðð, can be difficult to do it analytically
Key idea: approximate expectation by sample average
ðž ð â1
ð ð ð¥ð
ð
ð=1
where ð¥1, ⊠, ð¥ð ⌠ð ð|ð independently and identically
11
Sampling
Samples: points from the domain of a distribution ð ð
The higher the ð ð¥ , the more likely we see ð¥ in the sample
Mean ð =1
ð ð¥ðð
Variance ð 2 =1
ð (ð¥ðâð )2ð
12
ð(ð)
ð
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
For ð independent Gaussian variable with 0 mean and unit variance , use ð¥ ⌠ððððð ð, 1 in Matlab
For multivariate Gaussian ð©(ð, Σ)
ðŠ ⌠ððððð ð, 1
Transform the sample: ð¥ = ð + Σ1/2ðŠ
ð takes finite number of values, e.g., 1, 2, 3 , ð ð = 1 = 0.7, ð ð = 2 = 0.2, ð ð = 3 = 0.1
Draw a sample from uniform distribution U[0,1], ð¢ ⌠ðððð
If ð¢ falls into interval [0, 0.7), output sample ð¥ = 1
If ð¢ falls into interval [0.7, 0.9), output sample ð¥ = 2
If ð¢ falls into interval [0.9, 1], output sample ð¥ = 3
Sampling Example
13
Value 3
0.1 0.7
Value 1
0.2
Value 2 0 1
1 2
3
4 5
Generate Samples from Bayesian Networks
BN describe a generative process for observations
First, sort the nodes in topological order
Then, generate sample using this order according to the CPTs
Generate a set of sample for (A, F, S, N, H):
Sample ðð ⌠ð ðŽ
Sample ðð ⌠ð ð¹
Sample ð ð ⌠ð ð ðð , ðð
Sample ðð ⌠ð ð ð ð
Sample âð ⌠ð ð» ð ð
14
ðŽððððððŠ
ððð ð
ðððð¢ð
ð»ðððððâð
ð¹ðð¢
Challenge in sampling
How to draw sample from a given distribution?
Not all distributions can be trivially sampled
How to make better use of samples (not all samples are equally useful)?
How do know weâve sampled enough?
15
Sampling Methods
Direct Sampling
Simple
Works only for easy distributions
Rejection Sampling
Create samples like direct sampling
Only count samples consistent with given evidence
Importance Sampling
Create samples like direct sampling
Assign weights to samples
Gibbs Sampling
Often used for high-dimensional problem
Use variables and its Markov blanket for sampling
16
Rejection sampling I
Want to sample from distribution ð ð = 1
ðð(ð)
ð(ð) is difficult to sample, but f(X) is easy to evaluate
Key idea: sample from a simpler distribution Q(X), and then select samples
Rejection sampling (choose ð s.t. ð ð †ð ð ð )
ð = 1
Repeat until ð = ð
ð¥ ⌠ð ð , ð¢ ⌠ð 0,1
if ð¢ â€ð ð
ð ð ð, accept ð¥ and ð = ð + 1; otherwise, reject ð¥
Reject sample with probability 1 âð ð
ð ð ð
17
Rejection sampling II
Sample ð¥ ⌠ð(ð) and reject with probability 1 âð ð
ð ð ð
The likelihood of seeing a particular ð¥
ð ð¥ =
ð ð¥
ðð ð¥ð(ð¥)
â«ð ð
ðð ðð ð ðð
=1
ðð(ð¥)
18
ð(ð)
ð
ð¥1 ⌠ð(ð)
ð ð(ð) Between red and blue curves is rejection region
ð ð(ð¥1)
ð(ð¥1)
ð¢1 ⌠ð[0,1]
Drawback of rejection sampling
Can have very small acceptance rate
Adaptive rejection sampling: use envelop function for Q
19
ð(ð)
ð
ð¥1 ⌠ð(ð)
ð ð(ð)
ð ð(ð¥1)
ð¢1 ⌠ð[0,1]
Big rejection region
ð
Importance Sampling I
Support sampling from ð ð is hard
We can sample from a simpler proposal distribution ð ð
If ð dominates P (i.e., ð ð > 0 whenever ð ð > 0), we can sample from ð and reweight
ðžð ð ð = â« ð ð ð ð ðð) = â« ð(ð)ð ð
ð ðð(ð)ðð
Sample ð¥ð ⌠ð ð
Reweight sample ð¥ð with ð€ð =ð ð¥ð
ð ð¥ð
approximate expectation with 1
ð ð ð¥ð ð€ðð
20
Importance Sampling II
Instead of reject sample, reweight sample instead
The efficiency of importance sampling depends on how close the proposal Q is to the target P
21
ð(ð)
ð
ð¥1 ⌠ð(ð)
ð(ð)
ð(ð¥1)
ð(ð¥1)
ð€1 ⌠ð ð¥1 /ð ð¥1
ð¥2 ⌠ð(ð)
ð(ð¥2)
ð(ð¥2)
ð€2 ⌠ð ð¥2 /ð ð¥2
Unnormalized importance sampling
Suppose we only evaluate ð ð = 1
ðð(ð) (e.g., grid or MRF, Z
is difficult to compute), and donât know Z
We can get around the normalization Z by normalize the importance weight
ð ð =ð ð
ð ðâ ðžð ð ð = â«
ð ð
ð ð ð ð ðð = â« ð ð ðð = ð
ðžð ð ð = â« ð ð ð ð ðð =1
ðâ« ð ð
ð ð
ð ðð ð ðð =
â« ð ð ð ð ð ð ðð
â« ð ð ð ð ððâ
ð ð¥ðð ð ð¥ð
ð ð¥ðð
22
ðððððð ð€ð =ð ð¥ð
ð ð¥ðð
ðð ð¡âð
ððð€ ðððððð¡ðððð ð€ðððâð¡
Difference between normalized and unnormalized
Unnormalized importance sampling is unbiased
Normalized importance sampling is biased, eg. for N=1
ðžðð ð¥1 ð€1
ð€1= ðžð ð ð¥1 â ðžð ð ð¥1
However, the normalized importance sampler usually has lower variance
But most importantly, unnormalized importance sampler work for unnormalized distributions
For unnormalized sampler, one can also do resampling based on ð€ð (multinomial distribution with probability ð€ð for ð¥ð) 23
Gibbs Sampling
Both rejection sampling and importance sampling do not scale well to high dimensions
Markov Chain Monte Carlo (MCMC) is an alternative
Key idea: Construct a Markov chain whose stationary distribution is the target distribution ð ð
Sampling process: random walk in the Markov chain
Gibbs sampling is a very special and simple MCMC method.
24
Markov Chain Monte Carlo
Wan to sample from ð ð , start with a random initial vector X
ðð¡: ð at time step ð¡
ðð¡ transition to ðð¡+1 with probability
ð(ðð¡+1|ðð¡, ⊠, ð1) = ð (ðð¡+1|ðð¡)
The stationary distribution of ð (ðð¡+1|ðð¡) is our ð ð
Run for an intial ð samples (burn-in time) until the chain converges/mixes/reaches the stationary ditribution
Then collect ð (correlated) sample as ð¥ð
Key issues: Designing the transition kernel, and diagnose convergence
25
Gibbs Sampling
A very special transition kernel, works nicely with Markov blanket in GMs.
The procedure
We have variables set ð = ð1, ⊠, ððŸ variables in a GM.
At each step, one variable ðð is selected (at random or some fixed sequence), denote the remaining variables as ðâð, and its
current value as ð¥âðð¡
Compute the conditional distribution ð(ðð| ð¥âðð¡ )
A value ð¥ðð¡ is sampled from this distribution
This sample ð¥ðð¡ replaces the previous sampled value of ðð in ð
26
Gibbs Sampling in formula
Gibbs sampling
ð = ð¥0
For t = 1 to N
ð¥1ð¡ = ð(ð1|ð¥2
ð¡â1, ⊠, ð¥ðŸð¡â1)
ð¥2ð¡ = ð(ð2|ð¥1
ð¡, ⊠, ð¥ðŸð¡â1)
âŠ
ð¥ðŸð¡ = ð(ð2|ð¥1
ð¡ , ⊠, ð¥ðŸâ1ð¡ )
Variants:
Randomly pick variable to sample
sample block by block
27
ð1
ð2 ð3
ð4 ð5
ð¹ðð ððððâðððð ðððððð Only need to condition on the Variables in the Markov blanket
Gibbs Sampling: Examples
Pairwise Markov random field
ð ð1, ⊠, ðð â exp ððððððð + ððððððð
In the Gibbs sampling step, we need conditional ð(ð1|ð2, âŠðð)
ð(ð1|ð2, âŠðð) âexp ð12ð1ð2 + ð13ð1ð3 + ð14ð1ð4 + ð15ð1ð5 + ð1ð1
28
ð1 ð2
ð3
ð4
ð5 ð6
ð7
Diagnose convergence
Good chain
29
Iteration number
Sampled Value
Diagnose convergence
Bad chain
30
Iteration number
Sampled Value
Sampling Methods
Direct Sampling
Simple
Works only for easy distributions
Rejection Sampling
Create samples like direct sampling
Only count samples consistent with given evidence
Importance Sampling
Create samples like direct sampling
Assign weights to samples
Gibbs Sampling
Often used for high-dimensional problem
Use variables and its Markov blanket for sampling
31