uai mcmc tutorial

8/7/2019 Uai Mcmc Tutorial

1/61

Inference on Relational Models Using

Markov Chain Monte Carlo

Brian Milch

Massachusetts Institute of Technology

UAI Tutorial

July 19, 2007


2/61

2

S. Russel and P. Norvig (1995). Artificial Intelligence: A Modern Approach. Upper

Saddle River, NJ: Prentice Hall.

Example 1: Bibliographies

Russell, Stuart and Norvig, Peter. Articial Intelligence. Prentice-Hall, 1995.

Stuart Russell Peter Norvig

Artificial Intelligence: A Modern Approach


3/61

3

(1.9, 6.1, 2.2)

(0.6, 5.9, 3.2)

Example 2: Aircraft Tracking

t=1 t=2 t=3

(1.9, 9.0, 2.1)

(0.7, 5.1, 3.2)

(1.8, 7.4, 2.3)

(0.9, 5.8, 3.1)


4/61

4

Inference on Relational Structures

Russell Roberts

AI: A Mod...

Rus... AI... AI: A... Rus... AI... AI: A...

Rus... AI... AI: A...

Rob... Adv... Rob...

Shak... Haml... Wm...Seu... The... Seu...

Russell Norvig

AI: A Mod...Advance...

Seuss

The... If you...

Shak...

Hamlet

Tempest

1.2 x 10-12 2.3 x 10-12 4.5 x 10-14

6.7 x 10-16 8.9 x 10-16 5.0 x 10-20


5/61

5

Markov Chain Monte Carlo (MCMC)

Markov chain s1, s2, ... over

worlds where evidence E is

true

Approximate P(Q|E) as

fraction of s1, s2, ... that satisfy

query Q

E

Q


6/61

6

Outline

Probabilistic models for relational structures Modeling the number of objects

Three mistakes that are easy to make

Markov chain Monte Carlo (MCMC) Gibbs sampling

Metropolis-Hastings

MCMC over events

Case studies Citation matching

Multi-target tracking


7/61

7

Simple Example: Clustering

Wingspan (cm)

Q = 22 Q = 49 Q = 80

10 20 30 40 50 60 70 80 90 100


8/61

8

Simple Bayesian Mixture Model

Number of latent objects is known to be k

For each latent object i, have parameter:

For each data point j, have object selector

and observable value

]100,0[Uniform~iQ

},...,1Uniform({~ kCj

25,N r aljcj

X Q


9/61

9

BN for Mixture Model

X1 X2 X3 Xn

C1 C2 C3 Cn

Q1 Q2 Qk


10/61

10

Context-Specific Dependencies

X1 X2 X3 Xn

C1 C2 C3 Cn

Q1 Q2 Qk

= 2 = 1 = 2


11/61

11

Extensions to Mixture Model

Random number of latent objects k, with distributionp(k) such as:

Uniform({1, , 100})

Geometric(0.1)

Poisson(10)

Random distribution T for selecting objects

p(T | k) ~ Dirichlet(E1,..., Ek)(Dirichlet: distribution over probability vectors)

Still symmetric: each Ei= E/k

unbounded!


12/61

12

Existence versus Observation

A latent object can existeven ifno observations correspond

to it

Bird species may not be observed yet

Aircraft may fly over without yielding any blips

Two questions:

How many objects correspond to observations?

How many objects are there in total?

Observed3 species, each 100 times: probably no more

Observed 200 species, each 1 or 2 times: probably more exist


13/61

13

Expecting Additional Objects

P(ever observe new species | seen r so far) bounded byP(ku r)

So as # species observedpg, probability of ever seeingmorep 0

What if we dont want this?

r observed species

observe more later?


14/61

14

Dirichlet Process Mixtures

Set k = g, letT be infinite-dimensionalprobabilityvector with stick-breaking prior

Another view: Define prior directly on partitions ofdata points, allowing unbounded number of blocks

Drawback: Cant ask about number ofunobservedlatent objects (always infinite)

T1 T2 T3 T4 T5

[Ferguson 1983; Sethuraman 1994]

[tutorials: Jordan 2005; Sudderth 2006]


15/61

15

Outline




Metropolis-Hastings

MCMC over events




16/61

16

Mistake 1: Ignoring Interchangeability

Which birds are in species S1?

Latent objectindices are

interchangeable

Posterior on selector variable CB1 is uniform

Posterior on QS1 has a peak for each cluster of birds

Really care aboutpartition of observations

Partition with r blocks corresponds to k! / (k-r)! instantiations

of the Cjvariables

B1 B3B2 B5 B4

{{1, 3}, {2}, {4, 5}}

(1, 2, 1, 3, 3), (1, 2, 1, 4, 4), (1, 4, 1, 3, 3), (2, 1, 2, 3, 3),


17/61

17

Ignoring Interchangeability, Contd

Say k = 4. Whats prior probability that B1, B3 arein one species, B2 in another?

Multiply probabilities for CB1

, CB2

, CB3:

(1/4) x (1/4) x (1/4)

Not enough! Partition {{B1, B3}, {B2}} correspondsto 12 instantiations of Cs

Partition with r blocks corresponds to kPrinstantiations

(S1, S2, S1), (S1, S3, S1), (S1, S4, S1), (S2, S1, S2), (S2, S3, S2), (S2, S4, S2)

(S3, S1, S3), (S3, S2, S3), (S3, S4, S3), (S4, S1, S4), (S4, S2, S4), (S4, S3, S4)


18/61

18

Mistake 2: Underestimating the Bayesian

Ockhams Razor Effect Say k = 4. Are B1 and B2 in same species?

Maximum-likelihood estimation would yield one species

with Q = 50 and another with Q = 52

But Bayesian modeltrades offlikelihood againstprior

probabilityof getting those Q values

Wingspan (cm)

10 20 30 40 50 60 70 80 90 100

XB1=50 X B2=52


19/61

19

Bayesian Ockhams Razor

10 20 30 40 50 60 70 80 90 100

XB1=50 X B2=52

H1: Partition is {{B1, B2}}

11211100

01

2

141 )|()|()(41)d t,( QQQQ dxpxppPHp !

} 1.3 x 10-4

H2: Partition is {{B1}, {B2}}

222100

02111

100

01

2

242 )|()()|()(41)d t,( QQQQQQ dxppdxppPHp !

} 7.5 x 10-5

= 0.01

Dont use more latent objects than necessary to explain your data

[MacKay 1992]


20/61

20

Mistake 3: Comparing Densities Across

Dimensions

Wingspan (cm)

10 20 30 40 50 60 70 80 90 100

XB1=50 X B2=52

H1: Partition is {{B1, B2}}, Q = 51

H2: Partition is {{B1}, {B2}}, QB1 = 50, QB2= 52

)5,51;52()5,51;50(01.04

1)d t,( 222

141 NNPHp !

)5,52;52(.)5,50;50(01.041)ata,( 22

2

142 NNPHp

} 1.5 x 10-5

} 4.8 x 10-7

H1 wins by greater margin


21/61

21

What If We Change the Units?

Wingspan (m)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

XB1=0.50 X B2=0.52

H1: Partition is {{B1, B2}}, Q = 0.51

H2: Partition is {{B1}, {B2}}, QB1 = 0.50, QB2= 0.52

)05.0,51.0;52.0()05.0,51.0;50.0(14

1)d t,( 222

141 NNPHp !

)05.0,52.0;52.0(1)05.0,50.0;50.0(141)ata,( 22

2

142 NNPHp

} 15

} 48

density of Uniform(0, 1) is 1!

Now H2wins by a landslide


22/61

22

Lesson: Comparing Densities Across

Dimensions

Densities dont behave like probabilities (e.g., theycan be greater than 1)

Heights of density peaks in spaces ofdifferentdimension are not comparable

Work-arounds:

Find most likely partition first, then most likely

parameters given that partition Findregion in parameter space where most of the

posterior probability mass lies


23/61

23

Outline




Metropolis-Hastings

MCMC over events




24/61

24

Why Not Exact Inference?

Number of possible partitions is superexponential in

n

Variable elimination? Summing outQi

couples all the Cjs

Summing out Cjcouples all theQis

X1 X2 X3Xn

C1 C2 C3 Cn

Q1 Q2 Qk


25/61

25

Markov Chain Monte Carlo (MCMC)

Start in arbitrary state (possibleworld) s1 satisfying evidence E

Sample s2, s3, ... according to

transition kernelT(si, si+1),yielding Markov chain

Approximate p(Q | E) byfraction of s1, s2, , sL that are

in Q

E

Q


26/61

26

Why a Markov Chain?

Why use Markov chain rather than sampling

independently?

Stochastic local search for high-probability s

Once we find such s, explore around it


27/61

27

Convergence

Stationary distribution T is such that

If chain is ergodic(can get to anywhere fromanywhere*), then:

It has unique stationary distribution T

Fraction of s1, s2, ..., sL in Q converges to T(Q) as Lpg

Well design T so T(s) = p(s | E)

!s

sssTs )'()',()( TT

* and its aperiodic


28/61

28

Gibbs Sampling

Order non-evidence variables V1,V2,...,Vm

Given state s, sample from T as follows:

Let sd = s For i = 1 to m

Sample vid from p(Vi | sd-i)

Let sd = (sd-i, Vi= vid)

Return sd

Theorem: stationary distribution is p(s | E)

[Geman & Geman 1984]

Conditional for Vigiven

other vars in sd


29/61

29

Conditional for V depends only on factors that

contain v

So condition on Vs Markov

blanketmb(V): parents,

children, and co-parents

Gibbs on Bayesian Network

w)ch(

)])([P,|][()])[P (|()|(VY

VVYsvYspVsvpsvp

V


30/61

30

Gibbs on Bayesian Mixture Model

Given current state s:

Resample each Qigiven prior and

{Xj: Cj= i in s}

Resample each Cjgiven XjandQ1:k

X1 X2 X3 Xn

C1 C2 C3 Cn

Q1 Q2 Qk

context-specificMarkov blanket

[Neal 2000]


31/61

31

Sampling Given Markov Blanket

If V is discrete, just iterate over values, normalize,

sample from discrete distrib. If V is continuous:

Simple if child distributions are conjugate to Vs prior:posterior has same form as priorwith different

parameters

In general, even sampling from p(v | s-V) can be hard

w

)c (

)])([ a,|][()])[ a(|()|(VY

VVYsvYspVsvpsvp

[See BUGS software: http://www.mrc-bsu.cam.ac.uk/bugs]


32/61

32

Convergence Can Be Slow

Cjs wont change untilQ2 is in right area Q2does unguidedrandom walkas long as no observations

are associated with it Especially bad in high dimensions

should be two clusters

Q1 = 20 Q2= 90

species 2 is far away

Wingspan (cm)

10 20 30 40 50 60 70 80 90 100


33/61

33

Outline




Metropolis-Hastings

MCMC over events




34/61

34

Metropolis-Hastings

Define T(si, si+1) as follows:

Sample sd from proposal distribution q(sd | s)

Compute acceptance probability

With probabilityE, let si+1 = sd;else let si+1 = si

d

dd!

ii

i

ssqEsp

ssqEsp

||

||,1miE

relative posterior

probabilities

backward / forward

proposal probabilities

Can show that p(s | E) is stationary distribution for T

[Metropolis et al. 1953; Hastings 1970]


35/61

35

Metropolis-Hastings

Benefits

Proposal distribution can propose big steps involvingseveral variables

Only need to compute ratio p(sd | E) / p(s | E), ignoringnormalization factors

Dont need to sample from conditional distribs

Limitations

Proposals must be reversible, else q(s | sd) = 0

Need to be able to compute q(s | sd) / q(sd | s)


36/61

36

Split-Merge Proposals

Choose two observations i, j

If Ci= Cj= c, then splitcluster c

Get unused latent object cd

For each observation m such that Cm = c, change Cm tocd with probability 0.5

Propose new values forQc, Qcd

Else merge clusters ciand cj For each m such that Cm = cj, set Cm = ci Propose new value forQc

[Jain & Neal 2004]


37/61

37

Split-Merge Example

Q1 = 20 Q2= 90

Wingspan (cm)

10 20 30 40 50 60 70 80 90 100

Q2= 27

Split two birds from species 1

Resample Q2 to match these two birds

Move is likely to be accepted


38/61

38

Mixtures of Kernels

If T1,,Tm all have stationary distribution T, then sodoes mixture

Example: Mixture of split-merge and Gibbs moves

Point: Faster convergence

!wm

i

ii ssTwssT1

)',()',(


39/61

39

Outline




Metropolis-Hastings

MCMC over events




40/61

40

MCMC States in Split-Merge

Not complete instantiations!

No parameters for unobserved species

States are partialinstantiations of random variables

Each state corresponds to an event: set of outcomes

satisfying description

k = 12, CB1 = S2, CB2= S8, QS2= 31, QS8= 84


41/61

41

MCMC over Events

Markov chain over

events W, with stationary distrib.

proportional to p(W)

Theorem: Fraction of visitedevents in Q converges to p(Q|E)

if:

Each W is either subset of Q or

disjoint from Q

Events form partition of E

E

Q

[Milch & Russell 2006]


42/61

42

Computing Probabilities of Events

Engine needs to compute p(Wd) / p(Wn) efficiently

(without summations)

Use instantiations thatinclude all active parents

of the variables they

instantiate

Then probability is product of CPDs:

!)(vars

))(Pa(|)()(W

WWWW

X

XXXpp


43/61

43

States That Are Even More Abstract

Typical partial instantiation:

Specifies particular species numbers, even though species areinterchangeable

Let states be abstractpartial instantiations:

See [Milch & Russell 2006] for conditions under which wecan compute probabilities of such events

x y{ x[k = 12, CB1 = x, CB2= y, Qx= 31, Qy= 84]

k = 12, CB1 = S2, CB2= S8, QS2= 31, QS8= 84


44/61

44

Outline




Metropolis-Hastings

MCMC over events




45/61

45

Representative Applications

Tracking cars with cameras [Pasula et al. 1999]

Segmentation in computer vision [Tu & Zhu 2002]

Citation matching [Pasula et al. 2003] Multi-target tracking with radar[Oh et al. 2004]


46/61

46

Citation Matching Model

#Researcher ~ NumResearchersPrior();

Name(r) ~ NamePrior();

#Paper ~ NumPapersPrior();

FirstAuthor(p) ~ Uniform({Researcher r});Title(p) ~ TitlePrior();

PubCited(c) ~ Uniform({Paper p});

Text(c) ~ NoisyCitationGrammar

(Name(FirstAuthor(PubCited(c))), Title(PubCited(c)));

[Pasula et al. 2003; Milch & Russell 2006]


47/61

47

Citation Matching

Elaboration of generative model shown earlier

Parameter estimation

Priors for names, titles, citation formats learned offline from

labeled data

String corruption parameters learned with Monte Carlo EM

Inference

MCMC with split-merge proposals

Guided by canopies of similar citations Accuracy stabilizes after ~20 minutes

[Pasula et al., NIPS 2002]


48/61

48

Citation Matching Results

Four data sets of ~300-500 citations, referring to ~150-

300 papers

0

0.05

0.

0. 5

0.

0. 5

Rei f e Face Reason Const aint

Error

(Fraction

ofClustersNotRecoveredCorrectly)

Phrase Matching[Lawrence et al. 1999]

Generative Model + MCMC

[Pasula et al. 2002]

Conditional Random Field

[Wellneret al. 2004]


49/61

49

Cross-Citation Disambiguation

Wauchope, K. Eucalyptus: Integrating Natural Language

Input with a Graphical User Interface. NRL Report

NRL/FR/5510-94-9711 (1994).

Is "Eucalyptus" part of the title, or is the authornamed K. Eucalyptus Wauchope?

Kenneth Wauchope (1994). Eucalyptus: Integrating

natural language input with a graphical user

interface. NRL Report NRL/FR/5510-94-9711, Naval

Research Laboratory, Washington, DC, 39pp.

Second citation makes it clear how to parse the first one


50/61

50

Preliminary Experiments:

Information Extraction

P(citation text | title, author names) modeled with

simple HMM

For each paper: recover title, author surnames andgiven names

Fraction whose attributes are recovered perfectly in

last MCMC state:

among papers with one citation: 36.1%

among papers with multiple citations: 62.6%

Can use inferred knowledge for disambiguation


51/61

51

Multi-Object Tracking

False

Detection

Unobserved

Object


52/61

52

State Estimation for Aircraft

#Aircraft ~ NumAircraftPrior();

State(a, t)

if t = 0 then ~ InitState()else ~ StateTransition(State(a, Pred(t)));

#Blip(Source = a, Time = t)~ NumDetectionsCPD(State(a, t));

#Blip(Time = t)~ NumFalseAlarmsPrior();

ApparentPos(r)if (Source(r) = null) then ~ FalseAlarmDistrib()else ~ ObsCPD(State(Source(r), Time(r)));


53/61

53

Aircraft Entering and Exiting

#Aircraft(EntryTime = t) ~ NumAircraftPrior();

Exits(a, t)if InFlight(a, t) then ~ Bernoulli(0.1);

InFlight(a, t)

if t < EntryTime(a) then = falseelseif t = EntryTime(a) then = trueelse = (InFlight(a, Pred(t)) & !Exits(a, Pred(t)));

State(a, t)if t = EntryTime(a) then ~ InitState()elseif InFlight(a, t) then

~ StateTransition(State(a, Pred(t)));#Blip(Source = a, Time = t)

if InFlight(a, t) then~ NumDetectionsCPD(State(a, t));

plus last two statements from previous slide


54/61

54

MCMC for Aircraft Tracking

Uses generative model from previous slide (although not

with BLOG syntax)

Examples of Metropolis-Hastings proposals:

[Oh et al., CDC 2004]

[Figures by Songhwai Oh]


55/61

55

Aircraft Tracking Results

[Oh et al., CDC 2004][Figures by Songhwai Oh]

MCMC has smallest error,

hardly degrades at all as

tracks get dense

MCMC is nearly as fast as

greedy algorithm;

much faster than MHT

Estim tionError Running Time


56/61

56

Toward General-Purpose Inference

Currently, each new application requires new code

for:

Proposing moves

Representing MCMC states

Computing acceptance probabilities

Goal:

User specifies model and proposal distribution

General-purpose code does the rest


57/61

57

General MCMC Engine

Propose MCMC state

sd given sn

Compute ratio

q(sn | sd) / q(sd | sn)

Compute acceptance

probability based on model

Set sn+1

Define p(s)Custom proposal distribution

(Java class)

General-purpose engine

(Java code)

Model

(in declarative language)MCMC states: partial worlds

[Milch & Russell 2006]

Handle arbitrary proposals efficiently

using context-specific structure


58/61

58

Summary

Models for relational structures go beyond standard

probabilistic inference settings

MCMC provides a feasible path for inference

Open problems

More general inference

Adaptive MCMC

Integrating discriminative methods


59/61

59

References

Blei, D. M. and Jordan, M. I. (2005) Variational inference for Dirichlet process mixtures. J. Bayesian

Analysis 1(1):121-144.

Casella, G. and Robert, C. P. (1996) Rao-Blackwellisation of sampling schemes . Biometrika 83(1):81-

94.

Ferguson T. S. (1983) Bayesian density estimation by mixtures of normal distributions. In Rizvi, M. H.

et al., eds. Recent Advances in Statistics: Papers in Honor of Herman Chernoff on His Sixtieth Birthday.

Academic Press, New York, pages 287-302.

Geman, S. and Geman, D. (1984) Stochastic relaxation, Gibbs distributions and the Bayesian

restoration of images. IEEE Trans. on Pattern Analysis and Machine Intelligence 6:721-741.

Gilks, W. R., Thomas, A. and Spiegelhalter, D. J. (1994) A language and program for complex Bayesian

modelling. The Statistician 43(1):169-177.

Gilks, W. R., Richardson, S., and Spiegelhalter, D. J., eds. (1996) Markov Chain Monte Carlo in Practice.

Chapman and Hall.

Green, P. J. (1995) Reversible jump Markov chain Monte Carlo computation and Bayesian model

determination. Biometrika 82(4):711-732.


60/61

60

References

Hastings, W. K. (1970) Monte Carlo sampling methods using Markov chains and their applications.Biometrika 57:97-109.

Jain, S. and Neal, R. M. (2004) A split-merge Markov chain Monte Carlo procedure for the Dirichletprocess mixture model. J. Computational and Graphical Statistics 13(1):158-182.

Jordan M. I. (2005) Dirichlet processes, Chinese restaurant processes, and all that. Tutorial at theNIPS Conference, available at http://www.cs.berkeley.edu/~jordan/nips-tutorial05.ps

MacKay D. J. C. (1992) Bayesian Interpolation Neural Computation 4(3):414-447.

MacEachern, S. N. (1994) Estimating normal means with a conjugate style Dirichlet process priorCommunications in Statistics: Simulation and Computation 23:727-741.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953) Equations ofstate calculations by fast computing machines. J. Chemical Physics 21:1087-1092.

Milch, B., Marthi, B., Russell, S., Sontag, D., Ong, D. L., and Kolobov, A. (2005) BLOG: ProbabilisticModels with Unknown Objects. In Proc. 19th Intl Joint Conf. on AI, pages 1352-1359.

Milch, B. and Russell, S. (2006) General-purpose MCMC inference over relational structures. In Proc.

22nd

Conf. on Uncertainty in AI, pages 349-358.


61/61

61

References

Neal, R. M. (2000) Markov chain sampling methods for Dirichlet process mixture models . J.

Computational and Graphical Statistics 9:249-265.

Oh, S., Russell, S. and Sastry, S. (2004) Markov chain Monte Carlo data association for general multi-target tracking problems. In Proc. 43rdIEEE Conf. on Decision and Control, pages 734-742.

Pasula, H., Russell, S. J., Ostland, M., and Ritov, Y. (1999) Tracking many objects with many sensors.In Proc. 16th Intl Joint Conf. on AI, pages 1160-1171.

Pasula, H., Marthi, B., Milch, B., Russell, S., and Shpitser, I. (2003) Identity uncertainty and citationmatching. In Advances in Neural Information Processing Systems 15, MIT Press, pages 1401-1408.

Richardson,, S. and Green, P. J. (1997) On Bayesian analysis of mixtures with an unknown number ofcomponents. J. Royal Statistical Society B 59:731-792.

Sethuraman, J. (1994) A constructive definition of Dirichlet priors . Statistica Sinica 4:639-650.

Sudderth, E. (2006) Graphical models for visual object recognition and tracking. Ph.D. thesis, Dept. of

EECS, Massachusetts Institute of Technology, Cambridge, MA.

Tu, Z. and Zhu, S.-C. (2002) Image segmentation by data-driven Markov chain Monte Carlo. IEEE

Trans. Pattern Analysis and Machine Intelligence 24(5):657-673.

uai mcmc tutorial

Documents