understanding the metropolis-hastings algorithm

8/12/2019 Understanding the Metropolis-Hastings Algorithm

1/10

Understanding the Metropolis-Hastings Algorithm

Siddhartha Chib; Edward Greenberg

The American Statistician, Vol. 49, No. 4. (Nov., 1995), pp. 327-335.

Stable URL:

http://links.jstor.org/sici?sici=0003-1305%28199511%2949%3A4%3C327%3AUTMA%3E2.0.CO%3B2-2

The American Statisticianis currently published by American Statistical Association.

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available athttp://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtainedprior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content inthe JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained athttp://www.jstor.org/journals/astata.html.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printedpage of such transmission.

The JSTOR Archive is a trusted digital repository providing for long-term preservation and access to leading academicjournals and scholarly literature from around the world. The Archive is supported by libraries, scholarly societies, publishers,and foundations. It is an initiative of JSTOR, a not-for-profit organization with a mission to help the scholarly community takeadvantage of advances in technology. For more information regarding JSTOR, please contact [email protected].

http://www.jstor.orgFri Aug 31 13:56:51 2007
http://links.jstor.org/sici?sici=0003-1305%28199511%2949%3A4%3C327%3AUTMA%3E2.0.CO%3B2-2http://www.jstor.org/about/terms.htmlhttp://www.jstor.org/journals/astata.htmlhttp://www.jstor.org/journals/astata.htmlhttp://www.jstor.org/about/terms.htmlhttp://links.jstor.org/sici?sici=0003-1305%28199511%2949%3A4%3C327%3AUTMA%3E2.0.CO%3B2-2


2/10

Understanding the Metropolis-Hastings AlgorithmSiddhartha CHIBand Edward GREENBERG

We provide a detailed, introductory exposition of theMetropolis-Hastings algorithm , a powerful Markov chainmethod to simulate multivariate distributions. A sim-ple, intuitive derivation of this method is given alongwith guidance on implementation. Also discussed aretwo applications of the algorithm, one for implementingacceptance-rejection samp ling when a blanketing func-tion is not available and the other for imple menti ng the al-gorithm with block-at-a-time scans. In the latter situation,many different algorithms, including the Gibbs sampler,are shown to be special ca ses of the Metropolis-Hastingsalgorithm. The methods are illustrated with examples.KEY WORDS: Gibbs sampling; Markov chain MonteCarlo; Multivariate density simulation; ReversibleMarkov chains.

1. INTRODUCTIONIn recent years statisticians have been increasinglydrawn to Markov chain Monte Carlo (MCMC) methodsto simulate complex, nonstandard multivariate distribu-tions. Th e Gibbs samp ling algorithm is one of the bestknown of these meth ods, and its impact on Bayesian statis-tics, following the work of Tanner and Wong (1987) andGelfand and Sm ith (1990), has been imm ense as detailedin many articles, for example, Sm ith and Roberts (19931,Tanner (1993), and Chib and Greenberg (1993). A con-siderable amou nt of attention is now being devoted to theMetropolis-Hastings (M-H) algorithm , which was devel-oped by Metropolis, Rosenbluth, Rosenbluth, Teller, andTeller (1953) and subsequently generalized by Hastings(1970). This algorithm is extremely versatile and givesrise to the Gibbs sampler as a special case, as pointed outby Gelman (1992). Th e M-H algorithm has been usedextensively in physics, yet despite the paper by Hastin gs,it was little kno wn to statisticians until recently. Papersby Miiller (1993) and Tierney (1994) were instrumentalin exposing the value of this algorithm and stimulatinginterest amon g statisticians in its use.Becau se of the usefulness of the M-H alogrithm, appli-cations are appearing steadily in the current literature (see

Muller (1993), Chib and Greenberg (1994), and Phillipsand Smith (1994) for recent examples). Despite its obvi-ous importance, however, no simple or intuitive exposi-tion of the M-H algorithm, com parable to that of Casellaand George (1992) for the Gibbs sampler, is available.Th is article is an attempt to fill this gap. We provide atutorial introduction to the algorithm, deriving the algo-rithm from first principles. Th e article is self-contained

since it includes the relevant Markov chain theory, is-sues related to implementation and tuning, and empir-ical illustration s. We also discuss applicatio ns of themetho d, one for implem enting acceptance-rejection sam-pling when a blanketing function is not available, devel-oped by Tierney (1994), and the other for applying thealgorithm one block at a time. For the latter situation,we present an important principle that we call the grod-uct of kernels principle and explain how it is the basis ofmany other algorithms, including the Gibbs sampler. Ineach case we emphasize the intuition for the m ethod andpresent proofs of the main results. For mathe matica l con-venience, our entire discussion is phrased in the contextof simulating an absolutely continu ous target density, butthe same ideas apply to discrete and mixed continuous-discrete distributions.Th e rest of the article is organized as follow s. InSection 2 we briefly review the acceptance-rejection(A-R) method of simulation. Altho ugh not an MC M Cmethod, it uses some concepts that also appear in theMetropolis-Hastings algorithm and is a useful introduc-tion to the topic. Section 3 introduces the relevant Markovchain theory for continuous state spaces, along with thegeneral philosophy behind M CM C methods. In Section 4we derive the M-H algorithm by exploiting the notion ofreversibility defined in Section 3, and discuss so me impor-tant features of the algorithm and the mild regularity con-ditions that justify its use. Section 5 contains issues relatedto the choice of the candidate-generating density and guid-ance on implementation. Section 6 discusses how the algo-rithm can be used in an acceptance-rejection sch em e whena dominating density is not available. T his section also ex-plains how the algorithm can be applied when the variablesto be simulated are divided into blocks. Th e final sectioncontains two numerical examples, the first involving thesimulation of a bivariate n ormal distribution, and the sec-ond the Bayesian analysis of an autoregressive model.

2. ACCEPTANCE REJECTION SAMPLINGIn contrast to the MCMC methods described be-

low, classical simulation techniques ge nerate non-M arkov(usually independent) sam ples, that is, the successive ob-servations are statistically independent unless correlationis artificially introduced as a variance reduction device.An important method in this class is the A-R metho d,which can be described as follows:The objective is to generate samples from the abso-lutely continuous target clensitl): ~ x )= f(x) /K, where

; x ) is the unnorrnalized density, and K is the(possibly unknown) normalizing constant. Let Iz(x) be adensity that can be simulated by som e known m ethod, andSiddhartha Chib is at the John Olin School of Business. Wash-

ington University, St. Louis, MO 63130. Edward Greenberg is at there is a known cons tant that f clz(x)the Department of Economics. Washington University, st. Louis, M O for all n Then , to obtain a random variate from T .),63 130. Th e authors express their thanks to the editor of the journal, theassociate editor, and four referees for many useful com ments on the pa- ( ) Generate a candidate from l~ ( . ) nd a value uper, and to hlicha el Ogilvie andp in-Hu ang Chou for helpful discussions. from U(0; I), the uniform distribution on 0 ,1).

1995 Alilel-ican Statistical Ass ociatio~z The A1?7erlca11tatisticzalz, N ovel? ~her 995, Vol. 49, No. 4 327


3/10

If 5 (Z)/clz(Z)-return Z = ,.Else-goto * ) .

It is easily shown that the accepted value y is a randomvariate from T (.) . For this method to be eff icient, c mustbe carefully selected. Beca use the expected numbe r ofiterations of steps 1 and 2 to obtain a draw is given by c -l ,the rejection method is optimized by sett ing.f(x)C = s u p .7 (x)Even this choice, however. may result In an undesirably

large number of rejections.The notion of a generating density also appears in theM-H algorithm , but befo re considerin g the differencesand similari t ies, we turn to th e rationale behind MCMCmethods.

3 MARKOV CHAIN MONTE CA RL 0SIMULATION

The usual approac h to Mark ov chain theory on a contin-uous state space is to start with a transition kernel P( x, A )for x E 2 and A E B , where B is the Bore1 rr-field on 2 .Th e transition kernel is a conditional distribution fu nctionthat represen ts the probability of mo ving froin to a pointin the set A . By v irtue of its being a distribution function,P (x , 2 ) = 1, where i t is permitted that the chain can m akea transition fro m the point x to x, that is, P(x , {x)) is notnecessarily zero.A major con cern of Ma rkov chain theor; [seeNulnmelin (1984), Bill ingsley (1986), Bhattacharya andWaymire (1990), and, especially, Meyn and Tweedie(1993)l is to determin e conditions under which there existsan invariant distributio n T * and conditions under which i t-erations of the transition kernel converge to the invariantdistribution. The invariant distribution satisfies

a ' ( d ~ , )= P ( i , ~ > ) T ( X )x (1)where T is the density with respect to Lebesgue me asureof a: tthus ~ * ( d y )= ~ ( y )y ). T h e ~ t ht er at e is giv en byPc ) (x ,A ) = JT ( - ' ) ( n , dy )P (y , A , where P( ')(x,dy) =P(x, dy). un de r con ditions discussed in the following, i tcan be sho wn that the 12thiterate conv erges to the invariantdistribution as n oo.MCMC methods turn the theory around: the invariantdensity is kno wn (per hap s up to a constant multiple)-it isT(.) , the target density from w hich s amples are desired-but the transit ion kernei is unknown. To generate sainpiesfrom T (.) , the me thods f ind and uti lize a transition kernelP(x, dy) whose nth i terate converges to a ( .) for large 11.Th e process is started at an arbitrary x and iterated a largenumb er of times. After this large number, the distr ibu-tion of the observations generated from the simulation isapproximately the target distr ibution.

The problem then is to f ind an appropriate P(x, dl ') .Wh at might a ppear to be a search for the proverbial needlein a haystack is some what simplif ied by the following con-siderations. Suppos e that the transition kernel, for somefunction p(x, y) , is expressed as

where p(x, x) = 0,S,:(cly) = 1 if x dy and 0 otherwis e, andr(x) = 1 J2 p(x, y) cl~ls the probabili ty that the ch ain re-inains at x. From the possibil i ty that r(x) 0, i t should beclear that the integral of y .s,v) over >, is not n ecessarily 1 .Now, if the func tionp (x,J.) in (2) satisfies the reversibil-ity condition (also called detailed balance, microsc opicreversibility, and time reversibility )

then T( .) is the invariant density of P(x , . (Tierney 1994).To verify this we eva luate the right-hand side of (1):

+ r(x)T (x) dxr= / ( I Y ( ~ ) ) T ( J I )~ ' r ( x ) ~ ( x )d . ~

A J A= T n ( y )6 . (4)

Intuitively, the left-hand side of the reversibility conditionis the unconditional probabili ty of moving froin x to y,where x is generated from T (.) , and the r ight-hand side isthe unconditional probabili ty of moving fro m y ton , wherey is also generated from a (. ) . T he reversibil i ty conditionsays that the two sides are equal. and the above resultshows that T* (.) is the invariant distr ibution for P (. , . .

Thi s result gives us a sufficient condition (reversibility)that must be satisfied by p(x ,y ) . We now show how theMetropolis-Hastings algorithm finds a p(x , y) with thisproperty.

4. THE METROPOLIS HASTINGSALGORITHM

As in the A-R metho d, suppose we have a densitythat can generate candidates. Since we are dealing withMarkov chains, however, we permit that density to de-pend on the current state of the process. Accordingiy,the carldidcrte-getlel-ntirlg density is denoted q(x ,y) , whereJ q(x,J.) dy = 1 . This density is to be interpreted as sayingthat when a process is at the point x, the density gen eratesa value y f rom q(x,3,). If it happens that q(x,y) itself sat-isfies the reversibility c ondition (3) for all x, y, ou r searchis over. Bu t mo st likely it will not. We might f ind, forexample, that for so me x , y,

In this case, speaking somewhat loosely, the process~n ov esf rom x to J too often and from J , to x too rarely.A conv enient way to correct this condition is to reduce thenumber of moves from x to y by introducing a probability

328 Th e Aiiler-icarz St ati stic lan , Noveli7Der-1995, Vol. 49, No.


4/10

a(x . y ) < 1 that the mo ve 1s ma de. We refer to a(.u. y) asthe p rob ab ilit) . of r?zove If the move is not ma de, the pro-cess again returns x as a value from the target distribution.(Note the contrast with the A-R neth hod in which, whena y is rejected, a new pair (y, u is drawn independentlyof the previous value of y.) T hus transitions from to y(y x) are mad e according toPMH(X,Y) -- q(n, y)d.u, y) , x y,

where a (n , y) is yet to be determ ined.Con sider again inequality (5). It tells us that the mov e-ment from y to x is not made often enough. We shouldtherefore define a ( y , x ) to be as large as possible, andsince it is a probability, its upper limit is 1. But now theprobability of m ove a (x , y) is determined by requiring thatp I H ( x ,y) satisfies the reversibility c ondition, because then

We now see that a (x , y) = ~ ( y ) q ( j , ) / . ~ i ( x ) q ( x ,y). Ofcourse, if the inequality in (5) is reversed, we set a (x ,y) =1 and der ive a ( y , x) as above. Th e probabilit ies a(x ,y) anda ( y ,x) are thus introduced to ensure that the two sides of(5) are in balan ce or, in other words, that pbIH (x,y) satis-fies reversibility. Thu s we have show n that in order foryhfH(x ,y) to be reversible, the probability of mo ve mu st beset to

a ( x , y) = min , if T (x )~ (x ,) > 0= 1. otherwise.

To complete the definition of the transition kernel forthe Metropolis-Hastings chain, we must consider the pos-sibly nonze ro probability that the process remains at x. Asdefined above, this probability is

Consequently, the transition kernel of the M-H chain, de-noted by PMH(x.b s given byPMH(.x.dy) = q(x,Y)Q(X.Y)dy

a particular case of (2). Beca usep bIH (x,y) is reversible byconstruction, i t follows from the argument in (4) that theM-H kernel has ~ ( x )s it s invariant dens i ty.Several remarks abo ut this algorithm are in order. First,the M-H algorithm is specified by its candidate-ge neratingdensity q(x,y) w hose selection we take up in the next sec-tion. Sec ond , if a can dida te value is rejected, the currentvalue is taken as the next i tem in the sequence . Third,the calculation of a(x,y) does not require knowledge ofthe normalizillg constant of T . ecause it appears both inthe numerator and denominator. Fourth, if the candidate-generating density is sym metric, an importan t special case,q(x,y) = q(y,x) and the probabili ty of move reduces toT ( ~ ) / T ( x ) ;h en ce , if ~ ( y )> ~ ( x ) ,he cha in moves to y;otherwise, i t moves with probabi li ty given by T (~ ) / T (x ) .In othe r wo rds, if the jum p go es uphill, it is always ac-cepted; if downhill, it is accepted with a nonzero proba-bility. [See Fig . 1 where, fro m the current point x, a mov e

Figure 1. Calculating Probabilities of Move With SymmetricCandidate-Ge nerating Function see text).

to candidate y1 is made with certainty, while a move tocan di da te y2 is m ad e w ith p ro b ab il it y ~ ( y ~ ) / ~ ( x ) . ]hisis the algorithm proposed by M etropolis et al . (1953). In-terestingly, it also forms the basis for several optim izationalgorithms, notably the m ethod of simulated annea ling.We now su mm arize the M-PI algorithm in algorithmicform initialized with the (arbitrary) value x(O):

Repeat for = 1 , 2 , . . . , N.Generate y from q(x (i) ,. and u f rom U(0 , 1).If u a(x(il ,y)-set x i+l) = Y .Else-set x(i+l)= x(i)Return the values ( ~ ( 1 ,x ( ~ ) ,. ,~ ( ~ ' 1 ) .

As in ally M CM C metho d, the draws are regarded as asample from the target dens ity ~ ( x )nly af ter the chain haspassed the transient st age and the effect of the fixed startingvalue has becom e so small that it can be ignored . In fact,this convergen ce to the illvariallt distribution occurs undermild regularity conditions. Th e regularity collditiolls re-quired are irreducibility and aperiodicity [see Smith andRoberts (1993)l. Wh at these mean is that, if x and y are inthe dom ain of T , ,t must be possible to move fro lnx to yin a finite number of iterations with llollzero probability,and the number of moves required to move from x to (14 isnot required to be a multiple of som e integer. T hese con -ditions are usually satisfied if q(x, y) has a positive densityon the sam e support as that of T(.). It is usually also satis-fied by a q(x,y) with a restricted support (e.g., a uniformdistribution a round the current point w ith finite width).These conditions, however, do not de termine the rate ofconvergence [see Roberts and Tweedie (1994)], so thereis an empirical question of how large an initial sampleof size no (say) should be discarded and how long thesampling should be run. On e possibili ty, due to Gelmanand Rubin (1992), is to start multiple chains from dis-persed initial values and c omp are the within and betweenvariation of the sampled draws. A sim ple heuristic thatworks in some situations is to make no and N increasingfunctions of the first-order serial correlation in th e outpu t.


5/10

This entire area, however, is quite unsettled and is beingactively researched. For mor e details the reader shouldconsult Gelman and R ubin (1992 ) and the accompanyingdiscussion.5. IMPLEMENTATIONISSUES:

CHOICEOFq x y)To implemen t the M-H algorithm, it is necessary that asuitable can dida te-ge nera ting density be specified. Typi-cally, this density is selected fr om a family of distributiollsthat requires the specification of such tuning parametersas the location and scale. C onsiderable recent work is be-ing devoted to the question of how these choices shouldbe made and, although the theory is far from complete,enough is known to conduct most practical s imulationstudies.One family of candidate-generating densities, that ap-pears in the work of Metropolis et al. (19531, is givenby q(x, y) ql (y x) , where ql ( . ) i s a mul t ivar iate den-sity [see Miiller (1993)l. The candidate y is thus drawnaccording to the process y x z where z is called theincrement random variable and follows the distr ibution q l .Because the candidate is equal to the current value plusnoise, this case is called a mndorlz walk chain. Possiblechoices for ql includ e the multivariate normal density andthe multivariate-t with the parameters specified accordingto the principles described below . Note that when ql issymmetr ic , the usual ci rcumstance, ql (z) ql ( -z) ; theprobability of m ove then reduces to

As m entioned earlier , the sam e reduction occurs if q(x, y)q ( y , x).A second family of candidate-generating densities isgiven by the form q(x ,y) qz (y) [see Hastings (1970)l.In contrast to the random walk chain, the candidates aredrawn indepe nde ntly of the current location x-an inde-pen den ce clzain in Tierney's ( 199 4) terminology . As inthe first case, w e can let q~ be a multivariate normal ormultivariate-t density, but now it is necessary to specifythe location of the gen eratin g density as well as the spread.A third choice, which seems to be an efficient solu-tion w hen available, is to exploit the known form of T(, )to specify a candidate-generating density [see Chib andG r ee nb er g (1 9 94 ) l. F o r e x am p l e, if ~ ( t )an be w ritten as~ ( t )~ $(t)h(t), where lz(t) is a density that can be sam-pled and w (t) is uniformly bou nde d, then set q(x, y) h ( y )(as in the indep endenc e chain) to draw candidates. In thiscase, the probabili ty of move requires only the com puta-tion of the w function (not i or h ) and is given by

A fou rth method of draw ing candidates is to use the A-Rmethod with a psez~do dolninatingdensity. This methodwas developed in Tierney (1994 ), and becau se it is of inde-pende nt interest as an M-H acceptance-rejection me thod ,we explain it in Section 6 .1.A fifth family, also suggested by Tierney (1994), isrepresented by a vector autoregressive process of or-der 1. These autol-egressive chains are produced by letting330 Tlze Ati~ericnrzStatisticiarz, ~\ o~)eri~ber995, Vnl. 49, iVn. 4

y a B(x a) z, where a is a vector and B is a matrix(both conforma ble with x) and z has q as its density. Th en ,q(x, y) q(>: B(x a)) . Setting B 1 produceschains that are reflected about the point a and is a sim-ple way to indu ce negative correlation between successiveelements of the chain.We now return to the critical question of choosing thespread, or scale, of the cand idate-generating density. Th isis an important matter that has implications for the ef-ficiency of the algorithm. Th e spread of the cand idate -generating density affects the behavior of the chain in atleast two dimensions: o ne is the acceptance rate ( thepercentage of t imes a move to a new point is made), andthe other is the region of the sample spa ce that is coveredby the chain. To see why, consider the situation in whichthe chain has converged and the density is being sampledaround the mode. The n, if the spread is extremely large,some of the generated candidates will be far from the cur-rent value, and will therefore have a low probability ofbeing accepted (because the ordinate of the candidate issmall relative to the ordinate near the mode). Reducin gthe spread will correct this problem, but if the spread ischosen too sm all, the chain will take longer to traverse thesupport of the density, and low probab ility regions w ill beundersam pled. Both of these situations are likely to bereflected in high autocorrelations across sample values.Recent work by Roberts , Gelm an, and Gilks (1994) dis-cussed this issue in the context of q, ( the random walk pro-posal density). They show that if the target and proposaldensities are norm al, then the sca le of the latter should betuned so that the acceptance rate is approximately .45 inone-dimensional problems and approximately .2 3 as thenumber of dimensions approaches infinity, with the op-timal acceptance rate being around .25 in as low as sixdimensions. This is s imilar to the recomm endation ofMiiller (199 3), who argues that the acceptanc e rate shouldbe around .5 for the random walk chain.The choice of spread of the proposal density in thecase of q Z (the independence proposal density) has alsocome under recent scrutiny. Chib and G eweke [work inprogress] show that it is imp ortan t to ens ure that the tails ofthe proposal density dominate those of the target density,which is s imilar to a requirement on the impo rtance sam -pling function in Mon te Carlo integration with im portancesampling [see Geweke (19 89)l. I t is important to mentionthe caveat that a chain with the optimal acceptan ce ratemay still display high autocorrelations. In such circu m-stances it is usually necessa ry to try a different family ofcandidate-generating densities.6. APPLICATIONSOFTHEM-H ALGORITHM

We hope that the reader is now convinced that theM-H algorithm is a useful and straightforwa rd devic e withwhich to sample an arbitrary multivariate distribution. Inthis section we explain two uses of the algorithm, oneinvolving the A-R metho d, and the other for implem ent-ing the algorithm with block-at-a-time scan s. In the lattersituation many different algorithms, including the Gibbssampler, are shown to arise as special cases of the M-Halgorithm.


6/10

6.1 An M H Acceptance Rejection AlgorithmRecall that in the A-R meth od describ ed earlier, a con-stant c and a density h(x) are needed such that c h ( ~ )dominates or blankets the (possibly) unnormalized targetdensity f(x) . Finding a c that does the trick may be dif-ficult in some applications; moreover, if f(x) depends onparameters that are revised d uring an iterative cycle, find-ing a new value of c for each new set of the parameters

may significantly slow the computations. For these rea-sons it is worthwhile to have an A-R method that does notrequire a blanketillg function. Tienley's (1994) remark-able algorithm do es this by using an A-R step to generatecandidates for an M H algorithm. This algorithm, whichseems complicated at first, can be derived rather easilyusing the intuition we have developed for the M-H algo-rithm.To fix the context aga in: we are interested in samplingt he ta rg et de ns it y ~ ( x ) , ( x ) f ( x ) / K , w he re K m ay b eunknown, and apd f /I(.) s available for sampling. Supposec > 0 is a known constan t, but that f(x ) is not necessarilyless than ch(x) for all x; that is, clz(x) does not necessa rilydominate f (x). It is convenient to define the set C wheredomination occurs:C = {x: (x) ch(x )).

In this algorithm, given x( ) = x- the next value x( +') sobtained as follow s: First, a candida te value is obtained,indepen dent of the cu rre nt vallre x , by ap plying the A-Ralgorithm with ell(.) as the dominating density. Th e A-R step is implem ented through steps 1 and 2 in Section 2.Wha t is the density of the rv y that com es through thisstep? Following Rubinstein (1981, pp. 45-46), we have

But because P( U f(Z)/ch(Z) Z = y) = min{f(y)/ch(y ), 11, it follows thatmin{f (y)/ch(y). 1) x h(y)q(y) = d

where d = Pr(U < f(Z)/clz(Z)). By simplifying the nu-merator of this denslty we obtain a mo re useful represen-tation fo r the candidate-generating density:

(Note that there is no need to write q(x , y) for this densitybecause the can didate y is drawn independently of x.)Beca use clz(y) does not dom inate the target density inC (by definition ), it follow s that the target density is notadequately sample d there. Se e Figure 2 for an illustrationof a nondo minat ing density and the C region. This can becorrecte d with an M-H step applied to the y values thatcom e through the A-R step. S ince x and y can each be inC or in Cc ; here are four possible cases: (a) x C, y C;(b) and (c) x C, y C or x C, y C; and (d) x C,$ C .Th e objective now is to find the M-H moving proba-bility a(x,y) such that q(y)cu(x, y) satisfies reversibility.

Figure 2 Acceptance-Rejection Sampling With Pseudodominat-ing Density ch x).

To proceed, we derive a(x , y) in each of the fou r possiblecases g iven above. As in (2) , we consider ~( x ) q ( y ) ndr(y )q( x) [or, equivalently, (x)q(y) and (y)q(x)] to seehow the probability of moves should be defined to ensurereversibility. That is, we need to find a (x ; y) and a( )' ;x)such that

in each of the cases (a)-(d), wher e q(y) is chosen from (7).Case (a): x E C, y E C In this case it is easy to verifythat (x)q(y) = (x)f (y) /cd is equal to (y)q(x). Acco rd-ingly, setting a(x;y) = a ( y ; x) = 1 satisfies reversibility.Cases (b) and (c) : x C. y C o r x C. y C.In the first case f(x) > c~I (x ) ,or h(x) f(x) /c, whichimplies (on multiplying both sides by (y )/d ) that

or, from (7), (y)q(x) (x)q(y). We now see that thereare relatively too few transitions from y to x and too manyin the opposite direction By setting ~ ( y ,) = 1 the firstproblem is alleviated, and then a (x ; y) is determined from

w hich gives ~ ( x ,) = ch(x)/f (x). If x C . y C, reversethe roles of x and y abo ve to find that a ( x , y) = 1 andd ) ' , ~ )clzO')/f(y).

Case (d): x C. y C. In this case we havef(x)q(y) = f (x)h(.Y)ld and f (y)q(x) = fOl)h(x)ld, andthere are two possibilities. Th ere are too few transitionsfrom y to x to satisfy reversibility if

In that case set a ( y , x) = 1 and determine a(x , y) from

The Ar7zericarz Sfnfisfici nrz Novernbel- 199 5 Vol. 49 No. 4 331


7/10

which implies

If there are too few transitions fr om x to J just interchangex and ) in the above discussion.We thus see that in two of the cases, those wh ere x E C,the probability of move to y is 1, regardless of whe re lies.To summ arize, we have derived the following probability

of move to the candidates y that are produced from theA-R step:Le t C1 = {f(x) < clz(X)}; and C2 = {f(y) < ch(y)}.Generate u from U(0. 1) and-if C l = 1, then let u = 1;-if C l = 0 and C2 = 1 , then let cu = (ch(x)/f (x));-if C1 = 0 and C2 = 0 , hen let u = min{(f (y)h(x)/

f ( ~ ) h ( ~ ) .11.I f u < c u-return y.Else-return x.6 2 Block at a TimeAlgorithms

Another interesting situation arises whe n the M-H al-gorithm is applied in turn to subblocks of the vector x,rather than simultaneously to all elements of the vector.This block -at-a-time o r variable-at-a-time possibility,which is discussed in Ha sting s (1970 , sec. 2.4), often sim-plifies the search for a suitable candidate-generating den-sity and gives rise to several interesting hybrid algorithm sobtained by combining M-H updates.The central idea behind these algorithms may be il-lustrated with two blocks, x = (xl :x2) , where x, E R '.Suppose that there exists a conditional transition kernelP I (xl dy l x2) with the property that, for a fixed valueof x2, riiTI2(. x2 ) is its inv ariant distribution (with density7iI 2(. x:)), that is,

Also , suppo se the existence of a conditional transition ker-nel P2(x2,dy2 XI) with the property that, for a given x l,7i;,(. XI) is its invariant distribution, analogous to (8).For example, PI could be the transition kernel generatedby a Metropo lis-Hastings chain applied to the block xlwith x2 fixed for all iterations. Now, som ewh at surpris-ingly, it turns out that the product of the transition kernelshas T(X I x 2 ) as its invariant density. Th e practical sig-nificance of th is princ iple (wh ich w e call the prod iict ofkernelsprinciple) is enormous because it allows us to takedraws in succession from each of the kernels, instead ofhaving to run each of th e kernels to convergence for everyvalue of the conditioning variable. In addition, as sug-gested above, this principle is extremely useful because itis usually far easier to find several conditional k ernels thatconverge to their resp ective conditioilal den sities than tofind one kernel that conv erges to the joint.

To establish the product of kernels principle it is nec-essary to specify the nature of the scan through theeiements of x (Hastings mentions several possibilities).Suppo se the transition kernel PI .; . x2)produces yl given

xl and x2; and the trailsition kernel P2(., . y1) generatesy2 given x2 and yl Then the kernel form ed by multiplyingthe conditional kerneis has r*(. , .) as its invariant distri-bution:S P I ( X I & I x2)P2(x2; dy2 YI )T(*I, R ) ~ X Ix:

= 7iT(d~1)7i2*~1(d~2 Y I )= T * ( ~ Y I ,y2):

where the third lin e follows from (8), the fourth from B ayestheorem, the sixth from assu med invariance of P2 and thelast from the iaw of total probability.With this result in hand, several important special casesof the M-H algorithm can be mentioned. Th e first specialcase is the so-called Gibbs sampler. This algorithm isobtained by letting the transition kernel P I(xl dy l x2) =7i ;12(d~l ~ 2 ) ;nd P 2 ( ~ 2 ; d ~ 2 ~ 1 )~ 2 * l l ( d y 2 Yl), thatis, the samples are generated directly from the full con-ditional distributions. Note that this method requ ires thatit be possible to generate independent samples from eachof the full conditional densities. The calculations abovedemonstrate that this algorithm is a special case of theM-H algorithm. Alternatively, it may be checked that theM-H acceptance probability a( x , y) = 1 for all x; y.Another special case of the M-H algorithm is the so-called M-H within Gibbs algorithm (but see our com -ments on terminolog y below), in which an intractable fullconditional density [say 7il1 2(y1 x2)] is sam pled with thegeneral form of the M-H algorithm described in Section 4and the others are sampled directly from their full condi-tional distributions. Many other algorithms can be sim-ilarly developed that arise from multiplying conditionalkernels.We con clud e this section with a brief digression on ter-minology. It should be clear from the discussion in thissubsection that the M-H algorithm can take many differentforms, one of which is the Gibbs sampler. Because m uchof the literature has overlooked Hastings's discussion ofM-H algorithms that scan one block at a time, som e un-fortunate usage ( M-H within Gibbs, for exam ple) hasarisen that should be abandoned. In addition, it may be d e-sirable to define the Gibbs samp ler rather narrowly, as wehave done above, as the case in which all full conditionalkernels are sampled by independent algorithms in a fixedorder. Although a special case of the M-H algorithm, it isan extremely important sp ecial case.

7. EXAMPLESWe next present two examples of the use of the M-H

algorithm. In the first we simulate the bivariate norm alto illustrate the effects of various choices of q(x, y); th e332 The An ~e ri ca ~ztatisticialz Novenlber 1995 Vol. 49 No. 4


8/10

second example illustrates the value of setting up blocksof variables in the Bayesian posterior a nalysis of a second-order autoregressive time series mode l.7 1 Simulating a Bivariate Norm al

To illustrate the M-H algorithm w e consider the simu-lation of the bivariate normal distribution &(p , I herep = (1, 2) ' is the me an vector and Z = (D,,): 2 x 2 is thecovariance matrix given by

Because of the high correlation the contours of this dis-tr ib~ltionare cigar-shaped, that is, thin and positivelyinclined. Althoug h this distribution can be simulated di-rectly in the Choleski approach by lett ing y = p P 'u ,where 11 -- X2(0:12) and P satisfies P'P = C. thisweil-known problem is usefill for illustrating the M-Halgorithm.From the expression for the multivariate normal den-sity, the probability of move (for a symmetric candidate-generating density) is

exp [-; ?I pI1C- ' ( ) ) p ) ]LL(X.) = mln . 1 ) ., U ) ~ C - ' ( X b~)]We use the following candidate-generating densities, forwhich the parameters are adjusted by experimentation toachieve an acce ptance rate of 40% to 50%:

1. Random walk generating density J , = x z), wherethe increment random variable z is distributed as bivariateuniform, that is, the it11 component of z is uniform on theinterval (-hi, hi). N ote that h1 controls the spread alongthe first coordin ate axis and h2 the spread along the secon d.To avoid excessive mov es we let = .7 5 an d 52 = 1 .2. Random walk generating density (y = x + z withz distributed as independent normal N2 (0,D) , where D =diagona l(.6, .4).3. Pseudorejection sampling generating density withdomin ating function clz(x) = c(27;)-'jD - I / exp[- 2x p) 'D(x p ] , where D = diagona l(2,2) and c = .9 .Th e trial draws, which are passed through the A-R step,are thus obtained from a bivariate, independent normaldistribution.

4. Th e autoregressive generating density y = p (xp z, where z is independ ent uniform with 6 )= I = S2 .Thus values of y are obtained by reflecting the currentpoint around p and then ad ding the increment.Note that the probability of move in cases 1, 2, and 4 isgiven by (9). In addition, the first two generating densitiesdo not make use of the known value of ~ i ,l though thevalues of the 6, are related to C . In the third generatingdensity we have set the value of the constant to be smallerthan that which leads to true domination. For dominationit is necessary to let all diagonal entries of D be equal to1.9 (the largest eigenvalue of C ) and to set =q m[see Dagpu nar (1988, p. 159)].Each of these four candidate-generating densities repro-duces the shape of the bivariate normal distribution being

simulated, although overall the best result is obtained fro mthe fourth generating density. To illustrate the character-istics of the output, the top panel of Figure 3 containsthe scatter plot of i V 4,000 simulated values from theCholeski approach and the bottom panel the scatter plotof N = 6,0 00 simulated values using the fourth candidate-generating density. Mo re observations are taken from theM-H algorithm to make the two plots comp arable. Theplots of the output with the other candidate-generatingdensities are similar to this and are therefore omitted. Atthe sugg estion of a referee, points that repeat in the M-Hchain are jittered to improv e clarity. The figure clearlyreveals that the sam pler doe s a striking job of visiting theentire support of the distribution. This is confirmed by theestimated tail probabili ties computed from the M-H out-put for which the estimates are extremely c iose to the truevalues. Details are not reported to save space.For the third generating density we found that reduc-tions in the elements of D led to an erosion in the nu mb erof times the sampler visited the tails of the distribution.In addition, we fou nd that the first-order serial correlationof the sampled values with the first and seco nd candida te-generating densities is of the order .9, and with the othertwo it is .30 and .1 6, respectively. The high serial cor-relation with the random walk generating densities is notunexpected and stems from the long memory in the can-didate draws. F inally, by reflecting the can didates we seethat it is possible to obtain a beneficial reduction in theserial correlation of the ou tput with little cost.7 2 Simulating a Bayesian Posterior

We now illustrate the use of the M-H algorlthln to sam-ple an intractable distribution that arises in a stationarysecond-order autoregressive [AR(2)] t ime series model.Our presentation is based on Chib and Greenb erg (1994),which contains a mo re detailed discussion and results forthe general ARMA( p . q) model.For our i l lustration, we simulated 100 observations fromthe model

where 4) = 1, Q~ = - .5 and E , - N(0, I ) . The valuesof Q = (Q1,d2 ) l ie in the region S 2 that satisfies thestationarity restrictions

Following Box and Jenkins (1976), we express the (exactor ur~condltionnl)ikelihood func tlon for this model giventhe z = 100 data values Y = 1.).y 2 , . . , y,,)' as/( o , D') = Q (d , 0 ') x (g2)-( -2)/2

where w, = ( Y , - ~ .t - 2 ) ' ,

is the density of Y2 = ( y l .y2) ' ,

Tile Arlzericnrl Statzsticinrz N oi e~lz ber 995 Vol. 49 No . 4 333


9/10

Figure 3 Scatter Plots of Simulated Draws. Top panel: Generated by Choleski approach. Bottom panel: Generated y M H with reflectioncandidate generating density

and the third term in I 1 is proportional to th e density ofthe observations j3 y,, given Y 2 .If the only prior information available is that the processis stationary, then th e posterior d istribution of the param-eters is~ ( 4 ;r2 Y,,) x l ( 4 ; 2 ) 1 [ 4E S ] ;

where I [ 4 S ] is 1 if 4 S and otherwise.How can this posterior density be simulated? The an-swer lies in recogniz ing two facts. First, the blockingstrategy is useful for this problem by taking 4 and o2 asblocks. Second, from the regression ANOVA decomposi-tion, the exponential term of 1 1 ) is proportional to

where 3 G - C , b3 (w t y t ) nd G C k ( \ v t w : ) . This isthe kernel of the normal density with mean 2 and covari-ance m atrix 0 2 G - . These observations immediately leadto the following full conditional densities for o 2 and o1. Th e density of o2given 4 and Y, , is inverted gamm awith parameters r2 2 and y2 V - Y 2 C : i3 (y t ~ ( 0 ) ~ .2. Th e density of 4 given o 2 and Y,, is

. r ( i Y , , ,0 ) WQ.0 2 x { f n o r ( i 0 2 G - ) 1 [ o S ] ) .13 )where f,,, 1s the norm al denslty fu nction.

A sample of draws from . i r 02 ,4 Y,) can now be ob-tained by successively sampling 4 f rom ~ ( 4 Y,,,c2 ,ndgiven this value of 4 , simulating o2 f rom . ir 02 Y n , ) .Th e latter simulation is straightforward. For the form er,334 Tlze Ali~ericarzStntzstic~nn, 1ovenlDer1995, Vol 49, N o 4


10/10

Table I. Summ aries of the Posterior Distribution forSimulated AR 2) ModelPosterior

Param. Mean Num. SE S Median Lower Upper Corr.*l .044 .02 .0B2 045 .882 .2039 2 -.608 001 082 -.610 -.763 - .445 109o 1.160 003 170 1.143 877 1.544 020

because it can be shown that IV -'/ ' / ' is bounded for allvalues of 4 in the stationary region, we generate candi-dates from th e density in curly braces of (13), followingthe idea described in Section 5 . The n, the value of 4 issimulated as: A t the jth iteration (given the current valuem2(j)).draw a cand idate c5(it1) from a normal density withmean and covarianc e m2(i'G-I; if it satisfies stationarity.Inove to this point with p robability

and otherwise set di+ )6 ( j 1 , where Q( . , . is defined in(12). The A-R method of 'se cti on 2 can also be appliedto this problem by drawing candidates b(i+')from the nor-ma l density in (13) until U < Q ( d i t l ) ,m2(jj). Many drawsof ch may be necessary, howe\~er, efore one is acceptedbecause Q ( Q :m2) can beco me extremely small. Thus thedirect A-R me thod , although available, is not a competi-t ive substitute for the M-H scheme described above.In the sampling process we ignore the first no = 500draws and collect the next N = 5,000. These are usedto approximate the posterior distributioils of b and 0'.I t is worth mentioning that the entire sampling processtook just 2 minutes on a 50 MHz PC. For comparison weobtained samples from the A-R metho d, which took about4 t imes as long as the M-H algorithm.Th e posterior distributions are sum marized in Table 1,where we report the posterior mean (the average of thesimulated values), the iluinerical standard erro r of the pos-terior mean (computed by the batch ineans method), theposterior standard deviations (the standard deviation ofthe simulated values). the posterior med ian, the lower 2.5and upper 97.5 percentiles of the simulated values, andthe sample first-order serial correlation in the simulatedvalues (which is low and not of concern). From these re-sults it is clear that the M-H algorithm has quickly a ndaccurately produce d a posterior distribution concentratedon the values that generated the data.

8 CONCLUDING REMARKSOur goal in this article is to provide a tutorial expo-sition of the Metropolis-Hastings algorithm, a versatile.efficient, and powerfu l simulation technique. It borrowsfrom the well-known A-R metho d the idea of generatingcandidates that are either accepted or rejected, but thenretains the value when rejectioll takes place, ~ ~~

Markov chain thus generated can be shown to have thetarget distrib~itionas its limiting distribution. Simula t-ing from the target distribution is then accomplished by

running the chain a large number of rimes. We providea simple, intuitive justification for the form taken by t h eprobability of Inove in the M H algorithm by showing itsrelation to reversibility. We also discuss implementationissues and two applications, the M-H accepta nce rejectionalgorithm and the use of the algorithm in block-at-a-timesetting. Finally, the procedu res are illustrated with twoexamples .

[Rec eived Api.il 1994. Revised Jiirrrrnry 1995.1REFERENCES

Bhattacharya, R. N., and Waymire. E. C . (1900) . Srochosric Proc~.c.sesit. it il A[~ [~/i cnt ioll s,ew 'l'ork: John Wiley.Billingsley, P. (1986). Probability orid .kfro.c.rirr (2nd ed.), New York:John LViley.Box, G. E . P., and Jenkins, G. M . (1976), Tirr~e ei.ies Airnlysrs: F oia-casriilg n i ~ d orrtrol (re\,. ed.). San Francisco: Holden-Day.

Casella, G.. and George, E. (1992), "Explaining the Gibbs Sampler,"The A?trerica?lSmri.rricintz.46, 167-1 74.Chib, S.. and Greenberg, E. (1993). "Markov Chain Monte Carlo Sirnu-lation Method s in Econometrics," Eco?loilretric Tireor:,'. in press.(1994), "Bayes Inference for Regression Models withARhl A(p , 9 ) E ror s , " Jolcrrlnl of E cor~ on~e tric.r.4, 183-206.Dagpunar, J. (1988), Prbici11lr.r o Roiid oir~ Vrrr.iorr Ge nrr otio iz, NewYork: Oxford University Press.

Gelfand. A. E., and Smith. A. F. 121. (1990). "Sampling-Based Ap-proaches to Calculating Marginal Densities," .io~irtln i fii ie AiilericnrlSmtr.c.rrcaiAssocrririoir,85, 398-409.Geiman, A . (i992). "iterative and Non-Iterative Sirnuiation Aigo-rithms," in Coi?~p~itiirg (Ii~rer-fncecie?lcr otrdS rfl t i .~tic~ Plvce edii~g s).2 4 , 4 3 3 4 3 8 .Gelman. A, , and Rubin, D . B. (1992). "In ference from Iterative Sirnula-tion U sing Multiple Sequences" (with discussion). Sratisriciil Screrlce,7,457-5 11Ge~v eke. . (1989). "Bayesian inference in econometric models usingMonte Carlo integration." Ecorrorilerricn, 57. 13 17-1 34 0.

Hastings, W. K. (1970), "MonteCarloSam plingMethods UsingMarkovChains and Their Applications," Bioi~~rrrikci ,7, 97-109.Metropolis, N osenbluth, A . W.. Rosenbluth, M . N., Teller. A . H.. andTeller, E . (1953). "Equations of State Calculations by Fast ComputingMachines," Jorirrlol o Cherniciil Physics, 21, 1087-1 092.Meyn, S . P ., and T~ve edie , . L . (1993) . ~ M o r k o ~ ,l~iiirzs nd S mci~iis ticStiiDili/y. London: Springer-Veriag.hliiller, P. (1993). "A Generic Approach to Posterior Integration andGibbs Sampling," manuscript.

Nummelin. E. (1984). Geileral Irrril~iciblrA4orkov Ci~i~irrsr~ d Yorz-1Negntive 011ercitors.Cambridge: Cambridge University Press.Phil lips, D. B., and Sm ith, A. F. &I. (1994), "Bayesian Faces via Hi-erarchical Template Modeling," Jolrrrlol of tile Ari~ericni~foristiciil

Associiirion, 89. 1151-1 163.Roberts, G. 0 .. Gelman, A, , and Gilks, LI . R. (199 4), "Weak Conver-gence and Opt ~rn al caling of Rando m Walk bletropolis Algorithms,"Technical Report, University of C ambridge.Roberts. G. O., and T~ve edie, . L . (I 994), "Geometric Convergence andCentral Limit Theorems for Multidimensional H astings and Metropo-lis Algorithms," Technical Report, University of Cambridge.Rubinstein. R. Y. (198 I), Sirrzrrlotioiz (inn' file Morzte Ca rl o Merllod. NewYork: John Lt'iley.Smith, A. F.M., and Roberts, G . (1993). "Bayesian Com putation viathe Gibbs Sampler and Related Markov C hain Mon te Carlo Methods,"Jour~rznl f the Royal Sriiiisricril Sociefy. Ser. B 55, 3-24.

Tanner. M. A. (1993), 7i,ois,f01. tati .~rici~ilfererzcr (2nd ed.), NewYork: Springer-Verlag.Ta~ l~le r, . A,, and nr on g, MI. H. (1987), "The Calculation of Poste-rior Distributions by Data Augmentation," .lourrzal of the AiriericanSroti.c.ricn1Associatioil, 82, 528-549.

T , ~ ~ ~ ~ ~ , llains for Esll]oring posterior ~ i ~ ~ ~ i b, (1 99 4) . . c ~ a r k o l ,tions" (~ vl th iscussion), Aiiirals of Srntisrics, 22. 1701-1762.

understanding the metropolis-hastings algorithm

Documents