markov chain monte carlo methods: computation and...

Chapter 57

MARKOV CHAIN MONTE CARLO METHODS: COMPUTATIONAND INFERENCE

SIDDHARTHA CHIB *

John M Olin School of Business, Washington University, Campus Box 1133, 1 Brookings Dr ,

St Louis, MO 63130, USA

Contents

Abstract 3570Keywords 35701 Introduction 3571

1.1 Organization 35732 Classical sampling methods 3573

2.1 Inverse transform method 35732.2 Accept-reject algorithm 35752.3 Method of composition 3576

3 Markov chains 35763.1 Definitions and results 35773.2 Computation of numerical accuracy and inefficiency factor 3579

4 Metropolis-Hastings algorithm 35804.1 The algorithm 35814.2 Convergence results 35844.3 Example 35854.4 Multiple-block M-H algorithm 3587

5 The Gibbs sampling algorithm 35895.1 The algorithm 35905.2 Connection with the multiple-block M-H algorithm 35915.3 Invariance of the Gibbs Markov chain 35925.4 Sufficient conditions for convergence 35925.5 Estimation of density ordinates 35925.6 Example: simulating a truncated multivariate normal 3594

6 Sampler performance and diagnostics 35957 Strategies for improving mixing 3596

7.1 Choice of blocking 3597

' email: chib@olin wustl edu

Handbook of Econometrics, Volume 5, Edited by JJ Heckman and E Leamer© 2001 Elsevier Science B V All rights reserved

7.2 Tuning the proposal density 35977.3 Other strategies 3598

8 MCMC algorithms in Bayesian estimation 35998.1 Overview 35998.2 Notation and assumptions 36008.3 Normal and student-t regression models 36028.4 Binary and ordinal probit 36048.5 Tobit censored regression 36078.6 Regression with change point 36088.7 Autoregressive time series 36108.8 Hidden Markov models 36128.9 State space models 3614

8.10 Stochastic volatility model 36168.11 Gaussian panel data models 36198.12 Multivariate binary data models 3620

9 Sampling the predictive density 362310 MCMC methods in model choice problems 3626

10.1 Background 362610.2 Marginal likelihood computation 362710.3 Model space-parameter space MCMC algorithms 363310.4 Variable selection 363610.5 Remark 3639

11 MCMC methods in optimization problems 363912 Concluding remarks 3641References 3642

Abstract

This chapter reviews the recent developments in Markov chain Monte Carlo sim-ulation methods These methods, which are concerned with the simulation ofhigh dimensional probability distributions, have gained enormous prominence andrevolutionized Bayesian statistics The chapter provides background on the relevantMarkov chain theory and provides detailed information on the theory and practiceof Markov chain sampling based on the Metropolis-Hastings and Gibbs samplingalgorithms Convergence diagnostics and strategies for implementation are alsodiscussed A number of examples drawn from Bayesian statistics are used to illustratethe ideas The chapter also covers in detail the application of MCMC methods to theproblems of prediction and model choice.

Keywords

JEL classification: Cl, C 4

3570 S Chib

Ch 57: Markov Chain Monte Carlo Methods: Computation and Inference

1 Introduction

This chapter is concerned with the theory and practice of Markov chain MonteCarlo (MCMC) simulation methods These methods which deal with the simulationof high dimensional probability distributions, have over the last decade gainedenormous prominence, sparked intense research interest, and energized Bayesianstatistics lTanner and Wong ( 1987), Casella and George ( 1992), Gelfand and Smith( 1990, 1992), Smith and Roberts ( 1993), Tierney ( 1994), Chib and Greenberg ( 1995 a,1996), Besag, Green, Higdon and Mengersen ( 1995), Albert and Chib ( 1996), Tanner( 1996), Gilks, Richardson and Spiegelhalter ( 1996), Carlin and Louis ( 2000), Geweke( 1997), Gammerman ( 1997), Brooks ( 1998), Robert and Casella ( 1999)l The ideabehind these methods is simple and extremely general In order to sample a givenprobability distribution that is referred to as the target distribution, a suitable Markovchain is constructed with the property that its limiting, invariant distribution is thetarget distribution Depending on the specifics of the problem, the Markov chain canbe constructed by the Metropolis-Hastings algorithm, the Gibbs sampling method, aspecial case of the Metropolis method, or hybrid mixtures of these two algorithms.Once the Markov chain has been constructed, a sample of (correlated) draws from thetarget distribution can be obtained by simulating the Markov chain a large numberof times and recording its values In many situations, Markov chain Monte Carlosimulation provides the only practical way of obtaining samples from high dimensionalprobability distributions.

Markov chain sampling methods originated with the work of Metropolis, Rosen-bluth, Rosenbluth, Teller and Teller ( 1953) who proposed an algorithm to simulatea high dimensional discrete distribution This algorithm found wide application instatistical physics but was mostly unknown to statisticians until the paper of Hastings( 1970) Hastings generalized the Metropolis algorithm and applied it to the simulationof discrete and continuous probability distributions such as the normal and Poisson.Outside of statistical physics, Markov chain methods first found applications inspatial statistics and image analysis lBesag ( 1974)l The more recent interest inMCMC methods can be traced to the papers of Geman and Geman ( 1984), whodeveloped an algorithm that later came to be called the Gibbs sampler, to samplea discrete distribution, Tanner and Wong ( 1987), who proposed a MCMC schemeinvolving "data augmentation" to sample posterior distributions in missing dataproblems, and Gelfand and Smith ( 1990), where the value of the Gibbs sampler wasdemonstrated for general Bayesian inference with continuous parameter spaces.

In Bayesian applications, the target distribution is typically the posterior distributionof the parameters, given the data If M denotes a particular model, p(ipj M), t, E 9 ad,

the prior density of the parameters in that model andf(y I ip, M) the assumed samplingdensity (likelihood function) for a vector of observations y, then the posterior densityis given by

r(V Iy, AM) oc p(tl M)f(yfl,M),

3571

( 1)

where the normalizing constant of the density, called the marginal likelihood,

m(yl M) = jd p(pl M)f(y , M) d,

is almost never known in analytic form As may be expected, an important goal of theBayesian analysis is to summarize the posterior density Particular summaries, suchas the posterior mean and posterior covariance matrix, are especially important as areinterval estimates (called credible intervals) with specified posterior probabilities Thecalculation of these quantities reduces to the evaluation of the following integral

Ad h(t P) (p Iy, M) d ,

under various choices of the function h For example, to get the posterior mean, onelets h(p) = ip and for the second moment matrix one lets h() = ipt', from which theposterior covariance matrix and posterior standard deviations may be computed.

In the pre MCMC era, posterior summaries were usually obtained either by analyticapproximations, such as the method of Laplace for integrals lTierney and Kadane( 1986)l, or by the method of importance sampling lKloek and van Dijk ( 1978),Geweke ( 1989)l Although both techniques continue to have uses (for example, theformer in theoretical, asymptotic calculations), neither method is sufficiently flexibleto be used routinely for the kinds of high-dimensional problems that arise in practice.A shift in thinking was made possible by the advent of MCMC methods Instead offocusing on the question of moment calculation directly one may consider the moregeneral question of drawing sample variates from the distribution whose summariesare sought For example, to summarize the posterior density (I ly, M) one canproduce a simulated sample {Ip (1) , p(M)} from this posterior density, and fromthis simulated sample, the posterior expectation of h(l) can be estimated by theaverage

M

M-l E h(p(J)) ( 2)j=l

Under independent sampling from the posterior, which is rarely feasible, thiscalculation would be justified by classical laws of large numbers In the context ofMCMC sampling the draws are correlated but, nonetheless, a suitable law of largenumbers for Markov chains that is presented below can be used establish the factthat

M

M' Eh(t(j)) _ I h(p)r(tply,M)dt, M oo.j= 1

It is important to bear in mind that the convergence specified here is in terms of thesimulation sample size M and not in terms of the data sample size N which is fixed.

3572 S Chib


This means that one can achieve any desired precision by taking M to be as large asrequired, subject to the constraint on computing time.

The Monte Carlo approach to inference also provides elegant solutions to theBayesian problems of prediction and model choice For the latter, algorithms areavailable that proceed to sample over both model space and parameter space, suchas in the methods of Carlin and Chib ( 1995) and Green ( 1995), or those that directlycompute the evidential quantities that are required for Bayesian model comparisons,namely marginal likelihoods and their ratios, Bayes factors lJeffreys ( 1961)l; theseapproaches are developed by Gelfand and Dey ( 1994), Chib ( 1995), Verdinelli andWasserman ( 1995), Meng and Wong ( 1996), Di Ciccio, Kass, Raftery and Wasserman( 1997), Chib and Jeliazkov ( 2001), amongst others Discussion of these techniques isprovided in detail below.

1.1 Organization

The rest of the chapter is organized as follows Section 2 provides a brief review ofthree classical sampling methods that are discussed or used in the sequel Section 3summarizes the relevant Markov chain theory that justifies simulation by MCMCmethods In particular, we provide the conditions under which discrete-time andcontinuous state space Markov chains satisfy a law of large numbers and a centrallimit theorem The Metropolis-Hastings algorithm is discussed in Section 4 followedby the Gibbs sampling algorithm in Section 5 Methods for diagnosing convergence areconsidered in Section 6 and strategies for improving the mixing of the Markov chainsin Section 7 In Section 8 we discuss how MCMC methods can be applied to simulatethe posterior distributions that arise in various canonical statistical models Bayesianprediction and model choice problems are presented in Sections 9 and 10, respectively,and the MCMC-based EM algorithm is considered in Section 11 Section 12 concludeswith brief comments about new and emerging directions in MCMC methods.

2 Classical sampling methods

We now briefly review three sampling methods, that we refer to as classical methods,that deliver independent and identically distributed draws from the target density.Authoritative surveys of these and other such methods are provided by Devroye ( 1985),Ripley ( 1987) and Gentle ( 1998) Although these methods are technically outside thescope of this chapter, the separation is somewhat artificial because, in practice, allMCMC methods in one way or another make some use of classical simulation methods.The ones we have chosen to discuss here are those that are mentioned or used explicitlyin the sequel.

2.1 Inverse transform method

This method is particularly useful in the context of discrete distribution functions andis based on taking the inverse transform of the cumulative distribution function (hence

3573

its name) Suppose we want to generate the value of a discrete random variable withmass function

Pr(p = y) = pj, = 1,2, , pj = 1,

and cumulative mass function

Pr(ip < i/p) _ F(i) =pl +P 2 +P 2 + +Pj

The function F is a right-continuous stair function that has jumps at the point I equalto p and is constant otherwise It is not difficult to see that its inverse takes the form

F-(u) = M· if pi + '+Pj-I < u pi + +pj ( 3)

A random variate from this distribution is obtained by generating U uniform on ( 0, 1)and computing F (U) where F-1 is the inverse function in Equation ( 3) This methodsamples i with probability pj because

Pr(F-'(U) = V) = Pr(pl + ' +pj 1 U pil + ' +pj)

=Pl

An equivalent version is available for continuous random variables An importantapplication is to the sampling of a truncated normal distribution Suppose, for example,that

1) NTA(a, b)(l, 02),

a univariate truncated normal distribution truncated to the interval (a, b), withdistribution function

0 if V <aF(t) = p 2 1 (t(t ) ((a,i)) if a < p < , ( 4)

1 if b < p

where

Pl P(ay 2 b-) ·

To generate a sample variate from this distribution one must solve the equationF(t) = U, where U is uniform on ( 0, 1) Algebra yields

t = + a O (pl + U(p 2 -pl)) ( 5)

Although the inverse distribution method is useful it is rather difficult to apply in thesetting of multi-dimensional distributions.

3574 S Chib

Ch 57: Markou Chain Monte Carlo Methods: Computation and Inference

2.2 Accept-reject algorithm

The accept-reject method is the basis for many of the well known univariaterandom number generators that are provided in software programs This method ischaracterized by a source density h(vy) which is used to supply candidate values anda constant c, that is determined by analysis, such that for all p

7 (to) ch(vt).

Note that the accept-reject method does not require knowledge of the normalizingconstant of Jr because that constant can be absorbed in c Then, in the accept-rejectmethod, one draws a variate from h, accepting it with probability r(Vy)/{ch(It)} If theparticular proposal is rejected, a new one is drawn and the process continued until oneis accepted The accepted draws constitute an independent and identically distributed(i.i d ) sample from Jr.

In algorithmic form, the accept-reject method can be described as follows.

Algorithm 1: Accept-reject( 1) Repeat forj = 1,2, , M.

(a) Generate

io' h(); U Unif(O, 1).

(b) Let a(j)= ap' if

Jr(t')U <-ch(t O')'

otherwise go to step l(a).( 2) Return the values {af

( 1), V 1 ( 2 ), , p(M)}.

The idea behind this algorithm may be explained quite simply using Figure 1 Imaginedrawing random bivariate points in the region bounded above by the function ch(o P)and below by the x-axis A point in this region may be drawn by first drawing yp' fromh(tp), which fixes the x-coordinate of the point, and then drawing the y-coordinate ofthe point as Uch(V,') Now, if Uch(ip') < Jr(ip'), the point lies below Jr and is accepted;but the latter is simply the acceptance condition of the AR method, which completesthe justification.

Below we shall discuss a Markov chain Monte Carlo version of the accept-rejectmethod that can be used when the condition r(ip) < ch(ip) does not hold for all valuesof 1 P.

3575

S Chib

Fig 1 Graphical illustration of the accept-reject method.

2.3 Method of composition

This method is based on the observation that if the joint density zr(i 1 , ¢ 2) is expressedas

Jr(V, ) = (T ) ( 1 V 2 I ),

and each density on the right hand side is easily sampled, then a draw from the jointdistribution may be obtained by( 1) drawing e(j ) from T(lp 1 ) and then

( 2) drawing j) from 7 r( 2 ly IV(j)).

Because ((yl j ), (}J)) is a draw from the joint distribution it follows that the secondcomponent of the simulated vector is a draw from the marginal distribution of AP 2:

J) f 7 r(l)2) = 7 r(plwt)yr(Vl) dip.

Thus, to obtain a draw pij) from r(V 2), it is sufficient to produce a sample from thejoint distribution and retain the second component This method is quite important andarises frequently in the setting of MCMC methods.

3 Markov chains

Markov chain Monte Carlo is a method to sample a given multivariate distributionr J* by constructing a suitable Markov chain with the property that its limiting,

3576

Ch 57: Markov Chain Monte Carlo Methods Computation and Inference

invariant distribution, is the target distribution * In most problems of interest, thedistribution r* is absolutely continuous and, as a result, the theory of MCMC methodsis based on that of Markov chains on continuous state spaces outlined, for example, inNummelin ( 1984) and Meyn and Tweedie ( 1993) Tierney ( 1994) is the fundamentalreference for drawing the connections between this elaborate Markov chain theoryand MCMC methods Basically, the goal of the analysis is to specify conditionsunder which the constructed Markov chain converges to the invariant distribution, andconditions under which sample path averages based on the output of the Markov chainsatisfy a law of large numbers and a central limit theorem.

3.1 Definitions and results

A Markov chain is a collection of random variables (or vectors) = <i : i E T}where T = { 0, 1,2, } The evolution of the Markov chain on a space 2 C 91 isgoverned by the transition kernel

P(x,A) Pr( 4 i + I A i = x, j,j < i)

=Pr(Pl i E A i = x), x E Q, A C Q,

which embodies the Markov assumption that the distribution of each succeeding statein the sequence, given the current and the past states, depends only on the currentstate.

In general, in the context of Markov chain simulations, the transition kernel has botha continuous and a discrete component For some function p(x,y) : 2 x 2 + 9 + , thekernel can be expressed as

P(x, dy) = p(x,y) dy + r(x) 6 x(dy), ( 6)

where p(x,x) = O, bx(dy) = 1 if x E dy and O otherwise, r(x) = 1 f, p(x,y) dy Thistransition kernel specifies that transitions from x to y occur according to p(x,y) andtransitions from x to x occur with probability r(x).

The transition kernel is thus the distribution-of +, I given that i = x The nth-step-ahead transition kernel is given by

P(")(x,A) = P(x, dy) p (n ')(y,A),

where P(')(x, dy) = P(x, dy) and

P(x,A) = J P(x, dy) ( 7)

The objective is to elucidate the conditions under which the nth iterate of the transitionkernel converges to the invariant distribution r* as N oo The invariant distributionsatisfies

zr*(dy) = 2 P(x, dy) r(x) dx, ( 8)

where is the density of Jr* with respect to the Lebesgue measure (thus, zr*(dy) =Jr(y) dy) The invariance condition states that if O i is distributed according to J*, then

3577

all subsequent elements of the chain are also distributed as Jr* It should be noted thatMarkov chain samplers are invariant by construction and therefore the existence ofthe invariant distribution does not have to be checked in any particular application ofMCMC methods.

A Markov chain is said to be reversible if the function p(x,y) in Equation ( 6)satisfies

f(x)p(x,y) =f(y)p(y, x), ( 9)

for a densityf( ) If this condition holds, it can be shown thatf( ) = Jr( ) A reversiblechain has * as an invariant distribution lsee Tierney ( 1994)l To verify this weevaluate the right hand side of Equation ( 8):

P(x,A) Jr(x) dx = f { jp(x, y) dy} r(x) dx + j r(x) 6 x(A) Jr(x) dx,

= A {JP(x Y) r(x) dx} dy + A r(x) 7 r(x) dx,

= {Jp(yx)r(y)dx}dy + r(x)j(x)dx, ( 10)

= (l -r(y))J(y)dy + jr(x)jr(x) dx,

= n(y) dy.

A minimal requirement to ensure that the Markov chain satisfies a law of largenumbers is that of Jr-irreducibility This is the requirement that the chain is able tovisit all sets with positive probability under Jr* from any starting point in 2 Formally,a Markov chain is said to be z* -irreducible if for every x E 2,

jr*(A) > O O P(d 4 i E Al o =x) > 0,

for some i ) 1 If the space Q 2 is connected and the function p(x,y) is positive andcontinuous, then the Markov chain with transition kernel given by Equation ( 7) andinvariant distribution m* is 7 *-irreducible.

Another important property of a chain is aperiodicity, which ensures that thechain does not cycle through a finite number of sets A Markov chain is aperiodicif there exists no partition of Q 2 = (Do,D 1, , Dp_l) for some p 2 such thatP(~ 4 i E Dimod(p)Il IO E Do) = 1 for all i.

These definitions allow us to state the following results lsee Tierney ( 1994)l, whichform the basis for Markov chain Monte Carlo methods The first of these results givesconditions under which a strong law of large numbers holds and the second givesconditions under which the probability density of the Mth iterate of the Markov chainconverges to its unique, invariant density.

3578 S Chib


Theorem 1 Suppose {ai} is a 7 r*-irreducible Markov chain with transition kernelP(., ) and invariant distribution T*, then r* is the unique invariant distribution ofP(.,) andfor all :r*-integrable real-valuedfunctions h,

M

M E h(Pl ) h ) h(x) dx as M oc, a s.i= 1

Theorem 2 Suppose {Pl } is a r*-irreducible, aperiodic Markov chain with transitionkernel P( , ) and invariant distribution * Then for 7 c*-almost every x E Q, and allsets A

II PM(x,A)-Jr*(A) 1 O as M oc,

where 11 II denotes the total variation distance.

A further strengthening of the conditions is required to obtain a central limit theoremfor sample-path averages A key requirement is that of an ergodic chain, i e , chainsthat are irreducible, aperiodic and positive Harris-recurrent lfor a definition of thelatter, see Tierney ( 1994)l In addition, one needs the notion of geometric ergodicity.An ergodic Markov chain with invariant distribution Jr* is a geometrically ergodic ifthere exists a non-negative real-valued function (bounded in expectation under ;r*) anda positive constant r < I such that

II PM(x,A) r*(A) II< C(x)r",

for all x and all N and sets A Chan and Geyer ( 1994) show that if the Markovchain is ergodic, has invariant distribution ar*, and is geometrically ergodic, thenfor all L 2 measurable functions h, taken to be scalar-valued for simplicity, and anyinitial distribution, the distribution of x/I-(h M Eh) converges weakly to a normaldistribution with mean zero and variance a 2 > 0, where

h M = -E h( Vi)

Eh h() d

and

ah 2 = Var h(o) + 2 a Cov {h(Q Po), h(k)} ( 11)k=l

3.2 Computation of numerical accuracy and inefficiency factor

Let 01, 2, , 4 M denote the output from a Markov chain, possibly collected afterdiscarding the iterates from an initial burn-in period, and uppose that, as above,

3579

h M = Ei=m, h(Pi) denotes the sample average of the scalar function h Then, inthis context, the variance of h M based on {h( 11), , h( 4 M)} is an estimate of O owhere the square root of the variance of h M is referred to as the numerical standarderror.

To describe consistent in M estimators of 02, let Zi = h( 4 i) (i < M) Then, due tothe fact that {Zi} is a dependent sequence

Var(h M) = M E Cov(Zj, Zk)j, k

M

= S 2 M-2 Z Pl -k I

j,k =l

= S 2 M - 1 + 2 ( 1 )Ps ,

where S 2 is the sample variance of {Zi} and Ps is the estimated autocorrelationat lag S lsee Ripley ( 1987, Ch 6)l If Ps > O for each s, then this variance islarger than S 2/M which is the variance under independence Another estimate of thevariance can be found by consistently estimating the spectral density f of {Zi} atfrequency zero and using the fact that Var(h M) = t 2/M, where r 2 = 2 jrf( 0) Finally,a traditional approach to finding the variance is by the method of "batch means " Inthis approach, the data (Z 1, , ZM) is divided into k batches of length m with meansBl = m lZ(i I)+ m + Ziml and the variance of h M estimated as

k

Var(h M) = k(k (Bi ( 12)

where the batch size m is chosen to ensure that the first order serial correlation of thebatch means is less than 0 05.

Given the numerical variance it is common to calculate the inefficiency factor, whichis also called the autocorrelation time, defined as

Var(h M)s 2 /M ( 13)

This quantity is interpreted as the ratio of the numerical variance of h M to the varianceof h M based on independent draws, and its inverse is the relative numerical efficiencydefined in Geweke ( 1992) The inefficiency factor serves to quantify the relativeefficiency loss in the computation of h M from correlated versus independent samples.

4 Metropolis-Hastings algorithm

The Metropolis-Hastings (M-H) method is a general MCMC method to producesample variates from a given multivariate density lTierney ( 1994), Chib and Greenberg

3580 S Chib


( 1995 a)l It is based on a candidate generating density that is used to supply a proposalvalue and a probability of move that is used to determine if the proposal value shouldbe taken as the next item of the chain The probability of move is based on the ratioof the target density (evaluated at the proposal value in the numerator and the currentvalue in the denominator) times the ratio of the proposal density (at the current value inthe numerator and the proposal value in the denominator) Because ratios of the targetdensity are involved, knowledge of the normalizing constant of the target density isnot required There are a number of special cases of this method, each defined eitherby the form of the proposal density or by the form in which the components of P arerevised, say in one block or in several blocks The method is extremely general andpowerful, it being possible in principle to view almost any MCMC algorithm, in oneway or another, as a variant of the M-H algorithm.

4.1 The algorithm

The goal is to simulate the d-dimensional distribution ar*(tp), O E 'W C 91 d that hasdensity zr(t) with respect to some dominating measure To define the algorithm, letq(tp, % O ') denote the candidate generating density, also called a proposal density, that isused to supply a candidate value tp' given the current value %, and let a(tp, V') denotethe function

a(t, ')= { min l() q(,p) 1 l if Jr(%)q(l, 4 ") > O ; ( 14)1 otherwise.

Then, in the M-H algorithm, a candidate value ' is drawn from the proposal densityand taken to be the next item of the chain with probability a(l', ') If the proposalvalue is rejected, then the next sampled value is taken to be the current value.In algorithmic form, the simulated values are obtained by the following recursiveprocedure.

Algorithm 2: Metropolis-Hastings( 1) Specify an initial value %():( 2) Repeat forj = 1,2, , M.

(a) Propose

% O ' q(%V(})).

(b) Let

+ I) = { i' if Unif(O, 1) < a( O (j), 40 ');0 ( 3) otherwise.

( 3) Return the values {( 1),4 ( 2), , (M)}.

3581

S Chib

Higher density proposal

Currer

(W(j))

it point

Lower density point

Fig 2 Original Metropolis algorithm: higher density proposal is accepted with probabability one andthe lower density proposal with probability a.

The M-H algorithm delivers variates from Jr under general conditions Of course,the variates are from Jr only in the limit as the number of iterations becomes largebut, in practice, after an initial burn-in phase consisting of (say) no iterations, thechain is assumed to have converged and subsequent values are taken as approximatedraws from Jt Because theoretical calculation of the burn-in is not easy it is importantthat the proposal density be chosen to ensure that the chain makes large movesthrough the support of the invariant distribution without staying at one place for manyiterations Generally, the empirical behavior of the M-H output is monitored by theautocorrelation time of each component of ip and by the acceptance rate, which is theproportion of times a move is made as the sampling proceeds.

One should observe that the target density appears as a ratio in the probabilitya(ip, ip') and therefore the algorithm can be implemented without knowledge of thenormalizing constant of Jt( ) Furthermore, if the candidate-generating density issymmetric, i e , q(P, ip') = q(p', ip), the acceptance probability only contains the ratioa(/')/r T(p); hence, if 7 r(t') > T(Vp), the chain moves to i/, otherwise it moves withprobability given by T(ip/)/rzt(p) The latter is the algorithm originally proposed byMetropolis et al ( 1953) This version of the algorithm is illustrated in Figure 2.

Different proposal densities give rise to specific versions of the M-H algorithm,each with the correct invariant distribution ar One family of candidate-generatingdensities is given by q(VP, ') = q(ip' lp) The candidate ip' is thus drawn accordingto the process ip' = ip + z, where z follows the distribution q Since the candidate isequal to the current value plus noise, this case is called a random walk M-H chain.Possible choices for q include the multivariate normal density and the multivariate-t.The random walk M-H chain is perhaps the simplest version of the M-H algorithm

3582

v(i) V(e)


land was the one used by Metropolis et al ( 1953)l and quite popular in applications.One has to be careful, however, in setting the variance of z; if it is too large it is possiblethat the chain may remain stuck at a particular value for many iterations while if it istoo small the chain will tend to make small moves and move inefficiently through thesupport of the target distribution Both circumstances will tend to generate draws thatare highly serially correlated Note that when q is symmetric, the usual circumstance,q(z) = q(-z) and the probability of move only contains the ratio r(p')/jr(p) Asmentioned earlier, the same reduction occurs if q(t, ip') = q(tp', t V).

Hastings ( 1970) considers a second family of candidate-generating densities that aregiven by the form q(, P') = q(t') Tierney ( 1994) refers to this as an independenceM-H chain because, in contrast to the random walk chain, the candidates are drawnindependently of the current location i In this case, the probability of move becomes

a(l, ip') = min { w(/), 1 },

where w(t) = Jr(t)/q(tp) is the ratio of the target and proposal densities For thismethod to work and not get stuck in the tails of ar, it is important that the proposaldensity have thicker tails than z A similar requirement is placed on the importancesampling function in the method of importance sampling lGeweke ( 1989)l In fact,Mengersen and Tweedie ( 1996) show that if w(yp) is uniformly bounded then theresulting Markov chain is ergodic.

Chib and Greenberg ( 1994) discuss a way of formulating proposal densities in thecontext of time series autoregressive-moving average models that has a bearing on thechoice of proposal density for the independence M-H chain They suggest matchingthe proposal density to the target at the mode by a multivariate normal or multivariate-tdistribution with location given by the mode of the target and the dispersion givenby inverse of the Hessian evaluated at the mode Specifically, the parameters of theproposal density are taken to be

m = arg max log r(tp) and

= T { 02 log (t-) l ( 15)

where r is a tuning parameter that is adjusted to control the acceptance rate Theproposal density is then specified as q(ip') =f(p' m, V), wheref is some multivariatedensity This may be called a tailored M-H chain.

Another way to generate proposal values is through a Markov chain version ofthe accept-reject method In this version, due to Tierney ( 1994), a pseudo accept-reject step is used to generate candidates for an M-H algorithm Suppose c > Ois a known constant and h(P) a source density Let C = {i : Jr(?p) ch(ip)}denote the set of value for which ch() dominates the target density and assume thatthis set has high probability under r* Now given p(n) = p, the next value p(n"+ )

3583

is obtained as follows: First, a candidate value tp' is obtained, independent of thecurrent value 4 p, by applying the accept-reject algorithm with ch( ) as the "pseudodominating" density The candidates ip' that are produced under this scheme havedensity q(iy') oc min{((lp'),ch(p')} If we let w( 4 ') = c jr()/h( 4 p) then it can beshown that the M-H probability of move is given by

1 if l e C,a(, 4 ') = l/w(V) if ye 4 C,' C, ( 16)

min {w(p')/w(tp), 1 } if V, X C, 7 P' ~ C.

The choices mentioned above are not exhaustive Other proposal densities can begenerated by mixing over a set of proposal densities, using one proposal density for acertain number of iterations before switching to another.

4.2 Convergence results

In the M-H algorithm the transition kernel of the chain is given by

P(ip, dp') = q(t, 4 ') a(ip, 4 ') d,' + r(V,) 6,(di'), ( 17)

where bp(d,')= 1 if p E dip' and O otherwise and

r(t) = 1 q(i, p') a(, 4 ') d '.

Thus, transitions from to 4 p' (' • ) are made according to the density

p(ip, 4 ") q(i', 4 ") a(V,, l'), 4 ' , ',

while transitions from 4 ' to ip occur with probability r( 1 ') In other words, the densityfunction implied by this transition kernel is of mixed type,

K(tp, 4 ") = q( 4 ', 4 ") a( 4 ', V') + r(') b 6 p(V'"), ( 18)

having both a continuous and discrete component where now, with change of notation,6,,(ip') is the Dirac delta function defined as 6,(V,') = O for 4 ' and

f Sa 6 O,( 1 ') di' = 1.Chib and Greenberg ( 1995 a) provide a way to derive and interpret the prob-

ability of move a(', 4 ') Consider the proposal density q(V, 4 ') This proposaldensity q is not likely to be reversible for (if it were then we would bedone and M-H sampling would not be necessary) Without loss of generality,suppose that r( 4 ') q( 4 ', 4 ') > gr(i')q(ip', 4) implying that the rate of transitionsfrom to ' exceed those in the reverse direction To reduce the transitionsfrom ip to y 4 ' one can introduce a function O < a(ip, 4 ') < 1 such that

3584 S Chib

Ch 57: Markou Chain Monte Carlo Methods Computation and Inference

.r(ip) q(, V') a(, Vp') = r(tp') q(p', Vl) Solving for a(l, p') yields the probabilityof move in the M-H algorithm This calculation reveals the important point that thefunction p(tp, l') = q(p, Ip') a(VO, yp') is reversible by construction, i e , it satisfies thecondition

q(V, p') a(V, ') Jr(p) = q(yp', ) a(p', ) J(rp') ( 19)

It immediately follows, therefore, from the argument in Equation ( 10) that the M-Hkernel has r(p) as its invariant density.

It is not difficult to provide conditions under which the Markov chain generatedby the M-H algorithm satisfies the conditions of Propositions 1-2 The conditions ofProposition 1 are satisfied by the Metropolis-Hastings chain if q(t, lp') is positivefor (, iy') and continuous and the set ly is connected In addition, the conditions ofProposition 2 are satisfied if q is not reversible (which is the usual situation) whichleads to a chain that is aperiodic Conditions for ergodicity, required for use of thecentral limit theorem, are satisfied if in addition ir is bounded Other similar conditionsare provided by Robert and Casella ( 1999).

4.3 Example

To illustrate the M-H algorithm consider count data taken from Hand et al ( 1994)on the number of seizures for 58 epilepsy patients measured first over a eight weekbaseline period and then over four subsequent two week intervals At the end of thebaseline, each patient is randomly assigned to either a treatment group, which is giventhe drug Progabide, or a control group which is given a placebo The model for thesedata on the ith patient at the jth occasion is taken to be

yi IM, ~ Poisson(Aij),

ln(ij) = o + ixli + 2 x 2 ij + x 3 ij + In ti/,

i JV 4 ( 0, 1014),

where xl is an indicator for treatment status, x 2 is an indicator of period, equal tozero for the baseline and one otherwise, x 3 = xlx 2 and t is the offset that is equal toeight in the baseline period and two otherwise Because the purpose of this example isillustrative, the model does not incorporate the obvious intra-cluster dependence thatis likely to be present in the counts.

The target density in this case is the Bayesian posterior density

58 4

Jr(fly,M) () iexp(-Aij)) ',i=lj-O

where / = (/o,/f 1,/32,133) and Jr(#) is the density of the /( 0, 104) distribution Todraw sample variates on I from this density we apply the AR-M-H chain.

3585

S Chib

SIMULATED VALUES LAG CORRELATION

0.8

s O 6

00.4

0.2

I , , I 0 2000 4000 6000 8000 10000 I 5 10 15 20

1400

1200

1000 I

800-

600

400

200

DENSITY EMPIRICAL CDF

0.8

CI

00 40

0.2

-0.8 -0 6 -0 4 -0 2 0 O 2 -0 8 -06 -0 4 -0 2 O

Fig 3 Marginal posterior distribution of I 61 in Poisson count example Top left, simulated values byiteration; top right, autocorrelation function of simulated values; bottom left, histogram and superimposedkernel density estimate of marginal density; bottom right, empirical cdf with 05 percentile, 50th percentile

and 97 5th percentile marked.

Let and V denote the maximum likelihood estimate and inverse of observedinformation matrix, respectively Then, the source density h( 1) for the accept-rejectmethod is specified asf T(B Pl, V, 15), a multivariate-t density with fifteen degrees offreedom The constant c is set equal to 1 5 which implies that the probability of movein Equation ( 16) is defined in terms of the weight

w y( ) 1 1 -j=o exp(-A 6).1.5 f T(I, YV, 15)

The MCMC sampler is now run for 10 000 iterations beyond a burn-in of 200 itera-tions Of interest in this case is the marginal posterior of fl which is summarized inFigure 3.

The figure includes a time series plot of the sampled values, against iteration,and the associated autocorrelation function These indicate that there is no sign ofserial correlation in the sampled values Although mixing of this kind is often notachieved, this example shows that it is sometimes possible to have a MCMC algorithmproduce virtually i i d draws from the target distribution We also summarize themarginal posterior distribution by a histogram/kernel smoothed plot and the empiricalcumulative distribution function Because the entire distribution is concentrated onnegative values it appears that the drug Progabide tends to lower the seizure counts,conditional on the specified model.

(-0-

'If

J

3586

*I r r

I 1 i

)


4.4 Multiple-block M-H algorithm

In applications when the dimension of l is quite large it is preferable to constructthe Markov chain simulation by first grouping the variables % O into p blocks

(,P, , Vlp), with /k E Qk C 91dk, and sampling each block, conditioned on therest, by the M-H algorithm Hastings ( 1970) considers this general situation andmentions different possibilities for constructing a Markov chain on the product spaceQ = Q x x p.

Let Vl-k = (I, , k I, k + I, , ip) denote the variables (blocks) excluding Vlk,

in order to describe the multiple-block M-H algorithm Also let r(tpk, t-k) denotethe joint density of , regardless of where k appears in the list (i, , lp).Furthermore, let {qk(l Pk, l/I'lk), k p} denote a collection of proposal densities,one for each block lk, where the proposal density qk may depend on the current valueof the remaining blocks and is specified along the lines mentioned in connection withthe single-block M-H algorithm Finally, define

ak(Vk, 'Pk 'k)= min { O Kd V-0 qk (ak a/ at m I }( 20)ar('k, V-k)qk(Pk, Pl/'-k) 1 (

as the probability of move for block tpk conditioned on l-k Then, in the multiple-blockM-H algorithm, one cycle of the algorithm is completed by updating each block, saysequentially in fixed order, using a M-H step with the above probability of move, giventhe most current value of the remaining blocks The algorithm may be summarized asfollows.

Algorithm 3: Multiple-block Metropolis-Hastings( 1) Specify an initial value ( 0)= ( 1), , ))

( 2) Repeat forj = 1,2, , M(a) Repeat for k = 1,2, ,p

(i) Propose

k ~ q(V /), k' I/k)-

(ii) Calculate

J T(lk,/-k)qk(lk, Ilj)/V-k)ak( 1 (J') , /JV Jk) = rain lT(, l-k)qk(j) , Il/Jl/-k) 1

(iii) Set

(lj+l) = { 'tk if Unif( 0,1) < ak(tkl J), 11 'P-_k)kk 1/J) otherwise.

( 3) Return the values {( 1), ( 2), ,/(M)}.

3587

Before we examine this algorithm, some features of this method should be noted.First, the version of the algorithm presented above assumes that the blocks are revisedsequentially in fixed order This is not necessary and the blocks may be updated inrandom order Second, at the moment block k is updated in this algorithm, the blocks(I, , ak-l ) have already been revised while the blocks (k+ , , p) have not.Thus, at each step of the algorithm one must be sure to condition on the most currentvalue of the blocks in l-Lk Finally, if the proposal density qk is determined by tailoringto r(pk, O k), as in Chib and Greenberg ( 1994), then this implies that the proposaldensity is not fixed but varies across iterations.

To understand the multiple-block M-H algorithm, first note that the transition kernelof the kth block, conditioned on P-k, may be expressed as

Pk(l Pk, dkl P lk)= q(pk, l'd | l Pk) a(Ik, ok|l Ok)d -' +r(Vk Vl k) 6 k(d Y), ( 21)

where the notation is similar to that of Equation ( 17) It can be readily shown that, fora given 'P-k, this kernel satisfies what may be called the local reversibility condition

r(Wkl Ik)q(pk, |k O-k) a(q(k,) ak, k) = :(P IP-k) q('P, Pk l I Pk) a(pk, 1 k llPk).( 22)

As a consequence, the transition kernel of the move from ip = (, t, , k) toIp' = (, p, , , ok), under the assumption that the blocks are revised sequentially

in fixed order, is given by the product of transition kernels

p

P(V, d ') = H Pk(tpk, d kp Vk) ( 23)k=I

This transition kernel is not reversible, as can be easily checked, because under fixedsequential updating of the blocks updating in the reverse order never occurs Themultiple-block M-H algorithm, however, satisfies the weaker condition of invariance.To show this, we follow Chib and Greenberg ( 1995 a) Consider for notationalsimplicity the case of two blocks, Vp = (, V 2), where ipk: dk x 1 Now, due to thefact that the local moves satisfy the local reversibility condition ( 22), the transitionkernel Pl (lp,d Vlp 2) has l 12 ( ,2) as its local invariant distribution (with density

r 12 ( 1 ) P) , i e ,

:r 1 2 (d 'l I'2) = f P (ip, d l I V 2) r 12 (P 10 * 2)d ( 24)

3588 S Chib


Similarly, the conditional transition kernel P 2(t 2 ,dtl Il) has Jr 2 * 1 ( lt 1) as itsinvariant distribution, for a given value of ipl Then, the kernel formed by multiplyingthe conditional kernels is invariant for r*(-, ):

J Pl(Vl, dllp 2)P 2 (t 2, d 2 ll I) Jr(Vl, p 2)dl d 2

JP 2( 12, d 21 l P) l/P (VI, d V, 1 2) r 112( 11I 2) d Vlll r 2(, 2) d 2

P 2 (, d 1 y Vi) J Tl 2 (d Vl V 2) J 2 (V 2) d V 2

I 2 I ll) Jr(d l)&(d) dg,r 2 (t 2 )

-= Jr (dtt) JP 2 (V 2,d V 2 l );r 2 (t P 2 t V)d 2

= Jrl(d 4 l)Jri 2 (d I 2 lvl)

= J*(d Vl, dp 2),

where the third line follows from Equation ( 24), the fourth from Bayes theorem, thesixth from assumed invariance of P 2, and the last from the law of total probability.

The implication of this "product of kernels" result is that it allows us to take drawsin succession from each of the kernels, instead of having to run each to convergencefor every value of the conditioning variable.

5 The Gibbs sampling algorithm

Another MCMC method, which is a special case of the multiple-block Metropolis-Hastings method, is called the Gibbs sampling method and was brought into statisticalprominence by Gelfand and Smith ( 1990) An elementary introduction to Gibbssampling is provided by Casella and George ( 1992) In this algorithm the parametersare grouped into p blocks (i, , 'p) and each block is sampled according to thefull conditional distribution of block Pk, defined as the conditional distribution underar of Pk given all the other blocks Vpk and denoted as r(tkl O k) In parallel withthe multiple-block M-H algorithm, the most current value of the remaining blocksis used in deriving the full conditional distribution of each block Derivation ofthe full conditional distributions is usually quite simple since, by Bayes theorem,(l Npt I k) c J(tk, V-k), the joint distribution of all the blocks In addition, thepowerful device of data augmentation, due to Tanner and Wong ( 1987), in which latentor auxiliary variables are artificially introduced into the sampling, is often used tosimplify the derivation and sampling of the full conditional distributions.

3589

5.1 The algorithm

To define the Gibbs sampling algorithm, let the set of full conditional distributionsbe

(rt(pl I 2, * *p); r( 21 % , , * % p); (pl 1, * , O d )}-

Now one cycle of the Gibbs sampling algorithm is completed by simulating { Pk }k= 1from these distributions, recursively updating the conditioning variables as one movesthrough each distribution When d = 2 one obtains the two block Gibbs sampler thatis featured in the work of Tanner and Wong ( 1987) The Gibbs sampler in which eachblock is revised in fixed order is defined as follows.

Algorithm 4: Gibbs sampling( 1) Specify an initial value (O) = ( ) ,p O))( 2) Repeat forj=l, 2, , M

Generate O l, + l) from r( 11,' j ), i' j) PJ)).

Generate 1 if ) from n(O 21 V+i', J), (J)).

Generate % O (J+Il) from rt(p l j + ), ,P + 1)).

( 3) Return the values {V( 1), ( 2) 4, (M)}.

Thus, the transition of pk from k J) to + 1) is effected by taking a draw from theconditional distribution

(j-1) ()JT (k I, 1 , 'pk I ,'t'k +, ' ' p

where the conditioning elements reflect the fact that when the kth block is reached, theprevious (k 1) blocks have already been updated The transition density of the chain,again under the maintained assumption that r is absolutely continuous, is thereforegiven by the product of transition kernels for each block:

K (I), tf(')) =f J(tk j k tk+ ' J )) ( 25)k=l

To illustrate the manner in which the blocks are revised, we consider a two blockcase, each with a single component, and trace out in Figure 4 a possible trajectory ofthe sampling algorithm The contours in the plot represent the joint distribution of % Oand the labels "( 0)", "( 1)", etc , denote the simulated values Note that one iterationof the algorithm is completed after both components are revised Also notice thateach component is revised along the direction of the coordinate axes This featurecan be a source of problems if the two components are highly correlated because then

3590 S Chib


Fig 4 Gibbs sampling algorithm in twodimensinns starting from an initial nnint and then

W 1, completing three iterations.

the contours become compressed and movements along the coordinate axes tend toproduce only small moves We return to this issue below.

5.2 Connection with the multiple-block M-H algorithm

A connection with the M-H algorithm can be drawn by noting that the full conditionaldistribution by Bayes theorem is proportional to the joint distribution, i e ,

Jr(k Ik) Oc Jsr(Vk, P-k).

Now recall that the probability of move in the multiple-block M-H algorithm fromEquation ( 20) is

ak(Vk, Vk Vk) = min (r(V', V-k)q(V, V 1 k VI k) } r(, P-k) q(i Pk, V Pk' IP-k) '

so if one substitutes

q(Vk, k I V-k) = r(, IP-k),

q(V, Nk V -k)= T(k, 'P-k),

in this expression all the terms cancel implying that the probability of accepting theproposal is one Thus, the Gibbs sampling algorithm is a special case of the multiple-block M-H algorithm.

It should be noted that a multiple-block M-H algorithm in which only some of theblocks are sampled using the full conditional distributions are sometimes called hybridsamplers or Metropolis-within-Gibbs samplers These names are not very informativeor precise and it is preferable to continue to refer to such algorithms as multiple-blockM-H algorithms The only algorithm that should properly be referred to as the Gibbsalgorithm is the one in which each block is sampled directly from its full conditionaldistribution.

3591

II I b O -1 ' " II r-·· _-_ ·

5.3 Invariance of the Gibbs Markov chain

The Gibbs transition kernel is invariant by construction This is a consequence of thefact that the Gibbs algorithm is a special case of the multiple-block M-H algorithmwhich is invariant as was established in the last section A direct calculation also revealsthe same result Consider for simplicity the situation of two blocks when the transitionkernel density is

K(V, V') = (Vl, 2) r(V 1 V).

To check invariance we need to show that

J K(p, / )d( I y 2)dld 2t =I / r(l l k) i 2 V) (,2) dl d 2,

is equal to s(Vl, Vi) This is easily verified because r(~'ll) comes out of theintegral, and the integral over o IP and ? 2 produces r(tol) This calculation can beextended to any number of blocks in the same way In addition, the Gibbs Markovchain is not reversible Reversible Gibbs samplers are discussed by Liu, Wong andKong ( 1995).

5.4 Sufficient conditions for convergence

Under rather general conditions, which are easy to verify, the Markov chain generatedby the Gibbs sampling algorithm converges to the target density as the number ofiterations become large Formally, if we let K(t, ,') represent the transition densityof the Gibbs algorithm and let K(M)(V, V') be the density of the draw IV' afterM iterations given the starting value o, then

11 K (M ) (O), V') r(V') I 1 O as M oo ( 26)

Roberts and Smith ( 1994) lsee also Chan ( 1993)l have shown that the conditions ofProposition 2 are satisfied under the following conditions: (i) r(Vp) > O implies thereexists an open neighborhood N V, containing p and > O such that, for all Ip' N,r(tp') ) e > O ; (ii) ff(V) dlpk is locally bounded for all k, where pk is the kth block

of parameters; and (iii) the support of p is arc connected.It is difficult to find non-pathological problems where these conditions are not

satisfied.

5.5 Estimation of density ordinates

We mention that if the full conditional densities are available, whether in the context ofthe multiple-block M-H algorithm or that of the Gibbs sampler, then the MCMC outputcan be used to estimate posterior marginal density functions Tanner and Wong ( 1987)

3592 S Chib


and Gelfand and Smith ( 1990) One possibility is to use a non-parametric kernelsmoothing method which, however, suffers from the curse of dimensionality problem.A more efficient possibility is to exploit the fact that the marginal density of pk at thepoint pk* is

Jr(V:) = Jr(; I -k) r(-k)d -k,

where as before V-k = \tk Provided the normalizing constant of Jr(l Ik* Pk) isknown, we can estimate the marginal density as an average of the full conditionaldensity over the simulated values of k:

M

= M Z' )j= 1

Then, under the assumptions of Proposition 1,

M

M E r(V ') Jr(;), as M oo.j=t

Gelfand and Smith ( 1990) refer to this approach as "Rao-Blackwellization" becauseof the connections with the Rao-Blackwell theorem in classical statistics Thatconnection is more clearly seen in the context of estimating (say) the mean of Pk,

E(pk) = f t Vk(k) dpk By the law of the iterated expectation,

E(Vk) = E{E( 1 Vk V k)},

and therefore the estimates

M

M-I E kj '

j=I

and

M

M E(Vk ))j=l

both converge to E(i Pk) as M oc Under i i d sampling, and under Markov samplingprovided some conditions are satisfied lsee Liu, Wong and Kong ( 1994), Geyer ( 1995),Casella and Robert ( 1996) and Robert and Casella ( 1999)l, it can be shown that thevariance of the latter estimate is smaller than that of the former Thus, it can help toaverage the conditional mean E(vk 11 P-k), if that were available, rather than average

3593

the draws directly Gelfand and Smith appeal to this analogy to argue that the Rao-Blackwellized estimate of the density is preferable to that based on the method ofkernel smoothing Chib ( 1995) extends the Rao-Blackwellization approach to estimate"reduced conditional ordinates" defined as the density of k conditioned on one ormore of the remaining blocks More discussion of this is provided below in Section 10on Bayesian model choice Finally, Chen ( 1994) provides an importance weightedestimate of the marginal density for cases where the conditional posterior density doesnot have a known normalizing constant Chen's estimator is based on the identity

Jr(V*, V-k'T( A) = j 'k W(tk k) (, e 1 P-k) d VB

where W(Pkl-tk) is a completely known conditional density whose support is equalto the support of the full conditional density Ir(k I -k) In this form, the normalizingconstant of the full conditional density is not required and given a sample of drawsI('), , l (M )} from Jr(P), a Monte Carlo estimate of the marginal density is givenby

M W)

2 ( M E W(tk' |) I ( ~l ( j))tr(k V ,-~"'

Chen ( 1994) discusses the choice of the conditional density w Since it depends onl Pk, the choice of W will vary from one sampled draw to the next.

5.6 Example: simulating a truncated multivariate normal

To illustrate the Gibbs sampling algorithm consider the question of sampling atrivariate normal distribution truncated to the positive orthant In particular, let thetarget distribution be

J(V) = p Nr(, A X) I(p E A) ocf N(,) I(V e A),Pr(ip G A)

where S = ( 5, 1, 1 5)', X is in equi-correlated form with units on the diagonal and 0 7on the off-diagonal, A = ( 0, oc) x ( 0, oc) x ( 0, oo) and Pr(ip E A) is the normalizingconstant which is difficult to compute Following Geweke ( 1991), one may define theGibbs sampler with the blocks Wl, 2, y 3 and the full conditional distributions

JT(V 1 2, "); ( I V, "); Jr( I , "),

where each of the these full conditional distributions is univariate truncated normalrestricted to the interval ( 0, oo):

r( 1 k kl -k) Af N (k k + C Xk (e-L k l-k), k C-k Ck) I(Wk E ( 0, o)).( 27)

In this expression we have utilized the well known result about conditional normaldistributions and have let Ck = Cov(Pk, V Pk), -k = Var(_k) and P-k = E( 1 _-k) Note

3594 S Chib

Ch 57: Markov Chain Monte Carlo Methods Computation and Inference 3595

1200

1000

800

600

400

200

0 4

1200

1000

800

600

400

200

a0 2 4

XV, W 2 JO 3

1

0.9

08

0.7

0.6

0.5

04

03

0.2

01

0.9

08

0.7

0.6

05

0.4

0.3

0.2

01

no0 5 10 15 20 O 5 10 15 20 O 5 10 15 20

Fig 5 Marginal distributions of ap in truncated multivariate normal example (top panel) Histograms ofthe sampled values and Rao-Blackwellized estimates of the densities are shown Autocorrelation plotsof the Gibbs MCMC chain are in the bottom panel Graphs are based on 10 000 iterations following a

burn-in of 500 cycles.

that, unfortunately, the use of singleton block sizes is unavoidable in this problembecause the conditional distribution of any two components given the third is not easyto simulate.

Figure 5 gives the marginal distribution of each component of Opk from a Gibbssampling run of M = 10 000 iterations with a burn-in of 100 cycles The figure includesboth the histograms of the sampled values and the Rao-Blackwellized estimates of themarginal densities based on the averaging of Equation ( 27) over the simulated values ofy P-k The agreement between the two density estimates is close In the bottom panel ofFigure 5 we plot the autocorrelation function of the sampled draws The rapid declinein the autocorrelations for higher lags indicates that the sampler is mixing well.

6 Sampler performance and diagnostics

In implementing a MCMC method it is important to assess the performance of thesampling algorithm to determine the rate of mixing and the size of the bum-in, bothhaving implications for the number of iterations required to get reliable answers.A large literature has now emerged on these issues, for example, Robert ( 1995), Tanner( 1996, Section 6 3), Cowles and Carlin ( 1996), Gammerman ( 1997, Section 5 4),

1000

900

800

700

5 600

4 500

e 400

300

200

100

o

1

09

0.8

. 07

0.6

0.5

04

0.3

0.2

0.1

1 2 3 4

I

I

4

Ih I I ,

I~~~

Brooks, Dellaportas and Roberts ( 1997) and Robert and Casella ( 1999), but the ideas,although related in many ways, have not coalesced into a single prescription.

One approach for determining sampler performance and the size of the burn-intime is to employ analytical methods to the specified Markov chain, prior to sampling.This approach is exemplified in the work of, for example, Meyn and Tweedie ( 1994),Poison ( 1996), Roberts and Tweedie ( 1996) and Rosenthal ( 1995) Two factors haveinhibited the growth and application of these methods The first is that the calculationsare difficult and problem-specific, and second, the upper bounds for the burn-in thatemerge from such calculations are usually highly conservative.

At this time the more popular approach is to utilize the sampled draws to assessboth the performance of the algorithm and its approach to the stationary, invariantdistribution Several such relatively informal methods are now available Gelfand andSmith ( 1990) recommend monitoring the evolution of the quantiles as the samplingproceeds Another quite useful diagnostic, one that is perhaps the simplest and mostdirect, are autocorrelation plots (and autocorrelation times) of the sampled output.Slowly decaying correlations indicate problems with the mixing of the chain It isalso useful in connection with M-H Markov chains to monitor the acceptance rate ofthe proposal values with low rates implying "stickiness" in the sampled values andthus a slower approach to the invariant distribution.

Somewhat more formal sample-based diagnostics are also available in the literature,as summarized in the CODA routines provided by Best, Cowles and Vines ( 1995).Although these diagnostics often go under the name "convergence diagnostics" theyare in principle approaches that detect lack of convergence Detection of convergencebased entirely on the sampled output, without analysis of the target distribution, isextremely difficult and perhaps impossible Cowles and Carlin ( 1996) discuss andevaluate thirteen such diagnostics lfor example, those proposed by Geweke ( 1992),Raftery and Lewis ( 1992), Ritter and Tanner ( 1992), Gelman and Rubin ( 1992),Zellner and Min ( 1995), amongst othersl without arriving at a consensus Difficulties inevaluating these methods stem from the fact that some of these methods apply only toGibbs Markov chains lfor example, those of Ritter and Tanner ( 1992) and Zellner andMin ( 1995)l while others are based on the output not just of a single chain but on thatof multiple chains specifically run from "disparate starting values" as in the methodof Gelman and Rubin ( 1992) Finally, some methods assess the behavior of univariatemoment estimates las in the approach of Geweke ( 1992) and Gelman and Rubin( 1992)l while others are concerned with the behavior of the entire transition kernellas in Ritter and Tanner ( 1992) and Zellner and Min ( 1995)l Further developments inthis area are ongoing.

7 Strategies for improving mixing

In practice, while implementing MCMC methods it is important to construct samplersthat mix well, where mixing is measured by the autocorrelation time, because such

3596 S Chib


samplers can be expected to converge more quickly to the invariant distribution Overthe years a number of different recipes for designing samplers with low autocorrelationtimes have been proposed although it may sometimes be difficult, because of thecomplexity of the problem, to apply any of these recipes.

7.1 Choice of blocking

As a general rule, sets of parameters that are highly correlated should be treated asone block when applying the multiple-block M-H algorithm Otherwise, it would bedifficult to develop proposal densities that lead to large moves through the support ofthe target distribution and the sampled draws would tend to display autocorrelationsthat decay slowly To get a sense of the problem, it may be worthwhile for the reader touse the Gibbs sampler to simulate a bivariate normal distribution with unit variancesand covariance (correlation) of 0 95.

The importance of coarse, or highly grouped, blocking has been highlighted in anumber of different problems for example, the state space model, hidden Markovmodel and longitudinal data models with random effects In each of these situations,which are further discussed below in detail, the parameter space is quite large onaccount of the fact that auxiliary variables are included in the sampling (the latentstates in the case of the state space model and the random effects in the case of thelongitudinal data model) These latent variables tend to be highly correlated eitheramongst themselves, as in the case of the state space model, or with a different set ofvariables as in the case of the panel model.

Blocks can be combined by the method of composition For example, supposethat 'il, 2 and 03 denote three blocks and that the distribution ll 3 is tractable(i.e , can be sampled directly) Then, the blocks (ol, t 2) can be collapsed by firstsampling tlp from p 1 Io followed by Vt from l 12 I I, 3 This amounts to a two blockMCMC algorithm In addition, if it is possible to sample (l, P 2 t) marginalized over" then the number of blocks is reduced to one Liu ( 1994) and Liu, Wong and Kong( 1994) discuss the value of these strategies in the context of a three-block GibbsMCMC chains Roberts and Sahu ( 1997) provide further discussion of the role ofblocking in the context of Gibbs Markov chains used to sample multivariate normaltarget distributions.

7.2 Tuning the proposal density

As mentioned above, the proposal density in a M-H algorithm has an important bearingon the mixing of the MCMC chain Fortunately, one has great flexibility in the choiceof candidate generating density and it is possible to adapt the choice to the specificcontext of a given problem For example, Chib, Greenberg and Winkelmann ( 1998)develop and compare four different choices in the context of longitudinal randomeffects for count data In this problem, each cluster (or individual) has its own randomeffects and each of these has to be sampled from an intractable target distribution.

3597

If one lets N denote the number of clusters, where N is typically large, say in excessof a thousand, then the number of blocks in the MCMC implementation is N + 3 (nfor each of the random effect distributions, two for the fixed effects and one for thevariance components matrix) For this problem, the multiple-block M-H algorithmrequires N + 1 M-H steps within one iteration of the algorithm Tailored proposaldensities are therefore computationally quite expensive but one can use a mixtureof proposal densities where a less demanding proposal, for example a random walkproposal, is combined with the tailored proposal to sample each of the N random effecttarget distributions Further discussion of mixture proposal densities for the purposeof improving mixing is contained in Tierney ( 1994).

7.3 Other strategies

In some problems it is possible to reparameterize the variables to make the blocksless correlated See Hills and Smith ( 1992) and Gelfand, Sahu and Carlin ( 1995)where under certain circumstances reparameterization is shown to be beneficial forsimple one-way analysis of variance models, and for general hierarchical normal linearmodels.

Another strategy that can prove useful is importance resampling in which theMCMC sampler is applied not to the target distribution Jr but to a modifieddistribution Jr*, for which a well mixing sampler can be designed, and which isclose to Jr Now suppose {(l), , 1p(M)} are draws from the target distribution Jr*.These can be made to correspond to the target distribution r by attaching the weightwj = r( 1 pi))/jr*(ip(J)) to each draw and then re-sampling the sampled values withprobability given by {wj / E= wg} This strategy was introduced for a differentpurpose by Rubin ( 1988) and then employed by Gelfand and Smith ( 1992) and Albert( 1993) to study the sensitivity of the posterior distribution to small changes in theprior without involving a new MCMC calculation Its use for improving mixing in theMCMC context is illustrated by Kim, Shephard and Chib ( 1998) where a nonlinearstate space model of stochastic volatility is approximated accurately by a mixture ofstate space models; an efficient MCMC algorithm is then developed for the latter targetdistribution and the draws are finally re-sampled to correspond to the original nonlinearmodel.

Other approaches have also been discussed in the literature Marinari and Parisi( 1992) develop the simulated tempering method whereas Geyer and Thompson ( 1995)develop a related technique that they call the Metropolis-coupled MCMC method.Both these approaches rely on a series of transition kernels {K 1, , Km} whereonly K 1 has r* as the stationary distribution The other kernels have equilibriumdistributions ri, which Geyer and Thompson take to be Jrnj() = r(o)l /i , i = 2, , m.This specification produces a set of target distributions that have higher variance thanr* Once the transition kernels and equilibrium distributions are specified then the

Metropolis-coupled MCMC method requires that each of the m kernels be used inparallel At each iteration, after the m draws have been obtained, one randomly selects

3598 S Chib


two chains to see if the states should be swapped The probability of swap is basedon the M-H acceptance condition At the conclusion of the sampling, inference isbased on the sequence of draws that correspond to the distribution Zr* These methodspromote rapid mixing because draws from the various "flatter" target densities havea chance of being swapped with the draws from the base kernel K 1 Thus, variatesthat are unlikely under the transition K 1 have a chance of being included in the chain,leading to more rapid exploration of the parameter space.

8 MCMC algorithms in Bayesian estimation

8.1 Overview

Markov chain Monte Carlo methods have proved enormously popular in Bayesianstatistics lfor wide-ranging discussions of the Bayesian paradigm see, for example,Zellner ( 1971), Leamer ( 1978), Berger ( 1985), O'Hagan ( 1994), Bernardo and Smith( 1994), Poirier ( 1995), Gelman, Meng, Stem and Rubin ( 1995)l, where these methodshave opened up vistas that were unimaginable fifteen years ago Within the Bayesianframework, where both parameters and data are treated as random variables andinferences about the parameters are conducted conditioned on the data, the posteriordistribution of the parameters provides a natural target for MCMC methods Sometimesthe target distribution is the posterior distribution of the parameters augmented bylatent data, in which case the MCMC scheme operates on a space that is considerablylarger than the parameter space This strategy, which goes under the name of dataaugmentation, is illustrated in several models below and its main virtue is that itallows one to conduct the MCMC simulation without having to evaluate the likelihoodfunction of the parameters The latter feature is of considerable importance especiallywhen the model of interest has a complicated likelihood function and likelihood basedinference is difficult Admittedly, in standard problems such as the linear regressionmodel, there may be little to be gained by utilizing MCMC methods or in fact byadopting the Bayesian approach, but the important point is that MCMC methodsprovide a complete computational toolkit for conducting Bayesian inference in modelsthat are both simple and complicated This is the central reason for the current growingappeal of Bayesian methods in theoretical and practical work and this appeal is likelyto increase once MCMC Bayesian software, presently under development at varioussites, becomes readily available.

Papers that develop some of the important general MCMC ideas for Bayesianinference appeared early in the 1990 's Categorized by topics, these include, normaland student-t data models lGelfand et al ( 1990), Carlin and Polson ( 1991)l; binaryand ordinal response models lAlbert and Chib ( 1993 a, 1995)l; tobit censoredregression models lChib ( 1992)l; generalized linear models lDellaportas and Smith( 1993), Mallick and Gelfand ( 1994)l; change point models lCarlin et al ( 1992),Stephens ( 1994)l; autoregressive models lChib ( 1993), McCulloch and Tsay ( 1994)l;

3599

autoregressive-moving average models lChib and Greenberg ( 1994)l; hidden Markovmodels lAlbert and Chib ( 1993 b), Robert et al ( 1993), McCulloch and Tsay ( 1994),Chib ( 1996)l; state space models lCarlin, Polson and Stoffer ( 1992), Carter andKohn ( 1994, 1996), Chib and Greenberg ( 1995 b), de Jong and Shephard ( 1995)l;measurement error models lMallick and Gelfand ( 1996)l; mixture models lDieboltand Robert ( 1994), Escobar and West ( 1995), Muller, Erkanli and West ( 1996)l;longitudinal data models lZeger and Karim ( 1991), Wakefield et al ( 1994)l.

More recently, other model and inference situations have also come under scrutiny.Examples include, ARMA models with switching lBillio, Monfort and Robert ( 1999)l;CART models lChipman, George and McCulloch ( 1998), Denison, Mallick and Smith( 1998)l; conditionally independent hierarchical models lAlbert and Chib ( 1997)l;estimation of HPD intervals lChen and Shao ( 1999)l; item response models lPatzand Junker ( 1999)l; selection models lChib and Hamilton ( 2000)l; partially linear andadditive regression models lLenk ( 1999), Shively, Kohn and Wood ( 1999)l; sequentialMonte Carlo for state space models lLiu and Chen ( 1998), Pitt and Shephard ( 1999)l;stochastic differential equation models lElerian, Chib and Shephard ( 1999)l; modelswith symmetric stable distributions lTsionas ( 1999)l; neural network models lMullerand Insua ( 1998)l; spatial models lWaller, Carlin, Xia and Gelfand ( 1997)l.

MCMC methods have also been extended to the realm of Bayesian model choice.Problems related to variable selection in regression models, hypothesis testing in nestedmodels and the general problem of model choice are now all amenable to analysis byMCMC methods The basic strategies are developed in the following papers: variableselection in regression lGeorge and McCulloch ( 1993)l; hypothesis testing in nestedmodels lVerdinelli and Wasserman ( 1995)l; predictive model comparison lGelfandand Dey ( 1994)l; marginal likelihood and Bayes factor computation lChib ( 1995)l;composite model space and parameter space MCMC lCarlin and Chib ( 1995), Green( 1995)l These developments are discussed in Section 10.

We now provide a set of applications of MCMC methods to models largely drawnfrom the list above These models serve to illustrate a number of general techniques,for example, derivations of full conditional distributions, use of latent variables in thesampling (data augmentation) to avoid computation of the likelihood function, andissues related to blocking Because of the modular nature of MCMC methods, thealgorithms presented below can serve as the building blocks for other models notconsidered here In some instances one would only need to combine different piecesof these algorithms to fit a new model.

8.2 Notation and assumptions

To streamline the discussion we collect some of the notation that is used in the restof the paper.

The d-variate normal distribution with mean vector u and covariance matrix isdenoted by JN,(, 2 Q) Its density at the point t E g 9 d is denoted by ¢d(tlp, S 2) Theunivariate normal density truncated to the interval (a, b) is denoted by TANla, bl(t, a 2 )

3600 S Chib


with density at the point t E (a, b) given by (t Il, a 2)/l((b )/(a) ((a y)/a)l,where O is the univariate normal density and ( ) is the c d f of the standard normalrandom variable.

A d-variate random vector distributed according to the multivariate-t distributionwith mean vector u, dispersion matrix X and degrees of freedom has densityf T(tl#, Q, A) given by

F((~ + 1)/2)F(~/2) 1 }1 + -(t p) a'(t A)

(~)/ 2 }I 1/2 I +a

The gamma distribution is denoted by F(a, b) with density at the point t byf G(tla, b) oc ta l exp(-bt) Ilt > O l, where IlAl is the indicator function of the event A.The inverse gamma distribution is the distribution of the inverse of a gamma variate.

A random symmetric positive definite matrix W:p x p is said to follow a Wishartdistribution Wp(Wjv,R) if the density of W is given by

I Rwl{ >)/2

where c is a normalizing constant, R is a hyperparameter matrix and "tr" is the tracefunction To simulate the Wishart distribution, one utilizes the expression W = LTT'L',

where R = LL' and T = (ti) is a lower triangular matrix with ti + andt.j ~ i(O, 1).

In connection with the sampling design of the observations and the error termswe use "ind" to denote independent and "i i d " to denote independent and identicallydistributed The response variable (or vector) of the model is denoted by either yi ory,, the sample size by N and the entire collection of sample data by y = (yl, , Yn).In some instances, we let Y = (yl, ,yt) denote the data upto time t andY' = (t, , y) to denote the values from t to the end of the sample The covariatesare denoted as xi if the corresponding response is a scalar and as X or X, if the responseis a vector The regression coefficients are denoted by and the error variance (if yiis a scalar) by a 2 and the error covariance by 2 if yi is a vector The parameters ofthe model are denoted by O and the variables used in the MCMC simulation by p(consisting of 8 and other quantities).

When denoting conditional distributions only dependence on random quantities,such as parameters and random effects, is included in the conditioning set Covariatesare never included in the conditioning The symbolp is used to denote the prior densityif general notation is required.

It is always assumed that each distinct set of parameters, for example, regressioncoefficients and covariance elements, are a priori independent The joint priordistribution is therefore specified through the marginal distribution of each distinctset of parameters Distributions for the parameters are chosen from the class

3601

of conditionally conjugate distributions in keeping with the existing literature onthese models The parameters of the prior distributions, called hyperparameters,are assumed known These will be indicated by the subscript "O " In some cases,when the hyperparameters are unknown, hierarchical priors, defined by placing priordistributions on the prior hyperparameters, are used.

8.3 Normal and student-t regression models

Consider the univariate regression model defined by the specification

yil M, p, a 2 (xi, a 2), i < n,

k(o,Bo),

a 5 ~ ( 2 ' 2 ' 2

The target distribution is

n

( I, a 2 M ,y) c< p( l)p(a 2 ) I Jf (yi j Xi'i, a 2 ),i= 1

and MCMC simulation proceeds by a Gibbs chain defined through the full conditionaldistributions

'Iy, M, 2 ; a 2 ly,M,

Each of these distributions is straightforward to derive because conditioned on a 2 boththe prior and the likelihood have Gaussian forms (and hence the updated distributionis Gaussian with moments found by completing the square for the terms in theexponential function) while conditioned on r, the updated distribution of a 2 is inversegamma with parameters found by adding the exponents of the prior and the likelihood.

Algorithm 5: Gaussian multiple regression( 1) Sample

/~k ((n (Bof 10 io + a 2 xii,Bn (B-1 + a- 2 f Xi Xi ) )

i= 1 B

( 2) Sample

a 2 I {vo+n 60 + = l (y xi)2 }( 3) Goto 1.

3602 S Chib


This algorithm can be easily modified to permit the observations Yi to follow aStudent-t distribution The modification, proposed by Carlin and Polson ( 1991), utilizesthe fact that if

and

Yi (M, , a 2, i ~ /(xif, i 1 1a 2),

then

i A,, a 2~ fr(Yiifi, 2, ), i < n.

Hence, if one defines i = (,a 2 , {ij}) then, conditioned on {i}, the model isGaussian and a variant of Algorithm 5 can be used Furthermore, conditioned on(#, a 2), the full conditional distribution of {Ai} factors into a product of independentGamma distributions.

Algorithm 6: Student-t multiple regression( 1) Sample

0 k (A (Bo ri + 2 E Ai Xi Yi) Bn, A= (Bo +a-2 xii Xi

( 2) Sample

2 o +_ I { 6 + =, i(yi -Xi) 2

( 3) Sample

hi l + 1 + a(-2 yi -Xi)2, i

( 4) Goto 1.

Another modification of Algorithm 5 is to Zellner's seemingly unrelated regressionmodel (SUR) In this case a vector of p observations are generated from the model

ytl M, I,Q ~ A(Xt 6, Q), t < n,

fi Nk ( o, Bo),

Q-' Wp(vo,Ro),

where y, = (ylt , Ypt)', X, = diag(x,, x), )i = ( ,)' : k x 1, andk = E, ki.

3603

To deal with this model, a two block MCMC approach can be used as proposed byBlattberg and George ( 1991) and Percy ( 1992) Chib and Greenberg ( 1995 b) extendthat algorithm to SUR models with hierarchical priors and time-varying parameters ofthe type considered by Gammerman and Migon ( 1993).

For the SUR model, the posterior density of the parameters is proportional to

:t(fi:($ - ) X I 2 In 2 exp _ l(Yt) Q O (Yt -X Pf),

and the MCMC algorithm is defined by the full conditional distributions

0 il, M, -'; -'1 y, M,.

These are both tractable, with the former a normal distribution and the latter a Wishartdistribution.

Algorithm 7: Gaussian SUR( 1) Sample

B Nk (~ (Bfo+x I#y,)0 N (Bo X)

( 2) Sample

-1 Wp vo + n, Ro 1 + -Xt(y, -X)' ,}lt=l

( 3) Goto 1.

8.4 Binary and ordinal probit

Suppose that each yi is binary and the model of interest is

yil M,I ~(x 1),i < n; I 1 k( 3 o, B).

The posterior distribution does not belong to a named family of distributions Todeal with the problem, Albert and Chib ( 1993 a) introduce a technique that hasformed the basis for a unified methodology for univariate and multivariate binary andordinal response models and led to many applications The Albert-Chib algorithmcapitalizes on the simplifications afforded by introducing latent or auxiliary data intothe sampling.

3604 S Chib


Instead of the specification above, the model of interest is specified in equivalentform as

zij M, fi~ (x'fi, 1), y = Ilzi > Ol, i < n, ~ Jk(#o,Bo).

Now the MCMC Gibbs algorithm proceeds with the sampling of the full conditionaldistributions

#Ily, M, {zi}; {zi}ly, M,Pf,

where

f IY,M, {zi) 1 I M, {zi},

has the same form as in the linear regression model with 2 set equal to one and yireplaced by zi and

n

{Zi)l Y, M , _ H zilyi,M,0,i= 1

factor into a set of N independent distributions with each depending on the data onlythrough yi The distributions zi I yi, M, , are obtained by reasoning as follows Supposethat yj = 0, then from Bayes theorem

f(Zilyi = O,M,) Oxf N(zili'fi, )f(y/ = 01 zi, M,i P)O f(zilxij', 1) Ilzi < O l,

becausef(yj = Olzi, M, f) is equal to one if zi is negative and equal to zero otherwise,which is the definition of Ilzi < O l Hence, the information y = O simply serves totruncate the support of zi By a similar argument it is shown that the support of zi is( 0, oc) when conditioned on the event yj = 1 Each of these truncated distributions issimulated by the formula given in Equation ( 5) This leads to the following algorithm.

Algorithm 8: Binary probit( 1) Sample

i= ) ( i= 1 / )

( 2) Sample

fzi i ( o(x'P, 1) if yi = 0,i TA(o, )(xif,, 1) if Yi = 1,

( 3) Goto 1.

3605

Albert and Chib ( 1 993 a) also extend this algorithm to the ordinal categorical data casewhere yi can take one of the values { 0, 1, , J} according to the probabilities

Pr(yi < Jl, y) = (yj x), j = 0, 1, , J ( 28)

In this model the {yj} are category specific cut-points with yo normalized to zero andyj to infinity The remaining cut-points y = (y, , J ) are assumed to satisfy theorder restriction yl · · · < yj which ensures that the cumulative probabilities arenon-decreasing For given data yl , y, from this model, the likelihood function isgiven by

f(yj M,, Y) = Ji JJ l(i Y -x'p) ( 7 y 1 -xi)l, ( 29)j-0 i:yi =j

and the posterior density, under the prior p( IB, y), is proportional to p( A, y)f(yl , y).Posterior simulation is again feasible with the the introduction of latent variableszl , z,, where zi -V N(xil, 1) A priori, we observe yi =j if the latent variable zifalls in the interval ly _ -, y 1) Now the basic Albert and Chib MCMC scheme drawsthe latent data, regression parameters and cut-points in sequence Given yi = j, thesampling of the latent data zi is from TANl_, jl(x , 1) and the sampling of theparameters 11 is as in Algorithm 8 For the cut-points, Cowles ( 1996) and Nandramand Chen ( 1996) proposed that the cut-points be generated by the M-H algorithm,marginalized over z Subsequently, Albert and Chib ( 1998) simplified the latterstep by transforming the cut-points y so as to remove the ordering constraint Thetransformation is defined by the one-to-one map

61 = log yl; 6 j = log(yj yj _ i), 2 < j J 1 ( 30)

The advantage of working with 6 instead of y is that the parameters of the tailoredproposal density in the M-H step for 6 can be obtained by an unconstrainedoptimization and the prior p(b) on 6 can be an unrestricted multivariate normal Thealgorithm is defined as follows.

Algorithm 9: Ordinal probit( 1) M-H

(a) Calculate

m = arg max log f(yl M, A, 6),

and V = {-0 log f(yl M,P, 6)/Oa 6 b'} , the negative inverse ofthe hessian at m.

3606 S Chib


(b) Propose

6 ' ~fr(blm, V, ).

(c) Calculate

a p( 6 ')f(yl M,, 6 ') fr( 61 m, V, ) a = mi p()f(yl M,, ) fr( 6 'lm, V, )' J'

(d) Move to 6 ' with probability a Transform the new 6 toy via

the inverse map yj= Ji=l exp( 6 i), 1 j J 1.

( 2) Sample

zj/T Afly ,,y/l(xi 0, 1) if yi=j, i <n.

( 3) Sample

Pfi~k (Bn (Bo + xizi) N = (B + xix

( 4) Goto 1.

8.5 Tobit censored regression

Consider now a model in the class of the Tobit family in which the data yi is generatedby

zil M,, 12 JA(x/i, a 2).

y = max(O, z), 1 < i < n,

indicating that the observation zi is observed only when z is positive This model givesrise to a mixed discrete-continuous distribution with a point mass of l 1 '(x 3/1)lat zero and a density f N(yi lxi'f, a 2 ) on ( 0, oc) The likelihood function is given by

{ i ) } J ( O )exp {- 2 2 (Y-i /)2 }

where C is the set of censored observations and P is the c d f of the standard normalrandom variable.

A MCMC procedure for this model is developed by Chib ( 1992) while Weiand Tanner ( 1990 a) discuss a related approach for a model that arises in survivalanalysis A set of tractable full conditional distributions is obtained by includingthe vector z = (zi), i E C in the sampling Let yz = (yzi) be a N x 1 vector

3607

with ith component yi if the ith observation is not censored and zi if it is censored.Now apply the Gibbs sampling algorithm with blocks (i, a 2 , z) and associated fullconditional distributions

ily,M,z,a 2 ; a 2 1 y,M,z,B and zly,AM,,a 2.

The first two of these distributions follow from the results for linear regression withGaussian errors (with yzi used in place of yi) and the third distribution, analogous tothe probit case, is truncated normal on the interval (-oc, O l.

Algorithm 10: Tobit censored regression( 1) Sample

Ak (n ( O i O + a 2 E xyzi) ,Bn (Bo 0 1 + 2 xtxt) )t=l / = /

( 2) Sample

o 2 Voo + N 60 + Zi=,(yzi -xi')2 }2 ' 2 '

( 3) Sample

Zi T(_oc,o O l(xi,, 02), i E C.

( 4) Goto 1.

8.6 Regression with change point

Suppose that y = {y I,y 2, , y, } is a time series such that the density of yt given

Yt i = (yl, ,yt ) is specified as

yl M y i 2,gi c 2 a 2, r~ A(x:il,a 2) if t T,K(x 4 &, a ) if < t,

where is an unknown change point The objective is to estimate the parameter vectors0 = ( 2,), the regression variances o 2 = ( 2, a 22) and the change point T.

An analysis of such models from a MCMC perspective was initiated by Carlin,Gelfand and Smith ( 1992) It is based on the inclusion of the change point T in theMCMC sampling Stephens ( 1994) generalized the approach of Carlin, Gelfand andSmith for models with multiple change points by including each of the unobservedchange points in the sampling In this generalization, however, the step that involvesthe simulation of the change points conditioned on the parameters and the data can be

3608 S Chib


computationally very demanding when the sample size N is large A different approachto multiple change point problems which is computationally simpler is developed byChib ( 1998) An important aspect of the MCMC approach for change point problemsis that it can be easily adapted for binary and count data.

Assume that

Pj Ak ( 0, BO); Or' , )0 ; O Unif{a 0,a O + 1, b},

where r follows a discrete uniform distribution on the integers {ao,bo} Then theposterior density is

r(r, 2, l y, M) o p(f)p( 02 )p(r) j II (yt xfil, a) I o(y, xf 2, a 22).~r -< r T<t

Conditional on T the data splits into two parts and the conditional distributions of theregression parameters are obtained from the regression updates of Algorithm 5 Onthe other hand, given the regression parameters, the full conditional distribution of is concentrated on {ao, bo} with mass function

Pr( = kly, M,, a 2 )

o H I(yx l,) x O (ytlx , t 2,tk kit

The normalizing constant of this mass function is the sum of the right hand sideover k.

Algorithm 11: Regression with change point( 1) Sample for j= 1,2

j ~ Nk( , By),

#B= (B O + alx Xt, Y,

B/ = j B +O-2 C Xt X ,

/ = 1 + (j 1)r; u = + (j )(n ).

( 2) Sample forj= 1,2

2 ' 2nj { n+ ( j o 1)(n 2 r).no = T + (j )(n-2 t)

3609

( 3) Calculate for k = ao,ao + 1, , bo

pk cr HI (ytlx l, 2) ¢(yt lx 2, a 22).t<k k<t

( 4) Sample

r {Pao,Po,ao + 1, Pbo}.

( 5) Goto 1.

8.7 Autoregressive time series

Consider the model

Yt = x# + et, 1 t < n,

where the error is generated by the stationary AR(p) process

et -0 e, -opct-p = t or (L)et = ut,

where ut i i d Af(O, a 2 ) and O(L) = 1 O 1L p LP is a polynomial inthe lag operator L One interesting complication in this model is that the parameters0 = ( 1, , ¢p), due to the stationarity assumption, are restricted to lie in the region Soof 9 Vl where the roots of ¢(L) are all outside the unit circle Chib and Greenberg( 1994), based on Chib ( 1993), derive a multiple-block Metropolis-Hastings MCMCalgorithm for this model in which the proposal densities for 1 and a 2 are the respectivefull conditional densities while that of ~ is a normal density constructed from theobservations y,, t p + 1.

Denote the first p observations as Yp = (yl, , yp)' and Xp = (xl, , x,)' andlet y = (L)y, and x,* = (L)x,, t > p + 1 Also define the p dimensional matrix lpthrough the matrix equation

p = -Plp ' + el(p) el(p)',

where

(p = ( 1,0, 0)' and = , Let the choesk factorization of

el(p) = ( 1,0, , 0)' and O _p = ( 1, ~p ' Let the cholesky factorization of lpbe QQ' and define Yp* = Q-1 Y and X = Q I Xp which are functions of O Finallydefine e, = y xi, t > p + 1.

3610 S Chib

Ch 57 Markov Chain Monte Carlo Methods: Computation and Inference

One can now proceed by noting that given 0, updates of f and a 2 follow from themodel

y* M,,a 2 N(x*f,t a 2 ), t > 1,

t Xk(o, B 0),a 2

Vg 60

while conditioned on (, a 2 ), and the assumption that the prior density of ¢ isA( 0 , Go) truncated to the region So, the full conditional of is

j T(o Iy, M, , 2 ) OX J(o) X p( 1, V)ISO,

where

-( 4,) = l S;p-'/2 exp {-2 ( Yp -Xp) p

= V(Go I O o + t=p + Ee,), V = (Go I + -2 ,=p+i E,E)-, E, = (e, ,

To sample this density the proposal density is specified as

q(l y, M 4, A, 2) = p( 101, V).

With this tailored proposal density the probability of move just involves '(), leadingto a M-H step that is both fast (because it entails the calculation of a function basedon the first p observations and not the entire sample) and highly efficient (because theproposal density is matched to the target).

Algorithm 12: Regression with autoregressive errors( 1) Calculate (y T,x*), t < n.( 2) Sample

n N ,

/k (Bn(B Jo 1 0 + - 2 x t/yt),Bn = (Bo' + 0-2 Xt*,)-I ).

t=l t=l

( 3) Sample

(y 2 ~ o + N 60 + Ei= l(Yt X*'p)2 }

( 4) M-H(a) Calculate

n n

= V(G'0 o+ a-2 E Eet); V = (Gol + S-2 Et E) t=p+l t=p+l

3611

(b) Propose

(c) Calculate

a = min,( 1 d)Is*'

(d) Move to O' with probability a.( 5) Goto 1.

8.8 Hidden Markov models

In this subsection we consider the MCMC-based analysis of hidden Markov models (orMarkov mixture models or Markov switching models) The general model is describedas

Yt Yt l,M, st= k,O f(yl Yti,M, Ok), k= , ,m,

st st 1, P Markov(P, a 11),

8 ,

Pl ~ Dirichlet (ail, , aim), i m,

where st E { 1, , m} is an unobservable random variable which evolves accordingto a Markov process with transition matrix P = {pij}, with pi = Pr(s, =j s, I = i),and initial distribution r 1l at t = 1, f is a density or mass function, O = ( 01, , m)are the parameters off under each possible value of s,, and pi is the ith row of P thatis assumed to have a Dirichlet prior distribution with parameters (ail, , aim) Foridentifiability reasons, the Markov chain of s, is assumed to be time-homogeneous,irreducible, and aperiodic.

The MCMC analysis of such models was initiated by Albert and Chib ( 1 993 b) in thecontext of a more general model than the one above where the conditional density ofthe data depends not just on s, but also on the previous values {st-1, , st-r}, as inthe model of Hamilton ( 1989) The approach relies on augmenting the parameter spaceto include the unobserved states and simulating r(S, 0,Pl y, M) via the conditionaldistributions

stly, M,S(-t),0, P(t n); ly, M,S,, P; {pi} y, M,8, S,,

where S, = (Sl, , s ) denotes the entire collection of states Robert, Celeux andDiebolt ( 1993) and McCulloch and Tsay ( 1994) developed a similar approach for thesimpler model in which only the current state s, appears in the density of y, whileBillio, Monfort and Robert ( 1999) consider ARMA models with Markov switching.

3612 S Chib


Chib ( 1996), whose approach we now follow, modifies the first set of blocks of theabove scheme to sample the states jointly from

S I y, M, 0,P,

in one block This leads to a more efficient MCMC algorithm The sampling of S,is achieved by one forward and backward pass through the data In the forward pass,one recursively produces the sequence of mass functions { p(s, Y Y, , P)} (t < n) asfollows: assume that the function p(s, Y _ , M, 0, P) is available Then, one obtainsp(s, IY,, M, 0, P) by calculating

p(s, Y, ,, M, P) = Zp(s s , =, P) x p(s, = II Y , 0, P),/= 1

followed by

p(stl Y,_-l , , P) xf(y, l Y Y, M, ,,, P)p(sl Yt, M, ,P) = Em p(st = 11 Y,_ ,, , P) x f(y,l Y, , , , P)

These forward recursions can be initialized at t = by setting p(s l Yo, M, 0,P) tobe the stationary distribution of the chain (the left eigenvector corresponding to theeigenvalue of one).

Then, in the backward pass one simulates S, by the method of composition, firstsimulating s, from s, y, M, 0, P and then the set's using the probability mass functions

p(s, =kl Y,,M, ,P) x p(s,+, ls, =k,P)Ep,(s = I Yy, ,M,0, P) xp(s,+ I st, = I, P)

k m, t <n-1,

where S't+ = (s,t+ , , s,) consists of the simulated values from earlier steps andthe second term of the numerator is the Markov transition probability, which is pickedoff from the column of P determined by the simulated value of s, +

Given the simulated vector S,, the data separates into m non-contiguous pieces andthe simulation of O is from the full conditional distribution

Z(O) f(Ytl Yt,,M, ,P).t:s,=k

Depending on the form off and p this may belong to a named distribution Otherwise,this distribution is sampled by a M-H step Finally, the last distribution depends simplyon S, with each row pi of P independently an updated Dirichlet distribution:

Pi Sn D(ati + nil, , ait + nim), (i < m),

where nik is the total number of one-step transitions from state i to state k in thevector S,.

3613

Algorithm 13: Hidden Markov model( 1) Calculate and store for t= 1,2, , n

p(s, Y,, M, , P).

( 2) Sample

sn ~ p(sn I Y, M, , P).

( 3) Sample for t=n-,n-2, , 1

st ~ p(stly,M,S'+S, O,P).

( 4) Sample for k=l, ,m

Ok Oc (Ok) H f(Y(y t Y _, M, ,P).t:s, =k

( 5) Sample for i= 1,2, ,m

pi D D(ail + nil, , a, + Him).

( 6) Goto 1.

8.9 State space models

Consider next a linear state space model in which a scalar observation y, is generatedas

yt IM, O NA(x,, a 2),

Ot I Ot 1 Nm(G Ot 1, Y), 1 t n,

l 2 ~ ~V(;· 2 )I

W-' Wm(po,Ro),

where is an m x 1 state vector and G is assumed known For nonlinear versionsof this model, a MCMC fitting approach is provided by Carlin, Polson and Stoffer( 1992) It is based on the inclusion of the variables { O } in the sampling followed byone-at-a-time sampling of 0, given O _, (the remaining O 8 's) and ( 2, W) For the linearversion presented above, Carter and Kohn ( 1994) and Fruhwirth-Schnatter ( 1994) showthat a reduced blocking scheme involving the joint simulation of {, } is possible anddesirable, because the Ot's are correlated by construction, while de Jong and Shephard( 1995) provide an important alternative procedure called the simulation smoother thatis particularly useful if W is not positive definite or if the dimension m of the state

3614 S Chib


vector is large Carter and Kohn ( 1996) and Shephard ( 1994) also consider models,called conditionally Gaussian state space models, that have Gaussian observationdensities conditioned on a discrete or continuous variable s An example of this isprovided below in Section 8 10 Chib and Greenberg ( 1995 b) consider hierarchical andvector versions of the above model while additional issues related to the fitting andparameterization of state space models are considered by Pitt and Shephard ( 1997).

The MCMC implementation for this model is based on the distributions

01, , O, y, M, a 2 , l Y; a 2 ly, M,{Ot}, II; IY-'ly,M,{ 0,}.

To see how the Ot's are sampled, write the joint distribution as

p(Only, M, 2 , ) x p(O,_ I y, M, , 2 , )

x p(OI y, M,02, -, ,Oa 2, ,y),

where, on letting Os = (s, , ), Y, = (yl, , Ys) and Ys = (y,, , y) for S < n,the typical term is

p(O I Y, M, ' + , a 2, ) cp(O, I Yt, M, a 2 , Y)p(O,+ 1 , M, 02 , ),

due to the fact that (Y'+l','+ 1 ) is independent of Ot given (,+ ,a 2, ') The firstdensity on the right hand side is Gaussian with moments given by the Kalman filterrecursions The second density is Gaussian with moments G Ot, and Y By completingthe square in O t the moments ofp(t I y, M, ' + i, a 2, Y) can be derived Then, the jointdistribution 01, , 0, y, M, a 2, ' can be sampled by the method of composition.

Algorithm 14: Gaussian state space( 1) Kalman filter

(a) Calculate for t= 1,2, , n

,lt-1 I = G,_ 1 It 1, Rtl,_ 1 = G Rt I It -G' + 'l,f J lt l=x; R, = I-x,+ 2 , xt }l, ,,

tl, = ,tl,_, + K,(y, -x, I t ), Rtlt = (I Ktxt)Rlt ,M, = R, I t G'R+ i t R =R, I t-Mt R,+ It I M,'.

(b) Store

It,; M,; R.

( 2)Simulation step

(a) Sample

On Afm(,nln,Rnln).

3615

(b) Sample fort= N 1, N 2, , 1

O ,(,,R,), = H I,+Mt (H -G , I ,)

( 3) Sample

2 i N { 060 + Zn(y Y -x,O) 2 }

( 4) Sample

-1 Wm, + n, Ro 1 + ( -G O_,)(O, -Get )'

( 5) Goto 1.

8.10 Stochastic volatility model

Suppose that time series observations {yt} are generated by the stochastic volatil-ity (SV) model lsee, for example, Taylor ( 1994), Shephard ( 1996), and Ghysels, Harveyand Renault ( 1996)l

yt = exp(h,/2)ut, h = + (ht_ )+ art,, t n,

where {ht} is the latent log-volatility ofy, and {u,} and {rt} are white noise standardnormal random variables This is an example of a state space model in which thestate variable h appears non-linearly in the observation equation The model canbe extended to include covariates in the observation and evolution equations andto include a heavy-tailed, non-Gaussian distribution for u, The MCMC analysis ofthis model was initiated by Jacquier, Polson and Rossi ( 1994) based on the generalapproach of Carlin, Polson and Stoffer ( 1992) If we let O = (, , a 2 ), then thealgorithm of Jacquier, Polson and Rossi ( 1994) is based on the (n + 3) full conditionaldistributions

h, I y, M,h-,,O, t= 1,2, , n,

¢Iy,M, h,A, , a 2 ; yly, M, {h,},, a 2 ; 21 y,M, {ht},,l,,

where the latent variables h, are sampled by a sequence of Metropolis-Hastings steps.Subsequently, Kim, Shephard and Chib ( 1998) discussed an alternative approach thatleads to considerable improvements in the mixing of the Markov chain The latterapproach has been further refined by Chib, Nardari and Shephard ( 1998, 1999).

The idea behind the Kim, Shepard and Chib approach is to approximate theSV model by a conditionally Gaussian state space model with the introduction of

3616 S Chib


Table 1Parameters of seven-component Gaussian mixture to approximate the distribution of log X 2

s, q m 11,

1 0 00730 -11 40039 5 79596

2 0 10556 -5 24321 2 61369

3 0 00002 -9 83726 5 17950

4 0 04395 1 50746 0 16735

5 0 34001 -0 65098 0 64009

6 0 24566 0 52478 0 34023

7 0 25750 -2 35859 1 26261

multinomial random variables {st} that follow a seven-point discrete distribution.Conditioned on {s,t}, the model is Gaussian and the variables h, appear linearly in theobservation equation Then, the entire set of {h, } are sampled jointly conditioned on Oand {s,} by either the simulation smoother of de Jong and Shephard ( 1995) or by thealgorithm for simulating states given in Algorithm 14 Once the MCMC simulation isconcluded the parameter draws are reweighted to correspond to the original non-linearmodel.

To begin with, reexpress the SV model as

y, = h, + z, ht = 1 + (h l ) + ari,,

where y = ln(y 2) and z = log e 2 is distributed as a log of chi-squared randomvariable with one degrees of freedom Now approximate the distribution of y, h, by amixture of normal distributions A very accurate representation is given by the mixturedistribution

y*lh,, , S ,), Pr(st = i) = qi, i 7, t < n,

where st, ( 1,2, , 7) is an unobserved component indicator with probability massfunction q = {qj} and the parameters {q, ms,, v 2 } are as reported in Table 1 Now theparameters and the latent variables can be simulated by a two block MCMC algorithmdefined by the distributions

( , , h , n)l{Yt }, {t}

{st}l{y*}, {ht}, 0.where the first block is sampled by the method of composition by first drawing O fromar( O {y* }, {st }) by a M-H step followed by a draw of {h,} by the simulation smoother.In the former step the target distribution is

r T( 01 {yt}, {st}) (x p( 0)f(y, y, * I{s,}, 0)n

= f( y* 1,,*_ t, -st , t),t I

3617

where each one-step ahead densityf(yt*l IF* , {st}, 0) can be derived from the outputof the Kalman filter recursions, adapted to the differing components, as indicated bythe component vector {st}, and p(O) is the prior density For O the prior can be takento be the scaled beta density

p() = c ( 0 5 ( 1 + 0))"( " l ( 0 5 ( 1 )) (2)' _ ( 1) O ( 2) > 05, ( 31)

where

c = O 5 F( + ( 2))

with prior mean of 20 ( 1)/(( 1) + ¢( 2) _ 1), while those on p and a 2 can be normal andinverse gamma densities, respectively.

Algorithm 15: Stochastic volatility( 1) Initialize {st}( 2) M-H

(a) Calculate m = arg maxo l( 0) where

l(H) -I lnf i t E (Y * , I t )22 tl f t-1t=l t=l

and V = {-92 l( 0)/00 ' }1, the negative inverse of the

hessian at m, whereftlt_l and htlt_l are computed from theKalman filter recursions

htlt-I =,+q(ht llt -), g Rt 1 = 02 Rt_lt_ + 2,

ft t = R t-1 + 2,t Kt = R I t I ft- 1

htl t =ht t I + Kt(y t mst htlt l), Rt I t = ( -Kt)Rtl tI.

(b) Propose

O' f(Olm, V, ).

(c) Calculate

a = min p(') 1 ( O ') f T(Om, V, )p( ) f r(O' lm, V, )'

(d) Move to ' with probability a.( 3) Sample {ht} using algorithm 13, or the simulation smoother

algorithm, modified to include the components of themixture selected by {st}.

3618 S Chib


( 4) Sample

s, Pr(stly*, ht, V) Pr(s,)f N(y lst, + h,, v 2,).

( 5) Goto 2.

8.11 Gaussian panel data models

For continuous clustered or panel data a common model formulation is that of Lairdand Ware ( 1982)

i I AM,, bi, 02 ~ n,(Xip + Wibi, a 21 ni), bij D ~ Aq( 0,D),

D -l Wp(po,Ro), O -A k(Po,B o), (ya 2 2 (, )

where yi is a ni vector of observations and the matrix Wi is a subset of Xi If Wi is avector of units, then the model reduces to a panel model with intercept heterogeneity.If Wi = X, then the model becomes the random coefficient panel model.

Zeger and Karim ( 1991) and Wakefield et al ( 1994) propose a Gibbs MCMC ap-proach for this model that is based on including {bi} in the sampling in conjunctionwith full blocking This blocking scheme is not very desirable because the randomeffects and the fixed effects 11 tend to be highly correlated and treating them asseparate blocks creates problems with mixing Gelfand, Sahu and Carlin ( 1995) Todeal with this problem, Chib and Carlin ( 1999) suggest a number of reduced blockingschemes One of the simplest proceeds by sampling I and {bi} in one block by themethod of composition: first sampling B marginalized over {bi} and then sampling {bi}conditioned on P What makes reduced blocking possible is the fact that the distributionofyi marginalized over bi is also Gaussian:

yil M,Pf, D, a 2 Arn,(Xif, Vi), Vi = o 21 N + Wi DW;.

The updated distribution of A, marginalized over {bi} is, therefore, easy to derive.The rest of the algorithm follows the steps of Wakefield et al ( 1994) In particular,the sampling of the random effects is from independent normal distributions thatare derived by treating (yi Xif) as the "data," bi as the regression coefficient andbi Aq( 0, D) as the prior The sampling of D I is from an Wishart distribution andthat of 2 from an inverse gamma distribution.

Algorithm 16: Gaussian Panel( 1) Sample

p¢ Ak (Bn(Boliio + E X V yi), B, (B_ 1

+ E Xi Vi Xi)_,)l~A; B(B& +E Xi, g'y,),B =&' + O XE: -, ) i= 1 i= 1

3619

( 2) Sample

bi q (Di Wi'o- 2(yi -Xi#),Di =(D +C-2 W'W)-l), i< n.

( 3) Sample

D-1 Wp po+ n, (R + bib

( 4) Sample

2 IQ (+ E ni o + = i Yi Xl Wbi 1122 2

( 5) Goto 1.

8.12 Multivariate binary data models

To model correlated binary data a canonical model is the multivariate probit (MVP).Let yij denote the binary response on the ith observation unit andjth variable, and letYi = (il- , yij), Y 1 < i < n, denote the collection of responses on all J variables.Then, under the MVP model the marginal probability of yij = 1 is

Pr(yij = 1 IAM, 0) = (xjf),

and the joint probability that Yi = yi conditioned on the parameters ( 0, 1) is

Pr(Y =yi M,0, E) Pr(yil M,6, ) = | A Oj(t 10, 1)dt,A, Ail

where as in the SUR model, ' = (#I, , #J) O 9 k , k = Ckj, but unlike theSUR model, the J matrix E = {Cjk} is in correlation form (with units on thediagonal), and Aii is the interval

A (-o,x~i ) if Yij = 1,lx,'jj, o) if Yij = 0.

To simplify the MCMC implementation for this model Chib and Greenberg ( 1998)follow the general approach of Albert and Chib ( 1993 a) and employ latent variables.Let

Zi ~ A O (X At, ),

with the observed data given by the sign of zij:

Yi = I(Zij > O), j = 1, , J,

where I(A) is the indicator function of the event A If we let a = ( 21, 031, 032, , O JJ)denote the J(J 1)/2 distinct elements of A, and let z = (zl, , z) denote the latent

3620 S Chib


values corresponding to the observed data Y = {Yi'= 1, then the algorithm proceedswith the sampling of the augmented posterior density

Jr( , a, zly, M) cx p( )p(a)f(zl M, , 7) Pr(ylz,, 2)n

ocp(#) p(a) 7 I f{J(zi l Xfi ) Pr(yizi, i, z)}, E k , o C,i=l

where

Pr(yilzi,A,E ) = H {I(Zi > O)I(yij = 1)+ I(zij < O)I(yij = 0)},J=

p(r) is a normal density truncated to the region C, and C is the set of values of uthat produce a positive definite correlation matrix

Conditioned on {zi} and , the update for Pl is as in the SUR model, whileconditioned on (, 1), zij can be sampled one at a time conditioned on the other latentvalues from truncated normal distributions, where the region of truncation is either( 0, oco) or (-oo, 0) depending on whether the corresponding yij is one or zero The keystep in the algorithm is the sampling of o, the unrestricted elements of £ 7, from thefull conditional density 7 r(ol M, z, 1) oc p(a) I 1 n>l OJ(zil Xi, E) This density, whichis truncated to the complicated region C, is sampled by a M-H step with tailoredproposal density q(al M, z, 1) =fr(alm, V, ~) where

n

m = arg max E In ¢j(zi X 113, E),i=l

V= _ 2 j ,= , In J(zil Xi, E)00 v =

are the mode and curvature of the target distribution, given the current values of theconditioning variables Note that, as in Algorithm 12, no truncation is enforced on theproposal density.

Algorithm 17: Multivariate probit( 1) Sample for i < n, j < J

Z T A'(,oo)(yj, vij) if Yi = 1,TA(-oo,o)(Lij, vi) if yij = 0,

j = E(zy Zi(j),Z-, , ),

vij = Var(zojl Z(-j), S 6, ).

3621

( 2) Sample

A( (Bo 1 I+t X Zi) i, I (B + xi= -1 x

( 3) M-H(a) Calculate the parameters (m, V).

(b) Propose

a' fr(lm, v, ).

(c) Calculate

a = min p(a') i= (i l XS, ') Ilo' e Cl fr(alm, V, )1

p(a) -=> J(Z I Xi I, A) f Tr(o'lm, V, )'

(d) Move to ' with probability a.( 4) Goto 1.

As an application of this algorithm consider a data set in which the multivariate binaryresponses are generated by a panel strucure The data is concerned with the healtheffects of pollution on 537 children in Stuebenville, Ohio, each observed at ages 7, 8,9 and 10 years, and the response variable is an indicator of wheezing status lDiggle,Liang and Zeger ( 1995)l Suppose that the marginal probability of wheeze status ofthe ith child at the jth time point is specified as

Pr(yij = 11 |) = I(l Po-+Plxlij+ 12 X 2 ij+ 3 x 3 ij), i 537,j , 4,

where is constant across categories, xl is the age of the child centered at nineyears, x 2 is a binary indicator variable representing the mother's smoking habit duringthe first year of the study, and X 3 = xix 2 Suppose that the Gaussian prior on1 = (f 11,32,f 3,if 4) is centered at zero with a variance of 1 0 Ik and let p(a) be thedensity of a normal distribution, with mean zero and variance I 6, restricted to regionthat leads to a positive-definite correlation matrix, where ( 21,031, 032, a 41, 42, 043)-From 10000 cycles of Algorithm 17 one obtains the following covariate effects andposterior distributions of the correlations.

Notice that the summary tabular output in Table 2 contains not only the posteriormeans and standard deviations of the parameters but also the 95 % credibility intervals,all computed from the sampled draws It may be seen from Figure 6 that theposterior distributions of the correlations are similar suggesting that an equicorrelatedcorrelation structure might be appropriate for these data This issue is considered moreformally in Section 10 2 below.

3622 S Chib


Table 2Covariate effects in the Ohio wheeze data: MVP model with unrestricted correlations l

fi Prior Posterior 2

Mean Std dev Mean NSE Std dev Lower Upper

Btl 0 000 3 162 -1 108 0 001 0 062 -1 231 -0 985

12 O 000 3 162 -0 077 0 001 0 030 -0 136 -0 017

0.000 3 162 0 155 0 002 0 101 -0 043 0 352

14 0 000 3 162 0 036 0 001 0 049 -0 058 0 131

The results are based on 10000 draws from Algorithm 17.2 NSE denotes the numerical standard error, lower is the 2 5th percentile and upper is the 97 5thpercentile of the simulated draws.

0.8

0.7

060.6>

0.5

0.4

0.3

sig 21 sig 31 sig 32Column Number

sig 41 sig 42

Fig 6 Posterior boxplots of the correlations in the Ohio wheeze data: MVP model.

9 Sampling the predictive density

A fundamental goal of any statistical analysis is to predict a set of future or unobservedobservations yf given the current data y and the assumed model M In the Bayesiancontext this problem is solved by the calculation of the Bayesian prediction densitywhich is defined as the distribution of yf conditioned on (y, M) but marginalized overthe parameters O More formally, the predictive density is defined as

f(yfly, M) = f(Yfl, M, 0) r( O y, M) d, ( 32)

I T

I II I

+

3623

where f(yfly, M,0) is the conditional density of yf given (y,M, ) and themarginalization is with respect to the posterior density jr( O ly, M) of O In general,the predictive density is not available in closed form However, in the context ofMCMC problems that deliver a sample of (correlated) draws

0 ( 1), , O(M) tr(O O y,M),

this is hardly a problem One can utilize the posterior draws in conjunction with themethod of composition to produce a sample of draws from the predictive density Thisis done by appending a step at the end of the MCMC iterations where for each value0 (i) one simulates

Yj) ~ f(yf Iy,M,0)) , j < M, ( 33)

from the density of the observations, conditioned on () The collection of simulatedvalues {yl), , y M)} is a sample from the Bayes prediction density f(yfly, M).The simulated sample can be summarized in the usual way by the computation ofsample averages and quantiles Thus, to sample the prediction density one simply hasto simulate the data generating process for each simulated value of the parameters.

In some problems, that have a latent data structure, a modified procedure to samplethe predictive density may be necessary Suppose that f denotes the latent data in theprediction period and z denote the latent data in the sample period Let tp = (, z) andsuppose that the MCMC sampler produces the draws

p(M) , (M) ~ z(Oply, M).

In this situation, the predictive density can be expressed as

f(yfly, M)= f(y y,M,zf, ) r(zf y,M, ,) (ly, M)dzf dp, ( 34)

which may again be sampled by the method of composition where for each value(i) Tr(ly,M) one simulates

z) :T (z/lyf y , (, ) i) yf(yfl Iy M y,() J)).

The simulated values of yf from this two step process are again from the predictivedensity.

3624 S Chib


To illustrate the one step procedure, suppose that one is interested in predictingYf = (n+l,yn+ 2) from a regression model with autoregressive errors of order twowhere

yt I Yt , M, , , a 2 ~ V t l + Y 2 Yt O Y 2 +(Xt -l Xt I -02 Xt 2)/'i#,2)

Then, for each draw ((j'),p(j),a 2(j )) from Algorithm 12, one simulates yf bysampling

Yn' A+f 'I n+ 2 + 1 + (Xn + t 2 p )0), 2 (J))

and

y() ~ Af(aij ( W OW + (Xn + 2 O ( 1 j 2 N + _ OWX N t(j) ,2 (j).Yn+ 2 tl Y,+t + 2 Yn +(Xn+ 22 xn+l

The sample of simulated values { Yj+) ,Yn(j 2 } from repeating this process is a samplefrom the (joint) predictive density.

As an example of the two step procedure consider a specific hidden Markov modelin which

Yt lI Yt , M, Ao, y, 2 ~ ( fo + yst, 2),

where s, E { 0, 1 } is a unobserved state variable that follows a two-state Markov processwith unknown transition probabilities

p= (Poo Pol\Plo Pl I

In this case, suppose that Algorithm 13 has been used to deliver draws on= (o, y, 2, a, b, S) As described by Albert and Chib ( 1993 b), to predict y,+I

we take each draw of tp(j) and sample

Sn+ P(Sn +

N 'Pl IP 22)

from the Markov chain (this is just a two point discrete distribution), and then sample

y() _ A( o + 2 (j) 2)n+l O an+~ ,'

The next value ynj+ 2 is drawn in the same way after s j ) 2 is simulated from the Markov

chainp(s + 2 Is+ l Pl )P 22)) These two steps can be iterated for any number of periodsinto the future and the whole process repeated for each simulated value of yp.

3625

10 MCMC methods in model choice problems

10.1 Background

Consider the situation in which there are K possible models M , , MK for theobserved data defined by the sampling densities {f(Y Jlk,Mk)} and proper priordensities {P(Ok I Mk)} and the objective is to find the evidence in the data for thedifferent models In the Bayesian approach this question is answered by placing priorprobabilities Pr(Mk) on each of the K models and using the Bayes calculus to findthe posterior probabilities {Pr(Ml y), , Pr(M Kly)} conditioned on the data butmarginalized over the unknowns Ok Specifically, the posterior probability of Mk isgiven by the expression

Pr(Mk) m(YI Mk)Pr(Mkly) = Pr(Mk) m(y I Mk)El= Pr(MI) m(yl M)

oc Pr(Mk)m(yl Mk), (k < K),

where

m(y I Mk) = Jf(yok, Mk)p(Ok| k) d Ok, ( 35)

is the marginal density of the data and is called the marginal likelihood of Mk Inwords, the posterior probability of Mk is proportional to the prior probability of Mktimes the marginal likelihood of Mk The evidence provided by the data about themodels under consideration is summarized by the posterior probability of each model.

Often the posterior probabilities are summarized in terms of the posterior odds

Pr(Mily) _ Pr(Mi) m(yl Mi)

Pr(Mj I y) Pr(Mj) m(yl Mj)'

which provides the relative support for the two models The ratio of marginallikelihoods in this expression is the Bayes factor of M i vs Mj.

If interest centers on the prediction of observables then it is possible to mix overthe alternative predictive densities by utilizing the posterior probabilities as weights.More formally, the prediction density of a set of observations yf marginalized overboth {Ok} and {Mk} is given by

K

f(yf Y) = E Pr(Mkl Y)f(yf Iy,Mk),j=l

wheref(yfly, Mk) is the prediction density in Equation ( 34).

3626 S Chib


10.2 Marginal likelihood computation

A central problem in estimating the marginal likelihood is that it is an integral ofthe sampling density over the prior distribution of Ok Thus, MCMC methods, whichdeliver sample values from the posterior density, cannot be used to directly averagethe sampling density because that estimate would converge to

f(y Ok, Mk)p(Ok I y, Mk)d Ok,

which is not the marginal likelihood In addition, taking draws from the prior densityto do the averaging produces an estimate that is simulation-consistent but highlyinefficient because draws from the prior density are not likely to be in high densityregions of the sampling density f(yl k, Mk) A natural way to correct this problemis by the method of importance sampling If we let h(Okl Mk) denote a suitableimportance sampling function, then the marginal likelihood can be estimated as

^ (yl Mk) = m- 1 E f (YI Oi),Mk)p(O(j)l Mk)

j=, h(O(J)lMk)

0 j )~ h(O Hj)Mk) (j , M).

This method is useful when it can be shown that the ratio is bounded, which canbe difficult to check in practice, and when the sampling density is not expensive tocompute which, unfortunately, is often not true We mention that if the importancesampling function is taken to be the unnormalized posterior density then that leads to

MNRE= Mi f f(yl| k,),M)p(#k(i| Mk) l

the harmonic mean of the likelihood values This estimate, proposed by Newton andRaftery ( 1994), can be unstable because the inverse likelihood does not have finitevariance Gelfand and Dey ( 1994) propose a modified stable estimator

1 ~ ~M h(O(j))mit GD (M Yf (Yl H() Mk)P(O j)l Mk) }l

where h(O) is a density with tails thinner than the product of the prior and thelikelihood Unfortunately, this estimator is difficult to apply in models with latent ormissing data.

The Laplace method for integrals can be used to provide a non-simulation basedestimate of the marginal likelihood Let dk denote the dimension of Ok and let Ok

3627

denote the posterior mode of Ok, and k the inverse of the negative Hessian ofIn {f(ylek, Mk)P(Ok l Ak)} evaluated at k Then the Laplace estimate of marginallikelihood, on the customary log base ten scale, is given by

log t(l YI Mk) = (dk/2) log( 2 jr) + ( 1/2) log det(Xk) + logf(y I Ok, Mk) + logp(k Mk).

The Laplace estimate has a large sample justification and can be shown to equal thetrue value upto an error that goes to zero in probability at the rate n-'.

Both the importance method and the Laplace estimate may be considered as thetraditional methods for computing the marginal likelihood More recent methodsexploit two additional facts about the marginal likelihood The first that the marginallikelihood is the normalizing constant of the posterior density and therefore under thisview the Bayes factor can be interpreted as the ratio of two normalizing constants.There is a large literature in physics (in a quite different context, however) on preciselythe latter problem stemming from Bennett ( 1976) This literature was adapted inthe mid 1990 's for statistical problems by Meng and Wong ( 1996) utilizing thebridge sampling method and by Chen and Shao ( 1997) based on umbrella sampling.The techniques presented in these papers, although based on the work in physics,contain modifications of the ideas to handle problems such as models with differingdimensions Di Ciccio, Kass, Raftery and Wasserman ( 1997) present a comparativeanalysis of the bridge sampling method in relation to other competing methods ofcomputing the marginal likelihood At this time, however, the bridge sampling methodand its refinements have not found significant use in applications perhaps because themethods are quite involved and because simpler methods are available.

Another approach that deals with the estimation of Bayes factors, again in thecontext of nested models, is due to Verdinelli and Wasserman ( 1995) and is calledthe Savage-Dickey density ratio method Suppose a model is defined by a parameter0 = (to, p) and the first model M 1 is defined by the restriction = o O and the secondmodel M 2 by letting o be unrestricted Then, it can be shown that the Bayes factoris given by

B 12 = Jr(toy,M 2)E p (I M) lJ-r(ot IM 2 ) jp(ipl M, o) j '

where the expectation is with respect to r(O Ply, M 2, O) If r(ol y, M 2 , p) is avail-able in closed form then r(o I y, M 2) can be estimated by the Rao-Blackwell methodand the second expectation by taking draws from the posterior Jr(l y, M 2, Oo), whichcan be obtained by the method of reduced runs discussed below, and averaging the ratioof prior densities This method provides a simple approach for nested models but themethod is not efficient if the dimensions of the two models are substantially differentbecause then the ordinate r(o I y, M 2) tends to be small and the simulated valuesused to average the ratio tend to be in low density regions.

3628 S Chib

Ch 57 Markov Chain Monte Carlo Methods: Computation and Inference

The second fact about marginal likelihoods, highlighted in a paper by Chib ( 1995), isthat the marginal likelihood by virtue of being the normalizing constant of the posteriordensity can be expressed as

m(yj Mk) = f(Ylk, A 4 k(Ok Mk) ( 36):( O kly,Mk)

This expression is an identity in Ok because the left hand side is free of Ok Chib ( 1995)refers to it as the basic marginal likelihood identity (BMI) Based on this expressionan estimate of the marginal likelihood on the log-scale is given by

log in(yl Mk) = logf(y 1 Ok, Mk)+ logp(Ok Mk) logr(Ok I y, Mk), ( 37)

where Ok* denotes an arbitrarily chosen point and ( O * I y, Mk) is the estimate of theposterior density at that single point Two points should be noted First, this estimaterequires only one evaluation of the likelihood function This is particularly usefulin situations where repeated evaluation of the likelihood function is computationallyexpensive Second, to increase the computational efficiency, the point Ok* should betaken to be a high density point under the posterior.

To estimate the posterior ordinate one utilizes the MCMC output in conjunction witha marginal/conditional decomposition To simplify notation, drop the model subscript kand suppose that the parameter vector is blocked into B blocks as 01, , #B Inaddition, let z denote additional variables (latent or missing data) that may be includedin the simulation to clarify the structure of the full conditional distributions Also letWi = ( 01, , Oi) and ip' = ( O i , , B) denote the list of blocks upto i and the set ofblocks from i to B, respectively Now write the posterior ordinate at the point O * bythe law of total probability as

( O * I y, M)= X( y, M)x x *r(O; y, ,_l) x( ly, M, B _),( 38)

where the first term in this expression is the marginal density of 01 evaluated at O *,and the typical term is of the form

This may be called a reduced conditional ordinate It is important to bear in mind thatin finding the reduced conditional ordinate one must integrate only over (pi + , z) andthat the integrating measure is conditioned on '* 1.

Assume that the normalizing constants of each full conditional density is known,an assumption that is relaxed below Then, the first term of Equation ( 38) can beestimated by the Rao-Blackwell method To estimate the typical reduced conditional

3629

ordinate, Chib ( 1995) defines a reduced MCMC run consisting of the full conditionaldistributions

{ g(O Iy, M, vi P* pit lz); ; ( 8 IY, M, i, , , )B-39);

J 7 (zly, M, /*-,, ji)} ,

where the blocks in tpil are set equal to ip* By MCMC theory, the draws on(i+l,z) from this run are from the distribution :r(ipi+l,zly, M, l* l) and so thereduced conditional ordinate can be estimated as the average

M

fr(l IM, * l) = M -1 E r(Oi*y M, 1, V +, i+l(j), (j))j~l

over the simulated values of lp + and z from the reduced run Each subsequent reducedconditional ordinate that appears in the decomposition ( 38) can be estimated in thesame way though, conveniently, with fewer and fewer distributions appearing in thereduced runs Given the marginal and reduced conditional ordinates, the Chib estimateof the marginal likelihood on the log scale is defined as

B

log i(yl M) = logf(y IG*, M) + logp(O*) log f(* y, , , i* ), ( 40)

wheref(yl *, M) is the density of the data marginalized over the latent data z.It is worth noting that an alternative approach to estimate the posterior ordinate is

developed by Ritter and Tanner ( 1992) in the context of Gibbs MCMC chains withfully known full conditional distributions If one lets

B

KG(,0 *If) f kr( I y,M , , * , k + l, , B),i=l

denote the Gibbs transition kernel, then by virtue of the fact that the Gibbs chainsatisfies the invariance condition r(* I y, M) = f K Gc(, * y, M) r(O 1 y, M) d, onecan obtain the posterior ordinate by averaging the transition kernel over draws fromthe posterior distribution:

M

i(O* y, M) = M-' KG(Og), O * y, M).g=-

This estimate only requires draws from the full Gibbs run but when O is highdimensional and the model contains latent variables, this estimate is less accurate thanChib's posterior density decomposition method.

3630 S Chib


It should be observed that the above methods of estimating the posterior ordinaterequire knowledge of the normalizing constants of each full conditional density.What can be done when this condition does not hold? Di Ciccio, Kass, Raftery andWasserman ( 1997) and Chib and Greenberg ( 1998) suggest the use of kernel smoothingin this case Suppose, for example, that the problem occurs in the distribution of theith block Then, the draws on i from the reduced MCMC run in Equation ( 39) canbe smoothed by kernel methods to find the ordinate at Hi* This approach should onlybe used when the dimension of the recalcitrant block is not large A more generaltechnique has recently been developed by Chib and Jeliazkov ( 2001) The first mainresult of the paper is that if sampling is done in one block by the M-H algorithm thenthe posterior ordinate can be written as

T(* M)= E 1= {a(, O*Iy,M)q(, O*Iy, M)}*IY'M) E 2 {a( O *, O Iy, M)}

where the numerator expectation El is with respect to the distribution Tr(Oj y, M) andthe denominator expectation E 2 is with respect to the proposal density of O conditionedon #*, q(#*, #ly, M), and a(#, O * y, M) is the probability of move in the M-H step.This expression implies that a simulation consistent estimate of the posterior ordinatecan be defined as

M-l Eg= a( O (g), * I y, M) q( O (g), * I y, M), (eiy) ( 41)J-1 ,M l a(O*, O ()Iy,M)

where { O (g )} are the given draws from the posterior distribution while the draws O(j) inthe denominator are from q( 9 *, O 8 y, M), given the fixed value * The second mainresult of the paper is that in the context of the multiple block M-H algorithm thereduced conditional ordinate can be expressed as

(Oi* ly, M, , i* )

El {a(Oi, Oi* l Y,M, Ai* , Vi +l')qi(Oi, Oi* Y, M, i*l, + ) ( 42)E 2 {a( O i*, Oi y, A 4, , 1pi+ 1)}

where E 1 is the expectation with respect to r( O i, p'+'ly, M, ip* ,) and E 2 that withrespect to the product measure r(tpi+ 1y, M, lp) qi(i*, Oily, M, l, Itpi+l) Thequantity a(, Hi* y, M, i , i + ) is the usual conditional M-H probability of move.The two expectations can be estimated from the output of the reduced runs in anobvious way An example of this technique in action is provided next.

Consider the data set that was introduced in Section 8 in connection with themultivariate probit model In this setting, the full conditonal density of the correlationsis not in tractable form Assume as before that the marginal probability of wheeze isgiven by

Pr(yqy = ll Mk,P) = (+O + l Xlij + 22 ij + 3 X 3 ij), i < 537, j < 4,

where, as before, the dependence of P on the model is suppressed for convenience, x, isthe age of the child centered at nine years, x 2 is a binary indicator variable representing

3631

the mother's smoking habit during the first year of the study, and X 3 = XIX 2 Nowsuppose that interest centers on three alternative models generated by three alternativecorrelation matrices Let these models be defined as* M 1: Unrestricted X except for the unit constraints on the diagonal In this case a

consists of six unknown elements.* M 2: Equicorrelated X where the correlations are all equal and described by a single

parameter p.* M 3 : Toeplitz E wherein the correlations depend on a single parameter W but under

the restriction that Corr(Zik, Zi/) = olk-ll.

Assume that the prior on O = ( 0, #3 I,/2,/33) is independent Gaussian with a meanof zero and a variance of ten Also let the prior on the correlations a be normal withmean of zero and covariance equal to the identity matrix (truncated to the region C)and that on p and o be normal truncated to the interval (-1, 1).

For each model, 10000 iterations of Algorithm 17 are used to obtain the posteriorsample and the posterior ordinate, using M, for illustration, is computed as

j(a*,*ly, Mi) = r(ua*y, M) r(P*ly, M , *).

To estimate the marginal ordinate one can apply Equation ( 42) leading to the estimate

M-l M a( (g), *JY,({z(g)})q(a*ly, P(g), {z(g)})lr(a*ly, M ) = , ( 43)

EJ-' = a(a(j) l y, (), z& z})

where a is the probability of move defined in Algorithm 17, {(g), {zi()}, a(g)} arevalues drawn from the full MCMC run and the values {(j), {z)}, a(j))} in thedenominator are from a reduced run consisting of the densities

.ir(#jy,Mi, {zi},*); r({zi}ll,M 1,,I,*), ( 44)

after X 1 is fixed at 1 * In particular, the draws for the denominator are from thedistributions

), z' J 7 r(p, zly, M 1,,*),

o(i) q(a*, aoly,Mli(J),z(j)), j <

The sampled variates {P(i J),z()} from this reduced run are also used to estimate thesecond ordinate as

M

Ar(#*ly, M 1, *)=M -1 j(, * I j(i,),B), ( 45)j=l

where (i) = Bn(Bol + + i= X'* zi()) and Bn = (B 1 + Zi= I X,*-IX) Itshould be noted that estimates of both ordinates are available at the conclusion of thesingle reduced run.

3632 5 Chib


Table 3Log-likelihood and log marginal likelihood by the Chib method of three models fit to the Ohio wheeze

data'

M 1 M 2 M 3

In f(y I M, O *) -795 1869 -798 5567 -804 4102

Inm(y M) -823 9188 -818 009 -824 0001

1 Ml, MVP with unrestricted correlations; M 2, MVP with an equicorrelated correlation; M 3, MVPwith Toeplitz correlation structure.

The marginal likelihood computation is completed by evaluating the likelihoodfunction at the point (j J*, *) by the Geweke-Hajivassiliou-Keane method Theresulting marginal likelihoods of the three alternative models are reported in Table 3.On the basis of these marginal likelihoods we conclude that the data tend to supportthe MVP model with equicorrelated correlations.

10.3 Model space-parameter space MCMC algorithms

When one is presented with a large collection of candidate models {M, , MK},each with parameters Ok E Bk C 9

d , direct fitting of each model to find the marginallikelihood can be computationally expensive In such cases it may be more fruitfulto utilize model space-parameter space MCMC algorithms that eschew direct fittingof each model for an alternative simulation of a "mega model" where a model indexrandom variable, denoted as M, taking values on the integers from 1 to K, is sampledin tandem with the parameters The posterior distribution of M is then computed asthe frequency of times each model is visited.

In this section we discuss two general model space-parameter space algorithms thathave been proposed in the literature These are the algorithms of Carlin and Chib( 1995) and the reversible jump method of Green ( 1995).

To explain the Carlin and Chib ( 1995) algorithm, write O = { 01, , OK} andassume that each model is defined by the likelihood f(yl Ok, M = k) and (proper)priors p(Ok M = k) Note that each model is non-nested Now by the law of totalprobability the joint distribution of the data, the parameters and the model index isgiven by

f(y, 0, M = k) =f(yl Ok, M = k)p(Ok IM = k)p(O-k Ok, M = k) Pr(M = k).( 46)

Thus, in addition to the usual inputs, the joint probability model requires thespecification of the densities {p(k l Ok, M = k), k K} These are called pseudopriors or linking densities and are necessary to complete the probability model butplay no role in determining the marginal likelihood of M = k since

m(y, M = k) = f(y, 0, M = k) d O,

3633

regardless of what pseudo priors are chosen Hence, the linking densities may bechosen in any convenient way that promotes the working of the MCMC samplingprocedure The goal now is to sample the posterior distribution on model space andparameter space

t(l , , M y) c f(y, , M),

by MCMC methods.

Algorithm 18: Model space MCMC( 1) Sample

Ok ~ r(Okly,M = k) cxf(yl Ok,M = k)7 r(Okl M = k), M = k,0-k p(O-klk, M = k), M • k.

( 2) Model jump(a) Calculate

f(ylk,M = k)p(Okl M = k)p(&_kl Ok,M = k)Pr(M = k)pk = K , , K.

C= f(yl, M = )p(OHJM = I)p(If 1 , M = I) Pr(M = I)

(b) Sample

M ~{pl PK

( 3) Goto 1.

Thus, when M = k, we sample k from its full conditional distribution and theremaining parameters from their pseudo priors and the model index is sampled fromthe a discrete point distribution with probabilities {pk}.

Algorithm 18 is conceptually quite simple and can be used without any difficultieswhen the number of models under consideration is small When K is large, however,the specification of the pseudo priors and the requisite generation of each Ok withineach cycle of the MCMC algorithm can be a computational burden We also mentionthat the pseudo priors should be chosen to be close to the model specific posteriordistributions To understand the rationale for this recommendation suppose that thepseudo priors can be set exactly equal to the model specific posterior distributionsas

p(O-_k Ik,M = k)= 7 H n(Oy,M = 1).Ik

Substituting this choice into the equation of Pk and simplifying we get

m(yl M = k) Pr(M = k)Pk =( 47)

r ,,/ l m(yl M = ) Pr(M = )' \

which is Pr(M = kl y) Therefore, under this choice of pseudo priors, the Carlin-Chibalgorithm generates the model move at each iteration of the sampling according to

3634 S Chib


their posterior probabilities, without any required bum-in Thus, by utilizing pseudopriors that are close to the model specific posterior distributions one promotes mixingon model space and more rapid convergence to the invariant target distribution

( 01,, 92, , O, MI y).Another point in connection with the above algorithm is that the joint distribution

over parameter space and model space can be sampled by the M-H algorithm.For example, Dellaportas, Forster and Ntzoufras ( 1998) suggest that the discreteconditional distribution on the models be sampled by M-H algorithm in order to avoidthe calculation of the denominator of Pk Godsill ( 1998) considers the sampling ofthe entire joint distribution in Equation ( 46) by the M-H algorithm Suppose that theproposal density on the joint space is specified as

q {(M = k, k, 0-k), (M = k', O,, O'k,)} = ql(k, k') q 2 (Ok, O k, k, k')p(O k, k',, M = k'),( 48)

where the pseudo prior is the proposal density of the parameters 0-k' not in theproposed model k' It is important that ql not depend on the current value (k, k)

and that q 2 not depend on the current value of Ok, in the model being proposed Then,the probability of move from (M = k, Ok, O-k) to (M = k', Ok,, O'k,) in the M-H step,after substitutions and cancellations, reduces to

n f(yl O,, M = k')p(O, IM = k') Pr(M = k') ql(k', k) q 2 (O,, Oklk, k') f(y Ok, M = k)p(k JM = k) Pr(M = k) ql(k, k')q 2 (-k,0 k, Jk,k, ) I

( 49)which is completely independent of the pseudo priors Thus, the sampling, orspecification, of pseudo priors is not required in this version of the algorithm but therequirement that the parameters of each model be proposed in one block rules outmany important problems.

We now turn to the reversible jump algorithm of Green ( 1995) which is designedprimarily for nested models In this algorithm, model space and parameter space movesfrom the current point (M = k, k) to a new point (M = k', O,) are made bya Metropolis-Hastings step in conjunction with a dimension matching condition toensure that the resulting Markov chain is reversible An application of the reversiblejump method to choosing the number of components in a finite mixture of distributionmodel is provided by Richardson and Green ( 1997) The parameter space in thismethod is based on the union of the parameter spaces Bk To describe the algorithmwe let q denote a discrete mass function that gives the probability of each possiblemodel given the current model and we let u' denote an increment/decrement randomvariable that takes one from the current point Ok to the new point Ok,.

Algorithm 19: Reversible jump model space MCMC( 1) Propose a new model k'

k' ql(k, k').

3635

( 2) Dimension matching

(a) Propose

U' q 2 (U' Ok, k').

(b) Set

(I,, U) = gk,k'(Ok, U'),

where gk,k' is a bij ection between (,, u) and (Ok, ') and dim(Ok)+ dim(u') = dim( 8 O,) + dim(u).

( 3) M-H(a) Calculate

= f(yl k',,M =k')p(,l AM = k')Pr(M = k') q(k',k)q 2(ulk,kk) )

' f(yl Ok,M = k)p(ki M = k)Pr(M = k) q 2(k,k')q 2(u'I Ok,k,k')

where

gk,k(Ok, u') IJ= (ok,') '

(b) Move to (k'; O,,u') with probability a.( 4) Goto 1.

In the reversible jump method most of the tuning is in the specification of the proposaldistribution q 2 ; a different proposal distribution is required if k' is a model with moreparameters than model k than for the case when model k' has fewer parameters Thisis the reason for the dependence of q 2 on not just Ok but also on (k, k') In addition, thealgorithm as stated by Green ( 1995) is designed for the situation where the competingmodels are nested and obtained by the removal or addition of different parameters, asfor example in a variable selection problem.

10.4 Variable selection

Model space MCMC methods described above can be specialized to the problem ofvariable selection in regression We first focus on this problem in the context of linearregression models with conjugate priors before discussing a more general situation.

Consider then the question of building a multiple regression model for a vector ofn observations y in terms of a given set of covariates X = {x, , xp} The goal isto find the "best" model of the form

Mk: y = Xk Ik + a E,

where Xk is a N x dk matrix composed of some or all variables from X, 02 is a varianceparameter and E is AM(O, I,) Under the assumption that any subset of the variables in

3636 S Chib


X can be used to form Xk it follows that the number of possible models is given byK = 2 P, which is a large number even ifp is as small as fifteen Thus, unlessp is small,when the marginal likelihoods can be computed for each possible Xk, it is helpful touse simulation-based methods that traverse the space of possible models to determinethe subsets that are most supported by the data.

Raftery, Madigan and Hoeting ( 1997) develop one approach that is based on theuse of conjugate priors Let the parameters Ok = (k, a 2 ) of model Mk follow theconjugate prior distributions

fik IM = k, 2 Nd A Ik(O, a 2 Bok); o 2 IM = k I ( 2 , 2) ( 50)

which implies after some algebra that the marginal likelihood of Mk is

m(yl M = k) = r F {(v + n)/2 } 11/2 } ( 1 B F(vo/2)( 6 or)n/2 Bk+)

where

Bk = In -Xk(Bo +Xk Xk ) Xk.

Raftery, Madigan and Hoeting ( 1997) specify a MCMC chain to sample model space inwhich the target distribution is the univariate discrete distribution with probabilities

Pr(M = kly) =Pk cc m(y JM = k)Pr(M = k), k < K ( 51)

Although this distribution can in principle be normalized, the normalization constantis computationally expensive to calculate when K is large (but one can argue thatexpending the necessary computational effort is always desirable) This motivates thesampling of Equation ( 51) by the Metropolis-Hastings algorithm For each modelM = k define a neighborhood nbd(M = k) which consists of the model M = kand models with either one more variable or one fewer variable than M = k Definea transition matrix ql (k, k') which puts uniform probability over models k' that are innbd(M = k) and zero probability for all other models Given that the chain is currentlyat the point (M = k) a move to the proposed model k' is made with probability

m(yl M = k') Pr(M = k') q(k', k) 1 ( 52)min m(yl M = k) Pr(M = k) ql(k, k '') ' ( 52)

If the proposed move is rejected the chain stays at M = k.When conjugate priors are not assumed for k, or when the model is more

complicated than multiple regression, it is not possible to find the marginal likelihoodof each model analytically It then becomes necessary to sample both the parametersand the model index jointly as in the general model space-parameter space algorithms

3637

mentioned above The approaches that have been developed for this case, however,treat the various models as nested.

Suppose that the coefficients attached to the p possible covariates in the model aredenoted by it = { Al, , t Ip,, where any common noise variances or other commonparameters are suppressed from the notation and the discussion Now associate witheach coefficient j an indicator variable b 6 which takes the value one if the coefficientis in the model and the value zero otherwise and let r I denote the set of active 7 j'sgiven a configuration 6 and let i 7, denote the complementary Tj's For example, ifp = 5 and 6 = { 1,0,0,1, ), then /16 = { 11, 174, is 5 } and r_ = {r 2, r/3 A variableselection MCMC algorithm can now be developed by sampling the joint posteriordistribution s T( 61, r, ,, bp, l y) Particular implementations representing differentblocking schemes to sample this joint distribution are discussed by Kuo and Mallick( 1998), Geweke ( 1996) and Smith and Kohn ( 1996) For example, in the algorithm ofKuo and Mallick ( 1998), the posterior distribution is sampled by recursively simulatingthe {? 1, , , p from the distributions

17, ( 7 Bl 7 b)Y n' {f(y rt 7, 6)p(? 16)6) if 6 j = 1,~ P(? 6) ~if 6 j = 0,

where p( 7 j/_ 7, 6) is a pseudo prior because it represents the distribution of /1 j whenn is not in the current configuration Next, the variable indicators { 61, , p) aresampled one at a time from the two point mass function

j Pr( 6 j I y, -i, 6 j) o f(yl N 6, 6) p(? 16 l) p(-6 l n Iro, 6) p( 6 j),

where p(rl rla, 6) is the pseudo prior These two steps are iterated Procedures tosample (j, r) in one block given all the other blocks are presented by Geweke ( 1996)and Smith and Kohn ( 1996).

George and McCulloch ( 1993, 1997) develop an important alternative simulation-based approach for the variable selection problem that has been extensively studiedand refined In their approach, the variable selection problem is cast in terms of ahierarchical model of the type

y ~x + o E, ~jyj ( 1-yj)N(O, t) + yj N(O, c 2 r 2

Pr(y= 1)= Pr(yj = O)=pj,

where Ti 2 is a small positive number and cj a large positive number In this specificationeach component of O is assumed to come from a mixture of two normal distributionssuch that y = O corresponds to the case where B/ can be assumed to be zero It shouldbe noted that in this framework a particular covariate is never strictly removed fromthe model; exclusion from the model corresponds to a high posterior probability ofthe event that y = O George and McCulloch ( 1993) sample the posterior distributionof (, {j}) by the Gibbs sampling algorithm.

3638 S Chib


10.5 Remark

We conclude this discussion by pointing out that convergence checks of the Markovchain in model space algorithms is quite difficult and has not been satisfactorilyaddressed in the literature When the model space is large, as for example in thevariable selection problem, one cannot be sure that all models supported by the datahave been visited according to their posterior probabilities Of course if the modelspace is diminished to ensure better coverage of the various models it may happen thatdirect computation of the marginal likelihood becomes feasible, thereby removing anyjustification for considering a model space algorithm in the first place This tension inthe choice between direct computation and model space algorithms is real and cannotbe adjudicated in the absence of a concrete problem.

11 MCMC methods in optimization problems

Suppose that we are given a particular function h(O), say the log likelihood of a givenmodel, and interest lies in the value of 8 that maximizes this function In some cases,this optimization problem can be quite effectively solved by MCMC methods Onesomewhat coarse possibility is to obtain draws {(J)} from a density proportional toh( 8) and to find the value of 8 that corresponds to the maximum of {h( 8 (j)} Anothermore precise technique goes by the name of simulated annealing which appears inMetropolis et al ( 1953) and is closely related to the Metropolis simulation method Inthe simulated annealing method, which is most typically used to maximize a functionon a finite but large set, one uses the Metropolis method to sample the distribution

:r(O) oc exp {h( 8)/T},

where T is referred to as the temperature The temperature variable is graduallyreduced as the sampling proceeds lfor example, see Geman and Geman ( 1984)l It canbe shown that in the finite case, the values of B produced by the simulated annealingmethod concentrate around the local maximum of the function h().

Another method of interest is a MCMC version of the EM algorithm which canbe used to find the maximum likelihood estimate in certain situations Suppose that zrepresents missing data andf(yl M, 0) denotes the likelihood function Also supposethat

f(y IM, 8) = f(y, zl M, ) dz,

is difficult to compute but that the complete data likelihoodf(y, z M A 4, 0) is available,as in the models with a missing data structure in Section 8 For this problem, thestandard EM algorithm lDempster, Laird and Rubin ( 1977)l <equires the recursive

3639

implementation of two steps: the expectation or E-step and the maximization orM-step In the E-step, given the current guess of the maximizer O (j), one computes

Q( 6 (J), 0) = l Inf(y, z IM, O)f(z Iy, M, 0) dz,

while in the M-step the Q function is maximized to obtain a revised guess of themaximizer, i e ,

(ji + ) = arg max Q(O(j), 0).

Wu ( 1983) has shown that under regularity conditions the sequence of values { O ()}generated by these steps converges to the maximizer of the functionf(y M, 0).

The MCEM algorithm is a variant of the EM algorithm, proposed by Wei and Tanner( 1990 b), in which the E-step, which is often intractable, is computed by Monte Carloaveraging over values of z drawn from f(zly, M, 0), which in the MCMC contextis the full conditional distribution of the latent data Then, the revised value of O isobtained by maximizing the Monte Carlo estimate of the Q function Specifically, theMCEM algorithm is defined by iterating on the following steps:

M

QM(O(j, 0) = M lnf(y, z(j) M, 0),j=

z(j) -f(zly, M,), O ( + i) = arg max Q((j), 0).

As suggested by Wei and Tanner ( 1990 b), these iterations are started with a small valueof M that is increased as the maximizer is approached One point to note is that ingeneral, the MCEM algorithm, similar to the EM algorithm, can be slow to convergeto the mode but it should be possible to adapt the ideas described in Liu, Rubin andWu ( 1998) to address this problem Another point to note is that the computation ofthe QM function can be expensive when M is large Despite these potential difficulties,a number of applications of the MCEM algorithm have now appeared in the literature.These include Chan and Ledolter ( 1995), Chib ( 1996, 1998), Chib and Greenberg( 1998), Chib, Greenberg and Winkelmann ( 1998) and Booth and Hobert ( 1999).

Given the modal value 0, the standard errors of the MLE are obtained by the formulaof Louis ( 1982) In particular, the observed information matrix is given by

02 nf(y, z M, 0) Va 1 lnf(y, z M, 0)-EL ~ Oa-7 I Var -

3640 S Chib


where the expectation and variance are with respect to the distribution z y, M, O Thisexpression is estimated by taking an additional J draws {z( 1), , z(J)} from z y, M, O

and computing

_ 1 02 Inf(y,z(k) M , O *)

k I \U O /

where

m =J-l Olnf(y,z(k)IM, )k=l I

Standard errors are equal to the square roots of the diagonal elements of the inverseof the estimated information matrix.

12 Concluding remarks

In this survey we have provided an outline of Markov chain Monte Carlo methodswith emphasis on techniques that prove useful in Bayesian statistical inference Furtherdevelopments of these methods continue to occur but the ideas and details presentedin this survey should provide a reasonable starting point to understand the current andemerging literature Two recent developments are the slice sampling method discussedby Mira and Tierney ( 1998), Damien et al ( 1999) and Roberts and Rosenthal ( 1999)and the perfect sampling method proposed by Propp and Wilson ( 1996) The slicesampling method is based on the introduction of auxiliary uniform random variablesto simplify the sampling and improve mixing while the perfect sampling method usesMarkov chain coupling to generate an exact draw from the target distribution Thesemethods are in their infancy and can be currently applied only under rather restrictiveassumptions on the target distribution but it is possible that more general versions ofthese methods will eventually become available.

Other interesting developments are now occurring in the field of applied Bayesianinference as practical problems are being addressed by the methods summarized in thissurvey These applications are appearing at a steady rate in various areas For example,a partial list of fields and papers within fields include: biostatistical time series analysislWest, Prado and Krystal ( 1999)l; economics lChamberlain and Hirano ( 1997), Filardoand Gordon ( 1998), Gawande ( 1998), Lancaster ( 1997), Li ( 1998), Kiefer and Steel( 1998), Kim and Nelson ( 1999), Koop and Potter ( 1999), Martin ( 1999), Paap and vanDijk ( 1999), So, Lam and Li ( 1998)l; finance lJones ( 1999), Pastor and Stambaugh

3641

( 1999)l; marketing lAllenby, Leone and Jen ( 1999), Bradlow and Zaslavsky ( 1999),Manchanda, Ansari and Gupta ( 1999), Montgomery and Rossi ( 1999), Young, De Sarboand Morwitz ( 1998)l; political science lKing, Rosen and Tanner ( 1999), Quinn, Martinand Whitford ( 1999), Smith ( 1999)l; and many others.

One can claim that with the ever increasing power of computing hardware, and theexperience of the past ten years, the future of simulation-based inference using MCMCmethods is secure.

References

Albert, J ( 1993), "Teaching Bayesian statistics using sampling methods and MINITAB", AmericanStatistician 47:182-191.

Albert, J , and S Chib ( 1993 a), "Bayesian analysis of binary and polychotomous response data", Journalof the American Statistical Association 88:669-679.

Albert, J , and S Chib ( 1993 b), "Bayes inference via Gibbs sampling of autoregressive time seriessubject to Markov mean and variance shifts", Journal of Business and Economic Statistics 11:1-15.

Albert, J , and S Chib ( 1995), "Bayesian residual analysis for binary response models", Biometrika82:747-759.

Albert, J , and S Chib ( 1996), "Computation in Bayesian Econometrics: An Introduction to MarkovChain Monte Carlo", in: T Fomby and R C Hill, eds , Advances in Econometrics, Vol 11 A (JaiPress, Greenwich, CT) 3-24.

Albert, J , and S Chib ( 1997), "Bayesian tests and model diagnostics in conditionally independenthierarchical models", Journal of the American Statistical Association 92:916-925.

Albert, J , and S Chib ( 1998), "Sequential Ordinal Modeling with Applications to Survival Data".Biometrics, in press.

Allenby, G M , R P Leone and L Jen ( 1999), "A dynamic model of purchase timing with applicationto direct marketing", Journal of the American Statistical Association 94:365-374.

Bennett, C H ( 1976), "Efficient estimation of free energy differences from Monte Carlo data", Journalof Computational Physics 22:245-268.

Berger, J O ( 1985), Statistical Decision Theory and Bayesian Analysis, 2nd edition (Springer, NewYork).

Bernardo, J M , and A EM Smith ( 1994), Bayesian Theory (Wiley, New York).Besag, J ( 1974), "Spatial interaction and the statistical analysis of lattice systems (with discussion)",

Journal of the Royal Statistical Society B 36:192-236.Besag, J , E Green, D Higdon and K L Mengersen ( 1995), "Bayesian computation and stochastic

systems (with discussion)", Statistical Science 10:3-66.Best, N G , M K Cowles and S K Vines ( 1995), "CODA: convergence diagnostics and output analysis

software for Gibbs sampling", Technical report (Cambridge MRC Biostatistics Unit).Billio, M , A Monfort and C P Robert ( 1999), "Bayesian estimation of switching ARMA models",

Journal of Econometrics 93:229-255.Blattberg, R C , and E I George ( 1991), "Shrinkage estimation of price and promotional elasticities:

seemingly unrelated equations", Journal of the American Statistical Association 86:304-315.Booth, J G , and J P Hobert ( 1999), "Maximizing generalized linear mixed model likelihoods with an

automated Monte Carlo EM algorithm", Journal of the Royal Statistical Society B 61:265-285.Bradlow, E , and A M Zaslavsky ( 1999), "A hierarchical latent variable model for ordinal data from

a customer satisfaction survey with "no answer" responses", Journal of the American StatisticalAssociation 94:43-52.

Brooks, S P ( 1998), "Markov chain Monte Carlo and its application", Statistician 47:69-100.

3642 S Chib


Brooks, S P , P Dellaportas and G O Roberts ( 1997), "A total variation method for diagnosing convergenceof MCMC algorithms", Journal of Computational and Graphical Statistics 6:251-265.

Carlin, B , A E Gelfand and A EM Smith ( 1992), "Hierarchical Bayesian analysis of changepointproblems", Applied Statistics 41:389-405.

Carlin, B P , and S Chib ( 1995), "Bayesian model choice via Markov Chain Monte Carlo methods",Journal of the Royal Statistical Society B 57:473-484.

Carlin, B P, and T A Louis ( 2000), Bayes and Empirical Bayes Methods for Data Analysis, 2nd Edition(Chapman and Hall, London).

Carlin, B P , and N G Poison ( 1991), "Inference for non-conjugate Bayesian models using the Gibbssampler", Canadian Journal of Statistics 19:399-405.

Carlin, B P, N G Po Ison and D S Stoffer ( 1992), "A Monte Carlo approach to nonnormal and nonlinearstate-space modeling", Journal of the American Statistical Association 87:493-500.

Carter, C , and R Kohn ( 1994), "On Gibbs sampling for state space models", Biometrika 81:541-553.Carter, C , and R Kohn ( 1996), "Markov chain Monte Carlo for conditionally Gaussian state space

models", Biometrika 83:589-601.Casella, G , and E I George ( 1992), "Explaining the Gibbs sampler", American Statistician 46:167-174.Casella, G , and C P Robert ( 1996), "Rao-Blackwellization of sampling schemes", Biometrika 83:81-94.Chamberlain, G , and K Hirano ( 1997), "Predictive distributions based on longitudinal earnings data",

Manuscript (Department of Economics, Harvard University).Chan, K S ( 1993), "Asymptotic behavior of the Gibbs sampler", Journal of the American Statistical

Association 88:320-326.Chan, K S , and C J Geyer ( 1994), "Discussion of Markov chains for exploring posterior distributions",

Annals of Statistics 22:1747-1758.Chan, K S , and J Ledolter ( 1995), "Monte Carlo EM estimation for time series models involving

counts", Journal of the American Statistical Association 90:242-252.Chen, M -H ( 1994), "Importance-weighted marginal Bayesian posterior density estimation", Journal of

the American Statistical Association 89:818-824.Chen, M -H , and Q -M Shao ( 1997), "On Monte Carlo methods for estimating ratios of normalizing

constants", Annals of Statistics 25:1563-1594.Chen, M -H , and Q -M Shao ( 1999), "Monte Carlo estimation of Bayesian credible and HPD intervals",

Journal of Computational and Graphical Statistics 8:69-92.Chib, S ( 1992), "Bayes regression for the Tobit censored regression model", Journal of Econometrics

51:79-99.Chib, S ( 1993), "Bayes regression with autocorrelated errors: a Gibbs sampling approach", Journal of

Econometrics 58:275-294.Chib, S ( 1995), "Marginal likelihood from the Gibbs output", Journal of the American Statistical

Association 90:1313-1321.Chib, S ( 1996), "Calculating posterior distributions and modal estimates in Markov mixture models",

Journal of Econometrics 75:79-97.Chib, S ( 1998), "Estimation and comparison of multiple change point models", Journal of Econometrics

86:221-241.Chib, S , and B P Carlin ( 1999), "On MCMC sampling in hierarchical longitudinal models", Statistics

and Computing 9:17-26.Chib, S , and E Greenberg ( 1994), "Bayes inference for regression models with ARMA(p,q) errors",

Journal of Econometrics 64:183-206.Chib, S , and E Greenberg ( 1995 a), "Understanding the Metropolis-Hastings algorithm", American

Statistician 49:327-335.Chib, S , and E Greenberg ( 1995 b), "Hierarchical analysis of SUR models with extensions to correlated

serial errors and time-varying parameter models", Journal of Econometrics 68:339-360.Chib, S , and E Greenberg ( 1996), "Markov chain Monte Carlo simulation methods in econometrics",

Econometric Theory 12:409-431.

3643

Chib, S , and E Greenberg ( 1998), "Analysis of multivariate probit models", Biometrika 85:347-361.Chib, S , and B Hamilton ( 2000), "Bayesian analysis of cross section and clustered data treatment

models", Journal of Econometrics 97:25-50.Chib, S , and I Jeliazkov ( 2001), "Marginal likelihood from the Metropolis-Hastings output", Journal

of the American Statistical Association 96:270-281.Chib, S , E Greenberg and R Winkelmann ( 1998), "Posterior simulation and Bayes factors in panel

count data models", Journal of Econometrics 86:33-54.Chib, S , E Nardari and N Shephard ( 1998), "Markov Chain Monte Carlo analysis of generalized

stochastic volatility models", Journal of Econometrics, under review.Chib, S , E Nardari and N Shephard ( 1999), "Analysis of high dimensional multivariate stochastic

volatility models", Technical report (John M Olin School of Business, Washington University, St.Louis).

Chipman, H A , E I George and R E McCulloch ( 1998), "Bayesian CART model search (withdiscussion)", Journal of the American Statistical Association 93:935-948.

Cowles, M K ( 1996), "Accelerating Monte Carlo Markov chain convergence for cumulative-linkgeneralized linear models", Statistics and Computing 6:101-111.

Cowles, M K , and B Carlin ( 1996), "Markov chain Monte Carlo convergence diagnostics: a comparativereview", Journal of the American Statistical Association 91:883-904.

Damien, P , J Wakefield and S Walker( 1999), "Gibbs sampling for Bayesian nonconjugate and hierarchicalmodels using auxiliary variables", Journal of the Royal Statistical Society B 61:331-344.

de Jong, P , and N Shephard ( 1995), "The simulation smoother for time series models", Biometrika82:339-350.

Dellaportas, P , and A EM Smith ( 1993), "Bayesian inference for generalized linear and proportionalhazards models via Gibbs sampling", Applied Statistics 42:443-459.

Dellaportas, P , J J Forster and I Ntzoufras ( 1998), "On Bayesian model and variable selection usingMCMC", Technical report (University of Economics and Business, Greece).

Dempster, A P , N M Laird and D B Rubin ( 1977), "Maximum likelihood estimation from incompletedata via the EM algorithm", Journal of the Royal Statistical Society B 39:1-38.

Denison, D G T , B K Mallick and A EM Smith ( 1998), "A Bayesian CART algorithm", Biometrika85:363-377.

Devroye, L ( 1985), Non-Uniform Random Variate Generation (Springer, New York).Di Ciccio, T J , R E Kass, A E Raftery and L Wasserman ( 1997), "Computing Bayes factors by

combining simulation and asymptotic approximations", Journal of the American Statistical Association92:903-915.

Diebolt, J , and C P Robert ( 1994), "Estimation of finite mixture distributions through Bayesian sampling",Journal of the Royal Statistical Society B 56:363-375.

Diggle, P , K -Y Liang and S L Zeger ( 1995), Analysis of Longitudinal Data (Oxford University Press,Oxford).

Elerian, O , S Chib and N Shephard ( 1999), "Likelihood inference for discretely observed nonlineardiffusions", Econometrica, in press.

Escobar, M D , and M West ( 1995), "Bayesian prediction and density estimation", Journal of theAmerican Statistical Association 90:577-588.

Filardo, A J , and S E Gordon ( 1998), "Business cycle durations", Journal of Econometrics 85:99-123.Fruhwirth-Schnatter, S ( 1994), "Data augmentation and dynamic linear models", Journal of Time Series

Analysis 15:183-202.Gammerman, D ( 1997), Markov chain Monte Carlo: Stochastic Simulation for Bayesian Inference

(Chapman and Hall, London).Gammerman, D , and H S Migon ( 1993), "Dynamic hierarchical models", Journal of the Royal Statistical

Society B 55:629-642.Gawande, K ( 1998), "Comparing theories of endogenous protection: Bayesian comparison of Tobit

models using Gibbs sampling output", Review of Economics and Statistics 80:128-140.

3644 S Chib


Gelfand, A E , and D Dey ( 1994), "Bayesian model choice: asymptotics and exact calculations", Journalof the Royal Statistical Society B 56:501-514.

Gelfand, A E , and A EM Smith ( 1990), "Sampling-based approaches to calculating marginal densities",Journal of the American Statistical Association 85:398-409.

Gelfand, A E , and A EM Smith ( 1992), "Bayesian statistics without tears: a sampling-resamplingperspective", American Statistician 46:84-88.

Gelfand, A E , S Hills, A Racine-Poon and A F M Smith ( 1990), "Illustration of Bayesian inferencein normal data models using Gibbs sampling", Journal of the American Statistical Association85:972-982.

Gelfand, A E , S K Sahu and B P Carlin ( 1995), "Efficient parameterizations for normal linear mixedmodels", Biometrika 82:479-488.

Gelman, A , and D B Rubin ( 1992), "Inference from iterative simulation using multiple sequences",Statistical Science 4:457-472.

Gelman, A , X L Meng, H S Stern and D B Rubin ( 1995), Bayesian Data Analysis (Chapman and Hall,London).

Geman, S , and D Geman ( 1984), "Stochastic relaxation, Gibbs distributions and the Bayesian restorationof images", IEEE Transactions on Pattern Analysis and Machine Intelligence 12:609-628.

Gentle, J E ( 1998), Random Number Generation and Monte Carlo Methods (Springer, New York).George, E I , and R E McCulloch ( 1993), "Variable selection via Gibbs sampling", Journal of the

American Statistical Association 88:881-889.George, E I , and R E McCulloch ( 1997), "Approaches to Bayesian variable selection", Statistica Sinica

7:339-373.Geweke, J ( 1989), "Bayesian inference in econometric models using Monte Carlo integration",

Econometrica 57:1317-1340.Geweke, J ( 1991), "Efficient simulation from the multivariate normal and student-t distributions subject

to linear constraints", in: E Keramidas and S Kaufman, eds , Computing Science and Statistics:Proceedings of the 23rd Symposium (Interface Foundation of North America) 571-578.

Geweke, J ( 1992), "Evaluating the accuracy of sampling-based approaches to the calculation of posteriormoments", in: J M Bernardo, J O Berger, A P Dawid and A EM Smith, eds , Bayesian Statistics(Oxford University Press, New York) 169-193.

Geweke,J ( 1996), "Variable selection and model comparison in regression", in: J M Bernardo, J O Berger,A.P Dawid and A EM Smith, eds , Bayesian Statistics (Oxford University Press, New York) 609-620.

Geweke, J ( 1997), "Posterior simulators in econometrics", in: D M Kreps and K E Wallis, eds , Advancesin Economics and Econometrics: Theory and Applications, 7th World Congress (Cambridge UniversityPress, Cambridge) 128-165.

Geyer, C ( 1995), "Conditioning in Markov chain Monte Carlo", Journal of Computational and GraphicalStatistics 4:148-154.

Geyer, C J , and E A Thompson ( 1995), "Annealing Markov chain Monte Carlo with applications toancestral inference", Journal of the American Statistical Association 90:909-920.

Ghysels, E , A C Harvey and E Renault ( 1996), "Stochastic volatility", in: C R Rao and G S Maddala,eds , Statistical Methods in Finance (North-Holland, Amsterdam) 119-191.

Gilks, WR , S Richardson and D J Spiegelhalter ( 1996), Markov Chain Monte Carlo in Practice(Chapman and Hall, London).

Godsill, S J ( 1998), "On the relationship between model uncertainty methods", Technical report (SignalProcessing Group, Cambridge University).

Green, PE ( 1995), "Reversible jump Markov chain Monte Carlo computation and Bayesian modeldetermination", Biometrika 82:711-732.

Hamilton, J D ( 1989), "A new approach to the economic analysis of nonstationary time series subjectto changes in regime", Econometrica 57:357-384.

Hand, D J , E Daly, A D Lunn, K J McConway and E Ostrowski ( 1994), A Handbook of Small DataSets (Chapman and Hall, London).

3645

Hastings, W K ( 1970), "Monte Carlo sampling methods using Markov chains and their applications",

Biometrika 57:97-109.Hills, S E , and A EM Smith ( 1992), "Parameterization issues in Bayesian inference", in: J M Bernardo,

J.O Berger, A P Dawid and A EM Smith, eds , Proceedings of the Fourth Valencia International

Conference on Bayesian Statistics (Oxford University Press, New York) 641-649.

Jacquier, E , N G Poison and PE Rossi ( 1994), "Bayesian analysis of stochastic volatility models (with

discussion)", Journal of Business and Economic Statistics 12:371-417.

Jeffreys, H ( 1961), Theory of Probability, 3rd edition (Oxford University Press, New York).

Jones, C S ( 1999), "The dynamics of stochastic volatility", Manuscript (University of Rochester).

Kiefer, N M , and M EJ Steel ( 1998), "Bayesian analysis of the prototypal search model", Journal of

Business and Economic Statistics 16:178-186.

Kim, C -J , and C R Nelson ( 1999), "Has the US become more stable? A Bayesian approach based on a

Markov-switching model of business cycle", The Review of Economics and Statistics 81:608-616.

Kim, S , N Shephard and S Chib ( 1998), "Stochastic volatility: likelihood inference and comparison

with ARCH models", Review of Economic Studies 65:361-393.

King, G , O Rosen and M A Tanner ( 1999), "Binomial-beta hierarchical models for ecological inference",

Sociological Method Research 28:61-90.

Kloek, T , and H K van Dijk ( 1978), "Bayesian estimates of equation system parameters: an application

of integration by Monte Carlo", Econometrica 46:1-20.

Koop, G , and S M Potter ( 1999), "Bayes factors and nonlinearity: evidence from economic time series",

Journal of Econometrics 88:251-281.

Kuo, L , and B Mallick ( 1998), "Variable selection for regression models", Sankhya B 60:65-81.

Laird, N M , and J H Ware ( 1982), "Random-effects models for longitudinal data", Biometrics 38:

963-974.

Lancaster, T ( 1997), "Exact structural inference in optimal job-search models", Journal of Business and

Economic Statistics 15:165-179.

Leader, E E ( 1978), Specification Searches: Ad Hoc Inference with Experimental Data (Wiley, New

York).Lenk, P J ( 1999), "Bayesian inference for semiparametric regression using a Fourier representation",

Journal of the Royal Statistical Society B 61:863-879.

Li, K ( 1998), "Bayesian inference in a simultaneous equation model with limited dependent variables",

Journal of Econometrics 85:387-400.

Liu, C , D B Rubin and YN Wu ( 1998), "Parameter expansion to accelerate EM: the PX-EM algorithm",

Biometrika 85:755-770.

Liu, J S ( 1994), "The collapsed Gibbs sampler in Bayesian computations with applications to a gene

regulation problem", Journal of the American Statistical Association 89:958-966.

Liu, J S , and R Chen ( 1998), "Sequential Monte Carlo methods for dynamic systems", Journal of the

American Statistical Association 93:1032-1044.

Liu, J S , W H Wong and A Kong ( 1994), "Covariance structure of the Gibbs Sampler with applications

to the comparisons of estimators and data augmentation schemes", Biometrika 81:27-40.

Liu, J S , W H Wong and A Kong ( 1995), "Covariance structure and convergence rate of the Gibbs

sampler with various scans", Journal of the Royal Statistical Society B 57:157-169.

Louis, T A ( 1982), "Finding the observed information matric when using the EM algorithm", Journal

of the Royal Statistical Society B 44:226-232.

Mallick, B , and A E Gelfand ( 1994), "Generalized linear models with unknown link function",

Biometrika 81:237-246.

Mallick, B , and A E Gelfand ( 1996), "Semiparametric errors-in-variables models: a Bayesian approach",

Journal of Statistical Planning and Inference 52:307-321.

Manchanda, P , A Ansari and S Gupta ( 1999), "The "shopping basket": a model for multicategory

purchase incidence decisions", Marketing Science 18:95-114.

3646 S Chib


Marinari, E , and G Parisi ( 1992), "Simulated tempering: a new Monte Carlo scheme", EurophysicsLetters 19:451-458.

Martin, G ( 1999), "US deficit sustainability: a new approach based on multiple endogenous breaks",Journal of Applied Econometrics, in press.

McCulloch, R E , and R Tsay ( 1994), "Statistical analysis of macroeconomic time series via Markovswitching models", Journal of Time Series Analysis 15:523-539.

Meng, X -L , and W H Wong ( 1996), "Simulating ratios of normalizing constants via a simple identity:a theoretical exploration", Statistica Sinica 6:831-860.

Mengersen, K L , and R L Tweedie ( 1996), "Rates of convergence of the Hastings and Metropolisalgorithms", Annals of Statistics 24:101-121.

Metropolis, N , A W Rosenbluth, M N Rosenbluth, A H Teller and E Teller ( 1953), "Equations of statecalculations by fast computing machines", Journal of Chemical Physics 21:1087-1092.

Meyn, S P , and R L Tweedie ( 1993), Markov chains and stochastic stability (Springer, London).Meyn, S P , and R L Tweedie ( 1994), "Computable bounds for convergence rates of Markov chains",

Annals of Applied Probability 4:981-1011.Mira, A , and L Tierney ( 1998), "On the use of auxiliary variables in Markov chain Monte Carlo

methods", Technical Report (University of Minnesota).Montgomery, A L , and PE Rossi ( 1999), "Estimating price elasticities with theory-based priors", Journal

of Marketing Research 36:413-423.Muller, P, and D R Insua ( 1998), "Issues in Bayesian analysis of neural network models", Neural

Computation 10:749-770.Muller, P , A Erkanli and M West ( 1996), "Curve fitting using multivariate normal mixtures", Biometrika

83:63-79.Nandram, B , and M -H Chen ( 1996), "Accelerating Gibbs sampler convergence in the generalized linear

models via a reparameterization", Journal of Statistical Computation and Simulation 54:129-144.Newton, M A , and A E Raftery ( 1994), "Approximate Bayesian inference by the weighted likelihood

bootstrap (with discussion)", Journal of the Royal Statistical Society B 56:1-48.Nummelin, E ( 1984), General Irreducible Markov Chains and Non-Negative Operators (Cambridge

University Press, Cambridge).O'Hagan, A ( 1994), Kendall's Advanced Theory of Statistics, Vol 2 B, Bayesian Inference (Halsted

Press, New York).Paap, R , and H K van Dijk ( 1999), "Bayes estimates of Markov trends in possibly cointegrated series:

An application to US consumption and income", Manuscript (RIBES, Erasmus University).Pastor, L , and R F Stambaugh ( 1999), "Costs of equity capital and model mispricing", Journal of

Finance 54:67-121.Patz, R J , and B W Junker ( 1999), "A straightforward approach to Markov chain Monte Carlo methods

for item response models", Journal of Education and Behavioral Statistics 24:146-178.Percy, D F ( 1992), "Prediction for seemingly unrelated regressions", Journal of the Royal Statistical

Society B 54:243-252.Pitt, M K , and N Shephard ( 1997), "Analytic convergence rates and parameterization issues for the

Gibbs sampler applied to state space models", Journal of Time Series Analysis 20:63-85.Pitt, M K , and N Shephard ( 1999), "Filtering via simulation: auxiliary particle filters", Journal of the

American Statistical Association 94:590-599.Poirier, D J ( 1995), Intermediate Statistics and Econometrics: A Comparative Approach (MIT Press,

Cambridge).Poison, N G ( 1996), "Convergence of Markov chain Monte Carlo algorithms", in: J M Bernardo,

J.O Berger, A P Dawid and A EM Smith, eds , Proceedings of the Fifth Valencia InternationalConference on Bayesian Statistics (Oxford University Press, Oxford) 297-323,.

Propp, J G , and D B Wilson ( 1996), "Exact sampling with coupled Markov chains and applications tostatistical mechanics", Random Structures and Algorithms 9:223-252.

3647

Quinn, K M , A D Martin and A B Whitford ( 1999), "Voter choice in multi-party democracies: a testof competing theories and models", American Journal of Political Science 43:1231-1247.

Raftery, A E , and S M Lewis ( 1992), "How many iterations in the Gibbs sampler?" in: J M Bernardo,J.O Berger, A P Dawid and A EM Smith, eds , Proceedings of the Fourth Valencia InternationalConference on Bayesian Statistics (Oxford University Press, New York) 763-774.

Raftery, A E , A D Madigan and J A Hoeting ( 1997), "Bayesian model averaging for linear regressionmodels", Journal of the American Statistical Association 92:179-191.

Richardson, S , and P J Green ( 1997), "On Bayesian analysis of mixtures with an unknown number ofcomponents (with discussion)", Journal of the Royal Statistical Society B 59:731-792.

Ripley, B ( 1987), Stochastic Simulation (Wiley, New York).Ritter, C , and M A Tanner ( 1992), "Facilitating the Gibbs Sampler: the Gibbs Stopper and the

Griddy-Gibbs Sampler", Journal of the American Statistical Association 87:861-868.Robert, C P ( 1995), "Convergence control methods for Markov chain Monte Carlo algorithms", Statistical

Science 10:231-253.Robert, C P, and G Casella ( 1999), Monte Carlo Statistical Methods (Springer, New York).Robert, C P , G Celeux and J Diebolt ( 1993), "Bayesian estimation of hidden Markov models: a

stochastic implementation", Statistics and Probability Letters 16:77-83.Roberts, G O , and J S Rosenthal ( 1999), "Convergence of slice sampler Markov chains", Journal of the

Royal Statistical Society B 61:643-660.Roberts, G O , and S K Sahu ( 1997), "Updating schemes, correlation structure, blocking, and

parametization for the Gibbs sampler", Journal of the Royal Statististical Society B 59:291-317.Roberts, G O , and A EM Smith ( 1994), "Some simple conditions for the convergence of the Gibbs

sampler and Metropolis-Hastings algorithms", Stochastic Processes and its Applications 49:207-216.Roberts, G O , and R L Tweedie ( 1996), "Geometric convergence and central limit theorems for

multidimensional Hastings and Metropolis algorithms", Biometrika 83:95-110.Rosenthal, J S ( 1995), "Minorization conditions and convergence rates for Markov chain Monte Carlo",

Journal of the American Statistical Association 90:558-566.Rubin, D B ( 1988), "Using the SIR algorithm to simulate posterior distributions", in: J M Bernardo,

J.O Berger, A P Dawid and A EM Smith, eds , Proceedings of the Fourth Valencia InternationalConference on Bayesian Statistics (Oxford University Press, New York) 395-402.

Shephard, N ( 1994), "Partial non-Gaussian state space", Biometrika 81:115-131.Shephard, N ( 1996), "Statistical aspects of ARCH and stochastic volatility", in: D R Cox, D V Hinkley and

O.E Barndorff-Nielson, eds , Time Series Models with Econometric, Finance and Other Applications(Chapman and Hall, London) 1-67.

Shively, T S , R Kohn and S Wood ( 1999), "Variable selection and function estimation in additivenonparametric regression using a data-based prior", Journal of the American Statistical Association94:777-794.

Smith, A ( 1999), "Testing theories of strategic choice: the example of crisis escalation", AmericanJournal of Political Science 43:1254-1283.

Smith, A EM , and G O Roberts ( 1993), "Bayesian computation via the Gibbs sampler and relatedMarkov chain Monte Carlo methods", Journal of the Royal Statistical Society B 55:3-24.

Smith, M , and R Kohn ( 1996), "Nonparametric regression using Bayesian variable selection", Journalof Econometrics 75:317-343.

So, M K P , K Lam and WK Li ( 1998), "A stochastic volatility model with Markov switching", Journalof Business and Economic Statistics 16:244-253.

Stephens, D A ( 1994), "Bayesian retrospective multiple-changepoint identification", Applied Statistics43:159-178.

Tanner, M A ( 1996), Tools for Statistical Inference, 3rd edition (Springer, New York).Tanner, M A , and WH Wong ( 1987), "The calculation of posterior distributions by data augmentation",

Journal of the American Statistical Association 82:528-549.Taylor, S J ( 1994), "Modelling stochastic volatility", Mathematical Finance 4:183-204.

3648 S Chib


Tierney, L ( 1994), "Markov chains for exploring posterior distributions (with discussion)", Annals ofStatistics 22:1701-1762.

Tierney, L , and J Kadane ( 1986), "Accurate approximations for posterior moments and marginaldensities", Journal of the American Statistical Association 81:82-86.

Tsionas, E G ( 1999), "Monte Carlo inference in econometric models with symmetric stable disturbances",Journal of Econometrics 88:365-401.

Verdinelli, I , and L Wasserman ( 1995), "Computing Bayes factors using a generalization of theSavge-Dickey density ratio", Journal of the American Statistical Association 90:614-618.

Wakefield, J C , A EM Smith, A Racine-Poon and A E Gelfand ( 1994), "Bayesian analysis of linearand non-linear population models by using the Gibbs sampler", Applied Statistics 43:201-221.

Waller, L A , B P Carlin, H Xia and A E Gelfand ( 1997), "Hierarchical spatio-temporal mapping ofdisease rates", Journal of the American Statistical Association 92:607-617.

Wei, G C G , and M A Tanner ( 1990 a), "Posterior computations for censored regression data", Journalof the American Statistical Association 85:829-839.

Wei, G C G , and M A Tanner ( 1990 b), "A Monte Carlo implementation of the EM algorithm and the poorman's data augmentation algorithm", Journal of the American Statistical Association 85:699-704.

West, M , R Prado and A D Krystal ( 1999), "Evaluation and comparison of EEG traces: latent structurein nonstationary time series", Journal of the American Statistical Association 94:375-387.

Wu, C EJ ( 1983), "On the convergence properties of the EM algorithm", Annals of Statistics 11:95-103.Young, M R , W S De Sarbo and VG Morwitz ( 1998), "The stochastic modeling of purchase intentions

and behavior", Management Science 44:188-202.Zeger, S L , and M R Karim ( 1991), "Generalized linear models with random effects: a Gibbs sampling

approach", Journal of the American Statistical Association 86:79-86.Zellner, A ( 1971), Introduction to Bayesian Inference in Econometrics (Wiley, New York).Zellner, A , and C Min ( 1995), "Gibbs sampler convergence criteria", Journal of the American Statistical

Association 90:921-927.

3649

markov chain monte carlo methods: computation and...

Documents