the random walk metropolis: linking theory and …1011.6217v1 [stat.me] 29 nov 2010 statistical...

arX

iv:1

011.

6217

v1 [

stat

.ME

] 2

9 N

ov 2

010

Statistical Science

2010, Vol. 25, No. 2, 172–190DOI: 10.1214/10-STS327c© Institute of Mathematical Statistics, 2010

The Random Walk Metropolis: LinkingTheory and Practice Through a CaseStudyChris Sherlock, Paul Fearnhead and Gareth O. Roberts

Abstract. The random walk Metropolis (RWM) is one of the mostcommon Markov chain Monte Carlo algorithms in practical use today.Its theoretical properties have been extensively explored for certainclasses of target, and a number of results with important practical im-plications have been derived. This article draws together a selection ofnew and existing key results and concepts and describes their implica-tions. The impact of each new idea on algorithm efficiency is demon-strated for the practical example of the Markov modulated Poissonprocess (MMPP). A reparameterization of the MMPP which leads to ahighly efficient RWM-within-Gibbs algorithm in certain circumstancesis also presented.

Key words and phrases: Random walk Metropolis, Metropolis–Hastings,MCMC, adaptive MCMC, MMPP.

1. INTRODUCTION

Markov chain Monte Carlo (MCMC) algorithmsprovide a framework for sampling from a target ran-dom variable with a potentially complicated proba-bility distribution π(·) by generating a Markov chainX(1),X(2), . . . with stationary distribution π(·). Thesingle most widely used subclass of MCMC algo-rithms is based around the random walk Metropolis(RWM).Theoretical properties of RWM algorithms for cer-

tain special classes of target have been investigated

Chris Sherlock is Lecturer, Department of Mathematics

and Statistics, Lancaster University, Lancaster, LA1

4YF, UK e-mail: [email protected]. Paul

Fearnhead is Professor, Department of Mathematics

and Statistics, Lancaster University, Lancaster, LA1

4YF, UK. Gareth O. Roberts is Professor, Department

of Statistics, University of Warwick, Coventry, CV4

7AL, UK.

This is an electronic reprint of the original articlepublished by the Institute of Mathematical Statistics inStatistical Science, 2010, Vol. 25, No. 2, 172–190. Thisreprint differs from the original in pagination andtypographic detail.

extensively. Reviews of RWM theory have, for exam-ple, dealt with optimal scaling and posterior shape(Roberts and Rosenthal, 2001), and convergence(Roberts, 2003). This article does not set out tobe a comprehensive review of all theoretical resultspertinent to the RWM. Instead the article reviewsand develops specific aspects of the theory of RWMefficiency in order to tackle an important and diffi-cult problem: inference for the Markov modulatedPoisson process (MMPP). It includes sections onRWM within Gibbs, hybrid algorithms, and adap-tive MCMC, as well as optimal scaling, optimal shap-ing, and convergence. A strong emphasis is placedon developing an intuitive understanding of the pro-cesses behind the theoretical results, and then on us-ing these ideas to improve the implementation. Allof the RWM algorithms described in this article aretested against datasets arising from MMPPs. Real-ized changes in efficiency are then compared withtheoretical predictions.Observed event times of an MMPP arise from a

Poisson process whose intensity varies with the stateof an unobserved continuous-time Markov chain. TheMMPP has been used to model a wide variety ofclustered point processes, for example, requests for

1

http://arxiv.org/abs/1011.6217v1

http://www.imstat.org/sts/

http://dx.doi.org/10.1214/10-STS327

http://www.imstat.org

mailto:[email protected]

http://www.imstat.org

http://www.imstat.org/sts/

http://dx.doi.org/10.1214/10-STS327

2 C. SHERLOCK, P. FEARNHEAD AND G. O. ROBERTS

web pages from users of the World Wide Web (Scottand Smyth, 2003), arrivals of photons from single-molecule fluorescence experiments (Burzykowski,Szubiakowski and Ryden, 2003; Kou, Xie and Liu,2005), and occurrences of a rare DNA motif along agenome (Fearnhead and Sherlock, 2006).In common with mixture models and other hidden

Markov models, inference for the MMPP is greatlycomplicated by a lack of knowledge of the hiddendata. The likelihood function often possesses manyminor modes since the data might be approximatelydescribed by a hidden process with fewer states. Forthis same reason the likelihood often does not ap-proach zero as certain combinations of parametersapproach zero and/or infinity and so improper priorslead to improper posteriors (e.g., Sherlock, 2005).Further, as with many hidden data models the like-lihood is invariant under permutation of the states,and this “labeling” problem leads to posteriors withseveral equal modes.This article focuses on generic concepts and tech-

niques for improving the efficiency of RWM algo-rithms whatever the statistical model. The MMPPprovides a nontrivial testing ground for them. Allof the RWM algorithms described in this article aretested against two simulated MMPP datasets withvery different characteristics. This allows us to demon-strate the influence on performance of posterior at-tributes such as shape and orientation near the modeand lightness or heaviness of tails.Section 2 introduces RWM algorithms and then

describes theoretical and practical measures of al-gorithm efficiency. Next the two main theoreticalapproaches to determining efficiency are described,and the section ends with a brief overview of theMMPP and a description of the data analyzed inthis article. Section 3 introduces a series of con-cepts which allow potential improvements in the ef-ficiency of a RWM algorithm. The intuition behindeach concept is described, followed by theoreticaljustification and then details of one or more RWMalgorithms motivated by the theory. Actual resultsare described and compared with theoretical predic-tions in Section 4, and the article is summarized inSection 5.

2. BACKGROUND

In this section we introduce the background ma-terial on which the remainder of this article draws.We describe the random walk Metropolis algorithm

and a variation, the random walk Metropolis-within-Gibbs. Both practical issues and theoreticalapproaches to algorithm efficiency are then discussed.We conclude with an introduction to the Markovmodulated Poisson process and to the datasets usedlater in the article.

2.1 Random Walk Metropolis Algorithms

The random walk Metropolis (RWM) updatingscheme was first applied by Metropolis et al. (1953)and proceeds as follows. Given a current value of thed-dimensional Markov chain, X, a new value X∗ isobtained by proposing a jump Y∗ :=X∗ −X fromthe prespecified Lebesgue density

r(y∗;λ) :=1

λdr

(

y∗

λ

)

,(1)

with r(y) = r(−y) for all y. Here λ > 0 governs theoverall size of the proposed jump and (see Section3.1) plays a crucial role in determining the efficiencyof any algorithm. The proposal is then accepted orrejected according to acceptance probability

α(x,y∗) = min

(

1,π(x+ y∗)

π(x)

)

.(2)

If the proposed value is accepted it becomes the nextcurrent value (X′←X+Y∗); otherwise the currentvalue is left unchanged (X′←X).An intuitive interpretation of the above formula

is that “uphill” proposals (proposals which take thechain closer to a local mode) are always accepted,whereas “downhill” proposals are accepted with prob-ability exactly equal to the relative “heights” of theposterior at the proposed and current values. It isprecisely this rejection of some “downhill” propos-als which acts to keep the Markov chain in the mainposterior mass most of the time.More formally, denote by P (x, ·) the transition

kernel of the chain, which represents the combinedprocess of proposal and acceptance/rejection lead-ing from one element of the chain (x) to the next.The acceptance probability (2) is chosen so that thechain is reversible at equilibrium with stationarydistribution π(·). Reversibility [that π(x)P (x,x′) =π(x′)P (x′,x)] is an important property precisely be-cause it is so easy to construct reversible chainswhich have a prespecified stationary distribution. Itis also possible to prove a slightly stronger centrallimit theorem for reversible (as opposed to nonre-versible) geometrically ergodic chains (e.g., Section2.2.1).

LINKING THEORY AND PRACTICE 3

We now describe a generalization of the RWMwhich acts on a target whose components have beensplit into k sub-blocks. In general we write X =(X1, . . . ,Xk), where Xi is the ith sub-block of com-ponents of the current element of the chain. Startingfrom value X, a single iteration of this algorithm cy-cles through all of the sub-blocks updating each inturn. It will therefore be convenient to define theshorthand

x(B)i := x′

1, . . . ,x′i−1,xi,xi+1, . . . ,xk,

x(B)∗i := x′

1, . . . ,x′i−1,xi + y∗

i ,xi+1, . . . ,xk,

where x′j is the updated value of the jth sub-block.

For the ith sub-block a jump Y ∗i is proposed from

symmetric density ri(y;λi) and accepted or rejected

according to acceptance probability π(x(B)∗i )/

π(x(B)i ). Since this algorithm is in fact a generaliza-

tion of both the RWM and the Gibbs sampler (fora description of the Gibbs sampler see, e.g., Gamer-man and Lopes, 2006) we follow, for example, Nealand Roberts (2006) and call this the random walk

Metropolis-within-Gibbs or RWM-within-Gibbs.The most commonly used random walkMetropolis-within-Gibbs algorithm, and also the sim-plest, is that employed in this article: here all blockshave dimension 1 so that each component of the pa-rameter vector is updated in turn.As mentioned earlier in this section, the RWM is

reversible; but even though each stage of the RWM-within-Gibbs is reversible, the algorithm as a wholeis not. Reversible variations include the random scan

RWM-within-Gibbs, wherein at each iteration a sin-gle component is chosen at random and updatedconditional on all the other components.Convergence of the Markov chain to its stationary

distribution can be guaranteed for all of the abovealgorithms under quite general circumstances (e.g.,Gilks, Richardson and Spiegelhalter, 1996).

2.2 Algorithm Efficiency

Consecutive draws of an MCMC Markov chainare correlated and the sequence of marginal distri-butions converges to π(·). Two main (and related)issues arise with regard to the efficiency of MCMCalgorithms: convergence and mixing.

2.2.1 Convergence In this article we will be con-cerned with practical determination of a point atwhich a chain has converged. The method we em-ploy is simple heuristic examination of the trace

plots for the different components of the chain. Notethat since the state space is multidimensional it isnot sufficient to simply examine a single component.Alternative techniques are discussed in Chapter 7of the book by Gilks, Richardson and Spiegelhalter(1996).Theoretical criteria for ensuring convergence (er-

godicity) of MCMC Markov chains are examined indetail in Chapters 3 and 4 of the book by Gilks,Richardson and Spiegelhalter (1996) and referencestherein, and will not be discussed here. We do, how-ever, wish to highlight the concepts of geometric andpolynomial ergodicity. A Markov chain with transi-tion kernel P is geometrically ergodic with station-ary distribution π(·) if

‖Pn(x, ·)− π(·)‖1 ≤M(x)rn(3)

for some positive r < 1 and M(·) ≥ 0; if M(·) isbounded above, then the chain is uniformly ergodic.Here ‖F (·)−G(·)‖1 denotes the total variational dis-tance between measures F (·) and G(·) (see, e.g.,Meyn and Tweedie, 1993), and Pn is the n-steptransition kernel. Efficiency of a geometrically er-godic algorithm is measured by the geometric rateof convergence, r, which over a large number of it-erations is well approximated by the second largesteigenvalue of the transition kernel [the largest eigen-value being 1, and corresponding to the station-ary distribution π(·)]. Geometric ergodicity is usu-ally a purely qualitative property since in generalthe constants M(x) and r are not known. Cruciallyfor practical MCMC, however, any geometrically er-

godic reversible Markov chain satisfies a central limit

theorem for all functions with finite second moment

with respect to π(·). Thus there is a σ2f <∞ suchthat

n1/2(fn −Eπ[f(X)])⇒N(0, σ2f ),(4)

where ⇒ denotes convergence in distribution. Thecentral limit theorem (4) not only guarantees con-vergence of the Monte Carlo estimate (5) but alsosupplies its standard error, which decreases as n−1/2.When the second largest eigenvalue is also 1, a

Markov chain is termed polynomially ergodic if

‖Pn(x, ·)− π(·)‖1 ≤M(x)n−r.

Clearly polynomial ergodicity is a weaker conditionthan geometric ergodicity. Central limit theoremsfor polynomially ergodic MCMC are much more del-icate; see the article by Jarner and Roberts (2002)for details.


In this article a chain is referred to as having“reached stationarity” or “converged” when the dis-tribution from which an element is sampled is asclose to the stationary distribution as to make nopractical difference to any Monte Carlo estimates.An estimate of the expectation of a given function

f(X), which is more accurate than a naive MonteCarlo average over all the elements of the chain, islikely to be obtained by discarding the portion ofthe chain X0, . . . ,Xm up until the point at which itwas deemed to have reached stationarity; iterations1, . . . ,m are commonly termed “burn in.” Using onlythe remaining elements Xm+1, . . . ,Xm+n (with m+n=N ) our Monte Carlo estimator becomes

fn :=1

n

m+n∑

m+1

f(Xi).(5)

Convergence and burn in are not discussed any fur-ther here, and for the rest of this section the chainis assumed to have started at stationarity and con-tinued for n further iterations.

2.2.2 Practical measures of mixing efficiency Fora stationary chain, X0 is sampled from π(·), and sofor all k > 0 and i≥ 0

Cov[f(Xk), f(Xk+i)] = Cov[f(X0), f(Xi)].

This is the autocorrelation at lag i. Therefore atstationarity, from the definition in (4),

σ2f := limn→∞

nVar[fn]

= Var[f(X0)] + 2∞∑

i=1

Cov[f(X0), f(Xi)]

provided the sum exists (e.g., Geyer, 1992). If el-ements of the stationary chain were independent,then σ2f would simply be Var[f(X0)] and so a mea-sure of the inefficiency of the Monte Carlo estimatefn relative to the perfect i.i.d. sample is

σ2fVar[f(X0)]

= 1+ 2

∞∑

i=1

Corr[f(X0), f(Xi)].(6)

This is the integrated autocorrelation time (ACT)and represents the effective number of dependentsamples that is equivalent to a single independentsample. Alternatively n∗ = n/ACT may be regardedas the effective equivalent sample size if the elementsof the chain had been independent.To estimate the ACT in practice one might exam-

ine the chain from the point at which it is deemed

to have converged and estimate the lag-i autocorre-lation Corr[f(X0), f(Xi)] by

γi =1

n− i

n−i∑

j=1

(f(Xj)− fn)(f(Xj+i)− fn).(7)

Naively, substituting these into (6) gives an estimateof the ACT. However, contributions from all termswith very low theoretical autocorrelation in a real

run are effectively random noise, and the sum ofsuch terms can dominate the deterministic effect inwhich we are interested (e.g., Geyer, 1992). For thisarticle we employ the simple solution suggested byCarlin and Louis (2009): the sum (6) is truncatedfrom the first lag, l, for which the estimated auto-correlation drops below 0.05. This gives the (slightlybiased) estimator

ACTest := 1+ 2

l−1∑

i=1

γi.(8)

Given the potential for relatively large variance inestimates of integrated ACT howsoever they mightbe obtained (e.g., Sokal, 1997), this simple estima-tor should be adequate for comparing the relativeefficiencies of the different algorithms in this article.Geyer (1992) provided a number of more complexwindow estimators and provided references for reg-ularity conditions under which they are consistent.A given run will have a different ACT associated

with each parameter. An alternative efficiency mea-sure, which is aggregated over all parameters, is pro-vided by the Mean Squared Euclidean Jump Dis-tance (MSEJD)

S2Euc :=

1

n− 1

n−1∑

i=1

‖x(i+1) − x(i)‖22.

The expectation of this quantity at stationarity is re-ferred to as the Expected Squared Euclidean JumpDistance (ESEJD). Consider a single component ofthe target with variance σ2i := Var[Xi] = Var[X ′

i],and note that E[X ′

i −Xi] = 0, so

E[(X ′i −Xi)

2] = Var[X ′i −Xi]

= 2σ2i (1−Corr[Xi,X′i]).

Thus when the chain is stationary and the posteriorvariance is finite, maximizing the ESEJD is equiva-lent to minimizing a weighted sum of the lag-1 au-tocorrelations.If the target has finite second moments and is

roughly elliptical in shape with (known) covariance


matrix Σ, then an alternative measure of efficiencyis the Mean Squared Jump Distance (MSJD)

S2d :=

1

n− 1

n−1∑

i=1

(x(i+1) − x(i))tΣ−1(x(i+1) − x(i)),

which is proportional to the unweighted sum of thelag-1 autocorrelations over the principal componentsof the ellipse. The theoretical expectation of theMSJD at stationarity is known as the expectedsquared jump distance (ESJD).Figure 1 shows traceplots for three different

Markov chains. Estimates of the autocorrelation fromlag-0 to lag-40 for each Markov chain appear along-side the corresponding traceplot. The simple win-dow estimator for integrated ACT provides estimatesof, respectively, 39.7, 5.5, and 35.3. The MSEJDsare, respectively, 0.027, 0.349, and 0.063, and areequal to the MSJDs since the stationary distribu-tion has a variance of 1.

2.2.3 Assessing accuracy An MCMC algorithmmight efficiently explore an unimportant part of theparameter space and never find the main posteriormass. ACT’s will be low, therefore, but the resultingposterior estimate will be wildly inaccurate. In mostpractical examples it is not possible to determine theaccuracy of the posterior estimate, though consis-tency between several independent runs or betweendifferent portions of the same run can be tested.For the purposes of this article it was important to

have a relatively accurate estimate of the posterior,not determined by a RWM algorithm. Fearnheadand Sherlock (2006) detailed a Gibbs sampler forthe MMPP; this Gibbs sampler was run for 100,000iterations on each of the datasets analyzed in thisarticle. A “burn in” of 1000 iterations was allowedfor, and a posterior estimate from the last 99,000iterations was used as a reference for comparisonwith posterior estimates from RWM runs of 10,000iterations (after burn in).

2.2.4 Theoretical approaches for algorithm

efficiency To date, theoretical results on the effi-ciency of RWM algorithms have been obtainedthrough two very different approaches. We wish toquote, explain, and apply theory from both and sowe give a heuristic description of each and define as-sociated notation. Both approaches link some mea-sure of efficiency to the expected acceptance rate—the expected proportion of proposals accepted atstationarity.

The first approach was pioneered by Roberts, Gel-man and Gilks (1997) for targets with independentidentically distributed components and then gener-alized by Roberts and Rosenthal (2001) to targetsof the form

π(x) =

d∏

1

Cif(Cixi).

The inverse scale parameters, Ci, are assumed tobe drawn from some distribution with a given (fi-nite) mean and variance. A single component of thed-dimensional chain (without loss of generality thefirst) is then examined; at iteration i of the algo-

rithm it is denoted X(d)1,i . A scaleless, speeded up,

continuous-time process which mimics the first com-ponent of the chain is defined as

W(d)t :=C1X

(d)1,[td],

where [u] denotes the nearest integer less than orequal to u. Finally, proposed jumps are assumed tobe Gaussian

Y(d) ∼N(0, λ2dI).

Subject to conditions on the first two deriviatives off(·), Roberts and Rosenthal (2001) showed that ifE[Ci] = 1 and E[C2

i ] = b, and provided λd = µ/d1/2

for some fixed µ (the scale parameter but “rescaled”

according to dimension), then as d→∞, W(d)t ap-

proaches a Langevin diffusion process with speed

h(µ) =C21µ

2

bαd

(9)

where αd := 2Φ

(

−1

2µJ1/2

)

.

Here Φ(x) is the cumulative distribution function ofa standard Gaussian, J := E[((log f)′)2] is a measureof the roughness of the target, and αd correspondsto the acceptance rate.Bedard (2007) proved a similar result for a tri-

angular sequence of inverse scale parameters ci,d,which are assumed to be known. A necessary andsufficient condition equivalent to (11) below is at-tached to this result. In effect this requires the scaleover which the smallest component varies to be “nottoo much smaller” than the scales of the other com-ponents.The second technique (e.g., Sherlock and Roberts,

2009) uses expected squared jump distance (ESJD)as a measure of efficiency. Exact analytical forms for


Fig. 1. Traceplots [( a), (b), and ( c)] and corresponding autocorrelation plots [(d), ( e), and ( f)], for exploration of astandard Gaussian initialized from x = 0 and using the random walk Metropolis algorithm with Gaussian proposal for 1000iterations. Proposal scale parameters for the three scenarios are, respectively, ( a) and (d) 0.24, (b) and ( e) 2.4, and ( c) and( f) 24.

ESJD (denoted S2d) and expected acceptance rate

are derived for any unimodal elliptically symmetric

target and any proposal density. Many standard se-

quences of d-dimensional targets (d= 1,2, . . .), such

as the Gaussian, satisfy the condition that as d→∞ the probability mass becomes concentrated in a

spherical shell which itself becomes infinitesimally

thin relative to its radius. Thus the random walk

on a rescaling of the target is, in the limit, effec-

tively confined to the surface of this shell. Sherlock

and Roberts (2009) considered a sequence of tar-

gets which satisfies such a “shell” condition, and

a sequence of proposals which satisfies a slightly

stronger condition. Specifically it is required that

there exist sequences of positive real numbers, {k(d)x }

and {k(d)y }, such that

‖X(d)‖k(d)x

p−→ 1 and‖Y(d)‖λdk

(d)y

m.s.−→ 1.

For such combinations of target and proposal, asd→∞

d

k(d)x

2S2d(µ)→ µ2αd

(10)

with αd(µ) := 2Φ

(

−1

2µ

)

.

Here αd is the limiting expected acceptance rate,

and µ := d1/2λdk(d)y /k

(d)x . For target and proposal

distributions with independent components, such as

are used in the diffusion results, k(d)x = k

(d)y = d1/2,

and hence (consistently) µ= d1/2λd.


It is also required that the elliptical target not betoo eccentric. Specifically, for a sequence of targetdensities πd(x) := fd(

∑di=1 c

2i,dx

2i ) (for some appro-

priate sequence of functions {fd})maxi c

2i,d

∑di=1 c

2i,d

→ 0 as d→∞.(11)

Theoretical results from the two techniques areremarkably similar and as will be seen, lead to iden-tical strategies for optimizing algorithm efficiency.It is worth noting, however, that results from thefirst approach apply only to targets with indepen-dent components and results from the second only totargets which are unimodal and elliptically symmet-ric. That they lead to identical strategies indicatesa certain potential robustness of these strategies tothe form of the target. This potential, as we shallsee, is borne out in practice.

2.3 The Markov Modulated Poisson Process

Let Xt be a continuous-time Markov chain on dis-crete state space {1, . . . , d} and let ψ := [ψ1, . . . , ψd]be a d-dimensional vector of (nonnegative) intensi-ties. The linked but stochastically independent Pois-son process Yt whose intensity is ψXt

is a Markovmodulated Poisson process—it is a Poisson processwhose intensity is modulated by a continuous-timeMarkov chain.The idea is best illustrated through two exam-

ples, which also serve to introduce the notation anddatasets that will be used throughout this article.Consider a two-dimensional Markov chain Xt withgenerator Q with q12 = q21 = 1.Figure 2 shows realizations from two such chains

over a period of 10 seconds. Now consider a Pois-son process Yt which has intensity 10 when Xt is instate 1 and intensity 30 whenXt is in state 2. This isan MMPP with event intensity vector ψ = [10,30].A realization (obtained via the realization of Xt) isshown as a rug plot underneath the chain in the up-per graph. The lower graph shows a realization froman MMPP with event intensities [10,17].It can be shown (e.g., Fearnhead and Sherlock,

2006) that the likelihood for data from an MMPPwhich starts from a distribution ν over its states is

L(Q,Ψ, t)(12)

= ν′e(Q−Ψ)t1Ψ · · · e(Q−Ψ)tnΨe(Q−Ψ)tn+11.

Here Ψ := diag(ψ), 1 is a vector of 1’s, n is thenumber of observed events, t1 is the time from the

Fig. 2. Two 2-state continuous-time Markov chains simu-lated for 10 seconds from generator Q with q12 = q21 = 1; therug plots show events from an MMPP simulated from thesechains, with intensity vectors ψ = [10,30] (upper graph) andψ = [10,17] (lower graph).

start of the observation window until the first event,tn+1 is the time from the last event until the end ofthe observation window, and ti(2≤ i≤ n) is the timebetween the (i− 1)th and ith events. In the absenceof further information, the initial distribution ν isoften taken to be the stationary distribution of theunderlying Markov chain.The likelihood of an MMPP is invariant to a re-

labeling of the states. Hence if the prior is simi-larly invariant, then so too is the posterior: if theposterior for a two-dimensional MMPP has a modeat (ψ1, ψ2, q12, q21), then it has an identical modeat (ψ2, ψ1, q21, q12). In this article our overriding in-terest is in the efficiency of the MCMC algorithmsrather than the exact meaning of the parametersand so we choose the simplest solution to this iden-tifiability problem: the state with the lower Poissonintensity ψ is always referred to as state 1.

2.3.1 MMPP data in this article The two datasetsof event times used in this article arose from two in-dependent MMPP’s simulated over an observationwindow of 100 seconds. Both underlying Markovchains have q12 = q21 = 1; dataset D1 has event in-tensity vector ψ = [10,30] whereas dataset D2 hasψ = [10,17], so that the overall intensity of eventsin D2 is lower than in D1. As mentioned in Sec-tion 2.2.3, a posterior sample from a long run ofthe Gibbs sampler of Fearnhead and Sherlock (2006)was used to approximate the true posterior. Figure3 shows estimates of the marginal posterior distri-bution for (ψ1, ψ2) and for (ψ1, q12) for D1 (top) andD2 (bottom).


Fig. 3. Estimated marginal posteriors for ψ1 and ψ2 and forψ1 and q12 from long runs of the Gibbs sampler for datasetsD1 (top) and D2 (bottom).

Because the difference in intensity between thestates is so much larger in D1 than in D2 it is eas-ier with D1 than D2 to distinguish the state of theunderlying Markov chain, and thus the values ofthe Markov and Poisson parameters. Further, in thelimit of the underlying chain being known precisely,for example as ψ2→∞ with ψ1 finite, and providedthe priors are independent, the posteriors for thePoisson intensity parameters ψ1 and ψ2 are com-pletely independent of each other and of the Markovparameters q12 and q21. Dependence between theMarkov parameters is also small, beingO(1/T ) (e.g.,Fearnhead and Sherlock, 2006).In Section 4, differences between D1 and D2 will

be related directly to observed differences in effi-ciency of the various RWM algorithms between thetwo datasets.

3. IMPLEMENTATIONS OF THE RWM:

THEORY AND PRACTICE

This section describes several theoretical resultsfor the RWM or for MCMC in general. Intuitiveexplanation of the principle behind each result isemphasized and the manner in which it informs theRWM implementation is made clear. Each algorithmwas run three times on each of the two datasets.

3.1 Optimal Scaling of the RWM

Intuition: Consider the behavior of the RWM asa function of the overall scale parameter of the pro-posed jump, λ, in (1). If most proposed jumps aresmall compared with some measure of the scale ofvariability of the target distribution, then, althoughthese jumps will often be accepted, the chain willmove slowly and exploration of the target distribu-tion will be relatively inefficient. If the jumps pro-posed are relatively large compared with the tar-get distribution’s scale, then many will not be ac-cepted, the chain will rarely move, and will again ex-plore the target distribution inefficiently. This sug-gests that given a particular target and form for thejump proposal distribution, there may exist a finitescale parameter for the proposal with which the al-gorithm will explore the target as efficiently as pos-sible. These ideas are clearly demonstrated in Fig-ure 1 which shows traceplots for a one-dimensionalGaussian target explored using a Gaussian proposalwith scale parameter an order of magnitude smaller(a) and larger (c) than is optimal, and (b) with aclose to optimal scale parameter.Theory: Equation (9) gives algorithm efficiency for

a target with independent and identical (up to ascaling) components as a function of the “rescaled”scale parameter µ= d1/2λd of a Gaussian proposal.Equation (10) gives algorithm efficiency for a uni-modal elliptically symmetric target explored by a

spherically symmetric proposal with µ= d1/2λdk(d)y /

k(d)x . Efficiencies are therefore optimal at µ ≈ 2.38/J1/2 and µ≈ 2.38, respectively. These correspond toactual scale parameters of respectively

λd =2.38

J1/2d1/2and λd =

2.38k(d)x

d1/2k(d)y

.

The equivalence between these two expressions forGaussian data explored with a Gaussian target isclear from Section 2.2.4. However, the equations of-fer little direct help in choosing a scale parameterfor a target which is neither elliptical nor possessescomponents which are i.i.d. up to a scale parameter.Substitution of each expression into the correspond-ing acceptance rate equation, however, leads to thesame optimal acceptance rate, α≈ 0.234. This justi-fies the relatively well-known adage that for randomwalk algorithms with a large number of parameters,

the scale parameter of the proposal should be chosen

so that the acceptance rate is approximately 0.234.


On a graph of asymptotic efficiency against accep-tance rate (e.g., Roberts and Rosenthal, 2001), thecurvature near the mode is slight, especially to itsright, so that an acceptance rate of anywhere be-tween 0.2 and 0.3 should lead to an algorithm ofclose to optimal efficiency.In practice updates are performed on a finite num-

ber of parameters; for example, a two-dimensionalMMPP has four parameters (ψ1, ψ2, q12, q21). A blockupdate involves all of these, while each update ofa simple Metropolis-within-Gibbs step involves justone parameter. In finite dimensions the optimal ac-ceptance rate can in fact take any value between 0and 1. Sherlock and Roberts (2009) provided ana-lytical formulas for calculating the ESJD and theexpected acceptance rate for any proposal and anyelliptically symmetric unimodal target. In one di-mension, for example, the optimal acceptance ratefor a Gaussian target explored by a Gaussian pro-posal is 0.44, while the optimum for a Laplace target(π(x) ∝ e−|x|) explored with a Laplace proposal isexactly α= 1/3. Sherlock (2006) considered severalsimple examples of spherically symmetric proposaland target across a range of dimensions and foundthat in all cases curvature at the optimal acceptancerate is small, so that a range of acceptance rates isnearly optimal. Further, the optimal acceptance rateis itself between 0.2 and 0.3 for d≥ 6 in all the casesconsidered.Sherlock and Roberts (2009) also weakened the

“shell” condition of Section 2.2.4 and considered se-quences of spherically symmetric targets for whichthe (rescaled) radius converges to some random vari-able R rather than a point mass at 1. It is shownthat, provided the sequence of proposals still satis-fies the shell condition, the limiting optimal accep-tance rate is strictly less than 0.234. Acceptance ratetuning should thus be seen as only a guide, though aguide which has been found to be robust in practice.

Algorithm 1 (Blk). The first algorithm (Blk)used to explore datasets D1 and D2 is a four-dimen-sional block updating RWMwith proposalY ∼N(0,λ2I) and λ tuned so that the acceptance rate is ap-proximately 0.3.

3.2 Optimal Scaling of the RWM-Within-Gibbs

Intuition: Consider first a target either spheri-cally symmetric, or with i.i.d. components, and letthe overall scale of variability of the target be η.For full block proposals the optimal scale parameter

should be O(η/d1/2) so that the square of the magni-tude of the total proposal is O(η2). If a Metropolis-within-Gibbs update is to be used with k sub-blocksand d∗ = d/k of the components updated at eachstage, then the optimal scale parameter should be

larger, O(η/d1/2∗ ). However, only one of the k stages

of the RWM-within-Gibbs algorithm updates anygiven component whereas with k repeats of a blockRWM that component is updated k times. Consider-ing the squared jump distances it is easy to see that,given the additivity of squared jump distances, thelarger size of the RWM-within-Gibbs updates is ex-actly canceled by their lower frequency, and so (inthe limit) there is no difference in efficiency whencompared with a block update. The same intuitionapplies when comparing a random scan Metropolis-within-Gibbs scheme with a single block update.Now consider a target for which different compo-

nents vary on different scales. If sub-blocks are cho-sen so as to group together components with sim-ilar scales, then a Metropolis-within-Gibbs schemecan apply suitable scale paramaters to each blockwhereas a single block update must choose one scaleparameter that is adequate for all components. Inthis scenario, Metropolis-within-Gibbs updates shouldtherefore be more efficient.Theory: Neal and Roberts (2006) considered a ran-

dom scan RWM-within-Gibbs algorithm on a targetdistribution with i.i.d. components and using i.i.d.Gaussian proposals all having the same scale param-eter λd = µ/d1/2. At each iteration a fraction, γd, ofthe d components are chosen uniformly at randomand updated as a block. It is shown [again subjectto differentiability conditions on f(·)] that the pro-

cess W(d)t :=X

(d)1,[td] approaches a Langevin diffusion

with speed

hγ(µ) = 2γµ2Φ(−12µ(γJ)

1/2),

where γ := limd→∞ γd. The optimal scaling is there-fore larger than for a standard block update (by afactor of γ−1/2) but the optimal speed and the op-timal acceptance rate (0.234) are identical to thosefound by Roberts, Gelman and Gilks (1997).Sherlock (2006) considered sequential Metropolis-

within-Gibbs updates on a unimodal elliptically sym-metric target, using spherical proposal distributionsbut allowing different scale parameters for the pro-posals in each sub-block. The k sub-blocks are as-sumed to correspond to disjoint subsets of the prin-cipal axes of the ellipse and updates for each are


assumed to be optimally tuned. Efficiency is consid-ered in terms of ESEJD and is again found to beoptimal (as d→∞) when the acceptance rate foreach sub-block is 0.234. For equal sized sub-blocks,the relative efficiency of the Metropolis-within-Gibbsscheme compared to k optimally scaled single blockupdates is shown to be

r =(1/k)

∑

c2i

((1/k)∑

1/c2i)−1,(13)

where c2i is the mean of the squares of the inversescale parameters for the ith block. Since r is the ra-tio of an arithmetic mean to a harmonic mean, it isgreater than or equal to 1 and thus the Metropolis-within-Gibbs step is always at least as efficient asthe block Metropolis. However, the more similar theblocks, the less the potential gain in efficiency.In practice, parameter blocks do not generally cor-

respond to disjoint subsets of the principal axes ofthe posterior or, in terms of single parameter up-dates, the parameters are not generally orthogonal.Equation (13) therefore corresponds to a limitingmaximum efficiency gain, obtainable only when theparameter sub-blocks are orthogonal.

Algorithm 2 (MwG). Our second algorithm(MwG) is a sequential Metropolis-within-Gibbs al-gorithm with proposed jumps Yi ∼ N(0, λ2i ). Eachscale parameter is tuned separately to give an accep-tance rate of between 0.4 and 0.45 (approximatelythe optimum for a one-dimensional Gaussian targetand proposal).

3.3 Tailoring the Shape of a Block Proposal

Intuition: First consider a two-dimensional targetwith roughly elliptical contours and with the scaleof variation along one of the principal axes muchlarger than the scale of variation along the other(e.g., the two right-hand panels of Figure 3). Thesize of updates from a proposal of the type used inAlgorithm 1 is constrained by the smaller of the twoscales of variation. Thus, even when Algorithm 1 isoptimally tuned, the efficiency of exploration alongthe larger axis depends on the ratio of the two scalesand so can be arbitrarily low in targets where thisratio is large. Now consider a general target withroughly elliptical contours and covariance matrix Σ.It seems intuitively sensible that a “tailored” blockproposal distribution with the same shape and ori-entation as the target will tend to produce largerjumps along the target’s major axes and smaller

jumps along its minor axes and should therefore al-low for more efficient exploration of the target.Theory: Sherlock (2006) considered exploration of

a unimodal elliptically symmetric target with eithera spherically symmetric proposal or a tailored ellip-tically symmetric proposal in the limit as d→∞.Subject to condition (11) (and a “shell”-like condi-tion similar to that mentioned in Section 2.2.4), it isshown that with each proposal shape it is in fact pos-sible to achieve the same optimal expected squaredjump distance. However, if a spherically symmet-ric proposal is used on an elliptical target, somecomponents are explored better than others and insome sense the overall efficiency is reduced. This be-comes clear on considering the ratio, r, of the ex-pected squared Euclidean jump distance for an op-timal spherically symmetric proposal to that of anoptimal tailored proposal. Sherlock (2006) showedthat for a sequence of targets, where the target withdimension d has elliptical axes with inverse scale pa-rameters cd,1, . . . , cd,d, the limiting ratio is

r=limd→∞((1/d)

∑di=1 c

−2d,i )

−1

limd→∞ (1/d)∑d

i=1 c2d,i

.

The numerator is the limiting harmonic mean of thesquared inverse scale parameters, which is less thanor equal to their arithmetic mean (the denomina-tor), with equality if and only if (for a given d) allthe cd,i are equal. Roberts and Rosenthal (2001) ex-amined similar relative efficiencies but for targetsand proposals with independent components withinverse scale parameters C sampled from some dis-tribution. In this case the derived measure of relativeefficiency is the relative speeds of the diffusion limitsfor the first component of the target

r∗ =E[C]2

E[C2].

This is again less than or equal to 1, with equal-ity when all the scale parameters are equal. Henceefficiency is indeed directly related to the relativecompatibility between target and proposal shapes.Furthermore, Bedard (2008) showed that if a pro-

posal has i.i.d. components yet the target (assumedto have independent components) is wildly asym-metric, as measured by (11), then the limiting op-timal acceptance rate can be anywhere between 0and 1. However, even at this optimum, some com-ponents will be explored infinitely more slowly thanothers.


In practice the shape Σ of the posterior is notknown and must be estimated, for example by nu-merically finding the posterior mode and the Hes-sian matrix H at the mode, and setting Σ =H−1.We employ a simple alternative which uses an earlierMCMC run.

Algorithm 3 (BlkShp). Our third algorithmfirst uses an optimally scaled block RWM algorithm(Algorithm 1), which is run for long enough to ob-tain a “reasonable” estimate of the covariance fromthe posterior sample. A fresh run is then started andtuned to give an acceptance rate of about 0.3 butusing proposals

Y ∼N(0, λ2Σ).

For each dataset, so that our implementation wouldreflect likely statistical practice, each of the threereplicates of this algorithm estimated the Σ ma-trix from iterations 1000–2000 of the correspondingreplicate of Algorithm 1 (i.e., using 1000 iterationsafter “burn in”). In all, therefore, six different vari-ance matrices were used.

3.4 Improving Tail Exploration

Intuition: A posterior with relatively heavy poly-nomial tails such as the one-dimensional Cauchy dis-tribution has considerable mass some distance fromthe origin. Proposal scalings which efficiently ex-plore the body of the posterior are thus too small toexplore much of the tail mass in a “reasonable” num-ber of iterations. Further, polynomial tails becomeflatter with distance from the origin so that for unitvector u, π(x+ λu)/π(x)→ 1 as ‖x‖2→∞. Hencethe acceptance rate for a random walk algorithmapproaches 1 in the tails, whatever the direction ofthe proposed jump. The algorithm therefore loses al-most all sense of the direction to the posterior mass.Theory: Roberts (2003) brought together litera-

ture relating the tails of the d-dimensional posteriorand proposal to the ergodicity of the Markov chainand hence its convergence properties. Three impor-tant cases are noted:

(1) If ∃s > 0 such that π(x)∝ e−s‖x‖2 , at least out-side some compact set, then the random walkalgorithm is geometrically ergodic.

(2) If ∃r > 0 such that the tails of the proposal are

bounded by some multiple of ‖x‖−(r+d)2 and if

π(x)∝ ‖x‖−(r+d)2 , at least outside some compact

set, then the algorithm is polynomially ergodicwith rate r/2.

(3) If ∃r > 0 and η ∈ (0,2) such that π(x) ∝‖x‖−(r+d)

2 , at least for large enough x, and the

proposal has tails q(x)∝ ‖x‖−(d+η)2 , then the al-

gorithm is polynomially ergodic with rate r/η.

Thus posterior distributions with exponential orlighter tails lead to a geometrically ergodic Markovchain, whereas polynomially tailed posteriors canlead to polynomially ergodic chains, and even thisis only guaranteed if the tails of the proposal areat least as heavy as the tails of the posterior. How-ever, by using a proposal with tails so heavy thatit has infinite variance, the polynomial convergencerate can be made as large as is desired.

Algorithm 4 (BlkShpCau). Our fourth algo-rithm is identical to BlkShp but samples the pro-posed jump from the heavy-tailed multivariateCauchy. Proposals are generated by simulating V∼N(0, Σ) and Z ∼ N(0,1) and setting Y∗ = V/Z.No acceptance rate criteria exist for proposals withinfinite variance and so the optimal scaling param-eter for this algorithm was found (for each dataset

and Σ) by repeating several small runs with differ-ent scale parameters and noting which produced thebest ACT’s for each dataset.

Algorithm 5 (BlkShpMul). The fifth algorithmrelies on the fact that taking logarithms of param-eters shifts mass from the tails to the center of thedistribution. It uses a random walk on the posteriorof θ := (logψ1, logψ2, log q12, log q21). Shape matri-ces Σ were estimated as for Algorithm 3, but usingthe logarithms of the posterior output from Algo-rithm 1. In the original parameter space this algo-rithm is equivalent to a proposal with componentsX∗i =Xie

Y ∗i and so has been called themultiplicative

random walk (see, e.g., Dellaportas and Roberts,2003). In the original parameter space the accep-tance probability is

α(x,x∗) =min

(

1,

∏d1 x

∗i

∏d1 xi

π(x∗)

π(x)

)

.

Since the algorithm is simply an additive randomwalk on the log parameter space, the usual accep-tance rate optimality criteria apply.

A logarithmic transformation is clearly only ap-propriate for positive parameters and can in factlead to a heavy left-hand tail if a parameter (in theoriginal space) has too much mass close to zero. Thetransformation θi = sign(θi) log(1+ |θi|) circumventsboth of these problems.


3.5 Additional Strategies

Scaling and shaping of the proposal, the choiceof proposal distribution (here Gaussian or Cauchy),and an informed choice between RWM andMetropolis-within-Gibbs updates can all lead to amore efficient algorithm. Building on these possibil-ities, we now consider two further mechanisms forimproving efficiency: adaptive MCMC, and utilizingproblem-specific knowledge.

3.5.1 Adaptive MCMC

Intuition: Algorithm 3 used the output from aprevious MCMC run to estimate the shape MatrixΣ. An overall scaling parameter was then varied togive an acceptance rate of around 0.3. With adap-tive MCMC a single chain is run, and this chaingradually alters its own proposal distribution (e.g.,changing Σ), by learning about the posterior from

its own output. This simple idea has a major poten-tial pitfall, however.If the algorithm is started away from the main

posterior mass, for example in a tail or a minormode, then it initially learns about that region. Ittherefore alters the proposal so that it efficientlyexplores this region of minor importance. Worse,in so altering the proposal the algorithm may be-come even less efficient at finding the main posteriormass, remain in an unimportant region for longer,and become even more influenced by that unimpor-tant region. Since the transition kernel is continu-ally changing, potentially with this positive feedbackmechanism, it is no longer guaranteed that the over-all stationary distribution of the chain is π(·).A simple solution is so-called finite adaptation

wherein the algorithm is only allowed to evolve forthe first n0 iterations, after which time the transi-tion kernel is fixed. Such a scheme is equivalent torunning a shorter “tuning” chain and then a longersubsequent chain (e.g., Algorithm 3). If the tuningportion of the chain has only explored a minor modeor a tail, this still leads to an inefficient algorithm.We would prefer to allow the chain to eventually cor-rect for any errors made at early iterations and yetstill lead to the intended stationary distribution. Itseems sensible that this might be achieved providedchanges to the kernel become smaller and smalleras the algorithm proceeds and provided the above-mentioned positive feedback mechanism can neverpervert the entire algorithm.Theory: At the nth iteration let Γn represent the

choice of transition kernel; for the RWM it might

represent the current shape matrix Σ and the over-all scaling λ. Denote the corresponding transitionkernel PΓn

(x, ·). Roberts and Rosenthal (2007) de-rived two conditions which together guarantee con-vergence to the stationary distribution. A key con-cept is that of diminishing adaptation, whereinchanges to the kernel must become vanishingly smallas n→∞,

supx‖PΓn+1(x, ·)− PΓn

(x, ·)‖1p−→ 0 as n→∞.

A second containment condition considers the ε-convergence time under repeated application of afixed kernel, γ, and starting point x,

Mε(x, γ) := infn{n≥ 1 :‖Pnγ (x, ·)− π(·)‖1 ≤ ε},

and requires that for all δ > 0 there is an N suchthat for all n

P(Mε(Xn,Γn)≤N |X0 = x0,Γ0 = γ0)≥ 1− δ.The containment condition is difficult to check inpractice; some criteria are provided in the work ofBai, Roberts and Rosenthal (2009).Adaptive MCMC is a highly active research area

and so we confine ourselves to an adaptive versionof Algorithm 5. Roberts and Rosenthal (2010) de-scribed an adaptive RWM algorithm for which theproposal at the nth iteration is sampled from a mix-ture of an adaptive N(0, 1d2.38

2Σn) and a nonadap-

tive Gaussian distribution; here Σn is the variancematrix calculated from the previous n− 1 iterationsof the scheme. Changes to the variance matrix areO(1/n) at the nth iteration and so the algorithmsatisfies the diminishing adaptation condition.Choice of the overall scaling factor 2.382/d fol-

lows directly from the optimal scaling limit results

reviewed in Section 3.1, with J = 1 or k(d)x = k

(d)y . In

general, therefore, a different scaling might be ap-propriate, and so our scheme extends that of Robertsand Rosenthal (2010) by allowing the overall scalingfactor to adapt.

Algorithm 6 (BlkAdpMul). Our adaptiveMCMC algorithm is a block multiplicative randomwalk which samples jump proposals on thelog-posterior from the mixture

Y∼

N(0,m2nΣn) w.p. 1− δ,

N

(

0,1

dλ20I

)

w.p. δ.

Here δ = 0.05, d= 4, and Σn is the variance matrixof the logarithms of the posterior sample to date.


A few minutes were spent tuning the block multi-plicative random walk with proposal variance 1

4λ20I

to give at least a reasonable value for λ0 (acceptancerate ≈ 0.3), although this is not strictly necessary.

To ensure a sensible nonsingular Σn, proposalsfrom the adaptive part of the mixture were onlyallowed once there had been at least 10 proposedjumps accepted. The overall scaling factor for theadaptive part of the kernel, mn, was initialized tom0 = 2.38/d1/2 and an adaptation quantity ∆ =m0/100 was defined. If iteration i was from the non-adaptive part of the kernel, then mi+1←mi; other-wise:

• If the proposal was rejected, then mi+1← mi −∆/i1/2.

• If the proposal was accepted, then mi+1←mi +2.3∆/i1/2.

This leads to an equilibrium acceptance rate of1/3.3≈ 30%, the target acceptance rate for the otherblock updating algorithms which use Gaussian pro-posals (Algorithms 1, 3, and 5). Changes to m arescaled by i1/2 since they must be large enough toadapt to changes in the covariance matrix yet smallenough that an equilibrium value is established rel-atively quickly. As with the variance matrix, sucha value would then only change noticeably if therewere consistent evidence that it should.

3.5.2 Utilizing problem-specific knowledge

Intuition: Algorithms are always applied to spe-cific datasets with specific forms for the likelihoodand prior. Combining techniques such as optimalscaling and shape adjustment with problem-specificknowledge can often markedly improve efficiency. Inthe case of the MMPP we define a reparameteri-zation based on the intuition that for an MMPPwith ψ1 ≈ ψ2 (as in D2) the data contain a greatdeal of information about the average intensity butrelatively little information about the difference be-tween the intensities.Theory: For a two-dimensional MMPP define an

overall transition intensity, stationary distribution,mean intensity at stationarity, and a measure of thedifference between the two event intensities as fol-lows:

q := q12 + q21, ν :=1

q[q21, q12],

(14)

ψ := νtψ and δ :=(ψ2 − ψ1)

ψ.

Let tobs be the total observation time and t the vec-tor of observed event times. If the Poisson event in-tensities are similar, δ is small, and Taylor expansionof the log-likelihood in δ (see Sherlock, 2006) gives

l(ψ, q, δ, ν1)

= n logψ−ψtobs +2δ2ν1ν2f(ψt, qt)(15)

+ δ3ν1ν2(ν2 − ν1)g(ψt, qt) +O(δ4)

for some f(·, ·) and g(·, ·). Consider a reparameteri-zation from (ψ1, ψ2, q12, q21) to (ψ, q,α,β) with

α := 2δ(ν1ν2)1/2 and β := δ(ν2 − ν1).(16)

Parameters ψ; q and α; and β (in this order) capturedecreasing amounts of variation in the log-likelihoodand so, conversely, it might be anticipated that therebe corresponding decreasing amounts of informationabout these parameters contained in the likelihood.Hence very different scalings might be required foreach.

Algorithm 7 (MwGRep). A Metropolis-within-Gibbs update scheme was applied to the reparame-terization (ψ, q,α,β). A multiplicative random walkwas used for each of the first three parameters (sincethey are positive) and an additive update was usedfor β. Scalings for each of the four parameters werechosen to give acceptance rates of between 0.4 and0.45.

Algorithm 8 (MwGRepCau). Our final algo-rithm is identical to MwGRep except that additiveupdates for β are proposed from a Cauchy distribu-tion. The Cauchy scaling was optimized to give thebest ACT over the first 1000 iterations.

4. RESULTS

The eight algorithms described in Section 3 aresummarized in Table 1. The table includes two fur-ther algorithms, an independence sampler (Algorithm9: IndShp), and the Gibbs sampler of Fearnhead andSherlock (2006) (Algorithm 10: Gibbs); these wereincluded to benchmark the efficiency of RWM algo-rithms against some sensible alternatives. The inde-pendence sampler used a multivariate t distributionwith five degrees of freedom and the same set ofcovariance matrices as Algorithm 3.Each RWM variation was tested against datasets

D1 and D2 as described in Section 2.3.1. For eachdataset, each algorithm was started from the known“true” parameter values and was run three times


with three different random seeds (referred to asReplicates 1–3). All algorithms were run for 11,000iterations; a burn in of 1000 iterations was sufficientin all cases.Priors were independent and exponential with

means the known “true” parameter values. The like-lihood of an MMPP with maximum and minimumPoisson intensities ψmax and ψmin and with n eventsobserved over a time window of length tobs is boundedabove by ψnmaxe

−ψmintobs . In this article only MMPPparameters and their logarithms are considered forestimation. Since exponential priors are employedthe parameters and their logarithms therefore havefinite variance, and geometric ergodicity is guaran-teed.The accuracy of posterior simulations is assessed

via QQ plot comparison with the output from a verylong run of a Gibbs sampler (see Section 2.2.3). QQplots for almost all replicates were almost entirelywithin their 95% confidence bounds. Figure 4 showssuch plots for Algorithms 1–3 and 9 (the indepen-dence sampler) on dataset D2 (Replicate 1). In gen-eral these combinations produced the least accurateperformance, and only with the independence sam-pler is there reason to doubt that the posterior sam-ple is a reasonable representation of the true poste-rior. The relatively poor performance on D2 of Al-gorithms 1–3 and especially Algorithm 9 is repeatedfor the other two replicates. The third replicate ofAlgorithm 4 on D2 also showed an imperfect fit inthe tails.The integrated ACT was estimated for each pa-

rameter and each replicate using the final 10,000 it-erations from that replicate. Calculation of the like-lihood is by far the most computationally intensive

operation (taking approximately 99.8% of the to-tal CPU time) and is performed four times for eachMetropolis-within-Gibbs-iteration (once for each pa-rameter) and only once for each block update; a sim-ilar calculation is performed once for each update ofthe Gibbs sampler. To give a truer indication of over-all efficiency the ACTs for each Metropolis-within-Gibbs replicate have therefore been multiplied by 4.Table 2 shows the mean adjusted ACT for each algo-rithm, parameter, and dataset. For each set of threereplicates most of the ACTs lay within 20% of theirmean, and for the exceptions (Blk and BlkShpCaufor datasets D1 and D2, and BlkShp and BlkShp-Mul for dataset D2) full sets of ACTs are given inTable 3 in the Appendix.In general all algorithms performed better on D1

than on D2 because, as discussed in Section 2.3.1,dataset D1 contains more information on the pa-rameters than D2; it therefore has lighter tails andis more easily explored by the chain.The simple block additive algorithm using Gaus-

sian proposals with variance matrix proportional tothe identity matrix (Blk) performs relatively poorlyon both datasets. In absolute terms there is muchless uncertainty about the transition intensities q12and q21 (both are close to 1) than in the Poissonintensities ψ1 (10) and ψ2 (17 for D1 and 30 forD2) since the variance of the output from a Pois-son process is proportional to its value. The opti-mal single-scale parameter necessarily tunes to thesmallest variance and hence explores q12 and q21much more efficiently than ψ1 and ψ2.Overall performance improves enormously once

block proposals are from a Gaussian with approx-imately the correct shape (BlkShp). The efficiency

Table 1

Summary of the algorithms used in this paper

No. Abbreviation Description

1 Blk Block additive with tuned proposal N(0, λ2I).2 MwG Sequential additive with tuned proposals N(0, λ2

i ) (i= 1, . . . ,4).

3 BlkShp Block additive with tuned proposal N(0, λ2Σ).

4 BlkShpCau Block additive with tuned proposal Cauchy(0, λ2Σ).

5 BlkShpMul Block multiplicative with tuned proposal N(0, λ2Σ).

6 BlkAdpMul Block multiplicative with adaptively tuned mixture proposal.7 MwGRep Sequential multiplicative/additive Gaussian; reparameterization.8 MwGRepCau Sequential multiplicative Gaussian and additive Cauchy; reparameterization.

9 IndShp Block independence sampler with tuned proposal t5(0, Σ).10 Gibbs Hidden data Gibbs sampler of Fearnhead and Sherlock (2006).


Fig. 4. QQ plots for algorithms Blk, MwG, BlkShp, and IndShp, on D2 (Replicate 1). Dashed lines are approximate 95%confidence limits obtained by repeated sampling from iterations 1000 to 100,000 of a Gibbs sampler run; sample sizes were10,000/ACT, which is the effective sample size of the data being compared to the Gibbs run.

of the Metropolis-within-Gibbs algorithm with ad-

ditive Gaussian updates (MwG) lies somewhere be-

tween the efficiencies of Blk and BlkShp but the im-

provement over Blk is larger for dataset D1 than for

dataset D2. As discussed in Section 2.3.1 the pa-

rameters in D1 are more nearly independent than

the parameters in D2. Thus for dataset D1 the prin-

cipal axes of an elliptical approximation to the pos-

terior are more nearly parallel to the cartesian axes.

Metropolis-within-Gibbs updates are (by definition)

parallel to each of the cartesian axes and so can

make large updates almost directly along the majoraxis of the ellipse for dataset D1.For the heavy-tailed posterior of dataset D2 we

would expect block updates resulting from a Cauchyproposal (BlkShpCau) to be more efficient than thosefrom a Gaussian proposal. However, for both datasetsCauchy proposals are slightly less efficient than Gauss-ian proposals. It is likely that the heaviness of theCauchy tails leads to more proposals with at leastone negative parameter, such proposals being auto-matically rejected. Moreover, Σ represents the mainposterior mass, yet some large Cauchy jump propos-


Table 2

Mean estimated integrated autocorrelation time for the four parameters over three independent replicates for datasets D1 andD2

D1 D2

Algorithm ψ1 ψ2 log(q12) log(q21) ψ1 ψ2 log(q12) log(q21)

Blk 66 126 15 19 176 175 80 70MwG∗ 22 22 33 33 103 90 114 99BlkShp 13 18 13 15 46 25 37 36BlkShpCau 19 32 25 24 63 50 56 38BlkShpMul 13 17 13 15 33 26 22 16

BlkAdpMul 12 12 14 14 20 20 17 23MwGRep∗ 13 14 32 44 20 23 23 21MwGRepCau∗ 14 15 37 42 24 233 25 23

IndShp+ 3.7 5.5 3.5 3.7Gibbs 4.2 3.2 5.7 5.9 26 19 32 27

Notes: ∗Estimates for MwG replicates have been multiplied by 4 to provide figures comparable with full block updates interms of CPU time. +ACT results for the independence sampler for D2 are irrelevant since the MCMC sample was not anaccurate representation of the posterior.

als from this mass will be in the posterior tail. It maybe that Σ does not accurately represent the shapeof the posterior tails.Multiplicative updates (BlkShpMul) make little

difference for D1, but for the relatively heavy-tailedD2 there is a definite improvement over BlkShp.The adaptive multiplicative algorithm (BlkAdpMul)is slightly more efficient still, since the estimatedvariance matrix and the overall scaling are refinedthroughout the run.As was noted earlier in this section, due to our

choice of exponential priors the quantities estimatedin this article have exponential or lighter posteriortails and so all the nonadaptive algorithms in thisarticle are geometrically ergodic. The theory in Sec-tion 3.4 suggests ways to improve tail explorationfor polynomially ergodic algorithms and so, strictlyspeaking, need not apply here. However, the expo-nential decay only becomes dominant some distancefrom the posterior mass, especially for dataset D2.Polynomially increasing terms in the likelihood en-sure that initial decay is slower than exponential,and that the multiplicative random walk is there-fore more efficient than the additive random walk.The adaptive overall scaling m showed variability

of O(0.1) over the first 1000 iterations after whichtime it quickly settled down to 1.2 for all three repli-cates on D1 and to 1.1 for all three replicates on D2.Both of these values are very close to the scaling of1.19 that would be used for a four-dimensional up-date in the scheme of Roberts and Rosenthal (2010).

Fig. 5. Traceplots for the first 2000 iterations of BlkAdpMulon dataset D2 (Replicate 1).

The algorithm similarly learned very quickly aboutthe variance matrix Σ, with individual terms set-tling down after less than 2000 iterations, and withexploration close to optimal after less than 500 it-erations. This can be seen clearly in Figure 5 whichshows traceplots for the first 2000 iterations of thefirst replicate of BlkAdpMul on D2.The adaptive algorithm uses its own history to

learn about d(d+ 1)/2 covariance terms and a bestoverall scaling. One would therefore expect that the


larger the number of parameters, d, the more itera-tions are required for the scheme to learn about allof the adaptive terms and hence reach a close to op-timal efficiency. To test this a dataset (D3) was sim-ulated from a three-dimensional MMPP with ψ =[10,17,30]t and q12 = q13 = q21 = q23 = q31 = q32 =0.5. The following adaptive algorithm was then runthree times, each for 20,000 iterations.

Algorithm 6b [BlkAdpMul(b)]. This adaptivealgorithm is identical to BlkAdpMul (with d = 9)except that no adaptive proposals were used untilat least 100 nonadaptive proposals had been ac-cepted, and that if an adaptive proposal was ac-cepted then the overall scaling was updated withm←m+3∆/i1/2 so that the equilibrium acceptancerate was approximately 0.25.

Figure 6 shows the evolution of four of the 46adaptive parameters (Replicate 1). All parametersseem close to their optimal values after 10,000 it-erations, although covariance parameters appear tobe still slowly evolving even after 20,000 iterations.In contrast, traceplots of parameters (not shown)reveal that the speed of exploration of the poste-rior is close to its final optimum after only 1500iterations. This behavior was repeated across theother two replicates, indicating that, as with thetwo-dimensional adaptive and nonadaptive runs,even a very rough approximation to the variancematrix improves efficiency considerably. Over thefull 20,000 iterations, all three replicates showed adefinite multimodality with λ2 often close to eitherλ1 or λ3, indicating that the data might reasonablybe explained by a two-dimensional MMPP. In allthree replicates the optimal scaling settled between0.25 and 0.3, noticeably lower than the Roberts andRosenthal (2010) value of 2.38/

√9. With reference

to Section 3.1 this is almost certainly due to theroughness inherent in a multimodal posterior.The reparameterization of Section 3.5.2 was de-

signed for datasets similar to D2, and on this datasetthe resulting Metropolis-within-Gibbs algorithm(MwGRep) is at least as efficient as the adaptivemultiplicative random walk. On dataset D1, how-ever, exploration of q12 and q21 is arguably less ef-ficient than for the Metropolis-within-Gibbs algo-rithm with the original parameter set. The lack ofimprovement when using a Cauchy proposal for β(MwGRepCau) suggests that this inefficiency is notdue to poor exploration of the potentially heavy-tailed β. Further investigation in the (ψ, q,α,β) pa-rameter space showed that for dataset D1 only q was

Fig. 6. Plots of the adaptive scaling parameter m andthree estimated covariance parameters Var[ψ1], Var[q12], andCov[ψ1, q12] for BlkAdpMul(b) on dataset D3 (Replicate 1).

explored efficiently; the posteriors of ψ and β werestrongly positively correlated (ρ≈ 0.8), and both ψand β were strongly negatively correlated with α(ρ≈−0.65). Posterior correlations were small |ρ|<0.3 for all parameters with dataset D2 and for allcorrelations involving q for dataset D1.The optimal scaling for the one-dimensional addi-

tive Cauchy proposal in MwGRepCau was approx-imately two thirds of the optimal scaling for theone-dimensional additive Gaussian proposal in Mw-GRep. In four dimensions the ratio was approxi-mately one half. These ratios allow the Cauchy pro-posals to produce similar numbers of small to mediumsized jumps to the Gaussian proposals.The independence sampler is arguably the most

efficient of all of the algorithms considered for D1.However, as discussed earlier in this section, thereare doubts about the accuracy of its exploration ofD2. Mengersen and Tweedie (1996) showed that anindependence sampler is uniformly ergodic if andonly if the ratio of the proposal density to the tar-get density is bounded below, and that one minusthis ratio gives the geometric rate of convergence.To ensure the lower bound it is advisable to proposefrom a relatively heavy-tailed distribution, such asthe t5 used here. The problem in this instance arisesbecause dataset D2 could, just possibly, have beengenerated by a single Poisson process with inten-sity ψ ≈ (ψ1+ψ2)/2. The resulting minor mode (or,


more precisely, ridge) is some distance from the cen-ter of the distribution, resulting in a low ratio ofproposal and target densities.The Gibbs sampler of Fearnhead and Sherlock

(2006) is accurate, with its efficiency directly re-lated to the amount of information about the hiddenMarkov chain that is available from the data (Sher-lock, 2006). Thus for D1 the Gibbs sampler is moreefficient than the best RWM algorithms, but this isnot the case for D2.

5. DISCUSSION

We have described the theory and intuition behinda number of techniques for improving the efficiencyof random walk Metropolis algorithms and testedthese on two data sets generated from Markov mod-ulated Poisson processes (MMPPs). Tests on thesedatasets also showed a sensibly implemented RWMto be at least as good as some of the other avail-able MCMC algorithms. Some RWM implementa-tions were uniformly successful at improving effi-ciency, while for others success depended on theshape and/or tails of the posterior. All of the un-derlying concepts discussed here are quite generaland easily applied to statistical models other thanthe MMPP.Simple acceptance rate tuning to obtain the op-

timal overall variance term for a symmetric Gaus-sian proposal can increase efficiency by many ordersof magnitude. However, with our datasets, even af-ter such tuning, the RWM algorithm was very inef-ficient. The effectiveness of the sampling increasedenormously once the shape of the posterior was takeninto account by proposing from a Gaussian withvariance proportional to an estimate of the poste-rior variance. For Algorithms 3, 4, and 5 the poste-rior variance was estimated through a short “train-ing run”—the first 1000 iterations after burn in ofAlgorithm 1.As expected, use of the “multiplicative random

walk” (Algorithm 5), a random walk on the poste-rior of the logarithm of the parameters, improvedefficiency most noticeably on the posterior with theheavier tails. However, contrary to expectation, evenon the heavier tailed posterior an additive Cauchyproposal (Algorithm 4) was, if anything, less efficientthan a Gaussian. Tuning of Cauchy proposals wasalso more time-consuming since simple acceptancerate criteria could not be used.Algorithm 6 combined the successful strategies of

optimal scaling, shape tuning, and transforming the

data, to create a multiplicative random walk whichlearned the most efficient shape and scale parame-ters from its own history as it progressed. This adap-tive scheme was easy to implement and was arguablythe most efficient RWM for each of the datasets. Aslight variant of this algorithm was used to explorethe posterior of a three-dimensional MMPP, andshowed that in higher dimensions such algorithmstake longer to discover close to optimal values for theadaptive parameters. These runs also confirmed thefinding for the two-dimensional MMPP that RWMefficiency improves enormously with knowledge ofthe posterior variance, even if this knowledge is onlyapproximate. For a multimodal posterior such asthat found for the three-dimensional MMPP it mightbe argued that a different variance matrix should beused for each mode. Such “regionally adaptive” al-gorithms present additional problems, such as thedefinition of the different regions, and are discussedfurther by Roberts and Rosenthal (2010).Metropolis-within-Gibbs updates performed bet-

ter when the parameters were close to orthogonal,at which point the algorithms were almost as effi-cient as an equivalent block updating algorithm withtuned shape matrix. The best Metropolis-within-Gibbs scheme for dataset D2 arose from a new repa-rameterization devised specifically for thetwo-dimensional MMPP with parameter orthogo-nality in mind. On D2 this performed nearly as wellas the best scheme, the adaptive multiplicative ran-dom walk.The adaptive schemes discussed here provide a

significant step toward a goal of completely auto-mated algorithms. However, as already discussed,for d model-parameters, a posterior variance ma-trix has O(d2) components. Hence the length of any“training run” or of the adaptive “learning period”increases quickly with dimension. For high dimen-sion it is therefore especially important to utilize tothe full any problem-specific knowledge that is avail-able so as to provide as efficient a starting algorithmas possible.

APPENDIX: RUNS WITH HIGHLY VARIABLE

ACTS

Three replicates were performed for each datasetand algorithm, and ACTs are summarized by theirmean in Table 2. However, for certain combinationsof the algorithms and datasets the ACTs varied con-siderably; full sets of ACTs for these replicates aregiven in Table 3.


Table 3

Estimated ACT for the four parameters, on three independent replicates for Blk andBlkShpCau on dataset D1 and Blk, BlkShp, BlkShpCau, and BlkShpMul on dataset

D2

Algorithm ψ1 ψ2 log(q12) log(q21)

Blk (D1) 59,64,75 120,155,104 12,15,17 19,21,17BlkShpCau (D1) 28,16,12 36,29,31 20,20,35 26,23,24Blk (D2) 121,259,146 107,262,157 41,139,61 51,110,48BlkShp (D2) 54,51,34 23,24,29 40,45,27 50,35,23BlkShpCau (D2) 46,51,92 46,57,48 31,42,94 39,41,34BlkShpMul (D2) 53,24,23 22,33,25 20,23,24 17,18,13

REFERENCES

Bai, Y., Roberts, G. O. and Rosenthal, J. S. (2009).On the containment condition for adaptive Markov chainMonte Carlo algorithms. Preprint.

Bedard, M. (2007). Weak convergence of Metropolis al-gorithms for non-i.i.d. target distributions. Ann. Appl.Probab. 17 1222–1244. MR2344305

Bedard, M. (2008). Optimal acceptance rates for Metropolisalgorithms: Moving beyond 0.234. Stochastic Process. Appl.118 2198–2222. MR2474348

Burzykowski, T., Szubiakowski, J. and Ryden, T. (2003).Analysis of photon count data from single-molecule fluores-cence experiments. Chem. Phys. 288 291–307.

Carlin, B. P. and Louis, T. A. (2009). Bayesian Methodsfor Data Analysis, 3rd ed. CRC Press, Boca Raton, FL.MR2442364

Dellaportas, P. and Roberts, G. O. (2003). An introduc-tion to MCMC. In Spatial Statistics and ComputationalMethods (J. Moller, ed.). Lecture Notes in Statistics 173

1–41. Springer, Berlin. MR2001384Fearnhead, P. and Sherlock, C. (2006). An exact Gibbs

sampler for the Markov modulated Poisson processes. J. R.Stat. Soc. Ser. B Stat. Methodol. 68 767–784. MR2301294

Gamerman, D. and Lopes, H. F. (2006). Markov ChainMonte Carlo, 2nd ed. Chapman and Hall/CRC, Boca Ra-ton, FL. MR2260716

Geyer, C. J. (1992). Practical Markov chain Monte Carlo.Statist. Sci. 7 473–483.

Gilks, W. R., Richardson, S. and Spiegelhalter, D. J.

(1996). Markov Chain Monte Carlo in Practice. Chapmanand Hall, London. MR1397966

Jarner, S. F. and Roberts, G. O. (2002). Polynomial con-vergence rates of Markov chains. Ann. Appl. Probab. 12

224–247. MR1890063Kou, S. C., Xie, X. S. and Liu, J. S. (2005). Bayesian anal-

ysis of single-molecule experimental data. Appl. Statist. 541–28. MR2137252

Mengersen, K. L. and Tweedie, R. L. (1996). Rates of con-vergence of the Hastings and Metropolis algorithms. Ann.Statist. 24 101–121. MR1389882

Metropolis, N., Rosenbluth, A. W., Rosenbluth,

M. N., Teller, A. H. and Teller, E. (1953). Equationsof state calculations by fast computing machine. J. Chem.Phys. 21 1087–1091.

Meyn, S. P. and Tweedie, R. L. (1993). Markov Chainsand Stochastic Stability. Springer, London. MR1287609

Neal, P. and Roberts, G. (2006). Optimal scaling for par-tially updating MCMC algorithm. Ann. Appl. Probab. 16475–515. MR2244423

Roberts, G. O. (2003). Linking theory and practice ofMCMC. In Highly Structured Stochastic Systems. OxfordStatist. Sci. Ser. 27 145–178. Oxford Univ. Press, Oxford.MR2082409

Roberts, G. O. and Rosenthal, J. S. (2001). Optimalscaling for various Metropolis–Hastings algorithms. Statist.Sci. 16 351–367. MR1888450

Roberts, G. O. and Rosenthal, J. S. (2007). Couplingand ergodicity of adaptive Markov chain Monte Carlo al-gorithms. J. Appl. Probab. 44 458–475. MR2340211

Roberts, G. O. and Rosenthal, J. S. (2010). Examples ofadaptive MCMC. J. Comp. Graph. Stat. 8 349–367.

Roberts, G. O., Gelman, A. and Gilks, W. R. (1997).Weak convergence and optimal scaling of random walkMetropolis algorithms. Ann. Appl. Probab. 7 110–120.MR1428751

Scott, S. L. and Smyth, P. (2003). The Markov modulatedPoisson process and Markov Poisson cascade with applica-tions to web traffic modelling. Bayesian Statist. 7 1–10.

Sherlock, C. (2005). In discussion of ‘Bayesian analysis ofsingle-molecule experimental data.’ J. Roy. Statist. Soc.Ser. C 54 500. MR2137252

Sherlock, C. (2006). Methodology for inference onthe Markov modulated Poisson process and theoryfor optimal scaling of the random walk Metropo-lis. Ph.D. thesis, Lancaster Univ. Available athttp://eprints.lancs.ac.uk/850/.

Sherlock, C. and Roberts, G. (2009). Optimal scaling ofthe random walk Metropolis on elliptically symmetric uni-modal targets. Bernoulli 15 774–798. MR2555199

Sokal, A. (1997). Monte Carlo methods in statistical me-chanics: Foundations and new algorithms. In FunctionalIntegration (Cargese, 1996). NATO Adv. Sci. Inst. Ser. BPhys. 361 131–192. Plenum, New York. MR1477456

http://www.ams.org/mathscinet-getitem?mr=2344305

















http://eprints.lancs.ac.uk/850/



the random walk metropolis: linking theory and …1011.6217v1 [stat.me] 29 nov 2010 statistical...

Documents