scaling limits for the transient phase of local metropolis...

Scaling Limits for the Transient Phase

of Local Metropolis-Hastings Algorithms

by

Ole F. Christensen* and Gareth O. Roberts** and Jeffrey S. Rosenthal***

(January 2003; revised September 2004.)

Abstract. This paper considers high-dimensional Metropolis and Langevin algorithms in

their initial transient phase. In stationarity, these algorithms are well-understood and it is

now well-known how to scale their proposal distribution variances. For the random walk

Metropolis algorithm, convergence during the transient phase is extremely regular - to

the extent that the algorithm’s sample path actually resembles a deterministic trajectory.

In contrast, the Langevin algorithm with variance scaled to be optimal for stationarity,

performs rather erratically. We give weak convergence results which explain both of these

types of behaviour, and give practical guidance on implementation based on our theory.

1. Introduction.

Markov chain Monte Carlo (MCMC) algorithms are a very popular method for sam-

pling from complicated probability distributions π(·) (see e.g. Gilks, Richardson and Spiegel-

halter, 1996). One very common MCMC algorithm is the Metropolis-Hastings algorithm

(Metropolis, Rosenbluth, Rosenbluth, Teller and Teller, 1953; Hastings, 1970). This al-

gorithm requires that we choose a proposal distribution. A fundamental question is what

* BiRC - Bioinformatics Research Center, Aarhus University, Hoegh-Guldbergs Gade10, 8000 Aarhus C, Denmark. Email: [email protected].** Department of Mathematics and Statistics, Fylde College, Lancaster University, Lan-

caster, LA1 4YF, England. Email: [email protected].

*** Department of Statistics, University of Toronto, Toronto, Ontario, Canada M5S 3G3.Email: [email protected]. Supported in part by NSERC of Canada.

1

scaling should be used for the proposal. This question has recently received considerable

attention (Gelman, Roberts and Gilks, 1996; Roberts, Gelman and Gilks, 1997; Roberts

and Rosenthal, 1998; Roberts, 1998; Breyer and Roberts, 2000; Roberts and Rosenthal,

2001; Roberts and Yuen, 2004; Bedard, 2004).

The result of Roberts et al. (1997) says that, for random-walk Metropolis algorithms,

to achieve optimal mixing speed of the algorithm, the proposal variance should scale with

dimension d like O(d−1), and the optimal acceptance rate should be 0.234. By contrast, the

result of Roberts and Rosenthal (1998) says that, for Langevin Metropolis-Hastings algo-

rithms, the proposal variance should scale with dimension d like O(d−1/3), and the optimal

acceptance rate should be 0.574. These results provide very useful practical guidance.

However, these results do have limitations. Firstly, these results are limiting results

as the dimension of the problem goes to infinity, and this restricts the class of target

distribution for which formal scaling limits can be demonstrated. For instance, Roberts et

al. (1997) and Roberts and Rosenthal (1998) only prove results rigorously when π(·) has

components which are i.i.d. (at least asymptotically), though subsequent theoretical and

empirical work shows that these results apply somewhat more generally (see Roberts and

Rosenthal, 2001, for a survey). Secondly, and sometimes overlooked, the results assume

that the chain is started in stationarity, i.e. they consider mixing properties only in the

stationary phase of the chain.

As a further motivation for the study in this paper, it is empirically well-known that

while MCMC sample path trajectories in stationarity typically resemble diffusion pro-

cesses, in their transient phases, algorithms often look almost deterministic, in that the

characteristic fluctuations of MCMC trajectories are absent, or dominated by a systematic

drift term. Figure 1 is an example of this type of behaviour from a Bayesian posterior

distribution for inference for partially observed diffusions (see Beskos et. al., 2004, for

details). In this example, parameters are being updated using simple Metropolis jumping

rules. Further examples of this deterministic initial transient phase are considered later in

the paper.

2

Figure 1. A sample path of a Metropolis algorithm for a Bayesianposterior distribution for partially observed diffusions, as in Beskos et.al, (2004).

So while the diffusion approximation is effective for studying algorithms in stationarity,

it seems likely that a different theory is needed to describe the transient phase. In this

paper, we consider what happens in the transient (pre-stationary) phase of the chain.

Here our focus will be on Metropolis and Langevin algorithms. Our investigation will

involve theory, numerical investigation, and Bayesian applications. Most of the theory

just covers high-dimensional Gaussian distributions, where clear cut explicit results can

be given to explain almost deterministic transient behaviour. However, we shall see in

examples that the conclusions we draw from the theoretical study extend well beyond

the Gaussian case, providing useful practical guidance for MCMC users in complex high-

dimensional problems.

For the theory results, we consider chains which are started far out in the tails of

π(·), and study their approach to the “center” of π(·). Asymptotically (i.e., as d → ∞),

this happens deterministically as long as the proposal variance is scaled appropriately. In

particular, for Langevin algorithms, the proposal standard deviation must only scale as

O(d−1/2) or smaller. If the proposal variance recedes to zero more slowly than O(d−1/2),

3

then the convergence time becomes exponentially large as a function of dimension, ex-

hibiting erratic behaviour and rejecting a large proportion of proposed moves in the tail

region.

To illustrate that these scaling problems for the Langevin algorithm are relevant in

practice, in Section 6 we will consider an example of a high-dimensional target density

related to Bayesian inference for a spatial generalised linear mixed model (GLMM). Our

example will involve inference for a latent log-Gaussian Cox point-process, given point

process data. For this example, and as predicted by the Gaussian theory, we demonstrate

that for two natural starting values (both being near the mode of the distribution) the

algorithm is stuck at the starting value, whereas for a starting value chosen approximately

from the stationary distribution, the algorithm mixes very well.

Metropolis methods are of course very commonly used, while Langevin methods are

less well-established in the statistical literature. However, Langevin algorithms offer huge

computational advantages in many problems (particularly for spatial models) and deserve

to be more widely adopted. For example, for the model considered in Section 6, Metropolis

methods are now established to be prohibitively slow, while Langevin methods offer a fea-

sible and practical alternative. In fact, theory predicts (see Roberts and Rosenthal, 1998)

that Langevin algorithms will almost always massively outperform Metropolis alternatives

in the stationary phase, whenever they are practically feasible to run. For further examples

of the use of Langevin methods for high-dimensional Bayesian problems, see Grenander

and Miller (1994), Neal (1996) and Møller, Syversveen and Waagepetersen (1998).

2. Definitions.

Given a d-dimensional probability density function of interest, π (with respect to

the Lebesgue measure), and a proposal kernel Q(x, ·), the Metropolis-Hastings algorithm

proceeds as follows. Suppose that at the t’th iteration, the current state of the algorithm

is given by Xt. Then the algorithm proposes a new value Yt+1 ∼ Q(Xt, ·), which has

proposal density with respect to the Lebesgue measure given by q(Xt, ·). The proposal

Yt+1 is accepted as the new value (and we therefore set Xt+1 = Yt+1) with probability

4

α(Xt,Yt+1) where

α(x,y) = min

1,π(y)q(y,x)π(x)q(x,y)

,

and otherwise rejected, in which case we set Xt+1 = Xt.

Throughout this paper we will consider two algorithms. The symmetric random-walk

Metropolis algorithm takes q(x,y) to be a spherically symmetric function of ‖y−x‖, often

denoted by q(‖y − x‖). In this case, α(x,y) simplifies to

α(x,y) = min

1,π(y)π(x)

.

Therefore left to its own devices, Q would carry out a random walk with increment density

q(·). Most of what we describe below requires only that q(·) be a density of a square

integrable random variable. However for simplicity we shall assume that q is multivariate

Gaussian, i.e. Q(x, ·) ∼ MVN(x, hId) where Id is the d-dimensional identity matrix and h

is a proposal variance parameter.

The second algorithm we consider is the Langevin algorithm (see for example Roberts

and Tweedie, 1996; Roberts and Rosenthal, 1998), which is motivated by a discrete ap-

proximation to a Langevin diffusion, and takes Q(x, ·) ∼ MVN(x +∇ log π(x)h/2, hId).

3. A Concrete Example.

To further motivate the results we shall describe later, we consider the case where π(·)

is a d-dimensional standard normal distribution, so that π(x) ∝ exp(− 12‖x‖

2).

We consider a random-walk Metropolis algorithm with Gaussian proposal distribution

where the proposal variance h is scaled to be proportional to d−1 in each dimension. Thus

Yt+1 ∼ MVN(Xt, (`2/d) Id). We also consider the Langevin algorithm with proposal

variance h scaled to be proportional to d−1/3. Thus in this case Yt+1 ∼ MVN(Xt +

∇ log π(Xt)`2/(2d1/3), (`2/d1/3) Id). (Note that there is no relationship between ` as used

in the Metropolis case, and ` as used in the Langevin case. In both cases, ` is just being

used as a generic scale constant.) These variance scalings are optimally efficient in the

stationary phase (see Roberts et al., 1997, and Roberts and Rosenthal, 1998), and we also

choose the constant in the proposal variance, `, as the stationary optimal scaling.

5

Figure 2 below shows output from simulating the target density π(x) ∝ exp(−‖x‖2/2)

where x has dimension d = 1000. Plots on the left correspond to two independent ran-

dom walk Metropolis runs while plots on the right correspond to Langevin runs. For the

Metropolis traces, we have superimposed a deterministic function which is formally defined

in (3) below, and which we will discuss in more detail later.

Figure 2. Trace plots of ‖x‖2 for simulating a 1000-dimensional normaldistribution, when starting at the origin. Left: Random walk Metropo-lis, together with solid line giving the function f defined in (3) below.Right: Langevin. Two independent simulations are made for each algo-rithm.

For the random walk Metropolis method, the initial convergence appears to be al-

most deterministic, governed by the function f . However once the algorithm reaches the

stationary region, its trajectory becomes more obviously stochastic, displaying behaviour

6

characteristic of a diffusion process. Note that convergence to the stationary region is

fairly quick, but subsequent mixing is displaying high serial correlation.

In contrast, the Langevin method takes a large number of iterations to move at all,

and then proceeds in a sequence of unpredictable irregular jumps to move towards the

stationary region. The algorithm displays ‘sticky patches’ when in tail regions, where large

numbers of successive iterations are rejected. However once the algorithm has succeeded

in finding the stationary region, its mixing is very rapid indeed.

The theory in Roberts and Rosenthal (1998) predicts accurately the comparative per-

formance of these two algorithms in stationarity: mixing time for the Langevin algorithm

ought to be O(d1/3) comparing favourably with O(d) for the random walk Metropolis al-

gorithm. However from this output, it is clear that the transient phase of the algorithms

needs further investigation, since here, the random walk Metropolis method appears to be

doing better.

4. Scaling limit for random-walk Metropolis algorithms.

We now consider the random-walk Metropolis algorithm on the above Gaussian ex-

ample analytically. The Gaussian context here allows a detailed theoretical study to be

performed. It appears to be very difficult to extend these theoretical arguments substan-

tially beyond the context we shall present here, although we emphasis that in practice, the

phenomenon we shall describe here is empirically observed far more generally.

We let W dt = (1/d)‖Xbtdc‖2, so that W d is a process which keeps track of the norm-

squared of X, shrunk by a factor d, with time speeded up by a factor d. The functional

W d is the natural functional to consider in this problem for two reasons. Firstly, W dt

is a Markov process because of the spherical symmetry of the problem, and this allows

relatively straightforward calculations to establish its limiting behaviour for d going to

infinity. Secondly, since we shall be interested in particular in the starting value X0 = 0¯,

the norm-squared is in some sense the only interesting functional: all angular components

mix immediately in 1 iteration. Thus, the study of W d is sufficient to characterise the

convergence of the entire distribution of X.

We have the following calculation. (The proof of this result, and most of others in

7

this paper, have been put in the appendix, so as not to interrupt the flow of the paper.)

Lemma 1. For t = 0, 1, 2, . . .,

limd→∞

E[W d

(t+1)/d −W dt/d |W

dt/d = w

]d = a`(w) , (1)

where

a`(w) = `2Φ(N∗) + exp((`2/2)(w − 1))(1− 2w)`2Φ(−N∗ − `w1/2) , (2)

with N∗ = −`w−1/2/2, φ(s) = (2π)−1/2 exp(−s2/2), and Φ(s) =∫ s

−∞ φ(u)du. Moreover

the convergence in (1) is uniform for w ∈ [0,K] for all K > 0.

We shall require one more technical result which ensures that fluctuations of W d are

not too ‘severe’. This will essentially allow us to show that the sample paths resemble a

deterministic trajectory in an appropriate sense.

Lemma 2. For t = 0, 1, 2, . . ., for all K ≥ 0,

lim supd→∞

sup0≤w≤K

E[(

W d(t+1)/d −W d

t/d

)2

|W dt/d = w

]d2 < ∞ .

We are now in a position to state formally the main result of this section. From

Lemmas 1 and 2, we obtain the following.

Theorem 3. When W d0 = w0, then as d → ∞, we have W d ⇒ f , where ⇒ is weak

convergence, and where f is a deterministic function satisfying f(0) = w0 and

f ′(t) = a`(f(t)) , (3)

with a`(·) as in (2).

Theorem 3 therefore explains the deterministic sample path behaviour of the random-

walk Metropolis algorithm during its transient phase, as illustrated in Figure 2.

Figure 3 shows the function a`(w) for different values of `. Convergence of the algo-

rithm is quicker when the modulus of a` is large. We see that there exists no ` which is

uniformly maximising the speed of convergence.

8

Figure 3. A collection of the a`(·) functions defined in (2) for various different values

of `. Convergence is quicker when a` is as large as possible in modulus and positive for

w < 1, negative for w > 1. The thick solid curve for ` = 2.38, represents the scaling

corresponding to optimal mixing for a chain started in stationarity. The dashed curve for

` =√

2, represents the scaling produced by optimising a`(0). Both thick solid and dashed

curves perform close to optimally for all values of w. The four remaining curves are for

` = 1, . . . , 4.

Started in stationarity, the mixing time of optimally scaled random walk Metropolis

algorithms, on reasonably behaved target distributions, is known to be O(d) as d → ∞.

We argue informally that the convergence time of the random walk Metropolis algorithm,

when started from the transient phase, is still O(d). Indeed, from Theorem 3, the time

taken for W to reach 1−1/d is O(log d). However, once W reaches 1−1/d, then it behaves

as in its stationary phase, and thus converges in distribution in O(d) further iterations.

Thus, W , and hence also ‖X‖2, converges to stationarity in time O(d + log d) = O(d).

9

5. Scaling limits for Langevin algorithms.

We now move on to study in more detail the erratic behaviour of the Langevin al-

gorithm in its transient phase, as illustrated in Figure 2. For the example in Section 3,

we first provide theoretical justification for the problematic behaviour observed when the

proposal variance is h = `2d−1/3. Secondly, we motivate a different scaling limit for the

algorithm.

For the Langevin algorithm with scaling O(1/d1/3), we obtain a result analogous to,

but qualitatively different from, Lemma 1.

Lemma 4. Consider again the d-dimensional standard normal target density, and

its exploration using the Langevin algorithm with scaling h = `2d−1/3. Let W dt =

(1/d)‖Xbtd1/3c‖2. Then as d →∞, for 0 ≤ w ≤ 1, and t = 0, 1, 2, . . .,

E[W d

(t+1)/d1/3 −W dt/d1/3 | W d

t/d1/3 = w]d1/3 ≈ `2(1− w) accd(w),

where accd(w) = min1, exp(−d1/3`4(1 − w)/8) is the acceptance probability of moves

from w. In particular, for w < 1, as d → ∞, accd(w) decreases as O(e−Cd1/3) for some

C > 0.

From Lemma 4, we see the reason for the problems reported in Section 3. The accep-

tance probability of moves from the origin are receding exponentially in d1/3, leading to

severe mixing problems. In particular, this explains why the Langevin algorithm remains

at 0 for so long in the runs of Figure 2.

We also observe that for w ≥ 1, the algorithm behaves well (when using the scaling

O(d1/3)). Therefore, it is only starting values too close to the mode (origin) which lead to

severe convergence problems.

On the other hand, we may avoid the problem of acceptance probabilities decreasing

to 0, by choosing the variance scaling to be O(d−1/2) instead of O(d−1/3):

Lemma 5. Consider the Langevin algorithm with scaling h = `2d−1/2. Let W dt =

(1/d)‖Xbtd1/2c‖2. Then

limd→∞

(E[W d

(t+1)/d1/2 −W dt/d1/2 | W d

t/d1/2 = w])

d1/2 = b`(w),

10

where

b`(w) = `2(1− w) min1, exp(−`4(1− w)/8). (4)

Also for all K > 0,

lim supd→∞

sup0≤w≤K

E[(W d

(t+1)/d1/2 −W dt/d1/2)2 |W d

t/d1/2 = w]d < ∞ .

Using this proposition, it follows that for the Langevin algorithm using a scaling

O(d−1/2), we have a result similar to Theorem 3, again providing deterministic convergence

from the transient phase to the stationary phase.

Theorem 6. When W d0 = w0, then as d → ∞, we have W d ⇒ f , where ⇒ is weak

convergence, and where f is a deterministic function satisfying f(0) = w0 and

f ′(t) = b`(f(t)) , (5)

with b`(·) as in (4).

We therefore should see deterministic sample path behaviour of the Langevin algo-

rithm during its transient phase, when using a variance scaling of O(d−1/2).

To use Theorem 6 for simulations, we first need to decide how to choose the constant

` in the proposal variance h = `2/d1/2. Figure 4 shows the function b`(w) for different

choices of `. We see that, as in the case for a`(w), there exists no ` which is uniformly

maximising the speed of convergence.

11

Figure 4. A collection of the b`(·) functions defined in (4) for various different values

of `. Convergence is quicker when b` is as large as possible in modulus, and positive for

w < 1, negative for w > 1. The dashed curve is for ` =√

2, and the five solid curves are

for ` = 1.25, 1.75, 2, 2.5, 3.

The value ` =√

2 maximises b`(0), and in Figure 5 we show a trace plot for the target

density (for the example in Section 3) with this choice of `.

Figure 5. Trace plots of ‖x‖2 for simulating a 1000 dimensional normaldistribution starting at the origin using the Langevin algorithm withproposal variance h = `2/d1/2 with ` =

√2. Left: Trace plot. Right :

Trace plot for the first 500 iterations, together with a solid line givingthe solution to f ′(t) = b`(f(t)), where b`(·) is as in (4).

The initial convergence from the transient phase now appears to be quick, and almost

deterministic. Also, once the algorithm has reached the stationary region, its mixing is

rapid. However, the mixing in the stationary region of the Langevin algorithm scaled as

O(d1/2) is still slower than that when scaled as O(d1/3), since O(d1/3) is the optimal choice

in stationarity (Roberts and Rosenthal, 1998).

An obvious and important question is whether or not the result in Lemma 5 is specific

to the normal target density. Clearly, for other target densities, the form of (4) is different,

12

since in the derivation we have used extensively that ∂∂x log π(x) = −x for the normal

density. However, our conclusion about using the scaling h = `2d−1/2 during the transient

phase for the Langevin algorithm also appears to hold for other distributions. Whilst

proving results as general as Theorem 6 is difficult, we can nevertheless prove the following

weaker statement.

Theorem 7. Suppose that π(x) = exp(∑d

i=1 g(xi)) where g is any four times dif-

ferentiable function. Consider the Langevin algorithm for π(·), with variance scaling

h = `2d−1/2. Assume the algorithm’s initial point X0 is at a mode of π(·). Let pd be the

probability that the first proposed move is accepted, i.e. that X1 6= X0. Then limd→∞ pd

exists and is strictly positive.

Theorem 7 shows that for a general class of target distributions π(·), the Langevin

algorithm with scaling O(d−1/2), in contrast with O(d−1/3), will not get “stuck” when

starting from the mode.

6. Log-Gaussian Cox point-process example.

We now consider an example of a high-dimensional target density related to infer-

ence for a log-Gaussian Cox point-process. The example is from Møller, Syversveen and

Waagepetersen (1998) and consists of locations of 126 Scots pine saplings in a natural

forest in Finland. The locations are shown in the left plot in Figure 6.

The discretised version of the model used in the paper can be defined as follows. First

the area of interest, [0, 1]2, is discretised into a 64 × 64 regular grid, where the random

variables X = Xi,j are the number of points in grid-cells (i, j), i, j = 1, . . . , 64. The

dimension of the problem is thus d = 642 = 4096. Note that due to the fine discretisation

used, most of the grid cells contain no points, and only a few contain more than one point.

Given an unobserved intensity process Λ(·) = Λ(i, j) | i, j = 1, . . . , 64, the random

variables Xi,j , i, j = 1, . . . , 64, are assumed to be conditionally independent and Poisson

distributed with means mΛ(i, j), i, j = 1, . . . , 64, where m = 1/4096 is the area of each

grid-cell. The prior assumed for Λ(·) is

Λ(i, j) = exp(Yi,j),

13

where Y = (Yi,j , i, j = 1, . . . , 64) is multivariate Gaussian with mean E[Y] = µ1, and

covariance matrix Cov(Y) = Σ, where

Σ(i,j),(i′,j′) = σ2 exp(−((i− i′)2 + (j − j′)2)1/2/(64β)).

In the paper they estimated parameter values β = 1/33, σ2 = 1.91 and µ = log(126)−σ2/2,

which we will use here.

Møller et al. (1998) considered simulation of the intensity Λ(·) given the data X = x,

or equivalently Y given X = x. Using Bayes’ formula, the target density of interest is

f(y | x) ∝64∏

i,j=1

exp(xi,jyi,j −m exp(yi,j)) exp(−0.5(y − µ1)T Σ−1(y − µ1)) ,

where y = (yi,j , i, j = 1, . . . , 64). As in Papaspiliopoulos et al. (2003) and Christensen et

al. (2003), we reparameterise Y = µ1+LΓ, where L is obtained by Cholesky factorisation

such that Ω = LLT where Ω = (Σ−1+diag(y))−1. The target density is now

f(γ | x) ∝64∏

i,j=1

exp(xi,jyi,j −m exp(yi,j)) exp(−0.5γT LT Σ−1Lγ) ,

where the vector y = µ1 + Lγ. The gradient of the log target density is ∇ log f(γ | x) =

−LT Σ−1Lγ + LT xi,j −m exp(yi,j)i,j .

As in Møller et al. (1998), we use a Langevin algorithm for MCMC simulation. The

high dimensionality of the problem makes mixing of a Metropolis alternative prohibitively

slow. (For this model, that was noticed by Christensen and Waagepetersen, 2002, and

theoretical arguments explaining this can be found in Roberts and Rosenthal, 2001. We

therefore omit any comparison with the random walk Metropolis algorithm for this prob-

lem.)

The right hand plot in Figure 6 shows the estimated intensity E[Λ(·) | x] based on

100000 iterations, subsampling every 10th observation for storage reasons and using the

starting value II below; the plot is similar to Figure 12, upper left plot, in Møller et al.

(1998), the only difference being that more grey-scale colours are used here.

14

Figure 6. Scots pine saplings. Left: locations of trees. Right: the estimated intensity

E[Λ(·) | x].

Note that using the Cholesky factorisation as above is computationally slow, and is

only used here for simplicity of presentation. Møller et al. (1998) use a different reparam-

eterisation and a circulant embedding technique, where they extend the grid to a torus

and use the two-dimensional fast Fourier transform (FFT) to reduce the computational

burden. We refer to their paper for further details about this.

Now we compare the performance of the Langevin algorithm above for three different

starting values. The starting values expressed in terms of Y (which have to be transformed

to starting values for Γ) are

I : Yi,j = µ for i, j = 1, . . . , 64.

II : a random starting value for Γ, simulated from Γ ∼ MVN(0, I4096).

III : a starting value near the posterior mode. Let Yi,j solve the equation 0 = xi,j −

exp(Yi,j)− (Yi,j − β)/σ2.

In all three cases we use the optimal scaling for a vector of independent standard

normal distributed random variates in stationarity, ˆ2/(4096)1/3 = 0.1701563 where ˆ =

1.65.

15

Figure 7. Scots pine saplings. Trace plots log f(γ | x) when using the scaling 1.652/(4096)1/3.

Left: starting value I. Middle : starting value II. Right: starting value III.

For the starting value II, we observed that the convergence to equilibrium was fast,

say less than a hundred iterations. The overall acceptance rate was approximately 54%

which is close to the asymptotically optimal value of 57.4% (see Roberts and Rosenthal,

1998). For the starting values I and III all proposals were rejected. Figure 7 shows the

trace plots of log f(γ | x) from the algorithm with these three different starting values.

It can be seen that the cases where the algorithm rejects all proposed moves (I and III)

correspond to starting the algorithm way out in the tail of the target distribution.

For comparison, Figure 8 shows trace plots for the starting values I, II and III where

we instead used the scaling ˆ2/(4096)1/2 = 0.03125, where ˆ=√

2 was found in Section 5

to maximise the speed of convergence when starting at the mode for the standard normal

distribution, i.e. maximise b`(0) with b`(·) as in (4). We see that the algorithm approaches

the equilibrium distribution very rapidly for all three starting values. The acceptance

rate was in all three cases approximately 96%, and the scaling is therefore too small in

stationarity, as expected.

16

Figure 8. Scots pine saplings. Trace plots log f(γ | x) when using the scaling√

22/(4096)1/2.

Left: starting value I. Middle: starting value II. Right: starting value III.

In practice, an analysis of this type will need to incorporate parameter uncertainty,

and include MCMC update steps for the parameters as well as hidden Gaussian field

updates. Since the dimensionality of the parameters is usually small (2 or 3) this can be

done using fairly routine methods once careful reparameterisation has been carried out.

See Christensen et al. (2003) for details.

7. Discussion.

The results above introduce the intriguing suggestion that proposals for Langevin al-

gorithms should be scaled very differently in the transient and stationary phases - that is

once stationarity has been reached, the proposal variance can increase from O(d−1/2) to

O(d−1/3) (where for instance an O(d−1/3) variance is defined as described in our example in

Section 6). In practice, implementing a purely adaptive strategy which does this automat-

ically is not a feasible solution, since it could destroy the stationarity of π(·). One simple

solution to this problem is to alternate O(d−1/2) and O(d−1/3) moves in order to cover

the possibility of being in either regime at each time-point. This strategy would resemble

somewhat that of using a heavy tailed proposal distribution, a suggestion considered in

Stramer and Tweedie (1999a,b) to improve the convergence of MCMC algorithms.

17

In contrast, optimal scalings of Metropolis algorithms have rather stable properties

in the tails, and it is always optimal to set the proposal variance to be O(d−1). However

even in this case, the optimal acceptance rate of the algorithm can vary. To consider this

further, we note that to maximise the speed of movement towards convergence, we need to

maximise a`(w) as a function of ` (separately for each 0 ≤ w < 1). The resulting optimal

value, `∗(w), can be numerically computed, but it depends explicitly on w. Fortunately,

reasonable choices of `, including both ` = `∗(0) and ` equal to the optimal scaling in

stationarity, are not too far from being optimal for all values of 0 ≤ w < 1 (see Figure 3).

These results also raise the question of how to chose default starting values for algo-

rithms. This is a very practical issue, clearly of interest for routine use of MCMC algo-

rithms. In particular, for the Langevin algorithm we have seen that natural candidates

for default starting values are too near the mode of the distribution, leading to difficulties

regarding the scaling of the algorithm.

Most of the rigorous results in this paper are proved for concrete Gaussian target

distribution examples. Analogues of Theorems 3 and 6, and also Lemma 5, will no doubt

hold for suitably smooth spherically symmetric densities, and Theorem 7 covers a more

general class of target distributions. It is unrealistic to hope that blanket results might

hold to cover all distributions obtained from complex Bayesian analyses. However, we

believe that the broad conclusions of our main results hold rather generally in realistic and

complex statistical problems, and we believe this is illustrated through the point-process

example of Section 6.

Deterministic behaviour of sample paths might seem surprising, given the stochastic

nature of the algorithms we consider. However, stochasticity is still present, it is just being

masked by a drift term of relatively overwhelming magnitude. In stationarity of course

this drift is not present, so that the stochastic behaviour prevails. Also, in the transient

phase, it is still possible to observe stochasticity by considering different functionals. For

instance in the standard Gaussian example of Section 3, observing X1 instead of ‖X‖ will

give diffusion type behaviour rather than deterministic smooth sample paths. Because

our work here is based on starting distributions which are non-stationary, there is no

inconsistency between this and the diffusion limit type results of, for example, Roberts

18

and Rosenthal (1998).

As has been seen in other contexts (see for example Roberts and Rosenthal, 2001),

our results indicate that simple Metropolis algorithms have rather robust properties in the

following sense. They require no special adaptive scaling strategies within the transient

phase, and optimal scaling for stationary algorithms turns out to be close to optimal in

all regions of the state space. In contrast, the use of Langevin methods can always be

quicker, and we have seen that properly scaled Langevin algorithms do hugely outperform

competitor Metropolis alternatives (which is crucial in high dimensional examples such as

that in Section 6), but great care is required in scaling.

Thus a broad conclusion is that Langevin algorithms are very powerful alternatives to

Metropolis algorithms, but their implementation can sometimes require great care. These

findings support the use of hybrid strategies which alternate between different update

schemes, for instance using both Metropolis and Langevin updates, or by interspersing

complicated update schemes with collections of computationally cheaper single site moves.

The focus in this paper is on global scaling of high-dimensional proposals. It should be

noted however, that reparameterisation of target densities is also frequently needed for the

methods we discuss to be effective. In fact, this is particularly true for Langevin methods

as can be seen from Roberts and Rosenthal (2001), and our example in Section 6 uses such

a reparameterisation as well.

8. Appendix: Proofs of Results.

Proof of Lemma 1. We write Yt+1 = Xt + (`2/d)1/2Z, where Z is standard normal.

Then the proposed value for W d(t+1)/d (given W d

t/d) is given by

‖Xt + (`2/d)1/2Z‖2 = ‖Xt‖2 + 2(`2/d)1/2XTt Z + (`2/d)‖Z‖2

= ‖Xt‖2 + Vd + εd,

say, where we set εd = `2(d−1‖Z‖2 − 1

)and Vd = 2(`2/d)1/2XT

t Z + `2. Thus Vd ∼

N(`2, 4`2‖Xt‖2/d), i.e. N(`2, 4`2w). Notice that εd → 0 in L1 by a simple law of large

numbers argument.

19

Now, this proposed value is then accepted with probability equal to the minimum of

1 andπ(Yt+1)π(Xt)

=exp(−‖Yt+1‖2/2)exp(−‖Xt‖2/2)

= exp(−(‖Yt+1‖2 − ‖Xt‖2)/2) .

Therefore by the L1 convergence of εd and the fact that the function x 7→ xmin(1, e−x/2

)is a contraction

E[W d

(t+1)/d −W dt/d |W

dt/d = w

]d = E [(Vd + εd) min (1, exp(−(Vd + εd))] (6)

→ E[(`2 + 2`w1/2U) min

(1, exp(−(`2 + 2`w1/2U)/2)

)]with the last expectation taken with respect to U ∼ N(0, 1). Since εd is independent of w,

it is easy to see that this convergence takes place uniformly for w ∈ [0,K] for any K > 0.

The remainder of the proof is concerned with computing this expectation, which is

straightforward but tedious.

We first recall that N∗ = −`w−1/2/2, and note that the above “min” equals 1 for

U ≤ N∗ only. Hence, as d →∞,

E[W d

(t+1)/d −W dt/d |W

dt/d = w

]d ≈

∫ ∞

N∗φ(u) exp(−`2/2− `w1/2u)(`2 + 2`w1/2u)du

+∫ N∗

−∞φ(u)(`2 + 2`w1/2u)du ≡ I1 + I2 .

Then

I1 =∫ ∞

N∗φ(u + `w1/2) exp(`2w/2− `2/2)

(2`w1/2(u + `w1/2)− 2`2w + `2

)du

= exp(`2w/2− `2/2)

2`w1/2φ(N∗ + `w1/2) + (−2`2w + `2)[1− Φ

(N∗ + `w1/2

)],

= 2`w1/2φ(N∗) + exp(`2w/2− `2/2)(−2`2w + `2)Φ(−N∗ − `w1/2

),

where we first used φ(u) exp(−`w1/2u) = φ(u+`w1/2) exp(`2w/2), second that∫∞

u∗ uφ(u)du =

φ(u∗), and third that φ(N∗ + `w1/2) = φ(N∗) exp(`2/2− `2w/2).

Similarly,

I2 = `2Φ(N∗)− 2`w1/2φ(N∗) .

20

Combining the expressions for I1 and I2, the result follows.

Proof of Lemma 2. Letting U ∼ N(0, 1), then for d large

E[(W d

(t+1)/d −W dt/d)

2 |W dt/d = w

]d2 = E

[(Vd + εd)2 min (1, exp(−(Vd + εd))

]≤ E

[(Vd + εd)2

]which is easily seen to be bounded as a function of w as d →∞.

Proof of Theorem 3. Let Gd denote the discrete time generator of W d and let

C = C∞c functions : [0,∞) → R. (A function h ∈ C∞

c if it is infinitely differentiable

with compact support.) Then for h ∈ C,

Gdh(w) = dE[h(W d

(t+1)/d)− h(w)|W dt/d = w

]= dE

[h′(w)(W d

(t+1)/d − w) + h′′(W ∗)(W d(t+1)/d − w)2/2|W d

t/d = w]

where W ∗ represents a value in between w and W d(t+1)/d. However the second term on the

right hand side converges to 0 uniformly in w by Lemma 2, so that by Lemma 1

limd→∞

sup0≤w≤K

∣∣Gdh(w)− a`(w)h′(w)∣∣ = 0 . (7)

Thus the infinitesimal behaviour of the speeded up W d converges to that of the limiting

diffusion. The remaining technicalities involve deducing that convergence of infinitesimals

is sufficient to ensure weak convergence of the process. This final step follows from the

following argument which is typical of that used for weak convergence results in Ethier

and Kurtz (1986).

C is a core for the generator of the deterministic process described by (3) (see Sec-

tion 3 of Chapter 1, and Theorem 2.1 of Chapter 8 of Ethier and Kurtz (1986)). So by

Theorem 6.5 of Chapter 1 and (7), the finite dimensional distributions of W d converge to

the required limit. Moreover, from Corollary 8.7 of Chapter 4 of Ethier and Kurtz (1986),

21

this is therefore sufficient to ensure the weak convergence statement in (3) thus proving

Theorem 3.

Proof of Lemma 4. We use techniques similar to the proof of Lemma 1. For the

Langevin algorithm, Yt+1 = (1 − h/2)Xt +√

hZ with Z being a standard multivariate

normal random vector. From this we get

‖Yt+1‖2 = (1− h/2)2‖Xt‖2 + 2(1− h/2)h1/2XTt Z + h‖Z‖2

≈ (1− h + h2/4)‖Xt‖2 + 2h1/2(‖Xt‖2)1/2U + hd

where U ∼ N(0, 1). Using that h → 0 and hd →∞ we can write

‖Yt+1‖2 − ‖Xt‖2 ≈ hd(1− w)

The acceptance probability for Yt+1 becomes the minimum of 1 and

π(Yt+1)π(Xt)

q(Yt+1,Xt)q(Xt,Yt+1)

=exp(−‖Yt+1‖2/2)exp(−‖Xt‖2/2)

exp(−‖Xt − (1− h/2)Yt+1‖2/2h)exp(−‖Yt+1 − (1− h/2)Xt‖2/2h)

= exp(−h(‖Yt+1‖2 − ‖Xt‖2)/8) ≈ exp(−h2d(1− w)/8).

Thus for large d

E[W d(t+1)/d1/3 −W d

t/d1/3 | W dt/d1/3 = w] ≈ hd(1− w)min1, exp(−h2d(1− w)/8)/d,

proving (by substituting h = `2d−1/3) the result.

Proof of Lemma 5. The first part of the lemma follows from the proof of Lemma 4,

using h = `2/d1/2.

The second part of the lemma follows in a similar manner to the proof of Lemma 2,

and the details are therefore omitted.

22

Proof of Theorem 6. This theorem follows from Lemma 4 and Lemma 5, similar to

how Theorem 3 followed from Lemmas 1 and 2. The details are therefore omitted.

Proof of Theorem 7. The acceptance probability for Yt+1 equals the minimum of 1

and

π(Yt+1)π(Xt)

q(Yt+1,Xt)q(Xt,Yt+1)

=exp(

∑di=1 g(Yt+1,i))

exp(∑d

i=1 g(Xt,i))

exp(−∑d

i=1(Xt,i − Yt+1,i − h/2× g′(Yt+1,i))2/2h)

exp(−∑d

i=1(Yt+1,i −Xt,i − h/2× g′(Xt,i))2/2h)

= exp(−(h/8)d∑

i=1

(g′(Yt+1,i)2 − g′(Xt,i)2))

× exp(d∑

i=1

(g(Yt+1,i)− g(Xt,i)− (g′(Yt+1,i) + g′(Xt,i))(Yt+1,i −Xt,i)/2)).

This first term is similar to the term we got for the normal target density. Making a third

and fourth order Taylor expansion of g′(Yt+1,i) and g(Yt+1,i), respectively, we get that the

last term is approximately

exp(−d∑

i=1

(g′′′(Xt,i)(Yt+1,i −Xt,i)3/12 + g′′′′(Xt,i)(Yt+1,i −Xt,i)4/24)).

Now we use the fact that Xt is equal to a mode of the distribution (assumed for

notational simplicity to be the origin, 0).

Thus g′(Xt,i) = 0 and Xt,i = 0, for i = 1, . . . , d. This gives us

π(Yt+1)π(0)

q(Yt+1,0)q(0,Yt+1)

≈ exp(−(h/8)g′′(0)2d∑

i=1

Y 2t+1,i)

× exp(−g′′′(0)d∑

i=1

Y 3t+1,i/12− g′′′′(0)

d∑i=1

Y 4t+1,i/24).

The proposal in this case is Yt+1 = h1/2Z, i.e. Yt+1,i = h1/2Zi, with Z being a

standard multivariate normal random vector. Now observe that∑d

i=1 Y 3t+1,i has mean

zero, and therefore by the Central Limit Theorem is of order d1/2h3/2. Using the Law of

23

Large Numbers we see that the terms∑d

i=1 Y 2t+1,i and

∑di=1 Y 4

t+1,i are asymptotically dh

and 3dh2, respectively. Therefore in the acceptance probability expression, the Y 3t,i terms

are negligible as d →∞, and

π(Yt+1)π(0)

q(Yt+1,0)q(0,Yt+1)

≈ exp(−h2d(g′′(0)2 + g′′′′(0))/8)

= exp(−`4(g′′(0)2 + g′′′′(0))/8) = c, say,

when d is large and h = `2d−1/2. Thus, asymptotically as d → ∞, all proposed moves

will be accepted with the same positive acceptance probability c. This proves Theorem 7.

Acknowledgements The authors are grateful to the referees, associate editor and editor

for constructive comments and suggestions. The authors also thank Alexandros Beskos for

producing the plot in Figure 1.

REFERENCES

Bedard, M. (2004), On the robustness of optimal scaling for Metropolis-Hastings al-

gorithms. Ph.D. dissertation, University of Toronto. Work in progress.

Beskos, A., Papaspiliopoulos, O., Roberts, G.O. and Fearnhead P.N. (2004). Exact

likelihood-based estimation of diffusion processes. In progress.

Breyer, L. and Roberts, G.O. (2000), From Metropolis to diffusions: Gibbs states and

optimal scaling. Stoch. Proc. Appl, 90, 181–206.

Christensen, O.F. and Waagepetersen, R. (2002), Bayesian prediction of spatial count

data using generalized linear mixed models. Biometrics 58, 280–286.

Christensen, O.F., Roberts, G.O. and Skold, M. (2003). Robust MCMC methods for

spatial GLMM’s. Technical Report 23, Centre for Mathematical Sciences, Lund University.

Ethier, S.N. and Kurtz, T.G. (1986), Markov processes: Characterization and Conver-

gence. Wiley, New York.

24

Gelman, A., Roberts, G. O., and Gilks, W.R. (1996), Efficient Metropolis jumping

rules. In: Bayesian statistics 5 (eds. J.M. Bernardo, J.O. Berger, A.P. Dawid and A.F.M.

Smith), Oxford University Press, 599–608.

Gilks, W.R., Richardson, S., and Spiegelhalter, D.J., eds. (1996), Markov chain Monte

Carlo in practice. Chapman and Hall, London.

Grenander U. and Miller, M.I. (1994), Representations of knowledge in complex sys-

tems, J. Roy. Stat. Soc. B 56, 3, 549–603.

Hastings, W.K. (1970), Monte Carlo sampling methods using Markov chains and their

applications. Biometrika 57, 97–109.

Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. (1953),

Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–

1091.

Møller, J., Syversveen A.R. and Waagepetersen, R. (1998), Log Gaussian Cox pro-

cesses. Scand. J. Statist. 25, 451–482.

Neal, R. (1996) Bayesian Learning for Neural Networks, Lecture Notes in Statistics

No. 118, New York: Springer-Verlag.

Papaspiliopoulos, O., Roberts, G.O. and Skold, M. (2003). Non-centered parameter-

isations for hierarchical models and data augmentation. In: Bayesian statistics 7 (eds.

J.M. Bernardo, S. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M. Smith and M.

West), Oxford University Press, 307–326.

Roberts, G.O. (1998), Optimal Metropolis algorithms for product measures on the

vertices of a hypercube. Stochastics and Stochastic Reports 62, 275–283.

Roberts, G.O., Gelman, A. and Gilks, W.R. (1997), Weak convergence and optimal

scaling of random walk Metropolis algorithms. Ann. Applied Probability 7, 110–120.

Roberts, G.O. and Rosenthal, J.S. (1998), Optimal scaling of discrete approximations

to Langevin diffusions. J. Roy. Stat. Soc. B 60, 255–268.

Roberts, G.O. and Rosenthal, J.S. (2001), Optimal scaling for various Metropolis-

Hastings algorithms. Statistical Science 16, 351–367.

Roberts, G.O. and Tweedie, R.L. (1996), Exponential convergence of Langevin diffu-

sions and their discrete approximations, Bernoulli, 2, 4, 341–363.

25

Roberts, G.O. and Yuen, W.K. (2004), Optimal scaling of Metropolis algorithms for

discontinuous densities. Work in progress.

Stramer, O. and Tweedie, R.L. (1999a), Langevin-Type Models I: Diffusions with

Given Stationary Distributions, and Their Discretizations. Methodology & Computing in

Applied Probability 1, 283–306.

Stramer, O. and Tweedie, R.L. (1999b), Langevin-Type Models II: Self-Targeting

Candidates for MCMC Algorithms. Methodology & Computing in Applied Probability 1,

307–328.

26

scaling limits for the transient phase of local metropolis...

Documents