sta3431 (monte carlo methods) lecture notes, sept{dec...

STA3431 (Monte Carlo Methods) Lecture Notes, Sept–Dec 2017

by Jeffrey S. Rosenthal, University of Toronto

(Last updated: November 17, 2017)

Note: I will update these notes regularly (on the course web page). However,they are just rough, point-form notes, with no guarantee of completeness oraccuracy. They should in no way be regarded as a substitute for attendingthe lectures, doing the homework exercises, studying for the test, or readingthe reference books.

INTRODUCTION:

• Introduction to course, handout, references, prerequisites, etc.

− Course web page: probability.ca/sta3431

− Sidney Smith Hall room 2120, Mondays 10:10–12:00.

− If not Stat Dept grad student, must REQUEST enrolment (by e-mail);

need advanced undergraduate probability/statistics background, plus

basic computer programming experience.

− Conversely, if you already know lots about MCMC etc., then this

course might not be right for you since it’s an INTRODUCTION to

these topics.

− How many of you are stat grad students? undergrads? math? com-

puter science? physics? economics? management? engineering?

other? Auditing??

• Theme of the course: use (pseudo)randomness on a computer to simulate

(and hence estimate) important/interesting quantities.

• Example: Suppose want to estimate m := E[Z4 cos(Z)], where Z ∼Normal(0, 1).

− “Classical” Monte Carlo solution: replicate a large number z1, . . . , znof Normal(0,1) random variables, and let xi = z4

i cos(zi).

− Their mean x ≡ 1n

∑ni=1 xi is an (unbiased) estimate of E[X] ≡ E[Z4 cos(Z)].

− R: Z = rnorm(100); X = Z∧4 ∗ cos(Z); mean(X) [file “RMC”]

− unstable . . . but if replace “100” with “1000000” then x close to

−1.213 . . .

− Variability??

− Well, can estimate standard deviation of x by the (estimated) “stan-

dard error” of x, which is:

se = n−1/2 sd(x) ≈ n−1/2√

var(x) = n−1/2

√√√√ 1

n− 1

n∑i=1

(xi − x)2 .

1

http://probability.ca/sta3431

[file “RMCse”]

• Then what is, say, a 95% confidence interval for m?

• Well, by central limit theorem (CLT), for large n, have x ≈ N(m, v) ≈N(m, se2).

− (Strictly speaking, should use “t” distribution, not normal distribution

. . . but if n large that doesn’t really matter – ignore it for now.)

− So m−xse≈ N(0, 1).

− So, P(−1.96 < m−xse

< 1.96) ≈ 0.95.

− So, P(x− 1.96 se < m < x+ 1.96 se ) ≈ 0.95.

− i.e., approximate 95% confidence interval is [file “RMCci”]

(x− 1.96 se, x+ 1.96 se) .

• Alternatively, could compute expectation as∫ ∞−∞

z4 cos(z)e−z

2/2

√2π

dz .

Analytic? Numerical? Better? Worse? [file “RMCcomp”: −1.213]

• (Aside: In fact, by considering it as the real part of E(Z4eiZ), this can be

computed exactly, to be −2/√e.= −1.213061. But what about an even

harder example?)

• What about higher-dimensional versions? (Can’t do numerical integra-

tion!)

• Note: In this course we will often use R to automatically sample from

simple distributions like Normal, Uniform, Exponential, etc.

− But how does it work? (See below.)

• What if distribution too complicated to sample from?

− (MCMC! . . . Metropolis, Gibbs, tempered, trans-dimensional, . . . )

HISTORICAL EXAMPLE – BUFFON’S NEEDLE:

− Have series of parallel lines . . . line spacing w, needle length ` ≤w (say ` = w) . . . what is prob that needle lands touching line?

[http://www.metablake.com/pi.swf]

− Let θ be angle counter-clockwise from line direction, and h distance

of top end above nearest line.

− Then h ∼ Uniform[0, w] and θ ∼ Uniform[0, π], independent.

− Touches line iff h < ` sin(θ).

− So, prob = 1π

∫ π0

1w

∫ w0 1h<` sin(θ) dh dθ = 1

π

∫ π0

1w` sin(θ) dθ = 2`/wπ.

2

http://www.metablake.com/pi.swf

− Hence, by LLN, if throw needle n times, of which it touches a line m

times, then for n large, m/n ≈ 2`/wπ, so π ≈ 2n`/mw = 2n/m (if

` = w).

− [e.g. recuperating English Captain O.C. Fox, 1864: ` = 3, w = 4,

n = 530, m = 253, so π ≈ 2n`/mw.= 3.1423.]

− But for modern simulations, use computer. How to randomise??

PSEUDORANDOM NUMBERS:

• Goal: generate an i.i.d. sequence U1, U2, U3, . . . ∼ Uniform[0, 1].

• One method: LINEAR CONGRUENTIAL GENERATOR (LCG).

− Choose (large) positive integers m, a, and b.

− Start with a “seed” value, x0. (e.g., the current time in milliseconds)

− Then, recursively, xn = (axn−1 + b) mod m, i.e. xn = remainder when

axn−1 + b is divided by m.

− So, 0 ≤ xn ≤ m− 1.

− Then let Un = xn/m.

− Then {Un} will “seem” to be approximately i.i.d. ∼ Uniform[0, 1].

(file “Rrng”)

• Choice of m, a, and b?

• Many issues:

− need m large (so many possible values);

− need a large enough that no obvious “pattern” between Un−1 and Un.

− need b to avoid short “cycles” of numbers.

− many statistical tests, to try to see which choices provide good ran-

domness, avoid correlations, etc. (e.g. “diehard tests”, “dieharder”:

www.phy.duke.edu/∼rgb/General/dieharder.php)

− One common “good” choice: m = 232, a = 69, 069, b = 23, 606, 797.

• Theorem: the LCG has full period (m) if and only if both (i) gcd(b,m) =

1, and (ii) every “prime or 4” divisor of m also divides a− 1.

− So, if m = 232, then if b odd and a − 1 is a multiple of 4, then the

LCG has full period m = 232 .= 4.3× 109; good.

− Many other choices, e.g. C programming language (glibc) uses m =

232, a = 1, 103, 515, 245, b = 12, 345.

− One bad choice: m = 231, a = 65539 = 216 + 3, b = 0 (“RANDU”) . . .

used for many years (esp. early 1970s) . . . but then xn+2 = 6xn+1−9xnmod m . . . too much serial correlation. [Proof: xn+2 = (216 +3)2xn =

(232 +6(216)+9)xn ≡ (0+6(216 +3)−9)xn (mod 231) = 6xn+1−9xn.]

3

http://cerebro.xu.edu/math/Sources/Buffon/Hall%20on%20Buffon%20Needle.pdf

http://www.phy.duke.edu/~rgb/General/dieharder.php

− (Microsoft Excel pre-2003: period < 106, too small . . . Excel 2003

used floating-point “version” of LCG, which sometimes gave negative

numbers – bad!)

• Not “really” random, just “pseudorandom” . . .

− Can cause problems!

− Will fail certain statistical tests . . .

− Some implementations also use external randomness, e.g. current tem-

perature of computer’s CPU / entropy of kernel (e.g. Linux’s “uran-

dom”).

− Or the randomness of quantum mechanics, e.g. www.fourmilab.ch/hotbits

− Or of atmospheric noise, e.g. random.org.

− But for most purposes, standard pseudorandom numbers are pretty

good . . .

• We’ll consider LCG’s “good enough for now”, but:

− Other generators include “Multiply-with-Carry” [xn = (axn−r + bn−1)

mod m where bn = b(axn−r + bn−1)/mc; and ‘Kiss” [yn = (xn + Jn +

Kn) mod 232, where xn as above, and Jn and Kn are “shift register

generators”, given in bit form by Jn+1 = (I+L15)(I+R17)Jn mod 232,

and Kn+1 = (I +L13)(I +R18)Kn mod 231]; and “Mersenne Twister”

[xn+k = xn+s⊕ (x(upper)n |x(lower)

n+1 )A, where 1 ≤ s < k where 2kw−r− 1 is

Mersenne prime, and A is w×w (e.g. 32× 32) with (w− 1)× (w− 1)

identity in upper-right, with matrix mult. done bit-wise mod 2], and

many others too.

− (R implementation: see “?.Random.seed” . . . default is Mersenne

Twister.)

• So, just need computer to do simple arithmetic. No problem, right?

LIMITATIONS OF COMPUTER ARITHMETIC:

• Consider the following computations in R:

− > 2 + 1 - 2

[1] 1

> 2∧10 + 1 - 2∧10

> 2∧100 + 1 - 2∧100

• Why??

• Homework question: for what values of n does:

> 2∧n + 1 - 2∧n

give 0 instead of 1??

4

http://www.fourmilab.ch/hotbits

http://random.org

• (Similarly in many other computer languages too, e.g. C (powertest.c),

Java (powertest.java) . . . and Python with floating numbers . . . but not

Python with integer variables (powertest.py), because it then does dy-

namic memory allocation . . . )

• So, numerical computations are just approximations, with their own er-

rors!

• We’ll usually ignore this, but MUST BE CAREFUL!

− Then can simulate uniform random variables.

− What about other random variables?

SIMULATING OTHER DISTRIBUTIONS:

• Once we have U1, U2, . . . i.i.d. ∼ Uniform[0, 1] (at least approximately),

how do we generate other distributions?

• With transformations, using “change-of-variable” theorem!

• e.g. to make X ∼ Uniform[L,R], set X = (R− L)U1 + L.

• e.g. to make X ∼ Bernoulli(p), set

X =

{1, U1 ≤ p0, U1 > p

• e.g. to make Y ∼ Binomial(n, p), either set Y = X1 + . . .+Xn where

Xi =

{1, Ui ≤ p0, Ui > p

,

or set

Y = max{j :

j−1∑k=0

(n

k

)pk(1− p)n−k ≤ U1

}

(where by convention−1∑k=0

(· · ·) = 0). (“Inverse CDF method”)

• More generally, to make P(Y = xi) = pi for some x1 < x2 < x3 < . . .,

where pi ≥ 0 and∑i pi = 1, simply set

Y = max{xj ;j−1∑k=1

pk ≤ U1} .

• e.g. to make Z ∼ Exponential(1), set Z = − log(U1).

− Then P(Z > x) = P(− log(U1) > x) = P(log(U1) < −x) = P(U1 <

e−x) = e−x.

− Then, to make W ∼ Exponential(λ), set W = Z/λ = − log(U1)/λ.

• What if want X to have density 6 x510<x<1.

5

− Let X = U1/61 .

− Then for 0 < x < 1, P(X ≤ x) = P(U1/61 ≤ x) = P(U1 ≤ x6) = x6.

− Hence, fX(x) = ddx

[P(X ≤ x)

]= d

dxx6 = 6x5 for 0 < x < 1.

− More generally, for r > 1, if X = U1/r1 , then fX(x) = r xr−1 for

0 < x < 1. [CHECK!]

• What about normal dist.? Fact: If

X =√

2 log(1/U1) cos(2πU2) ,

Y =√

2 log(1/U1) sin(2πU2) ,

then X, Y ∼ N(0, 1) (independent!). [“Box-Muller transformation”: Ann

Math Stat 1958, 29, 610-611]

− Proof: By multidimensional change-of-variable theorem, if (x, y) =

h(u1, u2) and (u1, u2) = h−1(x, y), then fX,Y (x, y) = fU1,U2(h−1(x, y)) / |J(h−1(x, y))|.

Here fU1,U2(u1, u2) = 1 for 0 < u1, u2 < 1 (otherwise 0), and

J(u1, u2) = det

(∂x∂u1

∂x∂u2

∂y∂u1

∂y∂u2

)

= det

− cos(2πu2) / u1

√2 log(1/u1) −2π sin(2πu2)

√2 log(1/u1)

− sin(2πu2) / u1

√2 log(1/u1) 2π cos(2πu2)

√2 log(1/u1)

= −2π / u1 .

But u1 = e−(x2+y2)/2, so density of (X, Y ) is

fX,Y (x, y) = 1/|J(h−1(x, y))| = 1/|−2π / e−(x2+y2)/2| = e−(x2+y2)/2/2π

=( 1√

2πe−x

2/2)( 1√

2πe−y

2/2),

i.e. X ∼ N(0, 1) and Y ∼ N(0, 1) are independent.

• Another approach: “INVERSE CDF METHOD”:

− Suppose want P(X ≤ x) = F (x). (“CDF”)

− For 0 < t < 1, set F−1(t) = min{x ; F (x) ≥ t}. (“inverse CDF”)

− Then set X = F−1(U1).

− Then X ≤ x if and only if U1 ≤ F (x). [Subtle; see e.g. Rosenthal, A

First Look at Rigorous Probability Theory, 2nd ed., Lemma 7.1.2.]

− So, P(X ≤ x) = P(U1 ≤ F (x)) = F (x).

− Very general, but computing F−1(t) could be difficult . . .

• So, generating (pseudo)random numbers for most “standard” one-dimensional

distributions is pretty easy . . .

6

− So, can get Monte Carlo estimates of expectations involving standard

one-dimensional distributions, e.g. E[Z4 cos(Z)] where Z ∼ Normal(0, 1).

• But what if distribution is complicated, multidimensional, etc.? Simulate!

SIMULATION EXAMPLE: QUEUEING THEORY:

− Q(t) = number of people in queue at time t ≥ 0.

• Suppose service times ∼ Exponential(µ) [mean 1/µ], and interarrival

times ∼ Exponential(λ) (“M/M/1 queue”), so {Q(t)} Markovian. Then

well known [e.g. STA447/2006]:

− If µ ≤ λ, then Q(t)→∞ as t→∞.

− If µ > λ, then Q(t) converges in distribution as t→∞:

− P(Q(t) = i)→ (1− λµ)(λµ)i, for i = 0, 1, 2, . . ..

− Easy! (e.g. µ = 3, λ = 2, t = 1000) [file “Rqueue”]

• Now suppose instead that service times ∼ Uniform[0, 1], and interarrival

times have distribution of |Z| where Z ∼ Normal(0, 1). Limits not easily

computed. Now what?

− Simulate it! [file “Rqueue2”]

• Or, to make the means the same as the first example, suppose service

times ∼ Uniform[0, 2/3], and interarrival times have distribution of Z2/2

where Z ∼ Normal(0, 1). Now what? [file “Rqueue3”]

EXAMPLE – CODE BREAKING:

• Try it out: “decipherdemo”. [uses file “decipher.c”]

• Data is the coded message text: s1s2s3 . . . sN , where si ∈ A = {A,B,C, . . . , Z, space}.

• State space X is set of all bijections (for now) of A, i.e. one-to-one onto

mappings f : A → A, subject to f(space) = space.

− [“substitution cipher”]

• Use a reference text (e.g. “War and Peace”) to get matrix M(x, y) = 1+

number of times y follows x, for x, y ∈ A.

• Then for f ∈ X , let π(f) =∏N−1i=1 M

(f(si), f(si+1)

).

− (Or raise this all to a power, e.g. 0.25.)

• Idea: if π(f) is larger, then f leads to pair frequencies which more closely

match the reference text, so f is a “better” choice.

• Would like to find the choice of f which maximises π(f).

• To do this, run a Metropolis algorithm for π:

− Choose a, b ∈ A \ {space}, uniformly at random.

7

− Propose to replace f by g, where g(a) = f(b), g(b) = f(a), and

g(x) = f(x) for all x 6= a, b.

− Accept with probability min(1, π(g)

π(f)

).

• Easily seen to be an irreducible, aperiodic, reversible Markov chain [later!].

• So, converges (quickly!) to correct answer, breaking the code.

• References: S. Conner (2003), “Simulation and solving substitution codes”.

P. Diaconis (2008), “The Markov Chain Monte Carlo Revolution”.

• We later extended this, to transposition ciphers and more: J. Chen and

J.S. Rosenthal (2010), “Decrypting Classical Cipher Text Using Markov

Chain Monte Carlo” (Statistics and Computing 22(2), 397–413, 2011).

—————————– END WEEK #1 —————————–

EXAMPLE – PATTERN DETECTION:

• Try it out: faces.html

• Data is an image, given in terms of a grid of pixels (each “on” or “off”).

• Want to “find” the face in the image.

− (Harder for computers than for humans!)

• Define the face location by a vector θ of various parameters (face center,

eye width, nose height, etc.).

• Then define a score function S(θ) indicating how well the image agrees

with having a face in the location corresponding to the parameters θ.

• Then run a “mixed” Monte Carlo search (sometimes updating by small

RWM moves, sometimes starting fresh from a random vector) over the

entire parameter space, searching for argmaxθ S(θ), i.e. for the parameter

values which maximise the score function.

− Keep track of the best θ so far – this allows for greater flexibility in

trying different search moves without needing to preserve a stationary

distribution.

− Works pretty well, and fast! (“faces.html” Java applet)

− For details, see Java applet source code file “faces.java”, or the paper

J.S. Rosenthal, Optimising Monte Carlo Search Strategies for Auto-

mated Pattern Detection. F. E. J. Math. Sci. 2009.

• In both of these examples, wanted to MAXIMISE (i.e., OPTIMISE) π,

rather than SAMPLE from π.

− General method? Simulated Annealing – later.

8

http://probability.ca/jeff/ftpdir/decipherartpub.pdf

http://probability.ca/jeff/ftpdir/decipherartpub.pdf

http://probability.ca/jeff/java/faces.html

http://probability.ca/jeff/java/faces.html

http://probability.ca/jeff/java/vision/faces-2008-06-26/faces.java

http://probability.ca/jeff/ftpdir/faces.pdf

http://probability.ca/jeff/ftpdir/faces.pdf

MONTE CARLO INTEGRATION:

• How to compute an integral with Monte Carlo?

− Re-write it as an expectation!

• EXAMPLE: Want to compute∫ 1

0

∫ 10 g(x, y) dx dy.

− Regard this as E[g(X, Y )], where X, Y i.i.d. ∼ Uniform[0, 1].

− e.g. g(x, y) = cos(√xy ). (file “RMCint”) Easy!

− Get about 0.88± 0.003 . . . Mathematica gives 0.879544.

• e.g. estimate I =∫ 5

0

∫ 40 g(x, y) dy dx, where g(x, y) = cos(

√xy ).

− Here∫ 5

0

∫ 4

0g(x, y) dy dx =

∫ 5

0

∫ 4

05·4·g(x, y) (1/4)dy (1/5)dx = E[5·4·g(X, Y )] ,

where X ∼ Uniform[0, 5] and Y ∼ Uniform[0, 4].

− So, let Xi ∼ Uniform[0, 5], and Yi ∼ Uniform[0, 4] (all independent).

− Estimate I by 1M

∑Mi=1(5 · 4 · g(Xi, Yi)). (file “RMCint2”)

− Standard error: se = M−1/2 sd(5 · 4 · g(X1, Y1), . . . , 5 · 4 · g(XM , YM)).

− With M = 106, get about −4.11 ± 0.01 . . . Mathematica gives

−4.11692.

• e.g. estimate∫ 1

0

∫∞0 h(x, y) dy dx, where h(x, y) = e−y

2cos(√xy ).

− (Can’t use “Uniform” expectations.)

− Instead, write this as∫ 1

0

∫∞0 (ey h(x, y)) e−y dy dx.

− This is the same as E[eY h(X, Y )], where X ∼ Uniform[0, 1] and Y ∼Exponential(1) are independent.

− So, estimate it by 1M

∑Mi=1 e

Yih(Xi, Yi), where Xi ∼ Uniform[0, 1] and

Yi ∼ Exponential(1) (i.i.d.). (file “RMCint3”)

− With M = 106 get about 0.767±0.0004 . . . Small error! Mathematica:

0.767211.

• Alternatively, could write this as∫ 1

0

∫∞0 (1

5e5y h(x, y)) (5 e−5y) dy dx = E[1

5e5Y h(X, Y )]

where X ∼ Uniform[0, 1] and Y ∼ Exponential(5) (indep.).

− Then, estimate it by 1M

∑Mi=1

15e5yih(xi, yi), where xi ∼ Uniform[0, 1]

and yi ∼ Exponential(5) (i.i.d.).

− With M = 106, get about 0.767± 0.0016 . . . larger standard error . . .

(file “RMCint4”).

− If replace 5 by 1/5, get about 0.767± 0.0015 . . . about the same.

• So which choice is best?

− Whichever one minimises the standard error! (λ ≈ 1.5, se ≈ 0.00025?)

9

• In general, to evaluate I ≡ E[h(Y )] =∫h(y) π(y) dy, where Y has density

π, could instead re-write this as I =∫h(x) π(x)

f(x)f(x) dx, where f is easily

sampled from, with f(x) > 0 whenever π(x) > 0.

− Then I = E(h(X) π(X)

f(X)

), where X has density f .

− (“Importance Sampling”)

− Can then do classical (iid) Monte Carlo integration, get standard er-

rors etc.

− Good if easier to sample from f than π, and/or if the function h(x) π(x)f(x)

is less variable than h itself.

• In general, best to make h(x) π(x)f(x)

approximately constant.

− e.g. extreme case: if I =∫∞

0 e−3x dx, then I =∫∞0 (1/3)(3e−3x)dx =

E[1/3] where X ∼ Exponential(3), so I = 1/3 (error = 0, no MC

needed).

UNNORMALISED DENSITIES:

• Suppose now that π(y) = c g(y), where we know g but don’t know c or

π. (“Unnormalised density”, e.g. Bayesian posterior.)

− Obviously, c = 1∫g(y) dy

, but this might be hard to compute.

− Still, I =∫h(x) π(x) dx =

∫h(x) c g(x) dx =

∫h(x) g(x) dx∫g(x) dx

.

− If sample {xi} ∼ f (i.i.d.), then∫h(x) g(x) dx =

∫ (h(x) g(x) / f(x)

)f(x) dx =

E[h(X) g(X) / f(X)] where X ∼ f .

− So,∫h(x) g(x) dx ≈ 1

M

∑Mi=1

(h(xi) g(xi) / f(xi)

).

− Similarly,∫g(x) dx ≈ 1

M

∑Mi=1

(g(xi) / f(xi)

).

− So, I ≈∑M

i=1

(h(xi) g(xi) / f(xi)

)∑M

i=1

(g(xi) / f(xi)

) . (“Importance Sampling”: weighted

average)

− (Because we are taking ratios of (unbiased) estimates, the resulting

estimate is not unbiased, and its standard errors are less clear – but

it is still consistent as M →∞.)

− (Good to use same sample {xi} for both numerator and denominator:

easier computationally, and leads to smaller variance.)

• Example: compute I ≡ E(Y 2) where Y has density c y3 sin(y4) cos(y5)10<y<1,

where c > 0 unknown (and hard to compute!).

− Here g(y) = y3 sin(y4) cos(y5)10<y<1, and h(y) = y2.

− Let f(y) = 6 y510<y<1. [Recall: if U ∼ Uniform[0, 1], then X ≡U1/6 ∼ f .]

10

− Then I ≈∑M

i=1(h(xi) g(xi) / f(xi))∑M

i=1(g(xi) / f(xi))

=∑M

i=1( sin(x4i ) cos(x5i ))∑M

i=1( sin(x4i ) cos(x5i ) / x

2i ). (file “Rimp”

. . . get about 0.766 . . . )

− Or, let f(y) = 4 y310<y<1. [Then if U ∼ Uniform[0, 1], then U1/4 ∼ f .]

− Then I ≈∑M

i=1(h(xi) g(xi) / f(xi))∑M

i=1(g(xi) / f(xi))

=∑M

i=1( sin(x4i ) cos(x5i ) x

2i )∑M

i=1( sin(x4i ) cos(x5i ))

. (file “Rimp”)

• With importance sampling, is it important to use the same samples {xi}in both numerator and denominator?

− What if independent samples are used instead?

− Let’s try it! (file “Rimpind”)

− Both ways work, but usually the same samples work better.

• What other methods are available to iid sample from π?

REJECTION SAMPLER:

• Assume π(x) = c g(x), with π and c unknown, g known but hard to

sample from.

• Want to sample X ∼ π.

− Then ifX1, X2, . . . , XM ∼ π iid, then can estimate Eπ(h) by 1M

∑Mi=1 h(Xi),

etc.

• Find some other, easily-sampled density f , and known K > 0, such that

K f(x) ≥ g(x) for all x. (i.e., K f(x) ≥ π(x) / c, i.e. cK f(x) ≥ π(x))

• Sample X ∼ f , and U ∼ Uniform[0, 1] (indep.).

− If U ≤ g(X)Kf(X)

, then accept X (as a draw from π).

− Otherwise, reject X and start over again.

• Now, since 0 ≤ g(x)Kf(x)

≤ 1, therefore P(U ≤ g(X)Kf(X)

|X = x) = g(x)Kf(x)

.

− Hence, P(U ≤ g(X)

Kf(X)

)= E

[P(U ≤ g(X)

Kf(X)

∣∣∣X)] = E[g(X)Kf(X)

]=∫∞−∞

g(x)Kf(x)

f(x) dx =1K

∫∞−∞ g(x) dx.

− Similarly, for any y ∈ R, P(X ≤ y, U ≤ g(X)

Kf(X)

)= E

[1X≤yP

(U ≤

g(X)Kf(X)

∣∣∣X)] = E[1X≤y

g(X)Kf(X)

]=∫ y−∞

g(x)Kf(x)

f(x) dx = 1K

∫ y−∞ g(x) dx.

− So, conditional on accepting, we have for any y ∈ R that

P(X ≤ y

∣∣∣U ≤ g(X)

Kf(X)

)=

P(X ≤ y, U ≤ g(X)

Kf(X)

)P(U ≤ g(X)

Kf(X)

)

=

∫ y−∞

g(x)Kf(x)

f(x) dx∫∞−∞

g(x)Kf(x)

f(x) dx=

∫ y−∞ g(x) dx∫∞−∞ g(x) dx

=∫ y

−∞π(x) dx .

− So, conditional on accepting, X ∼ π. Good! iid!

11

− However, prob. of accepting may be very small.

− If so, then get very few samples – bad.

—————————– END WEEK #2 —————————–

• Example: π = N(0, 1), i.e. g(x) = π(x) = (2π)−1/2 exp(−x2/2).

− Want: Eπ(X4), i.e. h(x) = x4. (Should be 3.)

− Let f be double-exponential (Laplace) distribution, i.e. f(x) = 12e−|x|.

• If K = 8, then:

− For |x| ≤ 2, Kf(x) = 8 12

exp(−|x|) ≥ 8 12

exp(−2) ≥ (2π)−1/2 ≥π(x) = g(x).

− For |x| ≥ 2, Kf(x) = 8 12

exp(−|x|) ≥ 8 12

exp(−x2/2) ≥ (2π)−1/2 exp(−x2/2) =

π(x) = g(x).

− See graph: file “Rrejgraph”.

• So, can apply rejection sampler with this f and K, to get samples, esti-

mate of E[X], estimate of E[h(X)], estimate of P[X < −1], etc.

− Try it: file “Rrej”

• For Rejection Sampler, P (accept) = E[P (accept|X)] = E[ g(X)Kf(X)

] =∫ g(x)Kf(x)

f(x) dx =1K

∫g(x) dx = 1

cK. (Only depends on K, not f .)

− So, in M attempts, get about M/cK iid samples.

− (“Rrej” example: c = 1, K = 8, M = 10, 000, so get about M/8 =

1250 samples.)

− Since c fixed, try to minimise K.

− Extreme case: f(x) = π(x), so g(x) = π(x)/c = f(x)/c, and can take

K = 1/c, whence P (accept) = 1, iid sampling: optimal.

• Note: these algorithms all work in discrete case too.

− Can let π, f , etc. be “probability functions”, i.e. probability densities

with respect to counting measure.

− Then the algorithms proceed exactly as before.

AUXILIARY VARIABLE APPROACH:

• (related: “slice sampler”)

• Suppose π(x) = c g(x), and (X, Y ) chosen uniformly under graph of g.

− i.e., (X, Y ) ∼ Uniform{(x, y) ∈ R2 : 0 ≤ y ≤ g(x)}.− Then X ∼ π, i.e. we have sampled from π.

12

− Why? For a < b, P(a < X < b) = area with a<X<btotal area

=

∫ bag(x) dx∫∞

−∞ g(x) dx=∫ b

a π(x) dx.

− So, if repeat, get i.i.d. samples from π, can estimate Eπ(h) etc.

• Auxiliary Variable rejection sampler:

− If support of g contained in [L,R], and |g(x)| ≤ K, then can first

sample (X, Y ) ∼ Uniform([L,R] × [0, K]), then reject if Y > g(X),

otherwise accept as sample with (X, Y ) ∼ Uniform{(x, y) : 0 ≤ y ≤g(x)}, hence X ∼ π.

• Example: g(y) = y3 sin(y4) cos(y5)10<y<1.

− Then L = 0, R = 1, K = 1.

− So, sample X, Y ∼ Uniform[0, 1], then keep X iff Y ≤ g(X).

− If h(y) = y2, could compute e.g. Eπ(h) as the mean of the squares of

the accepted samples. (file “Raux”)

• Can iid / importance / rejection / auxiliary sampling solve all problems?

No!

− Many challenging cases arise, e.g. from Bayesian statistics (later).

− Some are high-dimensional, and the above algorithms fail.

− Alternative algorithm: MCMC!

MARKOV CHAIN MONTE CARLO (MCMC):

• Suppose have complicated, high-dimensional density π = c g.

• Want samples X1, X2, . . . ∼ π. (Then can do Monte Carlo.)

• Define a Markov chain (dependent random process) X0, X1, X2, . . . in

such a way that for large enough n, Xn ≈ π.

• METROPOLIS ALGORITHM (1953):

− Choose some initial value X0 (perhaps random).

− Then, given Xn−1, choose a proposal move Yn ∼ MVN(Xn−1, σ2 I)

(say).

− Let An = π(Yn) / π(Xn−1) = g(Yn) / g(Xn−1), and Un ∼ Uniform[0, 1].

− Then, if Un < An, set Xn = Yn (“accept”), otherwise set Xn = Xn−1

(“reject”).

− Repeat, for n = 1, 2, 3, . . . ,M .

− (Note: only need to compute π(Yn) / π(Xn−1), so multiplicative con-

stants cancel.)

− Try it: “www.probability.ca/rwm.html” Java applet.

13

http://probability.ca/rwm.html

• Fact: Then, for large n, have Xn ≈ π. (Markov chain theory: later.)

• Then can estimate Eπ(h) ≡∫h(x) π(x) dx by:

Eπ(h) ≈ 1

M −B

M∑i=B+1

h(Xi) ,

where B (“burn-in”) chosen large enough so XB ≈ π, and M chosen large

enough to get good Monte Carlo estimates.

• Note: This is called “random walk Metropolis” (RWM). Why? Because

the proposals (if we always accepted them) would form a random walk.

• How large B? Difficult to say! Some theory (later) . . . usually just use

trial-and-error / statistical analysis of output, and hope for the best . . .

• What initial value X0?

− Virtually any one will do, but “central” ones best.

− Can also use an “overdispersed starting distribution”: choose X0 ran-

domly from some distribution that “covers” the “important” parts of

the state space. Good for checking consistency . . .

• EXAMPLE: g(y) = y3 sin(y4) cos(y5)10<y<1.

− Want to compute (again!) Eπ(h) where h(y) = y2.

− Use Metropolis algorithm with proposal Y ∼ N(X, 1). [file “Rmet”]

− Works pretty well, but lots of variability!

− Plot: appears to have “good mixing” . . .

− acf: has some serial autocorrelation. Important! (Soon.)

− What if we change σ? How does that affect estimate? plot? acf?

• EXAMPLE: π(x1, x2) = C | cos(√x1 x2 )| I(0 ≤ x1 ≤ 5, 0 ≤ x2 ≤ 4).

− Want to compute Eπ(h), where h(x1, x2) = ex1 + (x2)2.

− Metropolis algorithm (file “Rmet2”) . . . works, but large uncertainty.

− Gets between about 34 and 44 . . . (Mathematica gets 38.7044)

− Individual plots appear to have “good mixing” . . .

− Joint plot shows fewer samples where x1x2 ≈ (π/2)2 .= 2.5 . . .

• OPTIMAL SCALING:

− Can change proposal distribution to Yn ∼ MVN(Xn−1, σ2I) for any

choice of σ > 0. Which is best?

− If σ too small, then usually accept, but chain won’t move much.

− If σ too large, then will usually reject proposals, so chain still won’t

move much.

14

− Optimal: need σ “just right” to avoid both extremes. (“Goldilocks

Principle”)

− Can experiment . . . (“rwm.html” applet, files “Rmet”, “Rmet2”) . . .

− Some theory . . . limited . . . active area of research . . .

− General principle: the acceptance rate should be far from 0 and far

from 1.

− In a certain idealised high-dimensional limit, optimal acceptance rate

is 0.234 (!). [Roberts et al., Ann Appl Prob 1997; Roberts and Rosen-

thal, Stat Sci 2001]

MCMC STANDARD ERROR:

• What about standard error, i.e. uncertainty?

− It’s usually larger than in iid case (due to correlations), and harder to

quantify.

• Simplest: re-run the chain many times, with same M and B, with dif-

ferent initial values drawn from some overdispersed starting distribution,

and compute standard error of the sequence of estimates.

− Then can analyse the estimates obtained as iid . . .

• But how to estimate standard error from a single run?

• i.e., how to estimate v ≡ Var(

1M−B

∑Mi=B+1 h(Xi)

)?

− Let h(x) = h(x)− Eπ(h), so Eπ(h) = 0.

− And, assume B large enough that Xi ≈ π for i > B.

− Then, for large M −B,

v ≈ Eπ

[(( 1

M −B

M∑i=B+1

h(Xi))−Eπ(h)

)2]= Eπ

[( 1

M −B

M∑i=B+1

h(Xi))2]

—————————– END WEEK #3 —————————–

=1

(M −B)2

[(M −B)Eπ(h(Xi)

2) + 2(M −B − 1)Eπ(h(Xi)h(Xi+1))

+2(M −B − 2)Eπ(h(Xi)h(Xi+2)) + . . .]

≈ 1

M −B(Eπ(h(Xi)

2) + 2Eπ(h(Xi)h(Xi+1)) + 2Eπ(h(Xi)h(Xi+2)) + . . .)

=1

M −B(Varπ(h) + 2 Covπ(h(Xi)h(Xi+1)) + 2 Covπ(h(Xi)h(Xi+2)) + . . .

)

15

http://probability.ca/jeff/java/rwm.html

http://probability.ca/jeff/ftpdir/scalerev.pdf


=1

M −BVarπ(h)

(1+2 Corrπ(h(Xi), h(Xi+1))+2 Corrπ(h(Xi), h(Xi+2))+. . .

)≡ 1

M −BVarπ(h)(varfact) = (iid variance) (varfact) ,

where

varfact = 1 + 2∞∑k=1

Corrπ(h(X0), h(Xk)

)≡ 1 + 2

∞∑k=1

ρk =∞∑

k=−∞ρk

(also called “integrated auto-correlation time” or “ACT”).

− Note that ρ0 = 1 and ρ−k = ρk, so also varfact = 2 (∑∞k=0 ρk)− 1.

− Then can estimate both iid variance, and varfact, from the sample

run, as usual.

− Note: to compute varfact, don’t sum over all k, just e.g. until, say,

|ρk| < 0.05 or ρk < 0 or . . .

− (Can use R’s built-in “acf” function, or can write your own – better.)

− Then standard error = se =√v = (iid-se)

√varfact.

• e.g. the files Rmet and Rmet2. (Recall: true answers are about 0.766 and

38.7, respectively.)

− Usually varfact� 1; try to get “better” chains so varfact smaller.

− Sometimes even try to design chain to get varfact < 1 (“antithetic”).

CONFIDENCE INTERVALS:

• Suppose we estimate u ≡ Eπ(h) by the quantity e = 1M−B

∑Mi=B+1 h(Xi),

and obtain an estimate e and an approximate variance (as above) v.

• Then what is, say, a 95% confidence interval for u?

• Well, if have central limit theorem (CLT), then for large M − B, e ≈N(u, v).

− So (e− u) v−1/2 ≈ N(0, 1).

− So, P(−1.96 < (e− u) v−1/2 < 1.96) ≈ 0.95.

− So, P(−1.96√v < e− u < 1.96

√v ) ≈ 0.95.

− i.e., with prob 95%, the interval (e−1.96√v, e+1.96

√v) will contain

u.

− (Again, strictly speaking, should use “t” distribution, not normal dis-

tribution . . . but if M − B large that doesn’t really matter – ignore

it for now.)

• e.g. the files Rmet and Rmet2. (Recall: true answers are about 0.766 and

38.7, respectively.)

• But does a CLT even hold??

16

− Does not follow from classical i.i.d. CLT. Does not always hold. But

often does.

− For example, CLT holds if chain is “geometrically ergodic” (later!)

and Eπ(|h|2+δ) <∞ for some δ > 0.

− (If chain also reversible then don’t need δ: Roberts and Rosenthal,

“Geometric ergodicity and hybrid Markov chains”, ECP 1997.)

• So MCMC is more complicated than standard Monte Carlo.

− Why should we bother?

− Some problems too challenging for other methods! (e.g. Bayesian –

later) In fact, need other MCMC algorithms too. For example . . .

METROPOLIS-HASTINGS ALGORITHM:

• (Hastings [Canadian!], Biometrika 1970; see www.probability.ca/hastings)

• Previous Metropolis algorithm works provided proposal distribution is

symmetric, i.e. q(x, y) = q(y, x). But what if it isn’t?

• FACT: if we replace “An = π(Yn) / π(Xn−1)” by An = π(Yn) q(Yn,Xn−1)π(Xn−1) q(Xn−1,Yn)

,

then it’s still valid (justification later); everything else remains the same.

− i.e., still accept if Un < An, otherwise reject.

− (Intuition: if q(x, y) >> q(y, x), then Metropolis chain would spend

too much time at y and not enough at x, so need to accept fewer

moves x→ y.)

− Do require that q(x, y) > 0 iff q(y, x) > 0.

• EXAMPLE: again π(x1, x2) = C | cos(√x1 x2 )| I(0 ≤ x1 ≤ 5, 0 ≤ x2 ≤

4), and h(x1, x2) = ex1 + (x2)2. (Recall: Mathematica gives Eπ(h).=

38.7044.)

− Proposal distribution: Yn ∼MVN(Xn−1, σ2 (1 + |Xn−1|2)2 I).

− (Intuition: larger proposal variance if farther from center.)

− So, q(x, y) = C (1 + |x|2)−2 exp(−|y − x|2 / 2σ2(1 + |x|2)2).

− So, can run Metropolis-Hastings algorithm for this example. (file

“RMH”)

− Usually get between 34 and 43, with claimed standard error ≈ 2.

(Recall: Mathematica gets 38.7044.)

LANGEVIN ALGORITHM:

− Special case of Metropolis-Hastings algorithm.

− Yn ∼MVN(Xn−1 + 12σ2∇ log π(Xn−1), σ2I).

− Intuition: tries to move in direction where π increasing.

17

http://probability.ca/jeff/ftpdir/hybrid.pdf


http://probability.ca/hastings

− Based on discrete approximation to “Langevin diffusion”.

− Usually more efficient, but requires knowledge and computation of

∇ log π. (Hard. Homework?)

− For theory, see e.g. Roberts & Tweedie, Bernoulli 2(4), 341–363, 1996;

Roberts & Rosenthal, JRSSB 60, 255–268, 1998.

INDEPENDENCE SAMPLER:

• Propose {Yn} ∼ q(·), i.e. the {Yn} are i.i.d. from some fixed density q,

independent of Xn−1. (e.g. Yn ∼MVN(0, Id))

− Then accept if Un < An where Un ∼ Uniform[0, 1] andAn = π(Yn) q(Xn−1)π(Xn−1) q(Yn)

.

− Special case of the Metropolis-Hastings algorithm, where Yn ∼ q(Xn−1, ·),and An = π(Yn) q(Yn, Xn−1)

π(Xn−1) q(Xn−1, Yn).

− Very special case: if q(y) ≡ π(y), i.e. propose exactly from target

density π, then An ≡ 1, i.e. make great proposals, and always accept

them (iid).

• e.g. independence sampler with π(x) = e−x and q(y) = ke−ky for x > 0.

− Then if Xn−1 = x and Yn = y, then An = e−y ke−kx

e−x ke−ky= e(k−1)(y−x). (file

“Rind”)

− k = 1: iid sampling (great).

− k = 0.01: proposals way too large (so-so).

− k = 5: proposals somewhat too small (terrible).

− And with k = 5, confidence intervals often miss 1. (file “Rind2”)

− Why is large k so much worse than small k?

VARIABLE-AT-A-TIME MCMC:

• Propose to move just one coordinate at a time, leaving all the other

coordinates fixed (since changing all coordinates at once may be difficult).

− e.g. proposal Yn has Yn,i ∼ N(Xn−1,i, σ2), with Yn,j = Xn−1,j for j 6= i.

− (Here Yn,i is the ith coordinate of Yn.)

• Then accept/reject with usual Metropolis rule (symmetric case: “Metropolis-

within-Gibbs”, or “Componentwise Metropolis”) or Metropolis-Hastings

rule (general case: “Metropolis-Hastings-within-Gibbs”, or “Componen-

twise Metropolis-Hastings”).

• Need to choose which coordinate to update each time . . .

− Could choose coordinates in sequence 1, 2, . . . , d, 1, 2, . . . (“systematic-

scan”).

− Or, choose coordinate ∼ Uniform{1, 2, . . . , d} each time (“random-

scan”).

18

http://probability.ca/jeff/ftpdir/lang.pdf

− Note: one systematic-scan iteration corresponds to d random-scan

ones . . .

• EXAMPLE: again π(x1, x2) = C | cos(√x1 x2 )| I(0 ≤ x1 ≤ 5, 0 ≤ x2 ≤

4), and h(x1, x2) = ex1 + (x2)2. (Recall: Mathematica gives Eπ(h).=

38.7044.)

− Works with systematic-scan (file “Rmwg”) or random-scan (file “Rmwg2”).

• So, lots of MCMC algorithms to choose from.

− Why do we need them all?

− To compute with complicated models! For example . . .

BAYESIAN STATISTICS:

• Have unknown parameter(s) θ, and a statistical model (likelihood func-

tion) for how the distribution of the data Y depends on θ: L(Y | θ).

• Have a prior distribution, representing our “initial” (subjective?) proba-

bilities for θ: L(θ).

• Combining these gives a full joint distribution for θ and Y , i.e. L(θ, Y ).

• Then posterior distribution of θ, π(θ), is then the conditional distribution

of θ, conditioned on the observed data y, i.e. π(θ) = L(θ |Y = y).

− In terms of densities, if have prior density fθ(θ), and likelihood fY |θ(y, θ),

then joint density is fθ,Y (θ, y) = fθ(θ) fY |θ(y, θ), and posterior density

is

π(θ) =fθ,Y (θ, y)

fY (y)= C fθ,Y (θ, y) = C fθ(θ) fY |θ(y, θ) .

—————————– END WEEK #4 —————————–

• Bayesian Statistics Example: VARIANCE COMPONENTS MODEL (a.k.a.

“random effects model”):

µ↙ ↓ ↘

θ1 . . . . . . θK θi ∼ N(µ, V )↙ ↓ ↓ ↘

Y11, . . . , Y1J1 . . . . . . YK1, . . . , YKJK Yij ∼ N(θi,W ) [observed]

− Suppose some population has overall mean µ (unknown).

− Population consists of K groups.

− Observe Yi1, . . . , YiJi from group i, for 1 ≤ i ≤ K.

− Assume Yij ∼ N(θi,W ) (cond. ind.), where θi and W unknown.

19

− Assume the different θi are “linked” by θi ∼ N(µ, V ) (cond. ind.),

with µ and V also unknown.

− Want to estimate some or all of V,W, µ, θ1, . . . , θK .

− Bayesian approach: use prior distributions, e.g. (“conjugate”):

V ∼ IG(a1, b1); W ∼ IG(a2, b2); µ ∼ N(a3, b3) ,

where ai, bi known constants, and IG(a, b) is the “inverse gamma”

distribution, with density ba

Γ(a)e−b/x x−a−1 for x > 0.

• Combining the above dependencies, we see that the joint density is (for

V,W > 0):

f(V,W, µ, θ1, . . . , θK , Y11, Y12, . . . , YKJK )

=

(ba11

Γ(a1)e−b1/V V −a1−1

)(ba22

Γ(a2)e−b2/WW−a2−1

)(1√

2πb3

e−(µ−a3)2/2b3

)×

×(K∏i=1

1√2πV

e−(θi−µ)2/2V

) K∏i=1

Ji∏j=1

1√2πW

e−(Yij−θi)2/2W

= C2 e

−b1/V V −a1−1e−b2/WW−a2−1e−(µ−a3)2/2b3V −K/2W− 12

∑K

i=1Ji ×

× exp

[−

K∑i=1

(θi − µ)2/2V

]exp

− K∑i=1

Ji∑j=1

(Yij − θi)2/2W

.• Then

π(V,W, µ, θ1, . . . , θK)

= f(V,W, µ, θ1, . . . , θK , Y11, Y12, . . . , YKJK ) / fY (Y11, Y12, . . . , YKJK )

∝ f(V,W, µ, θ1, . . . , θK , Y11, Y12, . . . , YKJK )

= C3 e−b1/V V −a1−1e−b2/WW−a2−1e−(µ−a3)2/2b3V −K/2W− 1

2

∑K

i=1Ji ×

× exp

[−

K∑i=1

(θi − µ)2/2V

]exp

− K∑i=1

Ji∑j=1

(Yij − θi)2/2W

.• NOTE: Many applications of variance components model, e.g.:

− Predicting success at law school (D. Rubin, JASA 1980), K = 82

schools.

− Melanoma (skin cancer) recurrence (),

with K = 19 different patient categories.

− Comparing baseball home-run hitters (J. Albert, The American Statis-

tician 1992), K = 12 players.

− Analysing fabric dyes (Davies 1967; Box/Tiao 1973; Gelfand/Smith

JASA 1990), K = 6 batches of dyestuff. (data in file “Rdye”)

20

http://www.mssanz.org.au/MODSIM07/papers/52_s24/Analysing_Clinicals24_Bartolucci_.pdf

http://probability.ca/sta3431/supp/Rdye

• Here, the dimension is d = K + 3, e.g. K = 19, d = 22. High!

• How to compute/estimate, say, Eπ(W/V ), or the effect of changing b1?

− Numerical integration? No, too high-dimensional!

− Importance sampling? Perhaps, but what “f”? Too inefficient!

− Rejection sampling? What “f”? What “K”? Virtually no samples!

− Perhaps MCMC can work!

− But need clever, useful MCMC algorithms!

− Perhaps Metropolis, or . . .

• ASIDE: For big complicated π, often better to work with logarithms, e.g.

accept iff log(Un) < log(An) = log(π(Yn))− log(π(Xn−1)).

− Then only need to compute log(π(x)), which might work better.

− So, better to program on log scale: log π(V,W, µ, θ1, . . . , θK) = . . ..

− Can avoid “overflow” problems.

− Also sometimes simpler, e.g. if π(x) = exp(∑

i<j |xj − xi|), then log(π(x)) =∑

i<j |xj − xi|. (Best to type in the log formula directly.)

GIBBS SAMPLER:

• (Special case of Metropolis-Hastings-within-Gibbs.)

• Proposal distribution for ith coordinate is equal to the conditional dis-

tribution of that coordinate (according to π), conditional on the current

values of all the other coordinates.

− Then, always accept. (Reason later.)

− Can use either systematic or random scan, just like above.

• EXAMPLE: Variance Components Model:

− Update of µ (say) should be from conditional density of µ, conditional

on current values of all the other coordinates: L(µ |V,W, θ1, . . . , θK , Y11, . . . , YJKK).

− This conditional density is proportional to the full joint density, but

with all variables besides µ treated as constant.

− Recall: full joint density is:

= C3e−b1/V V −a1−1e−b2/WW−a2−1e−(µ−a3)2/2b3V −K/2W− 1

2

∑K

i=1Ji ×

× exp

[−

K∑i=1

(θi − µ)2/2V

]exp

− K∑i=1

Ji∑j=1

(Yij − θi)2/2W

.− So, combining “constants”, the conditional density of µ is

C4 e−(µ−a3)2/2b3 exp

[−

K∑i=1

(θi − µ)2/2V

].

21

− This equals (verify this! HW?)

C5 exp(− µ2(

1

2b3

+K

2V) + µ(

a3

b3

+1

V

K∑i=1

θi)).

− Side calculation: if µ ∼ N(m, v), then density ∝ e−(µ−m)2/2v ∝e−µ

2(1/2v)+µ(m/v).

− Hence, here µ ∼ N(m, v), where 1/2v = 12b3

+ K2V

and m/v = a3b3

+1V

∑Ki=1 θi.

− Solve: v = b3V/(V +Kb3), and m = (a3V + b3∑Ki=1 θi) / (V +Kb3).

− So, in Gibbs Sampler, each time µ is updated, we sample it from

N(m, v) for this m and v (and always accept).

• Similarly (HW?), conditional distribution for V is:

C6e−b1/V V −a1−1V −K/2 exp

[−

K∑i=1

(θi − µ)2/2V

], V > 0 .

− Recall that “IG(r, s)” has density sr

Γ(r)e−s/x x−r−1 for x > 0.

− So, conditional distribution for V equals

IG(a1 +K/2, b1 + 12

∑Ki=1(θi − µ)2).

• Can similar compute conditional distributions for W and θi (HW?).

• The systematic-scan Gibbs sampler then proceeds (HW?) by:

− Update V from its conditional distribution IG(. . . , . . .).

− Update W from its conditional distribution IG(. . . , . . .).

− Update µ from its conditional distribution N(. . . , . . .).

− Update θi from its conditional distributionN(. . . , . . .), for i = 1, 2, . . . , K.

− Repeat all of the above M times.

• Or, the random-scan Gibbs sampler proceeds by choosing one of V,W, µ, θ1, . . . , θKuniformly at random, and then updating that coordinate from its corre-

sponding conditional distribution.

− Then repeat this step M times [or M(K + 3) times?].

− How well does it work? HW?

MCMC CONVERGENCE RATES THEORY:

• {Xn} : Markov chain on X , with stationary distribution Π(·).

• Let P n(x, S) = P[Xn ∈ S |X0 = x].

− Hope that for large n, P n(x, S) ≈ Π(S).

• Let D(x, n) = ‖P n(x, ·)− Π(·)‖ ≡ supS⊆X |P n(x, S)− Π(S)|.

22

• DEFN: chain is ergodic if limn→∞D(x, n) = 0, for Π-a.e. x ∈ X .

• DEFN: chain is geometrically ergodic if there is ρ < 1, and M : X →[0,∞] which is Π-a.e. finite, such that D(x, n) ≤ M(x) ρn for all x ∈ Xand n ∈ N.

• DEFN: a quantitative bound on convergence is an actual number n∗ such

that D(x, n∗) < 0.01 (say). [Then sometimes say chain “converges in n∗

iterations”.]

• Quantitative bounds often difficult (though I’ve worked on them a lot,

see e.g. Rosenthal, “Quantitative convergence rates of Markov chains: A

simple account”, Elec Comm Prob 2002 and the references therein), but

“geometric ergodicity” is often easier to verify.

• Fact (mentioned earlier): CLT holds for 1n

∑ni=1 h(Xi) if chain is geomet-

rically ergodic and Eπ(|h|2+δ) <∞ for some δ > 0.

− (If chain also reversible then don’t need δ: Roberts and Rosenthal,

“Geometric ergodicity and hybrid Markov chains”, ECP 1997.)

− If CLT holds, then (as before) have 95% confidence interval

(e− 1.96√v, e+ 1.96

√v).

• First Question: What do we know about ergodicity?

− Theorem (later): if chain is irreducible and aperiodic and Π(·) is

stationary, then chain is ergodic.

• But what about convergence rates?

• Special Case: INDEPENDENCE SAMPLER.

− By Thm, independence sampler is ergodic provided q(x) > 0 whenever

π(x) > 0.

− But is that sufficient?

− No, e.g. previous “Rind” example with k = 5: ergodic (of course), but

performs terribly.

− FACT: independence sampler is geometrically ergodic IF AND ONLY

IF there is δ > 0 such that q(x) ≥ δπ(x) for π-a.e. x ∈ X , in which

case D(x, n) ≤ (1− δ)n for π-a.e. x ∈ X .

• EXAMPLE: Independence sampler with π(x) = e−x and q(x) = ke−kx

for x > 0.

− If 0 < k ≤ 1, then setting δ = k, we have that q(x) = ke−kx ≥ke−x = kπ(x) = δπ(x) for all x > 0, so it’s geometrically ergodic, and

furthermore D(x, n) ≤ (1− k)n.

− e.g. if k = 0.01, then D(x, 459) ≤ (0.99)459 .= 0.0099 < 0.01 for all

x > 0, i.e. “converges after 459 iterations”.

23

http://probability.ca/jeff/ftpdir/computsimple.pdf

http://probability.ca/jeff/ftpdir/computsimple.pdf



− But if k > 1, then cannot find any δ > 0 such that q(x) ≥ δπ(x) for

all x, so it is not geometrically ergodic.

− If k > 2, then no CLT (Roberts, J. Appl. Prob. 36, 1210–1217, 1999).

− So, if k = 5 (as in “Rind”), then not geometrically ergodic, and CLT

does not hold. Indeed, confidence intervals often miss 1. (file “Rind2”)

− Fact: if k = 5, then D(0, n) > 0.01 for all n ≤ 4, 000, 000, while

D(0, n) < 0.01 for all n ≥ 14, 000, 000, i.e. “convergence” takes be-

tween 4 million and 14 million iterations. Slow! [Roberts and Rosen-

thal, “Quantitative Non-Geometric Convergence Bounds for Indepen-

dence Samplers”, MCAP 2011.]

• What about other chains (besides independence sampler)?

• FACT: if state space is finite, and chain is irreducible and aperiodic, then

always geometrically ergodic. (See e.g. J.S. Rosenthal, SIAM Review

37:387-405, 1995.)

• What about for the “random-walk Metropolis algorithm” (RWM), i.e.

where {Yn −Xn−1} ∼ q (i.i.d.) for some fixed symmetric density q?

− e.g. Yn ∼ N(Xn−1, σ2I), or Yn ∼ Uniform[Xn−1 − δ, Xn−1 + δ].

• FACT: RWM is geometrically ergodic essentially if and only if π has ex-

ponentially light tails, i.e. there are a, b, c > 0 such that π(x) ≤ ae−b|x|

whenever |x| > c. (Requires a few technical conditions: π and q contin-

uous and positive; q has finite first moment; and π non-increasing in the

tails, with (in higher dims) bounded Gaussian curvature.) [Mengersen

and Tweedie, Ann Stat 1996; Roberts and Tweedie, Biometrika 1996]

• EXAMPLES: RWM on R with usual proposals: Yn ∼ N(Xn−1, σ2).

− CASE #1: Π = N(5, 42), and functional h(y) = y2, so Eπ(h) =

52 + 42 = 41. (file “Rnorm” . . . σ = 1 v. σ = 4 v. σ = 16)

− Does CLT hold? Yes! (geometrically ergodic, and Eπ(|h|p) < ∞ for

all p.)

− Indeed, confidence intervals “usually” contain 41. (file “Rnorm2”)

− CASE #2: π(y) = c 1(1+y4)

, and functional h(y) = y2 (file “Rheavy”),

so

Eπ(h) =

∫∞−∞ y

2 1(1+y4)

dy∫∞−∞

1(1+y4)

dy=

π/√

2

π/√

2= 1 .

− Not exponentially light tails, so no CLT; estimates less stable, confi-

dence intervals often miss 1.

− CASE #3: π(y) = 1π(1+y2)

(Cauchy), and functional h(y) = 1−10<y<10,

so Eπ(h) = Π(|X| < 10) = 2 arctan(10)/π = 0.93655. [Π(0 < X <

x) = arctan(x)/π] (file “Rcauchy”)

24

http://probability.ca/jeff/ftpdir/kerrie.pdf



http://probability.ca/jeff/ftpdir/eigen.pdf

http://probability.ca/jeff/ftpdir/eigen.pdf

− Not geometrically ergodic.

− Confidence intervals often miss 0.93655.

− CASE #4: π(y) = 1π(1+y2)

(Cauchy), and functional h(y) = min(y2, 100).

[Numerical integration: Eπ(h).= 11.77] (file “Rcauchy2”)

− Again, not geometrically ergodic, and 95% CI often miss 11.77, though

iid MC does better.

• NOTE: Even when CLT holds, it can be rather unstable, e.g. it requires

that chain has converged to Π, so it might underestimate v.

− Estimate of v is very important! And “varfact” is not always reliable!

− Repeated runs?

− Another approach is “batch means”, whereby chain is broken into m

large “batches”, which are assumed to be approximately i.i.d.

—————————– END WEEK #5 —————————–

JUSTIFICATION: WHY DOES METROPOLIS ALG WORK?:

• (Uses Markov chain theory . . . e.g. STA447/2006 . . . already know?)

• Basic fact: if a Markov chain is “irreducible” and “aperiodic”, with “sta-

tionarity distribution” π, then L(Xn)→ π as n→∞. More precisely:

• THEOREM: If Markov chain is irreducible, with stationarity probability

density π, then for π-a.e. initial value X0 = x,

(a) if Eπ(|h|) < ∞, then limn→∞

1n

∑ni=1 h(Xi) = Eπ(h) ≡

∫h(x) π(x) dx;

and

(b) if chain aperiodic, then also limn→∞

P(Xn ∈ S) =∫S π(x) dx for all

S ⊆ X .

− Let’s figure out what this all means . . .

− Notation: P (i, j) = P(Xn+1 = j |Xn = i) (discrete case), or P (x,A) =

P(Xn+1 ∈ A |Xn = x) (general case). Also Π(A) =∫A π(x) dx.

• Well, irreducible means that you have positive probability of eventually

getting from anywhere to anywhere else.

− Discrete case: for all i, j ∈ X (the state space), there is n ∈ N such

that P (Xn = j |X0 = i) > 0.

− Actually, we only need to require this for states j such that π(j) > 0.

− General case: for all x ∈ X , and for all A ⊆ X with Π(A) > 0, there

is n ∈ N such that P (Xn ∈ A |X0 = x) > 0. (“π-irreducible”)

− (Since usually P (Xn = y |X0 = x) = 0 for all y.)

− Irreducibility is usually satisfied for MCMC.

25

• And, aperiodic means there are no forced cycles, i.e. there do not exist dis-

joint non-empty subsets X1,X2, . . . ,Xd for d ≥ 2, such that P (x,Xi+1) = 1

for all x ∈ Xi (1 ≤ i ≤ d−1), and P (x,X1) = 1 for all x ∈ Xd. [Diagram.]

− This is true for virtually any Metropolis algorithm, e.g. it’s true if

P (x, {x}) > 0 for any one state x ∈ X , e.g. if positive prob of rejection.

− Also true if P (x, ·) has positive density throughout S, for all x ∈ S,

for some S ⊆ X with Π(S) > 0. (e.g. Normal proposals)

− Not quite guaranteed, e.g. X = {0, 1, 2, 3}, and π uniform on X , and

Yn = Xn−1 ± 1 (mod 4). [Diagram.] But almost always holds.

• What about Π being a stationary distribution?

• Begin with DISCRETE CASE (e.g. rwm.html).

• Assume for simplicity that π(x) > 0 for all x ∈ X .

− Let q(x, y) = P(Yn = y |Xn−1 = x) be proposal distribution, e.g.

q(x, x + 1) = q(x, x − 1) = 1/2. Assume symmetric, i.e. q(x, y) =

q(y, x) for all x, y ∈ X .

− Let α(x, y) be probability of accepting a proposed move from x to y,

i.e.

α(x, y) = P(Un < An |Xn−1 = x, Yn = y) = P(Un <π(y)

π(x)) = min[1,

π(y)

π(x)] .

− State space is X , e.g. X ≡ {1, 2, 3, 4, 5, 6}.

• Then, for i, j ∈ X with i 6= j,

P (i, j) = q(i, j) α(i, j) = q(i, j) min(1,π(j)

π(i)) .

• Follows that chain is “(time) reversible”: for all i, j ∈ X , by symmetry,

π(i)P (i, j) = q(i, j) min(π(i), π(j)) = q(j, i) min(π(i), π(j)) = π(j)P (j, i) .

− (Case i 6= j is proved above, and case i = j is trivial.)

− (Intuition: if X0 ∼ π, i.e. P(X0 = i) = π(i) for all i ∈ X , then

P(X0 = i, X1 = j) = π(i)P (i, j) = P(X0 = j, X1 = i) . . . )

• We then compute that if X0 ∼ π, i.e. that P(X0 = i) = π(i) for all i ∈ X ,

then:

P(X1 = j) =∑i∈X

P(X0 = i)P (i, j) =∑i∈X

π(i)P (i, j) =∑i∈X

π(j)P (j, i)

= π(j)∑i∈X

P (j, i) = π(j) ,

i.e. X1 ∼ π too!

26


− So, the Markov chain “preserves” π, i.e. π is a stationary distribution.

− This is true for any Metropolis algorithm!

• It then follows from the Theorem (i.e., “Basic Fact”) that as n → ∞,

L(Xn) → π, i.e. limn→∞ P (Xn = i) = π(i) for all i ∈ X . (applet

“rwm.html”)

− Also follows that if Eπ(|h|) < ∞, then limn→∞

1n

∑ni=1 h(Xi) = Eπ(h) ≡∫

h(x) π(x) dx. (“LLN”)

JUSTIFICATION: GENERAL CONTINUOUS CASE:

• Some notation:

− Let X be the state space of all possible values. (Usually X ⊆ Rd, e.g.

for Variance Components Model, X = (0,∞) × (0,∞) ×R ×RK ⊆RK+3.)

− Let q(x, y) be the proposal density for y given x. (e.g. q(x, y) =

(2πσ)−d/2 exp (−∑di=1(yi − xi)2/2σ2).) Symmetric: q(x, y) = q(y, x).

− Let α(x, y) = min[1, π(y)π(x)

] be probability of accepting a proposed move

from x to y.

− Let P (x, S) = P(X1 ∈ S |X0 = x) be the transition probabilities.

− (Don’t use P (x, y) since that is usually 0.)

• Then if x 6∈ S, then

P (x, S) = P(Y1 ∈ S, U1 < A1 |X0 = x) =∫Sq(x, y) min[1, π(y)/π(x)] dy .

− Shorthand: for x 6= y, P (x, dy) = q(x, y) min[1, π(y)/π(x)] dy.

− Then for x 6= y, P (x, dy) π(x) dx = q(x, y) min[1, π(y)/π(x)] dy π(x) dx =

q(x, y) min[π(x), π(y)] dy dx = P (y, dx) π(y) dy. (symmetric)

− Follows that P (x, dy) π(x) dx = P (y, dx) π(y) dy for all x, y ∈ X .

(“reversible”)

− Shorthand: P (x, dy) Π(dx) = P (y, dx) Π(dy).

• How does “reversible” help? Just like for discrete chains!

• Indeed, suppose X0 ∼ Π, i.e. we “start in stationarity”. Then

P(X1 ∈ S) =∫x∈X

P(X1 ∈ S |X0 = x) π(x) dx =∫x∈X

∫y∈S

P (x, dy) π(x) dx

=∫x∈X

∫y∈S

P (y, dx) π(y) dy =∫y∈S

π(y) dy ≡ Π(S) ,

so also X1 ∼ Π. So, chain “preserves” Π, i.e. Π is stationary distribution.

• So, again, the Theorem applies.

27


• Note: key facts about q(x, y) are symmetry, and irreducibility.

JUSTIFICATION OF METROPOLIS-HASTINGS:

• Previous Metropolis algorithm works provided proposal distribution is

symmetric, i.e. q(x, y) = q(y, x). But what if it isn’t?

• For Metropolis, key was that q(x, y)α(x, y) π(x) was symmetric (to make

the Markov chain be reversible).

• If instead An = π(Yn) q(Yn,Xn−1)π(Xn−1) q(Xn−1,Yn)

, i.e. acceptance prob. ≡ α(x, y) =

min[1, π(y) q(y,x)

π(x) q(x,y)

], then:

q(x, y)α(x, y) π(x) = q(x, y) min[1,

π(y) q(y, x)

π(x) q(x, y)

]π(x)

= min[π(x) q(x, y), π(y) q(y, x)

].

So, still symmetric, even if q wasn’t.

− So, for Metropolis-Hastings algorithm, replace “An = π(Yn) / π(Xn−1)”

by An = π(Yn) q(Yn,Xn−1)π(Xn−1) q(Xn−1,Yn)

, then still reversible, and everything else re-

mains the same: still accept if Un < An, otherwise reject.

− We require only that q(x, y) > 0 iff q(y, x) > 0.

• INDEPENDENCE SAMPLER (mentioned earlier):

− Proposals {Yn} i.i.d. from some fixed distribution (say, Yn ∼MVN(0, I)).

− Another special case of Metropolis-Hastings algorithm.

− Then q(x, y) = q(y), depends only on y.

− So, now An = π(Yn) q(Xn−1)π(Xn−1) q(Yn)

. (files “Rind”, “Rind2” from before)

• VARIABLE-AT-A-TIME: The exact same justification works if we up-

date the variables one-at-a-time (e.g. Metropolis-within-Gibbs, Metropolis-

Hastings-within-Gibbs, etc.); each individual step is still reversible (for

the same reason), so π is still stationary.

JUSTIFICATION OF GIBBS SAMPLER:

• Special case of Metropolis-Hastings-within-Gibbs.

• Proposal distribution for ith coordinate is equal to the conditional dis-

tribution of that coordinate (according to π), conditional on the current

values of all the other coordinates.

− That is, qi(x, y) = C(x(−i)) π(y) whenever x(−i) = y(−i), where x(−i)

means all coordinates except the ith one.

− Here C(x(−i)) is the appropriate normalising constant (which depends

on x(−i)). (So C(x(−i)) = C(y(−i)).)

28

− Then An = π(Yn) qi(Yn,Xn−1)π(Xn−1) qi(Xn−1,Yn)

= π(Yn)C(Y(−i)n )π(Xn−1)

π(Xn−1)C(X(−i)n−1 )π(Yn)

= 1.

− So, always accept (i.e., can ignore the accept-reject step).

− (Intuition: if start in stationary distribution, then update one coordi-

nate from its conditional stationary distribution (and always accept),

then the distribution remains the same, i.e. stationarity is preserved.)

EXAMPLES RE WHY DOES MCMC WORK:

• EXAMPLE #1: Metropolis algorithm where X = Z, π(x) = 2−|x|/3, and

q(x, y) = 12

if |x− y| = 1, otherwise 0.

− Reversible? Yes, it’s a Metropolis algorithm!

− π stationary? Yes, follows from reversibility!

− Aperiodic? Yes, since P (x, {x}) > 0!

− Irreducible? Yes: π(x) > 0 for all x ∈ X , so can get from x to y in

|x− y| steps.

− So, by theorem, probabilities and expectations converge to those of π

– good.

• EXAMPLE #2: Same as #1, except now π(x) = 2−|x|−1 for x 6= 0, with

π(0) = 0.

− Still reversible, π stationary, aperiodic, same as before.

− Irreducible? No – can’t go from positive to negative!

• EXAMPLE #3: Same as #2, except now q(x, y) = 14

if 1 ≤ |x− y| ≤ 2,

otherwise 0.

− Still reversible, π stationary, aperiodic, same as before.

− Irreducible? Yes – can “jump over 0” to get from positive to negative,

and back!

• EXAMPLE #4: Metropolis algorithm with X = R, and π(x) = C e−x6,

and proposals Yn ∼ Uniform[Xn−1 − 1, Xn−1 + 1].

− Reversible? Yes since it’s Metropolis, and q(x, y) still symmetric.

− π stationary? Yes since reversible!

− Irreducible? Yes, since the n-step transitions P n(x, dy) have positive

density whenever |y − x| < n.

− Aperiodic? Yes since if periodic, then if e.g. X1 ∩ [0, 1] has positive

measure, then possible to go from X1 directly to X1, i.e. if x ∈ X1 ∩[0, 1], then P (x,X1) > 0. (Or, even simpler: since P (x, {x}) > 0 for

all x ∈ X .)

− So, by theorem, probabilities and expectations converge to those of π

– good.

29

• EXAMPLE #5: Same as #4, except now π(x) = C1 e−x6(1x<2 + 1x>4).

− Still reversible and stationary and aperiodic, same as before.

− But no longer irreducible: cannot jump from [4,∞) to (−∞, 2] or

back.

− So, does not converge.

• EXAMPLE #6: Same as #5, except now proposals are

Yn ∼ Uniform[Xn−1 − 5, Xn−1 + 5].

− Still reversible and stationary and aperiodic, same as before.

− And now irreducible, too: now can jump from [4,∞) to (−∞, 2] or

back.

• EXAMPLE #7: Same as #6, except now

Yn ∼ Uniform[Xn−1 − 5, Xn−1 + 10].

− Makes no sense – proposals not symmetric, so not a Metropolis al-

gorithm! (Not even symmetrically zero, for a Metropolis-Hastings

algorithm, e.g. have positive density 3→ 9 but not 9→ 3.)

• ASIDE: Why does Theorem say “π-a.e.” X0 = x?

• Example: X = {1, 2, 3, . . .}, and P (1, {1}) = 1, and for x ≥ 2, P (x, {1}) =

1/x2 and P (x, {x+ 1}) = 1− (1/x2).

− Stationary distribution: Π(·) = δ1(·), i.e. Π(S) = 11∈S for S ⊆ X .

− Irreducible, since if Π(S) > 0 then 1 ∈ S so P (x, S) ≥ P (x, {1}) > 0

for all x ∈ X .

− Aperiodic since P (1, {1}) > 0.

− So, by Theorem, for π-a.e. X0, have limn→∞P(Xn ∈ S) = Π(S), i.e.

limn→∞P(Xn = 1) = 1.

− But if X0 = x ≥ 2, then P[Xn = x+n for all n] =∏∞j=x(1−(1/j2)) > 0

(since∑∞j=x(1/j

2) <∞), so limn→∞P(Xn = 1) 6= 1.

− Convergence holds if X0 = 1, which is π-a.e. since Π(1) = 1, but not

from X0 = x ≥ 2.

• So, convergence subtle. But usually holds from any x ∈ X . (“Harris

recurrent”, see e.g. http://probability.ca/jeff/ftpdir/harris.pdf)

—————————– END WEEK #6 —————————–

MONTE CARLO IN FINANCE [brief]:

• Xt = stock price at time t

• Assume that X0 = a > 0, and dXt = bXtdt + σXtdBt, where {Bt} is

Brownian motion.

30

http://probability.ca/jeff/ftpdir/harris.pdf

− i.e., for small h > 0,

(Xt+h−Xt |Xt) ≈ bXt(t+h−t)+σXt(Bt+h−Bt) ∼ bXt(t+h−t)+σXtN(0, h) ,

so

(Xt+h |Xt) ∼ N(Xt + bXth, σ

2(Xt)2h). (∗)

• A “European call option” is the option to purchase one share of the stock

at a fixed time T > 0 for a fixed price q > 0.

• Question: what is a fair price for this option?

− At time T , its value is max(0, XT − q).− So, at time 0, its value is e−rT max(0, XT−q), where r is the “risk-free

interest rate”.

− But at time 0, XT is unknown! So, what is fair price??

• FACT: the fair price is equal to E(e−rT max(0, XT − q)), but only after

replacing b by r.

− (Proof: transform to risk-neutral martingale measure . . . )

− Intuition: if b very large, might as well just buy stock itself.

• If σ and r constant, then there’s a formula (“Black-Scholes eqn”) for this

price, in terms of Φ = cdf of N(0, 1):

a Φ

(1

σ√T

(log(a/q) + T (r +

1

2σ2)

))− qe−rTΦ

(1

σ√T

(log(a/q) + T (r − 1

2σ2)

))

• But we can also estimate it through (iid) Monte Carlo!

− Use (∗) above (for fixed small h > 0, e.g. h = 0.05) to generate samples

from the difusion.

− Any one run is highly variable. (file “RBS”, with M = 1)

− But many runs give good estimate. (file “RBS”, with M = 1000)

− Note that it’s iid replications, so varfact ≡ 1.

• An “Asian call option” is similar, but with XT replaced by Xk,t ≡1k

∑ki=1XiT/k, for some fixed positive integer k (e.g., k = 8).

− Above “FACT” still holds (again with XT replaced by Xk,t).

− Now there is no simple formula . . . but can still simulate! (file “RAO”)

MONTE CARLO OPTIMISATION – Simulated Annealing:

• General method to find highest mode of π.

• Idea: mode of π is same as mode of a flatter or a more peaked version

πτ , for any τ > 0.

31

− e.g. πτ ≡ π1/τ . Flatter if τ > 1, more peaked if τ < 1. (“tempered”)

− For large τ , MCMC explores a lot; good at beginning of search.

− For small τ , MCMC narrows in on local mode; good at end of search.

• So, use tempered MCMC, but where τ = τn ↘ 0, so πτn becomes more

and more concentrated at mode as n→∞.

• Need to choose {τn}, the “cooling schedule”.

− e.g. geometric (τn = τ0 rn for some r < 1).

− or linear (τn = τ0 − dn for some d > 0, chosen so τM = τ0 − dM ≥ 0).

− or logarithmic (τn = τ0/ log(1 + n)).

− or . . .

− Theorem:: if c ≥ sup π, then simulated annealing with τn = c/ log(1+

n) will converge to the global maximum as n→∞. (But very slow.)

• EXAMPLE: Πτ = 0.3N(0, τ 2) + 0.7N(20, τ 2). (file “Rsimann”)

− Highest mode is at 20 (for any τ).

− If run usual Metropolis algorithm, it will either jump forever between

modes (if τ large), or get stuck in one mode or the other with equal

probability (if τ small) – bad.

− But if τn ↘ 0 slowly, then can usually find the highest mode (20) –

good.

− Try both geometric and linear (better?) cooling . . . (file “Rsimann”)

OPTIMAL RWM PROPOSALS:

• Consider RWM on X = Rd, where Yn ∼MVN(Xn−1, Σ) for some d× dproposal covariance matrix Σ.

• What is best choice of Σ?

− Usually we take Σ = σ2 Id for some σ > 0, and then choose σ so

acceptance rate not too small, not too large (e.g. 0.234).

− But can we do better?

• Suppose for now that Π = MVN(µ0, Σ0) for some fixed µ0 and Σ0, in

dim=5. Try RWM with various proposal distributions (file “Ropt”):

− first version: Yn ∼MVN(Xn−1, Id). (acc ≈ 0.06; varfact ≈ 220)

− second version: Yn ∼ MVN(Xn−1, 0.1 Id). (acc ≈ 0.234; varfact ≈300)

− third version: Yn ∼MVN(Xn−1, Σ0). (acc ≈ 0.31; varfact ≈ 15)

− fourth version: Yn ∼MVN(Xn−1, 1.4 Σ0). (acc ≈ 0.234; varfact ≈7)

32

• Or in dim=20 (file “Ropt2”, with file “Rtarg20”):

− Yn ∼MVN(Xn−1, 0.025 Id). (acc ≈ 0.234; varfact ≈ 400 or more)

− Yn ∼MVN(Xn−1, 0.283 Σ0). (acc ≈ 0.234; varfact ≈ 50)

• Conclusion: acceptance rates near 0.234 are better.

• But also, proposals shaped like the target are better.

− Indeed, best is when proposal covariance = ((2.38)2/d)Σ0.

− This has been proved for targets which are orthogonal transformations

of independent components (Roberts et al., Ann Appl Prob 1997;

Roberts and Rosenthal, Stat Sci 2001 ; Bedard, Ann Appl Prob 2007).

− And it’s “approximately” true for most unimodal targets . . .

• Problem: Σ0 would usually be unknown; then what?

− Can perhaps “adapt“!

ADAPTIVE MCMC:

• What if target covariance Σ0 is unknown??

• Can estimate target covariance based on run so far, to get empirical

covariance Σn.

• Then update proposal covariance “on the fly”.

• “Learn as you go”: see e.g. the Java applet “adapt.html”

—————————– END WEEK #7 —————————–

• For Adaptive MCMC, could use proposal Yn ∼MVN(Xn−1, ((2.38)2/d)Σn).

− Hope that for large n, Σn ≈ Σ0, so proposals “nearly” optimal.

− (Usually also add εId to proposal covariance, to improve stability, e.g.

ε = 0.05.)

• Try R version, for the same MVN example as in Ropt (file “Radapt”):

− Need much longer burn-in, e.g. B = 20, 000, for adaption to work.

− Get varfact of last 4000 iterations of about 18 . . . “competitive” with

Ropt optimal . . .

− The longer the run, the more benefit from adaptation.

− Can also compute “slow-down factor”, sn ≡ d(∑d

i=1 λ−2in / (

∑di=1 λ

−1in )2

),

where {λin} eigenvals of Σ1/2n Σ

−1/20 . Starts large, should converge to 1.

[Motivation: if Σn = Σ0, then λin ≡ 1, so sn = d(d/d2) ≡ 1.] See

Roberts and Rosenthal, Examples of Adaptive MCMC, JCGS 2009.

• Higher dimensions: figure “RplotAMx200.png” (dim=200). (beautiful!)

33


http://probability.ca/jeff/ftpdir/mylene1.pdf

http://probability.ca/jeff/java/adapt.html

http://probability.ca/jeff/ftpdir/adaptex.pdf

http://probability.ca/sta3431/supp/RplotAMx200.png

− Works well, but it takes many iterations before the adaption is helpful.

• BUT IS “ADAPTIVE MCMC” A VALID ALGORITHM??

• Not in general: see e.g. “adapt.html”

• Algorithm now non-Markovian, doesn’t preserve stationarity at each step.

• However, still guaranteed to converge to Π under various technical con-

ditions.

• For example, it suffices (see Roberts & Rosenthal, “Coupling and Con-

vergence of Adaptive MCMC” (J. Appl. Prob. 2007)) that the adaption

satisfies:

− (a) Diminishing Adaptation: Adapt less and less as the algorithm

proceeds. Formally, supx∈X ‖PΓn+1(x, ·)−PΓn(x, ·)‖ → 0 in prob. [Can

always be made to hold, since adaption is user controlled.]

− (b) Containment: For all ε > 0, the time to converge to within ε of

stationary from x = Xn, if fix γ = Γn, remain bounded in probability

as n→∞. [Technical condition, to avoid “escape to infinity”. Holds

if e.g. the state space and adaption spaces are both compact, etc. And

always seems to hold in practice.]

− (This also guarantees WLLN for bounded functionals. Various other

results about LLN / CLT under stronger assumptions.)

− There are various “checkable” sufficient conditions which guarantee

Containment, e.g. Y. Bai, G.O. Roberts, and J.S. Rosenthal, Adv.

Appl. Stat. 2011 and Craiu, Gray, Latusynski, Madras, Roberts, and

Rosenthal, Ann. Appl. Prob. 2015 and J.S. Rosenthal and J. Yang,

Ergodicity of Combocontinuous Adaptive MCMC Algorithms, MCAP,

to appear.

• So, some “reasonable” theory, but you have to be careful!

TEMPERED MCMC:

• Suppose Π(·) is multi-modal, i.e. has distinct “parts” (e.g., Π = 12N(0, 1)+

12N(20, 1))

• Usual RWM with Yn ∼ N(Xn−1, 1) (say) can explore well within each

mode, but how to get from one mode to the other?

• Idea: if Π(·) were flatter, e.g. 12N(0, 102)+ 1

2N(20, 102), then much easier

to get between modes.

• So: define a sequence Π1,Π2, . . . ,Πm where Π1 = Π (“cold”), and Πτ

is flatter for larger τ (“hot”). (e.g. Πτ = 12N(0, τ 2) + 1

2N(20, τ 2); file

“Rtempered”)

• In the end, only “count” those samples where τ = 1.

34

http://probability.ca/jeff/java/adapt.html

http://probability.ca/jeff/ftpdir/adapt.pdf

http://probability.ca/jeff/ftpdir/adapt.pdf

http://probability.ca/jeff/ftpdir/yannew.pdf

http://probability.ca/jeff/ftpdir/yannew.pdf

http://probability.ca/jeff/ftpdir/AAP1083.pdf

http://probability.ca/jeff/ftpdir/AAP1083.pdf

http://probability.ca/jeff/ftpdir/dini.pdf



• Proceed by defining a joint Markov chain (x, τ) on X×{1, 2, . . . ,m}, with

stationary distribution Π defined by Π(S × {τ}) = 1m

Πτ (S).

− (Can also use other weights besides 1m

.)

• The Markov chain should have both spatial moves (change x) and tem-

perature moves (change τ).

− e.g. perhaps chain alternates between:

(a) propose x′ ∼ N(x, 1), accept with prob min(1, π(x′,τ)

π(x,τ)

)= min

(1, πτ (x′)

πτ (x)

).

(b) propose τ ′ = τ ± 1 (prob 12

each), accept with prob

min(1, π(x,τ ′)

π(x,τ)

)= min

(1, πτ ′ (x)

πτ (x)

).

• Chain should converge to Π.

• Then, as above, only “count” those samples where τ = 1. (red)

• EXAMPLE: Π = 12N(0, 1) + 1

2N(20, 1)

− Assume proposals are Yn ∼ N(Xn−1, 1).

− Mixing for Π: terrible! (file “Rtempered” with dotempering=FALSE

and temp=1; note the small claimed standard error!)

− Define Πτ = 12N(0, τ 2) + 1

2N(20, τ 2), for τ = 1, 2, . . . , 10.

− Mixing better for larger τ ! (file “Rtempered” with dotempering=FALSE

and temp=1,2,3,4,...,10)

− (Compare graphs of π1 and π8: plot commands at bottom of “Rtem-

pered” . . . )

− So, use above “(a)–(b)” algorithm; converges fairly well to Π. (file

“Rtempered”, with dotempering=TRUE)

− So, conditional on τ = 1, converges to Π. (“points” command at end

of file “Rtempered”)

− So, average of those h(x) with τ = 1 gives good estimate of Eπ(h).

• HOW TO FIND THE TEMPERED DENSITIES πτ?

• Usually won’t “know” about e.g. Πτ = 12N(0, τ 2) + 1

2N(20, τ 2).

• Instead, can e.g. let πτ (x) = cτ (π(x))1/τ . (Sometimes write β = 1/τ .)

− Then Π1 = Π, and πτ flatter for larger τ – good.

− (e.g. if π(x) density ofN(µ, σ2), then cτ (π(x))1/τ density ofN(µ, τσ2).)

− Then temperature acceptance probability is:

min(1,

πτ ′(x)

πτ (x)

)= min

(1,

cτ ′

cτ(π(x))(1/τ ′)−(1/τ)

).

− This depends on the cτ , which are usually unknown – bad.

35

• What to do?

PARALLEL TEMPERING:

• (a.k.a. Metropolis-Coupled MCMC, or MCMCMC: Geyer, 1991)

• Alternative to tempered MCMC.

• Instead, use state space Xm, with m chains, i.e. one chain for each tem-

perature.

• So, state at time n is Xn = (Xn1, Xn2, . . . , Xnm), where Xnτ is “at”

temperature τ .

• Stationary distribution is now Π = Π1 × Π2 × . . . × Πm, i.e. Π(X1 ∈S1, X2 ∈ S2, . . . , Xm ∈ Sm) = Π1(S1) Π2(S2) . . . Πm(Sm).

• Then, can update the chain Xn−1,τ at temperature τ (for each 1 ≤ τ ≤m), by proposing e.g. Yn,τ ∼ N(Xn−1,τ , 1), and accepting with probability

min(1, πτ (Yn,τ )

πτ (Xn−1,τ )

).

• And, can also choose temperatures τ and τ ′ (e.g., at random), and propose

to “swap” the values Xn,τ and Xn,τ ′ , and accept this with probability

min(1,

πτ (Xn,τ ′ )πτ ′ (Xn,τ )

πτ (Xn,τ )πτ ′ (Xn,τ ′ )

).

− Now, normalising constants cancel, e.g. if πτ (x) = cτ (π(x))1/τ , then

acceptance probability is:

min(1,cτπ(Xn,τ ′)

1/τ cτ ′π(Xn,τ )1/τ ′

cτπ(Xn,τ )1/τ cτ ′π(Xn,τ ′)1/τ ′

)= min

(1,π(Xn,τ ′)

1/τ π(Xn,τ )1/τ ′

π(Xn,τ )1/τ π(Xn,τ ′)1/τ ′

),

so cτ and cτ ′ are not required.

• EXAMPLE: suppose again that Πτ = 12N(0, τ 2) + 1

2N(20, τ 2), for τ =

1, 2, . . . , 10.

− Can run parallel tempering . . . works pretty well. (file “Rpara”)

TRANSDIMENSIONAL MCMC:

• (a.k.a. “reversible-jump MCMC”: Green, Biometrika 1995)

• What if the state space is a union of parts of different dimension?

− Can we still apply Metropolis-Hastings then??

• (EXAMPLE: autoregressive process: suppose Yn = a1Yn−1 + a2Yn−2 +

. . .+ akYn−k, but we don’t know what k should be.)

• OUR EXAMPLE: suppose {yj}Jj=1 are known data which are assumed to

come from a mixture distribution: 1k(N(a1, 1)+N(a2, 1)+ . . .+N(ak, 1)).

• Want to estimate the unknown k, a1, . . . , ak.

36

− Here the number of parameters is also unknown, i.e. the dimension is

unknown and variable, which makes MCMC more challenging!

• The state space is X = {(k, a) : k ∈ N, a ∈ Rk}.

• Prior distributions: k − 1 ∼ Poisson(2), and a|k ∼MVN(0, Ik) (say).

• Define a reference measure λ by: λ({k} × A) = λk(A) for k ∈ N and

(measurable) A ⊆ Rk, where λk is Lebesgue measure on Rk.

− i.e., λ = δ1 × λ1 + δ2 × λ2 + δ3 × λ3 + . . .

• Then in our mixture example, posterior density (with respect to λ) is:

π(k, a) = Ce−22k−1

(k − 1)!(2π)−k/2 exp

(−1

2

k∑i=1

a2i

)(2π)−J/2

J∏j=1

( k∑i=1

1

kexp

(−1

2(yj−ai)2

)).

• So, on a log scale,

log π(k, a) = logC + loge−22k−1

(k − 1)!− k

2log(2π)− 1

2

k∑i=1

a2i −

J

2log(2π)+

J∑j=1

log( k∑i=1

1

kexp

(− 1

2(yj − ai)2

)).

(Can ignore logC and J2

log(2π), but not k2

log(2π).)

• How to “explore” this posterior distribution??

• For fixed k, can move around Rk in usual way with RWM (say).

• But how to change k?

• Can propose to replace k with, say, k′ = k ± 1 (prob 12

each).

• Then have to correspondingly change a. One possibility:

− If k′ = k+1, then a′ = (a1, . . . , ak, Z) where Z ∼ N(0, 1) (“elongate”).

− If k′ = k − 1, then a′ = (a1, . . . , ak−1) (“truncate”).

• Then accept with usual probability, min(1,

π(k′,a′) q((k′,a′),(k,a))π(k,a) q((k,a),(k′,a′))

).

− Here if k′ = k+ 1, then q((k′, a′), (k, a)) = 12, while q((k, a), (k′, a′)) =

12

1√2πe−(a′

k′ )2/2.

− Or, if k′ = k − 1, then q((k, a), (k′, a′)) = 12, while q((k′, a′), (k, a)) =

12

1√2πe−(ak)2/2.

• Seems to work okay; final k usually between 5 and 9 . . . (file “Rtrans”)

• (NOTE: We didn’t really have time to cover the remaining material in

detail. So, I still hope you read it, but you don’t need it for the test.)

• ALTERNATIVE method for the “correspondingly change a” step:

37

− If k′ = k+1, then a′ = (a1, . . . , ak−1, ak−Z, ak+Z) where Z ∼ N(0, 1)

(“split”).

− If k′ = k − 1, then a′ = (a1, . . . , ak−2,12(ak−1 + ak)) (“merge”).

− What about the densities q((k′, a′), (k, a))?

− Well, if k′ = k+ 1, then q((k′, a′), (k, a)) = 12, while roughly speaking,

q((k, a), (k′, a′)) =1

2

1√2πe−z

2/2 =1

2

1√2πe−( 1

2(a′k′−a

′k))2/2 .

− One subtle additional point: The map (a, Z) 7→ a′ = (a1, . . . , ak−1, ak−Z, ak + Z) has “Jacobian” term:

det(

∂a′

∂(a,Z)

)= det

Ik−1 0 00 1 −10 1 1

= 1− (−1) = 2 ,

i.e. the split moves “spread out” the mass by a factor of 2.

− So by Change-of-Variable Thm, actually

q((k, a), (k′, a′)) =1

2

1√2πe−( 1

2(a′k′−a

′k))2/2

/2 .

− Similarly, if k′ = k − 1, then q((k, a), (k′, a′)) = 12, while

q((k′, a′), (k, a)) =1

2

1√2πe−( 1

2(ak−ak′ ))2/2

/2 .

− Algorithm still seems to work okay . . . (file “Rtrans2”)

• For more complicated transformations, need to include more complicated

“Jacobian” term (but above it equals 1 or 2).

• Check: if we start the algorithms with, say, k = 24, then they don’t

manage to reduce k enough!

− They might be trying to remove the “wrong” ai.

• So, try another MODIFICATION, this time where any coordinate can be

added/removed, not just the last one.

− While we’re at it, change “new ai distribution” from Z ∼ N(0, 1) to

Z ∼ Uniform(−20, 30), with corresponding change to the q((k, a), (k′, a′))

formulae.

− file “Rtrans3” – now works well even if started with k = 24.

− Seems to settle on k = 6 regardless of starting value.

− This seems to indicate rapid mixing – good!

• FINAL SUMMARY: Monte Carlo can be used for nearly everything!

38

sta3431 (monte carlo methods) lecture notes, sept{dec...

Documents