optimization methods iii. the mcmc. exercises.yambar/mae5704/aula8... · aula 8. optimization...

Aula 8. Optimization Methods III. Exercises. 0

Optimization Methods III.

The MCMC. Exercises.

Anatoli Iambartsev

IME-USP


[RC] A generic Markov chain Monte Carlo algorithm.

The MetropolisHastings algorithm associated with the objec-tive (target) density f and the conditional density q produces aMarkov chain (X(t)) through the following transition kernel:

Algorithm Metropolis-Hastings Given x(t),

1. Generate Yt ∼ q(y | x(t)).

2. Take

X(t+1) ={

Yt with probability ρ(x(t), Yt),x(t) with probability 1− ρ(x(t), Yt),

where

ρ(x, y) = min{f(y)

f(x)

q(x | y)

q(y | x),1}.


[RC] A generic Markov chain Monte Carlo algorithm.

A generic R implementation is straightforward, assuming a gen-erator for q(y|x) is available as geneq(x). If x[t] denotes thevalue of X(t),

6.3 Basic Metropolis–Hastings algorithms 171

again, we stress the incredible feature of the Metropolis–Hastings algorithmthat, for every given q, we can then construct a Metropolis–Hastings kernelsuch that f is its stationary distribution.

6.3.1 A generic Markov chain Monte Carlo algorithm

The Metropolis–Hastings algorithm associated with the objective (target)density f and the conditional density q produces a Markov chain (X(t))through the following transition kernel:

Algorithm 4 Metropolis–HastingsGiven x(t),

1. Generate Yt ∼ q(y|x(t)).2. Take

X(t+1) =

!Yt with probability ρ(x(t), Yt),

x(t) with probability 1 − ρ(x(t), Yt),

where

ρ(x, y) = min

"f(y)

f(x)

q(x|y)

q(y|x), 1

#.

A generic R implementation is straightforward, assuming a generator forq(y|x) is available as geneq(x). If x[t] denotes the value of X(t),

> y=geneq(x[t])

> if (runif(1)<f(y)*q(y,x[t])/(f(x[t])*q(x[t],y))){

+ x[t+1]=y

+ }else{

+ x[t+1]=x[t]

+ }

since the value y is always accepted when the ratio is larger than one.The distribution q is called the instrumental (or proposal or candidate)

distribution and the probability ρ(x, y) the Metropolis–Hastings acceptanceprobability. It is to be distinguished from the acceptance rate, which is theaverage of the acceptance probability over iterations,

ρ = limT→∞

1

T

T$

t=0

ρ(X(t), Yt) =

%ρ(x, y)f(x)q(y|x) dydx.

This quantity allows an evaluation of the performance of the algorithm, asdiscussed in Section 6.5.

since the value y is always accepted when the ratio is larger thanone.


[RC] Example 6.1.

We can use a Metropolis-Hastings algorithm to simulate a betadistribution, where the target density f is the Beta(2.7,6.3) den-sity and the candidate q is uniform over [0,1], which meansthat it does not depend on the previous value of the chain. AMetropolisHastings sample is then generated with the followingR code:

172 6 Metropolis–Hastings Algorithms

While, at first glance, Algorithm 4 does not seem to differ from Algorithm2, except for the notation, there are two fundamental differences between thetwo algorithms. The first difference is in their use since Algorithm 2 aims atmaximizing a function h(x), while the goal of Algorithm 4 is to explore thesupport of the density f according to its probability. The second differenceis in their convergence properties. With the proper choice of a temperatureschedule Tt in Algorithm 2, the simulated annealing algorithm converges tothe maxima of the function h, while the Metropolis–Hastings algorithm isconverging to the distribution f itself. Finally, modifying the proposal q alongiterations may have drastic consequences on the convergence pattern of thisalgorithm, as discussed in Section 8.5.

Algorithm 4 satisfies the so-called detailed balance condition,

(6.3) f(x)K(y|x) = f(y)K(x|y) ,

from which we can deduce that f is the stationary distribution of the chain{X(t)} by integrating each side of the equality in x (see Exercise 6.8).

That Algorithm 4 is naturally associated with f as its stationary distribu-tion thus comes quite easily as a consequence of the detailed balance conditionfor an arbitrary choice of the pair (f, q). In practice, the performance of thealgorithm will obviously strongly depend on this choice of q, but considerfirst a straightforward example where Algorithm 4 can be compared with iidsampling.

Example 6.1. Recall Example 2.7, where we used an Accept–Reject algo-rithm to simulate a beta distribution. We can just as well use a Metropolis–Hastings algorithm, where the target density f is the Be(2.7, 6.3) density and thecandidate q is uniform over [0, 1], which means that it does not depend on theprevious value of the chain. A Metropolis–Hastings sample is then generated withthe following R code:

> a=2.7; b=6.3; c=2.669 # initial values

> Nsim=5000

> X=rep(runif(1),Nsim) # initialize the chain

> for (i in 2:Nsim){

+ Y=runif(1)

+ rho=dbeta(Y,a,b)/dbeta(X[i-1],a,b)

+ X[i]=X[i-1] + (Y-X[i-1])*(runif(1)<rho)

+ }

A representation of the sequence (X(t)) by plot does not produce any patternin the simulation since the chain explores the same range at different periods. Ifwe zoom in on the final period, for 4500 ≤ t ≤ 4800, Figure 6.1 exhibits somecharacteristic features of Metropolis–Hastings sequences, namely that, for someintervals of time, the sequence (X(t)) does not change because all corresponding


[RC] Example 6.1.

> ks.test(jitter(X),rbeta(5000,a,b))

Two-sample Kolmogorov-Smirnov test

data: jitter(X) and rbeta(5000, a, b)

D = 0.031, p-value = 0.01638

alternative hypothesis: two-sided

# jitter() - Add a small amount of noise to a numeric vector:we need this function because of the presence of repeated valuesin the vector X, KS test works well when the samples are fromcontinuous r.v. what supposes the different values within bothsamples, i.e. ks.test do not accept ties..


[RC] The independent Metropolis-Hastings algorithm.

The MetropolisHastings algorithm allows a candidate distribu-tion q that only depends on the present state of the chain. If wenow require the candidate q to be independent of this presentstate of the chain (that is, q(y|x) = g(y)), we do get a specialcase of the original algorithm:

Algorithm Metropolis-Hastings Given x(t),

1. Generate Yt ∼ g(y).

2. Take

X(t+1) =

{Yt with probability min

{f(Yt)g(x(t))f(x(t))g(Yt)

,1},

x(t) otherwise.



This method then appears as a straightforward generalizationof the AcceptReject method in the sense that the instrumentaldistribution is the same density g as in the AcceptReject method.Thus, the proposed values Yt are the same, if not the acceptedones.

MetropolisHastings algorithms and AcceptReject methods (Sec-tion 2.3), both being generic simulation methods, have simi-larities between them that allow comparison, even though it israther rare to consider using a Metropolis Hastings solution whenan AcceptReject algorithm is available.



a. The AcceptReject sample is iid, while the MetropolisHastingssample is not. Although the Yts are generated independently,the resulting sample is not iid, if only because the probabilityof acceptance of Yt depends on X(t) (except in the trivial casewhen f = g).

b. The MetropolisHastings sample will involve repeated occur-rences of the same value since rejection of Yt leads to repetitionof X(t) at time t + 1. This will have an impact on tests likeks.test that do not accept ties.

c. The AcceptReject acceptance step requires the calculationof the upper bound M ≥ supx f(x)/g(x), which is not requiredby the Metropolis Hastings algorithm. This is an appealing fea-ture of MetropolisHastings if computing M is time-consumingor if the existing M is inaccurate and thus induces a waste ofsimulations.


[RC] The independent Metropolis-Hastings algo-rithm. Exercise 6.3.

Compute the acceptance probability ρ(x, y) in the caseq(y|x) = g(y). Deduce that, for a given value x(t),the MetropolisHastings algorithm associated with thesame pair (f, g) as an AcceptReject algorithm acceptsthe proposed value Yt more often than the AcceptRe-ject algorithm.


6

Metropolis-Hastings Algorithms

Exercise 6.1

A simple R program to simulate this chain is

# (C.) Jiazi Tang, 2009

x=1:10^4

x[1]=rnorm(1)

r=0.9

for (i in 2:10^4){

x[i]=r*x[i-1]+rnorm(1) }

hist(x,freq=F,col="wheat2",main="")

curve(dnorm(x,sd=1/sqrt(1-r^2)),add=T,col="tomato"

Exercise 6.3

When q(y|x) = g(y), we have

⇢(x, y) = min

✓f(y)

f(x)

q(x|y)

q(y|x), 1

◆

= min

✓f(y)

f(x)

g(x)

g(y), 1

◆

= min

✓f(y)

f(x)

g(x)

g(y), 1

◆.

Since the acceptance probability satisfies

f(y)

f(x)

g(x)

g(y)� f(y)/g(y)

max f(x)/g(x)

it is larger for Metropolis–Hastings than for accept-reject.


[RC] The M-H algorithm. Symmetric conditional.

Here we simulate Yt according Yt = X(t) + εt, whereεt is a random perturbation with distribution g inde-pendent of X(t), for instance a uniform distributionor a normal distribution, meaning that Yt ∼ U [X(t) −δ,X(t) + δ] or Yt ∼ N(X(t), τ2) in unidimensional set-tings. The proposal density q(y | x) is now of theform g(y − x). The Markov chain associated withq is a random walk, when the density g is symmet-ric around zero; that is, satisfying g(−t) = g(t). But,due to the additional Metropolis-Hastings acceptancestep, the Metropolis-Hastings Markov chain {X(t)} isnot a random walk.


[RC] The M-H algorithm. Random walk.

Algorithm Random Walk Metropolis-Hastings

Given x(t),

1. Generate Yt ∼ g(y − x(t)).

2. Take

X(t+1) =

{Yt with probability min

{f(Yt)f(x(t))

,1},

x(t) otherwise.



Example 6.4. The historical example of Hastings(1970) considers the formal problem of generatingthe normal distribution N(0,1) based on a randomwalk proposal equal to the uniform distribution on[−δ, δ]. The probability of acceptance is then

ρ(x(t), yt) = exp

{(x(t))2 − y2

t

2

}∧ 1.

Aula 8. Optimization Methods III. Exercises. 13184 6 Metropolis–Hastings Algorithms

Fig. 6.7. Outcomes of random walk Metropolis–Hastings algorithms for Example6.4. The left panel has a U(−.1, .1) candidate, the middle panel has U(−1, 1), andthe right panel has U(−10, 10). The upper graphs represent the last 500 iterationsof the chains, the middle graphs indicate how the histograms fit the target, and thelower graphs give the respective autocovariance functions.

programmed in Example 5.2. Indeed, the core of the R code is very similar exceptfor the increase in temperature, which obviously is not necessary here:

> scale=1

> the=matrix(runif(2,-2,5),ncol=2)

> curlike=hval=like(x)

> Niter=10^4

> for (iter in (1:Niter)){

+ prop=the[iter,]+rnorm(2)*scale

+ if ((max(-prop)>2)||(max(prop)>5)||

+ (log(runif(1))>like(prop)-curlike)) prop=the[iter,]



Figure describes three samples of 5000 points produced by thismethod for δ = 0.1,1, and 10 and clearly shows the differencein the produced chains: Too narrow or too wide a candidate(that is, a smaller or a larger value of δ) results in higher auto-covariance and slower convergence. Note the distinct patternsfor δ = 0.1 and δ = 10 in the upper graphs: In the former case,the Markov chain moves at each iteration but very slowly, whilein the latter it remains constant over long periods of time.

As noted in this formal example, calibrating the scale δ of therandom walk is crucial to achieving a good approximation to thetarget distribution in a reasonable number of iterations.


References.

[RC] Cristian P. Robert and George Casella. Intro-ducing Monte Carlo Methods with R. Series “Use R!”.Springer

[RC1] Cristian P. Robert and George Casella. In-troducing Monte Carlo Methods with R Solutions toOdd-Numbered Exercises.

optimization methods iii. the mcmc. exercises.yambar/mae5704/aula8... · aula 8. optimization...

Documents