engineering optimization (an introduction with metaheuristic applications) || random walk and markov...

18
CHAPTER 10 RANDOM WALK AND MARKOV CHAIN Random walk and Markov chain play a central role in modern metaheuristic algorithms and stochastic optimization. In essence, a metaheuristic search algorithm is a procedure with a good combination of some randomization and deterministic components, together with the usage of memory or search history. Randomization is often achieved by some form of random walk in the search space. In this chapter, we will introduce the fundamentals of random walk, Markov chains and their relevance to optimization. 10.1 RANDOM PROCESS Generally speaking, a random variable can be considered as an expression whose value is the realization or outcome of events associated with a random process such as the noise level on the street. The values of random variables are real, though for some variables such as the number of cars on a road only take discrete values, and such random variables are called discrete random variables. If a random variable such as the noise at a particular location can take any real values in an interval, it is called continuous. If a random variable can have both continuous and discrete values, it is called a mixed Engineering Optimization: An Introduction with Metaheuristic Applications. 153 By Xin-She Yang Copyright © 2010 John Wiley Sons, Inc.

Upload: xin-she

Post on 06-Jun-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Engineering Optimization (An Introduction with Metaheuristic Applications) || Random Walk and Markov Chain

CHAPTER 10

RANDOM WALK AND MARKOV CHAIN

Random walk and Markov chain play a central role in modern metaheuristic algorithms and stochastic optimization. In essence, a metaheuristic search algorithm is a procedure with a good combination of some randomization and deterministic components, together with the usage of memory or search history. Randomization is often achieved by some form of random walk in the search space. In this chapter, we will introduce the fundamentals of random walk, Markov chains and their relevance to optimization.

10.1 RANDOM PROCESS

Generally speaking, a random variable can be considered as an expression whose value is the realization or outcome of events associated with a random process such as the noise level on the street. The values of random variables are real, though for some variables such as the number of cars on a road only take discrete values, and such random variables are called discrete random variables. If a random variable such as the noise at a particular location can take any real values in an interval, it is called continuous. If a random variable can have both continuous and discrete values, it is called a mixed

Engineering Optimization: An Introduction with Metaheuristic Applications. 153 By Xin-She Yang Copyright © 2010 John Wiley L· Sons, Inc.

Page 2: Engineering Optimization (An Introduction with Metaheuristic Applications) || Random Walk and Markov Chain

154 CHAPTER 10. RANDOM WALK AND MARKOV CHAIN

type. Mathematical speaking, a random variable is a function that maps events to real numbers. The domain of this mapping is called the sample space.

There is a probability distribution function associated with each random variable, and this distribution is often expressed as a so-called probability density function. For example, the number of phone calls per minute, and the number of users of a web server per day all obey the Poisson distribution

ρ(η;λ) = ^ ρ , (η = 0,1,2,...), (10.1)

where λ > 0 is a parameter which is the mean or expectation of the occurrence of the event during a unit interval. This distribution is often called the law for rare events because it is a limit of small probability of the binomial distribution

p(*;n,p)=(i)Al-p)-fc, (j) = ^hy, « which is a probability distribution for the number k of success in a sequence of n independent success-or-fail experiments. Here p is the probability of success in each event. Poisson's distribution is the limit of the binomial distribution when n ^ oo while λ = pn remains constant.

Sometimes, it is easier to use moment-generating function Φ to calculate mean, variance and other quantities. For example, the moment-generating function for the binomial distribution is

Φ(ί) = (ρεί + 1 - ρ ) η , (10.3)

where t G Sft is the parameter. The mean of the binomial distribution is

μ = Φ'(ί) = n(pel + 1 - p)n ■ pe* = n(pe° + 1 - p)n ■ pe° = np, (10.4) t=o t-o

which is exactly the first moment μ\. The second moment μ'2 is defined as

M H ^ " W ( = 0 = n ( n - l ) [ p e t + ( l -p) ]" - 2 (pe t ) 2 +n[pe i + ( l - p ) r - 1 ( p e t ) t-o

= n(n - l)p2 + np. (10.5)

The variance is the second moment about the mean, that is

σ2 = μ'2 - μ2 = [n(n - l)p2 + np] - (np)2 = np(l - p). (10.6)

We will use these results later in the discussion of the random walk on a straight line.

Page 3: Engineering Optimization (An Introduction with Metaheuristic Applications) || Random Walk and Markov Chain

10.2 RANDOM WALK 155

Gaussian distribution or normal distribution is by far the most popular distribution because many physical variables including light intensity, and er­rors/uncertainty in measurements, and many other processes obey the normal distribution

1 (x — a)2

p{x;μ, σ2) = — - = exp[ — ^ — ] , -oo < x < oo, (10.7) σ ν 2 π 2σ^

where μ is the mean and σ > 0 is the standard deviation. This distribution is often denoted by Ν(μ, σ2). In the special case when μ = 0 and σ = 1, it is called a standard normal distribution, denoted by N(0,1).

For a random variable with discrete values Xi, the entropy of its distribution is given by

K

S =-J2p(xi)lo&bP(xi), (10-8)

where b is the base, often b = 2, e or 10. Here K the number of all possible outcomes. If the distribution continuous, then the entropy becomes an integral

S = - [ p(x) lnp(x) dx. (10.9)

For example, the entropy of the normal distribution is

S = - Γ -^—e-^-rf'2"2 ln[-i J-oo σ\/2π σν' W 2 T T

e (χ-μ)'/2σ*

= 1η[\/2πεσ2], (10.10)

which is independent of the mean μ.

10.2 RANDOM WALK

A random walk is a random process which consists of taking a series of con­secutive random steps. Mathematically speaking, let SN denotes the sum of each consecutive random step Xi, then SN forms a random walk

N

SN = J2xi=X1+... + XN, (10.11) i= l

where Xi is a random step drawn from a random distribution. This relation­ship can also be written as a recursive formula

Λ Γ - 1

δΝ=Σ +XN = SN~I + XN> (10·12)

Page 4: Engineering Optimization (An Introduction with Metaheuristic Applications) || Random Walk and Markov Chain

156 CHAPTER 10. RANDOM WALK AND MARKOV CHAIN

1/2 1/2 ΛΛ , - 3 - 2 - 1 0 1 2 3

Figure 10.1: Random walk in a one-dimensional line. At any point, the probability moving to the left or right equals to 1/2.

which means the next state SN will only depend the current existing state 5JV_I and how the motion or transition X^ from the existing state to the next state. This is typically the main property of a Markov chain to be introduced later.

Here the step size or length in a random walk can be fixed or varying. Ran­dom walks have many applications in physics, economics, statistics, computer sciences, environmental science and engineering.

10.2.1 I D Random Walk

Consider a scenario, a drunkard walks on a street, at each step, he can ran­domly go forward or backward, this forms a one-dimensional random walk. If this drunkard walks on a football pitch, he can walk in any direction ran­domly, this becomes a 2D random walk. Mathematically speaking, a random walk is given by the following equation

St+1=St+wt, (10.13)

where St is the current location or state at t, and Wt is a step or random variable with a known distribution.

For the ID grid line shown in Figure 10.1, a particle can jump to the right or the left with equal probability 1/2, and each jump can only take one step only. This jump probability, often called transition probability, can be written as

fl/2 ifS = +l wt= I 1/2 if S=-l . (10.14)

[ 0 otherwise A particle starting at So — 0 jumps along a straight line, let us follow its

first few steps. Suppose if we flip a coin, the particle move to the right (or up) if it is a head; otherwise, the particle moves to the left (or down) when the coin is a tail. First, we flop the coin, we get, say, a head. So the particle moves to the right by a fixed unit step. So Si = So +1 = 1. Then, a tail leads a move to the left, that is 52 = Si — 1 = 0. By flipping the coin again, we get a tail. So S3 = ^2 — 1 = — 1. We continue the process in the similar manner, and the path of 100 steps or jumps is shown in Figure 10.2. It has been proved

Page 5: Engineering Optimization (An Introduction with Metaheuristic Applications) || Random Walk and Markov Chain

10.2 RANDOM WALK 157

Figure 10.2: Random walk and the path of 100 consecutive steps staring at position 0.

theoretically that the probability of returning to the origin or reaching any point approaches 1 when the number TV of steps approaches infinity.

Suppose the probability of moving to the right is p, and thus the probability of moving to the left is q = 1 — p. The probability of taking k steps to the right among N steps obeys the the binomial distribution

p(k-:N,p)=(^jpk(l-p)N-k. (10.15)

From the results (10.4) and (10.6), we know that the mean number of steps to the right is the mean

<k>=pN, (10.16)

which means the mean number of steps to the left is simply N — pN = (1 — p)N = qN. The variance associated with k is

a2k=p(l-p)N = pqN. (10.17)

As time increases, the number of steps also increases. That is N — t if each step or jump takes a unit time. This means that the variance increases linearly with t, or

σ2 <x t. (10.18)

Page 6: Engineering Optimization (An Introduction with Metaheuristic Applications) || Random Walk and Markov Chain

158 CHAPTER 10. RANDOM WALK AND MARKOV CHAIN

Figure 10.3: Brownian motion in 2D: random walk with a Gaussian step-size distribution and the path of 100 steps starting at the origin (0,0) (marked with ·).

10.2.2 Random Walk in Higher Dimensions

If each step or jump is carried out in the n-dimensional space, the random walk discussed earlier

SN = J2Xi, (10.19) i= l

becomes a random walk in higher dimensions. In addition, there is no reason why each step length should be fixed. In fact, the step size can also vary according to a known distribution. If the step length obeys the Gaussian distribution, the random walk becomes the Brownian motion (see Figure 10.3). Similar to the one-dimensional case, the variance σ2 also increases linearly with time t or the total number of steps N (see Exercise 10.1 for detail).

In theory, as the number of steps N increases, the central limit theorem im­plies that the random walk (10.19) should approaches a Gaussian distribution. As the mean of particle locations shown in Figure 10.3 is obviously zero, their variance will increase linearly with t. Therefore, the Brownian motion B{t) essentially obeys a Gaussian distribution with zero mean and time-dependent variance. That is,

B{t) ~7V(0,CT2 =t), (10.20)

where ~ means the random variance obeys the distribution on the right-hand side, or samples should be drawn from the distribution. The diffusion process can be viewed as a series of Brownian motion, and the motion obeys the Gaussian distribution. For this reason, standard diffusion is often referred to

Page 7: Engineering Optimization (An Introduction with Metaheuristic Applications) || Random Walk and Markov Chain

10.3 LEVY FLIGHTS 159

as the Gaussian diffusion. If the motion at each step is not Gaussian, then the diffusion is called non-Gaussian diffusion.

If the step length obeys other distribution, we have to deal with more generalized random walk. A very special case is when the step length obeys the Levy distribution, such a random walk is called Levy flights.

10.3 LEVY FLIGHTS

Broadly speaking, Levy flights are a random walk whose step length is drawn from the Levy distribution, often in terms of a simple power-law formula L(s) ~ | s |~ 1 _ / 3 where 0 < ß < 2 is an index. Levy distribution is a distri­bution of the sum of N identically and independently distribution random variables whose Fourier transform takes the following form

FN(k) = exp[-N\k\ß]. (10.21)

The inverse to get the actual distribution L(s) is not straightforward, as the integral

L(s) = - cos(rs)e-arßdr, (0 < ß < 2), (10.22) π Jo

does not have analytical form except for a few special cases. Here L(s) is called the Levy distribution with an index β. For most applications, we can set a = 1 for simplicity.

Two special cases are β = 1 and β = 2. When β — 1, the above integral becomes the Cauchy distribution

When β = 2, it becomes the normal distribution. In this case, Levy flights become the standard Brownian motion. However, it is possible to express the integral (10.22) as a series

1 °° (—IV 1 L{s) ~ - - £ [—f- sm(jnß/2)T(ßj + 1) · - ^ , (10.24)

which suggests the leading-order approximation (j = 1) for the longer flight length s is a power-law distribution

L{s) ~ Isp1" ' 3 , (10.25)

and it is heavy-tailed. The variance of such a power-law distribution is infinite for 0 < β < 2. Figure 10.4 shows the path of Levy flights of 100 steps starting from (0,0) with β = 1. It is worth pointing out that a power-law distribution is often linked to some scale-free characteristics, and Levy flights can thus show self-similarity and fractal behavior in the flight patterns.

Page 8: Engineering Optimization (An Introduction with Metaheuristic Applications) || Random Walk and Markov Chain

160 CHAPTER 10. RANDOM WALK AND MARKOV CHAIN

Figure 10.4: Levy flights in 2D setting starting at the origin (0,0) (marked with ·).

If we calculate the variance of the Levy flights, though very difficult to do, we have the following relationship

it2 i f 0 < / 3 < l σ2(ί) ~ l t3~ß if 1 < ß < 2 , (10.26)

[t Ίϊβ>2

which is in contrast with the linear behavior σ2 ~ t obtained earlier for the Brownian motion and ID random walk. In the special case of β = 2, it indeed becomes the standard Brownian motion. It is worth pointing out that in the case of β = 1, σ2 ~ t2 / lni is expected. Therefore, we can say that Levy flights are the deviation from the behavior governed by the central limit theorem.

Levy flights have many applications. In fact, Levy flights have been ob­served among foraging patterns of albatrosses and fruit flies, spider monkeys, and even humans such as the Ju/'hoansi hunter-gatherers. In addition, many physical phenomena such as the diffusion of fluorenscent molecules, cooling behavior and noise could show Levy-flight style characteristics under the right conditions.

The non-Gaussian diffusion process for 1 < β < 2 is called superdiffusion, and for example, turbulence can be considered as a case of superdiffusion. The case β > 2 is called subdiffusion, and obviously, β = 2 is the standard diffusion. It is worth pointing out that quantum tunneling can be thought as the case of 0 < β < 1. The moments diverge (or are infinite) for 0 < β < 2, which is a stumbling block for mathematical analysis.

If we impose a velocity associated with each flight step or segment, then the potential difficulty may be avoided by looking at the distance wandered by a Levy walker. Conventionally, the Levy flight with an associated velocity is

Page 9: Engineering Optimization (An Introduction with Metaheuristic Applications) || Random Walk and Markov Chain

10.4 MARKOV CHAIN 161

often referred to as a Levy walk. However, as a Levy flight is a type of random walk, so Levy flights and Levy walk are interchangeable in most cases.

10.4 MARKOV CHAIN

Briefly speaking, a random variable U is a Markov process if the transition probability, from state Ut = Si at time t to another state Ut+ι — Sj, depends only on the current state Ui. That is

P(i,j) = P(Ut+l = Sj\U0 = Sp,..., Ut = Si)

= P(Ut+1 = Sj\Ut = Si), (10.27)

which is independent of the states before t. In addition, the sequence of ran­dom variables (i/o, U\,..., Un) generated by a Markov process is subsequently called a Markov chain. The transition probability P(i,j) = P(i —> j) = Pij is also referred to as the transition kernel of the Markov chain.

If we rewrite the random walk relationship (10.12) with a random move governed by Wt which depends on the transition probability P, we have

St+i = St+wt, (10.28)

which indeed has the properties of a Markov chain. Therefore, a random walk is a Markov chain.

In order to solve an optimization problem, we can search the solution by performing a random walk starting from a good initial but random guess solution. However, simple or blind random walks are not efficient. To be computationally efficient and effective in searching for new solutions, we have to keep the best solutions found so far, and to increase the mobility of the random walk so as to explore the search space more effectively. Most impor­tantly, we have to find a way to control the walk in such a way that it can move towards the optimal solutions more quickly, rather than wander away from the potential best solutions. These are the challenges for most metaheuristic algorithms, and the same issues are also important for Monte Carlo simu­lations and Markov chain sampling techniques. Markov chain Monte Carlo methods are a class of sampling techniques by controlling how a random walk behaves.

10.5 MARKOV CHAIN MONTE CARLO

The Markov chain Monte Carlo (MCMC) is a class of sample-generating meth­ods, which attempts to directly draw samples from some highly complex multi­dimensional distribution using a Markov chain with known transition proba­bility. The basic idea of the MCMC methods can trace back to to the classic Metropolis algorithm developed by Metropolis et al. in 1953. Since the 1990s,

Page 10: Engineering Optimization (An Introduction with Metaheuristic Applications) || Random Walk and Markov Chain

162 CHAPTER 10. RANDOM WALK AND MARKOV CHAIN

the Markov chain Monte Carlo has become a powerful tool for Bayesian sta­tistical analysis, Monte Carlo simulations, and potentially optimization with high nonlinearity.

We use 7Ti {t) to denote the probability of the chain in the state i (or more accurately Si) at time t. This means that π(ί) = (πι, ...,nm)T is a vector of the state space. At time t = 0, π(0) is the initial vector.

The fc-step transition probability P>,' from state i to state j can be calcu­lated by

pW = P(Vt+k = Sj\Vt = Si), (10.29)

where k > 0 is an integer. The matrix P = [P>j'] = [Ρ^] is the transition matrix which is a right stochastic matrix. A right stochastic matrix is defined as a probability (square) matrix whose entries are non-negative with each row summing to 1. That is

m Pij>0, 5 ^ = 1, i = l ,2) . . . ,m. (10.30)

J = l

It is worth pointing the left transition matrix, though less widely used, is a stochastic matrix with each column summing to 1.

A Markov chain is regular if some power of the transition matrix P has only positive elements. That is, there exists a positive integer K such that P^j > 0 for Vi, j . This means that there is a non-zero probability to go from any state i to another state j . In other words, every state is accessible in a finite number of steps (not necessarily a few steps). If the number of steps K is not a multiple of some integer, then the chain is called aperiodic. This means there is no fixed-length cycle between certain states of the chain. In addition, a Markov chain is said to be ergodic or irreducible if it is possible to go from every state to every state.

In general, for a Markov chain starting with an initial πο and a transition matrix P, we have after n steps

π„ = π 0 Ρ η , or πη = π η _ ι Ρ , (10.31)

where π„ is a vector whose j t h entry is the probability that the chain is in state Sj after n steps.

There is a fundamental theorem about a regular Markov chain. That is

lim Pn = W, (10.32) n—>oo

where W is a matrix with all rows equal and all entries strictly positive. As the number of steps n increases, it is possible for a Markov chain to

reach a stationary distribution π* defined by

7Γ* = π*Ρ. (10.33)

Page 11: Engineering Optimization (An Introduction with Metaheuristic Applications) || Random Walk and Markov Chain

10.5 MARKOV CHAIN MONTE CARLO 163

Prom the definition of eigenvalues of a matrix A,

Au = Xu, (10.34)

we know that the above equation implies that π* is the eigenvector (of P) associated with the eigenvalue λ = 1. The unique stationary distribution requires the following detailed balance of transition probabilities

P y < = Ρβ-ΐΓ*, (10.35)

which is often referred to as the reversibility condition. A Markov chain that satisfies this reversibility condition is said to be reversible.

■ EXAMPLE 10.1

To see if a Markov chain is regular or not, we have to use the transition matrix P. For example, the chain with

/0.2 0.7 0 . l \ P= (0.5 0.1 0.4 ,

\0.5 0.2 0.3/

is regular because all entries of P are positive. For another transition matrix

Λ / 2 0 l / 2 \ P = 0 1/4 3 / 4 ,

\ l / 3 2/3 0 /

it has zero entries. However, we have

/5/12 1/3 1/4 \ P 2 = PP= 1/4 9/16 3/16 ,

\ 1/6 1/6 2/3 /

whose entries are all positive. So this is a regular Markov chain. If we have an initial probability vector u0 = (1,0,0), then we have at the next step

u i = «o-P = (1/2 0 1/2),

and u2 =UlP = u0P2 = (5/12 1/3 1/4).

In fact, as n increases, we have

lim un w (0.26 0.35 0.39), n—>oo

which is independent of UQ . It is easy to check Uoo will be the same even from «o = (0 1 0) or (0 0 1).

Page 12: Engineering Optimization (An Introduction with Metaheuristic Applications) || Random Walk and Markov Chain

164 CHAPTER 10. RANDOM WALK AND MARKOV CHAIN

Furthermore, for

r-Cf V), there is a zero entry. However, we have

2 _ / 3 / 4 l / 4 \

whose entries are strictly positive. So this chain is also regular. On the other hand,

' -Gi) · is not regular. This is because P2 = I = I I, P = P, and Pn = I

if n is even, and Pn = P if n is odd. There are always two entries which are zero.

The above discussion is mainly for the case when the states are discrete. We can generalize the above results to a continuous state Markov chain with a transition probability P(u, v) and the corresponding stationary distribution

π*(υ)= I TT*(u)P{u,v)dv, (10.36)

where Ω is the probability state space. There are many ways to choose the transition probabilities, and different

choices will result in different behaviour of the Markov chain. In essence, the characteristics of the transition kernel largely determine how the Markov chain of interest behaves, which also determines the efficiency and convergence of MCMC sampling. There are several widely used sampling algorithms, in­cluding Metropolis algorithms, Metropolis-Hasting Algorithms, Independence Sampling, Random-Walk, and of course Gibbs Sampler. We will introduce the most popular Metropolis-Hastings algorithms.

10.5.1 Metropolis-Hastings Algorithms

To draw samples from the target distribution, we may write π(θ) = βρ(θ) where β is just a normalising constant which is either difficult to estimate or not known. We will see later that the normalising factor β disappear in the expression of acceptance probability.

The Metropolis-Hastings algorithms essentially expresses an arbitrary tran­sition probability from state Θ to state φ as the product of an arbitrary tran­sition kernel q(9, φ) and a probability α(θ, φ). That is

Ρ(θ, φ) = Ρ(θ -> 0) = q(9, φ)α(θ, φ). (10.37)

Page 13: Engineering Optimization (An Introduction with Metaheuristic Applications) || Random Walk and Markov Chain

10.5 MARKOV CHAIN MONTE CARLO 165

The Metropolis-Hastings algorithm

Begin with any initial θ0 at time t <— 0 such that p(#o) > 0; for i = 1 to n (number of samples), Generate a candidate 0» ~ q{6t,.) from a proposal distribution; Evaluate the acceptance probability a(0t,6>*); Generate a uniformly-distributed random number u ~ Unif [0,1],

if a > u, accept #*, that is 6t+i <— #*;

else

end if end

Figure 10.5: Metropolis-Hastings algorithm.

Here q is the proposal distribution function, while α(θ, φ) can be considered as the acceptance rate from Θ to φ, and can be determined by

α(θ, φ) = min 1 = mm —-——-—, 1 . (10.38)

The essence of Metropolis-Hastings algorithm is to first propose a candidate #*, then accept it with a probability a. That is, #t+i <— #„ if a > u where u is a random value drawn from a uniform distribution in [0,1], otherwise 6t+\ <— 9t. The Metropolis-Hastings algorithms can be summarized as the pseudo code shown in Figure 10.5.

It is straightforward to verify that the reversibility condition is satisfied by the Metropolis-Hastings kernel

q(0, φ)α(θ, φ)π(θ) = q(</>, θ)α(φ, θ)π(φ), (10.39)

for all θ, φ. Consequently, the Markov chain will converge to a stationary distribution which is the target distribution π(θ).

■ EXAMPLE 10.2

Let us try to draw some samples from the Rayleigh distribution

f(x) = χβ~χ2/2, a;e[0,oo),

using the Metropolis-Hastings algorithm. Suppose we use a uniform distribution in (0,10) as the proposal distribution q. Obviously, there is a small probability of x over 10, but such probability is extremely small.

Firstly, starting with öo = 1 at < = 0, we now draw a random sample from the uniform distribution (0,10), and we get a candidate 0* = 0.5.

Page 14: Engineering Optimization (An Introduction with Metaheuristic Applications) || Random Walk and Markov Chain

166 CHAPTER 10. RANDOM WALK AND MARKOV CHAIN

Then, have

a = mm my 0.5

- 0 . 5 2 / 2

- l 2 / 2 ' 0.7275.

Secondly, we now draw a random number u from a uniform distribution (0,1), suppose we get u = 0.69. Since a > u, so we accept Θ* = 0.5 as a successfully drawn sample, and we also update that θχ = 0.5.

Thirdly, we draw another candidate 0* = 2.40. Now have

a = mm m) \-rny

2.40· -2.402 /2

0.5 ■ e-°-52/2 1 0.305.

We then draw another uniformly distributed random number, say, we get u = 0.90. Since a < u, we should reject the sample. We proceed in a similar manner to draw as many samples as we want.

In a special case when the transition kernel is symmetric in its arguments, or

ς(θ,φ)=9(φ,θ), (10.40)

for all θ,φ, then equation (10.38) becomes

W (10.41)

and the Metropolis-Hastings algorithm reduces to the classic Metropolis algo­rithm. In this case, the associated Markov chain is called a symmetric chain. In a special case when a = 1 is used, that is, the acceptance probability is al­ways 1, then the Metropolis-Hastings degenerates into the classic widely-used Gibbs sampling algorithm. However, Gibbs sampler becomes very inefficient for the distributions that are non-normally distributed or highly nonlinear.

10.5.2 Random Walk

As shown above, the choice of proposal kernels is very important. An efficient variation in generating a candidate sample 9t+\ is to use a random walk process. The proposal can be generated from 9t by

■■Θ. ■wt, (10.42)

where wt is a random walk variable with a distribution density Q independent of the Markov chain. The transition kernel becomes a special case

ς(θ,φ)=(ί(φ-θ), (10.43)

which implies that only the difference φ — θ matters. If Q is an even function in terms of φ — Θ (or symmetric around 0), then the kernel is symmetric and thus the Markov chain is also symmetric. In this special case, the classic Metropolis algorithm can be considered as a special case of a random walk.

Page 15: Engineering Optimization (An Introduction with Metaheuristic Applications) || Random Walk and Markov Chain

10.6 MARKOV CHAIN AND OPTIMISATION 167

10.6 MARKOV CHAIN AND OPTIMISATION

An important link between MCMC and optimisation is that most heuristic and metaheuristic search algorithms such as simulated annealing to be in­troduced later use a trajectory-based approach. They start with some initial (random) state, and propose a new state (solution) randomly. Then, the move is accepted or not depending on some probability. This is similar to a Markov chain. In fact, the standard simulated annealing is a random walk.

In fact, a great leap in understanding metaheuristic algorithms is to view a Markov chain Monte carlo as an optimization procedure. If we want to find the minimum of an objective function /(#) at Θ = Ö, so that /» = /(#*) < / (#) , we can convert it to a target distribution for a Markov chain

TT(0) = β-βί{-θ\ (10.44)

where β > 0 is a parameter which acts as a normalized factor, β value should be chosen so that the probability is close to 1 when Θ —> 0*.

At Θ = #», π{θ) should reach a maximum π* = 7r(0„) > ττ(0). This often requires that the formulation of f{9) should be non-negative, which means that some objective functions can be shifted by a large constant A > 0 such as / <— / + A if necessary.

Using the Markov transition probability ρ(θ, θ') so that

< - " ■ ■ « # ■ " ·

we can update the chain if /(#') < /(#)· Ideally, we should choose such transition probabilities so as to make a close to 1 when a new move produces a better solution. This is equivalent to

a - 1 , for /(β') < f{ß), (10.46)

and α - 0 , for/(β') > / ( ö ) . (10-47)

Such optimisation via MCMC can be extended to a generic framework outlined by Ghate and Smith in 2008, as shown in Figure 10.6.

■ EXAMPLE 10.3

Simulated annealing for minimization intends to propose a potential move and then decide to accept by an acceptance probability

pt = min [e-A / / f c s T t , l

where ke is the Boltzmann constant which can often be taken as fcs = 1. Tf is the current temperature at time t, and Δ / = ft+i — ft is the change

Page 16: Engineering Optimization (An Introduction with Metaheuristic Applications) || Random Walk and Markov Chain

168 CHAPTER 10. RANDOM WALK AND MARKOV CHAIN

Markov Chain Algorithm for Optimization

Objective function f(x) Start with U0 € S, at t = 0 wh i le (criterion)

Generate Yt+i using an appropriate candidate kernel Generate a random number 0 < Pt < 1

_ (Yt+i with probability Pt an AÜ\ Ut+1 -\Ut with probability 1 - Pt

{WA8>

e n d

Figure 10.6: The Ghate-Smith Markov chain algorithm for optimization.

of the objective function f(x) where ft+\ and ft are the values of the objective function at the two consecutive t ime steps, respectively.

In this framework, simulated annealing and its many variants are simply a special case of a Markov chain with

f e x p [ - ^ ] i f / i + i > / t 1 \ l if / t+ i < /* '

In this case, only the difference Δ / between the function values is rele­vant.

In addition, a proper control over the temperature can have a sig­nificant effect on how the algorithm converges, and this control is often referred to as the cooling schedule.

Algorithms such as simulated annealing use a single Markov chain, which may not be very efficient. In practice, it is usually advantageous to use multiple Markov chains in parallel to increase the overall efficiency. Algorithms such as simulated tempering uses multiple Markov chains with different temperatures.

Furthermore, there is no reason different chains should not interact. In fact, the algorithms such as particle swarm optimization to be introduced in Pa r t II can be viewed as multiple interacting Markov chains, though such theoret­ical analysis remains almost intractable. The theory of interacting Markov chains is complicated and yet still under development, however, any progress in such areas will play a central role in the understanding how population- and trajectory-based metaheuristic algorithms perform under various conditions.

It is worth pointing tha t this generic optimization framework may have good convergence under appropriate conditions. However, the computational efficiency is not always practical for large-scale problems. There is no free lunch in optimization, and consequently, a lot of efforts have been devoted to the development of efficient algorithms to fit for the purpose. In Pa r t II, we will introduce various metaheuristic algorithms in great detail.

Page 17: Engineering Optimization (An Introduction with Metaheuristic Applications) || Random Walk and Markov Chain

EXERCISES 169

EXERCISES

10.1 Consider the case of a 2D random walk where each step is of fixed unit length. Derive the distance traveled after N steps and its variance using the complex numbers on a plane.

10.2 Write a simple program to carry out some Levy flights and demon­st ra te how the exponent ß affects the distance traveled (10.26).

10.3 Write a simple program to generate of a random walk

10 .4 The convergence to the stat ionary distribution 7r* of a Markov chain is typically related to the second eigenvalue of the transit ion matr ix P. For a simple matr ix

/ 1 / 4 3 / 4 \ ^ " V 3 / 4 1/4;'

what is its second eigenvalue? Show tha t Pn (where n = 1,2,...) will converge to

0 _ A / 2 l / 2 \ 00 ~ ^1/2 1 / 2 /

10.5 Verify if a Markov chain with the following transit ion matr ix P is regular or not.

/ n o o oX Λ / 2 1 / 4 l / 4 \ , v

· ) ' · - (« S:?>b"%°5 $ »/»];«>'-Ok ,·,). 10.6 Markov chains have many applications. For example, the Google's page ranking engine uses a Markov chain with a transit ion probabili ty p = Οί/rii + (1 — a)/N for page i with n» links among TV known webpages. Here a is in the range of 0 and 1. How will the choice of a affect the ranking?

REFERENCES

1. W. J. Bell, Searching Behaviour: The Behavioural Ecology of Finding Resources, Chapman & Hall, London, 1991.

2. C. Blum and A. Roli, "Metaheuristics in combinatorial optimization: Overview and conceptural comparison", ACM Comput. Surv., 35, 268-308 (2003).

3. M. G. Cox, A. B. Forbes, P. M. Harris, Discrete Modelling, SSfM Best Practice Guide No.4, NPL, UK, 2002.

4. S. R. Finch, Mathematical Constants, Cambridge University Press, (2003).

5. D. Gamerman, Markov Chain Monte Carlo, Chapman & Hall/CRC, 1997.

6. L. Gerencser, S. D. Hill, Z. Vago, and Z. Vincze, "Discrete optimization, SPSA, and Markov chain Monte Carlo methods", Proc. 2004 Am. Contr. Conf., 3814-3819 (2004).

Page 18: Engineering Optimization (An Introduction with Metaheuristic Applications) || Random Walk and Markov Chain

170 CHAPTER 10. RANDOM WALK AND MARKOV CHAIN

7. C. J. Geyer, "Practical Markov Chain Monte Carlo", Statistical Science, 7, 473-511 (1992).

8. A. Ghate and R. Smith, "Adaptive search with stochastic acceptance probabil­ities for global optimization", Operations Research Lett, 36, 285-290 (2008).

9. W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, Markov Chain Monte Carlo in Practice, Chapman & Hall/CRC, 1996.

10. M. Gutowski, "Levy flights as an underlying mechanism for global optimization algorithms", ArXiv Mathematical Physics e-Prints, June, 2001.

11. W. K. Hastings, "Monte Carlo sampling methods using Markov chains and their applications", Biometrika, 57, 97-109 (1970).

12. S. Kirkpatrick, C. D. Gellat and M. P. Vecchi, "Optimization by simulated annealing", Science, 220, 670-680 (1983).

13. E. Marinari and G. Parisi, "Simulated tempering: a new Monte Carlo scheme", Europhysics Lett, 19, 451-458 (1992).

14. W. H. McCrea and F. J. Whipple, "Random paths in two and three dimensions", Proc. Roy. Soc. Edinburgh, 60, 281-298 (1940).

15. S. P. Meyn, and R. L. Tweedie, Markov Chains and Stochastic Stability, Springer-Verlag, London, 1993.

16. N. Metropolis, and S. Ulam, "The Monte Carlo method", J. Amer. Stat Assoc, 44, 335-341 (1949).

17. N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller, "Equation of state calculations by fast computing machines", J. Chem. Phys., 21, 1087-1092 (1953).

18. D. J. Murdoch and P. J. Green, "Exact sampling from a continuous state space", Scand. J. Statist, 25, 483-502 (1998).

19. I. Pavlyukevich, "Levy flights, non-local search and simulated annealing", J. Computational Physics, 226, 1830-1844 (2007).

20. I. Pavlyukevich, "Cooling down Levy flights", J. Phys. A:Math. Theor., 40, 12299-12313 (2007).

21. J. Propp and D. Wilson, "Exact sampling with coupled Markov chains and applications to statistical mechanics", Random Structures and Algorithms, 9, 223-252 (1996).

22. G. Ramos-Fernandez, J. L. Mateos, O. Miramontes, G. Cocho, H. Larralde, B. Ayala-Orozco, "Levy walk patterns in the foraging movements of spider monkeys (Ateles geoffroyi)", Behav. Ecol. Sociobiol., 55, 223-230 (2004).

23. A. M. Reynolds and M. A. Frye, "Free-flight odor tracking in Drosophila is consistent with an optimal intermittent scale-free search", PLoS One, 2, e354 (2007).

24. M. E. Tipping M. E., "Bayesian inference: An introduction to principles and and practice in machine learning", in: Advanced Lectures on Machine Lerning, O. Bousquet, U. von Luxburg and G. Ratsch (Eds), pp.41-62 (2004).

25. G. M. Viswanathan, S. V. Buldyrev, S. Havlin, M. G. E. da Luz, E. P. Ra-poso, and H. E. Stanley, "Levy flight search patterns of wandering albatrosses", Nature, 381, 413-415 (1996).

26. E. Weisstein, http://mathworld.wolfram.com