Download - Rapid Mixing Book

8/4/2019 Rapid Mixing Book

1/156

Chapter 1

Sampling and Counting

1.1 Introduction

The classical Monte Carlo method is an approach to estimating quantities that are hardto compute exactly. The quantity z of interest is expressed as the expectation z = E(Z)of a random variable (r.v.) Z over a probability space , . It is assumed that someefficient procedure for sampling from , is available. By taking the mean of somesufficiently large set of independent samples of Z, one may obtain an approximation

to z. For example, supposeS =

(x, y) [0, 1]2 : pi(x, y) 0, for all i

is some region of the unit square defined by a system of polynomial inequalities pi(x, y) 0. Let Z be the r.v. defined by the following experiment or trial: choose a point (x, y)uniformly at random (u.a.r.) from [0, 1]2; let Z = 1 if pi(x, y) 0 for all i, and Z = 0otherwise. Then the area a ofS is equal to E(Z), and an estimate ofa may be obtainedfrom the sample mean of a sufficiently long sequence of trials. In this example, the useof the Monte Carlo method is perhaps avoidable, at the expense of a more complexalgorithm; for more essential uses, see, for example, Knuths proposal for estimatingthe size of a tree by taking a random path from the root to a leaf, or Rasmussens forestimating the permanent of a 0,1-matrix.

The main focus of this book is the Markov chain Monte Carlo (MCMC) method which isa development of the foregoing approach, which is sometimes applicable when Z cannotbe sampled directly. Zwill often be the cardinality of some combinatorially defined setS. We design a Markov Chain M with state space (often S itself) whose steady statedistribution is . Efficient sampling now rests on the rapid convergence of the chainto its steady state. These ideas will be made more explicit in Chapter 2 but for themoment we focus on the relationship between near uniform generation and approximatecounting.

1


2/156

2 CHAPTER 1. SAMPLING AND COUNTING

As a first example of the approach, we consider the problem of estimating the numberof independent sets of a graph G with small maximum degree . In Section 1.2 we showhow sampling independent sets ofG, generated independently and almost uniformly, canbe used to obtain an estimate for their number. This step of the MCMC programmehow samples are usedis often (though not always) rather routine.

We then consider the reverse process i.e. we show how good estimates of the number ofindependent sets can be used to generate a near uniform sample. This illustrates a sortof equivalence between the problems of generation and counting. Section 1.4 discussesa formal framework within which this can be made precise.

1.2 Approximate counting, uniform sampling andtheir relationship

1.2.1 An example Independent Sets

What do we mean precisely by (efficient) approximate counting and uniform sampling?

let N = N(G) denote the number of independent sets ofG. A randomised approximationscheme for N is a randomised algorithm that takes as input a graph G and an errorbound > 0, and produces as output a number Y (a random variable) such that

Pr

(1 )N Y (1 + )N 34

. (1.1)

A randomised approximation scheme is said to be fully polynomial if it runs in timepolynomial in n (the input length) and 1. We shall abbreviate the rather unwieldyphrase fully polynomial randomised approximation scheme to FPRAS.

There is no significance in the constant 34

appearing in the definition, beyond its lyingstrictly between 1

2and 1. Any success probability greater than 1

2may be boosted to 1

for any desired > 0 by performing a small number of trials and taking the medianof the results; the number of trials required is O(ln 1). Indeed let Y1, Y2, . . . , Y m beindependent samples satisfying (1.1). Suppose that Y is the median of Y1, Y2, . . . , Y m.Then

Pr(Y (1 + )N) Pr(|{i : Yi (1 + )N}| m/2) em/12using the Chernoffbounds. Similarly

Pr(Y (1 )N) em/8.Putting m = 12 ln(2/) we get

Pr

(1 )N Y (1 + )N 1 . (1.2)


3/156

1.2. APPROXIMATE COUNTING, UNIFORM SAMPLING AND THEIR RELATIONSHIP3

For any two probability distributions and on a countable set , define the totalvariation distance between and to be

Dtv(, ) := max

A|(A) (A)| = 1

2

x

|(x) (x)|. (1.3)

In our example = (G) will be the set of independent sets of graph G and (I) = 1||for each I i.e. is the uniform distribution over . We will let be the distributionof the output of some randomised algorithm that generates a random independent subsetof G.

A good sampler for is a randomised algorithm that takes as input a graph G and atolerance > 0, and produces an independent set I (a random variable) such that theprobability distribution ofI is within variation distance of the uniform distribution on. An almost uniform sampler is said to be fully polynomial if it runs in time polynomialin n (the input length) and log 1.

From good sampling to approximate counting

Theorem 1.2.1 Suppose we have a good sampler for the independent sets of a graph,which works for graphs G with maximum degree bounded by and suppose that thesampler has time complexity T(n, ), where n is the number of vertices in G, and theallowed deviation from uniformity in the sampling distribution. Then we may construct

an FPRAS for the number of independent sets of a graph, which works for graphs Gwith maximum degree bounded by, and which has time complexity

O

m2

2T

n,

6m

,

where m is the number of edges in G, and the specified error bound.

Proof Let G = Gm > Gm1 > > G1 > G0 = (V, ) be any sequence of graphsin which each graph Gi1 is obtained from the previous graph Gi by removing a singleedge. We may express the quantity we wish to estimate as a product of ratios:

|(G)| =|(Gm)|

|(Gm1)| |(Gm1)|

|(Gm2)| |(G1)|

|(G0)| |(G0)|, (1.4)

where, it will be observed, |(G0)| = 2n. Our strategy is to estimate the ratio

i =|(Gi)|

|(Gi1)|

for each i in the range 1 i m, and by substituting these quantities into identity (1.4),obtain an estimate for the number of independent sets of G:

|(G)| = 2n1 . . . m. (1.5)


4/156


To estimate the ratio i we use the almost uniform sampler to obtain a sufficiently largesample of independent sets from (Gi1) and compute the proportion of samples thatlie in (Gi).

The following lemma gives the basic probabilistic inequality we need.

Lemma 1.2.1 For i = 1, 2, . . . , m let 0 Zi 1 be independent random variables onthe probability space (i, i) where E(Zi) = i and min = mini i > 0.

For i = 1, 2, . . . , m let Zi denote the same random variable on the probability space

(

i, i) wheredTV(i, i) =

3mmin.

For i = 1, 2, . . . , m let i = E(Zi) and let Z(1)i , . . . , Z

(s)i be a sequence of

s = 17m2min2

independent copies of the random variable Zi and let Zi = s1s

j=1 Z(j)i be their mean.

Let

W =Z1Z2 Zm12 m

.

Then, for sufficiently small,

Pr(|W 1| ) 14

.

Proof Note first that for i = 1, 2, . . . , m,

|i i| and Var(Zi) 1. (1.6)

Let

W =Z1Z2 Zm12 m

.

Now E(W) = 1 and (1.6) implies1

min

m W

W

1 +

min

m.

So, WW 1 25 . (1.7)


5/156


Furthermore

Var(W) = E

mi=1

Z2i2i

1 (1.8)

=mi=1

1 +

Var(Zi)

2i

1

=mi=1

1 +

Var(Zi)

s2i

1

mi=1

1 + 1s2i 1

1 +2

17m

m 1

2

16. (1.9)

Thus by (1.7) and (1.9),

Pr(|W 1| ) Pr(|W 1| 2 ) 4

2Var(W) 1

4.

2

Suppose that the graphs Gi and Gi1 differ in the edge {u, v}, which is present in Gi butabsent from Gi1. Clearly, (Gi) (Gi1). Any independent set in (Gi1) \(Gi)contains u and v, and may be perturbed to an independent set in Gi by deleting vertex u.(To resolve ambiguity, let u be the smaller of the two vertices.) On the other hand, eachindependent set in Gi can be obtained in at most one way as the result of such aperturbation; hence |(Gi1) \ (Gi)| |(Gi)| and

1

2 i 1. (1.10)

To avoid trivialities, assume 0 < 1 and m 1. Let Zi {0, 1} denote the randomvariable which results from choosing a random independent set from Gi1 and returningone if the resulting independent set is also independent in Gi and zero otherwise. Notethat i = E(Zi) = i for i = 1, 2, . . . , m. Let Zi denote the random variable which results

from running the postulated almost uniform sampler on the graph Gi1 and returningone if the resulting independent set is also independent in Gi and zero otherwise. Wetake = 6m (in the sampler) and s = 68m2. Let Z(1)i , . . . , Z (s)i be a sequence of sindependent copies of the random variable Zi. As our estimator for |(G)|, we use therandom variable Y = 2n Z1Z2 . . . Z m. Applying Lemma 1.2.1 we see immediately that

Pr

Y|(G)| 1 14 .


6/156


We use s = O(m2) samples to estimate each i and the time bound claimed in thetheorem follows. 2

From approximate counting to good sampling

Theorem 1.2.2 Suppose that we have an FPRAS approxcount(G, , ) for the num-ber of independent sets of a graph G = (V, E) with maximum degree and suppose thatapproxcount(G, , ) has time complexity T(n, , ) where n = |V|, is the requiredmaximum relative error and is the allowed probability of failure. Then we can construct

a good samplerUgen(G, ) for the independent sets ofG with maximum degree whichhas expected time complexity

O

T

n, O

1

n

, O

n

. (1.11)

Proof We will call our sampling procedure Ugen(G, ): let

1 =

2n + 1and 1 =

log2

3n.

Ugen(G, )begin

N = approxcount(G, 1, 1)Repeat until I = Ugenx(G, 1, 14N) = Output I.

end

The precedure Ugenx has an extra parameter which is needed to control the rate ofsome rejection sampling. We define Ugenx recursively.

Ugenx(G, 1, )begin

If > 1 then output I = failure.If V = then I =

probability

probability 1

else beginv = max V and X is the set of neighbours of v in G.G1 = G v X and G2 = G vN1 = approxcount(G1, 1, 1) and N2 = approxcount(G2, 1, 1)

Output I =

v + Ugenx

G1, 1, N1+N2

N1

probability N1

N1+N2

Ugenx

G2, 1,

N1+N2N2

probability N2

N1+N2


7/156


endend

For I let pI denote the probability that Ugenx(G, 1, ) generates I, conditionalon all calls to approxcount being successful. Then we will see that pI and atthe bottom of the recursion, will have become /pi and so I will be output with(conditional) probability pI /pI = i.e. the conditional output is uniform.

Lemma 1.2.2 (a) The probability that approxcount gives a bad estimate during theexecution of Ugen is at most (2n + 1)1.

(b If approxcount gives no bad estimates then 1 throughout the execution ofUgen.

(c) If approxcount gives no bad estimates then the probability Ugen outputs is atmost 2/3.

(d) If approxcount gives no bad estimates then the output I is such that for anyindependent set I0 of G we have Pr(I = I0) = .

(e) Let be the distribution of the output I of ugen and let denote the uniformdistribution on. Then Dtv(, p) .

Proof (a) This is clear from the fact that we call approxcount at most 2n + 1times during the execution of Ugen.

(b) If there is no bad estimate from approxcount then we claim by induction on thedepth of recursion d that whenever we invoke Ugenx on a graph H, say, then we find

the current value of, d (1+1)d4(11)d+1|(H)| . This is trivially true for d = 0 and assumingsay that we recurse on H1 we have in this call

d+1 (1 + 1)d

4(1 1)d+1|(H)|N1 + N2

N1 (1 + 1)

d

4(1 1)d+1|(H)| (1 + 1)|(H)|

(1 1)|(H1)|=

(1 + 1)d+1

4(1 1)d+2|(H1)|as required.

Thus throughout the execution of Ugen we have

(1+1)n

4(1

1)n+1

< en1/2 < 1.

(c) We prove by induction on |V| that Pr(I = ) 1 |(G)|. This is clearly true ifV = . Otherwise

Pr(I = ) N1

N1 + N2(1 N1 + N2

N1|(G1)|) +

N2N1 + N2

(1 N1 + N2N2

|(G2)|)

= 1 |(G)|.


8/156


Thus Pr(I = ) 2/3 as required.(d) This is clearly true if V = . If V = and v = max V I0 then, by induction

Pr(I = I0) =N1

N1 + N2

N1 + N2N1

=

and similarly Pr(I = I0) = if v / I0.(e) Let E denote the event that some output of approxcount is bad in the iterationthat produces output. Then for A ,

(A) (A) Pr(I A | E) + Pr(E) (A) |A|

||+ |A|

|| .

2

We have therefore shown that by running Ugenx for constantexpected number of times,we will with probability at least 1 output a randomly chosen independent set. Theexpected running time of Ugen is clearly as given in (1.11) which is small enough tomake it a good sampler.

Having dealt with a specific example we see how to put the above ideas into a formal

framework. Before doing this we enumerate some basic facts about Markov Chains.

1.3 Markov Chains

Throughout N = {0, 1, 2, . . .}, N+ = N \ {0}, Q+ = {q Q : q > 0}, and [n] ={1, 2, . . . , n} for n N+.A Markov chain M on the finite state space , with transition matrix P is a sequenceof random variables Xt, t = 0, 1, 2, . . . , which satisfy

Pr(Xt = | Xt1 = , Xt2, . . . , X 0) = P(, ) (t = 1, 2, . . .),

We sometimes write P . The value of Xt is referred to as the state of M at time t.

Consider the digraph DM = (, A) where A = {(, ) : P(, ) > 0}. We willby and large be concerned with chains that satisfy the following assumptions:

M1 The digraph DM is strongly connected.

M2 gcd{|C| : C is a directed cycle of DM} = 1


9/156

1.3. MARKOV CHAINS 9

Under these assumptions, M is ergodic and therefore has a unique stationary distribution i.e.

limt

Pr(Xt = | X0 = ) = () (1.12)

i.e. the limit does not depend on the starting state X0. Furthermore, is the uniqueleft eigen-vector of P with eigenvalue 1 i.e. satisfying

PT = . (1.13)

Another useful fact is that if denotes the expected number of steps between successivevisits to state then

= 1()

. (1.14)

In most cases of interest, M is reversible, i.e.

Q(, ) = ()P(, ) = ()P(, ) (, ). (1.15)The central role of reversible chains in applications rests on the fact that can bededuced from (1.15). If : R satisfies (1.15), then it determines up to normal-ization. Indeed, if (1.15) holds and

() = 1 then

()P(, ) =

()P(, ) = ()

which proves that is a left eigenvector with eigenvalue 1.

In fact, we often design the chain to satisfy (1.15). Without reversibility, there is noapparent method of determining , other than to explicitly construct the transitionmatrix, an exponential time (and space) computation in our setting.

As a canonical example of a reversible chain we have a random walk on a graph. Arandom walk on the undirected graph G = (V, E) is a Markov chain with state space Vassociated with a particle that moves from vertex to vertex according to the followingrule: the probability of a transition from vertex i, of degree di, to vertex j is

1di

if{i, j} E, and 0 otherwise. Its stationary distribution is given by

(v) =dv

2|E|v V. (1.16)

To see this note that Q(v, w) = Q(w, v) if v, w are not adjacent and otherwise

Q(v, w) =1

2|E|= Q(w, v),

verifying the detailed balance equations (1.15).

Note that if G is a regular graph then the steady state is uniform over V.


10/156


If G is bipartite then the walk as described is not ergodic. This is because all cycles areof even length. This is usually handled by adding dv loops to vertex v for each vertex v.(Each loop counts as a single exit from v.) The net effect of this is to make the particlestay put with probability 12 at each step. The steady state is unaffected. The chain isnow lazy.

A chain is lazy if P(, ) 12

for all .If p0() = Pr(X0 = ), then pt() =

p0()P

t(, ) is the distribution at time t. Asa measure of convergence, the natural choice in this context is variation distance.

The mixing time of the chain is then

() = maxp0

mint

{Dtv(pt, ) },

and it is easy to show that the maximum occurs when X0 = 0, with probability one, forsome state 0. This is because Dtv(pt, ) is a convex function of p0 and so the maximumof Dtv(pt, ) occurs at an extreme point of the set of probabilities p0.

We now provide a simple lemma which indicates that variation distance Dtv(pt, ) goes

to zero exponentially. We define several related quantities: p(i)t denotes the t-fold distri-I think this should be

moved to the nextchapter

bution, conditional on X0 = i.

di(t) = Dtv(p(i)t , ), d(t) = max

idi(t), d(t) = max

i,jDtv(p

(i)t , p

(j)t ).

Lemma 1.3.1 For all s, t 0,

(a) d(s + t) d(s)d(t).(b) d(s + t) 2d(s)d(t).(c) d(s) 2d(s).(d) d(s) d(t) for s t.

Proof We will use the characterisation of variation distance as

Dtv(1, 2) = min Pr(X1 = X2) (1.17)where the minimum is taken over pairs of random variables X1, X2 such that Xi hasdistribution i, i = 1, 2.

Fix states i1, i2 and times s, t and let Y1, Y2 denote the chains started at i1, i2 respec-

tively. By (1.17) we can construct a joint distribution for (Y1s , Y2s ) such that

Pr(Y1s = Y2s ) = Dtv(p(i1)s , p(i2)s ) d(s).


11/156

1.4. A FORMAL COMPUTATIONAL FRAMEWORK 11

Now for each pair j1, j2 we can use (1.17) to construct a joint distribution for (Y1s+t, Y

2s+t)

such thatPr(Y1s+t = Y2s+t | Y1s = j1, Y2s = j2) = Dtv(p(j1)t , p(j2)t ).

The RHS is 0 if j1 = j2 and otherwise at most d(t). So, unconditionally,

Pr(Y1s+t = Y2s+t) d(s)d(t)and (1.17) establishes part (a) of the lemma.

For part (b), the same argument, with Y2 now being the stationary chain shows

d(s + t) d(s)d(t) (1.18)and so (b) will follow from (c), which follows from the triangular inequality for variationdistance. Finally note that (d) follows from (1.18). 2

We will for the most part use carefully defined Markov chains as our good samplers.As an example, we now define a simple chain with state space equal to the collectionof independent sets of a graph G. The chain is ergodic and its steady state is uniformover . So, running the chain for sufficiently long will produce a near uniformly chosenindependent set, see (1.12). Unfortunately, this chain does not have a small enoughmixing time for this to qualify as a good sampler, unless (G) 4.We define the chain as follows: suppose Xt = I. Then we choose a vertex v of Guniformly at random. If v

I then we put X

t+1= I \ {v}. If v /

I and I

{v} is an

indepedent set then we put Xt+1 = I {v}. Otherwise we let Xt+1 = Xt = I. Thus thetransition matrix can be described as follows: n = |V| and I, J are independent sets ofG.

P(I, J) =

1n |IJ| = 10 otherwise

Here IJ denotes the symmetric difference (I\ J) (J \ I).The chain satisfies M1 and M2: In DM every vertex can reach and is reachable from ,implying M1 holds. Also, DM contains loops unless G has no edges. In both cases M2holds trivially.

Note finally that P(I, J) = P(J, I) and so (1.15) holds with (I) = 1|| . Thus the chainis reversible and the steady state is uniform.

1.4 A formal computational framework

The sample spaces we have in mind are sets of combinatorial objects. However, in or-der to discuss the computational complexity of generation, it is necessary to consider asequence of instances of increasing size. We therefore work within the following formal


12/156


framework. The models of computation are the Turing Machine (TM) for deterministiccomputations and the Probabilistic Turing Machine (PTM) for randomized computa-tions. (A PTM is a TM with a source of uniform and independent random bits.) Wemust confine ourselves to some class of distributions which are easily described, froma computational viewpoint, in large instances. We identify this below with a class ofunnormalized measures which we call weight functions.

Let be a fixed alphabet of at least two symbols, and W : N be suchthat, for some polynomial b, W(, ) = 0 unless || b(||). Moreover W(, ) mustbe computable in time polynomial in || whenever W(, ) > 0. (If the TM for W may

ignore part of its input, this implies that W is always computable in polynomial time.)Let us call W a weight function. Here may be thought of as an encoding of an instanceof some combinatorial problem, and the of interest are encodings of the structures wewish to generate.

Let = { : W(, ) > 0}. Then the sequence of discrete probability spaces deter-mined by W is (, ), where is the density

() = W(, )/Z(), with Z() =

W(, )

being the corresponding normalising function. It is easy to see that the class of normal-ising functions so defined is essentially Valiants class #P. The definition implies that,for some fixed c

N, ||

Z()

2||

c

. If Z() = 0, then =

and is the

unique (improper) measure on .

In our definition, two distinct weight functions may define the same sequence of spaces.Therefore let us say weight functions W1, W2 are equivalent if there exists :

Q+so that W2(, ) = ()W1(, ) (, ). Then there is a bijection betweensequences of probability spaces (, ) and equivalence classes of weight functions.

Thus, if we write W for the equivalence class containing W, we may identify it with thesequence (, ).

We insist that sample spaces are discrete, and weight functions are integer valued. Com-putationally, discrete spaces are essential. If we wish to work with continuous spaces,then approximations must be made to some predetermined number of bits. The sameis true if we are interested in real-valued densities (as in some statistical applications).

However, the effect of such approximations can be absorbed into the variation distanceof the sampling procedure. The reader may still wonder why we require W to havecodomain N rather than Q, which would seem more natural. This is because we use un-normalised measures, and we wish to avoid the following technical difficulty. In a largesample space it is possible to specify polynomial size rationals for the unnormalised mea-sure which result in exponential size rationals for the probabilities. An example is theset [2n], with the measure assigning probability proportional to 1/i to i [2n]. In suchspaces there is no possibility of exact sampling in sub-exponential expected time, and


13/156

1.4. A FORMAL COMPUTATIONAL FRAMEWORK 13

we must accept approximations. We prefer not to deal with these anomalous spaces, butto insist that these approximations be made explicit. Thus, in this example we coulduse weights K/i for some suitably large integer K.A fully polynomial approximate sampler(which we shorten to good sampler) for (, )is a PTM which, on inputs and Q+ (0 < 1), outputs , according toa measure satisfying Dtv(, ) , in time bounded by a bivariate polynomial in|| , log 1. We allow / . If = , the algorithm does not terminate within itstime bound. However, this can be detected, and we may construct a polynomial timealgorithm which terminates either with a random or a proof that is empty.

Our real interest here is in combinatorial Markov chains, which we define as follows. LetM : N and define

R = {(, ) : M(, , ) > 0} , = { : w with (, ) R} .

Let M have the following properties.

(a) There is a polynomial b such that || , || b(||) if M(, , ) > 0, and M iscomputable in time polynomial in || whenever M(, , ) > 0.

(b) There exist constants K() N+, of polynomial size, such that

M(, , ) = K() ( ).

(c) The transitive closure ofR is , and for some , (, ) R.(d) Writing M(,

) = M(, , ) ( ), it follows from (a) that M is a weightfunction. We require that there is a good sampler for M ().

We call M a density matrix, and associate with it a sequence of Markov chains M =(, P), with transition matrices

P(1, 2) = M(, 1, 2)/K() (1, 2 ).Properties (a) and (c) ensure that M is finite and ergodic. Property (d) ensures that

we can efficiently simulate M to a close approximation for any given number of steps.Property (b) ensures that polynomial powers of the transition matrix cannot generaterationals of superpolynomial size, and hence the state probabilities at any polynomialtime cannot be rationals of superpolynomial size. We include this property since we donot wish to preclude exact generation using Markov chains. In any case, this conditioncan always be satisfied to any desired approximation, and is usually satisfied naturally.There is little loss in restricting K() to be a power of 2. If any such K() exist, itis easy to show that there is a chain with the same stationary distribution and K a


14/156


power of 2, simply by increasing the self-loop probability on all states. Since we areinterested in the stationary distribution, we can use this slightly slower chain. Thus wemay insist on K being a power of 2 where convenient.

Density matrices M1, M2 are equivalentif there exists : Q+ such that M2(, , )

()M1(, , ) for all , , . We can identify the equivalence class M with the

sequence M. We say that M is a rapidly mixing Markov chainif its mixing time ()is bounded by a polynomial in || , log 1.

If M is a Markov chain sequence, let denote the stationary distribution of M.Then, if W is a weight function, M is a Monte Carlo Markov chain(MCMC) for W ifboth W, M determine the same sequence of probability spaces (, ). (This slightoverloading of the MCMC abbreviation should not cause confusion.) The usual way toestablish this is by reversibility, i.e. ifW(, )M(, , ) = W(, )M(, , ) for all and , . Clearly we have a good sampler for W ifM is a rapidly mixingMarkov chain.

One of the main applications of sampling is to approximate integration. In our settingthis means estimating Z() to some specified relative error. In the important case whereW is a characteristic function, we call the approximate integration problem approximatecounting. Specifically, a fully polynomial randomized approximation scheme (fpras) forZ() is a PTM which on input , outputs Z so that

Pr(1/(1 + )

Z/Z

1 + )

34 ,

and which runs in time polynomial in || and 1/.

The success probability can be increased to 1 by taking the median of O(log )samples, see (1.2).

Let size : N be such that size() is polynomially bounded in ||, and if size() 1 2 N1. Theyare all real valued. Let max = max{|i| : i > 0}. The fact that max < 1 is a classicalresult of the theory of non-negative matrices. The spectral gap 1 max determines themixing rate of the chain in an essential way. The larger it is, the more rapidly does thechain mix. For U let

U(t) = maxi,jU

|Pt(i, j) (j)|

(j)

.

Theorem 2.1.1 For all U and t 0,

U tmax

miniU

(i).

Proof Let D1/2 be the diagonal matrix with diagonal entries

(), and let D

1/2

be its inverse. Then the reversibility of of the chain (1.15) implies thatthe matrix S = D1/2P D1/2 is symmetric. It has the same eigenvalues as P and itssymmetry means that these are all real. We can select an orthonormal basis of columnvectors e(i), i for R consisting of left eigenvectors of S where e(i) has associatedeigenvalue i and e

(0) = TD1/2. S has the spectral decomposition

S =N1i=0

ie(i)e(i)

T

=N1i=0

iE(i),

17


18/156


19/156

2.2. CONDUCTANCE 19

Theorem 2.1.2 If i, i = 1, 2, . . . , m and i, i = 1, 2, . . . , n are the eigenvalues of ma-trices AG1, AG2 respectively, then the eigenvalues of AG are {i + j : 1 i m, 1

j n}.

Proof AG can be obtained from AG1 by replacing each 1 by the |V2| identity matrixI2, the off-diagonal 0s by the |V2| |V2| matrix of 0s and replacing each diagonal entryby AG2. So ifpG() = det(I AG) then

pG() = detpG1(I2 AG2).This follows from the following: Suppose the mn

mn matrix A is decomposed into

an m m matrix of n n blocks Ai,j. Suppose also that the Ai,j commute amongthemselves. Then

det A = det

(1)sign()mi=1

Ai,(i)

,

i.e. one can produce an m m matrix by a determinant calculation and then take itsdeterminant. Needs a proof

So

pG() = detn

i=1

(I2 AG2 iI2) =n

i=1

pG2( i) =n

i=1

nj=1

( i j).

2

The eigenvalues of K2 are {1, 1} and applying (2.2) we see that the eigenvalues of Qnare {0, 1, 2, . . . , n} (ignoring multiplicities). To get the eigenvalues for our randomwalk we (i) divide by n and then (ii) replace each eigenvalue by 1+2 to account foradding loops. Thus the second eigenvalue of the walk is 1 12n .Applying Corollary 2.1.1 we obtain () log(1) + O(n2). This is a poor estimate,due to our use of the Cauchy-Schwartz inequality in the proof of Theorem 2.1.1. We getan easier and better estimate by using coupling.

2.1.1 Decomposition Theorem

2.2 ConductanceThe conductance of M is defined by

= min{S : S , 0 < (S) 1/2}where if Q(, ) = ()P(, ) and S = \ S,

S = (S)1Q(S, S).


20/156

20 CHAPTER 2. BOUNDING THE MIXING TIME

Thus S is the steady state probability of moving from S to S in one step of the chain,conditional on being in S.

Clearly 12

if M is lazy.

Note that

S(S) = Q(S, S) = Q(S, S) = S(S). (2.3)

Indeed,

Q(S, S) = Q(, S) Q(S, S) = (S) Q(S, S) = Q(S, S).Let min = min {() :

} > 0 and max = max {() :

}.

2.2.1 Reversible Chains

In this section we show how conductance gives us an estimate of the spectral gap of areversible chain.

Lemma 2.2.1 If M is lazy and ergodic then all eigenvalues are positive.

Proof Q = 2PI 0 is stochastic and has eigenvalues i = 2i1, i = 0, 1, . . . N 1. The result follows from i > 1, i = 0, 1, . . . N 1. 2For y RN let

E(y, y) =i


21/156

2.2. CONDUCTANCE 21

Now

yTD(I P)y = i=j

yiyjiPi,j +i

i(1 Pi,i)y2i

= i=j

yiyjiPi,j +i=j

iPi,jy2i + y

2j

2

=i


22/156


We verify (2.6) later. Also,i


23/156

2.2. CONDUCTANCE 23

Proof Lemma 2.2.1 implies that 1 = max and then

1

log 1max 1

log(1 2/2)1 2

2.

2

Now consider the conductance of a random walk on a graph G = (V, E). For S, T Vlet E(S, T) = {(v, w) E : v S, w T} and e(S, T) = |E(S, T). Then, by definition,

S = (v,w)E(S,S)dv

2|E|

1

dvvS

dv2|E|

=e(S, S)vS

dv.

In particular when G is an r-regular graph

= r1 min|S| 12 |V|

e(S, S)

|S|. (2.8)

The minimand above is referred to as the expansion of G. This graphs with goodexpansion (expander graphs) have large conductance and random walks on them mixrapidly.

As an example consider the n-cube Qn. For S

Xn let in(S) denote the number of

edges of Qn which are wholly contained in S.

Lemma 2.2.3 If = S Xn then in(S) 12 |S| log2 |S|.

Proof We prove this by induction on n. It is trivial for n = 1. For n > 1 letSi = {x S : xn = i} for i = 1, 2. Then

in(S) in(S0) + in(S1) + min{|S0|, |S1|}since the term min{|S0|, |S1|} bounds the number of edges which are contained in S and

join S0, S1. The lemma follows from the inequality

x log2 x + y log2 y + 2y (x + y)log2(x + y)for all x y 0. The proof is left as a simple exercise in calculus. 2By summing the degrees at each vertex of S we see that

e(S, S) + 2in(S) = n|S|.

By the above lemma we have

e(S, S) n|S| 12 |S| log2 |S| |S|


24/156


assuming |S| 2n1. It follows from (2.8) that 1n . Adding the self-loops to delay thewalk will halve the conductance the denominator

vSdv doubles without changing

the numerator in the definition ofS. This gives us the estimate of1

8n2for the spectral

gap, which is offby a factor of n see Section 2.1.

We finish this section by proving a sort of converse to Theorem 2.2.1.

Theorem 2.2.2 If M is a reversible chain then

1 1 2

Proof We use Lemma 2.2.2. Let S be a set of states which minimises S and definey by yj =

1(S)

if j S and yj = 1(S) if j S. It is easy to check that Ty = 0. Then

E(y, y) =

1

(S)+

1

(S)

2Q(S, S) and

iy

2i =

1

(S)+

1

(S).

Thus

1 max S(S)

1

(S)+

1

(S)

2S = 2.

2

2.2.2 General Chains

Theorem 2.2.3 Suppose that M is lazy and

max 2

20.

Then|pt() ()| 1/2min

1 1

2

2t

.

Proof For 0 x 1 let

ht(x) = max

(pt() ())() : [0, 1],

()() = x .By putting = 1S in the above definition we see that

pt(S) (S) ht((S))

for all S .


25/156

2.2. CONDUCTANCE 25

So, in particular pt() () ht(()) (x = ()) and () pt() ht(1 ())(x = 1 ()) and so

|pt() ()| min {ht(()), ht(1 ())} . (2.9)Order the elements = {1, 2, . . . , N} so that

pt(1)

(1) pt(2)

(2) pt(N)

(N),

and let k = ki=1 (i). Find the index k such that k1

x < k. Then

ht(x) =k1i=1

(pt(i) (i)) + x k1(k)

(pt(k) (k)).

This is because putting

(i) =

1 i < k

xk1(k)

i = k

0 i > k

yields an optimal basic feasible solution to the linear program in the definition of ht(x).

It follows that ht(x) is a concave piece-wise linear function on the interval [0, 1] withbreakpoints at 0 = 0 < 1 < < N = 1. Trivially, ht(0) = ht(1) = 0 and 0 ht(x) 1 for all t and x.

Now let

C = max

h0(x)

min

x,

1 x : 0 < x < 1

.

If a,b,c,d 0 then the function f() = (a + b)/c + d is monotone on [0, 1] and sothe value of x defining C must occur at one of the breakpoints of ht. It follows easilythat

C maxS

|0(S) (S)|min

(S),

1 (S)

= max

S

(S)1/2

|0(S) (S)|

(S) (2.10) 1

min.

(The second equation comes from considering \ S when (S) 1/2.)We now prove that for t 1 and x {0, 1, . . . , N},

ht(x) 12 (ht1(x 2min{x, 1 x}) + ht1(x + 2min{x, 1 x})) (2.11)


26/156


Fix k and let ui =

jkpi,j where pi,j = P(i, j). Clearly

1 ui pi,i 12 (i k) a n d 0 ui 1 pi,i 12 (i > k).

Now

(j) =Ni=1

(i)pi,j andNi=1

pt1(i)pi,j = pt(j)

and so if x = k

ht(x) =

kj=1

(pt(j) (j)) =k

j=1

Ni=1

pi,j(pt1(i) (i))

=Ni=1

(pt1(i) (i))ui.(2.12)

Moreover, 0 ui 1 andNi=1

(i)ui =Ni=1

kj=1

(i)pi,j =k

j=1

Ni=1

(i)pi,j =k

j=1

(j) = x. (2.13)

Now let

ui = 2ui 1 i k0 i > k

and ui = 1 i k2ui i > k

Then 0 ui, ui 1 and ui + ui = 2ui. Let x =N

i=1 (i)ui and x

=N

i=1 (i)ui .

Then (2.13) implies x + x = 2x and so by (2.12)

ht(x) =12

Ni=1

(pt1(i) (i))ui + 12Ni=1

(pt1(i) (i))ui 1

2 ht1(x) + 12 ht1(x

).

Furthermore,

x x =N

i=1(i)(ui ui) =

k

i=1(i)(1 ui) +

N

i=k+1(i)ui

=k

i=1

(i)

1

kj=1

pi,j

+

Ni=k+1

(i)k

j=1

pi,j

=k

i=1

Nj=k+1

(i)pi,j +N

i=k+1

(i)k

j=1

pi,j

2min{x, 1 x},


27/156

2.2. CONDUCTANCE 27

using (2.3) and the definition of. So, x x 2min{x, 1 x} and similarly x x + 2min{x, 1 x}. (2.11) now follows from the concavity ofht1.Now consider an x, k such that k1 < x < k < 12 . Let x = k1 + (1 )k. Then

ht(x) = ht(k1) + (1 )ht(k) 1

2 ((ht1(k1(1 2)) + ht1(k1(1 + 2)))+ (1 )(ht1(k(1 2)) + ht1(k(1 + 2))))

= 12

((ht1(k1(1 2))) + (1 )(ht1(k(1 2))))+ 12 ((ht

1(k

1(1 + 2))) + (1

)(ht

1(k(1 + 2))))

12

(ht1(x(1 2)) + ht1(x(1 + 2)))

from the concavity of ht1. Thus (2.11) holds for such an x. A similar argument showsthat (2.11) holds for an x, k such that 12 < k1 < x < k. So let be such that1 12 < and suppose 1 < x < . For such x we can only prove

ht(x) 12 (ht1(x xx) + ht1(x + xx)) (2.14)

where x 2 4max.

ht(x) = ht(1) + (1 )ht() 1

2 ((ht1(1(1 2)) + ht1(1(1 + 2)))+ (1 )((ht1( 2(1 ))) + ht1( + 2(1 )))) 1

2 (ht1(x 2(x (1 )(2 1))) + ht1(x + 2(x (1 )(2 1)))

Thus (2.14) holds with

x = 2 (1 )(2 1)x

2 2 11

2 4( 1)

and (2.14) follows.

Combining (2.11) and (2.14) we get that for 0 x 1ht(x) 12 (ht1(x xmin{x, 1 x}) + ht1(x + xmin{x, 1 x})) (2.15)

We now prove inductively that

ht(x) Cmin

x,

1 x1 22

t. (2.16)


28/156


For t = 0 (2.16) follows trivially from the definition of C. Let t 1 and suppose forexample 0 x 12 . Then (2.15) implies

ht(x) 12 C

1 122t1

x xx +

x + xx

= C

1 12

2t1

x

1 x +

1 + x

/2.

The last factor can be estimated by

12 (1 x+1 + x) = 12

r=0(1)

r12r(x)r +r=012r(x)r

=r=0

12

2r

(x)

r = 1 18 (x)2 5128 (x)4

1 12

2.

This completes the induction for x 12 . For x > 12 we put x = 1 y and defineht(y) = ht(1 y). Then (2.15) gives

ht(y) 12 (ht1(y xy) + ht1(y + xy))

from which we obtain

ht(y) Cy 1 122tas before. 2

Suppose now that we define the following distance M between measures and onspace .

M(, ) = max=A

|(A) (A)|(A)

. (2.17)

Corollary 2.2.1 Let a lazy ergodic Markov chain with steady state be started withdistribution 0 and let t denote the distribution after t steps. If max 220 then

M(t, )

M(

0, ) 1 122t .

Proof Fix S . It follows from (2.10) and (2.16) that

|t(S) (S)|

(S)M(0, )

1

2

2

t.

2


29/156

2.2. CONDUCTANCE 29

2.2.3 Path Congestion

Supose that for each pair (x, y) we have a canonical path xy from x to y in thedigraph DM = (, A) (defined in Section 1.3). Let

= maxeA

1

Q(e)

xye

(x)(y)|xy|

where if e = (, ) then Q(e) = ()P(, ) and |xy| is the number of arcs in xy.

Theorem 2.2.4 Assume that M is reversible. Then

1 1 1

.

Proof We use Lemma 2.2.2. Assume

i iyi = 0. Then

2Ni=1

iy2i =

Ni=1

Nj=1

ij(yi yj)2

=N

i=1N

j=1ij

eij

(ye+ ye)

2

where edge e = (e, e+)

Ni=1

Nj=1

ij|ij|eij

(ye+ ye)2

by Cauchy-Schwartz

=eA

(ye+ ye)2xye

xy|xy|

eA

(ye+ ye)2Q(e)

= 2E(y, y).

2

This theorem often gives stronger bounds on the spectral gap than Theorem 2.2.1. Weapply it now to our example of a random walk Wn on the cube.

Let x = (x0, x1, . . . , xn1) and y = (y0, y1, . . . , yn1) be arbitrary members of Xn. Thecanonical path xy from x to y is composed ofn edges, 0 to n 1, where edge i is simply

(y0, . . . , yi1, xi, xi+1, . . . xn1), (y0, . . . , yi1, yi, xi+1, . . . xn1)

,


30/156


i.e., we change the ith component from xi to yi. Note that some of the edges may beloops (ifxi = yi). To compute , fix attention on a particular (oriented) edge

t = (w, w) =

(w0, . . . , wi, . . . wn1), (w0, . . . , wi, . . . wn1)

,

and consider the number of canonical paths xy that include t. The number of possiblechoices for x is 2i, as the final n i positions are determined by xj = wj , for j i; andby a similar argument the number of possible choices for y is 2ni1. Thus the totalnumber of canonical paths using a particular edge t is 2n1; furthermore, Q(w, w) =(w)P(w, w) 2n(2n)1, and the length of every canonical path is exactly n. Pluggingall these bounds into the definition of yields n

2

. Thus, by Theorem 2.2.4, themixing time of Wn is () n2(n ln q + ln 1).

2.2.4 Comparison Theorems

2.2.5 Decomposition Theorem

2.3 Coupling

A coupling C(M) for M is a stochastic process (Xt, Yt) on such that each of Xt,Yt is marginally a copy of M,

Pr(Xt = 1 | Xt1 = 1) = P(1, 1),Pr(Yt = 2 | Yt1 = 2) = P(2, 2),

(t > 0). (2.18)

The following simple but powerful inequality then follows easily from these definitions.

Lemma 2.3.1 (Coupling Lemma) Let Xt, Yt be a coupling for M such that Y0 hasthe stationary distribution . Then, if Xt has distribution pt,

Dtv(pt, ) Pr(Xt = Yt). (2.19)

Proof Suppose At maximizes in (1.3). Then, since Yt has distribution ,

Dtv(pt, ) = Pr(Xt At) Pr(Yt At) Pr(Xt At, Yt / At) Pr(Xt = Yt).

2

It is important to remember that the Markov chain Yt is simply a proof construct, andXt the chain we actually observe. We also require that Xt = Yt implies Xt+1 = Yt+1,


31/156


32/156


Q are then

Q12 = 0 implies 1 = 2 ( ), (2.20)2

Q1212 = P11

(2 ),1

Q1212 = P22

(1 ). (2.21)

Here (2.20) implies equality after coalescence, and (2.21) implies the marginals are copiesof M. Our goal is to design Q so that Pr(Xt = Yt) quickly becomes small. We needonly specify Q to satisfy (2.21) for 1 = 2. The other entries are completely determinedby (2.20) and (2.21).In general, to prove rapid mixing using coupling, it is usual to map C(M) to a processon N by defining a function : N such that (1, 2) = 0 implies 1 = 2.We call this a proximity function. Then Pr(Xt = Yt) E((Xt, Yt)), by Markovsinequality, and we need only show that E((Xt, Yt)) converges quickly to zero.

2.4 Path coupling

A major difficulty with coupling is that we are obliged to specify it, and show improve-ment in the proximity function, for every pair of states. The idea ofpath coupling, whereapplicable, can be a major saving in this respect. We describe the approach below.

As a simple example of this approach consider a Markov chain where Sm for someset S and positive integer m. Suppose also that if , and h(, ) = d (Hammingdistance) then there exists a sequence = x0, x1, . . . , xd = of members of such that(i) {x0, x1, . . . , xd} , (ii) h(xi, xi+1) = 1, i = 0, 1, . . . , d 1 and (iii) P(xi, xi+1) > 0.Now suppose we define a coupling of the chains (Xt, Yt) onlyfor the case where h(Xt, Yt) =1. Suppose then that

E(h(Xt+1, Yt+1) | h(Xt, Yt) = 1) (2.22)

for some < 1. Then

E(h(Xt+1, Yt+1)) h(Xt, Yt), (2.23)in all cases. It then follows that

dTV(pt, ) Pr(Xt = Yt) nt.

Equation (2.23) is shown by choosing a sequence Xt = Z0, Z1, . . . , Z d = Yt, d = h(Xt, Yt)Z0, Z1, . . . , Z d satisfy (i),(ii),(iii) above. Then we can couple Zi and Zi+1, 1 i < dso that Xt+1 = Z

0, Z

1, . . . , Z

d = Yt+1 and (i) Pr(Z

i = | Zi = ) = P(, ) and (ii)


33/156

2.4. PATH COUPLING 33

E(h(Zi, Zi+1)) . Therefore

E(h(Xt+1, Yt+1)) d

i=1

E(h(Zi, Zi+1)) d

and (2.23) follows.

As an example, let G = (V, E) be a graph with maximum degree and let k 2+ 1be an integer. Let k be the set of proper k- vertex colourings of G i.e. {c : V [k]}such that (v, w) E implies c(v) = c(w). We describe a chain which provides a goodsampler for the uniform distribution over k. We let = Vk be all k-colourings,including improper ones and describe a chain on for which only proper colouringshave a positive steady state probability.

To describe a general step of the chain asume Xt . Then

Step 1 Choose w uniformly from V and x uniformly from [k].

Step 2 Let Xt+1(v) = Xt(v) for v V \ {w}.

Step 3 If no neighbour of w in G has colour x then put Xt+1(w) = x, otherwise putXt+1(w) = x.

Note that P(, ) = P(, ) = 1nk for two proper colourings which can be obtained fromeach other by a single move of the chain. It follows from (1.15) that the steady state isuniform over k.

We first describe a coupling which is extremely simple but needs k > 3 in order for(2.22) to be satisfied. Let h(Xt, Yt) = 1 and let v0 be the unique vertex of V such thatXt(v) = Yt(v). In our coupling we choose w, x as in Step 1 and try to colour w with xin both chains.

We claim that

E(h(Xt+1, Yt+1) 1 1n

1

k

+

n

2

k= 1 k 3

kn. (2.24)

and so we can take 1 1kn in (2.23) if k > 3.The term 1

n

1

k

in (2.24) lower bounds the probability that w = v0 and that x is

not used in the neighbourhood of v0. In which case we will have Xt+1 = Yt+1. Nextlet cX = cY be the colours of v0 in Xt, Yt respectively. The term n 2k in (2.24) is anupper bound for the probability that w is in the neighbourhood of v0 and x {cX , cY}and in which case we might have h(Xt+1, Yt+1) = 2. In all other cases we find thath(Xt+1, Yt+1) = h(Xt, Yt) = 1.


34/156


A better coupling gives the desired result. We proceed as above except for the casewhere w is a neighbour of v0 and x {cX , cY}. In this case with probability 12 we tryto colour w with cX in Xt and colour w with cY in Yt, and fail in both cases. Withprobability 12 we try to colour w with cY in Xt and colour w with cX in Yt, in which casethe hamming distance may increase by one. Thus for this coupling we have

E(h(Xt+1, Yt+1) 1 1n

1

k

+

1

2

n

2

k= 1 k 2

kn

and we can take 1 1kn in (2.23) if k > 2.We now give a more general framework for the definition of path coupling. Recallthat a quasi-metric satisfies the conditions for a metric except possibly the symmetrycondition. Any metric is a quasi-metric, but a simple example of a quasi-metric whichis not a metric is directed edge distance in a digraph.

Suppose we have a relation S such that S has transitive closure , andsuppose that we have a proximity function defined for all pairs in S, i.e. : S N.Then we may lift to a quasi-metric (, ) on as follows. For each pair (, ) , consider the set P(, ) of all sequences

= 1, 2, . . . , r1, r = with (i, i+1) S (i = 1, . . . , r 1). (2.25)Then we set

(, ) = minP(,)

r1i=1

(i, i+1). (2.26)

It is easy to prove that is a quasi-metric. We call a sequence minimizing (2.26)geodesic. We now show that, without any real loss, we may define the (Markovian)coupling only on pairs in S. Such a coupling is a called a path coupling. We give adetailed development below. Clearly S = is always a relation whose transitiveclosure is , but path coupling is only useful when we can define a suitable S whichis much smaller than . A relation of particular interest is R from Section 1.4,but this is not always the best choice.

As in Section 2.3, we use (or i) to denote a state obtained by performing a singletransition of the chain from the state (or i). Let P

denote the probability of

a transition from state to state in the Markov chain, and let Q

denote theprobability of a joint transition from (, ) to (, ), where (, ) S, as specified bythe path coupling. Since this coupling has the correct marginals, we have

Q

= P ,

Q

= P

((, ) S). (2.27)

We extend this to all pairs (, ) , as follows. For each pair, fix a sequence(1, 2, . . . , r) P(, ). We do not assume the sequence is geodesic here, or indeed


35/156

2.4. PATH COUPLING 35

the existence of any proximity function, but this is our eventual purpose. The impliedglobal coupling Q1r1r is then defined along this sequence by successively conditioningon the previous choice. Using (2.27), this can be written explicitly as

Q1r1r =2

3

r1Q1212 Q

2323

P22. . .

Qr1rr1rPr1r1

. (2.28)

Summing (2.28) over r or 1, and again applying (2.27), causes the right side to suc-cessively simplify, giving

r Q1r1r = P11 (r ), 1 Q1r1r = Prr (1 ). (2.29)Hence the global coupling satisfies (2.21), as we would anticipate from the properties ofconditional probabilities.

Now suppose the global coupling is determined by geodesic sequences. We bound theexpected value of (1, r). This is

E((1, r)) =1

r

(1, r)Q1212 Q

2323

Qr1rr1rP22 P

r1r1

1

rr1

i=1(i, i+1)

Q1212 Q2323

Qr1rr1rP22 P

r1r1

=r1i=1

1

r

(i, i+1)Q1212 Q

2323 Q

r1rr1r

P22 Pr1r1

=r1i=1

i

i+1

(i, i+1)Qii+1ii+1

, (2.30)

where we have used the triangle inequality for a quasi-metric and the same observationas that leading from (2.28) to (2.29).

Suppose we can find 1, such that, for all (, ) S,

E((, )) =

(, ) Q

(, ). (2.31)

Then, from (2.30), (2.31) and (2.26) we have

E((1, r)) r1i=1

(i, i+1) =

r1i=1

(i, i+1) = (1, r). (2.32)

Thus we can show (2.31) for every pair, merely by showing that this holds for all pairsin S. To apply path coupling to a particular problem, we must find a relation S and


36/156


proximity function so that this is possible. In particular we need (, ) for (, ) S to be easily deducible from .

Suppose that has diameter D, i.e. (, ) D for all , . Then, Pr(Xt =Yt) tD and so if < 1 we have, since log 1 1 ,

Dtv(pt, ) for t log(D1)/(1 ). (2.33)This bound is polynomial even when D is exponential in the problem size. It is alsopossible to prove a bound when = 1, provided we know the quasi-metric cannot getstuck. Specifically, we need an > 0 (inversely polynomial in the problem size) such

that, in the above notation,

Pr((, ) = (, )) (, ). (2.34)Observe that it is not sufficient simply to establish (2.34) for pairs in S. However, thestructure of the path coupling can usually help in proving it. In this case, we can showthat

Dtv(pt, ) for t eD2/ln(1). (2.35)This is most easily shown using a martingale argument. Here we need D to be polynomialin the problem size.

Consider a sequence (0, 0), (1,

1) . . . , (t,

t) and define the random time T

, =min {t : (t, t) = 0}, assuming that 0 = ,

0 =

. We prove that

E(T,

) D2/. (2.36)Let

Z(t) = (t, t)

2 2D(t, t) tand let

(t) = (t+1, t+1) (t, t).

Then

E(Z(t + 1) | Z(0), Z(1), . . . , Z (t)) Z(t) =2((t,

t) D)E((t) | t, t) + (E((t)2 | t, t) ) 0.

Hence Z(t) is a submartingale. The stopping time T,

has finite expectation and|Z(t + 1) Z(t)| D2. We can therefore apply the Optional Stopping Theorem forsubmartingales to obtain

E(Z(T,

)) Z(0).This implies

E(T,) (0)2 2D(0)and (2.36) follows.


37/156

2.5. HITTING TIME LEMMAS 37

So for any ,

Pr(T, eD2/) e1

and by considering k consecutive time intervals of length k we obtain

Pr(T, keD2/) ek

and (2.35) follows.

2.5 Hitting Time Lemmas

For a finite Markov chain M let Pri, Ei denote probability and expectation, given thatX0 = i.

For a set A letTA = min {t 0 : Xt A} .

Then for i = j the hitting timeHi,j = Ei(Tj)

is the expected number of steps needed to get from state i to state j.

The commute timeCi,j = Hi,j + Hj,i.

Lemma 2.5.1 Assume X0 = i and S is a stopping time with XS = i. Let j be anarbitrary state. Then

Ei(number of visits to state j before time S) = jEi(S).

Proof Consider the renewal process whose inter-renewal time is distributed as S.The reward-renewal theorem states that the asymptotic proportion of time spent in state

j is given byEi(number of visits to j before time S)/Ei(S).

This also equal to j, by the ergodic theorem. 2

Lemma 2.5.2Ej(number of visits to j before Ti) = jCi,j.

Proof Let S be the time of the first return to i after the first visit to j. ApplyLemma 2.5.1. 2

The cover time C(M) of M is maxi Ci(M) where Ci(M) = Ei(maxj Tj) is the expectedtime to visit all states starting at i.


38/156


Let MG denote a random walk on the connected graph G = (V, E). Here |V| = n and|E| = m.

Lemma 2.5.3 For Markov chain MG and e = {u, v} E, Cu,v 2m.

Proof The random walk on G induces a Markov chain on A = {(x, y) : {x, y}}the set of oriented edges obtainable by replacing each edge {x, y} E by a pair ofoppositely oriented edges. It is can be easily checked that the all 1s vector satisfies(1.13) and hence the steady state of the induced walk is uniform. It follows from (1.14)the expected time between traversals of (v, u) is 1

2m. So conditional on entering u from

v the expected time to visit v and subsequently visit u is at most 12m. Conditioning oninitially traversing (v, u) is irrelevant to the time to subsequently visit v and then u andthe lemma follows. 2

We can use this to obtain a bound on the cover time of MG.

Lemma 2.5.4C(MG) 2m(n 1).

Proof Let T be any spanning tree ofG and let v0, v1, . . . , v2n2 = v0 be a traversalofG which crosses each edge of T in each direction. Now consider the expected time forthe random walk, started at v0, to make journeys from v0 to v1, then from v1 onto v2and so on until v0, v1, . . . , v2n2 have been visited. This journey visits every vertex of G

and so its expected length is an upper bound on the cover time, from v0. Thus

Cv0(MG) 2n3i=0

Hvi,vi+1 =

{u,v}TCu,v.

The result now follows from Lemma 2.5.3. 2

2.6 Optimal Stopping Rules

Lovasz and Winkler, see for example [?] have been studying optimal stopping rules. Weneed a little of that theory here. For us a stopping rule is a function : : [0, 1] where = {(X0, X1, . . . , X t) : t 0} is the set of possible sequences of states generated byour Markov chain. (X0, X1, . . . , X t) is the probability that we stop the chain at timet. If X0 is chosen with probability distribution and is the distribution of the statewhere we stop then we say that is a stopping rule from to and write SR(, ).We are naturally mainly interested in the case where = .

For a stopping rule let T be the random number of steps taken until we stop. Let

H(, ) = inf{E(T) : SR(, )}


39/156

2.6. OPTIMAL STOPPING RULES 39

denote the minimum expected number of steps in a stopping rule from to . If isconcentrated on a single state s then we write H(s, ).

For a stopping rule and j let xj = xj() be the expected number of exits from ibefore stopping i.e. the expected number of times that the chain leaves i.

Lemma 2.6.1 If SR(, ) thenxj + j =

i

xiP(i, j) + j.

Proof Let T = T and consider the identityT1t=0

1Xt=j + 1T


40/156


Theorem 2.6.1 A stopping rule SR(, ) has a minimum mean expected stoppingtime iff there is a halting state.

Proof If there exists j such that xj = 0 then Corollary 2.6.1 implies that for SR(, )

E(T T) = xj()

j 0

implying that E(T) is minimal. It only remains to show that there exists a stoppingrule in SR(, ) which has at least one halting state.

The rule we define has a particular format. We define a nested sequence of sets Si ={vi, vi+1, . . . , vn} where = {v1, v2, . . . , vn}. For each i we will define q(i) by

q(i)j = Pr(vj is the first vertex of Si visited).

In particular q(1) = . We choose S1, S2, . . . , S n so that we can write

= 1q(1) + 2q

(2) + + nq(n) (2.37)

where 0 and 1 + 2 + + n = 1. Our stopping rule is then:

(i) Choose i with probability i.

(ii) Choose X0 with probability and then run the chain until Si is reached and thenstop.

It should be clear that SR(, ). IfS1, S2, . . . , S n can be constructed so that (2.37)holds then we are done: vn is a halting state.

Assume inductively that we have found S1, S2, . . . , S i and 1, 2, . . . , i1 0 such that

(i1) = (1q(1) + 2q(2) + + i1q(i1)) 0 (2.38)and

1 + 2 + + i1 1.

Putting S1 = does this for i = 1 and then for general i let

i = minjSi

(i)j

q(i)j

and let vi be a state of Si which achieves the minimum. Clearly i 0 and

(i) = (i1) iq(i) 0 (2.39)


41/156


from the definition of i.

Furthermorei

j=1

j =i

j=1

j

nk=1

q(j)k

nk=1

k = 1 (2.40)

completing the induction.

Finally note that when i = n the construction yields equality in (2.39) and then (2.37)holds and we obtain equality in (2.40). 2

We now relate optimal stopping rules and mixing time. Let

Tmix = maxs

H(s, ).

Theorem 2.6.2() 8Tmix log2(1/).

Proof Let s and let be an optimal stopping rule from s to . Consider a modi-fication: Follow until it stops after T = T steps and then generate {0, 1, . . . , t 1}uniformly and independently of the previous walk, and then walk more steps. Let thewalk be v1, v2, . . . , vT+. Then let = T + (mod t) and note that is uniformlydistributed over {0, 1, . . . , t 1}. Then for i

Pr(v = i) Pr(vT+ = i) Pr(vT+ = i, v = i) i Pr(vT+ = i, T + t)since vT+ is in the stationary distribution and T + < t implies = T + .

Hence, for every A ,(A) Pr(v A) Pr(vT+ A, T + t) Pr(T + t).

Now for any fixed value of T, Pr(T + t) Tt

and so

Pr(T + t) E(T)t

=H(s, )

t

and

(A) Pr(v A) H(s, )

t .It follows from Lemma 1.3.1(d) that

d(t) Tmixt

and so

d(4Tmix) 14

.


42/156


Applying Lemma 1.3.1(b) we see that

d(8Tmix log2 1) .

2

We can now prove a refinement of the usual conductance bound on the mixing time(Corollary 2.2.1) due to Kannan and Lovasz [?]. Thus for 0 x 12 let

(x) = minS

(S)x

Q(S, S)

(S)(S)

and let (x) = ( 12 ) for12 < x 1. Note that (x) 2.

Theorem 2.6.3 If 0 1 then

Tmix 302

+ 30

1x=

dx

x(x)2

where = inf{y : S such that (S) y and(S) < }.

Proof Let s and be an optimal stopping rule from s to . Let yi = xi/i, i =1, 2, . . . , n be the scaled exit frequencies of . Now order the states so that y1 y2 yn. We first claim that with this ordering

y1 = 0 and n = s. (2.41)

The first assertion comes from Theorem 2.6.1. For the second we use Lemma 2.6.1 andwrite, for j ,

iiP(i, j)yi jyj = j 1j=s.

Putting j = n we obtain

n 1n=s i

iP(i, n)yn nyn = 0

and (2.41) follows.

Now fix 1 k < m n and let A = {1, 2, . . . , k} , B = {k + 1, k + 2, . . . , m 1} andC = {m, m + 1, . . . , n}. We show next that

ym yk (A)Q(C, A)

. (2.42)

We start with the identity

ki=1

nj=k+1

yjQ(j,i) k

i=1

nj=k+1

yiQ(i, j) = (A). (2.43)


43/156


The left hand side counts the expected number of steps from V\A to A less the expectednumber of steps from A to V \ A, when following an optimal rule. Since we do not startin A (s = n) and stop in A with probability (A), (2.43) follows.

Now we estimate the left hand side of (2.43) as follows:

ki=1

nj=k+1

yjQ(j,i) k

i=1

nj=m+1

ymQ(j,i) +k

i=1

m1j=k+1

ykQ(j,i)

= ymQ(C, A) + ykQ(B, A)

andk

i=1

nj=k+1

yiQ(i, j) k

i=1

nj=k+1

ykQ(i, j) = ykQ(A, B C) = ykQ(B C, A).

Substituting into (2.43) we get

ymQ(C, A) + ykQ(B, A) ykQ(B C, A) = (ym yk)Q(C, A) (A)which proves (2.42).

We now observe that since y1 = 0,

H(s, ) =

n

i=1

iyi =

n1

j=1

(yj+1 yj)>j (2.44)

where >j =n

r=j+1 j.

We now define a sequence 1 = m0 < m1 < mk < mk+1 so that if Ti = {1, 2, . . . , mi},Ti = \ Ti and ai = (Ti) then

ai+1 mi+1 < ai

1 +(ai)

4

ai+1 (2.45)

and

ak 12

< ak+1. (2.46)

This definition can be justified as follows: Given mi with ai 1

2 we let mi+1 be thefirst integer such that (2.45) holds. Since an = 1 and ai

1 + (ai)

4

32

ai, such an mi+1

exists. k exists for the same reason.

We bound a portion of the sum in the right hand side of (2.44) by

mi+11j=mi

(yj+1 yj)>j (1 ai)(ymi+1 ymi) ai(1 ai)

Q(Ti+1 {mi+1} , Ti) (2.47)


44/156


where the second inequality follows from (2.43). Now,

Q(Ti+1 {mi+1} , Ti) = Q(Ti, Ti) Q(Ti \ (Ti+1 {mi+1} , Ti)) Q(Ti, Ti) (Ti \ (Ti+1 {mi+1}))

(ai)ai(1 ai) ai+1 + mi+1 + ai > (ai)ai(1 ai)/2.Hence we obtain from (2.47) that

mi+11

j=mi(yj+1 yj)>j 2

(ai). (2.48)

Now define i0 by (ai) iff i i0. It follows from (2.45) that

i0 ln 2ln

1 + 4 5

.

So from (2.48) we see that

mi0+11j=1

(yj+1 yj)>j i0i=1

2

(ai) 10

2. (2.49)

In general we have

ai+1ai

dxx(x)2

1(ai)2

ai+1ai

dxx

= 1(ai)2

ln(ai+1/ai)

1(ai)2

ln

1 +

(ai)

4

1

5(ai)(2.50)

since (ai) 2.So from (2.48), (2.49) and (2.50) we have

mk+11j=1

(yj+1 yj)>j 102

+ 10

1

dx

x(x)2. (2.51)

The estimate for the other half of the sum on the right hand sise of (2.44) is similar. We

define a sequence n0 = n > n1 > > nr and sets Si = {ni, ni + 1, . . . , n} , Si = \ Siand bi = (Si) for i = 1, 2, . . . , r + 1 such that

bi+1 ni+1 < bi

1 +(bi)

4

bi+1

and

br 12

< br+1.


45/156


As before we consider the partial sum

ni1j=ni+1

(yj+1 yj)>j (bi+1 ni+1)(yni yni+1) (bi+1 ni+1)(1 bi+1 + ni+1)

Q(Si, Si+1 {ni+1})where the second inequality follows from (2.42).

Now

Q(Si, Si+1 {ni+1}) = Q(Si+1 {ni+1} , Si) =Q(Si, Si) Q(Si \ (Si+1 {ni+1}), Si) Q(Si, Si) (Si+1 \ Si) + ni+1

(bi)bi(1 bi) bi+1 + ni+1 + bi > (bi)bi(1 bi)/2.Hence

ni1j=ni+1

(yj+1 yj)>j 2(bi+1 ni+1)bi

1

(bi) 4(bi)

since bi+1 ni+1 bi

1 + (bi)4

2bi. So as before we get

n1j=nr+1

(yj+1 yj)>j 202

+ 20

1

dx

x(x)2

and combined with (2.49) and (2.44) we have the theorem. 2

Of paticular interest to us is the case where for some A = A(n) < B = B(n), (x)satisfies

(x) min

A log1

x(1 x) , B

(2.52)

for x 1/2.

Theorem 2.6.4 If (2.52) holds then the mixing time

() cA2 for some absolute constantc > 0.

Proof It follows from (2.52) that for B we have e/A. ( < e/A impliesthat S : x = (S) < e/A and (x) < , which implies that min{A log 1x(1x) , B} < ,

contradiction). Define x0 by A log

1

x0(1x0) = B. Then for B we have by Theorems2.6.2 and 2.6.3 that

() = O

1

2+

1

B2

x0e/A

dx

x+

1

A2

1/2x0

dx

x(log x)2+

1

A2(log 4)2

11/2

dx

x

= O

1

2+

1

B2

log x0 +

A

+

1

A2

1

log x0+

1

log2

+

1

A2

= O(A2)


46/156


where we use log x0 = (B/A) and take = (AB2/2)1/3 and absorb terms of order

(AB)1 or B2. 2

2.7 Coupling from the Past


47/156

Chapter 3

Matchings and related structures

A problem that has played a historically important role in the development both ofcomplexity theory and algorithm design is that of evaluating the permanent function.The permanent of an n n integer matrix A = (aij : 0 i, j n 1) is defined by

per A =

n1i=0

ai,(i) ,

where the sum is over all permutations of [n] = {0, . . . , n 1}. Evaluating thepermanent of a 0,1-matrix is complete for the class #P; thus, we cannot expect to findan algorithm that solves the problem exactly in polynomial time. Interest has thereforecentred on finding computationally feasible approximation algorithms. In contrast, asis well known, the superficially related determinant of an n n matrix can be evaluatedin O(n3) arithmetic operations using Gaussian elimination.

A matching in a graph G = (V, E) is any subset A E of edges that are pairwisevertex disjoint. A matching is said to be perfect if it covers every vertex; clearly aperfect matching, which can exist only if |V| is even, has size |V|/2. Specialised to thecase when A is a 0,1-matrix, per A is equal to the number of perfect matchings in thebipartite graph G = (V1, V2, E), where V1 = V2 = [n], and (i, j)

E iff aij = 1.

In the light of the above connection, a promising approach to computing an approxima-tion of the permanent of A, at least when A is a 0,1-matrix, is through sampling perfectmatchings in the related bipartite graph G. We shall immediately generalise the situa-tion to that of sampling a (weighted) matching in a general graph. In the next sectionwe attack that sampling problem through Markov chain simulation; then in subsequentsections we shall apply the methods we develop there to related problems, including theapproximation of the permanent.

47


48/156

48 CHAPTER 3. MATCHINGS AND RELATED STRUCTURES

3.1 Weighted matchings (the monomer-dimer model)

Let G be a graph, not necessarily bipartite, with an even number 2n = |V| of vertices.The assumption that the number of vertices in G is even is inessential and is made fornotational convenience. To each matching M, a weight w(M) = |M| is assigned, where is a positive real parameter. The generating (or partition) function of matchings in Gis

Z() ZG() =Mw(M) =

n

k=0mk

k, (3.1)

where mk mk(G) is the number ofk-matchings in G. In statistical physics, a matchingis termed a monomer-dimer configuration: the edges in M are the dimers and theunmatched (uncovered) vertices are monomers. Thus mk(G) counts the number ofmonomer-dimer configurations with k dimers. The weight parameter reflects thecontribution of a dimer to the energy of the system.

Our main goal in this section is the development of an algorithm for approximating ZGat an arbitrary point 0. The running time of the algorithm is poly(n, , max{, 1}),where , as usual, controls the relative error that will be tolerated in the output. Thusthe algorithm will meet the specification of an FPRAS for ZG, provided is specifiedin unary notation. Our approach is to simulate a suitable Markov chain Mmatch(),

parameterised on the the graph G and edge weight . The state space, , is the set ofall matchings in G, and the transitions are constructed so that the chain is ergodic withstationary distribution given by

(M) =|M|

Z(). (3.2)

(Since G is fixed from now on, we drop the subscript from Z.) In other words, thestationary probability of each matching (monomer-dimer configuration) is proportionalto its weight in the partition function (3.1). The Markov chain Mmatch(), if simu-lated for sufficiently many steps, provides a method of sampling matchings from thedistribution .

It is not hard to construct a Markov chain Mmatch() with the right asymptotic proper-ties. Consider the chain in which transitions from any matching M are made accordingto the following rule:

1. with probability 12 let M = M; otherwise,


49/156

3.1. WEIGHTED MATCHINGS (THE MONOMER-DIMER MODEL) 49

2. select an edge e = {u, v} E u.a.r. and set

M =

M e if e M;M + e if both u and v are unmatched in M;M + e e if exactly one of u and v is matched in M

and e is the matching edge;M otherwise;

3. go to M with probability min{1, (M)/(M)}.

It is helpful to view this chain as follows. There is an underlying graph defined on theset of matchings in which the neighbours of matching M are all matchings M thatdiffer from M via one of the following local perturbations: an edge is removed from M(a -transition); an edge is added to M (a -transition); or a new edge is exchangedwith an edge in M (a -transition). Transitions from M are made by first selecting aneighbour M u.a.r., and then actually making, or accepting the transition with proba-bility max{1, (M

)/(M)}. Note that the ratio appearing in this expression is easyto compute: it is just 1, or 1 respectively, according to the type of the transition.

As the reader may easily verify, this acceptance probability is constructed so that thetransition probabilities P(M, M) satisfy the detailed balance condition

Q(M, M) = (M)P(M, M) = (M)P(M, M), for all M, M

,

i.e., Mmatch() is reversible. This fact, together with the observation that Mmatch()is irreducible (i.e., all states communicate, for example via the empty matching) andaperiodic (by step 1, the self-loop probabilities P(M, M) are all non-zero), ensures thatMmatch() is ergodic with stationary distribution , as required. The device of perform-ing random walk on a connected graph with acceptance probabilities of this form is wellknown in Monte Carlo physics under the name of the Metropolis process. Clearly, itcan be used to achieve any desired stationary distribution for which the ratio (u)/(v)for neighbours u, v can be computed easily. It is also the standard mechanism used incombinatorial optimisation by simulated annealing.

Having constructed a family of Markov chains with stationary distribution , our nexttask is to explain how samples from this distribution can be used to obtain a reliable sta-

tistical estimate of Z() at a specified point = 0. Our strategy is to express Z()as the product

Z() = Z(r)Z(r1)

Z(r1)Z(r2)

Z(2)Z(1)

Z(1)Z(0)

Z(0), (3.3)

where 0 = 0 < 1 < 2 < < r1 < r = is a suitably chosen sequence of values.Note that Z(0) = Z(0) = 1. We will then estimate each factor Z(i)/Z(i1) in this


50/156


product by sampling from the distribution i. This approach is analogous to that usedin the context of independent sets in the proof of Theorem 1.2.1; refer in particular toequation (1.4). For reasons that will become clear shortly, we will use the sequence ofvalues 1 = (2|E|)1 and i = (1 + 1n)

i11 for 1 i < r. The length r of the sequenceis taken to be minimal such that (1 + 1n)

r11 , so we have the boundr 2nln + ln(2|E|) + 1. (3.4)

To estimate the ratio Z(i)/Z(i1), we will express it, or rather its reciprocal, as

the expectation of a suitable random variable. Specifically, define the random variableZi(M) =

i1i

|M|, where M is a matching chosen from the distribution i. Then we

have

E(Zi) =M

i1

i

|M||M|i

Z(i)=

1

Z(i)

M

|M|i1 =

Z(i1)Z(i)

.

Thus the ratio i = Z(i1)/Z(i) can be estimated by sampling matchings from thedistribution i and computing the sample mean of Zi. Following (3.3), our estimator

of Z() will be the product of the reciprocals of these estimated ratios. Summarisingthis discussion, our algorithm can be written down as follows:

Step 1 Compute the sequence 1 = (2|E|)1 and i =

1 + 1n

i1

1 for 1 i < r,where r is the least integer such that 1 + 1nr1 1 . Set 0 = 0 and r =.

Step 2 For each value = 1, 2, . . . , r in turn, compute an estimate Xi of the ratio ias follows:

by performing S independent simulations of the Markov chainMmatch(i), each of length Ti, obtain an independent sample of size Sfrom (close to) the distribution i;

let Xi be the sample mean of the quantityi1i

|M|.

Step 3 Output the product Y =r

i=1 X1i .

Figure 3.1: Algorithm MatchSample

To complete the description of the algorithm, we need to specify the sample size S inStep 2, and the number of simulation steps Ti required for each sample. Our goal is toshow that, with suitable values for these quantities, Algorithm MatchSample is anFPRAS for Z().

The issue of the sample size S is straightforward. Now e1 Zi 1 and so usingLemma 1.2.1 of Chapter 1 we see


51/156


Proposition 3.1.1 In Algorithm MatchSample, suppose the sample size S in Step 2is S = 17e22r, and that the simulation length Ti is large enough that the variationdistance of Mmatch(i) from its stationary distribution i is at most /(3er). Then theoutput random variable Y satisfies

Pr

(1 )Z() Y (1 + )Z() 34

.

Since r is a relatively small quantity (essentially linear in n: see (3.4)), this result meansthat a modest sample size at each stage suffices to ensure a good final estimate Y,provided of course that the samples come from a distribution that is close enough to i.

It is in determining the number of simulation steps, Ti, required to achieve this that themeat of the analysis lies: of course, this is tantamount to investigating the mixing timeof the Markov chain Mmatch(i). Our main task in this section will be to show:

Proposition 3.1.2 The mixing time of the Markov chain Mmatch() satisfies

X() 4|E|n

n(ln n + ln ) + ln 1

,

where = max{, 1}.

The proof of this result will make use of the full power of the machinery introduced inSection 2.2.3 of Chapter 2. Note that Proposition 3.1.2 is a very strong statement: it says

that we can sample from (close to) the complex distribution over the exponentiallylarge space of matchings in G, by performing a Markov chain simulation of length onlya low-degree polynomial in the size of G.1

According to Proposition 3.1.1, we require a variation distance of /(3er), so Proposi-tion 3.1.2 tells us that it suffices to take

Ti =

4|E|ni

n(ln n + ln i) + ln(3er/)

. (3.5)

This concludes our specification of the Algorithm MatchSample.

Before proceeding to prove the above statements, let us convince ourselves that to-gether they imply that Algorithm MatchSample is an FPRAS for Z(). First of all,Proposition 3.1.1 ensures that the output of Algorithm MatchSample satisfies the

requirements of an FPRAS for Z. It remains only to verify that the running time isbounded by a polynomial in n, and 1. Evidently the running time is dominatedby the number of Markov chain simulations steps, which is

ri=1 STi; since Ti increases

with i, this is at most rSTr. Substituting the upper bound for r from (3.4), and values

1Incidentally, we should point out that Proposition 3.1.2 immediately tells us that we can samplemonomer-dimer configurations from the canonical distribution , in time polynomial in n and

. Thisis in itself an interesting result, and allows estimation of the expectation of many quantities associatedwith monomer-dimer configurations.


52/156


for S from Proposition 3.1.1 and Tr from (3.5), we see that the overall running time ofAlgorithm MatchSample is bounded by2

O

n4|E|(ln n)32,which grows only polynomially with n, and 1. We have therefore provedTheorem 3.1.1 Algorithm MatchSample is an FPRAS for the partition function ofan arbitrary monomer-dimer system.

We turn now to the question of proving Proposition 3.1.2. Our strategy will be tocarefully choose a collection of canonical paths = {XY : X, Y } in the Markovchain Mmatch() for which the bottleneck measure () of Section 2.2.3 is small.We can then appeal to Theorem 2.2.4 and Corollary 2.1.1 to bound the mixing time.Specifically, we shall show that our paths satisfy

() 4|E|n. (3.6)

Since the number of matchings in G is certainly bounded above by (2n)!, the station-ary probability (X) of any matching X is bounded below by (X) 1/(2n)!n.Using (3.6) and the fact that ln n! n ln n, the bound on the mixing time in Proposi-tion 3.1.2 can now be read off from Theorem 2.2.4 and Corollary 2.1.1.

It remains for us to find a set of canonical paths satisfying (3.6). For a pair of match-ings X, Y in G, we define the canonical path XY as follows. Consider the symmetricdifference X Y. A moments reflection should convince the reader that this consists ofa disjoint collection of paths in G (some of which may be closed cycles), each of whichhas edges that belong alternately to X and to Y. Now suppose that we have fixed somearbitrary ordering on all simple paths in G, and designated in each of them a so-calledstart vertex, which is arbitrary if the path is a closed cycle but must be an endpointotherwise. This ordering induces a unique ordering P1, P2, . . . , P m on the paths appear-ing in X Y. The canonical path from X to Y involves unwinding each of the Pi inturn as follows. There are two cases to consider:

1. Pi is not a cycle . Let Pi consist of the sequence (v0, v1, . . . , vl) of vertices, with v0

the start vertex. If (v0, v1) Y, perform a sequence of -transitions replacing(v2j+1, v2j+2) by (v2j, v2j+1) for j = 0, 1, . . ., and finish with a single -transitionif l is odd. If on the other hand (v0, v1) X, begin with a -transition removing(v0, v1) and proceed as before for the reduced path (v1, . . . , vl).

2In deriving the O-expression, we have assumed w.l.o.g. that Tr = O

|E|n2 lnn. This followsfrom (3.5) with the additional assumption that ln 1 = O(n lnn). This latter assumption is justifiedsince the problem can always be solved exactly by exhaustive enumeration in time O(n(2n)!), which isO(2) if ln 1 exceeds the above bound.


53/156


r

r

r

r

r r

r

r

rr

r

r

r r

r

r

rr

r

r

r

r

r

r

r r

r

r

rr

r

r

r r

r

r

rr

r

r

r

r

r

r

r r

r

r

rr

r

r

r

r

r

r

r r

r

r

rr

r

r

r r

r

r

rr

r

r dddd

dddd

r

r

r

r

r r

r

r

rr

r

r

r r

r

r

rr

r

r dddd

dddd

r

r

r

r

r r

r

r

rr

r

r dddd

dddd

r

r

r

r

r r

r

r

rr

r

r

r r

r

r

rr

r

r dddd

dddd

r

r

r

r

f

f

dddd

dddd

dddd

dddd

P1 PiX: Pi1 Pi+1

t

Start vertex of (closed) path Pi

M:

M:

Y:

...

...

Pm

Figure 3.2: A transition t in the canonical path from X to Y

2. Pi is a cycle. Let Pi consist of the sequence (v0, v1, . . . , v2l+1) of vertices, wherel 1, v0 is the start vertex, and (v2j, v2j+1) X for 0 j l, the remainingedges belonging to Y. Then the unwinding begins with a -transition to remove(v0, v1). We are left with an open path O with endpoints v0, v1, one of which mustbe the start vertex of O. Suppose vk, k {0, 1}, is not the start vertex. Then weunwind O as in (i) above but treating vk as the start vertex. This trick serves todistinguish paths from cycles, as will prove convenient shortly.

This concludes our definition of the family of canonical paths . Figure 3.2 will helpthe reader picture a typical transition t on a canonical path from X to Y. The path Pi(which happens to be a cycle) is the one currently being unwound; the paths P1, . . . , P i1to the left have already been processed, while the ones Pi+1, . . . , P m are yet to be dealtwith.

We now proceed to bound the bottleneck measure () for these paths. Let bean arbitrary edge in the Markov chain, i.e., a transition from M to M = M, and letcp() = {(X, Y) : XY } denote the set of all canonical paths that use . (We usethe notation in place of e here to avoid confusion with edges of G.) We will obtaina bound on the total weight of all paths that pass through by defining an injectivemapping : cp() . What we would like to do is to set (X, Y) = XY(MM);the intuition for this is that (X, Y) should agree with X on paths that have already


54/156


r r

r

r

rr

r

r

r

r

r

r

r r

r

r

rr

r

r

r r

r

r

rr

r

r

r

r

r

r

dd

dd

dd

f

P1 Pi Pi1 Pi+1 Pm

Figure 3.3: The corresponding encoding t(X, Y)

been unwound, and with Y on paths that have not yet been unwound. However, there isa minor complication concerning the path that we are currently processing: in order toensure that (X, Y) is indeed a matching, we may as we shall see have to remove

from it the edge of X adjacent to the start vertex of the path currently being unwound:we shall call this edge eXY t. This leads us to the following definition of the mapping :

(X, Y) =

X Y (M M) eXY t, if is a -transition

and the current path is a cycle;X Y (M M), otherwise.

Figure 3.5 illustrates the encoding t(X, Y) that would result from the transition t onthe canonical path sketched in Figure 3.2.

Let us check that (X, Y) is always a matching. To see this, consider the set of edgesA = XY(MM), and suppose that some vertex, u say, has degree two in A. (SinceA X Y, no vertex degree can exceed two.) Then A contains edges {u, v1}, {u, v2}for distinct vertices v1, v2, and since A

X

Y, one of these edges must belong to X

and the other to Y. Hence both edges belong to X Y, which means that neither canbelong to M M. Following the form of M M along the canonical path, however, itis clear that there can be at most one such vertex u; moreover, this happens preciselywhen the current path is a cycle, u is its start vertex, and is a -transition. Ourdefinition of removes one of the edges adjacent to u in this case, so all vertices in(X, Y) have degree at most one, i.e., (X, Y) is indeed a matching.

We now have to check that is injective. It is immediate from the definition of thatthe symmetric difference X Y can be recovered from (X, Y) using the relation

X Y =

(X, Y) (M M) + eXY t, if is a -transitionand the current path is a cycle;

(X, Y)

(M

M), otherwise.

Note that, once we have formed the set (X, Y)(MM), it will be apparent whetherthe current path is a cycle from the sense of unwinding. (Note that eXY t is the uniqueedge that forms a cycle when added to the path.) Given X Y, we can at onceinfer the sequence of paths P1, P2, . . . , P m that have to be unwound along the canonicalpath from X to Y, and the transition t tells us which of these, Pi say, is the pathcurrently being unwound. The partition of X Y into X and Y is now straightforward:X has the same parity as (X, Y) on paths P1, . . . , P i1, and the same parity as M on


55/156


paths Pi+1, . . . , P m. Finally, the reconstruction of X and Y is completed by noting thatX Y = M (X Y), which is immediate from the definition of the paths. Hence Xand Y can be uniquely recovered from (X, Y), so is injective.

We are almost done. What we now require in addition is that be weight-preserving,in the sense that Q()((X, Y)) (X)(Y). More precisely, we will show in amoment that

(X)(Y) 2|E|2Q()((X, Y)). (3.7)First, let us see why we need a bound of this form in order to estimate . We have

1

Q() XY

(X)(Y)|XY | 2|E|2 XY

((X, Y)) |XY |

4|E|n2

XY((X, Y))

4|E|n2 , (3.8)where the second inequality follows from the fact that the length of any canonical pathis bounded by 2n, and the last inequality from the facts that is injective and is aprobability distribution.

It remains for us to prove inequality (3.7). Before we do so, it is helpful to notice thatQ() = (2|E|)1 min{(M), (M)}, as may easily be verified from the definition ofMmatch(). We now distinguish four cases:

1. is a-transition. Suppose M = Me. Then (X, Y) = XYM, so, viewedas multisets, M (X, Y) and X Y are identical. Hence we have

(X)(Y) = (M)((X, Y))

=2|E|Q()

min{(M), (M)} (M)((X, Y))

= 2|E|Q()max{1, (M)/(M)}((X, Y))

2|E|Q()((X, Y)),from which (3.7) follows.

2. is a -transition. This is handled by a symmetrical argument to (i) above, withthe roles of M and M interchanged.

3. is a-transition and the current path is a cycle. Suppose M = M+ e e, andconsider the multiset M (X, Y). Then (X, Y) = X Y (M+ e) eXY t, sothe multiset M (X, Y) differs from X Y only in that e and eXY t are missingfrom it. Thus we have

(X)(Y) 2(M)((X, Y))= 2|E|2Q()((X, Y)),


56/156


since in this case (M) = (M), and s

Download - Rapid Mixing Book

Top Related