honest mcmc via drift and minorizationusers.stat.ufl.edu/~jhobert/papers/sdsu_reu_2016.pdfhonest...

Honest MCMC via Drift and Minorization

Jim Hobert, U. of Florida

Outline:

I. iid sequences vs. Markov chains

II. What I mean by “Honest MCMC”

III. A little Markov chain theory

IV. Drift + Minorization ⇒ Geometric Ergodicity

V. A toy example

VI. An incomplete list of applications

Loosely based on an old Stat Sci paper with Galin Jones

I. An iid sequence versus a Markov chain

Let {Xn}∞n=0 be a sequence of random vectors on X ⊂ Rp

Joint pdf of (X0, X1, X2, . . . , Xn) can be expressed as:

f0(x0) f1|0(x1|x0) f2|0,1(x2|x0, x1) · · · fn|0,1,...,n−1(xn|x0, x1, . . . , xn−1)

If the sequence is iid with common pdf π : X → [0,∞):

π(x0)π(x1)π(x2) · · · π(xn)

If the sequence is a Markov chain with Mtd k : X × X → [0,∞):

f0(x0) k(x1|x0)k(x2|x1) · · · k(xn|xn−1)

π(x0)π(x1)π(x2) · · · π(xn)

If the sequence is a Markov chain with Mtd k : X × X → [0,∞):

f0(x0) k(x1|x0)k(x2|x1) · · · k(xn|xn−1)

Now assume a connection between between π and k :∫

Xk(x ′|x)π(x) dx = π(x ′)

II. What I Mean By “Honest MCMC”

We have u : Rp → [0,∞) satisfying 0 <

Rp u(y) dy < ∞

Define the pdf: π(x) = u(x)/

Rp u(y) dy

The problem : We want to know the value of

Eπg :=

Rpg(x)π(x) dx =

Rp g(x) u(x) dx∫

Rp u(y) dy

but top and bottom are both intractable.

We could approximate Eπg using:

• analytical approximations

• numerical integration

• Monte Carlo methods (classical Monte Carlo & MCMC)

An example from Bayesian probit regression

π : Rp → (0,∞)

π(x) =

∏mi=1

Φ(vTi x)

1 − Φ(vTi x)

]1−zi

∏mi=1

Φ(vTi y)

1 − Φ(vTi y)

]1−zi dy

• m ∈ N is known

• zi ∈ {0, 1} is known

• vi ∈ Rp is known

• Φ(t) =

−∞

1√2π

e−y2/2 dy

We want the value of Eπg :=

Rpg(x)π(x) dx

Classical Monte Carlo Solution

Theory : Let X1, X2, X3, . . . be iid π and form gn := 1n

∑ni=1 g(Xi)

SLLN: If Eπ|g| < ∞, then gn → Eπg a.s. as n → ∞

Rpg(x)π(x) dx

∑ni=1 g(Xi)

CLT: If Eπg2 < ∞, then√

gn − Eπg)

/σnd→ N(0, 1) where

σ2n =

1n − 1

g(Xi) − gn

So, for large n, Pr(

gn − 2σn√n

< Eπg < gn + 2σn√n

≈ 0.95

Rpg(x)π(x) dx

∑ni=1 g(Xi)

CLT: If Eπg2 < ∞, then√

gn − Eπg)

/σnd→ N(0, 1) where

σ2n =

1n − 1

g(Xi) − gn

So, for large n, Pr(

gn − 2σn√n

< Eπg < gn + 2σn√n

≈ 0.95

Application: Fix n, simulate X1, . . . , Xn iid π

Asymptotic 95% CI for Eπg : gn ± 2σn/√

Back to the example. We want∫

Rp g(x)π(x) dx where

π(x) ∝m∏

Φ(vTi x)

1 − Φ(vTi x)

]1−zi

We cannot simulate iid rvs from π(x)

⇒ Classical Monte Carlo method is not applicable

Back to the example. We want∫

Rp g(x)π(x) dx where

π(x) ∝m∏

Φ(vTi x)

1 − Φ(vTi x)

]1−zi

We cannot simulate iid rvs from π(x)

⇒ Classical Monte Carlo method is not applicable

We can simulate a Markov chain that converges to π(x)

If Xn = x , simulate Xn+1 as follows:

Step 1: Draw Y1, . . . , Ym indep st Yi ∼ TN(vTi x , 1, zi)

Step 2: Draw Xn+1 ∼ Np

(V T V )−1V T y , (V T V )−1)

“TN” = truncated normal and V = [v1 · · · vm]T

Albert & Chib (1993, JASA)

Can we honestly replace the iid sequence with a MC?

Let X0, X1, X2, . . . be a well-behaved MC converging to π(x)

As in the iid case, let gn := 1n

∑n−1i=0 g(Xi)

Ergodic Theorem: If Eπ|g| < ∞, then gn → Eπg a.s. as n → ∞

So the MCMC estimator also converges a.s. to Eπg, but...

Can we honestly replace the iid sequence with a MC?

Let X0, X1, X2, . . . be a well-behaved MC converging to π(x)

As in the iid case, let gn := 1n

∑n−1i=0 g(Xi)

So the MCMC estimator also converges a.s. to Eπg, but...

There is no free lunch!

In the MC context: Eπg2 < ∞ 6⇒√

gn − Eπg) d→ N(0, γ2)

In general, CLTs will not exist unless the chain is fast mixing

III. A Little Markov Chain Theory

Define X ={

x ∈ Rp : π(x) > 0

Let {Xn}∞n=0 be a Markov chain on X such that

• Pr(Xn+1 ∈ A |Xn = x) =∫

A k(x ′|x) dx ′

•∫

X k(x ′|x)π(x) dx = π(x ′)

Define X ={

x ∈ Rp : π(x) > 0

• Pr(Xn+1 ∈ A |Xn = x) =∫

A k(x ′|x) dx ′

•∫

X k(x ′|x)π(x) dx = π(x ′)

n-step Markov transition funct: Pn(x , A) := Pr(Xn ∈ A |X0 = x)

If the chain is Harris ergodic, then as n → ∞‖Pn(x , ·) − π(·)‖ := sup

∣Pn(x , A) − π(A)∣

∣ ↓ 0

and, if Eπ|g| < ∞, then gn → Eπg a.s. as n → ∞

Define X ={

x ∈ Rp : π(x) > 0

• Pr(Xn+1 ∈ A |Xn = x) =∫

A k(x ′|x) dx ′

•∫

X k(x ′|x)π(x) dx = π(x ′)

n-step Markov transition funct: Pn(x , A) := Pr(Xn ∈ A |X0 = x)

If the chain is Harris ergodic, then as n → ∞‖Pn(x , ·) − π(·)‖ := sup

∣Pn(x , A) − π(A)∣

∣ ↓ 0

and, if Eπ|g| < ∞, then gn → Eπg a.s. as n → ∞

Two closely related questions:

• Can we find n such that ‖Pn(x , ·) − π(·)‖ < 0.01?

• Does gn satisfy a CLT?

We want to know: Eπg =∫

Rp g(x)π(x) dx

n-step Mtf: Pn(x , A) = Pr(Xn ∈ A |X0 = x)

To be honest, we need a CLT:√

gn − Eπg) d→ N(0, γ2)

Definition: {Xn}∞n=0 is geometrically ergodic (GE) if for some

ρ ∈ [0, 1), some M : X → [0,∞), and all n ∈ N,

‖Pn(x , ·) − π(·)‖ ≤ M(x)ρn

We want to know: Eπg =∫

Rp g(x)π(x) dx

n-step Mtf: Pn(x , A) = Pr(Xn ∈ A |X0 = x)

To be honest, we need a CLT:√

gn − Eπg) d→ N(0, γ2)

Definition: {Xn}∞n=0 is geometrically ergodic (GE) if for some

ρ ∈ [0, 1), some M : X → [0,∞), and all n ∈ N,

‖Pn(x , ·) − π(·)‖ ≤ M(x)ρn

Theorem: If {Xn}∞n=0 is GE, then CLT holds for g : X → R as

long as Eπ|g|2+δ < ∞ for some δ > 0

Rosenthal (1995, JASA) shows how to construct M(·) and ρusing drift and minorization conditions...

Recall that {Xn}∞n=0 is a MC on X such that

Pr(Xn+1 ∈ A |Xn = x) =

Ak(x ′|x) dx ′

A minorization condition holds if for some density function q(·),some ε ∈ (0, 1) and some set C ⊂ X

k(x ′|x) ≥ ε q(x ′) for all x ′ ∈ X and x ∈ C

Pr(Xn+1 ∈ A |Xn = x) =

Ak(x ′|x) dx ′

For fixed x ∈ C, define the residual density as

r(x ′|x) =k(x ′|x) − ε q(x ′)

1 − ε

Pr(Xn+1 ∈ A |Xn = x) =

Ak(x ′|x) dx ′

For fixed x ∈ C, define the residual density as

r(x ′|x) =k(x ′|x) − ε q(x ′)

1 − ε

So for x ∈ C, we can split k(x ′|x) into a mixture:

k(x ′|x) = ε q(x ′) + (1 − ε) r(x ′|x)

For x ∈ C : k(x ′|x) = ε q(x ′) + (1 − ε) r(x ′|x) (∗)

Use (∗) to construct two “coupled” Markov chains

X0, X1, X2, . . . and Z0, Z1, Z2, . . .

with X0 = x and Z0 ∼ π

X0, X1, X2, . . . and Z0, Z1, Z2, . . .

If (Xn, Zn) /∈ C × C, then Xn+1 ∼ k(·|Xn) and Zn+1 ∼ k(·|Zn)

X0, X1, X2, . . . and Z0, Z1, Z2, . . .

If (Xn, Zn) /∈ C × C, then Xn+1 ∼ k(·|Xn) and Zn+1 ∼ k(·|Zn)

If (Xn, Zn) ∈ C × C, flip an ε-coin, and

if coin is tails: Xn+1 ∼ r(·|Xn) and Zn+1 ∼ r(·|Zn)

if coin is heads: Xn+1 = Zn+1 ∼ q(·)Let T denote the “coupling time”

We have two coupled Markov chains

X0, X1, X2, . . . Z0, Z1, Z2, . . .

with X0 = x and Z0 ∼ π. The rv T is the coupling time.

Goal: Establish that ‖Pn(x , ·) − π(·)‖ ≤ M(x) ρn

The coupling inequality:

||Pn(x , ·) − π(·)|| := supA

∣Pn(x , A) − π(A)∣

= supA

∣ Pr(Xn ∈ A) − Pr(Zn ∈ A)∣

≤ Pr(Xn 6= Zn)

≤ Prx(T > n)

New goal: Establish that Prx (T > n) ≤ M(x) ρn

X0, X1, X2, . . . Z0, Z1, Z2, . . .

||Pn(x , ·) − π(·)|| := supA

∣Pn(x , A) − π(A)∣

= supA

∣ Pr(Xn ∈ A) − Pr(Zn ∈ A)∣

≤ Pr(Xn 6= Zn)

≤ Prx(T > n)

Easy case: C = X ⇒ T ∼ Geo(ε) and Prx(T > n) = (1 − ε)n

X0, X1, X2, . . . Z0, Z1, Z2, . . .

||Pn(x , ·) − π(·)|| := supA

∣Pn(x , A) − π(A)∣

= supA

∣ Pr(Xn ∈ A) − Pr(Zn ∈ A)∣

≤ Pr(Xn 6= Zn)

≤ Prx(T > n)

Easy case: C = X ⇒ T ∼ Geo(ε) and Prx(T > n) = (1 − ε)n

Hard case: C 6= X. Intuitively, the chain must return to C quickly

A drift condition guarantees that the chain returns to C quickly

Let V : X → [0,∞) be a function such that, for some d > 0,

x ∈ X : V (x) ≤ d}

A drift condition holds if

V (X1) |X0 = x]

≤ λ V (x) + b

with λ ∈ [0, 1) and b > 0 such that 2b1−λ < d

Intuition: If x /∈ C, then

V (X1) |X0 = x]

− V (x) ≤ −(1 − λ) V (x) + b < 0

so we “expect” the chain to “drift” towards C

Drift & minorization yield M and ρ st ‖Pn(x , ·) − π(·)‖ ≤ M(x) ρn

V. A Toy Example: π(x) = 38

1 + x2

)− 52

kDA(x ′|x) =

YπX |Y (x ′|y)πY |X (y |x) dy

Joint density for DA: π(x , y) = 4y32√

2πexp

− y( x2

2 + 2)}

I(0,∞)(y)

⇒ X |Y ∼ N(

and Y |X ∼ Gamma(5

2 , X2

2 + 2)

1 + x2

)− 52

kDA(x ′|x) =

2πexp

− y( x2

2 + 2)}

I(0,∞)(y)

⇒ X |Y ∼ N(

2 , X2

2 + 2)

Drift : Take V (x) = x2. Then

X 21 |X0 = x

Xu2 πX |Y (u|y) du

πY |X (y |x) dy =x2

1 + x2

)− 52

kDA(x ′|x) =

2πexp

− y( x2

2 + 2)}

I(0,∞)(y)

⇒ X |Y ∼ N(

2 , X2

2 + 2)

Drift : Take V (x) = x2. Then

X 21 |X0 = x

Xu2 πX |Y (u|y) du

πY |X (y |x) dy =x2

Minor : kDA(x ′|x) ≥ 0.343 q(x ′) for x ∈ C ={

x ∈ R : x2 ≤ 10}

Combining drift and minor: ‖Pn(x , ·) − π(·)‖ ≤ (x2 + 4) (0.943)n

VI. Applications of the drift & minorization method

• Linear mixed models: Roman & H (2012, AoS)

Abrahamsen & H (2017, Bernoulli)

• Binary data models: Roy & H (2008, JRSSB)

Choi & H (2013, EJS)

Chakraborty & Khare (2016+, EJS)

• Binary mixed models: Choi & Roman (2016, wp)

• Shrinkage models: Pal & Khare (2014, EJS)

Pal, Khare & H (2016+, SJS)

• Shrinkage mixed models: Abrahamsen & H (2016, wp)

honest mcmc via drift and minorizationusers.stat.ufl.edu/~jhobert/papers/sdsu_reu_2016.pdfhonest...

Documents

convergence of mcmc algorithms in finite...

mcmc on bersih

mcmc: does it work? how can we tell? - school of...

athens workshop on mcmc

a dirichlet form approach to mcmc optimal scaling · intro...

las mitocondrias mcmc

replica exchange mcmc

introduction to bayesian mcmc models - home (en) ·...

basic excel mcmc

query-aware mcmc

tutorial lectures on mcmc i - · pdf filetutorial lectures...

regularity & performance audit (mcmc)

cuadriptico mcmc 03

stochastic gradient mcmc

mcmc: gibbs sampler

mcmc in structure space mcmc in order space

internet of things (iot) - skmm.gov.my · idi norbarkhtiar...

lect4: exact sampling techniques and mcmc convergence...

inference v: mcmc methods

mcmc intercessory prayer