honest mcmc via drift and minorizationusers.stat.ufl.edu/~jhobert/papers/sdsu_reu_2016.pdfhonest...
Post on 14-Sep-2020
5 Views
Preview:
TRANSCRIPT
Honest MCMC via Drift and Minorization
Jim Hobert, U. of Florida
Outline:
I. iid sequences vs. Markov chains
II. What I mean by “Honest MCMC”
III. A little Markov chain theory
IV. Drift + Minorization ⇒ Geometric Ergodicity
V. A toy example
VI. An incomplete list of applications
Loosely based on an old Stat Sci paper with Galin Jones
I. An iid sequence versus a Markov chain
Let {Xn}∞n=0 be a sequence of random vectors on X ⊂ Rp
Joint pdf of (X0, X1, X2, . . . , Xn) can be expressed as:
f0(x0) f1|0(x1|x0) f2|0,1(x2|x0, x1) · · · fn|0,1,...,n−1(xn|x0, x1, . . . , xn−1)
I. An iid sequence versus a Markov chain
Let {Xn}∞n=0 be a sequence of random vectors on X ⊂ Rp
Joint pdf of (X0, X1, X2, . . . , Xn) can be expressed as:
f0(x0) f1|0(x1|x0) f2|0,1(x2|x0, x1) · · · fn|0,1,...,n−1(xn|x0, x1, . . . , xn−1)
If the sequence is iid with common pdf π : X → [0,∞):
π(x0)π(x1)π(x2) · · · π(xn)
I. An iid sequence versus a Markov chain
Let {Xn}∞n=0 be a sequence of random vectors on X ⊂ Rp
Joint pdf of (X0, X1, X2, . . . , Xn) can be expressed as:
f0(x0) f1|0(x1|x0) f2|0,1(x2|x0, x1) · · · fn|0,1,...,n−1(xn|x0, x1, . . . , xn−1)
If the sequence is iid with common pdf π : X → [0,∞):
π(x0)π(x1)π(x2) · · · π(xn)
If the sequence is a Markov chain with Mtd k : X × X → [0,∞):
f0(x0) k(x1|x0)k(x2|x1) · · · k(xn|xn−1)
I. An iid sequence versus a Markov chain
Let {Xn}∞n=0 be a sequence of random vectors on X ⊂ Rp
Joint pdf of (X0, X1, X2, . . . , Xn) can be expressed as:
f0(x0) f1|0(x1|x0) f2|0,1(x2|x0, x1) · · · fn|0,1,...,n−1(xn|x0, x1, . . . , xn−1)
If the sequence is iid with common pdf π : X → [0,∞):
π(x0)π(x1)π(x2) · · · π(xn)
If the sequence is a Markov chain with Mtd k : X × X → [0,∞):
f0(x0) k(x1|x0)k(x2|x1) · · · k(xn|xn−1)
Now assume a connection between between π and k :∫
Xk(x ′|x)π(x) dx = π(x ′)
II. What I Mean By “Honest MCMC”
We have u : Rp → [0,∞) satisfying 0 <
∫
Rp u(y) dy < ∞
Define the pdf: π(x) = u(x)/
∫
Rp u(y) dy
The problem : We want to know the value of
Eπg :=
∫
Rpg(x)π(x) dx =
∫
Rp g(x) u(x) dx∫
Rp u(y) dy
but top and bottom are both intractable.
We could approximate Eπg using:
• analytical approximations
• numerical integration
• Monte Carlo methods (classical Monte Carlo & MCMC)
An example from Bayesian probit regression
π : Rp → (0,∞)
and
π(x) =
∏mi=1
[
Φ(vTi x)
]zi[
1 − Φ(vTi x)
]1−zi
∫
Rp
∏mi=1
[
Φ(vTi y)
]zi[
1 − Φ(vTi y)
]1−zi dy
where
• m ∈ N is known
• zi ∈ {0, 1} is known
• vi ∈ Rp is known
• Φ(t) =
∫ t
−∞
1√2π
e−y2/2 dy
We want the value of Eπg :=
∫
Rpg(x)π(x) dx
Classical Monte Carlo Solution
Theory : Let X1, X2, X3, . . . be iid π and form gn := 1n
∑ni=1 g(Xi)
SLLN: If Eπ|g| < ∞, then gn → Eπg a.s. as n → ∞
We want the value of Eπg :=
∫
Rpg(x)π(x) dx
Classical Monte Carlo Solution
Theory : Let X1, X2, X3, . . . be iid π and form gn := 1n
∑ni=1 g(Xi)
SLLN: If Eπ|g| < ∞, then gn → Eπg a.s. as n → ∞
CLT: If Eπg2 < ∞, then√
n(
gn − Eπg)
/σnd→ N(0, 1) where
σ2n =
1n − 1
n∑
i=1
(
g(Xi) − gn
)2
So, for large n, Pr(
gn − 2σn√n
< Eπg < gn + 2σn√n
)
≈ 0.95
We want the value of Eπg :=
∫
Rpg(x)π(x) dx
Classical Monte Carlo Solution
Theory : Let X1, X2, X3, . . . be iid π and form gn := 1n
∑ni=1 g(Xi)
SLLN: If Eπ|g| < ∞, then gn → Eπg a.s. as n → ∞
CLT: If Eπg2 < ∞, then√
n(
gn − Eπg)
/σnd→ N(0, 1) where
σ2n =
1n − 1
n∑
i=1
(
g(Xi) − gn
)2
So, for large n, Pr(
gn − 2σn√n
< Eπg < gn + 2σn√n
)
≈ 0.95
Application: Fix n, simulate X1, . . . , Xn iid π
Asymptotic 95% CI for Eπg : gn ± 2σn/√
n
Back to the example. We want∫
Rp g(x)π(x) dx where
π(x) ∝m∏
i=1
[
Φ(vTi x)
]zi[
1 − Φ(vTi x)
]1−zi
We cannot simulate iid rvs from π(x)
⇒ Classical Monte Carlo method is not applicable
Back to the example. We want∫
Rp g(x)π(x) dx where
π(x) ∝m∏
i=1
[
Φ(vTi x)
]zi[
1 − Φ(vTi x)
]1−zi
We cannot simulate iid rvs from π(x)
⇒ Classical Monte Carlo method is not applicable
We can simulate a Markov chain that converges to π(x)
If Xn = x , simulate Xn+1 as follows:
Step 1: Draw Y1, . . . , Ym indep st Yi ∼ TN(vTi x , 1, zi)
Step 2: Draw Xn+1 ∼ Np
(
(V T V )−1V T y , (V T V )−1)
“TN” = truncated normal and V = [v1 · · · vm]T
Albert & Chib (1993, JASA)
Can we honestly replace the iid sequence with a MC?
Let X0, X1, X2, . . . be a well-behaved MC converging to π(x)
As in the iid case, let gn := 1n
∑n−1i=0 g(Xi)
Ergodic Theorem: If Eπ|g| < ∞, then gn → Eπg a.s. as n → ∞
So the MCMC estimator also converges a.s. to Eπg, but...
Can we honestly replace the iid sequence with a MC?
Let X0, X1, X2, . . . be a well-behaved MC converging to π(x)
As in the iid case, let gn := 1n
∑n−1i=0 g(Xi)
Ergodic Theorem: If Eπ|g| < ∞, then gn → Eπg a.s. as n → ∞
So the MCMC estimator also converges a.s. to Eπg, but...
There is no free lunch!
In the MC context: Eπg2 < ∞ 6⇒√
n(
gn − Eπg) d→ N(0, γ2)
In general, CLTs will not exist unless the chain is fast mixing
III. A Little Markov Chain Theory
Define X ={
x ∈ Rp : π(x) > 0
}
Let {Xn}∞n=0 be a Markov chain on X such that
• Pr(Xn+1 ∈ A |Xn = x) =∫
A k(x ′|x) dx ′
•∫
X k(x ′|x)π(x) dx = π(x ′)
III. A Little Markov Chain Theory
Define X ={
x ∈ Rp : π(x) > 0
}
Let {Xn}∞n=0 be a Markov chain on X such that
• Pr(Xn+1 ∈ A |Xn = x) =∫
A k(x ′|x) dx ′
•∫
X k(x ′|x)π(x) dx = π(x ′)
n-step Markov transition funct: Pn(x , A) := Pr(Xn ∈ A |X0 = x)
If the chain is Harris ergodic, then as n → ∞‖Pn(x , ·) − π(·)‖ := sup
A
∣
∣Pn(x , A) − π(A)∣
∣ ↓ 0
and, if Eπ|g| < ∞, then gn → Eπg a.s. as n → ∞
III. A Little Markov Chain Theory
Define X ={
x ∈ Rp : π(x) > 0
}
Let {Xn}∞n=0 be a Markov chain on X such that
• Pr(Xn+1 ∈ A |Xn = x) =∫
A k(x ′|x) dx ′
•∫
X k(x ′|x)π(x) dx = π(x ′)
n-step Markov transition funct: Pn(x , A) := Pr(Xn ∈ A |X0 = x)
If the chain is Harris ergodic, then as n → ∞‖Pn(x , ·) − π(·)‖ := sup
A
∣
∣Pn(x , A) − π(A)∣
∣ ↓ 0
and, if Eπ|g| < ∞, then gn → Eπg a.s. as n → ∞
Two closely related questions:
• Can we find n such that ‖Pn(x , ·) − π(·)‖ < 0.01?
• Does gn satisfy a CLT?
We want to know: Eπg =∫
Rp g(x)π(x) dx
n-step Mtf: Pn(x , A) = Pr(Xn ∈ A |X0 = x)
Ergodic Theorem: If Eπ|g| < ∞, then gn → Eπg a.s. as n → ∞
To be honest, we need a CLT:√
n(
gn − Eπg) d→ N(0, γ2)
Definition: {Xn}∞n=0 is geometrically ergodic (GE) if for some
ρ ∈ [0, 1), some M : X → [0,∞), and all n ∈ N,
‖Pn(x , ·) − π(·)‖ ≤ M(x)ρn
We want to know: Eπg =∫
Rp g(x)π(x) dx
n-step Mtf: Pn(x , A) = Pr(Xn ∈ A |X0 = x)
Ergodic Theorem: If Eπ|g| < ∞, then gn → Eπg a.s. as n → ∞
To be honest, we need a CLT:√
n(
gn − Eπg) d→ N(0, γ2)
Definition: {Xn}∞n=0 is geometrically ergodic (GE) if for some
ρ ∈ [0, 1), some M : X → [0,∞), and all n ∈ N,
‖Pn(x , ·) − π(·)‖ ≤ M(x)ρn
Theorem: If {Xn}∞n=0 is GE, then CLT holds for g : X → R as
long as Eπ|g|2+δ < ∞ for some δ > 0
Rosenthal (1995, JASA) shows how to construct M(·) and ρusing drift and minorization conditions...
IV. Drift + Minorization ⇒ Geometric Ergodicity
Recall that {Xn}∞n=0 is a MC on X such that
Pr(Xn+1 ∈ A |Xn = x) =
∫
Ak(x ′|x) dx ′
A minorization condition holds if for some density function q(·),some ε ∈ (0, 1) and some set C ⊂ X
k(x ′|x) ≥ ε q(x ′) for all x ′ ∈ X and x ∈ C
IV. Drift + Minorization ⇒ Geometric Ergodicity
Recall that {Xn}∞n=0 is a MC on X such that
Pr(Xn+1 ∈ A |Xn = x) =
∫
Ak(x ′|x) dx ′
A minorization condition holds if for some density function q(·),some ε ∈ (0, 1) and some set C ⊂ X
k(x ′|x) ≥ ε q(x ′) for all x ′ ∈ X and x ∈ C
For fixed x ∈ C, define the residual density as
r(x ′|x) =k(x ′|x) − ε q(x ′)
1 − ε
IV. Drift + Minorization ⇒ Geometric Ergodicity
Recall that {Xn}∞n=0 is a MC on X such that
Pr(Xn+1 ∈ A |Xn = x) =
∫
Ak(x ′|x) dx ′
A minorization condition holds if for some density function q(·),some ε ∈ (0, 1) and some set C ⊂ X
k(x ′|x) ≥ ε q(x ′) for all x ′ ∈ X and x ∈ C
For fixed x ∈ C, define the residual density as
r(x ′|x) =k(x ′|x) − ε q(x ′)
1 − ε
So for x ∈ C, we can split k(x ′|x) into a mixture:
k(x ′|x) = ε q(x ′) + (1 − ε) r(x ′|x)
For x ∈ C : k(x ′|x) = ε q(x ′) + (1 − ε) r(x ′|x) (∗)
Use (∗) to construct two “coupled” Markov chains
X0, X1, X2, . . . and Z0, Z1, Z2, . . .
with X0 = x and Z0 ∼ π
For x ∈ C : k(x ′|x) = ε q(x ′) + (1 − ε) r(x ′|x) (∗)
Use (∗) to construct two “coupled” Markov chains
X0, X1, X2, . . . and Z0, Z1, Z2, . . .
with X0 = x and Z0 ∼ π
X
C
•Xn
•Zn
If (Xn, Zn) /∈ C × C, then Xn+1 ∼ k(·|Xn) and Zn+1 ∼ k(·|Zn)
For x ∈ C : k(x ′|x) = ε q(x ′) + (1 − ε) r(x ′|x) (∗)
Use (∗) to construct two “coupled” Markov chains
X0, X1, X2, . . . and Z0, Z1, Z2, . . .
with X0 = x and Z0 ∼ π
X
C
•Xn
•Zn
If (Xn, Zn) /∈ C × C, then Xn+1 ∼ k(·|Xn) and Zn+1 ∼ k(·|Zn)
If (Xn, Zn) ∈ C × C, flip an ε-coin, and
if coin is tails: Xn+1 ∼ r(·|Xn) and Zn+1 ∼ r(·|Zn)
if coin is heads: Xn+1 = Zn+1 ∼ q(·)Let T denote the “coupling time”
We have two coupled Markov chains
X0, X1, X2, . . . Z0, Z1, Z2, . . .
with X0 = x and Z0 ∼ π. The rv T is the coupling time.
Goal: Establish that ‖Pn(x , ·) − π(·)‖ ≤ M(x) ρn
The coupling inequality:
||Pn(x , ·) − π(·)|| := supA
∣
∣Pn(x , A) − π(A)∣
∣
= supA
∣
∣ Pr(Xn ∈ A) − Pr(Zn ∈ A)∣
∣
≤ Pr(Xn 6= Zn)
≤ Prx(T > n)
New goal: Establish that Prx (T > n) ≤ M(x) ρn
We have two coupled Markov chains
X0, X1, X2, . . . Z0, Z1, Z2, . . .
with X0 = x and Z0 ∼ π. The rv T is the coupling time.
Goal: Establish that ‖Pn(x , ·) − π(·)‖ ≤ M(x) ρn
The coupling inequality:
||Pn(x , ·) − π(·)|| := supA
∣
∣Pn(x , A) − π(A)∣
∣
= supA
∣
∣ Pr(Xn ∈ A) − Pr(Zn ∈ A)∣
∣
≤ Pr(Xn 6= Zn)
≤ Prx(T > n)
New goal: Establish that Prx (T > n) ≤ M(x) ρn
Easy case: C = X ⇒ T ∼ Geo(ε) and Prx(T > n) = (1 − ε)n
We have two coupled Markov chains
X0, X1, X2, . . . Z0, Z1, Z2, . . .
with X0 = x and Z0 ∼ π. The rv T is the coupling time.
Goal: Establish that ‖Pn(x , ·) − π(·)‖ ≤ M(x) ρn
The coupling inequality:
||Pn(x , ·) − π(·)|| := supA
∣
∣Pn(x , A) − π(A)∣
∣
= supA
∣
∣ Pr(Xn ∈ A) − Pr(Zn ∈ A)∣
∣
≤ Pr(Xn 6= Zn)
≤ Prx(T > n)
New goal: Establish that Prx (T > n) ≤ M(x) ρn
Easy case: C = X ⇒ T ∼ Geo(ε) and Prx(T > n) = (1 − ε)n
Hard case: C 6= X. Intuitively, the chain must return to C quickly
A drift condition guarantees that the chain returns to C quickly
Let V : X → [0,∞) be a function such that, for some d > 0,
C ={
x ∈ X : V (x) ≤ d}
A drift condition holds if
E[
V (X1) |X0 = x]
≤ λ V (x) + b
with λ ∈ [0, 1) and b > 0 such that 2b1−λ < d
Intuition: If x /∈ C, then
E[
V (X1) |X0 = x]
− V (x) ≤ −(1 − λ) V (x) + b < 0
so we “expect” the chain to “drift” towards C
Drift & minorization yield M and ρ st ‖Pn(x , ·) − π(·)‖ ≤ M(x) ρn
V. A Toy Example: π(x) = 38
(
1 + x2
4
)− 52
kDA(x ′|x) =
∫
YπX |Y (x ′|y)πY |X (y |x) dy
Joint density for DA: π(x , y) = 4y32√
2πexp
{
− y( x2
2 + 2)}
I(0,∞)(y)
⇒ X |Y ∼ N(
0, 1Y
)
and Y |X ∼ Gamma(5
2 , X2
2 + 2)
V. A Toy Example: π(x) = 38
(
1 + x2
4
)− 52
kDA(x ′|x) =
∫
YπX |Y (x ′|y)πY |X (y |x) dy
Joint density for DA: π(x , y) = 4y32√
2πexp
{
− y( x2
2 + 2)}
I(0,∞)(y)
⇒ X |Y ∼ N(
0, 1Y
)
and Y |X ∼ Gamma(5
2 , X2
2 + 2)
Drift : Take V (x) = x2. Then
E[
X 21 |X0 = x
]
=
∫
Y
[∫
Xu2 πX |Y (u|y) du
]
πY |X (y |x) dy =x2
3+
43
V. A Toy Example: π(x) = 38
(
1 + x2
4
)− 52
kDA(x ′|x) =
∫
YπX |Y (x ′|y)πY |X (y |x) dy
Joint density for DA: π(x , y) = 4y32√
2πexp
{
− y( x2
2 + 2)}
I(0,∞)(y)
⇒ X |Y ∼ N(
0, 1Y
)
and Y |X ∼ Gamma(5
2 , X2
2 + 2)
Drift : Take V (x) = x2. Then
E[
X 21 |X0 = x
]
=
∫
Y
[∫
Xu2 πX |Y (u|y) du
]
πY |X (y |x) dy =x2
3+
43
Minor : kDA(x ′|x) ≥ 0.343 q(x ′) for x ∈ C ={
x ∈ R : x2 ≤ 10}
Combining drift and minor: ‖Pn(x , ·) − π(·)‖ ≤ (x2 + 4) (0.943)n
VI. Applications of the drift & minorization method
• Linear mixed models: Roman & H (2012, AoS)
Abrahamsen & H (2017, Bernoulli)
• Binary data models: Roy & H (2008, JRSSB)
Choi & H (2013, EJS)
Chakraborty & Khare (2016+, EJS)
• Binary mixed models: Choi & Roman (2016, wp)
• Shrinkage models: Pal & Khare (2014, EJS)
Pal, Khare & H (2016+, SJS)
• Shrinkage mixed models: Abrahamsen & H (2016, wp)
top related