regularization and computation with high-dimensional spike-and...
TRANSCRIPT
Submitted to the Annals of Statistics
arXiv: arXiv:1803.10282
REGULARIZATION AND COMPUTATION WITH
HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR
DISTRIBUTIONS∗
By Yves Atchade† and Anwesha Bhattacharyya†
University of Michigan†
We consider the Bayesian analysis of a high-dimensional statis-
tical model with a spike-and-slab prior, and we study the forward-
backward envelope of the posterior distribution – that we denotes Πγ
for some regularization parameter γ > 0. Viewing Πγ as a pseudo-
posterior distribution, we work out a set of sufficient conditions under
which it contracts towards the true value of the parameter as γ ↓ 0,
and p (the dimension of the parameter space) diverges to ∞. In linear
regression models the contraction rate matches the contraction rate
of the true posterior distribution.
In the second part of the paper, we study a practical Markov Chain
Monte Carlo (MCMC) algorithm to sample from Πγ . In the particular
case of the linear regression model, and focusing on models with high
signal-to-noise ratios, we show that the mixing time of the MCMC
algorithm depends crucially on the coherence of the design matrix
and the initialization of the Markov chain. In the most favorable
cases, we show that the computational complexity of the algorithm
scales with the dimension p asO(pes2?), where s? is the number of non-
zeros components of the true parameter. We provide some simulation
results to illustrate the theory. Our simulation results also suggest
that the proposed algorithm (as well as a version of the Gibbs sampler
of [31]) mix poorly if poorly initialized, or if the design matrix has
high coherence.
1. Introduction. Suppose that we wish to infer a parameter θ ∈ Rp from a
random sample Z ∈ Z, based on the statistical model Z ∼ fθ(z)dz, where fθ is a
density on a sample space Z equipped with a reference sigma-finite measure dz. The
∗This work is partially supported by the NSF grant DMS 1513040.
MSC 2010 subject classifications: Primary 62F15, 60K35; secondary 60K35
Keywords and phrases: High-dimensional Bayesian inference, Variable selection, Posterior contrac-
tion, Forward-backward envelope, Markov Chain Monte Carlo Mixing, High-dimensional linear re-
gression models
1
2
log-likelihood function
`(θ; z)def= log fθ(z), θ ∈ Rp, z ∈ Z,
is assumed to be a concave function of θ and a jointly measurable function on Rp×Z.
We take a Bayesian approach and consider a setting where p is very large and it is
statistically appealing to perform a variable selection step in the estimation process.
This problem has attracted a lot of attention in recent years, and several approaches
are available. One of the most effective solution – at least in theory – relies on spike-
and-slab priors ([29, 14]), and can be described as follows. For δ ∈ ∆def= {0, 1}p, let
µδ(dθ) denote the product measure on Rp given by
µδ(dθ)def=
p∏j=1
µδj (dθj),
where µ0(dz) is the Dirac mass at 0, and µ1(dz) is the Lebesgue measure on R. With
{ωδ, δ ∈ ∆} denoting a prior distribution on ∆, we consider the spike-and-slab prior
distribution on ∆× Rp given by1
(1.1) ωδ
(ρ2
)‖δ‖1e−ρ‖θ‖1µδ(dθ),
for a parameter ρ > 0. Given Z = z, the resulting posterior distribution on ∆×Rp is
(1.2) Π(δ, dθ|z) ∝ fθ(z)ωδ(ρ
2
)‖δ‖1e−ρ‖θ‖1µδ(dθ).
The posterior distribution (1.2) has been recently studied in the high-dimensional
regime (see e.g. [12, 11, 1]), where it is shown to contract towards the true value of
the parameter at an optimal rate. In practice however, this posterior is computation-
ally difficult to handle and typically require specialized MCMC techniques such as
reversible jump ([17]), and related methods ([16, 38]). However these MCMC algo-
rithms are often difficult to design and tune, particularly in high-dimensional problems
(see for instance [38] and [2] for some numerical comparisons). In the particular case
1The use of the Laplace distribution is not fundamental. We use it here partly for mathematical
and computational convenience, and partly because of its widespread use in the applications. Much
of the results below can also be worked out if the Laplace distribution is replaced by a density g
such that g(0) = 1, and log g is Lipschitz – note however that this condition is not satisfied by the
Gaussian distribution.
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 3
of the Gaussian linear model with a Gaussian slab density, it is sometimes possible to
side-step this computational difficulties by integrating out the parameter θ. One can
then explore the resulting discrete distribution δ 7→ Π(δ|z) using standard MCMC al-
gorithms (see for instance [14, 8, 45] and the reference therein). However this strategy
is typically not widely applicable.
A very popular approximation to Π – sometimes also referred to as spike-and-slab
– is obtained by replacing the point mass at zero by a Gaussian distribution with a
small variance γ, say, ([14, 20, 36, 31]). The resulting posterior distribution is
(1.3) Πγ(δ, du|z) ∝ ωδ(2πγ)‖δ‖0
2
(ρ2
)‖δ‖0e`(u;z)−ρ‖uδ‖1− 1
2γ‖u−uδ‖22du.
This approximation was recently studied in [31], where linear regression model con-
sistency is established in the high-dimensional setting. Here we consider another ap-
proximation scheme. We consider the variational approximation of Π proposed in [2]
– denoted Πγ – that is obtained by taking its forward-backward envelop (an envelop
function similar to the Moreau envelop). One advantage of working with Πγ is that it
is computationally and mathematically tractable. Indeed, the definition of Πγ leads
to tight approximation bounds that we leverage in the analysis. In our numerical
experiments we found that Πγ and Πγ behave very similarly, and to some extent we
expect the results – and the techniques – derived here to hold when applied to Πγ .
To define Πγ , we endow the Euclidean space Rp with its Lebesgue measure denoted
dθ or du, etc. For a set A ⊆ Rp, let ιA denote its characteristic function defined here as
ιA(u) = 0 if u ∈ A, and ιA(u) = +∞ otherwise. Given δ ∈ ∆, let Rpδdef= {δ ·u, u ∈ Rp}
(where δ · u def= (u1δ1, . . . , upδp) ∈ Rp). The basic idea is to replace the function
u 7→ −`(u; z) + ρ‖u‖1 + ιRpδ(u) by its forward-backward envelop ([34]) defined as
(1.4) hγ(δ, θ; z)def= min
v∈Rpδ
[−`(θ; z)− 〈∇`(θ; z), v − θ〉+ ρ‖v‖1 +
1
2γ‖v − θ‖22
],
for some parameter γ > 0. This envelop function is closely related to the more widely
known Moreau envelop ([30, 4, 33]). We choose to work with the forward-backward
envelop because it is typically available in closed whereas the Moreau envelop is not.
It is easily seen from the definition and the convexity of −`(·; z) that
(1.5) hγ(δ, ·; z) ≤ −`(·; z) + ρ‖ · ‖1 + ιRpδ(·),
4
and furthermore hγ(δ, ·; z) converges pointwise to −`(·; z) + ρ‖ · ‖1 + ιRpδ(·) as γ ↓ 0
(see for instance [34, 2]). Assuming that∫Rp e
−hγ(δ,u;z)du is finite for all δ ∈ ∆, we are
then naturally led to the probability distribution on ∆× Rp given by
(1.6) Πγ(δ, du|z) ∝ ωδ(2πγ)‖δ‖0
2
(ρ2
)‖δ‖0e−hγ(δ,u;z)du,
that we take as an approximation of Π. We note however that it is not automatic
that Πγ is a well-defined probability distribution. Since hγ is known to approximate
−`(·; z) + ρ‖ · ‖1 + ιRpδ(·), we naturally expect Πγ to behave like Π, for γ small, and
results of this type can be found in [2].
1.1. Main contributions. The contribution of this work is two-fold. Firstly, viewing
Πγ as a pseudo/quasi-posterior distribution, we study its statistical properties as the
dimension p increases. We derive some sufficient conditions under which Πγ is well
defined and puts most of its probability mass around the true value of the parameter
as p→∞. More precisely, Theorem 5 can be used to show that with high probability
a draw (δ, θ) from Πγ(·|Z) produces a typically sparse vector δ, and a non-sparse
vector θ. Although θ is not sparse, its components for which δj = 0 are typically
small (O(√γ)), and its sparsified version (that is θδ = θ · δ) is typically close to θ?,
the true value of the parameter.
We also study the contraction rate of Πγ , and using the linear regression model
as an example, we show that the rate of contraction of Πγ matches that of Π as
derived in [11]. Furthermore we show that Πγ enjoys a Bernstein-von Mises approx-
imation, and again in the particular case of the linear regression model, we recover
the Bernstein-von Mises theorem established by [11], upto some small difference in
the Fisher information matrix due to the approximate nature of Πγ .
Practical use of Bayesian procedures typically hinges on the ability to draw samples
from the posterior distribution. In the second part of the paper we develop an efficient
Metropolis-within-Gibbs algorithm to sample from Πγ . The algorithm is similar to the
Gibbs sampler used in [31]. Building on the posterior contraction properties developed
in the first part, we analyze the computational complexity of the algorithm as p→∞,
focusing on the linear regression case. The behavior of the resulting Markov chain
depends on the signal-to-noise ratio of the underlying regression problem, and in this
work we focus solely on problems with high signal-to-noise ratios. Even in this case,
the mixing of the Markov chain depends on the initial distribution of the Markov
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 5
chain, and a form of coherence of the design matrix X ∈ Rn×p defined as
C(X)def= sup
j: δ?,j=1
1√n‖X ′δc?Xj‖2,
where Xj , is the j-th column X, and Xδc? is the sub-matrix of X obtained by removing
the columns of X for which δ?,j = 1. Figure 1-(a)-(c) show the estimated mixing time
of the algorithm (truncated at 2 × 104) as function of p under different scenarios,
and illustrate the main conclusions of the paper. For comparison we also present – in
dashed lines – the estimated mixing times of a similar algorithm to sample from the
weak spike-and-slab posterior distribution (1.3) discussed above. The results presented
here deals only with the high signal-to-noise ratio setting, and we refer the reader to
Section 3.4 for a detailed description of the experiment. Figure 1-(a) shows the mixing
times in a setting where the design matrix X has low coherence, and a good initial
value is used to start the algorithm2. In Figure 1-(b) the initial value remains the
same, but the design matrix X has a higher coherence parameter (see Section 3.4 for
the details on how such design matrix is produced). Finally in Figure 1-(c), the design
matrix is the same as in Figure 1-(a) but the initial value of δ is “much worst”: it
has 20% false-negatives, but no false-positive. In this latter case most of the mixing
times observed are greater than the 20, 000 iterations mark.
0 1000 2000 3000 4000 5000
0100
200
300
400
500
dimension
mixing_time
0 1000 2000 3000 4000 5000
0100
200
300
400
500
(a)
Moreau Approx.weak Spike-and-slab
0 1000 2000 3000 4000 5000
02000
4000
6000
8000
10000
dimension
mixing_time
0 1000 2000 3000 4000 5000
02000
4000
6000
8000
10000
(b)
Moreau Approx.weak Spike-and-slab
0 1000 2000 3000 4000 5000
10000
14000
18000
dimension
mixing_time
0 1000 2000 3000 4000 5000
10000
14000
18000
(c)
Moreau Approx.weak Spike-and-slab
Fig 1. Estimated mixing times as function of the dimension p. (a) low coherence, good ini-tialization. (b) high coherence, good initialization. (c) low-coherence poor initialization.
2Here a “good initial value” for the Markov chain is a model δ that has no false-negative, and 10%
false-positives.
6
Two observations stand out from these results. The algorithm seems to mix quickly
when the design matrix X has low coherence and the initial distribution of the Markov
chain has no false-negative. The mixing seems to degrade when the coherence in-
creases, and the algorithm seems to mix very poorly when the initial distribution has
false-negatives. We show in Theorem 15 a result that partly supports these empirical
findings. More precisely, under some additional regularity conditions we show that
with an initial distribution without false-negatives, the mixing time of the algorithm
scales with p as
O
p exp
cs2?
[1 +
C(X)
s1/2? log(p)
√p
n
]2 ,
where s? = ‖θ?‖0, and c an absolute constant. The result implies that when p is of
the same order as n, and the coherence C(X) is comparable to s1/2? log(p), the mixing
time of the algorithm is essentially linear in p.
Our result is related to the recent work by [45] which studied a Metropolis-Hastings
algorithm to sample from the marginal distribution of δ in a linear regression model
with Gaussian spike-and-slab g-prior. Unlike this work where the fast mixing of our
MCMC algorithm hinges on high signal-to-noise ratio, good initialization, and low
coherence, these authors showed that their algorithm has a worst case (wrt the initial
distribution) mixing time that scales as O(s2?(n+ s?)p log(p)
). We note that the pos-
terior distribution considered by [45] can be viewed as a “collapsed” version of ours
(where the regression parameters are integrated out). The robustness performance
of the algorithm of [45] can perhaps be understood as a result of this collapsing –
the benefits of collapsing variables in a block update sampling problem is well-known
([22]). We feel that more research is needed on this problem. However to the very least
we stress again that the idea of collapsing parameters from the posterior distribution
is not always feasible, particularly in generalized linear regression models.
1.2. Outline of the paper. The paper is organized as follows. We study the statis-
tical properties of Πγ in Section 2. We derive two main theorems. Theorem 5 deals
with the contraction and the contraction rate of Πγ as p → ∞, whereas Theorem 8
studies variable selection and the Bernstein-von Mises theorem. We illustrate these
results in the particular case of the linear regression model in Section 2.2, leading to
Corollary 10. Since the proofs of the results of Section 2 follow similar techniques as
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 7
in [11] and [1], the details are placed in the appendix. In Section 3 we study the prob-
lem of sampling from Πγ , using a slightly modified version of the MCMC algorithm
of [2]. In the particular case of the linear regression model, we study the mixing of
the proposed algorithm. The main result there is Theorem 15, the proof of which is
developed in Section 4, with some technical details gathered in the appendix. Some
numerical simulations are detailed in Section 3.4.
1.3. Notation. Throughout we equip the Euclidean space Rp (p ≥ 1 integer) with
its usual Euclidean inner product 〈·, ·〉 and norm ‖ · ‖2, its Borel sigma-algebra, and
its Lebesgue measure. All vectors u ∈ Rp are column-vectors unless stated otherwise.
We also use the following norms on Rp: ‖θ‖1def=∑p
j=1 |θj |, ‖θ‖0def=∑p
j=1 1{|θj |>0},
and ‖θ‖∞def= max1≤j≤p |θj |.
We set ∆def= {0, 1}p. For θ, θ′ ∈ Rp, θ ·θ′ ∈ Rp denotes the component-wise product
of θ and θ′. For δ ∈ ∆, we set Rpδdef= {θ · δ : θ ∈ Rp}, and we write θδ as a short for
θ · δ. We define δcdef= 1 − δ, that is δcj = 1 − δj , 1 ≤ j ≤ p. For a matrix A ∈ Rm×m
and δ ∈ ∆, Aδ (resp. Aδc) denotes the matrix of R‖δ‖0×‖δ‖0 (resp. R(m−‖δ‖0)×(m−‖δ‖0))
obtained by keeping only the rows and columns of A for which δj = 1 (resp. δj = 0).
For δ, δ′ ∈ ∆, we write δ ⊇ δ′ to mean that for any j ∈ {1, . . . , p}, whenever δ′j = 1,
we have δj = 1.
Throughout the paper e denotes the Euler number, and(mq
)is the combinatorial
number m!/(q!(m − q)!). For x ∈ R, sign(x) is the sign of x (sign(x) = 1 if x > 0,
sign(x) = −1 if x < 0, and sign(x) = 0 if x = 0).
If f(θ, x) is a real-valued function that depends on the parameter θ and some other
argument x, the notation ∇(k)f(θ, x), where k is an integer, denotes the k-th partial
derivative of f with respect to θ. For k = 1, we write ∇f(θ, x) instead of ∇(1)f(θ, x).
The total variation metric between two probability measures µ, ν is defined as
‖µ− ν‖tvdef= sup
A meas.(µ(A)− ν(A)) .
All the asymptotic results in the paper are derived by letting the dimension p grow
to infinity, and we say that a term x ∈ R is an absolute constant if x does not depend
on p.
2. Contraction properties of Πγ. In this section we will establish that under
some mild conditions, Πγ is a well-defined probability distribution that has similar
8
posterior contraction properties as the spike-and-slab posterior distribution given in
(1.2). We make the following assumptions.
H1. For all z ∈ Z, the function θ 7→ `(θ; z) is concave, twice differentiable, and
there exist symmetric positive semidefinite matrices S, S ∈ Rp×p that does not depend
on z and θ, such that for all θ ∈ Rp, and all z ∈ Z,
−S � ∇(2) log fθ(z) � −S,
where the notation A � B means that B −A is positive semidefinite.
Remark 1. We note that one can always take S as the zero matrix, since `
is concave. Hence H1 mainly requires that the hessian matrix of the log-likelihood
function is lower bounded uniformly in θ and z by the matrix −S. Although somewhat
restrictive, this assumption is enough to handle linear and logistic regression models.
It is unclear whether the results developed below can be extended more broadly
beyond H1.�
For integer s ≥ 1, we set
κ(s)def= sup
{u′Su
‖u‖2: u ∈ Rp \ {0}, ‖u‖0 ≤ s
},
and κ(s)def= inf
{u′Su
‖u‖2: u ∈ Rp \ {0}, ‖u‖0 ≤ s
},
and we convene that κ(0) = 0, κ(0) = +∞. For s = p, κ(s) is the largest eigenvalue
of S that we write as λmax(S).
Following a standard approach in Bayesian asymptotics, we will assume that there
exists a true value of the parameter θ? such that Z ∼ fθ? . More precisely we assume
the following.
H2. There exists θ? ∈ Rp \ {0p} such that Z ∼ fθ?, and we set s?def= ‖θ?‖0.
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 9
Under H2 we expect θ? to be close to the maximizer of the log-likelihood θ 7→∇`(θ;Z). That is, we expect ∇`(θ?;Z) ≈ 0. Therefore the sets
Ecdef={z ∈ Z : ‖∇`(θ?; z)‖∞ ≤
c
2
},
for c > 0 will naturally play an important role in the analysis. Throughout the paper,
we write δ? ∈ ∆ to denote the sparsity structure of θ? (that is δ?,j = 1{|θ?,j |>0},
j = 1, . . . , p). We will write P? (resp. E?) to denote the probability distribution (resp.
expectation operator) of the random variable Z ∈ Z assumed in H2.
As a prior distribution on δ, we assume that the δj are independent Bernoulli
random variables. More precisely, we assume the following.
H3. For some absolute constant u > 0, setting qdef= 1
p1+u , we assume that
ωδ =
p∏j=1
qδj (1− q)1−δj .
Remark 2. The prior ωδ induced by H3 can be written as ωδ = g‖δ‖01
( p‖δ‖0
), where
gs =(ps
)qs(1− q)p−s. It is then easily checked that
(2.1) gs ≤(
2
pu
)gs−1, s = 1, . . . , p.
Discrete priors {ωδ, δ ∈ ∆} of the form ωδ = g‖δ‖01
( p‖δ‖1
)where {gs} satisfies
conditions of the form (2.1) were introduced by [12], and shown to work well for
high-dimensional problems.�
The ability to recover θ? depends on the quantity of information available, an idea
that we formalized by imposing appropriate restricted strong concavity condition on
the log-likelihood ` via the function
(2.2)
Lγ(δ, θ; z)def= `(θ; z)− `(θ?; z)− 〈∇`(θ?; z), θ − θ?〉+ 2γ‖δ · ∇`(θ; z)− δ · ∇`(θ?; z)‖22.
The function θ 7→ `(θ; z) − `(θ?; z) − 〈∇`(θ?; z), θ − θ?〉 is the well-known Bregman
divergence associated to `. The additional term 2γ‖δ ·∇`(θ; z)−δ ·∇`(θ?; z)‖22 in (2.2)
is due to the Moreau envelop approximation, and would typically be small for γ small,
10
and δ sparse. For our results to work with a certain generality, we use the concept
of a rate function introduced in [1]. A continuous function r : [0,+∞) → [0,+∞) is
called a rate function if r(0) = 0, r is increasing and limx↓0 r(x)/x = 0.
H 4. There exist γ > 0, ρ > 0, s ∈ {s?, . . . , p}, and a rate function r1 such that
for all δ ∈ ∆ such that ‖δ‖0 ≤ s, and for all θ ∈ Rpδ , we have
logE?[eLγ(δ,θ;Z)1Eρ(Z)
]≤ −1
2r1(‖θ − θ?‖2).
Remark 3. We view H4 as a form of restricted strong convexity assumption on
the function −` ([32]). This assumption is much weaker than the assumption that
` is strongly concave. We will see in Section 2.2 that H4 holds for linear regression
models. It can also be shown to hold for logistic regression models, although we do
not pursue this here.�
Remark 4. A lower bound on Lγ will also be needed below. We note that by
Assumption H1, we have
(2.3) Lγ(δ, θ; z) ≥ −1
2(θ − θ?)′S(θ − θ?).
�
With γ, ρ and s as in H4 we define
(2.4) adef= γTr(S − S) + 4γ2
(κ(s)Tr(S) + 4‖S‖2F
),
where Tr(M) (resp. ‖M‖F) denotes the trace (resp. the Frobenius norm) of M . As we
will see, the contraction rate of Πγ is at least
(2.5) εdef= inf
{x > 0 : r1(z) ≥ 3ρ(s? + s)1/2z for all z ≥ x
}.
We then set ∆sdef= {δ ∈ ∆ : ‖δ‖0 ≤ s}, and
(2.6) Bm,Mdef=
⋃δ∈∆s
({δ} × B
(δ)m,M
),
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 11
where
B(δ)m,M
def={θ ∈ Rp : ‖θδ − θ?‖2 ≤Mε, ‖θ − θδ‖2 ≤ 2
√(m+ 1)γp,
and ‖θ − θδ‖∞ ≤ 2√
(m+ 1)γ log(p)},
for absolute constants m,M .
Theorem 5. Assume H1-H4. Take ρ ∈ (0, ρ], and γ > 0 as in H4 such that
(2.7) 4γλmax(S) ≤ 1, and γρ2 = o(log(p)), as p→∞.
Suppose also that ε as defined in (2.5) is finite, and Πγ(·|z) is well-defined for P?-almost all z ∈ Eρ, and for all p large enough. Then there exists an absolute constant
A0 such that for p ≥ A0, for all m ≥ 1,M > 2, and for any measurable subset E ⊆ Eρof Z such that P?(Z ∈ E) ≥ 1/2 we have
E?[Πγ
(Bm,M |Z
)|Z ∈ E
]≥ 1− E?
[Πγ(‖δ‖0 > s|Z)|Z ∈ E
]− 4e
a2 p(2+u)s?
(1 +
κ(s?)
ρ2
)s?∑j≥1
e−16r1( jMε
2 )+8ρs1/2( jMε2 )
− 2es log(9p)∑j≥1
e−16r1( jMε
2 ) − 4
pmexp
(a
2+ 3γρ2s+ 2γκ(s)2(Mε)2
).
Proof. See Section 5 of the Supplement.
The theorem applies naturally with E = Eρ. The general case E ⊆ Eρ will be needed
below. We note that the probabilities of the events {Z ∈ Eρ} are relatively easy to
control, so one can easily derive an unconditional version of the theorem. Theorem 5
outlines a set of sufficient conditions under which one can guarantee that given the
event Z ∈ E , if (δ, θ) ∼ Πγ(·|Z) then the sparsified vector θδ satisfied ‖θδ−θ?‖2 ≤Mε,
and ‖θ − θδ‖2 ≤ 2√
(m+ 1)γp (and ‖θ − θδ‖∞ ≤ 2√
(m+ 1)γ log(p)) with high
probability. Controlling both the `0 and the `2 norm gives us a better control of the
term θ − θδ, which will be needed in the mixing time analysis. To use the theorem
one needs to establish by other means that Πγ(·|z) is well-defined and puts vanishing
probability on the set {‖δ‖0 > s}. This question is addressed in Lemma 21 under
some mild additional assumptions.
12
The second condition in (2.7) is readily satisfied in most cases. Hence the result
implies that one should choose γ as
γ =γ0
4λmax(S),
for some user-defined absolute constant γ0 ∈ (0, 1], provided that this choice of γ also
satisfies H4.
Remark 6. It is interesting to note that we did not directly rely on any sample
size condition in the theorem. Here the amount of information available about θ is for-
malized in H4 directly in terms of curvature of the log-likelihood function `. In models
with independent samples, H4 typically translates into a sample size requirement. We
refer the read to Section 2.2 for an example with linear regression models.�
2.1. Model selection and Bernstein-von Mises approximation. With some addi-
tional assumptions we show next that the distribution Πγ puts overwhelming proba-
bility mass around (δ?, θ?) and satisfies a Bernstein-von Mises approximation. To that
end we will assume that there exists an absolute constant M > 2 such that
(2.8) θ?def= min{|θ?,j | : j : δ?,j = 1} > Mε.
Clearly this assumption is unverifiable in practice since θ? is typically not known.
However a strong signal assumption such as (2.8) seems to be needed in one form or
the other for model selection ([31, 11, 45]). For δ ∈ ∆, we recall that the notation
δ ⊇ δ? means that for all 1 ≤ j ≤ p, δ?,j = 1 implies that δj = 1. We note that
when (2.8) holds then the set B(δ)m,M is empty if δ does not satisfy δ ⊇ δ?. Hence one
immediate implication of assumption (2.8) is that the set Bm,M reduces to
Bm,M =⋃δ∈A
({δ} × B
(δ)m,M
),
where
A def= {δ ∈ ∆ : δ ⊇ δ?, and ‖δ‖0 ≤ s}.
In other words when (2.8) and the assumptions of Theorem 5 hold, Πγ puts most
of its mass only on sparse models that contains the true model δ?. Therefore correct
model selection becomes possible if the prior distribution offsets the natural tendency
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 13
to overfit. Given θ ∈ Rp, and δ ∈ ∆ \ {0}, we write [θ]δ to denote the set of δ-selected
components of θ listed in their order of appearance: [θ]δ = (θj , δj = 1) ∈ R‖δ‖0 .
Conversely, if u ∈ R‖δ‖0 , we write (u, 0)δ to denote the element of Rpδ such that
[(u, 0)δ]δ = u. We define the function `[δ](·; z) : R‖δ‖0 → R by `[δ](u; z)def= `((u, 0)δ; z).
We then introduce
(2.9) θδ(z)def= Argmaxu∈R‖δ‖0
[`[δ](u; z)
], z ∈ Z.
When δ = δ?, we sometimes write θ?(z) instead of θδ?(z). At times, to shorten the
notation we will omit the data z and write θδ instead of θδ(z). Omitting the data
z, we write Iγ,δ ∈ R‖δ‖0×‖δ‖0 to denote the matrix of second derivatives of u 7→`[δ](u; z) evaluated at θδ(z). When δ = δ?, we simply write Iγ . We make the following
assumption.
H5. The integer s in H4 is such that κ(s) > 0, and the following holds.
1. For all δ ∈ ∆s, and z ∈ Eρ, the estimate θδ(z) is well-defined and satisfies
‖θδ(z)− [θ?]δ‖2 ≤ ε,
where ε is as defined in (2.5).
2. For all z ∈ Z, all δ ∈ ∆s, the function u 7→ `[δ](u; z) is thrice differentiable on
R‖δ‖0, and for all c > 0,
$2,cdef= sup
z∈Eρsupδ∈∆s
supu∈R‖δ‖0 : ‖u−[θ?]δ‖2≤cε
‖∇(3)`[δ](u; z)‖op,
is finite, where ‖ · ‖op denotes the operator norm3.
Remark 7. When H1 holds, H5-(1) is typically easy to check. Indeed, for all
δ ∈ ∆s, and z ∈ Eρ, we have
0 ≥ −`([δ](θδ; z) + `([δ]([θ?]δ; z) ≥⟨−∇`[δ]([θ?]δ; z), θδ − [θ?]δ
⟩+κ(s)
2‖θδ − [θ?]δ‖22,
where the first inequality uses the definition of θδ, whereas the second inequality uses
H1 and the definition of κ(s). Hence for all δ ∈ ∆s, and z ∈ Eρ, we have
(2.10) ‖θδ(z)− [θ?]δ‖2 ≤ρs1/2
κ(s),
3If M is a linear operator from Rp to Rp×p, we define ‖M‖op = sup‖u‖2=1 ‖Mu‖F
14
which gives an easy way to checking H5-(1) by comparing the right hand side of (2.10)
to ε. H5-(2) is hard to check in general since it requires the third derivative of the
log-likelihood to be uniformly bounded in z (for all practical purposes). However it
trivially holds for linear regression models since in that case $2,c = 0. It can also be
shown to hold for logistic regression models, although we do not pursue this here.�
For Λ > 0 we introduce the set
Eρ,Λdef= Eρ
s−s?⋂k=0
{z ∈ Z : sup
δ∈A: ‖δ‖0=s?+k`(θδ(z); z)− `(θ?(z); z) ≤ kΛ
}.
We introduce the following distribution on ∆× Rp,
(2.11) Π∞γ (δ, dθ|z)
∝ e−12
([θ]δ?−θ?)′Iγ([θ]δ?−θ?)− 12γ
(θ−θδ? )′(Ip−γS)(θ−θδ? )1{δ?}(δ)1B
(δ?)m,M
(θ)dθ.
Note that the δ-marginal of Π∞γ (·|z) is the point-mass at δ?. We note also that for p
large the restriction on the set B(δ?)m,M in Π∞γ (·|z) is inconsequential and can be removed
– by Gaussian concentration. Hence for all practical purposes, a draw U from the θ-
marginal of Π∞γ (·|z) is such that [U ]δ? and [U ]δc? are independent, [U ]δ? ∼ N(θ?, I−1γ ),
and [U ]δc? ∼ N(0, γ([Ip − γS]δc?)−1). Hence Uj = Op(
√γ) for all j such that δ?,j = 0.
Theorem 8. Assume H1-H5 and (2.8) for some absolute constant M > 2. Fix
γ > 0 such that 4γλmax(S) ≤ 1, and ρ > 0. Suppose also that for some Λ > 0,
2ρeΛ
pu
√2π
κ(s)≤ 1, and (M − 1)εκ(s)1/2 − s1/2
? ≥ 2.
Then there exists absolute constant A0, C, such that for all p ≥ A0, for all m ≥ 4,
and for all z ∈ Eρ,Λ such that Πγ(·|z) is well-defined we have
(2.12)
Πγ({δ?} × B(δ?)m,M |z) ≥ Πγ(Bm,M |z)−
(4ρeΛ
pu
√2π
κ(s)
)e2(M+2)ρs1/2εe
(M+1)3
3ε3$2,M
× es1/2? ε
$2,Mκ(s) e2γκ(s)2(Mε)2
e3γρ2sea2 .
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 15
Furthermore,
(2.13)∥∥Πγ(·|z)− Π∞γ (·|z)
∥∥tv≤(
1− Πγ({δ?} × B(δ?)m,M |z)
)+ 2
(1− Πγ(Bm,M |z)
)+ Cι1
(1 + ρs1/2ε+ ε3$2,M
)eCρs
1/2ε+Cε3$2,M ,
where
ι1def= e
a2
+3γρ2s+2γκ(s)2(Mε)2 − 1 +1
pm.
Proof. See Section 6 of the Supplement.
Theorem 5 and 8 can be combined to derive some simple conditions under which
Πγ put most of its probability mass on subsets of the form {δ?}×B(δ?)m,M , and satisfies a
Bernstein-von Mises approximation when z ∈ Eρ,Λ. We refer to Section 2.2 for a linear
regression example. Controlling the probability P?(Z ∈ Eρ,Λ) boils down to controlling
the log-likelihood ratios of the model. This is easily done for linear regression models
(see Section 2.2). The recent work of [40] provides some tools that can be used to deal
with logistic regression models, however this remains to be explored.
The Bernstein-von Mises approximation implies that for γ small, posterior con-
fidence intervals on θ obtained from Πγ are approximately equivalent to their cor-
responding frequentist counterparts knowing δ? (oracle confidence interval). In that
sense the theorem can be used to argue that the Bayesian inference is equivalent to
the oracle frequentist inference.
2.2. Application to high-dimensional linear regression models. We illustrate the
results with the linear regression model. Suppose that we have n subjects, and on
subject i we observe (Zi, xi) ∈ R × Rp, where xi is non-random, and the following
holds.
H6. For i = 1, . . . , n, Zi ∼ N(〈θ?, xi〉 , σ2) are independent random variables, for
a parameter θ? ∈ Rp \ {0}, and a known absolute constant σ2 > 0.
Let X ∈ Rn×p be such that the i-th row of X is x′i. We write Xj ∈ Rn to denote
the j-th column of X. Throughout the paper, we will automatically assume that the
matrix X is normalized such that
(2.14) ‖Xj‖22 = n, j = 1, . . . , p.
16
For s ≥ 1, we define
v(s) = inf
{u′(X ′X)u
n‖u‖22, u 6= 0, ‖u‖0 ≤ s
},
and v(s)def= sup
{u′(X ′X)u
n‖u‖22, u 6= 0, ‖u‖0 ≤ s
}.
We also introduce
v = inf
{u′(X ′X)u
n‖u‖22, u 6= 0, ‖uδc?‖1 ≤ 7‖uδ?‖1
}.
Different behavior can be obtained from Πγ depending on the choice of the hyper-
parameter ρ. To keep the discussion short, we assume that
(2.15) ρ =4
σ
√n
log(p), and ρ =
4
σ
√n log(p).
We make the following assumption.
H7. As p→∞, s? = o(log(p)) and
(2.16)1
v+
1
v(s)+ v(s) = O(1),
where sdef= s? + 2
u(m0 + 1 + 2s?), for some absolute constant m0 ≥ 1.
Remark 9. Assumption H7 can be viewed as a minimal sample size requirement.
It imposes that we increase the sample size n, as p grows, so as to guarantee that
v, v(s) remain bounded away from 0, and v(s) remains bounded from above. For
instance if X is a realization of a random matrix with iid entries it is known that H7
holds with high probability if the sample n grows as s log(p) (see e.g. [37] and the
references therein).�
In the next result ε as defined in (2.5) takes the form
ε =24σ
v(s)
√(s? + s) log(p)
n.
And the limiting distribution Π∞γ (·|z) in the Bernstein-von Mises approximation is
such that if (δ, U) ∼ Π∞γ (·|z), then δ = δ?, [U ]δ? ∼ N((X ′δ?Xδ?)−1Xδ?z, σ
2(X ′δ?Xδ?)−1),
and [U ]δc? ∼ N(0, γ([Ip − γσ2 (X ′X)]δc?)
−1), with [U ]δ? , [U ]δc? independent. We deduce
the following corollary.
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 17
Corollary 10. Assume H3, H6, H7, and choose ρ and ρ as in (2.15). Suppose
also that we choose γ > 0 such that 8σ2γ log(p)λmax(X
′X) ≤ 1, and as p→∞,
(2.17) γn log(p) = O(1), and γ2(nTr(X ′X) + ‖X ′X‖2F
)= o(log(p)).
Furthermore, suppose that as p→∞, log(p)/n→ 0, and
(2.18) limp→∞
{min
j: δ?,j=1|θ?,j |
}√n
(s? + s) log(p)= +∞.
Then setting Λ = 3 log(n∧p), the following holds. For all m > 1, all M > max(2,√
u+23 ),
we have
E?[Πγ
({δ?} × B
(δ?)m,M |Z
)|Z ∈ Eρ,Λ
]≥ 1− 1
pm0− 1
pm−1− 1
pM2s− 1
pu−3,
and P? (Z /∈ Eρ,Λ) ≤ 2
p+
A1
(n ∧ p)14
,
for all p large enough, where A1 is some absolute constant. If in addition to the
assumptions above, γs?n log(p) = o(1), and γ2(nTr(X ′X) + ‖X ′X‖2F
)= o(1), as
p→∞, then
limp→∞
E?[∥∥Πγ(·|Z)− Π∞γ (·|Z)
∥∥tv
]= 0.
Proof. See Section 7 of the Supplement.
In general the assumptions (2.17) and γ log(p)λmax(X′X) ≤ σ2/8 can be easily
satisfied by taking γ sufficiently small. If we choose γ ∼ 1n log(p) as imposed in Theorem
15 below, then these assumptions imply some restrictions on the design matrix X in
that we need
λmax(X′X) = O(n), and
Tr(X ′X)
λmax(X ′X)= o(log(p)3).
These latter conditions typically hold if p and n are of the same order, and the design
matrix X has a small number of leading singular values, which is similar to the spiked
covariance model commonly used in principal component analysis (see for instance
[10] and the references therein). In simulation results not reported here we noticed
that the conclusions of the theorem continue to hold if X is a random matrix with
iid standard normal entries.
18
The assumption (2.18) is a different (and stronger) form of the high signal-to-noise
ratio assumption. It implies that for any M > 2, (2.8) holds for all p large enough.
The additional assumptions needed for the Bernstein-von Mises theorem highlights
the fact that smaller values of γ are typically needed for the Bernstein-von Mises
approximation to hold. Although not reported here, we have indeed observed such
behavior in the simulations.
3. Markov Chain Monte Carlo Computation. In this section we develop
and analyze a MCMC algorithm to sample from Πγ . We shall focus on the θ-marginal
of Πγ given by
(3.1) Πγ(du|z) ∝∑δ∈∆
ωδ(2πγ)‖δ‖0
2
(ρ2
)‖δ‖0e−hγ(δ,u;z)du.
We will abuse the notation and continue to write Πγ to denote this marginal. The set
whose probability we seek will typically make it clear whether we are referring to the
joint distribution or its marginals.
3.1. A MCMC sampler for Πγ. We start first with a description of the MCMC
sampler. We use a data-augmentation approach where we sample the joint variable
(δ, θ), and then discard δ. To sample (δ, θ) we use a Metropolis-Hasting-within-Gibbs
sampler, where we update δ given θ, then we update the selected component [θ]δ given
(δ, [θ]δc), and finally update [θ]δc given (δ, [θ]δ). We refer the reader to [42, 35] for an
introduction to basic MCMC algorithms.
To develop the details, we need to introduce some notations. For γ > 0, δ ∈ ∆,
θ ∈ Rp, we define the proximal map
Proxγ(δ, θ)def= Argminv∈Rpδ
[ρ‖v‖1 +
1
2γ‖v − θ‖22
]= δ · sγ(θ),
where sγ(θ)def= (sγ(θ1), . . . , sγ(θp)) ∈ Rp is the soft-thresholding operation applied to
θ with
sγ(x)def= sign(x) (|x| − γρ)+ , x ∈ R,
where sign(a) is the sign of a, and a+ = max(a, 0). With this definition, hγ in (1.4)
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 19
can be rewritten as
(3.2) hγ(δ, θ; z) = −`(θ; z)− 〈∇`(θ; z), Jγ(δ, θ)− θ〉+ ρ‖Jγ(δ, θ)‖1
+1
2γ‖Jγ(δ, θ)− θ‖22
where
(3.3) Jγ(δ, θ)def= Proxγ (δ, θ + γ∇`(θ; z)) = δ · sγ (θ + γ∇`(θ; z)) .
This alternative expression shows that the function hγ(δ, θ; z) is easy to evaluate.
Furthermore, plugging (3.2) in (1.6) shows that given θ, the components of δ are
conditionally independent Bernoulli random variables:
(3.4) Πγ(δ|θ, z) =
p∏j=1
[pγ,j(θ)]δj [1− pγ,j(θ)]1−δj ,
where pγ,j(θ)def=
q1−q
ρ2
√2πγerγ,j
1 + q1−q
ρ2
√2πγerγ,j
, j = 1, . . . , p,
where rγ,jdef= rγ(θj + γ∇j`(θ; z)), where ∇j`(θ; z) denotes the j-th partial derivative
of `(θ; z) with respect to θ, and evaluated at θ, and for x ∈ R,
(3.5) rγ(x)def= − 1
2γsγ(x)2 +
1
γxsγ(x)− ρ|sγ(x)| =
{1
2γ (|x| − γρ)2 if |x| > γρ
0 otherwise.
It follows from the above that drawing samples from the conditional distribution of
δ given θ, z is easily achieved.
Now consider the conditional distribution Πγ(θ|δ, z) ∝ e−hγ(δ,θ;z). Given δ, we par-
tition θ into θ = ([θ]δ, [θ]δc), where [θ]δ groups the components of θ for which δj = 1,
and [θ]δc groups the remaining components. We propose a Metropolis-within-Gibbs
MCMC scheme whereby we first update [θ]δ using a Random Walk Metropolis scheme
while keeping [θ]δc fixed, and then we update [θ]δc using an Independence Metropolis-
Hastings scheme while keeping [θ]δ fixed. Again we refer the reader to [42] for an
introduction to these basic MCMC algorithms. To give more details, it is enough
to consider the case where 0 < ‖δ‖0 < p. We update the component [θ]δ using a
Random Walk Metropolis with a Gaussian proposal N(0, τ2δ I‖δ‖1) for some scale pa-
rameter τ2δ > 0 (we give more detail on the choice of τ2
δ below), while keeping δ and
20
[θ]δc fixed. To update the component [θ]δc , we build an approximation hγ of hγ as
follows. First, in view of (3.3), we propose to approximate Jγ(δ, θ) by
Jγ(δ, θ)def= δ · sγ (θ + γ∇`(θδ; z)) ,
and we note that Jγ(δ, θ) does not actually depend on [θ]δc . Using Taylor expansion
we also approximate `(θ; z) and ∇`(θ; z) by ˜(θ; z) and ∇`(θ; z) respectively, where
˜(θ; z)def= `(θδ; z) + 〈∇`(θδ; z), θ − θδ〉+
1
2(θ − θδ)′∇(2)`(θδ; z)(θ − θδ),
and
∇`(θ; z) def= ∇`(θδ; z) +∇(2)`(θδ; z)(θ − θδ).
We then propose to approximate hγ(δ, θ; z) by replacing Jγ(δ, θ) by Jγ(δ, θ), `(θ; z)
by ˜(θ; z), and ∇`(θ; z) by ∇`(θ; z) in the expression of hγ given in (3.2). This leads
to
(3.6) hγ(δ, θ; z)def= −˜(θ; z)−
⟨∇`(θ; z), Jγ(δ, θ)− θ
⟩+ ρ‖Jγ(δ, θ)‖1 +
1
2γ‖Jγ(δ, θ)− θ‖22.
Remark 11. It will be important to have in mind that in linear regression models,
θ 7→ `(θ; z) is quadratic so that ˜(θ; z) = `(θ; z), and ∇`(θ; z) = ∇`(θ; z). Hence in
that case the approximation h involves only replacing J by J .
�
We set
(3.7) Σδ,θdef=([Ip + γ∇(2)`(θδ; z)
]δc
)−1,
and mδ,θdef= Σδ,θ
[(Ip + γ∇(2)`(θδ; z))(Jγ(δ, θ)− θδ)
]δc.
We note that Σδ,θ and mδ,θ depend on θ only through [θ]δ. We note also that under
H1, the matrix Σδ,θ is symmetric positive definite whenever γλmax(S) < 1. With some
easy algebra it can be shown that hγ(δ, θ; z) can be written as
hγ(δ, θ; z) =1
2γ([θ]δc −mδ,θ)
′Σ−1δ,θ ([θ]δc −mδ,θ) + const,
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 21
where the term const does not depend on [θ]δc (but does depend on [θ]δ). Hence, given
δ and [θ]δ, we update [θ]δc using an Independence Metropolis-Hastings algorithm with
proposal N(mδ,θ, γΣδ,θ).
Given δ ∈ ∆, let us call−→K δ the transition kernel on Rp of the combined θ-move
that we just described. Let us call←−K δ the kernel obtained by reversing the order of
the updates (we update [θ]δc first using the independence sampler described above,
followed by a random walk Metropolis update of [θ]δ). For the purpose of having a
reversible kernel we introduce
Kδ(θ, ·)def=
1
2
−→K δ(θ, ·) +
1
2
←−K δ(θ, ·).
The proposed MCMC algorithm to sample from Πγ is as follows.
Algorithm 1. For some initial distribution ν0 on Rp, draw u0 ∼ ν0. Given
u0, . . . , un for some n ≥ 0, draw independently Dn+1 ∼ Ber(0.5).
1. If Dn+1 = 0, set un+1 = un.
2. If Dn+1 = 1,
(a) Draw δ ∼ Πγ(·|un, z) as given in (3.4), and
(b) draw un+1 ∼ Kδ(un, ·).
�
Remark 12. The use of the kernel Kδ instead of−→K δ (or
←−K δ) does not increase
the computational cost, and insures reversibility, which is needed in our theory. The
introduction of the indicator variable Dn implies that half of the time the chain does
not move: we have a lazy Markov chain, which is also needed in our theory. These
tricks are not used in practice, and for the numerical illustrations presented below we
only implemented the kernel−→K δ.
The indicator variables δ discarded in Algorithm 1 are important in practice for
the variable selection problem, and are usually collected along the iterations. Here
we focus the analysis on the continuous variables un ∈ Rp. Obviously we do not lose
anything, since given un exact sampling of δ is possible as discussed above. In other
words the mixing of the joint process {(δn, un), n ≥ 0} is driven by the mixing of the
marginal {un, n ≥ 0}.�
22
3.1.1. Initialization. The choice of the initial distribution ν0 plays a crucial role
for fast mixing. Given z ∈ Rn, and a model δ ∈ ∆s, let θδ = (XδX′δ)−1X ′δz denote
the ordinary least squares estimate based on the selected variables of δ. Let us call
ν(δ)(·|z) the Gaussian distribution on Rp such that if (U1, . . . , Up) ∼ ν(δ)(·|z), then
[U ]δ ∼ N(θδ, σ2(XδX
′δ)−1), and (independently) Uj
i.i.d.∼ N (0, γ) for all j such that
δj = 0.
We propose to take the initial distribution ν0 as ν(δ(0))(·|z) for some initial estimate
δ(0) of δ?. Perhaps the most natural choice of δ(0) is the lasso estimate ([41, 7]).
In a strong signal-to-noise ratio setting the lasso is known to recover δ? with high
probability ([28]). However in practice lasso estimates can perform poorly. So it is
important to understand the mixing of the MCMC sampler when δ(0) is close but not
exactly equal to δ?.
3.1.2. Computational cost. The computational cost of Algorithm 1 is dominated
by the cost of sampling from the Gaussian distribution N(mδ,θ, γΣδ,θ), which itself is
dominated by the Cholesky decomposition of Σδ,θ. Hence each iteration of Algorithm
1 in general has a cost that scales with p as O(p3). However in some cases, a faster
implementation is possible along the lines of an algorithm proposed in [6]4. Suppose
that the Hessian matrix ∇(2)`(θ; z) can be written as
(3.8) ∇(2)`(θ; z) = −X ′WθX,
where X ∈ Rn×p, and Wθ ∈ Rn×n is a non-singular diagonal matrix. This is the case
for instance for linear or logistic regression models. Then Σδ,θ =(Ip−‖δ‖0 − γX ′δcWθXδc
)−1,
where Xδc ∈ Rn×(p−‖δ‖0) is the sub-matrix of X obtained by selecting the columns
for which δj = 0. By the Woodbury formula we then have
Σδ,θ = Ip−‖δ‖0 + γX ′δc(W−1θ − γXδcX
′δc)−1
Xδc .
Therefore if C ′θCθ = W−1θ − γXδcX
′δc is the Cholesky decomposition of the Rn×n
matrixW−1θ −γXδcX
′δc , we can sample from N(0, γΣδ,θ) by drawing Z ∼ N(0, Ip−‖δ‖0),
U ∼ N(0, In), independently and returning
√γZ + γX ′δcC
−1θ U.
4We are grateful to Anirban Bhattacharya for pointing out this paper to us.
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 23
It is also easy to see that the Cholesky fastor Cθ can also be exploited to compute
mδ,θ. The per-iteration computational cost of this approach is O(n3 + p2n). Hence
in the particular case where (3.8) holds, the per-iteration computational cost of the
algorithm is O(p2 min(n, p)
), which matches other state of the art algorithms for
high-dimensional regression ([6]).
3.2. Mixing time of Markov chains. Our objective in this section is to provide
some qualitative bounds on the mixing time of Algorithm 1. Particularly, we wish
to understand how this mixing time depends on the dimension p. We follow the
conductance approach using the framework of [24]. However this theory cannot be
directly applied since the target distribution of interest is a mixture of log-concave
densities, and hence is not log-concave. Our main contribution is the idea that one
can invoke the contraction properties of Πγ to essentially reduce the mixing of Pγ to
the mixing of Kδ? – the Markov kernel that samples from the dominant component
of Πγ . This latter problem can then be handled by the standard theory of [24].
We start with a general overview of the technique using some generic notation; the
specific application to Πγ is presented in Section 3.3. Let π be a probability measure
on Rp that is absolutely continuous with respect to the Rp-Lebesgue measure dθ such
that
(3.9) π(dθ) ∝ e−h(θ)dθ, θ ∈ Rp,
for a measurable function h : Rp → [0,∞). We will abuse notation and write π to
denote both π and its density. Let P be a Markov kernel on Rp. For any integer n ≥ 1,
Pn denotes the Markov kernel defined recursively as P 1 = P , and
Pn(x, ·) def=
∫Pn−1(x, dz)P (z, ·).
For a probability measure µ, the product µP is the probability measure defined as
µP (·) def=
∫µ(dz)P (z, ·).
We say that P is reversible with respect to π if for all measurable sets A,B ⊆ Rp:∫Aπ(dθ)P (θ,B) =
∫Bπ(dθ)P (θ,A).
24
Reversibility of P with respect to π implies that P has invariant distribution π. We
say that a Markov kernel P is lazy if P (θ, {θ}) ≥ 1/2 for all θ ∈ Rp. For ζ ∈ [0, 1/2),
the ζ-conductance of the Markov kernel P as introduced by [23] is defined as
Φζ(P )def= inf
{ ∫A π(dθ)P (θ,Ac)
min(π(A)− ζ, π(Ac)− ζ), A meas., ζ < π(A) < 1− ζ
}.
The case ζ = 0 corresponds to the usual conductance. The conductance measures
how rapidly a Markov chain moves around the space if started from its stationary
distribution. In practice most MCMC algorithms are started from some initial distri-
bution ν0 that is not the stationary distribution. In high-dimensional problems – due
to the curse of dimensionality and the concentration of measure phenomenon – the
choice of the initial distribution becomes crucial. A fundamental result by [39] relates
the mixing time of the Markov chain to the conductance of P and the properties
of the initial distribution ν0. More details can also be found in [13, 25, 5] and the
references therein. Here we will use the generalization provided by Corollary 1.5-(2)
of [24], which can be stated as follows.
Theorem 13. Suppose that P is lazy, and has invariant distribution π, and fix
ζ ∈ (0, 1/2). For any probability measure ν0, and any integer K ≥ 1, we have
‖ν0PK − π‖tv ≤ Hζ
(1 +
1
ζe−K
Φ2ζ(P )
2
),
where
Hζdef= sup
A: π(A)≤ζ|ν0(A)− π(A)|.
Using this result boils down to lower bounding the ζ-conductance Φζ(P ), and
upper bounding Hζ . We follow the approach of [25] which consists in studying the
restriction of P to some well-chosen subset of Rp. More precisely, if Θ ⊆ Rp is a
non-empty measurable subset such that π(Θ) > 0, the Θ-conductance of P is
ΦΘ(P )def= inf
{ ∫B π(dθ)P (θ,Bc ∩Θ)
min(π(B), π(Bc ∩Θ)), B ⊆ Θ, B meas., 0 < π(B) < π(Θ)
}.
We note – and this is easily shown – that if π(Θ) ≥ 1 − ζ, then Φζ(P ) ≥ ΦΘ(P ).
Hence we can reduce the problem to lower bounding ΦΘ(P ) for a well-chosen subset Θ.
The next result builds on Theorem 2 of [5] and provides an approach to lower-bound
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 25
ΦΘ(P ). The result itself is of some independent interest, since it can be applied more
widely. The set up is as follows. With π and P as above, suppose that there exists
a sub-Markov kernel {q(x, ·), x ∈ Rp} (meaning that (x, y) 7→ q(x, y) is measurable,
and∫Rp q(x, y)dy ≤ 1 for all x ∈ Rp) such that
(3.10) P (θ,du) ≥ q(θ, u)du, θ ∈ Rp.
Given u, v ∈ Rp, set
qu,v(θ)def= min (q(u, θ), q(v, θ)) , θ ∈ Rp.
Let |||·||| denote some arbitrary pseudo-norm on Rp such that |||θ||| ≤ ‖θ‖2 for all θ.
Theorem 14. Suppose that P is a transition kernel that is reversible with respect
to π as given in (3.9), such that (3.10) holds. Suppose also that the following holds.
1. There exists a nonempty convex set Θ ⊂ Rp with finite diameter diam(Θ)def=
maxu,v∈Θ ‖u− v‖2, such that h is convex on Θ.
2. There exist r > 0, α > 0, such that for all u, v ∈ Θ that satisfy |||u− v||| ≤ r, we
have∫
Θ qu,v(θ)dθ ≥ α.
Then
ΦΘ(P ) ≥ α
4min
[1,
2r
diam(Θ)
].
Proof. See Section 4.1.
3.3. Application to the kernel Pγ for linear regression models. We analyze the
mixing of Algorithm 1. For simplicity we focus on linear regression models. The tran-
sition kernel of the Rp-valued Markov chain {un, n ≥ 0} defined by Algorithm 1
is
(3.11) Pγ(u,dv)def=
1
2δu(dv) +
1
2
∑ω∈∆
Πγ(ω|u, z)Kω(u,dv).
The coherence of the design matrix X plays a role. We define this here as
(3.12) C(X)def= sup
j: δ?,j=1sup
u∈Rp−s? : ‖u‖2=1
1√n|⟨Xj , Xδc?u
⟩| = sup
j: δ?,j=1
1√n‖X ′δc?Xj‖2,
26
where Xδc? ∈ Rn×(p−s?) is the submatrix of X corresponding to columns j for which
δδ,j = 0. C(X) is a measure of correlation between the important and the non-
important variables in the design matrix X. For a random matrix with iid Gaussian
entries, C(X) ≈ √p. With the same notations as in Section 2.2, we have the following
result.
Theorem 15. Assume H3, H6, H7, choose ρ, ρ as in (2.15), and choose γ =γ0
n log(p) for some absolute constant γ0 > 0 such that (2.17) and (2.18) hold. Suppose
that we initialize Algorithm 1 as described in Section 3.1.1 with ν0 = ν(δ(0))(·|z) for
some δ(0) ⊇ δ? such that FPdef= ‖δ(0)‖0 − s? = O(1), as p → ∞. Fix ζ0 ∈ (0, 1/2).
Then we can find absolute constants C0, C1, C2, C3 such that if we scale the step-size
of the Random Walk Metropolis update of Algorithm 1 as τδ = C0
‖δ‖1/20
√n log(p)
, the
following holds. For all p ≥ C1, and all integer K such that
K ≥ C2 (1 + FP) p exp
(C3s
2?
[1 +
(C(X)
s1/2 log(p)
√p
n
)2])
,
we have
E?[‖ν0P
Kγ − Πγ(·|Z)‖tv
]≤ 3ζ0,
provided that the constants u,m0 in H3 and H7 are taken large enough.
Proof. See Section 4.2.
The theorem suggests that if C(X) is small and p is not too large compared to
n, then the mixing time is essentially linear in p. We note also – as expected – that
the bound degrades for large values of the false-positive number FP. However the
constants C0, . . . , C3 depend in general on FP, hence the theorem does not provide a
clear read of the dependence on FP.
One of the main conclusion of the theorem is that the mixing of the algorithm
is directly impacted by C(X). However it is worth pointing out that Theorem 15
relies also on the restricted eigenvalue assumption made in H7 which also restricts
correlation between any set of s = O(s?) columns of X. In other words, controlling
C(X) alone is not enough for fast mixing.
The strong signal-to-noise ratio condition (2.18) plays a crucial role in the analysis,
and our method of proof breaks down if this assumption does not hold. More flexible
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 27
techniques such as the space decomposition approach ([26, 18, 44]), or perhaps the
coupling approach of [27] might be more successful in relaxing this assumption.
3.4. Numerical illustrations. We illustrate some of the conclusions above with the
following simulation study. We consider a linear regression model with Gaussian noise
N(0, σ2), where σ2 is set to 1. We experiment with sample size n = 200, and dimension
p ∈ {500, 1000, · · · , 5000}. We fix the number of non-zero coefficients to s? = 10, and
δ? is given by
δ? = (1, . . . , 1︸ ︷︷ ︸10
, 0, . . . , 0︸ ︷︷ ︸p−10
).
The non-zero coefficients of θ? are uniformly drawn from (−a − 1,−a) ∪ (a, a + 1),
where
a = SNR
√s? log(p)
n, where SNR ∈ {0.5, 4}.
We expect SNR = 4 to bring us closer to satisfying (2.18). To draw the design matrix
we proceed as follow. First we draw X0 ∈ Rn×p with i.i.d. entries N(0, 1), and form
its singular value decomposition X(0) = US(0)V ′. To bring us closer to assumption
(2.17), we re-scale the singular values to form S(1) where S(1)ii = S
(0)11
(1i
)0.2, and we
form X(1) = US(1)V ′. We consider two different design matrices. In the low coherence
case we take X = X(1), whereas for the high coherence example we take
X =K∑k=1
S(1)kk UkV
′k,
for some integer K that we take as K = 30 + (p − 2000)/100, with K = 20 for
p ≤ 1000. We subsequently standardize X to satisfy (2.14).
We choose the prior as in H3, with u = 5, ρ as in (2.15), and set γ as
γ =γ0σ
2
λmax(X ′X),
for a tuning parameter γ0 = 0.2. We use the initial distribution ν0 = ν(δ(0))(·|z)described in Section 3.1.1 for some intial value δ(0) ∈ ∆. We allow for some errors in
specifying δ(0), and we consider three choices, (a) a good initialization set up where
δ(0) has no false-negative and 10% false-positive, (b) a poor initialization set up where
δ(0) has no false-positive but 20% false-negative, (c) a lasso initialization where δ(0) is
28
taken as the lasso estimate computed using MATLAB default cross-validation set up.
The lasso initialization corresponds of course to the initialization one would typically
use in practice.
Since the ideal value of the step-size τ in the Random Walk Metropolis step of
Algorithm 1 is not known, we use an adaptive Random Walk Metropolis algorithm
([3]) to adaptively select τ such that the acceptance probability of the Markov chain
is approximately 30%.
To monitor the mixing, we compute the sensitivity and the precision at iteration k
as
SENk =1
s?
p∑j=1
1{|δk,j |>0}1{|δ?,j |>0}, PRECk =
∑pj=1 1{|δk,j |>0}1{|δ?,j |>0}∑p
j=1 1{|δk,j |>0}.
And we empirically measure the mixing time of the algorithm as the first time k where
both SENk and PRECk reach 1, truncated to 2 × 104 – that is we stop any run that
has not mixed after 2 × 104 iterations. In the high signal-to-noise regime (SNR = 4)
this definition makes sense, since in that case we know that with high probability
most of the probability mass of Πγ is concentrated on δ?. In the weak signal-to-noise
ratio regime this definition is not appropriate since in this case the distribution Πγ
can have a non-negligible probability of omitting some of the non-zero coefficients.
In this case we amend the definition and set the mixing time as the first time where
SEN ≥ αSEN, and PREC = 1, where αSEN is set by running a long preliminary version
of the algorithm.
For comparison we also show the results for a similar Metropolis-within-Gibbs
algorithm to sample from the weak spike-and-slab posterior distribution Πγ given in
(1.3). This distribution differs from the posterior distribution analyzed by [31] only in
the fact that we have used here a Laplace slab density, instead of the Gaussian density
used by [31]. One can sample similarly from Πγ with the same strategy described
above for Πγ , with the additional simplification that the conditional distribution of
[θ]δc given δ, [θ]δ has a Gaussian closed form. The resulting sampler is similar to the
Gibbs sampler implemented in [31].
All the results presented are based 45 independent MCMC replications. Figure 1
presented in the introduction shows the behavior of the mixing time as function of the
dimension p, with SNR = 4, under different initialization and design matrix coherence.
The simulation results confirm the conclusion of Theorem 15 that the mixing time
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 29
is roughly linear in p when the algorithm is well initialized and the coherence of the
design matrix is low. We also observe that the two algorithms (for Πγ and Πγ) behave
similarly.
We also explore the behavior of the algorithms under the lasso initialization. Figure
2-3 show the boxplots of the mixing times under different scenarios for p = 500 and
p = 2, 000. We obtain similar conclusions. The results show that when the signal-to-
noise ratio is low the initial lasso estimates have false negatives which result in poor
MCMC mixing. When the signal-to-noise ratio is high, the initial lasso estimates show
good recovery. However mixing is still slow if C(X) is high.
19999.50
19999.75
20000.00
20000.25
20000.50
MA SSSampler
mixing_time
SamplerMA
SS
wSNR, high coh.
19999.50
19999.75
20000.00
20000.25
20000.50
MA SSSampler
mixing_time
SamplerMA
SS
wSNR, low coh.
19999.50
19999.75
20000.00
20000.25
20000.50
MA SSSampler
mixing_time
SamplerMA
SS
hSNR, high coh.
500
1000
1500
2000
MA SSSampler
mixing_time
SamplerMA
SS
hSNR, low coh.
1.0
1.1
1.2
1.3
1.4
1.5
MA SSSampler
relative_error
SamplerMA
SS
wSNR, high coh.
0.8
0.9
1.0
1.1
MA SSSampler
relative_error
SamplerMA
SS
wSNR, low coh.
0
20
40
60
MA SSSampler
relative_error
SamplerMA
SS
hSNR, high coh.
0.06
0.08
0.10
0.12
MA SSSampler
relative_error
SamplerMA
SS
hSNR, low coh.
Fig 2. Boxplots of estimated mixing times (first row) and relative error (second row) of sam-pling from Πγ (denoted MA) and Πγ (denoted SS), under different configurations of signal-to-noise ratio and matrix coherence. Dimension p = 500.
4. Proofs of Theorem 14 and Theorem 15.
4.1. Proof of Theorem 14. The proof is similar to the proof of Theorem 2 of [5],
or Theorem 3.2 of [24]. It is based on well-known iso-perimetric inequalities for log-
concave densities. We will use the following version taken from [43] Theorem 4.2.
Lemma 16. Let Θ be a convex subset of Rp, and h : Θ → R a convex function.
Let Θ = S1 ∪ S2 ∪ S3 be a partition of Θ into nonempty measurable components such
30
19999.50
19999.75
20000.00
20000.25
20000.50
MA SSSampler
mixing_time
SamplerMA
SS
wSNR, high coh.
5000
10000
15000
20000
MA SSSampler
mixing_time
SamplerMA
SS
wSNR, low coh.
19999.50
19999.75
20000.00
20000.25
20000.50
MA SSSampler
mixing_time
SamplerMA
SS
hSNR, high coh.
0
1000
2000
3000
4000
MA SSSampler
mixing_time
SamplerMA
SS
hSNR, low coh.
1.1
1.2
1.3
1.4
MA SSSampler
relative_error
SamplerMA
SS
wSNR, high coh.
0.6
0.8
1.0
MA SSSampler
relative_error
SamplerMA
SS
wSNR, low coh.
0
25
50
75
100
MA SSSampler
relative_error
SamplerMA
SS
hSNR, high coh.
0.06
0.07
0.08
0.09
0.10
MA SSSampler
relative_error
SamplerMA
SS
hSNR, low coh.
Fig 3. Boxplots of estimated mixing times (first row) and relative error (second row) of sam-pling from Πγ (denoted MA) and Πγ (denoted SS), under different configurations of signal-to-noise ratio and matrix coherence. Dimension p = 2000.
that ddef= infx1∈S1, x2∈S2 ‖x1 − x2‖2 > 0. Then∫
S3
e−h(θ)dθ ≥ 2d
diam(Θ)min
[∫S1
e−h(θ)dθ,
∫S2
e−h(θ)dθ
].
Remark that (3.10) implies that for all u, v ∈ Rp, and for z ∈ {u, v}, A 7→ P (z,A)−∫A qu,v(x)dx is a non-negative measure on Rp. Hence for any measurable subset A of
Θ
P (u,A)− P (v,A) ≤ P (u,A)−∫Aqu,v(x)dx ≤ P (u,Θ)−
∫Θqu,v(x)dx.
A similar bound holds for P (v,A)− P (u,A), leading to the following result.
Lemma 17. If (3.10) holds, then for all u, v ∈ Rp,
supA⊆Θ|P (u,A)− P (v,A)| ≤ P (u,Θ) ∨ P (v,Θ)−
∫Θqu,v(z)dz,
where a ∨ b def= max(a, b).
With Lemma 16 and Lemma 17 in place, the proof of Theorem 14 can be done as
follows. Fix A ⊆ Θ such that 0 < π(A) < 1. Define
S1def={θ ∈ A : P (θ,Ac ∩Θ) <
α
2
}, S2
def={θ ∈ Ac ∩Θ : P (θ,A) <
α
2
},
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 31
and S3 = Θ \ (S1 ∪ S2). Hence we have a partition Θ = S1 ∪ S2 ∪ S3 of Θ. If
π(S1) ≤ π(A)/2. Then
(4.1)
∫Aπ(dθ)P (θ,Ac ∩Θ) ≥ α
2π (A \ S1) ≥ α
4π(A).
Similarly, if π(S2) ≤ π(Ac ∩Θ)/2, then
(4.2)
∫Ac∩Θ
π(dθ)P (θ,A) ≥ α
2π ((Ac ∩Θ) \ S2) ≥ α
4π(Ac ∩Θ).
Now suppose that π(S1) > π(A)/2 and π(S2) > π(Ac ∩Θ)/2. Then by reversibility∫Aπ(dθ)P (θ,Ac ∩Θ) =
1
2
∫Aπ(dθ)P (θ,Ac ∩Θ) +
1
2
∫Ac∩Θ
π(dθ)P (θ,A)
≥ α
4(π (A \ S1) + π ((Ac ∩Θ) \ S2))
=α
4π(S3).(4.3)
Fix θ, ϑ ∈ Θ such that |||θ − ϑ||| ≤ r. Therefore, by assumption we have∫
Θ φu,v(θ)dθ ≥α. Without any loss of generality suppose that P (θ,Θ) ≥ P (ϑ,Θ). It follows from
Lemma 17 that
P (θ,Θ)− α ≥ supB⊆Θ|P (θ,B)− P (ϑ,B)| ≥ P (θ,Θ)− P (θ,Ac ∩Θ)− P (ϑ,A),
where the last inequality follows by setting B = A = Θ\(Ac∩Θ). Hence P (θ,Ac∩Θ) ≥α − P (ϑ,A). This means that if |||θ − ϑ||| ≤ r and ϑ ∈ S2 we necessarily have θ /∈ S1.
Hence ddef= infθ1∈S1, θ2∈S2 ‖θ1 − θ2‖2 ≥ infθ1∈S1, θ2∈S2 |||θ1 − θ2||| ≥ r. By Lemma 16,
π(S3) ≥ 2r
diam(Θ)min (π(S1), π(S2)) .
Combining this with with (4.3), (4.1) and (4.2) we conclude that
ΦΘ(P ) ≥ α
4min
(1,
2r
diam(Θ)
),
as claimed.
�
32
4.2. Proof of Theorem 15. The first step of the proof is a lower-bound on the
conductance of Pγ that we derive in the next lemma. Given m ≥ 1, M > 2, ζ ∈(0, 1/2), we define
Eρ(ζ,m,M)def= Eρ ∩
{z ∈ Rn : Πγ({δ?} × B
(δ?)m,M |z) ≥ 1− ζ
}.
Lemma 18. Assume H6, H7, and choose γ = γ0
n log(p) for some absolute constant
γ0 > 0 such that 4(γ/σ2)λmax(X′X) ≤ 1. Take ρ ∈ (0, ρ] where ρ is as in (2.15).
Fix ζ ∈ (0, 1/2), m ≥ 5, M > 2 arbitrary. Then we can find finite absolute constants
C0, C1, C2, C3 ≥ 1 that do not depend on ζ such that, setting the step-size τδ of the
Random Walk Metropolis updates of Algorithm 1 as
τδ =C0
‖δ‖1/20
√n log(p)
,
the following holds. For all p ≥ C1, and all z ∈ Eρ(ζ,m,M), we have
(4.4) Φζ(Pγ) ≥ C2√pe−C3s2?
[1+
(C(X)
s1/2? log(p)
√pn
)2].
Proof. Fix m ≥ 5, M > 2, ζ ∈ (0, 1/2), and z ∈ Eρ(ζ,m,M) arbitrary. To
shorten notation, we write B(δ) for B(δ)m,M , and E for Eρ(ζ,m,M). Since z ∈ E , we have
1− Πγ({δ?} × B(δ?)|z) ≤ ζ.
Let A be a measurable set such that ζ < Πγ(A|z) < 1− ζ. We wish to lower bound
the quantity∫A Πγ(dθ|z)Pγ(θ,Ac)/min
(Πγ(A|z)− ζ, Πγ(Ac|z)− ζ
). Given the ex-
pression of Pγ and Πγ given in (3.11) and (3.1) respectively, we have∫A
Πγ(dθ|z)Pγ(θ,Ac) ≥ 1
2
∑δ
∫A
Πγ(dθ|z)Πγ(δ|θ, z)Kδ(θ,Ac),
=1
2
∑δ
Πγ(δ|z)∫A
Πγ(dθ|δ, z)Kδ(θ,Ac),
≥ 1
2Πγ(δ?|z)
∫A∩B(δ?)
Πγ(dθ|δ?, z)Kδ?(θ,Ac ∩ B(δ?)).
Then using the definition of ΦB(δ)(Kδ) (the B(δ)-conductance of Kδ), we get
(4.5)
∫A
Πγ(dθ|z)Pγ(θ,Ac) ≥ 1
2Πγ(δ?|z)ΦB(δ?)(Kδ?)
×min(Πγ(Aδ? |δ?, z), Πγ(Acδ? |δ?, z)
),
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 33
where Aδdef= A ∩ B(δ) and Acδ
def= Ac ∩ B(δ). On the other hand, since 1 − Πγ({δ?} ×
B(δ?)|z) ≤ ζ,
Πγ(A|z) ≤ Πγ({δ?} ×Aδ? |z) + 1− Πγ({δ?} × B(δ?)|z),
≤ Πγ({δ?} ×Aδ? |z) + ζ.
Hence Πγ(A|z)− ζ ≤ Πγ(δ?|z)Πγ(Aδ? |δ?, z). A similar bound holds for Πγ(Ac|z)− ζ.
We combine these with (4.5) to get∫A Πγ(dθ|z)Pγ(θ,Ac)
min(Πγ(A|z)− ζ, Πγ(Ac|z)− ζ
) ≥ 1
2ΦB(δ?)(Kδ?).
Since the right-hand side of the last display is independent of A we conclude that:
(4.6) Φζ(Pγ) ≥ 1
2ΦB(δ?)(Kδ?).
To lower-bound ΦB(δ?)(Kδ?), we shall apply Theorem 14. To save some notation,
in the remaining of the proof we will write δ for δ?, and s for s?. To lower bound
ΦB(δ)(Kδ), we will apply Theorem 14 with Θ taken as B(δ), and |||θ||| def= ‖[θ]δ‖2. Let
us check that all the assumptions are satisfied. Clearly, Kδ is reversible with respect
to Πγ(dθ|δ, z) ∝ e−hγ(δ,θ;z)dθ, and θ 7→ hγ(δ, θ; z) is convex ([34] Theorem 2.3).
Let us first recall our notations. For 1 ≤ i ≤ p, u ∈ Ri, v ∈ Rp−i, and δ ∈ ∆ such
that ‖δ‖0 = i, we write (u, v)δ to denote the vector of Rp, θ say, such that [θ]δ = u,
and [θ]δc = v. When v = (0, . . . , 0) ∈ Rp−i, we write (u, 0)δ. If the structure δ is
understood, we will simply write (u, v).
For τ2 > 0 and u ∈ Rs, letQs,τ2(u, ·) denote the density of the Gaussian distribution
N(u, τ2Is) on Rs, and let Gγ,δ(u, ·) denote the density of the Gaussian distribution
N(mδ,(u,0), γΣδ,(u,0)) on Rp−s of the independence Metropolis-Hastings update, with
mδ,(u,0) and Σδ,(u,0) as given in (3.7). We recall that Gγ,δ(u, ·) is proportional to hγ as
in (3.6). With these notations, for any measurable subset A ⊆ Rp, the Markov chain
can move from θ into A under the kernel Kδ = (1/2)−→K δ + (1/2)
←−K δ if we choose to
use−→K δ, and both the random walk Metropolis and the independence sampler moves
are accepted:
Kδ(θ,A) ≥ 1
2
∫AQs,τ2
δ([θ]δ, [u]δ) min
[1,e−hγ(δ,([u]δ,[θ]δc );z)
e−hγ(δ,([θ]δ,[θ]δc );z)
]
Gγ,δ([u]δ, [u]δc) min
[1,e−hγ(δ,([u]δ,[u]δc );z)Gγ,δ([u]δ, [θ]δc)
e−hγ(δ,([u]δ,[θ]δc );z)Gγ,δ([u]δ, [u]δc)
]du.
34
Hence Kδ satisfies (3.10) with
q(θ, u)def=
1
2Qs,τ2
δ([θ]δ, [u]δ) min
[1,e−hγ(δ,([u]δ,[θ]δc );z)
e−hγ(δ,([θ]δ,[θ]δc );z)
]
×Gγ,δ([u]δ, [u]δc) min
[1,e−hγ(δ,([u]δ,[u]δc );z)Gγ,δ([u]δ, [θ]δc)
e−hγ(δ,([u]δ,[θ]δc );z)Gγ,δ([u]δ, [u]δc)
].
We show in Lemma 27 and Lemma 28 of the supplement that we can find a finite
absolute constant C0 such that for all p large enough
supz∈Eρ
supθ∈B(δ)
∣∣∣hγ(δ, θ; z)− hγ(δ, θ; z)∣∣∣ ≤ C0R1,
and supz∈Eρ|hγ(δ, θ1; z)− hγ(δ, θ2; z)| ≤ C0R2 |||θ2 − θ1|||+ C0R3
for all θ1, θ2 ∈ B(δ), such that [θ1]δc = [θ2]δc , where
R1def= sγ
(ρ+ C(X)
√(m+ 1)γnp
)2,
R2def= s1/2
(ρ+ nMε+ C(X)γns1/2
√(m+ 1)γnp
),
and R3 = γs(ρ+ nMε)(ρ+ nMε+ C(X)
√(m+ 1)γnp
).
Hence for z ∈ Eρ, and ([u]δ, [θ]δc), ([u]δ, [u]δc) ∈ B(δ), we have
min
[1,e−hγ(δ,([u]δ,[u]δc );z)Gγ,δ([u]δ, [θ]δc)
e−hγ(δ,([u]δ,[θ]δc );z)Gγ,δ([u]δ, [u]δc)
]
= min
[1,ehγ(δ,([u]δ,[u]δc );z)−hγ(δ,([u]δ,[u]δc );z)
ehγ(δ,([u]δ,[θ]δc );z)−hγ(δ,([u]δ,[θ]δc );z)
]≥ e−2C0R1 .
It follows that for all p large enough, z ∈ Eρ, and θ1, θ2 ∈ B(δ),
∫Rp
min(q(θ1, u), q(θ2, u))du ≥ e−C0(2R1+R3)
2
×∫B(δ)
min(Qs,τ2
δ([θ1]δ, [u]δ)e
−C0R2|||u−θ1|||, Qs,τ2δ([θ2]δ, [u]δ)e
−C0R2|||u−θ2|||)
×Gγ,δ([u]δ, [u]δc)du.
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 35
To proceed we define
V1def=
{x ∈ Rs : ‖x− [θ?]δ‖2 ≤Mε, ‖x− [θ1]δ‖2 ≤
√2π
320
4Mε
s1/2 log(p), and
‖x− [θ2]δ‖2 ≤√
2π
320
4Mε
s1/2 log(p)
},
and V2def={v ∈ Rp−s : ‖v‖2 ≤ ε11, ‖v‖∞ ≤ ε12
},
where ε11 = 2√
(1 +m)γp, ε12 = 2√
(m+ 1)γ log(p). We note that V def= {(u, v)δ :
u ∈ V1, v ∈ V2} ⊂ B(δ), so that it follows from the last display that
(4.7)
∫Rp
min(q(θ1, u), q(θ2, u))du ≥ e−C0(2R1+R3)
2e−C0R2
√2π
3204Mε
s1/2 log(p)[infx∈V1
Gγ,δ(x,V2)
] ∫V1
min(Qs,τ2([θ1]δ, x), Qs,τ2([θ2]δ, x)
)dx.
We recall thatGγ,δ(x, ·) is the density of the Gaussian distribution N(mδ,(x,0), γΣδ,(x,0))
on Rp−s where Σδ,(x,0) = (Ip−s − γσ2X
′δcXδc)
−1, and
mδ,(x,0) = − γ
σ2(Ip−s −
γ
σ2X ′δcXδc)
−1X ′δcXδ([Jγ(δ, (x, 0))]δ − x).
We need to lower bound the term Gγ,δ(x,V2) = N(mδ,(x,0), γΣδ,(x,0))(V2). It suffices
to show that ‖mδ,(x,0)‖2 ≤√γp and ‖mδ,(x,0)‖∞ ≤
√γ. Indeed if these inequalities
hold, and if (U1, . . . , Up−s)iid∼ N(0, 1), we have
P(‖mδ,(x,0) + γ1/2Σ
1/2δ,(x,0)U‖2 > ε11
)≤ P
(U ′Σδ,(x,0)U >
1
γ(ε11 −
√γp)2
)≤ exp
−(
2√
(m+ 1)p−√p−√
Tr(Σδ,(x,0)))2
2λmax(Σδ,(x,0))
,where the last inequality uses Lemma 25 in the supplement. Similarly,
P(‖mδ,(x,0) + γ1/2Σ
1/2δ,(x,0)U‖∞ > ε12
)≤ 2 exp
[log(p)− (m+ 1) log(p)
2λmax(Σδ,(x,0))
],
Noting that the largest eigenvalue of Σδ,(x,0) is bounded from above by 4/3 (since
4γλmax(X′X) ≤ σ2), we can easily conclude that
(4.8) infx∈V1
Gγ,δ(x,V2) ≥ 1
2,
36
for all p large enough, and m ≥ 5. It remains to show that ‖mδ,(x,0)‖2 ≤√γp, and
‖mδ,(x,0)‖∞ ≤√γ. For any θ, (mδ,θ)j = − γ
σ2 [Jγ(δ, θ) − θ]′δX ′δXδc(Σδ,θej), where ej
denotes the j-th standard unit vector. Clearly, ‖Σδ,θej‖2 ≤ 4/3. Hence using the
definition of the incoherence parameter C(X),
‖mδ,θ‖∞ ≤4
3
γ
σ2
√nv(s)λmax(X ′X)‖Jγ(δ, θ)− θδ‖2.
Noting that for θ = (x, 0) ∈ B(δ), ‖Jγ(δ, θ) − θδ‖2 ≤ s1/2γ
(3ρ2 +
n√v(s)
σ2 Mε
)≤
Cs1/2γ(ρ+ nMε) = O(s1/2γnε), we easily get
‖mδ,(x,0)‖∞ = o(√γ), and ‖mδ,(x,0)‖2 ≤
√p‖mδ,(x,0)‖∞ = o(
√γp),
since s = o(log(p)) as assumed in H7. Therefore, (4.8) and (4.7) imply that∫Rp
min(q(θ1, u), q(θ2, u))du
≥ e−C0(2R1+R3)
4e−C0R2
√2π
3204Mε
s1/2 log(p)
∫V1
min(Qs,τ2
δ([θ1]δ, x), Qs,τ2
δ([θ2]δ, x)
)dx.
We lower bound the integral on the right-hand side of the last display using Lemma
29 in the supplement that we applied with d← s, R←Mε, σ ← τδ =√
2π320
Mεs log(p) , and
r ← 4τδ√s =
√2π
3204Mε
s1/2 log(p). The lemma implies that for |||θ1 − θ2||| ≤ τδ/4, we have
∫Rp
min(q(θ1, u), q(θ2, u))du ≥ e−C0(2R1+R3)
16e−C0R2
√2π
3204Mε
s1/2 log(p) .
Hence Kδ satisfies all the assumption of Theorem 14, and we conclude that
ΦB(δ)(Kδ) ≥1
64e−2C0(2R1+R3)e
−C0R2
√2π
3204Mε
s1/2 log(p)
×min
1,
√2π
2× 320
Mε
s log(p)(
2√
(m+ 1)γp+Mε) .
Using γ = γ0
n log(p) , we check that for some absolute constant C,
R1 ≤ Csp
n
(C(X)
log(p)
)2
, R3 ≤ C(s2 + s3/2
√p
n
C(X)
log(p)
),
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 37
andR2Mε
s1/2 log(p)≤ Cs
(1 +
√p
n
C(X)
(log(p))3/2
).
Hence, we have
R1 +R3 +R2Mε
s1/2 log(p)= O
(s2
[1 +
(C(X)
s1/2 log(p)
√p
n
)2])
,
andMε
s log(p)(
2√
(m+ 1)γp+Mε) ≥ C
1 +√m+ 1
1
s log(p) +√ps,
and it follows that for all p large enough,
(4.9) ΦB(δ)(Kδ) ≥C1√pe−C2s2
[1+
(C(X)
s1/2 log(p)
√pn
)2],
for all p large enough, for absolute constants C1 C2. The result then follows by com-
bining (4.9), and (4.6).
4.2.1. Proof of Theorem 15. Throughout the proof C denotes a generic universal
constant, whose actual value may change from one appearance to the next. To shorten
notation, we will also write δ to denote δ(0) (the initialization of Algorithm 1 as
described in Section 3.1.1). Fix ζ0 ∈ (0, 1/2). Set
(4.10) m1 = max
(0,
8σ2
v(s)γ0log
(8FP
ζ0
)), and ζ =
ζ0
16pC0(1+m1+FP),
where C0 ≥ 1 is an absolute constant that we specify later. Let m ≥ 5 and M >
max(2,√
u+23 ) arbitrary, and set
Eρ(ζ,m,M)def= Eρ,Λ
⋂{z ∈ Rn : Πγ({δ?} × B
(δ?)m,M |z) ≥ 1− ζ
}.
By Markov’s inequality,
P? (Z /∈ Eρ(ζ,m,M)) ≤ P?[1− Πγ({δ?} × B
(δ?)m,M |Z) > ζ|Z ∈ Eρ,Λ
]+ P? (Z /∈ Eρ,Λ) ,
≤ 1
ζE?[1− Πγ({δ?} × B
(δ?)m,M |Z)|Z ∈ Eρ,Λ
]+ P? (Z /∈ Eρ,Λ) .
Therefore by Corollary 10 we can choose absolute constantsm ≥ 5,M > max(2,√
u+23 )
(that depend only on C0, m1, FP and ζ0) such that if u and m0 are also large enough,
38
then we have
(4.11) P? (Z /∈ Eρ(ζ,m,M)) ≤ ζ0
2,
for all p large enough. By Theorem 13, and Lemma 18, there exist absolute constants
C1, C2, C3, C4 that does not depend on ζ such that: for all p ≥ C1, all z ∈ Eρ(ζ,m,M),
choosing the step-size of the Random Walk Metropolis as τδ = C2
‖δ‖1/20
√n log(p)
and
choosing integer an K satisfying
K ≥ C3p log
(1
ζ
)exp
(C4s
2?
[1 +
(C(X)
s1/2 log(p)
√p
n
)2])
,
we have
‖ν0PKγ − Πγ(·|z)‖tv ≤ 2 sup
A: Πγ(A|z)≤ζ
∣∣ν0(A)− Πγ(A|z)∣∣ .
Using this with (4.11), it follows that
(4.12)
E?[‖ν0P
Kγ − Πγ(·|Z)‖tv
]≤ 2E?
[sup
A: Πγ(A|Z)≤ζ
∣∣ν0(A)− Πγ(A|z)∣∣1Eρ(ζ,m,M)(Z)
]
+ζ0
2.
Therefore to finish the proof it suffices to upper bound the first term on the right-hand
side of the last display. We recall that ν0 = ν(δ)(·|z) (where here δ is a short for δ(0)),
and we split the term Πγ(A|z)− ν0 as(Πγ(A|z)− ν(δ)
B(δ?)1
(A|z))
+
(ν
(δ)
B(δ?)1
(A|z)− ν(δ)(A|z)),
where B(δ)1 is a short for B
(δ)m1,M
. For any measurable set A of Rp such that Πγ(A|z) ≤ ζ,
if Πγ(A|z) ≥ ν(δ)
B(δ?)1
(A|z), then∣∣∣∣Πγ(A|z)− ν(δ)
B(δ?)1
(A|z)∣∣∣∣ ≤ ζ.
We note that for z ∈ Eρ(ζ,m,M),
Πγ(A|z) ≥ Πγ({δ?} ×A ∩ B(δ?)1 |z) = Πγ({δ?} × B
(δ?)1 |z)Πγ(A ∩ B
(δ?)1 |δ?, z)
Πγ(B(δ?)1 |δ?, z)
≥ (1− ζ)Πγ(A ∩ B
(δ?)1 |δ?, z)
Πγ(B(δ?)1 |δ?, z)
,
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 39
Using this we see that if Πγ(A|z) ≤ ζ, thenΠγ(A∩B(δ?)
1 |δ?,z)Πγ(B
(δ?)1 |δ?,z)
≤ ζ/(1− ζ) ≤ 2ζ, so that
if Πγ(A|z) < ν(δ)
B(δ?)1
(A|z) then
∣∣∣∣Πγ(A|z)− ν(δ)
B(δ?)1
(A|z)∣∣∣∣ ≤ ν(δ)
B(δ?)1
(A|z)− Πγ(A ∩ B(δ?)1 |δ?, z)
Πγ(B(δ?)1 |δ?, z)
+ ζΠγ(A ∩ B
(δ?)1 |δ?, z)
Πγ(B(δ?)1 |δ?, z)
,
≤
∣∣∣∣∣ν(δ)
B(δ?)1
(A|z)− Πγ(A ∩ B(δ?)1 |δ?, z)
Πγ(B(δ?)1 |δ?, z)
∣∣∣∣∣+ ζ.
We conclude that for z ∈ Eρ(ζ,m,M)
(4.13) supA: Πγ(A|z)≤ζ
∣∣∣∣Πγ(A|z)− ν(δ)
B(δ?)1
(A|z)∣∣∣∣
≤ 2ζ
1 + supΠγ(A∩B(δ?)
1 |δ?,z)>0
ν(δ)(A∩B(δ?)1 |z)
ν(δ)(B(δ?)1 |z)
Πγ(A∩B(δ?)1 |δ?,z)
Πγ(B(δ?)1 |δ?,z)
.
Given A ⊆ B(δ?)1 , we write
(4.14)
ν(δ)(A|z)ν(δ)(B
(δ?)1 |z)
Πγ(A|δ?,z)Πγ(B
(δ?)1 |δ?,z)
=
ν(δ?)(A|z)ν(δ?)(B
(δ?)1 |z)
Πγ(A|δ?,z)Πγ(B
(δ?)1 |δ?,z)
×
ν(δ)(A|z)ν(δ)(B
(δ?)1 |z)
ν(δ?)(A|z)ν(δ?)(B
(δ?)1 |z)
.
Define
gγ(δ, θ; z)def=
1
2([θ]δ − θδ)′Iγ,δ([θ]δ − θδ) +
1
2γ(θ − θδ)′(θ − θδ).
The first ratio on the right hand side of (4.14) can be written as
(4.15)
ν(δ?)(A|z)ν(δ?)(B
(δ?)1 |z)
Πγ(A|δ?,z)Πγ(B
(δ?)1 |δ?,z)
=
∫A e−gγ(δ?,θ;z)dθ∫
A e−hγ(δ?,θ;z)−`(θδ? ;z)+ρ‖θδ?‖1dθ
×
∫B
(δ?)1
e−hγ(δ?,θ;z)−`(θδ? ;z)+ρ‖θδ?‖1dθ∫B
(δ?)1
e−gγ(δ?,θ;z)dθ.
By Lemma 19 in the supplement, we have
hγ(δ?, θ; z) = −`(θδ? ; z) + ρ‖θδ?‖1 +1
2γ(θ − θδ?)′(θ − θδ?)−R,
40
where the remainder R satisfies
(4.16) 0 ≤ R ≤ 1
2(θ − θδ?)′S(θ − θδ?) +
γ
2‖δ? · ∇`(θ; z)− ρsign(θδ?)‖22.
It follows that
− hγ(δ?, θ; z)− `(θδ? ; z) + ρ‖θδ?‖1 + gγ(δ?, θ; z)
= ρ(‖θδ?‖1 − ‖θδ?‖1
)+ `(θδ? ; z)− `(θδ? ; z)+
+1
2([θ]δ? − θδ?)′Iγ,δ?([θ]δ? − θδ?) +R.
Since `(θ; z) is quadratic and δ? ·∇`(θδ? ; z) = 0, we have `(θδ? ; z)−`(θδ? ; z)+ 12([θ]δ?−
θδ?)′Iγ,δ?([θ]δ? − θδ?) = 0. Hence
−hγ(δ?, θ; z)− `(θδ? ; z) + ρ‖θδ?‖1 + gγ(δ?, θ; z) = ρ(‖θδ?‖1 − ‖θδ?‖1
)+R.
For z ∈ Eρ, and θ ∈ B(δ?)1 , |‖θδ?‖1−‖θδ?‖1| ≤ s
1/2? (M+1)ε. Using this and the fact that
R ≥ 0, we see that the first term on the right-hand side of (4.15) is upper bounded
by
e(M+1)ρs1/2? ε ≤ elog(p),
since ρs1/2? ε = o(log(p)) by assumption. By proceeding as in the calculations leading
to (5.4) in the supplement, we can show that
− hγ(δ?, θ; z)− `(θδ? ; z) + ρ‖θδ?‖1 = ρ(‖θδ?‖1 − ‖θδ?‖1
)− gγ(δ?, θ; z) +R
≤ (M+1)ρs1/2? ε+C(1+M2)s?−
1
2([θ]δ?−θ?)′Iγ,δ?([θ]δ?−θ?)−
1
2γ(θ−θδ?)′Aγ(θ−θδ?),
where Aγ = Ip − γ(1 + 4γnv(s?))S. Hence the second term on the right hand side of
(4.15) is upper bounded by
eC(1+M2)s?
∫B
(δ?)1
e− 1
2([θ]δ?−θ?)′Iγ,δ? ([θ]δ?−θ?)− 1
2γ[θ]′δc?
[Aγ ]δc? [θ]δc?dθ∫B
(δ?)1
e− 1
2([θ]δ?−θ?)′Iγ,δ? ([θ]δ?−θ?)− 1
2γ‖[θ]δc?‖
22dθ
≤ eC(1+M2)s? 2√det([Aγ ]δc?
) ≤ 2eC(1+M2)s?eCγ2(nTr(X′X)+‖X′X‖2F).
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 41
Finally we use the assumption γ2(nTr(X ′X) + ‖X ′X‖2F
)= o(log(p)), s? = o(log(p))
to conclude that the second term on the right hand side of (4.15) is upper bounded
by elog(p), for all p large enough. Therefore the first term on the right hand side of
(4.14) is upper bounded
elog(p),
for all p large enough. The second term on the right hand side of (4.14) can be written
asν(δ)(A|z)
ν(δ)(B(δ?)1 |z)
ν(δ?)(A|z)ν(δ?)(B
(δ?)1 |z)
=
∫A e−gγ(δ,θ;z)dθ∫
A e−gγ(δ?,θ;z)dθ
×
∫B
(δ?)1
e−gγ(δ?,θ;z)dθ∫B
(δ?)1
e−gγ(δ,θ;z)dθ.
We claim that
(4.17) supz∈Eρ
supθ∈B(δ?)
1
|gγ(δ, θ; z)− gγ(δ?, θ; z)| ≤ C(m1 + FP) log(p),
for some absolute constant C ≥ 1 (that does not depend on m,M). (4.17) together
with the equation right before it then implies that the second term on the right hand
side of (4.14) is upper bounded by pC(m1+PF). Therefore we can conclude that (4.13)
becomes
(4.18) supA: Πγ(A|z)≤ζ
∣∣∣∣Πγ(A|z)− ν(δ)
B(δ?)1
(A|z)∣∣∣∣1Eρ(ζ,m,M)(z) ≤ 4ζpC(1+m1+PF).
Hence taking C0 as the constant C of the last display in the expression of ζ, we
conclude from (4.18) and (4.12) that for all p large enough
(4.19) E?[‖ν0P
Kγ − Πγ(·|Z)‖tv
]≤ ζ0
+ 2E?[‖ν(δ)
B(δ?)1
(·|Z)− ν(δ)(·|Z)‖tv1Eρ(ζ,m,M)(Z)
].
Noting that ν(δ)
B(δ?)1
(·|z) is the restriction of ν(δ)(·|z) to B(δ?)1 , we have
(4.20) ‖ν(δ)
B(δ?)1
(·|z)− ν(δ)(·|z)‖tv ≤ 1− ν(δ)(B(δ?)1 |z),
42
and
1− ν(δ)(B(δ?)1 |z) ≤ P
(‖[θδ + I−1/2
δ,γ V ]δ? − [θ?]δ?‖2 > Mε)
+P(‖[θδ + I−1/2
δ,γ V ]δ−δ?‖22 > 2(m1 + 1)γp)
+P(‖[θδ + I−1/2
δ,γ V ]δ−δ?‖2∞ > 4(m1 + 1)γ log(p)),
+P(γ‖W‖22 > 2(m1 + 1)γp
)+P(γ‖W‖2∞ > 4(m1 + 1)γ log(p)
),(4.21)
where V = (V1, . . . , Vs?)iid∼ N(0, 1), and W = (W1, . . . ,Wp−s?−FP)
iid∼ N(0, 1). For
z ∈ Eρ (which implies that ‖θδ − [θ?]δ‖2 ≤ ε, as seen in 2.10), given that ε =
σv(s)
√72(m0 + 1) s log(p)
n , and noting that the largest eigenvalue of I−1γ,δ is σ2/(nv(s)),
we can then use standard Gaussian deviation bounds to conclude that the sum of the
first and last two terms on the right hand side of (4.21) is upper bounded by
P
‖[V ]δ?‖2 >(M − 1)ε√
σ2
nv(s)
+ P(‖W‖2 >
√2(1 +m1)p
)+ P
(‖W‖∞ > 2
√(1 +m1) log(p)
)≤ 1
ps?(m0+1)+ e−m1p/4 +
1
pm1≤ ζ0
4,
for all p large enough, and m1 ≥ 1. Suppose that for z ∈ Eρ(m,M, γ), ‖[θδ]δ−δ?‖∞ ≤√(m1 + 1)γ log(p) (which implies that ‖[θδ]δ−δ?‖2 ≤ FP1/2
√(m1 + 1)γ log(p)). There-
fore the sum of the second and third terms on the right hand side of (4.21) is upper
bounded by
P
(‖I−1/2
δ,γ V ]δ−δ?‖2 >√
1
2(m1 + 1)γp
)+P(‖I−1/2
δ,γ V ]δ−δ?‖∞ >√
(m1 + 1)γ log(p))
≤ exp
−1
2
(v(s)1/2
σ
√1
2(m1 + 1)γnp−
√FP
)2
+ 2 exp
(log(FP)− (m1 + 1)v(s)γn log(p)
2σ2
)≤ e−
√p + 2FPe−
m1v(s)γ02σ2 ≤ ζ0
4,
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 43
for all p large enough, given the choice of m1 + 1 ≥ 2σ2
v(s)γ0log(
4FPζ0
). We combine this
bound with (4.21), 4.20, and 4.19) to conclude that
(4.22) E?[‖ν0P
Kγ − Πγ(·|z)‖tv
]≤ 2ζ0
+ 2P?[‖[θδ(Z)]δ−δ?‖∞ >
√(m1 + 1)γ log(p)
].
We note that θδ(Z) is the ordinary least square estimate in the regression model
Z = Xδθ + σE, where E ∼ N(0, In). It follow that [θδ(Z)]δ−δ? ∼ N(0, σ2Q), where
Q =(X ′δ−δ?
(Is? −Xδ?(X
′δ?Xδ?)
−1X ′δ?)Xδ−δ?
)−1.
We also note that for any unit vector u ∈ R‖δ‖0−s? , we have
u′Q−1u ≥ u′X ′δ−δ?Xδ−δ?u −1
nv(s?)‖X ′δ?Xδ−δ?u‖22 ≥ nv(s) − C(X)2s?
v(s?)≥ nv(s)
2,
for all p large enough, since C(X)s?/√n = o(1). Therefore the largest eigenvalue of Q
is upper bounded by 2/√nv(s). As a result, and since γn log(p) = γ0, and FP = O(1),
by Gaussian tail bounds (Lemma 25), we have
P?(‖[θδ(Z)]δ−δ?‖∞ >
√(m1 + 1)γ log(p)
)≤ exp
{log(FP)− (m1 + 1)v(s)γn log(p)
8σ2
}≤ ζ0
2,
for all p large enough. Hence (4.22) becomes
(4.23) E?[‖ν0P
Kγ − Πγ(·|z)‖tv
]≤ 3ζ0,
as claimed. To complete the proof, it remains to establish (4.17). To that end, we set
diff(θ)def= gγ(δ, θ; z)− gγ(δ?, θ; z). Since δ ⊇ δ?, we have
2diff(θ) = ([θ]δ − θδ)′Iγ,δ([θ]δ − θδ)− ([θ]δ? − θ?)′Iγ([θ]δ? − θ?)
− 1
γ[θ]′δ−δ? [θ]δ−δ? .
We split the term ([θ]δ − θδ)′Iγ,δ([θ]δ − θδ)− ([θ]δ? − θ?)′Iγ([θ]δ? − θ?) as
([θ]δ − [θ?]δ)′Iγ,δ([θ]δ − [θ?]δ)− ([θ]δ? − [θ?]δ?)
′Iγ([θ]δ? − [θ?]δ?)
+ 2([θ]δ − [θ?]δ)′Iγ,δ([θ?]δ − θδ)− 2([θ]δ? − [θ?]δ?)
′Iγ([θ?]δ? − θ?)
+ (θδ − [θ?]δ)′Iγ,δ(θδ − [θ?]δ)− (θ? − [θ?]δ?)
′Iγ(θ? − [θ?]δ?).
44
We calculate that
([θ]δ − [θ?]δ)′Iγ,δ([θ]δ − [θ?]δ)− ([θ]δ? − [θ?]δ?)
′Iγ([θ]δ? − [θ?]δ?)
= 2[θ − θ?]′δ?
(1
σ2X ′δ?Xδ−δ?
)[θ]δ−δ?
+ [θ]′δ−δ?
(1
σ2X ′δ−δ?Xδ−δ?
)[θ]δ−δ? .
Since θδ = (X ′δXδ)−1X ′δz, we get θδ − [θ?]δ = (X ′δXδ)
−1X ′δ(z −Xθ?) , for all δ ⊇ δ?.
We use this to calculate that
([θ]δ − [θ?]δ)′Iγ,δ([θ?]δ − θδ)− ([θ]δ? − [θ?]δ?)
′Iγ([θ?]δ? − θ?)
= − 1
σ2[θ]′δ−δ?X
′δ−δ?(z −Xθ?),
and
(θδ − [θ?]δ)′Iγ,δ(θδ − [θ?]δ)− (θ? − [θ?]δ?)
′Iγ(θ? − [θ?]δ?)
=1
σ2(z −Xθ?)′Xδ−δ?(X
′δ−δ?Xδ−δ?)
−1X ′δ−δ?(z −Xθ?).
All together, we have
2diff(θ) =1
σ2(z −Xθ?)′Xδ−δ?(X
′δ−δ?Xδ−δ?)
−1X ′δ−δ?(z −Xθ?)
+2
σ2[θ − θ?]′δ?
(X ′δ?Xδ−δ?
)[θ]δ−δ? −
2
σ2[θ]′δ−δ?X
′δ−δ?(z −Xθ?)
− 1
γ[θ]′δ−δ?
(I‖δ‖0−s? −
γ
σ2X ′δ−δ?Xδ−δ?
)[θ]δ−δ? .
For θ ∈ B(δ?), and z ∈ Eρ, we have
1
σ2(z −Xθ?)′Xδ−δ?(X
′δ−δ?Xδ−δ?)
−1X ′δ−δ?(z −Xθ?) ≤8(m0 + 1)FP
v(s)log(p).
We check that for θ ∈ B(δ?), and z ∈ Eρ,∣∣∣∣ 2
σ2[θ − θ?]′δ?
(X ′δ?Xδ−δ?
)[θ]δ−δ?
∣∣∣∣ ≤ CFP1/2 µ0s?√n log(p)
log(p) = o(log(p)),
∣∣∣∣ 2
σ2[θ]′δ−δ?X
′δ−δ?(z −Xθ?)
∣∣∣∣ ≤ 2
σ
FP√log(p)
√8(m0 + 1)(m+ 1) log(p) = o(log(p)),
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 45
and
0 ≤ 1
γ[θ]′δ−δ?
(I‖δ‖0−s? −
γ
σ2X ′δ−δ?Xδ−δ?
)[θ]δ−δ? ≤ 4(m1 + 1) log(p).
We deduce easily (4.17), and this complete the proof of the theorem.�
5. Proof of Theorem 5.
5.1. Some Preliminary results. Throughout the proof, θ? is the true value of the
parameter as introduced in H2, δ? denotes its sparsity structure, and s? = ‖θ?‖0. The
first lemma is adapted from [2] Lemma 2 and gives an approximation of the function
hγ(δ, θ; z) for γ small.
Lemma 19. Assume H1, and let hγ as in (1.4). For all δ ∈ ∆, γ > 0, u ∈ Rp,and z ∈ Z, we have
− `(uδ; z) + ρ‖uδ‖1 +1
2γ‖u− uδ‖22 −
1
2(u− uδ)′S(u− uδ) ≥ hγ(δ, u; z)
≥ −`(uδ; z) + ρ‖uδ‖1 +1
2γ‖u− uδ‖22 −
1
2(u− uδ)′S(u− uδ)−Rγ(δ, u; z),
where Rγ(δ, u; z) satisfies
0 ≤ Rγ(δ, u; z) ≤ γ
2‖δ · ∇`(u; z)− ρsign(uδ)‖2.
Proof. Using the definition of hγ in (1.4), and the definition of S in H1, we have
hγ(δ, u; z) ≤ −`(u; z)− 〈∇`(u; z), uδ − u〉+1
2γ‖u− uδ‖22 + ρ‖uδ‖1
≤ −`(uδ; z) + ρ‖uδ‖1 −1
2(u− uδ)S(u− uδ) +
1
2γ‖u− uδ‖22,
which is the first inequality. To prove the second, we note that for any v ∈ Rpδ ,
−`(u; z)− 〈∇`(u; z), v − u〉 = −`(uδ; z)− 〈∇`(u; z), v − uδ〉+ F(u|uδ),
where F(u|uδ)def= − [`(u; z)− `(uδ; z)− 〈∇`(uδ; z), u− uδ〉]+〈∇`(u; z)−∇`(uδ; z), u− uδ〉.
By Taylor expansion with integral remainder and H1, we have
F(u|uδ) = (u− uδ)′[∫ 1
0t∇(2)`(u+ t(u− uδ))dt
](u− uδ) ≥ −
1
2(u− uδ)′S(u− uδ).
46
Hence for all u ∈ Rp and all v ∈ Rpδ ,
−`(u; z)− 〈∇`(u; z), v − u〉 ≥ −`(uδ; z)− 〈∇`(u; z), v − uδ〉 −1
2(u− uδ)′S(u− uδ).
By convexity of the `1-norm, ‖v‖1 ≥ ‖uδ‖1 + 〈sign(uδ), v − uδ〉. We combine the last
two inequalities to conclude that
− `(u; z)− 〈∇`(u; z), v − u〉+ ρ‖v‖1 +1
2γ‖v − u‖22 ≥ −`(uδ; z) + ρ‖uδ‖1
+〈ρsign(uδ)− δ · ∇`(u; z), v − uδ〉+1
2γ‖v−uδ‖22+
1
2γ‖u−uδ‖22−
1
2(u−uδ)′S(u−uδ).
The second inequality follows by noting that 〈ρsign(uδ)− δ · ∇`(u; z), v − uδ〉+ 12γ ‖v−
uδ‖22 ≥ −γ2‖ρsign(uδ)− δ · ∇`(u; z)‖22.
The next result gives a lower bound on the normalizing constant of Πγ .
Lemma 20. Assume H1-H2. For γ > 0, z ∈ Z, let Cγ(z) denote the normalizing
constant of Πγ(·|z). If γλmax(S) < 1, then we have
(5.1)√
det([Ip − γS]δc?)(2πγ)−p2 Cγ(z) ≥ ωδ?e`(θ?;z)e−ρ‖θ?‖1
(ρ2
κ(s?) + ρ2
)s?,
where for a matrix A ∈ Rp×p, and δ ∈ ∆, the notation [A]δc denote the sub-matrix of
A obtained after removing the rows and columns of A for which δj = 1.
Proof. By definition we have
Cγ(z) =∑δ∈∆
ωδ(2πγ)‖δ‖0
2
(ρ2
)‖δ‖0 ∫Rpe−hγ(δ,u;z)du
≥ ωδ?(2πγ)‖δ?‖0
2
(ρ2
)‖δ?‖0 ∫Rpe−hγ(δ?,u;z)du.
By the first inequality of Lemma 19,
Cγ(z) ≥ ωδ?(2πγ)‖δ?‖0
2
(ρ2
)‖δ?‖0 ∫Rpe`(uδ? ;z)e−ρ‖uδ?‖1e
− 12γ
(u−uδ? )(Ip−γS)(u−uδ? )du.
The integrand in the last display is a multiplicatively separable function of [u]δ? and
[u]δc? . Integrating out [u]δc? then yields
Cγ(z) ≥ (2πγ)p2√
det([Ip − γS]δc?)ωδ?
(ρ2
)‖δ?‖0 ∫Rpe`(u;z)e−ρ‖u‖1µδ?(du).
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 47
The lower bound on the term ωδ?(ρ
2
)‖δ?‖0 ∫Rp e
`(u;z)e−ρ‖u‖1µδ?(du) established in [1] Lemma
11 is then employed to deduce (5.1).
Lemma 21. Assume H1-H2. Let ρ > 0, ρ ∈ (0, ρ], and γ > 0 be such that
4γλmax(S) ≤ 1. If for all δ ∈ ∆, and θ ∈ Rp,
(5.2) logE?[e
(1− ρ
ρ
)〈∇`(θ?;Z),θ−θ?〉+Lγ(δ,θ;Z)
1Eρ(Z)
]≤ −1
2r0(‖θ − θ?‖2)1{‖[θ−θ?]δ?‖1≤7‖[θ]δc?‖1}
(θ),
for some rate function r0, then Πγ(·|z) is well-defined for P?-almost all z ∈ Eρ. Fur-
thermore
E?[1Eρ(Z)Πγ(‖δ‖0 > s? + η|Z)
]≤ 1
pm0,
for all m0 ≥ 1, where
(5.3) ηdef=
2
u
m0 + 2s? +a0
2 log(p)+s? log
(1 + κ(s?)
ρ2
)log(p)
+γ
2 log(p)
(Tr(S − S)
)+
2γ2
log(p)
(λmax(S)Tr(S) + 4‖S‖2F
)],
and
a0def= −min
x>0
[r0(x)− 4ρs
1/2? x
].
Proof. That Πγ(·|z) is well-defined is equivalent to the statement that its nor-
malizing constant, denoted Cγ(z), is finite. Hence it suffices to establish that Cγ(z) is
finite for P?-almost all z in Eρ. By using the second inequality of Lemma 19, we have
(2πγ)−p2 Cγ(z)
e`(θ?;z)−ρ‖θ?‖1=∑δ∈∆
ωδ
(ρ2
)‖δ‖0 ( 1
2πγ
) p−‖δ‖02
∫Rp
e−hγ(δ,θ;z)
e`(θ?;z)−ρ‖θ?‖1dθ
≤∑δ∈∆
ωδ
(ρ2
)‖δ‖0 ( 1
2πγ
) p−‖δ‖02
∫Rp
e`(θδ;z)e−ρ‖θδ‖1
e`(θ?;z)e−ρ‖θ?‖1e− 1
2γ(θ−θδ)′(Ip−γS)(θ−θδ)+Rγ(δ,θ;z)
dθ.
For z ∈ Eρ, and δ ∈ ∆, we use the convexity of the square norm and H1 to bound
48
the term Rγ (given in Lemma 19) as
Rγ(δ, θ; z) ≤γ
2‖δ · ∇`(θ; z)− δ · ∇`(θδ; z) + δ · ∇`(θδ; z)− δ · ∇`(θ?; z) + δ · ∇`(θ?; z) + ρsign(θδ)‖22,
≤ 2γκ(‖δ‖0)(θ − θδ)′S(θ − θδ) + 2γ‖δ · ∇`(θδ; z)− δ · ∇`(θ?; z)‖22 + 3γρ2‖δ‖0,
where the second inequality uses H1, the definition of κ(‖δ‖0), and the fact that
‖∇`(θ?; z)‖∞ ≤ ρ/2 for z ∈ Eρ. It follows that
(5.4) − 1
2γ(θ − θδ)′(Ip − γS)(θ − θδ) +Rγ(δ, θ; z) ≤ − 1
2γ(θ − θδ)′Aδ(θ − θδ)
+ 2γ‖δ · ∇`(θδ; z)− δ · ∇`(θ?; z)‖22 + 3γρ2‖δ‖0.
whereAδdef= Ip−γ(1+4γκ(‖δ‖0))S. If δ = (1, . . . , 1), (5.4) still holds withAδ = 0. Note
also that if δ 6= (1, . . . , 1), the matrix Aδ is positive definite under the assumption
4γλmax(S) ≤ 1. We recall the notation Lγ(δ, θ; z) introduced in (2.2) of the main
manuscript, and use it to write
(5.5)
`(θδ; z)−`(θ?; z)+2γ‖δ ·∇`(θδ; z)−δ ·∇`(θ?; z)‖22 = 〈∇`(θ?; z), θ − θ?〉+Lγ(δ, θδ; z)
=ρ
ρ〈∇`(θ?; z), θδ − θ?〉+
(1− ρ
ρ
)〈∇`(θ?; z), θδ − θ?〉+ Lγ(δ, θδ; z).
Since ‖∇`(θ?; z)‖∞ ≤ ρ/2, for z ∈ Eρ, we have ρρ | 〈∇`(θ?; z), θδ − θ?〉 | ≤ (ρ/2)‖θδ −
θ?‖1. Using this, and accounting for all the terms, we obtain
(2πγ)−p2 Cγ(z)
e`(θ?;z)−ρ‖θ?‖11Eρ(z) ≤
∑δ∈∆
ωδ
(ρ2
)‖δ‖0 ( 1
2πγ
) p−‖δ‖02
× e3γ‖δ‖0ρ21Eρ(z)
∫Rped(θδ)+
(1− ρ
ρ
)〈∇`(θ?;z),θδ−θ?〉+Lγ(δ,θδ;z)e
− 12γ
(θ−θδ)′Aδ(θ−θδ)dθ,
where d(θ)def= ρ
2‖θ − θ?‖1 − ρ (‖θ‖1 − ‖θ?‖1). Taking the expectation on both sides
and using Fubini’s theorem and (5.2) gives
E?
[(2πγ)−
p2 Cγ(Z)
e`(θ?;Z)−ρ‖θ?‖11Eρ(Z)
]≤∑δ∈∆
ωδ
(ρ2
)‖δ‖0 ( 1
2πγ
) p−‖δ‖02
× e3γ‖δ‖0ρ2
∫Rped(θδ)e
− 12γ
(θ−θδ)′Aδ(θ−θδ)dθ,
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 49
All the integrals on the right-hand side of the last inequality are finite since the
matrices Aδ are symmetric positive definite, and d(θ) ∼ −ρ2‖θ‖1, for ‖θ‖1 large.
Hence for P?-almost all z ∈ Eρ, Cγ(z) is finite as claimed.
To establish the second part of the lemma, we first note that for any z ∈ Z, and
any measurable subset B of ∆×Rp, using (5.1), the second inequality of Lemma 19,
(5.4), and similar calculations as above, we get
(5.6)
Πγ(B|z) ≤√
det([Ip − γS]δc?
)(1 +
κ(s?)
ρ2
)s? 1
ωδ?
∑δ∈∆
ωδ
(1
2πγ
) p−‖δ‖02 (ρ
2
)‖δ‖1× e3γρ2‖δ‖0
∫B(δ)
e−ρ‖θδ‖1
e−ρ‖θ?‖1e`(θδ;z)
e`(θ?;z)e2γ‖δ·∇`(θδ;z)−δ·∇`(θ?;z)‖22e
− 12γ
(θ−θδ)′Aδ(θ−θδ)dθ,
where B(δ) def= {θ ∈ Rp : (δ, θ) ∈ B}. For some arbitrary η > 0, Let Ac1
def= {δ ∈
∆ : ‖δ‖0 > s? + η}, and Ac1 = Ac1 × Rp. It follows from the last display (applied to
B = Ac1 × Rp), together with (5.5) and Fubini’s theorem that
(5.7) E?[Πγ
(Ac1|Z
)1Eρ(Z)
]≤√
det([Ip − γS]δc?
)(1 +
κ(s?)
ρ2
)s? ∑δ∈Ac1
ωδωδ?
(ρ2
)‖δ‖0× e3γρ2‖δ‖0√
det([Aδ]δc)
∫Rped(θ)E?
[e
(1− ρ
ρ
)〈∇`(θ?;z),θδ−θ?〉+Lγ(δ,θδ;z)1Eρ(Z)
]µδ(dθ),
where det([Aδ]δc) is taken as 1 if δ ≡ 1. We claim that for all θ ∈ Rp,
(5.8) d(θ) + logE?[e
(1− ρ
ρ
)〈∇`(θ?;z),θδ−θ?〉+Lγ(δ,θδ;z)1Eρ(Z)
]≤ −ρ
4‖θ − θ?‖1 +
a0
2,
where a0 = −minx>0
[r0(x)− 4ρs
1/2? x
]> 0. Obviously this claim is true if θ = θ?.
Suppose now that θ 6= θ?. First using the notation δcdef= 1− δ, we note that
d(θ) =ρ
2‖δ? · (θ − θ?)‖1 +
ρ
2‖δc? · θ‖1 − ρ‖δ? · θ‖1 − ρ‖δc? · θ‖1 + ρ‖θ?‖1
≤ −ρ2‖δc? · (θ − θ?)‖1 +
3ρ
2‖δ? · (θ − θ?)‖1.
If ‖δc? · (θ − θ?)‖1 > 7‖δ? · (θ − θ?)‖1, then
d(θ) ≤ −ρ4‖δc? · (θ − θ?)‖1 −
7ρ
4‖δ? · (θ − θ?)‖1 +
3ρ
2‖δ? · (θ − θ?)‖1 ≤ −
ρ
4‖θ − θ?‖1.
50
This bound together with (5.2) shows that the claim (5.8) holds true when ‖δc? · (θ−θ?)‖1 > 7‖δ? · (θ − θ?)‖1. Now, if ‖δc? · (θ − θ?)‖1 ≤ 7‖δ? · (θ − θ?)‖1, then again by
(5.2), the left-hand side of (5.8) is upper bounded by
− ρ
2‖δc? · (θ − θ?)‖1 +
3ρ
2‖δ? · (θ − θ?)‖1 −
1
2r0(‖θ − θ?‖2)
≤ −ρ2‖θ − θ?‖1 −
1
2
[r0(‖θ − θ?‖2)− 4ρs
1/2? ‖θ − θ?‖2
]≤ −ρ
2‖θ − θ?‖1 +
a0
2,
which also gives (5.8). We can then use (5.8) to deduce that
(5.9)
∫Rped(θ)E?
[e
(1− ρ
ρ
)〈∇`(θ?;z),θδ−θ?〉+Lγ(δ,θδ;z)1Eρ(Z)
]µδ(dθ)
≤ ea0/2
∫Rpe−
ρ4‖θ−θ?‖1µδ(dθ) ≤ ea0/2
(8
ρ
)‖δ‖0,
and using this in (5.7) we conclude that
E?[Πγ
(Ac1|Z
)1Eρ(Z)
]≤ ea0/2
(1 +
κ(s?)
ρ2
)s? ∑δ∈Ac1
ωδωδ?
4‖δ‖0e3γ‖δ‖0ρ2
√det([Ip − γS]δc?
)√det([Aδ]δc)
.
We claim that for all δ ∈ ∆,
(5.10)
√det(Ip−s? − γ[S]δc?
)√det ([Aδ]δc)
≤ exp(s? +
γ
2Tr(S − S) + 2γ2
(κ(‖δ‖0)Tr(S) + 4‖S‖2F
)).
To show this, we write√det(Ip−s? − γ[S]δc?
)√det ([Aδ]δc)
=
√det(Ip−s? − γ[S]δc?
)det(Ip−‖δ‖0 − γ[S]δc
)√
det(Ip−‖δ‖0 − γ[S]δc
)√det ([Aδ]δc)
.
The first term on the right-hand of the last display can be further written as√det([Ip − γS]δc?
)det (Ip − γS)
√det (Ip − γS)
det ([Ip − γS]δc)≤(
4
3
) s?2
,
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 51
where the inequality follows from Lemma 26 and the fact that all the eigenvalues of
the matrix Ip − γS are between 3/4 and 1. If λj (resp. λj) denote the eigenvalues of
[S]δc (resp. [S]δc), we have√det(Ip−‖δ‖0 − γ[S]δc
)√det ([A]δc)
= exp
1
2
p−‖δ‖0∑j=1
(log(1− γλj)− log(1− γ(1 + 4γκ(‖δ‖0))λj)
) .Since 4γλmax(S) ≤ 1, γλj ≤ 1/4, and γ(1 + 4γκ(s))λj) ≤ 1/2 for all 1 ≤ j ≤ p−‖δ‖0.
Furthermore, the function log satisfies log(1 − x) ≤ −x, and log(1 − x) ≥ −x − 4x2
for x ∈ [0, 1/2]. We deduce that
(5.11)
√det(Ip−‖δ‖0 − γ[S]δc
)√det ([A]δc)
≤ eγ2Tr(S−S)+2γ2(κ(‖δ‖0)Tr(S)+4‖S‖2F).
These last two results together establishes (5.10). On the other hand, using H3, we
have
∑δ∈Ac1
ωδωδ?
4‖δ‖0e3γ‖δ‖0ρ2=
p∑j=s?+η+1
(p
j
)(q
1− q
)j−s? (4e3γρ2
)j
≤(p
s?
)(4e3γρ2
)s? p∑j=s?+η+1
(8e3γρ2
pu
)j−s?,
using the fact that q1−q = 1
pu+1−1≤ 2
pu+1 for p ≥ 2, and(pj
)≤ pj−s?
(ps?
). Hence for p
large enough so that 8e3γρ2
pu ≤ 1/2, we get
∑δ∈Ac1
ωδωδ?
4‖δ‖0e3γ‖δ‖0ρ2 ≤(p
s?
)(4e3γρ2
)s? (8e3γρ2
pu
)η≤ e
32s? log(p)−uη
2log(p),
for all p large enough, where here we use again the assumption that γρ2 = o(log(p)),
and log(ps?
)≤ s? log(p). Hence we conclude that
E?[Πγ
(Ac1|Z
)1Eρ(Z)
]≤ e
a02
(1 +
κ(s?)
ρ2
)s?e2s? log(p)−u
2η log(p)
exp(γ
2Tr(S − S) + 2γ2
(λmax(S)Tr(S) + 4‖S‖2F
))≤ 1
pm0,
by choosing η as in the statement of the lemma. Hence the result.
52
For linear regression models the previous lemma takes a slightly sharper form.
Lemma 22. In the particular case of the linear regression model, for all ρ > 0, all
γ > 0 such that 4γσ2λmax(X′X) ≤ 1, and all z ∈ Rn, Πγ(·|z) is well-defined.
Proof. Here S = (1/σ2)X ′X, and as above, we have
(2πγ)−p2 Cγ(z) ≤
≤∑δ∈∆
ωδ
(ρ2
)‖δ‖0 ( 1
2πγ
) p−‖δ‖02
∫Rpe`(θδ;z)e−ρ‖θδ‖1e
− 12γ
(θ−θδ)′(Ip−γS)(θ−θδ)+Rγ(δ,θ;z)dθ.
With a similar argument we bound
Rγ(δ, θ; z) ≤ 3
2γρ2‖δ‖0 +
3γ
σ2λmax(X
′X)1
2σ2‖z −Xθδ‖22
+3γ
2σ2λmax(X
′X)1
σ2(θ − θ?)′(X ′X)(θ − θδ).
Therefore, if 4γσ2λmax(X
′X) ≤ 1, then
`(θδ; z)−1
2γ(θ − θδ)′(Ip − γS)(θ − θδ) +Rγ(δ, θ; z)
≤ 3
2γρ2‖δ‖0 +
1
4`(θδ; z)−
1
2γ(θ − θδ)′Mδ(θ − θδ),
where Mδdef= Ip − γ(1 + 3γ
σ2λmax(X′X))S. And since `(θδ; z) = − 1
2σ2 ‖z −Xθδ‖22 ≤ 0,
we get
(2πγ)−p2 Cγ(z) ≤
≤∑δ∈∆
ωδ
(ρ2
)‖δ‖0 ( 1
2πγ
) p−‖δ‖02
e32γρ2‖δ‖0
∫Rpe−ρ‖θδ‖1e
− 12γ
(θ−θδ)′Mδ(θ−θδ)dθ.
All the integrals on the right hand side of the last display are finite since Mδ is positive
definite. This proves the lemma.
The proof of Theorem 5 is based on some classical testing arguments for which we
need the following result. This lemma slightly extends Lemma 14 of [1].
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 53
Lemma 23. Assume H2 and H4. Define
εdef= inf
{x > 0 : r1(z) ≥ 3ρ(s? + s)1/2z, for all z ≥ x
}.
If ε <∞, then for any M > 2, there exists a measurable function φ : Z → [0, 1] such
that
E? (φ(Z)) ≤(p
s
)9s∑j≥1
e−16r1( jMε
2 ).
Furthermore, for all δ ∈ ∆ such that ‖δ‖0 ≤ s, and all θ ∈ Rpδ such that ‖θ − θ?‖2 >jMε for some j ≥ 1, we have∫
Eρ(1− φ(z)) e2γ‖δ·∇`(θ,z)−δ·∇`(θ?,z)‖22fθ(z)dz ≤ e−
16r1( jMε
2 ).
Proof. If q1, q2 are two integrable functions on some arbitrary measure space, we
define their Hellinger affinity as
H(q1, q2)def=
∫√q1q2.
We will rely on the following result due to [21].
Lemma 24 (Kleijn-Van der Vaart (2006)). Let p a density, Q a family of integrable
functions. Then there exists a measurable [0, 1]-valued function φ such that
supq∈Q
[∫φp+
∫(1− φ)q
]≤ sup
q∈conv(Q)H(p, q),
where conv(Q) is the convex hull of Q.
Fix δ ∈ ∆ such that ‖δ‖0 ≤ s, and θ ∈ Rpδ . To put ourselves in the setting on
Lemma 24, we define
qδ,u(z)def= e2γ‖δ·∇`(u;z)−δ·∇`(θ?,z)‖22fu(z)1Eρ(z), u ∈ Rpδ , z ∈ Z.
For z ∈ Eρ, u ∈ Rpδ , we have
(5.12)qδ,u(z)
fθ?(z)= e〈∇`(θ?;z),u−θ?〉+Lγ(u;z)1Eρ(z) ≤ e
ρ2‖u−θ?‖1+Lγ(δ,u;z).
54
Therefore ∫Zqδ,u(z)dz ≤ e
ρ2‖u−θ?‖1E?
[eLγ(δ,u;Z)1Eρ(Z)
]<∞,
by H4. Now, fix η ≥ 2ε, and suppose that ‖θ − θ?‖2 > η. Let
Pδ,θdef={qδ,u : u ∈ Rpδ , ‖u− θ‖2 ≤
η
2
}.
According to Lemma 24, applied with p = fθ? , and Q = Pδ,θ, there exists a test
function φδ,θ such that
(5.13) supq∈Pδ,θ
[∫φδ,θfθ? +
∫(1− φδ,θ)q
]≤ sup
q∈conv(Pδ,θ)H(fθ? , q).
Any q ∈ conv(Pδ,θ) can be written as q =∑
j αj qδ,uj , where∑
j αj = 1, uj ∈ Rpδ ,‖uj − θ‖2 ≤ η/2. Notice that this implies that ‖uj − θ?‖2 > η/2 ≥ ε. Therefore, by
Jensen’s inequality and (5.12) we get
H(q, fθ?) =
∫Eρ
√√√√∑j
αjqδ,uj (z)
fθ?(z)fθ?(z)dz
≤
√√√√∑j
αj
∫Eρ
qδ,uj (z)
fθ?(z)fθ?(z)dz
≤√∑
j
αjeρ2‖uj−θ?‖1E?
[eLγ(δ,uj ;Z)1Eρ(Z)
],
≤√∑
j
αjeρ2‖uj−θ?‖1− 1
2r1(‖uj−θ?‖2).
Since ‖uj‖0 ≤ s, ‖uj − θ?‖2 > η/2 ≥ ε, and by the definition of ε, we have
ρ
2‖uj − θ?‖1 −
1
6r1(‖uj − θ?‖2) ≤ ρ(s+ s?)
1/2
2‖uj − θ?‖2 −
1
6r1(‖uj − θ?‖2) ≤ 0.
We conclude that for any q ∈ conv(Pδ,θ),
(5.14) H(q, fθ?) ≤ e−16r1( η2 ).
Now for M > 2, write ∪δ{θ ∈ Rpδ : ‖θ− θ?‖2 > Mε} as ∪δ ∪j≥1Aε(δ, j), where the
unions in δ are taken over all δ such that ‖δ‖0 = s, and
Aε(δ, j)def={θ ∈ Rpδ : jMε < ‖θ − θ?‖2 ≤ (j + 1)Mε
}.
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 55
For Aε(δ, j) 6= ∅, let S(δ, j) be a maximally (jMε/2)-separated point in Aε(δ, j). It is
easily checked that the cardinality of S(δ, j) is upper bounded by 9‖δ‖0 = 9s (see for
instance [15] Example 7.1 for the arguments). For θδ,jk ∈ S(δ, j), let φδ,jk denote the
test function φδ,θδ,jk obtained above with θ = θδ,jk and η = jMε. From (5.13) and
(5.14) φδ,jk satisfies
(5.15) supu∈Rpδ , ‖u−θδ,jk‖2≤
jMε2
[E?(φδ,jk(Z)) +
∫Eρ
(1− φδ,jk(z))qu(z)dz
]≤ e−
16r1( jMε
2 ).
Then we set
φ = supδ: ‖δ‖0=s
supj≥1
maxθδ,jk∈S(δ,j)
φδ,jk.
It then follows that
E? (φ(Z)) ≤∑δ
∑j≥1
∑θδ,jk∈S(δ,j)
E? (φδ,jk(Z)) ≤(p
s
)9s∑j≥1
e−16r1( jMε
2 ).
And if for some δ, such that ‖δ‖0 ≤ s and θ ∈ Rpδ we have ‖θ − θ?‖2 > jMε, then
we can find δ with ‖δ‖0 = s, such that θ ∈ Rpδ
and θ resides within (iMε)/2 of some
point θδ,ik ∈ S(δ, i) for some i ≥ j. Hence, by (5.15),∫Eρ
(1− φ(z))qδ,θ(z)dz ≤∫Eρ
(1− φδ,ik(z))qδ,θ(z)dz ≤ e− 1
6r1( iMε
2 ).
This ends the proof.
We make use of the following Gaussian version of the the Hanson-Wright inequal-
ity which follows directly from standard deviation bounds for Lipschitz function of
Gaussian random variables (see e.g. Theorem 5.6 of [9]).
Lemma 25. If Z ∼ N(0, Im), and A ∈ Rm×m is a symmetric positive semi-definite
matrix, then for all t ≥ Tr(A),
P(X ′AX > t
)≤ exp
[−
(√t−√
Tr(A))2
2‖A‖2
].
We will also need the following lemma on determinants of sub-matrices.
56
Lemma 26. If symmetric positive definite matrices A,M and D ∈ Rq×q are such
that M =
(A B
B′ D
), then
det(A)λmin(M)q ≤ det(M) ≤ det(A)λmax(M)q.
Proof. This follows from Cauchy’s interlacing property for eigenvalues. See for
instance [19] Theorem 4.3.17.
5.2. Proof of Theorem 5. Let ∆sdef= {δ ∈ ∆ : ‖δ‖0 ≤ s}. We have ∆ × Rp =
((∆ \∆s)× Rp) ∪ F1 ∪ F21 ∪ F22 ∪ Bm,M , where
F1def=
⋃δ∈∆s
{δ} × {θ ∈ Rp : ‖θδ − θ?‖2 > Mε} ,
F21def=
⋃δ∈∆s
{δ} ×{θ ∈ Rp : ‖θδ − θ?‖2 ≤Mε, and ‖θ − θδ‖2 > 2
√(m+ 1)γp
},
F22def=
⋃δ∈∆s
{δ} ×{θ ∈ Rp : ‖θδ − θ?‖2 ≤Mε, ‖θ − θδ‖2 ≤ 2
√(m+ 1)γp,
and ‖θ − θδ‖∞ > 2√
(m+ 1)γ log(p)}
Hence we can write,
Πγ(Bm,M |Z) = 1− Πγ(‖δ‖0 > s|Z)− Πγ(F1|Z)− Πγ(F21|Z)− Πγ(F22|Z).
Therefore, using the assumption that P?(Z ∈ Eρ) ≥ 1/2, and the definition of condi-
tion probability,
(5.16) E?[Πγ(Bm,M |Z)|Z ∈ Eρ
]≥ 1− E?
[Πγ(‖δ‖0 > s|Z)|Z ∈ Eρ
]− 2E?
[1Eρ(Z)Πγ(F1|Z)
]− 2E?
[1Eρ(Z)Πγ(F21|Z)
]− 2E?
[1Eρ(Z)Πγ(F22|Z)
].
Therefore, to finish the proof it suffices to upper-bounding the terms on the right-hand
side of (5.16).
Let φ denote the test function asserted by Lemma 23 where M > 2 is some arbitrary
absolute constant. We can then write
(5.17) E?[1Eρ(Z)Πγ(F1|Z)
]≤ E? (φ(Z)) + E?
[1Eρ(Z) (1− φ(Z)) Πγ(F1|Z)
].
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 57
Since(ps
)9s ≤ es log(9p), we have
(5.18) E? (φ(Z)) ≤ es log(9p)∞∑j=1
e−16r1( jMε
2 ).
We apply (5.6) with B = F1 and Fubini’s theorem to get:
(5.19) E?[1Eρ(Z)(1− φ(Z))Πγ(F1|Z)
]≤(
1 +κ(s?)
ρ2
)s? ∑δ∈∆s
ωδwδ?
(ρ2
)‖δ‖0e3γρ2‖δ‖0
√det(Ip−s? − γ[S]δc?
)√det ([Aδ]δc)
×∫Fε
e−ρ‖θ‖1
e−ρ‖θ?‖1E?
[1Eρ(Z)(1− φ(Z))
e`(θ;Z)
e`(θ?;Z)e2γ‖δ·∇`(θ;Z)−δ·∇`(θ?;Z)‖22
]µδ(dθ),
where Fεdef= {θ ∈ Rp : ‖θ − θ?‖2 > Mε} = ∪j≥1Fj,ε, with Fj,ε
def= {θ ∈ Rp : jMε <
‖θ − θ?‖2 ≤ (j + 1)Mε}. Using Lemma 23, we have∫Fj,ε
e−ρ‖θ‖1
e−ρ‖θ?‖1E?
[1Eρ(Z)(1− φ(Z))
e`(θ;Z)
e`(θ?;Z)e2γ‖δ·∇`(θ;Z)−δ·∇`(θ?;Z)‖22
]µδ(dθ)
≤ e−16r1( jMε
2 )∫Fj,ε
e−ρ‖θ‖1
e−ρ‖θ?‖1µδ(dθ)
≤ e−16r1( jMε
2 )e8ρs1/2( jMε2 )∫Rpe−ρ‖θ−θ?‖1µδ(dθ),
and it is easily seen that∫Rp e
−ρ‖θ−θ?‖1µδ(dθ) ≤(
2ρ
)‖δ‖0. Therefore (5.19) reduces to
(5.20) E?[1Eρ(Z)(1− φ(Z))Πγ(F1|Z)
]≤(
1 +κ(s?)
ρ2
)s? ∞∑j=1
e−16r1( jMε
2 )+8ρs1/2( jMε2 )
×∑δ∈∆s
ωδwδ?
e3γρ2‖δ‖0
√det(Ip−s? − γ[S]δc?
)√det ([Aδ]δc)
.
Using (5.10) and borrowing the definition of a in Equation (2.4) of the main manuscript,
we get
E?[1Eρ(Z)(1− φ(Z))Πγ(F1|Z)
]≤ es?+ a
2
(1 +
κ(s?)
ρ2
)s?×
∑δ∈∆s
ωδωδ?
e3γρ2‖δ‖0
∞∑j=1
e−16r1( jMε
2 )+8ρs1/2( jMε2 ).
58
Using (2.7) we have 4e3γρ2 ≤ pu+1, for all p large enough, so that Assumption H3
gives∑δ∈∆s
ωδωδ?
e3γρ2‖δ‖0 ≤(
1− q
q
)s? s∑s=0
(2e3γρ2
pu+1
)s≤ 2(1− q)s?ps?(1+u) ≤ 2ps?(1+u).
It follows that
(5.21) E?[1Eρ(Z)(1− φ(Z))Πγ(F1|Z)
]≤ 2e
a2 exp
((2 + u)s? log(p) + s? log
(1 +
κ(s?)
ρ2
))×∞∑j=1
e−16r1( jMε
2 )+8ρs1/2( jMε2 ).
We set F (δ)21
def= {θ ∈ Rp : ‖θδ − θ?‖2 ≤ Mε, and ‖θ − θδ‖2 > ε1}, with ε1 =
2√
(m+ 1)γp. From the definition of Πγ , and using the two inequalities of Lemma
19, we have
Πγ(F21|Z) =
∑δ∈∆s
ωδ(ρ
2
)‖δ‖0 ( 12πγ
) p−‖δ‖02 ∫
F(δ)2
e−hγ(δ,u;Z)du
∑δ∈∆ ωδ
(ρ2
)‖δ‖0 ( 12πγ
) p−‖δ‖02 ∫
Rp e−hγ(δ,u;Z)du
≤
∑δ∈∆s
ωδ(ρ
2
)‖δ‖0 ( 12πγ
) p−‖δ‖02 ∫
F(δ)2
e`(uδ;Z)−ρ‖uδ‖1− 1
2γ(u−uδ)(Ip−γS)(u−uδ)+Rγ(δ,u;Z)
du
∑δ∈∆ ωδ
(ρ2
)‖δ‖0 ( 12πγ
) p−‖δ‖02 ∫
Rp e`(uδ;Z)−ρ‖uδ‖1− 1
2γ(u−uδ)(Ip−γS)(u−uδ)du
.
For Z ∈ Eρ, and δ ∈ ∆s, proceeding as in (5.4) we have
(5.22) − 1
2γ(u− uδ)(Ip − γS)(u− uδ) +Rγ(δ, u;Z)
≤ − 1
2γ(u− uδ)′A(u− uδ) + 3γρ2s+ 2γκ(s)2‖uδ − θ?‖22,
where A = Ip − γ(1 + 4γκ(s))S. It follows that
(5.23) 1Eρ(Z)Πγ(F21|Z) ≤ 1E(Z)e3γρ2s+2γκ(s)2(Mε)2
×
∑δ∈∆s
ωδ(ρ
2
)‖δ‖0 [∫Rp e
`(u;Z)−ρ‖u‖1µδ(du)]
1√det([A]δc )
Tδ∑δ∈∆ ωδ
(ρ2
)‖δ‖0 [∫Rp e
`(u;Z)−ρ‖u‖1µδ(du)]
1√det(Ip−‖δ‖0−γ[S]δc)
,
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 59
where
Tδdef=
∫F(δ)
2
e− 1
2γz′([A]δc )z
dz∫Rp−‖δ‖1 e
− 12γz′([Aγ ]δc )z
dz,
and F (δ)2
def= {θ ∈ Rp−‖δ‖1 : ‖θ‖2 ≥ ε1}. We have seen in (5.11) that√
det(Ip−‖δ‖0 − γ[S]δc
)det ([A]δc)
≤ ea2 .
Therefore, (5.23) is upper bounded by
e3γρ2s+2γκ(s)2(Mε)2e
a2 supδ∈∆s
Tδ.
Tδ is the probability of the set F (δ)2 ⊂ Rp−‖δ‖1 under the Gaussian distribution
N(0, γ([A]δc)−1). Under the assumption 4γλmax(S) ≤ 1, all the eigenvalues of the
matrix A = Ip − γ(1 + 4γκ(s))S are between [1/2, 1], and so are the eigenvalues of
the sub-matrix [A]δc (by Cauchy’s interlacing property for eigenvalues; see Theorem
4.3.17 of [19]). Hence by Lemma 25, we have Tδ ≤ e−m4p, for m ≥ 1. Hence,
1Eρ(Z)Πγ(F21|Z) ≤ e−mp4 exp
(3γρ2s+ 2γκs2(Mε)2 +
a
2
).
A similar bound holds for F22 with 1pm instead of e−mp/2. This completes the proof.
�
6. Proof of Theorem 8.
Proof. We split the proof int two parts.
Part one: Model selection consistency.. To shorten notation, we write B(δ) for
B(δ)m,M , and B for Bm,M . We have seen that under (2.8) B =
⋃δ∈A({δ} × B(δ)), which
we can obviously write as B = ({δ?}×B(δ?))∪⋃δ∈A0{δ}×B(δ), where A0
def= A\{δ?}.
Hence
Πγ
({δ?} × B(δ?)|z)
)= Πγ
(B|z)− Πγ
(∪δ∈A0{δ} × B(δ)|z
).
60
The proof then boils down to controlling the rightmost term in the last equation. We
have by definition
(6.1) Πγ
(∪δ∈A0{δ} × B(δ)|z
)=
∑δ∈A0
ωδ(ρ
2
)‖δ‖0 ( 12πγ
) p−‖δ‖12 ∫
B(δ) e−hγ(δ,θ;z)dθ∑δ∈∆ ωδ
(ρ2
)‖δ‖0 ( 12πγ
) p−‖δ‖12 ∫
Rp e−hγ(δ,θ;z)dθ
≤
∑δ∈A0
ωδ(ρ
2
)‖δ‖0 ( 12πγ
) p−‖δ‖12 ∫
{θ∈Rp: ‖θδ−θ?‖2≤Mε} e−hγ(δ,θ;z)dθ∑
δ∈A ωδ(ρ
2
)‖δ‖0 ( 12πγ
) p−‖δ‖12 ∫
{θ∈Rp: ‖θδ−θ?‖2≤Mε} e−hγ(δ,θ;z)dθ
.
The first inequality of Lemma 19 says that for all θ ∈ Rp,
−hγ(δ, θ; z) ≥ `(θδ; z)− ρ‖θδ‖1 −1
2γ(θ − θδ)′(Ip − γS)(θ − θδ).
Since `(θδ; z) = `[δ]([θ]δ; z), we combine this with the definition of $2,M in H5 to
obtain that for all θ ∈ Rp, such that ‖θδ − θ?‖2 ≤Mε,
−hγ(δ, θ; z) ≥ `[δ](θδ; z)+⟨∇`[δ](θδ; z), [θ]δ − θδ
⟩−1
2([θ]δ−θδ)′[−∇(2)`[δ](θδ; z)]([θ]δ−θδ)
−$2,M
6‖[θ]δ − θδ‖32 − ρ‖θδ‖1 −
1
2γ(θ − θδ)′(Ip − γS)(θ − θδ).
By the first order optimality condition of θδ, ∇`[δ](θδ; z) = 0. Furthermore for any
θ ∈ Rp such that ‖θδ − θ?‖2 ≤ Mε, and using the assumption that ‖θδ − [θ?]δ‖2 ≤ ε,
we have
−ρ‖θδ‖1 −$2,M
6‖[θ]δ − θδ‖32 ≥ −ρ‖θδ‖1 − (M + 1)ρs1/2ε− (M + 1)3
6$2,M ε
3.
We then deduce that
(6.2) − hγ(δ, θ; z) ≥ −(M + 1)ρs1/2ε− (M + 1)3
6$2,M ε
3 + `[δ](θδ; z)− ρ‖θδ‖1
− 1
2([θ]δ − θδ)′[−∇(2)`[δ](θδ; z)]([θ]δ − θδ)−
1
2γ(θ − θδ)′(Ip − γS)(θ − θδ).
It follows from this last inequality that the denominator of the right-hand side of (6.1)
is lower bounded by
e−(M+1)ρs1/2ε− (M+1)3
6$2,M ε
3∑δ∈A
(1
2πγ
) p−‖δ‖02
∫Rp−‖δ‖0
e− 1
2γu′([Ip−S]δc )u
du
× ωδ(ρ
2
)‖δ‖0e`(θδ;z)−ρ‖θδ‖1
√det(2πI−1
γ,δ)N(θδ, I−1γ,δ)(Bδ),
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 61
where Bδdef= {u ∈ R‖δ‖0 : ‖u− [θ?]δ‖2 ≤Mε}. Starting with the second inequality of
Lemma 19, and with the same calculations as above, we get for any θ ∈ B(δ),
− hγ(δ, θ; z) ≤ (M + 1)ρs1/2ε+(M + 1)3
6$2,M ε
3 + `[δ](θδ; z)− ρ‖θδ‖1
− 1
2([θ]δ − θδ)′[−∇(2)`(θδ; z)]([θ]δ − θδ)−
1
2γ(θ− θδ)′(Ip− γS)(θ− θδ) +Rγ(δ, θ; z),
and for z ∈ Eρ, δ ∈ A, and θ ∈ B(δ), using (5.4), we have
− 1
2γ(θ − θδ)′(Ip − γS)(θ − θδ) +Rγ(δ, θ; z)
≤ − 1
2γ(θ − θδ)′[A]δc(θ − θδ) + 2γκ(s)2(Mε)2 + 3γρ2‖δ‖0.
The last two inequalities imply that for z ∈ Eρ, δ ∈ A, and θ ∈ B(δ)
(6.3) − hγ(δ, θ; z) ≤ (M + 1)ρs1/2ε+(M + 1)3
6$2,M ε
3 + 2γκ(s)2(Mε)2 + 3γρ2‖δ‖0
+ `[δ](θδ; z)− ρ‖θδ‖1 −1
2([θ]δ − θδ)′[−∇(2)`(θδ; z)]([θ]δ − θδ)
− 1
2γ(θ − θδ)′[A]δc(θ − θδ).
Therefore the numerator on the right-hand side of (6.1) is upper-bounded by
e(M+1)ρs1/2ε+(M+1)3
6$2,M ε
3e2γκ(s)2(Mε)2
∑δ∈A
(1
2πγ
) p−‖δ‖02
∫Rp−‖δ‖0
e− 1
2γu′([A]δc )u
du
× e3γρ2‖δ‖0ωδ
(ρ2
)‖δ‖0e`(θδ;z)−ρ‖θδ‖1
√det(2πI−1
γ,δ)N(θδ, I−1γ,δ)(Bδ),
Furthermore, we have seen in (5.11) that√det(Ip−‖δ‖0 − γ[S]δc
)det([A]δc)
≤ ea2 .
Therefore it follows from the above that (6.1) gives us
Πγ
(∪δ∈A0{δ} × B(δ)|z
)≤ ec0
×
∑δ∈A0
ωδ(ρ
2
)‖δ‖0 e`(θδ ;z)−ρ‖θδ‖1√det(Ip−‖δ‖0−γ[S]δc)
√det(2πI−1
γ,δ)N(θδ, I−1γ,δ)(Bδ)∑
δ∈A ωδ(ρ
2
)‖δ‖0 e`(θδ ;z)−ρ‖θδ‖1√det(Ip−‖δ‖0−γ[S]δc)
√det(2πI−1
γ,δ)N(θδ, I−1γ,δ)(Bδ)
,
62
where c0def= 2(M + 1)ρs1/2ε+ (M+1)3
3 $2,M ε3 + 2γκ(s)2(Mε)2 + 3γρ2s+ a
2 . We rewrite
the last inequality as
(6.4) Πγ
(∪δ∈A0{δ} × B(δ)|z
)≤ ec0
∑s−s?k=1 Gk∑s−s?k=0 Gk
,
where
Gk =∑
δ⊇δ?, ‖δ‖0=s?+k
ωδ
(ρ2
)s?+k
× e`(θδ;z)−ρ‖θδ‖1√det (Ip−s?−k − γ[S]δc)
√det(
2πI−1γ,δ
)N(θδ, I−1
γ,δ)(Bδ).
We note that √det(
2πI−1γ,δ
)=
(2π)s?+k
2√det([−∇(2)`[δ](θδ; z)
]) .Hence
Gk =∑
δ⊇δ?, ‖δ‖0=s?+k
ωδ
(ρ2
)s?+k e`(θδ;z)−ρ‖θδ‖1√det (Ip−s?−k − γ[S]δc)
(2π)s?+k
2 N(θδ, I−1γ,δ)(Bδ)√
det([−∇(2)`[δ](θδ; z)
]) .Fix δ such δ ⊇ δ?, ‖δ‖0 = s? + k. Firstly, since [S]δ is a sub-matrix of [S]δ? and the
eigenvalues of Ip−s? − γ[S]δc? are all between 1/2 and 1, it is not hard to see that√det(Ip−s? − γ[S]δc?
)det (Ip−s?−k − γ[S]δc)
≤ 1.
Secondly,
−ρ‖θδ‖1 + ρ‖θ?‖1 ≤ ρ‖θδ − θ?‖1 ≤ 2ρs1/2ε,
and for z ∈ Eρ,Λ, we have
`(θδ; z) ≤ `(θ?; z) + Λk.
It follows that
Gk ≤ G0e2ρs1/2ε
×∑
δ⊇δ?, ‖δ‖0=s?+k
ωδωδ?
eΛk(ρ
2
)k(2π)
k2
√det([−∇(2)`[δ?](θδ? ; z)
])√
det([−∇(2)`[δ](θδ; z)
]) N(θδ, I−1γ,δ)(Bδ)
N(θδ? , I−1γ,δ?
)(Bδ?).
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 63
We have
N(θδ? , I−1γ,δ?
)(Bδ?) ≥ 1− P(V ′(
[−∇(2)`[δ?](θδ? ; z)])−1
V > (M − 1)2ε2),
≥ 1− P(‖V ‖2 ≥ (M − 1)εκ(s)1/2
),
where V = (V1, . . . , Vs?)i.i.d.∼ N(0, 1), and where the second inequality uses H1 and
the definition of κ(s). By standard exponential bound for Gaussian random variables,
we have P(‖V ‖2 ≥ (M − 1)εκ(s)1/2
)≤ exp
(−1
2
((M − 1)εκ(s)1/2 − s1/2
?
)2
+
). Hence
N(θδ, I−1γ,δ)(Bδ)
N(θδ? , I−1γ,δ?
)(Bδ?)≤ 1
N(θδ? , I−1γ,δ?
)(Bδ?)≤ 1
1− e− 1
2
((M−1)εκ(s)1/2−s1/2?
)2
+
≤ 2,
using the assumption that (M − 1)εκ(s)1/2 − s1/2? ≥ 2. We split√
det([−∇(2)`[δ?](θδ? ; z)
])√
det([−∇(2)`[δ](θδ; z)
])
=
√det
([−∇(2)`((θδ? , 0)δ? ; z)
]δ?,δ?
)√
det
([−∇(2)`((θδ, 0)δ; z)
]δ?,δ?
)√
det
([−∇(2)`((θδ, 0)δ; z)
]δ?,δ?
)√
det
([−∇(2)`((θδ, 0)δ; z)
]δ,δ
) .
The convexity of the function − log det can be used to show that for any pair of
symmetric positive definite matrices A,B of same size, | log det(A) − log det(B)| ≤max(‖A−1‖F, ‖B−1‖F)‖B −A‖F. We use this to conclude that√
det
([−∇(2)`((θδ? , 0)δ? ; z)
]δ?,δ?
)√
det
([−∇(2)`((θδ, 0)δ; z)
]δ?,δ?
)
≤ exp
(1
2
s1/2?
κ(s)
∥∥∥∥[∇(2)`[δ]([θδ]δ; z)−∇(2)`([δ]([θ?]δ; z)]δ?,δ?
∥∥∥∥F
)≤ e
12
s1/2? $2,Mκ(s)
‖θδ−θ?‖2
≤ es1/2? $2,Mε
κ(s) .
64
Then we use Lemma 26 to get√det
([−∇(2)`((θδ, 0)δ; z)
]δ?,δ?
)√
det
([−∇(2)`((θδ, 0)δ; z)
]δ,δ
) ≤
(1√κ(s)
)k.
It follows that for z ∈ Eρ,Λ,
Gk ≤ G02e2ρs1/2ε+s
1/2? ε
$2,Mκ(s) eΛk
(ρ2
)k(2π)
k2
(1√κ(s)
)k ∑δ⊇δ?, ‖δ‖0=s?+k
ωδωδ?
,
and under H3, and for p large enough so that q ≤ 1/2,
∑δ⊇δ?, ‖δ‖0=s?+k
ωδωδ?
=
(q
1− q
)k (p− s?k
)≤ (2q)kek log(p) ≤
(2
pu
)k,
using the fact that(p−s?k
)≤ e(p−s?) log(k) ≤ ep log(k). Therefore
(6.5)s∑
k=1
Gk ≤ G02e2ρs1/2ε+s
1/2? ε
$2,Mκ(s)
s∑k=1
(ρeΛ
pu
√2π
κ(s)
)k.
It follows that for 2ρeΛ
pu
√2πκ(s) ≤ 1, we get
∑sk=1Gk∑sk=0Gk
≤ 4ρeΛ
pu
√2π
κ(s)e
2ρs1/2ε+s1/2? ε
$2,Mκ(s) ,
which, together with (6.4), implies the stated bound in (2.12).
Part two: Bernstein-von Mises approximation.. We introduce the following prob-
ability distributions on ∆× Rp:
Πγ,B(δ, dθ|z) ∝ ωδ(2πγ)‖δ‖0
2
(ρ2
)‖δ‖0e−hγ(δ,θ;z)1B(δ, θ)dθ,
Πγ,B(δ, dθ|z) ∝ ωδ(2πγ)‖δ‖0
2
(ρ2
)‖δ‖0e`(θδ;z)−ρ‖θδ‖1− 1
2γ(θ−θδ)′(Ip−γS)(θ−θδ)1B(δ, θ)dθ,
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 65
and
Π∞γ,B(δ, dθ|z) ∝ ωδ(2πγ)‖δ‖0
2
(ρ2
)‖δ‖0e`(θδ;z)−ρ‖θδ‖1
× e−12
([θ]δ−θδ)′Iγ,δ([θ]δ−θδ)− 12γ
(θ−θδ)′(Ip−γS)(θ−θδ)1B(δ, θ)dθ.
For all measurable subset C of ∆× Rp, we obviously can write
|Πγ(C|z)− Π∞γ (C|z)| ≤ |Πγ(C|z)− Π∞γ,B(C|z)|+ |Π∞γ,B(C|z)− Π∞γ (C|z)|.
Since Π∞γ (·|z) is one of the component probability measures of Π∞γ,B
(·|z), by coupling
inequality we have
‖Π∞γ (·|z)− Π∞γ (·|z)‖tv ≤ 1− Π∞γ ({δ?} × B(δ?)|z)
≤ 1− Πγ({δ?} × B(δ?)|z) + ‖Πγ(·|z)− Π∞γ,B(·|z)‖tv.
We conclude that
(6.6) ‖Πγ(·|z)− Π∞γ (·|z)‖tv ≤ 1− Πγ({δ?} × B(δ?)|z) + 2‖Πγ(·|z)− Π∞γ,B(·|z)‖tv.
Hence it suffices to bound the rightmost term of (6.6). For a measurable subset C of
∆× Rp, we have
(6.7) |Πγ(C|z)− Π∞γ,B(C|z)| ≤ |Πγ(C|z)− Πγ,B(C|z)|+ |Πγ,B(C|z)− Πγ,B(C|z)|
+ |Πγ,B(C|z)− Π∞γ,B(C|z)|.
To deal with the first term on the right hand side of the inequality in (6.7), we first
note that Πγ,B(·|z) is no other than the restriction of Πγ to the set B. With this
in mind, we make the following general observation. For any probability measure µ,
and a measurable set A such that µ(A) > 0, if µA denotes the restriction of µ to A
(µA(B)def= µ(A∩B)/µ(A)), we can decompose µ as µ = µA+µ(Ac)(µAc−µA), where
Ac denotes the complement of A. This decomposition implies that for all measurable
set B,
(6.8) |µ(B)− µA(B)| ≤ max
(µ(Ac ∩B),
µ(A ∩B)
µ(A)µ(Ac)
)≤ µ(Ac).
In the particular case of Πγ and Πγ,B, this bound readily implies that
(6.9) supC meas.
|Πγ(C|z)− Πγ,B(C|z)| ≤ 1− Πγ(B|z).
66
We claim that for all z ∈ Eρ,
(6.10) supC: meas.
∣∣∣Πγ,B(C|z)− Πγ,B(C|z)∣∣∣ ≤ 2ι1,
where
ι1def= e
a2
+3γρ2s+2γκ(s)2(Mε)2 − 1 +1
pm.
To establish (6.10), we note that if f ≥ g are two unnormalized positive densities
on some measurable space with normalizing constants Zf , Zg respectively, and A is a
measurable set, we have
(6.11)
∣∣∣∣∫A f(x)dx
Zf−∫A g(x)dx
Zg
∣∣∣∣ =
∣∣∣∣(Zg − Zf )∫A f(x)dx
ZgZf+
∫A(f(x)− g(x))d(x)
Zg
∣∣∣∣≤(ZfZg− 1
).
Owning to the first inequality of Lemma 19, we can apply this result with f/Zf as
Πγ,B(·|z), and g/Zg as Πγ,B(·|z). Now, it suffices to control the ratio of normalizing
constants of Πγ,B(·|z) and Πγ(·|z) given by
(6.12)
∑δ∈∆s
ωδ(2πγ)‖δ‖0
2
(ρ2
)‖δ‖0 ∫B(δ) e−hγ(δ,θ;z)dθ∑
δ∈∆sωδ(2πγ)
‖δ‖02
(ρ2
) ‖δ‖02∫B(δ) e
`(θδ;z)−ρ‖δδ‖1− 12γ
(θ−θδ)′(Ip−γS)(θ−θδ)dθ
.
By the second inequality of Lemma 19, and (5.22), for z ∈ Eρ, δ ∈ ∆s, and θ ∈ B(δ),
we have
−hγ(δ, θ; z) ≤ `(θδ; z)− ρ‖θδ‖1 −1
2γ(θ − θδ)A(θ − θδ) + 3γρ2s+ 2γκ(s)2(Mε)2,
where A = Ip − γ(1 + 4γκ(s))S. Hence the numerator of (6.12) is upper-bounded by
(2πγ)p/2e3γρ2s+2γκ(s)2(Mε)2∑δ∈∆s
ωδ
(ρ2
)‖δ‖0 [∫{θ∈Rp: ‖θ−θ?‖2≤Mε}
e`(θ;z)−ρ‖θ‖1µδ(dθ)
]
× 1√det ([A]δc)
,
whereas the denominator is equal to
(2πγ)p/2∑δ∈∆s
ωδ
(ρ2
)‖δ‖0 [∫{θ:∈Rp ‖θ−θ?‖2≤Mε}
e`(θ;z)−ρ‖θ‖1µδ(dθ)
]Tδ√
det ([I − γS]δc),
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 67
where Tδ is the probability of the set {u ∈ Rp−‖δ‖0 : ‖u‖2 ≤ 2√
(1 +m)γp, ‖u‖∞ ≤2√
(m+ 1)γ log(p)}, under the distribution N(0, γ([I−γS]δc)−1), which is easily seen
to be larger than 1 − 1pm for all m ≥ 4, by standard Gaussian tail bound. From
these results and (5.11) we conclude that the ratio (6.12) is upper bounded by (1 −1pm )−1e3γρ2s+2γκ(s)2(Mε)2+ a
2 , and this together with (6.11) imply (6.10).
We claim that for all z ∈ Eρ,
(6.13) supC: meas.
∣∣∣Πγ,B(C|z)− Π∞γ,B(C|z)∣∣∣
≤ 16ι1
((M + 1)ρs1/2ε+
(M + 1)3
3ε3$2,M
)e4(M+1)ρs1/2ε+ 4
3(M+1)3ε3$2,M .
We establish this with the following general observation. Let {aj},{bj} be two discrete
probability distributions on some discrete set J , with aj > 0. Let µ(j,dx) = ajµj(dx),
ν(j,dx) = bjνj(dx) be two probability measures on J ×Rp, where for each j, µj(dx)
and νj(dx) are equivalent probability measures supported by some measurable subset
of Rp. For any measurable set A ⊂ J × Rp, we have
µ(A)− ν(A) =∑j
(1− bj
aj
)ajµj(A
(j)) +∑j
bjaj
∫A(j)
(1− dνj
dµj(x)
)ajµj(dx),
where A(j) def= {x ∈ Rp : (j, x) ∈ A}. This implies that
(6.14) supA: meas.
|µ(A)− ν(A)| ≤ supj
∣∣∣∣1− bjaj
∣∣∣∣+ supj
(bjaj
)supx
∣∣∣∣1− dνjdµj
(x)
∣∣∣∣ .We apply this result with µ taken as Πγ,B, and ν taken as Π∞
γ,B. In that case
aδ =aδ∑δ∈A aδ
, bδ =bδ∑δ∈A bδ
,
where
aδ = ωδ(2πγ)‖δ‖0
2
(ρ2
)‖δ‖0 ∫B(δ)
e`(θδ;z)−ρ‖θδ‖1− 1
2γ(θ−θδ)′(Ip−γS)(θ−θδ)dθ,
bδ = ωδ(2πγ)‖δ‖0
2
(ρ2
)‖δ‖0e`(θδ;z)−ρ‖θδ‖1
×∫B(δ)
e− 1
2([θ]δ−θδ)′Iγ,δ([θ]δ−θδ)− 1
2γ(θ−θδ)′(Ip−γS)(θ−θδ)dθ.
68
We haveminδ∈A
bδaδ
maxδ∈Abδaδ
≤ bδaδ≤
maxδ∈Abδaδ
minδ∈Abδaδ
.
We take a Taylor expansion of the function u 7→ `[δ](u; z) to the third order around
θδ, and note that ∇`[δ](θδ; z) = 0 to conclude that for all z ∈ Eρ, θ ∈ B(δ),
(6.15)
∣∣∣∣`(θδ; z)− ρ‖θδ‖1 − `(θδ; z) + ρ‖θδ‖1 +1
2([θ]δ − θδ)′Iγ,δ([θ]δ − θδ)
∣∣∣∣≤ ρ‖θδ − θδ‖1 +
1
6$2,M‖θδ − θδ‖32 ≤ c1,
where c1 = (M + 1)ρs1/2ε+ (M+1)3
6 $2,M ε3. It follows easily that e−c1 ≤ bδ
aδ≤ ec1 , so
that e−2c1 ≤ bδaδ≤ e2c1 . Similarly the Radon-Nykodym derivative satisfies
e−2c1 ≤ dνδdµδ
(θ) ≤ e2c1 .
With these bounds, (6.13) easily follows from (6.14), and the fact that for all x ∈(e−a, ea) for some a > 0, we have |1 − x| ≤ aea. The theorem hence follows from
(6.13), (6.10), (6.9), (6.7) and (6.6).
7. Proof of Corollary 10. We know from Lemma 22 that Πγ is well-defined
for all z ∈ Rn and all γ > 0 such that 4γλmax(X′X)/σ2 ≤ 1. H6 readily implies H2.
Ignoring constants, it is straightforward that
`(θ; z) = − 1
2σ2‖z −Xθ‖22, ∇`(θ; z) =
1
σ2X ′(z −Xθ), ∇(2)`(θ; z) = − 1
σ2(X ′X).
Hence H1 holds with S = S = (X ′X)/σ2. To apply Lemma 21 we need to check (5.2).
For γ > 0, and δ ∈ ∆, we have
Lγ(δ, θ; z) = `(θ; z)− `(θ?; z)− 〈∇`(θ?; z), θ − θ?〉
+2γ
σ4(θ − θ?)′X ′XδX
′δX(θ − θ?),
≤ − n
2σ2(θ − θ?)′
(X ′X
n
)(θ − θ?)
+2nγλmax(XδX
′δ)
σ4(θ − θ?)′
(X ′X
n
)(θ − θ?).
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 69
Using this and the moment generating function of the Gaussian distribution yields
logE?[eLγ(δ,θ;Z)+
(1− ρ
ρ
)〈∇`(θ?;Z),θ−θ?〉1Eρ(Z)
]≤ − n
2σ2
(1− 4γλmax(XX
′)
σ2−(
1− ρ
ρ
)2)
(θ − θ?)′(X ′X
n
)(θ − θ?)
≤ − n
2σ2
(ρ
ρ− 4γλmax(XX
′)
σ2
)(θ − θ?)′
(X ′X
n
)(θ − θ?).
Since ρ/ρ = 1/ log(p), (8/σ2)γ log(p)λmax(X′X) ≤ 1, and given H7 we readily deduce
that (5.2) holds with the rate function r0(x) = nv2σ2 log(p)
x2. In that case a0 = 128 s?v .
Furthermore, under the stated assumptions we easily check that η in (5.3) satisfies
η ≤ 2u(m0 + 1 + 2s?). This naturally suggests taking s = s? + 2
u(m0 + 1 + 2s?) in
H4. For ‖δ‖0 ≤ s and θ ∈ Rpδ , the upper-bound on Lγ(δ, θ; z) obtained above readily
shows that
logE?[eLγ(δ,θ;Z)1Eρ(Z)
]≤ Lγ(δ, θ;Z)
≤ − n
2σ2
(1− 4γnv(s)
σ2
)(θ − θ?)′
(X ′X
n
)(θ − θ?).
Since γn = o(1), as p→∞, we see that for all p large enough, H4 holds with the rate
function r1(x) = nv(s)2σ2 x
2, and from the definitions, we obtain that
ε =6σ2ρ(s? + s)1/2
nv(s)=
24σ
v(s)
√(s? + s) log(p)
n.
We can then apply Theorem 5. Condition (2.7) and a = o(log(p)) are implied by the
assumptions on γ in (2.17) of the main manuscript. We also easily see that γρ2s +
γκ(s)2(Mε2)2 = O(s) = o(log(p)). Consequently for allm ≥ 2, 4p−mea2
+3γρ2s+2γκ(s)2(Mε)2 ≤p−(m−1). Hence by Theorem 5, for any m > 1, p large enough, and for any event Esuch that P?(Z ∈ E) ≥ 1/2,
E?[Πγ
(Bm,M |Z
)|Z ∈ E
]≥ 1− 1
pm0− 1
p(m−1)− 2es log(9p)
∑j≥1
e−16r1( jMε
2 )
− 4ea2 p(2+u)s?
(1 +
κ(s?)
ρ2
)s?∑j≥1
e−16r1( jMε
2 )+8ρs1/2( jMε2 ).
70
Given the expression of r1 found above, we check that
∑j≥1
e−16r1( jMε
2 ) ≤ e−nv(s)
12σ2 (Mε/2)2
1− e−nv(s)
12σ2 (Mε/2)2,
and noting that v(s) ≤ 1 as a consequence of (2.14), we have
nv(s)
12σ2
(Mε
2
)2
=12M2
v(s)(s? + s) log(p) ≥ 2M2s log(9p),
for all p ≥ 2. Hence,
es0 log(9p)∑j≥1
e−16r1( jMε
2 ) ≤ 2
(9p)M2s≤ 1
pM2s.
Similarly, we check that
− 1
12r1
(jMε
2
)+ 8ρs1/2
(jMε
2
)≤ 0,
for all j ≥ 1 if M ≥ 64/n. Hence, given M > 2, for all p large enough (such that
n ≥ 32), we have∑j≥1
e−16r1( jMε
2 )+8ρs1/2( jMε2 ) ≤
∑j≥1
e−112
r1( jMε2 ) ≤ 2 exp
(−6M2(s? + s) log(p)
),
and since a = o(log(p)), and
log
(1 +
κ(s?)
ρ2
)s?= s? log
(1 +
v(s?)
16log(p)
)= o(s? log(p)),
as p→∞. Let us take M > 2 such that 3M2 ≥ 2 + u. We then get
4ea2 p(2+u)s?
(1 +
κ(s?)
ρ2
)s?∑j≥1
e−16r1( jMε
2 )+8ρs1/2( jMε2 ) ≤ 1
pM2s,
for all p large enough. We conclude that there exist absolute constants A0 such that
for all p ≥ A0,
(7.1) E?[Πγ
(Bm,M |Z
)|Z ∈ E
]≥ 1− 1
pm0− 1
pm−1− 1
pM2s,
for any measurable subset E ⊆ Eρ such that P?(Z ∈ E) ≥ 1/2.
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 71
To apply Theorem 8, we need to check H5 and (2.8) of the main manuscript. For
any given M ≥ max(2,√
u+23 ), (2.18) of the main manuscript implies (2.8) of the
main manuscript, for all p large enough. Since −∇(2)`[δ](θδ(z); z) = (X ′δXδ)/σ2, it is
straightforward that
κ(s) =nv(s)
σ2> 0, and $2 = 0.
By (2.10) of the main manuscript, for ‖δ‖0 ≤ s, we get
‖θδ(z)− [θ?]δ‖21Eρ(z) ≤ρs1/2
κ(s)≤ 1
6ε ≤ ε.
This shows that H5 holds. We shall apply Theorem 8 with Λ = 3 log(n∧p). We easily
check that (M − 1)εκ(s)1/2 ≥ s1/2? + 2,
ρeΛ
pu
√2π
κ(s)≤ log(p)
pu−2
4√
2
v(s),
and(4ρeΛ
pu
√2π
κ(s)
)e2(M+2)ρs1/2εe2γκ(s)2(Mε)2
e3γρ2sea2 ≤ log(p)
pu−2
4√
2
v(s)eo(log(p)) ≤ 1
pu−3,
for all p large enough. Hence we can apply Theorem 8 to conclude that
Πγ({δ?} × B(δ?)m,M |Z)1Eρ(Z) ≥ Πγ(Bm,M |Z)1Eρ(Z)− 1
pu−31E(Z).
Taking the expectation on both sides and dividing by P?(Z ∈ Eρ) together with (7.1)
yields
E?[Πγ({δ?} × B
(δ?)m,M |Z)|Z ∈ E
]≥ 1− 1
pm0− 1
pm−1− 1
pM2s− 1
pu−3.
It remains to control the term P? [Z /∈ Eρ,Λ(Z)]. By Gaussian tail bounds, we see
that H6 and (2.14) imply that,
(7.2) P? (Z /∈ Eρ) ≤2
p.
Since Z = Xδ?θ? + σU , where U ∼ N(0, In), for any δ ∈ A, with δ 6= δ?, we have
`(θδ;Z)− `(θ?;Z) = U ′[Xδ(X
′δXδ)
−1X ′δ −Xδ?(X′δ?Xδ?)
−1X ′δ?]U = ‖Q′δ−δ?U‖
22,
72
where X = QR with Q ∈ Rn×(p∧n), R ∈ R(p∧n)×p denotes the QR decomposition
of X. Using this, and by Lemma 5 of [11] which provides a deviation bound on the
maximum of chi-square random variables, we can find an absolute constant c such
that
P?(Z /∈ EΛ) = P?[∪s−s?k=1
{max
δ∈A: ‖δ‖0=s?+k`(θδ;Z)− `(θ?;Z) > 3k log(n ∧ p)
}]≤
s−s?∑k=1
P?[
maxδ∈A: ‖δ‖0=s?+k
`(θδ;Z)− `(θ?;Z) > 3k log(n ∧ p)]≤
s−s?∑k=1
eck(n∧pk
) 14
.
Since(n∧pk
)≥ (n ∧ p − k)k = ek log(n∧p−k) ≥ ek log((n∧p)/2) for k ≤ s ≤ (n ∧ p)/2, for
n, p large enough, we get
s−s?∑k=1
eck(n∧pk
) 14
≤s∑
k=1
e−k4
(log((n∧p)/2)−4c) ≤ 2e−14
(log((n∧p)/2)−4c) =C1
(n ∧ p)14
,
for some absolute constant C1. Hence
P? [Z /∈ Eρ,Λ(Z)] ≤ 2
p+
C1
(n ∧ p)14
.
The Bernstein-von Mises approximation part of the theorem follows easily.
�
8. Technical lemmas needed in the proof of Lemma 18.
Lemma 27. Assume H6. Then there exist absolute constants A0, such that for all
p ≥ A0, all z ∈ Rn, all γ > 0, and all θ ∈ Rp, we have∣∣∣hγ(δ?, θ; z)− hγ(δ?, θ; z)∣∣∣ ≤ ‖δ?‖0γ
2
(ρ+C(X)
√n
σ2‖θ − θδ‖2
)2
.
Proof. Fix γ > 0, δ = δ?, θ, and z as above. Using (3.2) an (3.6) we have
hγ(δ, θ; z)− hγ(δ, θ; z) = ρ(‖Jγ(δ, θ)‖1 − ‖Jγ(δ, θ)‖1
)− 1
2γ
⟨Jγ(δ, θ)− Jγ(δ, θ), Jγ(δ, θ)− θδ − γ∇`(θ; z) + Jγ(δ, θ)− θδ − γ∇`(θ; z)
⟩.
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 73
From the definition of the proximal operator there exist E1,j ∈ [−1, 1] and E2,j ∈[−1, 1] such that
Jγ,j(δ, θ) = (θj + γ∇j`(θ; z) + γρE1,j) δj ,
and Jγ,j(δ, θ) = (θj + γ∇j`(θδ; z) + γρE2,j) δj .
Hence for j such that δj = 1, we have
∣∣∣Jγ,j(δ, θ)− Jγ,j(δ, θ)∣∣∣ ≤ 2γρ+γ
σ2
∣∣∣∣∣∣⟨Xj ,
∑k:δk=0
θkXk
⟩∣∣∣∣∣∣ ≤ 2γρ+γC(X)
√n
σ2‖θ − θδ‖2,
using the coherence parameter C(X). Similarly,∣∣∣Jγ,j(δ, θ)− (θδ)j − γ∇j`(θ; z) + Jγ(δ, θ)− (θδ)j − γ∇j`(θ; z)∣∣∣
≤ 2γρ+γC(X)
√n
σ2‖θ − θδ‖2.
We conclude that∣∣∣hγ(δ, θ; z)− hγ(δ, θ; z)∣∣∣ ≤ ρ‖δ‖0(2γρ+
γC(X)√n
σ2‖θ − θδ‖2
)+‖δ‖02γ
(2γρ+
γC(X)√n
σ2‖θ − θδ‖2
)2
.
The result follows easily.
Lemma 28. Assume H6, H7. Then there exist some constants A0, C0 such that
for all p ≥ A0, all m ≥ 1,M > 2, and all θ1, θ2 ∈ B(δ?)m,M , such that [θ1]δc? = [θ2]δc?, we
have
supz∈Eρ|hγ(δ?, θ1; z)− hγ(δ?, θ2; z)|
≤ C0
((1 +
γns1/2?
σ2
)(ρ+ nMε) +
γns1/2?
σ2C(X)
√(m+ 1)γnp
)‖θ2 − θ1‖1
+ C0γs?
(1 +
γns1/2?
σ2
)(ρ+ nMε)
(ρ+ nMε+ C(X)
√(m+ 1)γnp
).
74
Proof. For convenience we write δ, and B(δ) instead of δ? and B(δ?)m,M respectively.
We also set s = ‖δ?‖0. Fix z ∈ Eρ, and θ1, θ2 ∈ B(δ) such that [θ1]δc = [θ2]δc . We
start with some general remarks. For θ ∈ Rp, since ` is quadratic and ∇(2)`(θ; z) =
− 1σ2 (X ′X), we have
∇`(θ; z) = ∇`(θ?; z)−1
σ2(X ′X)(θ − θδ)−
1
σ2(X ′X)(θδ − θ?).
Hence for z ∈ Eρ, and j such that δj = 1, if ∇j denotes the partial derivative operator
with respect to θj , we have
|∇j`(θ; z)| ≤ρ
2+
√nC(X)
σ2‖θ − θδ‖2 +
n√v(s)
σ2‖θδ − θ?‖2.
Hence for all θ ∈ Rp,
(8.1) supz∈Eρ
maxj: δj=1
|∇j`(θ; z)| ≤ρ
2+
√nC(X)
σ2‖θ − θδ‖2 +
n√v(s)
σ2‖θδ − θ?‖2.
From (3.2), we have
hγ(δ, θ; z) = −`(θ; z)− 〈∇`(θ; z), Jγ(δ, θ)− θ〉+ ρ‖Jγ(δ, θ)‖1 +1
2γ‖Jγ(δ, θ)− θ‖22.
Since ` is quadratic, we have
−`(θ; z)−〈∇`(θ; z), Jγ(δ, θ)− θ〉 = −` (Jγ(δ, θ); z)+1
2σ2[Jγ(δ, θ)−θ]′(X ′X)[Jγ(δ, θ)−θ].
Hence
hγ(δ, θ2; z)− hγ(δ, θ1; z) = U (1) + U (2) + U (3) + U (4),
where
U (1) def= ` (Jγ(δ, θ1); z)− ` (Jγ(δ, θ2); z) ,
U (2) def=
1
2σ2
([Jγ(δ, θ2)− θ2]′(X ′X)[Jγ(δ, θ2)− θ2]
−[Jγ(δ, θ1)− θ1]′(X ′X)[Jγ(δ, θ1)− θ1]),
U (3) def= ρ (‖Jγ(δ, θ2)‖1 − ‖Jγ(δ, θ1)‖1) ,
U (4) def=
1
2γ
(‖Jγ(δ, θ2)− θ2‖22 − ‖Jγ(δ, θ1)− θ1‖22
).
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 75
Note that U (1) =⟨∇`(θ; z), Jγ(δ, θ1)− Jγ(δ, θ2)
⟩, for some θ on the segment be-
tween Jγ(δ, θ1) and Jγ(δ, θ2). Therefore using (8.1) we get∣∣∣U (1) + U (3)∣∣∣ ≤ (3ρ
2+
√nµ0
σ2‖θ − θδ‖2 +
n√v(s)
σ2‖θδ − θ?‖2
)‖Jγ(δ, θ2)−Jγ(δ, θ1)‖1.
Since θ lies on the segment between Jγ(δ, θ1) and Jγ(δ, θ2), θ − θδ = 0, and we have
(8.2) ‖θδ − θ?‖2 ≤ max (‖Jγ(δ, θ1)− θ?‖2, ‖Jγ(δ, θ2)− θ?‖2) .
Given θ ∈ B(δ), we recall that if δj = 0, Jγ,j(δ, θ) = 0, and if δj = 1, we have
Jγ,j(δ, θ) = θj + γ∇j`(θ;Z) + γρEj , for some Ej ∈ [−1, 1] (that depends on θ).
Hence if δj = 1, using (8.1) and the fact that θ ∈ B(δ), we have |Jγ,j(δ, θ) − θ?,j | ≤|θj−θ?,j |+γ(3ρ
2 +2√npC(X)
√(1 +m)γ log(p)/σ2 +n
√v(s)(Mε)/σ2). It follows that
for any θ ∈ B(δ),
‖Jγ(δ, θ)− θ?‖2 ≤Mε+ γs1/2
(3ρ
2+
2√nC(X)
σ2
√(m+ 1)γp+
n√v(s)
σ2Mε
).
Using the last display in (8.2) yields
‖θδ − θ?‖2 ≤ Mε+ γs1/2
(3ρ
2+
2√nC(X)
√(m+ 1)γp
σ2+n√v(s)
σ2Mε
).
Similarly, we have
‖Jγ(δ, θ2)− Jγ(δ, θ1)‖1 ≤ ‖θ2 − θ1‖1 + γs(
2ρ+n
σ2v(s)1/2‖θ2 − θ1‖2
),
≤ ‖θ2 − θ1‖1 + 2γs(ρ+
n
σ2v(s)1/2Mε
).
We then get∣∣∣U (1) + U (3)∣∣∣ ≤ C1
(γs[ρ+
n
σ2Mε]
+ ‖θ2 − θ1‖1)
×
[(1 +
γns1/2
σ2
)(ρ+ nMε) +
γns1/2
σ2
C(X)√n√
(m+ 1)γp
σ2
],
for all p large enough, and for some absolute constants C1, C2, using the fact that
v(s) = O(1). For U (2), we write
U (2) =1
2σ2∆′1(X ′X)∆2,
76
where ∆1 = Jγ(δ, θ2) − θ2 − Jγ(δ, θ1) + θ1, and ∆2 = Jγ(δ, θ2) − θ2 + Jγ(δ, θ1) − θ1.
We note that ∆1,j = 0 for δj = 0. Hence
|U2| ≤1
2σ2‖∆1‖1 max
j: δj=1|〈Xj , X∆2〉|
≤ 1
2σ2‖∆1‖1
(nv(s)1/2‖∆2,δ‖2 + C(X)
√n‖∆2 −∆2,δ‖2
).
With the same calculations as above we have
‖∆1‖1 ≤ 2γs(ρ+
n
σ2v(s)1/2Mε
),
‖∆2,δ‖2 ≤ s1/2γ
(2ρ+ max
j: δj=1|∇j`(θ1, z)|+ max
j: δj=1|∇j`(θ2, z)|
)≤ Cγs1/2
σ2
(σ2ρ+ nMε+ C(X)
√n√
(m+ 1)γp),
and
‖∆2 −∆2,δ‖2 = ‖(θ1 − θ1,δ) + (θ2 − θ2,δ)‖2 ≤ 4√
(m+ 1)γp.
Therefore
|U (2)| ≤ Cγs
σ2(ρ+
n
σ2Mε)
[nγs1/2
σ2(ρ+ nMε) +
(1 +
γns1/2
σ2
)C(X)
√(m+ 1)γnp
],
for all p large enough. For U (4), we proceed similarly as with U (2) to get
|U (4)| ≤ Cγs
σ2(ρ+
n
σ2Mε)
[nγs1/2
σ2(ρ+ nMε) +
(1 +
γns1/2
σ2
)C(X)
√(m+ 1)γnp
].
We complete the proof by collecting these bounds together.
We will also need the following lemma which is adapted from Lemma 4 of [5]. For
τ > 0, we call Qτ (θ, ·) the density of the normal distribution N(θ, τ2Id).
Lemma 29. For some integer d ≥ 1, fix θ0 ∈ Rd, R > 0, and define the set
Ξdef={u ∈ Rd : ‖u− θ0‖2 ≤ R
}. Let 0 < σ ≤ R
√2π
320d , and rdef= 4σ
√d. For all u, v ∈ Ξ
such that ‖u− v‖2 ≤ σ/4, we have∫Ξuv
Qσ(u, z) ∧Qσ(v, z)dz ≥ 1
4,
where Ξuvdef= {z ∈ Ξ : ‖z − u‖2 ≤ r, ‖z − v‖2 ≤ r}.
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 77
Proof. For x ∈ Rd, and a > 0 we shall write Ba(x) to denote the ball of radius a
centered at x: Ba(x)def= {z ∈ Rd : ‖z−x‖2 ≤ a}. Without any loss of generality we can
assume that θ0 = 0d, and we can take u, v ∈ Ξ such that ‖u− θ0‖2 = ‖v − θ0‖2 = R.
We set Ξu,vdef= Br(u) ∩ Br(v) ∩ Ξ, and we introduce
H1def={θ ∈ Rd : 2 〈θ − u, v − u〉 ≥ ‖v − u‖22
},
H2def={θ ∈ Rd : 2 〈θ − u, v − u〉 < −‖v − u‖22
}.
H3def={θ ∈ Rd : −‖v − u‖22 ≤ 2 〈θ − u, v − u〉 < ‖v − u‖22
}.
We have Rd = H1∪H2∪H3, and since ‖θ−u‖22 = ‖θ−v‖22 +2 〈θ − u, v − u〉−‖v−u‖22,
we see that
Qσ(u, θ) ∧Qσ(v, θ) =
{Qσ(u, θ) if θ ∈ H1
Qσ(v, θ) if θ /∈ H1
.
Therefore,∫Ξu,v
Qσ(u, θ) ∧Qσ(v, θ)dθ ≥∫
Ξu,v∩H1
Qσ(u, θ)dθ +
∫Ξu,v∩H2
Qσ(v, θ)dθ.
Let w = (u + v)/2, ℘def= {z ∈ Rd : 〈z, w〉 = ‖w‖22}, and the half-space ℘−
def= {z ∈
Rd : 〈z, w〉 ≤ ‖w‖22}. Define also Brdef= Br(u) ∩ Br(v). It can easily checked that for
j ∈ {1, 2},
Br ∩Hj ∩(℘− −
r2w
R‖w‖2
)⊆ Ξu,v ∩Hj .
Indeed any θ ∈ Rd can be written θ = θ′w‖w‖2
w‖w‖2 + s, where 〈s, w〉 = 0. And if θ ∈ Br
then ‖s‖2 = ‖θ− θ′w‖w‖2
w‖w‖2 ‖2 ≤ ‖θ−w‖2 = ‖θ−(u+v)/2‖2 ≤ r. Also if θ ∈ ℘−− r2w
R‖w‖2 ,
then θ′w‖w‖2 ≤ ‖w‖2−
r2
R ≤ R−r2
R . Hence ‖θ‖22 =(θ′w‖w‖2
)2+‖s‖22 ≤
(R− r2
R
)2+r2 ≤ R2.
We conclude that∫Ξu,v∩H1
Qσ(u, θ)dθ ≥∫
Ξu,v∩H1∩℘−Qσ(u, θ)dθ
≥∫Br∩H1∩℘−
Qσ(u, θ)dθ −∫℘−\
(℘−− r2w
R‖w‖2
)Qσ(u, θ)dθ.
And θ ∈ ℘− \(℘− − r2w
R‖w‖2
)is equivalent to − r2
R ≤⟨θ − u, w
‖w‖2
⟩≤ 0. Hence∫
℘−\(℘−− r2w
R‖w‖2
)Qσ(u, θ)dθ =∫ r2
R0
1√2πσ
e−u2
2σ2 du ≤ r2
σR√
2π. A similarly calculations
78
holds for the second integral. We conclude that∫Ξu,v
Qσ(u, θ)∧Qσ(v, θ)dθ ≥∫Br∩H1∩℘−
Qσ(u, θ)dθ+
∫Br∩H2∩℘−
Qσ(v, θ)dθ− 2r2
σR√
2π.
It can be checked that the sets Br(u), Br(v), H1 and H2 are symmetric with respect
to ℘. Indeed for x ∈ Rd, the reflection of x with respect to ℘ is x−2(x′w‖w‖2 − ‖w‖2
)w,
and it is easy verification to check that if x belongs to any of these sets, its reflection
also belong. Hence,∫Ξu,v
Qσ(u, θ)∧Qσ(v, θ)dθ ≥ 1
2
∫Br∩H1
Qσ(u, θ)dθ+1
2
∫Br∩H2
Qσ(v, θ)dθ− 2r2
σR√
2π
≥ 1
2
∫Br∩H1
Qσ(u, θ)dθ +1
2
∫Br∩Hc1
Qσ(v, θ)dθ − 1
2
∫H3
Qσ(v, θ)dθ − 2r2
σR√
2π.
We have∫Br∩H1
Qσ(u, θ)dθ +
∫Br∩Hc1
Qσ(v, θ)dθ
=
∫Rd
1Br∩H1(θ)Qσ(u, θ)dθ +
∫Rd
1Br∩Hc1(v − u+ θ)Qσ(u, θ)dθ
≥∫Rd
1Br∩H1(θ)Qσ(u, θ)dθ +
∫Rp
1Br(u)∩H2(θ)Qσ(u, θ)dθ,
where the last inequality uses the fact, easy to check that if θ ∈ Br(u) ∩ H2, then
θ + v − u ∈ Br \ H1. Furthermore,
Br(u) = (Br(u) ∩H1) ∪ (Br(u) ∩H2) ∪ (Br(u) ∩H3) ,
and Br(u) ∩H1 = Br ∩H1. Hence∫Ξu,v
Qσ(u, θ) ∧Qσ(v, θ)dθ ≥ 1
2
∫Br(u)
Qσ(u, θ)dθ
− 1
2
∫H3
Qσ(u, θ)dθ − 1
2
∫H3
Qσ(v, θ)dθ − 2r2
σR√
2π.
With r = 4σ√d, we have
∫Br(u)Qσ(u, θ)dθ ≥ 1− 10−4. With σ ≤ R
√2π
320d ,
2r2
σR√
2π≤ 1
10,
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 79
and with ‖v − u‖2 ≤ σ/4,∫H3
Qσ(u, θ)dθ +
∫H3
Qσ(v, θ)dθ
=
∫ ‖v−u‖22
− ‖v−u‖22
1√2πσ
e−u2
2σ2 du+
∫ ‖v−u‖22
− 3‖v−u‖22
1√2πσ
e−u2
2σ2 du ≤ 3‖v − u‖2√2πσ
≤ 3
4√
2π.
In conclusion,∫Ξu,v
Qσ(u, θ) ∧Qσ(v, θ)dθ ≥ 1
2− 10−4
2− 3
8√
2π− 1
10≥ 1
4,
as claimed.
Acknowledgements. The authors are grateful to Galin Jones, Scott Schmidler
and Yuekai Sun for very helpful discussions.
References.
[1] Atchade, Y. A. (2017). On the contraction properties of some high-dimensional quasi-posterior
distributions. Ann. Statist. 45 2248–2273.
[2] Atchade, Y. F. (2015). A Moreau-Yosida approximation scheme for high-dimensional posterior
and quasi-posterior distributions. ArXiv e-prints .
[3] Atchade, Y. F. and Rosenthal, J. S. (2005). On adaptive Markov chain Monte Carlo algo-
rithms. Bernoulli 11 815–828.
[4] Bauschke, H. H. and Combettes, P. L. (2011). Convex analysis and monotone operator
theory in Hilbert spaces. CMS Books in Mathematics/Ouvrages de Mathematiques de la SMC,
Springer, New York. With a foreword by Hedy Attouch.
URL http://dx.doi.org/10.1007/978-1-4419-9467-7
[5] Belloni, A. and Chernozhukov, V. (2009). On the computational complexity of MCMC-
based estimators in large samples. Ann. Statist. 37 2011–2055.
[6] Bhattacharya, A., Chakraborty, A. and Mallick, B. (2016). Fast sampling with gaussian
scale mixture priors in high-dimensional regression. Biometrika 103 985 – 991.
[7] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and
Dantzig selector. Ann. Statist. 37 1705–1732.
[8] Bottolo, L. and Richardson, S. (2010). Evolutionary stochastic search for bayesian model
exploration. Bayesian Anal. 5 583–618.
[9] Boucheron, S., Lugosi, G. and Massart, P. (2013). Concentration inequalities: a nonasymp-
totic theory of independence. Springer Series in Statistics, Oxford University Press, Oxford.
[10] Cai, T., Han, X. and Pan, G. (2017). Limiting Laws for Divergent Spiked Eigenvalues and
Largest Non-spiked Eigenvalue of Sample Covariance Matrices. ArXiv e-prints .
[11] Castillo, I., Schmidt-Hieber, J. and van der Vaart, A. (2015). Bayesian linear regression
with sparse priors. Ann. Statist. 43 1986–2018.
80
[12] Castillo, I. and van der Vaart, A. (2012). Needles and straw in a haystack: Posterior
concentration for possibly sparse sequences. Ann. Statist. 40 2069–2101.
[13] Dyer, M., Frieze, A. and Kannan, R. (1991). A random polynomial-time algorithm for
approximating the volume of convex bodies. J. ACM 38 1–17.
[14] George, E. I. and McCulloch, R. E. (1997). Approaches to bayesian variable selection.
Statist. Sinica 7 339–373.
[15] Ghosal, S., Ghosh, J. K. and van der Vaart, A. W. (2000). Convergence rates of posterior
distributions. Ann. Statist. 28 500–531.
[16] Gottardo, R. and Raftery, A. E. (2008). Markov chain monte carlo with mixtures of
mutually singular distributions. Journal of Computational and Graphical Statistics 17 949–975.
[17] Green, P. J. (2003). Trans-dimensional Markov chain Monte Carlo. In Highly structured
stochastic systems, vol. 27 of Oxford Statist. Sci. Ser. 179–206.
[18] Guan, Y. and Krone, S. M. (2007). Small-world mcmc and convergence to multi-modal
distributions: From slow mixing to fast mixing. Ann. Appl. Probab. 17 284–304.
[19] Horn, R. A. and Johnson, C. R. (2012). Matrix Analysis. 2nd ed. Cambridge University
Press, New York, NY, USA.
[20] Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: Frequentist and bayesian
strategies. Ann. Statist. 33 730–773.
[21] Kleijn, B. J. K. and van der Vaart, A. W. (2006). Misspecification in infinite-dimensional
Bayesian statistics. Ann. Statist. 34 837–877.
[22] Liu, J. S. (1994). The collapsed gibbs sampler in bayesian computations with applications to a
gene regulation problem. Journal of the American Statistical Association 89 958–966.
[23] Lovasz, L. and Simonovits, M. (1990). The mixing rate of markov chains, an isoperimet-
ric inequality, and computing the volume. In Proceedings of the 31st Annual Symposium on
Foundations of Computer Science. SFCS ’90, IEEE Computer Society, Washington, DC, USA.
[24] Lovasz, L. and Simonovits, M. (1993). Random walks in a convex body and an improved
volume algorithm. Random Structures Algorithms 4 359–412.
[25] Lovasz, L. and Vempala, S. (2007). The geometry of logconcave functions and sampling
algorithms. Random Structures Algorithms 30 307–358.
[26] Madras, N. and Randall, D. (2002). Markov chain decomposition for convergence rate anal-
ysis. Ann. Appl. Probab. 12 581–606.
[27] Mangoubi, O. and Smith, A. (2017). Rapid Mixing of Hamiltonian Monte Carlo on Strongly
Log-Concave Distributions. ArXiv e-prints .
[28] Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations for high-
dimensional data. Ann. Statist. 37 246–270.
[29] Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression.
JASA 83 1023–1032.
[30] Moreau, J.-J. (1965). Proximite et dualite dans un espace hilbertien. Bull. Soc. Math. France
93 273–299.
[31] Narisetty, N. and He, X. (2014). Bayesian variable selection with shrinking and diffusing
priors. Ann. Statist. 42 789–817.
ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 81
[32] Negahban, S. N., Ravikumar, P., Wainwright, M. J. and Yu, B. (2012). A unified frame-
work for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical
Science 27 538–557.
[33] Parikh, N. and Boyd, S. (2013). Proximal algorithms. Foundations and Trends in Optimization
1 123–231.
[34] Patrinos, P., Stella, L. and Bemporad, A. (2014). Forward-backward truncated Newton
methods for convex composite optimization. ArXiv e-prints .
[35] Robert, C. P. and Casella, G. (2004). Monte Carlo statistical methods. 2nd ed. Springer
Texts in Statistics, Springer-Verlag, New York.
[36] Rockova, V. and George, E. I. (2014). Emvs: The em approach to bayesian variable selection.
Journal of the American Statistical Association 109 828–846.
[37] Rudelson, M. and Zhou, S. (2013). Reconstruction from anisotropic random measurements.
IEEE Trans. Inf. Theor. 59 3434–3447.
[38] Schreck, A., Fort, G., Le Corff, S. and Moulines, E. (2013). A shrinkage-thresholding
Metropolis adjusted Langevin algorithm for Bayesian variable selection. ArXiv e-prints .
[39] Sinclair, A. and Jerrum, M. (1989). Approximate counting, uniform generation and rapidly
mixing Markov chains. Inform. and Comput. 82 93–133.
[40] Sur, P., Chen, Y. and Candes, E. J. (2017). The Likelihood Ratio Test in High-Dimensional
Logistic Regression Is Asymptotically a Rescaled Chi-Square. ArXiv e-prints .
[41] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc.
Ser. B 58 267–288.
[42] Tierney, L. (1994). Markov chains for exploring posterior distributions. Ann. Statist. 22
1701–1762. With discussion and a rejoinder by the author.
[43] Vempala, S. (2005). Geometric random walks: a survey. In Combinatorial and computational
geometry, vol. 52 of Math. Sci. Res. Inst. Publ. Cambridge Univ. Press, Cambridge, 577–616.
[44] Woodard, D. B., Schmidler, S. C. and Huber, M. (2009). Conditions for rapid mixing of
parallel and simulated tempering on multimodal distributions. Ann. Appl. Probab. 19 617–640.
[45] Yang, Y., Wainwright, M. J. and Jordan, M. I. (2016). On the computational complexity
of high-dimensional bayesian variable selection. Ann. Statist. 44 2497–2532.
1085 South University, Ann Arbor, 48109, MI, United States
E-mail: [email protected]