regularization and computation with high-dimensional spike-and...

Submitted to the Annals of Statistics

arXiv: arXiv:1803.10282

REGULARIZATION AND COMPUTATION WITH

HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR

DISTRIBUTIONS∗

By Yves Atchade† and Anwesha Bhattacharyya†

University of Michigan†

We consider the Bayesian analysis of a high-dimensional statis-

tical model with a spike-and-slab prior, and we study the forward-

backward envelope of the posterior distribution – that we denotes Πγ

for some regularization parameter γ > 0. Viewing Πγ as a pseudo-

posterior distribution, we work out a set of sufficient conditions under

which it contracts towards the true value of the parameter as γ ↓ 0,

and p (the dimension of the parameter space) diverges to ∞. In linear

regression models the contraction rate matches the contraction rate

of the true posterior distribution.

In the second part of the paper, we study a practical Markov Chain

Monte Carlo (MCMC) algorithm to sample from Πγ . In the particular

case of the linear regression model, and focusing on models with high

signal-to-noise ratios, we show that the mixing time of the MCMC

algorithm depends crucially on the coherence of the design matrix

and the initialization of the Markov chain. In the most favorable

cases, we show that the computational complexity of the algorithm

scales with the dimension p asO(pes2?), where s? is the number of non-

zeros components of the true parameter. We provide some simulation

results to illustrate the theory. Our simulation results also suggest

that the proposed algorithm (as well as a version of the Gibbs sampler

of [31]) mix poorly if poorly initialized, or if the design matrix has

high coherence.

1. Introduction. Suppose that we wish to infer a parameter θ ∈ Rp from a

random sample Z ∈ Z, based on the statistical model Z ∼ fθ(z)dz, where fθ is a

density on a sample space Z equipped with a reference sigma-finite measure dz. The

∗This work is partially supported by the NSF grant DMS 1513040.

MSC 2010 subject classifications: Primary 62F15, 60K35; secondary 60K35

Keywords and phrases: High-dimensional Bayesian inference, Variable selection, Posterior contrac-

tion, Forward-backward envelope, Markov Chain Monte Carlo Mixing, High-dimensional linear re-

gression models

1

http://www.imstat.org/aos/

http://arxiv.org/abs/arXiv:1803.10282

2

log-likelihood function

`(θ; z)def= log fθ(z), θ ∈ Rp, z ∈ Z,

is assumed to be a concave function of θ and a jointly measurable function on Rp×Z.

We take a Bayesian approach and consider a setting where p is very large and it is

statistically appealing to perform a variable selection step in the estimation process.

This problem has attracted a lot of attention in recent years, and several approaches

are available. One of the most effective solution – at least in theory – relies on spike-

and-slab priors ([29, 14]), and can be described as follows. For δ ∈ ∆def= {0, 1}p, let

µδ(dθ) denote the product measure on Rp given by

µδ(dθ)def=

p∏j=1

µδj (dθj),

where µ0(dz) is the Dirac mass at 0, and µ1(dz) is the Lebesgue measure on R. With

{ωδ, δ ∈ ∆} denoting a prior distribution on ∆, we consider the spike-and-slab prior

distribution on ∆× Rp given by1

(1.1) ωδ

(ρ2

)‖δ‖1e−ρ‖θ‖1µδ(dθ),

for a parameter ρ > 0. Given Z = z, the resulting posterior distribution on ∆×Rp is

(1.2) Π(δ, dθ|z) ∝ fθ(z)ωδ(ρ

2

)‖δ‖1e−ρ‖θ‖1µδ(dθ).

The posterior distribution (1.2) has been recently studied in the high-dimensional

regime (see e.g. [12, 11, 1]), where it is shown to contract towards the true value of

the parameter at an optimal rate. In practice however, this posterior is computation-

ally difficult to handle and typically require specialized MCMC techniques such as

reversible jump ([17]), and related methods ([16, 38]). However these MCMC algo-

rithms are often difficult to design and tune, particularly in high-dimensional problems

(see for instance [38] and [2] for some numerical comparisons). In the particular case

1The use of the Laplace distribution is not fundamental. We use it here partly for mathematical

and computational convenience, and partly because of its widespread use in the applications. Much

of the results below can also be worked out if the Laplace distribution is replaced by a density g

such that g(0) = 1, and log g is Lipschitz – note however that this condition is not satisfied by the

Gaussian distribution.

ON HIGH-DIMENSIONAL SPIKE-AND-SLAB POSTERIOR DISTRIBUTIONS 3

of the Gaussian linear model with a Gaussian slab density, it is sometimes possible to

side-step this computational difficulties by integrating out the parameter θ. One can

then explore the resulting discrete distribution δ 7→ Π(δ|z) using standard MCMC al-

gorithms (see for instance [14, 8, 45] and the reference therein). However this strategy

is typically not widely applicable.

A very popular approximation to Π – sometimes also referred to as spike-and-slab

– is obtained by replacing the point mass at zero by a Gaussian distribution with a

small variance γ, say, ([14, 20, 36, 31]). The resulting posterior distribution is

(1.3) Πγ(δ, du|z) ∝ ωδ(2πγ)‖δ‖0

2

(ρ2

)‖δ‖0e`(u;z)−ρ‖uδ‖1− 1

2γ‖u−uδ‖22du.

This approximation was recently studied in [31], where linear regression model con-

sistency is established in the high-dimensional setting. Here we consider another ap-

proximation scheme. We consider the variational approximation of Π proposed in [2]

– denoted Πγ – that is obtained by taking its forward-backward envelop (an envelop

function similar to the Moreau envelop). One advantage of working with Πγ is that it

is computationally and mathematically tractable. Indeed, the definition of Πγ leads

to tight approximation bounds that we leverage in the analysis. In our numerical

experiments we found that Πγ and Πγ behave very similarly, and to some extent we

expect the results – and the techniques – derived here to hold when applied to Πγ .

To define Πγ , we endow the Euclidean space Rp with its Lebesgue measure denoted

dθ or du, etc. For a set A ⊆ Rp, let ιA denote its characteristic function defined here as

ιA(u) = 0 if u ∈ A, and ιA(u) = +∞ otherwise. Given δ ∈ ∆, let Rpδdef= {δ ·u, u ∈ Rp}

(where δ · u def= (u1δ1, . . . , upδp) ∈ Rp). The basic idea is to replace the function

u 7→ −`(u; z) + ρ‖u‖1 + ιRpδ(u) by its forward-backward envelop ([34]) defined as

(1.4) hγ(δ, θ; z)def= min

v∈Rpδ

[−`(θ; z)− 〈∇`(θ; z), v − θ〉+ ρ‖v‖1 +

1

2γ‖v − θ‖22

],

for some parameter γ > 0. This envelop function is closely related to the more widely

known Moreau envelop ([30, 4, 33]). We choose to work with the forward-backward

envelop because it is typically available in closed whereas the Moreau envelop is not.

It is easily seen from the definition and the convexity of −`(·; z) that

(1.5) hγ(δ, ·; z) ≤ −`(·; z) + ρ‖ · ‖1 + ιRpδ(·),

4

and furthermore hγ(δ, ·; z) converges pointwise to −`(·; z) + ρ‖ · ‖1 + ιRpδ(·) as γ ↓ 0

(see for instance [34, 2]). Assuming that∫Rp e

−hγ(δ,u;z)du is finite for all δ ∈ ∆, we are

then naturally led to the probability distribution on ∆× Rp given by

(1.6) Πγ(δ, du|z) ∝ ωδ(2πγ)‖δ‖0

2

(ρ2

)‖δ‖0e−hγ(δ,u;z)du,

that we take as an approximation of Π. We note however that it is not automatic

that Πγ is a well-defined probability distribution. Since hγ is known to approximate

−`(·; z) + ρ‖ · ‖1 + ιRpδ(·), we naturally expect Πγ to behave like Π, for γ small, and

results of this type can be found in [2].

1.1. Main contributions. The contribution of this work is two-fold. Firstly, viewing

Πγ as a pseudo/quasi-posterior distribution, we study its statistical properties as the

dimension p increases. We derive some sufficient conditions under which Πγ is well

defined and puts most of its probability mass around the true value of the parameter

as p→∞. More precisely, Theorem 5 can be used to show that with high probability

a draw (δ, θ) from Πγ(·|Z) produces a typically sparse vector δ, and a non-sparse

vector θ. Although θ is not sparse, its components for which δj = 0 are typically

small (O(√γ)), and its sparsified version (that is θδ = θ · δ) is typically close to θ?,

the true value of the parameter.

We also study the contraction rate of Πγ , and using the linear regression model

as an example, we show that the rate of contraction of Πγ matches that of Π as

derived in [11]. Furthermore we show that Πγ enjoys a Bernstein-von Mises approx-

imation, and again in the particular case of the linear regression model, we recover

the Bernstein-von Mises theorem established by [11], upto some small difference in

the Fisher information matrix due to the approximate nature of Πγ .

Practical use of Bayesian procedures typically hinges on the ability to draw samples

from the posterior distribution. In the second part of the paper we develop an efficient

Metropolis-within-Gibbs algorithm to sample from Πγ . The algorithm is similar to the

Gibbs sampler used in [31]. Building on the posterior contraction properties developed

in the first part, we analyze the computational complexity of the algorithm as p→∞,

focusing on the linear regression case. The behavior of the resulting Markov chain

depends on the signal-to-noise ratio of the underlying regression problem, and in this

work we focus solely on problems with high signal-to-noise ratios. Even in this case,

the mixing of the Markov chain depends on the initial distribution of the Markov


chain, and a form of coherence of the design matrix X ∈ Rn×p defined as

C(X)def= sup

j: δ?,j=1

1√n‖X ′δc?Xj‖2,

where Xj , is the j-th column X, and Xδc? is the sub-matrix of X obtained by removing

the columns of X for which δ?,j = 1. Figure 1-(a)-(c) show the estimated mixing time

of the algorithm (truncated at 2 × 104) as function of p under different scenarios,

and illustrate the main conclusions of the paper. For comparison we also present – in

dashed lines – the estimated mixing times of a similar algorithm to sample from the

weak spike-and-slab posterior distribution (1.3) discussed above. The results presented

here deals only with the high signal-to-noise ratio setting, and we refer the reader to

Section 3.4 for a detailed description of the experiment. Figure 1-(a) shows the mixing

times in a setting where the design matrix X has low coherence, and a good initial

value is used to start the algorithm2. In Figure 1-(b) the initial value remains the

same, but the design matrix X has a higher coherence parameter (see Section 3.4 for

the details on how such design matrix is produced). Finally in Figure 1-(c), the design

matrix is the same as in Figure 1-(a) but the initial value of δ is “much worst”: it

has 20% false-negatives, but no false-positive. In this latter case most of the mixing

times observed are greater than the 20, 000 iterations mark.

0 1000 2000 3000 4000 5000

0100

200

300

400

500

dimension

mixing_time

0 1000 2000 3000 4000 5000

0100

200

300

400

500

(a)

Moreau Approx.weak Spike-and-slab

0 1000 2000 3000 4000 5000

02000

4000

6000

8000

10000

dimension

mixing_time

0 1000 2000 3000 4000 5000

02000

4000

6000

8000

10000

(b)


0 1000 2000 3000 4000 5000

10000

14000

18000

dimension

mixing_time

0 1000 2000 3000 4000 5000

10000

14000

18000

(c)


Fig 1. Estimated mixing times as function of the dimension p. (a) low coherence, good ini-tialization. (b) high coherence, good initialization. (c) low-coherence poor initialization.

2Here a “good initial value” for the Markov chain is a model δ that has no false-negative, and 10%

false-positives.

6

Two observations stand out from these results. The algorithm seems to mix quickly

when the design matrix X has low coherence and the initial distribution of the Markov

chain has no false-negative. The mixing seems to degrade when the coherence in-

creases, and the algorithm seems to mix very poorly when the initial distribution has

false-negatives. We show in Theorem 15 a result that partly supports these empirical

findings. More precisely, under some additional regularity conditions we show that

with an initial distribution without false-negatives, the mixing time of the algorithm

scales with p as

O

p exp

cs2?

[1 +

C(X)

s1/2? log(p)

√p

n

]2 ,

where s? = ‖θ?‖0, and c an absolute constant. The result implies that when p is of

the same order as n, and the coherence C(X) is comparable to s1/2? log(p), the mixing

time of the algorithm is essentially linear in p.

Our result is related to the recent work by [45] which studied a Metropolis-Hastings

algorithm to sample from the marginal distribution of δ in a linear regression model

with Gaussian spike-and-slab g-prior. Unlike this work where the fast mixing of our

MCMC algorithm hinges on high signal-to-noise ratio, good initialization, and low

coherence, these authors showed that their algorithm has a worst case (wrt the initial

distribution) mixing time that scales as O(s2?(n+ s?)p log(p)

). We note that the pos-

terior distribution considered by [45] can be viewed as a “collapsed” version of ours

(where the regression parameters are integrated out). The robustness performance

of the algorithm of [45] can perhaps be understood as a result of this collapsing –

the benefits of collapsing variables in a block update sampling problem is well-known

([22]). We feel that more research is needed on this problem. However to the very least

we stress again that the idea of collapsing parameters from the posterior distribution

is not always feasible, particularly in generalized linear regression models.

1.2. Outline of the paper. The paper is organized as follows. We study the statis-

tical properties of Πγ in Section 2. We derive two main theorems. Theorem 5 deals

with the contraction and the contraction rate of Πγ as p → ∞, whereas Theorem 8

studies variable selection and the Bernstein-von Mises theorem. We illustrate these

results in the particular case of the linear regression model in Section 2.2, leading to

Corollary 10. Since the proofs of the results of Section 2 follow similar techniques as


in [11] and [1], the details are placed in the appendix. In Section 3 we study the prob-

lem of sampling from Πγ , using a slightly modified version of the MCMC algorithm

of [2]. In the particular case of the linear regression model, we study the mixing of

the proposed algorithm. The main result there is Theorem 15, the proof of which is

developed in Section 4, with some technical details gathered in the appendix. Some

numerical simulations are detailed in Section 3.4.

1.3. Notation. Throughout we equip the Euclidean space Rp (p ≥ 1 integer) with

its usual Euclidean inner product 〈·, ·〉 and norm ‖ · ‖2, its Borel sigma-algebra, and

its Lebesgue measure. All vectors u ∈ Rp are column-vectors unless stated otherwise.

We also use the following norms on Rp: ‖θ‖1def=∑p

j=1 |θj |, ‖θ‖0def=∑p

j=1 1{|θj |>0},

and ‖θ‖∞def= max1≤j≤p |θj |.

We set ∆def= {0, 1}p. For θ, θ′ ∈ Rp, θ ·θ′ ∈ Rp denotes the component-wise product

of θ and θ′. For δ ∈ ∆, we set Rpδdef= {θ · δ : θ ∈ Rp}, and we write θδ as a short for

θ · δ. We define δcdef= 1 − δ, that is δcj = 1 − δj , 1 ≤ j ≤ p. For a matrix A ∈ Rm×m

and δ ∈ ∆, Aδ (resp. Aδc) denotes the matrix of R‖δ‖0×‖δ‖0 (resp. R(m−‖δ‖0)×(m−‖δ‖0))

obtained by keeping only the rows and columns of A for which δj = 1 (resp. δj = 0).

For δ, δ′ ∈ ∆, we write δ ⊇ δ′ to mean that for any j ∈ {1, . . . , p}, whenever δ′j = 1,

we have δj = 1.

Throughout the paper e denotes the Euler number, and(mq

)is the combinatorial

number m!/(q!(m − q)!). For x ∈ R, sign(x) is the sign of x (sign(x) = 1 if x > 0,

sign(x) = −1 if x < 0, and sign(x) = 0 if x = 0).

If f(θ, x) is a real-valued function that depends on the parameter θ and some other

argument x, the notation ∇(k)f(θ, x), where k is an integer, denotes the k-th partial

derivative of f with respect to θ. For k = 1, we write ∇f(θ, x) instead of ∇(1)f(θ, x).

The total variation metric between two probability measures µ, ν is defined as

‖µ− ν‖tvdef= sup

A meas.(µ(A)− ν(A)) .

All the asymptotic results in the paper are derived by letting the dimension p grow

to infinity, and we say that a term x ∈ R is an absolute constant if x does not depend

on p.

2. Contraction properties of Πγ. In this section we will establish that under

some mild conditions, Πγ is a well-defined probability distribution that has similar

8

posterior contraction properties as the spike-and-slab posterior distribution given in

(1.2). We make the following assumptions.

H1. For all z ∈ Z, the function θ 7→ `(θ; z) is concave, twice differentiable, and

there exist symmetric positive semidefinite matrices S, S ∈ Rp×p that does not depend

on z and θ, such that for all θ ∈ Rp, and all z ∈ Z,

−S � ∇(2) log fθ(z) � −S,

where the notation A � B means that B −A is positive semidefinite.

Remark 1. We note that one can always take S as the zero matrix, since `

is concave. Hence H1 mainly requires that the hessian matrix of the log-likelihood

function is lower bounded uniformly in θ and z by the matrix −S. Although somewhat

restrictive, this assumption is enough to handle linear and logistic regression models.

It is unclear whether the results developed below can be extended more broadly

beyond H1.�

For integer s ≥ 1, we set

κ(s)def= sup

{u′Su

‖u‖2: u ∈ Rp \ {0}, ‖u‖0 ≤ s

},

and κ(s)def= inf

{u′Su

‖u‖2: u ∈ Rp \ {0}, ‖u‖0 ≤ s

},

and we convene that κ(0) = 0, κ(0) = +∞. For s = p, κ(s) is the largest eigenvalue

of S that we write as λmax(S).

Following a standard approach in Bayesian asymptotics, we will assume that there

exists a true value of the parameter θ? such that Z ∼ fθ? . More precisely we assume

the following.

H2. There exists θ? ∈ Rp \ {0p} such that Z ∼ fθ?, and we set s?def= ‖θ?‖0.


Under H2 we expect θ? to be close to the maximizer of the log-likelihood θ 7→∇`(θ;Z). That is, we expect ∇`(θ?;Z) ≈ 0. Therefore the sets

Ecdef={z ∈ Z : ‖∇`(θ?; z)‖∞ ≤

c

2

},

for c > 0 will naturally play an important role in the analysis. Throughout the paper,

we write δ? ∈ ∆ to denote the sparsity structure of θ? (that is δ?,j = 1{|θ?,j |>0},

j = 1, . . . , p). We will write P? (resp. E?) to denote the probability distribution (resp.

expectation operator) of the random variable Z ∈ Z assumed in H2.

As a prior distribution on δ, we assume that the δj are independent Bernoulli

random variables. More precisely, we assume the following.

H3. For some absolute constant u > 0, setting qdef= 1

p1+u , we assume that

ωδ =

p∏j=1

qδj (1− q)1−δj .

Remark 2. The prior ωδ induced by H3 can be written as ωδ = g‖δ‖01

( p‖δ‖0

), where

gs =(ps

)qs(1− q)p−s. It is then easily checked that

(2.1) gs ≤(

2

pu

)gs−1, s = 1, . . . , p.

Discrete priors {ωδ, δ ∈ ∆} of the form ωδ = g‖δ‖01

( p‖δ‖1

)where {gs} satisfies

conditions of the form (2.1) were introduced by [12], and shown to work well for

high-dimensional problems.�

The ability to recover θ? depends on the quantity of information available, an idea

that we formalized by imposing appropriate restricted strong concavity condition on

the log-likelihood ` via the function

(2.2)

Lγ(δ, θ; z)def= `(θ; z)− `(θ?; z)− 〈∇`(θ?; z), θ − θ?〉+ 2γ‖δ · ∇`(θ; z)− δ · ∇`(θ?; z)‖22.

The function θ 7→ `(θ; z) − `(θ?; z) − 〈∇`(θ?; z), θ − θ?〉 is the well-known Bregman

divergence associated to `. The additional term 2γ‖δ ·∇`(θ; z)−δ ·∇`(θ?; z)‖22 in (2.2)

is due to the Moreau envelop approximation, and would typically be small for γ small,

10

and δ sparse. For our results to work with a certain generality, we use the concept

of a rate function introduced in [1]. A continuous function r : [0,+∞) → [0,+∞) is

called a rate function if r(0) = 0, r is increasing and limx↓0 r(x)/x = 0.

H 4. There exist γ > 0, ρ > 0, s ∈ {s?, . . . , p}, and a rate function r1 such that

for all δ ∈ ∆ such that ‖δ‖0 ≤ s, and for all θ ∈ Rpδ , we have

logE?[eLγ(δ,θ;Z)1Eρ(Z)

]≤ −1

2r1(‖θ − θ?‖2).

Remark 3. We view H4 as a form of restricted strong convexity assumption on

the function −` ([32]). This assumption is much weaker than the assumption that

` is strongly concave. We will see in Section 2.2 that H4 holds for linear regression

models. It can also be shown to hold for logistic regression models, although we do

not pursue this here.�

Remark 4. A lower bound on Lγ will also be needed below. We note that by

Assumption H1, we have

(2.3) Lγ(δ, θ; z) ≥ −1

2(θ − θ?)′S(θ − θ?).

�

With γ, ρ and s as in H4 we define

(2.4) adef= γTr(S − S) + 4γ2

(κ(s)Tr(S) + 4‖S‖2F

),

where Tr(M) (resp. ‖M‖F) denotes the trace (resp. the Frobenius norm) of M . As we

will see, the contraction rate of Πγ is at least

(2.5) εdef= inf

{x > 0 : r1(z) ≥ 3ρ(s? + s)1/2z for all z ≥ x

}.

We then set ∆sdef= {δ ∈ ∆ : ‖δ‖0 ≤ s}, and

(2.6) Bm,Mdef=

⋃δ∈∆s

({δ} × B

(δ)m,M

),


where

B(δ)m,M

def={θ ∈ Rp : ‖θδ − θ?‖2 ≤Mε, ‖θ − θδ‖2 ≤ 2

√(m+ 1)γp,

and ‖θ − θδ‖∞ ≤ 2√

(m+ 1)γ log(p)},

for absolute constants m,M .

Theorem 5. Assume H1-H4. Take ρ ∈ (0, ρ], and γ > 0 as in H4 such that

(2.7) 4γλmax(S) ≤ 1, and γρ2 = o(log(p)), as p→∞.

Suppose also that ε as defined in (2.5) is finite, and Πγ(·|z) is well-defined for P?-almost all z ∈ Eρ, and for all p large enough. Then there exists an absolute constant

A0 such that for p ≥ A0, for all m ≥ 1,M > 2, and for any measurable subset E ⊆ Eρof Z such that P?(Z ∈ E) ≥ 1/2 we have

E?[Πγ

(Bm,M |Z

)|Z ∈ E

]≥ 1− E?

[Πγ(‖δ‖0 > s|Z)|Z ∈ E

]− 4e

a2 p(2+u)s?

(1 +

κ(s?)

ρ2

)s?∑j≥1

e−16r1( jMε

2 )+8ρs1/2( jMε2 )

− 2es log(9p)∑j≥1

e−16r1( jMε

2 ) − 4

pmexp

(a

2+ 3γρ2s+ 2γκ(s)2(Mε)2

).

Proof. See Section 5 of the Supplement.

The theorem applies naturally with E = Eρ. The general case E ⊆ Eρ will be needed

below. We note that the probabilities of the events {Z ∈ Eρ} are relatively easy to

control, so one can easily derive an unconditional version of the theorem. Theorem 5

outlines a set of sufficient conditions under which one can guarantee that given the

event Z ∈ E , if (δ, θ) ∼ Πγ(·|Z) then the sparsified vector θδ satisfied ‖θδ−θ?‖2 ≤Mε,

and ‖θ − θδ‖2 ≤ 2√

(m+ 1)γp (and ‖θ − θδ‖∞ ≤ 2√

(m+ 1)γ log(p)) with high

probability. Controlling both the `0 and the `2 norm gives us a better control of the

term θ − θδ, which will be needed in the mixing time analysis. To use the theorem

one needs to establish by other means that Πγ(·|z) is well-defined and puts vanishing

probability on the set {‖δ‖0 > s}. This question is addressed in Lemma 21 under

some mild additional assumptions.

12

The second condition in (2.7) is readily satisfied in most cases. Hence the result

implies that one should choose γ as

γ =γ0

4λmax(S),

for some user-defined absolute constant γ0 ∈ (0, 1], provided that this choice of γ also

satisfies H4.

Remark 6. It is interesting to note that we did not directly rely on any sample

size condition in the theorem. Here the amount of information available about θ is for-

malized in H4 directly in terms of curvature of the log-likelihood function `. In models

with independent samples, H4 typically translates into a sample size requirement. We

refer the read to Section 2.2 for an example with linear regression models.�

2.1. Model selection and Bernstein-von Mises approximation. With some addi-

tional assumptions we show next that the distribution Πγ puts overwhelming proba-

bility mass around (δ?, θ?) and satisfies a Bernstein-von Mises approximation. To that

end we will assume that there exists an absolute constant M > 2 such that

(2.8) θ?def= min{|θ?,j | : j : δ?,j = 1} > Mε.

Clearly this assumption is unverifiable in practice since θ? is typically not known.

However a strong signal assumption such as (2.8) seems to be needed in one form or

the other for model selection ([31, 11, 45]). For δ ∈ ∆, we recall that the notation

δ ⊇ δ? means that for all 1 ≤ j ≤ p, δ?,j = 1 implies that δj = 1. We note that

when (2.8) holds then the set B(δ)m,M is empty if δ does not satisfy δ ⊇ δ?. Hence one

immediate implication of assumption (2.8) is that the set Bm,M reduces to

Bm,M =⋃δ∈A

({δ} × B

(δ)m,M

),

where

A def= {δ ∈ ∆ : δ ⊇ δ?, and ‖δ‖0 ≤ s}.

In other words when (2.8) and the assumptions of Theorem 5 hold, Πγ puts most

of its mass only on sparse models that contains the true model δ?. Therefore correct

model selection becomes possible if the prior distribution offsets the natural tendency


to overfit. Given θ ∈ Rp, and δ ∈ ∆ \ {0}, we write [θ]δ to denote the set of δ-selected

components of θ listed in their order of appearance: [θ]δ = (θj , δj = 1) ∈ R‖δ‖0 .

Conversely, if u ∈ R‖δ‖0 , we write (u, 0)δ to denote the element of Rpδ such that

[(u, 0)δ]δ = u. We define the function `[δ](·; z) : R‖δ‖0 → R by `[δ](u; z)def= `((u, 0)δ; z).

We then introduce

(2.9) θδ(z)def= Argmaxu∈R‖δ‖0

[`[δ](u; z)

], z ∈ Z.

When δ = δ?, we sometimes write θ?(z) instead of θδ?(z). At times, to shorten the

notation we will omit the data z and write θδ instead of θδ(z). Omitting the data

z, we write Iγ,δ ∈ R‖δ‖0×‖δ‖0 to denote the matrix of second derivatives of u 7→`[δ](u; z) evaluated at θδ(z). When δ = δ?, we simply write Iγ . We make the following

assumption.

H5. The integer s in H4 is such that κ(s) > 0, and the following holds.

1. For all δ ∈ ∆s, and z ∈ Eρ, the estimate θδ(z) is well-defined and satisfies

‖θδ(z)− [θ?]δ‖2 ≤ ε,

where ε is as defined in (2.5).

2. For all z ∈ Z, all δ ∈ ∆s, the function u 7→ `[δ](u; z) is thrice differentiable on

R‖δ‖0, and for all c > 0,

$2,cdef= sup

z∈Eρsupδ∈∆s

supu∈R‖δ‖0 : ‖u−[θ?]δ‖2≤cε

‖∇(3)`[δ](u; z)‖op,

is finite, where ‖ · ‖op denotes the operator norm3.

Remark 7. When H1 holds, H5-(1) is typically easy to check. Indeed, for all

δ ∈ ∆s, and z ∈ Eρ, we have

0 ≥ −`([δ](θδ; z) + `([δ]([θ?]δ; z) ≥⟨−∇`[δ]([θ?]δ; z), θδ − [θ?]δ

⟩+κ(s)

2‖θδ − [θ?]δ‖22,

where the first inequality uses the definition of θδ, whereas the second inequality uses

H1 and the definition of κ(s). Hence for all δ ∈ ∆s, and z ∈ Eρ, we have

(2.10) ‖θδ(z)− [θ?]δ‖2 ≤ρs1/2

κ(s),

3If M is a linear operator from Rp to Rp×p, we define ‖M‖op = sup‖u‖2=1 ‖Mu‖F

14

which gives an easy way to checking H5-(1) by comparing the right hand side of (2.10)

to ε. H5-(2) is hard to check in general since it requires the third derivative of the

log-likelihood to be uniformly bounded in z (for all practical purposes). However it

trivially holds for linear regression models since in that case $2,c = 0. It can also be

shown to hold for logistic regression models, although we do not pursue this here.�

For Λ > 0 we introduce the set

Eρ,Λdef= Eρ

s−s?⋂k=0

{z ∈ Z : sup

δ∈A: ‖δ‖0=s?+k`(θδ(z); z)− `(θ?(z); z) ≤ kΛ

}.

We introduce the following distribution on ∆× Rp,

(2.11) Π∞γ (δ, dθ|z)

∝ e−12

([θ]δ?−θ?)′Iγ([θ]δ?−θ?)− 12γ

(θ−θδ? )′(Ip−γS)(θ−θδ? )1{δ?}(δ)1B

(δ?)m,M

(θ)dθ.

Note that the δ-marginal of Π∞γ (·|z) is the point-mass at δ?. We note also that for p

large the restriction on the set B(δ?)m,M in Π∞γ (·|z) is inconsequential and can be removed

– by Gaussian concentration. Hence for all practical purposes, a draw U from the θ-

marginal of Π∞γ (·|z) is such that [U ]δ? and [U ]δc? are independent, [U ]δ? ∼ N(θ?, I−1γ ),

and [U ]δc? ∼ N(0, γ([Ip − γS]δc?)−1). Hence Uj = Op(

√γ) for all j such that δ?,j = 0.

Theorem 8. Assume H1-H5 and (2.8) for some absolute constant M > 2. Fix

γ > 0 such that 4γλmax(S) ≤ 1, and ρ > 0. Suppose also that for some Λ > 0,

2ρeΛ

pu

√2π

κ(s)≤ 1, and (M − 1)εκ(s)1/2 − s1/2

? ≥ 2.

Then there exists absolute constant A0, C, such that for all p ≥ A0, for all m ≥ 4,

and for all z ∈ Eρ,Λ such that Πγ(·|z) is well-defined we have

(2.12)

Πγ({δ?} × B(δ?)m,M |z) ≥ Πγ(Bm,M |z)−

(4ρeΛ

pu

√2π

κ(s)

)e2(M+2)ρs1/2εe

(M+1)3

3ε3$2,M

× es1/2? ε

$2,Mκ(s) e2γκ(s)2(Mε)2

e3γρ2sea2 .


Furthermore,

(2.13)∥∥Πγ(·|z)− Π∞γ (·|z)

∥∥tv≤(

1− Πγ({δ?} × B(δ?)m,M |z)

)+ 2

(1− Πγ(Bm,M |z)

)+ Cι1

(1 + ρs1/2ε+ ε3$2,M

)eCρs

1/2ε+Cε3$2,M ,

where

ι1def= e

a2

+3γρ2s+2γκ(s)2(Mε)2 − 1 +1

pm.


Theorem 5 and 8 can be combined to derive some simple conditions under which

Πγ put most of its probability mass on subsets of the form {δ?}×B(δ?)m,M , and satisfies a

Bernstein-von Mises approximation when z ∈ Eρ,Λ. We refer to Section 2.2 for a linear

regression example. Controlling the probability P?(Z ∈ Eρ,Λ) boils down to controlling

the log-likelihood ratios of the model. This is easily done for linear regression models

(see Section 2.2). The recent work of [40] provides some tools that can be used to deal

with logistic regression models, however this remains to be explored.

The Bernstein-von Mises approximation implies that for γ small, posterior con-

fidence intervals on θ obtained from Πγ are approximately equivalent to their cor-

responding frequentist counterparts knowing δ? (oracle confidence interval). In that

sense the theorem can be used to argue that the Bayesian inference is equivalent to

the oracle frequentist inference.

2.2. Application to high-dimensional linear regression models. We illustrate the

results with the linear regression model. Suppose that we have n subjects, and on

subject i we observe (Zi, xi) ∈ R × Rp, where xi is non-random, and the following

holds.

H6. For i = 1, . . . , n, Zi ∼ N(〈θ?, xi〉 , σ2) are independent random variables, for

a parameter θ? ∈ Rp \ {0}, and a known absolute constant σ2 > 0.

Let X ∈ Rn×p be such that the i-th row of X is x′i. We write Xj ∈ Rn to denote

the j-th column of X. Throughout the paper, we will automatically assume that the

matrix X is normalized such that

(2.14) ‖Xj‖22 = n, j = 1, . . . , p.

16

For s ≥ 1, we define

v(s) = inf

{u′(X ′X)u

n‖u‖22, u 6= 0, ‖u‖0 ≤ s

},

and v(s)def= sup

{u′(X ′X)u

n‖u‖22, u 6= 0, ‖u‖0 ≤ s

}.

We also introduce

v = inf

{u′(X ′X)u

n‖u‖22, u 6= 0, ‖uδc?‖1 ≤ 7‖uδ?‖1

}.

Different behavior can be obtained from Πγ depending on the choice of the hyper-

parameter ρ. To keep the discussion short, we assume that

(2.15) ρ =4

σ

√n

log(p), and ρ =

4

σ

√n log(p).

We make the following assumption.

H7. As p→∞, s? = o(log(p)) and

(2.16)1

v+

1

v(s)+ v(s) = O(1),

where sdef= s? + 2

u(m0 + 1 + 2s?), for some absolute constant m0 ≥ 1.

Remark 9. Assumption H7 can be viewed as a minimal sample size requirement.

It imposes that we increase the sample size n, as p grows, so as to guarantee that

v, v(s) remain bounded away from 0, and v(s) remains bounded from above. For

instance if X is a realization of a random matrix with iid entries it is known that H7

holds with high probability if the sample n grows as s log(p) (see e.g. [37] and the

references therein).�

In the next result ε as defined in (2.5) takes the form

ε =24σ

v(s)

√(s? + s) log(p)

n.

And the limiting distribution Π∞γ (·|z) in the Bernstein-von Mises approximation is

such that if (δ, U) ∼ Π∞γ (·|z), then δ = δ?, [U ]δ? ∼ N((X ′δ?Xδ?)−1Xδ?z, σ

2(X ′δ?Xδ?)−1),

and [U ]δc? ∼ N(0, γ([Ip − γσ2 (X ′X)]δc?)

−1), with [U ]δ? , [U ]δc? independent. We deduce

the following corollary.


Corollary 10. Assume H3, H6, H7, and choose ρ and ρ as in (2.15). Suppose

also that we choose γ > 0 such that 8σ2γ log(p)λmax(X

′X) ≤ 1, and as p→∞,

(2.17) γn log(p) = O(1), and γ2(nTr(X ′X) + ‖X ′X‖2F

)= o(log(p)).

Furthermore, suppose that as p→∞, log(p)/n→ 0, and

(2.18) limp→∞

{min

j: δ?,j=1|θ?,j |

}√n

(s? + s) log(p)= +∞.

Then setting Λ = 3 log(n∧p), the following holds. For all m > 1, all M > max(2,√

u+23 ),

we have

E?[Πγ

({δ?} × B

(δ?)m,M |Z

)|Z ∈ Eρ,Λ

]≥ 1− 1

pm0− 1

pm−1− 1

pM2s− 1

pu−3,

and P? (Z /∈ Eρ,Λ) ≤ 2

p+

A1

(n ∧ p)14

,

for all p large enough, where A1 is some absolute constant. If in addition to the

assumptions above, γs?n log(p) = o(1), and γ2(nTr(X ′X) + ‖X ′X‖2F

)= o(1), as

p→∞, then

limp→∞

E?[∥∥Πγ(·|Z)− Π∞γ (·|Z)

∥∥tv

]= 0.


In general the assumptions (2.17) and γ log(p)λmax(X′X) ≤ σ2/8 can be easily

satisfied by taking γ sufficiently small. If we choose γ ∼ 1n log(p) as imposed in Theorem

15 below, then these assumptions imply some restrictions on the design matrix X in

that we need

λmax(X′X) = O(n), and

Tr(X ′X)

λmax(X ′X)= o(log(p)3).

These latter conditions typically hold if p and n are of the same order, and the design

matrix X has a small number of leading singular values, which is similar to the spiked

covariance model commonly used in principal component analysis (see for instance

[10] and the references therein). In simulation results not reported here we noticed

that the conclusions of the theorem continue to hold if X is a random matrix with

iid standard normal entries.

18

The assumption (2.18) is a different (and stronger) form of the high signal-to-noise

ratio assumption. It implies that for any M > 2, (2.8) holds for all p large enough.

The additional assumptions needed for the Bernstein-von Mises theorem highlights

the fact that smaller values of γ are typically needed for the Bernstein-von Mises

approximation to hold. Although not reported here, we have indeed observed such

behavior in the simulations.

3. Markov Chain Monte Carlo Computation. In this section we develop

and analyze a MCMC algorithm to sample from Πγ . We shall focus on the θ-marginal

of Πγ given by

(3.1) Πγ(du|z) ∝∑δ∈∆

ωδ(2πγ)‖δ‖0

2

(ρ2

)‖δ‖0e−hγ(δ,u;z)du.

We will abuse the notation and continue to write Πγ to denote this marginal. The set

whose probability we seek will typically make it clear whether we are referring to the

joint distribution or its marginals.

3.1. A MCMC sampler for Πγ. We start first with a description of the MCMC

sampler. We use a data-augmentation approach where we sample the joint variable

(δ, θ), and then discard δ. To sample (δ, θ) we use a Metropolis-Hasting-within-Gibbs

sampler, where we update δ given θ, then we update the selected component [θ]δ given

(δ, [θ]δc), and finally update [θ]δc given (δ, [θ]δ). We refer the reader to [42, 35] for an

introduction to basic MCMC algorithms.

To develop the details, we need to introduce some notations. For γ > 0, δ ∈ ∆,

θ ∈ Rp, we define the proximal map

Proxγ(δ, θ)def= Argminv∈Rpδ

[ρ‖v‖1 +

1

2γ‖v − θ‖22

]= δ · sγ(θ),

where sγ(θ)def= (sγ(θ1), . . . , sγ(θp)) ∈ Rp is the soft-thresholding operation applied to

θ with

sγ(x)def= sign(x) (|x| − γρ)+ , x ∈ R,

where sign(a) is the sign of a, and a+ = max(a, 0). With this definition, hγ in (1.4)


can be rewritten as

(3.2) hγ(δ, θ; z) = −`(θ; z)− 〈∇`(θ; z), Jγ(δ, θ)− θ〉+ ρ‖Jγ(δ, θ)‖1

+1

2γ‖Jγ(δ, θ)− θ‖22

where

(3.3) Jγ(δ, θ)def= Proxγ (δ, θ + γ∇`(θ; z)) = δ · sγ (θ + γ∇`(θ; z)) .

This alternative expression shows that the function hγ(δ, θ; z) is easy to evaluate.

Furthermore, plugging (3.2) in (1.6) shows that given θ, the components of δ are

conditionally independent Bernoulli random variables:

(3.4) Πγ(δ|θ, z) =

p∏j=1

[pγ,j(θ)]δj [1− pγ,j(θ)]1−δj ,

where pγ,j(θ)def=

q1−q

ρ2

√2πγerγ,j

1 + q1−q

ρ2

√2πγerγ,j

, j = 1, . . . , p,

where rγ,jdef= rγ(θj + γ∇j`(θ; z)), where ∇j`(θ; z) denotes the j-th partial derivative

of `(θ; z) with respect to θ, and evaluated at θ, and for x ∈ R,

(3.5) rγ(x)def= − 1

2γsγ(x)2 +

1

γxsγ(x)− ρ|sγ(x)| =

{1

2γ (|x| − γρ)2 if |x| > γρ

0 otherwise.

It follows from the above that drawing samples from the conditional distribution of

δ given θ, z is easily achieved.

Now consider the conditional distribution Πγ(θ|δ, z) ∝ e−hγ(δ,θ;z). Given δ, we par-

tition θ into θ = ([θ]δ, [θ]δc), where [θ]δ groups the components of θ for which δj = 1,

and [θ]δc groups the remaining components. We propose a Metropolis-within-Gibbs

MCMC scheme whereby we first update [θ]δ using a Random Walk Metropolis scheme

while keeping [θ]δc fixed, and then we update [θ]δc using an Independence Metropolis-

Hastings scheme while keeping [θ]δ fixed. Again we refer the reader to [42] for an

introduction to these basic MCMC algorithms. To give more details, it is enough

to consider the case where 0 < ‖δ‖0 < p. We update the component [θ]δ using a

Random Walk Metropolis with a Gaussian proposal N(0, τ2δ I‖δ‖1) for some scale pa-

rameter τ2δ > 0 (we give more detail on the choice of τ2

δ below), while keeping δ and

20

[θ]δc fixed. To update the component [θ]δc , we build an approximation hγ of hγ as

follows. First, in view of (3.3), we propose to approximate Jγ(δ, θ) by

Jγ(δ, θ)def= δ · sγ (θ + γ∇`(θδ; z)) ,

and we note that Jγ(δ, θ) does not actually depend on [θ]δc . Using Taylor expansion

we also approximate `(θ; z) and ∇`(θ; z) by ˜(θ; z) and ∇`(θ; z) respectively, where

˜(θ; z)def= `(θδ; z) + 〈∇`(θδ; z), θ − θδ〉+

1

2(θ − θδ)′∇(2)`(θδ; z)(θ − θδ),

and

∇`(θ; z) def= ∇`(θδ; z) +∇(2)`(θδ; z)(θ − θδ).

We then propose to approximate hγ(δ, θ; z) by replacing Jγ(δ, θ) by Jγ(δ, θ), `(θ; z)

by ˜(θ; z), and ∇`(θ; z) by ∇`(θ; z) in the expression of hγ given in (3.2). This leads

to

(3.6) hγ(δ, θ; z)def= −˜(θ; z)−

⟨∇`(θ; z), Jγ(δ, θ)− θ

⟩+ ρ‖Jγ(δ, θ)‖1 +

1

2γ‖Jγ(δ, θ)− θ‖22.

Remark 11. It will be important to have in mind that in linear regression models,

θ 7→ `(θ; z) is quadratic so that ˜(θ; z) = `(θ; z), and ∇`(θ; z) = ∇`(θ; z). Hence in

that case the approximation h involves only replacing J by J .

�

We set

(3.7) Σδ,θdef=([Ip + γ∇(2)`(θδ; z)

]δc

)−1,

and mδ,θdef= Σδ,θ

[(Ip + γ∇(2)`(θδ; z))(Jγ(δ, θ)− θδ)

]δc.

We note that Σδ,θ and mδ,θ depend on θ only through [θ]δ. We note also that under

H1, the matrix Σδ,θ is symmetric positive definite whenever γλmax(S) < 1. With some

easy algebra it can be shown that hγ(δ, θ; z) can be written as

hγ(δ, θ; z) =1

2γ([θ]δc −mδ,θ)

′Σ−1δ,θ ([θ]δc −mδ,θ) + const,


where the term const does not depend on [θ]δc (but does depend on [θ]δ). Hence, given

δ and [θ]δ, we update [θ]δc using an Independence Metropolis-Hastings algorithm with

proposal N(mδ,θ, γΣδ,θ).

Given δ ∈ ∆, let us call−→K δ the transition kernel on Rp of the combined θ-move

that we just described. Let us call←−K δ the kernel obtained by reversing the order of

the updates (we update [θ]δc first using the independence sampler described above,

followed by a random walk Metropolis update of [θ]δ). For the purpose of having a

reversible kernel we introduce

Kδ(θ, ·)def=

1

2

−→K δ(θ, ·) +

1

2

←−K δ(θ, ·).

The proposed MCMC algorithm to sample from Πγ is as follows.

Algorithm 1. For some initial distribution ν0 on Rp, draw u0 ∼ ν0. Given

u0, . . . , un for some n ≥ 0, draw independently Dn+1 ∼ Ber(0.5).

1. If Dn+1 = 0, set un+1 = un.

2. If Dn+1 = 1,

(a) Draw δ ∼ Πγ(·|un, z) as given in (3.4), and

(b) draw un+1 ∼ Kδ(un, ·).

�

Remark 12. The use of the kernel Kδ instead of−→K δ (or

←−K δ) does not increase

the computational cost, and insures reversibility, which is needed in our theory. The

introduction of the indicator variable Dn implies that half of the time the chain does

not move: we have a lazy Markov chain, which is also needed in our theory. These

tricks are not used in practice, and for the numerical illustrations presented below we

only implemented the kernel−→K δ.

The indicator variables δ discarded in Algorithm 1 are important in practice for

the variable selection problem, and are usually collected along the iterations. Here

we focus the analysis on the continuous variables un ∈ Rp. Obviously we do not lose

anything, since given un exact sampling of δ is possible as discussed above. In other

words the mixing of the joint process {(δn, un), n ≥ 0} is driven by the mixing of the

marginal {un, n ≥ 0}.�

22

3.1.1. Initialization. The choice of the initial distribution ν0 plays a crucial role

for fast mixing. Given z ∈ Rn, and a model δ ∈ ∆s, let θδ = (XδX′δ)−1X ′δz denote

the ordinary least squares estimate based on the selected variables of δ. Let us call

ν(δ)(·|z) the Gaussian distribution on Rp such that if (U1, . . . , Up) ∼ ν(δ)(·|z), then

[U ]δ ∼ N(θδ, σ2(XδX

′δ)−1), and (independently) Uj

i.i.d.∼ N (0, γ) for all j such that

δj = 0.

We propose to take the initial distribution ν0 as ν(δ(0))(·|z) for some initial estimate

δ(0) of δ?. Perhaps the most natural choice of δ(0) is the lasso estimate ([41, 7]).

In a strong signal-to-noise ratio setting the lasso is known to recover δ? with high

probability ([28]). However in practice lasso estimates can perform poorly. So it is

important to understand the mixing of the MCMC sampler when δ(0) is close but not

exactly equal to δ?.

3.1.2. Computational cost. The computational cost of Algorithm 1 is dominated

by the cost of sampling from the Gaussian distribution N(mδ,θ, γΣδ,θ), which itself is

dominated by the Cholesky decomposition of Σδ,θ. Hence each iteration of Algorithm

1 in general has a cost that scales with p as O(p3). However in some cases, a faster

implementation is possible along the lines of an algorithm proposed in [6]4. Suppose

that the Hessian matrix ∇(2)`(θ; z) can be written as

(3.8) ∇(2)`(θ; z) = −X ′WθX,

where X ∈ Rn×p, and Wθ ∈ Rn×n is a non-singular diagonal matrix. This is the case

for instance for linear or logistic regression models. Then Σδ,θ =(Ip−‖δ‖0 − γX ′δcWθXδc

)−1,

where Xδc ∈ Rn×(p−‖δ‖0) is the sub-matrix of X obtained by selecting the columns

for which δj = 0. By the Woodbury formula we then have

Σδ,θ = Ip−‖δ‖0 + γX ′δc(W−1θ − γXδcX

′δc)−1

Xδc .

Therefore if C ′θCθ = W−1θ − γXδcX

′δc is the Cholesky decomposition of the Rn×n

matrixW−1θ −γXδcX

′δc , we can sample from N(0, γΣδ,θ) by drawing Z ∼ N(0, Ip−‖δ‖0),

U ∼ N(0, In), independently and returning

√γZ + γX ′δcC

−1θ U.

4We are grateful to Anirban Bhattacharya for pointing out this paper to us.


It is also easy to see that the Cholesky fastor Cθ can also be exploited to compute

mδ,θ. The per-iteration computational cost of this approach is O(n3 + p2n). Hence

in the particular case where (3.8) holds, the per-iteration computational cost of the

algorithm is O(p2 min(n, p)

), which matches other state of the art algorithms for

high-dimensional regression ([6]).

3.2. Mixing time of Markov chains. Our objective in this section is to provide

some qualitative bounds on the mixing time of Algorithm 1. Particularly, we wish

to understand how this mixing time depends on the dimension p. We follow the

conductance approach using the framework of [24]. However this theory cannot be

directly applied since the target distribution of interest is a mixture of log-concave

densities, and hence is not log-concave. Our main contribution is the idea that one

can invoke the contraction properties of Πγ to essentially reduce the mixing of Pγ to

the mixing of Kδ? – the Markov kernel that samples from the dominant component

of Πγ . This latter problem can then be handled by the standard theory of [24].

We start with a general overview of the technique using some generic notation; the

specific application to Πγ is presented in Section 3.3. Let π be a probability measure

on Rp that is absolutely continuous with respect to the Rp-Lebesgue measure dθ such

that

(3.9) π(dθ) ∝ e−h(θ)dθ, θ ∈ Rp,

for a measurable function h : Rp → [0,∞). We will abuse notation and write π to

denote both π and its density. Let P be a Markov kernel on Rp. For any integer n ≥ 1,

Pn denotes the Markov kernel defined recursively as P 1 = P , and

Pn(x, ·) def=

∫Pn−1(x, dz)P (z, ·).

For a probability measure µ, the product µP is the probability measure defined as

µP (·) def=

∫µ(dz)P (z, ·).

We say that P is reversible with respect to π if for all measurable sets A,B ⊆ Rp:∫Aπ(dθ)P (θ,B) =

∫Bπ(dθ)P (θ,A).

24

Reversibility of P with respect to π implies that P has invariant distribution π. We

say that a Markov kernel P is lazy if P (θ, {θ}) ≥ 1/2 for all θ ∈ Rp. For ζ ∈ [0, 1/2),

the ζ-conductance of the Markov kernel P as introduced by [23] is defined as

Φζ(P )def= inf

{ ∫A π(dθ)P (θ,Ac)

min(π(A)− ζ, π(Ac)− ζ), A meas., ζ < π(A) < 1− ζ

}.

The case ζ = 0 corresponds to the usual conductance. The conductance measures

how rapidly a Markov chain moves around the space if started from its stationary

distribution. In practice most MCMC algorithms are started from some initial distri-

bution ν0 that is not the stationary distribution. In high-dimensional problems – due

to the curse of dimensionality and the concentration of measure phenomenon – the

choice of the initial distribution becomes crucial. A fundamental result by [39] relates

the mixing time of the Markov chain to the conductance of P and the properties

of the initial distribution ν0. More details can also be found in [13, 25, 5] and the

references therein. Here we will use the generalization provided by Corollary 1.5-(2)

of [24], which can be stated as follows.

Theorem 13. Suppose that P is lazy, and has invariant distribution π, and fix

ζ ∈ (0, 1/2). For any probability measure ν0, and any integer K ≥ 1, we have

‖ν0PK − π‖tv ≤ Hζ

(1 +

1

ζe−K

Φ2ζ(P )

2

),

where

Hζdef= sup

A: π(A)≤ζ|ν0(A)− π(A)|.

Using this result boils down to lower bounding the ζ-conductance Φζ(P ), and

upper bounding Hζ . We follow the approach of [25] which consists in studying the

restriction of P to some well-chosen subset of Rp. More precisely, if Θ ⊆ Rp is a

non-empty measurable subset such that π(Θ) > 0, the Θ-conductance of P is

ΦΘ(P )def= inf

{ ∫B π(dθ)P (θ,Bc ∩Θ)

min(π(B), π(Bc ∩Θ)), B ⊆ Θ, B meas., 0 < π(B) < π(Θ)

}.

We note – and this is easily shown – that if π(Θ) ≥ 1 − ζ, then Φζ(P ) ≥ ΦΘ(P ).

Hence we can reduce the problem to lower bounding ΦΘ(P ) for a well-chosen subset Θ.

The next result builds on Theorem 2 of [5] and provides an approach to lower-bound


ΦΘ(P ). The result itself is of some independent interest, since it can be applied more

widely. The set up is as follows. With π and P as above, suppose that there exists

a sub-Markov kernel {q(x, ·), x ∈ Rp} (meaning that (x, y) 7→ q(x, y) is measurable,

and∫Rp q(x, y)dy ≤ 1 for all x ∈ Rp) such that

(3.10) P (θ,du) ≥ q(θ, u)du, θ ∈ Rp.

Given u, v ∈ Rp, set

qu,v(θ)def= min (q(u, θ), q(v, θ)) , θ ∈ Rp.

Let |||·||| denote some arbitrary pseudo-norm on Rp such that |||θ||| ≤ ‖θ‖2 for all θ.

Theorem 14. Suppose that P is a transition kernel that is reversible with respect

to π as given in (3.9), such that (3.10) holds. Suppose also that the following holds.

1. There exists a nonempty convex set Θ ⊂ Rp with finite diameter diam(Θ)def=

maxu,v∈Θ ‖u− v‖2, such that h is convex on Θ.

2. There exist r > 0, α > 0, such that for all u, v ∈ Θ that satisfy |||u− v||| ≤ r, we

have∫

Θ qu,v(θ)dθ ≥ α.

Then

ΦΘ(P ) ≥ α

4min

[1,

2r

diam(Θ)

].

Proof. See Section 4.1.

3.3. Application to the kernel Pγ for linear regression models. We analyze the

mixing of Algorithm 1. For simplicity we focus on linear regression models. The tran-

sition kernel of the Rp-valued Markov chain {un, n ≥ 0} defined by Algorithm 1

is

(3.11) Pγ(u,dv)def=

1

2δu(dv) +

1

2

∑ω∈∆

Πγ(ω|u, z)Kω(u,dv).

The coherence of the design matrix X plays a role. We define this here as

(3.12) C(X)def= sup

j: δ?,j=1sup

u∈Rp−s? : ‖u‖2=1

1√n|⟨Xj , Xδc?u

⟩| = sup

j: δ?,j=1

1√n‖X ′δc?Xj‖2,

26

where Xδc? ∈ Rn×(p−s?) is the submatrix of X corresponding to columns j for which

δδ,j = 0. C(X) is a measure of correlation between the important and the non-

important variables in the design matrix X. For a random matrix with iid Gaussian

entries, C(X) ≈ √p. With the same notations as in Section 2.2, we have the following

result.

Theorem 15. Assume H3, H6, H7, choose ρ, ρ as in (2.15), and choose γ =γ0

n log(p) for some absolute constant γ0 > 0 such that (2.17) and (2.18) hold. Suppose

that we initialize Algorithm 1 as described in Section 3.1.1 with ν0 = ν(δ(0))(·|z) for

some δ(0) ⊇ δ? such that FPdef= ‖δ(0)‖0 − s? = O(1), as p → ∞. Fix ζ0 ∈ (0, 1/2).

Then we can find absolute constants C0, C1, C2, C3 such that if we scale the step-size

of the Random Walk Metropolis update of Algorithm 1 as τδ = C0

‖δ‖1/20

√n log(p)

, the

following holds. For all p ≥ C1, and all integer K such that

K ≥ C2 (1 + FP) p exp

(C3s

2?

[1 +

(C(X)

s1/2 log(p)

√p

n

)2])

,

we have

E?[‖ν0P

Kγ − Πγ(·|Z)‖tv

]≤ 3ζ0,

provided that the constants u,m0 in H3 and H7 are taken large enough.

Proof. See Section 4.2.

The theorem suggests that if C(X) is small and p is not too large compared to

n, then the mixing time is essentially linear in p. We note also – as expected – that

the bound degrades for large values of the false-positive number FP. However the

constants C0, . . . , C3 depend in general on FP, hence the theorem does not provide a

clear read of the dependence on FP.

One of the main conclusion of the theorem is that the mixing of the algorithm

is directly impacted by C(X). However it is worth pointing out that Theorem 15

relies also on the restricted eigenvalue assumption made in H7 which also restricts

correlation between any set of s = O(s?) columns of X. In other words, controlling

C(X) alone is not enough for fast mixing.

The strong signal-to-noise ratio condition (2.18) plays a crucial role in the analysis,

and our method of proof breaks down if this assumption does not hold. More flexible


techniques such as the space decomposition approach ([26, 18, 44]), or perhaps the

coupling approach of [27] might be more successful in relaxing this assumption.

3.4. Numerical illustrations. We illustrate some of the conclusions above with the

following simulation study. We consider a linear regression model with Gaussian noise

N(0, σ2), where σ2 is set to 1. We experiment with sample size n = 200, and dimension

p ∈ {500, 1000, · · · , 5000}. We fix the number of non-zero coefficients to s? = 10, and

δ? is given by

δ? = (1, . . . , 1︸︷︷︸10

, 0, . . . , 0︸︷︷︸p−10

).

The non-zero coefficients of θ? are uniformly drawn from (−a − 1,−a) ∪ (a, a + 1),

where

a = SNR

√s? log(p)

n, where SNR ∈ {0.5, 4}.

We expect SNR = 4 to bring us closer to satisfying (2.18). To draw the design matrix

we proceed as follow. First we draw X0 ∈ Rn×p with i.i.d. entries N(0, 1), and form

its singular value decomposition X(0) = US(0)V ′. To bring us closer to assumption

(2.17), we re-scale the singular values to form S(1) where S(1)ii = S

(0)11

(1i

)0.2, and we

form X(1) = US(1)V ′. We consider two different design matrices. In the low coherence

case we take X = X(1), whereas for the high coherence example we take

X =K∑k=1

S(1)kk UkV

′k,

for some integer K that we take as K = 30 + (p − 2000)/100, with K = 20 for

p ≤ 1000. We subsequently standardize X to satisfy (2.14).

We choose the prior as in H3, with u = 5, ρ as in (2.15), and set γ as

γ =γ0σ

2

λmax(X ′X),

for a tuning parameter γ0 = 0.2. We use the initial distribution ν0 = ν(δ(0))(·|z)described in Section 3.1.1 for some intial value δ(0) ∈ ∆. We allow for some errors in

specifying δ(0), and we consider three choices, (a) a good initialization set up where

δ(0) has no false-negative and 10% false-positive, (b) a poor initialization set up where

δ(0) has no false-positive but 20% false-negative, (c) a lasso initialization where δ(0) is

28

taken as the lasso estimate computed using MATLAB default cross-validation set up.

The lasso initialization corresponds of course to the initialization one would typically

use in practice.

Since the ideal value of the step-size τ in the Random Walk Metropolis step of

Algorithm 1 is not known, we use an adaptive Random Walk Metropolis algorithm

([3]) to adaptively select τ such that the acceptance probability of the Markov chain

is approximately 30%.

To monitor the mixing, we compute the sensitivity and the precision at iteration k

as

SENk =1

s?

p∑j=1

1{|δk,j |>0}1{|δ?,j |>0}, PRECk =

∑pj=1 1{|δk,j |>0}1{|δ?,j |>0}∑p

j=1 1{|δk,j |>0}.

And we empirically measure the mixing time of the algorithm as the first time k where

both SENk and PRECk reach 1, truncated to 2 × 104 – that is we stop any run that

has not mixed after 2 × 104 iterations. In the high signal-to-noise regime (SNR = 4)

this definition makes sense, since in that case we know that with high probability

most of the probability mass of Πγ is concentrated on δ?. In the weak signal-to-noise

ratio regime this definition is not appropriate since in this case the distribution Πγ

can have a non-negligible probability of omitting some of the non-zero coefficients.

In this case we amend the definition and set the mixing time as the first time where

SEN ≥ αSEN, and PREC = 1, where αSEN is set by running a long preliminary version

of the algorithm.

For comparison we also show the results for a similar Metropolis-within-Gibbs

algorithm to sample from the weak spike-and-slab posterior distribution Πγ given in

(1.3). This distribution differs from the posterior distribution analyzed by [31] only in

the fact that we have used here a Laplace slab density, instead of the Gaussian density

used by [31]. One can sample similarly from Πγ with the same strategy described

above for Πγ , with the additional simplification that the conditional distribution of

[θ]δc given δ, [θ]δ has a Gaussian closed form. The resulting sampler is similar to the

Gibbs sampler implemented in [31].

All the results presented are based 45 independent MCMC replications. Figure 1

presented in the introduction shows the behavior of the mixing time as function of the

dimension p, with SNR = 4, under different initialization and design matrix coherence.

The simulation results confirm the conclusion of Theorem 15 that the mixing time


is roughly linear in p when the algorithm is well initialized and the coherence of the

design matrix is low. We also observe that the two algorithms (for Πγ and Πγ) behave

similarly.

We also explore the behavior of the algorithms under the lasso initialization. Figure

2-3 show the boxplots of the mixing times under different scenarios for p = 500 and

p = 2, 000. We obtain similar conclusions. The results show that when the signal-to-

noise ratio is low the initial lasso estimates have false negatives which result in poor

MCMC mixing. When the signal-to-noise ratio is high, the initial lasso estimates show

good recovery. However mixing is still slow if C(X) is high.

19999.50

19999.75

20000.00

20000.25

20000.50

MA SSSampler

mixing_time

SamplerMA

SS

wSNR, high coh.

19999.50

19999.75

20000.00

20000.25

20000.50

MA SSSampler

mixing_time

SamplerMA

SS

wSNR, low coh.

19999.50

19999.75

20000.00

20000.25

20000.50

MA SSSampler

mixing_time

SamplerMA

SS

hSNR, high coh.

500

1000

1500

2000

MA SSSampler

mixing_time

SamplerMA

SS

hSNR, low coh.

1.0

1.1

1.2

1.3

1.4

1.5

MA SSSampler

relative_error

SamplerMA

SS

wSNR, high coh.

0.8

0.9

1.0

1.1

MA SSSampler

relative_error

SamplerMA

SS

wSNR, low coh.

0

20

40

60

MA SSSampler

relative_error

SamplerMA

SS

hSNR, high coh.

0.06

0.08

0.10

0.12

MA SSSampler

relative_error

SamplerMA

SS

hSNR, low coh.

Fig 2. Boxplots of estimated mixing times (first row) and relative error (second row) of sam-pling from Πγ (denoted MA) and Πγ (denoted SS), under different configurations of signal-to-noise ratio and matrix coherence. Dimension p = 500.

4. Proofs of Theorem 14 and Theorem 15.

4.1. Proof of Theorem 14. The proof is similar to the proof of Theorem 2 of [5],

or Theorem 3.2 of [24]. It is based on well-known iso-perimetric inequalities for log-

concave densities. We will use the following version taken from [43] Theorem 4.2.

Lemma 16. Let Θ be a convex subset of Rp, and h : Θ → R a convex function.

Let Θ = S1 ∪ S2 ∪ S3 be a partition of Θ into nonempty measurable components such

30

19999.50

19999.75

20000.00

20000.25

20000.50

MA SSSampler

mixing_time

SamplerMA

SS

wSNR, high coh.

5000

10000

15000

20000

MA SSSampler

mixing_time

SamplerMA

SS

wSNR, low coh.

19999.50

19999.75

20000.00

20000.25

20000.50

MA SSSampler

mixing_time

SamplerMA

SS

hSNR, high coh.

0

1000

2000

3000

4000

MA SSSampler

mixing_time

SamplerMA

SS

hSNR, low coh.

1.1

1.2

1.3

1.4

MA SSSampler

relative_error

SamplerMA

SS

wSNR, high coh.

0.6

0.8

1.0

MA SSSampler

relative_error

SamplerMA

SS

wSNR, low coh.

0

25

50

75

100

MA SSSampler

relative_error

SamplerMA

SS

hSNR, high coh.

0.06

0.07

0.08

0.09

0.10

MA SSSampler

relative_error

SamplerMA

SS

hSNR, low coh.

Fig 3. Boxplots of estimated mixing times (first row) and relative error (second row) of sam-pling from Πγ (denoted MA) and Πγ (denoted SS), under different configurations of signal-to-noise ratio and matrix coherence. Dimension p = 2000.

that ddef= infx1∈S1, x2∈S2 ‖x1 − x2‖2 > 0. Then∫

S3

e−h(θ)dθ ≥ 2d

diam(Θ)min

[∫S1

e−h(θ)dθ,

∫S2

e−h(θ)dθ

].

Remark that (3.10) implies that for all u, v ∈ Rp, and for z ∈ {u, v}, A 7→ P (z,A)−∫A qu,v(x)dx is a non-negative measure on Rp. Hence for any measurable subset A of

Θ

P (u,A)− P (v,A) ≤ P (u,A)−∫Aqu,v(x)dx ≤ P (u,Θ)−

∫Θqu,v(x)dx.

A similar bound holds for P (v,A)− P (u,A), leading to the following result.

Lemma 17. If (3.10) holds, then for all u, v ∈ Rp,

supA⊆Θ|P (u,A)− P (v,A)| ≤ P (u,Θ) ∨ P (v,Θ)−

∫Θqu,v(z)dz,

where a ∨ b def= max(a, b).

With Lemma 16 and Lemma 17 in place, the proof of Theorem 14 can be done as

follows. Fix A ⊆ Θ such that 0 < π(A) < 1. Define

S1def={θ ∈ A : P (θ,Ac ∩Θ) <

α

2

}, S2

def={θ ∈ Ac ∩Θ : P (θ,A) <

α

2

},


and S3 = Θ \ (S1 ∪ S2). Hence we have a partition Θ = S1 ∪ S2 ∪ S3 of Θ. If

π(S1) ≤ π(A)/2. Then

(4.1)

∫Aπ(dθ)P (θ,Ac ∩Θ) ≥ α

2π (A \ S1) ≥ α

4π(A).

Similarly, if π(S2) ≤ π(Ac ∩Θ)/2, then

(4.2)

∫Ac∩Θ

π(dθ)P (θ,A) ≥ α

2π ((Ac ∩Θ) \ S2) ≥ α

4π(Ac ∩Θ).

Now suppose that π(S1) > π(A)/2 and π(S2) > π(Ac ∩Θ)/2. Then by reversibility∫Aπ(dθ)P (θ,Ac ∩Θ) =

1

2

∫Aπ(dθ)P (θ,Ac ∩Θ) +

1

2

∫Ac∩Θ

π(dθ)P (θ,A)

≥ α

4(π (A \ S1) + π ((Ac ∩Θ) \ S2))

=α

4π(S3).(4.3)

Fix θ, ϑ ∈ Θ such that |||θ − ϑ||| ≤ r. Therefore, by assumption we have∫

Θ φu,v(θ)dθ ≥α. Without any loss of generality suppose that P (θ,Θ) ≥ P (ϑ,Θ). It follows from

Lemma 17 that

P (θ,Θ)− α ≥ supB⊆Θ|P (θ,B)− P (ϑ,B)| ≥ P (θ,Θ)− P (θ,Ac ∩Θ)− P (ϑ,A),

where the last inequality follows by setting B = A = Θ\(Ac∩Θ). Hence P (θ,Ac∩Θ) ≥α − P (ϑ,A). This means that if |||θ − ϑ||| ≤ r and ϑ ∈ S2 we necessarily have θ /∈ S1.

Hence ddef= infθ1∈S1, θ2∈S2 ‖θ1 − θ2‖2 ≥ infθ1∈S1, θ2∈S2 |||θ1 − θ2||| ≥ r. By Lemma 16,

π(S3) ≥ 2r

diam(Θ)min (π(S1), π(S2)) .

Combining this with with (4.3), (4.1) and (4.2) we conclude that

ΦΘ(P ) ≥ α

4min

(1,

2r

diam(Θ)

),

as claimed.

�

32

4.2. Proof of Theorem 15. The first step of the proof is a lower-bound on the

conductance of Pγ that we derive in the next lemma. Given m ≥ 1, M > 2, ζ ∈(0, 1/2), we define

Eρ(ζ,m,M)def= Eρ ∩

{z ∈ Rn : Πγ({δ?} × B

(δ?)m,M |z) ≥ 1− ζ

}.

Lemma 18. Assume H6, H7, and choose γ = γ0

n log(p) for some absolute constant

γ0 > 0 such that 4(γ/σ2)λmax(X′X) ≤ 1. Take ρ ∈ (0, ρ] where ρ is as in (2.15).

Fix ζ ∈ (0, 1/2), m ≥ 5, M > 2 arbitrary. Then we can find finite absolute constants

C0, C1, C2, C3 ≥ 1 that do not depend on ζ such that, setting the step-size τδ of the

Random Walk Metropolis updates of Algorithm 1 as

τδ =C0

‖δ‖1/20

√n log(p)

,

the following holds. For all p ≥ C1, and all z ∈ Eρ(ζ,m,M), we have

(4.4) Φζ(Pγ) ≥ C2√pe−C3s2?

[1+

(C(X)

s1/2? log(p)

√pn

)2].

Proof. Fix m ≥ 5, M > 2, ζ ∈ (0, 1/2), and z ∈ Eρ(ζ,m,M) arbitrary. To

shorten notation, we write B(δ) for B(δ)m,M , and E for Eρ(ζ,m,M). Since z ∈ E , we have

1− Πγ({δ?} × B(δ?)|z) ≤ ζ.

Let A be a measurable set such that ζ < Πγ(A|z) < 1− ζ. We wish to lower bound

the quantity∫A Πγ(dθ|z)Pγ(θ,Ac)/min

(Πγ(A|z)− ζ, Πγ(Ac|z)− ζ

). Given the ex-

pression of Pγ and Πγ given in (3.11) and (3.1) respectively, we have∫A

Πγ(dθ|z)Pγ(θ,Ac) ≥ 1

2

∑δ

∫A

Πγ(dθ|z)Πγ(δ|θ, z)Kδ(θ,Ac),

=1

2

∑δ

Πγ(δ|z)∫A

Πγ(dθ|δ, z)Kδ(θ,Ac),

≥ 1

2Πγ(δ?|z)

∫A∩B(δ?)

Πγ(dθ|δ?, z)Kδ?(θ,Ac ∩ B(δ?)).

Then using the definition of ΦB(δ)(Kδ) (the B(δ)-conductance of Kδ), we get

(4.5)

∫A

Πγ(dθ|z)Pγ(θ,Ac) ≥ 1

2Πγ(δ?|z)ΦB(δ?)(Kδ?)

×min(Πγ(Aδ? |δ?, z), Πγ(Acδ? |δ?, z)

),


where Aδdef= A ∩ B(δ) and Acδ

def= Ac ∩ B(δ). On the other hand, since 1 − Πγ({δ?} ×

B(δ?)|z) ≤ ζ,

Πγ(A|z) ≤ Πγ({δ?} ×Aδ? |z) + 1− Πγ({δ?} × B(δ?)|z),

≤ Πγ({δ?} ×Aδ? |z) + ζ.

Hence Πγ(A|z)− ζ ≤ Πγ(δ?|z)Πγ(Aδ? |δ?, z). A similar bound holds for Πγ(Ac|z)− ζ.

We combine these with (4.5) to get∫A Πγ(dθ|z)Pγ(θ,Ac)

min(Πγ(A|z)− ζ, Πγ(Ac|z)− ζ

) ≥ 1

2ΦB(δ?)(Kδ?).

Since the right-hand side of the last display is independent of A we conclude that:

(4.6) Φζ(Pγ) ≥ 1

2ΦB(δ?)(Kδ?).

To lower-bound ΦB(δ?)(Kδ?), we shall apply Theorem 14. To save some notation,

in the remaining of the proof we will write δ for δ?, and s for s?. To lower bound

ΦB(δ)(Kδ), we will apply Theorem 14 with Θ taken as B(δ), and |||θ||| def= ‖[θ]δ‖2. Let

us check that all the assumptions are satisfied. Clearly, Kδ is reversible with respect

to Πγ(dθ|δ, z) ∝ e−hγ(δ,θ;z)dθ, and θ 7→ hγ(δ, θ; z) is convex ([34] Theorem 2.3).

Let us first recall our notations. For 1 ≤ i ≤ p, u ∈ Ri, v ∈ Rp−i, and δ ∈ ∆ such

that ‖δ‖0 = i, we write (u, v)δ to denote the vector of Rp, θ say, such that [θ]δ = u,

and [θ]δc = v. When v = (0, . . . , 0) ∈ Rp−i, we write (u, 0)δ. If the structure δ is

understood, we will simply write (u, v).

For τ2 > 0 and u ∈ Rs, letQs,τ2(u, ·) denote the density of the Gaussian distribution

N(u, τ2Is) on Rs, and let Gγ,δ(u, ·) denote the density of the Gaussian distribution

N(mδ,(u,0), γΣδ,(u,0)) on Rp−s of the independence Metropolis-Hastings update, with

mδ,(u,0) and Σδ,(u,0) as given in (3.7). We recall that Gγ,δ(u, ·) is proportional to hγ as

in (3.6). With these notations, for any measurable subset A ⊆ Rp, the Markov chain

can move from θ into A under the kernel Kδ = (1/2)−→K δ + (1/2)

←−K δ if we choose to

use−→K δ, and both the random walk Metropolis and the independence sampler moves

are accepted:

Kδ(θ,A) ≥ 1

2

∫AQs,τ2

δ([θ]δ, [u]δ) min

[1,e−hγ(δ,([u]δ,[θ]δc );z)

e−hγ(δ,([θ]δ,[θ]δc );z)

]

Gγ,δ([u]δ, [u]δc) min

[1,e−hγ(δ,([u]δ,[u]δc );z)Gγ,δ([u]δ, [θ]δc)

e−hγ(δ,([u]δ,[θ]δc );z)Gγ,δ([u]δ, [u]δc)

]du.

34

Hence Kδ satisfies (3.10) with

q(θ, u)def=

1

2Qs,τ2

δ([θ]δ, [u]δ) min

[1,e−hγ(δ,([u]δ,[θ]δc );z)

e−hγ(δ,([θ]δ,[θ]δc );z)

]

×Gγ,δ([u]δ, [u]δc) min



].

We show in Lemma 27 and Lemma 28 of the supplement that we can find a finite

absolute constant C0 such that for all p large enough

supz∈Eρ

supθ∈B(δ)

∣∣∣hγ(δ, θ; z)− hγ(δ, θ; z)∣∣∣ ≤ C0R1,

and supz∈Eρ|hγ(δ, θ1; z)− hγ(δ, θ2; z)| ≤ C0R2 |||θ2 − θ1|||+ C0R3

for all θ1, θ2 ∈ B(δ), such that [θ1]δc = [θ2]δc , where

R1def= sγ

(ρ+ C(X)

√(m+ 1)γnp

)2,

R2def= s1/2

(ρ+ nMε+ C(X)γns1/2

√(m+ 1)γnp

),

and R3 = γs(ρ+ nMε)(ρ+ nMε+ C(X)

√(m+ 1)γnp

).

Hence for z ∈ Eρ, and ([u]δ, [θ]δc), ([u]δ, [u]δc) ∈ B(δ), we have

min



]

= min

[1,ehγ(δ,([u]δ,[u]δc );z)−hγ(δ,([u]δ,[u]δc );z)

ehγ(δ,([u]δ,[θ]δc );z)−hγ(δ,([u]δ,[θ]δc );z)

]≥ e−2C0R1 .

It follows that for all p large enough, z ∈ Eρ, and θ1, θ2 ∈ B(δ),

∫Rp

min(q(θ1, u), q(θ2, u))du ≥ e−C0(2R1+R3)

2

×∫B(δ)

min(Qs,τ2

δ([θ1]δ, [u]δ)e

−C0R2|||u−θ1|||, Qs,τ2δ([θ2]δ, [u]δ)e

−C0R2|||u−θ2|||)

×Gγ,δ([u]δ, [u]δc)du.


To proceed we define

V1def=

{x ∈ Rs : ‖x− [θ?]δ‖2 ≤Mε, ‖x− [θ1]δ‖2 ≤

√2π

320

4Mε

s1/2 log(p), and

‖x− [θ2]δ‖2 ≤√

2π

320

4Mε

s1/2 log(p)

},

and V2def={v ∈ Rp−s : ‖v‖2 ≤ ε11, ‖v‖∞ ≤ ε12

},

where ε11 = 2√

(1 +m)γp, ε12 = 2√

(m+ 1)γ log(p). We note that V def= {(u, v)δ :

u ∈ V1, v ∈ V2} ⊂ B(δ), so that it follows from the last display that

(4.7)

∫Rp


2e−C0R2

√2π

3204Mε

s1/2 log(p)[infx∈V1

Gγ,δ(x,V2)

] ∫V1

min(Qs,τ2([θ1]δ, x), Qs,τ2([θ2]δ, x)

)dx.

We recall thatGγ,δ(x, ·) is the density of the Gaussian distribution N(mδ,(x,0), γΣδ,(x,0))

on Rp−s where Σδ,(x,0) = (Ip−s − γσ2X

′δcXδc)

−1, and

mδ,(x,0) = − γ

σ2(Ip−s −

γ

σ2X ′δcXδc)

−1X ′δcXδ([Jγ(δ, (x, 0))]δ − x).

We need to lower bound the term Gγ,δ(x,V2) = N(mδ,(x,0), γΣδ,(x,0))(V2). It suffices

to show that ‖mδ,(x,0)‖2 ≤√γp and ‖mδ,(x,0)‖∞ ≤

√γ. Indeed if these inequalities

hold, and if (U1, . . . , Up−s)iid∼ N(0, 1), we have

P(‖mδ,(x,0) + γ1/2Σ

1/2δ,(x,0)U‖2 > ε11

)≤ P

(U ′Σδ,(x,0)U >

1

γ(ε11 −

√γp)2

)≤ exp

−(

2√

(m+ 1)p−√p−√

Tr(Σδ,(x,0)))2

2λmax(Σδ,(x,0))

,where the last inequality uses Lemma 25 in the supplement. Similarly,

P(‖mδ,(x,0) + γ1/2Σ

1/2δ,(x,0)U‖∞ > ε12

)≤ 2 exp

[log(p)− (m+ 1) log(p)

2λmax(Σδ,(x,0))

],

Noting that the largest eigenvalue of Σδ,(x,0) is bounded from above by 4/3 (since

4γλmax(X′X) ≤ σ2), we can easily conclude that

(4.8) infx∈V1

Gγ,δ(x,V2) ≥ 1

2,

36

for all p large enough, and m ≥ 5. It remains to show that ‖mδ,(x,0)‖2 ≤√γp, and

‖mδ,(x,0)‖∞ ≤√γ. For any θ, (mδ,θ)j = − γ

σ2 [Jγ(δ, θ) − θ]′δX ′δXδc(Σδ,θej), where ej

denotes the j-th standard unit vector. Clearly, ‖Σδ,θej‖2 ≤ 4/3. Hence using the

definition of the incoherence parameter C(X),

‖mδ,θ‖∞ ≤4

3

γ

σ2

√nv(s)λmax(X ′X)‖Jγ(δ, θ)− θδ‖2.

Noting that for θ = (x, 0) ∈ B(δ), ‖Jγ(δ, θ) − θδ‖2 ≤ s1/2γ

(3ρ2 +

n√v(s)

σ2 Mε

)≤

Cs1/2γ(ρ+ nMε) = O(s1/2γnε), we easily get

‖mδ,(x,0)‖∞ = o(√γ), and ‖mδ,(x,0)‖2 ≤

√p‖mδ,(x,0)‖∞ = o(

√γp),

since s = o(log(p)) as assumed in H7. Therefore, (4.8) and (4.7) imply that∫Rp

min(q(θ1, u), q(θ2, u))du

≥ e−C0(2R1+R3)

4e−C0R2

√2π

3204Mε

s1/2 log(p)

∫V1

min(Qs,τ2

δ([θ1]δ, x), Qs,τ2

δ([θ2]δ, x)

)dx.

We lower bound the integral on the right-hand side of the last display using Lemma

29 in the supplement that we applied with d← s, R←Mε, σ ← τδ =√

2π320

Mεs log(p) , and

r ← 4τδ√s =

√2π

3204Mε

s1/2 log(p). The lemma implies that for |||θ1 − θ2||| ≤ τδ/4, we have

∫Rp


16e−C0R2

√2π

3204Mε

s1/2 log(p) .

Hence Kδ satisfies all the assumption of Theorem 14, and we conclude that

ΦB(δ)(Kδ) ≥1

64e−2C0(2R1+R3)e

−C0R2

√2π

3204Mε

s1/2 log(p)

×min

1,

√2π

2× 320

Mε

s log(p)(

2√

(m+ 1)γp+Mε) .

Using γ = γ0

n log(p) , we check that for some absolute constant C,

R1 ≤ Csp

n

(C(X)

log(p)

)2

, R3 ≤ C(s2 + s3/2

√p

n

C(X)

log(p)

),


andR2Mε

s1/2 log(p)≤ Cs

(1 +

√p

n

C(X)

(log(p))3/2

).

Hence, we have

R1 +R3 +R2Mε

s1/2 log(p)= O

(s2

[1 +

(C(X)

s1/2 log(p)

√p

n

)2])

,

andMε

s log(p)(

2√

(m+ 1)γp+Mε) ≥ C

1 +√m+ 1

1

s log(p) +√ps,

and it follows that for all p large enough,

(4.9) ΦB(δ)(Kδ) ≥C1√pe−C2s2

[1+

(C(X)

s1/2 log(p)

√pn

)2],

for all p large enough, for absolute constants C1 C2. The result then follows by com-

bining (4.9), and (4.6).

4.2.1. Proof of Theorem 15. Throughout the proof C denotes a generic universal

constant, whose actual value may change from one appearance to the next. To shorten

notation, we will also write δ to denote δ(0) (the initialization of Algorithm 1 as

described in Section 3.1.1). Fix ζ0 ∈ (0, 1/2). Set

(4.10) m1 = max

(0,

8σ2

v(s)γ0log

(8FP

ζ0

)), and ζ =

ζ0

16pC0(1+m1+FP),

where C0 ≥ 1 is an absolute constant that we specify later. Let m ≥ 5 and M >

max(2,√

u+23 ) arbitrary, and set

Eρ(ζ,m,M)def= Eρ,Λ

⋂{z ∈ Rn : Πγ({δ?} × B

(δ?)m,M |z) ≥ 1− ζ

}.

By Markov’s inequality,

P? (Z /∈ Eρ(ζ,m,M)) ≤ P?[1− Πγ({δ?} × B

(δ?)m,M |Z) > ζ|Z ∈ Eρ,Λ

]+ P? (Z /∈ Eρ,Λ) ,

≤ 1

ζE?[1− Πγ({δ?} × B

(δ?)m,M |Z)|Z ∈ Eρ,Λ

]+ P? (Z /∈ Eρ,Λ) .

Therefore by Corollary 10 we can choose absolute constantsm ≥ 5,M > max(2,√

u+23 )

(that depend only on C0, m1, FP and ζ0) such that if u and m0 are also large enough,

38

then we have

(4.11) P? (Z /∈ Eρ(ζ,m,M)) ≤ ζ0

2,

for all p large enough. By Theorem 13, and Lemma 18, there exist absolute constants

C1, C2, C3, C4 that does not depend on ζ such that: for all p ≥ C1, all z ∈ Eρ(ζ,m,M),

choosing the step-size of the Random Walk Metropolis as τδ = C2

‖δ‖1/20

√n log(p)

and

choosing integer an K satisfying

K ≥ C3p log

(1

ζ

)exp

(C4s

2?

[1 +

(C(X)

s1/2 log(p)

√p

n

)2])

,

we have

‖ν0PKγ − Πγ(·|z)‖tv ≤ 2 sup

A: Πγ(A|z)≤ζ

∣∣ν0(A)− Πγ(A|z)∣∣ .

Using this with (4.11), it follows that

(4.12)

E?[‖ν0P


]≤ 2E?

[sup

A: Πγ(A|Z)≤ζ

∣∣ν0(A)− Πγ(A|z)∣∣1Eρ(ζ,m,M)(Z)

]

+ζ0

2.

Therefore to finish the proof it suffices to upper bound the first term on the right-hand

side of the last display. We recall that ν0 = ν(δ)(·|z) (where here δ is a short for δ(0)),

and we split the term Πγ(A|z)− ν0 as(Πγ(A|z)− ν(δ)

B(δ?)1

(A|z))

+

(ν

(δ)

B(δ?)1

(A|z)− ν(δ)(A|z)),

where B(δ)1 is a short for B

(δ)m1,M

. For any measurable set A of Rp such that Πγ(A|z) ≤ ζ,

if Πγ(A|z) ≥ ν(δ)

B(δ?)1

(A|z), then∣∣∣∣Πγ(A|z)− ν(δ)

B(δ?)1

(A|z)∣∣∣∣ ≤ ζ.

We note that for z ∈ Eρ(ζ,m,M),

Πγ(A|z) ≥ Πγ({δ?} ×A ∩ B(δ?)1 |z) = Πγ({δ?} × B

(δ?)1 |z)Πγ(A ∩ B

(δ?)1 |δ?, z)

Πγ(B(δ?)1 |δ?, z)

≥ (1− ζ)Πγ(A ∩ B

(δ?)1 |δ?, z)

Πγ(B(δ?)1 |δ?, z)

,

40

where the remainder R satisfies

(4.16) 0 ≤ R ≤ 1

2(θ − θδ?)′S(θ − θδ?) +

γ

2‖δ? · ∇`(θ; z)− ρsign(θδ?)‖22.

It follows that

− hγ(δ?, θ; z)− `(θδ? ; z) + ρ‖θδ?‖1 + gγ(δ?, θ; z)

= ρ(‖θδ?‖1 − ‖θδ?‖1

)+ `(θδ? ; z)− `(θδ? ; z)+

+1

2([θ]δ? − θδ?)′Iγ,δ?([θ]δ? − θδ?) +R.

Since `(θ; z) is quadratic and δ? ·∇`(θδ? ; z) = 0, we have `(θδ? ; z)−`(θδ? ; z)+ 12([θ]δ?−

θδ?)′Iγ,δ?([θ]δ? − θδ?) = 0. Hence

−hγ(δ?, θ; z)− `(θδ? ; z) + ρ‖θδ?‖1 + gγ(δ?, θ; z) = ρ(‖θδ?‖1 − ‖θδ?‖1

)+R.

For z ∈ Eρ, and θ ∈ B(δ?)1 , |‖θδ?‖1−‖θδ?‖1| ≤ s

1/2? (M+1)ε. Using this and the fact that

R ≥ 0, we see that the first term on the right-hand side of (4.15) is upper bounded

by

e(M+1)ρs1/2? ε ≤ elog(p),

since ρs1/2? ε = o(log(p)) by assumption. By proceeding as in the calculations leading

to (5.4) in the supplement, we can show that

− hγ(δ?, θ; z)− `(θδ? ; z) + ρ‖θδ?‖1 = ρ(‖θδ?‖1 − ‖θδ?‖1

)− gγ(δ?, θ; z) +R

≤ (M+1)ρs1/2? ε+C(1+M2)s?−

1

2([θ]δ?−θ?)′Iγ,δ?([θ]δ?−θ?)−

1

2γ(θ−θδ?)′Aγ(θ−θδ?),

where Aγ = Ip − γ(1 + 4γnv(s?))S. Hence the second term on the right hand side of

(4.15) is upper bounded by

eC(1+M2)s?

∫B

(δ?)1

e− 1

2([θ]δ?−θ?)′Iγ,δ? ([θ]δ?−θ?)− 1

2γ[θ]′δc?

[Aγ ]δc? [θ]δc?dθ∫B

(δ?)1

e− 1

2([θ]δ?−θ?)′Iγ,δ? ([θ]δ?−θ?)− 1

2γ‖[θ]δc?‖

22dθ

≤ eC(1+M2)s? 2√det([Aγ ]δc?

) ≤ 2eC(1+M2)s?eCγ2(nTr(X′X)+‖X′X‖2F).


Finally we use the assumption γ2(nTr(X ′X) + ‖X ′X‖2F

)= o(log(p)), s? = o(log(p))

to conclude that the second term on the right hand side of (4.15) is upper bounded

by elog(p), for all p large enough. Therefore the first term on the right hand side of

(4.14) is upper bounded

elog(p),

for all p large enough. The second term on the right hand side of (4.14) can be written

asν(δ)(A|z)

ν(δ)(B(δ?)1 |z)


(δ?)1 |z)

=

∫A e−gγ(δ,θ;z)dθ∫

A e−gγ(δ?,θ;z)dθ

×

∫B

(δ?)1

e−gγ(δ?,θ;z)dθ∫B

(δ?)1

e−gγ(δ,θ;z)dθ.

We claim that

(4.17) supz∈Eρ

supθ∈B(δ?)

1

|gγ(δ, θ; z)− gγ(δ?, θ; z)| ≤ C(m1 + FP) log(p),

for some absolute constant C ≥ 1 (that does not depend on m,M). (4.17) together

with the equation right before it then implies that the second term on the right hand

side of (4.14) is upper bounded by pC(m1+PF). Therefore we can conclude that (4.13)

becomes

(4.18) supA: Πγ(A|z)≤ζ

∣∣∣∣Πγ(A|z)− ν(δ)

B(δ?)1

(A|z)∣∣∣∣1Eρ(ζ,m,M)(z) ≤ 4ζpC(1+m1+PF).

Hence taking C0 as the constant C of the last display in the expression of ζ, we

conclude from (4.18) and (4.12) that for all p large enough

(4.19) E?[‖ν0P


]≤ ζ0

+ 2E?[‖ν(δ)

B(δ?)1

(·|Z)− ν(δ)(·|Z)‖tv1Eρ(ζ,m,M)(Z)

].

Noting that ν(δ)

B(δ?)1

(·|z) is the restriction of ν(δ)(·|z) to B(δ?)1 , we have

(4.20) ‖ν(δ)

B(δ?)1

(·|z)− ν(δ)(·|z)‖tv ≤ 1− ν(δ)(B(δ?)1 |z),

42

and

1− ν(δ)(B(δ?)1 |z) ≤ P

(‖[θδ + I−1/2

δ,γ V ]δ? − [θ?]δ?‖2 > Mε)

+P(‖[θδ + I−1/2

δ,γ V ]δ−δ?‖22 > 2(m1 + 1)γp)

+P(‖[θδ + I−1/2

δ,γ V ]δ−δ?‖2∞ > 4(m1 + 1)γ log(p)),

+P(γ‖W‖22 > 2(m1 + 1)γp

)+P(γ‖W‖2∞ > 4(m1 + 1)γ log(p)

),(4.21)

where V = (V1, . . . , Vs?)iid∼ N(0, 1), and W = (W1, . . . ,Wp−s?−FP)

iid∼ N(0, 1). For

z ∈ Eρ (which implies that ‖θδ − [θ?]δ‖2 ≤ ε, as seen in 2.10), given that ε =

σv(s)

√72(m0 + 1) s log(p)

n , and noting that the largest eigenvalue of I−1γ,δ is σ2/(nv(s)),

we can then use standard Gaussian deviation bounds to conclude that the sum of the

first and last two terms on the right hand side of (4.21) is upper bounded by

P

‖[V ]δ?‖2 >(M − 1)ε√

σ2

nv(s)

+ P(‖W‖2 >

√2(1 +m1)p

)+ P

(‖W‖∞ > 2

√(1 +m1) log(p)

)≤ 1

ps?(m0+1)+ e−m1p/4 +

1

pm1≤ ζ0

4,

for all p large enough, and m1 ≥ 1. Suppose that for z ∈ Eρ(m,M, γ), ‖[θδ]δ−δ?‖∞ ≤√(m1 + 1)γ log(p) (which implies that ‖[θδ]δ−δ?‖2 ≤ FP1/2

√(m1 + 1)γ log(p)). There-

fore the sum of the second and third terms on the right hand side of (4.21) is upper

bounded by

P

(‖I−1/2

δ,γ V ]δ−δ?‖2 >√

1

2(m1 + 1)γp

)+P(‖I−1/2

δ,γ V ]δ−δ?‖∞ >√

(m1 + 1)γ log(p))

≤ exp

−1

2

(v(s)1/2

σ

√1

2(m1 + 1)γnp−

√FP

)2

+ 2 exp

(log(FP)− (m1 + 1)v(s)γn log(p)

2σ2

)≤ e−

√p + 2FPe−

m1v(s)γ02σ2 ≤ ζ0

4,


for all p large enough, given the choice of m1 + 1 ≥ 2σ2

v(s)γ0log(

4FPζ0

). We combine this

bound with (4.21), 4.20, and 4.19) to conclude that

(4.22) E?[‖ν0P

Kγ − Πγ(·|z)‖tv

]≤ 2ζ0

+ 2P?[‖[θδ(Z)]δ−δ?‖∞ >

√(m1 + 1)γ log(p)

].

We note that θδ(Z) is the ordinary least square estimate in the regression model

Z = Xδθ + σE, where E ∼ N(0, In). It follow that [θδ(Z)]δ−δ? ∼ N(0, σ2Q), where

Q =(X ′δ−δ?

(Is? −Xδ?(X

′δ?Xδ?)

−1X ′δ?)Xδ−δ?

)−1.

We also note that for any unit vector u ∈ R‖δ‖0−s? , we have

u′Q−1u ≥ u′X ′δ−δ?Xδ−δ?u −1

nv(s?)‖X ′δ?Xδ−δ?u‖22 ≥ nv(s) − C(X)2s?

v(s?)≥ nv(s)

2,

for all p large enough, since C(X)s?/√n = o(1). Therefore the largest eigenvalue of Q

is upper bounded by 2/√nv(s). As a result, and since γn log(p) = γ0, and FP = O(1),

by Gaussian tail bounds (Lemma 25), we have

P?(‖[θδ(Z)]δ−δ?‖∞ >

√(m1 + 1)γ log(p)

)≤ exp

{log(FP)− (m1 + 1)v(s)γn log(p)

8σ2

}≤ ζ0

2,

for all p large enough. Hence (4.22) becomes

(4.23) E?[‖ν0P

Kγ − Πγ(·|z)‖tv

]≤ 3ζ0,

as claimed. To complete the proof, it remains to establish (4.17). To that end, we set

diff(θ)def= gγ(δ, θ; z)− gγ(δ?, θ; z). Since δ ⊇ δ?, we have

2diff(θ) = ([θ]δ − θδ)′Iγ,δ([θ]δ − θδ)− ([θ]δ? − θ?)′Iγ([θ]δ? − θ?)

− 1

γ[θ]′δ−δ? [θ]δ−δ? .

We split the term ([θ]δ − θδ)′Iγ,δ([θ]δ − θδ)− ([θ]δ? − θ?)′Iγ([θ]δ? − θ?) as

([θ]δ − [θ?]δ)′Iγ,δ([θ]δ − [θ?]δ)− ([θ]δ? − [θ?]δ?)

′Iγ([θ]δ? − [θ?]δ?)

+ 2([θ]δ − [θ?]δ)′Iγ,δ([θ?]δ − θδ)− 2([θ]δ? − [θ?]δ?)

′Iγ([θ?]δ? − θ?)

+ (θδ − [θ?]δ)′Iγ,δ(θδ − [θ?]δ)− (θ? − [θ?]δ?)

′Iγ(θ? − [θ?]δ?).

44

We calculate that

([θ]δ − [θ?]δ)′Iγ,δ([θ]δ − [θ?]δ)− ([θ]δ? − [θ?]δ?)

′Iγ([θ]δ? − [θ?]δ?)

= 2[θ − θ?]′δ?

(1

σ2X ′δ?Xδ−δ?

)[θ]δ−δ?

+ [θ]′δ−δ?

(1

σ2X ′δ−δ?Xδ−δ?

)[θ]δ−δ? .

Since θδ = (X ′δXδ)−1X ′δz, we get θδ − [θ?]δ = (X ′δXδ)

−1X ′δ(z −Xθ?) , for all δ ⊇ δ?.

We use this to calculate that

([θ]δ − [θ?]δ)′Iγ,δ([θ?]δ − θδ)− ([θ]δ? − [θ?]δ?)

′Iγ([θ?]δ? − θ?)

= − 1

σ2[θ]′δ−δ?X

′δ−δ?(z −Xθ?),

and

(θδ − [θ?]δ)′Iγ,δ(θδ − [θ?]δ)− (θ? − [θ?]δ?)

′Iγ(θ? − [θ?]δ?)

=1

σ2(z −Xθ?)′Xδ−δ?(X

′δ−δ?Xδ−δ?)

−1X ′δ−δ?(z −Xθ?).

All together, we have

2diff(θ) =1



−1X ′δ−δ?(z −Xθ?)

+2

σ2[θ − θ?]′δ?

(X ′δ?Xδ−δ?

)[θ]δ−δ? −

2

σ2[θ]′δ−δ?X

′δ−δ?(z −Xθ?)

− 1

γ[θ]′δ−δ?

(I‖δ‖0−s? −

γ


)[θ]δ−δ? .

For θ ∈ B(δ?), and z ∈ Eρ, we have

1



−1X ′δ−δ?(z −Xθ?) ≤8(m0 + 1)FP

v(s)log(p).

We check that for θ ∈ B(δ?), and z ∈ Eρ,∣∣∣∣ 2

σ2[θ − θ?]′δ?

(X ′δ?Xδ−δ?

)[θ]δ−δ?

∣∣∣∣ ≤ CFP1/2 µ0s?√n log(p)

log(p) = o(log(p)),

∣∣∣∣ 2

σ2[θ]′δ−δ?X

′δ−δ?(z −Xθ?)

∣∣∣∣ ≤ 2

σ

FP√log(p)

√8(m0 + 1)(m+ 1) log(p) = o(log(p)),


and

0 ≤ 1

γ[θ]′δ−δ?

(I‖δ‖0−s? −

γ


)[θ]δ−δ? ≤ 4(m1 + 1) log(p).

We deduce easily (4.17), and this complete the proof of the theorem.�

5. Proof of Theorem 5.

5.1. Some Preliminary results. Throughout the proof, θ? is the true value of the

parameter as introduced in H2, δ? denotes its sparsity structure, and s? = ‖θ?‖0. The

first lemma is adapted from [2] Lemma 2 and gives an approximation of the function

hγ(δ, θ; z) for γ small.

Lemma 19. Assume H1, and let hγ as in (1.4). For all δ ∈ ∆, γ > 0, u ∈ Rp,and z ∈ Z, we have

− `(uδ; z) + ρ‖uδ‖1 +1

2γ‖u− uδ‖22 −

1

2(u− uδ)′S(u− uδ) ≥ hγ(δ, u; z)

≥ −`(uδ; z) + ρ‖uδ‖1 +1

2γ‖u− uδ‖22 −

1

2(u− uδ)′S(u− uδ)−Rγ(δ, u; z),

where Rγ(δ, u; z) satisfies

0 ≤ Rγ(δ, u; z) ≤ γ

2‖δ · ∇`(u; z)− ρsign(uδ)‖2.

Proof. Using the definition of hγ in (1.4), and the definition of S in H1, we have

hγ(δ, u; z) ≤ −`(u; z)− 〈∇`(u; z), uδ − u〉+1

2γ‖u− uδ‖22 + ρ‖uδ‖1

≤ −`(uδ; z) + ρ‖uδ‖1 −1

2(u− uδ)S(u− uδ) +

1

2γ‖u− uδ‖22,

which is the first inequality. To prove the second, we note that for any v ∈ Rpδ ,

−`(u; z)− 〈∇`(u; z), v − u〉 = −`(uδ; z)− 〈∇`(u; z), v − uδ〉+ F(u|uδ),

where F(u|uδ)def= − [`(u; z)− `(uδ; z)− 〈∇`(uδ; z), u− uδ〉]+〈∇`(u; z)−∇`(uδ; z), u− uδ〉.

By Taylor expansion with integral remainder and H1, we have

F(u|uδ) = (u− uδ)′[∫ 1

0t∇(2)`(u+ t(u− uδ))dt

](u− uδ) ≥ −

1

2(u− uδ)′S(u− uδ).

46

Hence for all u ∈ Rp and all v ∈ Rpδ ,

−`(u; z)− 〈∇`(u; z), v − u〉 ≥ −`(uδ; z)− 〈∇`(u; z), v − uδ〉 −1

2(u− uδ)′S(u− uδ).

By convexity of the `1-norm, ‖v‖1 ≥ ‖uδ‖1 + 〈sign(uδ), v − uδ〉. We combine the last

two inequalities to conclude that

− `(u; z)− 〈∇`(u; z), v − u〉+ ρ‖v‖1 +1

2γ‖v − u‖22 ≥ −`(uδ; z) + ρ‖uδ‖1

+〈ρsign(uδ)− δ · ∇`(u; z), v − uδ〉+1

2γ‖v−uδ‖22+

1

2γ‖u−uδ‖22−

1

2(u−uδ)′S(u−uδ).

The second inequality follows by noting that 〈ρsign(uδ)− δ · ∇`(u; z), v − uδ〉+ 12γ ‖v−

uδ‖22 ≥ −γ2‖ρsign(uδ)− δ · ∇`(u; z)‖22.

The next result gives a lower bound on the normalizing constant of Πγ .

Lemma 20. Assume H1-H2. For γ > 0, z ∈ Z, let Cγ(z) denote the normalizing

constant of Πγ(·|z). If γλmax(S) < 1, then we have

(5.1)√

det([Ip − γS]δc?)(2πγ)−p2 Cγ(z) ≥ ωδ?e`(θ?;z)e−ρ‖θ?‖1

(ρ2

κ(s?) + ρ2

)s?,

where for a matrix A ∈ Rp×p, and δ ∈ ∆, the notation [A]δc denote the sub-matrix of

A obtained after removing the rows and columns of A for which δj = 1.

Proof. By definition we have

Cγ(z) =∑δ∈∆


2

(ρ2

)‖δ‖0 ∫Rpe−hγ(δ,u;z)du

≥ ωδ?(2πγ)‖δ?‖0

2

(ρ2

)‖δ?‖0 ∫Rpe−hγ(δ?,u;z)du.

By the first inequality of Lemma 19,

Cγ(z) ≥ ωδ?(2πγ)‖δ?‖0

2

(ρ2

)‖δ?‖0 ∫Rpe`(uδ? ;z)e−ρ‖uδ?‖1e

− 12γ

(u−uδ? )(Ip−γS)(u−uδ? )du.

The integrand in the last display is a multiplicatively separable function of [u]δ? and

[u]δc? . Integrating out [u]δc? then yields

Cγ(z) ≥ (2πγ)p2√

det([Ip − γS]δc?)ωδ?

(ρ2

)‖δ?‖0 ∫Rpe`(u;z)e−ρ‖u‖1µδ?(du).


The lower bound on the term ωδ?(ρ

2

)‖δ?‖0 ∫Rp e

`(u;z)e−ρ‖u‖1µδ?(du) established in [1] Lemma

11 is then employed to deduce (5.1).

Lemma 21. Assume H1-H2. Let ρ > 0, ρ ∈ (0, ρ], and γ > 0 be such that

4γλmax(S) ≤ 1. If for all δ ∈ ∆, and θ ∈ Rp,

(5.2) logE?[e

(1− ρ

ρ

)〈∇`(θ?;Z),θ−θ?〉+Lγ(δ,θ;Z)

1Eρ(Z)

]≤ −1

2r0(‖θ − θ?‖2)1{‖[θ−θ?]δ?‖1≤7‖[θ]δc?‖1}

(θ),

for some rate function r0, then Πγ(·|z) is well-defined for P?-almost all z ∈ Eρ. Fur-

thermore

E?[1Eρ(Z)Πγ(‖δ‖0 > s? + η|Z)

]≤ 1

pm0,

for all m0 ≥ 1, where

(5.3) ηdef=

2

u

m0 + 2s? +a0

2 log(p)+s? log

(1 + κ(s?)

ρ2

)log(p)

+γ

2 log(p)

(Tr(S − S)

)+

2γ2

log(p)

(λmax(S)Tr(S) + 4‖S‖2F

)],

and

a0def= −min

x>0

[r0(x)− 4ρs

1/2? x

].

Proof. That Πγ(·|z) is well-defined is equivalent to the statement that its nor-

malizing constant, denoted Cγ(z), is finite. Hence it suffices to establish that Cγ(z) is

finite for P?-almost all z in Eρ. By using the second inequality of Lemma 19, we have

(2πγ)−p2 Cγ(z)

e`(θ?;z)−ρ‖θ?‖1=∑δ∈∆

ωδ

(ρ2

)‖δ‖0 ( 1

2πγ

) p−‖δ‖02

∫Rp

e−hγ(δ,θ;z)

e`(θ?;z)−ρ‖θ?‖1dθ

≤∑δ∈∆

ωδ

(ρ2

)‖δ‖0 ( 1

2πγ

) p−‖δ‖02

∫Rp

e`(θδ;z)e−ρ‖θδ‖1

e`(θ?;z)e−ρ‖θ?‖1e− 1

2γ(θ−θδ)′(Ip−γS)(θ−θδ)+Rγ(δ,θ;z)

dθ.

For z ∈ Eρ, and δ ∈ ∆, we use the convexity of the square norm and H1 to bound

48

the term Rγ (given in Lemma 19) as

Rγ(δ, θ; z) ≤γ

2‖δ · ∇`(θ; z)− δ · ∇`(θδ; z) + δ · ∇`(θδ; z)− δ · ∇`(θ?; z) + δ · ∇`(θ?; z) + ρsign(θδ)‖22,

≤ 2γκ(‖δ‖0)(θ − θδ)′S(θ − θδ) + 2γ‖δ · ∇`(θδ; z)− δ · ∇`(θ?; z)‖22 + 3γρ2‖δ‖0,

where the second inequality uses H1, the definition of κ(‖δ‖0), and the fact that

‖∇`(θ?; z)‖∞ ≤ ρ/2 for z ∈ Eρ. It follows that

(5.4) − 1

2γ(θ − θδ)′(Ip − γS)(θ − θδ) +Rγ(δ, θ; z) ≤ − 1

2γ(θ − θδ)′Aδ(θ − θδ)

+ 2γ‖δ · ∇`(θδ; z)− δ · ∇`(θ?; z)‖22 + 3γρ2‖δ‖0.

whereAδdef= Ip−γ(1+4γκ(‖δ‖0))S. If δ = (1, . . . , 1), (5.4) still holds withAδ = 0. Note

also that if δ 6= (1, . . . , 1), the matrix Aδ is positive definite under the assumption

4γλmax(S) ≤ 1. We recall the notation Lγ(δ, θ; z) introduced in (2.2) of the main

manuscript, and use it to write

(5.5)

`(θδ; z)−`(θ?; z)+2γ‖δ ·∇`(θδ; z)−δ ·∇`(θ?; z)‖22 = 〈∇`(θ?; z), θ − θ?〉+Lγ(δ, θδ; z)

=ρ

ρ〈∇`(θ?; z), θδ − θ?〉+

(1− ρ

ρ

)〈∇`(θ?; z), θδ − θ?〉+ Lγ(δ, θδ; z).

Since ‖∇`(θ?; z)‖∞ ≤ ρ/2, for z ∈ Eρ, we have ρρ | 〈∇`(θ?; z), θδ − θ?〉 | ≤ (ρ/2)‖θδ −

θ?‖1. Using this, and accounting for all the terms, we obtain

(2πγ)−p2 Cγ(z)

e`(θ?;z)−ρ‖θ?‖11Eρ(z) ≤

∑δ∈∆

ωδ

(ρ2

)‖δ‖0 ( 1

2πγ

) p−‖δ‖02

× e3γ‖δ‖0ρ21Eρ(z)

∫Rped(θδ)+

(1− ρ

ρ

)〈∇`(θ?;z),θδ−θ?〉+Lγ(δ,θδ;z)e

− 12γ

(θ−θδ)′Aδ(θ−θδ)dθ,

where d(θ)def= ρ

2‖θ − θ?‖1 − ρ (‖θ‖1 − ‖θ?‖1). Taking the expectation on both sides

and using Fubini’s theorem and (5.2) gives

E?

[(2πγ)−

p2 Cγ(Z)

e`(θ?;Z)−ρ‖θ?‖11Eρ(Z)

]≤∑δ∈∆

ωδ

(ρ2

)‖δ‖0 ( 1

2πγ

) p−‖δ‖02

× e3γ‖δ‖0ρ2

∫Rped(θδ)e

− 12γ



All the integrals on the right-hand side of the last inequality are finite since the

matrices Aδ are symmetric positive definite, and d(θ) ∼ −ρ2‖θ‖1, for ‖θ‖1 large.

Hence for P?-almost all z ∈ Eρ, Cγ(z) is finite as claimed.

To establish the second part of the lemma, we first note that for any z ∈ Z, and

any measurable subset B of ∆×Rp, using (5.1), the second inequality of Lemma 19,

(5.4), and similar calculations as above, we get

(5.6)

Πγ(B|z) ≤√

det([Ip − γS]δc?

)(1 +

κ(s?)

ρ2

)s? 1

ωδ?

∑δ∈∆

ωδ

(1

2πγ

) p−‖δ‖02 (ρ

2

)‖δ‖1× e3γρ2‖δ‖0

∫B(δ)

e−ρ‖θδ‖1

e−ρ‖θ?‖1e`(θδ;z)

e`(θ?;z)e2γ‖δ·∇`(θδ;z)−δ·∇`(θ?;z)‖22e

− 12γ


where B(δ) def= {θ ∈ Rp : (δ, θ) ∈ B}. For some arbitrary η > 0, Let Ac1

def= {δ ∈

∆ : ‖δ‖0 > s? + η}, and Ac1 = Ac1 × Rp. It follows from the last display (applied to

B = Ac1 × Rp), together with (5.5) and Fubini’s theorem that

(5.7) E?[Πγ

(Ac1|Z

)1Eρ(Z)

]≤√

det([Ip − γS]δc?

)(1 +

κ(s?)

ρ2

)s? ∑δ∈Ac1

ωδωδ?

(ρ2

)‖δ‖0× e3γρ2‖δ‖0√

det([Aδ]δc)

∫Rped(θ)E?

[e

(1− ρ

ρ

)〈∇`(θ?;z),θδ−θ?〉+Lγ(δ,θδ;z)1Eρ(Z)

]µδ(dθ),

where det([Aδ]δc) is taken as 1 if δ ≡ 1. We claim that for all θ ∈ Rp,

(5.8) d(θ) + logE?[e

(1− ρ

ρ


]≤ −ρ

4‖θ − θ?‖1 +

a0

2,

where a0 = −minx>0

[r0(x)− 4ρs

1/2? x

]> 0. Obviously this claim is true if θ = θ?.

Suppose now that θ 6= θ?. First using the notation δcdef= 1− δ, we note that

d(θ) =ρ

2‖δ? · (θ − θ?)‖1 +

ρ

2‖δc? · θ‖1 − ρ‖δ? · θ‖1 − ρ‖δc? · θ‖1 + ρ‖θ?‖1

≤ −ρ2‖δc? · (θ − θ?)‖1 +

3ρ

2‖δ? · (θ − θ?)‖1.

If ‖δc? · (θ − θ?)‖1 > 7‖δ? · (θ − θ?)‖1, then

d(θ) ≤ −ρ4‖δc? · (θ − θ?)‖1 −

7ρ

4‖δ? · (θ − θ?)‖1 +

3ρ

2‖δ? · (θ − θ?)‖1 ≤ −

ρ

4‖θ − θ?‖1.

50

This bound together with (5.2) shows that the claim (5.8) holds true when ‖δc? · (θ−θ?)‖1 > 7‖δ? · (θ − θ?)‖1. Now, if ‖δc? · (θ − θ?)‖1 ≤ 7‖δ? · (θ − θ?)‖1, then again by

(5.2), the left-hand side of (5.8) is upper bounded by

− ρ

2‖δc? · (θ − θ?)‖1 +

3ρ

2‖δ? · (θ − θ?)‖1 −

1

2r0(‖θ − θ?‖2)

≤ −ρ2‖θ − θ?‖1 −

1

2

[r0(‖θ − θ?‖2)− 4ρs

1/2? ‖θ − θ?‖2

]≤ −ρ

2‖θ − θ?‖1 +

a0

2,

which also gives (5.8). We can then use (5.8) to deduce that

(5.9)

∫Rped(θ)E?

[e

(1− ρ

ρ


]µδ(dθ)

≤ ea0/2

∫Rpe−

ρ4‖θ−θ?‖1µδ(dθ) ≤ ea0/2

(8

ρ

)‖δ‖0,

and using this in (5.7) we conclude that

E?[Πγ

(Ac1|Z

)1Eρ(Z)

]≤ ea0/2

(1 +

κ(s?)

ρ2

)s? ∑δ∈Ac1

ωδωδ?

4‖δ‖0e3γ‖δ‖0ρ2

√det([Ip − γS]δc?

)√det([Aδ]δc)

.

We claim that for all δ ∈ ∆,

(5.10)

√det(Ip−s? − γ[S]δc?

)√det ([Aδ]δc)

≤ exp(s? +

γ

2Tr(S − S) + 2γ2

(κ(‖δ‖0)Tr(S) + 4‖S‖2F

)).

To show this, we write√det(Ip−s? − γ[S]δc?

)√det ([Aδ]δc)

=


)det(Ip−‖δ‖0 − γ[S]δc

)√

det(Ip−‖δ‖0 − γ[S]δc

)√det ([Aδ]δc)

.

The first term on the right-hand of the last display can be further written as√det([Ip − γS]δc?

)det (Ip − γS)

√det (Ip − γS)

det ([Ip − γS]δc)≤(

4

3

) s?2

,


where the inequality follows from Lemma 26 and the fact that all the eigenvalues of

the matrix Ip − γS are between 3/4 and 1. If λj (resp. λj) denote the eigenvalues of

[S]δc (resp. [S]δc), we have√det(Ip−‖δ‖0 − γ[S]δc

)√det ([A]δc)

= exp

1

2

p−‖δ‖0∑j=1

(log(1− γλj)− log(1− γ(1 + 4γκ(‖δ‖0))λj)

) .Since 4γλmax(S) ≤ 1, γλj ≤ 1/4, and γ(1 + 4γκ(s))λj) ≤ 1/2 for all 1 ≤ j ≤ p−‖δ‖0.

Furthermore, the function log satisfies log(1 − x) ≤ −x, and log(1 − x) ≥ −x − 4x2

for x ∈ [0, 1/2]. We deduce that

(5.11)

√det(Ip−‖δ‖0 − γ[S]δc

)√det ([A]δc)

≤ eγ2Tr(S−S)+2γ2(κ(‖δ‖0)Tr(S)+4‖S‖2F).

These last two results together establishes (5.10). On the other hand, using H3, we

have

∑δ∈Ac1

ωδωδ?

4‖δ‖0e3γ‖δ‖0ρ2=

p∑j=s?+η+1

(p

j

)(q

1− q

)j−s? (4e3γρ2

)j

≤(p

s?

)(4e3γρ2

)s? p∑j=s?+η+1

(8e3γρ2

pu

)j−s?,

using the fact that q1−q = 1

pu+1−1≤ 2

pu+1 for p ≥ 2, and(pj

)≤ pj−s?

(ps?

). Hence for p

large enough so that 8e3γρ2

pu ≤ 1/2, we get

∑δ∈Ac1

ωδωδ?

4‖δ‖0e3γ‖δ‖0ρ2 ≤(p

s?

)(4e3γρ2

)s? (8e3γρ2

pu

)η≤ e

32s? log(p)−uη

2log(p),

for all p large enough, where here we use again the assumption that γρ2 = o(log(p)),

and log(ps?

)≤ s? log(p). Hence we conclude that

E?[Πγ

(Ac1|Z

)1Eρ(Z)

]≤ e

a02

(1 +

κ(s?)

ρ2

)s?e2s? log(p)−u

2η log(p)

exp(γ

2Tr(S − S) + 2γ2

(λmax(S)Tr(S) + 4‖S‖2F

))≤ 1

pm0,

by choosing η as in the statement of the lemma. Hence the result.

52

For linear regression models the previous lemma takes a slightly sharper form.

Lemma 22. In the particular case of the linear regression model, for all ρ > 0, all

γ > 0 such that 4γσ2λmax(X′X) ≤ 1, and all z ∈ Rn, Πγ(·|z) is well-defined.

Proof. Here S = (1/σ2)X ′X, and as above, we have

(2πγ)−p2 Cγ(z) ≤

≤∑δ∈∆

ωδ

(ρ2

)‖δ‖0 ( 1

2πγ

) p−‖δ‖02

∫Rpe`(θδ;z)e−ρ‖θδ‖1e

− 12γ

(θ−θδ)′(Ip−γS)(θ−θδ)+Rγ(δ,θ;z)dθ.

With a similar argument we bound

Rγ(δ, θ; z) ≤ 3

2γρ2‖δ‖0 +

3γ

σ2λmax(X

′X)1

2σ2‖z −Xθδ‖22

+3γ

2σ2λmax(X

′X)1

σ2(θ − θ?)′(X ′X)(θ − θδ).

Therefore, if 4γσ2λmax(X

′X) ≤ 1, then

`(θδ; z)−1

2γ(θ − θδ)′(Ip − γS)(θ − θδ) +Rγ(δ, θ; z)

≤ 3

2γρ2‖δ‖0 +

1

4`(θδ; z)−

1

2γ(θ − θδ)′Mδ(θ − θδ),

where Mδdef= Ip − γ(1 + 3γ

σ2λmax(X′X))S. And since `(θδ; z) = − 1

2σ2 ‖z −Xθδ‖22 ≤ 0,

we get

(2πγ)−p2 Cγ(z) ≤

≤∑δ∈∆

ωδ

(ρ2

)‖δ‖0 ( 1

2πγ

) p−‖δ‖02

e32γρ2‖δ‖0

∫Rpe−ρ‖θδ‖1e

− 12γ

(θ−θδ)′Mδ(θ−θδ)dθ.

All the integrals on the right hand side of the last display are finite since Mδ is positive

definite. This proves the lemma.

The proof of Theorem 5 is based on some classical testing arguments for which we

need the following result. This lemma slightly extends Lemma 14 of [1].


Lemma 23. Assume H2 and H4. Define

εdef= inf

{x > 0 : r1(z) ≥ 3ρ(s? + s)1/2z, for all z ≥ x

}.

If ε <∞, then for any M > 2, there exists a measurable function φ : Z → [0, 1] such

that

E? (φ(Z)) ≤(p

s

)9s∑j≥1

e−16r1( jMε

2 ).

Furthermore, for all δ ∈ ∆ such that ‖δ‖0 ≤ s, and all θ ∈ Rpδ such that ‖θ − θ?‖2 >jMε for some j ≥ 1, we have∫

Eρ(1− φ(z)) e2γ‖δ·∇`(θ,z)−δ·∇`(θ?,z)‖22fθ(z)dz ≤ e−

16r1( jMε

2 ).

Proof. If q1, q2 are two integrable functions on some arbitrary measure space, we

define their Hellinger affinity as

H(q1, q2)def=

∫√q1q2.

We will rely on the following result due to [21].

Lemma 24 (Kleijn-Van der Vaart (2006)). Let p a density, Q a family of integrable

functions. Then there exists a measurable [0, 1]-valued function φ such that

supq∈Q

[∫φp+

∫(1− φ)q

]≤ sup

q∈conv(Q)H(p, q),

where conv(Q) is the convex hull of Q.

Fix δ ∈ ∆ such that ‖δ‖0 ≤ s, and θ ∈ Rpδ . To put ourselves in the setting on

Lemma 24, we define

qδ,u(z)def= e2γ‖δ·∇`(u;z)−δ·∇`(θ?,z)‖22fu(z)1Eρ(z), u ∈ Rpδ , z ∈ Z.

For z ∈ Eρ, u ∈ Rpδ , we have

(5.12)qδ,u(z)

fθ?(z)= e〈∇`(θ?;z),u−θ?〉+Lγ(u;z)1Eρ(z) ≤ e

ρ2‖u−θ?‖1+Lγ(δ,u;z).

54

Therefore ∫Zqδ,u(z)dz ≤ e

ρ2‖u−θ?‖1E?

[eLγ(δ,u;Z)1Eρ(Z)

]<∞,

by H4. Now, fix η ≥ 2ε, and suppose that ‖θ − θ?‖2 > η. Let

Pδ,θdef={qδ,u : u ∈ Rpδ , ‖u− θ‖2 ≤

η

2

}.

According to Lemma 24, applied with p = fθ? , and Q = Pδ,θ, there exists a test

function φδ,θ such that

(5.13) supq∈Pδ,θ

[∫φδ,θfθ? +

∫(1− φδ,θ)q

]≤ sup

q∈conv(Pδ,θ)H(fθ? , q).

Any q ∈ conv(Pδ,θ) can be written as q =∑

j αj qδ,uj , where∑

j αj = 1, uj ∈ Rpδ ,‖uj − θ‖2 ≤ η/2. Notice that this implies that ‖uj − θ?‖2 > η/2 ≥ ε. Therefore, by

Jensen’s inequality and (5.12) we get

H(q, fθ?) =

∫Eρ

√√√√∑j

αjqδ,uj (z)

fθ?(z)fθ?(z)dz

≤

√√√√∑j

αj

∫Eρ

qδ,uj (z)

fθ?(z)fθ?(z)dz

≤√∑

j

αjeρ2‖uj−θ?‖1E?

[eLγ(δ,uj ;Z)1Eρ(Z)

],

≤√∑

j

αjeρ2‖uj−θ?‖1− 1

2r1(‖uj−θ?‖2).

Since ‖uj‖0 ≤ s, ‖uj − θ?‖2 > η/2 ≥ ε, and by the definition of ε, we have

ρ

2‖uj − θ?‖1 −

1

6r1(‖uj − θ?‖2) ≤ ρ(s+ s?)

1/2

2‖uj − θ?‖2 −

1

6r1(‖uj − θ?‖2) ≤ 0.

We conclude that for any q ∈ conv(Pδ,θ),

(5.14) H(q, fθ?) ≤ e−16r1( η2 ).

Now for M > 2, write ∪δ{θ ∈ Rpδ : ‖θ− θ?‖2 > Mε} as ∪δ ∪j≥1Aε(δ, j), where the

unions in δ are taken over all δ such that ‖δ‖0 = s, and

Aε(δ, j)def={θ ∈ Rpδ : jMε < ‖θ − θ?‖2 ≤ (j + 1)Mε

}.


For Aε(δ, j) 6= ∅, let S(δ, j) be a maximally (jMε/2)-separated point in Aε(δ, j). It is

easily checked that the cardinality of S(δ, j) is upper bounded by 9‖δ‖0 = 9s (see for

instance [15] Example 7.1 for the arguments). For θδ,jk ∈ S(δ, j), let φδ,jk denote the

test function φδ,θδ,jk obtained above with θ = θδ,jk and η = jMε. From (5.13) and

(5.14) φδ,jk satisfies

(5.15) supu∈Rpδ , ‖u−θδ,jk‖2≤

jMε2

[E?(φδ,jk(Z)) +

∫Eρ

(1− φδ,jk(z))qu(z)dz

]≤ e−

16r1( jMε

2 ).

Then we set

φ = supδ: ‖δ‖0=s

supj≥1

maxθδ,jk∈S(δ,j)

φδ,jk.

It then follows that

E? (φ(Z)) ≤∑δ

∑j≥1

∑θδ,jk∈S(δ,j)

E? (φδ,jk(Z)) ≤(p

s

)9s∑j≥1

e−16r1( jMε

2 ).

And if for some δ, such that ‖δ‖0 ≤ s and θ ∈ Rpδ we have ‖θ − θ?‖2 > jMε, then

we can find δ with ‖δ‖0 = s, such that θ ∈ Rpδ

and θ resides within (iMε)/2 of some

point θδ,ik ∈ S(δ, i) for some i ≥ j. Hence, by (5.15),∫Eρ

(1− φ(z))qδ,θ(z)dz ≤∫Eρ

(1− φδ,ik(z))qδ,θ(z)dz ≤ e− 1

6r1( iMε

2 ).

This ends the proof.

We make use of the following Gaussian version of the the Hanson-Wright inequal-

ity which follows directly from standard deviation bounds for Lipschitz function of

Gaussian random variables (see e.g. Theorem 5.6 of [9]).

Lemma 25. If Z ∼ N(0, Im), and A ∈ Rm×m is a symmetric positive semi-definite

matrix, then for all t ≥ Tr(A),

P(X ′AX > t

)≤ exp

[−

(√t−√

Tr(A))2

2‖A‖2

].

We will also need the following lemma on determinants of sub-matrices.

56

Lemma 26. If symmetric positive definite matrices A,M and D ∈ Rq×q are such

that M =

(A B

B′ D

), then

det(A)λmin(M)q ≤ det(M) ≤ det(A)λmax(M)q.

Proof. This follows from Cauchy’s interlacing property for eigenvalues. See for

instance [19] Theorem 4.3.17.

5.2. Proof of Theorem 5. Let ∆sdef= {δ ∈ ∆ : ‖δ‖0 ≤ s}. We have ∆ × Rp =

((∆ \∆s)× Rp) ∪ F1 ∪ F21 ∪ F22 ∪ Bm,M , where

F1def=

⋃δ∈∆s

{δ} × {θ ∈ Rp : ‖θδ − θ?‖2 > Mε} ,

F21def=

⋃δ∈∆s

{δ} ×{θ ∈ Rp : ‖θδ − θ?‖2 ≤Mε, and ‖θ − θδ‖2 > 2

√(m+ 1)γp

},

F22def=

⋃δ∈∆s

{δ} ×{θ ∈ Rp : ‖θδ − θ?‖2 ≤Mε, ‖θ − θδ‖2 ≤ 2

√(m+ 1)γp,

and ‖θ − θδ‖∞ > 2√

(m+ 1)γ log(p)}

Hence we can write,

Πγ(Bm,M |Z) = 1− Πγ(‖δ‖0 > s|Z)− Πγ(F1|Z)− Πγ(F21|Z)− Πγ(F22|Z).

Therefore, using the assumption that P?(Z ∈ Eρ) ≥ 1/2, and the definition of condi-

tion probability,

(5.16) E?[Πγ(Bm,M |Z)|Z ∈ Eρ

]≥ 1− E?

[Πγ(‖δ‖0 > s|Z)|Z ∈ Eρ

]− 2E?

[1Eρ(Z)Πγ(F1|Z)

]− 2E?

[1Eρ(Z)Πγ(F21|Z)

]− 2E?

[1Eρ(Z)Πγ(F22|Z)

].

Therefore, to finish the proof it suffices to upper-bounding the terms on the right-hand

side of (5.16).

Let φ denote the test function asserted by Lemma 23 where M > 2 is some arbitrary

absolute constant. We can then write

(5.17) E?[1Eρ(Z)Πγ(F1|Z)

]≤ E? (φ(Z)) + E?

[1Eρ(Z) (1− φ(Z)) Πγ(F1|Z)

].


Since(ps

)9s ≤ es log(9p), we have

(5.18) E? (φ(Z)) ≤ es log(9p)∞∑j=1

e−16r1( jMε

2 ).

We apply (5.6) with B = F1 and Fubini’s theorem to get:

(5.19) E?[1Eρ(Z)(1− φ(Z))Πγ(F1|Z)

]≤(

1 +κ(s?)

ρ2

)s? ∑δ∈∆s

ωδwδ?

(ρ2

)‖δ‖0e3γρ2‖δ‖0


)√det ([Aδ]δc)

×∫Fε

e−ρ‖θ‖1

e−ρ‖θ?‖1E?

[1Eρ(Z)(1− φ(Z))

e`(θ;Z)

e`(θ?;Z)e2γ‖δ·∇`(θ;Z)−δ·∇`(θ?;Z)‖22

]µδ(dθ),

where Fεdef= {θ ∈ Rp : ‖θ − θ?‖2 > Mε} = ∪j≥1Fj,ε, with Fj,ε

def= {θ ∈ Rp : jMε <

‖θ − θ?‖2 ≤ (j + 1)Mε}. Using Lemma 23, we have∫Fj,ε

e−ρ‖θ‖1

e−ρ‖θ?‖1E?

[1Eρ(Z)(1− φ(Z))

e`(θ;Z)

e`(θ?;Z)e2γ‖δ·∇`(θ;Z)−δ·∇`(θ?;Z)‖22

]µδ(dθ)

≤ e−16r1( jMε

2 )∫Fj,ε

e−ρ‖θ‖1

e−ρ‖θ?‖1µδ(dθ)

≤ e−16r1( jMε

2 )e8ρs1/2( jMε2 )∫Rpe−ρ‖θ−θ?‖1µδ(dθ),

and it is easily seen that∫Rp e

−ρ‖θ−θ?‖1µδ(dθ) ≤(

2ρ

)‖δ‖0. Therefore (5.19) reduces to

(5.20) E?[1Eρ(Z)(1− φ(Z))Πγ(F1|Z)

]≤(

1 +κ(s?)

ρ2

)s? ∞∑j=1

e−16r1( jMε

2 )+8ρs1/2( jMε2 )

×∑δ∈∆s

ωδwδ?

e3γρ2‖δ‖0


)√det ([Aδ]δc)

.

Using (5.10) and borrowing the definition of a in Equation (2.4) of the main manuscript,

we get

E?[1Eρ(Z)(1− φ(Z))Πγ(F1|Z)

]≤ es?+ a

2

(1 +

κ(s?)

ρ2

)s?×

∑δ∈∆s

ωδωδ?

e3γρ2‖δ‖0

∞∑j=1

e−16r1( jMε

2 )+8ρs1/2( jMε2 ).

58

Using (2.7) we have 4e3γρ2 ≤ pu+1, for all p large enough, so that Assumption H3

gives∑δ∈∆s

ωδωδ?

e3γρ2‖δ‖0 ≤(

1− q

q

)s? s∑s=0

(2e3γρ2

pu+1

)s≤ 2(1− q)s?ps?(1+u) ≤ 2ps?(1+u).

It follows that

(5.21) E?[1Eρ(Z)(1− φ(Z))Πγ(F1|Z)

]≤ 2e

a2 exp

((2 + u)s? log(p) + s? log

(1 +

κ(s?)

ρ2

))×∞∑j=1

e−16r1( jMε

2 )+8ρs1/2( jMε2 ).

We set F (δ)21

def= {θ ∈ Rp : ‖θδ − θ?‖2 ≤ Mε, and ‖θ − θδ‖2 > ε1}, with ε1 =

2√

(m+ 1)γp. From the definition of Πγ , and using the two inequalities of Lemma

19, we have

Πγ(F21|Z) =

∑δ∈∆s

ωδ(ρ

2

)‖δ‖0 ( 12πγ

) p−‖δ‖02 ∫

F(δ)2

e−hγ(δ,u;Z)du

∑δ∈∆ ωδ

(ρ2

)‖δ‖0 ( 12πγ

) p−‖δ‖02 ∫

Rp e−hγ(δ,u;Z)du

≤

∑δ∈∆s

ωδ(ρ

2

)‖δ‖0 ( 12πγ

) p−‖δ‖02 ∫

F(δ)2

e`(uδ;Z)−ρ‖uδ‖1− 1

2γ(u−uδ)(Ip−γS)(u−uδ)+Rγ(δ,u;Z)

du

∑δ∈∆ ωδ

(ρ2

)‖δ‖0 ( 12πγ

) p−‖δ‖02 ∫

Rp e`(uδ;Z)−ρ‖uδ‖1− 1

2γ(u−uδ)(Ip−γS)(u−uδ)du

.

For Z ∈ Eρ, and δ ∈ ∆s, proceeding as in (5.4) we have

(5.22) − 1

2γ(u− uδ)(Ip − γS)(u− uδ) +Rγ(δ, u;Z)

≤ − 1

2γ(u− uδ)′A(u− uδ) + 3γρ2s+ 2γκ(s)2‖uδ − θ?‖22,

where A = Ip − γ(1 + 4γκ(s))S. It follows that

(5.23) 1Eρ(Z)Πγ(F21|Z) ≤ 1E(Z)e3γρ2s+2γκ(s)2(Mε)2

×

∑δ∈∆s

ωδ(ρ

2

)‖δ‖0 [∫Rp e

`(u;Z)−ρ‖u‖1µδ(du)]

1√det([A]δc )

Tδ∑δ∈∆ ωδ

(ρ2

)‖δ‖0 [∫Rp e

`(u;Z)−ρ‖u‖1µδ(du)]

1√det(Ip−‖δ‖0−γ[S]δc)

,


where

Tδdef=

∫F(δ)

2

e− 1

2γz′([A]δc )z

dz∫Rp−‖δ‖1 e

− 12γz′([Aγ ]δc )z

dz,

and F (δ)2

def= {θ ∈ Rp−‖δ‖1 : ‖θ‖2 ≥ ε1}. We have seen in (5.11) that√

det(Ip−‖δ‖0 − γ[S]δc

)det ([A]δc)

≤ ea2 .

Therefore, (5.23) is upper bounded by

e3γρ2s+2γκ(s)2(Mε)2e

a2 supδ∈∆s

Tδ.

Tδ is the probability of the set F (δ)2 ⊂ Rp−‖δ‖1 under the Gaussian distribution

N(0, γ([A]δc)−1). Under the assumption 4γλmax(S) ≤ 1, all the eigenvalues of the

matrix A = Ip − γ(1 + 4γκ(s))S are between [1/2, 1], and so are the eigenvalues of

the sub-matrix [A]δc (by Cauchy’s interlacing property for eigenvalues; see Theorem

4.3.17 of [19]). Hence by Lemma 25, we have Tδ ≤ e−m4p, for m ≥ 1. Hence,

1Eρ(Z)Πγ(F21|Z) ≤ e−mp4 exp

(3γρ2s+ 2γκs2(Mε)2 +

a

2

).

A similar bound holds for F22 with 1pm instead of e−mp/2. This completes the proof.

�

6. Proof of Theorem 8.

Proof. We split the proof int two parts.

Part one: Model selection consistency.. To shorten notation, we write B(δ) for

B(δ)m,M , and B for Bm,M . We have seen that under (2.8) B =

⋃δ∈A({δ} × B(δ)), which

we can obviously write as B = ({δ?}×B(δ?))∪⋃δ∈A0{δ}×B(δ), where A0

def= A\{δ?}.

Hence

Πγ

({δ?} × B(δ?)|z)

)= Πγ

(B|z)− Πγ

(∪δ∈A0{δ} × B(δ)|z

).

60

The proof then boils down to controlling the rightmost term in the last equation. We

have by definition

(6.1) Πγ

(∪δ∈A0{δ} × B(δ)|z

)=

∑δ∈A0

ωδ(ρ

2

)‖δ‖0 ( 12πγ

) p−‖δ‖12 ∫

B(δ) e−hγ(δ,θ;z)dθ∑δ∈∆ ωδ

(ρ2

)‖δ‖0 ( 12πγ

) p−‖δ‖12 ∫

Rp e−hγ(δ,θ;z)dθ

≤

∑δ∈A0

ωδ(ρ

2

)‖δ‖0 ( 12πγ

) p−‖δ‖12 ∫

{θ∈Rp: ‖θδ−θ?‖2≤Mε} e−hγ(δ,θ;z)dθ∑

δ∈A ωδ(ρ

2

)‖δ‖0 ( 12πγ

) p−‖δ‖12 ∫

{θ∈Rp: ‖θδ−θ?‖2≤Mε} e−hγ(δ,θ;z)dθ

.

The first inequality of Lemma 19 says that for all θ ∈ Rp,

−hγ(δ, θ; z) ≥ `(θδ; z)− ρ‖θδ‖1 −1

2γ(θ − θδ)′(Ip − γS)(θ − θδ).

Since `(θδ; z) = `[δ]([θ]δ; z), we combine this with the definition of $2,M in H5 to

obtain that for all θ ∈ Rp, such that ‖θδ − θ?‖2 ≤Mε,

−hγ(δ, θ; z) ≥ `[δ](θδ; z)+⟨∇`[δ](θδ; z), [θ]δ − θδ

⟩−1

2([θ]δ−θδ)′[−∇(2)`[δ](θδ; z)]([θ]δ−θδ)

−$2,M

6‖[θ]δ − θδ‖32 − ρ‖θδ‖1 −

1

2γ(θ − θδ)′(Ip − γS)(θ − θδ).

By the first order optimality condition of θδ, ∇`[δ](θδ; z) = 0. Furthermore for any

θ ∈ Rp such that ‖θδ − θ?‖2 ≤ Mε, and using the assumption that ‖θδ − [θ?]δ‖2 ≤ ε,

we have

−ρ‖θδ‖1 −$2,M

6‖[θ]δ − θδ‖32 ≥ −ρ‖θδ‖1 − (M + 1)ρs1/2ε− (M + 1)3

6$2,M ε

3.

We then deduce that

(6.2) − hγ(δ, θ; z) ≥ −(M + 1)ρs1/2ε− (M + 1)3

6$2,M ε

3 + `[δ](θδ; z)− ρ‖θδ‖1

− 1

2([θ]δ − θδ)′[−∇(2)`[δ](θδ; z)]([θ]δ − θδ)−

1

2γ(θ − θδ)′(Ip − γS)(θ − θδ).

It follows from this last inequality that the denominator of the right-hand side of (6.1)

is lower bounded by

e−(M+1)ρs1/2ε− (M+1)3

6$2,M ε

3∑δ∈A

(1

2πγ

) p−‖δ‖02

∫Rp−‖δ‖0

e− 1

2γu′([Ip−S]δc )u

du

× ωδ(ρ

2

)‖δ‖0e`(θδ;z)−ρ‖θδ‖1

√det(2πI−1

γ,δ)N(θδ, I−1γ,δ)(Bδ),


where Bδdef= {u ∈ R‖δ‖0 : ‖u− [θ?]δ‖2 ≤Mε}. Starting with the second inequality of

Lemma 19, and with the same calculations as above, we get for any θ ∈ B(δ),

− hγ(δ, θ; z) ≤ (M + 1)ρs1/2ε+(M + 1)3

6$2,M ε

3 + `[δ](θδ; z)− ρ‖θδ‖1

− 1

2([θ]δ − θδ)′[−∇(2)`(θδ; z)]([θ]δ − θδ)−

1

2γ(θ− θδ)′(Ip− γS)(θ− θδ) +Rγ(δ, θ; z),

and for z ∈ Eρ, δ ∈ A, and θ ∈ B(δ), using (5.4), we have

− 1

2γ(θ − θδ)′(Ip − γS)(θ − θδ) +Rγ(δ, θ; z)

≤ − 1

2γ(θ − θδ)′[A]δc(θ − θδ) + 2γκ(s)2(Mε)2 + 3γρ2‖δ‖0.

The last two inequalities imply that for z ∈ Eρ, δ ∈ A, and θ ∈ B(δ)

(6.3) − hγ(δ, θ; z) ≤ (M + 1)ρs1/2ε+(M + 1)3

6$2,M ε

3 + 2γκ(s)2(Mε)2 + 3γρ2‖δ‖0

+ `[δ](θδ; z)− ρ‖θδ‖1 −1

2([θ]δ − θδ)′[−∇(2)`(θδ; z)]([θ]δ − θδ)

− 1

2γ(θ − θδ)′[A]δc(θ − θδ).

Therefore the numerator on the right-hand side of (6.1) is upper-bounded by

e(M+1)ρs1/2ε+(M+1)3

6$2,M ε

3e2γκ(s)2(Mε)2

∑δ∈A

(1

2πγ

) p−‖δ‖02

∫Rp−‖δ‖0

e− 1

2γu′([A]δc )u

du

× e3γρ2‖δ‖0ωδ

(ρ2


√det(2πI−1

γ,δ)N(θδ, I−1γ,δ)(Bδ),

Furthermore, we have seen in (5.11) that√det(Ip−‖δ‖0 − γ[S]δc

)det([A]δc)

≤ ea2 .

Therefore it follows from the above that (6.1) gives us

Πγ

(∪δ∈A0{δ} × B(δ)|z

)≤ ec0

×

∑δ∈A0

ωδ(ρ

2

)‖δ‖0 e`(θδ ;z)−ρ‖θδ‖1√det(Ip−‖δ‖0−γ[S]δc)

√det(2πI−1

γ,δ)N(θδ, I−1γ,δ)(Bδ)∑

δ∈A ωδ(ρ

2

)‖δ‖0 e`(θδ ;z)−ρ‖θδ‖1√det(Ip−‖δ‖0−γ[S]δc)

√det(2πI−1

γ,δ)N(θδ, I−1γ,δ)(Bδ)

,

62

where c0def= 2(M + 1)ρs1/2ε+ (M+1)3

3 $2,M ε3 + 2γκ(s)2(Mε)2 + 3γρ2s+ a

2 . We rewrite

the last inequality as

(6.4) Πγ

(∪δ∈A0{δ} × B(δ)|z

)≤ ec0

∑s−s?k=1 Gk∑s−s?k=0 Gk

,

where

Gk =∑

δ⊇δ?, ‖δ‖0=s?+k

ωδ

(ρ2

)s?+k

× e`(θδ;z)−ρ‖θδ‖1√det (Ip−s?−k − γ[S]δc)

√det(

2πI−1γ,δ

)N(θδ, I−1

γ,δ)(Bδ).

We note that √det(

2πI−1γ,δ

)=

(2π)s?+k

2√det([−∇(2)`[δ](θδ; z)

]) .Hence

Gk =∑

δ⊇δ?, ‖δ‖0=s?+k

ωδ

(ρ2

)s?+k e`(θδ;z)−ρ‖θδ‖1√det (Ip−s?−k − γ[S]δc)

(2π)s?+k

2 N(θδ, I−1γ,δ)(Bδ)√

det([−∇(2)`[δ](θδ; z)

]) .Fix δ such δ ⊇ δ?, ‖δ‖0 = s? + k. Firstly, since [S]δ is a sub-matrix of [S]δ? and the

eigenvalues of Ip−s? − γ[S]δc? are all between 1/2 and 1, it is not hard to see that√det(Ip−s? − γ[S]δc?

)det (Ip−s?−k − γ[S]δc)

≤ 1.

Secondly,

−ρ‖θδ‖1 + ρ‖θ?‖1 ≤ ρ‖θδ − θ?‖1 ≤ 2ρs1/2ε,

and for z ∈ Eρ,Λ, we have

`(θδ; z) ≤ `(θ?; z) + Λk.

It follows that

Gk ≤ G0e2ρs1/2ε

×∑

δ⊇δ?, ‖δ‖0=s?+k

ωδωδ?

eΛk(ρ

2

)k(2π)

k2

√det([−∇(2)`[δ?](θδ? ; z)

])√

det([−∇(2)`[δ](θδ; z)

]) N(θδ, I−1γ,δ)(Bδ)

N(θδ? , I−1γ,δ?

)(Bδ?).


We have


)(Bδ?) ≥ 1− P(V ′(

[−∇(2)`[δ?](θδ? ; z)])−1

V > (M − 1)2ε2),

≥ 1− P(‖V ‖2 ≥ (M − 1)εκ(s)1/2

),

where V = (V1, . . . , Vs?)i.i.d.∼ N(0, 1), and where the second inequality uses H1 and

the definition of κ(s). By standard exponential bound for Gaussian random variables,

we have P(‖V ‖2 ≥ (M − 1)εκ(s)1/2

)≤ exp

(−1

2

((M − 1)εκ(s)1/2 − s1/2

?

)2

+

). Hence

N(θδ, I−1γ,δ)(Bδ)


)(Bδ?)≤ 1


)(Bδ?)≤ 1

1− e− 1

2

((M−1)εκ(s)1/2−s1/2?

)2

+

≤ 2,

using the assumption that (M − 1)εκ(s)1/2 − s1/2? ≥ 2. We split√

det([−∇(2)`[δ?](θδ? ; z)

])√

det([−∇(2)`[δ](θδ; z)

])

=

√det

([−∇(2)`((θδ? , 0)δ? ; z)

]δ?,δ?

)√

det

([−∇(2)`((θδ, 0)δ; z)

]δ?,δ?

)√

det

([−∇(2)`((θδ, 0)δ; z)

]δ?,δ?

)√

det

([−∇(2)`((θδ, 0)δ; z)

]δ,δ

) .

The convexity of the function − log det can be used to show that for any pair of

symmetric positive definite matrices A,B of same size, | log det(A) − log det(B)| ≤max(‖A−1‖F, ‖B−1‖F)‖B −A‖F. We use this to conclude that√

det

([−∇(2)`((θδ? , 0)δ? ; z)

]δ?,δ?

)√

det

([−∇(2)`((θδ, 0)δ; z)

]δ?,δ?

)

≤ exp

(1

2

s1/2?

κ(s)

∥∥∥∥[∇(2)`[δ]([θδ]δ; z)−∇(2)`([δ]([θ?]δ; z)]δ?,δ?

∥∥∥∥F

)≤ e

12

s1/2? $2,Mκ(s)

‖θδ−θ?‖2

≤ es1/2? $2,Mε

κ(s) .

64

Then we use Lemma 26 to get√det

([−∇(2)`((θδ, 0)δ; z)

]δ?,δ?

)√

det

([−∇(2)`((θδ, 0)δ; z)

]δ,δ

) ≤

(1√κ(s)

)k.

It follows that for z ∈ Eρ,Λ,

Gk ≤ G02e2ρs1/2ε+s

1/2? ε

$2,Mκ(s) eΛk

(ρ2

)k(2π)

k2

(1√κ(s)

)k ∑δ⊇δ?, ‖δ‖0=s?+k

ωδωδ?

,

and under H3, and for p large enough so that q ≤ 1/2,

∑δ⊇δ?, ‖δ‖0=s?+k

ωδωδ?

=

(q

1− q

)k (p− s?k

)≤ (2q)kek log(p) ≤

(2

pu

)k,

using the fact that(p−s?k

)≤ e(p−s?) log(k) ≤ ep log(k). Therefore

(6.5)s∑

k=1

Gk ≤ G02e2ρs1/2ε+s

1/2? ε

$2,Mκ(s)

s∑k=1

(ρeΛ

pu

√2π

κ(s)

)k.

It follows that for 2ρeΛ

pu

√2πκ(s) ≤ 1, we get

∑sk=1Gk∑sk=0Gk

≤ 4ρeΛ

pu

√2π

κ(s)e

2ρs1/2ε+s1/2? ε

$2,Mκ(s) ,

which, together with (6.4), implies the stated bound in (2.12).

Part two: Bernstein-von Mises approximation.. We introduce the following prob-

ability distributions on ∆× Rp:

Πγ,B(δ, dθ|z) ∝ ωδ(2πγ)‖δ‖0

2

(ρ2

)‖δ‖0e−hγ(δ,θ;z)1B(δ, θ)dθ,

Πγ,B(δ, dθ|z) ∝ ωδ(2πγ)‖δ‖0

2

(ρ2

)‖δ‖0e`(θδ;z)−ρ‖θδ‖1− 1

2γ(θ−θδ)′(Ip−γS)(θ−θδ)1B(δ, θ)dθ,


and

Π∞γ,B(δ, dθ|z) ∝ ωδ(2πγ)‖δ‖0

2

(ρ2


× e−12

([θ]δ−θδ)′Iγ,δ([θ]δ−θδ)− 12γ

(θ−θδ)′(Ip−γS)(θ−θδ)1B(δ, θ)dθ.

For all measurable subset C of ∆× Rp, we obviously can write

|Πγ(C|z)− Π∞γ (C|z)| ≤ |Πγ(C|z)− Π∞γ,B(C|z)|+ |Π∞γ,B(C|z)− Π∞γ (C|z)|.

Since Π∞γ (·|z) is one of the component probability measures of Π∞γ,B

(·|z), by coupling

inequality we have

‖Π∞γ (·|z)− Π∞γ (·|z)‖tv ≤ 1− Π∞γ ({δ?} × B(δ?)|z)

≤ 1− Πγ({δ?} × B(δ?)|z) + ‖Πγ(·|z)− Π∞γ,B(·|z)‖tv.

We conclude that

(6.6) ‖Πγ(·|z)− Π∞γ (·|z)‖tv ≤ 1− Πγ({δ?} × B(δ?)|z) + 2‖Πγ(·|z)− Π∞γ,B(·|z)‖tv.

Hence it suffices to bound the rightmost term of (6.6). For a measurable subset C of

∆× Rp, we have

(6.7) |Πγ(C|z)− Π∞γ,B(C|z)| ≤ |Πγ(C|z)− Πγ,B(C|z)|+ |Πγ,B(C|z)− Πγ,B(C|z)|

+ |Πγ,B(C|z)− Π∞γ,B(C|z)|.

To deal with the first term on the right hand side of the inequality in (6.7), we first

note that Πγ,B(·|z) is no other than the restriction of Πγ to the set B. With this

in mind, we make the following general observation. For any probability measure µ,

and a measurable set A such that µ(A) > 0, if µA denotes the restriction of µ to A

(µA(B)def= µ(A∩B)/µ(A)), we can decompose µ as µ = µA+µ(Ac)(µAc−µA), where

Ac denotes the complement of A. This decomposition implies that for all measurable

set B,

(6.8) |µ(B)− µA(B)| ≤ max

(µ(Ac ∩B),

µ(A ∩B)

µ(A)µ(Ac)

)≤ µ(Ac).

In the particular case of Πγ and Πγ,B, this bound readily implies that

(6.9) supC meas.

|Πγ(C|z)− Πγ,B(C|z)| ≤ 1− Πγ(B|z).

66

We claim that for all z ∈ Eρ,

(6.10) supC: meas.

∣∣∣Πγ,B(C|z)− Πγ,B(C|z)∣∣∣ ≤ 2ι1,

where

ι1def= e

a2

+3γρ2s+2γκ(s)2(Mε)2 − 1 +1

pm.

To establish (6.10), we note that if f ≥ g are two unnormalized positive densities

on some measurable space with normalizing constants Zf , Zg respectively, and A is a

measurable set, we have

(6.11)

∣∣∣∣∫A f(x)dx

Zf−∫A g(x)dx

Zg

∣∣∣∣ =

∣∣∣∣(Zg − Zf )∫A f(x)dx

ZgZf+

∫A(f(x)− g(x))d(x)

Zg

∣∣∣∣≤(ZfZg− 1

).

Owning to the first inequality of Lemma 19, we can apply this result with f/Zf as

Πγ,B(·|z), and g/Zg as Πγ,B(·|z). Now, it suffices to control the ratio of normalizing

constants of Πγ,B(·|z) and Πγ(·|z) given by

(6.12)

∑δ∈∆s


2

(ρ2

)‖δ‖0 ∫B(δ) e−hγ(δ,θ;z)dθ∑

δ∈∆sωδ(2πγ)

‖δ‖02

(ρ2

) ‖δ‖02∫B(δ) e

`(θδ;z)−ρ‖δδ‖1− 12γ

(θ−θδ)′(Ip−γS)(θ−θδ)dθ

.

By the second inequality of Lemma 19, and (5.22), for z ∈ Eρ, δ ∈ ∆s, and θ ∈ B(δ),

we have

−hγ(δ, θ; z) ≤ `(θδ; z)− ρ‖θδ‖1 −1

2γ(θ − θδ)A(θ − θδ) + 3γρ2s+ 2γκ(s)2(Mε)2,

where A = Ip − γ(1 + 4γκ(s))S. Hence the numerator of (6.12) is upper-bounded by

(2πγ)p/2e3γρ2s+2γκ(s)2(Mε)2∑δ∈∆s

ωδ

(ρ2

)‖δ‖0 [∫{θ∈Rp: ‖θ−θ?‖2≤Mε}

e`(θ;z)−ρ‖θ‖1µδ(dθ)

]

× 1√det ([A]δc)

,

whereas the denominator is equal to

(2πγ)p/2∑δ∈∆s

ωδ

(ρ2

)‖δ‖0 [∫{θ:∈Rp ‖θ−θ?‖2≤Mε}

e`(θ;z)−ρ‖θ‖1µδ(dθ)

]Tδ√

det ([I − γS]δc),


where Tδ is the probability of the set {u ∈ Rp−‖δ‖0 : ‖u‖2 ≤ 2√

(1 +m)γp, ‖u‖∞ ≤2√

(m+ 1)γ log(p)}, under the distribution N(0, γ([I−γS]δc)−1), which is easily seen

to be larger than 1 − 1pm for all m ≥ 4, by standard Gaussian tail bound. From

these results and (5.11) we conclude that the ratio (6.12) is upper bounded by (1 −1pm )−1e3γρ2s+2γκ(s)2(Mε)2+ a

2 , and this together with (6.11) imply (6.10).

We claim that for all z ∈ Eρ,

(6.13) supC: meas.

∣∣∣Πγ,B(C|z)− Π∞γ,B(C|z)∣∣∣

≤ 16ι1

((M + 1)ρs1/2ε+

(M + 1)3

3ε3$2,M

)e4(M+1)ρs1/2ε+ 4

3(M+1)3ε3$2,M .

We establish this with the following general observation. Let {aj},{bj} be two discrete

probability distributions on some discrete set J , with aj > 0. Let µ(j,dx) = ajµj(dx),

ν(j,dx) = bjνj(dx) be two probability measures on J ×Rp, where for each j, µj(dx)

and νj(dx) are equivalent probability measures supported by some measurable subset

of Rp. For any measurable set A ⊂ J × Rp, we have

µ(A)− ν(A) =∑j

(1− bj

aj

)ajµj(A

(j)) +∑j

bjaj

∫A(j)

(1− dνj

dµj(x)

)ajµj(dx),

where A(j) def= {x ∈ Rp : (j, x) ∈ A}. This implies that

(6.14) supA: meas.

|µ(A)− ν(A)| ≤ supj

∣∣∣∣1− bjaj

∣∣∣∣+ supj

(bjaj

)supx

∣∣∣∣1− dνjdµj

(x)

∣∣∣∣ .We apply this result with µ taken as Πγ,B, and ν taken as Π∞

γ,B. In that case

aδ =aδ∑δ∈A aδ

, bδ =bδ∑δ∈A bδ

,

where

aδ = ωδ(2πγ)‖δ‖0

2

(ρ2

)‖δ‖0 ∫B(δ)

e`(θδ;z)−ρ‖θδ‖1− 1

2γ(θ−θδ)′(Ip−γS)(θ−θδ)dθ,

bδ = ωδ(2πγ)‖δ‖0

2

(ρ2


×∫B(δ)

e− 1

2([θ]δ−θδ)′Iγ,δ([θ]δ−θδ)− 1

2γ(θ−θδ)′(Ip−γS)(θ−θδ)dθ.

68

We haveminδ∈A

bδaδ

maxδ∈Abδaδ

≤ bδaδ≤

maxδ∈Abδaδ

minδ∈Abδaδ

.

We take a Taylor expansion of the function u 7→ `[δ](u; z) to the third order around

θδ, and note that ∇`[δ](θδ; z) = 0 to conclude that for all z ∈ Eρ, θ ∈ B(δ),

(6.15)

∣∣∣∣`(θδ; z)− ρ‖θδ‖1 − `(θδ; z) + ρ‖θδ‖1 +1

2([θ]δ − θδ)′Iγ,δ([θ]δ − θδ)

∣∣∣∣≤ ρ‖θδ − θδ‖1 +

1

6$2,M‖θδ − θδ‖32 ≤ c1,

where c1 = (M + 1)ρs1/2ε+ (M+1)3

6 $2,M ε3. It follows easily that e−c1 ≤ bδ

aδ≤ ec1 , so

that e−2c1 ≤ bδaδ≤ e2c1 . Similarly the Radon-Nykodym derivative satisfies

e−2c1 ≤ dνδdµδ

(θ) ≤ e2c1 .

With these bounds, (6.13) easily follows from (6.14), and the fact that for all x ∈(e−a, ea) for some a > 0, we have |1 − x| ≤ aea. The theorem hence follows from

(6.13), (6.10), (6.9), (6.7) and (6.6).

7. Proof of Corollary 10. We know from Lemma 22 that Πγ is well-defined

for all z ∈ Rn and all γ > 0 such that 4γλmax(X′X)/σ2 ≤ 1. H6 readily implies H2.

Ignoring constants, it is straightforward that

`(θ; z) = − 1

2σ2‖z −Xθ‖22, ∇`(θ; z) =

1

σ2X ′(z −Xθ), ∇(2)`(θ; z) = − 1

σ2(X ′X).

Hence H1 holds with S = S = (X ′X)/σ2. To apply Lemma 21 we need to check (5.2).

For γ > 0, and δ ∈ ∆, we have

Lγ(δ, θ; z) = `(θ; z)− `(θ?; z)− 〈∇`(θ?; z), θ − θ?〉

+2γ

σ4(θ − θ?)′X ′XδX

′δX(θ − θ?),

≤ − n

2σ2(θ − θ?)′

(X ′X

n

)(θ − θ?)

+2nγλmax(XδX

′δ)

σ4(θ − θ?)′

(X ′X

n

)(θ − θ?).


Using this and the moment generating function of the Gaussian distribution yields

logE?[eLγ(δ,θ;Z)+

(1− ρ

ρ

)〈∇`(θ?;Z),θ−θ?〉1Eρ(Z)

]≤ − n

2σ2

(1− 4γλmax(XX

′)

σ2−(

1− ρ

ρ

)2)

(θ − θ?)′(X ′X

n

)(θ − θ?)

≤ − n

2σ2

(ρ

ρ− 4γλmax(XX

′)

σ2

)(θ − θ?)′

(X ′X

n

)(θ − θ?).

Since ρ/ρ = 1/ log(p), (8/σ2)γ log(p)λmax(X′X) ≤ 1, and given H7 we readily deduce

that (5.2) holds with the rate function r0(x) = nv2σ2 log(p)

x2. In that case a0 = 128 s?v .

Furthermore, under the stated assumptions we easily check that η in (5.3) satisfies

η ≤ 2u(m0 + 1 + 2s?). This naturally suggests taking s = s? + 2

u(m0 + 1 + 2s?) in

H4. For ‖δ‖0 ≤ s and θ ∈ Rpδ , the upper-bound on Lγ(δ, θ; z) obtained above readily

shows that

logE?[eLγ(δ,θ;Z)1Eρ(Z)

]≤ Lγ(δ, θ;Z)

≤ − n

2σ2

(1− 4γnv(s)

σ2

)(θ − θ?)′

(X ′X

n

)(θ − θ?).

Since γn = o(1), as p→∞, we see that for all p large enough, H4 holds with the rate

function r1(x) = nv(s)2σ2 x

2, and from the definitions, we obtain that

ε =6σ2ρ(s? + s)1/2

nv(s)=

24σ

v(s)

√(s? + s) log(p)

n.

We can then apply Theorem 5. Condition (2.7) and a = o(log(p)) are implied by the

assumptions on γ in (2.17) of the main manuscript. We also easily see that γρ2s +

γκ(s)2(Mε2)2 = O(s) = o(log(p)). Consequently for allm ≥ 2, 4p−mea2

+3γρ2s+2γκ(s)2(Mε)2 ≤p−(m−1). Hence by Theorem 5, for any m > 1, p large enough, and for any event Esuch that P?(Z ∈ E) ≥ 1/2,

E?[Πγ

(Bm,M |Z

)|Z ∈ E

]≥ 1− 1

pm0− 1

p(m−1)− 2es log(9p)

∑j≥1

e−16r1( jMε

2 )

− 4ea2 p(2+u)s?

(1 +

κ(s?)

ρ2

)s?∑j≥1

e−16r1( jMε

2 )+8ρs1/2( jMε2 ).

70

Given the expression of r1 found above, we check that

∑j≥1

e−16r1( jMε

2 ) ≤ e−nv(s)

12σ2 (Mε/2)2

1− e−nv(s)

12σ2 (Mε/2)2,

and noting that v(s) ≤ 1 as a consequence of (2.14), we have

nv(s)

12σ2

(Mε

2

)2

=12M2

v(s)(s? + s) log(p) ≥ 2M2s log(9p),

for all p ≥ 2. Hence,

es0 log(9p)∑j≥1

e−16r1( jMε

2 ) ≤ 2

(9p)M2s≤ 1

pM2s.

Similarly, we check that

− 1

12r1

(jMε

2

)+ 8ρs1/2

(jMε

2

)≤ 0,

for all j ≥ 1 if M ≥ 64/n. Hence, given M > 2, for all p large enough (such that

n ≥ 32), we have∑j≥1

e−16r1( jMε

2 )+8ρs1/2( jMε2 ) ≤

∑j≥1

e−112

r1( jMε2 ) ≤ 2 exp

(−6M2(s? + s) log(p)

),

and since a = o(log(p)), and

log

(1 +

κ(s?)

ρ2

)s?= s? log

(1 +

v(s?)

16log(p)

)= o(s? log(p)),

as p→∞. Let us take M > 2 such that 3M2 ≥ 2 + u. We then get

4ea2 p(2+u)s?

(1 +

κ(s?)

ρ2

)s?∑j≥1

e−16r1( jMε

2 )+8ρs1/2( jMε2 ) ≤ 1

pM2s,

for all p large enough. We conclude that there exist absolute constants A0 such that

for all p ≥ A0,

(7.1) E?[Πγ

(Bm,M |Z

)|Z ∈ E

]≥ 1− 1

pm0− 1

pm−1− 1

pM2s,

for any measurable subset E ⊆ Eρ such that P?(Z ∈ E) ≥ 1/2.


To apply Theorem 8, we need to check H5 and (2.8) of the main manuscript. For

any given M ≥ max(2,√

u+23 ), (2.18) of the main manuscript implies (2.8) of the

main manuscript, for all p large enough. Since −∇(2)`[δ](θδ(z); z) = (X ′δXδ)/σ2, it is

straightforward that

κ(s) =nv(s)

σ2> 0, and $2 = 0.

By (2.10) of the main manuscript, for ‖δ‖0 ≤ s, we get

‖θδ(z)− [θ?]δ‖21Eρ(z) ≤ρs1/2

κ(s)≤ 1

6ε ≤ ε.

This shows that H5 holds. We shall apply Theorem 8 with Λ = 3 log(n∧p). We easily

check that (M − 1)εκ(s)1/2 ≥ s1/2? + 2,

ρeΛ

pu

√2π

κ(s)≤ log(p)

pu−2

4√

2

v(s),

and(4ρeΛ

pu

√2π

κ(s)

)e2(M+2)ρs1/2εe2γκ(s)2(Mε)2

e3γρ2sea2 ≤ log(p)

pu−2

4√

2

v(s)eo(log(p)) ≤ 1

pu−3,

for all p large enough. Hence we can apply Theorem 8 to conclude that

Πγ({δ?} × B(δ?)m,M |Z)1Eρ(Z) ≥ Πγ(Bm,M |Z)1Eρ(Z)− 1

pu−31E(Z).

Taking the expectation on both sides and dividing by P?(Z ∈ Eρ) together with (7.1)

yields

E?[Πγ({δ?} × B

(δ?)m,M |Z)|Z ∈ E

]≥ 1− 1

pm0− 1

pm−1− 1

pM2s− 1

pu−3.

It remains to control the term P? [Z /∈ Eρ,Λ(Z)]. By Gaussian tail bounds, we see

that H6 and (2.14) imply that,

(7.2) P? (Z /∈ Eρ) ≤2

p.

Since Z = Xδ?θ? + σU , where U ∼ N(0, In), for any δ ∈ A, with δ 6= δ?, we have

`(θδ;Z)− `(θ?;Z) = U ′[Xδ(X

′δXδ)

−1X ′δ −Xδ?(X′δ?Xδ?)

−1X ′δ?]U = ‖Q′δ−δ?U‖

22,

72

where X = QR with Q ∈ Rn×(p∧n), R ∈ R(p∧n)×p denotes the QR decomposition

of X. Using this, and by Lemma 5 of [11] which provides a deviation bound on the

maximum of chi-square random variables, we can find an absolute constant c such

that

P?(Z /∈ EΛ) = P?[∪s−s?k=1

{max

δ∈A: ‖δ‖0=s?+k`(θδ;Z)− `(θ?;Z) > 3k log(n ∧ p)

}]≤

s−s?∑k=1

P?[

maxδ∈A: ‖δ‖0=s?+k

`(θδ;Z)− `(θ?;Z) > 3k log(n ∧ p)]≤

s−s?∑k=1

eck(n∧pk

) 14

.

Since(n∧pk

)≥ (n ∧ p − k)k = ek log(n∧p−k) ≥ ek log((n∧p)/2) for k ≤ s ≤ (n ∧ p)/2, for

n, p large enough, we get

s−s?∑k=1

eck(n∧pk

) 14

≤s∑

k=1

e−k4

(log((n∧p)/2)−4c) ≤ 2e−14

(log((n∧p)/2)−4c) =C1

(n ∧ p)14

,

for some absolute constant C1. Hence

P? [Z /∈ Eρ,Λ(Z)] ≤ 2

p+

C1

(n ∧ p)14

.

The Bernstein-von Mises approximation part of the theorem follows easily.

�

8. Technical lemmas needed in the proof of Lemma 18.

Lemma 27. Assume H6. Then there exist absolute constants A0, such that for all

p ≥ A0, all z ∈ Rn, all γ > 0, and all θ ∈ Rp, we have∣∣∣hγ(δ?, θ; z)− hγ(δ?, θ; z)∣∣∣ ≤ ‖δ?‖0γ

2

(ρ+C(X)

√n

σ2‖θ − θδ‖2

)2

.

Proof. Fix γ > 0, δ = δ?, θ, and z as above. Using (3.2) an (3.6) we have

hγ(δ, θ; z)− hγ(δ, θ; z) = ρ(‖Jγ(δ, θ)‖1 − ‖Jγ(δ, θ)‖1

)− 1

2γ

⟨Jγ(δ, θ)− Jγ(δ, θ), Jγ(δ, θ)− θδ − γ∇`(θ; z) + Jγ(δ, θ)− θδ − γ∇`(θ; z)

⟩.


From the definition of the proximal operator there exist E1,j ∈ [−1, 1] and E2,j ∈[−1, 1] such that

Jγ,j(δ, θ) = (θj + γ∇j`(θ; z) + γρE1,j) δj ,

and Jγ,j(δ, θ) = (θj + γ∇j`(θδ; z) + γρE2,j) δj .

Hence for j such that δj = 1, we have

∣∣∣Jγ,j(δ, θ)− Jγ,j(δ, θ)∣∣∣ ≤ 2γρ+γ

σ2

∣∣∣∣∣∣⟨Xj ,

∑k:δk=0

θkXk

⟩∣∣∣∣∣∣ ≤ 2γρ+γC(X)

√n

σ2‖θ − θδ‖2,

using the coherence parameter C(X). Similarly,∣∣∣Jγ,j(δ, θ)− (θδ)j − γ∇j`(θ; z) + Jγ(δ, θ)− (θδ)j − γ∇j`(θ; z)∣∣∣

≤ 2γρ+γC(X)

√n

σ2‖θ − θδ‖2.

We conclude that∣∣∣hγ(δ, θ; z)− hγ(δ, θ; z)∣∣∣ ≤ ρ‖δ‖0(2γρ+

γC(X)√n

σ2‖θ − θδ‖2

)+‖δ‖02γ

(2γρ+

γC(X)√n

σ2‖θ − θδ‖2

)2

.

The result follows easily.

Lemma 28. Assume H6, H7. Then there exist some constants A0, C0 such that

for all p ≥ A0, all m ≥ 1,M > 2, and all θ1, θ2 ∈ B(δ?)m,M , such that [θ1]δc? = [θ2]δc?, we

have

supz∈Eρ|hγ(δ?, θ1; z)− hγ(δ?, θ2; z)|

≤ C0

((1 +

γns1/2?

σ2

)(ρ+ nMε) +

γns1/2?

σ2C(X)

√(m+ 1)γnp

)‖θ2 − θ1‖1

+ C0γs?

(1 +

γns1/2?

σ2

)(ρ+ nMε)

(ρ+ nMε+ C(X)

√(m+ 1)γnp

).

74

Proof. For convenience we write δ, and B(δ) instead of δ? and B(δ?)m,M respectively.

We also set s = ‖δ?‖0. Fix z ∈ Eρ, and θ1, θ2 ∈ B(δ) such that [θ1]δc = [θ2]δc . We

start with some general remarks. For θ ∈ Rp, since ` is quadratic and ∇(2)`(θ; z) =

− 1σ2 (X ′X), we have

∇`(θ; z) = ∇`(θ?; z)−1

σ2(X ′X)(θ − θδ)−

1

σ2(X ′X)(θδ − θ?).

Hence for z ∈ Eρ, and j such that δj = 1, if ∇j denotes the partial derivative operator

with respect to θj , we have

|∇j`(θ; z)| ≤ρ

2+

√nC(X)

σ2‖θ − θδ‖2 +

n√v(s)

σ2‖θδ − θ?‖2.

Hence for all θ ∈ Rp,

(8.1) supz∈Eρ

maxj: δj=1

|∇j`(θ; z)| ≤ρ

2+

√nC(X)

σ2‖θ − θδ‖2 +

n√v(s)

σ2‖θδ − θ?‖2.

From (3.2), we have

hγ(δ, θ; z) = −`(θ; z)− 〈∇`(θ; z), Jγ(δ, θ)− θ〉+ ρ‖Jγ(δ, θ)‖1 +1

2γ‖Jγ(δ, θ)− θ‖22.

Since ` is quadratic, we have

−`(θ; z)−〈∇`(θ; z), Jγ(δ, θ)− θ〉 = −` (Jγ(δ, θ); z)+1

2σ2[Jγ(δ, θ)−θ]′(X ′X)[Jγ(δ, θ)−θ].

Hence

hγ(δ, θ2; z)− hγ(δ, θ1; z) = U (1) + U (2) + U (3) + U (4),

where

U (1) def= ` (Jγ(δ, θ1); z)− ` (Jγ(δ, θ2); z) ,

U (2) def=

1

2σ2

([Jγ(δ, θ2)− θ2]′(X ′X)[Jγ(δ, θ2)− θ2]

−[Jγ(δ, θ1)− θ1]′(X ′X)[Jγ(δ, θ1)− θ1]),

U (3) def= ρ (‖Jγ(δ, θ2)‖1 − ‖Jγ(δ, θ1)‖1) ,

U (4) def=

1

2γ

(‖Jγ(δ, θ2)− θ2‖22 − ‖Jγ(δ, θ1)− θ1‖22

).


Note that U (1) =⟨∇`(θ; z), Jγ(δ, θ1)− Jγ(δ, θ2)

⟩, for some θ on the segment be-

tween Jγ(δ, θ1) and Jγ(δ, θ2). Therefore using (8.1) we get∣∣∣U (1) + U (3)∣∣∣ ≤ (3ρ

2+

√nµ0

σ2‖θ − θδ‖2 +

n√v(s)

σ2‖θδ − θ?‖2

)‖Jγ(δ, θ2)−Jγ(δ, θ1)‖1.

Since θ lies on the segment between Jγ(δ, θ1) and Jγ(δ, θ2), θ − θδ = 0, and we have

(8.2) ‖θδ − θ?‖2 ≤ max (‖Jγ(δ, θ1)− θ?‖2, ‖Jγ(δ, θ2)− θ?‖2) .

Given θ ∈ B(δ), we recall that if δj = 0, Jγ,j(δ, θ) = 0, and if δj = 1, we have

Jγ,j(δ, θ) = θj + γ∇j`(θ;Z) + γρEj , for some Ej ∈ [−1, 1] (that depends on θ).

Hence if δj = 1, using (8.1) and the fact that θ ∈ B(δ), we have |Jγ,j(δ, θ) − θ?,j | ≤|θj−θ?,j |+γ(3ρ

2 +2√npC(X)

√(1 +m)γ log(p)/σ2 +n

√v(s)(Mε)/σ2). It follows that

for any θ ∈ B(δ),

‖Jγ(δ, θ)− θ?‖2 ≤Mε+ γs1/2

(3ρ

2+

2√nC(X)

σ2

√(m+ 1)γp+

n√v(s)

σ2Mε

).

Using the last display in (8.2) yields

‖θδ − θ?‖2 ≤ Mε+ γs1/2

(3ρ

2+

2√nC(X)

√(m+ 1)γp

σ2+n√v(s)

σ2Mε

).

Similarly, we have

‖Jγ(δ, θ2)− Jγ(δ, θ1)‖1 ≤ ‖θ2 − θ1‖1 + γs(

2ρ+n

σ2v(s)1/2‖θ2 − θ1‖2

),

≤ ‖θ2 − θ1‖1 + 2γs(ρ+

n

σ2v(s)1/2Mε

).

We then get∣∣∣U (1) + U (3)∣∣∣ ≤ C1

(γs[ρ+

n

σ2Mε]

+ ‖θ2 − θ1‖1)

×

[(1 +

γns1/2

σ2

)(ρ+ nMε) +

γns1/2

σ2

C(X)√n√

(m+ 1)γp

σ2

],

for all p large enough, and for some absolute constants C1, C2, using the fact that

v(s) = O(1). For U (2), we write

U (2) =1

2σ2∆′1(X ′X)∆2,

76

where ∆1 = Jγ(δ, θ2) − θ2 − Jγ(δ, θ1) + θ1, and ∆2 = Jγ(δ, θ2) − θ2 + Jγ(δ, θ1) − θ1.

We note that ∆1,j = 0 for δj = 0. Hence

|U2| ≤1

2σ2‖∆1‖1 max

j: δj=1|〈Xj , X∆2〉|

≤ 1

2σ2‖∆1‖1

(nv(s)1/2‖∆2,δ‖2 + C(X)

√n‖∆2 −∆2,δ‖2

).

With the same calculations as above we have

‖∆1‖1 ≤ 2γs(ρ+

n

σ2v(s)1/2Mε

),

‖∆2,δ‖2 ≤ s1/2γ

(2ρ+ max

j: δj=1|∇j`(θ1, z)|+ max

j: δj=1|∇j`(θ2, z)|

)≤ Cγs1/2

σ2

(σ2ρ+ nMε+ C(X)

√n√

(m+ 1)γp),

and

‖∆2 −∆2,δ‖2 = ‖(θ1 − θ1,δ) + (θ2 − θ2,δ)‖2 ≤ 4√

(m+ 1)γp.

Therefore

|U (2)| ≤ Cγs

σ2(ρ+

n

σ2Mε)

[nγs1/2

σ2(ρ+ nMε) +

(1 +

γns1/2

σ2

)C(X)

√(m+ 1)γnp

],

for all p large enough. For U (4), we proceed similarly as with U (2) to get

|U (4)| ≤ Cγs

σ2(ρ+

n

σ2Mε)

[nγs1/2

σ2(ρ+ nMε) +

(1 +

γns1/2

σ2

)C(X)

√(m+ 1)γnp

].

We complete the proof by collecting these bounds together.

We will also need the following lemma which is adapted from Lemma 4 of [5]. For

τ > 0, we call Qτ (θ, ·) the density of the normal distribution N(θ, τ2Id).

Lemma 29. For some integer d ≥ 1, fix θ0 ∈ Rd, R > 0, and define the set

Ξdef={u ∈ Rd : ‖u− θ0‖2 ≤ R

}. Let 0 < σ ≤ R

√2π

320d , and rdef= 4σ

√d. For all u, v ∈ Ξ

such that ‖u− v‖2 ≤ σ/4, we have∫Ξuv

Qσ(u, z) ∧Qσ(v, z)dz ≥ 1

4,

where Ξuvdef= {z ∈ Ξ : ‖z − u‖2 ≤ r, ‖z − v‖2 ≤ r}.


Proof. For x ∈ Rd, and a > 0 we shall write Ba(x) to denote the ball of radius a

centered at x: Ba(x)def= {z ∈ Rd : ‖z−x‖2 ≤ a}. Without any loss of generality we can

assume that θ0 = 0d, and we can take u, v ∈ Ξ such that ‖u− θ0‖2 = ‖v − θ0‖2 = R.

We set Ξu,vdef= Br(u) ∩ Br(v) ∩ Ξ, and we introduce

H1def={θ ∈ Rd : 2 〈θ − u, v − u〉 ≥ ‖v − u‖22

},

H2def={θ ∈ Rd : 2 〈θ − u, v − u〉 < −‖v − u‖22

}.

H3def={θ ∈ Rd : −‖v − u‖22 ≤ 2 〈θ − u, v − u〉 < ‖v − u‖22

}.

We have Rd = H1∪H2∪H3, and since ‖θ−u‖22 = ‖θ−v‖22 +2 〈θ − u, v − u〉−‖v−u‖22,

we see that

Qσ(u, θ) ∧Qσ(v, θ) =

{Qσ(u, θ) if θ ∈ H1

Qσ(v, θ) if θ /∈ H1

.

Therefore,∫Ξu,v

Qσ(u, θ) ∧Qσ(v, θ)dθ ≥∫

Ξu,v∩H1

Qσ(u, θ)dθ +

∫Ξu,v∩H2

Qσ(v, θ)dθ.

Let w = (u + v)/2, ℘def= {z ∈ Rd : 〈z, w〉 = ‖w‖22}, and the half-space ℘−

def= {z ∈

Rd : 〈z, w〉 ≤ ‖w‖22}. Define also Brdef= Br(u) ∩ Br(v). It can easily checked that for

j ∈ {1, 2},

Br ∩Hj ∩(℘− −

r2w

R‖w‖2

)⊆ Ξu,v ∩Hj .

Indeed any θ ∈ Rd can be written θ = θ′w‖w‖2

w‖w‖2 + s, where 〈s, w〉 = 0. And if θ ∈ Br

then ‖s‖2 = ‖θ− θ′w‖w‖2

w‖w‖2 ‖2 ≤ ‖θ−w‖2 = ‖θ−(u+v)/2‖2 ≤ r. Also if θ ∈ ℘−− r2w

R‖w‖2 ,

then θ′w‖w‖2 ≤ ‖w‖2−

r2

R ≤ R−r2

R . Hence ‖θ‖22 =(θ′w‖w‖2

)2+‖s‖22 ≤

(R− r2

R

)2+r2 ≤ R2.

We conclude that∫Ξu,v∩H1

Qσ(u, θ)dθ ≥∫

Ξu,v∩H1∩℘−Qσ(u, θ)dθ

≥∫Br∩H1∩℘−

Qσ(u, θ)dθ −∫℘−\

(℘−− r2w

R‖w‖2

)Qσ(u, θ)dθ.

And θ ∈ ℘− \(℘− − r2w

R‖w‖2

)is equivalent to − r2

R ≤⟨θ − u, w

‖w‖2

⟩≤ 0. Hence∫

℘−\(℘−− r2w

R‖w‖2

)Qσ(u, θ)dθ =∫ r2

R0

1√2πσ

e−u2

2σ2 du ≤ r2

σR√

2π. A similarly calculations

78

holds for the second integral. We conclude that∫Ξu,v

Qσ(u, θ)∧Qσ(v, θ)dθ ≥∫Br∩H1∩℘−

Qσ(u, θ)dθ+

∫Br∩H2∩℘−

Qσ(v, θ)dθ− 2r2

σR√

2π.

It can be checked that the sets Br(u), Br(v), H1 and H2 are symmetric with respect

to ℘. Indeed for x ∈ Rd, the reflection of x with respect to ℘ is x−2(x′w‖w‖2 − ‖w‖2

)w,

and it is easy verification to check that if x belongs to any of these sets, its reflection

also belong. Hence,∫Ξu,v

Qσ(u, θ)∧Qσ(v, θ)dθ ≥ 1

2

∫Br∩H1

Qσ(u, θ)dθ+1

2

∫Br∩H2

Qσ(v, θ)dθ− 2r2

σR√

2π

≥ 1

2

∫Br∩H1

Qσ(u, θ)dθ +1

2

∫Br∩Hc1

Qσ(v, θ)dθ − 1

2

∫H3

Qσ(v, θ)dθ − 2r2

σR√

2π.

We have∫Br∩H1

Qσ(u, θ)dθ +

∫Br∩Hc1

Qσ(v, θ)dθ

=

∫Rd

1Br∩H1(θ)Qσ(u, θ)dθ +

∫Rd

1Br∩Hc1(v − u+ θ)Qσ(u, θ)dθ

≥∫Rd

1Br∩H1(θ)Qσ(u, θ)dθ +

∫Rp

1Br(u)∩H2(θ)Qσ(u, θ)dθ,

where the last inequality uses the fact, easy to check that if θ ∈ Br(u) ∩ H2, then

θ + v − u ∈ Br \ H1. Furthermore,

Br(u) = (Br(u) ∩H1) ∪ (Br(u) ∩H2) ∪ (Br(u) ∩H3) ,

and Br(u) ∩H1 = Br ∩H1. Hence∫Ξu,v

Qσ(u, θ) ∧Qσ(v, θ)dθ ≥ 1

2

∫Br(u)

Qσ(u, θ)dθ

− 1

2

∫H3

Qσ(u, θ)dθ − 1

2

∫H3

Qσ(v, θ)dθ − 2r2

σR√

2π.

With r = 4σ√d, we have

∫Br(u)Qσ(u, θ)dθ ≥ 1− 10−4. With σ ≤ R

√2π

320d ,

2r2

σR√

2π≤ 1

10,


and with ‖v − u‖2 ≤ σ/4,∫H3

Qσ(u, θ)dθ +

∫H3

Qσ(v, θ)dθ

=

∫ ‖v−u‖22

− ‖v−u‖22

1√2πσ

e−u2

2σ2 du+

∫ ‖v−u‖22

− 3‖v−u‖22

1√2πσ

e−u2

2σ2 du ≤ 3‖v − u‖2√2πσ

≤ 3

4√

2π.

In conclusion,∫Ξu,v

Qσ(u, θ) ∧Qσ(v, θ)dθ ≥ 1

2− 10−4

2− 3

8√

2π− 1

10≥ 1

4,

as claimed.

Acknowledgements. The authors are grateful to Galin Jones, Scott Schmidler

and Yuekai Sun for very helpful discussions.

References.

[1] Atchade, Y. A. (2017). On the contraction properties of some high-dimensional quasi-posterior

distributions. Ann. Statist. 45 2248–2273.

[2] Atchade, Y. F. (2015). A Moreau-Yosida approximation scheme for high-dimensional posterior

and quasi-posterior distributions. ArXiv e-prints .

[3] Atchade, Y. F. and Rosenthal, J. S. (2005). On adaptive Markov chain Monte Carlo algo-

rithms. Bernoulli 11 815–828.

[4] Bauschke, H. H. and Combettes, P. L. (2011). Convex analysis and monotone operator

theory in Hilbert spaces. CMS Books in Mathematics/Ouvrages de Mathematiques de la SMC,

Springer, New York. With a foreword by Hedy Attouch.

URL http://dx.doi.org/10.1007/978-1-4419-9467-7

[5] Belloni, A. and Chernozhukov, V. (2009). On the computational complexity of MCMC-

based estimators in large samples. Ann. Statist. 37 2011–2055.

[6] Bhattacharya, A., Chakraborty, A. and Mallick, B. (2016). Fast sampling with gaussian

scale mixture priors in high-dimensional regression. Biometrika 103 985 – 991.

[7] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and

Dantzig selector. Ann. Statist. 37 1705–1732.

[8] Bottolo, L. and Richardson, S. (2010). Evolutionary stochastic search for bayesian model

exploration. Bayesian Anal. 5 583–618.

[9] Boucheron, S., Lugosi, G. and Massart, P. (2013). Concentration inequalities: a nonasymp-

totic theory of independence. Springer Series in Statistics, Oxford University Press, Oxford.

[10] Cai, T., Han, X. and Pan, G. (2017). Limiting Laws for Divergent Spiked Eigenvalues and

Largest Non-spiked Eigenvalue of Sample Covariance Matrices. ArXiv e-prints .

[11] Castillo, I., Schmidt-Hieber, J. and van der Vaart, A. (2015). Bayesian linear regression

with sparse priors. Ann. Statist. 43 1986–2018.

http://dx.doi.org/10.1007/978-1-4419-9467-7

80

[12] Castillo, I. and van der Vaart, A. (2012). Needles and straw in a haystack: Posterior

concentration for possibly sparse sequences. Ann. Statist. 40 2069–2101.

[13] Dyer, M., Frieze, A. and Kannan, R. (1991). A random polynomial-time algorithm for

approximating the volume of convex bodies. J. ACM 38 1–17.

[14] George, E. I. and McCulloch, R. E. (1997). Approaches to bayesian variable selection.

Statist. Sinica 7 339–373.

[15] Ghosal, S., Ghosh, J. K. and van der Vaart, A. W. (2000). Convergence rates of posterior

distributions. Ann. Statist. 28 500–531.

[16] Gottardo, R. and Raftery, A. E. (2008). Markov chain monte carlo with mixtures of

mutually singular distributions. Journal of Computational and Graphical Statistics 17 949–975.

[17] Green, P. J. (2003). Trans-dimensional Markov chain Monte Carlo. In Highly structured

stochastic systems, vol. 27 of Oxford Statist. Sci. Ser. 179–206.

[18] Guan, Y. and Krone, S. M. (2007). Small-world mcmc and convergence to multi-modal

distributions: From slow mixing to fast mixing. Ann. Appl. Probab. 17 284–304.

[19] Horn, R. A. and Johnson, C. R. (2012). Matrix Analysis. 2nd ed. Cambridge University

Press, New York, NY, USA.

[20] Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: Frequentist and bayesian

strategies. Ann. Statist. 33 730–773.

[21] Kleijn, B. J. K. and van der Vaart, A. W. (2006). Misspecification in infinite-dimensional

Bayesian statistics. Ann. Statist. 34 837–877.

[22] Liu, J. S. (1994). The collapsed gibbs sampler in bayesian computations with applications to a

gene regulation problem. Journal of the American Statistical Association 89 958–966.

[23] Lovasz, L. and Simonovits, M. (1990). The mixing rate of markov chains, an isoperimet-

ric inequality, and computing the volume. In Proceedings of the 31st Annual Symposium on

Foundations of Computer Science. SFCS ’90, IEEE Computer Society, Washington, DC, USA.

[24] Lovasz, L. and Simonovits, M. (1993). Random walks in a convex body and an improved

volume algorithm. Random Structures Algorithms 4 359–412.

[25] Lovasz, L. and Vempala, S. (2007). The geometry of logconcave functions and sampling

algorithms. Random Structures Algorithms 30 307–358.

[26] Madras, N. and Randall, D. (2002). Markov chain decomposition for convergence rate anal-

ysis. Ann. Appl. Probab. 12 581–606.

[27] Mangoubi, O. and Smith, A. (2017). Rapid Mixing of Hamiltonian Monte Carlo on Strongly

Log-Concave Distributions. ArXiv e-prints .

[28] Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations for high-

dimensional data. Ann. Statist. 37 246–270.

[29] Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression.

JASA 83 1023–1032.

[30] Moreau, J.-J. (1965). Proximite et dualite dans un espace hilbertien. Bull. Soc. Math. France

93 273–299.

[31] Narisetty, N. and He, X. (2014). Bayesian variable selection with shrinking and diffusing

priors. Ann. Statist. 42 789–817.


[32] Negahban, S. N., Ravikumar, P., Wainwright, M. J. and Yu, B. (2012). A unified frame-

work for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical

Science 27 538–557.

[33] Parikh, N. and Boyd, S. (2013). Proximal algorithms. Foundations and Trends in Optimization

1 123–231.

[34] Patrinos, P., Stella, L. and Bemporad, A. (2014). Forward-backward truncated Newton

methods for convex composite optimization. ArXiv e-prints .

[35] Robert, C. P. and Casella, G. (2004). Monte Carlo statistical methods. 2nd ed. Springer

Texts in Statistics, Springer-Verlag, New York.

[36] Rockova, V. and George, E. I. (2014). Emvs: The em approach to bayesian variable selection.

Journal of the American Statistical Association 109 828–846.

[37] Rudelson, M. and Zhou, S. (2013). Reconstruction from anisotropic random measurements.

IEEE Trans. Inf. Theor. 59 3434–3447.

[38] Schreck, A., Fort, G., Le Corff, S. and Moulines, E. (2013). A shrinkage-thresholding

Metropolis adjusted Langevin algorithm for Bayesian variable selection. ArXiv e-prints .

[39] Sinclair, A. and Jerrum, M. (1989). Approximate counting, uniform generation and rapidly

mixing Markov chains. Inform. and Comput. 82 93–133.

[40] Sur, P., Chen, Y. and Candes, E. J. (2017). The Likelihood Ratio Test in High-Dimensional

Logistic Regression Is Asymptotically a Rescaled Chi-Square. ArXiv e-prints .

[41] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc.

Ser. B 58 267–288.

[42] Tierney, L. (1994). Markov chains for exploring posterior distributions. Ann. Statist. 22

1701–1762. With discussion and a rejoinder by the author.

[43] Vempala, S. (2005). Geometric random walks: a survey. In Combinatorial and computational

geometry, vol. 52 of Math. Sci. Res. Inst. Publ. Cambridge Univ. Press, Cambridge, 577–616.

[44] Woodard, D. B., Schmidler, S. C. and Huber, M. (2009). Conditions for rapid mixing of

parallel and simulated tempering on multimodal distributions. Ann. Appl. Probab. 19 617–640.

[45] Yang, Y., Wainwright, M. J. and Jordan, M. I. (2016). On the computational complexity

of high-dimensional bayesian variable selection. Ann. Statist. 44 2497–2532.

1085 South University, Ann Arbor, 48109, MI, United States

E-mail: [email protected]

[email protected]

mailto:[email protected]

mailto:[email protected]

regularization and computation with high-dimensional spike-and...

Documents