Langevin MCMC: Theory and MethodsBayesian Computation Opening Workshop

A. Durmus1, N. Brosse2, E. Moulines2, M. Pereyra3, S. Sabanis4

1ENS Paris-Saclay

2Ecole Polytechnique

3Heriot-Watt University

4University of Edinburgh

1 Motivation and setting

2 The ULA algorithm

3 The Stochastic Gradient Langevin dynamics (SGLD)

4 Strongly log-concave distribution

5 Convex and Super-exponential densities

6 Non-smooth potentials

7 Logconcave densities with constrained domains

8 Normalizing constants of log-concave densities

Sampling distribution over high-dimensional state-space has recentlyattracted a lot of research efforts in computational statistics and machinelearning community...

Applications (non-exhaustive)

1. Bayesian inference for high-dimensional models,2. Bayesian inverse problems (e.g., image restoration and deblurring),3. Aggregation of estimators and experts,4. Bayesian non-parametrics.

Most of the sampling techniques known so far do not scale tohigh-dimension... Challenges are numerous in this area...

Bayesian stuff

A Bayesian model is specified by

(i) the likelihood of the observed data : D ∼ L(·|x)(ii) a prior distribution π0 on the parameter space x ∈ X ⊂ Rd

The inference is based on the posterior distribution:

π(x |D) =π0(x)L(D|x)∫L(D|u)π0(u)du


In most cases the normalizing constant is not tractable:

π(x |D) ∝ π0(x)L(D|x) .

An example: Bayesian analysis of logistic regression

Likelihood: Binary regression set-up in which the binary observations(responses) Rini=1 are conditionally independent Bernoulli random variableswith success probability F (xTZi )ni=1, where

1. Zi is a d dimensional vector of known covariates,2. x is a d dimensional vector of unknown regression coefficients3. F is the link function: in logistic regression, F is the standard logistic

cumulative distribution function:

F (t) = et/(1 + et)

An example: Bayesian analysis of logistic regression

Possible choices for the prior (tons of theory behind these, this is not blackmagic !)

Gaussian prior π0(x) ∝ exp(− 1


)LASSO prior π0(x) = exp


i=1 αi |xi |)

Many more sophisticated ones (spike-and-slab, horseshoe,...) and otherglobal-local shrinkage priors...

The posterior of x is up to a proportionality constant given by

π(x |(R,Z )) ∝n∏


FRi (x ′Zi )(1− F (x ′Zi ))1−Riπ0(x)

Bayesian statistics are out of the game ?

While optimization-based algorithms for the extremely popular Lasso andelastic net procedures can scale to dimension in the hundreds of thousands,corresponding Bayesian methods that use Markov chain Monte Carlo(MCMC) for computation are limited to problems at least an order ofmagnitude smaller.

The promise of scaling MCMC to large datasets has thus far not beenrealized, owing to the expensive computations often involved in MCMCalgorithms and slow mixing of the corresponding Markov chains.

(long term) Objective:

- Contribute to fill the gap between optimization and simulation.- Develop sampling methods which can provably work in high-dimension

(more precise statements later)

For logistic regression with horse-shoe prior, on going work with JamesJohndrow

1 Motivation and setting

2 The ULA algorithm

3 The Stochastic Gradient Langevin dynamics (SGLD)

4 Strongly log-concave distribution

5 Convex and Super-exponential densities

6 Non-smooth potentials

7 Logconcave densities with constrained domains

8 Normalizing constants of log-concave densities

Denote by π a target density w.r.t. the Lebesgue measure on Rd , known upto a normalisation factor

x 7→ e−U(x)/


e−U(y)dy ,

Assumption: U is L-smooth : continuously differentiable and there exists aconstant L such that for all x , y ∈ Rd ,

‖∇U(x)−∇U(y)‖ ≤ L‖x − y‖ .

Note: this condition can be removed by taming the gradient (Brosse,Durmus, M., Sabanis, 2018, to appear SPA 2019)

(Overdamped) Langevin diffusion

Langevin SDE:dYt = −∇U(Yt)dt +

√2dBt ,

where (Bt)t≥0 is a d-dimensional Brownian Motion.

π(x) ∝ exp(−U(x)) is the unique invariant probability measure.

Notation: (Pt)t≥0 the Markov semigroup associated to the Langevindiffusion:

Discretized Langevin diffusion

Idea: Sample the diffusion paths, using the Euler-Maruyama (EM) scheme:

Xk+1 = Xk − γk+1∇U(Xk) +√



- (Zk)k≥1 is i.i.d. N (0, Id)- (γk)k≥1 is a sequence of stepsizes, which can either be held constant or

be chosen to decrease to 0 at a certain rate

Closely related to the gradient descent algorithm.

This algorithm is referred to as the Unadjusted Langevin Algorithm (ULA) inBayesian statistics or Langevin Monte Carlo (LMC) in machine learning.

Discretized Langevin diffusion: constant stepsize

When the stepsize is held constant, i.e. γk = γ, then (Xk)k≥1 is anhomogeneous Markov chain with Markov kernel Rγ

Under some appropriate conditions, Rγ is irreducible, positive recurrent unique invariant distribution πγ which does not coincide with the targetdistribution π.


For a given precision ε > 0, how to chose the stepsize γ > 0 and thenumber of iterations n so that : ‖δxRn

γ − π‖TV ≤ ε ?How to select the starting point x cleverly ?How to quantify the distance between πγ and π ?

Discretized Langevin diffusion: decreasing stepsize

When (γk)k≥1 is nonincreasing and non constant, (Xk)k≥1 is aninhomogeneous Markov chain associated with the kernels (Rγk )k≥1.

Notation: Qpγ is the composition of Markov kernels

Qpγ = Rγ1Rγ2 . . .Rγp

With this notation, Ex [f (Xp)] = δxQpγ f .


- Convergence : Can the step sizes be selected so that ‖δxQpγ − π‖TV → 0

?- Optimal choice of simulation parameters : How to select the number of

iterations to achieve : ‖δxQpγ − π‖TV ≤ ε starting from a given point x

- Should we use fixed or decreasing step sizes ?

Some (very incomplete) references: Early references

(i) Statistical physics: Parisi, 1981, Correlation function and ComputerSimulations, Nuclear Physics.

(ii) Bayesian statistics: Grenander and Miller (in discussion Besag,Representation of knowledge in Complex Systems, JRSS B). First theoreticalresults given by Roberts and Tweedie, 1996, Exponential Convergence ofLangevin Distributions and Their Discrete Approximations, Bernoulli, Stramerand Tweedie, Langevin-type models. I. Diffusions with given stationarydistributions and their discretizations., MCAP, 1999

(iii) Most of these results are qualitative (e.g. conditions upon which the sampleris geometrically ergodic).

IMS 2018 14 / 84


Some (very incomplete) references: Euler discretisation

(i) Talay, D. and Tubaro, L. Expansion of the global error for numerical schemessolving stochastic differential equations, Stochastic Anal. Appl., 483–509(1991)

(ii) Lamberton, D.; Pages, G. Recursive computation of the invariant distributionof a diffusion: the case of a weakly mean reverting drift. Stoch. Dyn. 3(2003), no. 4, 435–451.

(iii) Lemaire, V. Behavior of the Euler scheme with decreasing step in adegenerate situation. ESAIM Probab. Stat. 11 (2007), 236–247.

(iv) Lemaire, V.; Menozzi, S. On some non asymptotic bounds for the Eulerscheme. Electron. J. Probab. 15 (2010), no. 53, 1645–1681.

1 Motivation and setting

2 The ULA algorithm

3 The Stochastic Gradient Langevin dynamics (SGLD)

4 Strongly log-concave distribution

5 Convex and Super-exponential densities

6 Non-smooth potentials

7 Logconcave densities with constrained domains

8 Normalizing constants of log-concave densities

SGLD Algorithm

Objective: posterior inference on large scale datasets (many machine learningapplications).

the target π is the posterior distribution in a Bayesian inference problem withprior density π0(x) and a large number N 1 of i.i.d. observations Di withlikelihoods L(Di |x):

π(x) = π0(x)N∏i=1

L(Di |x) .

the cost of one iteration is Nd which is prohibitively large for massivedatasets.


negated log-likelihood for a single observation: Ui (x) = − log(L(Di |x))

for i ∈ 1, . . . ,N, negated likelihood of N observations: U =∑N

i=0 Ui .negated log-prior: U0(x) = − log(π0(x))

SGLD Algorithm

Welling and Teh (2011) suggested to replace ∇U with an unbiased estimate

∇U0 + (N/p)∑i∈S


where S is a minibatch of 1, . . . ,N with replacement of size p.

A single update of SGLD is then given by

Xk+1 = Xk − γ

∇U0(Xk) +N



∇Ui (Xk)


2γZk+1 .

The idea of using only a fraction of the observations to compute an unbiasedestimate of the gradient at each iteration comes from Stochastic GradientDescent (SGD) which is a popular algorithm in machine learning to minimizethe potential U.

SGLD and control variates

SGLD scales to large datasets by using noisy gradients calculated using amini-batch or subset of the dataset.

However, the high variance inherent in these noisy gradients degradesperformance and leads to slower mixing.

Idea: Use variance reduction for the unbiased estimator of the gradient.

the Stochastic Variance Reduced Gradient Langevin Dynamics (SVR-GLD)(Dubey et al, 2016):

Xk+1 = Xk − γ

∇U0(Xk)− g0,k +N



∇Ui (Xk)− gi,k


2γZk+1 .

where gi,k are updated every m iterations, i.e.

gi,k =

gi,k−1 k 6= 0 mod m

∇Ui (Xk) otherwise

Some (very incomplete) references: SGLD, SVR-GLD

(i) Welling, M. and Teh, Y.W., Bayesian learning via stochastic gradientLangevin dynamic, ICML, 2011

(ii) Vollmer, S.; Zygalakis, K.; Teh, Y. W. Exploration of the (non-)asymptoticbias and variance of stochastic gradient Langevin dynamics, J. Mach. Learn.Res. 17 (2016),

(iii) Teh, Y.W.; Thiery, A.; Vollmer, S., Consistency and fluctuations forstochastic gradient Langevin dynamics J. Mach. Learn. Res. 17 (2016)

(iv) Dubey, A; Reddi, S; Williamson, S. ; Poczos, B.; Smola, A.; Xing E.,Variance reduction in stochastic gradient Langevin dynamics, Advances inneural information processing systems,

(v) Baker, J.; Fearnhead, P.; Fox, E. B.; Nemeth, C., Control variates forstochastic gradient MCMC. (2017).

(vi) Dalalyan, A., Further and stronger analogy between sampling andoptimization: Langevin Monte Carlo and gradient descent, Proc. of the 2017Conference on Learning Theory, volume 65

1 Motivation and setting

2 The ULA algorithm

3 The Stochastic Gradient Langevin dynamics (SGLD)

4 Strongly log-concave distribution

5 Convex and Super-exponential densities

6 Non-smooth potentials

7 Logconcave densities with constrained domains

8 Normalizing constants of log-concave densities

Strongly convex potential

Assumption: U is L-smooth and m-strongly convex

‖∇U(x)−∇U(y)‖ ≤ L ‖x − y‖

〈∇U(x)−∇U(y), x − y〉 ≥ m ‖x − y‖2.

Outline of the proof

1. Control in Wasserstein distance of the laws of the Langevin diffusion andits discretized version.

2. Relating Wasserstein distance result to total variation.

Key technique: (Synchronous and Reflection) coupling !

Wasserstein distance

Definition 1

Let ξ, ξ′ be two probability measures on (Rd ,B(Rd)). The Wasserstein distance oforder 2 is given by

W 22 (ξ, ξ′) = inf



‖x − x ′‖2ζ(dxdx ′) ,

= infζ∈C(ξ,ξ′)

E[‖X − X ′‖2


where (X ,X ′) ∼ ζ.

Wasserstein distance convergence

Theorem 2

Assume that U is L-smooth and m-strongly convex.

Then, for all x , y ∈ Rd and t ≥ 0,

W2 (δxPt , δyPt) ≤ e−mt ‖x − y‖

For all x ∈ Rd and t ≥ 0,

W2 (δxPt , δyPt) ≤ e−mt‖x − x?‖+



where x? = arg minRd U.

IMS 2018 24 / 84


Synchronous Coupling

dYt = −∇U(Yt)dt +

√2dBt ,

dYt = −∇U(Yt)dt +√

2dBt ,where (Y0, Y0) = (x , y).

This SDE has a unique strong solution (Yt , Yt)t≥0. Since

dYt − Yt = −∇U(Yt)−∇U(Yt)


The product rule for semimartingales imply

d∥∥∥Yt − Yt


= −2⟨∇U(Yt)−∇U(Yt),Yt − Yt

⟩dt .

Synchronous Coupling

∥∥∥Yt − Yt


=∥∥∥Y0 − Y0


− 2

∫ t


⟨(∇U(Ys)−∇U(Ys)),Ys − Ys

⟩ds ,

Since U is strongly convex 〈∇U(y)−∇U(y ′), y − y ′〉 ≥ m ‖y − y ′‖2which implies∥∥∥Yt − Yt


≤∥∥∥Y0 − Y0


− 2m

∫ t


∥∥∥Ys − Ys


ds .

Gromwall inequality: ∥∥∥Yt − Yt


≤∥∥∥Y0 − Y0



Contraction property of the discretization

Theorem 3

Assume that U is L-smooth and m-strongly convex. Then,

(i) Let (γk)k≥1 be a nonincreasing sequence with γ1 ≤ 2/(m + L). For allx , y ∈ Rd and ` ≥ n ≥ 1,

W2(δxQn,`γ , δyQ

n,`γ ) ≤


(1− κγk)


‖x − y‖ .

where κ = 2mL/(m + L).

(ii) For any γ ∈ (0, 2/(m + L)), for all x ∈ Rd and n ≥ 1,

W2(δxRnγ , πγ) ≤ (1− κγ)n/2

‖x − x?‖+



A coupling proof (I)

Objective compute bound for W2(δxQnγ , π)

Since πPt = π for all t ≥ 0, it suffices to get bounds of the Wassersteindistance



nγ , πPΓn

)- Γn =

∑nk=1 γk ,

- δxQnγ : law of the discretized diffusion,

- πPγn = π, where (Pt)t≥0 is the semi group of the diffusion

Idea ! synchronous coupling between the diffusion and the interpolation ofthe Euler discretization.

A coupling proof (II)

For all n ≥ 0 and t ∈ [Γn, Γn+1) byYt = YΓn −

∫ t

Γn∇U(Ys)ds +

√2(Bt − BΓn)

Yt = YΓn −∫ t

Γn∇U(YΓn)ds +

√2(Bt − BΓn) ,

with Y0 ∼ π and Y0 = xFor all n ≥ 0,

W 22

(δxPΓn , πQ

)≤ E[‖YΓn − YΓn‖2] ,

Explicit bound in Wasserstein distance

Theorem 4

Assume that U is m-strongly convex and L-smooth. Let (γk)k≥1 be a nonincreasingsequence with γ1 ≤ 1/(m + L). Then

W 22 (δxQ

nγ , π) ≤ u(1)

n (γ)‖x − x?‖2

+ d/m

+ u(2)n (γ) ,

where u(1)n (γ) = 2


(1− κγk) with κ = mL/(m + L) and

u(2)n (γ) = 2




[γ2i c(m, L, γi )


(1− κγk)


Can be sharpened if U is three times continuously differentiable and there exists L suchthat for all x , y ∈ Rd ,

∥∥∇2U(x)−∇2U(y)∥∥ ≤ L ‖x − y‖.

Fixed step size For any ε > 0, one may choose γ so that



pγ , π

)≤ ε in p = O(dε−1) iterations

where x? is the unique maximum of π

Decreasing step size with γk = γ1k−α, α ∈ (0, 1),



nγ , π)

= dO(n−α) .

Our results are tight (check with U(x) = (1/2)‖x‖2).

From the Wasserstein distance to the TV

Theorem 5

Assume that U is strongly convex.

(i) For all x , y ∈ Rd ,

‖Pt(x , ·)− Pt(y , ·)‖TV ≤ 1− 2Φ

− ‖x − y‖√

(4/m)(e2mt − 1)


where Φ is the standard Gaussian cumulative distribution function.

(ii) For any µ, ν and t > 0,

‖µPt − νPt‖TV ≤ 21/2 W1(µ, ν)/


(4/m)(e2mt − 1)) .

Use reflection coupling (Lindvall and Rogers, 1986)

Hints of Proof I

dXt = −∇U(Xt)dt +



dYt = −∇U(Yt)dt +√

2(Id−2eteTt )dBd

t ,where et = e(Xt − Yt)

with X0 = x , Y0 = y , e(z) = z/ ‖z‖ for z 6= 0 and e(0) = 0 otherwise.Define the coupling time Tc = infs ≥ 0 | Xs 6= Ys. By construction Xt = Yt

for t ≥ Tc .

Bdt =

∫ t


(Id−2eseTs )dBd


is a d-dimensional Brownian motion, therefore (Xt)t≥0 and (Yt)t≥0 are weaksolutions to Langevin diffusions started at x and y , respectively. Then byLindvall’s inequality, for all t > 0 we have

‖Pt(x , ·)− Pt(y , ·)‖TV ≤ P (Xt 6= Yt) .

Hints of Proof II

For t < Tc (before the coupling time)

dXt − Yt = −∇U(Xt)−∇U(Yt) dt + 2√

2etdB1t .

Using Ito’s formula

‖Xt − Yt‖ = ‖x − y‖ −∫ t


〈∇U(Xs)−∇U(Ys), es〉ds + 2√


≤ ‖x − y‖ −m

∫ t


‖Xs − Ys‖ds + 2√

2B1t .

and Gronwall’s inequality implies

‖Xt − Yt‖ ≤ e−mt ‖x − y‖+ 2√

2B1t −m2


∫ t


B1s e−m(t−s)ds .

Hint of Proof III

Therefore by integration by part, ‖Xt − Yt‖ ≤ Ut where (Ut)t∈(0,Tc ) is theone-dimensional Ornstein-Uhlenbeck process defined by

Ut = e−mt ‖x − y‖+ 2√


∫ t



Therefore, for all x , y ∈ Rd and t ≥ 0, we get

P(Tc > t) ≤ P(


Ut > 0


Finally the proof follows from the tail of the hitting time of (one-dimensional) OU(see Borodin and Salminen,2002).

From the Wasserstein distance to the TV (II)

‖Pt(x , ·)− Pt(y , ·)‖TV ≤‖x − y‖√

(2π/m)(e2mt − 1)


1. (Pt)t≥0 converges exponentially fast to π in total variation at a rate e−mt .

2. For all f : Rd → R, measurable and sup |f | ≤ 1, then the functionx 7→ Pt f (x) is Lipschitz with Lipshitz constant smaller than


(2π/m)(e2mt − 1) .

Explicit bound in total variation

Theorem 6

Assume U is L-smooth and strongly convex. Let (γk)k≥1 be a nonincreasingsequence with γ1 ≤ 1/(m + L).

(Optional assumption) U ∈ C 3(Rd) and there exists L such that for allx , y ∈ Rd :

∥∥∇2U(x)−∇2U(y)∥∥ ≤ L ‖x − y‖.

Then there exist sequences u(1)n (γ), n ∈ N and u(1)

n (γ), n ∈ N such that for allx ∈ Rd and n ≥ 1,

‖δxQnγ − π‖TV ≤ u(1)

n (γ)‖x − x?‖2 + d/m

+ u(2)

n (γ) .

Constant step sizes

For any ε > 0, the minimal number of iterations to achieve‖δxRp

γ − π‖TV ≤ ε is

p = O(d log(d)ε−1 |log(ε)|) .

For a given stepsize γ, letting p → +∞, we get:

‖πγ − π‖TV ≤ Cγ |log(γ)| .

1 Motivation and setting

2 The ULA algorithm

3 The Stochastic Gradient Langevin dynamics (SGLD)

4 Strongly log-concave distribution

5 Convex and Super-exponential densities

6 Non-smooth potentials

7 Logconcave densities with constrained domains

8 Normalizing constants of log-concave densities

Convergence of the Euler discretization

Assume one of the following conditions:

There exist α > 1, ρ > 0 and Mρ ≥ 0 such that for all y ∈ Rd , ‖y‖ ≥ Mρ:

〈∇U(y), y〉 ≥ ρ ‖y‖α .

U is convex.


If limγk→+∞ γk = 0, and∑

k γk = +∞ then


‖δxQpγ − π‖TV = 0 .

‖πγ − π‖TV ≤ C√γ

Target precision ε: the convex case

Setting U is convex. Constant stepsize

Optimal stepsize γ and number of iterations p to achieve ε-accuracy in TV:

‖δxRpγ − π‖TV ≤ ε .

d ε Lγ O(d−3) O(ε2/ log(ε−1)) O(L−2)

p O(d5) O(ε−2 log2(ε−1)) O(L2)

Table: In the strongly convex case, d !

Strongly convex outside a ball potential

U is convex everywhere and strongly convex outside a ball, i.e. there existR ≥ 0 and m > 0, such that for all x , y ∈ Rd , ‖x − y‖ ≥ R,

〈∇U(x)−∇U(y), x − y〉 ≥ m ‖x − y‖2.

Eberle (2015) established that the convergence in the Wasserstein distancedoes not depends on the dimension.

Durmus, Moulines (2016) established that the convergence of the semi-groupin TV to π does not depends on the dimension but just on R boundswhich scale nicely in the dimension.

IMS 2018 42 / 84


Dependence on the dimension

Setting U is convex and strongly convex outside a ball. Constant stepsize

Optimal stepsize γ and number of iterations p to achieve ε-accuracy in TV:

‖δxRpγ − π‖TV ≤ ε .

d ε L m Rγ O(d−1) O(ε2/ log(ε−1)) O(L−2) O(m) O(R−4)

p O(d log(d)) O(ε−2 log2(ε−1)) O(L2) O(m−2) O(R8)

Table: Of course, there is a price to pay in the dependence in R

Target precision ε: the convex case

Setting U is convex. Constant stepsize

Optimal stepsize γ and number of iterations p to achieve ε-accuracy in TV:

‖δx?Rpγ − π‖TV ≤ ε .

starting point x? ∈ arg minRd U

d ε Lγ O(d−3) O(ε2/ log(ε−1)) O(L−2)

p O(d5) O(ε−2 log2(ε−1)) O(L2)

Strongly convex outside a ball potential

U is convex everywhere and strongly convex outside a ball, i.e. there existR ≥ 0 and m > 0, such that for all x , y ∈ Rd , ‖x − y‖ ≥ R,

〈∇U(x)−∇U(y), x − y〉 ≥ m ‖x − y‖2.

Eberle (2015) established that the convergence in the Wasserstein distancedoes not depends on the dimension.

Durmus, Moulines (2016) established that the convergence of the semi-groupin TV to π does not depends on the dimension but just on R new boundswhich scale nicely in the dimension.

Dependence on the dimension

Setting U is convex and strongly convex outside a ball. Constant stepsize

Optimal stepsize γ and number of iterations p to achieve ε-accuracy in TV:

‖δx?Rpγ − π‖TV ≤ ε .

starting point x? ∈ arg minRd U

d ε L m Rγ O(d−1) O(ε2/ log(ε−1)) O(L−2) O(m) O(R−4)

p O(d log(d)) O(ε−2 log2(ε−1)) O(L2) O(m−2) O(R8)

1 Motivation and setting

2 The ULA algorithm

3 The Stochastic Gradient Langevin dynamics (SGLD)

4 Strongly log-concave distribution

5 Convex and Super-exponential densities

6 Non-smooth potentials

7 Logconcave densities with constrained domains

8 Normalizing constants of log-concave densities

Non-smooth potential

Problem: U is convex but not continuously differentiable ? More precisely,assume that

π ∝ e−U , U = f + g ,

where f is convex and smooth and g is convex but not smooth.

Applications :

(i) LASSO g(x) =∑d

i=1 |xi |, fused-LASSO models,

g(x) =d∑


|xi − xi−1|

.(ii) Constrained domain,

g(x) = ι∑ni=1 |xi |≤κ with ιK


+∞ if x /∈ K,

0 if x ∈ K .,

Non-smooth potential

Idea: apply EM discretization, (Durmus,M.,Pereyra, 2018) to anappropriately regularized version of g

(i) the convexity of U is preserved(ii) the regularisation of U is continuously differentiable and gradient

Lipschitz(iii) the resulting approximation is close to π (e.g. in total variation norm)

Moreau-Yosida regularization

Assume that g : Rd → (−∞,+∞] is a l.s.c convex function and let λ > 0.

The λ-Moreau-Yosida envelope gλ : Rd → R is defined for all x ∈ Rd by

gλ(x) = infy∈Rd

g(y) + (2λ)−1 ‖x − y‖2

≤ g(x) .

For every x ∈ Rd , the minimum is achieved at a unique point, proxλg (x) ,which is characterized by the inclusion

x − proxλg (x) ∈ γ∂g(proxλg (x)) .

The Moreau-Yosida envelope is a regularized version of g , whichapproximates g from below.

Properties of proximal operators

As λ ↓ 0, converges gλ converges pointwise g , i.e. for all x ∈ Rd ,

gλ(x) ↑ g(x) , as λ ↓ 0 .

The function gλ is convex and continuously differentiable

∇gλ(x) = λ−1(x − proxλg (x)) .

The proximal operator is a monotone operator, for all x , y ∈ Rd ,⟨proxλg (x)− proxλg (y), x − y

⟩≥ 0 ,

which implies that the Moreau-Yosida envelope is L-smooth: for allx , y ∈ Rd ∥∥∇gλ(x)−∇gλ(y)

∥∥ ≤ λ−1 ‖x − y‖ .

Moreau-Yosida regularization

If g is not differentiable, but the proximal operator associated with g isavailable, its λ-Moreau Yosida envelope gλ can be considered.

This leads to the approximation of the potential Uλ : Rd → R defined for allx ∈ Rd by

Uλ(x) = f (x) + gλ(x) .

Question: Does it make some sense to use Uλ for targeting π ∝ e−U ?

π ∝ e−U , U = f + g

f : Rd → R and g : Rd → (−∞,+∞] are convex

f is continuously differentiable and gradient Lipschitz with Lipschitz constantLf , i.e. for all x , y ∈ Rd

‖∇f (x)−∇f (y)‖ ≤ Lf ‖x − y‖ .

g is lower semi-continuous and∫Rd e−g(y)dy ∈ (0,+∞).

Properties of proximal operators and consequences

The function gλ are convex and continuously differentiable.

gλ is gradient-Lipschitz: for all x , y ∈ Rd ,∥∥∇gλ(x)−∇gλ(y)∥∥ ≤ λ−1 ‖x − y‖

Consequence: The function Uλ is convex, continuously differentiable andgradient-Lipschitz.

Properties of proximal operators and consequences

Question Can Uλ be used instead of U in ULA (or any MCMC algorithm thatuses the gradient of U) ?

Uλ defines a regularized distribution πλ

πλ ∝ e−Uλ

, Uλ(x) = f (x) + gλ(x) .

A minimal requirement is that πλ ∝ e−Uλ

is a probability density function !

Under H1, for all λ > 0, 0 <∫Rd e−Uλ(y)dy < +∞.

IMS 2018 55 / 84


Some approximation results

Uλ defines a regularized version of πλ

πλ ∝ e−Uλ

, Uλ(x) = f (x) + gλ(x) .

Question: We now that limλ↓0+ Uλ(x) = U(x). Is this sufficient to guaranteethat πλ is a sensible approximation of π ?

Theorem 8 ((Durmus,Moulines,Pereyra) SIAM J. Imaging Sci., 2018)

Assume H1.

1. Then, limλ→0 ‖πλ − π‖TV = 0.

2. Assume in addition that g is Lipschitz. Then for all λ > 0,

‖πλ − π‖TV ≤ λ ‖g‖2Lip .

Moreau-Yoshida approximations

p(x) ∝ exp (−|x |) p(x) ∝ exp(−x4

)p(x) ∝ 1[−0.5,0.5](x)

Figure: True densities (solid blue) and approximations (dashed red).

The MYULA algorithm

Main idea: Target πλ ∝ e−Uλ

instead of π ∝ e−U using ULA.


πλ is a ”good” approximation of π provided that the regularizationparameter λ is small enoughUλ is continuously differentiable, gradient Lipschitz and convex

Given a regularization parameter λ > 0 and a stepsize γ > 0, the ULAapplied to πλ yields

XMk+1 = XM

k − γ∇f (XM

k ) + λ−1(XMk − proxλg (XM

k ))


2Zk+1 ,

where Zk , k ∈ N∗ is a sequence of i.i.d. d-dimensional standard Gaussianrandom variables.

Microscopy dataset

(a) (b)

Figure: Microscopy dataset, field of size 4µm × 4µm containing 100 molecules, (a)Original Observation, (b) MAP

Objective recover a high-resolution image x ∈ Rn from a blurred and noisyobservation y = Hx + w , where H is a circulant blurring matrix andw ∼ N (0, σ2

I n). This inverse problem is ill-conditioned, but this problem can bemitigated by exploiting prior knowledge.

Microscopy dataset

The goal is to recover the image x ∈ Rn from an incomplete and noisy set ofFourier measurements y = AFx + w , where F is the discrete Fouriertransform operator, A is a tomographic sampling mask, and w ∼ N (0, σ2

I n).

This inverse problem is ill-posed, resulting in significant uncertainty about thetrue value of x .

Idea Use a LASSO (or horse-shoe) prior to promote sparse images(nearly-black object). The resulting posterior is

π(x) ∝ exp[−(‖y − AFx‖2/2σ2 + β‖x‖1)


with fixed hyper-parameters σ > 0 and β > 0.

This density is log-concave and MAP estimation can be performed efficientlyby proximal convex optimisation.

point estimators such as xMAP deliver accurate results but do not provideinformation about the posterior uncertainty of x or ϕ(x) where ϕ is afunction.

This is formalised in the Bayesian decision theory framework by computinghighest posterior density (HPD)

Problem: compute posterior credibility sets for ϕ(x).

IMS 2018 61 / 84


Comparison with PMALA

(a) (b)

Figure: Microscopy experiment: (a) HDP region thresholds ηα for MYULA (2× 106

iterations λ = 1, γ = 0.6) and PMALA (2× 107 iterations), (b) relative approximationerror of MYULA.

IMS 2018 62 / 84


Sparse image deblurring

y xMAP uncertainty estimates

Figure: Live-cell microscopy data Zhu2012. Uncertainty analysis (±78nm ×±125nm)

Computing time 4 minutes. M = 105 iterations. Estimation error 0.2%..

1 Motivation and setting

2 The ULA algorithm

3 The Stochastic Gradient Langevin dynamics (SGLD)

4 Strongly log-concave distribution

5 Convex and Super-exponential densities

6 Non-smooth potentials

7 Logconcave densities with constrained domains

8 Normalizing constants of log-concave densities

Densities with convex support

π has bounded support, log π = −∞ outside some domain: Supp(π) = K,where K ⊂ Rd is a convex body

π ∝ e−U , U = f + ιK , ιK(x) =

+∞ if x /∈ K,

0 if x ∈ K .,

where f is smooth.

The Moreau-Yosida envelope of ιK is given for the regularization parameterλ > 0 by

ιλK(x) = infy∈Rd

(ιK(y) + (2λ)−1 ‖x − y‖2

)= (2λ)−1 ‖x − projK (x)‖2


Previous works

Previous work for the Metropolis algorithm and the hit-and-run : Applegate,Kannan, Dyer, Frieze, Polson, Simonovits, Lovasz, Vempala...

Our approach is more in the spirit of Bubeck, Eldan, and Lehec (2015) basedon the Projected Langevin Monte Carlo

Main results - Assumptions


f is convex, continuously differentiable on Rd and gradient Lipschitz withLipschitz constant Lf , i.e. for all x , y ∈ Rd , ‖∇f (x)−∇f (y)‖ ≤ Lf ‖x − y‖.


There exist r ,R > 0, r ≤ R, such that B(0, r) ⊂ K ⊂ B(0,R).

Main results - Statement

Theorem 9 (brosse:durmus:moulines:pereyra:2017)

Assume H2 and H3. For all ε > 0 and x ∈ Rd , there exist (explicit) λ > 0 andγ > 0 such that,

‖δxRnγ,λ − π‖TV ≤ ε for n = Ω(d5) ,

where Rγ,λ is the Markov kernel associated to (Xλk )k≥0.

Similar bounds hold for the Wasserstein distance.

Comparison with existing results

Lovasz (2007) shows that the complexity of the RWM and the hit-and-runalgorithm are of order d4. However, this result requires that K iswell-rounded (which is difficult to check)

Bubeck et al (2015) studies the complexity of the Projected Langevin MonteCarlo algorithm (PLMC):

Xk+1 = projK (Xk − γ∇f (Xk)) .

Note that contrary to MYULA, the iterates of PLMC always belongs to K.Under similar assumptions, Bubeck et al. (2015) get explicit bounds in totalvariation for PLMC:

d → +∞ ε→ 0 R → +∞ r → 0

Bubeck:2015 π uniform on K O(d7) O(ε−8) O(R6) O(r−6)

Bubeck:2015 π log concave O(d12) O(ε−12) O(R18) O(r−18)

MYULA O(d5) O(ε−6) O(R4) O(r−4)

Table: Complexity ‖δx?Rnγ,λ − π‖TV ≤ ε


Application to regression with `1 constraints

1. For all s > 0, consider the density πs ∝ e−Us


Us(x) = exp

(−‖Y − Zx‖2

2σ2− ιKs(x)

),Ks = x ∈ Rd ; ‖x‖1 ≤ s .

2. Dual problem of LASSO regression in optimization.

3. We compute for all i ∈ 1, · · · , d, the median of xi for different values of son the diabetes data set (n = 442, d = 10).

4. Compute the LASSO regularization paths.

Application to regression with `1 constraints





Shrinking Factor




0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1









Shrinking FactorC




0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1






Figure: Lasso paths for (a) MYULA , (b) Wall HMC (Neal 2010)

1 Motivation and setting

2 The ULA algorithm

3 The Stochastic Gradient Langevin dynamics (SGLD)

4 Strongly log-concave distribution

5 Convex and Super-exponential densities

6 Non-smooth potentials

7 Logconcave densities with constrained domains

8 Normalizing constants of log-concave densities

Normalizing constants

Let U : Rd → R. We aim at estimating Z =∫Rd e−U(x)dx < +∞.

Z is the normalizing constant of the probability density π associated with thepotential U.

Many applications in Bayesian inference (Bayes factors) and statisticalphysics (free energy) .

In Bayesian inference, models can be compared Bayes factors which is theratio of two normalizing constants.

Many methods... But few theoretical guarantees.

IMS 2018 73 / 84


Multistage sampling

Idea: decompose the original problem in a sequence of problems which areeasier to solve.

Multistage sampling method (Gelman,1998),






1. M ∈ N? is the number of stages,2. Z0 is the initial normalizing constant (should be easy to compute)3. Zi+1/Zi are the ratios of normalisations constants (that should also be

easy to estimate).

A Gaussian annealing algorithm

M ∈ N? number of stages.

Let σ2i Mi=0 be an increasing sequence of positive numbers and set

σ2M = +∞.

Consider the sequence of functions UiMi=0 defined for all i ∈ 0, . . . ,M andx ∈ Rd by

Ui (x) =‖x‖2


+ U(x) ,

with the convention 1/∞ = 0.

Note that UM = U, since σM = +∞.

If σ0 is small enough, then U0(x) ≈ ‖x‖2/(2σ0).

A Gaussian annealing algorithm

Define sequence of probability densities πiMi=0 for i ∈ 0, . . . ,M andx ∈ Rd by

πi (x) = Z−1i e−Ui (x) , Zi =


e−Ui (y)dy .

It defines (Zi )Mi=1 in the decomposition





For i ∈ 0, . . . ,M − 1, we get




gi (x)πi (x)dx = πi (gi ) ,

where gi : Rd → R+ is defined for any x ∈ Rd by

gi (x) = exp(ai ‖x‖2

), ai =





− 1



Multistage methods

Multistage sampling type algorithms are widely used and known underdifferent names: multistage sampling (Valleau,1972), (extended) bridgesampling (Gelman,1998), annealed importance sampling (AIS) (Neal,2001),thermodynamic integration (Girolami,2016), power posterior (Friel,2012).

For the stability and accuracy of the method, the choice of the parameters iscrucial and is known to be difficult.

Indeed, the issue has been pointed out in several articles under the names oftuning tempered transitions (Friel,2012), temperature placement (Friel,2014),annealing sequence, temperature ladder (Girolami,2016), effects of grid size...

In Brosse,Durmus,Mmoulines (2018), we explicitly determine the sequenceσ2

i M−1i=0 .

Multistage Langevin

Compute for all i ∈ 1, . . . ,M − 1,




gi (x)πi (x)dx = πi (gi ) .

The quantity πi (gi ) is estimated by the Unadjusted Langevin Algorithm(ULA) targeting πi .

For all i ∈ 1, . . . ,M, consider

Xi,k+1 = Xi,k − γi∇Ui (Xi,k) +√

2γiZi,k+1 , Xi,0 = 0 .

For i ∈ 0, . . . ,M − 1, consider the following estimator of Zi+1/Zi ,

πi (gi ) =1



gi (Xi,k) ,

where ni ≥ 1 is the sample size and Ni ≥ 0 the burn-in period.

ULA algorithm

We want to compute for all i ∈ 1, . . . ,M − 1,




gi (x)πi (x)dx = πi (gi ) ,

For i ∈ 0, . . . ,M − 1, consider the following estimator of Zi+1/Zi ,

πi (gi ) =1



gi (Xi,k) ,

where ni ≥ 1 is the sample size and Ni ≥ 0 the burn-in period.

Z the following estimator of Z,

Z = (2πσ20)d/2(1 + σ2



πi (gi )


Theoretical analysis

Denote by S the set of simulation parameters,

S =M, σ2

i M−1i=0 , γiM−1

i=0 , niM−1i=0 , NiM−1



Z the following estimator of Z,

Z = (2πσ20)d/2(1 + σ2



πi (gi )


cost of the algorithm: cost =∑M−1

i=0 Ni + ni.

Theorem 10 (brosse:durmus:moulines:2018)

Let µ, ε ∈ (0, 1). There exists an explicit choice of the simulation parameters Ssuch that the estimator Z satisfies

P(∣∣∣Z/Z − 1

∣∣∣ > ε)≤ µ .

Moreover, the cost of the algorithm is polynomial in the dimension d , ε−1 andη−1.

Numerical experiments





Dimension 10







Dimension 25









Dimension 50

Figure: Boxplots of the logarithm of the normalizing constants of a multivariateGaussian distribution in dimension d ∈ 10, 25, 50.

Numerical experiements II









Figure: Boxplot of the log evidence for a mixture of 4 Gaussian distributions indimension 2.

Our published work

1. Durmus, A.; Moulines, E. Quantitative bounds of convergence forgeometrically ergodic Markov chain in the Wasserstein distance withapplication to the Metropolis adjusted Langevin algorithm. Stat. Comput.,2015

2. Durmus, A.; Moulines, E., Non-asymptotic convergence analysis for theUnadjusted Langevin Algorithm, Ann. Appl. Probab., 2017

3. Durmus, A.; Moulines, E., High-dimensional Bayesian inference via theUnadjusted Langevin Algorithm, major revision in Bernoulli, 2018 [firstversion submitted 3 years ago !]

4. Durmus, A.; Simsleki, U.; Moulines, E.; Badeau, Roland, Stochastic GradientRichardson-Romberg Markov Chain Monte Carlo, NIPS, 2016

5. Brosse, N., Durmus A., Moulines E., Pereyra, M., Sampling from alog-concave distribution with compact support with proximal Langevin MonteCarlo, COLT 2017 Efficient Bayesian computation by proximal Markov chainMonte Carlo: when Langevin meets Moreau, SIAM J. Imaging Sciences, 2018.

