a contour stochastic gradient langevin dynamics algorithm

25
A Contour Stochastic Gradient Langevin Dynamics Algorithm for Simulations of Multi-modal Distributions Wei Deng Department of Mathematics Purdue University West Lafayette, IN, USA [email protected] Guang Lin Departments of Mathematics & School of Mechanical Engineering Purdue University West Lafayette, IN, USA [email protected] Faming Liang * Departments of Statistics Purdue University West Lafayette, IN, USA [email protected] Abstract We propose an adaptively weighted stochastic gradient Langevin dynamics algo- rithm (SGLD), so-called contour stochastic gradient Langevin dynamics (CSGLD), for Bayesian learning in big data statistics. The proposed algorithm is essentially a scalable dynamic importance sampler, which automatically flattens the target distribution such that the simulation for a multi-modal distribution can be greatly fa- cilitated. Theoretically, we prove a stability condition and establish the asymptotic convergence of the self-adapting parameter to a unique fixed-point, regardless of the non-convexity of the original energy function; we also present an error analysis for the weighted averaging estimators. Empirically, the CSGLD algorithm is tested on multiple benchmark datasets including CIFAR10 and CIFAR100. The numeri- cal results indicate its superiority over the existing state-of-the-art algorithms in training deep neural networks. 1 Introduction AI safety has long been an important issue in the deep learning community. A promising solution to the problem is Markov chain Monte Carlo (MCMC), which leads to asymptotically correct uncertainty quantification for deep neural network (DNN) models. However, traditional MCMC algorithms [Metropolis et al., 1953, Hastings, 1970] are not scalable to big datasets that deep learning models rely on, although they have achieved significant successes in many scientific areas such as statistical physics and bioinformatics. It was not until the study of stochastic gradient Langevin dynamics (SGLD) [Welling and Teh, 2011] that resolves the scalability issue encountered in Monte Carlo computing for big data problems. Ever since, a variety of scalable stochastic gradient Markov chain Monte Carlo (SGMCMC) algorithms have been developed based on strategies such as Hamiltonian dynamics [Chen et al., 2014, Ma et al., 2015, Ding et al., 2014], Hessian approximation [Ahn et al., 2012, Li et al., 2016, ¸ Sim¸ sekli et al., 2016], and higher-order numerical schemes [Chen et al., 2015, Li et al., 2019]. Despite their theoretical guarantees in statistical inference [Chen et al., 2015, Teh et al., 2016, Vollmer et al., 2016] and non-convex optimization [Zhang et al., 2017, Raginsky et al., 2017, Xu et al., 2018], these algorithms often converge slowly, which makes them hard to be used for efficient uncertainty quantification for many AI safety problems. To develop more efficient SGMCMC algorithms, we seek inspirations from traditional MCMC algorithms, such as simulated annealing [Kirkpatrick et al., 1983], parallel tempering [Swendsen and Wang, 1986, Geyer, 1991], and flat histogram algorithms [Berg and Neuhaus, 1991, Wang and Landau, * To whom correspondence should be addressed: Faming Liang. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada. arXiv:2010.09800v1 [stat.ML] 19 Oct 2020

Upload: others

Post on 17-Apr-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Contour Stochastic Gradient Langevin Dynamics Algorithm

A Contour Stochastic Gradient Langevin DynamicsAlgorithm for Simulations of Multi-modal

Distributions

Wei DengDepartment of Mathematics

Purdue UniversityWest Lafayette, IN, [email protected]

Guang LinDepartments of Mathematics &

School of Mechanical EngineeringPurdue University

West Lafayette, IN, [email protected]

Faming Liang ∗Departments of Statistics

Purdue UniversityWest Lafayette, IN, [email protected]

Abstract

We propose an adaptively weighted stochastic gradient Langevin dynamics algo-rithm (SGLD), so-called contour stochastic gradient Langevin dynamics (CSGLD),for Bayesian learning in big data statistics. The proposed algorithm is essentiallya scalable dynamic importance sampler, which automatically flattens the targetdistribution such that the simulation for a multi-modal distribution can be greatly fa-cilitated. Theoretically, we prove a stability condition and establish the asymptoticconvergence of the self-adapting parameter to a unique fixed-point, regardless ofthe non-convexity of the original energy function; we also present an error analysisfor the weighted averaging estimators. Empirically, the CSGLD algorithm is testedon multiple benchmark datasets including CIFAR10 and CIFAR100. The numeri-cal results indicate its superiority over the existing state-of-the-art algorithms intraining deep neural networks.

1 Introduction

AI safety has long been an important issue in the deep learning community. A promising solution tothe problem is Markov chain Monte Carlo (MCMC), which leads to asymptotically correct uncertaintyquantification for deep neural network (DNN) models. However, traditional MCMC algorithms[Metropolis et al., 1953, Hastings, 1970] are not scalable to big datasets that deep learning modelsrely on, although they have achieved significant successes in many scientific areas such as statisticalphysics and bioinformatics. It was not until the study of stochastic gradient Langevin dynamics(SGLD) [Welling and Teh, 2011] that resolves the scalability issue encountered in Monte Carlocomputing for big data problems. Ever since, a variety of scalable stochastic gradient Markov chainMonte Carlo (SGMCMC) algorithms have been developed based on strategies such as Hamiltoniandynamics [Chen et al., 2014, Ma et al., 2015, Ding et al., 2014], Hessian approximation [Ahn et al.,2012, Li et al., 2016, Simsekli et al., 2016], and higher-order numerical schemes [Chen et al., 2015,Li et al., 2019]. Despite their theoretical guarantees in statistical inference [Chen et al., 2015, Tehet al., 2016, Vollmer et al., 2016] and non-convex optimization [Zhang et al., 2017, Raginsky et al.,2017, Xu et al., 2018], these algorithms often converge slowly, which makes them hard to be used forefficient uncertainty quantification for many AI safety problems.

To develop more efficient SGMCMC algorithms, we seek inspirations from traditional MCMCalgorithms, such as simulated annealing [Kirkpatrick et al., 1983], parallel tempering [Swendsen andWang, 1986, Geyer, 1991], and flat histogram algorithms [Berg and Neuhaus, 1991, Wang and Landau,∗To whom correspondence should be addressed: Faming Liang.

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

arX

iv:2

010.

0980

0v1

[st

at.M

L]

19

Oct

202

0

Page 2: A Contour Stochastic Gradient Langevin Dynamics Algorithm

2001]. In particular, simulated annealing proposes to decay temperatures to increase the hittingprobability to the global optima [Mangoubi and Vishnoi, 2018], which, however, often gets stuckinto a local optimum with a fast cooling schedule. Parallel tempering proposes to swap positions ofneighboring Markov chains according to an acceptance-rejection rule. However, under the mini-batchsetting, it often requires a large correction which is known to deteriorate its performance [Denget al., 2020a]. The flat histogram algorithms, such as the multicanonical [Berg and Neuhaus, 1991]and Wang-Landau [Wang and Landau, 2001] algorithms, were first proposed to sample discretestates of Ising models by yielding a flat histogram in the energy space, and then extended as ageneral dynamic importance sampling algorithm, the so-called stochastic approximation Monte Carlo(SAMC) algorithm [Liang, 2005, Liang et al., 2007, Liang, 2009]. Theoretical studies [Lelièvre et al.,2008, Liang, 2010, Fort et al., 2015] support the efficiency of the flat histogram algorithms in MonteCarlo computing for small data problems. However, it is still unclear how to adapt the flat histogramidea to accelerate the convergence of SGMCMC, ensuring efficient uncertainty quantification for AIsafety problems.

This paper proposes the so-called contour stochastic gradient Langevin dynamics (CSGLD) algorithm,which successfully extends the flat histogram idea to SGMCMC. Like the SAMC algorithm [Liang,2005, Liang et al., 2007, Liang, 2009], CSGLD works as a dynamic importance sampling algorithm,which adaptively adjusts the target measure at each iteration and accounts for the bias introducedthereby by importance weights. However, theoretical analysis for the two types of dynamic importancesampling algorithms can be quite different due to the fundamental difference in their transition kernels.We proceed by justifying the stability condition for CSGLD based on the perturbation theory, andestablishing ergodicity of CSGLD based on newly developed theory for the convergence of adaptiveSGLD. Empirically, we test the performance of CSGLD through a few experiments. It achievesremarkable performance on some synthetic data, UCI datasets, and computer vision datasets such asCIFAR10 and CIFAR100.

2 Contour stochastic gradient Langevin dynamics

Suppose we are interested in sampling from a probability measure π(x) with the density given byπ(x) ∝ exp(−U(x)/τ), x ∈ X , (1)

where X denotes the sample space, U(x) is the energy function, and τ is the temperature. It isknown that when U(x) is highly non-convex, SGLD can mix very slowly [Raginsky et al., 2017]. Toaccelerate the convergence, we exploit the flat histogram idea in SGLD.

Suppose that we have partitioned the sample space X into m subregions based on the energy functionU(x): X1 = {x : U(x) ≤ u1}, X2 = {x : u1 < U(x) ≤ u2}, . . ., Xm−1 = {x : um−2 < U(x) ≤um−1}, and Xm = {x : U(x) > um−1}, where −∞ < u1 < u2 < · · · < um−1 <∞ are specifiedby the user. For convenience, we set u0 = −∞ and um =∞. Without loss of generality, we assumeui+1 − ui = ∆u for i = 1, . . . ,m− 2. We propose to simulate from a flattened density

$Ψθ (x) ∝ π(x)

Ψζθ(U(x))

, (2)

where ζ > 0 is a hyperparameter controlling the geometric property of the flatted density (see Figure1(a) for illustration), and θ = (θ(1), θ(2), . . . , θ(m)) is an unknown vector taking values in the space:

Θ =

{(θ(1), θ(2), · · · , θ(m))

∣∣0 < θ(1), θ(2), · · · , θ(m) < 1 andm∑i=1

θ(i) = 1

}. (3)

2.1 A naïve contour SGLD

It is known if we set †

(i) ζ = 1 and Ψθ(U(x)) =

m∑i=1

θ(i)1ui−1<U(x)≤ui ,

(ii) θ(i) = θ?(i),where θ?(i) =

∫χi

π(x)dx for i ∈ {1, 2, · · · ,m},(4)

†1A is an indicator function that takes value 1 if event A occurs and 0 otherwise.

2

Page 3: A Contour Stochastic Gradient Langevin Dynamics Algorithm

the algorithm will act like the SAMC algorithm [Liang et al., 2007], yielding a flat histogram in thespace of energy (see the pink curve in Fig.1(b)). Theoretically, such a density flattening strategyenables a sharper logarithmic Sobolev inequality and accelerates the convergence of simulations[Lelièvre et al., 2008, Fort et al., 2015]. However, such a density flattening setting only works underthe framework of the Metropolis algorithm [Metropolis et al., 1953]. A naïve application of the stepfunction in formula (4(i)) to SGLD results in ∂ log Ψθ(u)

∂u = 1Ψθ(u)

∂Ψθ(u)∂u = 0 almost everywhere,

which leads to the vanishing-gradient problem for SGLD. Calculating the gradient for the naïvecontour SGLD, we have

∇x log$Ψθ (x) = −[1 + ζτ

∂ log Ψθ(u)

∂u

]∇xU(x)

τ= −∇xU(x)

τ.

As such, the naïve algorithm behaves like SGLD and fails to simulate from the flattened density (2).

2.2 How to resolve the vanishing gradient

To tackle this issue, we propose to set Ψθ(u) as a piecewise continuous function:

Ψθ(u) =

m∑i=1

(θ(i− 1)e(log θ(i)−log θ(i−1))

u−ui−1∆u

)1ui−1<u≤ui , (5)

where θ(0) is fixed to θ(1) for simplicity. A direct calculation shows that

∇x log$Ψθ (x) = −[1 + ζτ

∂ log Ψθ(u)

∂u

]∇xU(x)

τ

= −[1 + ζτ

log θ(J(x))− log θ((J(x)− 1) ∨ 1)

∆u

]∇xU(x)

τ,

(6)

where J(x) ∈ {1, 2, · · · ,m} denotes the index of the subregion that x belongs to, i.e., uJ(x)−1 <

U(x) ≤ uJ(x). § Since θ is unknown, we propose to estimate it on the fly under the frameworkof stochastic approximation [Robbins and Monro, 1951]. Provided that a scalable transition kernelΠθk(xk, ·) is available and the energy function U(x) on the full data can be efficiently evaluated, theweighted density $Ψθ (x) can be simulated by iterating between the following steps:

(i) Simulate xk+1 from Πθk(xk, ·), which admits $θk(x) as the invariant distribution,

(ii) θk+1(i) = θk(i) + ωk+1θζk(J(xk+1))

(1i=J(xk+1) − θk(i)

)for i ∈ {1, 2, · · · ,m}.

(7)

where θk denotes a working estimate of θ at the k-th iteration. We expect that in a long run, suchan algorithm can achieve an optimization-sampling equilibrium such that θk converges to the fixedpoint θ? and the random vector xk converges weakly to the distribution $Ψθ?

(x).

To make the algorithm scalable to big data, we propose to adopt the Langevin transition kernelfor drawing samples at each iteration, for which a mini-batch of data can be used to acceleratecomputation. In addition, we observe that evaluating U(x) on the full data can be quite expensive forbig data problems, while it is free to obtain the stochastic energy U(x) in evaluating the stochasticgradient ∇xU(x) due to the nature of auto-differentiation [Paszke et al., 2017]. For this reason,we propose a biased index J(x), where uJ(x)−1 <

Nn U(x) ≤ uJ(x), N is the sample size of the

full dataset and n is the mini-batch size. Let {εk}∞k=1 and {ωk}∞k=1 denote the learning rates andstep sizes for SGLD and stochastic approximation, respectively. Given the above notations, theproposed algorithm can be presented in Algorithm 1, which can be viewed as a scalable Wang-Landaualgorithm for deep learning and big data problems.

2.3 Related work

Compared to the existing MCMC algorithms, the proposed algorithm has a few innovations:

First, CSGLD is an adaptive MCMC algorithm based on the Langevin transition kernel insteadof the Metropolis transition kernel [Liang et al., 2007, Fort et al., 2015]. As a result, the existingconvergence theory for the Wang-Landau algorithm does not apply. To resolve this issue, we first

§Formula (6) shows a practical numerical scheme. An alternative is presented in the supplementary material.

3

Page 4: A Contour Stochastic Gradient Langevin Dynamics Algorithm

Algorithm 1 Contour SGLD Algorithm. One can conduct a resampling step from the pool ofimportance samples according to the importance weights to obtain the original distribution.

[1.] (Data subsampling) Simulate a mini-batch of data of size n from the whole dataset of sizeN ; Compute the stochastic gradient∇xU(xk) and stochastic energy U(xk).[2.] (Simulation step) Sample xk+1 using the SGLD algorithm based on xk and θk, i.e.,

xk+1 = xk − εk+1N

n

[1 + ζτ

log θk(J(xk))− log θk((J(xk)− 1) ∨ 1)

∆u

]∇xU(xk) +

√2τεk+1wk+1,

(8)

where wk+1 ∼ N(0, Id), d is the dimension, εk+1 is the learning rate, and τ is the temperature.[3.] (Stochastic approximation) Update the estimate of θ(i)’s for i = 1, 2, . . . ,m by setting

θk+1(i) = θk(i) + ωk+1θζk(J(xk+1))

(1i=J(xk+1) − θk(i)

), (9)

where 1i=J(xk+1) is an indicator function which equals 1 if i = J(xk+1) and 0 otherwise.

prove a stability condition for CSGLD based on the perturbation theory, and then verify regularityconditions for the solution of the Poisson equation so that the fluctuations of the mean-field systeminduced by CSGLD get controlled, which eventually ensures convergence of CSGLD.

Second, the use of the stochastic index J(x) avoids the evaluation of U(x) on the full data and thussignificantly accelerates the computation of the algorithm, although it leads to a small bias, dependingon the mini-batch size n, in parameter estimation. Compared to other methods, such as using afixed sub-dataset to estimate U(x), the implementation is much simpler. Moreover, combining thevariance reduction of the noisy energy estimators [Deng et al., 2020b], the bias also decreases to zeroasymptotically as ε→ 0.

Third, unlike the existing SGMCMC algorithms [Welling and Teh, 2011, Chen et al., 2014, Maet al., 2015], CSGLD works as a dynamic importance sampler which flattens the target distributionand reduces the energy barriers for the sampler to traverse between different regions of the energylandscape (see Fig.1(a) for illustration). The sampling bias introduced thereby is accounted for bythe importance weight θζ(J(·)). Interestingly, CSGLD possesses a self-adjusting mechanism to easeescapes from local traps, which is similar to the self-repulsive dynamics [Ye et al., 2020] and can beexplained as follows. Let’s assume that the sampler gets trapped into a local optimum at iteration k.Then CSGLD will automatically increase the multiplier of the stochastic gradient (i.e., the bracketterm of (8)) at iteration k + 1 by increasing the value of θk(J(x)), while decreasing the componentsof θk corresponding to other subregions. This adjustment will continue until the sampler moves awayfrom the current subregion. Then, in the followed several iterations, the multiplier might becomenegative in neighboring subregions of the local optimum due to the increased value of θ(J(x)),which continues to help to drive the sampler to higher energy regions and thus escape from the localtrap. That is, in order to escape from local traps, CSGLD is sometimes forced to move toward higherenergy regions by changing the sign of the stochastic gradient multiplier! This is a very attractivefeature for simulations of multi-modal distributions.

3 Theoretical study of the CSGLD algorithm

In this section, we study the convergence of CSGLD algorithm under the framework of stochasticapproximation and show the ergodicity property based on weighted averaging estimators.

3.1 Convergence analysis

Following the tradition of stochastic approximation analysis, we rewrite the updating rule (9) as

θk+1 = θk + ωk+1H(θk,xk+1), (10)

where H(θ,x) = (H1(θ,x), . . . , Hm(θ,x)) is a random field function with

Hi(θ,x) = θζ(J(x))(

1i=J(x) − θ(i)), i = 1, 2, . . . ,m. (11)

4

Page 5: A Contour Stochastic Gradient Langevin Dynamics Algorithm

Notably, H(θ,x) works under an empirical measure $θ(x) which approximates the invariantmeasure $Ψθ (x) ∝ π(x)

Ψζθ(U(x))asymptotically as ε→ 0 and n→ N . As shown in Lemma 1, we have

the mean-field equation

h(θ) =

∫XH(θ,x)$θ(x)dx = Z−1

θ (θ? + εβ(θ)− θ) = 0, (12)

where θ? = (∫X1π(x)dx,

∫X2π(x)dx, . . . ,

∫Xm π(x)dx), Zθ is the normalizing constant, β(θ) is

a perturbation term, ε is a small error depending on ε, n and m. The mean-field equation implies thatfor any ζ > 0, θk converges to a small neighbourhood of θ?. By applying perturbation theory andsetting the Lyapunov function V(θ) = 1

2‖θ? − θ‖2, we can establish the stability condition:

Lemma 1 (Stability). Given a small enough ε (learning rate), a large enough n (batch size) and m(partition number), there is a constant φ = infθ Z

−1θ > 0 such that the mean-field h(θ) satisfies

∀θ ∈ Θ, 〈h(θ),θ − θ?〉 ≤ −φ‖θ − θ?‖2 +O(ε+

1

m+ δn(θ)

),

where δn(·) is a bias term depending on the batch size n and decays to 0 as n→ N .

Together with the tool of Poisson equation [Benveniste et al., 1990, Andrieu et al., 2005], whichcontrols the fluctuation of H(θ,x)− h(θ), we can establish convergence of θk in Theorem 1, whoseproof is given in the supplementary material.Theorem 1 (L2 convergence rate). Given Assumptions 1-5 (given in Appendix), a small enoughlearning rate εk, a large partition number m and a large batch size n, θk converges to θ? such that

E[‖θk − θ?‖2

]= O

(ωk + sup

i≥k0

εi +1

m+ supi≥k0

δn(θi)

),

where k0 is some large enough integer, θ? = (∫X1π(x)dx,

∫X2π(x)dx, . . . ,

∫Xm π(x)dx), and

δn(·) is a bias term depending on the batch size n and decays to 0 as n→ N .

3.2 Ergodicity and dynamic importance sampler

CSGLD belongs to the class of adaptive MCMC algorithms, but its transition kernel is based onSGLD instead of the Metropolis algorithm. As such, the ergodicity theory for traditional adaptiveMCMC algorithms [Roberts and Rosenthal, 2007, Andrieu and Éric Moulines, 2006, Fort et al., 2011,Liang, 2010] is not directly applicable. To tackle this issue, we conduct the following theoreticalstudy. First, rewrite (8) as

xk − ε(∇xL(xk,θ?) + Υ(xk,θk,θ?)

)+N (0, 2ετI), (13)

where ∇xL(xk,θ?) = Nn

[1 + ζτ

∆u (log θ?(J(xk))− log θ?((J(xk)− 1) ∨ 1))]∇xU(xk),

the bias term Υ(xk,θk,θ?) = ∇xL(xk,θk) − ∇xL(xk,θ?), and ∇xL(xk,θk) =Nn

[1 + ζτ

∆u

(log θk(J(xk))− log θk((J(xk)− 1) ∨ 1)

)]∇xU(xk). The order of the bias is fig-

ured out in Lemma C1 in the supplementary material based on the results of Theorem 1.

Next, we show how the empirical mean 1k

∑ki=1 f(xi) deviates from the posterior mean∫

X f(x)$Ψθ?(x)dx. Note that this is a direct application of Theorem 2 of Chen et al. [2015]

by treating∇xL(x,θ?) as the stochastic gradient of a target distribution and Υ(x,θ,θ?) as the biasof the stochastic gradient. Moreover, considering that $Ψθ?

(x) ∝ π(x)

θζ?(J(x))→ $Ψθ?

as m → ∞based on Lemma B4 in the supplementary material, we have the followingLemma 2 (Convergence of the Averaging Estimators). Suppose Assumptions 1-6 (in the supplemen-tary material) hold. For any bounded function f , we have∣∣∣∣∣E

[∑ki=1 f(xi)

k

]−∫χ

f(x)$Ψθ?(dx)

∣∣∣∣∣ = O

1

kε+√ε+

√∑ki=1 ωk

k+

1√m

+ supi≥k0

√δn(θi)

,

where $Ψθ?(x) = 1

Zθ?

π(x)

θζ?(J(x))and Zθ? =

∑mi=1

∫Xiπ(x)dx

θ?(i)ζ.

5

Page 6: A Contour Stochastic Gradient Langevin Dynamics Algorithm

Finally, we consider the problem of estimating the quantity∫X f(x)π(x)dx. Recall that π(x)

is the target distribution that we would like to make inference for. To estimate this quantity, we

naturally consider the weighted averaging estimator∑ki=1 θ

ζi (J(xi))f(xi)∑k

i=1 θζi (J(xi))

by treating θζ(J(xi)) as the

dynamic importance weight of the sample xi for i = 1, 2, . . . , k. The convergence of this estimatoris established in Theorem 2, which can be proved by repeated applying Theorem 1 and Lemma 2with the details given in the supplementary material.Theorem 2 (Convergence of the Weighted Averaging Estimators). Suppose Assumptions 1-6 hold.For any bounded function f , we have∣∣∣∣∣E[∑k

i=1 θζi (J(xi))f(xi)∑k

i=1 θζi (J(xi))

]−∫χ

f(x)π(dx)

∣∣∣∣∣ = O

1

kε+√ε+

√∑ki=1 ωk

k+

1√m

+ supi≥k0

√δn(θi)

.

The bias of the weighted averaging estimator decreases if one applies a larger batch size, a finersample space partition, a smaller learning rate ε, and smaller step sizes {ωk}k≥0. Admittedly, theorder of this bias is slightly larger than O

(1kε + ε

)achieved by the standard SGLD. We note that

this is necessary as simulating from the flattened distribution $Ψθ?often leads to a much faster

convergence, see e.g. the green curve v.s. the purple curve in Fig.1(c).

4 Numerical studies

4.1 Simulations of multi-modal distributions

A Gaussian mixture distribution The first numerical study is to test the performance of CSGLDon a Gaussian mixture distribution π(x) = 0.4N(−6, 1) + 0.6N(4, 1). In each experiment, thealgorithm was run for 107 iterations. We fix the temperature τ = 1 and the learning rate ε = 0.1. Thestep size for stochastic approximation follows ωk = 1

k0.6+100 . The sample space is partitioned into50 subregions with ∆u = 1. The stochastic gradients are simulated by injecting additional randomnoises following N(0, 0.01) to the exact gradients. For comparison, SGLD is chosen as the baselinealgorithm and implemented with the same setup as CSGLD. We repeat the experiments 10 times andreport the average and the associated standard deviation.

We first assume that θ? is known and plot the energy functions for both π(x) and $Ψθ?with different

values of ζ. Fig.1(a) shows that the original energy function has a rather large energy barrier whichstrongly affects the communication between two modes of the distribution. In contrast, CSGLDsamples from a modified energy function, which yields a flattened landscape and reduced energybarriers. For example, with ζ = 0.75, the energy barrier for this example is greatly reduced from12 to as small as 2. Consequently, the local trap problem can be greatly alleviated. Regarding thebizarre peaks around x = 4, we leave the study in the supplementary material.

Large barrierSmall barrier

0

5

10

15

20

25

−13 −6 4 11Sample space

Ene

rgy

Original energy

Modified energy (ζ=0.5)

Modified energy (ζ=0.75)

Modified energy (ζ=1)

(a) Original v.s. trial densities

●● ● ● ● ● ● ●

●●

●●

● ●●

● ●

Higher energy

0.0

0.2

0.4

5 10Partition index

Fre

quen

cy

●●

●●

●●

●●

●●

θ*

θ

CSGLD (ζ=0.5)

CSGLD (ζ=0.75)

CSGLD (ζ=1)

(b) θ’s estimates and histograms

0

2

4

4e5 2e6 1e7Iterations

Est

imat

ion

erro

r

SGLD

CSGLD

KSGLD

(c) Estimation errors

Figure 1: Comparison between SGLD and CSGLD: Fig.1(b) presents only the first 12 partitions foran illustrative purpose; KSGLD in Fig.1(c) is implemented by assuming θ? is known.

Fig. 1(b) summarizes the estimates of θ? with ζ = 0.75, which matches the ground truth value of θ?very well. Notably, we see that θ?(i) decays exponentially fast as the partition index i increases, which

6

Page 7: A Contour Stochastic Gradient Langevin Dynamics Algorithm

indicates the exponentially decreasing probability of visiting high energy regions and a severe localtrap problem. CSGLD tackles this issue by adaptively updating the transition kernel or, equivalently,the invariant distribution such that the sampler moves like a “random walk” in the space of energy. Inparticular, setting ζ = 1 leads to a flat histogram of energy (for the samples produced by CSGLD).

To explore the performance of CSGLD in quantity estimation with the weighed averaging estima-tor, we compare CSGLD (ζ = 0.75) with SGLD and KSGLD in estimating the posterior mean∫X xπ(x)dx, where KSGLD was implemented by assuming θ? is known and sampling from $Ψθ?

directly. Each algorithm was run for 10 times, and we recorded the mean absolute estimation erroralong with iterations. As shown in Fig.1(c), the estimation error of SGLD decays quite slow and rarelyconverges due to the high energy barrier. On the contrary, KSGLD converges much faster, whichshows the advantage of sampling from a flattened distribution $Ψθ?

. Admittedly, θ? is unknown inpractice. CSGLD instead adaptively updates its invariant distribution while optimizing the parameterθ until an optimization-sampling equilibrium is reached. In the early period of the run, CSGLDconverges slightly slower than KSGLD, but soon it becomes as efficient as KSGLD.

Finally, we compare the sample path and learning rate for CSGLD and SGLD. As shown in Fig.2(a),SGLD tends to be trapped in a deep local optimum for an exponentially long time. CSGLD, incontrast, possesses a self-adjusting mechanism for escaping from local traps. In the early period of arun, CSGLD might suffer from a similar local-trap problem as SGLD (see Fig.2(b)). In this case, thecomponents of θ corresponding to the current subregion will increase very fast, eventually renderinga smaller or even negative stochastic gradient multiplier which bounces the sampler back to highenergy regions. To illustrate the process, we plot a bouncy zone and an absorbing zone in Fig.2(c).The bouncy zone enables the sampler to “jump” over large energy barriers to explore other modes.As the run continues, θk converges to θ?. Fig.2(d) shows that larger bouncy “jumps” (in red lines)can potentially be induced in the bouncy zone, which occurs in both local and global optima. Due tothe self-adjusting mechanism, CSGLD has the local trap problem much alleviated.

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●●

●●

●●●

●●●

●●

●●

●● ●

●●

●●●●

● ●

●●●

● ●

●●

●●

●●

●● ●●

●●●●

● ●

●●●●

●●

●●●●●

●●●● ●

●●●

● ●

●●●●

●●

●●

●●

●●●

●●●●

●●

●●●● ●●

●●●

●●●

●● ●●●

●●●●

●● ●

● ●

●●

●●

●●

●●

● ●

●●●●

●●●

●●●

●●●●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●● ●●

●●

●●

●●

●● ●●

● ●● ●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●● ●●

●●

●●

●●

●●

●● ●● ●

● ●●

●●

●● ●●

●● ●

●●

●●

●●

● ●●

●●

● ●●●

● ●

●●●

●● ●

●●●

●●●●

●●

●●

●● ●

●●

●●●●

●●

● ●

●● ●

●●●●●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●●

●● ●

●●●

●●●

●●

●●●●

●●

●●

●●●

● ●●●●

●●

● ●●

● ●

●●

●●●●

●●

●●

●●

●● ●

●●

●●

●●

●●●

●●● ●●●

●●●

●●●

●●

●● ●●

●● ●●

● ●●

●●

●●

●●●●●

● ●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●●

●●● ●

●● ●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●● ●

● ●● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●●●●

●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●●●●

●●

●●

●●

●●

Energy barrier > 10

(a) SGLD paths

Absorbing zone

Bouncy zone

●●

● ●●

●●

●●

●●●

●●

● ●

●●

●●●

●●● ●

●●

●●

●●●

●●

●●●

●●

●●

(b) CSGLD paths (early)

Absorbing zone

Bouncy zone

●●

●●

●●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

(c) CSGLD paths (mid)

●●

●●

●●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

● ●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

(d) CSGLD paths (late)

Figure 2: Sample trajectories of SGLD and CSGLD: plots (a) and (c) are implemented by 100,000iterations with a thinning factor 100 and ζ = 0.75, while plot (b) utilizes a thinning factor 10.

A synthetic multi-modal distribution We next simulate from a distribution π(x) ∝ e−U(x),where U(x) =

∑2i=1

x(i)2−10 cos(1.2πx(i))3 and x = (x(1), x(2)). We compare CSGLD with SGLD,

replica exchange SGLD (reSGLD) [Deng et al., 2020a], and SGLD with cyclic learning rates(cycSGLD) [Zhang et al., 2020] and detail the setups in the supplementary material. Fig.3(a) showsthat the distribution contains nine important modes, where the center mode has the largest probabilitymass and the four modes on the corners have the smallest mass. We see in Fig.3(b) that SGLDspends too much time in local regions and only identifies three modes. cycSGLD has a better abilityto explore the distribution by leveraging large learning rates cyclically. However, as illustratedin Fig.3(c), such a mechanism is still not efficient enough to resolve the local trap issue for thisproblem. reSGLD proposes to include a high-temperature process to encourage exploration andallows interactions between the two processes via appropriate swaps. We observe in Fig.3(d) thatreSGLD obtains both the exploration and exploitation abilities and yields a much better result.However, the noisy energy estimator may hinder the swapping efficiency and it becomes difficult toestimate a few modes on the corners. As to our algorithm, CSGLD first simulates the importancesamples and recovers the original distribution according to the importance weights. We notice that thesamples from CSGLD can traverse freely in the parameter space and eventually achieve a remarkableperformance, as shown in Fig.3(e).

7

Page 8: A Contour Stochastic Gradient Langevin Dynamics Algorithm

(a) Ground truth (b) SGLD (c) cycSGLD (d) reSGLD (e) CSGLD

Figure 3: Simulations of a multi-modal distribution. A resampling scheme is used for CSGLD.

4.2 UCI data

We tested the performance of CSGLD on the UCI regression datasets. For each dataset, we normalizedall features and randomly selected 10% of the observations for testing. Following [Hernandez-Lobatoand Adams, 2015], we modeled the data using a Multi-Layer Perception (MLP) with a single hiddenlayer of 50 hidden units. We set the mini-batch size n = 50 and trained the model for 5,000epochs. The learning rate was set to 5e-6 and the default L2-regularization coefficient is 1e-4. Forall the datasets, we used the stochastic energy N

n U(x) to evaluate the partition index. We set theenergy bandwidth ∆u = 100. We fine-tuned the temperature τ and the hyperparameter ζ. For afair comparison, each algorithm was run 10 times with fixed seeds for each dataset. In each run,the performance of the algorithm was evaluated by averaging over 50 models, where the averagingestimator was used for SGD and SGLD and the weighted averaging estimator was used for CSGLD.As shown in Table 1, SGLD outperforms the stochastic gradient descent (SGD) algorithm for mostdatasets due to the advantage of a sampling algorithm in obtaining more informative modes. Since allthese datasets are small, there is only very limited potential for improvement. Nevertheless, CSGLDstill consistently outperforms all the baselines including SGD and SGLD.

The contour strategy proposed in the paper can be naturally extended to SGHMC [Chen et al.,2014, Ma et al., 2015] without affecting the theoretical results. In what follows, we adopted anumerical method proposed by Saatci and Wilson [2017] to avoid extra hyperparameter tuning.We set the momentum term to 0.9 and simply inherited all the other parameter settings used inthe above experiments. In such a case, we compare the contour SGHMC (CSGHMC) with thebaselines, including M-SGD (Momentum SGD) and SGHMC. The comparison indicates that someimprovements can be achieved by including the momentum.

Table 1: Algorithm evaluation using average root-mean-square error and its standard deviation.

Dataset Energy Concrete Yacht WineHyperparameters (τ/ζ) 1/1 5/1 1/2.5 5/10

SGD 1.13±0.07 4.60±0.14 0.81±0.08 0.65±0.01SGLD 1.08±0.07 4.12±0.10 0.72±0.07 0.63±0.01

CSGLD 1.02±0.06 3.98±0.11 0.69±0.06 0.62±0.01M-SGD 0.95±0.07 4.32±0.27 0.73±0.08 0.71±0.02SGHMC 0.77±0.06 4.25±0.19 0.66±0.07 0.67±0.02

CSGHMC 0.76±0.06 4.15±0.20 0.72±0.09 0.65±0.01

4.3 Computer vision data

This section compares only CSGHMC with M-SGD and SGHMC due to the popularity of momentumin accelerating computation for computer vision datasets. We keep partitioning the sample spaceaccording to the stochastic energy N

n U(x), where a mini-batch data of size n is randomly chosenfrom the full dataset of size N at each iteration. Notably, such a strategy significantly acceleratesthe computation of CSGHMC. As a result, CSGHMC has almost the same computational cost asSGHMC and SGD. To reduce the bias associated with the stochastic energy, we choose a large batchsize n = 1, 000. For more discussions on the hyperparameter settings, we refer readers to section Din the supplementary material.

8

Page 9: A Contour Stochastic Gradient Langevin Dynamics Algorithm

CIFAR10 is a standard computer vision dataset with 10 classes and 60,000 images, for which 50,000images were used for training and the rest for testing. We modeled the data using a Resnet of 20layers (Resnet20) [He et al., 2016]. In particular, for CSGHMC, we considered a partition of theenergy space in 200 subregions, where the energy bandwidth was set to ∆u = 1000. We trained themodel for a total of 1000 epochs and evaluated the model every ten epochs based on two criteria,namely, best point estimate (BPE) and Bayesian model average (BMA). We repeated each experiment10 times and reported in Table 2 the average prediction accuracy and the corresponding standarddeviation.

In the first set of experiments, all the algorithms utilized a fixed learning rate ε = 2e− 7 and a fixedtemperature τ = 0.01 under the Bayesian setting.SGHMC performs quite similarly to M-SGD, bothobtaining around 90% accuracy in BPE and 92% in BMA. Notably, in this case, simulated annealingis not applied to any of the algorithms and achieving the state-of-the-art is quite difficult. However,BMA still consistently outperforms BPE, implying the great potential of advanced MCMC techniquesin deep learning. Instead of simulating from π(x) directly, CSGHMC adaptively simulates from aflattened distribution $θ? and adjusts the sampling bias by dynamic importance weights. As a result,the weighted averaging estimators obtain an improvement by as large as 0.8% on BMA. In addition,the flattened distribution facilitates optimization and the increase in BPE is quite significant.

In the second set of experiments, we employed a decaying schedule on both learning rates andtemperatures (if applicable) to obtain simulated annealing effects. For the learning rate, we fix it at 2×10−6 in the first 400 epochs and then decayed it by a factor of 1.01 at each epoch. For the temperature,we consistently decayed it by a factor of 1.01 at each epoch. We call the resulting algorithms bysaM-SGD, saSGHMC, and saCSGHMC, respectively. Table 2 shows that the performances of allalgorithms are increased quite significantly, where the fine-tuned baselines already obtained thestate-of-the-art results. Nevertheless, saCSGHMC further improves BPE by 0.25% and slightlyimprove the highly optimized BMA by nearly 0.1%.

CIFAR100 dataset has 100 classes, each of which contains 500 training images and 100 testingimages. We follow a similar setup as CIFAR10, except that ∆u is set to 5000. For M-SGD, BMAcan be better than BPE by as large as 5.6%. CSGHMC has led to an improvement of 3.5% on BPEand 2% on BMA, which further demonstrates the superiority of advanced MCMC techniques. Table2 also shows that with the help of both simulated annealing and importance sampling, saCSGHMCcan outperform the highly optimized baselines by almost 1% accuracy on BPE and 0.7% on BMA.The significant improvements show the advantage of the proposed method in training DNNs.

Table 2: Experiments on CIFAR10 & 100 using Resnet20, where BPE and BMA are short for bestpoint estimate and Bayesian model average, respectively.

Algorithms CIFAR10 CIFAR100BPE BMA BPE BMA

M-SGD 90.02±0.06 92.03±0.08 61.41±0.15 67.04±0.12SGHMC 90.01±0.07 91.98±0.05 61.46±0.14 66.43±0.11

CSGHMC 90.87±0.04 92.85±0.05 63.97±0.21 68.94±0.23saM-SGD 93.83±0.07 94.25±0.04 69.18±0.13 71.83±0.12saSGHMC 93.80±0.06 94.24±0.06 69.24±0.11 71.98±0.10

saCSGHMC 94.06±0.07 94.33±0.07 70.18±0.15 72.67±0.15

5 Conclusion

We have proposed CSGLD as a general scalable Monte Carlo algorithm for both simulation andoptimization tasks. CSGLD automatically adjusts the invariant distribution during simulations tofacilitate escaping from local traps and traversing over the entire energy landscape. The sampling biasintroduced thereby is accounted for by dynamic importance weights. We proved a stability conditionfor the mean-field system induced by CSGLD together with the convergence of its self-adaptingparameter θ to a unique fixed point θ?. We established the convergence of a weighted averagingestimator for CSGLD. The bias of the estimator decreases as we employ a finer partition, a largermini-batch size, and smaller learning rates and step sizes. We tested CSGLD and its variants on a fewexamples, which show their great potential in deep learning and big data computing.

9

Page 10: A Contour Stochastic Gradient Langevin Dynamics Algorithm

Broader Impact

Our algorithm ensures AI safety by providing more robust predictions and helps build a saferenvironment. It is an extension of the flat histogram algorithms from the Metropolis kernel to theLangevin kernel and paves the way for future research in various dynamic importance samplersand adaptive biasing force (ABF) techniques for big data problems. The Bayesian community andthe researchers in the area of Monte Carlo methods will enjoy the benefit of our work. To our bestknowledge, the negative society consequences are not clear and no one will be put at disadvantage.

Acknowledgment

Liang’s research was supported in part by the grants DMS-2015498, R01-GM117597 and R01-GM126089. Lin acknowledges the support from NSF (DMS-1555072, DMS-1736364), BNLSubcontract 382247, W911NF-15-1-0562, and DE-SC0021142.

ReferencesSungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian Posterior Sampling via Stochastic

Gradient Fisher Scoring. In Proc. of the International Conference on Machine Learning (ICML),2012.

Christophe Andrieu and Éric Moulines. On the Ergodicity Properties of Some Adaptive MCMCAlgorithms. Annals of Applied Probability, 16:1462–1505, 2006.

Christophe Andrieu, Éric Moulines, and Pierre Priouret. Stability of Stochastic Approximation underVerifiable Conditions. SIAM J. Control Optim., 44(1):283–312, 2005.

Albert Benveniste, Michael Métivier, and Pierre Priouret. Adaptive Algorithms and StochasticApproximations. Berlin: Springer, 1990.

Bernd A. Berg and T. Neuhaus. Multicanonical Algorithms for First Order Phase Transitions. PhysicsLetters B, 267(2):249–253, 1991.

Changyou Chen, Nan Ding, and Lawrence Carin. On the Convergence of Stochastic Gradient MCMCAlgorithms with High-order Integrators. In Advances in Neural Information Processing Systems(NeurIPS), pages 2278–2286, 2015.

Tianqi Chen, Emily B. Fox, and Carlos Guestrin. Stochastic Gradient Hamiltonian Monte Carlo. InProc. of the International Conference on Machine Learning (ICML), 2014.

Umut Simsekli, Roland Badeau, A. Taylan Cemgil, and Gaë Richard. Stochastic Quasi-NewtonLangevin Monte Carlo. In Proc. of the International Conference on Machine Learning (ICML),pages 642–651, 2016.

Wei Deng, Xiao Zhang, Faming Liang, and Guang Lin. An Adaptive Empirical Bayesian Method forSparse Deep Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2019.

Wei Deng, Qi Feng, Liyao Gao, Faming Liang, and Guang Lin. Non-Convex Learning via ReplicaExchange Stochastic Gradient MCMC. In Proc. of the International Conference on MachineLearning (ICML), 2020a.

Wei Deng, Qi Feng, Georgios Karagiannis, Guang Lin, and Faming Liang. Accelerating Convergenceof Replica Exchange Stochastic Gradient MCMC via Variance Reduction. arXiv:2010.01084,2020b.

Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D. Skeel, and Hartmut Neven.Bayesian Sampling using Stochastic Gradient Thermostats. In Advances in Neural InformationProcessing Systems (NeurIPS), pages 3203–3211, 2014.

G. Fort, E. Moulines, and P. Priouret. Convergence of Adaptive and Interacting Markov Chain MonteCarlo Algorithms. Annals of Statistics, 39:3262–3289, 2011.

10

Page 11: A Contour Stochastic Gradient Langevin Dynamics Algorithm

G. Fort, B. Jourdain, E. Kuhn, T. Lelièvre, and G. Stoltz. Convergence of the Wang-Landau Algorithm.Math. Comput., 84(295):2297–2327, 2015.

Charles J. Geyer. Markov Chain Monte Carlo Maximum Likelihood. Computing Science andStatistics: Proceedings of the 23rd Symposium on the Interfac, pages 156–163, 1991.

W.K. Hastings. Monte Carlo Sampling Methods using Markov Chain and Their Applications.Biometrika, 57:97–109, 1970.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for ImageRecognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

Jose Miguel Hernandez-Lobato and Ryan Adams. Probabilistic Backpropagation for ScalableLearning of Bayesian Neural Networks. In Proc. of the International Conference on MachineLearning (ICML), volume 37, pages 1861–1869, 2015.

Scott Kirkpatrick, D. Gelatt Jr, and Mario P. Vecchi. Optimization by Simulated Annealing. Science,220(4598):671–680, 1983.

T. Lelièvre, M. Rousset, and G. Stoltz. Long-time Convergence of an Adaptive Biasing Force Method.Nonlinearity, 21:1155–1181, 2008.

Chunyuan Li, Changyou Chen, David Carlson, and Lawrence Carin. Preconditioned StochasticGradient Langevin Dynamics for Deep Neural Networks. In Proc. of the National Conference onArtificial Intelligence (AAAI), pages 1788–1794, 2016.

Xuechen Li, Denny Wu, Lester Mackey, and Murat A. Erdogdu. Stochastic Runge-Kutta AcceleratesLangevin Monte Carlo and Beyond. In Advances in Neural Information Processing Systems(NeurIPS), pages 7746–7758, 2019.

Faming Liang. A Generalized Wang–Landau Algorithm for Monte Carlo Computation. Journal ofthe American Statistical Association, 100(472):1311–1327, 2005.

Faming Liang. On the Use of Stochastic Approximation Monte Carlo for Monte Carlo Integration.Statistics and Probability Letters, 79:581–587, 2009.

Faming Liang. Trajectory Averaging for Stochastic Approximation MCMC Algorithms. The Annalsof Statistics, 38:2823–2856, 2010.

Faming Liang, Chuanhai Liu, and Raymond J. Carroll. Stochastic Approximation in Monte CarloComputation. Journal of the American Statistical Association, 102:305–320, 2007.

Yi-An Ma, Tianqi Chen, and Emily B. Fox. A Complete Recipe for Stochastic Gradient MCMC. InAdvances in Neural Information Processing Systems (NeurIPS), 2015.

Oren Mangoubi and Nisheeth K. Vishnoi. Convex Optimization with Unbounded Nonconvex Oraclesusing Simulated Annealing. In Proc. of Conference on Learning Theory (COLT), 2018.

J.C. Mattingly, A.M. Stuartb, and D.J. Highamc. Ergodicity for SDEs and Approximations: LocallyLipschitz Vector Fields and Degenerate Noise. Stochastic Processes and their Applications, 101:185–232, 2002.

Jonathan C. Mattingly, Andrew M. Stuart, and M.V. Tretyakov. Convergence of Numerical Time-Averaging and Stationary Measures via Poisson Equations. SIAM Journal on Numerical Analysis,48:552–577, 2010.

N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. Equation of StateCalculations by Fast Computing Machines. Journal of Chemical Physics, 21:1087–1091, 1953.

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation inPyTorch. In NeurIPS Autodiff Workshop, 2017.

11

Page 12: A Contour Stochastic Gradient Langevin Dynamics Algorithm

Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex Learning via StochasticGradient Langevin Dynamics: a Nonasymptotic Analysis. In Proc. of Conference on LearningTheory (COLT), June 2017.

Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. Annals of MathematicalStatistics, 22:400–407, 1951.

Gerneth O. Roberts and Jeff S. Rosenthal. Coupling and Ergodicity of Adaptive Markov Chain MonteCarlo Algorithms. Journal of Applied Probability, 44:458–475, 2007.

Yunus Saatci and Andrew G Wilson. Bayesian GAN. In Advances in Neural Information ProcessingSystems (NeurIPS), pages 3622–3631, 2017.

Issei Sato and Hiroshi Nakagawa. Approximation Analysis of Stochastic Gradient Langevin Dynamicsby Using Fokker-Planck Equation and Ito Process. In Proc. of the International Conference onMachine Learning (ICML), 2014.

Robert H. Swendsen and Jian-Sheng Wang. Replica Monte Carlo Simulation of Spin-Glasses. Phys.Rev. Lett., 57:2607–2609, 1986.

Yee Whye Teh, Alexandre Thiéry, and Sebastian Vollmer. Consistency and Fluctuations for StochasticGradient Langevin Dynamics. Journal of Machine Learning Research, 17:1–33, 2016.

Eric Vanden-Eijnden. Introduction to Regular Perturbation Theory. Slides, 2001. URL https://cims.nyu.edu/~eve2/reg_pert.pdf.

Sebastian J. Vollmer, Konstantinos C. Zygalakis, and Yee Whye Teh. Exploration of the (Non-)Asymptotic Bias and Variance of Stochastic Gradient Langevin Dynamics. Journal of MachineLearning Research, 17(159):1–48, 2016.

Fugao Wang and D. P. Landau. Efficient, Multiple-range Random Walk Algorithm to Calculate theDensity of States. Physical Review Letters, 86(10):2050–2053, 2001.

Max Welling and Yee Whye Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. InProc. of the International Conference on Machine Learning (ICML), pages 681–688, 2011.

Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global Convergence of Langevin DynamicsBased Algorithms for Nonconvex Optimization. In Advances in Neural Information ProcessingSystems (NeurIPS), 2018.

Mao Ye, Tongzheng Ren, and Qiang Liu. Stein Self-Repulsive Dynamics: Benefits From PastSamples. arXiv:2002.09070v1, 2020.

Ruqi Zhang, Chunyuan Li, Jianyi Zhang, Changyou Chen, and Andrew Gordon Wilson. CyclicalStochastic Gradient MCMC for Bayesian Deep Learning. In Proc. of the International Conferenceon Learning Representation (ICLR), 2020.

Yuchen Zhang, Percy Liang, and Moses Charikar. A Hitting Time Analysis of Stochastic GradientLangevin Dynamics. In Proc. of Conference on Learning Theory (COLT), pages 1980–2022, 2017.

Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random Erasing Data Augmen-tation. ArXiv e-prints, 2017.

12

Page 13: A Contour Stochastic Gradient Langevin Dynamics Algorithm

Supplimentary Material for “A Contour Stochastic Gradient LangevinDynamics Algorithm for Simulations of Multi-modal Distributions”

The supplementary material is organized as follows: Section A provides a review for the relatedmethodologies, Section B proves the stability condition and convergence of the self-adapting pa-rameter, Section C establishes the ergodicity of the contour stochastic gradient Langevin dynamics(CSGLD) algorithm, and Section D provides more discussions for the algorithm.

A Background on stochastic approximation and Poisson equation

A.1 Stochastic approximation

Stochastic approximation [Benveniste et al., 1990] provides a standard framework for the devel-opment of adaptive algorithms. Given a random field function H(θ,x), the goal of the stochasticapproximation algorithm is to find the solution to the mean-field equation h(θ) = 0, i.e., solving

h(θ) =

∫XH(θ,x)$θ(dx) = 0,

where x ∈ X ⊂ Rd, θ ∈ Θ ⊂ Rm, H(θ,x) is a random field function and $θ(x) is a distributionfunction of x depending on the parameter θ. The stochastic approximation algorithm works byrepeating the following iterations

(1) Draw xk+1 ∼ Πθk(xk, ·), where Πθk(xk, ·) is a transition kernel that admits $θk(x) asthe invariant distribution,

(2) Update θk+1 = θk + ωk+1H(θk,xk+1) + ω2k+1ρ(θk,xk+1), where ρ(·, ·) denotes a bias

term.

The algorithm differs from the Robbins–Monro algorithm [Robbins and Monro, 1951] in that x issimulated from a transition kernel Πθk(·, ·) instead of the exact distribution $θk(·). As a result, aMarkov state-dependent noise H(θk,xk+1)− h(θk) is generated, which requires some regularityconditions to control the fluctuation

∑k Πk

θ(H(θ,x)− h(θ)). Moreover, it supports a more generalform where a bounded bias term ρ(·, ·) is allowed without affecting the theoretical properties of thealgorithm.

A.2 Poisson equation

Stochastic approximation generates a nonhomogeneous Markov chain {(xk,θk)}∞k=1, for which theconvergence theory can be studied based on the Poisson equation

µθ(x)−Πθµθ(x) = H(θ,x)− h(θ),

where Πθ(x, A) is the transition kernel for any Borel subset A ⊂ X and µθ(·) is a function on X .The solution to the Poisson equation exists when the following series converges:

µθ(x) :=∑k≥0

Πkθ(H(θ,x)− h(θ)).

That is, the consistency of the estimator θ can be established by controlling the perturbations of∑k≥0 Πk

θ(H(θ,x)− h(θ)) via imposing some regularity conditions on µθ(·). Towards this goal,Benveniste et al. [1990] gave the following regularity conditions on µθ(·) to ensure the convergenceof the adaptive algorithm:

There exist a function V : X → [1,∞), and a constant C such that for all θ,θ′ ∈ Θ,

‖Πθµθ(x)‖ ≤ CV (x), ‖Πθµθ(x)−Πθ′µθ′(x)‖ ≤ C‖θ − θ′‖V (x), E[V (x)] ≤ ∞,which requires only the first order smoothness. In contrast, the ergodicity theory by Mattingly et al.[2010] and Vollmer et al. [2016] relies on the much stronger 4th order smoothness.

13

Page 14: A Contour Stochastic Gradient Langevin Dynamics Algorithm

B Stability and convergence analysis for CSGLD

B.1 CSGLD algorithm

To make the theory more general, we slightly extend CSGLD by allowing a higher order bias term.The resulting algorithm works by iterating between the following two steps:

(1) Sample xk+1 = xk−εk∇xL(xk,θk)+N (0, 2εkτI), (S1)

(2) Update θk+1 = θk + ωk+1H(θk,xk+1) + ω2k+1ρ(θk,xk+1), (S2)

where εk is the learning rate, ωk+1 is the step size,∇xL(x,θ) is the stochastic gradient given by

∇xL(x,θ) =N

n

[1 +

ζτ

∆u

(log θ(J(x))− log θ((J(x)− 1) ∨ 1)

)]∇xU(x), (14)

H(θ,x) = (H1(θ,x), . . . , Hm(θ,x)) is a random field function with

Hi(θ,x) = θζ(J(x))(

1i=J(x) − θ(i)), i = 1, 2, . . . ,m, (15)

for some constant ζ > 0, and ρ(θk,xk+1) is a bias term.

B.2 Convergence of parameter estimation

To establish the convergence of θk, we make the following assumptions:Assumption A1 (Compactness). The space Θ is compact such that infΘ θ(i) > 0 for any i ∈{1, 2, . . . ,m}. There exists a large constant Q > 0 such that for any θ ∈ Θ and x ∈ X ,

‖θ‖ ≤ Q, ‖H(θ,x)‖ ≤ Q, ‖ρ(θ,x)‖ ≤ Q. (16)

To simplify the proof, we consider a slightly stronger assumption such that infΘ θ(i) > 0 holds forany i ∈ {1, 2, . . . ,m}. To relax this assumption, we refer interested readers to Fort et al. [2015]where the recurrence property was proved for the sequence {θk}k≥1 of a similar algorithm. Such aproperty guarantees θk to visit often enough to a desired compact space, rendering the convergenceof the sequence.

Assumption A2 (Smoothness). U(x) is M -smooth; that is, there exists a constant M > 0 such thatfor any x,x′ ∈ X ,

‖∇xU(x)−∇xU(x′)‖ ≤M‖x− x′‖. (17)

Smoothness is a standard assumption in the study of convergence of SGLD, see e.g. Raginsky et al.[2017], Xu et al. [2018].Assumption A3 (Dissipativity). There exist constants m > 0 and b ≥ 0 such that for any x ∈ Xand θ ∈ Θ,

〈∇xL(x,θ),x〉 ≤ b− m‖x‖2. (18)

This assumption ensures samples to move towards the origin regardless the initial point, which isstandard in proving the geometric ergodicity of dynamical systems, see e.g. Mattingly et al. [2002],Raginsky et al. [2017], Xu et al. [2018].Assumption A4 (Gradient noise). The stochastic gradient is unbiased, that is,

E[∇xU(xk)−∇xU(xk)] = 0;

in addition, there exist some constants M > 0 and B > 0 such that

E[‖∇xU(xk)−∇xU(xk)‖2] ≤M2‖x‖2 +B2,

where the expectation E[·] is taken with respect to the distribution of the noise component included in∇xU(x).

14

Page 15: A Contour Stochastic Gradient Langevin Dynamics Algorithm

Lemma B1 establishes a stability condition for CSGLD, which implies potential convergence of θk.Lemma B1 (Stability). Suppose that Assumptions A1-A4 hold. For any θ ∈ Θ, 〈h(θ),θ −θ?〉 ≤ −φ‖θ − θ?‖2 + O

(δn(θ) + ε+ 1

m

), where φ = infθ Z

−1θ > 0, θ? =

(∫X1π(x)dx,

∫X2π(x)dx, . . . ,

∫Xm π(x)dx) and δn(·) is a bias term depending on the batch size n

such that δn(·)→ 0 as n→ N .

Proof Let $Ψθ (x) ∝ π(x)

Ψζθ(U(x))denote a theoretical invariant measure of SGLD, where Ψθ(u) is

a fixed piecewise continuous function given by

Ψθ(u) =

m∑i=1

(θ(i− 1)e(log θ(i)−log θ(i−1))

u−ui−1∆u

)1ui−1<u≤ui , (19)

the full data is used in determining the indexes of subregions, and the learning rate converges to zero.In addition, we define a piece-wise constant function

Ψθ =

m∑i=1

θ(i)1ui−1<u≤ui ,

and a theoretical measure $Ψθ(x) ∝ π(x)

θζ(J(x)). Obviously, as the sample space partition becomes

fine and fine, i.e., u1 → umin, um−1 → umax and m → ∞, we have ‖Ψθ − Ψθ‖ → 0 and‖$Ψθ

(x) − $Ψθ (x)‖ → 0, where umin and umax denote the minimum and maximum of U(x),respectively. Without loss of generality, we assume umax < ∞. Otherwise, umax can be set to avalue such that π({x : U(x) > umax}) is sufficiently small.

For each i ∈ {1, 2, . . . ,m}, the random field Hi(θ,x) = θζ(J(x))(

1i≥J(x) − θ(i))

is a biased

estimator of Hi(θ,x) = θζ(J(x))(1i≥J(x) − θ(i)

). Let δn(θ) = E[H(θ,x) − H(θ,x)] denote

the bias, which is caused by the mini-batch evaluation of the energy and decays to 0 as n→ N .

First, let’s compute the mean-field h(θ) with respect to the empirical measure $θ(x):

hi(θ) =

∫XHi(θ,x)$θ(x)dx =

∫XHi(θ,x)$θ(x)dx+ δn(θ)

=

∫XHi(θ,x)

$Ψθ(x)︸ ︷︷ ︸

I1

−$Ψθ(x) +$Ψθ (x)︸ ︷︷ ︸

I2

−$Ψθ (x) +$θ(x)︸ ︷︷ ︸I3

dx+ δn(θ).

(20)

For the term I1, we have∫XHi(θ,x)$Ψθ

(x)dx =1

∫Xθζ(J(x))

(1i=J(x) − θ(i)

) π(x)

θζ(J(x))dx

= Z−1θ

[m∑k=1

∫Xkπ(x)1k=idx− θ(i)

m∑k=1

∫Xkπ(x)dx

]= Z−1

θ [θ?(i)− θ(i)] ,

(21)

where Zθ =∑mi=1

∫Xiπ(x)dx

θ(i)ζdenotes the normalizing constant of $Ψθ

(x).

Next, let’s consider the integrals I2 and I3. By Lemma B4 and the boundedness of H(θ,x), we have∫XHi(θ,x)(−$Ψθ

(x) +$Ψθ (x))dx = O(

1

m

). (22)

For the term I3, we have for any fixed θ,∫XHi(θ,x) (−$Ψθ (x) +$θ(x)) dx = O(δn (θ)) +O(ε), (23)

where δn(·) uniformly decays to 0 as n→ N and the order of O(ε) follows from Theorem 6 of Satoand Nakagawa [2014].

15

Page 16: A Contour Stochastic Gradient Langevin Dynamics Algorithm

Plugging (21), (22) and (23) into (20), we have

hi(θ) = Z−1θ [εβi(θ) + θ?(i)− θ(i)] , (24)

where ε = O(δn(θ) + ε+ 1

m

)and βi(θ) is a bounded term such that Z−1

θ εβi(θ) =

O(δn(θ) + ε+ 1

m

).

To solve the ODE system with small disturbances, we consider standard techniques in perturbationtheory. According to the fundamental theorem of perturbation theory [Vanden-Eijnden, 2001], wecan obtain the solution to the mean field equation h(θ) = 0:

θ(i) = θ?(i) + εβi(θ?) +O(ε2), i = 1, 2, . . . ,m, (25)

which is a stable point in a small neighbourhood of θ?.

Considering the positive definite function V(θ) = 12‖θ? − θ‖

2 for the mean-field system h(θ) =

Z−1θ (εβi(θ) + θ? − θ) = Z−1

θ (θ? − θ) +O(ε), we have

〈h(θ),V(θ)〉 = 〈h(θ),θ−θ?〉 = −Z−1θ ‖θ−θ?‖

2+O(ε) ≤ −φ‖θ−θ?‖2+O(δn(θ) + ε+

1

m

),

where φ = infθ Z−1θ > 0 by the compactness assumption A1. This concludes the proof.

The following is a restatement of Lemma 1 of Deng et al. [2019], which holds for any θ in thecompact space Θ.Lemma B2 (Uniform L2 bounds). Suppose Assumptions A1, A3 and A4 hold. Given a small enoughlearning rate, then supk≥1 E[‖xk‖2] <∞.

Lemma B3 (Solution of Poisson equation). Suppose that Assumptions A1-A4 hold. There is asolution µθ(·) on X to the Poisson equation

µθ(x)−Πθµθ(x) = H(θ,x)− h(θ). (26)

In addition, for all θ,θ′ ∈ Θ, there exists a constant C such that

E[‖Πθµθ(x)‖] ≤ C,E[‖Πθµθ(x)−Πθ′µθ′(x)‖] ≤ C‖θ − θ′‖.

(27)

Proof The lemma can be proved based on Theorem 13 of Vollmer et al. [2016], whose conditionscan be easily verified for CSGLD given the assumptions A1-A4 and Lemma B2. The details areomitted.

Now we are ready to prove the first main result on the convergence of θk. The technique lemmas arelisted in Section B.3.Assumption A5 (Learning rate and step size). The learning rate {εk}k∈N is a positive non-increasingsequence of real numbers satisfying the conditions

limkεk = 0,

∞∑k=1

εk =∞.

The step size {ωk}k∈N is a positive decreasing sequence of real numbers such that

ωk → 0,

∞∑k=1

ωk = +∞, limk→∞

inf 2φωkωk+1

+ωk+1 − ωkω2k+1

> 0. (28)

According to Benveniste et al. [1990], we can choose ωk := Akα+B for some α ∈ ( 1

2 , 1] and somesuitable constants A > 0 and B > 0.Theorem 3 (L2 convergence rate). Suppose Assumptions A1-A5 hold. For a sufficiently large valueof m, a sufficiently small learning rate sequence {εk}∞k=1, and a sufficiently small step size sequence{ωk}∞k=1, {θk}∞k=0 converges to θ? in L2-norm such that

E[‖θk − θ?‖2

]= O

(ωk + sup

i≥k0

εi +1

m+ supi≥k0

δn(θi)

),

where k0 is a sufficiently large constant, and δn(θ) is a bias term decaying to 0 as n→ N .

16

Page 17: A Contour Stochastic Gradient Langevin Dynamics Algorithm

Proof Consider the iterations

θk+1 = θk + ωk+1

(H(θk,xk+1) + ωk+1ρ(θk,xk+1)

).

Define Tk = θk − θ?. By subtracting θ? from both sides and taking the square and L2 norm, wehave‖T 2

k+1‖ = ‖T 2k ‖+ ω2

k+1‖H(θk,xk+1) + ωk+1ρ(θk,xk+1)‖2 + 2ωk+1 〈Tk, H(xk+1) + ωk+1ρ(θk,xk+1)〉︸ ︷︷ ︸D

.

First, by Lemma B5, there exists a constant G = 4Q2(1 +Q2) such that

‖H(θk,xk+1) + ωk+1ρ(θk,xk+1)‖2 ≤ G(1 + ‖Tk‖2). (29)

Next, by the Poisson equation (26), we have

D = 〈Tk, H(θk,xk+1) + ωk+1ρ(θk,xk+1)〉= 〈Tk, h(θk) + µθk(xk+1)−Πθkµθk(xk+1) + ωk+1ρ(θk,xk+1)〉= 〈Tk, h(θk)〉︸ ︷︷ ︸

D1

+ 〈Tk, µθk(xk+1)−Πθkµθk(xk+1)〉︸ ︷︷ ︸D2

+ 〈Tk, ωk+1ρ(θk,xk+1)〉︸ ︷︷ ︸D3

.

For the term D1, by Lemma B1, we have

E [〈Tk, h(θk)〉] ≤ −φE[‖Tk‖2] +O(δn(θk) + εk +1

m).

For convenience, in the following context, we denote O(δn(θk) + εk + 1m ) by ∆k.

To deal with the term D2, we make the following decomposition

D2 = 〈Tk, µθk(xk+1)−Πθkµθk(xk)〉︸ ︷︷ ︸D21

+ 〈Tk,Πθkµθk(xk)−Πθk−1µθk−1

(xk)〉︸ ︷︷ ︸D22

+ 〈Tk,Πθk−1µθk−1

(xk)−Πθkµθk(xk+1)〉︸ ︷︷ ︸D23

.

(i) From the Markov property, µθk(xk+1)−Πθkµθk(xk) forms a martingale difference sequence

E [〈Tk, µθk(xk+1)−Πθkµθk(xk)〉|Fk] = 0, (D21)

where Fk is a σ-filter formed by {θ0,x1,θ1,x2, · · · ,xk,θk}.(ii) By the regularity of the solution of Poisson equation in (27) and Lemma B6, we have

E[‖Πθkµθk(xk)−Πθk−1µθk−1

(xk)‖] ≤ C‖θk − θk−1‖ ≤ 2QCωk. (30)

Using Cauchy–Schwarz inequality, (30) and the compactness of Θ in Assumption A1, we have

E[〈Tk,Πθkµθk (xk)−Πθk−1µθk−1(xk)〉] ≤ E[‖Tk‖] · 2QCωk ≤ 4Q2Cωk ≤ 5Q2Cωk+1 (D22),

where the last inequality follows from assumption A5 and holds for a large enough k.

(iii) For the last term of D2,

〈Tk,Πθk−1µθk−1

(xk)−Πθkµθk(xk+1)〉=(〈Tk,Πθk−1

µθk−1(xk)〉 − 〈Tk+1,Πθkµθk(xk+1)〉

)+ (〈Tk+1,Πθkµθk(xk+1)〉 − 〈Tk,Πθkµθk(xk+1)〉)

=(zk − zk+1) + 〈Tk+1 − Tk,Πθkµθk(xk+1)〉,

where zk = 〈Tk,Πθk−1µθk−1

(xk)〉. By the regularity assumption (27) and Lemma B6,

E〈Tk+1 − Tk,Πθkµθk(xk+1)〉 ≤ E[‖θk+1 − θk‖] · E[‖Πθkµθk(xk+1)‖] ≤ 2QCωk+1. (D23)

Regarding D3, since ρ(θk,xk+1) is bounded, applying Cauchy–Schwarz inequality gives

E[〈Tk, ωk+1ρ(θk,xk+1))] ≤ 2Q2ωk+1 (D3)

17

Page 18: A Contour Stochastic Gradient Langevin Dynamics Algorithm

Finally, adding (29), D1, D21, D22, D23 and D3 together, it follows that for a constant C0 =G+ 10Q2C + 4QC + 4Q2,E[‖Tk+1‖2

]≤ (1− 2ωk+1φ+Gω2

k+1)E[‖Tk‖2

]+ C0ω

2k+1 + 2∆kωk+1 + 2E[zk − zk+1]ωk+1.

(31)

Moreover, from (16) and (27), E[|zk|] is upper bounded by

E[|zk|] = E[〈Tk,Πθk−1µθk−1

(xk)〉] ≤ E[‖Tk‖]E[‖Πθk−1µθk−1

(xk)‖] ≤ 2QC. (32)

According to Lemma B7, we can choose λ0 and k0 such that

E[‖Tk0‖2] ≤ ψk0

= λ0ωk0+

1

φsupi≥k0

∆i,

which satisfies the conditions (43) and (44) of Lemma B9. Applying Lemma B9 leads to

E[‖Tk‖2

]≤ ψk + E

k∑j=k0+1

Λkj (zj−1 − zj)

, (33)

where ψk = λ0ωk + 1φ supi≥k0

∆i for all k > k0. Based on (32) and the increasing condition of Λkjin Lemma B8, we have

E

∣∣∣∣∣∣k∑

j=k0+1

Λkj (zj−1 − zj)

∣∣∣∣∣∣ = E

∣∣∣∣∣∣k−1∑

j=k0+1

(Λkj+1 − Λkj )zj − 2ωkzk + Λkk0+1zk0

∣∣∣∣∣∣

≤k−1∑

j=k0+1

2(Λkj+1 − Λkj )QC + E[|2ωkzk|] + 2ΛkkQC

≤2(Λkk − Λkk0)QC + 2ΛkkQC + 2ΛkkQC

≤6ΛkkQC.

(34)

Given ψk = λ0ωk + 1φ supi≥k0

∆i which satisfies the conditions (43) and (44) of Lemma B9, itfollows from (33) and (34) that the following inequality holds for any k > k0,

E[‖Tk‖2] ≤ ψk + 6ΛkkQC = (λ0 + 12QC)ωk +1

φsupi≥k0

∆i = λωk +1

φsupi≥k0

∆i,

where λ = λ0 + 12QC, λ0 =2G supi≥k0

∆i+2C0φ

C1φ, C1 = lim inf 2φ

ωkωk+1

+ωk+1 − ωkω2k+1

> 0, C0 =

G+ 5Q2C + 2QC + 2Q2 and G = 4Q2(1 +Q2).

B.3 Technical lemmas

Lemma B4. Suppose Assumption A1 holds, and u1 and um−1 are fixed such that Ψ(u1) > ν andΨ(um−1) > 1− ν for some small constant ν > 0. For any bounded function f(x), we have∫

Xf(x)

($Ψθ (x)−$Ψθ

(x))dx = O

(1

m

). (35)

Proof Recall that $Ψθ(x) = 1

π(x)θζ(J(x))

and $Ψθ (x) = 1ZΨθ

π(x)

Ψζθ(U(x)). Since f(x) is bounded,

it suffices to show∫X

1

π(x)

θζ(J(x))− 1

ZΨθ

π(x)

Ψζθ(U(x))

dx

≤∫X

∣∣∣∣∣ 1

π(x)

θζ(J(x))− 1

π(x)

Ψζθ(U(x))

∣∣∣∣∣ dx+

∫X

∣∣∣∣∣ 1

π(x)

Ψζθ(U(x))

− 1

ZΨθ

π(x)

Ψζθ(U(x))

∣∣∣∣∣ dx=

1

m∑i=1

∫Xi

∣∣∣∣∣π(x)

θζ(i)− π(x)

Ψζθ(U(x))

∣∣∣∣∣ dx︸ ︷︷ ︸I1

+

m∑i=1

∣∣∣∣ 1

Zθ− 1

ZΨθ

∣∣∣∣ ∫Xi

π(x)

Ψζθ(U(x))

dx︸ ︷︷ ︸I2

= O(

1

m

),

(36)

18

Page 19: A Contour Stochastic Gradient Langevin Dynamics Algorithm

where Zθ =∑mi=1

∫Xi

π(x)θ(i)ζ

dx, ZΨθ =∑mi=1

∫Xi

π(x)

Ψζθ(U(x))dx, and Ψθ(u) is a piecewise continu-

ous function defined in (19).

By Assumption A1, infΘ θ(i) > 0 for any i. Further, by the mean-value theorem, which implies|xζ − yζ | . |x− y|zζ for any ζ > 0, x ≤ y and z ∈ [x, y] ⊂ [u1,∞), we have

I1 =1

m∑i=1

∫Xi

∣∣∣∣∣θζ(i)−Ψζθ(U(x))

θζ(i)Ψζθ(U(x))

∣∣∣∣∣π(x)dx .1

m∑i=1

∫Xi

|Ψθ(ui−1)−Ψθ(ui)|θζ(i)

π(x)dx

≤ maxi|Ψθ(ui −∆u)−Ψθ(ui)|

1

m∑i=1

∫Xi

π(x)

θζ(i)dx = max

i|Ψθ(ui −∆u)−Ψθ(ui)| . ∆u = O

(1

m

),

where the last inequality follows by Taylor expansion, and the last equality follows as u1 and um−1

are fixed. Similarly, we have

I2 =

∣∣∣∣ 1

Zθ− 1

ZΨθ

∣∣∣∣ZΨθ =|ZΨθ − Zθ|

Zθ≤ 1

m∑i=1

∫Xi

∣∣∣∣∣π(x)

θζ(i)− π(x)

Ψζθ(U(x))

∣∣∣∣∣ dx = I1 = O(

1

m

).

The proof can then be concluded by combining the orders of I1 and I2.Lemma B5. Given sup{ωk}∞k=1 ≤ 1, there exists a constant G = 4Q2(1 +Q2) such that

‖H(θk,xk+1) + ωk+1ρ(θk,xk+1)‖2 ≤ G(1 + ‖θk − θ?‖2). (37)

Proof

According to the compactness condition in Assumption A1, we have

‖H(θk,xk+1)‖2 ≤ Q2(1+‖θk‖2) = Q2(1+‖θk−θ?+θ?‖2) ≤ Q2(1+2‖θk−θ?‖2+2Q2). (38)

Therefore, using (38), we can show that for a constant G = 4Q2(1 +Q2)

‖H(θk,xk+1) + ωk+1ρ(θk,xk+1)‖2

≤ 2‖H(θk,xk+1)‖2 + 2ω2k+1‖ρ(θk,xk+1)‖2

≤ 2Q2(1 + 2‖θk − θ?‖2 + 2Q2) + 2Q2

≤ 2Q2(2 + 2Q2 + (2 + 2Q2)‖θk − θ?‖2)

≤ G(1 + ‖θk − θ?‖2).

Lemma B6. Given sup{ωk}∞k=1 ≤ 1, we have that

‖θk − θk−1‖ ≤ 2ωkQ (39)

Proof Following the update θk − θk−1 = ωkH(θk−1,xk) + ω2kρ(θk−1,xk), we have that

‖θk − θk−1‖ = ‖ωkH(θk−1,xk) + ω2kρ(θk−1,xk)‖ ≤ ωk‖H(θk−1,xk)‖+ ω2

k‖ρ(θk−1,xk)‖.

By the compactness condition in Assumption A1 and sup{ωk}∞k=1 ≤ 1, (39) can be derived.Lemma B7. There exist constants λ0 and k0 such that ∀λ ≥ λ0 and ∀k > k0, the sequence {ψk}∞k=1,where ψk = λωk + 1

φ supi≥k0∆i, satisfies

ψk+1 ≥(1− 2ωk+1φ+Gω2k+1)ψk + C0ω

2k+1 + 2∆kωk+1. (40)

Proof By replacing ψk with λωk + 1φ supi≥k0

∆i in (40), it suffices to show

λωk+1 +1

φsupi≥k0

∆i ≥(1− 2ωk+1φ+Gω2k+1)

(λωk +

1

φsupi≥k0

∆i

)+ C0ω

2k+1 + 2∆kωk+1.

which is equivalent to proving

λ(ωk+1 − ωk + 2ωkωk+1φ−Gωkω2k+1) ≥ 1

φsupi≥k0

∆i(−2ωk+1φ+Gω2k+1) + C0ω

2k+1 + 2∆kωk+1.

19

Page 20: A Contour Stochastic Gradient Langevin Dynamics Algorithm

Given the step size condition in (28), we have

ωk+1 − ωk + 2ωkωk+1φ ≥ C1ω2k+1,

where C1 = lim inf 2φωkωk+1

+ωk+1 − ωkω2k+1

> 0. Combining − supi≥k0∆i ≤ ∆k, it suffices to

prove

λ (C1 −Gωk)ω2k+1 ≥

(G

φsupi≥k0

∆i + C0

)ω2k+1. (41)

It is clear that for a large enough k0 and λ0 such that ωk0 ≤ C1

2G , λ0 =2G supi≥k0

∆i+2C0φ

C1φ, the

desired conclusion (41) holds for all such k ≥ k0 and λ ≥ λ0.

The following lemma is a restatement of Lemma 25 (page 247) from Benveniste et al. [1990].

Lemma B8. Suppose k0 is an integer satisfying infk>k0

ωk+1 − ωkωkωk+1

+ 2φ−Gωk+1 > 0 for some

constant G. Then for any k > k0, the sequence {ΛKk }k=k0,...,K defined below is increasing anduppered bounded by 2ωk

ΛKk =

2ωk∏K−1j=k (1− 2ωj+1φ+Gω2

j+1) if k < K,

2ωk if k = K.(42)

Lemma B9. Let {ψk}k>k0 be a series that satisfies the following inequality for all k > k0

ψk+1 ≥ψk(1− 2ωk+1φ+Gω2

k+1

)+ C0ω

2k+1 + 2∆kωk+1, (43)

and assume there exists such k0 that

E[‖Tk0‖2

]≤ ψk0 . (44)

Then for all k > k0, we have

E[‖Tk‖2

]≤ ψk +

k∑j=k0+1

Λkj (zj−1 − zj). (45)

Proof We prove by the induction method. Assuming (45) is true and applying (31), we have that

E[‖Tk+1‖2

]≤ (1− 2ωk+1φ+ ω2

k+1G)(ψk +

k∑j=k0+1

Λkj (zj−1 − zj))

+ C0ω2k+1 + 2∆kωk+1 + 2ωk+1E[zk − zk+1]

Combining (40) and Lemma.B8, respectively, we have

E[‖Tk+1‖2

]≤ ψk+1 + (1− 2ωk+1φ+ ω2

k+1G)

k∑j=k0+1

Λkj (zj−1 − zj) + 2ωk+1E[zk − zk+1]

≤ ψk+1 +

k∑j=k0+1

Λk+1j (zj−1 − zj) + Λk+1

k+1E[zk − zk+1]

≤ ψk+1 +

k+1∑j=k0+1

Λk+1j (zj−1 − zj).

C Ergodicity and dynamic importance sampler

Our interest is to analyze the deviation between the weighted averaging estimator1k

∑ki=1 θ

ζi (J(xi))f(xi) and posterior expectation

∫X f(x)π(dx) for a bounded function f . To

20

Page 21: A Contour Stochastic Gradient Langevin Dynamics Algorithm

accomplish this analysis, we first study the convergence of the posterior sample mean 1k

∑ki=1 f(xi)

to the posterior expectation f =∫X f(x)$Ψθ?

(x)(dx) and then extend it to∫X f(x)$Ψθ?

(x)(dx).The key tool for establishing the ergodic theory is still the Poisson equation which is used to charac-terize the fluctuation between f(x) and f :

Lg(x) = f(x)− f , (46)

where g(x) is the solution of the Poisson equation, andL is the infinitesimal generator of the Langevindiffusion

Lg := 〈∇g,∇L(·,θ?)〉+ τ∇2g.

By imposing the following regularity conditions on the function g(x), we can control the perturbationsof 1

k

∑ki=1 f(xi)− f and enables convergence of the weighted averaging estimate.

Assumption A6 (Regularity). Given a sufficiently smooth function g(x) and a function V(x) suchthat ‖Dkg‖ . Vpk(x) for some constants pk > 0, where k ∈ {0, 1, 2, 3}. In addition, Vp has abounded expectation, i.e., supx E[Vp(x)] <∞; and V is smooth, i.e. sups∈{0,1} Vp(sx+(1−s)y) .Vp(x) + Vp(y) for all x,y ∈ X and p ≤ 2 maxk{pk}.

For stronger but verifiable conditions, we refer readers to Vollmer et al. [2016]. In what follows,we present a lemma, which is majorly adapted from Theorem 2 of Chen et al. [2015] with a fixedlearning rate ε.Lemma C1 (Convergence of the Averaging Estimators). Suppose Assumptions A1-A6 hold. For anybounded function f ,∣∣∣∣∣E

[∑ki=1 f(xi)

k

]−∫Xf(x)$Ψθ?

(x)dx

∣∣∣∣∣ = O

1

kε+√ε+

√∑ki=1 ωi

k+

1√m

+ supi≥k0

√δn(θi)

,

where k0 is a sufficiently large constant, $Ψθ?(x) ∝ π(x)

θζ?(J(x)), and

∑ki=1 ωik = o( 1√

k) as implied by

Assumption A5.

Proof We rewrite the CSGLD algorithm as follows:

xk+1 = xk − εk∇xL(xk,θk) +N (0, 2εkτI)

= xk − εk(∇xL(xk,θ?) + Υ(xk,θk,θ?)

)+N (0, 2εkτI),

where ∇xL(x,θ) = Nn

[1 + ζτ

∆u (log θ(J(x))− log θ((J(x)− 1) ∨ 1))]∇xU(x), ∇xL(x,θ) is

as defined in Section B.1, and the bias term is given by Υ(xk,θk,θ?) = ∇xL(xk,θk) −∇xL(xk,θ?).

By Assumption A2, we have ‖∇xU(x)‖ = ‖∇xU(x)−∇xU(x?)‖ . ‖x− x?‖ ≤ ‖x‖+ ‖x?‖for some optimum. Then the L2 upper bound in Lemma B2 implies that ∇xU(x) has a boundedsecond moment. Combining Assumption A4, we have E

[‖∇xU(x)‖2

]<∞. Further by Eve’s law

(i.e., the variance decomposition formula), it is easy to derive that E[‖∇xU(x)‖

]<∞. Then, by

the triangle inequality and Jensen’s inequality,

‖E[Υ(xk,θk,θ?)]‖ ≤ E[‖∇xL(xk,θk)−∇xL(xk,θ?)‖] + E[‖∇xL(xk,θ?)−∇xL(xk,θ?)‖]

. E[‖θk − θ?‖] +O(δn(θ?)) ≤√

E[‖θk − θ?‖2] +O(δn(θ?))

≤ O

(√ωk + ε+

1

m+ supi≥k0

δn(θi)

),

(47)

where Assumption A1 and Theorem 3 are used to derive the smoothness of∇xL(x,θ) with respectto θ, and δn(θ) = E[H(θ,x)−H(θ,x)] is the bias caused by the mini-batch evaluation of U(x).

The ergodic average based on biased gradients and a fixed learning rate is a direct result of Theorem2 of Chen et al. [2015] by imposing the regularity condition A6. By simulating from $Ψθ?

(x) ∝

21

Page 22: A Contour Stochastic Gradient Langevin Dynamics Algorithm

π(x)

Ψζθ? (U(x))and combining (47) and Theorem 3, we have∣∣∣∣∣E

[∑ki=1 f(xi)

k

]−∫Xf(x)$Ψθ?

(x)dx

∣∣∣∣∣ ≤ O(

1

kε+ ε+

∑ki=1 ‖E[Υ(xk,θk,θ?)]‖

k

)

. O

1

kε+ ε+

∑ki=1

√ωi + ε+ 1

m+ supi≥k0

δn(θi)

k

≤ O

1

kε+√ε+

√∑ki=1 ωi

k+

1√m

+ supi≥k0

√δn(θi)

,

where the last inequality follows by repeatedly applying the inequality√a+ b ≤

√a+√b and the

inequality∑ki=1

√ωi ≤

√k∑ki=1 ωi.

For any a bounded function f(x), we have |∫X f(x)$Ψθ?

(x)dx−∫X f(x)$Ψθ?

(x)dx| = O( 1m )

by Lemma B4. By the triangle inequality, we have∣∣∣∣∣E[∑k

i=1 f(xi)

k

]−∫Xf(x)$Ψθ?

(x)dx

∣∣∣∣∣ ≤ O 1

kε+√ε+

√∑ki=1 ωi

k+

1√m

+ supi≥k0

√δ(θi)

,

which concludes the proof.

Finally, we are ready to show the convergence of the weighted averaging estimator∑ki=1 θ

ζi (J(xi))f(xi)∑k

i=1 θζi (J(xi))

to the posterior mean∫X f(x)π(dx).

Theorem 4 (Convergence of the Weighted Averaging Estimators). Assume Assumptions A1-A6 hold.For any bounded function f , we have that∣∣∣∣∣E[∑k

i=1 θζi (J(xi))f(xi)∑k

i=1 θζi (J(xi))

]−∫Xf(x)π(dx)

∣∣∣∣∣ = O

1

kε+√ε+

√∑ki=1 ωi

k+

1√m

+ supi≥k0

√δn(θi)

.

Proof

Applying triangle inequality and |E[x]| ≤ E[|x|], we have∣∣∣∣∣E[∑k

i=1 θζi (J(xi))f(xi)∑k

i=1 θζi (J(xi))

]−∫Xf(x)π(dx)

∣∣∣∣∣≤E

[∣∣∣∣∣∑ki=1 θ

ζi (J(xi))f(xi)∑k

i=1 θζi (J(xi))

−∑ki=1 θ

ζi (J(xi))f(xi)∑k

i=1 θζi (J(xi))

∣∣∣∣∣]

︸ ︷︷ ︸I1

+ E

[∣∣∣∣∣∑ki=1 θ

ζi (J(xi))f(xi)∑k

i=1 θζi (J(xi))

−Zθ?

∑ki=1 θ

ζi (J(xi))f(xi)

k

∣∣∣∣∣]

︸ ︷︷ ︸I2

+E

[Zθ?k

k∑i=1

∣∣∣θζi (J(xi))− θζ?(J(xi))∣∣∣ · |f(xi)|

]︸ ︷︷ ︸

I3

+

∣∣∣∣∣E[Zθ?k

k∑i=1

θζ?(J(xi))f(xi)

]−∫Xf(x)π(dx)

∣∣∣∣∣︸ ︷︷ ︸I4

.

For the term I1, consider the bias δn(θ) = E[H(θ,x)−H(θ,x)] as defined in the proof of LemmaB1, which decreases to 0 as n→ N . By applying mean-value theorem, we have

I1 = E

∣∣∣∣∣∣(∑k

i=1 θζi (J(xi))f(xi)

)(∑ki=1 θ

ζi (J(xi))

)−(∑k

i=1 θζi (J(xi))f(xi)

)(∑ki=1 θ

ζi (J(xi))

)(∑k

i=1 θζi (J(xi))

)(∑ki=1 θ

ζi (J(xi))

)∣∣∣∣∣∣

. supiδn(θi)E

(∑k

i=1 θζi (J(xi))f(xi)

(∑ki=1 θ

ζi (J(xi))

))(∑k

i=1 θζi (J(xi))

)(∑ki=1 θ

ζi (J(xi))

) = O

(supiδn(θi)

).

(48)

22

Page 23: A Contour Stochastic Gradient Langevin Dynamics Algorithm

For the term I2, by the boundedness of Θ and f and the assumption infΘ θζ(i) > 0, we have

I2 =E

[∣∣∣∣∣∑ki=1 θ

ζi (J(xi))f(xi)∑k

i=1 θζi (J(xi))

(1−

k∑i=1

θζi (J(xi))

kZθ?

)∣∣∣∣∣]

.E

[∣∣∣∣∣Zθ?∑ki=1 θ

ζi (J(xi))

k− 1

∣∣∣∣∣]

=E

∣∣∣∣∣∣Zθ?m∑i=1

∑kj=1

(θζj (i)− θζ?(i) + θζ?(i)

)1J(xj)=i

k− 1

∣∣∣∣∣∣

≤E

Zθ? m∑i=1

∑kj=1

∣∣∣θζj (i)− θζ?(i)∣∣∣ 1J(xj)=i

k

︸ ︷︷ ︸

I21

+E

[∣∣∣∣∣Zθ?m∑i=1

θζ?(i)∑kj=1 1J(xj)=i

k− 1

∣∣∣∣∣]

︸ ︷︷ ︸I22

.

For I21, by first applying the inequality |xζ −yζ | ≤ ζ|x−y|zζ−1 for any ζ > 0, x ≤ y and z ∈ [x, y]based on the mean-value theorem and then applying the Cauchy–Schwarz inequality, we have

I21 .1

kE

k∑j=1

m∑i=1

∣∣∣θζj (i)− θζ?(i)∣∣∣ .

1

kE

k∑j=1

m∑i=1

|θj(i)− θ?(i)|

.1

k

√√√√ k∑j=1

E[‖θj − θ?‖2

],

(49)where the compactness of Θ has been used in deriving the second inequality.

For I22, considering the following relation

1 =

m∑i=1

∫Xiπ(x)dx =

m∑i=1

∫Xiθζ?(i)

π(x)

θζ?(i)dx = Zθ?

∫X

m∑i=1

θζ?(i)1J(x)=i$Ψθ?(x)dx,

then we have

I22 = E

[∣∣∣∣∣Zθ?m∑i=1

θζ?(i)∑kj=1 1J(xj)=i

k− Zθ?

∫X

m∑i=1

θζ?(i)1J(x)=i$Ψθ?(x)dx

∣∣∣∣∣]

= Zθ?E

∣∣∣∣∣∣1kk∑j=1

(m∑i=1

θζ?(i)1J(xj)=i

)−∫X

(m∑i=1

θζ?(i)1J(x)=i

)$Ψθ?

(x)dx

∣∣∣∣∣∣

= O

1

kε+√ε+

√∑ki=1 ωik

+1√m

+ supi≥k0

√δn(θi)

,

(50)

where the last equality follows from Lemma C1 as the step function∑mi=1 θ

ζ?(i)1J(x)=i is integrable.

For I3, by the boundedness of f , the mean value theorem and Cauchy-Schwarz inequality, we have

I3 . E

[1

k

k∑i=1

∣∣∣θζi (J(xi))− θζ?(J(xi))∣∣∣] .

1

kE[ k∑j=1

m∑i=1

∣∣θj(i)− θ?(i)∣∣] . 1

k

√√√√ k∑j=1

E[‖θj − θ?‖2

].

(51)

For the last term I4, we first decompose∫X f(x)π(dx) into m disjoint regions to facilitate the

analysis ∫Xf(x)π(dx) =

∫∪mj=1Xj

f(x)π(dx) =

m∑j=1

∫Xjθζ?(j)f(x)

π(dx)

θζ?(j)

= Zθ?

∫X

m∑j=1

θ?(j)ζf(x)1J(xi)=j$Ψθ?

(x)(dx).

(52)

23

Page 24: A Contour Stochastic Gradient Langevin Dynamics Algorithm

Plugging (52) into the last term I4, we have

I4 =

∣∣∣∣∣E[Zθ?k

k∑i=1

m∑j=1

θ?(j)ζf(xi)1J(xi)=j

]−∫Xf(x)π(dx)

∣∣∣∣∣= Zθ?

∣∣∣∣∣E[

1

k

k∑i=1

(m∑j=1

θζ?(j)f(xi)1J(xi)=j

)]−∫X

(m∑j=1

θζ?(j)f(xi)1J(xi)=j

)$Ψθ?

(x)(dx)

∣∣∣∣∣(53)

Applying the function∑mj=1 θ

ζ?(j)f(xi)1J(xi)=j to Lemma C1 yields∣∣∣∣∣E

[1

k

k∑i=1

f(xi)

]−∫Xf(x)$Ψθ?

(x)(dx)

∣∣∣∣∣ = O

1

kε+√ε+

√∑ki=1 ωi

k+

1√m

+ supi≥k0

√δn(θi)

.

(54)

Plugging (54) into (53) and combining I1, I21, I22, I3 and Theorem 3, we have∣∣∣∣∣E[∑k

i=1 θζi (J(xi))f(xi)∑k

i=1 θζi (J(xi))

]−∫Xf(x)π(dx)

∣∣∣∣∣ = O

1

kε+√ε+

√∑ki=1 ωi

k+

1√m

+ supi≥k0

√δn(θi)

,

which concludes the proof of the theorem.

D More discussions on the algorithm

D.1 An alternative numerical scheme

In addition to the numerical scheme used in (6) and (8) in the main body, we can also consider thefollowing numerical scheme

xk+1 = xk − εk+1N

n

[1 + ζτ

log θk(J(xk) ∧m

)− log θk

(J(xk)

)∆u

]∇xU(xk) +

√2τεk+1wk+1.

Such a scheme leads to a similar theoretical result and a better treatment of Ψθ(·) for the subregionsthat contains stationary points.

D.2 Bizarre peaks in the Gaussian mixture distribution

A bizarre peak always indicates that there is a stationary point of the same energy in somewhere ofthe sample space, as the sample space is partitioned according to the energy function in CSGLD. Forexample, we study a mixture distribution with asymmetric modes π(x) = 1/6N(−6, 1)+5/6N(4, 1).Figure 4 shows a bizarre peak at x. Although x is not a local minimum, it has the same energy as “-6”which is a local minimum. Note that in CSGLD, x and “-6” belongs to the same subregion.

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

x

0

5

10

15

20

−12.5 −6 4 10Sample space

Ene

rgy

Original energy

Modified energy (ζ=0.5)

Modified energy (ζ=0.75)

Modified energy (ζ=1)

Figure 4: Explanation of bizarre peaks.

24

Page 25: A Contour Stochastic Gradient Langevin Dynamics Algorithm

D.3 Simulations of multi-modal distributions

We run all the algorithms with 200,000 iterations and assume the energy and gradient follow theGaussian distribution with a variance of 0.1. We include an additional quadratic regularizer (‖x‖2 −7)1‖x‖2>7 to limit the samples to the center region. We use a constant learning rate 0.001 for SGLD,reSGLD, and CSGLD; We adopt the cyclic cosine learning rates with initial learning rate 0.005and 20 cycles for cycSGLD. The temperature is fixed at 1 for all the algorithms, excluding thehigh-temperature process of reSGLD, which employs a temperature of 3. In particular for CSGLD,we choose the step size ωk = min{0.003, 10/(k0.8 + 100)} for learning the latent vector. We fix100 partitions and each energy bandwidth is set to 0.25. We choose ζ = 0.75.

D.4 Extension to the scenarios with high-ζ

In some complex experiments (e.g. computer vision) with a high-loss function, the fixed point θ?can be very close to the vector (1, 0, ..., 0), i.e., the first subregion contains almost all the probabilitymass, if the sample space is not appropriately partitioned. As a result, estimating θ(i)’s for the highenergy subregions can be quite difficult due to the limitation of floating points. If a small value of ζis used, the gradient multiplier 1 + ζτ log θ?(i)−log θ?((i−1)∨1)

∆u is close to 1 for any i and the algorithmwill perform similarly to SGLD, except with different weights. When a large value of ζ is used, theconvergence of θ? can become relatively slow. To tackle this issue, we include a high-order bias itemin the stochastic approximation as follows:

θk+1(i) = θk(i) + ωk+1

(θζk(J(xk+1) + ωk+11i≥J(xk+1)ρ)

)(1i=J(xk+1) − θk(i)

), (55)

for i = 1, 2, . . . ,m, where ρ is a constant. As shown early, our convergence theory allows inclusionof such a high-order bias term. In simulations, the high-order bias term ω2

k+11i≥J(xk+1)ρ penalizedmore on the higher energy regions, and thus accelerates the convergence of θk toward the pattern(1, 0, 0, . . . , 0) especially in the early period.

In all computation for the computer vision examples, we set the momentum coefficient to 0.9 andthe weight decay to 25, and employed the data augmentation scheme as in Zhong et al. [2017].In addition, for CSGHMC and saCSGHMC, we set ωk = 10

k0.75+1000 and ρ = 1 in (55) for bothCIFAR10 and CIFAR100, and set ζ = 1× 106 for CIFAR10 and 3× 106 for CIFAR100.

D.5 Number of partitions

A fine partition will lead to a smaller discretization error, but it may increase the risk in stability.In particular, it leads to large bouncy jumps around optima (a large negative learning rate, i.e.,log θ(2)−log θ(1)

∆u � 0 in formula (8) may be caused there). Empirically, we suggest to partition thesample space into a moderate number of subregions, e.g. 10-1000, to balance between stability anddiscretization error.

25