sampling methods on manifolds and their view from probability … · hyperspherical variational...

Sampling Methods on Manifolds and Their View fromProbability Manifolds

Chang Liu

Microsoft Research Asia

[email protected] done at Department of Computer Science and Technology, Tsinghua University

Nov. 20, 2019

Chang Liu (MSRA) Sampling Methods and Manifolds 1 / 82

1 Introduction

2 Sampling on ManifoldsManifold ConceptsMCMCs on ManifoldsParVIs on Manifolds

3 Understanding Samping Methods on Probability ManifoldsThe Wasserstein SpaceUnderstanding ParVIs on the Wasserstein SpaceUnderstanding MCMCs on the Wasserstein Space


Introduction

Introduction: the Sampling Task

The need of drawing samples from a distribution:

Bayesian inference: p(z|x) = p(z)p(x|z)/p(x) ∝ p(z)p(x|z):

Generative model generation (e.g., MRF generation).

Monte Carlo estimation (e.g., MRF likelihood gradient,doubly-stochastic gradient).


Introduction

Introduction: Sampling Methods

Methods:

Monte Carlo:

Directly draw i.i.d. samples.Efficient but requires exact density.

Markov Chain Monte Carlo (MCMC):

Draw samples by simulating a Markov chain with desired stationarydistribution.Admit unnormalized density but introduce autocorrelation.

Particle-Based Variational Inference (ParVI):

Optimize a set of particles (i.e. samples) to drive the particledistribution towards the target distribution.Admit unnormalized density but require assumption on the particledistribution (which affects performance).


Introduction

Introduction: Manifold

Concept (M -dim manifold M):topological space locally homeomporchic to anopen subset of RM .

Merits:

Inclusive concept: globally releases linearity.Rich structures can be equipped: distance,gradient, distribution, dynamics, etc.Fundamental view of geometry:parameterization-invariant.


Introduction

Introduction: Sampling and Manifolds

Sampling from a distribution supported on a manifold.

Spherical Admixture Model (SAM) [61]:topics on spheres for better representation.


Introduction



Hyperspherical Variational Auto-Encoder [19, 30]:spherical latent space for uninformative prior.


Introduction



Hyperbolic Variational Auto-Encoders [53, 30, 59, 55]:hyperbolic latent space (R) for the analogy to a tree structure (L).

Bayesian Matrix Factorization [64, 66, 73]:factor matrices on Stiefel manifold [68, 33]{M ∈ Rm×n |M>M = Im}.


Introduction



Information Geometry [3, 4]:for Bayesian inference p(z|x) for a Bayesian model {p(z), p(x|z)}:


Introduction


Sampling from a distribution supported on a manifold:How to comply to the manifold geometry while being efficient?

Viewing Sampling Methods on Probability Manifolds:

ParVIs have a natural optimization interpretation on a probabilityspace. Can it be made concrete?Do MCMCs have a similar interpretation?


Sampling on Manifolds Manifold Concepts

1 Introduction





Manifolds

M -dim. manifold M (a):topological space locally homeomporchic to an open subset of RM .

Tangent vector v at x ∈M (b): linear function C∞(M)→ Rsatifying the Leibniz rule (directional derivative).

A smooth curve γt through x defines a tangent vector (derivative alongthe curve) (c).

Tangent space TxM at x (b): M -dim. linear space.Flow of a vector field V (d): the set of curves {(ϕt)t} s.t.ϕt = V (ϕt) (exists at least locally).

(a) (b) (c) (d)



Manifolds

Riemannian structure: inner product in every tangent space TxM.

Coordinate expression:

〈u, v〉TxM = gij(x)uivj .

Gradient of f :〈grad f(x), v〉TxM = v[f ] := vi∂if(x).⇐⇒Steepest ascending direction:grad f(x) = max · argmax‖v‖TxM=1

ddtf(ϕt).

Coordinate expression:(grad f(x)

)i= gij(x)∂jf(x).



Manifolds

Riemannian structure: inner product in every tangent space TxM.

Distance: d(x, y) =√

infγt:γ0=x,γ1=y

∫ 10 〈γt, γt〉TγtM dt.

Geodesic: the minimizing curve(s) when it exists (e.g., when M iscomplete as a metric space [32]).

More fundamental definition: auto-parallel curves under an affineconnection (covariant derivative).Generalization of straight lines.



Manifolds

Riemannian structure: inner product in every tangent space TxM.Exponential map Expx(v): maps v ∈ TxM to the end point of thegeodesic tangent to v at x with length ‖v‖TxM.

Generalization of vector addition.

Parallel transport Γyx(v): moves v ∈ TxM to TyM (in a certain senseof) parallelly, along the geodesic from x to y.

Generalization of conventional parallel transport.Generally is path-dependent.More fundamental def.: specified by an affine connection.



Manifolds

Measures on orientable manifolds can be expressed by volume forms:

Volume form: alternative linear (TxM)M → R for every x.

Lebesgue measure of a coordinate space: dx1 ∧ · · · ∧ dxM .

Riemannian volume form (Riemannian measure): coordinate invariantvolume form

√|G| dx1 ∧ · · · ∧ dxM .


Sampling on Manifolds MCMCs on Manifolds

1 Introduction





MCMCs on Euclidean Space

Classical MCMCs: high autocorrelation.

Metropolis-Hastings algorithm [54, 31].

Gibbs sampling [27].

Dynamics-Based MCMCs: more effective move.

Dynamics: continuous-time no-jump Markov process:

dx = V (x) dt+√

2D(x) dBt(x).

Key tool: the Fokker-Planck Equation:

∂tpt = −∂i(ptV i) + ∂i∂j(ptDij).





Langevin Dynamics (LD) [39] ([63, 62, 71]):

dx = Σ−1∇ log p dt+√

2Σ−1 dBt(x).Hamiltonian Dynamics (Hamitonian Monte Carlo(HMC) [21, 56, 10]): {

dx = Σ−1r dt,

dr = ∇ log p dt.

Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) [37, 15]:{dx = Σ−1r dt,

dr = ∇ log p dt− Cr dt+√

2CΣ dBt(x).

Stochastic Gradient Nose-Hoover Thermostats (SGNHT) [20]:dx = Σ−1r dt,

dr = ∇ log p dt− ξr dt+√

2CΣ dBt(x),

dξ = ( 1M r>Σ−1r − 1) dt.





The complete recipe [51] for the dynamics:

dx = V (x) dt+√

2D(x) dBt(x),

V i(x) = ∂j

(p(x)

(Dij(x) +Qij(x)

))/p(x),

(1)

for some pos. semi-def. DM×M (diffusion matrix) and skew-symm.QM×M (curl matrix), keeps p invariant.

The inverse also holds.If D is pos. def., then p is the unique stationary distribution.





Stochastic Gradient MCMC: for Bayesian inference,

∇z log p(z|{x(n)}Nn=1) = ∇z log p(z) +

N∑n=1

∇z log p(x(n)|z),

∇z log p(z|{x(n)}Nn=1) := ∇z log p(z) +N

|S|∑n∈S∇z log p(x(n)|z)

≈ ∇z log p(z|{x(n)}Nn=1) +N (0, A(z)).

Influence on the dynamics dx = V (x) dt+√

2D(x) dBt(x):

Var(V (x) dt) = Var(V (x)) dt2 = o(dt),

Var(√

2D(x) dBt(x)) = 2D(x) dt.

HMC cannot be simulated using stochastic gradient [15, 9].



MCMCs on Riemannian Manifolds

In the corrdinate space (p is the density w.r.t. the Lebesgue meas.):

Riemann Manifold Langevin Dynamics (RMLD) [28, 60]:

dx = G−1∇ log p dt+∇ ·G−1 dt+√

2G−1 dBt(x).

Riemann Manifold Hamiltonian Monte Carlo (RMHMC) [28]:{dx = G−1r dt,

dr = ∇ log(p/√|G|) dt− 1

2∇(r>G−1r

)dt.

Stochastic Gradient Riemann Hamiltonian Monte Carlo(SGRHMC) [51]:{

dx = G−1/2r dt,

dr = G−1/2∇ log p dt−∇ ·G−1/2 +G−1r +√

2G−1 dBt(x).




In the corrdinate space: application using the Fisher-Rao Metric(information geometry [3, 4]):

Given a Bayesian model p(z), p(x|z), z is acoordinate of the manifold {p(x|z) | z ∈ Z}.Fisher-Rao metric:G(z) := Ep(x|z)[∇>z log p(x|z)∇z log p(x|z)].

Derived from the KL divergence.Corresp. distance is the (

√8×) JS divergence.

Invariant under reparameterization of z.

HMC (L) and RMHMC (R) [28]:




Problems of coordinate space: a global one may not exist (e.g.hyperspheres Sn−1 := {x ∈ Rn | ‖x‖2 = 1}).

Cumbersome to switch between coordinate systems.

G would be singular near the edge of a coordinate space.

Simulation in an embedded space Ξ(M): homeo. injective Ξ :M→ Rn.

Global representation.

Common manifolds have a natural (isometric) embedding.

Hausdorff meas. on Ξ(M) (isom. emb.) is the Riem. meas. on M.




RMHMC in the embedded space:

Constraint HMC (CHMC) [11].

Geodesic Monte Carlo (GMC) [12].

Stochastic Gradient MCMCs in the embedded space [42]:

Stochastic Gradient Geodesic Monte Carlo (SGGMC).

Geodesic Stochastic Gradient Nose-Hoover Thermostats (gSGNHT).



Stochastic Gradient MCMCs in the Embedded Space

Table: A summary of MCMCs on Riemannian Manifolds. –: sampling on manifoldnot supported; †: The integrators are not in the SSI scheme (It is unclear whether theclaimed “2nd-order” is equivalent to ours); ‡: 2nd-order integrators for SGHMC andmSGNHT are developed by [13] and [40], respectively.

methodsstochasticgradient

no inneriteration

no globalcoordinates

order ofintegrator

LD [63, 62] ×√

– 1stHMC [56] ×

√– 2nd

GMC [12] ×√ √

2ndRMLD [28] ×

√× 1st

RMHMC [28] × × × 2nd†

CHMC [11] × ×√

2nd†

SGLD [71]√ √

– 1stSGHMC [15] / SGNHT [20]

√ √– 1st‡

SGRLD [60] / SGRHMC [51]√ √

× 1st

SGGMC / gSGNHT [42]√ √ √

2nd




SGGMC dynamics (coordinate space):

Augment with the momentum r ∈ Rm (more precisely, covector∈ T ∗xM).

Augmented target distribution:

− log p(z, r) = − log p(z|x) +1

2log |G(z)|︸︷︷︸

potential energy

+1

2r>G(z)−1r︸︷︷︸

kinetic energy

.

Let M isom. emb. in Rn via y = Ξ(x). Define:

D(z) =

(0 00 J(z)>CJ(z)

), Q(z) =

(0 −II 0

),

where Jn×m : Jai = ∂ya

∂xi(J>J = G).




SGGMC dynamics (coordinate space):

dz =G−1rdt

dr =∇z log p(z|x)dt− 1

2∇z log |G(z)|dt

− J>CJG−1r dt− 1

2∇z[r>G−1r

]dt

+N (0, 2J>CJdt)




SGGMC simulation (emb. sp.): Symmetric Splitting Integrator (SSI) [13].Split SGGMC dynamics (in the coordinate space):

dz =G−1rdt

dr =∇z log p(z|x)dt−1

2∇z log |G(z)|dt

−J>CJG−1r dt−1

2∇z[r>G−1r

]dt

+N (0, 2J>CJdt)

A :

{dz = G−1rdt

dr = −1

2∇z[r>G−1r

]dt

⇒ (zt, rt) = GeodFlow(z0, r0) [1, 12]

B :

{dz = 0

dr = −J>CJG−1r dt⇒{

zt = z0rt = J> expm{−Ct}JG−1r0

O :

dz = 0dr = ∇z log p(z|x) dt

− 12∇z log |G(z)| dt

+N (0, 2J>CJdt)

⇒

zt = z0rt = ∇z log p(z0|x)t

− 12∇z log |G(z0)| t

+N (0, 2J>CJt)




SGGMC simulation (emb. sp.): Symmetric Splitting Integrator (SSI) [13].

Dynamics A in the embedded space: geodesic flow (i.e., exponentialmap + parallel transport).

Example 1 (Geodesic flow of hypersphere Sn−1 in the embedded space){y(t) = y(0) cos(αt) + (v(0)/α) sin(αt)

v(t) = −αy(0) sin(αt) + v(0) cos(αt),

where y ∈ Sn−1, v = y ∈ Ty(Sn−1), and α = ‖v(0)‖.Chang Liu (MSRA) Sampling Methods and Manifolds 25 / 82




Dynamics B and O in the embedded space:

B :

{y(t) = y(0)

v(t) = Λ(y(0)

)expm{−Ct}v(0)

O :

y(t) = y(0)

v(t) = v(0)+Λ(y(0)

)[∇y log pH

(y(0)|x

)t

+N (0, 2Ct)],

where: pH is the density function w.r.t. the Hausdorff measure, andΛ(y) = In − P (y)P (y)> is the projection onto Ξ∗(TzM).

Example 2 (The projection Λ(y) for hypersphere in the embedded space)

Λ(y) = In − yy>.





Simulate following the sequence “ABOBA”:

Algorithm 1 Sampling procedure of SGGMC

Sample a subset S for computing ∇y log pH(y). (y0, v0)← (y(n−1), v(n−1)).for l = 1, 2, . . . , L doA: Update (y∗, v∗)← (yl−1, vl−1) by the geodesic flow for time step εn

2 .B: v∗ ← exp{−C εn

2 }v∗.

O: v∗ ← v∗ + Λ(y∗) ·[∇y log pH(y∗)εn+N

(0, (2C−εnV (y∗))εn

)].

B: v∗ ← exp{−C εn2 }v

∗.A: Update (yl, vl)← (y∗, v∗) by the geodesic flow for time step εn

2 .end for

Second-order simulation: MSE = O(L−2K/(2K+1)) [13].




gSGNHT dynamics:dz =G−1r dt,

dr =∇z log p(z|x)dt− 1

2∇z log |G|dt− ξr dt− 1

2∇z[r>G−1r

]dt+N (0, 2CGdt),

dξ =(1

mr>G−1r − 1) dt.

gSGNHT simulation:

Algorithm 2 Sampling procedure of gSGNHT

A: Update (y∗, v∗)← (yl−1, vl−1) by the geodesic flow for time step εn2 ,

ξ∗ ← ξl−1 + ( 1mv>l−1vl−1 − 1) εn2 .

B: v∗ ← exp{−ξ∗ εn2 }v∗.

O: v∗ ← v∗ + Λ(y∗) ·[∇y log pH(y∗)εn+N

(0, (2C−εnV (y∗))εn

)].

B: v∗ ← exp{−ξ∗ εn2 }v∗.

A: Update (yl, vl)← (y∗, v∗) by the geodesic flow for time step εn2 ,




Experimental results:

−0.5 0 0.5

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

−0.4 −0.2 0 0.2 0.4

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

Figure: Joint posterior of z1 and z2 in gray scale. Left: true distribution; Right:empirical distribution by samples of SGGMC.




Experimental results: inference for Spherical Admixture Model (SAM) [61]

Model structure:

Document v (e.g., normalized tf-idf), topic β, corpus mean µ: onhyperspheres.

Posterior of interest: p(β|v).

∇β log p(β|v) =1

p(β|v)∇β∫p(β, θ|v)dθ = Ep(θ|β,v) [∇β log p(β, θ|v)] .

Run another MCMC (GMC [12]) to sample from p(θ|β, v) (supportedon simplex) to estimate the expectation.




Experimental results: inference for Spherical Admixture Model (SAM) [61]

103

104

105

2000

2500

3000

3500

4000

4500

5000

wall time in seconds (log scale)

log−

perp

lexi

ty

VIStoVIGMC−apprMHGMC−bGibbsSGGMC−batchSGGMC−fullgSGNHT−batchgSGNHT−full

Figure: Results on the 150K Wikipedia subset (150K training and 1K test, 50topics)


Sampling on Manifolds ParVIs on Manifolds

1 Introduction





ParVIs on Euclidean Space

Particle-Based Variational Inference (ParVI):optimize a set of particles (i.e. samples) to drive the particle distributiontowards the target distribution.

More flexible and accurate than classical (i.e., statistical-model-based)variational inference.

Has a better convergence perspective than MCMCs.

More particle-efficient than MCMCs.




Stein Variational Gradient Descent (SVGD) [46]:

A deterministic dynamics zt = v(zt) on M = Rm induces acontinuously-evolving distribution (qt) on M:

∂tqt = −∇ · (qtv). (continuity equation / det. FPE)




Stein Variational Gradient Descent (SVGD) [46]:

To drive (qt) towards p, let it minimize KL(qt‖p):

Find the decreasing rate (directional derivative):

− d

dtKL(qt‖p) = Eq[v · ∇ log p+∇ · v].

Find v maximizing the decreasing ratev∗ := max · argmax‖v‖X=1− d

dtKL(qt‖p) (functional gradient).

Taking X = T (M) = Rm: no tractable solution.Taking X = Hm where H is the RKHS [67] of a kernel K:

v∗(x′) = Eq(x)[K(x, x′)∇x log p(x) +∇xK(x, x′)].

The expectation can be estimated directly by the particles!

Simulate the particles by applying the dynamics:x(i) ← x(i) + εv∗(x(i)).



ParVIs on Riemannian Manifolds

Riemannian SVGD [41]:

Utilize information geometry to enhance efficiency (coordinate space).

Enable ParVIs on manifolds like hyperspheres (embedded space).




Dynamics on a Riemannian manifold:

zt = X(zt), (zt) is a curve of the flow of X.

Evolving distribution: let all densities be w.r.t. the Riem. meas.

Lemma 3 (Continuity Equation on Riemannian Manifold)∂tqt = −div(qtX) = −X[qt]− qt div(X)

= −Xi∂iqt − qt∂iXi − qtXi∂i log√|G|.




Directional derivative:

Theorem 4 (Directional Derivative)

Let p be a fixed distribution. Then the directional derivative is

− d

dtKL(qt‖p)=Eqt [div(pX)/p]=Eqt

[X[log p]+div(X)

].

X[qt]: the action of the vector field X on the smooth function qt.In any coordinate system, X[qt] = Xi∂iqt.

div(X): the divergence of vector field X.In any coordinate system, div(X) = ∂i(

√|G|Xi)/

√|G|.




Functional gradient:

X∗ := max · argmaxX∈X,‖X‖X=1

J (X) := Eq[X[log p] + div(X)

],

where X is a subspace of vector fields on M, such that:

1. X∗ is a valid vector field on M.

Example 5 (Nontriviality of a valid vector field)

Vector fields on an even-dimensional hypersphere must have one zeropoint (hairy ball theorem ([2], Thm 8.5.13)). The choice in SVGDX = Hm cannot guarantee this requirement.

2. X∗ is coordinate invariant.Concept: the expression in any coordinate system is the same.Necessary for avoiding the arbitrariness of the solution.The choice in SVGD X = Hm cannot guarantee this requirement.

3. X∗ can be expressed in closed form.




Functional gradient:

Our Solution

X = {grad f | f ∈ H}, where H is the RKHS of a kernel K.

The gradient a function is a valid, coordinate invariant vector field.

Lemma 6

For Gaussian RKHS, X is isometrically isomorphic to H.

Theorem 7 (Functional Gradient)

X∗′ = grad′ f∗′, f∗′ = Eq[(gradK)[log p] + ∆K

],

where “·′” takes x′ as argument, and ∆f := div(grad f).

X∗′i

= g′ij∂′jEq[(gab∂a log(p

√|G|) + ∂ag

ab)∂bK + gab∂a∂bK

].

Simulate the dynamics: z(s) ← z(s) + εX∗(z(s)).




Experimental Results (coordinate space):

0 50 100 150 200 250 300 350 4000.6

0.65

0.7

0.75

0.8

0.85

iteration

accu

racy

SVGDRSVGD (Ours)

(a) On Splice19 dataset

0 50 100 150 200 250 3000.52

0.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

0.7

0.72

iterationac

cura

cy

SVGDRSVGD (Ours)

(b) On Covertype dataset

Figure: Test accuracy along iteration for BLR. Both methods are run 20 times on Splice19 and

10 times on Covertype.




Functional gradient in the embedded space:

Proposition 8 (Functional Gradient in the Embedded Space)

Let m-dim M isometrically embedded in Rn (with orthonormal basis{yα}nα=1)) via Ξ :M→ Rn. Then X∗′ = (In −N ′N ′>)∇′f∗′,

f∗′ = Eq[(∇ log

(p√|G|))>(

In − PP>)

(∇K) +∇>∇K

− tr(P>(∇∇>K)P

)+(

(J>∇)>(G−1J>))

(∇K)],

where ∇ = (∂y1 , . . . , ∂yn)>, Jn×m : Jai = ∂ya

∂zi, and P ∈ Rn×(n−m) is the

set of orthonormal basis of the orthogonal complement of Ξ∗(TzM).

Simulate the dynamics with exponential map:

y(s) ← Expy(s)(εX∗(y(s))).

(Is a coordinate-independent expression possible?)




Functional gradient on hyperspheres:

Proposition 9 (Functional Gradient for Embedded Hyperspheres)

For Sn−1 isometrically embedded in Rn with orthonormal basis {yα}nα=1,

we have X∗′ = (In − y′y′>)∇′f∗′, where f∗′ =

Eq[(∇log p

)>(∇K) +∇>∇K − y>

(∇∇>K

)y − (y>∇log p+ n− 1)y>∇K

].

Simulate the dynamics with exponential map on Sn−1:

Expy(v) = y cos(‖v‖) + (v/‖v‖) sin(‖v‖).




Experimental Results (embedded space):

20 40 60 80 100 120 140 160 180 2003800

4000

4200

4400

4600

4800

5000

5200

5400

5600

5800

6000

epochs

log−

perp

lexi

ty

VIGMC−seqGMC−parSGGMCb−seqSGGMCb−parSGGMCf−seqSGGMCf−parRSVGD (Ours)

(a) Results with 100 particles

20 40 60 80 100 120 140 1603800

4000

4200

4400

4600

4800

5000

5200

5400

5600

5800

6000

number of particleslo

g−pe

rple

xity

GMC−seqGMC−parSGGMCb−seqSGGMCb−parSGGMCf−seqSGGMCf−parRSVGD (Ours)

(b) Results at 200 epochs

Figure: Results on the SAM inference task on 20News-different dataset, in log-perplexity.

SGGMCf: full batch; SGGMCb: mini-batch of size 50.


Understanding Samping Methods on Probability Manifolds

1 Introduction




Understanding Samping Methods on Probability Manifolds

Questions on ParVIs and MCMCs

ParVIs exhibit the intuition of minimizing KLp(·) on a probabilityspace, along the speedest descending direction. Can this be madeconcrete?

Liu (2017) [45] conceives a probability manifold where SVGD simulatesthe gradient flow. But the validity of the artificial manifold is unknown.

ParVIs do not assume a parametric statistical model, but need akernel (or other treatment). Do they need an assumption / make anapproximation?

Do general MCMCs have a flow/optimization interpretation?

Things are made clear on the Wasserstein space.


Understanding Samping Methods on Probability Manifolds The Wasserstein Space

The Wasserstein Space

For a metric space (M, d):P2(M) :=

{q: distribution on M

∣∣ ∃x0 ∈M s.t. Eq[d(x0, x)2] < +∞}.

P2(M) is a metric space ([70], Def 6.4) with the Wasserstein distance:

dW (q, p) :=(

infπ∈Π(q,p)

Eπ(x,y)[d(x, y)2])1/2

,

where

Π(q, p) :=

{π: distribution on M×M

∣∣∣∣ ∫Mπ(x, y) dy = q(x),∫

Mπ(x, y) dx = p(y)

}.



The Wasserstein Space: Riemannian Structure

For a Riem. manif. (M, 〈·, ·〉TxM), P2(M) also has a Riem. str. [58, 70, 6]:

Tangent vector v ⇐⇒ vector field X on M.

Tangent space at q: TqP2(M) = {grad f | f ∈ C∞c (M)}L2q(M)

is a subspace of L2q(M) := {X | Eq(x)

[〈X(x), X(x)〉TxM

]<∞}.

([70], Thm 13.8; [6], Thm 8.3.1, Def 8.4.1, Prop 8.4.5)




For a Riem. manif. (M, 〈·, ·〉TxM), P2(M) also has a Riem. str. [58, 70, 6]:

Riemannian structure: TqP2 inherits the inner product of L2q :

〈X,Y 〉TqP2= Eq(x)

[〈X(x), Y (x)〉TxM

].

It is consistent with dW [8].




Gradient flow on P2(M) for KLp(q) := Eq[log(q/p)] (using Riem.meas):

P2(M) as a Riemannian manifold:

V GF := − grad KLp(q) = − grad( δδq

KLp(q))

= grad log(p/q).

([70], Thm 23.18; [6], Example 11.1.2)

P2(M) as a metric space: e.g., Minimizing Movement Scheme (MMS)([6], Def. 2.0.6):

qt+ε = argminq∈P2(M)

KLp(q) +1

2εd2W (q, qt).

They coincide under the Riemannian structure. ([70], Prop. 23.1,

Rem. 23.4; [6], Thm. 11.1.6; [24], Lem. 2.7)

Exponential convergence when p is log-concave. ([70], Thm 23.25, Thm

24.7; [6], Thm 11.1.4)



Langevin Dynamics as Wasserstein Gradient Flow

The Langevin dynamics

dx = ∇ log p(x) dt+√

2 dBt(x)

produces the same [14] evolving distr. (qt) as:

dx = ∇ log(p(x)/qt(x)) dt,

which is the gradient flow of KLp on P2(M) for Euclidean M.

The gradient flow interpretation of LD is known earlier from theMMS perspective [34].


Understanding Samping Methods on Probability Manifolds Understanding ParVIs on the Wasserstein Space

1 Introduction





Understanding ParVIs on the Wasserstein Space

Understand and accelerate ParVIs from the Wasserstein gradient flowperspective [43].

Consider Euclidean M = Rm for brevity.



SVGD Approximates the Wasserstein Gradient Flow

Reformulate V GF as:

V GF = max · argmaxV ∈L2

q,‖V ‖L2q=1

⟨V GF, V

⟩L2q. (2)

We find:

Theorem 10 (V SVGD approximates V GF)

V SVGD = max · argmaxV ∈HD,‖V ‖HD=1

⟨V GF, V

⟩L2q.

HD is a subspace of L2q , so V SVGD is the projection of V GF on HD.



ParVIs Approx. the Wass. Gradient Flow by Smoothing

Smoothing Functions

SVGD restricts the optimization domain L2q to HD.

Theorem 11 (HD smooths L2q)

For M = RD, a Gaussian kernel K on M and an absolutely continuous q,the vector-valued RKHS HD of K is isometrically isomorphic to the

closure G := {φ ∗K : φ ∈ C∞c }L2q .

C∞cL2q = L2

q ([36], Thm. 2.11) =⇒ G is roughly the kernel-smoothed L2q .

Smoothing the Density

The Blob method (w-SGLD-B) [14]: partially smooths the density.

V GF = −∇( δδq

Eq[log(q/p)])

=⇒ V Blob = −∇( δδq

Eq[log(q/p)]),

where q := q ∗K is the kernel-smoothed density.



ParVIs Approx. the Wass. Gradient Flow by Smoothing

Equivalence:Smoothing-function objective = Eq[L(V )], L : L2

q → L2q linear.

=⇒ Eq[L(V )] = Eq∗K [L(V )] = Eq[L(V ) ∗K] = Eq[L(V ∗K)].

Necessity: grad KLp(q) undefined at q = q := 1N

∑Ni=1 δx(i) .

Theorem 12 (Necessity of smoothing for SVGD)

For q = q and V ∈ L2p, problem (2):

maxV ∈L2

p,‖V ‖L2p=1

⟨V GF, V

⟩L2q

,

has no optimal solution. In fact the supremum of the objective is infinite,indicating that a maximizing sequence of V tends to be ill-posed.

ParVIs rely on the smoothing assumption! No free lunch!



New ParVIs with Smoothing

Gradient Flow with Smoothed Density (GFSD):Fully smooth the density:

V GFSD := ∇ log p−∇ log q.

Gradient Flow with Smoothed test Functions (GFSF):

V GF = ∇ log p−∇ log q

=⇒ V GF = ∇ log p+ argminU∈L2

maxφ∈C∞c ,‖φ‖L2q=1

(Eq[φ · U −∇ · φ]

)2.

Smooth φ: take φ from HD:

V GFSF := ∇ log p+ argminU∈L2

maxφ∈HD,‖φ‖HD=1

(Eq[φ · U −∇ · φ]

)2.

Solution: V GFSF = V + K ′K−1. (Note V SVGD = V GFSFK.)V:,i = ∇x(i) log p(x(i)), Kij = K(x(i), x(j)), K′:,i =

∑j ∇x(j)K(x(j), x(i)).



Bandwidth Selection via the Heat Equation

Note

Under the dynamics dx = −∇ log qt(x) dt, qt evolves following the heatequation (HE): ∂tqt(x) = ∆qt(x).

Smoothing the density: qt(x) ≈ q(x) = q(x; {x(i)}Ni=1). Then for qt+ε(x),

Due to HE, qt+ε(x) ≈ q(x) + ε∆q(x).

Due to the effect of the dynamics, updated particles{x(i)−ε∇ log q(x(i))}Ni=1 approximate qt+ε, soqt+ε(x) ≈ q(x; {x(i)−ε∇ log q(x(i))}Ni=1).

Objective:∑

k

(q(x(k)) + ε∆q(x(k))− q(x(k); {x(i)−ε∇ log q(x(i))}Ni=1)

)2.

Take ε→ 0, make the objective dimensionless (h/x2 is dimensionless):

1hD+2

∑k

[∆q(x(k); {x(i)}i)+

∑j∇x(j) q(x(k); {x(i)}i)·∇log q(x(j); {x(i)}i)

]2.

Also applicable to smoothing functions.



Bandwidth Selection via the Heat Equation

Median:

HE:

SVGD Blob GFSD GFSF

Figure: Comparison of HE (bottom row) with the median method (top row) forbandwidth selection.



Accelerated First-Order Methods on the Wasserstein Space

Nesterov’s Acceleration Methods on Riemannian Manifolds:rk ∈ P2(M): auxiliary variable. Vk := − grad KL(rk).

Riemannian Accelerated Gradient (RAG) [47] (with simplification):{qk = Exprk−1

(εVk−1),

rk = Expqk

[−Γqkrk−1

(k−1k Exp−1

rk−1(qk−1)− k+α−2

k εVk−1

)].

Riemannian Nesterov’s method (RNes) [74] (with simplification):{qk = Exprk−1

(εVk−1),

rk = Expqk{c1 Exp−1

qk

[Exprk−1

((1−c2) Exp−1

rk−1(qk−1)+c2 Exp−1

rk−1(qk)

)]}.

Required:

Exponential map Expq : TqP2(M)→ P2(M) and its inverse.

Parallel transport Γrq : TqP2(M)→ TrP2(M).




Leveraging the Riemannian Structure of P2(M):

Exponential map ([70], Coro. 7.22; [6], Prop. 8.4.6; [24], Prop. 2.1):Expq(V ) = (id +V )#q, i.e.,

{x(i)}i ∼ q ⇒ {x(i)+V (x(i))}i ∼ Expq(V ).

Inverse exponential map: require the optimal transport map.

Sinkhorn methods [17, 72] appear costly and unstable.Make approximations when {x(i)}i and {y(i)}i are pairwise close:d(x(i), y(i))� min

{minj 6=i d(x(i), x(j)),minj 6=i d(y(i), y(j))

}.

Proposition 13 (Inverse exponential map)

For pairwise close samples {x(i)}i of q and {y(i)}i of r, we have(Exp−1

q (r))(x(i)) ≈ y(i) − x(i).




Leveraging the Riemannian Structure of P2(M):

Parallel transport

Hard to implement analytical results [49, 50].Use Schild’s ladder method [23, 35] for approximation.

Proposition 14 (Parallel transport)

For pairwise close samples {x(i)}i of q and {y(i)}i of r, we have(Γrq(V )

)(y(i)) ≈ V (x(i)), ∀V ∈ TqP2.




Algorithm 3 The acceleration framework with Wasserstein Accelerated Gra-dient (WAG) and Wasserstein Nesterov’s method (WNes)

1: WAG: select acceleration factor α > 3;WNes: select or calculate c1, c2 ∈ R+;

2: Initialize {x(i)0 }Ni=1 distinctly; let y

(i)0 = x

(i)0 ;

3: for k = 1, 2, · · · , kmax, do4: for i = 1, · · · , N , do

5: Find V (y(i)k−1) by SVGD/Blob/GFSD/GFSF;

6: x(i)k = y

(i)k−1 + εV (y

(i)k−1);

7: y(i)k = x

(i)k +

{WAG: k−1

k (y(i)k−1 − x

(i)k−1) + k+α−2

k εV (y(i)k−1);

WNes: c1(c2 − 1)(x(i)k − x

(i)k−1);

8: end for9: end for

10: Return {x(i)kmax}Ni=1.




Experimental results: Bayesian inference for Latent Dirichlet Allocation:

200 400iteration

1020

1040

1060

1080

1100

hold

out p

erpl

exity

SVGD-WGDSVGD-POSVGD-WAGSVGD-WNes

200 400iteration

1020

1040

1060

1080

1100

hold

out p

erpl

exity

Blob-WGDBlob-POBlob-WAGBlob-WNes

200 400iteration

1020

1040

1060

1080

1100

hold

out p

erpl

exity

GFSD-WGDGFSD-POGFSD-WAGGFSD-WNes

200 400iteration

1020

1040

1060

1080

1100

hold

out p

erpl

exity

GFSF-WGDGFSF-POGFSF-WAGGFSF-WNes

Figure: Acceleration effect of WAG and WNes onLDA (measured by hold-out perplexity).

100 200 300 400iteration

1025

1050

1075

1100

1125

1150

hold

out p

erpl

exity

SGNHT-seqSGNHT-paraSVGD-WGDSVGD-WNes

Figure: Comparison ofSVGD and SGNHT onLDA, as representativesof ParVIs and MCMCs.Average over 10 runs.


Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

1 Introduction





Understanding MCMCs on the Wasserstein Space

Understanding MCMC dynamics as flows on the Wasserstein Space [44]:

The Langevin dynamics (LD) is recognized as the Wassersteingradient flow of the KL divergence [34].

Benefits its asymptotic [63] and non-asymptotic [22, 16] behaviors.Relates it to ParVIs [14, 43].

Does a general MCMC dynamics correspond to an interpretable flowon the Wasserstein space?



The First Reformulation

Lemma 15 (Equivalent deterministic MCMC dynamics)

A general MCMC dynamics specified by a symm. pos. semi-def. D andskew-symm. Q via Eq. (1) produces the same distr. evolution as thedeterministic dynamics:

dx = Wt(x) dt,

(Wt)i(x) = Dij(x) ∂j log(p(x)/qt(x)) +Qij(x) ∂j log p(x) + ∂jQ

ij(x),(3)

where qt is the distribution density of x at time t.

=⇒ Barbour’s generator [7]Af := d

dtEqt [f ]∣∣qt=δx

= 1p∂j[p(Dij +Qij

)(∂if)

](c.f. [29]).

How to interpret Wt(x)?



Interpret MCMC Dynamics


ij(x).

1 Dij(x) ∂j log(p(x)/qt(x)) seems like a gradient flow on P2(M).

Euclidean M: D = I.

Hilbert M: constant and non-singular D.

Riemannian M: non-singular D(x).

We need positive semi-definite D(x).





ij(x).


Fiber Bundle M (of dim. M = m+ n)(known knowledge):

M is locally M0 ×F (dim(M0) = m,dim(F) = n) [57] in terms of a projection$:

$ :M→M0locally⇐⇒ M0 ×F →M0.

The fiber through y ∈M0:My := $−1(y) (diffeom. to F).Coordinate decomposition: x = (y, z),y ∈ Rm: coord. of M0;z ∈ Rn: coord. of My.





ij(x).


Fiber-Riemannian manifold M:

Definition 3 (Fiber-Riemannian manifold)

M is a fiber-Riemannian manifold if it is a fiberbundle and there is a Riemannian structure gMy

on each fiber My.





ij(x).





on each fiber My.

Gradient on fiber My:(gradMy

f(y, z))a

= (gMy (z))ab ∂zbf(y, z), 1 6 a, b 6 n.

Define fiber-gradient on M by taking union over y:(gradfib f(x)

)M

:=(0m,

(gradM$(x)

f($(x), z))n

).





ij(x).





on each fiber My.

Alternatively, the fiber-gradient on M is:(gradfib f(x)

)i=gij(x) ∂jf(x), 1 6 i, j 6M,(

gij(x))M×M :=

(0m×m 0m×n0n×m

((gM$(x)

(z))ab)n×n

). (4)

We use g to denote the fiber-Riemannian structure.





ij(x).


Structures on P2(M) with fiber-Riemannian M.

Hard to decompose P2(M).

P2(M) := {q(z|y) ∈ P2(My) | y ∈M0}locally⇐⇒ M0 × P2(My):

fiber-Riemannian!On P2(My),

(grad KLp(·|y)(q(·|y))(z)

)a= (gMy

(z))ab ∂zb logq(z|y)

p(z|y)= (gMy

(z))ab ∂zb logq(y, z)

p(y, z), 1 6 a, b 6 n.

Taking union over y ∈M0, the fiber-gradient on P2(M) is:(gradfib KLp(q)(x)

)M

=(gij(x) ∂j log

(q(x)/p(x)

))M.





ij(x).

1 Dij(x) ∂j log(p(x)/qt(x)) seems like a gradient flow on P2(M).(gradfib KLp(q)(x)

)i= gij(x) ∂j log

(q(x)/p(x)

),

(gij(x)) =

(0m×m 0m×n0n×m (gM$(x)

ij)n×n

).

Assumption 4 (Regular MCMC dynamics (1/2))

(a) D = C or D = 0 or D =

(0 00 C

), for a symm. positive definite C(x).

(b) . . .

Satisfied by existing MCMC instances.Could be relaxed by coordinate transformation.

Dij ∂j log(p/qt) is the fiber-gradient with fiber-Riemannian support(M, g) where (gij) = D.





ij(x).

2 Qij(x) ∂j log p(x) + ∂jQij(x) makes a Hamiltonian flow.

The common Hamiltonian flow: M = R2`, Q =

(0 I`−I` 0

).

Symplectic manifold [18, 52]: M even-dim., Q non-singular.Poisson manifold M [25]:

Poisson structure: bivector field β = βij∂i ⊗ ∂j =∑i<j β

ij∂i ∧ ∂j(anti-symm. 2nd-order contravariant tensor field; (βij) is skew-symm.)that satisfies the Jacobian identity:

βil∂lβjk + βjl∂lβ

ki + βkl∂lβij = 0,∀i, j, k. (5)

Hamiltonian flow Xf of a smooth function f :(Xf (x)

)[h] :=

(β(df, dh)

)(x) = βij(x) ∂if(x) ∂jh(x).

Coordinate expression:(Xf (x)

)i= βij(x) ∂jf(x).

Xf conserves f : ddtf(ϕt) = 0.





ij(x).


Poisson structure on P2(M) [49, 5, 26] (known knowledge):Hamiltonian flow of a function F on P2(M):

XF (q) = πq(Xf ),

where func. f on M relates to F via gradq Eq[f ] = gradq F (q), andπq is the orthogonal projection L2

q(M)→ TqP2(M), which does notchange distribution evolution.

Hamiltonian flow of KL on P2(M):

Lemma 2 (Hamiltonian flow of KL on P2(M))

The Hamiltonian flow of KLp on P2(M) is:

XKLp(q) = πq(Xlog(q/p)), where(Xlog(q/p)(x)

)i= βij(x) ∂j log(q(x)/p(x)).





ij(x).


−(Xlog(q/p)(x)

)i= βij(x) ∂j log p(x)− βij(x) ∂j log q(x).

Assumption 4 (Regular MCMC dynamics (2/2))

(a) D = C or D = 0 or D =

(0 00 C

), for a symm. positive definite C(x).

(b) Q(x) satisfies Eq. (5): Qil∂lQjk +Qjl∂lQ

ki +Qkl∂lQij = 0, ∀i, j, k.

Satisfied by MCMCs except for SGNHT-related methods [20, 75].Required to match Poisson structure; unnecessary for conservation ofHamiltonian.

Qij ∂j log p+ ∂jQij ⇐⇒? Qij ∂j log p−Qij ∂j log q? Yes!



Interpret MCMC Dynamics: Main Theorem

Theorem 5 (Equivalence between regular MCMC dynamics on RM andfGH flows on P2(M).)

We call (M, g, β) a fiber-RiemannianPoisson (fRP) manifold, and define thefiber-gradient Hamiltonian (fGH) flow onP2(M) as:

WKLp :=−π(gradfib KLp)−XKLp ,(WKLp(q)

)i=πq

((gij + βij)∂j log(p/q)

).

(6)

Then:

Regular MCMC dynamics ⇐⇒ fGH flow with fRP M,(D,Q) ⇐⇒ (g, β).



Interpret MCMC Dynamics: Case Study

Type 1: D is non-singular (m = 0 in Eq. (4)).

M0 degenerates, M is the unique fiber.

M is Riemannian, fiber gradient =⇒ gradient.

The fGH flow: WKLp = −π(grad KLp)−XKLp ,

−π(grad KLp): minimizes KLp steepestly on P2(M).−XKLp : conserves KLp on P2(M) and helps mixing/exploration.

Converges to p uniquely (c.f. [51]).

Robust to SG (c.f. [65, 69]).

Instances:

LD [62] / SGLD [71]: Q = 0, M is Euclidean.

RLD [28] / SGRLD [60]: Q = 0, M is the manifold underconsideration.




Type 2: D = 0 (n = 0 in Eq. (4)).

M0 =M, fibers degenerate.

M has no (fiber-)Riemannian structures.

The fGH flow: WKLp = −XKLp conserves KLp on P2(M) and helpsmixing/exploration.

Fragile against SG: no stablizing forces (i.e. (fiber-)gradient flows)(c.f. [15, 9]).

Hard to extend to ParVIs.

Instances (`-dim. sample space S):

HMC [21, 56, 10] (S = R`): M = R2`.

HMC relies on geometric ergodicity for convergence [48, 10].

RHMC [28] / LagrMC [38] / GMC [12] (manifold S): M = T ∗S.




Type 3: D 6= 0 and D is singular (m,n > 1 in Eq. (4)).

Non-degenerate M0 and My.

M is a non-trivial fRP manifold.

The fGH flow: WKLp := −π(gradfib KLp)−XKLp ,

−π(gradfib KLp): minimizes KLp(·|y)(q(·|y)) steepest on each fiberP2(My).−XKLp : conserves KLp on P2(M) and helps mixing/exploration.

Robust to SG (SG appears on each fiber) (c.f. [15, 13]).

Instances (`-dim. sample space S):

SGHMC [15] (S = R`), SGRHMC [51] / SGGMC [42] (manifold S):M0 = S, Mθ = T ∗θ S.

SGNHT [20] (S = R`), gSGNHT [42] (manifold S):M0 = S, Mθ = R× T ∗θ S.



ParVI Simulation for SGHMC

Simulate the deterministic dynamics of SGHMC:

By Lemma 15 (Eq. (3)):

dθ

dt= Σ−1r,

dr

dt= ∇θ log p(θ)− CΣ−1r − C∇r log q(r).

By Theorem 5 (Eq. (6)):

dθ

dt= Σ−1r +∇r log q(r),

dr

dt= ∇θlog p(θ)−CΣ−1r−C∇rlog q(r)−∇θlog q(θ).

To estimate ∇ log q with particles, use ParVI techniques [43], e.g.Blob [14]:

−∇rlog q(r(i))≈ −∑

k∇r(i)K(i,k)r∑

jK(i,j)r

−∑k

∇r(i)K(i,k)r∑

jK(j,k)r

,

where K(i,j)r := Kr(r

(i), r(j)).Chang Liu (MSRA) Sampling Methods and Manifolds 77 / 82


ParVI Simulation for SGHMC

Simulate the deterministic dynamics of SGHMC:

pSGHMC-det:

∆θ(i)

ε =Σ−1r(i),

∆r(i)

ε =∇θlog p(θ(i))−CΣ−1r(i)−C(∑

k∇r(i)K(i,k)r∑

jK(i,j)r

+∑k

∇r(i)

K(i,k)r∑

jK(j,k)r

).

pSGHMC-fGH:

∆θ(i)

ε = Σ−1r(i)+∑k∇r(i)K

(i,k)r∑

jK(i,j)r

+∑k

∇r(i)

K(i,k)r∑

jK(j,k)r

,

∆r(i)

ε = ∇θlog p(θ(i))−(∑

k∇θ(i)K(i,k)θ∑

jK(i,j)θ

+∑k

∇θ(i)

K(i,k)θ∑

jK(j,k)θ

)− CΣ−1r(i) − C

(∑k∇r(i)K

(i,k)r∑

jK(i,j)r

+∑k

∇r(i)

K(i,k)r∑

jK(j,k)r

).

Advantages:

Over SGHMC: particle-efficiency, ParVI techniques like HE [43].

Over ParVIs: more efficient dynamics over LD.



Experimental Results: Synthetic

Figure: Dynamics simulation results. Rows correspond to Blob, SGHMC,pSGHMC-det, pSGHMC-fGH, respectively. All methods adopt the same step size0.01, and SGHMC-related methods share the same Σ−1 = 1.0, C = 0.5. In eachrow, figures are plotted for every 300 iterations, and the last one for 10,000iterations. The HE method [43] is used for bandwidth selection.


Blob SGHMC pSGHMC-det pSGHMC-fGH


Experimental Results: Latent Dirichlet Allocation (LDA)

0 200 400 600iteration

1040

1060

1080

1100

1120

hold

out p

erpl

exity

BlobSGHMCpSGHMC-detpSGHMC-fGH

(a) Learning curve (20 ptcls)

0 50 100#particle

1030

1035

1040

1045

1050

hold

out p

erpl

exity

SGHMCpSGHMC-detpSGHMC-fGH

(b) Particle efficiency (iter 600)

Figure: Performance on LDA with the ICML data set.



Experimental Results: Bayesian Neural Networks (BNNs)

0 40 80epoch

0.1

0.2

0.3

0.4

0.5

0.6

erro

r rat

e

BlobSGHMCpSGHMC-detpSGHMC-fGH

(a) Learning curve (10 ptcls)

0 20 40#particle

0.0300.0350.0400.0450.0500.0550.060

erro

r rat

e

SGHMCpSGHMC-detpSGHMC-fGH

(b) Particle efficiency (epch 80)

Figure: Performance on BNN with MNIST data set.


Thanks!Questions?


References

Ralph Abraham, Jerrold E Marsden, and Jerrold E Marsden.

Foundations of mechanics.

Benjamin/Cummings Publishing Company Reading, Massachusetts, Providence,Rhode Island, 1978.

Ralph Abraham, Jerrold E Marsden, and Tudor Ratiu.

Manifolds, tensor analysis, and applications, volume 75.

Springer Science & Business Media, New York, 2012.

Shun-Ichi Amari.

Information geometry and its applications.

Springer, Tokyo, 2016.

Shun-Ichi Amari and Hiroshi Nagaoka.

Methods of information geometry, volume 191.

American Mathematical Soc., Providence, Rhode Island, 2007.

Luigi Ambrosio and Wilfrid Gangbo.

Hamiltonian ODEs in the Wasserstein space of probability measures.

Communications on Pure and Applied Mathematics: A Journal Issued by theCourant Institute of Mathematical Sciences, 61(1):18–53, 2008.


References

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savare.

Gradient flows: in metric spaces and in the space of probability measures.

Springer Science & Business Media, Berlin, 2008.

Andrew D Barbour.

Stein’s method for diffusion approximations.

Probability theory and related fields, 84(3):297–322, 1990.

Jean-David Benamou and Yann Brenier.

A computational fluid mechanics solution to the Monge-Kantorovich mass transferproblem.

Numerische Mathematik, 84(3):375–393, 2000.

Michael Betancourt.

The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naivedata subsampling.

In Proceedings of the 32nd International Conference on Machine Learning (ICML2015), pages 533–540, Lille, France, 2015. IMLS.

Michael Betancourt.

A conceptual introduction to Hamiltonian Monte Carlo.

arXiv preprint arXiv:1701.02434, 2017.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

References

Marcus A. Brubaker, Mathieu Salzmann, and Raquel Urtasun.

A family of MCMC methods on implicitly defined manifolds.

In Proceedings of the 15th International Conference on Artificial Intelligence andStatistics (AISTATS-12), pages 161–172, La Palma, Canary Islands, 2012.AISTATS Committee.

Simon Byrne and Mark Girolami.

Geodesic Monte Carlo on embedded manifolds.

Scandinavian Journal of Statistics, 40(4):825–845, 2013.

Changyou Chen, Nan Ding, and Lawrence Carin.

On the convergence of stochastic gradient MCMC algorithms with high-orderintegrators.

In Advances in Neural Information Processing Systems, pages 2269–2277,Montreal, Canada, 2015. NIPS Foundation.

Changyou Chen, Ruiyi Zhang, Wenlin Wang, Bai Li, and Liqun Chen.

A unified particle-optimization framework for scalable Bayesian sampling.

In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI2018), Monterey, California USA, 2018. Association for Uncertainty in ArtificialIntelligence.

Tianqi Chen, Emily Fox, and Carlos Guestrin.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

References

Stochastic gradient Hamiltonian Monte Carlo.

In Proceedings of the 31st International Conference on Machine Learning (ICML2014), pages 1683–1691, Beijing, China, 2014. IMLS.

Xiang Cheng and Peter Bartlett.

Convergence of Langevin MCMC in KL-divergence.

arXiv preprint arXiv:1705.09048, 2017.

Marco Cuturi.

Sinkhorn distances: Lightspeed computation of optimal transport.

In Advances in Neural Information Processing Systems, pages 2292–2300, LakeTahoe, Nevada USA, 2013. NIPS Foundation.

Ana Cannas Da Silva.

Lectures on symplectic geometry, volume 3575.

Springer, Boston, 2001.

Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub MTomczak.

Hyperspherical variational auto-encoders.


References

Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D. Skeel, andHartmut Neven.

Bayesian sampling using stochastic gradient thermostats.


Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth.

Hybrid Monte Carlo.

Physics Letters B, 195(2):216–222, 1987.

Alain Durmus and Eric Moulines.

High-dimensional Bayesian inference via the unadjusted Langevin algorithm.


J Ehlers, F Pirani, and A Schild.

The geometry of free fall and light propagation, in the book “General Relativity”(papers in honour of JL Synge), 63–84, 1972.

Matthias Erbar et al.

The heat equation on manifolds as a gradient flow in the Wasserstein space.

In Annales de l’Institut Henri Poincare, Probabilites et Statistiques, volume 46,pages 1–23, Paris, 2010. Institut Henri Poincare.


References

Rui Loja Fernandes and Ioan Marcut.

Lectures on Poisson Geometry.

Springer, Basel, 2014.

Wilfrid Gangbo, Hwa Kil Kim, and Tommaso Pacini.

Differential forms on Wasserstein space and infinite-dimensional Hamiltoniansystems.

American Mathematical Soc., Providence, Rhode Island, 2010.

Stuart Geman and Donald Geman.

Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images.

In Readings in Computer Vision, pages 564–584. Elsevier, Los Altos, CaliforniaUSA, 1987.

Mark Girolami and Ben Calderhead.

Riemann manifold Langevin and Hamiltonian Monte Carlo methods.

Journal of the Royal Statistical Society: Series B (Statistical Methodology),73(2):123–214, 2011.

Jackson Gorham, Andrew B Duncan, Sebastian J Vollmer, and Lester Mackey.

Measuring sample quality with diffusions.


References

Daniele Grattarola, Lorenzo Livi, and Cesare Alippi.

Adversarial autoencoders with constant-curvature latent manifolds.


W Keith Hastings.

Monte Carlo sampling methods using Markov chains and their applications.

Biometrika, 57(1):97–109, 1970.

Heinz Hopf and Willi Rinow.

Uber den begriff der vollstandigen differential geometrischen flache.

Commentarii Mathematici Helvetici, 3(1):209–225, 1931.

I. M. James.

The topology of Stiefel manifolds, volume 24.

Cambridge University Press, New York, 1976.

Richard Jordan, David Kinderlehrer, and Felix Otto.

The variational formulation of the Fokker-Planck equation.

SIAM journal on mathematical analysis, 29(1):1–17, 1998.

Arkady Kheyfets, Warner A Miller, and Gregory A Newton.

Schild’s ladder parallel transport procedure for an arbitrary connection.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

References

International Journal of Theoretical Physics, 39(12):2891–2898, 2000.

Ondrej Kovacik and Jirı Rakosnık.

On spaces Lp(x) and W k,p(x).

Czechoslovak Mathematical Journal, 41(4):592–618, 1991.

Hendrik Anthony Kramers.

Brownian motion in a field of force and the diffusion model of chemical reactions.

Physica, 7(4):284–304, 1940.

Shiwei Lan, Vasileios Stathopoulos, Babak Shahbaba, and Mark Girolami.

Markov chain Monte Carlo from lagrangian dynamics.

Journal of Computational and Graphical Statistics, 24(2):357–378, 2015.

Paul Langevin.

Sur la theorie du mouvement Brownien.

Compt. Rendus, 146:530–533, 1908.

Chunyuan Li, Changyou Chen, Kai Fan, and Lawrence Carin.

High-order stochastic gradient thermostats for Bayesian learning of deep models.


References

Chang Liu and Jun Zhu.

Riemannian Stein variational gradient descent for Bayesian inference.

In The 32nd AAAI Conference on Artificial Intelligence, pages 3627–3634, NewOrleans, Louisiana USA, 2018. AAAI press.

Chang Liu, Jun Zhu, and Yang Song.

Stochastic gradient geodesic MCMC methods.

In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems 29, pages 3009–3017. CurranAssociates, Inc., Barcelona, Spain, 2016.

Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, Jun Zhu, and LawrenceCarin.

Understanding and accelerating particle-based variational inference.

In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36thInternational Conference on Machine Learning, volume 97 of Proceedings ofMachine Learning Research, pages 4082–4092, Long Beach, California USA, 09–15Jun 2019. IMLS, PMLR.

Chang Liu, Jingwei Zhuo, and Jun Zhu.

Understanding MCMC dynamics as flows on the Wasserstein space.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

References

In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36thInternational Conference on Machine Learning, volume 97 of Proceedings ofMachine Learning Research, pages 4093–4103, Long Beach, California USA, 09–15Jun 2019. IMLS, PMLR.

Qiang Liu.

Stein variational gradient descent as gradient flow.

In Advances in Neural Information Processing Systems, pages 3118–3126, LongBeach, California USA, 2017. NIPS Foundation.

Qiang Liu and Dilin Wang.

Stein variational gradient descent: A general purpose Bayesian inference algorithm.

In Advances in Neural Information Processing Systems, pages 2370–2378,Barcelona, Spain, 2016. NIPS Foundation.

Yuanyuan Liu, Fanhua Shang, James Cheng, Hong Cheng, and Licheng Jiao.

Accelerated first-order methods for geodesically convex optimization on Riemannianmanifolds.

In Advances in Neural Information Processing Systems, pages 4875–4884, LongBeach, California USA, 2017. NIPS Foundation.

Samuel Livingstone, Michael Betancourt, Simon Byrne, and Mark Girolami.

On the geometric ergodicity of Hamiltonian Monte Carlo.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

References


John Lott.

Some geometric calculations on Wasserstein space.

Communications in Mathematical Physics, 277(2):423–437, 2008.

John Lott.

An intrinsic parallel transport in Wasserstein space.

Proceedings of the American Mathematical Society, 145(12):5329–5340, 2017.

Yi-An Ma, Tianqi Chen, and Emily Fox.

A complete recipe for stochastic gradient MCMC.


Jerrold E Marsden and Tudor S Ratiu.

Introduction to mechanics and symmetry: a basic exposition of classical mechanicalsystems, volume 17.


Emile Mathieu, Charline Le Lan, Chris J Maddison, Ryota Tomioka, and Yee WhyeTeh.

Hierarchical representations with Poincare variational auto-encoders.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

References


Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta HTeller, and Edward Teller.

Equation of state calculations by fast computing machines.

The journal of chemical physics, 21(6):1087–1092, 1953.

Yoshihiro Nagano, Shoichiro Yamaguchi, Yasuhiro Fujita, and Masanori Koyama.

A differentiable Gaussian-like distribution on hyperbolic space for gradient-basedlearning.


Radford M. Neal.

MCMC using Hamiltonian dynamics.

Handbook of Markov Chain Monte Carlo, 2, 2011.

Liviu I Nicolaescu.

Lectures on the Geometry of Manifolds.

World Scientific, Singapore, 2007.

Felix Otto.

The geometry of dissipative evolution equations: the porous medium equation.

2001.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

References

Ivan Ovinnikov.

Poincare Wasserstein autoencoder.


Sam Patterson and Yee Whye Teh.

Stochastic gradient Riemannian Langevin dynamics on the probability simplex.

In Advances in Neural Information Processing Systems, pages 3102–3110, LakeTahoe, Nevada USA, 2013. NIPS Foundation.

Joseph Reisinger, Austin Waters, Bryan Silverthorn, and Raymond J. Mooney.

Spherical topic models.

In Proceedings of the 27th International Conference on Machine Learning (ICML2010), pages 903–910, Haifa, Israel, 2010. IMLS.

Gareth O Roberts and Osnat Stramer.

Langevin diffusions and Metropolis-Hastings algorithms.

Methodology and computing in applied probability, 4(4):337–357, 2002.

Gareth O Roberts, Richard L Tweedie, et al.

Exponential convergence of Langevin distributions and their discreteapproximations.

Bernoulli, 2(4):341–363, 1996.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

References

Ruslan Salakhutdinov and Andriy Mnih.

Bayesian probablistic matrix factorization using Markov chain Monte Carlo.

In Proceedings of the 25th International Conference on Machine Learning (ICML2008), pages 880–887, Helsinki, Finland, 2008. IMLS, Omnipress.

Issei Sato and Hiroshi Nakagawa.

Approximation analysis of stochastic gradient Langevin dynamics by usingFokker-Planck equation and Ito process.

In Proceedings of the 31st International Conference on Machine Learning (ICML2014), pages 982–990, Beijing, China, 2014. IMLS.

Yang Song and Jun Zhu.

Bayesian matrix completion via adaptive relaxed spectral regularization.

In The 30th AAAI Conference on Artificial Intelligence (AAAI-16), pages2044–2050, Phoenix, Arizona USA, 2016. AAAI press.

Ingo Steinwart and Andreas Christmann.

Support vector machines.

Springer Science & Business Media, New York, 2008.

Eduard L. Stiefel.

Richtungsfelder und fernparallelismus in n-dimensionalen mannigfaltigkeiten.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

References

Commentarii Mathematici Helvetici, 8(1):305–353, 1935.

Yee Whye Teh, Alexandre H Thiery, and Sebastian J Vollmer.

Consistency and fluctuations for stochastic gradient Langevin dynamics.

The Journal of Machine Learning Research, 17(1):193–225, 2016.

Cedric Villani.

Optimal transport: old and new, volume 338.


Max Welling and Yee Whye Teh.

Bayesian learning via stochastic gradient Langevin dynamics.

In Proceedings of the 28th International Conference on Machine Learning (ICML2011), pages 681–688, Bellevue, Washington USA, 2011. IMLS.

Yujia Xie, Xiangfeng Wang, Ruijia Wang, and Hongyuan Zha.

A fast proximal point method for computing Wasserstein distance.


Viktor Yanush and Dmitry Kropotov.

Hamiltonian Monte-Carlo for orthogonal matrices.


References

Hongyi Zhang and Suvrit Sra.

An estimate sequence for geodesically convex optimization.

In Proceedings of the 31st Annual Conference on Learning Theory (COLT 2018),pages 1703–1723, Stockholm, Sweden, 2018. IMLS.

Yizhe Zhang, Changyou Chen, Zhe Gan, Ricardo Henao, and Lawrence Carin.

Stochastic gradient monomial Gamma sampler.



sampling methods on manifolds and their view from probability … · hyperspherical variational...

Documents