sampling methods on manifolds and their view from probability … · hyperspherical variational...

111
Sampling Methods on Manifolds and Their View from Probability Manifolds Chang Liu Microsoft Research Asia [email protected] Work done at Department of Computer Science and Technology, Tsinghua University Nov. 20, 2019 Chang Liu (MSRA) Sampling Methods and Manifolds 1 / 82

Upload: others

Post on 08-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling Methods on Manifolds and Their View fromProbability Manifolds

Chang Liu

Microsoft Research Asia

[email protected] done at Department of Computer Science and Technology, Tsinghua University

Nov. 20, 2019

Chang Liu (MSRA) Sampling Methods and Manifolds 1 / 82

Page 2: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

1 Introduction

2 Sampling on ManifoldsManifold ConceptsMCMCs on ManifoldsParVIs on Manifolds

3 Understanding Samping Methods on Probability ManifoldsThe Wasserstein SpaceUnderstanding ParVIs on the Wasserstein SpaceUnderstanding MCMCs on the Wasserstein Space

Chang Liu (MSRA) Sampling Methods and Manifolds 2 / 82

Page 3: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Introduction

Introduction: the Sampling Task

The need of drawing samples from a distribution:

Bayesian inference: p(z|x) = p(z)p(x|z)/p(x) ∝ p(z)p(x|z):

Generative model generation (e.g., MRF generation).

Monte Carlo estimation (e.g., MRF likelihood gradient,doubly-stochastic gradient).

Chang Liu (MSRA) Sampling Methods and Manifolds 3 / 82

Page 4: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Introduction

Introduction: Sampling Methods

Methods:

Monte Carlo:

Directly draw i.i.d. samples.Efficient but requires exact density.

Markov Chain Monte Carlo (MCMC):

Draw samples by simulating a Markov chain with desired stationarydistribution.Admit unnormalized density but introduce autocorrelation.

Particle-Based Variational Inference (ParVI):

Optimize a set of particles (i.e. samples) to drive the particledistribution towards the target distribution.Admit unnormalized density but require assumption on the particledistribution (which affects performance).

Chang Liu (MSRA) Sampling Methods and Manifolds 4 / 82

Page 5: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Introduction

Introduction: Manifold

Concept (M -dim manifold M):topological space locally homeomporchic to anopen subset of RM .

Merits:

Inclusive concept: globally releases linearity.Rich structures can be equipped: distance,gradient, distribution, dynamics, etc.Fundamental view of geometry:parameterization-invariant.

Chang Liu (MSRA) Sampling Methods and Manifolds 5 / 82

Page 6: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Introduction

Introduction: Sampling and Manifolds

Sampling from a distribution supported on a manifold.

Spherical Admixture Model (SAM) [61]:topics on spheres for better representation.

Chang Liu (MSRA) Sampling Methods and Manifolds 6 / 82

Page 7: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Introduction

Introduction: Sampling and Manifolds

Sampling from a distribution supported on a manifold.

Hyperspherical Variational Auto-Encoder [19, 30]:spherical latent space for uninformative prior.

Chang Liu (MSRA) Sampling Methods and Manifolds 6 / 82

Page 8: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Introduction

Introduction: Sampling and Manifolds

Sampling from a distribution supported on a manifold.

Hyperbolic Variational Auto-Encoders [53, 30, 59, 55]:hyperbolic latent space (R) for the analogy to a tree structure (L).

Bayesian Matrix Factorization [64, 66, 73]:factor matrices on Stiefel manifold [68, 33]{M ∈ Rm×n |M>M = Im}.

Chang Liu (MSRA) Sampling Methods and Manifolds 6 / 82

Page 9: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Introduction

Introduction: Sampling and Manifolds

Sampling from a distribution supported on a manifold.

Information Geometry [3, 4]:for Bayesian inference p(z|x) for a Bayesian model {p(z), p(x|z)}:

Chang Liu (MSRA) Sampling Methods and Manifolds 6 / 82

Page 10: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Introduction

Introduction: Sampling and Manifolds

Sampling from a distribution supported on a manifold:How to comply to the manifold geometry while being efficient?

Viewing Sampling Methods on Probability Manifolds:

ParVIs have a natural optimization interpretation on a probabilityspace. Can it be made concrete?Do MCMCs have a similar interpretation?

Chang Liu (MSRA) Sampling Methods and Manifolds 7 / 82

Page 11: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds Manifold Concepts

1 Introduction

2 Sampling on ManifoldsManifold ConceptsMCMCs on ManifoldsParVIs on Manifolds

3 Understanding Samping Methods on Probability ManifoldsThe Wasserstein SpaceUnderstanding ParVIs on the Wasserstein SpaceUnderstanding MCMCs on the Wasserstein Space

Chang Liu (MSRA) Sampling Methods and Manifolds 8 / 82

Page 12: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds Manifold Concepts

Manifolds

M -dim. manifold M (a):topological space locally homeomporchic to an open subset of RM .

Tangent vector v at x ∈M (b): linear function C∞(M)→ Rsatifying the Leibniz rule (directional derivative).

A smooth curve γt through x defines a tangent vector (derivative alongthe curve) (c).

Tangent space TxM at x (b): M -dim. linear space.Flow of a vector field V (d): the set of curves {(ϕt)t} s.t.ϕt = V (ϕt) (exists at least locally).

(a) (b) (c) (d)

Chang Liu (MSRA) Sampling Methods and Manifolds 9 / 82

Page 13: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds Manifold Concepts

Manifolds

Riemannian structure: inner product in every tangent space TxM.

Coordinate expression:

〈u, v〉TxM = gij(x)uivj .

Gradient of f :〈grad f(x), v〉TxM = v[f ] := vi∂if(x).⇐⇒Steepest ascending direction:grad f(x) = max · argmax‖v‖TxM=1

ddtf(ϕt).

Coordinate expression:(grad f(x)

)i= gij(x)∂jf(x).

Chang Liu (MSRA) Sampling Methods and Manifolds 10 / 82

Page 14: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds Manifold Concepts

Manifolds

Riemannian structure: inner product in every tangent space TxM.

Distance: d(x, y) =√

infγt:γ0=x,γ1=y

∫ 10 〈γt, γt〉TγtM dt.

Geodesic: the minimizing curve(s) when it exists (e.g., when M iscomplete as a metric space [32]).

More fundamental definition: auto-parallel curves under an affineconnection (covariant derivative).Generalization of straight lines.

Chang Liu (MSRA) Sampling Methods and Manifolds 11 / 82

Page 15: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds Manifold Concepts

Manifolds

Riemannian structure: inner product in every tangent space TxM.Exponential map Expx(v): maps v ∈ TxM to the end point of thegeodesic tangent to v at x with length ‖v‖TxM.

Generalization of vector addition.

Parallel transport Γyx(v): moves v ∈ TxM to TyM (in a certain senseof) parallelly, along the geodesic from x to y.

Generalization of conventional parallel transport.Generally is path-dependent.More fundamental def.: specified by an affine connection.

Chang Liu (MSRA) Sampling Methods and Manifolds 11 / 82

Page 16: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds Manifold Concepts

Manifolds

Measures on orientable manifolds can be expressed by volume forms:

Volume form: alternative linear (TxM)M → R for every x.

Lebesgue measure of a coordinate space: dx1 ∧ · · · ∧ dxM .

Riemannian volume form (Riemannian measure): coordinate invariantvolume form

√|G| dx1 ∧ · · · ∧ dxM .

Chang Liu (MSRA) Sampling Methods and Manifolds 12 / 82

Page 17: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

1 Introduction

2 Sampling on ManifoldsManifold ConceptsMCMCs on ManifoldsParVIs on Manifolds

3 Understanding Samping Methods on Probability ManifoldsThe Wasserstein SpaceUnderstanding ParVIs on the Wasserstein SpaceUnderstanding MCMCs on the Wasserstein Space

Chang Liu (MSRA) Sampling Methods and Manifolds 13 / 82

Page 18: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

MCMCs on Euclidean Space

Classical MCMCs: high autocorrelation.

Metropolis-Hastings algorithm [54, 31].

Gibbs sampling [27].

Dynamics-Based MCMCs: more effective move.

Dynamics: continuous-time no-jump Markov process:

dx = V (x) dt+√

2D(x) dBt(x).

Key tool: the Fokker-Planck Equation:

∂tpt = −∂i(ptV i) + ∂i∂j(ptDij).

Chang Liu (MSRA) Sampling Methods and Manifolds 14 / 82

Page 19: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

MCMCs on Euclidean Space

Dynamics-Based MCMCs: more effective move.

Langevin Dynamics (LD) [39] ([63, 62, 71]):

dx = Σ−1∇ log p dt+√

2Σ−1 dBt(x).Hamiltonian Dynamics (Hamitonian Monte Carlo(HMC) [21, 56, 10]): {

dx = Σ−1r dt,

dr = ∇ log p dt.

Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) [37, 15]:{dx = Σ−1r dt,

dr = ∇ log p dt− Cr dt+√

2CΣ dBt(x).

Stochastic Gradient Nose-Hoover Thermostats (SGNHT) [20]:dx = Σ−1r dt,

dr = ∇ log p dt− ξr dt+√

2CΣ dBt(x),

dξ = ( 1M r>Σ−1r − 1) dt.

Chang Liu (MSRA) Sampling Methods and Manifolds 15 / 82

Page 20: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

MCMCs on Euclidean Space

Dynamics-Based MCMCs: more effective move.

The complete recipe [51] for the dynamics:

dx = V (x) dt+√

2D(x) dBt(x),

V i(x) = ∂j

(p(x)

(Dij(x) +Qij(x)

))/p(x),

(1)

for some pos. semi-def. DM×M (diffusion matrix) and skew-symm.QM×M (curl matrix), keeps p invariant.

The inverse also holds.If D is pos. def., then p is the unique stationary distribution.

Chang Liu (MSRA) Sampling Methods and Manifolds 16 / 82

Page 21: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

MCMCs on Euclidean Space

Dynamics-Based MCMCs: more effective move.

Stochastic Gradient MCMC: for Bayesian inference,

∇z log p(z|{x(n)}Nn=1) = ∇z log p(z) +

N∑n=1

∇z log p(x(n)|z),

∇z log p(z|{x(n)}Nn=1) := ∇z log p(z) +N

|S|∑n∈S∇z log p(x(n)|z)

≈ ∇z log p(z|{x(n)}Nn=1) +N (0, A(z)).

Influence on the dynamics dx = V (x) dt+√

2D(x) dBt(x):

Var(V (x) dt) = Var(V (x)) dt2 = o(dt),

Var(√

2D(x) dBt(x)) = 2D(x) dt.

HMC cannot be simulated using stochastic gradient [15, 9].

Chang Liu (MSRA) Sampling Methods and Manifolds 16 / 82

Page 22: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

MCMCs on Riemannian Manifolds

In the corrdinate space (p is the density w.r.t. the Lebesgue meas.):

Riemann Manifold Langevin Dynamics (RMLD) [28, 60]:

dx = G−1∇ log p dt+∇ ·G−1 dt+√

2G−1 dBt(x).

Riemann Manifold Hamiltonian Monte Carlo (RMHMC) [28]:{dx = G−1r dt,

dr = ∇ log(p/√|G|) dt− 1

2∇(r>G−1r

)dt.

Stochastic Gradient Riemann Hamiltonian Monte Carlo(SGRHMC) [51]:{

dx = G−1/2r dt,

dr = G−1/2∇ log p dt−∇ ·G−1/2 +G−1r +√

2G−1 dBt(x).

Chang Liu (MSRA) Sampling Methods and Manifolds 17 / 82

Page 23: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

MCMCs on Riemannian Manifolds

In the corrdinate space: application using the Fisher-Rao Metric(information geometry [3, 4]):

Given a Bayesian model p(z), p(x|z), z is acoordinate of the manifold {p(x|z) | z ∈ Z}.Fisher-Rao metric:G(z) := Ep(x|z)[∇>z log p(x|z)∇z log p(x|z)].

Derived from the KL divergence.Corresp. distance is the (

√8×) JS divergence.

Invariant under reparameterization of z.

HMC (L) and RMHMC (R) [28]:

Chang Liu (MSRA) Sampling Methods and Manifolds 18 / 82

Page 24: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

MCMCs on Riemannian Manifolds

Problems of coordinate space: a global one may not exist (e.g.hyperspheres Sn−1 := {x ∈ Rn | ‖x‖2 = 1}).

Cumbersome to switch between coordinate systems.

G would be singular near the edge of a coordinate space.

Simulation in an embedded space Ξ(M): homeo. injective Ξ :M→ Rn.

Global representation.

Common manifolds have a natural (isometric) embedding.

Hausdorff meas. on Ξ(M) (isom. emb.) is the Riem. meas. on M.

Chang Liu (MSRA) Sampling Methods and Manifolds 19 / 82

Page 25: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

MCMCs on Riemannian Manifolds

RMHMC in the embedded space:

Constraint HMC (CHMC) [11].

Geodesic Monte Carlo (GMC) [12].

Stochastic Gradient MCMCs in the embedded space [42]:

Stochastic Gradient Geodesic Monte Carlo (SGGMC).

Geodesic Stochastic Gradient Nose-Hoover Thermostats (gSGNHT).

Chang Liu (MSRA) Sampling Methods and Manifolds 20 / 82

Page 26: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

Stochastic Gradient MCMCs in the Embedded Space

Table: A summary of MCMCs on Riemannian Manifolds. –: sampling on manifoldnot supported; †: The integrators are not in the SSI scheme (It is unclear whether theclaimed “2nd-order” is equivalent to ours); ‡: 2nd-order integrators for SGHMC andmSGNHT are developed by [13] and [40], respectively.

methodsstochasticgradient

no inneriteration

no globalcoordinates

order ofintegrator

LD [63, 62] ×√

– 1stHMC [56] ×

√– 2nd

GMC [12] ×√ √

2ndRMLD [28] ×

√× 1st

RMHMC [28] × × × 2nd†

CHMC [11] × ×√

2nd†

SGLD [71]√ √

– 1stSGHMC [15] / SGNHT [20]

√ √– 1st‡

SGRLD [60] / SGRHMC [51]√ √

× 1st

SGGMC / gSGNHT [42]√ √ √

2nd

Chang Liu (MSRA) Sampling Methods and Manifolds 21 / 82

Page 27: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

Stochastic Gradient MCMCs in the Embedded Space

SGGMC dynamics (coordinate space):

Augment with the momentum r ∈ Rm (more precisely, covector∈ T ∗xM).

Augmented target distribution:

− log p(z, r) = − log p(z|x) +1

2log |G(z)|︸ ︷︷ ︸

potential energy

+1

2r>G(z)−1r︸ ︷︷ ︸

kinetic energy

.

Let M isom. emb. in Rn via y = Ξ(x). Define:

D(z) =

(0 00 J(z)>CJ(z)

), Q(z) =

(0 −II 0

),

where Jn×m : Jai = ∂ya

∂xi(J>J = G).

Chang Liu (MSRA) Sampling Methods and Manifolds 22 / 82

Page 28: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

Stochastic Gradient MCMCs in the Embedded Space

SGGMC dynamics (coordinate space):

dz =G−1rdt

dr =∇z log p(z|x)dt− 1

2∇z log |G(z)|dt

− J>CJG−1r dt− 1

2∇z[r>G−1r

]dt

+N (0, 2J>CJdt)

Chang Liu (MSRA) Sampling Methods and Manifolds 23 / 82

Page 29: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

Stochastic Gradient MCMCs in the Embedded Space

SGGMC simulation (emb. sp.): Symmetric Splitting Integrator (SSI) [13].Split SGGMC dynamics (in the coordinate space):

dz =G−1rdt

dr =∇z log p(z|x)dt−1

2∇z log |G(z)|dt

−J>CJG−1r dt−1

2∇z[r>G−1r

]dt

+N (0, 2J>CJdt)

A :

{dz = G−1rdt

dr = −1

2∇z[r>G−1r

]dt

⇒ (zt, rt) = GeodFlow(z0, r0) [1, 12]

B :

{dz = 0

dr = −J>CJG−1r dt⇒{

zt = z0rt = J> expm{−Ct}JG−1r0

O :

dz = 0dr = ∇z log p(z|x) dt

− 12∇z log |G(z)| dt

+N (0, 2J>CJdt)

zt = z0rt = ∇z log p(z0|x)t

− 12∇z log |G(z0)| t

+N (0, 2J>CJt)

Chang Liu (MSRA) Sampling Methods and Manifolds 24 / 82

Page 30: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

Stochastic Gradient MCMCs in the Embedded Space

SGGMC simulation (emb. sp.): Symmetric Splitting Integrator (SSI) [13].

Dynamics A in the embedded space: geodesic flow (i.e., exponentialmap + parallel transport).

Example 1 (Geodesic flow of hypersphere Sn−1 in the embedded space){y(t) = y(0) cos(αt) + (v(0)/α) sin(αt)

v(t) = −αy(0) sin(αt) + v(0) cos(αt),

where y ∈ Sn−1, v = y ∈ Ty(Sn−1), and α = ‖v(0)‖.Chang Liu (MSRA) Sampling Methods and Manifolds 25 / 82

Page 31: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

Stochastic Gradient MCMCs in the Embedded Space

SGGMC simulation (emb. sp.): Symmetric Splitting Integrator (SSI) [13].

Dynamics B and O in the embedded space:

B :

{y(t) = y(0)

v(t) = Λ(y(0)

)expm{−Ct}v(0)

O :

y(t) = y(0)

v(t) = v(0)+Λ(y(0)

)[∇y log pH

(y(0)|x

)t

+N (0, 2Ct)],

where: pH is the density function w.r.t. the Hausdorff measure, andΛ(y) = In − P (y)P (y)> is the projection onto Ξ∗(TzM).

Example 2 (The projection Λ(y) for hypersphere in the embedded space)

Λ(y) = In − yy>.

Chang Liu (MSRA) Sampling Methods and Manifolds 26 / 82

Page 32: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

Stochastic Gradient MCMCs in the Embedded Space

SGGMC simulation (emb. sp.): Symmetric Splitting Integrator (SSI) [13].

Simulate following the sequence “ABOBA”:

Algorithm 1 Sampling procedure of SGGMC

Sample a subset S for computing ∇y log pH(y). (y0, v0)← (y(n−1), v(n−1)).for l = 1, 2, . . . , L doA: Update (y∗, v∗)← (yl−1, vl−1) by the geodesic flow for time step εn

2 .B: v∗ ← exp{−C εn

2 }v∗.

O: v∗ ← v∗ + Λ(y∗) ·[∇y log pH(y∗)εn+N

(0, (2C−εnV (y∗))εn

)].

B: v∗ ← exp{−C εn2 }v

∗.A: Update (yl, vl)← (y∗, v∗) by the geodesic flow for time step εn

2 .end for

Second-order simulation: MSE = O(L−2K/(2K+1)) [13].

Chang Liu (MSRA) Sampling Methods and Manifolds 27 / 82

Page 33: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

Stochastic Gradient MCMCs in the Embedded Space

gSGNHT dynamics:dz =G−1r dt,

dr =∇z log p(z|x)dt− 1

2∇z log |G|dt− ξr dt− 1

2∇z[r>G−1r

]dt+N (0, 2CGdt),

dξ =(1

mr>G−1r − 1) dt.

gSGNHT simulation:

Algorithm 2 Sampling procedure of gSGNHT

A: Update (y∗, v∗)← (yl−1, vl−1) by the geodesic flow for time step εn2 ,

ξ∗ ← ξl−1 + ( 1mv>l−1vl−1 − 1) εn2 .

B: v∗ ← exp{−ξ∗ εn2 }v∗.

O: v∗ ← v∗ + Λ(y∗) ·[∇y log pH(y∗)εn+N

(0, (2C−εnV (y∗))εn

)].

B: v∗ ← exp{−ξ∗ εn2 }v∗.

A: Update (yl, vl)← (y∗, v∗) by the geodesic flow for time step εn2 ,

Chang Liu (MSRA) Sampling Methods and Manifolds 28 / 82

Page 34: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

Stochastic Gradient MCMCs in the Embedded Space

Experimental results:

−0.5 0 0.5

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

−0.4 −0.2 0 0.2 0.4

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

Figure: Joint posterior of z1 and z2 in gray scale. Left: true distribution; Right:empirical distribution by samples of SGGMC.

Chang Liu (MSRA) Sampling Methods and Manifolds 29 / 82

Page 35: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

Stochastic Gradient MCMCs in the Embedded Space

Experimental results: inference for Spherical Admixture Model (SAM) [61]

Model structure:

Document v (e.g., normalized tf-idf), topic β, corpus mean µ: onhyperspheres.

Posterior of interest: p(β|v).

∇β log p(β|v) =1

p(β|v)∇β∫p(β, θ|v)dθ = Ep(θ|β,v) [∇β log p(β, θ|v)] .

Run another MCMC (GMC [12]) to sample from p(θ|β, v) (supportedon simplex) to estimate the expectation.

Chang Liu (MSRA) Sampling Methods and Manifolds 30 / 82

Page 36: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds MCMCs on Manifolds

Stochastic Gradient MCMCs in the Embedded Space

Experimental results: inference for Spherical Admixture Model (SAM) [61]

103

104

105

2000

2500

3000

3500

4000

4500

5000

wall time in seconds (log scale)

log−

perp

lexi

ty

VIStoVIGMC−apprMHGMC−bGibbsSGGMC−batchSGGMC−fullgSGNHT−batchgSGNHT−full

Figure: Results on the 150K Wikipedia subset (150K training and 1K test, 50topics)

Chang Liu (MSRA) Sampling Methods and Manifolds 31 / 82

Page 37: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds ParVIs on Manifolds

1 Introduction

2 Sampling on ManifoldsManifold ConceptsMCMCs on ManifoldsParVIs on Manifolds

3 Understanding Samping Methods on Probability ManifoldsThe Wasserstein SpaceUnderstanding ParVIs on the Wasserstein SpaceUnderstanding MCMCs on the Wasserstein Space

Chang Liu (MSRA) Sampling Methods and Manifolds 32 / 82

Page 38: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds ParVIs on Manifolds

ParVIs on Euclidean Space

Particle-Based Variational Inference (ParVI):optimize a set of particles (i.e. samples) to drive the particle distributiontowards the target distribution.

More flexible and accurate than classical (i.e., statistical-model-based)variational inference.

Has a better convergence perspective than MCMCs.

More particle-efficient than MCMCs.

Chang Liu (MSRA) Sampling Methods and Manifolds 33 / 82

Page 39: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds ParVIs on Manifolds

ParVIs on Euclidean Space

Stein Variational Gradient Descent (SVGD) [46]:

A deterministic dynamics zt = v(zt) on M = Rm induces acontinuously-evolving distribution (qt) on M:

∂tqt = −∇ · (qtv). (continuity equation / det. FPE)

Chang Liu (MSRA) Sampling Methods and Manifolds 34 / 82

Page 40: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds ParVIs on Manifolds

ParVIs on Euclidean Space

Stein Variational Gradient Descent (SVGD) [46]:

To drive (qt) towards p, let it minimize KL(qt‖p):

Find the decreasing rate (directional derivative):

− d

dtKL(qt‖p) = Eq[v · ∇ log p+∇ · v].

Find v maximizing the decreasing ratev∗ := max · argmax‖v‖X=1− d

dtKL(qt‖p) (functional gradient).

Taking X = T (M) = Rm: no tractable solution.Taking X = Hm where H is the RKHS [67] of a kernel K:

v∗(x′) = Eq(x)[K(x, x′)∇x log p(x) +∇xK(x, x′)].

The expectation can be estimated directly by the particles!

Simulate the particles by applying the dynamics:x(i) ← x(i) + εv∗(x(i)).

Chang Liu (MSRA) Sampling Methods and Manifolds 34 / 82

Page 41: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds ParVIs on Manifolds

ParVIs on Riemannian Manifolds

Riemannian SVGD [41]:

Utilize information geometry to enhance efficiency (coordinate space).

Enable ParVIs on manifolds like hyperspheres (embedded space).

Chang Liu (MSRA) Sampling Methods and Manifolds 35 / 82

Page 42: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds ParVIs on Manifolds

ParVIs on Riemannian Manifolds

Dynamics on a Riemannian manifold:

zt = X(zt), (zt) is a curve of the flow of X.

Evolving distribution: let all densities be w.r.t. the Riem. meas.

Lemma 3 (Continuity Equation on Riemannian Manifold)∂tqt = −div(qtX) = −X[qt]− qt div(X)

= −Xi∂iqt − qt∂iXi − qtXi∂i log√|G|.

Chang Liu (MSRA) Sampling Methods and Manifolds 36 / 82

Page 43: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds ParVIs on Manifolds

ParVIs on Riemannian Manifolds

Directional derivative:

Theorem 4 (Directional Derivative)

Let p be a fixed distribution. Then the directional derivative is

− d

dtKL(qt‖p)=Eqt [div(pX)/p]=Eqt

[X[log p]+div(X)

].

X[qt]: the action of the vector field X on the smooth function qt.In any coordinate system, X[qt] = Xi∂iqt.

div(X): the divergence of vector field X.In any coordinate system, div(X) = ∂i(

√|G|Xi)/

√|G|.

Chang Liu (MSRA) Sampling Methods and Manifolds 37 / 82

Page 44: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds ParVIs on Manifolds

ParVIs on Riemannian Manifolds

Functional gradient:

X∗ := max · argmaxX∈X,‖X‖X=1

J (X) := Eq[X[log p] + div(X)

],

where X is a subspace of vector fields on M, such that:

1. X∗ is a valid vector field on M.

Example 5 (Nontriviality of a valid vector field)

Vector fields on an even-dimensional hypersphere must have one zeropoint (hairy ball theorem ([2], Thm 8.5.13)). The choice in SVGDX = Hm cannot guarantee this requirement.

2. X∗ is coordinate invariant.Concept: the expression in any coordinate system is the same.Necessary for avoiding the arbitrariness of the solution.The choice in SVGD X = Hm cannot guarantee this requirement.

3. X∗ can be expressed in closed form.

Chang Liu (MSRA) Sampling Methods and Manifolds 38 / 82

Page 45: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds ParVIs on Manifolds

ParVIs on Riemannian Manifolds

Functional gradient:

Our Solution

X = {grad f | f ∈ H}, where H is the RKHS of a kernel K.

The gradient a function is a valid, coordinate invariant vector field.

Lemma 6

For Gaussian RKHS, X is isometrically isomorphic to H.

Theorem 7 (Functional Gradient)

X∗′ = grad′ f∗′, f∗′ = Eq[(gradK)[log p] + ∆K

],

where “·′” takes x′ as argument, and ∆f := div(grad f).

X∗′i

= g′ij∂′jEq[(gab∂a log(p

√|G|) + ∂ag

ab)∂bK + gab∂a∂bK

].

Simulate the dynamics: z(s) ← z(s) + εX∗(z(s)).

Chang Liu (MSRA) Sampling Methods and Manifolds 39 / 82

Page 46: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds ParVIs on Manifolds

ParVIs on Riemannian Manifolds

Experimental Results (coordinate space):

0 50 100 150 200 250 300 350 4000.6

0.65

0.7

0.75

0.8

0.85

iteration

accu

racy

SVGDRSVGD (Ours)

(a) On Splice19 dataset

0 50 100 150 200 250 3000.52

0.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

0.7

0.72

iterationac

cura

cy

SVGDRSVGD (Ours)

(b) On Covertype dataset

Figure: Test accuracy along iteration for BLR. Both methods are run 20 times on Splice19 and

10 times on Covertype.

Chang Liu (MSRA) Sampling Methods and Manifolds 40 / 82

Page 47: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds ParVIs on Manifolds

ParVIs on Riemannian Manifolds

Functional gradient in the embedded space:

Proposition 8 (Functional Gradient in the Embedded Space)

Let m-dim M isometrically embedded in Rn (with orthonormal basis{yα}nα=1)) via Ξ :M→ Rn. Then X∗′ = (In −N ′N ′>)∇′f∗′,

f∗′ = Eq[(∇ log

(p√|G|))>(

In − PP>)

(∇K) +∇>∇K

− tr(P>(∇∇>K)P

)+(

(J>∇)>(G−1J>))

(∇K)],

where ∇ = (∂y1 , . . . , ∂yn)>, Jn×m : Jai = ∂ya

∂zi, and P ∈ Rn×(n−m) is the

set of orthonormal basis of the orthogonal complement of Ξ∗(TzM).

Simulate the dynamics with exponential map:

y(s) ← Expy(s)(εX∗(y(s))).

(Is a coordinate-independent expression possible?)

Chang Liu (MSRA) Sampling Methods and Manifolds 41 / 82

Page 48: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds ParVIs on Manifolds

ParVIs on Riemannian Manifolds

Functional gradient on hyperspheres:

Proposition 9 (Functional Gradient for Embedded Hyperspheres)

For Sn−1 isometrically embedded in Rn with orthonormal basis {yα}nα=1,

we have X∗′ = (In − y′y′>)∇′f∗′, where f∗′ =

Eq[(∇log p

)>(∇K) +∇>∇K − y>

(∇∇>K

)y − (y>∇log p+ n− 1)y>∇K

].

Simulate the dynamics with exponential map on Sn−1:

Expy(v) = y cos(‖v‖) + (v/‖v‖) sin(‖v‖).

Chang Liu (MSRA) Sampling Methods and Manifolds 42 / 82

Page 49: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Sampling on Manifolds ParVIs on Manifolds

ParVIs on Riemannian Manifolds

Experimental Results (embedded space):

20 40 60 80 100 120 140 160 180 2003800

4000

4200

4400

4600

4800

5000

5200

5400

5600

5800

6000

epochs

log−

perp

lexi

ty

VIGMC−seqGMC−parSGGMCb−seqSGGMCb−parSGGMCf−seqSGGMCf−parRSVGD (Ours)

(a) Results with 100 particles

20 40 60 80 100 120 140 1603800

4000

4200

4400

4600

4800

5000

5200

5400

5600

5800

6000

number of particleslo

g−pe

rple

xity

GMC−seqGMC−parSGGMCb−seqSGGMCb−parSGGMCf−seqSGGMCf−parRSVGD (Ours)

(b) Results at 200 epochs

Figure: Results on the SAM inference task on 20News-different dataset, in log-perplexity.

SGGMCf: full batch; SGGMCb: mini-batch of size 50.

Chang Liu (MSRA) Sampling Methods and Manifolds 43 / 82

Page 50: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds

1 Introduction

2 Sampling on ManifoldsManifold ConceptsMCMCs on ManifoldsParVIs on Manifolds

3 Understanding Samping Methods on Probability ManifoldsThe Wasserstein SpaceUnderstanding ParVIs on the Wasserstein SpaceUnderstanding MCMCs on the Wasserstein Space

Chang Liu (MSRA) Sampling Methods and Manifolds 44 / 82

Page 51: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds

Questions on ParVIs and MCMCs

ParVIs exhibit the intuition of minimizing KLp(·) on a probabilityspace, along the speedest descending direction. Can this be madeconcrete?

Liu (2017) [45] conceives a probability manifold where SVGD simulatesthe gradient flow. But the validity of the artificial manifold is unknown.

ParVIs do not assume a parametric statistical model, but need akernel (or other treatment). Do they need an assumption / make anapproximation?

Do general MCMCs have a flow/optimization interpretation?

Things are made clear on the Wasserstein space.

Chang Liu (MSRA) Sampling Methods and Manifolds 45 / 82

Page 52: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds The Wasserstein Space

The Wasserstein Space

For a metric space (M, d):P2(M) :=

{q: distribution on M

∣∣ ∃x0 ∈M s.t. Eq[d(x0, x)2] < +∞}.

P2(M) is a metric space ([70], Def 6.4) with the Wasserstein distance:

dW (q, p) :=(

infπ∈Π(q,p)

Eπ(x,y)[d(x, y)2])1/2

,

where

Π(q, p) :=

{π: distribution on M×M

∣∣∣∣ ∫Mπ(x, y) dy = q(x),∫

Mπ(x, y) dx = p(y)

}.

Chang Liu (MSRA) Sampling Methods and Manifolds 46 / 82

Page 53: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds The Wasserstein Space

The Wasserstein Space: Riemannian Structure

For a Riem. manif. (M, 〈·, ·〉TxM), P2(M) also has a Riem. str. [58, 70, 6]:

Tangent vector v ⇐⇒ vector field X on M.

Tangent space at q: TqP2(M) = {grad f | f ∈ C∞c (M)}L2q(M)

is a subspace of L2q(M) := {X | Eq(x)

[〈X(x), X(x)〉TxM

]<∞}.

([70], Thm 13.8; [6], Thm 8.3.1, Def 8.4.1, Prop 8.4.5)

Chang Liu (MSRA) Sampling Methods and Manifolds 47 / 82

Page 54: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds The Wasserstein Space

The Wasserstein Space: Riemannian Structure

For a Riem. manif. (M, 〈·, ·〉TxM), P2(M) also has a Riem. str. [58, 70, 6]:

Riemannian structure: TqP2 inherits the inner product of L2q :

〈X,Y 〉TqP2= Eq(x)

[〈X(x), Y (x)〉TxM

].

It is consistent with dW [8].

Chang Liu (MSRA) Sampling Methods and Manifolds 47 / 82

Page 55: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds The Wasserstein Space

The Wasserstein Space: Riemannian Structure

Gradient flow on P2(M) for KLp(q) := Eq[log(q/p)] (using Riem.meas):

P2(M) as a Riemannian manifold:

V GF := − grad KLp(q) = − grad( δδq

KLp(q))

= grad log(p/q).

([70], Thm 23.18; [6], Example 11.1.2)

P2(M) as a metric space: e.g., Minimizing Movement Scheme (MMS)([6], Def. 2.0.6):

qt+ε = argminq∈P2(M)

KLp(q) +1

2εd2W (q, qt).

They coincide under the Riemannian structure. ([70], Prop. 23.1,

Rem. 23.4; [6], Thm. 11.1.6; [24], Lem. 2.7)

Exponential convergence when p is log-concave. ([70], Thm 23.25, Thm

24.7; [6], Thm 11.1.4)

Chang Liu (MSRA) Sampling Methods and Manifolds 48 / 82

Page 56: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds The Wasserstein Space

Langevin Dynamics as Wasserstein Gradient Flow

The Langevin dynamics

dx = ∇ log p(x) dt+√

2 dBt(x)

produces the same [14] evolving distr. (qt) as:

dx = ∇ log(p(x)/qt(x)) dt,

which is the gradient flow of KLp on P2(M) for Euclidean M.

The gradient flow interpretation of LD is known earlier from theMMS perspective [34].

Chang Liu (MSRA) Sampling Methods and Manifolds 49 / 82

Page 57: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding ParVIs on the Wasserstein Space

1 Introduction

2 Sampling on ManifoldsManifold ConceptsMCMCs on ManifoldsParVIs on Manifolds

3 Understanding Samping Methods on Probability ManifoldsThe Wasserstein SpaceUnderstanding ParVIs on the Wasserstein SpaceUnderstanding MCMCs on the Wasserstein Space

Chang Liu (MSRA) Sampling Methods and Manifolds 50 / 82

Page 58: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding ParVIs on the Wasserstein Space

Understanding ParVIs on the Wasserstein Space

Understand and accelerate ParVIs from the Wasserstein gradient flowperspective [43].

Consider Euclidean M = Rm for brevity.

Chang Liu (MSRA) Sampling Methods and Manifolds 51 / 82

Page 59: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding ParVIs on the Wasserstein Space

SVGD Approximates the Wasserstein Gradient Flow

Reformulate V GF as:

V GF = max · argmaxV ∈L2

q,‖V ‖L2q=1

⟨V GF, V

⟩L2q. (2)

We find:

Theorem 10 (V SVGD approximates V GF)

V SVGD = max · argmaxV ∈HD,‖V ‖HD=1

⟨V GF, V

⟩L2q.

HD is a subspace of L2q , so V SVGD is the projection of V GF on HD.

Chang Liu (MSRA) Sampling Methods and Manifolds 52 / 82

Page 60: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding ParVIs on the Wasserstein Space

ParVIs Approx. the Wass. Gradient Flow by Smoothing

Smoothing Functions

SVGD restricts the optimization domain L2q to HD.

Theorem 11 (HD smooths L2q)

For M = RD, a Gaussian kernel K on M and an absolutely continuous q,the vector-valued RKHS HD of K is isometrically isomorphic to the

closure G := {φ ∗K : φ ∈ C∞c }L2q .

C∞cL2q = L2

q ([36], Thm. 2.11) =⇒ G is roughly the kernel-smoothed L2q .

Smoothing the Density

The Blob method (w-SGLD-B) [14]: partially smooths the density.

V GF = −∇( δδq

Eq[log(q/p)])

=⇒ V Blob = −∇( δδq

Eq[log(q/p)]),

where q := q ∗K is the kernel-smoothed density.

Chang Liu (MSRA) Sampling Methods and Manifolds 53 / 82

Page 61: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding ParVIs on the Wasserstein Space

ParVIs Approx. the Wass. Gradient Flow by Smoothing

Equivalence:Smoothing-function objective = Eq[L(V )], L : L2

q → L2q linear.

=⇒ Eq[L(V )] = Eq∗K [L(V )] = Eq[L(V ) ∗K] = Eq[L(V ∗K)].

Necessity: grad KLp(q) undefined at q = q := 1N

∑Ni=1 δx(i) .

Theorem 12 (Necessity of smoothing for SVGD)

For q = q and V ∈ L2p, problem (2):

maxV ∈L2

p,‖V ‖L2p=1

⟨V GF, V

⟩L2q

,

has no optimal solution. In fact the supremum of the objective is infinite,indicating that a maximizing sequence of V tends to be ill-posed.

ParVIs rely on the smoothing assumption! No free lunch!

Chang Liu (MSRA) Sampling Methods and Manifolds 54 / 82

Page 62: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding ParVIs on the Wasserstein Space

New ParVIs with Smoothing

Gradient Flow with Smoothed Density (GFSD):Fully smooth the density:

V GFSD := ∇ log p−∇ log q.

Gradient Flow with Smoothed test Functions (GFSF):

V GF = ∇ log p−∇ log q

=⇒ V GF = ∇ log p+ argminU∈L2

maxφ∈C∞c ,‖φ‖L2q=1

(Eq[φ · U −∇ · φ]

)2.

Smooth φ: take φ from HD:

V GFSF := ∇ log p+ argminU∈L2

maxφ∈HD,‖φ‖HD=1

(Eq[φ · U −∇ · φ]

)2.

Solution: V GFSF = V + K ′K−1. (Note V SVGD = V GFSFK.)V:,i = ∇x(i) log p(x(i)), Kij = K(x(i), x(j)), K′:,i =

∑j ∇x(j)K(x(j), x(i)).

Chang Liu (MSRA) Sampling Methods and Manifolds 55 / 82

Page 63: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding ParVIs on the Wasserstein Space

Bandwidth Selection via the Heat Equation

Note

Under the dynamics dx = −∇ log qt(x) dt, qt evolves following the heatequation (HE): ∂tqt(x) = ∆qt(x).

Smoothing the density: qt(x) ≈ q(x) = q(x; {x(i)}Ni=1). Then for qt+ε(x),

Due to HE, qt+ε(x) ≈ q(x) + ε∆q(x).

Due to the effect of the dynamics, updated particles{x(i)−ε∇ log q(x(i))}Ni=1 approximate qt+ε, soqt+ε(x) ≈ q(x; {x(i)−ε∇ log q(x(i))}Ni=1).

Objective:∑

k

(q(x(k)) + ε∆q(x(k))− q(x(k); {x(i)−ε∇ log q(x(i))}Ni=1)

)2.

Take ε→ 0, make the objective dimensionless (h/x2 is dimensionless):

1hD+2

∑k

[∆q(x(k); {x(i)}i)+

∑j∇x(j) q(x(k); {x(i)}i)·∇log q(x(j); {x(i)}i)

]2.

Also applicable to smoothing functions.

Chang Liu (MSRA) Sampling Methods and Manifolds 56 / 82

Page 64: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding ParVIs on the Wasserstein Space

Bandwidth Selection via the Heat Equation

Median:

HE:

SVGD Blob GFSD GFSF

Figure: Comparison of HE (bottom row) with the median method (top row) forbandwidth selection.

Chang Liu (MSRA) Sampling Methods and Manifolds 57 / 82

Page 65: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding ParVIs on the Wasserstein Space

Accelerated First-Order Methods on the Wasserstein Space

Nesterov’s Acceleration Methods on Riemannian Manifolds:rk ∈ P2(M): auxiliary variable. Vk := − grad KL(rk).

Riemannian Accelerated Gradient (RAG) [47] (with simplification):{qk = Exprk−1

(εVk−1),

rk = Expqk

[−Γqkrk−1

(k−1k Exp−1

rk−1(qk−1)− k+α−2

k εVk−1

)].

Riemannian Nesterov’s method (RNes) [74] (with simplification):{qk = Exprk−1

(εVk−1),

rk = Expqk{c1 Exp−1

qk

[Exprk−1

((1−c2) Exp−1

rk−1(qk−1)+c2 Exp−1

rk−1(qk)

)]}.

Required:

Exponential map Expq : TqP2(M)→ P2(M) and its inverse.

Parallel transport Γrq : TqP2(M)→ TrP2(M).

Chang Liu (MSRA) Sampling Methods and Manifolds 58 / 82

Page 66: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding ParVIs on the Wasserstein Space

Accelerated First-Order Methods on the Wasserstein Space

Leveraging the Riemannian Structure of P2(M):

Exponential map ([70], Coro. 7.22; [6], Prop. 8.4.6; [24], Prop. 2.1):Expq(V ) = (id +V )#q, i.e.,

{x(i)}i ∼ q ⇒ {x(i)+V (x(i))}i ∼ Expq(V ).

Inverse exponential map: require the optimal transport map.

Sinkhorn methods [17, 72] appear costly and unstable.Make approximations when {x(i)}i and {y(i)}i are pairwise close:d(x(i), y(i))� min

{minj 6=i d(x(i), x(j)),minj 6=i d(y(i), y(j))

}.

Proposition 13 (Inverse exponential map)

For pairwise close samples {x(i)}i of q and {y(i)}i of r, we have(Exp−1

q (r))(x(i)) ≈ y(i) − x(i).

Chang Liu (MSRA) Sampling Methods and Manifolds 59 / 82

Page 67: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding ParVIs on the Wasserstein Space

Accelerated First-Order Methods on the Wasserstein Space

Leveraging the Riemannian Structure of P2(M):

Parallel transport

Hard to implement analytical results [49, 50].Use Schild’s ladder method [23, 35] for approximation.

Proposition 14 (Parallel transport)

For pairwise close samples {x(i)}i of q and {y(i)}i of r, we have(Γrq(V )

)(y(i)) ≈ V (x(i)), ∀V ∈ TqP2.

Chang Liu (MSRA) Sampling Methods and Manifolds 59 / 82

Page 68: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding ParVIs on the Wasserstein Space

Accelerated First-Order Methods on the Wasserstein Space

Algorithm 3 The acceleration framework with Wasserstein Accelerated Gra-dient (WAG) and Wasserstein Nesterov’s method (WNes)

1: WAG: select acceleration factor α > 3;WNes: select or calculate c1, c2 ∈ R+;

2: Initialize {x(i)0 }Ni=1 distinctly; let y

(i)0 = x

(i)0 ;

3: for k = 1, 2, · · · , kmax, do4: for i = 1, · · · , N , do

5: Find V (y(i)k−1) by SVGD/Blob/GFSD/GFSF;

6: x(i)k = y

(i)k−1 + εV (y

(i)k−1);

7: y(i)k = x

(i)k +

{WAG: k−1

k (y(i)k−1 − x

(i)k−1) + k+α−2

k εV (y(i)k−1);

WNes: c1(c2 − 1)(x(i)k − x

(i)k−1);

8: end for9: end for

10: Return {x(i)kmax}Ni=1.

Chang Liu (MSRA) Sampling Methods and Manifolds 60 / 82

Page 69: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding ParVIs on the Wasserstein Space

Accelerated First-Order Methods on the Wasserstein Space

Experimental results: Bayesian inference for Latent Dirichlet Allocation:

200 400iteration

1020

1040

1060

1080

1100

hold

out p

erpl

exity

SVGD-WGDSVGD-POSVGD-WAGSVGD-WNes

200 400iteration

1020

1040

1060

1080

1100

hold

out p

erpl

exity

Blob-WGDBlob-POBlob-WAGBlob-WNes

200 400iteration

1020

1040

1060

1080

1100

hold

out p

erpl

exity

GFSD-WGDGFSD-POGFSD-WAGGFSD-WNes

200 400iteration

1020

1040

1060

1080

1100

hold

out p

erpl

exity

GFSF-WGDGFSF-POGFSF-WAGGFSF-WNes

Figure: Acceleration effect of WAG and WNes onLDA (measured by hold-out perplexity).

100 200 300 400iteration

1025

1050

1075

1100

1125

1150

hold

out p

erpl

exity

SGNHT-seqSGNHT-paraSVGD-WGDSVGD-WNes

Figure: Comparison ofSVGD and SGNHT onLDA, as representativesof ParVIs and MCMCs.Average over 10 runs.

Chang Liu (MSRA) Sampling Methods and Manifolds 61 / 82

Page 70: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

1 Introduction

2 Sampling on ManifoldsManifold ConceptsMCMCs on ManifoldsParVIs on Manifolds

3 Understanding Samping Methods on Probability ManifoldsThe Wasserstein SpaceUnderstanding ParVIs on the Wasserstein SpaceUnderstanding MCMCs on the Wasserstein Space

Chang Liu (MSRA) Sampling Methods and Manifolds 62 / 82

Page 71: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

Understanding MCMCs on the Wasserstein Space

Understanding MCMC dynamics as flows on the Wasserstein Space [44]:

The Langevin dynamics (LD) is recognized as the Wassersteingradient flow of the KL divergence [34].

Benefits its asymptotic [63] and non-asymptotic [22, 16] behaviors.Relates it to ParVIs [14, 43].

Does a general MCMC dynamics correspond to an interpretable flowon the Wasserstein space?

Chang Liu (MSRA) Sampling Methods and Manifolds 63 / 82

Page 72: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

The First Reformulation

Lemma 15 (Equivalent deterministic MCMC dynamics)

A general MCMC dynamics specified by a symm. pos. semi-def. D andskew-symm. Q via Eq. (1) produces the same distr. evolution as thedeterministic dynamics:

dx = Wt(x) dt,

(Wt)i(x) = Dij(x) ∂j log(p(x)/qt(x)) +Qij(x) ∂j log p(x) + ∂jQ

ij(x),(3)

where qt is the distribution density of x at time t.

=⇒ Barbour’s generator [7]Af := d

dtEqt [f ]∣∣qt=δx

= 1p∂j[p(Dij +Qij

)(∂if)

](c.f. [29]).

How to interpret Wt(x)?

Chang Liu (MSRA) Sampling Methods and Manifolds 64 / 82

Page 73: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

The First Reformulation

Lemma 15 (Equivalent deterministic MCMC dynamics)

A general MCMC dynamics specified by a symm. pos. semi-def. D andskew-symm. Q via Eq. (1) produces the same distr. evolution as thedeterministic dynamics:

dx = Wt(x) dt,

(Wt)i(x) = Dij(x) ∂j log(p(x)/qt(x)) +Qij(x) ∂j log p(x) + ∂jQ

ij(x),(3)

where qt is the distribution density of x at time t.

=⇒ Barbour’s generator [7]Af := d

dtEqt [f ]∣∣qt=δx

= 1p∂j[p(Dij +Qij

)(∂if)

](c.f. [29]).

How to interpret Wt(x)?

Chang Liu (MSRA) Sampling Methods and Manifolds 64 / 82

Page 74: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

The First Reformulation

Lemma 15 (Equivalent deterministic MCMC dynamics)

A general MCMC dynamics specified by a symm. pos. semi-def. D andskew-symm. Q via Eq. (1) produces the same distr. evolution as thedeterministic dynamics:

dx = Wt(x) dt,

(Wt)i(x) = Dij(x) ∂j log(p(x)/qt(x)) +Qij(x) ∂j log p(x) + ∂jQ

ij(x),(3)

where qt is the distribution density of x at time t.

=⇒ Barbour’s generator [7]Af := d

dtEqt [f ]∣∣qt=δx

= 1p∂j[p(Dij +Qij

)(∂if)

](c.f. [29]).

How to interpret Wt(x)?

Chang Liu (MSRA) Sampling Methods and Manifolds 64 / 82

Page 75: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

Interpret MCMC Dynamics

(Wt)i(x) = Dij(x) ∂j log(p(x)/qt(x)) +Qij(x) ∂j log p(x) + ∂jQ

ij(x).

1 Dij(x) ∂j log(p(x)/qt(x)) seems like a gradient flow on P2(M).

Euclidean M: D = I.

Hilbert M: constant and non-singular D.

Riemannian M: non-singular D(x).

We need positive semi-definite D(x).

Chang Liu (MSRA) Sampling Methods and Manifolds 65 / 82

Page 76: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

Interpret MCMC Dynamics

(Wt)i(x) = Dij(x) ∂j log(p(x)/qt(x)) +Qij(x) ∂j log p(x) + ∂jQ

ij(x).

1 Dij(x) ∂j log(p(x)/qt(x)) seems like a gradient flow on P2(M).

Fiber Bundle M (of dim. M = m+ n)(known knowledge):

M is locally M0 ×F (dim(M0) = m,dim(F) = n) [57] in terms of a projection$:

$ :M→M0locally⇐⇒ M0 ×F →M0.

The fiber through y ∈M0:My := $−1(y) (diffeom. to F).Coordinate decomposition: x = (y, z),y ∈ Rm: coord. of M0;z ∈ Rn: coord. of My.

Chang Liu (MSRA) Sampling Methods and Manifolds 66 / 82

Page 77: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

Interpret MCMC Dynamics

(Wt)i(x) = Dij(x) ∂j log(p(x)/qt(x)) +Qij(x) ∂j log p(x) + ∂jQ

ij(x).

1 Dij(x) ∂j log(p(x)/qt(x)) seems like a gradient flow on P2(M).

Fiber-Riemannian manifold M:

Definition 3 (Fiber-Riemannian manifold)

M is a fiber-Riemannian manifold if it is a fiberbundle and there is a Riemannian structure gMy

on each fiber My.

Chang Liu (MSRA) Sampling Methods and Manifolds 67 / 82

Page 78: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

Interpret MCMC Dynamics

(Wt)i(x) = Dij(x) ∂j log(p(x)/qt(x)) +Qij(x) ∂j log p(x) + ∂jQ

ij(x).

1 Dij(x) ∂j log(p(x)/qt(x)) seems like a gradient flow on P2(M).

Fiber-Riemannian manifold M:

Definition 3 (Fiber-Riemannian manifold)

M is a fiber-Riemannian manifold if it is a fiberbundle and there is a Riemannian structure gMy

on each fiber My.

Gradient on fiber My:(gradMy

f(y, z))a

= (gMy (z))ab ∂zbf(y, z), 1 6 a, b 6 n.

Define fiber-gradient on M by taking union over y:(gradfib f(x)

)M

:=(0m,

(gradM$(x)

f($(x), z))n

).

Chang Liu (MSRA) Sampling Methods and Manifolds 67 / 82

Page 79: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

Interpret MCMC Dynamics

(Wt)i(x) = Dij(x) ∂j log(p(x)/qt(x)) +Qij(x) ∂j log p(x) + ∂jQ

ij(x).

1 Dij(x) ∂j log(p(x)/qt(x)) seems like a gradient flow on P2(M).

Fiber-Riemannian manifold M:

Definition 3 (Fiber-Riemannian manifold)

M is a fiber-Riemannian manifold if it is a fiberbundle and there is a Riemannian structure gMy

on each fiber My.

Alternatively, the fiber-gradient on M is:(gradfib f(x)

)i=gij(x) ∂jf(x), 1 6 i, j 6M,(

gij(x))M×M :=

(0m×m 0m×n0n×m

((gM$(x)

(z))ab)n×n

). (4)

We use g to denote the fiber-Riemannian structure.

Chang Liu (MSRA) Sampling Methods and Manifolds 67 / 82

Page 80: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

Interpret MCMC Dynamics

(Wt)i(x) = Dij(x) ∂j log(p(x)/qt(x)) +Qij(x) ∂j log p(x) + ∂jQ

ij(x).

1 Dij(x) ∂j log(p(x)/qt(x)) seems like a gradient flow on P2(M).

Structures on P2(M) with fiber-Riemannian M.

Hard to decompose P2(M).

P2(M) := {q(z|y) ∈ P2(My) | y ∈M0}locally⇐⇒ M0 × P2(My):

fiber-Riemannian!On P2(My),

(grad KLp(·|y)(q(·|y))(z)

)a= (gMy

(z))ab ∂zb logq(z|y)

p(z|y)= (gMy

(z))ab ∂zb logq(y, z)

p(y, z), 1 6 a, b 6 n.

Taking union over y ∈M0, the fiber-gradient on P2(M) is:(gradfib KLp(q)(x)

)M

=(gij(x) ∂j log

(q(x)/p(x)

))M.

Chang Liu (MSRA) Sampling Methods and Manifolds 68 / 82

Page 81: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

Interpret MCMC Dynamics

(Wt)i(x) = Dij(x) ∂j log(p(x)/qt(x)) +Qij(x) ∂j log p(x) + ∂jQ

ij(x).

1 Dij(x) ∂j log(p(x)/qt(x)) seems like a gradient flow on P2(M).(gradfib KLp(q)(x)

)i= gij(x) ∂j log

(q(x)/p(x)

),

(gij(x)) =

(0m×m 0m×n0n×m (gM$(x)

ij)n×n

).

Assumption 4 (Regular MCMC dynamics (1/2))

(a) D = C or D = 0 or D =

(0 00 C

), for a symm. positive definite C(x).

(b) . . .

Satisfied by existing MCMC instances.Could be relaxed by coordinate transformation.

Dij ∂j log(p/qt) is the fiber-gradient with fiber-Riemannian support(M, g) where (gij) = D.

Chang Liu (MSRA) Sampling Methods and Manifolds 69 / 82

Page 82: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

Interpret MCMC Dynamics

(Wt)i(x) = Dij(x) ∂j log(p(x)/qt(x)) +Qij(x) ∂j log p(x) + ∂jQ

ij(x).

2 Qij(x) ∂j log p(x) + ∂jQij(x) makes a Hamiltonian flow.

The common Hamiltonian flow: M = R2`, Q =

(0 I`−I` 0

).

Symplectic manifold [18, 52]: M even-dim., Q non-singular.Poisson manifold M [25]:

Poisson structure: bivector field β = βij∂i ⊗ ∂j =∑i<j β

ij∂i ∧ ∂j(anti-symm. 2nd-order contravariant tensor field; (βij) is skew-symm.)that satisfies the Jacobian identity:

βil∂lβjk + βjl∂lβ

ki + βkl∂lβij = 0,∀i, j, k. (5)

Hamiltonian flow Xf of a smooth function f :(Xf (x)

)[h] :=

(β(df, dh)

)(x) = βij(x) ∂if(x) ∂jh(x).

Coordinate expression:(Xf (x)

)i= βij(x) ∂jf(x).

Xf conserves f : ddtf(ϕt) = 0.

Chang Liu (MSRA) Sampling Methods and Manifolds 70 / 82

Page 83: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

Interpret MCMC Dynamics

(Wt)i(x) = Dij(x) ∂j log(p(x)/qt(x)) +Qij(x) ∂j log p(x) + ∂jQ

ij(x).

2 Qij(x) ∂j log p(x) + ∂jQij(x) makes a Hamiltonian flow.

Poisson structure on P2(M) [49, 5, 26] (known knowledge):Hamiltonian flow of a function F on P2(M):

XF (q) = πq(Xf ),

where func. f on M relates to F via gradq Eq[f ] = gradq F (q), andπq is the orthogonal projection L2

q(M)→ TqP2(M), which does notchange distribution evolution.

Hamiltonian flow of KL on P2(M):

Lemma 2 (Hamiltonian flow of KL on P2(M))

The Hamiltonian flow of KLp on P2(M) is:

XKLp(q) = πq(Xlog(q/p)), where(Xlog(q/p)(x)

)i= βij(x) ∂j log(q(x)/p(x)).

Chang Liu (MSRA) Sampling Methods and Manifolds 71 / 82

Page 84: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

Interpret MCMC Dynamics

(Wt)i(x) = Dij(x) ∂j log(p(x)/qt(x)) +Qij(x) ∂j log p(x) + ∂jQ

ij(x).

2 Qij(x) ∂j log p(x) + ∂jQij(x) makes a Hamiltonian flow.

−(Xlog(q/p)(x)

)i= βij(x) ∂j log p(x)− βij(x) ∂j log q(x).

Assumption 4 (Regular MCMC dynamics (2/2))

(a) D = C or D = 0 or D =

(0 00 C

), for a symm. positive definite C(x).

(b) Q(x) satisfies Eq. (5): Qil∂lQjk +Qjl∂lQ

ki +Qkl∂lQij = 0, ∀i, j, k.

Satisfied by MCMCs except for SGNHT-related methods [20, 75].Required to match Poisson structure; unnecessary for conservation ofHamiltonian.

Qij ∂j log p+ ∂jQij ⇐⇒? Qij ∂j log p−Qij ∂j log q? Yes!

Chang Liu (MSRA) Sampling Methods and Manifolds 72 / 82

Page 85: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

Interpret MCMC Dynamics

(Wt)i(x) = Dij(x) ∂j log(p(x)/qt(x)) +Qij(x) ∂j log p(x) + ∂jQ

ij(x).

2 Qij(x) ∂j log p(x) + ∂jQij(x) makes a Hamiltonian flow.

−(Xlog(q/p)(x)

)i= βij(x) ∂j log p(x)− βij(x) ∂j log q(x).

Assumption 4 (Regular MCMC dynamics (2/2))

(a) D = C or D = 0 or D =

(0 00 C

), for a symm. positive definite C(x).

(b) Q(x) satisfies Eq. (5): Qil∂lQjk +Qjl∂lQ

ki +Qkl∂lQij = 0, ∀i, j, k.

Satisfied by MCMCs except for SGNHT-related methods [20, 75].Required to match Poisson structure; unnecessary for conservation ofHamiltonian.

Qij ∂j log p+ ∂jQij ⇐⇒? Qij ∂j log p−Qij ∂j log q? Yes!

Chang Liu (MSRA) Sampling Methods and Manifolds 72 / 82

Page 86: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

Interpret MCMC Dynamics: Main Theorem

Theorem 5 (Equivalence between regular MCMC dynamics on RM andfGH flows on P2(M).)

We call (M, g, β) a fiber-RiemannianPoisson (fRP) manifold, and define thefiber-gradient Hamiltonian (fGH) flow onP2(M) as:

WKLp :=−π(gradfib KLp)−XKLp ,(WKLp(q)

)i=πq

((gij + βij)∂j log(p/q)

).

(6)

Then:

Regular MCMC dynamics ⇐⇒ fGH flow with fRP M,(D,Q) ⇐⇒ (g, β).

Chang Liu (MSRA) Sampling Methods and Manifolds 73 / 82

Page 87: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

Interpret MCMC Dynamics: Case Study

Type 1: D is non-singular (m = 0 in Eq. (4)).

M0 degenerates, M is the unique fiber.

M is Riemannian, fiber gradient =⇒ gradient.

The fGH flow: WKLp = −π(grad KLp)−XKLp ,

−π(grad KLp): minimizes KLp steepestly on P2(M).−XKLp : conserves KLp on P2(M) and helps mixing/exploration.

Converges to p uniquely (c.f. [51]).

Robust to SG (c.f. [65, 69]).

Instances:

LD [62] / SGLD [71]: Q = 0, M is Euclidean.

RLD [28] / SGRLD [60]: Q = 0, M is the manifold underconsideration.

Chang Liu (MSRA) Sampling Methods and Manifolds 74 / 82

Page 88: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

Interpret MCMC Dynamics: Case Study

Type 2: D = 0 (n = 0 in Eq. (4)).

M0 =M, fibers degenerate.

M has no (fiber-)Riemannian structures.

The fGH flow: WKLp = −XKLp conserves KLp on P2(M) and helpsmixing/exploration.

Fragile against SG: no stablizing forces (i.e. (fiber-)gradient flows)(c.f. [15, 9]).

Hard to extend to ParVIs.

Instances (`-dim. sample space S):

HMC [21, 56, 10] (S = R`): M = R2`.

HMC relies on geometric ergodicity for convergence [48, 10].

RHMC [28] / LagrMC [38] / GMC [12] (manifold S): M = T ∗S.

Chang Liu (MSRA) Sampling Methods and Manifolds 75 / 82

Page 89: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

Interpret MCMC Dynamics: Case Study

Type 3: D 6= 0 and D is singular (m,n > 1 in Eq. (4)).

Non-degenerate M0 and My.

M is a non-trivial fRP manifold.

The fGH flow: WKLp := −π(gradfib KLp)−XKLp ,

−π(gradfib KLp): minimizes KLp(·|y)(q(·|y)) steepest on each fiberP2(My).−XKLp : conserves KLp on P2(M) and helps mixing/exploration.

Robust to SG (SG appears on each fiber) (c.f. [15, 13]).

Instances (`-dim. sample space S):

SGHMC [15] (S = R`), SGRHMC [51] / SGGMC [42] (manifold S):M0 = S, Mθ = T ∗θ S.

SGNHT [20] (S = R`), gSGNHT [42] (manifold S):M0 = S, Mθ = R× T ∗θ S.

Chang Liu (MSRA) Sampling Methods and Manifolds 76 / 82

Page 90: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

ParVI Simulation for SGHMC

Simulate the deterministic dynamics of SGHMC:

By Lemma 15 (Eq. (3)):

dt= Σ−1r,

dr

dt= ∇θ log p(θ)− CΣ−1r − C∇r log q(r).

By Theorem 5 (Eq. (6)):

dt= Σ−1r +∇r log q(r),

dr

dt= ∇θlog p(θ)−CΣ−1r−C∇rlog q(r)−∇θlog q(θ).

To estimate ∇ log q with particles, use ParVI techniques [43], e.g.Blob [14]:

−∇rlog q(r(i))≈ −∑

k∇r(i)K(i,k)r∑

jK(i,j)r

−∑k

∇r(i)K(i,k)r∑

jK(j,k)r

,

where K(i,j)r := Kr(r

(i), r(j)).Chang Liu (MSRA) Sampling Methods and Manifolds 77 / 82

Page 91: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

ParVI Simulation for SGHMC

Simulate the deterministic dynamics of SGHMC:

pSGHMC-det:

∆θ(i)

ε =Σ−1r(i),

∆r(i)

ε =∇θlog p(θ(i))−CΣ−1r(i)−C(∑

k∇r(i)K(i,k)r∑

jK(i,j)r

+∑k

∇r(i)

K(i,k)r∑

jK(j,k)r

).

pSGHMC-fGH:

∆θ(i)

ε = Σ−1r(i)+∑k∇r(i)K

(i,k)r∑

jK(i,j)r

+∑k

∇r(i)

K(i,k)r∑

jK(j,k)r

,

∆r(i)

ε = ∇θlog p(θ(i))−(∑

k∇θ(i)K(i,k)θ∑

jK(i,j)θ

+∑k

∇θ(i)

K(i,k)θ∑

jK(j,k)θ

)− CΣ−1r(i) − C

(∑k∇r(i)K

(i,k)r∑

jK(i,j)r

+∑k

∇r(i)

K(i,k)r∑

jK(j,k)r

).

Advantages:

Over SGHMC: particle-efficiency, ParVI techniques like HE [43].

Over ParVIs: more efficient dynamics over LD.

Chang Liu (MSRA) Sampling Methods and Manifolds 78 / 82

Page 92: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

Experimental Results: Synthetic

Figure: Dynamics simulation results. Rows correspond to Blob, SGHMC,pSGHMC-det, pSGHMC-fGH, respectively. All methods adopt the same step size0.01, and SGHMC-related methods share the same Σ−1 = 1.0, C = 0.5. In eachrow, figures are plotted for every 300 iterations, and the last one for 10,000iterations. The HE method [43] is used for bandwidth selection.

Chang Liu (MSRA) Sampling Methods and Manifolds 79 / 82

Blob SGHMC pSGHMC-det pSGHMC-fGH

Page 93: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

Experimental Results: Latent Dirichlet Allocation (LDA)

0 200 400 600iteration

1040

1060

1080

1100

1120

hold

out p

erpl

exity

BlobSGHMCpSGHMC-detpSGHMC-fGH

(a) Learning curve (20 ptcls)

0 50 100#particle

1030

1035

1040

1045

1050

hold

out p

erpl

exity

SGHMCpSGHMC-detpSGHMC-fGH

(b) Particle efficiency (iter 600)

Figure: Performance on LDA with the ICML data set.

Chang Liu (MSRA) Sampling Methods and Manifolds 80 / 82

Page 94: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Understanding Samping Methods on Probability Manifolds Understanding MCMCs on the Wasserstein Space

Experimental Results: Bayesian Neural Networks (BNNs)

0 40 80epoch

0.1

0.2

0.3

0.4

0.5

0.6

erro

r rat

e

BlobSGHMCpSGHMC-detpSGHMC-fGH

(a) Learning curve (10 ptcls)

0 20 40#particle

0.0300.0350.0400.0450.0500.0550.060

erro

r rat

e

SGHMCpSGHMC-detpSGHMC-fGH

(b) Particle efficiency (epch 80)

Figure: Performance on BNN with MNIST data set.

Chang Liu (MSRA) Sampling Methods and Manifolds 81 / 82

Page 95: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

Thanks!Questions?

Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

Page 96: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

References

Ralph Abraham, Jerrold E Marsden, and Jerrold E Marsden.

Foundations of mechanics.

Benjamin/Cummings Publishing Company Reading, Massachusetts, Providence,Rhode Island, 1978.

Ralph Abraham, Jerrold E Marsden, and Tudor Ratiu.

Manifolds, tensor analysis, and applications, volume 75.

Springer Science & Business Media, New York, 2012.

Shun-Ichi Amari.

Information geometry and its applications.

Springer, Tokyo, 2016.

Shun-Ichi Amari and Hiroshi Nagaoka.

Methods of information geometry, volume 191.

American Mathematical Soc., Providence, Rhode Island, 2007.

Luigi Ambrosio and Wilfrid Gangbo.

Hamiltonian ODEs in the Wasserstein space of probability measures.

Communications on Pure and Applied Mathematics: A Journal Issued by theCourant Institute of Mathematical Sciences, 61(1):18–53, 2008.

Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

Page 97: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

References

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savare.

Gradient flows: in metric spaces and in the space of probability measures.

Springer Science & Business Media, Berlin, 2008.

Andrew D Barbour.

Stein’s method for diffusion approximations.

Probability theory and related fields, 84(3):297–322, 1990.

Jean-David Benamou and Yann Brenier.

A computational fluid mechanics solution to the Monge-Kantorovich mass transferproblem.

Numerische Mathematik, 84(3):375–393, 2000.

Michael Betancourt.

The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naivedata subsampling.

In Proceedings of the 32nd International Conference on Machine Learning (ICML2015), pages 533–540, Lille, France, 2015. IMLS.

Michael Betancourt.

A conceptual introduction to Hamiltonian Monte Carlo.

arXiv preprint arXiv:1701.02434, 2017.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

Page 98: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

References

Marcus A. Brubaker, Mathieu Salzmann, and Raquel Urtasun.

A family of MCMC methods on implicitly defined manifolds.

In Proceedings of the 15th International Conference on Artificial Intelligence andStatistics (AISTATS-12), pages 161–172, La Palma, Canary Islands, 2012.AISTATS Committee.

Simon Byrne and Mark Girolami.

Geodesic Monte Carlo on embedded manifolds.

Scandinavian Journal of Statistics, 40(4):825–845, 2013.

Changyou Chen, Nan Ding, and Lawrence Carin.

On the convergence of stochastic gradient MCMC algorithms with high-orderintegrators.

In Advances in Neural Information Processing Systems, pages 2269–2277,Montreal, Canada, 2015. NIPS Foundation.

Changyou Chen, Ruiyi Zhang, Wenlin Wang, Bai Li, and Liqun Chen.

A unified particle-optimization framework for scalable Bayesian sampling.

In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI2018), Monterey, California USA, 2018. Association for Uncertainty in ArtificialIntelligence.

Tianqi Chen, Emily Fox, and Carlos Guestrin.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

Page 99: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

References

Stochastic gradient Hamiltonian Monte Carlo.

In Proceedings of the 31st International Conference on Machine Learning (ICML2014), pages 1683–1691, Beijing, China, 2014. IMLS.

Xiang Cheng and Peter Bartlett.

Convergence of Langevin MCMC in KL-divergence.

arXiv preprint arXiv:1705.09048, 2017.

Marco Cuturi.

Sinkhorn distances: Lightspeed computation of optimal transport.

In Advances in Neural Information Processing Systems, pages 2292–2300, LakeTahoe, Nevada USA, 2013. NIPS Foundation.

Ana Cannas Da Silva.

Lectures on symplectic geometry, volume 3575.

Springer, Boston, 2001.

Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub MTomczak.

Hyperspherical variational auto-encoders.

arXiv preprint arXiv:1804.00891, 2018.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

Page 100: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

References

Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D. Skeel, andHartmut Neven.

Bayesian sampling using stochastic gradient thermostats.

In Advances in Neural Information Processing Systems, pages 3203–3211,Montreal, Canada, 2014. NIPS Foundation.

Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth.

Hybrid Monte Carlo.

Physics Letters B, 195(2):216–222, 1987.

Alain Durmus and Eric Moulines.

High-dimensional Bayesian inference via the unadjusted Langevin algorithm.

arXiv preprint arXiv:1605.01559, 2016.

J Ehlers, F Pirani, and A Schild.

The geometry of free fall and light propagation, in the book “General Relativity”(papers in honour of JL Synge), 63–84, 1972.

Matthias Erbar et al.

The heat equation on manifolds as a gradient flow in the Wasserstein space.

In Annales de l’Institut Henri Poincare, Probabilites et Statistiques, volume 46,pages 1–23, Paris, 2010. Institut Henri Poincare.

Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

Page 101: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

References

Rui Loja Fernandes and Ioan Marcut.

Lectures on Poisson Geometry.

Springer, Basel, 2014.

Wilfrid Gangbo, Hwa Kil Kim, and Tommaso Pacini.

Differential forms on Wasserstein space and infinite-dimensional Hamiltoniansystems.

American Mathematical Soc., Providence, Rhode Island, 2010.

Stuart Geman and Donald Geman.

Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images.

In Readings in Computer Vision, pages 564–584. Elsevier, Los Altos, CaliforniaUSA, 1987.

Mark Girolami and Ben Calderhead.

Riemann manifold Langevin and Hamiltonian Monte Carlo methods.

Journal of the Royal Statistical Society: Series B (Statistical Methodology),73(2):123–214, 2011.

Jackson Gorham, Andrew B Duncan, Sebastian J Vollmer, and Lester Mackey.

Measuring sample quality with diffusions.

arXiv preprint arXiv:1611.06972, 2016.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

Page 102: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

References

Daniele Grattarola, Lorenzo Livi, and Cesare Alippi.

Adversarial autoencoders with constant-curvature latent manifolds.

arXiv preprint arXiv:1812.04314, 2018.

W Keith Hastings.

Monte Carlo sampling methods using Markov chains and their applications.

Biometrika, 57(1):97–109, 1970.

Heinz Hopf and Willi Rinow.

Uber den begriff der vollstandigen differential geometrischen flache.

Commentarii Mathematici Helvetici, 3(1):209–225, 1931.

I. M. James.

The topology of Stiefel manifolds, volume 24.

Cambridge University Press, New York, 1976.

Richard Jordan, David Kinderlehrer, and Felix Otto.

The variational formulation of the Fokker-Planck equation.

SIAM journal on mathematical analysis, 29(1):1–17, 1998.

Arkady Kheyfets, Warner A Miller, and Gregory A Newton.

Schild’s ladder parallel transport procedure for an arbitrary connection.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

Page 103: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

References

International Journal of Theoretical Physics, 39(12):2891–2898, 2000.

Ondrej Kovacik and Jirı Rakosnık.

On spaces Lp(x) and W k,p(x).

Czechoslovak Mathematical Journal, 41(4):592–618, 1991.

Hendrik Anthony Kramers.

Brownian motion in a field of force and the diffusion model of chemical reactions.

Physica, 7(4):284–304, 1940.

Shiwei Lan, Vasileios Stathopoulos, Babak Shahbaba, and Mark Girolami.

Markov chain Monte Carlo from lagrangian dynamics.

Journal of Computational and Graphical Statistics, 24(2):357–378, 2015.

Paul Langevin.

Sur la theorie du mouvement Brownien.

Compt. Rendus, 146:530–533, 1908.

Chunyuan Li, Changyou Chen, Kai Fan, and Lawrence Carin.

High-order stochastic gradient thermostats for Bayesian learning of deep models.

arXiv preprint arXiv:1512.07662, 2015.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

Page 104: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

References

Chang Liu and Jun Zhu.

Riemannian Stein variational gradient descent for Bayesian inference.

In The 32nd AAAI Conference on Artificial Intelligence, pages 3627–3634, NewOrleans, Louisiana USA, 2018. AAAI press.

Chang Liu, Jun Zhu, and Yang Song.

Stochastic gradient geodesic MCMC methods.

In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems 29, pages 3009–3017. CurranAssociates, Inc., Barcelona, Spain, 2016.

Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, Jun Zhu, and LawrenceCarin.

Understanding and accelerating particle-based variational inference.

In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36thInternational Conference on Machine Learning, volume 97 of Proceedings ofMachine Learning Research, pages 4082–4092, Long Beach, California USA, 09–15Jun 2019. IMLS, PMLR.

Chang Liu, Jingwei Zhuo, and Jun Zhu.

Understanding MCMC dynamics as flows on the Wasserstein space.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

Page 105: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

References

In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36thInternational Conference on Machine Learning, volume 97 of Proceedings ofMachine Learning Research, pages 4093–4103, Long Beach, California USA, 09–15Jun 2019. IMLS, PMLR.

Qiang Liu.

Stein variational gradient descent as gradient flow.

In Advances in Neural Information Processing Systems, pages 3118–3126, LongBeach, California USA, 2017. NIPS Foundation.

Qiang Liu and Dilin Wang.

Stein variational gradient descent: A general purpose Bayesian inference algorithm.

In Advances in Neural Information Processing Systems, pages 2370–2378,Barcelona, Spain, 2016. NIPS Foundation.

Yuanyuan Liu, Fanhua Shang, James Cheng, Hong Cheng, and Licheng Jiao.

Accelerated first-order methods for geodesically convex optimization on Riemannianmanifolds.

In Advances in Neural Information Processing Systems, pages 4875–4884, LongBeach, California USA, 2017. NIPS Foundation.

Samuel Livingstone, Michael Betancourt, Simon Byrne, and Mark Girolami.

On the geometric ergodicity of Hamiltonian Monte Carlo.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

Page 106: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

References

arXiv preprint arXiv:1601.08057, 2016.

John Lott.

Some geometric calculations on Wasserstein space.

Communications in Mathematical Physics, 277(2):423–437, 2008.

John Lott.

An intrinsic parallel transport in Wasserstein space.

Proceedings of the American Mathematical Society, 145(12):5329–5340, 2017.

Yi-An Ma, Tianqi Chen, and Emily Fox.

A complete recipe for stochastic gradient MCMC.

In Advances in Neural Information Processing Systems, pages 2899–2907,Montreal, Canada, 2015. NIPS Foundation.

Jerrold E Marsden and Tudor S Ratiu.

Introduction to mechanics and symmetry: a basic exposition of classical mechanicalsystems, volume 17.

Springer Science & Business Media, Berlin, 2013.

Emile Mathieu, Charline Le Lan, Chris J Maddison, Ryota Tomioka, and Yee WhyeTeh.

Hierarchical representations with Poincare variational auto-encoders.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

Page 107: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

References

arXiv preprint arXiv:1901.06033, 2019.

Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta HTeller, and Edward Teller.

Equation of state calculations by fast computing machines.

The journal of chemical physics, 21(6):1087–1092, 1953.

Yoshihiro Nagano, Shoichiro Yamaguchi, Yasuhiro Fujita, and Masanori Koyama.

A differentiable Gaussian-like distribution on hyperbolic space for gradient-basedlearning.

arXiv preprint arXiv:1902.02992, 2019.

Radford M. Neal.

MCMC using Hamiltonian dynamics.

Handbook of Markov Chain Monte Carlo, 2, 2011.

Liviu I Nicolaescu.

Lectures on the Geometry of Manifolds.

World Scientific, Singapore, 2007.

Felix Otto.

The geometry of dissipative evolution equations: the porous medium equation.

2001.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

Page 108: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

References

Ivan Ovinnikov.

Poincare Wasserstein autoencoder.

arXiv preprint arXiv:1901.01427, 2019.

Sam Patterson and Yee Whye Teh.

Stochastic gradient Riemannian Langevin dynamics on the probability simplex.

In Advances in Neural Information Processing Systems, pages 3102–3110, LakeTahoe, Nevada USA, 2013. NIPS Foundation.

Joseph Reisinger, Austin Waters, Bryan Silverthorn, and Raymond J. Mooney.

Spherical topic models.

In Proceedings of the 27th International Conference on Machine Learning (ICML2010), pages 903–910, Haifa, Israel, 2010. IMLS.

Gareth O Roberts and Osnat Stramer.

Langevin diffusions and Metropolis-Hastings algorithms.

Methodology and computing in applied probability, 4(4):337–357, 2002.

Gareth O Roberts, Richard L Tweedie, et al.

Exponential convergence of Langevin distributions and their discreteapproximations.

Bernoulli, 2(4):341–363, 1996.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

Page 109: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

References

Ruslan Salakhutdinov and Andriy Mnih.

Bayesian probablistic matrix factorization using Markov chain Monte Carlo.

In Proceedings of the 25th International Conference on Machine Learning (ICML2008), pages 880–887, Helsinki, Finland, 2008. IMLS, Omnipress.

Issei Sato and Hiroshi Nakagawa.

Approximation analysis of stochastic gradient Langevin dynamics by usingFokker-Planck equation and Ito process.

In Proceedings of the 31st International Conference on Machine Learning (ICML2014), pages 982–990, Beijing, China, 2014. IMLS.

Yang Song and Jun Zhu.

Bayesian matrix completion via adaptive relaxed spectral regularization.

In The 30th AAAI Conference on Artificial Intelligence (AAAI-16), pages2044–2050, Phoenix, Arizona USA, 2016. AAAI press.

Ingo Steinwart and Andreas Christmann.

Support vector machines.

Springer Science & Business Media, New York, 2008.

Eduard L. Stiefel.

Richtungsfelder und fernparallelismus in n-dimensionalen mannigfaltigkeiten.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

Page 110: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

References

Commentarii Mathematici Helvetici, 8(1):305–353, 1935.

Yee Whye Teh, Alexandre H Thiery, and Sebastian J Vollmer.

Consistency and fluctuations for stochastic gradient Langevin dynamics.

The Journal of Machine Learning Research, 17(1):193–225, 2016.

Cedric Villani.

Optimal transport: old and new, volume 338.

Springer Science & Business Media, Berlin, 2008.

Max Welling and Yee Whye Teh.

Bayesian learning via stochastic gradient Langevin dynamics.

In Proceedings of the 28th International Conference on Machine Learning (ICML2011), pages 681–688, Bellevue, Washington USA, 2011. IMLS.

Yujia Xie, Xiangfeng Wang, Ruijia Wang, and Hongyuan Zha.

A fast proximal point method for computing Wasserstein distance.

arXiv preprint arXiv:1802.04307, 2018.

Viktor Yanush and Dmitry Kropotov.

Hamiltonian Monte-Carlo for orthogonal matrices.

arXiv preprint arXiv:1901.08045, 2019.Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82

Page 111: Sampling Methods on Manifolds and Their View from Probability … · Hyperspherical Variational Auto-Encoder [19, 30]: spherical latent space for uninformative prior. Chang Liu (MSRA)

References

Hongyi Zhang and Suvrit Sra.

An estimate sequence for geodesically convex optimization.

In Proceedings of the 31st Annual Conference on Learning Theory (COLT 2018),pages 1703–1723, Stockholm, Sweden, 2018. IMLS.

Yizhe Zhang, Changyou Chen, Zhe Gan, Ricardo Henao, and Lawrence Carin.

Stochastic gradient monomial Gamma sampler.

arXiv preprint arXiv:1706.01498, 2017.

Chang Liu (MSRA) Sampling Methods and Manifolds 82 / 82