big data時代の大規模ベイズ学習-stochastic gradient langevin dynamicsを中心として

Big Data時代の大規模ベイズ学習

-Stochastic Gradient Langevin Dynamics

を中心として

佐藤一誠

東京大学/JSTさきがけ

河原林ERATO感謝祭 Summer 2014

1

尤度最大化

• データ:

• 尤度:

2

最尤推定:

• 予測分布:

事後確率最大化

• 事後分布:

3

ベイズの定理:

• データ:

• 尤度:


• 事後分布:

4

MAP推定:

• 予測分布:

• 事後分布:

• データ:

• 尤度:


5 θ*

p(θ|x1:n)

• 事後分布:

• 予測分布:

• 事後分布:

• データ:

• 尤度:

ベイズ予測

6

p(θ|x1:n)

• 事後分布: • 事後分布:

• データ:

• 尤度:

• 予測分布:

7

Matrix Factorization

R U V

？

8

Probabilistic Matrix Factorization

R U V

？

9

Bayesian Matrix Factorization

R U V

？

ベイズ予測

10

p(θ|x1:n)


• データ:

• 尤度:

• 予測分布:

ベイズ予測

11


• データ:

• 尤度:

• 予測分布:

計算量的に高コスト

• サンプリング近似法

• 変分ベイズ法

近似ベイズ推定

• サンプリング近似法：

• 変分ベイズ法:

12

13

Accept/Reject Test

Propose

サンプリング近似法の例

Metropolis-Hastings

Target distribution

Motivation: Big-n 問題

14

O(n)

O(n)

:Subsampling →O(m)

:Subsampling → O(m)



＊ICML2014でチュートリアルが開かれるほどのHot topic

Motivation: Big-n 問題

15

O(n)

O(n)

:Subsampling →O(m)

:Subsampling → O(m)



＊今日の話

• Stochastic Gradient Langevin Dynamics (SGLD) – [Welling & Teh, ICML2011]

• Stochastic Gradient Riemannian Langevin Dynamics – [Patterson & Teh, NIPS2013 ]

• Distributed Stochastic Gradient MCMC – [Ahn, Shahbaba & Welling, ICML2014]

• Theoretical Analysis of SGLD

by Fokker-Planck Equation and Ito Process – [Sato & Nakagawa, ICML2014]

16

サンプリング近似法＋ Subsamplingの最近

Stochastic Gradient Langevin Dynamics (SGLD)

Mini-batch

Injected Gaussian noise:

Annealed step-size:

Stochastic gradient:

Welling & Teh, 2011

Samples are generated by

17


Mini-batch


Annealed step-size:


Stochastic Gradient Langevin Dynamics (SGLD) Welling & Teh, 2011

18


Mini-batch


Annealed step-size:



19


Mini-batch


Annealed step-size:


Stochastic Gradient Langevin Dynamics (SGLD)

Stochastic Gradient Method

Welling & Teh, 2011

20


Mini-batch


Annealed step-size:



21

• Stochastic Gradient Langevin Dynamics (SGLD) – [Welling & Teh, ICML2011]

• Stochastic Gradient Riemannian Langevin Dynamics – [Patterson & Teh, NIPS2013 ]

• Distributed Stochastic Gradient MCMC – [Ahn, Shahbaba & Welling, ICML2014]

• Theoretical Analysis of SGLD

by Fokker-Planck Equation and Ito Process – [Sato & Nakagawa, ICML2014]

22

サンプリング近似法＋ Subsampling

Motivation

• Annealing step-size → slow mixing rate

• SGLDの原論文ではMH-stepを間に挟む

→省略したい（経験的にはOK）

• このようなSGLDから生成されるθの分布の収束先は?

• θの収束の種類は?

→ Constant step-size

→ Fokker-Planck equation

→ Ito process 23

Main Results

θt の確率分布の収束に関して

θt の収束に関して

SGLD から生成される θt の確率分布はベイズ事後分布へ収束する

24

θtは弱収束するが強収束しない

Motivation








→ Ito process 25

Virtual Time Line

Time

N ：SGLDの総更新階数

：SGLDのk回目の更新時間

時間間隔:

, i.e, の調整＝ Tの調整

26

Virtual Time Line

Time



時間間隔:


27

Virtual Time Line

Time



時間間隔:


28

Virtual Time Line

Time



時間間隔:


29

これからの流れ：

1. 時刻 t におけるθtの確率分布q(t,θ)を解析

2. q(t,θ)の定常分布q(θ)を求める

Fokker-Planck 方程式 Risken & Frank, 1984; Daum, 1994

: エネルギー関数

p.d.fの時間変化を記述する微分方程式

: 時刻 t における θ のp.d.f

30 ⇒

条件：

Fokker-Planck 方程式 Risken & Frank, 1984; Daum, 1994

: エネルギー関数

p.d.fの時間変化を記述する微分方程式

: 時刻 t における θ のp.d.f

31 for 正規化項 ⇒

条件：

From FP 方程式 to ベイズ事後分布

がFP 方程式に従い

がの定常分布のとき

ここで

where

⇒ 32

From FP 方程式 to ベイズ事後分布

がFP 方程式に従い

がの定常分布のとき

ここで

where

⇒ 33

• SGLDから時刻 t に生成されるθt の分布q(t,θ) を分析

- q(t,θ) はFP方程式を満たす

• エネルギー関数U(θ)を分析

- U(θ)= -L(θ)

Our plan:

Stochastic noise

Problem setting

ε: constant step size


34

Stochastic noise

Assumption

The expectations over mini-batch sampling set :

* This equality always holds*

35

Result

Let be the p.d.f of θt generated by SGLD.

: Inverse Fourier transform

Proof sketch:

Fourier transform

Negative log-likelihood

36

Result

Let be the p.d.f of θt generated by SGLD.

: Inverse Fourier transform

Proof sketch:

Fourier transform

ϵ → 0のとき

SGLDから生成される θt の定常分布は

ベイズ事後分布

これから知りたいこと：

ϵ>0による（離散化）誤差

⇒ θt の収束解析

37

ここまでわかったこと：

Motivation








→ Ito process 38

From FP 方程式 to S.D.E

FP 方程式 for SGLD

Stochastic Differential Equation (Ito Process)

39

Ito Process

: Weiner process

: Lipschitz-continuous functions of linear growth

離散近似:

40

[Ito, 1944]

Ito Process

: Weiner process


41

[Ito, 1944]

離散近似:

Ito Process

: Weiner process


＊オイラー・丸山法 42

[Ito, 1944]

離散近似:

強収束と弱収束

A time discrete approximation

converges strongly to at time T if

Strong convergence

Weak convergence

for any continuous differentiable and polynomial

growth function h

A time discrete approximation

converges weakly to at time T if

43

From FP 方程式 to S.D.E

FP 方程式 for SGLD

Stochastic Differential Equation (Ito Process)

44

From S.D.E to SGLD

SGLD = discrete approximation of S.D.E

+ stochastic approximation noise

S.D.E representation (Ito process) of SGLD

使う道具： Ito formula, Gronwall inequality, Feynman-Kac formula, e.t.c…

45

誤差解析

SGLDの強近似誤差

SGLDの弱近似誤差

for any continuous differentiable function h

i

SGLDは強収束しない

SGLDは弱収束する

SGLD

SGLD

Stochastic noise

S.D.E

S.D.E

46

Discussion & Conclusion

強収束：サンプルパス解析には重要

※ベイズ推定でサンプルパス解析はめった

に行われない

弱収束：ベイズ推定では重要

ある関数hに対する平均計算はベイズ推定

の基本計算E[h(θ)]

e.g., 予測分布：h(θ) = p(x|θ).

SGLD は、ベイズ事後分布によるサンプル平均の近似として使うには有望なアルゴリズム 47

Q & A

48

Stochastic Gradient Riemannian Langevin Dynamics

(SGRLD) - Patterson & Teh, 2013

Natural Gradient

change in curvature align noise 49

i

SGLRD results - LDA

50

NIPS - 2483 documents

Wikipedia - 150,000 documents

OVB - Hoffman, Blei, Bach (2010)

HSVG – Mimno, Hoffman, Blei (2012)

Distributed SGLD Ahn, Shahbaba, Welling (2014)

N1

N2

N3

Total N

Data points

51

1. Trajectory Sampling 2. Adaptive Load Balancing 3. Chain Coupling

D-SGLD Results

Wikipedia dataset: 4.6M articles, 811M tokens, vocabulary size: 7702

PubMed dataset: 8.2M articles, 730M tokens, vocabulary size: 39987

Model: Latent Dirichlet Allocation

52

AD- LDA: Newman et. al. (2007)

big data時代の大規模ベイズ学習-stochastic gradient langevin dynamicsを中心として

Data & Analytics