variational inference - marc deisenrothvariational inference variational inference is the...
TRANSCRIPT
![Page 1: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/1.jpg)
Probabilistic Inference (CO-493)
Variational InferenceMarc Deisenroth
Department of ComputingImperial College London
February 19, 2019
![Page 2: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/2.jpg)
Learning Material
§ Pattern Recognition and Machine Learning, Chapter 10 (Bishop,2006)
§ Machine Learning: A Probabilistic Perspective, Chapter 21(Murphy, 2012)
§ Variational Inference: A Review for Statisticians (Blei et al., 2017)
§ NIPS-2016 Tutorial by Blei, Ranganath, Mohamedhttps://nips.cc/Conferences/2016/Schedule?showEvent=6199
§ Tutorials by S. Mohamedhttp://shakirm.com/papers/VITutorial.pdf
http://shakirm.com/slides/MLSS2018-Madrid-ProbThinking.pdf
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 2
![Page 3: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/3.jpg)
Overview
Introduction and BackgroundKey IdeaOptimization ObjectiveConditionally Conjugate ModelsMean-Field Variational InferenceStochastic Variational InferenceLimits of Classical Variational InferenceBlack-Box Variational InferenceComputing Gradients of Expectations
Score Function GradientsPathwise Gradients
Amortized InferenceRicher Posterior Approximations
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 3
![Page 4: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/4.jpg)
Probabilistic Pipeline
Build model Discover patterns Predict and explore
Criticize model
Adopted from NIPS-2016 Tutorial by Blei et al.
Assumptions Data
§ Use knowledge and assumptions about the data to build a model§ Use model and data to discover patterns§ Predict and explore§ Criticize/revise the model
Inference is the key algorithmic problem: What does the modelsay about the data?
Goal: general and scalable approaches to inference
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 4
![Page 5: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/5.jpg)
Probabilistic Pipeline
Build model Discover patterns Predict and explore
Criticize model
Adopted from NIPS-2016 Tutorial by Blei et al.
Assumptions Data
§ Use knowledge and assumptions about the data to build a model§ Use model and data to discover patterns§ Predict and explore§ Criticize/revise the modelInference is the key algorithmic problem: What does the model
say about the data?Goal: general and scalable approaches to inference
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 4
![Page 6: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/6.jpg)
Probabilistic Machine Learning
z
x
§ Probabilistic model: Joint distribution of latent variablesz and observed variables x (data):
ppx, zq
§ Inference: Learning about the unknowns z through theposterior distribution
ppz|xq “ppx, zqppxq
, ppxq “ż
ppx|zqppzqdz
§ Normally: Denominator (marginal likelihood/evidence)intractable (i.e., we cannot compute the integral)
Approximate inference to get the posterior
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 5
![Page 7: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/7.jpg)
Probabilistic Machine Learning
z
x
§ Probabilistic model: Joint distribution of latent variablesz and observed variables x (data):
ppx, zq
§ Inference: Learning about the unknowns z through theposterior distribution
ppz|xq “ppx, zqppxq
, ppxq “ż
ppx|zqppzqdz
§ Normally: Denominator (marginal likelihood/evidence)intractable (i.e., we cannot compute the integral)
Approximate inference to get the posterior
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 5
![Page 8: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/8.jpg)
Probabilistic Machine Learning
z
x
§ Probabilistic model: Joint distribution of latent variablesz and observed variables x (data):
ppx, zq
§ Inference: Learning about the unknowns z through theposterior distribution
ppz|xq “ppx, zqppxq
, ppxq “ż
ppx|zqppzqdz
§ Normally: Denominator (marginal likelihood/evidence)intractable (i.e., we cannot compute the integral)
Approximate inference to get the posteriorVariational Inference Marc Deisenroth @Imperial College, February 19, 2019 5
![Page 9: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/9.jpg)
Some Options for Posterior Inference
§ Markov Chain Monte Carlo (to sample from the posterior)
§ Laplace approximation
§ Expectation propagation (Minka, 2001)
§ Variational inference (Jordan et al., 1999)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 6
![Page 10: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/10.jpg)
Variational Inference
§ Variational inference is the most scalable inference methodavailable (at the moment)
§ Can handle (arbitrarily) large datasets
§ Applications include:§ Topic modeling (Hoffman et al., 2013)§ Community detection (Gopalan & Blei, 2013)§ Genetic analysis (Gopalan et al., 2016)§ Reinforcement learning (e.g., Eslami et al., 2016)§ Neuroscience analysis (Manning et al., 2014)§ Compression and content generation (Gregor et al., 2016)§ Traffic analysis (Kucukelbir et al., 2016; Salimbeni & Deisenroth,
2017)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 7
![Page 11: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/11.jpg)
Variational Inference
§ Variational inference is the most scalable inference methodavailable (at the moment)
§ Can handle (arbitrarily) large datasets§ Applications include:
§ Topic modeling (Hoffman et al., 2013)§ Community detection (Gopalan & Blei, 2013)§ Genetic analysis (Gopalan et al., 2016)§ Reinforcement learning (e.g., Eslami et al., 2016)§ Neuroscience analysis (Manning et al., 2014)§ Compression and content generation (Gregor et al., 2016)§ Traffic analysis (Kucukelbir et al., 2016; Salimbeni & Deisenroth,
2017)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 7
![Page 12: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/12.jpg)
Overview
Introduction and BackgroundKey IdeaOptimization ObjectiveConditionally Conjugate ModelsMean-Field Variational InferenceStochastic Variational InferenceLimits of Classical Variational InferenceBlack-Box Variational InferenceComputing Gradients of Expectations
Score Function GradientsPathwise Gradients
Amortized InferenceRicher Posterior Approximations
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 8
![Page 13: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/13.jpg)
Key Idea: Approximation by Optimization
qpz|νq
νinit ν˚
ppz|xq
Figure adopted from Blei et al.’s NIPS-2016 tutorial
§ Find approximation of a probability distribution (e.g., posterior)by optimization:
1. Define an objective function2. Define a (parametrized) family of approximating distributions qν
3. Optimize objective function w.r.t. variational parameters ν
§ Inference Optimization
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 9
![Page 14: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/14.jpg)
Key Idea: Approximation by Optimization
qpz|νq
νinit ν˚
ppz|xq
Figure adopted from Blei et al.’s NIPS-2016 tutorial
§ Find approximation of a probability distribution (e.g., posterior)by optimization:
1. Define an objective function2. Define a (parametrized) family of approximating distributions qν
3. Optimize objective function w.r.t. variational parameters ν
§ Inference Optimization
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 9
![Page 15: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/15.jpg)
Overview
qpz|νq
νinit ν˚
ppz|xq
Figure adopted from Blei et al.’s NIPS-2016 tutorial
§ Find approximation of a probability distribution (e.g., posterior)by optimization:
1. Define an objective function2. Define a (parametrized) family of approximating distributions qν
3. Optimize objective function w.r.t. variational parameters ν
§ Inference Optimization
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 10
![Page 16: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/16.jpg)
Overview
Introduction and BackgroundKey IdeaOptimization ObjectiveConditionally Conjugate ModelsMean-Field Variational InferenceStochastic Variational InferenceLimits of Classical Variational InferenceBlack-Box Variational InferenceComputing Gradients of Expectations
Score Function GradientsPathwise Gradients
Amortized InferenceRicher Posterior Approximations
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 11
![Page 17: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/17.jpg)
Some Useful Quantities
§ Kullback-Leibler divergence
KLpqpxq}ppxqq “ż
qpxq logqpxqppxq
dx
“ Eq
„
logqpxqppxq
“ Eqrlog qpxqs ´Eqrlog ppxqs
§ Differential entropy
Hrqpxqs “ ´Eqrlog qpxqs “ ´ż
qpxq log qpxqdx
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 12
![Page 18: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/18.jpg)
Some Useful Quantities
§ Kullback-Leibler divergence
KLpqpxq}ppxqq “ż
qpxq logqpxqppxq
dx
“ Eq
„
logqpxqppxq
“ Eqrlog qpxqs ´Eqrlog ppxqs
§ Differential entropy
Hrqpxqs “ ´Eqrlog qpxqs “ ´ż
qpxq log qpxqdx
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 12
![Page 19: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/19.jpg)
Optimization Objective
qpz|νq
νinit ν˚
ppz|xq
Figure adopted from Blei et al.’s NIPS-2016 tutorial
§ Need to compare distributions (variational approximation q, trueposterior p)
Kullback-Leibler divergence
§ Find variational parameters ν by minimizing the KL divergenceKLpqpz|νq}ppz|xqq between the variational approximation q andthe true posterior.
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 13
![Page 20: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/20.jpg)
Optimization Objective
qpz|νq
νinit ν˚
ppz|xq
Figure adopted from Blei et al.’s NIPS-2016 tutorial
§ Need to compare distributions (variational approximation q, trueposterior p)
Kullback-Leibler divergence§ Find variational parameters ν by minimizing the KL divergence
KLpqpz|νq}ppz|xqq between the variational approximation q andthe true posterior.
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 13
![Page 21: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/21.jpg)
Optimization Objective (2)
§ Minimize the KL divergence KLpqpz|νq}ppz|xqq between thevariational approximation q and the true posterior.
qpz|νq “ ppz|xq is the optimal solution
KLpqpzq}ppz|xqq “ Eqrlog qpzqs ´ Eqrlog ppz|xqs
“ Eqrlog qpzqs´Eqrlog ppzqs ´Eqrlog ppx|zqs ` log ppxq
“ KLpqpzq}ppzqq ´Eqrlog ppx|zqs ` const
§ Not required to know the unknown posterior ppz|xq
§ Minimizing KLpqpzq}ppz|xqq is equivalent to maximizing a lowerbound on the marginal likelihood (ELBO)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 14
![Page 22: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/22.jpg)
Optimization Objective (2)
§ Minimize the KL divergence KLpqpz|νq}ppz|xqq between thevariational approximation q and the true posterior.
qpz|νq “ ppz|xq is the optimal solution
KLpqpzq}ppz|xqq “ Eqrlog qpzqs ´ Eqrlog ppz|xqs
“ Eqrlog qpzqs´Eqrlog ppzqs ´Eqrlog ppx|zqs ` log ppxq
“ KLpqpzq}ppzqq ´Eqrlog ppx|zqs ` const
§ Not required to know the unknown posterior ppz|xq
§ Minimizing KLpqpzq}ppz|xqq is equivalent to maximizing a lowerbound on the marginal likelihood (ELBO)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 14
![Page 23: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/23.jpg)
Optimization Objective (2)
§ Minimize the KL divergence KLpqpz|νq}ppz|xqq between thevariational approximation q and the true posterior.
qpz|νq “ ppz|xq is the optimal solution
KLpqpzq}ppz|xqq “ Eqrlog qpzqs ´ Eqrlog ppz|xqs
“ Eqrlog qpzqs´Eqrlog ppzqs ´Eqrlog ppx|zqs ` log ppxq
“ KLpqpzq}ppzqq ´Eqrlog ppx|zqs ` const
§ Not required to know the unknown posterior ppz|xq
§ Minimizing KLpqpzq}ppz|xqq is equivalent to maximizing a lowerbound on the marginal likelihood (ELBO)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 14
![Page 24: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/24.jpg)
Optimization Objective (2)
§ Minimize the KL divergence KLpqpz|νq}ppz|xqq between thevariational approximation q and the true posterior.
qpz|νq “ ppz|xq is the optimal solution
KLpqpzq}ppz|xqq “ Eqrlog qpzqs ´ Eqrlog ppz|xqs
“ Eqrlog qpzqs´Eqrlog ppzqs ´Eqrlog ppx|zqs ` log ppxq
“ KLpqpzq}ppzqq ´Eqrlog ppx|zqs ` const
§ Not required to know the unknown posterior ppz|xq
§ Minimizing KLpqpzq}ppz|xqq is equivalent to maximizing a lowerbound on the marginal likelihood (ELBO)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 14
![Page 25: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/25.jpg)
Optimization Objective (2)
§ Minimize the KL divergence KLpqpz|νq}ppz|xqq between thevariational approximation q and the true posterior.
qpz|νq “ ppz|xq is the optimal solution
KLpqpzq}ppz|xqq “ Eqrlog qpzqs ´ Eqrlog ppz|xqs
“ Eqrlog qpzqs´Eqrlog ppzqs ´Eqrlog ppx|zqs ` log ppxq
“ KLpqpzq}ppzqq ´Eqrlog ppx|zqs ` const
§ Not required to know the unknown posterior ppz|xq
§ Minimizing KLpqpzq}ppz|xqq is equivalent to maximizing a lowerbound on the marginal likelihood (ELBO)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 14
![Page 26: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/26.jpg)
Optimization Objective (2)
§ Minimize the KL divergence KLpqpz|νq}ppz|xqq between thevariational approximation q and the true posterior.
qpz|νq “ ppz|xq is the optimal solution
KLpqpzq}ppz|xqq “ Eqrlog qpzqs ´ Eqrlog ppz|xqs
“ Eqrlog qpzqs´Eqrlog ppzqs ´Eqrlog ppx|zqs ` log ppxq
“ KLpqpzq}ppzqq ´Eqrlog ppx|zqs ` const
§ Not required to know the unknown posterior ppz|xq
§ Minimizing KLpqpzq}ppz|xqq is equivalent to maximizing a lowerbound on the marginal likelihood (ELBO)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 14
![Page 27: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/27.jpg)
Evidence Lower Bound
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 15
![Page 28: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/28.jpg)
Importance Sampling
Key idea: Transform an intractable integral into an expectation undera simpler distribution q (proposal distribution):
x
0.00
0.05
0.10
p(x)
f (x)
q(x)
Epr f pxqs “ż
f pxqppxqdx
“
ż
f pxqppxqqpxqqpxq
dx “ż
f pxqppxqqpxq
qpxqdx
“ Eq
„
f pxqppxqqpxq
If we choose q in a way that we can easily sample from it, we canapproximate this last expectation by Monte Carlo:
Eq
„
f pxqppxqqpxq
«1S
Sÿ
s“1
f pxpsqqppxpsqqqpxpsqq
“1S
Sÿ
s“1
ws f pxpsqq
, xpsq „ qpxq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 16
![Page 29: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/29.jpg)
Importance Sampling
Key idea: Transform an intractable integral into an expectation undera simpler distribution q (proposal distribution):
x
0.00
0.05
0.10
p(x)
f (x)
q(x)
Epr f pxqs “ż
f pxqppxqdx
“
ż
f pxqppxqqpxqqpxq
dx
“
ż
f pxqppxqqpxq
qpxqdx
“ Eq
„
f pxqppxqqpxq
If we choose q in a way that we can easily sample from it, we canapproximate this last expectation by Monte Carlo:
Eq
„
f pxqppxqqpxq
«1S
Sÿ
s“1
f pxpsqqppxpsqqqpxpsqq
“1S
Sÿ
s“1
ws f pxpsqq
, xpsq „ qpxq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 16
![Page 30: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/30.jpg)
Importance Sampling
Key idea: Transform an intractable integral into an expectation undera simpler distribution q (proposal distribution):
x
0.00
0.05
0.10
p(x)
f (x)
q(x)
Epr f pxqs “ż
f pxqppxqdx
“
ż
f pxqppxqqpxqqpxq
dx “ż
f pxqppxqqpxq
qpxqdx
“ Eq
„
f pxqppxqqpxq
If we choose q in a way that we can easily sample from it, we canapproximate this last expectation by Monte Carlo:
Eq
„
f pxqppxqqpxq
«1S
Sÿ
s“1
f pxpsqqppxpsqqqpxpsqq
“1S
Sÿ
s“1
ws f pxpsqq
, xpsq „ qpxq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 16
![Page 31: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/31.jpg)
Importance Sampling
Key idea: Transform an intractable integral into an expectation undera simpler distribution q (proposal distribution):
x
0.00
0.05
0.10
p(x)
f (x)
q(x)
Epr f pxqs “ż
f pxqppxqdx
“
ż
f pxqppxqqpxqqpxq
dx “ż
f pxqppxqqpxq
qpxqdx
“ Eq
„
f pxqppxqqpxq
If we choose q in a way that we can easily sample from it, we canapproximate this last expectation by Monte Carlo:
Eq
„
f pxqppxqqpxq
«1S
Sÿ
s“1
f pxpsqqppxpsqqqpxpsqq
“1S
Sÿ
s“1
ws f pxpsqq
, xpsq „ qpxq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 16
![Page 32: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/32.jpg)
Importance Sampling
Key idea: Transform an intractable integral into an expectation undera simpler distribution q (proposal distribution):
x
0.00
0.05
0.10
p(x)
f (x)
q(x)
Epr f pxqs “ż
f pxqppxqdx
“
ż
f pxqppxqqpxqqpxq
dx “ż
f pxqppxqqpxq
qpxqdx
“ Eq
„
f pxqppxqqpxq
If we choose q in a way that we can easily sample from it, we canapproximate this last expectation by Monte Carlo:
Eq
„
f pxqppxqqpxq
«1S
Sÿ
s“1
f pxpsqqppxpsqqqpxpsqq
“1S
Sÿ
s“1
ws f pxpsqq
, xpsq „ qpxq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 16
![Page 33: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/33.jpg)
Importance Sampling
Key idea: Transform an intractable integral into an expectation undera simpler distribution q (proposal distribution):
x
0.00
0.05
0.10
p(x)
f (x)
q(x)
Epr f pxqs “ż
f pxqppxqdx
“
ż
f pxqppxqqpxqqpxq
dx “ż
f pxqppxqqpxq
qpxqdx
“ Eq
„
f pxqppxqqpxq
If we choose q in a way that we can easily sample from it, we canapproximate this last expectation by Monte Carlo:
Eq
„
f pxqppxqqpxq
«1S
Sÿ
s“1
f pxpsqqppxpsqqqpxpsqq
“1S
Sÿ
s“1
ws f pxpsqq, xpsq „ qpxq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 16
![Page 34: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/34.jpg)
Importance Sampling: Properties
§ Unbiased estimate of the expectation
§ Many draws from posterior needed, especially in highdimensions
§ Good for evaluating integrals, but we don’t know much aboutthe posterior distribution
§ Degeneracy
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 17
![Page 35: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/35.jpg)
Jensen’s Inequality
An important result from convex analysis:
Jensen’s InequalityFor concave functions f :
f pErzsq ě Er f pzqs −1 0 1x
1.0
1.2
1.4
1.6
1.8
2.0
f(x
)
Logarithms are concave. Therefore:
logEr f pzqs “ logż
f pzqppzqdz ěż
ppzq log f pzqdz “ Erlog f pzqs
Idea: For computing the marginal likelihood, use Jensen’s inequalityinstead of MCMC
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 18
![Page 36: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/36.jpg)
Jensen’s Inequality
An important result from convex analysis:
Jensen’s InequalityFor concave functions f :
f pErzsq ě Er f pzqs −1 0 1x
1.0
1.2
1.4
1.6
1.8
2.0
f(x
)
Logarithms are concave. Therefore:
logEr f pzqs “ logż
f pzqppzqdz ěż
ppzq log f pzqdz “ Erlog f pzqs
Idea: For computing the marginal likelihood, use Jensen’s inequalityinstead of MCMC
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 18
![Page 37: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/37.jpg)
Jensen’s Inequality
An important result from convex analysis:
Jensen’s InequalityFor concave functions f :
f pErzsq ě Er f pzqs −1 0 1x
1.0
1.2
1.4
1.6
1.8
2.0
f(x
)
Logarithms are concave. Therefore:
logEr f pzqs “ logż
f pzqppzqdz ěż
ppzq log f pzqdz “ Erlog f pzqs
Idea: For computing the marginal likelihood, use Jensen’s inequalityinstead of MCMC
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 18
![Page 38: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/38.jpg)
From Importance Sampling to Variational Inference
Look at log-marginal likelihood (log-evidence):
log ppxq “ logż
ppx|zqppzqdz
“ logż
ppx|zqppzqqpzqqpzq
dz
“ logż
ppx|zqppzqqpzq
qpzqdz
“ logEq
„
ppx|zqppzqqpzq
ěEq logˆ
ppx|zqppzqqpzq
˙
“ Eqrlog ppx|zqs ´Eq
„
logˆ
qpzqppzq
˙
“ Eqrlog ppx|zqs ´KLpqpzq}ppzqq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 19
![Page 39: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/39.jpg)
From Importance Sampling to Variational Inference
Look at log-marginal likelihood (log-evidence):
log ppxq “ logż
ppx|zqppzqdz
“ logż
ppx|zqppzqqpzqqpzq
dz
“ logż
ppx|zqppzqqpzq
qpzqdz
“ logEq
„
ppx|zqppzqqpzq
ěEq logˆ
ppx|zqppzqqpzq
˙
“ Eqrlog ppx|zqs ´Eq
„
logˆ
qpzqppzq
˙
“ Eqrlog ppx|zqs ´KLpqpzq}ppzqq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 19
![Page 40: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/40.jpg)
From Importance Sampling to Variational Inference
Look at log-marginal likelihood (log-evidence):
log ppxq “ logż
ppx|zqppzqdz
“ logż
ppx|zqppzqqpzqqpzq
dz
“ logż
ppx|zqppzqqpzq
qpzqdz
“ logEq
„
ppx|zqppzqqpzq
ěEq logˆ
ppx|zqppzqqpzq
˙
“ Eqrlog ppx|zqs ´Eq
„
logˆ
qpzqppzq
˙
“ Eqrlog ppx|zqs ´KLpqpzq}ppzqq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 19
![Page 41: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/41.jpg)
From Importance Sampling to Variational Inference
Look at log-marginal likelihood (log-evidence):
log ppxq “ logż
ppx|zqppzqdz
“ logż
ppx|zqppzqqpzqqpzq
dz
“ logż
ppx|zqppzqqpzq
qpzqdz
“ logEq
„
ppx|zqppzqqpzq
ěEq logˆ
ppx|zqppzqqpzq
˙
“ Eqrlog ppx|zqs ´Eq
„
logˆ
qpzqppzq
˙
“ Eqrlog ppx|zqs ´KLpqpzq}ppzqq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 19
![Page 42: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/42.jpg)
From Importance Sampling to Variational Inference
Look at log-marginal likelihood (log-evidence):
log ppxq “ logż
ppx|zqppzqdz
“ logż
ppx|zqppzqqpzqqpzq
dz
“ logż
ppx|zqppzqqpzq
qpzqdz
“ logEq
„
ppx|zqppzqqpzq
ěEq logˆ
ppx|zqppzqqpzq
˙
“ Eqrlog ppx|zqs ´Eq
„
logˆ
qpzqppzq
˙
“ Eqrlog ppx|zqs ´KLpqpzq}ppzqq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 19
![Page 43: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/43.jpg)
From Importance Sampling to Variational Inference
Look at log-marginal likelihood (log-evidence):
log ppxq “ logż
ppx|zqppzqdz
“ logż
ppx|zqppzqqpzqqpzq
dz
“ logż
ppx|zqppzqqpzq
qpzqdz
“ logEq
„
ppx|zqppzqqpzq
ěEq logˆ
ppx|zqppzqqpzq
˙
“ Eqrlog ppx|zqs ´Eq
„
logˆ
qpzqppzq
˙
“ Eqrlog ppx|zqs ´KLpqpzq}ppzqq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 19
![Page 44: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/44.jpg)
From Importance Sampling to Variational Inference
Look at log-marginal likelihood (log-evidence):
log ppxq “ logż
ppx|zqppzqdz
“ logż
ppx|zqppzqqpzqqpzq
dz
“ logż
ppx|zqppzqqpzq
qpzqdz
“ logEq
„
ppx|zqppzqqpzq
ěEq logˆ
ppx|zqppzqqpzq
˙
“ Eqrlog ppx|zqs ´Eq
„
logˆ
qpzqppzq
˙
“ Eqrlog ppx|zqs ´KLpqpzq}ppzqq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 19
![Page 45: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/45.jpg)
Evidence Lower Bound (ELBO)
§ We just lower-bounded the evidence (marginal likelihood):
log ppxq ě Eqrlog ppx|zqs ´KLpqpzq}ppzqq “: ELBO
§ Data-fit term (expected log-likelihood): Measures how wellsamples from qpzq explain the data (“reconstruction cost”).
Place q’s mass on the MAP estimate.
§ Regularizer: Variational posterior qpzq should not differ muchfrom the prior ppzq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 20
![Page 46: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/46.jpg)
Evidence Lower Bound (ELBO)
§ We just lower-bounded the evidence (marginal likelihood):
log ppxq ě Eqrlog ppx|zqs ´KLpqpzq}ppzqq “: ELBO
§ Data-fit term (expected log-likelihood): Measures how wellsamples from qpzq explain the data (“reconstruction cost”).
Place q’s mass on the MAP estimate.
§ Regularizer: Variational posterior qpzq should not differ muchfrom the prior ppzq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 20
![Page 47: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/47.jpg)
ELBO and KL Divergence
§ Maximizing the ELBO w.r.t. the variational parameters ν isequivalent to minimizing KLpqpzq}ppz|xqq, where ppz|xq is thetrue posterior
KLpqpzq}ppz|xqq “ KLpqpzq}ppzqq ´Eqrlog ppx|zqs ` const
“ ´ELBO` const
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 21
![Page 48: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/48.jpg)
ELBO and KL Divergence
§ Maximizing the ELBO w.r.t. the variational parameters ν isequivalent to minimizing KLpqpzq}ppz|xqq, where ppz|xq is thetrue posterior
KLpqpzq}ppz|xqq “ KLpqpzq}ppzqq ´Eqrlog ppx|zqs ` const
“ ´ELBO` const
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 21
![Page 49: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/49.jpg)
Optimizing the ELBO
Fpνq “ Eqrlog ppx|zqslooooooomooooooon
data-fit
´KLpqpz|νq}ppzqqlooooooooomooooooooon
regularizer
“ Eqrlog ppx, zqs `Hrqpz|νqs
§ Choice of the variational distribution qpz|νq Parametrization
§ Computational challenges:
§ Compute expectation§ Compute gradients dF{dν for variational learning§ ELBO is non-convex Local optima
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 22
![Page 50: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/50.jpg)
Optimizing the ELBO
Fpνq “ Eqrlog ppx|zqslooooooomooooooon
data-fit
´KLpqpz|νq}ppzqqlooooooooomooooooooon
regularizer
“ Eqrlog ppx, zqs `Hrqpz|νqs
§ Choice of the variational distribution qpz|νq Parametrization§ Computational challenges:
§ Compute expectation§ Compute gradients dF{dν for variational learning§ ELBO is non-convex Local optima
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 22
![Page 51: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/51.jpg)
Optimizing the ELBO
Fpνq “ Eqrlog ppx|zqslooooooomooooooon
data-fit
´KLpqpz|νq}ppzqqlooooooooomooooooooon
regularizer
“ Eqrlog ppx, zqs `Hrqpz|νqs
§ Choice of the variational distribution qpz|νq Parametrization§ Computational challenges:
§ Compute expectation
§ Compute gradients dF{dν for variational learning§ ELBO is non-convex Local optima
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 22
![Page 52: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/52.jpg)
Optimizing the ELBO
Fpνq “ Eqrlog ppx|zqslooooooomooooooon
data-fit
´KLpqpz|νq}ppzqqlooooooooomooooooooon
regularizer
“ Eqrlog ppx, zqs `Hrqpz|νqs
§ Choice of the variational distribution qpz|νq Parametrization§ Computational challenges:
§ Compute expectation§ Compute gradients dF{dν for variational learning
§ ELBO is non-convex Local optima
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 22
![Page 53: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/53.jpg)
Optimizing the ELBO
Fpνq “ Eqrlog ppx|zqslooooooomooooooon
data-fit
´KLpqpz|νq}ppzqqlooooooooomooooooooon
regularizer
“ Eqrlog ppx, zqs `Hrqpz|νqs
§ Choice of the variational distribution qpz|νq Parametrization§ Computational challenges:
§ Compute expectation§ Compute gradients dF{dν for variational learning§ ELBO is non-convex Local optima
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 22
![Page 54: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/54.jpg)
Overview
qpz|νq
νinit ν˚
ppz|xq
Figure adopted from Blei et al.’s NIPS-2016 tutorial
§ Find approximation of a probability distribution (e.g., posterior)by optimization:
1. Define an objective function2. Define a (parametrized) family of approximating distributions qν
3. Optimize objective function w.r.t. variational parameters ν
§ Inference Optimization
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 23
![Page 55: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/55.jpg)
Roadmap I
1. Define a generic class of conditionally conjugate models
2. Classical mean-field variational inference
3. Stochastic variational inference Scales to massive data
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 24
![Page 56: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/56.jpg)
Overview
Introduction and BackgroundKey IdeaOptimization ObjectiveConditionally Conjugate ModelsMean-Field Variational InferenceStochastic Variational InferenceLimits of Classical Variational InferenceBlack-Box Variational InferenceComputing Gradients of Expectations
Score Function GradientsPathwise Gradients
Amortized InferenceRicher Posterior Approximations
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 25
![Page 57: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/57.jpg)
Model Class
zn xn
β
n “ 1, . . . , N
Local variables
Global variables
§ All unknown parameters are described by random variables§ Global random variables β, which control all the data§ Local random variables zn, which are local to individual data
points xn
§ Example: Gaussian mixture model
§ Global: means, covariances, weights µk, Σk, πk§ Local: binary assignments zn
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 26
![Page 58: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/58.jpg)
Model Class
zn xn
β
n “ 1, . . . , N
Local variables
Global variables
§ All unknown parameters are described by random variables§ Global random variables β, which control all the data§ Local random variables zn, which are local to individual data
points xn
§ Example: Gaussian mixture model
§ Global: means, covariances, weights µk, Σk, πk§ Local: binary assignments zn
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 26
![Page 59: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/59.jpg)
Model Class
zn xn
β
n “ 1, . . . , N
Local variables
Global variables
§ All unknown parameters are described by random variables§ Global random variables β, which control all the data§ Local random variables zn, which are local to individual data
points xn
§ Example: Gaussian mixture model
§ Global: means, covariances, weights µk, Σk, πk§ Local: binary assignments zn
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 26
![Page 60: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/60.jpg)
Probabilistic Model
zn xn
β
n “ 1, . . . , N
Local variables
Global variables
§ Observations/data x1, . . . , xN
§ nth data point depends only on local zn and global β
§ Joint distribution:
ppβ, z1:N , x1:Nq “ ppβqźN
n“1ppzn, xn|βq
ObjectiveCompute posterior distribution of all unknowns: ppβ, z1:N|x1:Nq.
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 27
![Page 61: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/61.jpg)
Probabilistic Model
zn xn
β
n “ 1, . . . , N
Local variables
Global variables
§ Observations/data x1, . . . , xN
§ nth data point depends only on local zn and global β
§ Joint distribution: ppβ, z1:N , x1:Nq “ ppβqźN
n“1ppzn, xn|βq
ObjectiveCompute posterior distribution of all unknowns: ppβ, z1:N|x1:Nq.
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 27
![Page 62: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/62.jpg)
Probabilistic Model
zn xn
β
n “ 1, . . . , N
Local variables
Global variables
§ Observations/data x1, . . . , xN
§ nth data point depends only on local zn and global β
§ Joint distribution: ppβ, z1:N , x1:Nq “ ppβqźN
n“1ppzn, xn|βq
ObjectiveCompute posterior distribution of all unknowns: ppβ, z1:N|x1:Nq.
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 27
![Page 63: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/63.jpg)
Complete Conditional
zn xn
β
n “ 1, . . . , N
Local variables
Global variables
§ Complete conditional: Conditional of a single latent variablegiven the observations and all other latent variables
ppzn|β, xnq
ppβ|z1:N , x1:Nq
§ Assume that each complete conditional is a member of theexponential family (Bernoulli, Beta, Gamma, Gaussian, ...)
Conditionally conjugate models
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 28
![Page 64: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/64.jpg)
Complete Conditional
zn xn
β
n “ 1, . . . , N
Local variables
Global variables
§ Complete conditional: Conditional of a single latent variablegiven the observations and all other latent variables
ppzn|β, xnq
ppβ|z1:N , x1:Nq
§ Assume that each complete conditional is a member of theexponential family (Bernoulli, Beta, Gamma, Gaussian, ...)
Conditionally conjugate modelsVariational Inference Marc Deisenroth @Imperial College, February 19, 2019 28
![Page 65: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/65.jpg)
Generic Class of Models: Examples
§ Bayesian mixture models
§ Hidden Markov models
§ Factor analysis
§ Principal component analysis
§ Linear regression
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 29
![Page 66: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/66.jpg)
Approximating Distributions
z1
z2 z3
z1
z2 z3
Least expressiveMost expressive
q(z|x) = p(z|x) q(z|x) =∏
i qi(zi)
True posterior Fully factorized
§ Specifying the class of posteriors is closely related to specifying amodel of the data
We have a lot of flexibility§ Generally:
§ Build expressive class of posteriors (no overfitting problems)§ Maintain computational efficiency Scalability
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 30
![Page 67: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/67.jpg)
Approximating Distributions
z1
z2 z3
z1
z2 z3
Least expressiveMost expressive
q(z|x) = p(z|x) q(z|x) =∏
i qi(zi)
True posterior Fully factorized
§ Specifying the class of posteriors is closely related to specifying amodel of the data
We have a lot of flexibility
§ Generally:§ Build expressive class of posteriors (no overfitting problems)§ Maintain computational efficiency Scalability
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 30
![Page 68: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/68.jpg)
Approximating Distributions
z1
z2 z3
z1
z2 z3
Least expressiveMost expressive
q(z|x) = p(z|x) q(z|x) =∏
i qi(zi)
True posterior Fully factorized
§ Specifying the class of posteriors is closely related to specifying amodel of the data
We have a lot of flexibility§ Generally:
§ Build expressive class of posteriors (no overfitting problems)§ Maintain computational efficiency Scalability
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 30
![Page 69: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/69.jpg)
Overview
Introduction and BackgroundKey IdeaOptimization ObjectiveConditionally Conjugate ModelsMean-Field Variational InferenceStochastic Variational InferenceLimits of Classical Variational InferenceBlack-Box Variational InferenceComputing Gradients of Expectations
Score Function GradientsPathwise Gradients
Amortized InferenceRicher Posterior Approximations
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 31
![Page 70: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/70.jpg)
Mean-Field Approximation
z1
z2 z3
z1
z2 z3
Least expressiveMost expressive
q(z|x) = p(z|x) q(z|x) =∏
i qi(zi)
True posterior Fully factorized
§ Assume conditionally conjugate model§ Fully factorized (mean field) approximation:
qpβ, z|νq “ qpβ|λqNź
n“1
qpzn|φnq φn zn
βλ
n “ 1, . . . , N
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 32
![Page 71: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/71.jpg)
Fully Factorized Distribution
§ Fully factorized (mean field) approximation:
qpβ, z|νq “ qpβ|λqNź
n“1
qpzn|φnq , ν “ tλ, φ1, . . . , φNu
§ All latent variables are independent and governed by their ownvariational parameters
§ Assume each factor is in the same exponential family as themodel’s complete conditional
§ The q-factors do not depend on the data
§ ELBO connects this family to the data
§ Maximize the ELBO w.r.t. variational parameters ν
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 33
![Page 72: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/72.jpg)
Fully Factorized Distribution
§ Fully factorized (mean field) approximation:
qpβ, z|νq “ qpβ|λqNź
n“1
qpzn|φnq , ν “ tλ, φ1, . . . , φNu
§ All latent variables are independent and governed by their ownvariational parameters
§ Assume each factor is in the same exponential family as themodel’s complete conditional
§ The q-factors do not depend on the data
§ ELBO connects this family to the data
§ Maximize the ELBO w.r.t. variational parameters ν
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 33
![Page 73: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/73.jpg)
Fully Factorized Distribution
§ Fully factorized (mean field) approximation:
qpβ, z|νq “ qpβ|λqNź
n“1
qpzn|φnq , ν “ tλ, φ1, . . . , φNu
§ All latent variables are independent and governed by their ownvariational parameters
§ Assume each factor is in the same exponential family as themodel’s complete conditional
§ The q-factors do not depend on the data
§ ELBO connects this family to the data
§ Maximize the ELBO w.r.t. variational parameters ν
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 33
![Page 74: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/74.jpg)
Fully Factorized Distribution
§ Fully factorized (mean field) approximation:
qpβ, z|νq “ qpβ|λqNź
n“1
qpzn|φnq , ν “ tλ, φ1, . . . , φNu
§ All latent variables are independent and governed by their ownvariational parameters
§ Assume each factor is in the same exponential family as themodel’s complete conditional
§ The q-factors do not depend on the data
§ ELBO connects this family to the data
§ Maximize the ELBO w.r.t. variational parameters ν
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 33
![Page 75: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/75.jpg)
Fully Factorized Distribution
§ Fully factorized (mean field) approximation:
qpβ, z|νq “ qpβ|λqNź
n“1
qpzn|φnq , ν “ tλ, φ1, . . . , φNu
§ All latent variables are independent and governed by their ownvariational parameters
§ Assume each factor is in the same exponential family as themodel’s complete conditional
§ The q-factors do not depend on the data
§ ELBO connects this family to the data
§ Maximize the ELBO w.r.t. variational parameters ν
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 33
![Page 76: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/76.jpg)
Fully Factorized Distribution
§ Fully factorized (mean field) approximation:
qpβ, z|νq “ qpβ|λqNź
n“1
qpzn|φnq , ν “ tλ, φ1, . . . , φNu
§ All latent variables are independent and governed by their ownvariational parameters
§ Assume each factor is in the same exponential family as themodel’s complete conditional
§ The q-factors do not depend on the data
§ ELBO connects this family to the data
§ Maximize the ELBO w.r.t. variational parameters ν
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 33
![Page 77: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/77.jpg)
Optimizing in Turn
§ Independent latent variables: qpzq “ś
n qpzn|φnq “ś
n qnpznq
Optimize variational parameters in turn
ELBO “ Eq
„
logˆ
ppx|zqppzqqpzq
˙
“ Eqrlog ppx, zq´ log qpzqs
“ Eqn
”
Ei‰nrlog ppx, zqslooooooooomooooooooon
“: p̂px,znq
ı
´Eqnrlog qnpznqs ´ÿ
i‰n
Eqirlog qipziqs
“ Eqnrp̂px, znqs´Eqnrlog qnpznqs ´ÿ
i‰n
Eqirlog qipziqs
§ Fix qi‰n and optimize factor qn:
ELBOpqnq “ Eqnrp̂px, znqs ´Eqnrlog qnpznqs ` const
“ Eqn
„
logexppp̂px, znqq
qnpznq
“ ´KLpqnpznq} exppp̂px, znqqq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 34
![Page 78: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/78.jpg)
Optimizing in Turn
§ Independent latent variables: qpzq “ś
n qpzn|φnq “ś
n qnpznq
Optimize variational parameters in turn
ELBO “ Eq
„
logˆ
ppx|zqppzqqpzq
˙
“ Eqrlog ppx, zq´ log qpzqs
“ Eqn
”
Ei‰nrlog ppx, zqslooooooooomooooooooon
“: p̂px,znq
ı
´Eqnrlog qnpznqs ´ÿ
i‰n
Eqirlog qipziqs
“ Eqnrp̂px, znqs´Eqnrlog qnpznqs ´ÿ
i‰n
Eqirlog qipziqs
§ Fix qi‰n and optimize factor qn:
ELBOpqnq “ Eqnrp̂px, znqs ´Eqnrlog qnpznqs ` const
“ Eqn
„
logexppp̂px, znqq
qnpznq
“ ´KLpqnpznq} exppp̂px, znqqq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 34
![Page 79: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/79.jpg)
Optimizing in Turn
§ Independent latent variables: qpzq “ś
n qpzn|φnq “ś
n qnpznq
Optimize variational parameters in turn
ELBO “ Eq
„
logˆ
ppx|zqppzqqpzq
˙
“ Eqrlog ppx, zq´ log qpzqs
“ Eqn
”
Ei‰nrlog ppx, zqslooooooooomooooooooon
“: p̂px,znq
ı
´Eqnrlog qnpznqs ´ÿ
i‰n
Eqirlog qipziqs
“ Eqnrp̂px, znqs´Eqnrlog qnpznqs ´ÿ
i‰n
Eqirlog qipziqs
§ Fix qi‰n and optimize factor qn:
ELBOpqnq “ Eqnrp̂px, znqs ´Eqnrlog qnpznqs ` const
“ Eqn
„
logexppp̂px, znqq
qnpznq
“ ´KLpqnpznq} exppp̂px, znqqq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 34
![Page 80: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/80.jpg)
Optimizing in Turn
§ Independent latent variables: qpzq “ś
n qpzn|φnq “ś
n qnpznq
Optimize variational parameters in turn
ELBO “ Eq
„
logˆ
ppx|zqppzqqpzq
˙
“ Eqrlog ppx, zq´ log qpzqs
“ Eqn
”
Ei‰nrlog ppx, zqslooooooooomooooooooon
“: p̂px,znq
ı
´Eqnrlog qnpznqs ´ÿ
i‰n
Eqirlog qipziqs
“ Eqnrp̂px, znqs´Eqnrlog qnpznqs ´ÿ
i‰n
Eqirlog qipziqs
§ Fix qi‰n and optimize factor qn:
ELBOpqnq “ Eqnrp̂px, znqs ´Eqnrlog qnpznqs ` const
“ Eqn
„
logexppp̂px, znqq
qnpznq
“ ´KLpqnpznq} exppp̂px, znqqq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 34
![Page 81: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/81.jpg)
Optimizing in Turn
§ Independent latent variables: qpzq “ś
n qpzn|φnq “ś
n qnpznq
Optimize variational parameters in turn
ELBO “ Eq
„
logˆ
ppx|zqppzqqpzq
˙
“ Eqrlog ppx, zq´ log qpzqs
“ Eqn
”
Ei‰nrlog ppx, zqslooooooooomooooooooon
“: p̂px,znq
ı
´Eqnrlog qnpznqs ´ÿ
i‰n
Eqirlog qipziqs
“ Eqnrp̂px, znqs´Eqnrlog qnpznqs ´ÿ
i‰n
Eqirlog qipziqs
§ Fix qi‰n and optimize factor qn:
ELBOpqnq “ Eqnrp̂px, znqs ´Eqnrlog qnpznqs ` const
“ Eqn
„
logexppp̂px, znqq
qnpznq
“ ´KLpqnpznq} exppp̂px, znqqq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 34
![Page 82: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/82.jpg)
Optimizing in Turn
§ Independent latent variables: qpzq “ś
n qpzn|φnq “ś
n qnpznq
Optimize variational parameters in turn
ELBO “ Eq
„
logˆ
ppx|zqppzqqpzq
˙
“ Eqrlog ppx, zq´ log qpzqs
“ Eqn
”
Ei‰nrlog ppx, zqslooooooooomooooooooon
“: p̂px,znq
ı
´Eqnrlog qnpznqs ´ÿ
i‰n
Eqirlog qipziqs
“ Eqnrp̂px, znqs´Eqnrlog qnpznqs ´ÿ
i‰n
Eqirlog qipziqs
§ Fix qi‰n and optimize factor qn:
ELBOpqnq “ Eqnrp̂px, znqs ´Eqnrlog qnpznqs ` const
“ Eqn
„
logexppp̂px, znqq
qnpznq
“ ´KLpqnpznq} exppp̂px, znqqq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 34
![Page 83: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/83.jpg)
Optimizing in Turn
§ Independent latent variables: qpzq “ś
n qpzn|φnq “ś
n qnpznq
Optimize variational parameters in turn
ELBO “ Eq
„
logˆ
ppx|zqppzqqpzq
˙
“ Eqrlog ppx, zq´ log qpzqs
“ Eqn
”
Ei‰nrlog ppx, zqslooooooooomooooooooon
“: p̂px,znq
ı
´Eqnrlog qnpznqs ´ÿ
i‰n
Eqirlog qipziqs
“ Eqnrp̂px, znqs´Eqnrlog qnpznqs ´ÿ
i‰n
Eqirlog qipziqs
§ Fix qi‰n and optimize factor qn:
ELBOpqnq “ Eqnrp̂px, znqs ´Eqnrlog qnpznqs ` const
“ Eqn
„
logexppp̂px, znqq
qnpznq
“ ´KLpqnpznq} exppp̂px, znqqq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 34
![Page 84: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/84.jpg)
Optimizing in Turn
§ Independent latent variables: qpzq “ś
n qpzn|φnq “ś
n qnpznq
Optimize variational parameters in turn
ELBO “ Eq
„
logˆ
ppx|zqppzqqpzq
˙
“ Eqrlog ppx, zq´ log qpzqs
“ Eqn
”
Ei‰nrlog ppx, zqslooooooooomooooooooon
“: p̂px,znq
ı
´Eqnrlog qnpznqs ´ÿ
i‰n
Eqirlog qipziqs
“ Eqnrp̂px, znqs´Eqnrlog qnpznqs ´ÿ
i‰n
Eqirlog qipziqs
§ Fix qi‰n and optimize factor qn:
ELBOpqnq “ Eqnrp̂px, znqs ´Eqnrlog qnpznqs ` const
“ Eqn
„
logexppp̂px, znqq
qnpznq
“ ´KLpqnpznq} exppp̂px, znqqq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 34
![Page 85: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/85.jpg)
Optimal Factors
ELBOpqnq “ ´KLpqnpznq} exppp̂px, znqqq
§ Maximizing the ELBO w.r.t. qn is equivalent to minimizingKLpqnpznq} exppp̂px, znqqq
log q˚npznq “ p̂px, znq “ Eqi‰nrlog ppx, zqs ` const
§ Get the optimal factor q˚n by
1. Writing down the log-joint distribution of all latent and observedvariables log ppx, zq
2. Computing the expectation w.r.t. all other random variables
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 35
![Page 86: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/86.jpg)
Optimal Factors
ELBOpqnq “ ´KLpqnpznq} exppp̂px, znqqq
§ Maximizing the ELBO w.r.t. qn is equivalent to minimizingKLpqnpznq} exppp̂px, znqqq
log q˚npznq “ p̂px, znq “ Eqi‰nrlog ppx, zqs ` const
§ Get the optimal factor q˚n by
1. Writing down the log-joint distribution of all latent and observedvariables log ppx, zq
2. Computing the expectation w.r.t. all other random variables
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 35
![Page 87: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/87.jpg)
Optimal Factors
ELBOpqnq “ ´KLpqnpznq} exppp̂px, znqqq
§ Maximizing the ELBO w.r.t. qn is equivalent to minimizingKLpqnpznq} exppp̂px, znqqq
log q˚npznq “ p̂px, znq “ Eqi‰nrlog ppx, zqs ` const
§ Get the optimal factor q˚n by1. Writing down the log-joint distribution of all latent and observed
variables log ppx, zq
2. Computing the expectation w.r.t. all other random variables
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 35
![Page 88: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/88.jpg)
Optimal Factors
ELBOpqnq “ ´KLpqnpznq} exppp̂px, znqqq
§ Maximizing the ELBO w.r.t. qn is equivalent to minimizingKLpqnpznq} exppp̂px, znqqq
log q˚npznq “ p̂px, znq “ Eqi‰nrlog ppx, zqs ` const
§ Get the optimal factor q˚n by1. Writing down the log-joint distribution of all latent and observed
variables log ppx, zq2. Computing the expectation w.r.t. all other random variables
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 35
![Page 89: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/89.jpg)
Mean-Field Approximation for ConditionallyConjugate Models
φn zn
βλ
n “ 1, . . . , N
§ Optimal factors (see Bishop (2006) or Ghahramani & Beal (2001)):
q˚pβ|λq 9 exp pEzrlog ppx, z, βqsq
q˚pzn|φnq 9 exp`
Eβrlog ppxn, zn, βqs˘
§ Update one term at a time Coordinate ascent§ No closed-form solution (see EM algorithm)§ Iteratively optimize each parameter until we reach a local
optimum (convergence guaranteed)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 36
![Page 90: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/90.jpg)
Mean-Field Approximation for ConditionallyConjugate Models
φn zn
βλ
n “ 1, . . . , N
§ Optimal factors (see Bishop (2006) or Ghahramani & Beal (2001)):
q˚pβ|λq 9 exp pEzrlog ppx, z, βqsq
q˚pzn|φnq 9 exp`
Eβrlog ppxn, zn, βqs˘
§ Update one term at a time Coordinate ascent
§ No closed-form solution (see EM algorithm)§ Iteratively optimize each parameter until we reach a local
optimum (convergence guaranteed)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 36
![Page 91: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/91.jpg)
Mean-Field Approximation for ConditionallyConjugate Models
φn zn
βλ
n “ 1, . . . , N
§ Optimal factors (see Bishop (2006) or Ghahramani & Beal (2001)):
q˚pβ|λq 9 exp pEzrlog ppx, z, βqsq
q˚pzn|φnq 9 exp`
Eβrlog ppxn, zn, βqs˘
§ Update one term at a time Coordinate ascent§ No closed-form solution (see EM algorithm)
§ Iteratively optimize each parameter until we reach a localoptimum (convergence guaranteed)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 36
![Page 92: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/92.jpg)
Mean-Field Approximation for ConditionallyConjugate Models
φn zn
βλ
n “ 1, . . . , N
§ Optimal factors (see Bishop (2006) or Ghahramani & Beal (2001)):
q˚pβ|λq 9 exp pEzrlog ppx, z, βqsq
q˚pzn|φnq 9 exp`
Eβrlog ppxn, zn, βqs˘
§ Update one term at a time Coordinate ascent§ No closed-form solution (see EM algorithm)§ Iteratively optimize each parameter until we reach a local
optimum (convergence guaranteed)Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 36
![Page 93: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/93.jpg)
Mean-Field Approximation: Algorithm
1. Input: data x, model ppβ, z, xq
2. Initialize global variational parameters λ randomly
3. While ELBO has not converged, repeat:3.1 For each data point xn
3.1.1 Update local variational parameters φn
3.2 Update global variational parameters λ
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 37
![Page 94: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/94.jpg)
Mean-Field Approximation: Limitation
−4 −2 0 2 4x1
−4
−2
0
2
4
x2
−4 −2 0 2 4x1
−4
−2
0
2
4
x2
§ Mean-field VI to approximate a correlated Gaussian with afactorized Gaussian
§ Generally, mean-field VI tends to yield an approximation that istoo compact Need better classes of posterior approximations
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 38
![Page 95: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/95.jpg)
Mean-Field Approximation: Limitation
−4 −2 0 2 4x1
−4
−2
0
2
4
x2
−4 −2 0 2 4x1
−4
−2
0
2
4
x2
§ Mean-field VI to approximate a correlated Gaussian with afactorized Gaussian
§ Generally, mean-field VI tends to yield an approximation that istoo compact Need better classes of posterior approximations
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 38
![Page 96: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/96.jpg)
Classical Variational Inference: Limitation
Data Local structure Global structure
§ Classical VI is inefficient: Need to crunch through the full datasetto update variational parametersCan’t handle massive data
§ Stochastic variational inference updates the global hiddenstructure once we have any update of the local structure
Stochastic optimization
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 39
![Page 97: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/97.jpg)
Classical Variational Inference: Limitation
Data Local structure Global structure
§ Classical VI is inefficient: Need to crunch through the full datasetto update variational parametersCan’t handle massive data
§ Stochastic variational inference updates the global hiddenstructure once we have any update of the local structure
Stochastic optimization
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 39
![Page 98: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/98.jpg)
Classical Variational Inference: Limitation
Data Local structure Global structure
§ Classical VI is inefficient: Need to crunch through the full datasetto update variational parametersCan’t handle massive data
§ Stochastic variational inference updates the global hiddenstructure once we have any update of the local structure
Stochastic optimization
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 39
![Page 99: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/99.jpg)
Overview
Introduction and BackgroundKey IdeaOptimization ObjectiveConditionally Conjugate ModelsMean-Field Variational InferenceStochastic Variational InferenceLimits of Classical Variational InferenceBlack-Box Variational InferenceComputing Gradients of Expectations
Score Function GradientsPathwise Gradients
Amortized InferenceRicher Posterior Approximations
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 40
![Page 100: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/100.jpg)
Stochastic Variational Inference
Subsample data Infer local structure
Data
Updateglobal structure
§ Coordinate-ascent VI is inefficient: Need to look at the fulldataset to update variational parameters
§ Idea:
§ Do some local computations for each data point§ Subsample data, infer local structure, update global structure
§ Key: Stochastic optimization
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 41
![Page 101: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/101.jpg)
Stochastic Variational Inference
Subsample data Infer local structure
Data
Updateglobal structure
§ Coordinate-ascent VI is inefficient: Need to look at the fulldataset to update variational parameters
§ Idea:
§ Do some local computations for each data point§ Subsample data, infer local structure, update global structure
§ Key: Stochastic optimization
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 41
![Page 102: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/102.jpg)
Stochastic Variational Inference
Subsample data Infer local structure
Data
Updateglobal structure
§ Coordinate-ascent VI is inefficient: Need to look at the fulldataset to update variational parameters
§ Idea:§ Do some local computations for each data point
§ Subsample data, infer local structure, update global structure§ Key: Stochastic optimization
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 41
![Page 103: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/103.jpg)
Stochastic Variational Inference
Subsample data Infer local structure
Data
Updateglobal structure
§ Coordinate-ascent VI is inefficient: Need to look at the fulldataset to update variational parameters
§ Idea:§ Do some local computations for each data point§ Subsample data, infer local structure, update global structure
§ Key: Stochastic optimization
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 41
![Page 104: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/104.jpg)
Stochastic Variational Inference
Subsample data Infer local structure
Data
Updateglobal structure
§ Coordinate-ascent VI is inefficient: Need to look at the fulldataset to update variational parameters
§ Idea:§ Do some local computations for each data point§ Subsample data, infer local structure, update global structure
§ Key: Stochastic optimizationVariational Inference Marc Deisenroth @Imperial College, February 19, 2019 41
![Page 105: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/105.jpg)
Stochastic Optimization: Key Idea
§ Replace exact gradient with cheaper (noisy) estimates (Robbins &Monro, 1951)
§ This estimate could be based on a subset of the data(mini-batches)
§ Guaranteed to converge to a local optimum
§ Key driver of modern machine learning
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 42
![Page 106: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/106.jpg)
Noisy Updates of Variational Parameters
§ With noisy gradients, update the variational parameters:
νt`1 “ νt ` ρt∇̂νFpνq
§ Requires unbiased gradients, i.e., on average the gradient pointsin the correct direction:
Er∇̂νFpνqs “ ∇νFpνq
§ Some requirements on the step size parameter ρt
Convergence to local optimum
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 43
![Page 107: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/107.jpg)
Noisy Updates of Variational Parameters
§ With noisy gradients, update the variational parameters:
νt`1 “ νt ` ρt∇̂νFpνq
§ Requires unbiased gradients, i.e., on average the gradient pointsin the correct direction:
Er∇̂νFpνqs “ ∇νFpνq
§ Some requirements on the step size parameter ρt
Convergence to local optimum
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 43
![Page 108: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/108.jpg)
Noisy Updates of Variational Parameters
§ With noisy gradients, update the variational parameters:
νt`1 “ νt ` ρt∇̂νFpνq
§ Requires unbiased gradients, i.e., on average the gradient pointsin the correct direction:
Er∇̂νFpνqs “ ∇νFpνq
§ Some requirements on the step size parameter ρt
Convergence to local optimum
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 43
![Page 109: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/109.jpg)
Natural Gradient
∇̃νFpνq “ F´1∇νFpνq F : Fisher information matrix
§ Points in the direction of maximum change in distribution space,not in parameter space We maximize the ELBO/minimize KL
§ Invariant to parametrization of distribution (e.g., variance vsprecision of a Gaussian)
§ Scales each parameter individually
§ Natural gradient for conditionally conjugate models easy tocompute (Hoffman et al., 2013)
§ Noisy natural gradient (one estimate per data point)§ Unbiased§ Only depends on optimized parameters of a single data point
cheap to compute
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 44
![Page 110: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/110.jpg)
Natural Gradient
∇̃νFpνq “ F´1∇νFpνq F : Fisher information matrix
§ Points in the direction of maximum change in distribution space,not in parameter space We maximize the ELBO/minimize KL
§ Invariant to parametrization of distribution (e.g., variance vsprecision of a Gaussian)
§ Scales each parameter individually§ Natural gradient for conditionally conjugate models easy to
compute (Hoffman et al., 2013)§ Noisy natural gradient (one estimate per data point)§ Unbiased§ Only depends on optimized parameters of a single data point
cheap to computeVariational Inference Marc Deisenroth @Imperial College, February 19, 2019 44
![Page 111: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/111.jpg)
Algorithm
1. Input: data x, model ppβ, z, xq
2. Initialize global variational parameters λ randomly
3. Repeat3.1 Sample data point xn uniformly at random3.2 Update local parameter φn3.3 Compute intermediate global parameter λ̂ based on noisy natural
gradient3.4 Set global parameter
λ Ð p1´ ρtqλ` ρtλ̂
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 45
![Page 112: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/112.jpg)
Subsample data Infer local structure
Data
Updateglobal structure
§ Look at a single data point in your dataset
§ Infer local variational parameters
§ Update global variational parameters using noisy naturalgradient
§ Repeat
Simple way to scale variational inference to massive datasetsVariational Inference Marc Deisenroth @Imperial College, February 19, 2019 46
![Page 113: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/113.jpg)
Example: Online LDA (Hoffman et al., 2010)
From Hoffman et al. (2010)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 47
![Page 114: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/114.jpg)
Overview
qpz|νq
νinit ν˚
ppz|xq
Figure adopted from Blei et al.’s NIPS-2016 tutorial
§ Find approximation of a probability distribution (e.g., posterior)by optimization:
1. Define an objective function2. Define a (parametrized) family of approximating distributions qν
3. Optimize objective function w.r.t. variational parameters ν
§ Inference Optimization
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 48
![Page 115: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/115.jpg)
Roadmap II
1. Limits of Classical Variational Inference
2. Black-Box Variational Inference
3. Computing Gradients of Expectations
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 49
![Page 116: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/116.jpg)
Overview
Introduction and BackgroundKey IdeaOptimization ObjectiveConditionally Conjugate ModelsMean-Field Variational InferenceStochastic Variational InferenceLimits of Classical Variational InferenceBlack-Box Variational InferenceComputing Gradients of Expectations
Score Function GradientsPathwise Gradients
Amortized InferenceRicher Posterior Approximations
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 50
![Page 117: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/117.jpg)
Variational Inference: General Recipe
ppx, zq
qpz|νq
ż
p...qqpz|νqdz ∇ν
Adopted from Blei et al.’s NIPS-2016 tutorial
§ Specify model ppx, zq and approximation qpz|νq
§ Objective Fpνq “ Eqrlog ppx, zq ´ log qpz|νqs
§ Compute expectation
§ Compute gradient
§ Optimize with gradient descent
This recipe is fairly generic.Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 51
![Page 118: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/118.jpg)
Example: Bayesian Logistic Regression
§ Binary classification
§ Inputs x P R, labels y P t0, 1u
§ Model parameter z (normally denoted by θ)−4 −2 0 2
x1
−2
−1
0
1
2
3
x2
0.10
0
0.20
0
0.30
0
0.40
00.
500
0.60
0
0.70
0
0.80
0
0.90
0
Prior on model parameter: ppzq “ N`
0, 1˘
Likelihood: ppyn|xn, zq “ Berpσpzxnqq
§ Assume we have a single data point px, yq
§ Goal: Approximate the intractable posterior distribution ppz|x, yqusing variational inference
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 52
![Page 119: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/119.jpg)
Example: Bayesian Logistic Regression
§ Binary classification
§ Inputs x P R, labels y P t0, 1u
§ Model parameter z (normally denoted by θ)−4 −2 0 2
x1
−2
−1
0
1
2
3
x2
0.10
0
0.20
0
0.30
0
0.40
00.
500
0.60
0
0.70
0
0.80
0
0.90
0
Prior on model parameter: ppzq “ N`
0, 1˘
Likelihood: ppyn|xn, zq “ Berpσpzxnqq
§ Assume we have a single data point px, yq
§ Goal: Approximate the intractable posterior distribution ppz|x, yqusing variational inference
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 52
![Page 120: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/120.jpg)
Example: Bayesian Logistic Regression
§ Binary classification
§ Inputs x P R, labels y P t0, 1u
§ Model parameter z (normally denoted by θ)−4 −2 0 2
x1
−2
−1
0
1
2
3
x2
0.10
0
0.20
0
0.30
0
0.40
00.
500
0.60
0
0.70
0
0.80
0
0.90
0
Prior on model parameter: ppzq “ N`
0, 1˘
Likelihood: ppyn|xn, zq “ Berpσpzxnqq
§ Assume we have a single data point px, yq
§ Goal: Approximate the intractable posterior distribution ppz|x, yqusing variational inference
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 52
![Page 121: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/121.jpg)
Example: Bayesian Logistic Regression (2)
§ Choose Gaussian variational approximation:qpz|νq “ N
`
µ, σ2˘
ν “
tµ, σ2u
§ Objective function: ELBO FpνqFpm, σ2q “ Eqrlog ppzq ´ log qpzq ` log ppy|x, zqs
“ ´12pµ
2 ` σ2q ` 12 log σ2 `Eqrlog ppy|x, zqs ` c
Eqrlog ppy|x, zqs “ Eqry log σpxzq ` p1´ yq logp1´ σpxzqqs
“ Eqryxzs ´Eqry logp1` exppxzqqs
`Eq
„
p1´ yq logˆ
1´exppxzq
1` exppxzq
˙
with
σpxzq “exppxzq
1` exppxzq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 53
![Page 122: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/122.jpg)
Example: Bayesian Logistic Regression (2)
§ Choose Gaussian variational approximation:qpz|νq “ N
`
µ, σ2˘
ν “ tµ, σ2u
§ Objective function: ELBO FpνqFpm, σ2q “ Eqrlog ppzq ´ log qpzq ` log ppy|x, zqs
“ ´12pµ
2 ` σ2q ` 12 log σ2 `Eqrlog ppy|x, zqs ` c
Eqrlog ppy|x, zqs “ Eqry log σpxzq ` p1´ yq logp1´ σpxzqqs
“ Eqryxzs ´Eqry logp1` exppxzqqs
`Eq
„
p1´ yq logˆ
1´exppxzq
1` exppxzq
˙
with
σpxzq “exppxzq
1` exppxzq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 53
![Page 123: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/123.jpg)
Example: Bayesian Logistic Regression (2)
§ Choose Gaussian variational approximation:qpz|νq “ N
`
µ, σ2˘
ν “ tµ, σ2u
§ Objective function: ELBO FpνqFpm, σ2q “ Eqrlog ppzq ´ log qpzq ` log ppy|x, zqs
“ ´12pµ
2 ` σ2q ` 12 log σ2 `Eqrlog ppy|x, zqs ` c
Eqrlog ppy|x, zqs “ Eqry log σpxzq ` p1´ yq logp1´ σpxzqqs
“ Eqryxzs ´Eqry logp1` exppxzqqs
`Eq
„
p1´ yq logˆ
1´exppxzq
1` exppxzq
˙
with
σpxzq “exppxzq
1` exppxzq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 53
![Page 124: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/124.jpg)
Example: Bayesian Logistic Regression (2)
§ Choose Gaussian variational approximation:qpz|νq “ N
`
µ, σ2˘
ν “ tµ, σ2u
§ Objective function: ELBO FpνqFpm, σ2q “ Eqrlog ppzq ´ log qpzq ` log ppy|x, zqs
“ ´12pµ
2 ` σ2q ` 12 log σ2 `Eqrlog ppy|x, zqs ` c
Eqrlog ppy|x, zqs “ Eqry log σpxzq ` p1´ yq logp1´ σpxzqqs
“ Eqryxzs ´Eqry logp1` exppxzqqs
`Eq
„
p1´ yq logˆ
1´exppxzq
1` exppxzq
˙
with
σpxzq “exppxzq
1` exppxzq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 53
![Page 125: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/125.jpg)
Example: Bayesian Logistic Regression (2)
§ Choose Gaussian variational approximation:qpz|νq “ N
`
µ, σ2˘
ν “ tµ, σ2u
§ Objective function: ELBO FpνqFpm, σ2q “ Eqrlog ppzq ´ log qpzq ` log ppy|x, zqs
“ ´12pµ
2 ` σ2q ` 12 log σ2 `Eqrlog ppy|x, zqs ` c
Eqrlog ppy|x, zqs “
Eqry log σpxzq ` p1´ yq logp1´ σpxzqqs
“ Eqryxzs ´Eqry logp1` exppxzqqs
`Eq
„
p1´ yq logˆ
1´exppxzq
1` exppxzq
˙
with
σpxzq “exppxzq
1` exppxzq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 53
![Page 126: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/126.jpg)
Example: Bayesian Logistic Regression (2)
§ Choose Gaussian variational approximation:qpz|νq “ N
`
µ, σ2˘
ν “ tµ, σ2u
§ Objective function: ELBO FpνqFpm, σ2q “ Eqrlog ppzq ´ log qpzq ` log ppy|x, zqs
“ ´12pµ
2 ` σ2q ` 12 log σ2 `Eqrlog ppy|x, zqs ` c
Eqrlog ppy|x, zqs “ Eqry log σpxzq ` p1´ yq logp1´ σpxzqqs
“ Eqryxzs ´Eqry logp1` exppxzqqs
`Eq
„
p1´ yq logˆ
1´exppxzq
1` exppxzq
˙
with
σpxzq “exppxzq
1` exppxzq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 53
![Page 127: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/127.jpg)
Example: Bayesian Logistic Regression (2)
§ Choose Gaussian variational approximation:qpz|νq “ N
`
µ, σ2˘
ν “ tµ, σ2u
§ Objective function: ELBO FpνqFpm, σ2q “ Eqrlog ppzq ´ log qpzq ` log ppy|x, zqs
“ ´12pµ
2 ` σ2q ` 12 log σ2 `Eqrlog ppy|x, zqs ` c
Eqrlog ppy|x, zqs “ Eqry log σpxzq ` p1´ yq logp1´ σpxzqqs
“ Eqryxzs ´Eqry logp1` exppxzqqs
`Eq
„
p1´ yq logˆ
1´exppxzq
1` exppxzq
˙
with
σpxzq “exppxzq
1` exppxzq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 53
![Page 128: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/128.jpg)
Computing the Expected Log-Likelihood
Eqrlog ppy|x, zqs “ Eqryxzs ´Eqry logp1` exppxzqqs
`Eqrp1´ yq logˆ
1´exppxzq
1` exppxzq
˙
s
“ yxµ´Eqry logp1` exppxzqqs
`Eqrp1´ yq logˆ
11` exppxzq
˙
s
“ yxµ´Eqry logp1` exppxzqqs´Eqrlogp1` exppxzqqs `Eqry logp1` exppxzqqs
“ yxµ´Eqrlogp1` exppxzqqs
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 54
![Page 129: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/129.jpg)
Computing the Expected Log-Likelihood
Eqrlog ppy|x, zqs “ Eqryxzs ´Eqry logp1` exppxzqqs
`Eqrp1´ yq logˆ
1´exppxzq
1` exppxzq
˙
s
“ yxµ´Eqry logp1` exppxzqqs
`Eqrp1´ yq logˆ
11` exppxzq
˙
s
“ yxµ´Eqry logp1` exppxzqqs´Eqrlogp1` exppxzqqs `Eqry logp1` exppxzqqs
“ yxµ´Eqrlogp1` exppxzqqs
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 54
![Page 130: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/130.jpg)
Computing the Expected Log-Likelihood
Eqrlog ppy|x, zqs “ Eqryxzs ´Eqry logp1` exppxzqqs
`Eqrp1´ yq logˆ
1´exppxzq
1` exppxzq
˙
s
“ yxµ´Eqry logp1` exppxzqqs
`Eqrp1´ yq logˆ
11` exppxzq
˙
s
“ yxµ´Eqry logp1` exppxzqqs´Eqrlogp1` exppxzqqs `Eqry logp1` exppxzqqs
“ yxµ´Eqrlogp1` exppxzqqs
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 54
![Page 131: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/131.jpg)
Computing the Expected Log-Likelihood
Eqrlog ppy|x, zqs “ Eqryxzs ´Eqry logp1` exppxzqqs
`Eqrp1´ yq logˆ
1´exppxzq
1` exppxzq
˙
s
“ yxµ´Eqry logp1` exppxzqqs
`Eqrp1´ yq logˆ
11` exppxzq
˙
s
“ yxµ´Eqry logp1` exppxzqqs´Eqrlogp1` exppxzqqs `Eqry logp1` exppxzqqs
“ yxµ´Eqrlogp1` exppxzqqs
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 54
![Page 132: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/132.jpg)
Example: Bayesian Logistic Regression (ctd.)
§ Choose Gaussian variational approximation:qpz|νq “ N
`
µ, σ2˘
ν “ tµ, σ2u
§ Objective function: ELBO Fpνq
Fpµ, σ2q “ Eqrlog ppzq ` log ppy|x, zq ´ log qpzqs
“ ´12pµ
2 ` σ2q ` 12 log σ2 `Eqrlog ppy|x, zqs ` c
“ ´ 12pµ
2 ` σ2q ` 12 log σ2 ` yxµ´Eqrlogp1` exppxzqqs
§ Expectation cannot be computed in closed form
§ Pushing gradients through Monte Carlo estimates is very hard.
§ Option: Lower-bound this expectation; but this is model specific.
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 55
![Page 133: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/133.jpg)
Example: Bayesian Logistic Regression (ctd.)
§ Choose Gaussian variational approximation:qpz|νq “ N
`
µ, σ2˘
ν “ tµ, σ2u
§ Objective function: ELBO Fpνq
Fpµ, σ2q “ Eqrlog ppzq ` log ppy|x, zq ´ log qpzqs
“ ´12pµ
2 ` σ2q ` 12 log σ2 `Eqrlog ppy|x, zqs ` c
“ ´ 12pµ
2 ` σ2q ` 12 log σ2 ` yxµ´Eqrlogp1` exppxzqqs
§ Expectation cannot be computed in closed form
§ Pushing gradients through Monte Carlo estimates is very hard.
§ Option: Lower-bound this expectation; but this is model specific.
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 55
![Page 134: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/134.jpg)
Example: Bayesian Logistic Regression (ctd.)
§ Choose Gaussian variational approximation:qpz|νq “ N
`
µ, σ2˘
ν “ tµ, σ2u
§ Objective function: ELBO Fpνq
Fpµ, σ2q “ Eqrlog ppzq ` log ppy|x, zq ´ log qpzqs
“ ´12pµ
2 ` σ2q ` 12 log σ2 `Eqrlog ppy|x, zqs ` c
“ ´ 12pµ
2 ` σ2q ` 12 log σ2 ` yxµ´Eqrlogp1` exppxzqqs
§ Expectation cannot be computed in closed form
§ Pushing gradients through Monte Carlo estimates is very hard.
§ Option: Lower-bound this expectation; but this is model specific.
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 55
![Page 135: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/135.jpg)
Non-Conjugate Models
§ Nonlinear time series models
§ Deep latent Gaussian models
§ Attention models (e.g., DRAW)
§ Generalized linear models (e.g., logistic regression)
§ Bayesian neural networks
§ ...
There are many interesting non-conjugate modelsLook for a solution that is not model specificBlack-Box Variational Inference
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 56
![Page 136: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/136.jpg)
Non-Conjugate Models
§ Nonlinear time series models
§ Deep latent Gaussian models
§ Attention models (e.g., DRAW)
§ Generalized linear models (e.g., logistic regression)
§ Bayesian neural networks
§ ...
There are many interesting non-conjugate modelsLook for a solution that is not model specificBlack-Box Variational Inference
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 56
![Page 137: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/137.jpg)
Overview
Introduction and BackgroundKey IdeaOptimization ObjectiveConditionally Conjugate ModelsMean-Field Variational InferenceStochastic Variational InferenceLimits of Classical Variational InferenceBlack-Box Variational InferenceComputing Gradients of Expectations
Score Function GradientsPathwise Gradients
Amortized InferenceRicher Posterior Approximations
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 57
![Page 138: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/138.jpg)
Black-Box Variational Inference
From Blei et al.’s NIPS-2016 tutorial
§ Any model
§ Massive data
§ Some general assumptions on the approximating family
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 58
![Page 139: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/139.jpg)
Computational Challenge of Classical VI
ppx, zq
qpz|νq
ż
p...qqpz|νqdz ∇ν
Adopted from Blei et al.’s NIPS-2016 tutorial
§ Integral computation, which makes the ELBO explicitly afunction of the variational parameters
§ Integral cannot be computed for non-conjugate modelsGradient computation difficult
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 59
![Page 140: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/140.jpg)
Approach
ppx, zq
qpz|νq
ż
p...qqpz|νqdz∇ν
Adopted from Blei et al.’s NIPS-2016 tutorial
§ Switch order of integration (compute expectations) anddifferentiation
§ Approximate the expectation after having taken the gradientMonte Carlo estimator (ideally with low variance)
§ Stochastic optimization
Require a general way to compute gradients of expectations
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 60
![Page 141: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/141.jpg)
Approach
ppx, zq
qpz|νq
ż
p...qqpz|νqdz∇ν
Adopted from Blei et al.’s NIPS-2016 tutorial
§ Switch order of integration (compute expectations) anddifferentiation
§ Approximate the expectation after having taken the gradientMonte Carlo estimator (ideally with low variance)
§ Stochastic optimization
Require a general way to compute gradients of expectations
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 60
![Page 142: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/142.jpg)
Approach
ppx, zq
qpz|νq
ż
p...qqpz|νqdz∇ν
Adopted from Blei et al.’s NIPS-2016 tutorial
§ Switch order of integration (compute expectations) anddifferentiation
§ Approximate the expectation after having taken the gradientMonte Carlo estimator (ideally with low variance)
§ Stochastic optimization
Require a general way to compute gradients of expectations
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 60
![Page 143: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/143.jpg)
Approach
ppx, zq
qpz|νq
ż
p...qqpz|νqdz∇ν
Adopted from Blei et al.’s NIPS-2016 tutorial
§ Switch order of integration (compute expectations) anddifferentiation
§ Approximate the expectation after having taken the gradientMonte Carlo estimator (ideally with low variance)
§ Stochastic optimization
Require a general way to compute gradients of expectationsVariational Inference Marc Deisenroth @Imperial College, February 19, 2019 60
![Page 144: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/144.jpg)
Re-Writing the ELBO
ELBO “ Eqrlog ppx|zqs ´KLpqpz|νq}ppzqq
“ Eqrlog ppx|zqs `Eqrlog ppzq ´ log qpz|νqs“ Eqrlog ppx, zq ´ log qpz|νq
looooooooooooomooooooooooooon
“:gpz,νq
s
Evidence Lower Bound
ELBO “ Eqrgpz, νqs “
ż
gpz, νqqpz|νqdz
gpz, νq :“ log ppx, zq ´ log qpz|νq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 61
![Page 145: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/145.jpg)
Re-Writing the ELBO
ELBO “ Eqrlog ppx|zqs ´KLpqpz|νq}ppzqq“ Eqrlog ppx|zqs `Eqrlog ppzq ´ log qpz|νqs
“ Eqrlog ppx, zq ´ log qpz|νqlooooooooooooomooooooooooooon
“:gpz,νq
s
Evidence Lower Bound
ELBO “ Eqrgpz, νqs “
ż
gpz, νqqpz|νqdz
gpz, νq :“ log ppx, zq ´ log qpz|νq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 61
![Page 146: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/146.jpg)
Re-Writing the ELBO
ELBO “ Eqrlog ppx|zqs ´KLpqpz|νq}ppzqq“ Eqrlog ppx|zqs `Eqrlog ppzq ´ log qpz|νqs“ Eqrlog ppx, zq ´ log qpz|νq
looooooooooooomooooooooooooon
“:gpz,νq
s
Evidence Lower Bound
ELBO “ Eqrgpz, νqs “
ż
gpz, νqqpz|νqdz
gpz, νq :“ log ppx, zq ´ log qpz|νq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 61
![Page 147: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/147.jpg)
Re-Writing the ELBO
ELBO “ Eqrlog ppx|zqs ´KLpqpz|νq}ppzqq“ Eqrlog ppx|zqs `Eqrlog ppzq ´ log qpz|νqs“ Eqrlog ppx, zq ´ log qpz|νq
looooooooooooomooooooooooooon
“:gpz,νq
s
Evidence Lower Bound
ELBO “ Eqrgpz, νqs “
ż
gpz, νqqpz|νqdz
gpz, νq :“ log ppx, zq ´ log qpz|νq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 61
![Page 148: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/148.jpg)
Overview
Introduction and BackgroundKey IdeaOptimization ObjectiveConditionally Conjugate ModelsMean-Field Variational InferenceStochastic Variational InferenceLimits of Classical Variational InferenceBlack-Box Variational InferenceComputing Gradients of Expectations
Score Function GradientsPathwise Gradients
Amortized InferenceRicher Posterior Approximations
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 62
![Page 149: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/149.jpg)
Approach
ppx, zq
qpz|νq
ż
p...qqpz|νqdz∇ν
Adopted from Blei et al.’s NIPS-2016 tutorial
§ Switch order of integration (compute expectations) anddifferentiation
§ Simplify the expectation after having taken the gradient
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 63
![Page 150: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/150.jpg)
Log-Derivative Trick
Log-Derivative Trick
∇ν log qpz|νq “
∇νqpz|νqqpz|νq
ðñ ∇νqpz|νq “ qpz|νq∇ν log qpz|νq
§ Therefore:ż
∇νqpz|νq f pzqdz “ż
qpz|νq∇ν log qpz|νq f pzqdz
“ Eqr∇ν log qpz|νq f pzqs
§ If we can sample from q, this expectation can be evaluated easily(Monte Carlo estimation)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 64
![Page 151: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/151.jpg)
Log-Derivative Trick
Log-Derivative Trick
∇ν log qpz|νq “∇νqpz|νq
qpz|νq
ðñ ∇νqpz|νq “ qpz|νq∇ν log qpz|νq
§ Therefore:ż
∇νqpz|νq f pzqdz “ż
qpz|νq∇ν log qpz|νq f pzqdz
“ Eqr∇ν log qpz|νq f pzqs
§ If we can sample from q, this expectation can be evaluated easily(Monte Carlo estimation)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 64
![Page 152: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/152.jpg)
Log-Derivative Trick
Log-Derivative Trick
∇ν log qpz|νq “∇νqpz|νq
qpz|νq
ðñ ∇νqpz|νq “ qpz|νq∇ν log qpz|νq
§ Therefore:ż
∇νqpz|νq f pzqdz “ż
qpz|νq∇ν log qpz|νq f pzqdz
“ Eqr∇ν log qpz|νq f pzqs
§ If we can sample from q, this expectation can be evaluated easily(Monte Carlo estimation)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 64
![Page 153: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/153.jpg)
Log-Derivative Trick
Log-Derivative Trick
∇ν log qpz|νq “∇νqpz|νq
qpz|νq
ðñ ∇νqpz|νq “ qpz|νq∇ν log qpz|νq
§ Therefore:ż
∇νqpz|νq f pzqdz “ż
qpz|νq∇ν log qpz|νq f pzqdz
“ Eqr∇ν log qpz|νq f pzqs
§ If we can sample from q, this expectation can be evaluated easily(Monte Carlo estimation)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 64
![Page 154: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/154.jpg)
Gradients of Expectations: Approach 1
ELBO “ Fpνq “ Eqrgpz, νqs , gpz, νq “ log ppx, zq ´ log qpz|νq
§ Need gradient of ELBO w.r.t. variational parameters ν
∇νF “ ∇νEqrgpz, νqs
“ ∇ν
ż
gpz, νqqpz|νqdz
“
ż
`
∇νgpν, zq˘
qpz|νq ` gpν, zq∇νqpz|νqdz product rule
“
ż
qpz|νq∇νgpz, νq ` qpz|νq∇ν log qpz|νqgpz, νqdz log-deriv. trick
“ Eqr∇ν log qpz|νqgpz, νq `∇νgpz, νqs
§ We successfully swapped gradient and expectation§ q known
Sample from q and use Monte Carlo estimation
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 65
![Page 155: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/155.jpg)
Gradients of Expectations: Approach 1
ELBO “ Fpνq “ Eqrgpz, νqs , gpz, νq “ log ppx, zq ´ log qpz|νq
§ Need gradient of ELBO w.r.t. variational parameters ν
∇νF “ ∇νEqrgpz, νqs
“ ∇ν
ż
gpz, νqqpz|νqdz
“
ż
`
∇νgpν, zq˘
qpz|νq ` gpν, zq∇νqpz|νqdz product rule
“
ż
qpz|νq∇νgpz, νq ` qpz|νq∇ν log qpz|νqgpz, νqdz log-deriv. trick
“ Eqr∇ν log qpz|νqgpz, νq `∇νgpz, νqs
§ We successfully swapped gradient and expectation§ q known
Sample from q and use Monte Carlo estimation
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 65
![Page 156: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/156.jpg)
Gradients of Expectations: Approach 1
ELBO “ Fpνq “ Eqrgpz, νqs , gpz, νq “ log ppx, zq ´ log qpz|νq
§ Need gradient of ELBO w.r.t. variational parameters ν
∇νF “ ∇νEqrgpz, νqs “ ∇ν
ż
gpz, νqqpz|νqdz
“
ż
`
∇νgpν, zq˘
qpz|νq ` gpν, zq∇νqpz|νqdz product rule
“
ż
qpz|νq∇νgpz, νq ` qpz|νq∇ν log qpz|νqgpz, νqdz log-deriv. trick
“ Eqr∇ν log qpz|νqgpz, νq `∇νgpz, νqs
§ We successfully swapped gradient and expectation§ q known
Sample from q and use Monte Carlo estimation
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 65
![Page 157: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/157.jpg)
Gradients of Expectations: Approach 1
ELBO “ Fpνq “ Eqrgpz, νqs , gpz, νq “ log ppx, zq ´ log qpz|νq
§ Need gradient of ELBO w.r.t. variational parameters ν
∇νF “ ∇νEqrgpz, νqs “ ∇ν
ż
gpz, νqqpz|νqdz
“
ż
`
∇νgpν, zq˘
qpz|νq ` gpν, zq∇νqpz|νqdz product rule
“
ż
qpz|νq∇νgpz, νq ` qpz|νq∇ν log qpz|νqgpz, νqdz log-deriv. trick
“ Eqr∇ν log qpz|νqgpz, νq `∇νgpz, νqs
§ We successfully swapped gradient and expectation§ q known
Sample from q and use Monte Carlo estimation
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 65
![Page 158: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/158.jpg)
Gradients of Expectations: Approach 1
ELBO “ Fpνq “ Eqrgpz, νqs , gpz, νq “ log ppx, zq ´ log qpz|νq
§ Need gradient of ELBO w.r.t. variational parameters ν
∇νF “ ∇νEqrgpz, νqs “ ∇ν
ż
gpz, νqqpz|νqdz
“
ż
`
∇νgpν, zq˘
qpz|νq ` gpν, zq∇νqpz|νqdz product rule
“
ż
qpz|νq∇νgpz, νq ` qpz|νq∇ν log qpz|νqgpz, νqdz log-deriv. trick
“ Eqr∇ν log qpz|νqgpz, νq `∇νgpz, νqs
§ We successfully swapped gradient and expectation§ q known
Sample from q and use Monte Carlo estimation
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 65
![Page 159: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/159.jpg)
Gradients of Expectations: Approach 1
ELBO “ Fpνq “ Eqrgpz, νqs , gpz, νq “ log ppx, zq ´ log qpz|νq
§ Need gradient of ELBO w.r.t. variational parameters ν
∇νF “ ∇νEqrgpz, νqs “ ∇ν
ż
gpz, νqqpz|νqdz
“
ż
`
∇νgpν, zq˘
qpz|νq ` gpν, zq∇νqpz|νqdz product rule
“
ż
qpz|νq∇νgpz, νq ` qpz|νq∇ν log qpz|νqgpz, νqdz log-deriv. trick
“ Eqr∇ν log qpz|νqgpz, νq `∇νgpz, νqs
§ We successfully swapped gradient and expectation§ q known
Sample from q and use Monte Carlo estimationVariational Inference Marc Deisenroth @Imperial College, February 19, 2019 65
![Page 160: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/160.jpg)
Approach
ppx, zq
qpz|νq
ż
p...qqpz|νqdz∇ν
Adopted from Blei et al.’s NIPS-2016 tutorial
§ Swap order of integration (compute expectations) anddifferentiation
§ Simplify the expectation after having taken the gradient
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 66
![Page 161: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/161.jpg)
Score Function Gradient Estimator of the ELBO
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 67
![Page 162: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/162.jpg)
Simplifying the Gradient
∇νELBO “ Eqr∇ν log qpz|νqgpz, νq `∇νgpz, νqs
§ Let’s simplify this gradient Score function
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 68
![Page 163: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/163.jpg)
Score Function
§ Score function: Derivative of a log-likelihood with respect to theparameter vector ν:
Score Function
score “ ∇ν log qpz|νq “1
qpz|νq∇νqpz|νq
§ Measures the sensitivity of the log-likelihood w.r.t. ν
§ Central to maximum likelihood estimation
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 69
![Page 164: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/164.jpg)
Score Function
§ Score function: Derivative of a log-likelihood with respect to theparameter vector ν:
Score Function
score “ ∇ν log qpz|νq “1
qpz|νq∇νqpz|νq
§ Measures the sensitivity of the log-likelihood w.r.t. ν
§ Central to maximum likelihood estimation
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 69
![Page 165: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/165.jpg)
Score Function
§ Score function: Derivative of a log-likelihood with respect to theparameter vector ν:
Score Function
score “ ∇ν log qpz|νq “1
qpz|νq∇νqpz|νq
§ Measures the sensitivity of the log-likelihood w.r.t. ν
§ Central to maximum likelihood estimation
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 69
![Page 166: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/166.jpg)
Score Function
§ Score function: Derivative of a log-likelihood with respect to theparameter vector ν:
Score Function
score “ ∇ν log qpz|νq “1
qpz|νq∇νqpz|νq
§ Measures the sensitivity of the log-likelihood w.r.t. ν
§ Central to maximum likelihood estimation
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 69
![Page 167: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/167.jpg)
Score Function (2)
score “ ∇ν log qpz|νq “1
qpz|νq∇νqpz|νq
§ Important property:
Eqpz|νqrscores “
Eqpz|νq
„
1qpz|νq
∇νqpz|νq
“
ż
1qpz|νq
qpz|νq∇νqpz|νqdz
“
ż
∇νqpz|νqdz “ ∇ν
ż
qpz|νqdz “ ∇ν1 “ 0
Mean of the score function is 0
§ Variance of the score: Fisher information Natural gradients
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 70
![Page 168: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/168.jpg)
Score Function (2)
score “ ∇ν log qpz|νq “1
qpz|νq∇νqpz|νq
§ Important property:
Eqpz|νqrscores “ Eqpz|νq
„
1qpz|νq
∇νqpz|νq
“
ż
1qpz|νq
qpz|νq∇νqpz|νqdz
“
ż
∇νqpz|νqdz “ ∇ν
ż
qpz|νqdz “ ∇ν1 “ 0
Mean of the score function is 0
§ Variance of the score: Fisher information Natural gradients
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 70
![Page 169: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/169.jpg)
Score Function Gradient Estimator
ELBO “ Eqrgpz, νqs “ Eqrlog ppx, zq ´ log qpz|νqs
§ Gradient of ELBO:
∇νELBO “
Eqr∇ν log qpz|νqgpz, νqs `Eqr∇νgpz, νqs
“ Eqr∇ν log qpz|νqgpz, νqs
`Eqr∇ν log ppx, zqlooooooomooooooon
“0
´∇ν log qpz|νqlooooooomooooooon
score
s
§ Exploit that the mean of the score function is 0. Then:
∇νELBO “ Eqr∇ν log qpz|νqgpz, νqs
“ Eqr∇ν log qpz|νqplog ppx, zq ´ log qpz|νqqs
§ Likelihood ratio gradient (Glynn, 1990)§ REINFORCE gradient (Williams, 1992)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 71
![Page 170: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/170.jpg)
Score Function Gradient Estimator
ELBO “ Eqrgpz, νqs “ Eqrlog ppx, zq ´ log qpz|νqs
§ Gradient of ELBO:
∇νELBO “ Eqr∇ν log qpz|νqgpz, νqs `Eqr∇νgpz, νqs
“ Eqr∇ν log qpz|νqgpz, νqs
`Eqr∇ν log ppx, zqlooooooomooooooon
“0
´∇ν log qpz|νqlooooooomooooooon
score
s
§ Exploit that the mean of the score function is 0. Then:
∇νELBO “ Eqr∇ν log qpz|νqgpz, νqs
“ Eqr∇ν log qpz|νqplog ppx, zq ´ log qpz|νqqs
§ Likelihood ratio gradient (Glynn, 1990)§ REINFORCE gradient (Williams, 1992)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 71
![Page 171: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/171.jpg)
Score Function Gradient Estimator
ELBO “ Eqrgpz, νqs “ Eqrlog ppx, zq ´ log qpz|νqs
§ Gradient of ELBO:
∇νELBO “ Eqr∇ν log qpz|νqgpz, νqs `Eqr∇νgpz, νqs
“ Eqr∇ν log qpz|νqgpz, νqs
`Eqr∇ν log ppx, zqlooooooomooooooon
“0
´∇ν log qpz|νqlooooooomooooooon
score
s
§ Exploit that the mean of the score function is 0. Then:
∇νELBO “ Eqr∇ν log qpz|νqgpz, νqs
“ Eqr∇ν log qpz|νqplog ppx, zq ´ log qpz|νqqs
§ Likelihood ratio gradient (Glynn, 1990)§ REINFORCE gradient (Williams, 1992)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 71
![Page 172: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/172.jpg)
Score Function Gradient Estimator
ELBO “ Eqrgpz, νqs “ Eqrlog ppx, zq ´ log qpz|νqs
§ Gradient of ELBO:
∇νELBO “ Eqr∇ν log qpz|νqgpz, νqs `Eqr∇νgpz, νqs
“ Eqr∇ν log qpz|νqgpz, νqs
`Eqr∇ν log ppx, zqlooooooomooooooon
“0
´∇ν log qpz|νqlooooooomooooooon
score
s
§ Exploit that the mean of the score function is 0. Then:
∇νELBO “ Eqr∇ν log qpz|νqgpz, νqs
“ Eqr∇ν log qpz|νqplog ppx, zq ´ log qpz|νqqs
§ Likelihood ratio gradient (Glynn, 1990)§ REINFORCE gradient (Williams, 1992)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 71
![Page 173: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/173.jpg)
Score Function Gradient Estimator
ELBO “ Eqrgpz, νqs “ Eqrlog ppx, zq ´ log qpz|νqs
§ Gradient of ELBO:
∇νELBO “ Eqr∇ν log qpz|νqgpz, νqs `Eqr∇νgpz, νqs
“ Eqr∇ν log qpz|νqgpz, νqs
`Eqr∇ν log ppx, zqlooooooomooooooon
“0
´∇ν log qpz|νqlooooooomooooooon
score
s
§ Exploit that the mean of the score function is 0. Then:
∇νELBO “ Eqr∇ν log qpz|νqgpz, νqs
“ Eqr∇ν log qpz|νqplog ppx, zq ´ log qpz|νqqs
§ Likelihood ratio gradient (Glynn, 1990)§ REINFORCE gradient (Williams, 1992)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 71
![Page 174: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/174.jpg)
Using Noisy Stochastic Gradients
§ Gradient of the ELBO
∇νELBO “ Eqr∇ν log qpz|νqplog ppx, zq ´ log qpz|νqqs
is an expectation
§ Require that qpz|νq is differentiable w.r.t. ν
§ Get noisy unbiased gradients using Monte Carlo by samplingfrom q:
1S
Sÿ
s“1
∇ν log qpzpsq|νqplog ppx, zpsqq ´ log qpzpsq|νqq , zpsq „ qpz|νq
§ Sampling from q is easy (we choose q)
§ Use this within SVI to converge to a local optimum
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 72
![Page 175: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/175.jpg)
Using Noisy Stochastic Gradients
§ Gradient of the ELBO
∇νELBO “ Eqr∇ν log qpz|νqplog ppx, zq ´ log qpz|νqqs
is an expectation
§ Require that qpz|νq is differentiable w.r.t. ν
§ Get noisy unbiased gradients using Monte Carlo by samplingfrom q:
1S
Sÿ
s“1
∇ν log qpzpsq|νqplog ppx, zpsqq ´ log qpzpsq|νqq , zpsq „ qpz|νq
§ Sampling from q is easy (we choose q)
§ Use this within SVI to converge to a local optimum
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 72
![Page 176: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/176.jpg)
Using Noisy Stochastic Gradients
§ Gradient of the ELBO
∇νELBO “ Eqr∇ν log qpz|νqplog ppx, zq ´ log qpz|νqqs
is an expectation
§ Require that qpz|νq is differentiable w.r.t. ν
§ Get noisy unbiased gradients using Monte Carlo by samplingfrom q:
1S
Sÿ
s“1
∇ν log qpzpsq|νqplog ppx, zpsqq ´ log qpzpsq|νqq , zpsq „ qpz|νq
§ Sampling from q is easy (we choose q)
§ Use this within SVI to converge to a local optimum
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 72
![Page 177: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/177.jpg)
Using Noisy Stochastic Gradients
§ Gradient of the ELBO
∇νELBO “ Eqr∇ν log qpz|νqplog ppx, zq ´ log qpz|νqqs
is an expectation
§ Require that qpz|νq is differentiable w.r.t. ν
§ Get noisy unbiased gradients using Monte Carlo by samplingfrom q:
1S
Sÿ
s“1
∇ν log qpzpsq|νqplog ppx, zpsqq ´ log qpzpsq|νqq , zpsq „ qpz|νq
§ Sampling from q is easy (we choose q)
§ Use this within SVI to converge to a local optimum
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 72
![Page 178: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/178.jpg)
Using Noisy Stochastic Gradients
§ Gradient of the ELBO
∇νELBO “ Eqr∇ν log qpz|νqplog ppx, zq ´ log qpz|νqqs
is an expectation
§ Require that qpz|νq is differentiable w.r.t. ν
§ Get noisy unbiased gradients using Monte Carlo by samplingfrom q:
1S
Sÿ
s“1
∇ν log qpzpsq|νqplog ppx, zpsqq ´ log qpzpsq|νqq , zpsq „ qpz|νq
§ Sampling from q is easy (we choose q)
§ Use this within SVI to converge to a local optimum
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 72
![Page 179: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/179.jpg)
BBVI: Algorithm
1. Input: model ppx, zq, variational approximation qpz|νq
2. Repeat2.1 Draw S samples zpsq „ qpz|νq2.2 Update variational parameters
νt`1 “ νt ` ρt1S
Sÿ
s“1
∇ν log qpzpsq|νqplog ppx, zpsqq ´ log qpzpsq|νqqlooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooon
MC estimate of the score-function gradient of the ELBO
2.3 t “ t` 1
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 73
![Page 180: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/180.jpg)
Requirements for Inference
§ Computing the noisy gradient of the ELBO requires:§ Sampling from q. We choose q so that this is possible.§ Evaluate the score function ∇ν log qpz|νq§ Evaluate log qpz|νq and log ppx, zq “ log ppzq ` log ppx|zq
No model-specific computations for optimization(computations are only specific to the choice of the variationalapproximation)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 74
![Page 181: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/181.jpg)
Issue: Variance of the Gradients
§ Stochastic optimization Gradients are noisy (high variance)
§ The noisier the gradients, the slower the convergence§ Possible solutions:
§ Control variates (with the score function as control variate)§ Rao-Blackwellization§ Importance sampling
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 75
![Page 182: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/182.jpg)
Non-Conjugate Models
§ Nonlinear time series models
§ Deep latent Gaussian models
§ Attention models (e.g., DRAW)
§ Generalized linear models (e.g., logistic regression)
§ Bayesian neural networks
§ ...
BBVI allows us to design models ppx, zq based on the data, and not onthe inference we can do
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 76
![Page 183: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/183.jpg)
Assumptions
§ Score-function gradient estimator only requires generalassumptions
§ Noisy gradients are a problem
§ Address this issue by making some additional assumptions (nottoo strict)
Pathwise gradient estimators
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 78
![Page 184: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/184.jpg)
Pathwise Gradient Estimators of the ELBO
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 79
![Page 185: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/185.jpg)
Approach
ppx, zq
qpz|νq
ż
p...qqpz|νqdz∇ν
Adopted from Blei et al.’s NIPS-2016 tutorial
§ Switch order of integration (compute expectations) anddifferentiation
§ Approximate the expectation after having taken the gradient
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 80
![Page 186: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/186.jpg)
Change of Variables
b1
b2 c1 c2
f p¨q
§ Use function f : X Ñ Y , x ÞÑ f pxq “ y
§ Change of area/volume:
ˇ
ˇ
ˇ
ˇ
ż
Yydy
ˇ
ˇ
ˇ
ˇ
“
ˇ
ˇ
ˇ
ˇ
d fdx
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ż
Xxdx
ˇ
ˇ
ˇ
ˇ
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 81
![Page 187: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/187.jpg)
Change of Variables
b1
b2 c1 c2
f p¨q
§ Use function f : X Ñ Y , x ÞÑ f pxq “ y
§ Change of area/volume:ˇ
ˇ
ˇ
ˇ
ż
Yydy
ˇ
ˇ
ˇ
ˇ
“
ˇ
ˇ
ˇ
ˇ
d fdx
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ż
Xxdx
ˇ
ˇ
ˇ
ˇ
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 81
![Page 188: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/188.jpg)
Reparametrization Trick
Reparametrization TrickBase distribution ppεq and a deterministic transformation z “ tpε, νq
so that z „ qpz|νq. Then:
∇νEqpz|νqr f pzqs “ Eppεqr∇ν f ptpε, νqqs
Expectation taken w.r.t. base distribution
§ Key idea: change of variables using a deterministictransformation
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 82
![Page 189: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/189.jpg)
Reparametrization Trick
Reparametrization TrickBase distribution ppεq and a deterministic transformation z “ tpε, νq
so that z „ qpz|νq. Then:
∇νEqpz|νqr f pzqs “ Eppεqr∇ν f ptpε, νqqs
Expectation taken w.r.t. base distribution
§ Key idea: change of variables using a deterministictransformation
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 82
![Page 190: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/190.jpg)
Reparametrization Trick (2)
∇νEqpz|νqr f pzqs “ Eppεqr∇ν f ptpε, νqqs
§ Change of variables Probability mass contained in adifferential area must be invariant under change of variables:
|qpz|νqdz| “ |ppεqdε|
§ This implies
∇νEqpz|νqr f pzqs “
∇ν
ż
f pzqqpz|νqdz “ ∇ν
ż
f pzpεqqppεqdε
“ ∇ν
ż
f ptpε, νqqppεqdε “
ż
∇ν f ptpε, νqqppεqdε
“ Eppεqr∇ν f ptpε, νqqs
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 83
![Page 191: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/191.jpg)
Reparametrization Trick (2)
∇νEqpz|νqr f pzqs “ Eppεqr∇ν f ptpε, νqqs
§ Change of variables Probability mass contained in adifferential area must be invariant under change of variables:
|qpz|νqdz| “ |ppεqdε|
§ This implies
∇νEqpz|νqr f pzqs “
∇ν
ż
f pzqqpz|νqdz “ ∇ν
ż
f pzpεqqppεqdε
“ ∇ν
ż
f ptpε, νqqppεqdε “
ż
∇ν f ptpε, νqqppεqdε
“ Eppεqr∇ν f ptpε, νqqs
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 83
![Page 192: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/192.jpg)
Reparametrization Trick (2)
∇νEqpz|νqr f pzqs “ Eppεqr∇ν f ptpε, νqqs
§ Change of variables Probability mass contained in adifferential area must be invariant under change of variables:
|qpz|νqdz| “ |ppεqdε|
§ This implies
∇νEqpz|νqr f pzqs “
∇ν
ż
f pzqqpz|νqdz “ ∇ν
ż
f pzpεqqppεqdε
“ ∇ν
ż
f ptpε, νqqppεqdε “
ż
∇ν f ptpε, νqqppεqdε
“ Eppεqr∇ν f ptpε, νqqs
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 83
![Page 193: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/193.jpg)
Reparametrization Trick (2)
∇νEqpz|νqr f pzqs “ Eppεqr∇ν f ptpε, νqqs
§ Change of variables Probability mass contained in adifferential area must be invariant under change of variables:
|qpz|νqdz| “ |ppεqdε|
§ This implies
∇νEqpz|νqr f pzqs “ ∇ν
ż
f pzqqpz|νqdz “ ∇ν
ż
f pzpεqqppεqdε
“ ∇ν
ż
f ptpε, νqqppεqdε “
ż
∇ν f ptpε, νqqppεqdε
“ Eppεqr∇ν f ptpε, νqqs
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 83
![Page 194: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/194.jpg)
Reparametrization Trick (2)
∇νEqpz|νqr f pzqs “ Eppεqr∇ν f ptpε, νqqs
§ Change of variables Probability mass contained in adifferential area must be invariant under change of variables:
|qpz|νqdz| “ |ppεqdε|
§ This implies
∇νEqpz|νqr f pzqs “ ∇ν
ż
f pzqqpz|νqdz “ ∇ν
ż
f pzpεqqppεqdε
“ ∇ν
ż
f ptpε, νqqppεqdε “
ż
∇ν f ptpε, νqqppεqdε
“ Eppεqr∇ν f ptpε, νqqs
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 83
![Page 195: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/195.jpg)
Reparametrization Trick (2)
∇νEqpz|νqr f pzqs “ Eppεqr∇ν f ptpε, νqqs
§ Change of variables Probability mass contained in adifferential area must be invariant under change of variables:
|qpz|νqdz| “ |ppεqdε|
§ This implies
∇νEqpz|νqr f pzqs “ ∇ν
ż
f pzqqpz|νqdz “ ∇ν
ż
f pzpεqqppεqdε
“ ∇ν
ż
f ptpε, νqqppεqdε “
ż
∇ν f ptpε, νqqppεqdε
“ Eppεqr∇ν f ptpε, νqqs
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 83
![Page 196: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/196.jpg)
Example
−5 0 5 10x1
−10.0
−7.5
−5.0
−2.5
0.0
2.5
5.0x
2
−5 0 5 10x1
−10.0
−7.5
−5.0
−2.5
0.0
2.5
5.0
x2
ν :“ tµ, Ru , RRJ “ Σ
ppεq “ N`
0, I˘
tpε, νq “ µ` Rε
ùñ ppzq “ N`
z | µ, Σ˘
§ Base distribution is parameter free§ Simple deterministic transformation t generate more expressive
distributions11https://tinyurl.com/hyakoj2
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 84
![Page 197: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/197.jpg)
Reparametrization as a System of Pipes
Figure provided by S. Mohamed
ε
z
µ R
§ Path: Follow the input noise through the pipes2
§ Construction of pipes known (deterministic transformations)Go back and push gradients through it
§ Also called “push-in method”: Push the parameters of the zdistribution into the deterministic transformations
2https://tinyurl.com/hyakoj2Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 85
![Page 198: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/198.jpg)
Reparametrization as a System of Pipes
Figure provided by S. Mohamed
ε
z
µ R
§ Path: Follow the input noise through the pipes2
§ Construction of pipes known (deterministic transformations)Go back and push gradients through it
§ Also called “push-in method”: Push the parameters of the zdistribution into the deterministic transformations
2https://tinyurl.com/hyakoj2Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 85
![Page 199: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/199.jpg)
Reparametrization as a System of Pipes
Figure provided by S. Mohamed
ε
z
µ R
§ Path: Follow the input noise through the pipes2
§ Construction of pipes known (deterministic transformations)Go back and push gradients through it
§ Also called “push-in method”: Push the parameters of the zdistribution into the deterministic transformations
2https://tinyurl.com/hyakoj2Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 85
![Page 200: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/200.jpg)
Gradients of Expectations: Approach 2
∇νELBO “ ∇νEqrgpz, νqs
“ ∇ν
ż
gpz, νqqpz|νqdz
“ ∇ν
ż
gpz|νqppεqdε qpzqdz “ ppεqdε
“ ∇ν
ż
gptpε, νq, νqppεqdε z “ tpε, νq
“
ż
∇νgptpε, νq, νqppεqdε ∇ν
ş
ε “ş
ε ∇ν
“ Eppεqr∇νgptpε, νq, νqs
Turned gradient of an expectation into expectation of a gradient(and sampling from ppεq is very easy).
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 86
![Page 201: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/201.jpg)
Gradients of Expectations: Approach 2
∇νELBO “ ∇νEqrgpz, νqs
“ ∇ν
ż
gpz, νqqpz|νqdz
“ ∇ν
ż
gpz|νqppεqdε qpzqdz “ ppεqdε
“ ∇ν
ż
gptpε, νq, νqppεqdε z “ tpε, νq
“
ż
∇νgptpε, νq, νqppεqdε ∇ν
ş
ε “ş
ε ∇ν
“ Eppεqr∇νgptpε, νq, νqs
Turned gradient of an expectation into expectation of a gradient(and sampling from ppεq is very easy).
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 86
![Page 202: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/202.jpg)
Gradients of Expectations: Approach 2
∇νELBO “ ∇νEqrgpz, νqs
“ ∇ν
ż
gpz, νqqpz|νqdz
“ ∇ν
ż
gpz|νqppεqdε qpzqdz “ ppεqdε
“ ∇ν
ż
gptpε, νq, νqppεqdε z “ tpε, νq
“
ż
∇νgptpε, νq, νqppεqdε ∇ν
ş
ε “ş
ε ∇ν
“ Eppεqr∇νgptpε, νq, νqs
Turned gradient of an expectation into expectation of a gradient(and sampling from ppεq is very easy).
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 86
![Page 203: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/203.jpg)
Gradients of Expectations: Approach 2
∇νELBO “ ∇νEqrgpz, νqs
“ ∇ν
ż
gpz, νqqpz|νqdz
“ ∇ν
ż
gpz|νqppεqdε qpzqdz “ ppεqdε
“ ∇ν
ż
gptpε, νq, νqppεqdε z “ tpε, νq
“
ż
∇νgptpε, νq, νqppεqdε ∇ν
ş
ε “ş
ε ∇ν
“ Eppεqr∇νgptpε, νq, νqs
Turned gradient of an expectation into expectation of a gradient(and sampling from ppεq is very easy).
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 86
![Page 204: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/204.jpg)
Gradients of Expectations: Approach 2
∇νELBO “ ∇νEqrgpz, νqs
“ ∇ν
ż
gpz, νqqpz|νqdz
“ ∇ν
ż
gpz|νqppεqdε qpzqdz “ ppεqdε
“ ∇ν
ż
gptpε, νq, νqppεqdε z “ tpε, νq
“
ż
∇νgptpε, νq, νqppεqdε ∇ν
ş
ε “ş
ε ∇ν
“ Eppεqr∇νgptpε, νq, νqs
Turned gradient of an expectation into expectation of a gradient(and sampling from ppεq is very easy).
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 86
![Page 205: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/205.jpg)
Gradients of Expectations: Approach 2
∇νELBO “ ∇νEqrgpz, νqs
“ ∇ν
ż
gpz, νqqpz|νqdz
“ ∇ν
ż
gpz|νqppεqdε qpzqdz “ ppεqdε
“ ∇ν
ż
gptpε, νq, νqppεqdε z “ tpε, νq
“
ż
∇νgptpε, νq, νqppεqdε ∇ν
ş
ε “ş
ε ∇ν
“ Eppεqr∇νgptpε, νq, νqs
Turned gradient of an expectation into expectation of a gradient(and sampling from ppεq is very easy).
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 86
![Page 206: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/206.jpg)
Gradients of Expectations: Approach 2
∇νELBO “ ∇νEqrgpz, νqs
“ ∇ν
ż
gpz, νqqpz|νqdz
“ ∇ν
ż
gpz|νqppεqdε qpzqdz “ ppεqdε
“ ∇ν
ż
gptpε, νq, νqppεqdε z “ tpε, νq
“
ż
∇νgptpε, νq, νqppεqdε ∇ν
ş
ε “ş
ε ∇ν
“ Eppεqr∇νgptpε, νq, νqs
Turned gradient of an expectation into expectation of a gradient(and sampling from ppεq is very easy).
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 86
![Page 207: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/207.jpg)
Approach
ppx, zq
qpz|νq
ż
p...qqpz|νqdz∇ν
Adopted from Blei et al.’s NIPS-2016 tutorial
§ Swap order of integration (compute expectations) anddifferentiation
§ Approximate the expectation after having taken the gradient
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 87
![Page 208: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/208.jpg)
Pathwise Gradients
gpz, νq “ log ppx, zq ´ log qpz|νqz “ tpε, νq
Simplify gradient of the ELBO:
∇νELBO “ Eppεqr∇νgptpε, νq, νqs
“ Eppεqr∇ν log ppx, tpε, νqq ´∇ν log qptpε, νq|νqs Def. of g
“ Eppεqr∇z log ppx, zq∇νtpε, νq
´∇z log qpz|νq∇νtpε, νq ´∇ν log qptpε, νq|νqloooooooooomoooooooooon
score
s Chain rule
“ Eppεqr∇z`
log ppx, zq ´ log qpz|νq˘
∇νtpε, νqs Score property
§ Pathwise gradient§ Reparametrization gradient
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 88
![Page 209: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/209.jpg)
Pathwise Gradients
gpz, νq “ log ppx, zq ´ log qpz|νqz “ tpε, νq
Simplify gradient of the ELBO:
∇νELBO “ Eppεqr∇νgptpε, νq, νqs
“ Eppεqr∇ν log ppx, tpε, νqq ´∇ν log qptpε, νq|νqs Def. of g
“ Eppεqr∇z log ppx, zq∇νtpε, νq
´∇z log qpz|νq∇νtpε, νq ´∇ν log qptpε, νq|νqloooooooooomoooooooooon
score
s Chain rule
“ Eppεqr∇z`
log ppx, zq ´ log qpz|νq˘
∇νtpε, νqs Score property
§ Pathwise gradient§ Reparametrization gradient
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 88
![Page 210: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/210.jpg)
Variance Comparison
Figure from Kucukelbir et al. (2017)
§ Drastically reduced variance compared to score-functiongradient estimation
§ Restricted class of models (compared with score functionestimator)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 89
![Page 211: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/211.jpg)
Variance Comparison
Figure from Kucukelbir et al. (2017)
§ Drastically reduced variance compared to score-functiongradient estimation
§ Restricted class of models (compared with score functionestimator)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 89
![Page 212: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/212.jpg)
Score Function vs Pathwise Gradients
ELBO “
ż
gpz, νqqpz|νqdz
gpz, νq “ log ppx, zq ´ log qpz|µq
§ Score function gradient:
∇νELBO “ Eqr`
∇ν log qpz|νq˘
gpz, νqs
Gradient of the variational distribution
§ Reparametrization gradient:
∇νELBO “ Eppεqr`
∇zgpz, νq˘
∇νtpε, νqs
Gradient of the model and the variational distribution
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 90
![Page 213: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/213.jpg)
Score Function vs Pathwise Gradients (2)
§ Score function§ Works for all models (continuous and discrete)§ Works for a large class of variational approximations§ Variance can be high Slow convergence
§ Pathwise gradient estimator§ Requires differentiable models§ Requires the variational approximation to be expressed as a
deterministic transformation z “ tpε, νq
§ Generally lower variance
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 91
![Page 214: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/214.jpg)
Re-cap: Hierarchical Bayesian Models
zn xn
β
n “ 1, . . . , N
Local variables
Global variables
§ Joint distribution:
ppβ, z, xq “ ppβqNź
n“1
ppzn, xn|βq
§ Mean-field variational approximation:
φn zn
βλ
n “ 1, . . . , N
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 92
![Page 215: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/215.jpg)
Re-cap: Hierarchical Bayesian Models
zn xn
β
n “ 1, . . . , N
Local variables
Global variables
§ Joint distribution:
ppβ, z, xq “ ppβqNź
n“1
ppzn, xn|βq
§ Mean-field variational approximation:
φn zn
βλ
n “ 1, . . . , N
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 92
![Page 216: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/216.jpg)
Re-cap: Mean-Field Stochastic Variational Inference
1. Input: data x, model ppβ, z, xq
2. Initialize global variational parameters λ randomly
3. Repeat3.1 Sample data point xn uniformly at random3.2 Update local parameter φn “ Eλr...s3.3 Compute intermediate global parameter λ̂ “ NEφ1:N
r...s ` ...3.4 Set global parameter λ Ð p1´ ρtqλ` ρtλ̂
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 93
![Page 217: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/217.jpg)
BBVI Stochastic Variational Inference
1. Input: data x, model ppβ, z, xq
2. Initialize global variational parameters λ randomly
3. Repeat3.1 Sample data point xn uniformly at random3.2 Update local parameter φn “ Eλr...s3.3 Compute intermediate global parameter λ̂ “ NEφ1:N
r...s ` ...3.4 Set global parameter λ Ð p1´ ρtqλ` ρtλ̂
Issue:
§ Expectations we require to update the local and globalparameters are no longer tractable
No closed-form updating of variational factors
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 94
![Page 218: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/218.jpg)
Addressing the Challenge
§ Same problem we had with the ELBO: Integral intractableGradient descent for variational updates
§ Idea: Stochastic optimization
§ Expectations for updating local variational parameters arecomputed per data point
Need to run an optimization algorithm per data pointSVI gets really slow
Amortized Inference
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 95
![Page 219: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/219.jpg)
Addressing the Challenge
§ Same problem we had with the ELBO: Integral intractableGradient descent for variational updates
§ Idea: Stochastic optimization
§ Expectations for updating local variational parameters arecomputed per data point
Need to run an optimization algorithm per data pointSVI gets really slow
Amortized Inference
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 95
![Page 220: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/220.jpg)
Addressing the Challenge
§ Same problem we had with the ELBO: Integral intractableGradient descent for variational updates
§ Idea: Stochastic optimization
§ Expectations for updating local variational parameters arecomputed per data point
Need to run an optimization algorithm per data pointSVI gets really slow
Amortized Inference
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 95
![Page 221: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/221.jpg)
Addressing the Challenge
§ Same problem we had with the ELBO: Integral intractableGradient descent for variational updates
§ Idea: Stochastic optimization
§ Expectations for updating local variational parameters arecomputed per data point
Need to run an optimization algorithm per data pointSVI gets really slow
Amortized Inference
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 95
![Page 222: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/222.jpg)
Overview
Introduction and BackgroundKey IdeaOptimization ObjectiveConditionally Conjugate ModelsMean-Field Variational InferenceStochastic Variational InferenceLimits of Classical Variational InferenceBlack-Box Variational InferenceComputing Gradients of Expectations
Score Function GradientsPathwise Gradients
Amortized InferenceRicher Posterior Approximations
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 96
![Page 223: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/223.jpg)
Amortized (=shared) Inference
§ Key idea: Learn a mapping (with global variational parameters)from data points xn to local variational parameters φn:
f pxn, θq “ φn
§ No more local variational parameters to optimize
§ No longer independent optimization of variational parameters
§ New global variational parameters θ can be optimized usingstochastic gradient descent
§ The mapping is often a deep neural network (inference network)
§ Can we overfit? Discuss!
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 97
![Page 224: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/224.jpg)
Amortized (=shared) Inference
§ Key idea: Learn a mapping (with global variational parameters)from data points xn to local variational parameters φn:
f pxn, θq “ φn
§ No more local variational parameters to optimize
§ No longer independent optimization of variational parameters
§ New global variational parameters θ can be optimized usingstochastic gradient descent
§ The mapping is often a deep neural network (inference network)
§ Can we overfit? Discuss!
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 97
![Page 225: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/225.jpg)
Amortized (=shared) Inference
§ Key idea: Learn a mapping (with global variational parameters)from data points xn to local variational parameters φn:
f pxn, θq “ φn
§ No more local variational parameters to optimize
§ No longer independent optimization of variational parameters
§ New global variational parameters θ can be optimized usingstochastic gradient descent
§ The mapping is often a deep neural network (inference network)
§ Can we overfit? Discuss!
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 97
![Page 226: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/226.jpg)
Amortized (=shared) Inference
§ Key idea: Learn a mapping (with global variational parameters)from data points xn to local variational parameters φn:
f pxn, θq “ φn
§ No more local variational parameters to optimize
§ No longer independent optimization of variational parameters
§ New global variational parameters θ can be optimized usingstochastic gradient descent
§ The mapping is often a deep neural network (inference network)
§ Can we overfit? Discuss!
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 97
![Page 227: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/227.jpg)
Amortized (=shared) Inference
§ Key idea: Learn a mapping (with global variational parameters)from data points xn to local variational parameters φn:
f pxn, θq “ φn
§ No more local variational parameters to optimize
§ No longer independent optimization of variational parameters
§ New global variational parameters θ can be optimized usingstochastic gradient descent
§ The mapping is often a deep neural network (inference network)
§ Can we overfit? Discuss!
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 97
![Page 228: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/228.jpg)
Amortized (=shared) Inference
§ Key idea: Learn a mapping (with global variational parameters)from data points xn to local variational parameters φn:
f pxn, θq “ φn
§ No more local variational parameters to optimize
§ No longer independent optimization of variational parameters
§ New global variational parameters θ can be optimized usingstochastic gradient descent
§ The mapping is often a deep neural network (inference network)
§ Can we overfit? Discuss!
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 97
![Page 229: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/229.jpg)
Amortized VI in Hierarchical Bayesian Models
φn zn
βλ
n “ 1, . . . , N
ELBO “ Eqrlog ppx, β, z1:Nq ´ log qpβ, z1:N|λ, φ1:Nqs ν “ tβ, φ1:Nu
“ Eqrlog ppx, β, z1:Nqs
´Eqrlog qpβ|λq `Nÿ
n“1
log qpzn|φnqs mean-field
“ Eqrlog ppx, β, z1:Nqs
´Eqrlog qpβ|λq `Nÿ
n“1
log qpzn| f pxn, θqqs φn “ f pxn, θq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 98
![Page 230: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/230.jpg)
Amortized VI in Hierarchical Bayesian Models
φn zn
βλ
n “ 1, . . . , N
ELBO “ Eqrlog ppx, β, z1:Nq ´ log qpβ, z1:N|λ, φ1:Nqs ν “ tβ, φ1:Nu
“ Eqrlog ppx, β, z1:Nqs
´Eqrlog qpβ|λq `Nÿ
n“1
log qpzn|φnqs mean-field
“ Eqrlog ppx, β, z1:Nqs
´Eqrlog qpβ|λq `Nÿ
n“1
log qpzn| f pxn, θqqs φn “ f pxn, θq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 98
![Page 231: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/231.jpg)
Amortized VI in Hierarchical Bayesian Models
φn zn
βλ
n “ 1, . . . , N
ELBO “ Eqrlog ppx, β, z1:Nq ´ log qpβ, z1:N|λ, φ1:Nqs ν “ tβ, φ1:Nu
“ Eqrlog ppx, β, z1:Nqs
´Eqrlog qpβ|λq `Nÿ
n“1
log qpzn|φnqs mean-field
“ Eqrlog ppx, β, z1:Nqs
´Eqrlog qpβ|λq `Nÿ
n“1
log qpzn| f pxn, θqqs φn “ f pxn, θq
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 98
![Page 232: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/232.jpg)
Stochastic Gradients
∇θELBO “dELBO
dφn
dφndθ
§ ELBO gradient w.r.t. local variational parameters is difficultStochastic gradient estimators (score function,
reparametrization)
§ Gradient of variational parameters w.r.t. parameters θ ofinference network are easy
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 99
![Page 233: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/233.jpg)
Amortized SVI
1. Input: data x, model ppβ, z, xq
2. Initialize global variational parameters λ randomly3. Repeat
3.1 Sample β „ qpβ|λq3.2 Sample data point xn uniformly at random3.3 Compute stochastic natural gradients
∇̃λELBO
∇̃θELBO
3.4 Update global parameters
λ Ð λ` ρt∇̃λELBO global variational parameters
θÐ θ` ρt∇̃θELBO inference network parameters
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 100
![Page 234: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/234.jpg)
Example: Variational Auto-Encoder
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 101
![Page 235: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/235.jpg)
Variational Auto-Encoder: Model
xn
zn
ψ
n “ 1, . . . , N
z µψ, Σψ ppx|zq
µ
Σ
§ Model (Rezende et al., 2014; Kinga & Welling, 2014):
ppzq “ N`
0, I˘
ppx|zq “ N`
µψpzq, Σψpzq˘
§ µψ and Σψ are deep networks with model parameters ψ
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 102
![Page 236: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/236.jpg)
Variational Auto-Encoder: Model
xn
zn
ψ
n “ 1, . . . , N
z µψ, Σψ ppx|zq
µ
Σ
§ Model (Rezende et al., 2014; Kinga & Welling, 2014):
ppzq “ N`
0, I˘
ppx|zq “ N`
µψpzq, Σψpzq˘
§ µψ and Σψ are deep networks with model parameters ψ
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 102
![Page 237: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/237.jpg)
Variational Auto-Encoder: Inference
µ
Σ
zµθ, Σθx
§ Inference:
qpz|xq “ N`
µθpxq, Σθpxq˘
§ µθ and Σθ are deep networks that map data onto (local)variational parameters Inference network
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 103
![Page 238: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/238.jpg)
Variational Auto-Encoder
µ
Σ
z „ qpz|xq
zµθ, Σθ µψ, Σψx ppx|zq
µ
Σ
§ Reparametrization trick introduces random noise ε „ N`
0, I˘
,but everything else are deterministic transformations
No need to push gradients through samples
§ Significance of the VAE:§ Propagation of gradients through probability distributions
(stochastic backpropagation)§ Joint learning of model parameters and variational parameters
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 104
![Page 239: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/239.jpg)
Variational Auto-Encoder
µ
Σ
z “ µ` Rε
z
ε ε „ N`
0, I˘
µθ, Σθ µψ, Σψx ppx|zq
µ
Σ
§ Reparametrization trick introduces random noise ε „ N`
0, I˘
,but everything else are deterministic transformations
No need to push gradients through samples
§ Significance of the VAE:§ Propagation of gradients through probability distributions
(stochastic backpropagation)§ Joint learning of model parameters and variational parameters
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 104
![Page 240: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/240.jpg)
Variational Auto-Encoder
µ
Σ
z “ µ` Rε
z
ε ε „ N`
0, I˘
µθ, Σθ µψ, Σψx ppx|zq
µ
Σ
§ Reparametrization trick introduces random noise ε „ N`
0, I˘
,but everything else are deterministic transformations
No need to push gradients through samples
§ Significance of the VAE:§ Propagation of gradients through probability distributions
(stochastic backpropagation)§ Joint learning of model parameters and variational parameters
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 104
![Page 241: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/241.jpg)
Variational Auto-Encoder
µ
Σ
z “ µ` Rε
z
ε ε „ N`
0, I˘
µθ, Σθ µψ, Σψx ppx|zq
µ
Σ
§ Reparametrization trick introduces random noise ε „ N`
0, I˘
,but everything else are deterministic transformations
No need to push gradients through samples§ Significance of the VAE:
§ Propagation of gradients through probability distributions(stochastic backpropagation)
§ Joint learning of model parameters and variational parametersVariational Inference Marc Deisenroth @Imperial College, February 19, 2019 104
![Page 242: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/242.jpg)
Variational Auto-Encoder: A Different Schematic
ppx|zqModel
qpz|xq
InferenceNetwork
z
x „ ppx|zq Data
z „ qpz|xq
§ Generative process (left) (generator)
§ Inference process (right) (recognition/inference network)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 105
![Page 243: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/243.jpg)
Applications
§ Data compression/dimensionality reduction (similar to PCA)
§ Data visualization
§ Generation of new (realistic) data
§ Denoising
§ Probabilistic data imputation (fill gaps in data)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 106
![Page 244: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/244.jpg)
VAE: Data Visualization
Figure from Rezende et al. (2014)Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 107
![Page 245: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/245.jpg)
VAE: Generation of Realistic Images
Figure from Rezende et al. (2014)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 108
![Page 246: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/246.jpg)
VAE: Probabilistic Data Imputation
Figure from Rezende et al. (2014)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 109
![Page 247: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/247.jpg)
Overview
Variational approximation of posterior
mean-field more general
Modelconditionallyconjugate
analytic solution
hierarchicalBayesian
stochastic gradi-ent estimators;Example: VAE
dense Gaussian,mixture mod-els, normalizingflows, auxiliary-variable models
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 110
![Page 248: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/248.jpg)
Overview
Introduction and BackgroundKey IdeaOptimization ObjectiveConditionally Conjugate ModelsMean-Field Variational InferenceStochastic Variational InferenceLimits of Classical Variational InferenceBlack-Box Variational InferenceComputing Gradients of Expectations
Score Function GradientsPathwise Gradients
Amortized InferenceRicher Posterior Approximations
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 111
![Page 249: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/249.jpg)
Richer Posterior Approximations
z1
z2 z3
z1
z2 z3
Least expressiveMost expressive
q(z|x) = p(z|x) q(z|x) =∏
i qi(zi)
True posterior Fully factorized
§ Build richer posteriors
§ Maintain computational efficiency and scalability
§ Use all the things we know for specifying models of the data (butnow for posterior approximations)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 112
![Page 250: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/250.jpg)
Structured Mean Field
z1
z2 z3
z1
z2 z3
Least expressiveMost expressive
q(z|x) = p(z|x) q(z|x) =∏
i qi(zi)
True posterior Fully factorized
z1
z2 z3
Structured approximation
q(z) =∏
k qk(zk|zj 6=k)
§ Introduce dependencies between latent variablesRicher posterior than fully factorized
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 113
![Page 251: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/251.jpg)
Autoregressive Distributions
z1 z2 z3 z4
§ Autoregressive structure:
qpz1:K|νq “Kź
k“1
qpzk|z1:k´1, νkq
§ Ordering of latent variables and nonlinear dependencies
§ VAE (Rezende et al., 2014): mean-field Gaussian; DRAW (Gregoret al., 2015): autoregressive
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 114
![Page 252: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/252.jpg)
Other Posteriors
§ Mixture models (Saul & Jordan, 1996): qpzq “ř
k πkqkpzk|νkq
§ Linking functions (Ranganath et al., 2016):qpzq “
´
śKk“1 qpzk|νkq
¯
Lpz1:K|νK`1q
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 115
![Page 253: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/253.jpg)
Normalizing Flows (1)
§ Apply sequence of K invertible transformations to an initialdistribution q0 Change-of-variables rule
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 116
![Page 254: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/254.jpg)
Normalizing Flows (2)
Rezende & Mohamed (2015)
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 117
![Page 255: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/255.jpg)
Summary
§ Variational inference finds an approximate posterior byoptimization
§ Minimizing the KL divergence is equivalent to maximizing alower bound on the marginal likelihood
§ Mean-field VI: analytic updates in conditionally conjugatemodels
§ Stochastic VI: Stochastic optimization for scalability§ General models require us to compute gradients of expectations
§ Score-function gradients§ Pathwise gradients
§ Amortized inference§ Modern VI allows us to specify rich classes of posterior
approximationsVariational Inference Marc Deisenroth @Imperial College, February 19, 2019 118
![Page 256: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/256.jpg)
References I
[1] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Computational Neuroscience Unit,University College London, 2003.
[2] C. M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer-Verlag, 2006.
[3] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational Inference: A Review for Statisticians. Journal of the AmericanStatistical Association, 112(518):859–877, 2017.
[4] S. M. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari, and G. E. Hinton. Attend, Infer, Repeat:Fast SceneUnderstanding with Generative Models. In Advances in Neural Information Processing Systems, 2016.
[5] Z. Ghahramani and M. J. Beal. Propagation Algorithms for Variational Bayesian Learning. In Advances in NeuralInformation Processing Systems, 2001.
[6] P. Gopalan, W. Hao, D. M. Blei, and J. D. Storey. Scaling Probabilistic Models of Genetic Variation to Millions of Humans.Nature Genetics, 48(12):1587–1590, 2016.
[7] P. K. Gopalan and D. M. Blei. Efficient Discovery of Overlapping Communities in Massive Networks. Proceedings of theNational Academy of Sciences, page 201221839, 2013.
[8] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. DRAW: A Recurrent Neural Network For ImageGeneration. In Proceedings of the International Conference on Machine Learning, 2015.
[9] M. D. Hoffman, D. M. Blei, and F. Bach. Online Learning for Latent Dirichlet Allocation. Advances in Neural InformationProcessing Systems, 23:1–9, 2010.
[10] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic Variational Inference. Journal of Machine Learning Research,14:1303–1347, 2013.
[11] D. Jimenez Rezende and S. Mohamed. Variational Inference with Normalizing Flows. In Proceedings of the InternationalConference on Machine Learning, 2015.
[12] D. Jimenez Rezende, S. Mohamed, and D. Wierstra. Stochastic Backpropagation and Variational Inference in Deep LatentGaussian Models. In Proceedings of the International Conference on Machine Learning, 2014.
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 119
![Page 257: Variational Inference - Marc DeisenrothVariational Inference Variational inference is the mostscalable inference method available (at the moment) Can handle (arbitrarily) large datasets](https://reader036.vdocuments.site/reader036/viewer/2022062402/5f9e3c0b1e78e35f366757b9/html5/thumbnails/257.jpg)
References II
[13] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An Introduction to Variational Methods for Graphical Models.Machine Learning, 37:183–233, 1999.
[14] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In Proceedings of the International Conference on LearningRepresentations, 2014.
[15] A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M. Blei. Automatic Differentiation Variational Inference. Journalof Machine Learning Research, 18(1):430–474, 2017.
[16] J. R. Manning, R. Ranganath, K. A. Norman, and D. M. Blei. Topographic factor analysis: A bayesian model for inferringbrain networks from neural data. PloS One, 9(5):e94914, 2014.
[17] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Massachusetts Institute of Technology,Cambridge, MA, USA, 2001.
[18] K. P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge, MA, USA, 2012.
[19] R. Ranganath, D. Tran, and D. M. Blei. Hierarchical Variational Models. In Proceedings of the International Conference onMachine Learning, 2016.
[20] H. Salimbeni and M. P. Deisenroth. Doubly Stochastic Variational Inference for Deep Gaussian Processes. In Advances inNeural Information Processing Systems, 2017.
[21] L. K. Saul and M. I. Jordan. Exploiting Tractable Substructures in Intractable Networks. In Advances in Neural InformationProcessing Systems, 1996.
[22] R. J. Williams. Simple Statistical Gradient-following Algorithms for Connectionist Reinforcement Learning. MachineLearning, 8(3):229–256, May 1992.
Variational Inference Marc Deisenroth @Imperial College, February 19, 2019 120