introduction to deep learning - github...
TRANSCRIPT
![Page 1: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/1.jpg)
Introduction to deep learning
Andreas Damianou
Amazon, Cambridge, UK
Royal Statistical Society, London13 Dec. 2018
![Page 2: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/2.jpg)
Notebook
http://adamian.github.io/talks/Damianou_DL_tutorial_18.ipynb
![Page 3: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/3.jpg)
Starting with a cliche...
![Page 4: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/4.jpg)
Starting with a cliche...
![Page 5: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/5.jpg)
Starting with a cliche...
![Page 6: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/6.jpg)
Deep neural networks: hierarchical function definitions
A neural network is a composition of functions (layers), each parameterized with aweight vector wl. E.g. for 2 layers:
fnet = h2(h1(x;w1);w2).
Generally fnet : x 7→ y with:
h1 = ϕ(xw1 + b1)
h2 = ϕ(h1w2 + b2)
· · ·y = ϕ(hL−1wL + bL)
φ is the (non-linear) activation function.
![Page 7: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/7.jpg)
Deep neural networks: hierarchical function definitions
A neural network is a composition of functions (layers), each parameterized with aweight vector wl. E.g. for 2 layers:
fnet = h2(h1(x;w1);w2).
Generally fnet : x 7→ y with:
h1 = ϕ(xw1 + b1)
h2 = ϕ(h1w2 + b2)
· · ·y = ϕ(hL−1wL + bL)
φ is the (non-linear) activation function.
![Page 8: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/8.jpg)
Defining the loss
I We have our function approximator fnet(x) = y
I We have to define our loss (objective function) to relate this function outputs tothe observed data.
I E.g. squared difference∑
n(yn − yn)2 or cross-entropy
![Page 9: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/9.jpg)
Probabilistic re-formulation
I Training minimizing loss:
argminw
1
2
N∑i=1
(fnet(w, xi)− yi)2︸ ︷︷ ︸fit
+λ∑i
‖ wi ‖︸ ︷︷ ︸regularizer
I Equivalent probabilistic view for regression, maximizing posterior probability:
argmaxw
log p(y|x,w)︸ ︷︷ ︸fit
+ log p(w)︸ ︷︷ ︸regularizer
where p(y|x,w) ∼ N and p(w) ∼ Laplace
I Optimization still done with back-prop (i.e. gradient descent).
![Page 10: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/10.jpg)
Probabilistic re-formulation
I Training minimizing loss:
argminw
1
2
N∑i=1
(fnet(w, xi)− yi)2︸ ︷︷ ︸fit
+λ∑i
‖ wi ‖︸ ︷︷ ︸regularizer
I Equivalent probabilistic view for regression, maximizing posterior probability:
argmaxw
log p(y|x,w)︸ ︷︷ ︸fit
+ log p(w)︸ ︷︷ ︸regularizer
where p(y|x,w) ∼ N and p(w) ∼ Laplace
I Optimization still done with back-prop (i.e. gradient descent).
![Page 11: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/11.jpg)
Graphical depiction
![Page 12: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/12.jpg)
Optimization
One layer:
Loss =1
2(h− y)2
h = φ(xw)
ϑLoss
ϑw= (y − h)︸ ︷︷ ︸
ε
ϑφ(xw)
ϑw
Two layers:
Loss =1
2(h2 − y)2
h2 = φ
φ(xw0)︸ ︷︷ ︸h1
w1
ϑLoss
ϑw0= · · ·
ϑLoss
ϑw1= · · ·
![Page 13: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/13.jpg)
Derivative w.r.t w1
ϑ(h2 − y)2
ϑw1= −21
2(h2 − y)
ϑh2
ϑw1=
= (y − h2)ϑφ(h1w1)
ϑw1=
= (y − h2)ϑφ(h1w1)
ϑh1w1
ϑh1w1
ϑw1=
= (y − h2)︸ ︷︷ ︸ε2
ϑφ(h1w1)
ϑh1w1︸ ︷︷ ︸g1
hT1
h1 is computed during the forward pass.
![Page 14: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/14.jpg)
Derivative w.r.t w0
ϑ(h2 − y)2
ϑw0= −21
2(h2 − y)
ϑh2
ϑw0=
= (y − h2)ϑφ(h1w1)
ϑh1w1
ϑh1w1
ϑh1
ϑh1
ϑw0=
= ε2 g1 wT1
ϑφ(xw0)
xw0
ϑxw0
ϑw0=
= ε2 g1 wT1
ϑφ(xw0)
ϑxw0︸ ︷︷ ︸g0
xT
Propagation of error is just the chain rule.
![Page 15: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/15.jpg)
Go to notebook!
![Page 16: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/16.jpg)
Automatic differentiation
Example: f(x1, x2) = x1√
log x1sin(x22)
has symbolic graph:
(image: sanyamkapoor.com)
![Page 17: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/17.jpg)
A NN in mxnet
Back to notebook!
![Page 18: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/18.jpg)
We’re far from done...
I How to initialize the (so many) parameters?
I How to pick the right architecture?
I Layers and parameters co-adapt.
I Multiple local optima in optimization surface.
I Numerical problems.
I Bad behaviour of composite function (e.g. problematic gradient distribution).
I OVERFITTING
![Page 19: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/19.jpg)
Curve fitting [skip]
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.02.5
3.0
3.5
4.0
4.5
5.0
5.5
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.02.5
3.0
3.5
4.0
4.5
5.0
5.5
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.02.5
3.0
3.5
4.0
4.5
5.0
5.5
![Page 20: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/20.jpg)
Taming the dragon
← my neural network
← me
How to make your
neural network do
what you want it to do?
![Page 21: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/21.jpg)
Lottery ticket hypothesis
Might provide intuition for many of the tricks used.
I Optimization landscape: multiple optima and difficult to navigate
I Over-parameterized networks contain multiple sub-networks (“lottery tickets”)
I “Winning ticket”: a lucky sub-network found a good solution
I Over-parameterization: more tickets, higher winning probability
I Of course this means we have to prune or at least regularize.
(Frankle and Carbin (2018))
![Page 22: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/22.jpg)
“Tricks”
I Smart initializations
I ReLU: better behaviour of gradients
I Early stopping: prevent overfitting
I Dropout
I Batch-normalization
I Transfer/meta-learning/BO: guide the training with another model
I many other “tricks”
![Page 23: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/23.jpg)
Vanishing and exploding gradients
ϑ(h2 − y)2
ϑw0= −21
2(h2 − y)
ϑh2
ϑw0=
= (y − h2)ϑφ(h1w1)
ϑh1w1
ϑh1w1
ϑh1
ϑh1
ϑw0=
= ε2 g1 wT1
ϑφ(xw0)
xw0
ϑxw0
ϑw0=
= ε2 g1 wT1
ϑφ(xw0)
ϑxw0︸ ︷︷ ︸g0
xT
I ReLU: an activation function leading to well-behaved gradients.
![Page 24: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/24.jpg)
Vanishing and exploding gradients
ϑ(h2 − y)2
ϑw0= −21
2(h2 − y)
ϑh2
ϑw0=
= (y − h2)ϑφ(h1w1)
ϑh1w1
ϑh1w1
ϑh1
ϑh1
ϑw0=
= ε2 g1 wT1
ϑφ(xw0)
xw0
ϑxw0
ϑw0=
= ε2 g1 wT1
ϑφ(xw0)
ϑxw0︸ ︷︷ ︸g0
xT
I ReLU: an activation function leading to well-behaved gradients.
![Page 25: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/25.jpg)
Early stopping
(image: deeplearning4j.org)
![Page 26: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/26.jpg)
Dropout
Dropout======⇒
I Randomly drop units during training.
I Prevents units from co-adapting too much and prevents overfitting.
![Page 27: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/27.jpg)
Batch-normalization
I Normalize each layer’s output so e.g. µ = 0, σ = 1
I Reduces covariate shift (data distribution changes)
I Less co-adaptation of layers
I Overall: faster convergence
![Page 28: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/28.jpg)
Meta-learning
I Optimize the neural network model with the help of another model.
I The helper model might be allowed to learn from multiple datasets.
(image: Rusu et al. 2018 - LEO)
![Page 29: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/29.jpg)
Bayesian HPO
I Hyperparameters: learning rate, weight decay, architectures, learning protocols
I Optimize them using Bayesian optimization
I Prediction of learning curves. Can speed up HPO in a bandit setting
I Example: https://xfer.readthedocs.io/en/master/demos/xfer-hpo.html
![Page 30: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/30.jpg)
Convolutional NN
(image: towardsdatascience.com)
![Page 31: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/31.jpg)
Recurrent NN
image: towardsdatascience.com
![Page 32: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/32.jpg)
Deployment: Transfer learning
Training neural networks from scratch is not practical as this requires:
I a lot of data
I expertise
I compute (e.g. GPU machines)
Solution:
I Transfer learning. Repurposing pre-trained neural networks to solve new tasks.
I A library for transfer learning: https://github.com/amzn/xfer
Go to Notebook!
![Page 33: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/33.jpg)
Deployment: Transfer learning
Training neural networks from scratch is not practical as this requires:
I a lot of data
I expertise
I compute (e.g. GPU machines)
Solution:
I Transfer learning. Repurposing pre-trained neural networks to solve new tasks.
I A library for transfer learning: https://github.com/amzn/xfer
Go to Notebook!
![Page 34: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/34.jpg)
Bayesian deep learning
We saw that optimizing the parameters is a challenge.Why not marginalize them out completely?
![Page 35: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/35.jpg)
Probabilistic re-formulation
I Training minimizing loss:
argminw
1
2
N∑i=1
(fnet(w, xi)− yi)2︸ ︷︷ ︸fit
+λ∑i
‖ wi ‖︸ ︷︷ ︸regularizer
I Equivalent probabilistic view for regression, maximizing posterior probability:
argmaxw
log p(y|x,w)︸ ︷︷ ︸fit
+ log p(w)︸ ︷︷ ︸regularizer
where p(y|x,w) ∼ N and p(w) ∼ Laplace
I Optimization still done with back-prop (i.e. gradient descent).
![Page 36: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/36.jpg)
Integrating out weights
D := (x,y)
p(w|D) =p(D|w)p(w)
p(D) =∫p(D|w)p(w)dw
![Page 37: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/37.jpg)
Inference
I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way inwhich w appears through g.
I Attempt at variational inference:
KL (q(w; θ) ‖ p(w|D))︸ ︷︷ ︸minimize
= log(p(D))− L(θ)︸︷︷︸maximize
whereL(θ) = Eq(w;θ)[log p(D,w)]︸ ︷︷ ︸
F
+H [q(w; θ)]
I Term in red is still problematic. Solution: MC.
I Such approaches can be formulated as black-box inferences.
![Page 38: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/38.jpg)
Inference
I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way inwhich w appears through g.
I Attempt at variational inference:
KL (q(w; θ) ‖ p(w|D))︸ ︷︷ ︸minimize
= log(p(D))− L(θ)︸︷︷︸maximize
whereL(θ) = Eq(w;θ)[log p(D,w)]︸ ︷︷ ︸
F
+H [q(w; θ)]
I Term in red is still problematic. Solution: MC.
I Such approaches can be formulated as black-box inferences.
![Page 39: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/39.jpg)
Inference
I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way inwhich w appears through g.
I Attempt at variational inference:
KL (q(w; θ) ‖ p(w|D))︸ ︷︷ ︸minimize
= log(p(D))− L(θ)︸︷︷︸maximize
whereL(θ) = Eq(w;θ)[log p(D,w)]︸ ︷︷ ︸
F
+H [q(w; θ)]
I Term in red is still problematic. Solution: MC.
I Such approaches can be formulated as black-box inferences.
![Page 40: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/40.jpg)
Bayesian neural network (what we saw before)
Y
H
ϕ ϕ ϕ. . .
...
H
X
ϕ ϕ ϕ. . .
![Page 41: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/41.jpg)
From NN to GP
Y
HG
ϕ ϕ ϕ. . .. . . . . .
...
HG
X
ϕ ϕ ϕ. . .. . . . . .I NN: H2 = W2φ(H1)
I GP: φ is ∞−dimensional so:H2 = f2(H1; θ2) + ε
I NN: p(W)
I GP: p(f(·))
![Page 42: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/42.jpg)
From NN to GP
Y
HG
ϕ ϕ ϕ. . .. . . . . .
...
HG
X
ϕ ϕ ϕ. . .. . . . . .I NN: H2 = W2φ(H1)
I GP: φ is ∞−dimensional so:H2 = f2(H1; θ2) + ε
I NN: p(W)
I GP: p(f(·))
![Page 43: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/43.jpg)
Summary
I Vanilla feedforward NN with backpropabation (chain rule)
I Automatic differentiation
I Practical issues and solutions (“tricks”)
I Understanding the challenges: optimization landscape and capacity
I ConvNets and RNNs
I Transfer Learning for practical use
I Bayesian NNs
![Page 44: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,](https://reader034.vdocuments.site/reader034/viewer/2022042406/5f20a0aa176a8f35ee4ed6c0/html5/thumbnails/44.jpg)
Conclusions
I NNs are mathematically simple; challenge is in how to optimize them.
I Data efficiency? Uncertainty calibration? Interpretability? Safety? ...