introduction to deep learning - github...

44
Introduction to deep learning Andreas Damianou Amazon, Cambridge, UK Royal Statistical Society, London 13 Dec. 2018

Upload: others

Post on 05-Jul-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Introduction to deep learning

Andreas Damianou

Amazon, Cambridge, UK

Royal Statistical Society, London13 Dec. 2018

Page 2: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Notebook

http://adamian.github.io/talks/Damianou_DL_tutorial_18.ipynb

Page 3: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Starting with a cliche...

Page 4: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Starting with a cliche...

Page 5: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Starting with a cliche...

Page 6: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Deep neural networks: hierarchical function definitions

A neural network is a composition of functions (layers), each parameterized with aweight vector wl. E.g. for 2 layers:

fnet = h2(h1(x;w1);w2).

Generally fnet : x 7→ y with:

h1 = ϕ(xw1 + b1)

h2 = ϕ(h1w2 + b2)

· · ·y = ϕ(hL−1wL + bL)

φ is the (non-linear) activation function.

Page 7: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Deep neural networks: hierarchical function definitions

A neural network is a composition of functions (layers), each parameterized with aweight vector wl. E.g. for 2 layers:

fnet = h2(h1(x;w1);w2).

Generally fnet : x 7→ y with:

h1 = ϕ(xw1 + b1)

h2 = ϕ(h1w2 + b2)

· · ·y = ϕ(hL−1wL + bL)

φ is the (non-linear) activation function.

Page 8: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Defining the loss

I We have our function approximator fnet(x) = y

I We have to define our loss (objective function) to relate this function outputs tothe observed data.

I E.g. squared difference∑

n(yn − yn)2 or cross-entropy

Page 9: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Probabilistic re-formulation

I Training minimizing loss:

argminw

1

2

N∑i=1

(fnet(w, xi)− yi)2︸ ︷︷ ︸fit

+λ∑i

‖ wi ‖︸ ︷︷ ︸regularizer

I Equivalent probabilistic view for regression, maximizing posterior probability:

argmaxw

log p(y|x,w)︸ ︷︷ ︸fit

+ log p(w)︸ ︷︷ ︸regularizer

where p(y|x,w) ∼ N and p(w) ∼ Laplace

I Optimization still done with back-prop (i.e. gradient descent).

Page 10: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Probabilistic re-formulation

I Training minimizing loss:

argminw

1

2

N∑i=1

(fnet(w, xi)− yi)2︸ ︷︷ ︸fit

+λ∑i

‖ wi ‖︸ ︷︷ ︸regularizer

I Equivalent probabilistic view for regression, maximizing posterior probability:

argmaxw

log p(y|x,w)︸ ︷︷ ︸fit

+ log p(w)︸ ︷︷ ︸regularizer

where p(y|x,w) ∼ N and p(w) ∼ Laplace

I Optimization still done with back-prop (i.e. gradient descent).

Page 11: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Graphical depiction

Page 12: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Optimization

One layer:

Loss =1

2(h− y)2

h = φ(xw)

ϑLoss

ϑw= (y − h)︸ ︷︷ ︸

ε

ϑφ(xw)

ϑw

Two layers:

Loss =1

2(h2 − y)2

h2 = φ

φ(xw0)︸ ︷︷ ︸h1

w1

ϑLoss

ϑw0= · · ·

ϑLoss

ϑw1= · · ·

Page 13: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Derivative w.r.t w1

ϑ(h2 − y)2

ϑw1= −21

2(h2 − y)

ϑh2

ϑw1=

= (y − h2)ϑφ(h1w1)

ϑw1=

= (y − h2)ϑφ(h1w1)

ϑh1w1

ϑh1w1

ϑw1=

= (y − h2)︸ ︷︷ ︸ε2

ϑφ(h1w1)

ϑh1w1︸ ︷︷ ︸g1

hT1

h1 is computed during the forward pass.

Page 14: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Derivative w.r.t w0

ϑ(h2 − y)2

ϑw0= −21

2(h2 − y)

ϑh2

ϑw0=

= (y − h2)ϑφ(h1w1)

ϑh1w1

ϑh1w1

ϑh1

ϑh1

ϑw0=

= ε2 g1 wT1

ϑφ(xw0)

xw0

ϑxw0

ϑw0=

= ε2 g1 wT1

ϑφ(xw0)

ϑxw0︸ ︷︷ ︸g0

xT

Propagation of error is just the chain rule.

Page 15: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Go to notebook!

Page 16: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Automatic differentiation

Example: f(x1, x2) = x1√

log x1sin(x22)

has symbolic graph:

(image: sanyamkapoor.com)

Page 17: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

A NN in mxnet

Back to notebook!

Page 18: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

We’re far from done...

I How to initialize the (so many) parameters?

I How to pick the right architecture?

I Layers and parameters co-adapt.

I Multiple local optima in optimization surface.

I Numerical problems.

I Bad behaviour of composite function (e.g. problematic gradient distribution).

I OVERFITTING

Page 19: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Curve fitting [skip]

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.02.5

3.0

3.5

4.0

4.5

5.0

5.5

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.02.5

3.0

3.5

4.0

4.5

5.0

5.5

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.02.5

3.0

3.5

4.0

4.5

5.0

5.5

Page 20: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Taming the dragon

← my neural network

← me

How to make your

neural network do

what you want it to do?

Page 21: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Lottery ticket hypothesis

Might provide intuition for many of the tricks used.

I Optimization landscape: multiple optima and difficult to navigate

I Over-parameterized networks contain multiple sub-networks (“lottery tickets”)

I “Winning ticket”: a lucky sub-network found a good solution

I Over-parameterization: more tickets, higher winning probability

I Of course this means we have to prune or at least regularize.

(Frankle and Carbin (2018))

Page 22: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

“Tricks”

I Smart initializations

I ReLU: better behaviour of gradients

I Early stopping: prevent overfitting

I Dropout

I Batch-normalization

I Transfer/meta-learning/BO: guide the training with another model

I many other “tricks”

Page 23: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Vanishing and exploding gradients

ϑ(h2 − y)2

ϑw0= −21

2(h2 − y)

ϑh2

ϑw0=

= (y − h2)ϑφ(h1w1)

ϑh1w1

ϑh1w1

ϑh1

ϑh1

ϑw0=

= ε2 g1 wT1

ϑφ(xw0)

xw0

ϑxw0

ϑw0=

= ε2 g1 wT1

ϑφ(xw0)

ϑxw0︸ ︷︷ ︸g0

xT

I ReLU: an activation function leading to well-behaved gradients.

Page 24: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Vanishing and exploding gradients

ϑ(h2 − y)2

ϑw0= −21

2(h2 − y)

ϑh2

ϑw0=

= (y − h2)ϑφ(h1w1)

ϑh1w1

ϑh1w1

ϑh1

ϑh1

ϑw0=

= ε2 g1 wT1

ϑφ(xw0)

xw0

ϑxw0

ϑw0=

= ε2 g1 wT1

ϑφ(xw0)

ϑxw0︸ ︷︷ ︸g0

xT

I ReLU: an activation function leading to well-behaved gradients.

Page 25: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Early stopping

(image: deeplearning4j.org)

Page 26: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Dropout

Dropout======⇒

I Randomly drop units during training.

I Prevents units from co-adapting too much and prevents overfitting.

Page 27: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Batch-normalization

I Normalize each layer’s output so e.g. µ = 0, σ = 1

I Reduces covariate shift (data distribution changes)

I Less co-adaptation of layers

I Overall: faster convergence

Page 28: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Meta-learning

I Optimize the neural network model with the help of another model.

I The helper model might be allowed to learn from multiple datasets.

(image: Rusu et al. 2018 - LEO)

Page 29: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Bayesian HPO

I Hyperparameters: learning rate, weight decay, architectures, learning protocols

I Optimize them using Bayesian optimization

I Prediction of learning curves. Can speed up HPO in a bandit setting

I Example: https://xfer.readthedocs.io/en/master/demos/xfer-hpo.html

Page 30: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Convolutional NN

(image: towardsdatascience.com)

Page 31: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Recurrent NN

image: towardsdatascience.com

Page 32: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Deployment: Transfer learning

Training neural networks from scratch is not practical as this requires:

I a lot of data

I expertise

I compute (e.g. GPU machines)

Solution:

I Transfer learning. Repurposing pre-trained neural networks to solve new tasks.

I A library for transfer learning: https://github.com/amzn/xfer

Go to Notebook!

Page 33: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Deployment: Transfer learning

Training neural networks from scratch is not practical as this requires:

I a lot of data

I expertise

I compute (e.g. GPU machines)

Solution:

I Transfer learning. Repurposing pre-trained neural networks to solve new tasks.

I A library for transfer learning: https://github.com/amzn/xfer

Go to Notebook!

Page 34: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Bayesian deep learning

We saw that optimizing the parameters is a challenge.Why not marginalize them out completely?

Page 35: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Probabilistic re-formulation

I Training minimizing loss:

argminw

1

2

N∑i=1

(fnet(w, xi)− yi)2︸ ︷︷ ︸fit

+λ∑i

‖ wi ‖︸ ︷︷ ︸regularizer

I Equivalent probabilistic view for regression, maximizing posterior probability:

argmaxw

log p(y|x,w)︸ ︷︷ ︸fit

+ log p(w)︸ ︷︷ ︸regularizer

where p(y|x,w) ∼ N and p(w) ∼ Laplace

I Optimization still done with back-prop (i.e. gradient descent).

Page 36: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Integrating out weights

D := (x,y)

p(w|D) =p(D|w)p(w)

p(D) =∫p(D|w)p(w)dw

Page 37: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Inference

I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way inwhich w appears through g.

I Attempt at variational inference:

KL (q(w; θ) ‖ p(w|D))︸ ︷︷ ︸minimize

= log(p(D))− L(θ)︸︷︷︸maximize

whereL(θ) = Eq(w;θ)[log p(D,w)]︸ ︷︷ ︸

F

+H [q(w; θ)]

I Term in red is still problematic. Solution: MC.

I Such approaches can be formulated as black-box inferences.

Page 38: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Inference

I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way inwhich w appears through g.

I Attempt at variational inference:

KL (q(w; θ) ‖ p(w|D))︸ ︷︷ ︸minimize

= log(p(D))− L(θ)︸︷︷︸maximize

whereL(θ) = Eq(w;θ)[log p(D,w)]︸ ︷︷ ︸

F

+H [q(w; θ)]

I Term in red is still problematic. Solution: MC.

I Such approaches can be formulated as black-box inferences.

Page 39: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Inference

I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way inwhich w appears through g.

I Attempt at variational inference:

KL (q(w; θ) ‖ p(w|D))︸ ︷︷ ︸minimize

= log(p(D))− L(θ)︸︷︷︸maximize

whereL(θ) = Eq(w;θ)[log p(D,w)]︸ ︷︷ ︸

F

+H [q(w; θ)]

I Term in red is still problematic. Solution: MC.

I Such approaches can be formulated as black-box inferences.

Page 40: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Bayesian neural network (what we saw before)

Y

H

ϕ ϕ ϕ. . .

...

H

X

ϕ ϕ ϕ. . .

Page 41: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

From NN to GP

Y

HG

ϕ ϕ ϕ. . .. . . . . .

...

HG

X

ϕ ϕ ϕ. . .. . . . . .I NN: H2 = W2φ(H1)

I GP: φ is ∞−dimensional so:H2 = f2(H1; θ2) + ε

I NN: p(W)

I GP: p(f(·))

Page 42: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

From NN to GP

Y

HG

ϕ ϕ ϕ. . .. . . . . .

...

HG

X

ϕ ϕ ϕ. . .. . . . . .I NN: H2 = W2φ(H1)

I GP: φ is ∞−dimensional so:H2 = f2(H1; θ2) + ε

I NN: p(W)

I GP: p(f(·))

Page 43: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Summary

I Vanilla feedforward NN with backpropabation (chain rule)

I Automatic differentiation

I Practical issues and solutions (“tricks”)

I Understanding the challenges: optimization landscape and capacity

I ConvNets and RNNs

I Transfer Learning for practical use

I Bayesian NNs

Page 44: Introduction to deep learning - GitHub Pagesadamian.github.io/talks/Damianou_deep_learning_rss_2018.pdf · 2020-04-02 · Introduction to deep learning Andreas Damianou Amazon, Cambridge,

Conclusions

I NNs are mathematically simple; challenge is in how to optimize them.

I Data efficiency? Uncertainty calibration? Interpretability? Safety? ...