learning multi-layer networks€¦ · adagrad separate, normalised update for each weight...

23
Learning Multi-layer Networks Steve Renals Machine Learning Practical — MLP Lecture 4 12 October 2016 MLP Lecture 4 Learning Multi-layer Networks 1

Upload: others

Post on 14-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Learning Multi-layer Networks

Steve Renals

Machine Learning Practical — MLP Lecture 412 October 2016

MLP Lecture 4 Learning Multi-layer Networks 1

Page 2: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Recap: Training multi-layer networks

Outputs

�(3)1 �

(3)` �

(3)K

Hidden

Hidden

y1 y` yK

w(3)1k

w(3)`k

w(3)Kk

w(2)1j

w(2)kj

w(2)Hj

h(2)Hh

(2)k

h(2)1

�(2)1 �

(2)k

�(2)H

h(1)j

w(1)ji

�(1)j

xiInputs

�(2)k =

X

m

�(3)m wmk

!h

(2)k (1� h

(2)k )

@En

@w(2)kj

= �(2)k h

(1)j

w(2)kj ← w

(2)kj − η(δ

(2)k h

(1)j )

MLP Lecture 4 Learning Multi-layer Networks 2

Page 3: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

How to setthe learning rate?

MLP Lecture 4 Learning Multi-layer Networks 3

Page 4: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Weight Updates

Let Di (t) = ∂E/∂wi (t) be the gradient of the error functionE with respect to a weight wi at update time t

“Vanilla” gradient descent updates the weight along thenegative gradient direction:

∆wi (t) = −ηDi (t)

wi (t + 1) = wi (t) + ∆wi (t)

Hyperparameter η - learning rate

Initialise η, and update as the training progresses (learningrate schedule)

MLP Lecture 4 Learning Multi-layer Networks 4

Page 5: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Learning Rate Schedules

Proofs of convergence for stochastic optimisation rely on alearning rate that reduces through time (as 1/t) - Robbinsand Munro (1951)

Learning rate schedule – typically initial larger steps followedby smaller steps for fine tuning: Results in faster convergenceand better solutions

Time-dependent schedules

Piecewise constant: pre-determined η for each epoch)Exponential: η(t) = η(0) exp(−t/r) (r ∼ training set size)Reciprocal: η(t) = η(0)(1 + t/r)−c (c ∼ 1)

Performance-dependent η – e.g. “NewBOB”: fixed η untilvalidation set stops improving, then halve η each epoch (i.e.constant, then exponential)

MLP Lecture 4 Learning Multi-layer Networks 5

Page 6: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Training with Momentum

∆wi (t) = −ηDi (t) + α∆wi (t − 1)

α ∼ 0.9 is the momentum

Weight changes start by following the gradient

After a few updates they start to have velocity – no longerpure gradient descent

Momentum term encourages the weight change to go in theprevious direction

Damps the random directions of the gradients, to encourageweight changes in a consistent direction

MLP Lecture 4 Learning Multi-layer Networks 6

Page 7: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Adaptive Learning Rates

Tuning learning rate (and momentum) parameters can beexpensive (hyperparameter grid search) – it works, but we cando better

Adaptive learning rates and per-weight learning rates

AdaGrad – normalise the update for each weightRMSProp – AdaGrad forces the learning rate to alwaysdecrease, this constraint is relaxed with RMSPropAdam – “RMSProp with momentum”

Well-explained by Andrej Karpathy athttp://cs231n.github.io/neural-networks-3/

MLP Lecture 4 Learning Multi-layer Networks 7

Page 8: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

AdaGrad

Separate, normalised update for each weight

Normalised by the sum squared gradient S

Si (0) = 0

Si (t) = Si (t − 1) + Di (t)2

∆wi (t) =−η√

Si (t) + εDi (t)

ε ∼ 10−8 is a small constant to prevent division by 0 errorsThe update step for a parameter wi is normalised by the(square root of) the sum squared gradients for that parameter

Weights with larger gradient magnitudes will have smallerweight updates; small gradients result in larger updatesThe effective learning rate for a parameter is forced tomonotonically decrease since the normalising term Si cannotget smaller

Duchi et al, http://jmlr.org/papers/v12/duchi11a.html

MLP Lecture 4 Learning Multi-layer Networks 8

Page 9: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

RMSProp

RProp (http://dx.doi.org/10.1109/ICNN.1993.298623)is a method for batch gradient descent which uses an adaptivelearning rate for each parameter and only the sign of thegradient (equivalent to normalising by the gradient)

RMSProp is a stochastic gradient descent version of RProp(Hinton, http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) normalised by amoving average of the squared gradient – similar to AdaGrad,but replacing the sum by a moving average for S :

Si (t) = βSi (t − 1) + (1− β)Di (t)2

∆wi (t) =−η√

Si (t) + εDi (t)

β ∼ 0.9 is the decay rate

Effective learning rates no longer guaranteed to decrease

MLP Lecture 4 Learning Multi-layer Networks 9

Page 10: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Adam

Hinton commented about RMSProp: “Momentum does nothelp as much as it normally does”

Adam (Kingma & Ba,https://arxiv.org/abs/1412.6980) can be viewed asaddressing this – it is a variant of RMSProp with momentum:

Mi (t) = αMi (t − 1) + (1− α)Di (t)

Si (t) = βSi (t − 1) + (1− β)Di (t)2

∆wi (t) =−η√

Si (t) + εMi (t)

Here a momentum-smoothed gradient is used for the updatein place of the gradient. Kingman and Ba recommendα ∼ 0.9, β ∼ 0.999

MLP Lecture 4 Learning Multi-layer Networks 10

Page 11: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Coursework 1

MNIST classification, working on a standard architecture (2 hiddenlayers, each with 100 hidden units)

Part 1 Learning Rate Schedules – investigate eitherexponential or reciprocal learning rate schedulePart 2 Training with Momentum – investigate using agradient descent learning rule with momentumPart 3 Adaptive Learning Rules – implement and investigatetwo of AdaGrad, RMSProp, Adam

Submit a report (PDF), along with your notebook and pythoncode. Primarily assessed on the report which should include:

A clear description of the methods used and algorithmsimplementedQuantitative results for the experiments you carried outincluding relevant graphsDiscussion of the results of your experiments and anyconclusions you have drawn

MLP Lecture 4 Learning Multi-layer Networks 11

Page 12: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Generalisationin practice

MLP Lecture 4 Learning Multi-layer Networks 12

Page 13: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Generalization

How many hidden units (or, how many weights) do we need?

How many hidden layers do we need?

Generalization: what is the expected error on a test set?

Causes of error

Network too “flexible”: Too many weights compared withnumber of training examplesNetwork not flexible enough: Not enough weights (hiddenunits) to represent the desired mapping

When comparing models, it can be helpful to comparesystems with the same number of trainable parameters (i.e.the number of trainable weights in a neural network)

Optimizing training set performance does not necessarilyoptimize test set performance....

MLP Lecture 4 Learning Multi-layer Networks 13

Page 14: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Training / Test / Validation Data

Partitioning the data...

Training data – data used for training the networkValidation data – frequently used to measure the error of anetwork on “unseen” data (e.g. after each epoch)Test data – less frequently used “unseen” data, ideally onlyused once

Frequent use of the same test data can indirectly “tune” thenetwork to that data (e.g. by influencing choice ofhyperparameters such as learning rate, number of hiddenunits, number of layers, ....)

MLP Lecture 4 Learning Multi-layer Networks 14

Page 15: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Measuring generalisation

Generalization Error – The predicted error on unseen data.How can the generalization error be estimated?

Training error?

Etrain = −∑

training set

K∑

k=1

tnk ln ynk

Validation error?

Eval = −∑

validation set

K∑

k=1

tnk ln ynk

MLP Lecture 4 Learning Multi-layer Networks 15

Page 16: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Cross-validation

Optimize network performance given a fixed training set

Hold out a set of data (validation set) and predictgeneralization performance on this set

1 Train network in usual way on training data2 Estimate performance of network on validation set

If several networks trained on the same data, choose the onethat performs best on the validation set (not the training set)

n-fold Cross-validation: divide the data into n partitions;select each partition in turn to be the validation set, and trainon the remaining (n − 1) partitions. Estimate generalizationerror by averaging over all validation sets.

MLP Lecture 4 Learning Multi-layer Networks 16

Page 17: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Overtraining

Overtraining corresponds to a network function too closely fitto the training set (too much flexibility)

Undertraining corresponds to a network function not well fitto the training set (too little flexibility)

Solutions

If possible increasing both network complexity in line with thetraining set sizeUse prior information to constrain the network functionControl the flexibility: Structural StabilizationControl the effective flexibility: early stopping andregularization

MLP Lecture 4 Learning Multi-layer Networks 17

Page 18: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Structural Stabilization

Directly control the number of weights:

Compare models with different numbers of hidden units

Start with a large network and reduce the number of weightsby pruning individual weights or hidden units

Weight sharing — use prior knowledge to constrain theweights on a set of connections to be equal.→ Convolutional Neural Networks

MLP Lecture 4 Learning Multi-layer Networks 18

Page 19: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Early Stopping

Use validation set to decide when to stop training

Training Set Error monotonically decreases as trainingprogresses

Validation Set Error will reach a minimum then start toincrease

MLP Lecture 4 Learning Multi-layer Networks 19

Page 20: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Early Stopping

Validation

Training

E

tt*

MLP Lecture 4 Learning Multi-layer Networks 20

Page 21: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Early Stopping

Use validation set to decide when to stop training

Training Set Error monotonically decreases as trainingprogresses

Validation Set Error will reach a minimum then start toincrease

Best generalization predicted to be at point of minimumvalidation set error

“Effective Flexibility” increases as training progresses

Network has an increasing number of “effective degrees offreedom” as training progresses

Network weights become more tuned to training data

Very effective — used in many practical applications such asspeech recognition and optical character recognition

MLP Lecture 4 Learning Multi-layer Networks 21

Page 22: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Early Stopping

Use validation set to decide when to stop training

Training Set Error monotonically decreases as trainingprogresses

Validation Set Error will reach a minimum then start toincrease

Best generalization predicted to be at point of minimumvalidation set error

“Effective Flexibility” increases as training progresses

Network has an increasing number of “effective degrees offreedom” as training progresses

Network weights become more tuned to training data

Very effective — used in many practical applications such asspeech recognition and optical character recognition

MLP Lecture 4 Learning Multi-layer Networks 21

Page 23: Learning Multi-layer Networks€¦ · AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i(0) = 0 S i(t) = S i(t 1) + D i(t)2 w i(t) =

Summary

Learning rates and weight updates

Coursework 1

Generalisation in practice

Reading:Michael Nielsen, chapters 2 & 3 of Neural Networks and DeepLearninghttp://neuralnetworksanddeeplearning.com/

Andrej Karpathy, CS231n notes (Stanford)http://cs231n.github.io/neural-networks-3/

Diederik Kingma and Jimmy Ba, “Adam: A Method forStochastic Optimization”, ICLR-2015https://arxiv.org/abs/1412.6980

MLP Lecture 4 Learning Multi-layer Networks 22