learning principles - eindhoven university of technology · learning principles. learning goals •...

Berk [email protected]

2019

Intelligent Architectures5LIL0

Learning Principles

Learning Goals

• Complete learning procedure

• Different loss functions, regularization methods, optimization methods

• Backpropagation mathematical derivation

• Other key concepts as Hyperparameters, data preprocessing, weight initialization

Learning – Training of Deep Neural Networks

PyTorch

Classification Problem

{dog, cat, truck, plane, ...}

cat

Three key components :

- Score function : Function to map input to output

- Loss Function : Evaluate quality of mapping

- Optimization Function : Update classifier

Score Function

Loss Function

Optimization Function

Score Function : Linear Classifier

10 numbers giving class scoresf(x,W) = Wx + b

W, b : Weights and biases

Score Function : Linear Classifier

10 numbers giving class scoresf(x,W) = Wx + b

Update Score Function

Loss Function : Linear Classifier

ClassScoresf(x,W)

Output Quality

Loss Function

Why a loss function is needed?- Optimizer requires a “hint” for update

cat

car

frog

3.2 1.3 2.25.1 4.9 2.5-1.7 2.0 -3.1

Scores from are:


cat

car

frog

3.2 1.3 2.25.1 4.9 2.5-1.7 2.0 -3.1

Scores from are:

Given a dataset of examples

Where : image: label

Loss over the dataset :

Single sample loss :


- SVM Multiclass Loss (Hinge Loss)

- Softmax Regression : Softmax + Cross Entropy Loss

Common Loss Functions for Classification

SVM Multiclass Loss Formulation

Scores vector:

the SVM loss has the form:

3.2 1.3 2.25.1 4.9 2.5-1.7 2.0 -3.1

SVM loss has the form:

Loss over full dataset is average:

cat

car

frog

Losses: 12.92.9 0 L= 5.27

SVM Multiclass Loss Example

Suppose that we found a W such that L = 0.

Is this W unique?

SVM Multiclass Loss

Suppose that we found a W such that L = 0.

Is this W unique?

No! 2W is also has L = 0!

SVM Multiclass Loss

Regularization

Regularization Loss: Regularization loss prevents overfitting via penalizing large weights

= regularization strength

Data loss: Model prediction

Regularization

Data loss: Model prediction Regularization Loss: Regularization loss prevents overfitting via penalizing large weights

= regularization strength

R(W) L2 regularization:L1 regularization:Elastic net (L1 + L2):

Softmax Classifier & Cross Entropy Loss

Formulated as following :

- “Softmax” function converts scores to probabilities in last layer

- Loss is cross entropy between class probabilities and ground truth

cat

car

frog

3.25.1-1.7

24.5164.00.18

exp

unnormalized probabilities

Want to interpret raw classifier scores as probabilitiesSoftmax Function

Probabilities must be >= 0

Softmax Classifier

3.25.1-1.7

log-probabilities / logitsUnnormalized

cat

car

frog

3.25.1-1.7


24.5164.00.18

0.130.870.00

exp normalize

unnormalizedprobabilities

Fei-Fei Li & Justin Johnson & Serena Yeung

probabilities


Probabilities must sum to 1

Softmax Classifier

3.25.1-1.7

log-probabilities / logitsUnnormalized

Scores vector:

Softmax Classifier interprets scores as class probabilities

Final Loss :

Softmax function :

Cross Entropy Loss :

= 0 Kullback-LeiblerDivergence

Softmax Classifier & Cross Entropy Loss

cat

car

frog

3.25.1-1.7


24.5164.00.18

0.130.870.00

exp normalize

unnormalized



probabilitiesUnnormalizedlog-probabilities / logits probabilities

Softmax Classifier

1.000.000.00Correct

Cross Enthropy

cat

car

frog

3.25.1-1.7


24.5164.00.18

0.130.870.00

exp normalize

unnormalized



probabilitiesUnnormalized

Li = -log(0.13)= 2.04

log-probabilities / logits probabilities

Softmax Classifier

Softmax vs. SVM

- We have some dataset of (x,y)- We have a score function:- We have a loss function:

Softmax

SVM

Generalized Full loss

Summary – Loss Functions

Loss is a function of W

- How can we find best W for our score function

Optimization

Naïve approach : Randomize

Improving, but very slow

Another approach : Local random search

Optimization

Add random perturbation: W` = W + ∂W

Better than weight randomization. But expensive and slow to converge.

Another approach : Following the gradient

Optimization

Derivative of 1-dim function :

In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension

The slope in any direction is the dot product of the direction with the gradient The direction of steepest descent is the negative gradient

current W:

[0.34,-1.11,0.78]

loss 1.25347

gradient[?,?,?]

W + step:

Optimization – Following gradient

current W:

[0.34,-1.11,0.78]

loss 1.25347

gradient[?,?,?]

W + step:

[0.34 + 0.0001,-1.11,0.78]

loss 1.25322


current W:

[0.34,-1.11,0.78]

loss 1.25347

gradient[-2.5,?,?]

W + step:

[0.34 + 0.0001,-1.11,0.78]

loss 1.25322 (1.25322 - 1.25347)/0.0001= -2.5


current W:

[0.34,-1.11,0.78]

loss 1.25347

gradient[-2.5,0.6,?]

W + step:

[0.34,-1.11 + 0.0001,0.78]

loss 1.25353 ?,?,?,?,

(1.25353 - 1.25347)/0.0001= 0.6


Numerical evaluation of gradient :

- Very slow

- Need iteration for every dimension

- Approximate


We need a function of W : Analytic gradient- Exact

- Fast, once expression is derived

- Error-prone, Gradient Check with numerical gradient


Numerical evaluation was simple on previous example

How about deriving analytic gradient of all weights for this network?

x

W

hinge loss

R

+ Ls (scores)

*

Better Idea: Computational graphs + Backpropagation

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4


e.g. x = -2, y = 5, z = -4

Want:


e.g. x = -2, y = 5, z = -4


Chain rule:

Want:Upstream gradient

Local gradient

Chain rule:

e.g. x = -2, y = 5, z = -4



Local gradient

e.g. x = -2, y = 5, z = -4


Chain rule:


Local gradient

Chain rule:

e.g. x = -2, y = 5, z = -4



Local gradient

f

“local gradient”

f


“Upstream gradient”

f



“Downstream gradients”

Another example:

Another example:

Upstream gradient

Local gradient

Another example:

Another example:

Upstream gradient

Local gradient

Another example:

Another example:

Upstream gradient

Local gradient

Another example:

Another example:

Upstream gradient

Local gradient

Another example:

Another example:

[upstream gradient] x [local gradient] [0.2] x [1] = 0.2[0.2] x [1] = 0.2 (both inputs!)

Another example:

Another example:

[upstream gradient] x [local gradient] x0: [0.2] x [2] = 0.4w0: [0.2] x [-1] = -0.2

Another example:

Sigmoid

Sigmoid function

Computational graph representation may not be unique. Choose one where local gradients at each node can be easily expressed!

Another example:

Sigmoid

Sigmoid function

Sigmoid local gradient:


Another example:

Sigmoid

Sigmoid function

Sigmoid local gradient:


[upstream gradient] x [local gradient] [1.00] x [(1 - 0.73) (0.73)] = 0.2

Patterns in gradient flow

add gate: gradient distributor 3

+ 72

2

42



+ 72

2

42

mul gate: “swap multiplier” 2

× 65

5*3=15

32*5=10



+ 72

2

4


× 65

5*3=15

32*5=102

copy gate: gradient adder7

74+2=6

472



+ 72

2

4

max


× 65

5*3=15

32*5=10

max gate: gradient router 4

59

0

59

2

copy gate: gradient adder7

74+2=6

472

Backprop Implementation: Forward pass: Compute output

Backward pass: Compute grads

Forward pass: Compute output

Base case

Backprop Implementation:


Sigmoid



Add gate



Multiply gate



Multiply gate

April 11, 2019


Backprop Implementation: Modularized APIGraph (or Net) object (rough pseudo code)

x

y(x,y,z are scalars)

z*

Modularized implementation: forward / backward API

Need to stash some values for use in backward

Gate / Node / Function object: Actual PyTorch code

Upstream gradient

Multiply upstream and local gradients

Example: PyTorch operators

Forward

PyTorch sigmoid layer

So far: backprop with scalars

What about vector-valued functions?

Vector derivatives

Scalar to Scalar

Regular derivative:

If x changes by a small amount, how much will y change?

Vector derivatives

Regular derivative:


Scalar to Scalar Vector to Scalar

Derivative is Gradient:

For each element of x,if it changes by a smallamount then how muchwill y change?

Vector derivatives

Regular derivative:


Scalar to Scalar Vector to Scalar

Derivative is Gradient:

For each element of x,if it changes by a smallamount then how muchwill y change?

Vector to Vector

Derivative is Jacobian:

For each element of x, if it changes by a small amount then how much will each element of y change?

f

“local gradients”


Backprop with Vectors

Dx

Dy

Dz

Loss L still a scalar!

“Downstreamgradients”

f


Dx

Dy

Dz

Dz


“Upstream gradient” For each element of z, how much does it influence L?



Dx

Dy

Dz

Dz

Loss L still a scalar!“local gradients”

[Dx x Dz]

f[Dy x Dz]

Jacobian matrices “Upstream gradient”

For each element of z, how much does it influence L?




Dx

Dy

Dz

Dz


y z[D x D ]


[Dx x Dz]

fJacobian matrices “Upstream gradient”


Dy

Dx

Matrix-vector multiply


f(x) = max(0,x)(elementwise)

4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]

4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]



4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]

4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]

[ 4 ][ -1 ][ 5 ][ 9 ]

4D dL/dy:

Upstream gradient



4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]

4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]

4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]

Jacobian dy/dx [ 1 0 0 0 ][ 0 0 0 0 ][ 0 0 1 0 ][ 0 0 0 0 ]

Upstream gradient



4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]

4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]

4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]

[dy/dx] [dL/dy] [ 1 0 0 0 ] [ 4 ][ 0 0 0 0 ] [ -1 ][ 0 0 1 0 ] [ 5 ][ 0 0 0 0 ] [ 9 ]

Upstream gradient



4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]

4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]

4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]

[dy/dx] [dL/dy] [ 1 0 0 0 ] [ 4 ][ 0 0 0 0 ] [ -1 ][ 0 0 1 0 ] [ 5 ][ 0 0 0 0 ] [ 9 ]

Upstream gradient

4D dL/dx: [ 4 ][ 0 ][ 5 ][ 0 ]



4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]

4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]

4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]

[dy/dx] [dL/dy] [ 1 0 0 0 ] [ 4 ][ 0 0 0 0 ] [ -1 ][ 0 0 1 0 ] [ 5 ][ 0 0 0 0 ] [ 9 ]

Upstream gradient

Jacobian is sparse: off-diagonal entries always zero! Never explicitly form Jacobian -- instead use implicit multiplication

4D dL/dx: [ 4 ][ 0 ][ 5 ][ 0 ]



4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]

4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]

4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]

[dy/dx] [dL/dy]4D dL/dx: [ 4 ][ 0 ][ 5 ][ 0 ]

Upstream gradient

Jacobian is sparse: off-diagonal entries always zero! Never explicitly form Jacobian -- instead use implicit multiplication




Backprop with Matrices (or Tensors) Loss L still a scalar!

Jacobian matrices

For each element of y, how much does it influence each element of z?


[Dz×Mz]

“Upstream gradient” For each element of z, how much does it influence L?

[Dx×Mx]

[Dx×Mx]

[Dy×My]

[Dy×My]

dL/dx always has the same shape as x!

[Dz×Mz]




[(Dx×Mx)×(Dz×Mz)]

Jacobian matrices “Upstream gradient”


For each element of y, how much does it influence each element of z?


[Dy×My]

[Dz×Mz]

[Dz×Mz][(Dy×My)×(Dz×Mz)]

[Dx×Mx]

[Dx×Mx]

[Dy×My]

dL/dx always has the same shape as x!

Backprop with Matrices (or Tensors)

x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]

[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]

Matrix Multiply

y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]

dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]

Backprop with Matrices

x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]

[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]

Matrix Multiply

y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]

dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]

Jacobians: dy/dx: [(N×D)×(N×M)]dy/dw: [(D×M)×(N×M)]

For a neural net we may have N=64, D=M=4096Each Jacobian takes 256 GB of memory!

Must work with them implicitly!


x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]

[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]

Matrix Multiply

y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]

dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]Q: What parts of y

are affected by one element of x?


x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]

[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]

Matrix Multiply

y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]


are affected by one element of x?A: affects the whole row


x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]

[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]

Matrix Multiply

y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]


are affected by one element of x?A: affects the whole row

Q: How muchdoesaffect ?


x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]

[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]

Matrix Multiply

y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]


are affected by one element of x?A: affects thewhole row

Q: How muchdoesaffect ?A:


x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]

[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]

Matrix Multiply

y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]


are affected by one element of x?A: affects thewhole row

Q: How muchdoesaffect ?A:

[N×D] [N×M] [M×D]


x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]

[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]

Matrix Multiply

y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]

dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]

[N×D] [N×M] [M×D] [D×M] [D×N] [N×M]

By similar logic:


Optimization - Gradient Descent (GD)Most common optimization algorithm in deep learning

- First order optimization – uses only first derivatives

- Step size on negative gradient is learning rate

- Batch, Minibatch and Stochastic GD

All samples Batch of samples Single sample

Batch Gradient DescentUses all sample results (entire dataset) for loss calculation

- Expensive, but less noisy

- Stable loss gradient

- Can converge to local minima

Minibatch Gradient Descent

Full sum is expensive when N is large!

Approximate sum using a minibatch of examples

Uses batches of 32, 64 or 128

Stochastic Gradient Descent (SGD)

Weights are updated after every sample

- Minibatch gradient descent with batch size one

- Less commonly used as vectorized ops are more efficient

- You can assume most SGD implementations use minibatches!

SGD

- Build up “velocity” as a running mean of gradients- Rho gives “friction”; typically rho=0.9 or 0.99

Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

SGD+Momentum

SGD + Momentum

Velocity

actual step

Momentum update:

GradientVelocity

actual step

Nesterov Momentum

Gradient

Combine gradient at current point with velocity to get step used to update weights

Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983 Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

Nesterov Momentum

andChange of variables rearrange:

We want update in termsof

Nesterov Momentum

Gradient Descent - Challenges

- First order – Curvature information is not used

- Saddle points

- No independent parameter update – same learning rate for all

- Added element-wise scaling of the gradient based on the historical sum of squares in each dimension

- “Per-parameter learning rates” or “adaptive learningrates”

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

AdaGrad

AdaGrad

RMSProp

Tieleman and Hinton, 2012

RMSProp: “LeakyAdaGrad”

Kingma and Ba, “Adam: Amethod for stochastic optimization”, ICLR 2015

Momentum

Bias correction

AdaGrad / RMSProp

Bias correction for the fact that first and second moment estimates start at zero

Adam with beta1 = 0.9,beta2 = 0.999, and learning_rate = 1e-3 or 5e-4 is a great starting point for many models!

Adam

SGD

SGD+Momentum

RMSProp

Adam

Adam

Saddle Point Behaviour

Recall Regularization

In common use: L2 regularization L1 regularizationElastic net (L1 + L2)

(Weight decay)

Regularization: DropoutIn each forward pass, randomly set some neurons to zero Probability of dropping is a hyperparameter; 0.5 is common

Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR2014

Regularization: Dropout Example forward pass with a3-layer network using dropout

Dropout: Test time

Dropout makes our output random!

Output (label)

Input (image)

Random mask

Want to “average out” the randomness at test-time

Dropout: Test time

Want to approximate the integral

Consider a single neuron.a

x y

w1 w2

Dropout: Test time


Consider a single neuron.

At test time we have:a

x y

w1 w2

Dropout: Test time



At test time we have: During training we have:

a

x y

w1 w2

Dropout: Test time



At test time we have: During training we have:

a

x y

w1 w2

At test time, multiplyby dropout probability

Dropout: Test time

At test time all neurons are active always=> We must scale the activations so that for each neuron: output at test time = expected output at training time

Dropout Summary

drop in forward pass

scale at test time

Regularization: A common pattern

Training: Add some kind of randomness

Testing: Average out randomness (sometimes approximate)

Regularization: A commonpattern

Training: Add some kind of randomness

Testing: Average out randomness (sometimes approximate)

Example: Batch Normalization

Training: Normalize using stats from random minibatches

Testing: Use fixed stats to normalize

Load image and label

“cat”

Compute loss

CNN

Regularization: Data Augmentation

This image by Nikita is licensed under CC-BY 2.0

https://www.flickr.com/photos/malfet/1428198050

https://www.flickr.com/photos/malfet/

https://creativecommons.org/licenses/by/2.0/

Regularization: Data Augmentation

Load image and label

“cat”

Compute loss

CNN

Transform image

Data Augmentation : Horizontal Flips

Data AugmentationRandom crops and scales

Training: sample random crops / scalesResNet:1. Pick random L in range [256, 480]2. Resize training image, short side = L3. Sample random 224 x 224 patch

Data AugmentationRandom crops and scales

Training: sample random crops / scalesResNet:1. Pick random L in range [256, 480]2. Resize training image, short side = L3. Sample random 224 x 224 patch

Testing: average a fixed set of cropsResNet:1. Resize image at 5 scales: {224, 256, 384, 480, 640}2. For each size, use 10 224 x 224 crops: 4 corners + center, + flips

Data AugmentationColor Jitter

Simple: Randomize contrast and brightness

More Complex:

1. Apply PCA to all [R, G, B] pixels in training set

2. Sample a “color offset” along principal component directions

3. Add offset to all pixels of a training image

(As seen in [Krizhevsky et al. 2012], ResNet, etc)

Data AugmentationColor Jitter

Simple: Randomize contrast and brightness

Regularization: A commonpatternTraining: Add random noiseTesting: Marginalize over the noise

Examples:DropoutBatch Normalization Data Augmentation

Regularization: DropConnect

Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013

Training: Drop connections between neurons (set weights to 0)Testing: Use all the connections

Examples:DropoutBatch Normalization Data Augmentation DropConnect

Regularization: Fractional PoolingTraining: Use randomized pooling regionsTesting: Average predictions from several regions

Examples:DropoutBatch Normalization Data Augmentation DropConnect Fractional Max Pooling

Graham, “Fractional Max Pooling”, arXiv 2014

Regularization: Stochastic DepthTraining: Skip some layers in the networkTesting: Use all the layer

Examples:DropoutBatch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth

Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016

Regularization: CutoutTraining: Set random image regions to zeroTesting: Use full image

Examples:DropoutBatch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Cutout

DeVries and Taylor, “Improved Regularization of Convolutional Neural Networks with Cutout”, arXiv 2017

Works very well for small datasets like CIFAR, less common for large datasets like ImageNet

Regularization: MixupTraining: Train on random blends of imagesTesting: Use original images

Examples:DropoutBatch Normalization Data Augmentation DropConnect Fractional Max PoolingStochastic Depth CutoutMixup

Zhang et al, “mixup: Beyond Empirical Risk Minimization”, ICLR 2018

Randomly blend the pixels of pairs of training images,e.g. 40% cat, 60% dog

CNNTarget label: cat: 0.4dog: 0.6

RegularizationTraining: Add random noiseTesting: Marginalize over the noise

Examples:DropoutBatch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth CutoutMixup

- Consider dropout for large fully-connected layers

- Batch normalization and data augmentation almost always a good idea

- Try cutout and mixup especially for small classification datasets

Choosing Hyperparameters

Step 1: Check initial lossStep 2: Overfit a small sample

Try to train to 100% training accuracy on a small sample of training data (~5-10 minibatches); fiddle with architecture, learning rate, weight initialization

Loss not going down? LR too low, bad initializationLoss explodes to Inf or NaN? LR too high, bad initialization


Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go down

Use the architecture from the previous step, use all training data, turn on small weight decay, find a learning rate that makes the loss drop significantly within ~100 iterations

Good learning rates to try: 1e-1, 1e-2, 1e-3, 1e-4


Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go downStep 4: Coarse grid, train for ~1-5 epochs

Choose a few values of learning rate and weight decay around what worked from Step 3, train a few models for ~1-5 epochs.

Good weight decay to try: 1e-4, 1e-5, 0


Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go downStep 4: Coarse grid, train for ~1-5 epochsStep 5: Refine grid, train longer

Pick best models from Step 4, train them for longer (~10-20 epochs) without learning rate decay


Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go downStep 4: Coarse grid, train for ~1-5 epochsStep 5: Refine grid, train longerStep 6: Look at loss curves

Losses may be noisy, use a scatter plot and also plot moving average to see trends better

Loss curvesTraining Loss Train / ValAccuracy

Loss

time

Bad initialization

Loss

time

Loss plateaus: Try learning rate decay

Loss

time

Learning rate step decay Loss was still going down when learning rate dropped, you decayed too early!

Accuracy

time

Train

Accuracy still going up, you need to train longer

Val

Accuracy

time

Train

Huge train / val gap means overfitting! Increase regularization, get more data

Val

Accuracy

time

Train

No gap between train / val means underfitting: train longer, use a bigger model

Val


Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go downStep 4: Coarse grid, train for ~1-5 epochsStep 5: Refine grid, train longerStep 6: Look at loss curvesStep 7: GOTO step 5

(Assume X [NxD] is data matrix, each example in a row)

Data Preprocessing

Data Preprocessing

In practice, you may also see PCA and Whitening of the data

(data has diagonal covariance matrix)

(covariance matrix is the identity matrix)

Data Preprocessing

Before normalization: classification loss very sensitive to changes in weight matrix; hard to optimize

After normalization: less sensitive to small changes in weights; easier to optimize

In practice for Images: center only

Divide by per-channel std (e.g. ResNet) (mean along each channel = 3 numbers)

e.g. consider CIFAR-10 example with [32,32,3] images

- Subtract the mean image (e.g.AlexNet) (mean image = [32,32,3] array)

- Subtract per-channel mean (e.g. VGGNet) (mean along each channel = 3 numbers)

- Subtract per-channel mean and

Weight Initialization

- Q: what happens when W=constant init is used?


- First idea: Small random numbers(gaussian with zero mean and 1e-2 standard deviation)

Works for small networks, but problems with deepernetworks.


Weight Initialization: Activation statistics

Forward pass for a 6-layer net with hidden size 4096

Weight Initialization: Activation statisticsForward pass for a 6-layer net with hidden size 4096

All activations tend to zero for deeper network layers

Q: What do the gradients dL/dW look like?

Weight Initialization: Activation statisticsForward pass for a 6-layer net with hidden size 4096

All activations tend to zero for deeper network layers

Q: What do the gradients dL/dW look like?

A: All zero, no learning

Weight Initialization: Activation statisticsIncrease std of initial weights from 0.01 to 0.05


All activations saturate

Q: What do the gradients look like?


All activations saturate

Q: What do the gradients look like?

A: Local gradients all zero, no learning

Weight Initialization: “Xavier” Initialization“Xavier” initialization: std = 1/sqrt(Din)

Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

Activations are nicely scaled for all layers!

“Xavier” initialization: std = 1/sqrt(Din)


Weight Initialization: “Xavier” Initialization


For conv layers, Din iskernel_size2 * input_channels







y = Wx h = f(y) i i i i= Din * (E[x 2]E[w 2] - E[x ]2 E[w ]2)

= Din * Var(xi) * Var(wi)

[Assume x, w are iid] [Assume x, w independant] [Assume x, w are zero-mean]

If Var(wi) = 1/Din then Var(yi) = Var(xi)

Derivation:Var(yi) = Din * Var(xiwi)


For conv layers, Din iskernel_size2 * input_channels

Batch Normalization

Batch Normalization [Ioffe and Szegedy, 2015]

Consider a batch of activations at some layer. To make each dimension zero-mean unit-variance, apply:

this is a vanilla differentiable function...

Input: Per-channel mean, shape is D

Per-channel var, shape is D

Normalized x, Shape is N x D


XN

D





XN

D Problem: Zero-mean, unit variance are too hard of a constraint?





Learnable scale and shift parameters:

Output,Shape is N x D

Learning = ,= will recover the

identity function!




Batch Normalization



Learning = ,= will recover the

identity function!

Estimates depend on minibatch; can’t do this at test-time!

Input:


Batch Normalization: Test-Time



(Running) average of values seen during training

(Running) average of values seen during training

During testing batchnorm becomes a linear operator! Can be fused with the previous fully-connected or conv layer

Batch Normalization for ConvNets

x: N × DNormalize

𝞵𝞵,𝝈𝝈: 1 × Dɣ,β: 1 × Dy = ɣ(x-𝞵𝞵)/𝝈𝝈+β

x: N×C×H×WNormalize

𝞵𝞵,𝝈𝝈: 1×C×1×1ɣ,β: 1×C×1×1y = ɣ(x-𝞵𝞵)/𝝈𝝈+β

Batch Normalization forfully-connected networks

Batch Normalization forconvolutional networks(Spatial Batchnorm, BatchNorm2D)


FC

BN

tanh

FC

BN

tanh

Usually inserted after Fully Connected or Convolutional layers, and before nonlinearity.

...


FC

BN

tanh

FC

BN

tanh

- Makes deep networks much easier to train!- Improves gradient flow- Allows higher learning rates, faster convergence- Networks become more robust to initialization- Acts as regularization during training- Zero overhead at test-time: can be fused with conv!- Behaves differently during training and testing: this

is a very common source of bugs!

...

Layer Normalization

x: N × D x: N × DNormalize

𝞵𝞵,𝝈𝝈: 1 × DNormalize

𝞵𝞵,𝝈𝝈: N × 1ɣ,β: 1 × D ɣ,β: 1 × Dy = ɣ(x-𝞵𝞵)/𝝈𝝈+β y = ɣ(x-𝞵𝞵)/𝝈𝝈+β

Layer Normalization for fully-connected networksSame behavior at train and test! Can be used in recurrent networks

Ba, Kiros, and Hinton, “Layer Normalization”, arXiv 2016

Batch Normalization for fully-connected networks

Instance Normalization


𝞵𝞵,𝝈𝝈: 1×C×1×1ɣ,β: 1×C×1×1y = ɣ(x-𝞵𝞵)/𝝈𝝈+β


𝞵𝞵,𝝈𝝈: N×C×1×1ɣ,β: 1×C×1×1y = ɣ(x-𝞵𝞵)/𝝈𝝈+β

Instance Normalization for convolutional networks Same behavior at train / test!

Batch Normalization for convolutional networks

Ulyanov et al, Improved Texture Networks: Maximizing Quality and Diversity in Feed-forward Stylization and Texture Synthesis, CVPR 2017

Comparison of Normalization Layers

Wu and He, “Group Normalization”, ECCV 2018

Group Normalization

Wu and He, “Group Normalization”, ECCV 2018

learning principles - eindhoven university of technology · learning principles. learning goals •...

Documents