learning principles - eindhoven university of technology · learning principles. learning goals •...

198
Berk Ulker [email protected] TUEindhoven 2019 Intelligent Architectures 5LIL0 Learning Principles

Upload: others

Post on 13-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Berk [email protected]

2019

Intelligent Architectures5LIL0

Learning Principles

Page 2: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Learning Goals

• Complete learning procedure

• Different loss functions, regularization methods, optimization methods

• Backpropagation mathematical derivation

• Other key concepts as Hyperparameters, data preprocessing, weight initialization

Page 3: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Learning – Training of Deep Neural Networks

PyTorch

Page 4: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Classification Problem

{dog, cat, truck, plane, ...}

cat

Page 5: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Three key components :

- Score function : Function to map input to output

- Loss Function : Evaluate quality of mapping

- Optimization Function : Update classifier

Score Function

Loss Function

Optimization Function

Page 6: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Score Function : Linear Classifier

10 numbers giving class scoresf(x,W) = Wx + b

W, b : Weights and biases

Page 7: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Score Function : Linear Classifier

10 numbers giving class scoresf(x,W) = Wx + b

Update Score Function

Page 8: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Loss Function : Linear Classifier

ClassScoresf(x,W)

Output Quality

Loss Function

Why a loss function is needed?- Optimizer requires a “hint” for update

Page 9: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

cat

car

frog

3.2 1.3 2.25.1 4.9 2.5-1.7 2.0 -3.1

Scores from are:

Loss Function : Linear Classifier

Page 10: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

cat

car

frog

3.2 1.3 2.25.1 4.9 2.5-1.7 2.0 -3.1

Scores from are:

Given a dataset of examples

Where : image: label

Loss over the dataset :

Single sample loss :

Loss Function : Linear Classifier

Page 11: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

- SVM Multiclass Loss (Hinge Loss)

- Softmax Regression : Softmax + Cross Entropy Loss

Common Loss Functions for Classification

Page 12: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

SVM Multiclass Loss Formulation

Scores vector:

the SVM loss has the form:

Page 13: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

3.2 1.3 2.25.1 4.9 2.5-1.7 2.0 -3.1

SVM loss has the form:

Loss over full dataset is average:

cat

car

frog

Losses: 12.92.9 0 L= 5.27

SVM Multiclass Loss Example

Page 14: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Suppose that we found a W such that L = 0.

Is this W unique?

SVM Multiclass Loss

Page 15: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Suppose that we found a W such that L = 0.

Is this W unique?

No! 2W is also has L = 0!

SVM Multiclass Loss

Page 16: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Regularization

Regularization Loss: Regularization loss prevents overfitting via penalizing large weights

= regularization strength

Data loss: Model prediction

Page 17: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Regularization

Data loss: Model prediction Regularization Loss: Regularization loss prevents overfitting via penalizing large weights

= regularization strength

R(W) L2 regularization:L1 regularization:Elastic net (L1 + L2):

Page 18: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Softmax Classifier & Cross Entropy Loss

Formulated as following :

- “Softmax” function converts scores to probabilities in last layer

- Loss is cross entropy between class probabilities and ground truth

Page 19: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

cat

car

frog

3.25.1-1.7

24.5164.00.18

exp

unnormalized probabilities

Want to interpret raw classifier scores as probabilitiesSoftmax Function

Probabilities must be >= 0

Softmax Classifier

3.25.1-1.7

log-probabilities / logitsUnnormalized

Page 20: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

cat

car

frog

3.25.1-1.7

Want to interpret raw classifier scores as probabilitiesSoftmax Function

24.5164.00.18

0.130.870.00

exp normalize

unnormalizedprobabilities

Fei-Fei Li & Justin Johnson & Serena Yeung

probabilities

Probabilities must be >= 0

Probabilities must sum to 1

Softmax Classifier

3.25.1-1.7

log-probabilities / logitsUnnormalized

Page 21: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Scores vector:

Softmax Classifier interprets scores as class probabilities

Final Loss :

Softmax function :

Cross Entropy Loss :

= 0 Kullback-LeiblerDivergence

Softmax Classifier & Cross Entropy Loss

Page 22: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

cat

car

frog

3.25.1-1.7

Want to interpret raw classifier scores as probabilitiesSoftmax Function

24.5164.00.18

0.130.870.00

exp normalize

unnormalized

Probabilities must be >= 0

Probabilities must sum to 1

probabilitiesUnnormalizedlog-probabilities / logits probabilities

Softmax Classifier

1.000.000.00Correct

Cross Enthropy

Page 23: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

cat

car

frog

3.25.1-1.7

Want to interpret raw classifier scores as probabilitiesSoftmax Function

24.5164.00.18

0.130.870.00

exp normalize

unnormalized

Probabilities must be >= 0

Probabilities must sum to 1

probabilitiesUnnormalized

Li = -log(0.13)= 2.04

log-probabilities / logits probabilities

Softmax Classifier

Page 24: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Softmax vs. SVM

Page 25: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

- We have some dataset of (x,y)- We have a score function:- We have a loss function:

Softmax

SVM

Generalized Full loss

Summary – Loss Functions

Loss is a function of W

Page 26: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

- How can we find best W for our score function

Optimization

Naïve approach : Randomize

Improving, but very slow

Page 27: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another approach : Local random search

Optimization

Add random perturbation: W` = W + ∂W

Better than weight randomization. But expensive and slow to converge.

Page 28: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another approach : Following the gradient

Optimization

Derivative of 1-dim function :

In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension

The slope in any direction is the dot product of the direction with the gradient The direction of steepest descent is the negative gradient

Page 29: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

current W:

[0.34,-1.11,0.78]

loss 1.25347

gradient[?,?,?]

W + step:

Optimization – Following gradient

Page 30: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

current W:

[0.34,-1.11,0.78]

loss 1.25347

gradient[?,?,?]

W + step:

[0.34 + 0.0001,-1.11,0.78]

loss 1.25322

Optimization – Following gradient

Page 31: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

current W:

[0.34,-1.11,0.78]

loss 1.25347

gradient[-2.5,?,?]

W + step:

[0.34 + 0.0001,-1.11,0.78]

loss 1.25322 (1.25322 - 1.25347)/0.0001= -2.5

Optimization – Following gradient

Page 32: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

current W:

[0.34,-1.11,0.78]

loss 1.25347

gradient[-2.5,0.6,?]

W + step:

[0.34,-1.11 + 0.0001,0.78]

loss 1.25353 ?,?,?,?,

(1.25353 - 1.25347)/0.0001= 0.6

Optimization – Following gradient

Page 33: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Numerical evaluation of gradient :

- Very slow

- Need iteration for every dimension

- Approximate

Optimization – Following gradient

We need a function of W : Analytic gradient- Exact

- Fast, once expression is derived

- Error-prone, Gradient Check with numerical gradient

Page 34: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Optimization – Following gradient

Numerical evaluation was simple on previous example

How about deriving analytic gradient of all weights for this network?

Page 35: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

x

W

hinge loss

R

+ Ls (scores)

*

Better Idea: Computational graphs + Backpropagation

Page 36: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Backpropagation: a simple example

Page 37: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Backpropagation: a simple example

Page 38: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

e.g. x = -2, y = 5, z = -4

Backpropagation: a simple example

Page 39: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

e.g. x = -2, y = 5, z = -4

Want:

Backpropagation: a simple example

Page 40: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

e.g. x = -2, y = 5, z = -4

Want:

Backpropagation: a simple example

Page 41: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

e.g. x = -2, y = 5, z = -4

Want:

Backpropagation: a simple example

Page 42: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

e.g. x = -2, y = 5, z = -4

Want:

Backpropagation: a simple example

Page 43: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

e.g. x = -2, y = 5, z = -4

Want:

Backpropagation: a simple example

Page 44: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

e.g. x = -2, y = 5, z = -4

Want:

Backpropagation: a simple example

Page 45: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

e.g. x = -2, y = 5, z = -4

Want:

Backpropagation: a simple example

Page 46: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

e.g. x = -2, y = 5, z = -4

Backpropagation: a simple example

Chain rule:

Want:Upstream gradient

Local gradient

Page 47: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Chain rule:

e.g. x = -2, y = 5, z = -4

Backpropagation: a simple example

Want:Upstream gradient

Local gradient

Page 48: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

e.g. x = -2, y = 5, z = -4

Backpropagation: a simple example

Chain rule:

Want:Upstream gradient

Local gradient

Page 49: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Chain rule:

e.g. x = -2, y = 5, z = -4

Backpropagation: a simple example

Want:Upstream gradient

Local gradient

Page 50: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

f

Page 51: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

f

“local gradient”

Page 52: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

f

“local gradient”

“Upstream gradient”

Page 53: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

f

“local gradient”

“Upstream gradient”

“Downstream gradients”

Page 54: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

f

“local gradient”

“Upstream gradient”

“Downstream gradients”

Page 55: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

f

“local gradient”

“Upstream gradient”

“Downstream gradients”

Page 56: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another example:

Page 57: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another example:

Page 58: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another example:

Page 59: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another example:

Page 60: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another example:

Upstream gradient

Local gradient

Page 61: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another example:

Page 62: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another example:

Upstream gradient

Local gradient

Page 63: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another example:

Page 64: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another example:

Upstream gradient

Local gradient

Page 65: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another example:

Page 66: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another example:

Upstream gradient

Local gradient

Page 67: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another example:

Page 68: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another example:

[upstream gradient] x [local gradient] [0.2] x [1] = 0.2[0.2] x [1] = 0.2 (both inputs!)

Page 69: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another example:

Page 70: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another example:

[upstream gradient] x [local gradient] x0: [0.2] x [2] = 0.4w0: [0.2] x [-1] = -0.2

Page 71: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another example:

Sigmoid

Sigmoid function

Computational graph representation may not be unique. Choose one where local gradients at each node can be easily expressed!

Page 72: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another example:

Sigmoid

Sigmoid function

Sigmoid local gradient:

Computational graph representation may not be unique. Choose one where local gradients at each node can be easily expressed!

Page 73: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Another example:

Sigmoid

Sigmoid function

Sigmoid local gradient:

Computational graph representation may not be unique. Choose one where local gradients at each node can be easily expressed!

[upstream gradient] x [local gradient] [1.00] x [(1 - 0.73) (0.73)] = 0.2

Page 74: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Patterns in gradient flow

add gate: gradient distributor 3

+ 72

2

42

Page 75: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Patterns in gradient flow

add gate: gradient distributor 3

+ 72

2

42

mul gate: “swap multiplier” 2

× 65

5*3=15

32*5=10

Page 76: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Patterns in gradient flow

add gate: gradient distributor 3

+ 72

2

4

mul gate: “swap multiplier” 2

× 65

5*3=15

32*5=102

copy gate: gradient adder7

74+2=6

472

Page 77: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Patterns in gradient flow

add gate: gradient distributor 3

+ 72

2

4

max

mul gate: “swap multiplier” 2

× 65

5*3=15

32*5=10

max gate: gradient router 4

59

0

59

2

copy gate: gradient adder7

74+2=6

472

Page 78: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Backprop Implementation: Forward pass: Compute output

Backward pass: Compute grads

Page 79: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Forward pass: Compute output

Base case

Backprop Implementation:

Page 80: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Forward pass: Compute output

Sigmoid

Backprop Implementation:

Page 81: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Forward pass: Compute output

Add gate

Backprop Implementation:

Page 82: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Forward pass: Compute output

Add gate

Backprop Implementation:

Page 83: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Forward pass: Compute output

Multiply gate

Backprop Implementation:

Page 84: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Forward pass: Compute output

Multiply gate

April 11, 2019

Backprop Implementation:

Page 85: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Backprop Implementation: Modularized APIGraph (or Net) object (rough pseudo code)

Page 86: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

x

y(x,y,z are scalars)

z*

Modularized implementation: forward / backward API

Need to stash some values for use in backward

Gate / Node / Function object: Actual PyTorch code

Upstream gradient

Multiply upstream and local gradients

Page 87: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Example: PyTorch operators

Page 88: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Forward

PyTorch sigmoid layer

Page 89: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Forward

PyTorch sigmoid layer

Page 90: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Forward

PyTorch sigmoid layer

Page 91: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

So far: backprop with scalars

What about vector-valued functions?

Page 92: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Vector derivatives

Scalar to Scalar

Regular derivative:

If x changes by a small amount, how much will y change?

Page 93: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Vector derivatives

Regular derivative:

If x changes by a small amount, how much will y change?

Scalar to Scalar Vector to Scalar

Derivative is Gradient:

For each element of x,if it changes by a smallamount then how muchwill y change?

Page 94: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Vector derivatives

Regular derivative:

If x changes by a small amount, how much will y change?

Scalar to Scalar Vector to Scalar

Derivative is Gradient:

For each element of x,if it changes by a smallamount then how muchwill y change?

Vector to Vector

Derivative is Jacobian:

For each element of x, if it changes by a small amount then how much will each element of y change?

Page 95: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

f

“local gradients”

“Upstream gradient”

Backprop with Vectors

Dx

Dy

Dz

Loss L still a scalar!

“Downstreamgradients”

Page 96: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

f

“local gradients”

Dx

Dy

Dz

Dz

Loss L still a scalar!

“Upstream gradient” For each element of z, how much does it influence L?

“Downstreamgradients”

Backprop with Vectors

Page 97: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Dx

Dy

Dz

Dz

Loss L still a scalar!“local gradients”

[Dx x Dz]

f[Dy x Dz]

Jacobian matrices “Upstream gradient”

For each element of z, how much does it influence L?

“Downstreamgradients”

Backprop with Vectors

Page 98: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

“Downstreamgradients”

Dx

Dy

Dz

Dz

Loss L still a scalar!

y z[D x D ]

“local gradients”

[Dx x Dz]

fJacobian matrices “Upstream gradient”

For each element of z, how much does it influence L?

Dy

Dx

Matrix-vector multiply

Backprop with Vectors

Page 99: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

f(x) = max(0,x)(elementwise)

4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]

4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]

Backprop with Vectors

Page 100: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

f(x) = max(0,x)(elementwise)

4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]

4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]

[ 4 ][ -1 ][ 5 ][ 9 ]

4D dL/dy:

Upstream gradient

Backprop with Vectors

Page 101: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

f(x) = max(0,x)(elementwise)

4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]

4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]

4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]

Jacobian dy/dx [ 1 0 0 0 ][ 0 0 0 0 ][ 0 0 1 0 ][ 0 0 0 0 ]

Upstream gradient

Backprop with Vectors

Page 102: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

f(x) = max(0,x)(elementwise)

4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]

4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]

4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]

[dy/dx] [dL/dy] [ 1 0 0 0 ] [ 4 ][ 0 0 0 0 ] [ -1 ][ 0 0 1 0 ] [ 5 ][ 0 0 0 0 ] [ 9 ]

Upstream gradient

Backprop with Vectors

Page 103: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

f(x) = max(0,x)(elementwise)

4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]

4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]

4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]

[dy/dx] [dL/dy] [ 1 0 0 0 ] [ 4 ][ 0 0 0 0 ] [ -1 ][ 0 0 1 0 ] [ 5 ][ 0 0 0 0 ] [ 9 ]

Upstream gradient

4D dL/dx: [ 4 ][ 0 ][ 5 ][ 0 ]

Backprop with Vectors

Page 104: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

f(x) = max(0,x)(elementwise)

4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]

4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]

4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]

[dy/dx] [dL/dy] [ 1 0 0 0 ] [ 4 ][ 0 0 0 0 ] [ -1 ][ 0 0 1 0 ] [ 5 ][ 0 0 0 0 ] [ 9 ]

Upstream gradient

Jacobian is sparse: off-diagonal entries always zero! Never explicitly form Jacobian -- instead use implicit multiplication

4D dL/dx: [ 4 ][ 0 ][ 5 ][ 0 ]

Backprop with Vectors

Page 105: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

f(x) = max(0,x)(elementwise)

4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]

4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]

4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]

[dy/dx] [dL/dy]4D dL/dx: [ 4 ][ 0 ][ 5 ][ 0 ]

Upstream gradient

Jacobian is sparse: off-diagonal entries always zero! Never explicitly form Jacobian -- instead use implicit multiplication

Backprop with Vectors

Page 106: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

“local gradients”

“Downstreamgradients”

Backprop with Matrices (or Tensors) Loss L still a scalar!

Jacobian matrices

For each element of y, how much does it influence each element of z?

Matrix-vector multiply

[Dz×Mz]

“Upstream gradient” For each element of z, how much does it influence L?

[Dx×Mx]

[Dx×Mx]

[Dy×My]

[Dy×My]

dL/dx always has the same shape as x!

[Dz×Mz]

Page 107: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

“Downstreamgradients”

Loss L still a scalar!

“local gradients”

[(Dx×Mx)×(Dz×Mz)]

Jacobian matrices “Upstream gradient”

For each element of z, how much does it influence L?

For each element of y, how much does it influence each element of z?

Matrix-vector multiply

[Dy×My]

[Dz×Mz]

[Dz×Mz][(Dy×My)×(Dz×Mz)]

[Dx×Mx]

[Dx×Mx]

[Dy×My]

dL/dx always has the same shape as x!

Backprop with Matrices (or Tensors)

Page 108: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]

[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]

Matrix Multiply

y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]

dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]

Backprop with Matrices

Page 109: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]

[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]

Matrix Multiply

y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]

dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]

Jacobians: dy/dx: [(N×D)×(N×M)]dy/dw: [(D×M)×(N×M)]

For a neural net we may have N=64, D=M=4096Each Jacobian takes 256 GB of memory!

Must work with them implicitly!

Backprop with Matrices

Page 110: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]

[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]

Matrix Multiply

y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]

dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]Q: What parts of y

are affected by one element of x?

Backprop with Matrices

Page 111: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]

[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]

Matrix Multiply

y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]

dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]Q: What parts of y

are affected by one element of x?A: affects the whole row

Backprop with Matrices

Page 112: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]

[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]

Matrix Multiply

y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]

dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]Q: What parts of y

are affected by one element of x?A: affects the whole row

Q: How muchdoesaffect ?

Backprop with Matrices

Page 113: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]

[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]

Matrix Multiply

y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]

dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]Q: What parts of y

are affected by one element of x?A: affects thewhole row

Q: How muchdoesaffect ?A:

Backprop with Matrices

Page 114: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]

[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]

Matrix Multiply

y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]

dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]Q: What parts of y

are affected by one element of x?A: affects thewhole row

Q: How muchdoesaffect ?A:

[N×D] [N×M] [M×D]

Backprop with Matrices

Page 115: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]

[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]

Matrix Multiply

y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]

dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]

[N×D] [N×M] [M×D] [D×M] [D×N] [N×M]

By similar logic:

Backprop with Matrices

Page 116: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Optimization - Gradient Descent (GD)Most common optimization algorithm in deep learning

- First order optimization – uses only first derivatives

- Step size on negative gradient is learning rate

- Batch, Minibatch and Stochastic GD

All samples Batch of samples Single sample

Page 117: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Batch Gradient DescentUses all sample results (entire dataset) for loss calculation

- Expensive, but less noisy

- Stable loss gradient

- Can converge to local minima

Page 118: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Minibatch Gradient Descent

Full sum is expensive when N is large!

Approximate sum using a minibatch of examples

Uses batches of 32, 64 or 128

Page 119: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Stochastic Gradient Descent (SGD)

Weights are updated after every sample

- Minibatch gradient descent with batch size one

- Less commonly used as vectorized ops are more efficient

- You can assume most SGD implementations use minibatches!

Page 120: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

SGD

- Build up “velocity” as a running mean of gradients- Rho gives “friction”; typically rho=0.9 or 0.99

Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

SGD+Momentum

SGD + Momentum

Page 121: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Velocity

actual step

Momentum update:

GradientVelocity

actual step

Nesterov Momentum

Gradient

Combine gradient at current point with velocity to get step used to update weights

Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983 Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

Nesterov Momentum

Page 122: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

andChange of variables rearrange:

We want update in termsof

Nesterov Momentum

Page 123: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Gradient Descent - Challenges

- First order – Curvature information is not used

- Saddle points

- No independent parameter update – same learning rate for all

Page 124: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

- Added element-wise scaling of the gradient based on the historical sum of squares in each dimension

- “Per-parameter learning rates” or “adaptive learningrates”

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

AdaGrad

Page 125: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

AdaGrad

RMSProp

Tieleman and Hinton, 2012

RMSProp: “LeakyAdaGrad”

Page 126: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Kingma and Ba, “Adam: Amethod for stochastic optimization”, ICLR 2015

Momentum

Bias correction

AdaGrad / RMSProp

Bias correction for the fact that first and second moment estimates start at zero

Adam with beta1 = 0.9,beta2 = 0.999, and learning_rate = 1e-3 or 5e-4 is a great starting point for many models!

Adam

Page 127: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

SGD

SGD+Momentum

RMSProp

Adam

Adam

Page 128: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Saddle Point Behaviour

Page 129: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Recall Regularization

In common use: L2 regularization L1 regularizationElastic net (L1 + L2)

(Weight decay)

Page 130: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Regularization: DropoutIn each forward pass, randomly set some neurons to zero Probability of dropping is a hyperparameter; 0.5 is common

Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR2014

Page 131: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Regularization: Dropout Example forward pass with a3-layer network using dropout

Page 132: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Dropout: Test time

Dropout makes our output random!

Output (label)

Input (image)

Random mask

Want to “average out” the randomness at test-time

Page 133: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Dropout: Test time

Want to approximate the integral

Consider a single neuron.a

x y

w1 w2

Page 134: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Dropout: Test time

Want to approximate the integral

Consider a single neuron.

At test time we have:a

x y

w1 w2

Page 135: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Dropout: Test time

Want to approximate the integral

Consider a single neuron.

At test time we have: During training we have:

a

x y

w1 w2

Page 136: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Dropout: Test time

Want to approximate the integral

Consider a single neuron.

At test time we have: During training we have:

a

x y

w1 w2

At test time, multiplyby dropout probability

Page 137: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Dropout: Test time

At test time all neurons are active always=> We must scale the activations so that for each neuron: output at test time = expected output at training time

Page 138: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Dropout Summary

drop in forward pass

scale at test time

Page 139: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Regularization: A common pattern

Training: Add some kind of randomness

Testing: Average out randomness (sometimes approximate)

Page 140: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Regularization: A commonpattern

Training: Add some kind of randomness

Testing: Average out randomness (sometimes approximate)

Example: Batch Normalization

Training: Normalize using stats from random minibatches

Testing: Use fixed stats to normalize

Page 141: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Load image and label

“cat”

Compute loss

CNN

Regularization: Data Augmentation

This image by Nikita is licensed under CC-BY 2.0

Page 142: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Regularization: Data Augmentation

Load image and label

“cat”

Compute loss

CNN

Transform image

Page 143: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Data Augmentation : Horizontal Flips

Page 144: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Data AugmentationRandom crops and scales

Training: sample random crops / scalesResNet:1. Pick random L in range [256, 480]2. Resize training image, short side = L3. Sample random 224 x 224 patch

Page 145: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Data AugmentationRandom crops and scales

Training: sample random crops / scalesResNet:1. Pick random L in range [256, 480]2. Resize training image, short side = L3. Sample random 224 x 224 patch

Testing: average a fixed set of cropsResNet:1. Resize image at 5 scales: {224, 256, 384, 480, 640}2. For each size, use 10 224 x 224 crops: 4 corners + center, + flips

Page 146: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Data AugmentationColor Jitter

Simple: Randomize contrast and brightness

Page 147: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

More Complex:

1. Apply PCA to all [R, G, B] pixels in training set

2. Sample a “color offset” along principal component directions

3. Add offset to all pixels of a training image

(As seen in [Krizhevsky et al. 2012], ResNet, etc)

Data AugmentationColor Jitter

Simple: Randomize contrast and brightness

Page 148: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Regularization: A commonpatternTraining: Add random noiseTesting: Marginalize over the noise

Examples:DropoutBatch Normalization Data Augmentation

Page 149: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Regularization: DropConnect

Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013

Training: Drop connections between neurons (set weights to 0)Testing: Use all the connections

Examples:DropoutBatch Normalization Data Augmentation DropConnect

Page 150: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Regularization: Fractional PoolingTraining: Use randomized pooling regionsTesting: Average predictions from several regions

Examples:DropoutBatch Normalization Data Augmentation DropConnect Fractional Max Pooling

Graham, “Fractional Max Pooling”, arXiv 2014

Page 151: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Regularization: Stochastic DepthTraining: Skip some layers in the networkTesting: Use all the layer

Examples:DropoutBatch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth

Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016

Page 152: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Regularization: CutoutTraining: Set random image regions to zeroTesting: Use full image

Examples:DropoutBatch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Cutout

DeVries and Taylor, “Improved Regularization of Convolutional Neural Networks with Cutout”, arXiv 2017

Works very well for small datasets like CIFAR, less common for large datasets like ImageNet

Page 153: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Regularization: MixupTraining: Train on random blends of imagesTesting: Use original images

Examples:DropoutBatch Normalization Data Augmentation DropConnect Fractional Max PoolingStochastic Depth CutoutMixup

Zhang et al, “mixup: Beyond Empirical Risk Minimization”, ICLR 2018

Randomly blend the pixels of pairs of training images,e.g. 40% cat, 60% dog

CNNTarget label: cat: 0.4dog: 0.6

Page 154: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

RegularizationTraining: Add random noiseTesting: Marginalize over the noise

Examples:DropoutBatch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth CutoutMixup

- Consider dropout for large fully-connected layers

- Batch normalization and data augmentation almost always a good idea

- Try cutout and mixup especially for small classification datasets

Page 155: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Choosing Hyperparameters

Step 1: Check initial lossStep 2: Overfit a small sample

Try to train to 100% training accuracy on a small sample of training data (~5-10 minibatches); fiddle with architecture, learning rate, weight initialization

Loss not going down? LR too low, bad initializationLoss explodes to Inf or NaN? LR too high, bad initialization

Page 156: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Choosing Hyperparameters

Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go down

Use the architecture from the previous step, use all training data, turn on small weight decay, find a learning rate that makes the loss drop significantly within ~100 iterations

Good learning rates to try: 1e-1, 1e-2, 1e-3, 1e-4

Page 157: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Choosing Hyperparameters

Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go downStep 4: Coarse grid, train for ~1-5 epochs

Choose a few values of learning rate and weight decay around what worked from Step 3, train a few models for ~1-5 epochs.

Good weight decay to try: 1e-4, 1e-5, 0

Page 158: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Choosing Hyperparameters

Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go downStep 4: Coarse grid, train for ~1-5 epochsStep 5: Refine grid, train longer

Pick best models from Step 4, train them for longer (~10-20 epochs) without learning rate decay

Page 159: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Choosing Hyperparameters

Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go downStep 4: Coarse grid, train for ~1-5 epochsStep 5: Refine grid, train longerStep 6: Look at loss curves

Page 160: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Losses may be noisy, use a scatter plot and also plot moving average to see trends better

Loss curvesTraining Loss Train / ValAccuracy

Page 161: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Loss

time

Bad initialization

Page 162: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Loss

time

Loss plateaus: Try learning rate decay

Page 163: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Loss

time

Learning rate step decay Loss was still going down when learning rate dropped, you decayed too early!

Page 164: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Accuracy

time

Train

Accuracy still going up, you need to train longer

Val

Page 165: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Accuracy

time

Train

Huge train / val gap means overfitting! Increase regularization, get more data

Val

Page 166: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Accuracy

time

Train

No gap between train / val means underfitting: train longer, use a bigger model

Val

Page 167: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Choosing Hyperparameters

Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go downStep 4: Coarse grid, train for ~1-5 epochsStep 5: Refine grid, train longerStep 6: Look at loss curvesStep 7: GOTO step 5

Page 168: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

(Assume X [NxD] is data matrix, each example in a row)

Data Preprocessing

Page 169: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Data Preprocessing

In practice, you may also see PCA and Whitening of the data

(data has diagonal covariance matrix)

(covariance matrix is the identity matrix)

Page 170: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Data Preprocessing

Before normalization: classification loss very sensitive to changes in weight matrix; hard to optimize

After normalization: less sensitive to small changes in weights; easier to optimize

Page 171: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

In practice for Images: center only

Divide by per-channel std (e.g. ResNet) (mean along each channel = 3 numbers)

e.g. consider CIFAR-10 example with [32,32,3] images

- Subtract the mean image (e.g.AlexNet) (mean image = [32,32,3] array)

- Subtract per-channel mean (e.g. VGGNet) (mean along each channel = 3 numbers)

- Subtract per-channel mean and

Page 172: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Weight Initialization

Page 173: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

- Q: what happens when W=constant init is used?

Weight Initialization

Page 174: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

- First idea: Small random numbers(gaussian with zero mean and 1e-2 standard deviation)

Works for small networks, but problems with deepernetworks.

Weight Initialization

Page 175: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Weight Initialization: Activation statistics

Forward pass for a 6-layer net with hidden size 4096

Page 176: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Weight Initialization: Activation statisticsForward pass for a 6-layer net with hidden size 4096

All activations tend to zero for deeper network layers

Q: What do the gradients dL/dW look like?

Page 177: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Weight Initialization: Activation statisticsForward pass for a 6-layer net with hidden size 4096

All activations tend to zero for deeper network layers

Q: What do the gradients dL/dW look like?

A: All zero, no learning

Page 178: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Weight Initialization: Activation statisticsIncrease std of initial weights from 0.01 to 0.05

Page 179: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Weight Initialization: Activation statisticsIncrease std of initial weights from 0.01 to 0.05

All activations saturate

Q: What do the gradients look like?

Page 180: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Weight Initialization: Activation statisticsIncrease std of initial weights from 0.01 to 0.05

All activations saturate

Q: What do the gradients look like?

A: Local gradients all zero, no learning

Page 181: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Weight Initialization: “Xavier” Initialization“Xavier” initialization: std = 1/sqrt(Din)

Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

Page 182: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Activations are nicely scaled for all layers!

“Xavier” initialization: std = 1/sqrt(Din)

Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

Weight Initialization: “Xavier” Initialization

Page 183: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Activations are nicely scaled for all layers!

For conv layers, Din iskernel_size2 * input_channels

“Xavier” initialization: std = 1/sqrt(Din)

Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

Weight Initialization: “Xavier” Initialization

Page 184: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

“Xavier” initialization: std = 1/sqrt(Din)

Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

Weight Initialization: “Xavier” Initialization

y = Wx h = f(y) i i i i= Din * (E[x 2]E[w 2] - E[x ]2 E[w ]2)

= Din * Var(xi) * Var(wi)

[Assume x, w are iid] [Assume x, w independant] [Assume x, w are zero-mean]

If Var(wi) = 1/Din then Var(yi) = Var(xi)

Derivation:Var(yi) = Din * Var(xiwi)

Activations are nicely scaled for all layers!

For conv layers, Din iskernel_size2 * input_channels

Page 185: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Batch Normalization

Page 186: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Batch Normalization [Ioffe and Szegedy, 2015]

Consider a batch of activations at some layer. To make each dimension zero-mean unit-variance, apply:

this is a vanilla differentiable function...

Page 187: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Input: Per-channel mean, shape is D

Per-channel var, shape is D

Normalized x, Shape is N x D

Batch Normalization [Ioffe and Szegedy, 2015]

XN

D

Page 188: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Input: Per-channel mean, shape is D

Per-channel var, shape is D

Normalized x, Shape is N x D

Batch Normalization [Ioffe and Szegedy, 2015]

XN

D Problem: Zero-mean, unit variance are too hard of a constraint?

Page 189: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Input: Per-channel mean, shape is D

Per-channel var, shape is D

Normalized x, Shape is N x D

Batch Normalization [Ioffe and Szegedy, 2015]

Learnable scale and shift parameters:

Output,Shape is N x D

Learning = ,= will recover the

identity function!

Page 190: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Input: Per-channel mean, shape is D

Per-channel var, shape is D

Normalized x, Shape is N x D

Batch Normalization

Learnable scale and shift parameters:

Output,Shape is N x D

Learning = ,= will recover the

identity function!

Estimates depend on minibatch; can’t do this at test-time!

Page 191: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Input:

Normalized x, Shape is N x D

Batch Normalization: Test-Time

Learnable scale and shift parameters:

Output,Shape is N x D

(Running) average of values seen during training

(Running) average of values seen during training

During testing batchnorm becomes a linear operator! Can be fused with the previous fully-connected or conv layer

Page 192: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Batch Normalization for ConvNets

x: N × DNormalize

𝞵𝞵,𝝈𝝈: 1 × Dɣ,β: 1 × Dy = ɣ(x-𝞵𝞵)/𝝈𝝈+β

x: N×C×H×WNormalize

𝞵𝞵,𝝈𝝈: 1×C×1×1ɣ,β: 1×C×1×1y = ɣ(x-𝞵𝞵)/𝝈𝝈+β

Batch Normalization forfully-connected networks

Batch Normalization forconvolutional networks(Spatial Batchnorm, BatchNorm2D)

Page 193: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Batch Normalization [Ioffe and Szegedy, 2015]

FC

BN

tanh

FC

BN

tanh

Usually inserted after Fully Connected or Convolutional layers, and before nonlinearity.

...

Page 194: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Batch Normalization [Ioffe and Szegedy, 2015]

FC

BN

tanh

FC

BN

tanh

- Makes deep networks much easier to train!- Improves gradient flow- Allows higher learning rates, faster convergence- Networks become more robust to initialization- Acts as regularization during training- Zero overhead at test-time: can be fused with conv!- Behaves differently during training and testing: this

is a very common source of bugs!

...

Page 195: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Layer Normalization

x: N × D x: N × DNormalize

𝞵𝞵,𝝈𝝈: 1 × DNormalize

𝞵𝞵,𝝈𝝈: N × 1ɣ,β: 1 × D ɣ,β: 1 × Dy = ɣ(x-𝞵𝞵)/𝝈𝝈+β y = ɣ(x-𝞵𝞵)/𝝈𝝈+β

Layer Normalization for fully-connected networksSame behavior at train and test! Can be used in recurrent networks

Ba, Kiros, and Hinton, “Layer Normalization”, arXiv 2016

Batch Normalization for fully-connected networks

Page 196: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Instance Normalization

x: N×C×H×WNormalize

𝞵𝞵,𝝈𝝈: 1×C×1×1ɣ,β: 1×C×1×1y = ɣ(x-𝞵𝞵)/𝝈𝝈+β

x: N×C×H×WNormalize

𝞵𝞵,𝝈𝝈: N×C×1×1ɣ,β: 1×C×1×1y = ɣ(x-𝞵𝞵)/𝝈𝝈+β

Instance Normalization for convolutional networks Same behavior at train / test!

Batch Normalization for convolutional networks

Ulyanov et al, Improved Texture Networks: Maximizing Quality and Diversity in Feed-forward Stylization and Texture Synthesis, CVPR 2017

Page 197: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Comparison of Normalization Layers

Wu and He, “Group Normalization”, ECCV 2018

Page 198: Learning Principles - Eindhoven University of Technology · Learning Principles. Learning Goals • Complete learning procedure • Different loss functions, regularization methods,

Group Normalization

Wu and He, “Group Normalization”, ECCV 2018