learning principles - eindhoven university of technology · learning principles. learning goals •...
TRANSCRIPT
Learning Goals
• Complete learning procedure
• Different loss functions, regularization methods, optimization methods
• Backpropagation mathematical derivation
• Other key concepts as Hyperparameters, data preprocessing, weight initialization
Learning – Training of Deep Neural Networks
PyTorch
Classification Problem
{dog, cat, truck, plane, ...}
cat
Three key components :
- Score function : Function to map input to output
- Loss Function : Evaluate quality of mapping
- Optimization Function : Update classifier
Score Function
Loss Function
Optimization Function
Score Function : Linear Classifier
10 numbers giving class scoresf(x,W) = Wx + b
W, b : Weights and biases
Score Function : Linear Classifier
10 numbers giving class scoresf(x,W) = Wx + b
Update Score Function
Loss Function : Linear Classifier
ClassScoresf(x,W)
Output Quality
Loss Function
Why a loss function is needed?- Optimizer requires a “hint” for update
cat
car
frog
3.2 1.3 2.25.1 4.9 2.5-1.7 2.0 -3.1
Scores from are:
Loss Function : Linear Classifier
cat
car
frog
3.2 1.3 2.25.1 4.9 2.5-1.7 2.0 -3.1
Scores from are:
Given a dataset of examples
Where : image: label
Loss over the dataset :
Single sample loss :
Loss Function : Linear Classifier
- SVM Multiclass Loss (Hinge Loss)
- Softmax Regression : Softmax + Cross Entropy Loss
Common Loss Functions for Classification
SVM Multiclass Loss Formulation
Scores vector:
the SVM loss has the form:
3.2 1.3 2.25.1 4.9 2.5-1.7 2.0 -3.1
SVM loss has the form:
Loss over full dataset is average:
cat
car
frog
Losses: 12.92.9 0 L= 5.27
SVM Multiclass Loss Example
Suppose that we found a W such that L = 0.
Is this W unique?
SVM Multiclass Loss
Suppose that we found a W such that L = 0.
Is this W unique?
No! 2W is also has L = 0!
SVM Multiclass Loss
Regularization
Regularization Loss: Regularization loss prevents overfitting via penalizing large weights
= regularization strength
Data loss: Model prediction
Regularization
Data loss: Model prediction Regularization Loss: Regularization loss prevents overfitting via penalizing large weights
= regularization strength
R(W) L2 regularization:L1 regularization:Elastic net (L1 + L2):
Softmax Classifier & Cross Entropy Loss
Formulated as following :
- “Softmax” function converts scores to probabilities in last layer
- Loss is cross entropy between class probabilities and ground truth
cat
car
frog
3.25.1-1.7
24.5164.00.18
exp
unnormalized probabilities
Want to interpret raw classifier scores as probabilitiesSoftmax Function
Probabilities must be >= 0
Softmax Classifier
3.25.1-1.7
log-probabilities / logitsUnnormalized
cat
car
frog
3.25.1-1.7
Want to interpret raw classifier scores as probabilitiesSoftmax Function
24.5164.00.18
0.130.870.00
exp normalize
unnormalizedprobabilities
Fei-Fei Li & Justin Johnson & Serena Yeung
probabilities
Probabilities must be >= 0
Probabilities must sum to 1
Softmax Classifier
3.25.1-1.7
log-probabilities / logitsUnnormalized
Scores vector:
Softmax Classifier interprets scores as class probabilities
Final Loss :
Softmax function :
Cross Entropy Loss :
= 0 Kullback-LeiblerDivergence
Softmax Classifier & Cross Entropy Loss
cat
car
frog
3.25.1-1.7
Want to interpret raw classifier scores as probabilitiesSoftmax Function
24.5164.00.18
0.130.870.00
exp normalize
unnormalized
Probabilities must be >= 0
Probabilities must sum to 1
probabilitiesUnnormalizedlog-probabilities / logits probabilities
Softmax Classifier
1.000.000.00Correct
Cross Enthropy
cat
car
frog
3.25.1-1.7
Want to interpret raw classifier scores as probabilitiesSoftmax Function
24.5164.00.18
0.130.870.00
exp normalize
unnormalized
Probabilities must be >= 0
Probabilities must sum to 1
probabilitiesUnnormalized
Li = -log(0.13)= 2.04
log-probabilities / logits probabilities
Softmax Classifier
Softmax vs. SVM
- We have some dataset of (x,y)- We have a score function:- We have a loss function:
Softmax
SVM
Generalized Full loss
Summary – Loss Functions
Loss is a function of W
- How can we find best W for our score function
Optimization
Naïve approach : Randomize
Improving, but very slow
Another approach : Local random search
Optimization
Add random perturbation: W` = W + ∂W
Better than weight randomization. But expensive and slow to converge.
Another approach : Following the gradient
Optimization
Derivative of 1-dim function :
In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension
The slope in any direction is the dot product of the direction with the gradient The direction of steepest descent is the negative gradient
current W:
[0.34,-1.11,0.78]
loss 1.25347
gradient[?,?,?]
W + step:
Optimization – Following gradient
current W:
[0.34,-1.11,0.78]
loss 1.25347
gradient[?,?,?]
W + step:
[0.34 + 0.0001,-1.11,0.78]
loss 1.25322
Optimization – Following gradient
current W:
[0.34,-1.11,0.78]
loss 1.25347
gradient[-2.5,?,?]
W + step:
[0.34 + 0.0001,-1.11,0.78]
loss 1.25322 (1.25322 - 1.25347)/0.0001= -2.5
Optimization – Following gradient
current W:
[0.34,-1.11,0.78]
loss 1.25347
gradient[-2.5,0.6,?]
W + step:
[0.34,-1.11 + 0.0001,0.78]
loss 1.25353 ?,?,?,?,
(1.25353 - 1.25347)/0.0001= 0.6
Optimization – Following gradient
Numerical evaluation of gradient :
- Very slow
- Need iteration for every dimension
- Approximate
Optimization – Following gradient
We need a function of W : Analytic gradient- Exact
- Fast, once expression is derived
- Error-prone, Gradient Check with numerical gradient
Optimization – Following gradient
Numerical evaluation was simple on previous example
How about deriving analytic gradient of all weights for this network?
x
W
hinge loss
R
+ Ls (scores)
*
Better Idea: Computational graphs + Backpropagation
Backpropagation: a simple example
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Backpropagation: a simple example
Chain rule:
Want:Upstream gradient
Local gradient
Chain rule:
e.g. x = -2, y = 5, z = -4
Backpropagation: a simple example
Want:Upstream gradient
Local gradient
e.g. x = -2, y = 5, z = -4
Backpropagation: a simple example
Chain rule:
Want:Upstream gradient
Local gradient
Chain rule:
e.g. x = -2, y = 5, z = -4
Backpropagation: a simple example
Want:Upstream gradient
Local gradient
f
f
“local gradient”
f
“local gradient”
“Upstream gradient”
f
“local gradient”
“Upstream gradient”
“Downstream gradients”
f
“local gradient”
“Upstream gradient”
“Downstream gradients”
f
“local gradient”
“Upstream gradient”
“Downstream gradients”
Another example:
Another example:
Another example:
Another example:
Another example:
Upstream gradient
Local gradient
Another example:
Another example:
Upstream gradient
Local gradient
Another example:
Another example:
Upstream gradient
Local gradient
Another example:
Another example:
Upstream gradient
Local gradient
Another example:
Another example:
[upstream gradient] x [local gradient] [0.2] x [1] = 0.2[0.2] x [1] = 0.2 (both inputs!)
Another example:
Another example:
[upstream gradient] x [local gradient] x0: [0.2] x [2] = 0.4w0: [0.2] x [-1] = -0.2
Another example:
Sigmoid
Sigmoid function
Computational graph representation may not be unique. Choose one where local gradients at each node can be easily expressed!
Another example:
Sigmoid
Sigmoid function
Sigmoid local gradient:
Computational graph representation may not be unique. Choose one where local gradients at each node can be easily expressed!
Another example:
Sigmoid
Sigmoid function
Sigmoid local gradient:
Computational graph representation may not be unique. Choose one where local gradients at each node can be easily expressed!
[upstream gradient] x [local gradient] [1.00] x [(1 - 0.73) (0.73)] = 0.2
Patterns in gradient flow
add gate: gradient distributor 3
+ 72
2
42
Patterns in gradient flow
add gate: gradient distributor 3
+ 72
2
42
mul gate: “swap multiplier” 2
× 65
5*3=15
32*5=10
Patterns in gradient flow
add gate: gradient distributor 3
+ 72
2
4
mul gate: “swap multiplier” 2
× 65
5*3=15
32*5=102
copy gate: gradient adder7
74+2=6
472
Patterns in gradient flow
add gate: gradient distributor 3
+ 72
2
4
max
mul gate: “swap multiplier” 2
× 65
5*3=15
32*5=10
max gate: gradient router 4
59
0
59
2
copy gate: gradient adder7
74+2=6
472
Backprop Implementation: Forward pass: Compute output
Backward pass: Compute grads
Forward pass: Compute output
Base case
Backprop Implementation:
Forward pass: Compute output
Sigmoid
Backprop Implementation:
Forward pass: Compute output
Add gate
Backprop Implementation:
Forward pass: Compute output
Add gate
Backprop Implementation:
Forward pass: Compute output
Multiply gate
Backprop Implementation:
Forward pass: Compute output
Multiply gate
April 11, 2019
Backprop Implementation:
Backprop Implementation: Modularized APIGraph (or Net) object (rough pseudo code)
x
y(x,y,z are scalars)
z*
Modularized implementation: forward / backward API
Need to stash some values for use in backward
Gate / Node / Function object: Actual PyTorch code
Upstream gradient
Multiply upstream and local gradients
Example: PyTorch operators
Forward
PyTorch sigmoid layer
Forward
PyTorch sigmoid layer
Forward
PyTorch sigmoid layer
So far: backprop with scalars
What about vector-valued functions?
Vector derivatives
Scalar to Scalar
Regular derivative:
If x changes by a small amount, how much will y change?
Vector derivatives
Regular derivative:
If x changes by a small amount, how much will y change?
Scalar to Scalar Vector to Scalar
Derivative is Gradient:
For each element of x,if it changes by a smallamount then how muchwill y change?
Vector derivatives
Regular derivative:
If x changes by a small amount, how much will y change?
Scalar to Scalar Vector to Scalar
Derivative is Gradient:
For each element of x,if it changes by a smallamount then how muchwill y change?
Vector to Vector
Derivative is Jacobian:
For each element of x, if it changes by a small amount then how much will each element of y change?
f
“local gradients”
“Upstream gradient”
Backprop with Vectors
Dx
Dy
Dz
Loss L still a scalar!
“Downstreamgradients”
f
“local gradients”
Dx
Dy
Dz
Dz
Loss L still a scalar!
“Upstream gradient” For each element of z, how much does it influence L?
“Downstreamgradients”
Backprop with Vectors
Dx
Dy
Dz
Dz
Loss L still a scalar!“local gradients”
[Dx x Dz]
f[Dy x Dz]
Jacobian matrices “Upstream gradient”
For each element of z, how much does it influence L?
“Downstreamgradients”
Backprop with Vectors
“Downstreamgradients”
Dx
Dy
Dz
Dz
Loss L still a scalar!
y z[D x D ]
“local gradients”
[Dx x Dz]
fJacobian matrices “Upstream gradient”
For each element of z, how much does it influence L?
Dy
Dx
Matrix-vector multiply
Backprop with Vectors
f(x) = max(0,x)(elementwise)
4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]
4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]
Backprop with Vectors
f(x) = max(0,x)(elementwise)
4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]
4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]
[ 4 ][ -1 ][ 5 ][ 9 ]
4D dL/dy:
Upstream gradient
Backprop with Vectors
f(x) = max(0,x)(elementwise)
4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]
4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]
4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]
Jacobian dy/dx [ 1 0 0 0 ][ 0 0 0 0 ][ 0 0 1 0 ][ 0 0 0 0 ]
Upstream gradient
Backprop with Vectors
f(x) = max(0,x)(elementwise)
4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]
4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]
4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]
[dy/dx] [dL/dy] [ 1 0 0 0 ] [ 4 ][ 0 0 0 0 ] [ -1 ][ 0 0 1 0 ] [ 5 ][ 0 0 0 0 ] [ 9 ]
Upstream gradient
Backprop with Vectors
f(x) = max(0,x)(elementwise)
4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]
4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]
4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]
[dy/dx] [dL/dy] [ 1 0 0 0 ] [ 4 ][ 0 0 0 0 ] [ -1 ][ 0 0 1 0 ] [ 5 ][ 0 0 0 0 ] [ 9 ]
Upstream gradient
4D dL/dx: [ 4 ][ 0 ][ 5 ][ 0 ]
Backprop with Vectors
f(x) = max(0,x)(elementwise)
4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]
4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]
4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]
[dy/dx] [dL/dy] [ 1 0 0 0 ] [ 4 ][ 0 0 0 0 ] [ -1 ][ 0 0 1 0 ] [ 5 ][ 0 0 0 0 ] [ 9 ]
Upstream gradient
Jacobian is sparse: off-diagonal entries always zero! Never explicitly form Jacobian -- instead use implicit multiplication
4D dL/dx: [ 4 ][ 0 ][ 5 ][ 0 ]
Backprop with Vectors
f(x) = max(0,x)(elementwise)
4D input x:[ 1 ][ -2 ][ 3 ][ -1 ]
4D output y:[ 1 ][ 0 ][ 3 ][ 0 ]
4D dL/dy:[ 4 ][ -1 ][ 5 ][ 9 ]
[dy/dx] [dL/dy]4D dL/dx: [ 4 ][ 0 ][ 5 ][ 0 ]
Upstream gradient
Jacobian is sparse: off-diagonal entries always zero! Never explicitly form Jacobian -- instead use implicit multiplication
Backprop with Vectors
“local gradients”
“Downstreamgradients”
Backprop with Matrices (or Tensors) Loss L still a scalar!
Jacobian matrices
For each element of y, how much does it influence each element of z?
Matrix-vector multiply
[Dz×Mz]
“Upstream gradient” For each element of z, how much does it influence L?
[Dx×Mx]
[Dx×Mx]
[Dy×My]
[Dy×My]
dL/dx always has the same shape as x!
[Dz×Mz]
“Downstreamgradients”
Loss L still a scalar!
“local gradients”
[(Dx×Mx)×(Dz×Mz)]
Jacobian matrices “Upstream gradient”
For each element of z, how much does it influence L?
For each element of y, how much does it influence each element of z?
Matrix-vector multiply
[Dy×My]
[Dz×Mz]
[Dz×Mz][(Dy×My)×(Dz×Mz)]
[Dx×Mx]
[Dx×Mx]
[Dy×My]
dL/dx always has the same shape as x!
Backprop with Matrices (or Tensors)
x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]
[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]
Matrix Multiply
y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]
dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]
Backprop with Matrices
x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]
[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]
Matrix Multiply
y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]
dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]
Jacobians: dy/dx: [(N×D)×(N×M)]dy/dw: [(D×M)×(N×M)]
For a neural net we may have N=64, D=M=4096Each Jacobian takes 256 GB of memory!
Must work with them implicitly!
Backprop with Matrices
x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]
[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]
Matrix Multiply
y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]
dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]Q: What parts of y
are affected by one element of x?
Backprop with Matrices
x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]
[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]
Matrix Multiply
y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]
dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]Q: What parts of y
are affected by one element of x?A: affects the whole row
Backprop with Matrices
x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]
[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]
Matrix Multiply
y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]
dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]Q: What parts of y
are affected by one element of x?A: affects the whole row
Q: How muchdoesaffect ?
Backprop with Matrices
x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]
[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]
Matrix Multiply
y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]
dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]Q: What parts of y
are affected by one element of x?A: affects thewhole row
Q: How muchdoesaffect ?A:
Backprop with Matrices
x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]
[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]
Matrix Multiply
y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]
dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]Q: What parts of y
are affected by one element of x?A: affects thewhole row
Q: How muchdoesaffect ?A:
[N×D] [N×M] [M×D]
Backprop with Matrices
x: [N×D][ 2 1 -3 ][ -3 4 2 ]w: [D×M]
[ 3 2 1 -1][ 2 1 3 2][ 3 2 1 -2]
Matrix Multiply
y: [N×M][13 9 -2 -6 ][ 5 2 17 1 ]
dL/dy: [N×M] [ 2 3 -3 9 ][ -8 1 4 6 ]
[N×D] [N×M] [M×D] [D×M] [D×N] [N×M]
By similar logic:
Backprop with Matrices
Optimization - Gradient Descent (GD)Most common optimization algorithm in deep learning
- First order optimization – uses only first derivatives
- Step size on negative gradient is learning rate
- Batch, Minibatch and Stochastic GD
All samples Batch of samples Single sample
Batch Gradient DescentUses all sample results (entire dataset) for loss calculation
- Expensive, but less noisy
- Stable loss gradient
- Can converge to local minima
Minibatch Gradient Descent
Full sum is expensive when N is large!
Approximate sum using a minibatch of examples
Uses batches of 32, 64 or 128
Stochastic Gradient Descent (SGD)
Weights are updated after every sample
- Minibatch gradient descent with batch size one
- Less commonly used as vectorized ops are more efficient
- You can assume most SGD implementations use minibatches!
SGD
- Build up “velocity” as a running mean of gradients- Rho gives “friction”; typically rho=0.9 or 0.99
Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013
SGD+Momentum
SGD + Momentum
Velocity
actual step
Momentum update:
GradientVelocity
actual step
Nesterov Momentum
Gradient
Combine gradient at current point with velocity to get step used to update weights
Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983 Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013
“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction
Nesterov Momentum
andChange of variables rearrange:
We want update in termsof
Nesterov Momentum
Gradient Descent - Challenges
- First order – Curvature information is not used
- Saddle points
- No independent parameter update – same learning rate for all
- Added element-wise scaling of the gradient based on the historical sum of squares in each dimension
- “Per-parameter learning rates” or “adaptive learningrates”
Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011
AdaGrad
AdaGrad
RMSProp
Tieleman and Hinton, 2012
RMSProp: “LeakyAdaGrad”
Kingma and Ba, “Adam: Amethod for stochastic optimization”, ICLR 2015
Momentum
Bias correction
AdaGrad / RMSProp
Bias correction for the fact that first and second moment estimates start at zero
Adam with beta1 = 0.9,beta2 = 0.999, and learning_rate = 1e-3 or 5e-4 is a great starting point for many models!
Adam
SGD
SGD+Momentum
RMSProp
Adam
Adam
Saddle Point Behaviour
Recall Regularization
In common use: L2 regularization L1 regularizationElastic net (L1 + L2)
(Weight decay)
Regularization: DropoutIn each forward pass, randomly set some neurons to zero Probability of dropping is a hyperparameter; 0.5 is common
Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR2014
Regularization: Dropout Example forward pass with a3-layer network using dropout
Dropout: Test time
Dropout makes our output random!
Output (label)
Input (image)
Random mask
Want to “average out” the randomness at test-time
Dropout: Test time
Want to approximate the integral
Consider a single neuron.a
x y
w1 w2
Dropout: Test time
Want to approximate the integral
Consider a single neuron.
At test time we have:a
x y
w1 w2
Dropout: Test time
Want to approximate the integral
Consider a single neuron.
At test time we have: During training we have:
a
x y
w1 w2
Dropout: Test time
Want to approximate the integral
Consider a single neuron.
At test time we have: During training we have:
a
x y
w1 w2
At test time, multiplyby dropout probability
Dropout: Test time
At test time all neurons are active always=> We must scale the activations so that for each neuron: output at test time = expected output at training time
Dropout Summary
drop in forward pass
scale at test time
Regularization: A common pattern
Training: Add some kind of randomness
Testing: Average out randomness (sometimes approximate)
Regularization: A commonpattern
Training: Add some kind of randomness
Testing: Average out randomness (sometimes approximate)
Example: Batch Normalization
Training: Normalize using stats from random minibatches
Testing: Use fixed stats to normalize
Load image and label
“cat”
Compute loss
CNN
Regularization: Data Augmentation
This image by Nikita is licensed under CC-BY 2.0
Regularization: Data Augmentation
Load image and label
“cat”
Compute loss
CNN
Transform image
Data Augmentation : Horizontal Flips
Data AugmentationRandom crops and scales
Training: sample random crops / scalesResNet:1. Pick random L in range [256, 480]2. Resize training image, short side = L3. Sample random 224 x 224 patch
Data AugmentationRandom crops and scales
Training: sample random crops / scalesResNet:1. Pick random L in range [256, 480]2. Resize training image, short side = L3. Sample random 224 x 224 patch
Testing: average a fixed set of cropsResNet:1. Resize image at 5 scales: {224, 256, 384, 480, 640}2. For each size, use 10 224 x 224 crops: 4 corners + center, + flips
Data AugmentationColor Jitter
Simple: Randomize contrast and brightness
More Complex:
1. Apply PCA to all [R, G, B] pixels in training set
2. Sample a “color offset” along principal component directions
3. Add offset to all pixels of a training image
(As seen in [Krizhevsky et al. 2012], ResNet, etc)
Data AugmentationColor Jitter
Simple: Randomize contrast and brightness
Regularization: A commonpatternTraining: Add random noiseTesting: Marginalize over the noise
Examples:DropoutBatch Normalization Data Augmentation
Regularization: DropConnect
Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013
Training: Drop connections between neurons (set weights to 0)Testing: Use all the connections
Examples:DropoutBatch Normalization Data Augmentation DropConnect
Regularization: Fractional PoolingTraining: Use randomized pooling regionsTesting: Average predictions from several regions
Examples:DropoutBatch Normalization Data Augmentation DropConnect Fractional Max Pooling
Graham, “Fractional Max Pooling”, arXiv 2014
Regularization: Stochastic DepthTraining: Skip some layers in the networkTesting: Use all the layer
Examples:DropoutBatch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth
Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016
Regularization: CutoutTraining: Set random image regions to zeroTesting: Use full image
Examples:DropoutBatch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth Cutout
DeVries and Taylor, “Improved Regularization of Convolutional Neural Networks with Cutout”, arXiv 2017
Works very well for small datasets like CIFAR, less common for large datasets like ImageNet
Regularization: MixupTraining: Train on random blends of imagesTesting: Use original images
Examples:DropoutBatch Normalization Data Augmentation DropConnect Fractional Max PoolingStochastic Depth CutoutMixup
Zhang et al, “mixup: Beyond Empirical Risk Minimization”, ICLR 2018
Randomly blend the pixels of pairs of training images,e.g. 40% cat, 60% dog
CNNTarget label: cat: 0.4dog: 0.6
RegularizationTraining: Add random noiseTesting: Marginalize over the noise
Examples:DropoutBatch Normalization Data Augmentation DropConnect Fractional Max Pooling Stochastic Depth CutoutMixup
- Consider dropout for large fully-connected layers
- Batch normalization and data augmentation almost always a good idea
- Try cutout and mixup especially for small classification datasets
Choosing Hyperparameters
Step 1: Check initial lossStep 2: Overfit a small sample
Try to train to 100% training accuracy on a small sample of training data (~5-10 minibatches); fiddle with architecture, learning rate, weight initialization
Loss not going down? LR too low, bad initializationLoss explodes to Inf or NaN? LR too high, bad initialization
Choosing Hyperparameters
Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go down
Use the architecture from the previous step, use all training data, turn on small weight decay, find a learning rate that makes the loss drop significantly within ~100 iterations
Good learning rates to try: 1e-1, 1e-2, 1e-3, 1e-4
Choosing Hyperparameters
Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go downStep 4: Coarse grid, train for ~1-5 epochs
Choose a few values of learning rate and weight decay around what worked from Step 3, train a few models for ~1-5 epochs.
Good weight decay to try: 1e-4, 1e-5, 0
Choosing Hyperparameters
Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go downStep 4: Coarse grid, train for ~1-5 epochsStep 5: Refine grid, train longer
Pick best models from Step 4, train them for longer (~10-20 epochs) without learning rate decay
Choosing Hyperparameters
Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go downStep 4: Coarse grid, train for ~1-5 epochsStep 5: Refine grid, train longerStep 6: Look at loss curves
Losses may be noisy, use a scatter plot and also plot moving average to see trends better
Loss curvesTraining Loss Train / ValAccuracy
Loss
time
Bad initialization
Loss
time
Loss plateaus: Try learning rate decay
Loss
time
Learning rate step decay Loss was still going down when learning rate dropped, you decayed too early!
Accuracy
time
Train
Accuracy still going up, you need to train longer
Val
Accuracy
time
Train
Huge train / val gap means overfitting! Increase regularization, get more data
Val
Accuracy
time
Train
No gap between train / val means underfitting: train longer, use a bigger model
Val
Choosing Hyperparameters
Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go downStep 4: Coarse grid, train for ~1-5 epochsStep 5: Refine grid, train longerStep 6: Look at loss curvesStep 7: GOTO step 5
(Assume X [NxD] is data matrix, each example in a row)
Data Preprocessing
Data Preprocessing
In practice, you may also see PCA and Whitening of the data
(data has diagonal covariance matrix)
(covariance matrix is the identity matrix)
Data Preprocessing
Before normalization: classification loss very sensitive to changes in weight matrix; hard to optimize
After normalization: less sensitive to small changes in weights; easier to optimize
In practice for Images: center only
Divide by per-channel std (e.g. ResNet) (mean along each channel = 3 numbers)
e.g. consider CIFAR-10 example with [32,32,3] images
- Subtract the mean image (e.g.AlexNet) (mean image = [32,32,3] array)
- Subtract per-channel mean (e.g. VGGNet) (mean along each channel = 3 numbers)
- Subtract per-channel mean and
Weight Initialization
- Q: what happens when W=constant init is used?
Weight Initialization
- First idea: Small random numbers(gaussian with zero mean and 1e-2 standard deviation)
Works for small networks, but problems with deepernetworks.
Weight Initialization
Weight Initialization: Activation statistics
Forward pass for a 6-layer net with hidden size 4096
Weight Initialization: Activation statisticsForward pass for a 6-layer net with hidden size 4096
All activations tend to zero for deeper network layers
Q: What do the gradients dL/dW look like?
Weight Initialization: Activation statisticsForward pass for a 6-layer net with hidden size 4096
All activations tend to zero for deeper network layers
Q: What do the gradients dL/dW look like?
A: All zero, no learning
Weight Initialization: Activation statisticsIncrease std of initial weights from 0.01 to 0.05
Weight Initialization: Activation statisticsIncrease std of initial weights from 0.01 to 0.05
All activations saturate
Q: What do the gradients look like?
Weight Initialization: Activation statisticsIncrease std of initial weights from 0.01 to 0.05
All activations saturate
Q: What do the gradients look like?
A: Local gradients all zero, no learning
Weight Initialization: “Xavier” Initialization“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Activations are nicely scaled for all layers!
“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Activations are nicely scaled for all layers!
For conv layers, Din iskernel_size2 * input_channels
“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
“Xavier” initialization: std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
y = Wx h = f(y) i i i i= Din * (E[x 2]E[w 2] - E[x ]2 E[w ]2)
= Din * Var(xi) * Var(wi)
[Assume x, w are iid] [Assume x, w independant] [Assume x, w are zero-mean]
If Var(wi) = 1/Din then Var(yi) = Var(xi)
Derivation:Var(yi) = Din * Var(xiwi)
Activations are nicely scaled for all layers!
For conv layers, Din iskernel_size2 * input_channels
Batch Normalization
Batch Normalization [Ioffe and Szegedy, 2015]
Consider a batch of activations at some layer. To make each dimension zero-mean unit-variance, apply:
this is a vanilla differentiable function...
Input: Per-channel mean, shape is D
Per-channel var, shape is D
Normalized x, Shape is N x D
Batch Normalization [Ioffe and Szegedy, 2015]
XN
D
Input: Per-channel mean, shape is D
Per-channel var, shape is D
Normalized x, Shape is N x D
Batch Normalization [Ioffe and Szegedy, 2015]
XN
D Problem: Zero-mean, unit variance are too hard of a constraint?
Input: Per-channel mean, shape is D
Per-channel var, shape is D
Normalized x, Shape is N x D
Batch Normalization [Ioffe and Szegedy, 2015]
Learnable scale and shift parameters:
Output,Shape is N x D
Learning = ,= will recover the
identity function!
Input: Per-channel mean, shape is D
Per-channel var, shape is D
Normalized x, Shape is N x D
Batch Normalization
Learnable scale and shift parameters:
Output,Shape is N x D
Learning = ,= will recover the
identity function!
Estimates depend on minibatch; can’t do this at test-time!
Input:
Normalized x, Shape is N x D
Batch Normalization: Test-Time
Learnable scale and shift parameters:
Output,Shape is N x D
(Running) average of values seen during training
(Running) average of values seen during training
During testing batchnorm becomes a linear operator! Can be fused with the previous fully-connected or conv layer
Batch Normalization for ConvNets
x: N × DNormalize
𝞵𝞵,𝝈𝝈: 1 × Dɣ,β: 1 × Dy = ɣ(x-𝞵𝞵)/𝝈𝝈+β
x: N×C×H×WNormalize
𝞵𝞵,𝝈𝝈: 1×C×1×1ɣ,β: 1×C×1×1y = ɣ(x-𝞵𝞵)/𝝈𝝈+β
Batch Normalization forfully-connected networks
Batch Normalization forconvolutional networks(Spatial Batchnorm, BatchNorm2D)
Batch Normalization [Ioffe and Szegedy, 2015]
FC
BN
tanh
FC
BN
tanh
Usually inserted after Fully Connected or Convolutional layers, and before nonlinearity.
...
Batch Normalization [Ioffe and Szegedy, 2015]
FC
BN
tanh
FC
BN
tanh
- Makes deep networks much easier to train!- Improves gradient flow- Allows higher learning rates, faster convergence- Networks become more robust to initialization- Acts as regularization during training- Zero overhead at test-time: can be fused with conv!- Behaves differently during training and testing: this
is a very common source of bugs!
...
Layer Normalization
x: N × D x: N × DNormalize
𝞵𝞵,𝝈𝝈: 1 × DNormalize
𝞵𝞵,𝝈𝝈: N × 1ɣ,β: 1 × D ɣ,β: 1 × Dy = ɣ(x-𝞵𝞵)/𝝈𝝈+β y = ɣ(x-𝞵𝞵)/𝝈𝝈+β
Layer Normalization for fully-connected networksSame behavior at train and test! Can be used in recurrent networks
Ba, Kiros, and Hinton, “Layer Normalization”, arXiv 2016
Batch Normalization for fully-connected networks
Instance Normalization
x: N×C×H×WNormalize
𝞵𝞵,𝝈𝝈: 1×C×1×1ɣ,β: 1×C×1×1y = ɣ(x-𝞵𝞵)/𝝈𝝈+β
x: N×C×H×WNormalize
𝞵𝞵,𝝈𝝈: N×C×1×1ɣ,β: 1×C×1×1y = ɣ(x-𝞵𝞵)/𝝈𝝈+β
Instance Normalization for convolutional networks Same behavior at train / test!
Batch Normalization for convolutional networks
Ulyanov et al, Improved Texture Networks: Maximizing Quality and Diversity in Feed-forward Stylization and Texture Synthesis, CVPR 2017
Comparison of Normalization Layers
Wu and He, “Group Normalization”, ECCV 2018
Group Normalization
Wu and He, “Group Normalization”, ECCV 2018