training neural networks practical issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... ·...

54
Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017, and some from Andrew Ng lectures, “Deep Learning Specialization”, coursera, 2017.

Upload: others

Post on 19-Aug-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Training Neural NetworksPractical Issues

M. Soleymani

Sharif University of Technology

Fall 2017

Most slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017,

and some from Andrew Ng lectures, “Deep Learning Specialization”, coursera, 2017.

Page 2: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Outline

• Activation Functions

• Data Preprocessing

• Weight Initialization

• Checking gradients

Page 3: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Neuron

Page 4: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Activation functions

Page 5: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Activation functions: sigmoid

• Squashes numbers to range [0,1]

• Historically popular since they have nice interpretation as asaturating “firing rate” of a neuron

𝜎 𝑥 =1

1 + 𝑒−𝑥

Problems:1. Saturated neurons “kill” the gradients

Page 6: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Activation functions: sigmoid

• What happens when x = -10?

• What happens when x = 0?

• What happens when x = 10?

Page 7: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Activation functions: sigmoid

• Squashes numbers to range [0,1]

• Historically popular since they have nice interpretation as a saturating“firing rate” of a neuron

𝜎 𝑥 =1

1 + 𝑒−𝑥

Problems:1. Saturated neurons “kill” the gradients2. Sigmoid outputs are not zero-centered

Page 8: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Activation functions: sigmoid

• Consider what happens when the input to a neuron (x) is alwayspositive:

𝑓

𝑖

𝑤𝑖 𝑥𝑖 + 𝑏

• What can we say about the gradients on w?

Page 9: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Activation functions: sigmoid

• Consider what happens when the input toa neuron is always positive...

• What can we say about the gradients onw? Always all positive or all negative

𝑓

𝑖

𝑤𝑖 𝑥𝑖 + 𝑏

Page 10: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Activation functions: sigmoid

• Squashes numbers to range [0,1]

• Historically popular since they have nice interpretation as a saturating“firing rate” of a neuron

𝜎 𝑥 =1

1 + 𝑒−𝑥

3 problems:1. Saturated neurons “kill” the gradients2. Sigmoid outputs are not zero-centered3. exp() is a bit compute expensive

Page 11: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Activation functions: tanh

• Squashes numbers to range [-1,1]

• zero centered (nice)

• still kills gradients when saturated :(

[LeCun et al., 1991]

tanh 𝑥 =1 − 𝑒−2𝑥

1 + 𝑒−2𝑥

= 1 − 2𝜎 2𝑥

Page 12: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Activation functions: ReLU (Rectified Linear Unit)

• Does not saturate (in +region)

• Very computationally efficient

• Converges much faster than sigmoid/tanh inpractice (e.g. 6x)

• Actually more biologically plausible thansigmoid

[Krizhevsky et al., 2012]

𝑓(𝑥) = max(0, 𝑥)

Page 13: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Activation functions: ReLU

• What happens when x = -10?

• What happens when x = 0?

• What happens when x = 10?

Page 14: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Activation functions: ReLU

• Does not saturate (in +region)• Very computationally efficient• Converges much faster than sigmoid/tanh in

practice (e.g. 6x)• Actually more biologically plausible than

sigmoid

• Problems:– Not zero-centered output– An annoyance:

• hint: what is the gradient when x < 0?

[Krizhevsky et al., 2012]

𝑓(𝑥) = max(0, 𝑥)

Page 15: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Activation functions: ReLU

=> people like to initialize ReLU neurons with slightly positive biases (e.g. 0.01)

Page 16: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Activation functions: Leaky ReLU

• Does not saturate

• Computationally efficient

• Converges much faster thansigmoid/tanh in practice! (e.g. 6x)

• will not “die”.

[Mass et al., 2013] [He et al., 2015]

Parametric Rectifier (PReLU)

𝛼 is determined during backpropagation

Page 17: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Activation Functions: Exponential Linear Units (ELU)

[Clevert et al., 2015]• All benefits of ReLU

• Closer to zero mean outputs

• Negative saturation regime comparedwith Leaky ReLU adds somerobustness to noise

• Computation requires exp()

Page 18: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Maxout “Neuron”

• Does not have the basic form of “dot product->nonlinearity”

• Generalizes ReLU and Leaky ReLU

• Linear Regime! Does not saturate! Does not die!

• Problem: doubles the number of parameters/neuron :(

[Goodfellow et al., 2013]

Page 19: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Activation functions: TLDR

• In practice:– Use ReLU. Be careful with your learning rates

– Try out Leaky ReLU / Maxout / ELU

– Try out tanh but don’t expect much

– Don’t use sigmoid

Page 20: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Data Preprocessing

Page 21: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Normalizing the input

• On the training set compute mean of each input (feature)

– 𝜇𝑖 = 𝑛=1

𝑁 𝑥𝑖𝑛

𝑁

– 𝜎𝑖2 =

𝑛=1𝑁 𝑥𝑖

𝑛−𝜇𝑖

2

𝑁

Page 22: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Normalizing the input

• On the training set compute mean of each input (feature)

– 𝜇𝑖 = 𝑛=1

𝑁 𝑥𝑖𝑛

𝑁

– 𝜎𝑖2 =

𝑛=1𝑁 𝑥𝑖

𝑛−𝜇𝑖

2

𝑁

• Remove mean: from each of the input mean of the corresponding input– 𝑥𝑖 ← 𝑥𝑖 − 𝜇𝑖

• Normalize variance:– 𝑥𝑖 ←

𝑥𝑖

𝜎𝑖

• If we normalize the training set we use the same 𝜇 and 𝜎2 to normalize testdata too

Page 23: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Why zero-mean input?

• Reminder: sigmoid

• Consider what happens when the input toa neuron is always positive...

• What can we say about the gradients onw? Always all positive or all negative

– this is also why you want zero-mean data!

Page 24: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Why normalize inputs?

[Andrew Ng, Deep Neural Network, 2017]

© 2017 Coursera Inc.

Very different range

of input features

Page 25: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Why normalize inputs?

[Andrew Ng, Deep Neural Network, 2017]

© 2017 Coursera Inc.

Speed up training

Page 26: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

TLDR: In practice for Images: center only

• Example: consider CIFAR-10 example with [32,32,3] images

• Subtract the mean image (e.g. AlexNet)– (mean image = [32,32,3] array)

• Subtract per-channel mean (e.g. VGGNet)– (mean along each channel = 3 numbers)

Not common to normalize variance, to do PCA or whitening

Page 27: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Weight initialization

• How to choose the starting point for the iterative process ofoptimization

Page 28: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Weight initialization

• Initializing to zero (or equally)

• First idea: Small random numbers– gaussian with zero mean and 1e-2 standard deviation

• Works ~okay for small networks, but problems with deeper networks.

Page 29: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Multi-layer network

𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑎[𝐿]

= 𝑓 𝑧[𝐿]

= 𝑓 𝑊[𝐿]𝑎[𝐿−1]

= 𝑓 𝑊[𝐿]𝑓(𝑊[𝐿−1]𝑎[𝐿−2]

= 𝑓 𝑊[𝐿]𝑓 𝑊[𝐿−1] … 𝑓 𝑊[2]𝑓 𝑊[1]𝑥

𝑊[1] 𝑥

×

𝑓[1]𝑊[2]

×

𝑓[2]

𝑊[𝐿]

×

𝑓[𝐿]

𝑧[1]

𝑎[1]

𝑧[2]

𝑎[𝐿−1]

𝑧[𝐿]

𝑎[𝐿]

𝑎[𝐿] = 𝑜𝑢𝑡𝑝𝑢𝑡

Page 30: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Vanishing/exploding gradinets

𝜕𝐸𝑛

𝜕𝑊[𝑙]

=𝜕𝐸𝑛

𝜕𝑎[𝑙]

×𝜕𝑎

[𝑙]

𝜕𝑊[𝑙]

= 𝛿[𝑙]

× 𝑎[𝑙−1]

× 𝑓′ 𝑧[𝑙]

𝛿[𝑙−1]

= 𝑓′ 𝑧[𝑙]

× 𝑊[𝑙]

× 𝛿[𝑙]

= 𝑓′ 𝑧[𝑙]

× 𝑊[𝑙]

× 𝑓′ 𝑧[𝑙+1]

× 𝑊[𝑙+1]

× 𝛿[𝑙+1]

= ⋯

= 𝑓′ 𝑧[𝑙]

× 𝑊[𝑙]

× 𝑓′ 𝑧[𝑙+1]

× 𝑊[𝑙+1]

× 𝑓′ 𝑧[𝑙+2]

× 𝑊[𝑙+2]

× ⋯ × 𝑓′ 𝑧[𝐿]

× 𝑊[𝐿]

× 𝛿[𝐿]

Page 31: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Vanishing/exploding gradinets

𝛿[𝑙−1]

=

= 𝑓′ 𝑧[𝑙]

× 𝑊[𝑙]

× 𝑓′ 𝑧[𝑙+1]

× 𝑊[𝑙+1]

× 𝑓′ 𝑧[𝑙+2]

× 𝑊[𝑙+2]

× ⋯ × 𝑓′ 𝑧[𝐿]

× 𝑊[𝐿]

× 𝛿[𝐿]

For deep networks:

Large weights can cause exploding gradients

Small weights can cause vanishing gradients

Page 32: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Vanishing/exploding gradinets

𝛿[𝑙−1]

=

= 𝑓′ 𝑧[𝑙]

× 𝑊[𝑙]

× 𝑓′ 𝑧[𝑙+1]

× 𝑊[𝑙+1]

× 𝑓′ 𝑧[𝑙+2]

× 𝑊[𝑙+2]

× ⋯ × 𝑓′ 𝑧[𝐿]

× 𝑊[𝐿]

× 𝛿[𝐿]

For deep networks:

Large weights can cause exploding gradients

Small weights can cause vanishing gradients

Example:

𝑊[1] = ⋯ = 𝑊[𝐿] = 𝜔𝐼, 𝑓 𝑧 = 𝑧 ⇒

𝛿[1]

= 𝜔𝐼 𝐿−1𝛿[𝐿]

= 𝜔 𝐿−1𝛿[𝐿]

Page 33: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Lets look at some activation statistics

• E.g. 10-layer net with 500 neurons on each layer– using tanh non-linearities

– Initialization: gaussian with zero mean and 0.01 standard deviation

Page 34: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Lets look at some activation statistics

All activations become zero!

Q: think about the backward pass. What do the gradients look like?

Hint: think about backward pass for a W*X gate.

Initialization: gaussian with zero mean and 0.01 standard deviation

Tanh activation function

Page 35: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Lets look at some activation statistics

Almost all neurons completely saturated, either -1 and 1.Gradients will be all zero.

Initialization: gaussian with zero mean and 1 standard deviation

Tanh activation function

Page 36: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Xavier initialization: intuition

• To have similar variances for neurons outputs, for neurons with largernumber of inputs we need to scale down the variance:

𝑧 = 𝑤1𝑥1 + ⋯ . +𝑤𝑟𝑥𝑟

– Thus, we scale down the weights variances when there exist higher fan in

• Thus, Xavier initialization can help to reduce exploding and vanishinggradient problem

Page 37: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Xavier initialization

Reasonable initialization. (Mathematical derivation assumes linear activations)

[Glorot et al., 2010]

Initialization: gaussian with zero mean and 1/ fan_in standard deviationfan_in for fully connected layers = number of neurons in the previous layer

Tanh activation function

Page 38: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Xavier initialization

but when using the ReLU nonlinearity it breaks.

Initialization: gaussian with zero mean and 1/ fan_in standard deviation

Tanh activation function

Page 39: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Initialization: ReLu activation

[He et al., 2015](note additional /2)Initialization: gaussian with zero mean and 1/ fanin/𝟐

standard deviation

Page 40: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Initialization: ReLu activation

[He et al., 2015](note additional /2)Initialization: gaussian with zero mean and 1/ fanin/𝟐

standard deviation

Page 41: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Proper initialization is an active area of research…

• Understanding the difficulty of training deep feedforward neural networksby Glorot and Bengio, 2010

• Exact solutions to the nonlinear dynamics of learning in deep linear neuralnetworks by Saxe et al, 2013

• Random walk initialization for training very deep feedforward networks bySussillo and Abbott, 2014

• Delving deep into rectifiers: Surpassing human-level performance onImageNet classification by He et al., 2015

• Data-dependent Initializations of Convolutional Neural Networks byKrähenbühl et al., 2015 All you need is a good init, Mishkin and Matas,2015

• ….

Page 42: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Check implementation of Backpropagation

• After implementing backprop we may need to check “is it correct”

• We need to identify the modules that may have bug:– Which elements of the approximate vector of gradients are far from the

corresponding ones in the exact gradient vector

Page 43: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Check implementation of Backpropagation

𝑥

𝑥 + ℎ𝑥 − ℎ

𝑓 𝑥 = 𝑥4

Error: 𝑂(ℎ) when ℎ → 0

Error: 𝑂(ℎ2) when ℎ → 0

Page 44: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Check implementation of Backpropagation

𝑓 3 + 0.001 − 𝑓 3 − 0.001

2 × 0.001

=81.1080540 − 80.8920540

0.002= 108.000006

𝑥

𝑥 + ℎ𝑥 − ℎ

𝑓 𝑥 = 𝑥4

Error: 𝑂(ℎ) when ℎ → 0

Error: 𝑂(ℎ2) when ℎ → 0

𝑓′ 3 = 4 × 33 = 108

Page 45: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Check gradients and debug backpropagation:

• 𝜃 : containing all parameters of the network (𝑊 1 , … 𝑊 𝐿 arevectorized and concatenated)

Page 46: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Check gradients and debug backpropagation:

• 𝜃 : containing all parameters of the network (𝑊 1 , … 𝑊 𝐿 arevectorized and concatenated)

• 𝐽′ 𝑖 =𝜕𝐽

𝜕𝜃𝑖(𝑖 = 1, … , 𝑝) is found via backprop

Page 47: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Check gradients and debug backpropagation:

• 𝜃 : containing all parameters of the network (𝑊 1 , … 𝑊 𝐿 arevectorized and concatenated)

• 𝐽′ 𝑖 =𝜕𝐽

𝜕𝜃𝑖(𝑖 = 1, … , 𝑝) is found via backprop

• For each i:

– 𝐽′[𝑖] =𝐽 𝜃1,…,𝜃𝑖−1,𝜃𝑖+ℎ,𝜃𝑖,…,𝜃𝑝 −𝐽(𝜃1,…,𝜃𝑖−1,𝜃𝑖−ℎ,𝜃𝑖,…,𝜃𝑝)

2ℎ

Page 48: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Check gradients and debug backpropagation:

• 𝜃: containing all parameters of the network (𝑊 1 , … 𝑊 𝐿 are vectorizedand concatenated)

• 𝐽′ 𝑖 =𝜕𝐽

𝜕𝜃𝑖(𝑖 = 1, … , 𝑝) is found via backprop

• For each i:

– 𝐽′[𝑖] =𝐽 𝜃1,…,𝜃𝑖−1,𝜃𝑖+ℎ,𝜃𝑖,…,𝜃𝑝 −𝐽(𝜃1,…,𝜃𝑖−1,𝜃𝑖−ℎ,𝜃𝑖,…,𝜃𝑝)

2ℎ

• Compute 𝜖 = 𝐽′−𝐽′

2

𝐽′2+ 𝐽′

2

Page 49: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

CheckGrad

• with e.g. ℎ = 10−7:– 𝜖 < 10−7 (great)

– 𝜖 ≈ 10−5 (may need check)• okay for objectives with kinks. But if there are no kinks then 1e-4 is too high.

– 𝜖 ≈ 10−3 (concern)

– 𝜖 > 10−2 (probably wrong)

• Use double precision and stick around active range of floating point• Be careful with the step size h (not too small that cause underflow)• Gradient check after burn-in of training

• After running gradcheck at random initialization, run grad check again when you've trained for some number of iterations

Page 50: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Before learning: sanity checks Tips/Tricks

• Look for correct loss at chance performance.

• Increasing the regularization strength should increase the loss

• Network can overfit a tiny subset of data. – make sure you can achieve zero cost.

• set regularization to zero

Page 51: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Initial loss

• Example: CIFAR-10

Page 52: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Double check that the loss is reasonable

Page 53: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Primitive check of the whole network

• Make sure that you can overfit very small portion of the training data– Example:

• Take the first 20 examples from CIFAR-10

• turn off regularization

• use simple vanilla ‘sgd

• Very small loss, train accuracy 1.00, nice!

Page 54: Training Neural Networks Practical Issuesce.sharif.edu/courses/96-97/1/ce959-1/resources... · Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology

Resource

• Please see the following notes:– http://cs231n.github.io/neural-networks-2/

– http://cs231n.github.io/neural-networks-3/