lecture 4: backpropagation and neural networks (part 1 · lecture 4: backpropagation and neural...
Post on 20-Apr-2020
19 Views
Preview:
TRANSCRIPT
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 1
Lecture 4: Backpropagation
and Neural Networks (part 1)
Tuesday January 31, 2017
comp150dl
Announcements!- If you are adversely affected by immigration ban, please talk to me about
accommodations
- Send in paper choices by tonight
- Should be able to run Jupyter server on Tufts was and network machines now
- (deep-venv)> pip install --upgrade jupyter
- hw1 deadline in two days — Thurs Feb 2: Don’t forget to read the course notes.
- Redo calculation of dL/dW for hinge loss2
comp150dl
Python/Numpy of the Day
- y_pred = scores.argmax(axis=1)
- inds = np.random.choice(X.shape[0],batch_size)
- randomly select N numbers in a range,
- useful for subsampling
- [:,np.newaxis]
- reshapes matrices of size (N,) to size (N,1)
3
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 4
want
scores function
SVM loss
data loss + regularization
Where we are...
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 5
(image credits to Alec Radford)
Optimization
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 6
Gradient Descent
Numerical gradient: slow :(, approximate :(, easy to write :) Analytic gradient: fast :), exact :), error-prone :(
In practice: Derive analytic gradient, check your implementation with numerical gradient
comp150dl
Hinge Loss Gradient wrt Weights W
• We want the Jacobian Matrix of all gradients
• partial derivatives of all output dimensions by all input dimensions
7http://cs231n.github.io/optimization-1/#analytic
margin size, usually 1.0
For all rows of dW where the row corresponds to the GT value for that training instance, i.e. For all rows of dW where
comp150dl
Softmax Loss Gradient wrt Score S
8eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/
* note change of subscripts from last slide
Skipping some steps for space, please see original notes.
comp150dl
Softmax Loss Gradient wrt Score S
9eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/
Skipping some steps for space, please see original notes.
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 10
Computational Graph
x
W
* hinge loss
R
+ Ls (scores)
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 11
Convolutional Network (AlexNet)
input image weights
loss
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 12
e.g. x = -2, y = 5, z = -4
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 13
e.g. x = -2, y = 5, z = -4
Want:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 14
e.g. x = -2, y = 5, z = -4
Want:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 15
e.g. x = -2, y = 5, z = -4
Want:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 16
e.g. x = -2, y = 5, z = -4
Want:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 17
e.g. x = -2, y = 5, z = -4
Want:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 18
e.g. x = -2, y = 5, z = -4
Want:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 19
e.g. x = -2, y = 5, z = -4
Want:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 20
e.g. x = -2, y = 5, z = -4
Want:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 21
e.g. x = -2, y = 5, z = -4
Want:
Chain rule:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 22
e.g. x = -2, y = 5, z = -4
Want:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 23
e.g. x = -2, y = 5, z = -4
Want:
Chain rule:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 24
f
activations
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 25
f
activations
“local gradient”
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 26
f
activations
“local gradient”
gradients
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 27
f
activations
gradients
“local gradient”
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 28
f
activations
gradients
“local gradient”
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 29
f
activations
gradients
“local gradient”
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 30
Another example:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 31
Another example:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 32
Another example:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 33
Another example:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 34
Another example:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 35
Another example:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 36
Another example:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 37
Another example:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 38
Another example:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 39
Another example:
(-1) * (-0.20) = 0.20
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 40
Another example:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 41
Another example:
[local gradient] x [its gradient] [1] x [0.2] = 0.2 [1] x [0.2] = 0.2 (both inputs!)
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 42
Another example:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 43
Another example:
[local gradient] x [its gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 44
sigmoid function
sigmoid gate
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 45
sigmoid function
sigmoid gate
(0.73) * (1 - 0.73) = 0.2
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 46
Patterns in backward flow
add gate: gradient distributor max gate: gradient router mul gate: gradient… “switcher”?
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 47
Gradients add at branches
+
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 48
Implementation: forward/backward API
Graph (or Net) object. (Rough psuedo code)
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 49
Implementation: forward/backward API
(x,y,z are scalars)
*
x
y
z
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 50
Implementation: forward/backward API
(x,y,z are scalars)
*
x
y
z
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 51
Example: Torch Layers
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 52
Example: Torch Layers
=
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 53
Example: Torch MulConstant
initialization
forward()
backward()
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 54
Example: Caffe Layers
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 55
Caffe Sigmoid Layer
*top_diff (chain rule)
comp150dl 56
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 57
Gradients for vectorized code
f
“local gradient”
This is now the Jacobian matrix (derivative of each element of z w.r.t. each element of x)
(x,y,z are now vectors)
gradients
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 58
Vectorized operations
f(x) = max(0,x) (elementwise)
4096-d input vector
4096-d output vector
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 59
Vectorized operations
f(x) = max(0,x) (elementwise)
4096-d input vector
4096-d output vector
Q: what is the size of the Jacobian matrix?
Jacobian matrix
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 60
max(0,x) (elementwise)
4096-d input vector
4096-d output vector
Q: what is the size of the Jacobian matrix? [4096 x 4096!]
Q2: what does it look like?
Vectorized operations
Jacobian matrix
f(x) = max(0,x) (elementwise)
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 61
max(0,x) (elementwise)
100 4096-d input vectors
100 4096-d output vectors
Vectorized operations
in practice we process an entire minibatch (e.g. 100) of examples at one time:
i.e. Jacobian would technically be a [409,600 x 409,600] matrix :\
f(x) = max(0,x) (elementwise)
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 62
Assignment: Writing SVM/Softmax Stage your forward/backward computation!
E.g. for the SVM:margins
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 63
Summary so far
- neural nets will be very large: no hope of writing down gradient formula by hand for all parameters
- backpropagation = recursive application of the chain rule along a computational graph to compute the gradients of all inputs/parameters/intermediates
- implementations maintain a graph structure, where the nodes implement the forward() / backward().
- forward: compute result of an operation and save any intermediates needed for gradient computation in memory
- backward: apply the chain rule to compute the gradient of the loss function with respect to the inputs.
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 64
Neural Network so far:
(Before) Linear score function:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 65
(Before) Linear score function:
(Now) 2-layer Neural Network
Neural Network so far:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 66
(Before) Linear score function:
(Now) 2-layer Neural Network
x hW1 sW2
3072 100 10
Neural Network so far:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl
Neural Network so far:
67
(Before) Linear score function:
(Now) 2-layer Neural Network
x hW1 sW2
3072 100 10
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 68
(Before) Linear score function:
(Now) 2-layer Neural Network or 3-layer Neural Network
Neural Network so far:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 69
Full implementation of training a 2-layer Neural Network needs ~11 lines:
from @iamtrask, http://iamtrask.github.io/2015/07/12/basic-python-network/
Forward pass
Backward pass
backprop of derivative
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 70
Assignment: Writing 2layer Net Stage your forward/backward computation!
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 71
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 72
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 73
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 74
sigmoid activation function
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 75
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 76
Be very careful with your Brain analogies:
Biological Neurons: - Many different types - Dendrites can perform complex non-
linear computations - Synapses are not a single weight but
a complex non-linear dynamical system
- Rate code may not be adequate
[Dendritic Computation. London and Hausser]
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 77
Activation Functions
Sigmoid
tanh tanh(x)
ReLU max(0,x)
Leaky ReLU max(0.1x, x)
Maxout
ELU
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 78
Neural Networks: Architectures
“Fully-connected” layers“2-layer Neural Net”, or “1-hidden-layer Neural Net”
“3-layer Neural Net”, or “2-hidden-layer Neural Net”
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 79
Example Feed-forward computation of a Neural Network
We can efficiently evaluate an entire layer of neurons.
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 80
Example Feed-forward computation of a Neural Network
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 81
Setting the number of layers and their sizes
more neurons = more capacity
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 82
(you can play with this demo over at ConvNetJS: http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html)
Do not use size of neural network as a regularizer. Use stronger regularization instead:
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 83
Summary
- we arrange neurons into fully-connected layers - the abstraction of a layer has the nice property that it
allows us to use efficient vectorized code (e.g. matrix multiplies)
- neural networks are not really neural - neural networks: bigger = better (but might have to
regularize more strongly)
* Original slides borrowed from Andrej Karpathy and Li Fei-Fei, Stanford cs231n comp150dl 84
Next Lecture:
More than you ever wanted to know about Neural Networks and how to train them.
top related