neural networks - bishop prml ch. 5
TRANSCRIPT
Feed-forward Networks Network Training Error Backpropagation Applications
Neural NetworksBishop PRML Ch. 5
Alireza Ghane
Neural Networks Alireza Ghane / Greg Mori 1
Feed-forward Networks Network Training Error Backpropagation Applications
Neural Networks
• Neural networks arise from attempts to model human/animalbrains
• Many models, many claims of biological plausibility• We will focus on multi-layer perceptrons
• Mathematical properties rather than plausibility
Neural Networks Alireza Ghane / Greg Mori 2
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
Neural Networks Alireza Ghane / Greg Mori 3
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
Neural Networks Alireza Ghane / Greg Mori 4
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks
• We have looked at generalized linear models of the form:
y(x,w) = f
M∑j=1
wjφj(x)
for fixed non-linear basis functions φ(·)
• We now extend this model by allowing adaptive basis functions,and learning their parameters
• In feed-forward networks (a.k.a. multi-layer perceptrons) we leteach basis function be another non-linear function of linearcombination of the inputs:
φj(x) = f
M∑j=1
. . .
Neural Networks Alireza Ghane / Greg Mori 5
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks
• We have looked at generalized linear models of the form:
y(x,w) = f
M∑j=1
wjφj(x)
for fixed non-linear basis functions φ(·)
• We now extend this model by allowing adaptive basis functions,and learning their parameters
• In feed-forward networks (a.k.a. multi-layer perceptrons) we leteach basis function be another non-linear function of linearcombination of the inputs:
φj(x) = f
M∑j=1
. . .
Neural Networks Alireza Ghane / Greg Mori 6
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks
• Starting with input x = (x1, . . . , xD), construct linearcombinations:
aj =
D∑i=1
w(1)ji xi + w
(1)j0
These aj are known as activations• Pass through an activation function h(·) to get output zj = h(aj)
• Model of an individual neuron
from Russell and Norvig, AIMA2eNeural Networks Alireza Ghane / Greg Mori 7
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks
• Starting with input x = (x1, . . . , xD), construct linearcombinations:
aj =
D∑i=1
w(1)ji xi + w
(1)j0
These aj are known as activations• Pass through an activation function h(·) to get output zj = h(aj)
• Model of an individual neuron
from Russell and Norvig, AIMA2eNeural Networks Alireza Ghane / Greg Mori 8
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks
• Starting with input x = (x1, . . . , xD), construct linearcombinations:
aj =
D∑i=1
w(1)ji xi + w
(1)j0
These aj are known as activations• Pass through an activation function h(·) to get output zj = h(aj)
• Model of an individual neuron
from Russell and Norvig, AIMA2eNeural Networks Alireza Ghane / Greg Mori 9
Feed-forward Networks Network Training Error Backpropagation Applications
Activation Functions
• Can use a variety of activation functions• Sigmoidal (S-shaped)
• Logistic sigmoid 1/(1 + exp(−a)) (useful for binaryclassification)
• Hyperbolic tangent tanh• Radial basis function zj =
∑i(xi − wji)2
• Softmax• Useful for multi-class classification
• Identity• Useful for regression
• Threshold• . . .
• Needs to be differentiable for gradient-based learning (later)• Can use different activation functions in each unit
Neural Networks Alireza Ghane / Greg Mori 10
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks
x0
x1
xD
z0
z1
zM
y1
yK
w(1)MD w
(2)KM
w(2)10
hidden units
inputs outputs
• Connect together a number of these units into a feed-forwardnetwork (DAG)
• Above shows a network with one layer of hidden units• Implements function:
yk(x,w) = h
M∑j=1
w(2)kj h
(D∑i=1
w(1)ji xi + w
(1)j0
)+ w
(2)k0
Neural Networks Alireza Ghane / Greg Mori 11
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
Neural Networks Alireza Ghane / Greg Mori 12
Feed-forward Networks Network Training Error Backpropagation Applications
Network Training• Given a specified network structure, how do we set its
parameters (weights)?• As usual, we define a criterion to measure how well our network
performs, optimize against it• For regression, training data are (xn, t), tn ∈ R
• Squared error naturally arises:
E(w) =
N∑n=1
{y(xn,w)− tn}2
• For binary classification, this is another discriminative model,ML:
p(t|w) =
N∏n=1
ytnn {1− yn}1−tn
E(w) = −N∑n=1
{tn ln yn + (1− tn) ln(1− yn)}
Neural Networks Alireza Ghane / Greg Mori 13
Feed-forward Networks Network Training Error Backpropagation Applications
Network Training• Given a specified network structure, how do we set its
parameters (weights)?• As usual, we define a criterion to measure how well our network
performs, optimize against it• For regression, training data are (xn, t), tn ∈ R
• Squared error naturally arises:
E(w) =
N∑n=1
{y(xn,w)− tn}2
• For binary classification, this is another discriminative model,ML:
p(t|w) =
N∏n=1
ytnn {1− yn}1−tn
E(w) = −N∑n=1
{tn ln yn + (1− tn) ln(1− yn)}
Neural Networks Alireza Ghane / Greg Mori 14
Feed-forward Networks Network Training Error Backpropagation Applications
Network Training• Given a specified network structure, how do we set its
parameters (weights)?• As usual, we define a criterion to measure how well our network
performs, optimize against it• For regression, training data are (xn, t), tn ∈ R
• Squared error naturally arises:
E(w) =
N∑n=1
{y(xn,w)− tn}2
• For binary classification, this is another discriminative model,ML:
p(t|w) =
N∏n=1
ytnn {1− yn}1−tn
E(w) = −N∑n=1
{tn ln yn + (1− tn) ln(1− yn)}
Neural Networks Alireza Ghane / Greg Mori 15
Feed-forward Networks Network Training Error Backpropagation Applications
Parameter Optimization
w1
w2
E(w)
wA wB wC
∇E
• For either of these problems, the error function E(w) is nasty• Nasty = non-convex• Non-convex = has local minima
Neural Networks Alireza Ghane / Greg Mori 16
Feed-forward Networks Network Training Error Backpropagation Applications
Descent Methods
• The typical strategy for optimization problems of this sort is adescent method:
w(τ+1) = w(τ) + ∆w(τ)
• As we’ve seen before, these come in many flavours• Gradient descent∇E(w(τ))• Stochastic gradient descent∇En(w(τ))• Newton-Raphson (second order)∇2
• All of these can be used here, stochastic gradient descent isparticularly effective
• Redundancy in training data, escaping local minima
Neural Networks Alireza Ghane / Greg Mori 17
Feed-forward Networks Network Training Error Backpropagation Applications
Descent Methods
• The typical strategy for optimization problems of this sort is adescent method:
w(τ+1) = w(τ) + ∆w(τ)
• As we’ve seen before, these come in many flavours• Gradient descent∇E(w(τ))• Stochastic gradient descent∇En(w(τ))• Newton-Raphson (second order)∇2
• All of these can be used here, stochastic gradient descent isparticularly effective
• Redundancy in training data, escaping local minima
Neural Networks Alireza Ghane / Greg Mori 18
Feed-forward Networks Network Training Error Backpropagation Applications
Descent Methods
• The typical strategy for optimization problems of this sort is adescent method:
w(τ+1) = w(τ) + ∆w(τ)
• As we’ve seen before, these come in many flavours• Gradient descent∇E(w(τ))• Stochastic gradient descent∇En(w(τ))• Newton-Raphson (second order)∇2
• All of these can be used here, stochastic gradient descent isparticularly effective
• Redundancy in training data, escaping local minima
Neural Networks Alireza Ghane / Greg Mori 19
Feed-forward Networks Network Training Error Backpropagation Applications
Computing Gradients
• The function y(xn,w) implemented by a network iscomplicated
• It isn’t obvious how to compute error function derivatives withrespect to weights
• Numerical method for calculating error derivatives, use finitedifferences:
∂En∂wji
≈ En(wji + ε)− En(wji − ε)2ε
• How much computation would this take with W weights in thenetwork?
• O(W ) per derivative, O(W 2) total per gradient descent step
Neural Networks Alireza Ghane / Greg Mori 20
Feed-forward Networks Network Training Error Backpropagation Applications
Computing Gradients
• The function y(xn,w) implemented by a network iscomplicated
• It isn’t obvious how to compute error function derivatives withrespect to weights
• Numerical method for calculating error derivatives, use finitedifferences:
∂En∂wji
≈ En(wji + ε)− En(wji − ε)2ε
• How much computation would this take with W weights in thenetwork?
• O(W ) per derivative, O(W 2) total per gradient descent step
Neural Networks Alireza Ghane / Greg Mori 21
Feed-forward Networks Network Training Error Backpropagation Applications
Computing Gradients
• The function y(xn,w) implemented by a network iscomplicated
• It isn’t obvious how to compute error function derivatives withrespect to weights
• Numerical method for calculating error derivatives, use finitedifferences:
∂En∂wji
≈ En(wji + ε)− En(wji − ε)2ε
• How much computation would this take with W weights in thenetwork?
• O(W ) per derivative, O(W 2) total per gradient descent step
Neural Networks Alireza Ghane / Greg Mori 22
Feed-forward Networks Network Training Error Backpropagation Applications
Computing Gradients
• The function y(xn,w) implemented by a network iscomplicated
• It isn’t obvious how to compute error function derivatives withrespect to weights
• Numerical method for calculating error derivatives, use finitedifferences:
∂En∂wji
≈ En(wji + ε)− En(wji − ε)2ε
• How much computation would this take with W weights in thenetwork?
• O(W ) per derivative, O(W 2) total per gradient descent step
Neural Networks Alireza Ghane / Greg Mori 23
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
Neural Networks Alireza Ghane / Greg Mori 24
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation
• Backprop is an efficient method for computing error derivatives∂En∂wji
• O(W ) to compute derivatives wrt all weights
• First, feed training example xn forward through the network,storing all activations aj
• Calculating derivatives for weights connected to output nodes iseasy
• e.g. For linear output nodes yk =∑i wkizi:
∂En∂wki
=∂
∂wki
1
2(yn − tn)2 = (yn − tn)xni
• For hidden layers, propagate error backwards from the outputnodes
Neural Networks Alireza Ghane / Greg Mori 25
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation
• Backprop is an efficient method for computing error derivatives∂En∂wji
• O(W ) to compute derivatives wrt all weights
• First, feed training example xn forward through the network,storing all activations aj
• Calculating derivatives for weights connected to output nodes iseasy
• e.g. For linear output nodes yk =∑i wkizi:
∂En∂wki
=∂
∂wki
1
2(yn − tn)2 = (yn − tn)xni
• For hidden layers, propagate error backwards from the outputnodes
Neural Networks Alireza Ghane / Greg Mori 26
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation
• Backprop is an efficient method for computing error derivatives∂En∂wji
• O(W ) to compute derivatives wrt all weights
• First, feed training example xn forward through the network,storing all activations aj
• Calculating derivatives for weights connected to output nodes iseasy
• e.g. For linear output nodes yk =∑i wkizi:
∂En∂wki
=∂
∂wki
1
2(yn − tn)2 = (yn − tn)xni
• For hidden layers, propagate error backwards from the outputnodes
Neural Networks Alireza Ghane / Greg Mori 27
Feed-forward Networks Network Training Error Backpropagation Applications
Chain Rule for Partial Derivatives
• A “reminder”• For f(x, y), with f differentiable wrt x and y, and x and y
differentiable wrt u and v:
∂f
∂u=
∂f
∂x
∂x
∂u+∂f
∂y
∂y
∂u
and
∂f
∂v=
∂f
∂x
∂x
∂v+∂f
∂y
∂y
∂v
Neural Networks Alireza Ghane / Greg Mori 28
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation• We can write
∂En∂wji
=∂
∂wjiEn(aj1 , aj2 , . . . , ajm)
where {ji} are the indices of the nodes in the same layer as nodej
• Using the chain rule:
∂En∂wji
=∂En∂aj
∂aj∂wji
+∑k
∂En∂ak
∂ak∂wji
where∑
k runs over all other nodes k in the same layer as nodej.
• Since ak does not depend on wji, all terms in the summation goto 0
∂En∂wji
=∂En∂aj
∂aj∂wji
Neural Networks Alireza Ghane / Greg Mori 29
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation• We can write
∂En∂wji
=∂
∂wjiEn(aj1 , aj2 , . . . , ajm)
where {ji} are the indices of the nodes in the same layer as nodej
• Using the chain rule:
∂En∂wji
=∂En∂aj
∂aj∂wji
+∑k
∂En∂ak
∂ak∂wji
where∑
k runs over all other nodes k in the same layer as nodej.
• Since ak does not depend on wji, all terms in the summation goto 0
∂En∂wji
=∂En∂aj
∂aj∂wji
Neural Networks Alireza Ghane / Greg Mori 30
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation• We can write
∂En∂wji
=∂
∂wjiEn(aj1 , aj2 , . . . , ajm)
where {ji} are the indices of the nodes in the same layer as nodej
• Using the chain rule:
∂En∂wji
=∂En∂aj
∂aj∂wji
+∑k
∂En∂ak
∂ak∂wji
where∑
k runs over all other nodes k in the same layer as nodej.
• Since ak does not depend on wji, all terms in the summation goto 0
∂En∂wji
=∂En∂aj
∂aj∂wji
Neural Networks Alireza Ghane / Greg Mori 31
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation cont.
• Introduce error δj ≡ ∂En∂aj
∂En∂wji
= δj∂aj∂wji
• Other factor is:
∂aj∂wji
=∂
∂wji
∑k
wjkzk = zi
Neural Networks Alireza Ghane / Greg Mori 32
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation cont.
• Error δj can also be computed using chain rule:
δj ≡∂En∂aj
=∑k
∂En∂ak︸︷︷︸δk
∂ak∂aj
where∑
k runs over all nodes k in the layer after node j.• Eventually:
δj = h′(aj)∑k
wkjδk
• A weighted sum of the later error “caused” by this weight
Neural Networks Alireza Ghane / Greg Mori 33
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation cont.
• Error δj can also be computed using chain rule:
δj ≡∂En∂aj
=∑k
∂En∂ak︸︷︷︸δk
∂ak∂aj
where∑
k runs over all nodes k in the layer after node j.• Eventually:
δj = h′(aj)∑k
wkjδk
• A weighted sum of the later error “caused” by this weight
Neural Networks Alireza Ghane / Greg Mori 34
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
Neural Networks Alireza Ghane / Greg Mori 35
Feed-forward Networks Network Training Error Backpropagation Applications
Applications of Neural Networks
• Many success stories for neural networks• Credit card fraud detection• Hand-written digit recognition• Face detection• Autonomous driving (CMU ALVINN)
Neural Networks Alireza Ghane / Greg Mori 36
Feed-forward Networks Network Training Error Backpropagation Applications
Hand-written Digit Recognition
• MNIST - standard dataset for hand-written digit recognition• 60000 training, 10000 test images
Neural Networks Alireza Ghane / Greg Mori 37
Feed-forward Networks Network Training Error Backpropagation Applications
LeNet-5
INPUT
32x32
Convolutions SubsamplingConvolutions
C1: feature maps 6@28x28
Subsampling
S2: f. maps6@14x14
S4: f. maps 16@5x5
C5: layer120
C3: f. maps 16@10x10
F6: layer 84
Full connection
Full connection
Gaussian connections
OUTPUT
10
• LeNet developed by Yann LeCun et al.• Convolutional neural network
• Local receptive fields (5x5 connectivity)• Subsampling (2x2)• Shared weights (reuse same 5x5 “filter”)• Breaking symmetry
Neural Networks Alireza Ghane / Greg Mori 38
Feed-forward Networks Network Training Error Backpropagation Applications
4!>6 3!>5 8!>2 2!>1 5!>3 4!>8 2!>8 3!>5 6!>5 7!>3
9!>4 8!>0 7!>8 5!>3 8!>7 0!>6 3!>7 2!>7 8!>3 9!>4
8!>2 5!>3 4!>8 3!>9 6!>0 9!>8 4!>9 6!>1 9!>4 9!>1
9!>4 2!>0 6!>1 3!>5 3!>2 9!>5 6!>0 6!>0 6!>0 6!>8
4!>6 7!>3 9!>4 4!>6 2!>7 9!>7 4!>3 9!>4 9!>4 9!>4
8!>7 4!>2 8!>4 3!>5 8!>4 6!>5 8!>5 3!>8 3!>8 9!>8
1!>5 9!>8 6!>3 0!>2 6!>5 9!>5 0!>7 1!>6 4!>9 2!>1
2!>8 8!>5 4!>9 7!>2 7!>2 6!>5 9!>7 6!>1 5!>6 5!>0
4!>9 2!>8
• The 82 errors made by LeNet5 (0.82% test error rate)
Neural Networks Alireza Ghane / Greg Mori 39
Feed-forward Networks Network Training Error Backpropagation Applications
Deep Learning
• Collection of important techniques to improve performance:• Multi-layer networks (as in LeNet 5), auto-encoders for
unsupervised feature learning, sparsity, convolutional networks,parameter tying, probabilistic models for initializing parameters,hinge activation functions for steeper gradients
• ufldl.stanford.edu/eccv10-tutorial• http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf
• Scalability is key, can use lots of data since stochastic gradientdescent is memory-efficient
Neural Networks Alireza Ghane / Greg Mori 40
Feed-forward Networks Network Training Error Backpropagation Applications
Conclusion
• Readings: Ch. 5.1, 5.2, 5.3• Feed-forward networks can be used for regression or
classification• Similar to linear models, except with adaptive non-linear basis
functions• These allow us to do more than e.g. linear decision boundaries
• Different error functions• Learning is more difficult, error function not convex
• Use stochastic gradient descent, obtain (good?) local minimum
• Backpropagation for efficient gradient computation
Neural Networks Alireza Ghane / Greg Mori 41