neural networks - bishop prml ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Neural NetworksBishop PRML Ch. 5

Alireza Ghane

Neural Networks Alireza Ghane / Greg Mori 1


Neural Networks

• Neural networks arise from attempts to model human/animalbrains

• Many models, many claims of biological plausibility• We will focus on multi-layer perceptrons

• Mathematical properties rather than plausibility



Outline

Feed-forward Networks

Network Training

Error Backpropagation

Applications



Outline


Network Training


Applications




• We have looked at generalized linear models of the form:

y(x,w) = f

M∑j=1

wjφj(x)

for fixed non-linear basis functions φ(·)

• We now extend this model by allowing adaptive basis functions,and learning their parameters

• In feed-forward networks (a.k.a. multi-layer perceptrons) we leteach basis function be another non-linear function of linearcombination of the inputs:

φj(x) = f

M∑j=1

. . .




• Starting with input x = (x1, . . . , xD), construct linearcombinations:

aj =

D∑i=1

w(1)ji xi + w

(1)j0

These aj are known as activations• Pass through an activation function h(·) to get output zj = h(aj)

• Model of an individual neuron

from Russell and Norvig, AIMA2eNeural Networks Alireza Ghane / Greg Mori 7




aj =

D∑i=1

w(1)ji xi + w

(1)j0





Activation Functions

• Can use a variety of activation functions• Sigmoidal (S-shaped)

• Logistic sigmoid 1/(1 + exp(−a)) (useful for binaryclassification)

• Hyperbolic tangent tanh• Radial basis function zj =

∑i(xi − wji)2

• Softmax• Useful for multi-class classification

• Identity• Useful for regression

• Threshold• . . .

• Needs to be differentiable for gradient-based learning (later)• Can use different activation functions in each unit




x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

• Connect together a number of these units into a feed-forwardnetwork (DAG)

• Above shows a network with one layer of hidden units• Implements function:

yk(x,w) = h

M∑j=1

w(2)kj h

(D∑i=1

w(1)ji xi + w

(1)j0

)+ w

(2)k0



Outline


Network Training


Applications



Network Training• Given a specified network structure, how do we set its

parameters (weights)?• As usual, we define a criterion to measure how well our network

performs, optimize against it• For regression, training data are (xn, t), tn ∈ R

• Squared error naturally arises:

E(w) =

N∑n=1

{y(xn,w)− tn}2

• For binary classification, this is another discriminative model,ML:

p(t|w) =

N∏n=1

ytnn {1− yn}1−tn

E(w) = −N∑n=1

{tn ln yn + (1− tn) ln(1− yn)}







E(w) =

N∑n=1

{y(xn,w)− tn}2


p(t|w) =

N∏n=1


E(w) = −N∑n=1




Parameter Optimization

w1

w2

E(w)

wA wB wC

∇E

• For either of these problems, the error function E(w) is nasty• Nasty = non-convex• Non-convex = has local minima



Descent Methods

• The typical strategy for optimization problems of this sort is adescent method:

w(τ+1) = w(τ) + ∆w(τ)

• As we’ve seen before, these come in many flavours• Gradient descent∇E(w(τ))• Stochastic gradient descent∇En(w(τ))• Newton-Raphson (second order)∇2

• All of these can be used here, stochastic gradient descent isparticularly effective

• Redundancy in training data, escaping local minima



Descent Methods


w(τ+1) = w(τ) + ∆w(τ)






Computing Gradients

• The function y(xn,w) implemented by a network iscomplicated

• It isn’t obvious how to compute error function derivatives withrespect to weights

• Numerical method for calculating error derivatives, use finitedifferences:

∂En∂wji

≈ En(wji + ε)− En(wji − ε)2ε

• How much computation would this take with W weights in thenetwork?

• O(W ) per derivative, O(W 2) total per gradient descent step



Computing Gradients




∂En∂wji






Outline


Network Training


Applications




• Backprop is an efficient method for computing error derivatives∂En∂wji

• O(W ) to compute derivatives wrt all weights

• First, feed training example xn forward through the network,storing all activations aj

• Calculating derivatives for weights connected to output nodes iseasy

• e.g. For linear output nodes yk =∑i wkizi:

∂En∂wki

=∂

∂wki

1

2(yn − tn)2 = (yn − tn)xni

• For hidden layers, propagate error backwards from the outputnodes









∂En∂wki

=∂

∂wki

1





Chain Rule for Partial Derivatives

• A “reminder”• For f(x, y), with f differentiable wrt x and y, and x and y

differentiable wrt u and v:

∂f

∂u=

∂f

∂x

∂x

∂u+∂f

∂y

∂y

∂u

and

∂f

∂v=

∂f

∂x

∂x

∂v+∂f

∂y

∂y

∂v



Error Backpropagation• We can write

∂En∂wji

=∂

∂wjiEn(aj1 , aj2 , . . . , ajm)

where {ji} are the indices of the nodes in the same layer as nodej

• Using the chain rule:

∂En∂wji

=∂En∂aj

∂aj∂wji

+∑k

∂En∂ak

∂ak∂wji

where∑

k runs over all other nodes k in the same layer as nodej.

• Since ak does not depend on wji, all terms in the summation goto 0

∂En∂wji

=∂En∂aj

∂aj∂wji




∂En∂wji

=∂

∂wjiEn(aj1 , aj2 , . . . , ajm)



∂En∂wji

=∂En∂aj

∂aj∂wji

+∑k

∂En∂ak

∂ak∂wji

where∑



∂En∂wji

=∂En∂aj

∂aj∂wji



Error Backpropagation cont.

• Introduce error δj ≡ ∂En∂aj

∂En∂wji

= δj∂aj∂wji

• Other factor is:

∂aj∂wji

=∂

∂wji

∑k

wjkzk = zi




• Error δj can also be computed using chain rule:

δj ≡∂En∂aj

=∑k

∂En∂ak︸︷︷︸δk

∂ak∂aj

where∑

k runs over all nodes k in the layer after node j.• Eventually:

δj = h′(aj)∑k

wkjδk

• A weighted sum of the later error “caused” by this weight



Outline


Network Training


Applications



Applications of Neural Networks

• Many success stories for neural networks• Credit card fraud detection• Hand-written digit recognition• Face detection• Autonomous driving (CMU ALVINN)



Hand-written Digit Recognition

• MNIST - standard dataset for hand-written digit recognition• 60000 training, 10000 test images



LeNet-5

INPUT

32x32

Convolutions SubsamplingConvolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps6@14x14

S4: f. maps 16@5x5

C5: layer120

C3: f. maps 16@10x10

F6: layer 84

Full connection

Full connection

Gaussian connections

OUTPUT

10

• LeNet developed by Yann LeCun et al.• Convolutional neural network

• Local receptive fields (5x5 connectivity)• Subsampling (2x2)• Shared weights (reuse same 5x5 “filter”)• Breaking symmetry



4!>6 3!>5 8!>2 2!>1 5!>3 4!>8 2!>8 3!>5 6!>5 7!>3

9!>4 8!>0 7!>8 5!>3 8!>7 0!>6 3!>7 2!>7 8!>3 9!>4

8!>2 5!>3 4!>8 3!>9 6!>0 9!>8 4!>9 6!>1 9!>4 9!>1

9!>4 2!>0 6!>1 3!>5 3!>2 9!>5 6!>0 6!>0 6!>0 6!>8

4!>6 7!>3 9!>4 4!>6 2!>7 9!>7 4!>3 9!>4 9!>4 9!>4

8!>7 4!>2 8!>4 3!>5 8!>4 6!>5 8!>5 3!>8 3!>8 9!>8

1!>5 9!>8 6!>3 0!>2 6!>5 9!>5 0!>7 1!>6 4!>9 2!>1

2!>8 8!>5 4!>9 7!>2 7!>2 6!>5 9!>7 6!>1 5!>6 5!>0

4!>9 2!>8

• The 82 errors made by LeNet5 (0.82% test error rate)



Deep Learning

• Collection of important techniques to improve performance:• Multi-layer networks (as in LeNet 5), auto-encoders for

unsupervised feature learning, sparsity, convolutional networks,parameter tying, probabilistic models for initializing parameters,hinge activation functions for steeper gradients

• ufldl.stanford.edu/eccv10-tutorial• http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf

• Scalability is key, can use lots of data since stochastic gradientdescent is memory-efficient


ufldl.stanford.edu/eccv10-tutorial

http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf

http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf


Conclusion

• Readings: Ch. 5.1, 5.2, 5.3• Feed-forward networks can be used for regression or

classification• Similar to linear models, except with adaptive non-linear basis

functions• These allow us to do more than e.g. linear decision boundaries

• Different error functions• Learning is more difficult, error function not convex

• Use stochastic gradient descent, obtain (good?) local minimum

• Backpropagation for efficient gradient computation


neural networks - bishop prml ch. 5

Documents