backpropagation learning - mit opencourseware · generalized delta rule • delta rule only works...

Backpropagation learning

MIT Department of Brain and Cognitive Sciences 9.641J, Spring 2005 - Introduction to Neural Networks Instructor: Professor Sebastian Seung

Simple vs. multilayer perceptron

Hidden layer problem

• Radical change for the supervised learning problem.

• No desired values for the hidden layer.• The network must find its own hidden

layer activations.

Generalized delta rule

• Delta rule only works for the output layer.

• Backpropagation, or the generalized delta rule, is a way of creating desired values for hidden layers

Outline

• The algorithm• Derivation as a gradient algoritihm• Sensitivity lemma

Multilayer perceptron

• L layers of weights and biases• L+1 layers of neurons

x0 W 1,b1 ⎯ → ⎯ x1 W 2 ,b2 ⎯ → ⎯ L W L ,bL ⎯ → ⎯ xL

xil = f Wij

l x jl−1 + bi

l

j=1

nl−1

∑⎛

⎝ ⎜

⎞

⎠

Reward function

• Depends on activity of the output layer only.

• Maximize reward with respect to weights and biases.

R xL( )

Example: squared error

• Square of desired minus actual output, with minus sign.

R xL( )= −12

di − xiL( )2

i=1

nL

∑

Forward pass

For l =1 to L,

uil = Wij

l x jl−1 + bi

l

j=1

nl−1

∑

xil = f ui

l( )

Sensitivity computation

• The sensitivity is also called “delta.”

siL = ′ f ui

L( ) ∂R∂xi

L

= ′ f uiL( ) di − xi

L( )

Backward pass

for l = L to 2

s jl−1 = ′ f u j

l−1( ) sil

i=1

nl

∑ Wijl

Learning update

• In any order

∆Wijl = ηsi

l x jl−1

∆bil = ηsi

l

Backprop is a gradient update

• Consider R as function of weights and biases.

∂R∂Wij

l = sil x j

l−1

∂R∂bi

l = sil

∆Wijl = η ∂R

∂Wijl

∆bil = η ∂R

∂bil

Sensitivity lemma

• Sensitivity matrix = outer product– sensitivity vector– activity vector

• The sensitivity vector is sufficient.• Generalization of “delta.”

∂R∂Wij

l =∂R∂bi

l x jl−1

Coordinate transformation

uil = Wij

l f u jl−1( )+ bi

l

j=1

nl−1

∑

∂uil

∂u jl−1 = Wij

l ′ f u jl−1( )

∂R∂ui

l =∂R∂bi

l

Output layer

xiL = f ui

L( )ui

L = WijL x j

L−1 + biL

j∑

∂R∂bi

L = ′ f uiL( ) ∂R

∂xiL

Chain rule

• composition of two functionsul−1 → ul → Rul−1 → R

∂R∂u j

l−1 =∂R∂ui

l∂ui

l

∂u jl−1

i∑

∂R∂bj

l−1 =∂R∂bi

l Wijl ′ f u j

l−1( )i

∑

Computational complexity

• Naïve estimate– network output: order N– each component of the gradient: order N– N components: order N2

• With backprop: order N

Biological plausibility

• Local: pre- and postsynaptic variables

• Forward and backward passes use same weights

• Extra set of variables

x jl−1 Wij

l ⎯ → ⎯ xil , s j

l−1 Wijl ← ⎯ ⎯ si

l

Backprop for brain modeling

• Backprop may not be a plausible account of learning in the brain.

• But perhaps the networks created by it are similar to biological neural networks.

• Zipser and Andersen:– train network– compare hidden neurons with those found

in the brain.

LeNet

• Weight-sharing• Sigmoidal neurons• Learn binary outputs

Machine learning revolution

• Gradient following– or other hill-climbing methods

• Empirical error minimization

backpropagation learning - mit opencourseware · generalized delta rule • delta rule only works...

Documents