backpropagation learning - mit opencourseware · generalized delta rule • delta rule only works...
TRANSCRIPT
Backpropagation learning
MIT Department of Brain and Cognitive Sciences 9.641J, Spring 2005 - Introduction to Neural Networks Instructor: Professor Sebastian Seung
Simple vs. multilayer perceptron
Hidden layer problem
• Radical change for the supervised learning problem.
• No desired values for the hidden layer.• The network must find its own hidden
layer activations.
Generalized delta rule
• Delta rule only works for the output layer.
• Backpropagation, or the generalized delta rule, is a way of creating desired values for hidden layers
Outline
• The algorithm• Derivation as a gradient algoritihm• Sensitivity lemma
Multilayer perceptron
• L layers of weights and biases• L+1 layers of neurons
x0 W 1,b1 ⎯ → ⎯ x1 W 2 ,b2 ⎯ → ⎯ L W L ,bL ⎯ → ⎯ xL
xil = f Wij
l x jl−1 + bi
l
j=1
nl−1
∑⎛
⎝ ⎜
⎞
⎠
Reward function
• Depends on activity of the output layer only.
• Maximize reward with respect to weights and biases.
R xL( )
Example: squared error
• Square of desired minus actual output, with minus sign.
R xL( )= −12
di − xiL( )2
i=1
nL
∑
Forward pass
For l =1 to L,
uil = Wij
l x jl−1 + bi
l
j=1
nl−1
∑
xil = f ui
l( )
Sensitivity computation
• The sensitivity is also called “delta.”
siL = ′ f ui
L( ) ∂R∂xi
L
= ′ f uiL( ) di − xi
L( )
Backward pass
for l = L to 2
s jl−1 = ′ f u j
l−1( ) sil
i=1
nl
∑ Wijl
Learning update
• In any order
∆Wijl = ηsi
l x jl−1
∆bil = ηsi
l
Backprop is a gradient update
• Consider R as function of weights and biases.
∂R∂Wij
l = sil x j
l−1
∂R∂bi
l = sil
∆Wijl = η ∂R
∂Wijl
∆bil = η ∂R
∂bil
Sensitivity lemma
• Sensitivity matrix = outer product– sensitivity vector– activity vector
• The sensitivity vector is sufficient.• Generalization of “delta.”
∂R∂Wij
l =∂R∂bi
l x jl−1
Coordinate transformation
uil = Wij
l f u jl−1( )+ bi
l
j=1
nl−1
∑
∂uil
∂u jl−1 = Wij
l ′ f u jl−1( )
∂R∂ui
l =∂R∂bi
l
Output layer
xiL = f ui
L( )ui
L = WijL x j
L−1 + biL
j∑
∂R∂bi
L = ′ f uiL( ) ∂R
∂xiL
Chain rule
• composition of two functionsul−1 → ul → Rul−1 → R
∂R∂u j
l−1 =∂R∂ui
l∂ui
l
∂u jl−1
i∑
∂R∂bj
l−1 =∂R∂bi
l Wijl ′ f u j
l−1( )i
∑
Computational complexity
• Naïve estimate– network output: order N– each component of the gradient: order N– N components: order N2
• With backprop: order N
Biological plausibility
• Local: pre- and postsynaptic variables
• Forward and backward passes use same weights
• Extra set of variables
x jl−1 Wij
l ⎯ → ⎯ xil , s j
l−1 Wijl ← ⎯ ⎯ si
l
Backprop for brain modeling
• Backprop may not be a plausible account of learning in the brain.
• But perhaps the networks created by it are similar to biological neural networks.
• Zipser and Andersen:– train network– compare hidden neurons with those found
in the brain.
LeNet
• Weight-sharing• Sigmoidal neurons• Learn binary outputs
Machine learning revolution
• Gradient following– or other hill-climbing methods
• Empirical error minimization