back-propagation) - cnbc.cmu.eduplaut/intropdp/slides/backprop.pdf · high momentum (epochs 1-2)...
TRANSCRIPT
XOR with intermediate (“hidden”) units
Intermediate units can re-representinput patterns as new patterns withaltered similarities
Targets which are not linearly separablein the input space can be linearlyseparable in the intermediaterepresentational space
Intermediate units are called “hidden”because their activations are notdetermined directly by the trainingenvironment (inputs and targets)
1 / 33
hidden output
Hidden-to-output weights can be trainedwith the Delta rule
How can we train input-to-hidden weights?
Hidden units do not have targets (fordetermining error)Trick: We don’t need to know the error,just the error derivatives
2 / 33
Delta rule as gradient descent in error (sigmoid units)
nj =∑
i
ai wij
aj =1
1 + exp (−nj)
Error E =1
2
∑
j
(tj − aj)2
wij
ai → nj →tjaj → E
Gradient descent: 4wij = −ε ∂E
∂wij
∂E
∂wij=
∂E
∂aj
dajdnj
∂nj∂wij
= − (tj − aj) aj (1− aj) ai
4wij = −ε ∂E
∂wij= ε (tj − aj) aj (1− aj) ai
3 / 33
Generalized Delta rule (“back-propagation”)
nj =∑
i
ai wij
aj =1
1 + exp (−nj)
Error E =1
2
∑
j
(tj − aj)2
hidden output
ni →wij
ai → nj →tjaj → E
Gradient descent: 4 wij = −ε ∂E
∂wij
∂E
∂nj=
∂E
∂aj
dajdnj
= − (tj − aj) aj (1− aj)
∂E
∂wij=
∂E
∂nj
∂nj∂wij
=∂E
∂njai
∂E
∂ai=
∑
j
∂E
∂nj
∂nj∂ai
=∑
j
∂E
∂njwij
4 / 33
Back-propagation
Forward pass (⇑) Backward pass (⇓)
aj =1
1 + exp (−nj)
nj =∑
i
ai wij
ai =1
1 + exp (−ni )
∂E
∂aj= −(tj − aj)
∂E
∂nj=
∂E
∂ajaj(1− aj)
∂E
∂wij=
∂E
∂njai
∂E
∂ai=
∑
j
∂E
∂njwij
5 / 33
What do hidden representations learn?
Plaut and Shallice (1993)
Mapped orthography to semantics(unrelated similarities)
Compared similarities among hiddenrepresentations to those amongorthographic and semantic representations(over settling)
Hidden representations “split thedifference” between input andoutput similarity 6 / 33
Accelerating learning: Momentum descent
4wij[t] = −ε ∂E
∂wij+ α
(4w
[t−1]ij
)
7 / 33
“Auto-encoder” network (4–2–4)
8 / 33
Projections of error surface in weight space
Asterisk: error of current set of weights
Tick mark: error of next set of weights
Solid curve (0): Gradient direction
Solid curve (21): Integrated gradientdirection (including momentum)
This is actual direction of weight step(tick mark is on this curve)Number is angle with gradient direction
Dotted curves: Random directions (eachlabeled by angle with gradient direction)
9 / 33
Epochs 1-2
10 / 33
Epochs 3-4
11 / 33
Epochs 5-6
12 / 33
Epochs 7-8
13 / 33
Epochs 9-10
14 / 33
Epochs 25-50
15 / 33
Epochs 75-end
16 / 33
High momentum (epochs 1-2)
17 / 33
High momentum (epochs 3-4)
18 / 33
High learning rate (epochs 1-2)
19 / 33
High learning rate (epochs 3-4)
20 / 33