artificial neural networks - university of minnesota duluthrmaclin/cs5541/notes/ml_chapter04.pdf ·...
TRANSCRIPT
Artificial Neural Networks • Threshold units
• Gradient descent• Gradient descent
• Multilayer networks
• Backpropagation
• Hidden layer representationsy p
• Example: Face recognition
Ad d t i• Advanced topics
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 1
Connectionist ModelsConsider humans• Neuron switching time ~.001 second• Number of neurons ~1010
• Connections per neuron ~104-5
• Scene recognition time ~ 1 second• Scene recognition time ~.1 second• 100 inference step does not seem like enoughmust use lots of parallel computation!f p pProperties of artificial neural nets (ANNs):• Many neuron-like threshold switching units• Many weighted interconnections among units• Highly parallel, distributed process• Emphasis on tuning weights automaticallyCS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 2
• Emphasis on tuning weights automatically
When to Consider Neural Networks• Input is high-dimensional discrete or real-valued (e.g., raw
sensor input)• Output is discrete or real valued• Output is a vector of values
P ibl i d t• Possibly noisy data• Form of target function is unknown• Human readability of result is unimportantHuman readability of result is unimportant
Examples:• Speech phoneme recognition [Waibel]Speech phoneme recognition [Waibel]• Image classification [Kanade, Baluja, Rowley]• Financial prediction
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 3
ALVINN drives 70 mph on highways
SharpLeft
SharpRight
StraightAhead
4 Hidden4 HiddenUnits
30x32 Sensor30x32 SensorInput Retina
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 4
PerceptronX0=1
W
W0W
1
X1
X
X0 1
W n
W2
ΣX2
i
n
ii xw∑
0⎪⎨⎧
= ∑ if 1n
1iii xwσ
Xn
i=0⎪⎩⎨ =
otherwise 1-1iσ
⎧ >+++ 0if1 xwxww
⎩⎨⎧ >+++
=
t titi lillS tiotherwise 1-
0...if1),...,( 110
1
xwxwwxxo nn
n
⎨⎧ >⋅
=0 if 1
:notationctor simpler veusewillweSometimesxw
) xo(rr
r
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 5⎩⎨ otherwise1-
)(
Decision Surface of Perceptron
X1
X1
X2 X2
Represents some useful functionsRepresents some useful functions• What weights represent g(x1,x2) = AND(x1,x2)?But some functions not representablep• e.g., not linearly separable• therefore, we will want networks of these ...
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 6
Perceptron Training Rule•
where Δ+← www iii
lue target vais )( )( η
=•−=Δ
xctxotw ii
r
ratelearning called.1)(e.g.,constant small is output perceptron is
g)(
η•• o
convergeit will proveCan
g)( gη
smallly sufficient is and separablelinearly is data trainingIf
gp
η••
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 7
Gradient Descent where, simpleconsider ,understand To
xwxwwotlinear uni
+++=
[ ]error squared theminimize that s'learn :Idea
21
110
wxw...xwwo
i
nn
∑
+++=
[ ]
examplestrainingofsettheisWhere
)( 221
D
otwEDd
dd∑∈
−≡r
examplestrainingofset theis Where D
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 8
Gradient Descent
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 9
Gradient Descent
EEE ⎤⎡ ∂∂∂
n
wEwwE
wE
wEwE
∇−=Δ
⎥⎦
⎤⎢⎣
⎡∂∂
∂∂
∂∂
≡∇
][:ruleTraining
,...,,][ Gradient 10
η r
r
ii
i
wEw
wEw
∂∂
−=Δ
∇=Δ
i.e.,
][ :rule Training
η
η
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 10
iw∂
Gradient Descent)(
21 2∑ −
∂∂
=∂∂
ddd
ii
otww
E
)(21 2∑ −
∂∂
= ddd i
dii
otw
)()(221 ∑ −
∂∂
−= ddd i
dd
d i
otw
ot
)()(
2
∑ ⋅−∂∂
−=
∂
ddd i
dd
d i
xwtw
ot
wrr
))(( ,∑ −−=∂∂
∂
ddidd
d i
xotwE
w
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 11
∂ diw
Gradient Descent
ctor of is the vexwhere ,,txe form pair of thples is a ining exam Each traexamplestraining
><−
) ,_(DESCENTGRADIENT η
rr
iw).(e.g., .rning rateis the leavalue. et output s the tes and t iinput valu
••
domet,isconditiononterminatitheUntil valuerandom small some toeach Initialize
05 arg η
i
examplestraining,txw
><Δ
do ,_in each For - zero. toeach Initialize -
domet,iscondition on terminati the Until
r
iii
i
xotwww
ox
−+Δ←Δ
∗
)(do ,t unit weighlinear each For *
output computeandinstance Input the
η
r
iii
iii
www
xotww
Δ+←
+Δ←Δ
do t w,unit weighlinear each For -
)( η
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 12
SummaryPerceptron training rule guaranteed to succeed if• Training examples are linearly separableTraining examples are linearly separable• Sufficiently small learning rate η
Linear unit training rule uses gradient descent• Guaranteed to converge to hypothesis with• Guaranteed to converge to hypothesis with
minimum squared error• Given sufficiently small learning rate ηGiven sufficiently small learning rate η• Even when training data contains noise• Even when training data not separable by HCS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 13
• Even when training data not separable by H
Incremental (Stochastic) Gradient DescentBatch mode Gradient Descent:Do until satisfied:
di tthC t1 ][E r∇ 21 )(][E ∑r
][ 2.gradient theCompute .1
wEww]w[E
D
Drrr
∇−←∇
η
I t l d G di t D t
221 )(][ d
DddD otwE −≡ ∑
∈
r
Incremental mode Gradient Descent:Do until satisfied:- For each training example d in D
][ 2.gradient theCompute .1
wEww]w[E
d
drrr
r
∇−←∇
η
221 )(][ ddd otwE −≡
r
Incremental Gradient Descent can approximate Batch GradientDescent arbitrarily closely if η made small enough
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 14
Multilayer Networks of Sigmoid Units
d1 d3 d3
h1h2
d1d2
3
d0 d2d0o
h1 h2
x1 x2
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 15
Multilayer Decision Space
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 16
Sigmoid UnitX
X0=1
W2
W0W
1
X1
ΣX2
W n
Σi
n
ii xwnet ∑
=
=0
neteneto −+
==1
1)(σ
Xnx
1functionsigmoid theis )(σ
xxdx
xde-x
))( 1)(( )( :property Nice
1
−=
+
σσσ
ationBackpropagnetworksMultilayer unitssigmoidofunit sigmoid One
train torulesdescent gradient derivecan We
→••
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 17
ationBackpropagnetworksMultilayer unitssigmoidof →•
The Sigmoid Function
0 70.80.9
1
-xex
+=
11)(σ
0.30.40.50.60.7
outp
ut
00.10.2
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
net input
Sort of a rounded step functionSort of a rounded step functionUnlike step function, can take derivative (makes learningpossible)
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 18
Error Gradient for a Sigmoid Unit
dd otww
E−
∂∂
=∂∂ ∑ )(
21 2
dd neto )()(:know But we
∂∂ σ
dd
di
Ddii
otw
ww
−∂∂
=
∂∂
∑∈
)(21
2
2
didd
ddd
d
d
d
xxwnet
oonetnet
neto
)(
)1()(
=⋅∂
=∂
−=∂
∂=
∂∂
rr
σ
dd
di
dd otw
ot
⎞⎛ ∂
−∂∂
−= ∑ )()(221
diii
E
xww ,
:So
∂
∂∂
dd
d i
ddd
netowoot
∂∂
⎟⎟⎠
⎞⎜⎜⎝
⎛∂∂
−−=
∑
∑
)(
)( didddDd
di
xoootwE
,)1()( −−−=∂∂ ∑
∈
i
d
d
d
ddd w
netnetoot
∂∂
∂∂
−= ∑ )(-
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 19
Backpropagation Algorithm
do example, ingeach trainFor do satisfied, Untilnumbers. random small to weightsall Initialize
•
otook
))(1(unit output each For 2.
outputsthecomputeandexample trainingInput the 1.
δ ←
kkhhhk
kkkkk
woohotoo
)1( unit hidden each For 3.
))(1(
δδ
δ
−←
−−←
∑
ji
koutputsk
khhhk
www
w ,
,
ight network weeach Update4.
)(
Δ+←
∑∈
jijji
jijiji
xw
www ,,,
ere wh
δη=Δ
Δ+←
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 20
jijji xw ,, δη
More on Backpropagation• Gradient descent over entire network weight vector
• Easily generalized to arbitrary directed graphsy g y g p
• Will find a local, not necessarily global error minimum– In practice, often works well (can run multiple times)p , ( p )
• Often include weight momentum α)1( )( ,,, −Δ+=Δ nwxnw jijijji αδη
• Minimizes error over training examples
• Will it generalize well to subsequent examples?
,,, jjjj
Will it generalize well to subsequent examples?
• Training can take thousands of iterations -- slow!– Using network after training is fast
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 21
Using network after training is fast
Learning Hidden Layer Representations
O t tI tOutputs
010000000100000010000000 10000000
Output Input
→→
000100000001000000100000 001000000100000001000000
→→→
00000100 0000010000001000 00001000
→→Inputs
00000001 0000000100000010 00000010
→→
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 22
Learning Hidden Layer RepresentationsOutputs
O t tI t
010000008811010100000010000000 .08 .04 .89 10000000
Output Input
→→→→
Inputs00010000.71.97.99 0001000000100000 .27 .97 .01 0010000001000000 .88.11.01 01000000
→→→→→→
00000100 .99 .99 .22 0000010000001000 .02 .05 .03 00001000
→→→→
00000001 .01 .94 .60 0000000100000010 .98 .01 .80 00000010
→→→→
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 23
Output Unit Error during Training
Sum of squared errors for each output unit
0.70.80.9
0.30.40.50.6
00.10.2
0 500 1000 1500 2000 25000 500 1000 1500 2000 2500
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 24
Hidden Unit Encoding
Hidden unit encoding for one input
0.80.9
1
0.40.50.60.7
0.10.20.3
0 500 1000 1500 2000 25000 500 1000 1500 2000 2500
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 25
Input to Hidden WeightsWeights from inputs to one hidden unit
1234
-2-101
-5-4-3
0 500 1000 1500 2000 25000 500 1000 1500 2000 2500
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 26
Convergence of BackpropagationGradient descent to some local minimum• Perhaps not global minimump g• Momentum can cause quicker convergence• Stochastic gradient descent also results in faster
convergence• Can train multiple networks and get different results (using
different initial weights)different initial weights)
Nature of convergence• Initialize weights near zero• Therefore, initial networks near-linear
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 27
• Increasingly non-linear functions as training progresses
Expressive Capabilities of ANNsBoolean functions:• Every Boolean function can be represented by network y p y
with a single hidden layer• But that might require an exponential (in the number of
inp ts) hidden nitsinputs) hidden units
C ti f tiContinuous functions:• Every bounded continuous function can be approximated
with arbitrarily small error by a network with one hiddenwith arbitrarily small error by a network with one hidden layer [Cybenko 1989; Hornik et al. 1989]
• Any function can be approximated to arbitrary accuracy by k i h hidd l [C b k 1988]
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 28
a network with two hidden layers [Cybenko 1988]
Overfitting in ANNs
Error versus weight updates (example 1)
0.0080.009
0.01Training set
Validation set
0.0040.0050.0060.007
Erro
r
Validation set
0.0020.003
0 5000 10000 15000 20000
N b f i ht d tNumber of weight updates
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 29
Overfitting in ANNs
Error versus weight updates (Example 2)
0.060.070.08
Training set
Validation set
0.020.030.040.05
Erro
r Validation set
00.01
0 1000 2000 3000 4000 5000 6000
N b f i ht d tNumber of weight updates
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 30
Neural Nets for Face Recognitionleft strt rgt up
30x3230x32inputs 90% accurate learning
head pose, and recognizingp , g g1-of-20 faces
lCS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 31
Typical Input Images
Learned Network Weightsleft strt rgt up
Learned Weights
30x32inputs
Typical Input Images
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 32
Alternative Error Functions
: weightslarge Penalize )()(
ji,
2221 +−≡ ∑∑ ∑
∈ ∈
jikdDd outputsk
kd wotwE γr
)()(
:valuesaswellasslopeson target Train 2
21 ⎥⎤
⎢⎡
⎟⎞
⎜⎛ ∂∂∑ ∑ kdkd ottE r
:weightstogetherTie
)()( 221
⎥⎥⎦⎢
⎢⎣
⎟⎠⎞
⎜⎝⎛
∂−
∂+−≡ ∑ ∑
∈ ∈Dd outputsk djkd
dj
kdkdkd xx
otwE μ
nrecognitio phonemein e.g., :weights together Tie
•
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 33
Recurrent Networksy(t+1) y(t+1)
FeedforwardNetwork
x(t)
RecurrentNetwork
x(t)
Network Network
y(t+1)
x(t)
y(t)
(t 1)
x(t-1)
y(t-1)
Recurrent Networkunfolded in time
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 34x(t-2)
Neural Network Summary• physiologically (neurons) inspired model• powerful (accurate), slow, opaque (hard topowerful (accurate), slow, opaque (hard to
understand resulting model)• bias: preferentialp
– based on gradient descent– finds local minimum– effect by initial conditions, parameters
• neural units– linear– linear threshold
i idCS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 35
– sigmoid
Neural Network Summary (cont)• gradient descent
– convergenceg
• linear units– limitation: hyperplane decision surface– learning rule
• multilayer network– advantage: can have non-linear decision surface– backpropagation to learn
b k l i l• backprop learning rule
• learning issuesunits used
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 36
– units used
Neural Network Summary (cont)• learning issues (cont)
– batch versus incremental (stochastic)( )– parameters
• initial weightsl i• learning rate
• momentum
– cost (error) function( )• sum of squared errors• can include penalty terms
k• recurrent networks– simple
backpropagation through timeCS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 37
– backpropagation through time