artificial neural networks - university of minnesota duluthrmaclin/cs5541/notes/ml_chapter04.pdf ·...

Artificial Neural Networks • Threshold units

• Gradient descent• Gradient descent

• Multilayer networks

• Backpropagation

• Hidden layer representationsy p

• Example: Face recognition

Ad d t i• Advanced topics

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 1

Connectionist ModelsConsider humans• Neuron switching time ~.001 second• Number of neurons ~1010

• Connections per neuron ~104-5

• Scene recognition time ~ 1 second• Scene recognition time ~.1 second• 100 inference step does not seem like enoughmust use lots of parallel computation!f p pProperties of artificial neural nets (ANNs):• Many neuron-like threshold switching units• Many weighted interconnections among units• Highly parallel, distributed process• Emphasis on tuning weights automaticallyCS 5751 Machine Learning


• Emphasis on tuning weights automatically

When to Consider Neural Networks• Input is high-dimensional discrete or real-valued (e.g., raw

sensor input)• Output is discrete or real valued• Output is a vector of values

P ibl i d t• Possibly noisy data• Form of target function is unknown• Human readability of result is unimportantHuman readability of result is unimportant

Examples:• Speech phoneme recognition [Waibel]Speech phoneme recognition [Waibel]• Image classification [Kanade, Baluja, Rowley]• Financial prediction



ALVINN drives 70 mph on highways

SharpLeft

SharpRight

StraightAhead

4 Hidden4 HiddenUnits

30x32 Sensor30x32 SensorInput Retina



PerceptronX0=1

W

W0W

1

X1

X

X0 1

W n

W2

ΣX2

i

n

ii xw∑

0⎪⎨⎧

= ∑ if 1n

1iii xwσ

Xn

i=0⎪⎩⎨ =

otherwise 1-1iσ

⎧ >+++ 0if1 xwxww

⎩⎨⎧ >+++

=

t titi lillS tiotherwise 1-

0...if1),...,( 110

1

xwxwwxxo nn

n

⎨⎧ >⋅

=0 if 1

:notationctor simpler veusewillweSometimesxw

) xo(rr

r


Chapter 4 Artificial Neural Networks 5⎩⎨ otherwise1-

)(

Decision Surface of Perceptron

X1

X1

X2 X2

Represents some useful functionsRepresents some useful functions• What weights represent g(x1,x2) = AND(x1,x2)?But some functions not representablep• e.g., not linearly separable• therefore, we will want networks of these ...



Perceptron Training Rule•

where Δ+← www iii

lue target vais )( )( η

=•−=Δ

xctxotw ii

r

ratelearning called.1)(e.g.,constant small is output perceptron is

g)(

η•• o

convergeit will proveCan

g)( gη

smallly sufficient is and separablelinearly is data trainingIf

gp

η••



Gradient Descent where, simpleconsider ,understand To

xwxwwotlinear uni

+++=

[ ]error squared theminimize that s'learn :Idea

21

110

wxw...xwwo

i

nn

∑

+++=

[ ]

examplestrainingofsettheisWhere

)( 221

D

otwEDd

dd∑∈

−≡r

examplestrainingofset theis Where D



Gradient Descent



Gradient Descent

EEE ⎤⎡ ∂∂∂

n

wEwwE

wE

wEwE

∇−=Δ

⎥⎦

⎤⎢⎣

⎡∂∂

∂∂

∂∂

≡∇

][:ruleTraining

,...,,][ Gradient 10

η r

r

ii

i

wEw

wEw

∂∂

−=Δ

∇=Δ

i.e.,

][ :rule Training

η

η



iw∂

Gradient Descent)(

21 2∑ −

∂∂

=∂∂

ddd

ii

otww

E

)(21 2∑ −

∂∂

= ddd i

dii

otw

)()(221 ∑ −

∂∂

−= ddd i

dd

d i

otw

ot

)()(

2

∑ ⋅−∂∂

−=

∂

ddd i

dd

d i

xwtw

ot

wrr

))(( ,∑ −−=∂∂

∂

ddidd

d i

xotwE

w



∂ diw

Gradient Descent

ctor of is the vexwhere ,,txe form pair of thples is a ining exam Each traexamplestraining

><−

) ,_(DESCENTGRADIENT η

rr

iw).(e.g., .rning rateis the leavalue. et output s the tes and t iinput valu

••

domet,isconditiononterminatitheUntil valuerandom small some toeach Initialize

05 arg η

i

examplestraining,txw

><Δ

do ,_in each For - zero. toeach Initialize -

domet,iscondition on terminati the Until

r

iii

i

xotwww

ox

−+Δ←Δ

∗

)(do ,t unit weighlinear each For *

output computeandinstance Input the

η

r

iii

iii

www

xotww

Δ+←

+Δ←Δ

do t w,unit weighlinear each For -

)( η



SummaryPerceptron training rule guaranteed to succeed if• Training examples are linearly separableTraining examples are linearly separable• Sufficiently small learning rate η

Linear unit training rule uses gradient descent• Guaranteed to converge to hypothesis with• Guaranteed to converge to hypothesis with

minimum squared error• Given sufficiently small learning rate ηGiven sufficiently small learning rate η• Even when training data contains noise• Even when training data not separable by HCS 5751 Machine Learning


• Even when training data not separable by H

Incremental (Stochastic) Gradient DescentBatch mode Gradient Descent:Do until satisfied:

di tthC t1 ][E r∇ 21 )(][E ∑r

][ 2.gradient theCompute .1

wEww]w[E

D

Drrr

∇−←∇

η

I t l d G di t D t

221 )(][ d

DddD otwE −≡ ∑

∈

r

Incremental mode Gradient Descent:Do until satisfied:- For each training example d in D

][ 2.gradient theCompute .1

wEww]w[E

d

drrr

r

∇−←∇

η

221 )(][ ddd otwE −≡

r

Incremental Gradient Descent can approximate Batch GradientDescent arbitrarily closely if η made small enough



Multilayer Networks of Sigmoid Units

d1 d3 d3

h1h2

d1d2

3

d0 d2d0o

h1 h2

x1 x2



Multilayer Decision Space



Sigmoid UnitX

X0=1

W2

W0W

1

X1

ΣX2

W n

Σi

n

ii xwnet ∑

=

=0

neteneto −+

==1

1)(σ

Xnx

1functionsigmoid theis )(σ

xxdx

xde-x

))( 1)(( )( :property Nice

1

−=

+

σσσ

ationBackpropagnetworksMultilayer unitssigmoidofunit sigmoid One

train torulesdescent gradient derivecan We

→••



ationBackpropagnetworksMultilayer unitssigmoidof →•

The Sigmoid Function

0 70.80.9

1

-xex

+=

11)(σ

0.30.40.50.60.7

outp

ut

00.10.2

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6

net input

Sort of a rounded step functionSort of a rounded step functionUnlike step function, can take derivative (makes learningpossible)



Error Gradient for a Sigmoid Unit

dd otww

E−

∂∂

=∂∂ ∑ )(

21 2

dd neto )()(:know But we

∂∂ σ

dd

di

Ddii

otw

ww

−∂∂

=

∂∂

∑∈

)(21

2

2

didd

ddd

d

d

d

xxwnet

oonetnet

neto

)(

)1()(

=⋅∂

=∂

−=∂

∂=

∂∂

rr

σ

dd

di

dd otw

ot

⎞⎛ ∂

−∂∂

−= ∑ )()(221

diii

E

xww ,

:So

∂

∂∂

dd

d i

ddd

netowoot

∂∂

⎟⎟⎠

⎞⎜⎜⎝

⎛∂∂

−−=

∑

∑

)(

)( didddDd

di

xoootwE

,)1()( −−−=∂∂ ∑

∈

i

d

d

d

ddd w

netnetoot

∂∂

∂∂

−= ∑ )(-



Backpropagation Algorithm

do example, ingeach trainFor do satisfied, Untilnumbers. random small to weightsall Initialize

•

otook

))(1(unit output each For 2.

outputsthecomputeandexample trainingInput the 1.

δ ←

kkhhhk

kkkkk

woohotoo

)1( unit hidden each For 3.

))(1(

δδ

δ

−←

−−←

∑

ji

koutputsk

khhhk

www

w ,

,

ight network weeach Update4.

)(

Δ+←

∑∈

jijji

jijiji

xw

www ,,,

ere wh

δη=Δ

Δ+←



jijji xw ,, δη

More on Backpropagation• Gradient descent over entire network weight vector

• Easily generalized to arbitrary directed graphsy g y g p

• Will find a local, not necessarily global error minimum– In practice, often works well (can run multiple times)p , ( p )

• Often include weight momentum α)1( )( ,,, −Δ+=Δ nwxnw jijijji αδη

• Minimizes error over training examples

• Will it generalize well to subsequent examples?

,,, jjjj

Will it generalize well to subsequent examples?

• Training can take thousands of iterations -- slow!– Using network after training is fast



Using network after training is fast

Learning Hidden Layer Representations

O t tI tOutputs

010000000100000010000000 10000000

Output Input

→→

000100000001000000100000 001000000100000001000000

→→→

00000100 0000010000001000 00001000

→→Inputs

00000001 0000000100000010 00000010

→→



Learning Hidden Layer RepresentationsOutputs

O t tI t

010000008811010100000010000000 .08 .04 .89 10000000

Output Input

→→→→

Inputs00010000.71.97.99 0001000000100000 .27 .97 .01 0010000001000000 .88.11.01 01000000

→→→→→→

00000100 .99 .99 .22 0000010000001000 .02 .05 .03 00001000

→→→→

00000001 .01 .94 .60 0000000100000010 .98 .01 .80 00000010

→→→→



Output Unit Error during Training

Sum of squared errors for each output unit

0.70.80.9

0.30.40.50.6

00.10.2

0 500 1000 1500 2000 25000 500 1000 1500 2000 2500



Hidden Unit Encoding

Hidden unit encoding for one input

0.80.9

1

0.40.50.60.7

0.10.20.3

0 500 1000 1500 2000 25000 500 1000 1500 2000 2500



Input to Hidden WeightsWeights from inputs to one hidden unit

1234

-2-101

-5-4-3

0 500 1000 1500 2000 25000 500 1000 1500 2000 2500



Convergence of BackpropagationGradient descent to some local minimum• Perhaps not global minimump g• Momentum can cause quicker convergence• Stochastic gradient descent also results in faster

convergence• Can train multiple networks and get different results (using

different initial weights)different initial weights)

Nature of convergence• Initialize weights near zero• Therefore, initial networks near-linear



• Increasingly non-linear functions as training progresses

Expressive Capabilities of ANNsBoolean functions:• Every Boolean function can be represented by network y p y

with a single hidden layer• But that might require an exponential (in the number of

inp ts) hidden nitsinputs) hidden units

C ti f tiContinuous functions:• Every bounded continuous function can be approximated

with arbitrarily small error by a network with one hiddenwith arbitrarily small error by a network with one hidden layer [Cybenko 1989; Hornik et al. 1989]

• Any function can be approximated to arbitrary accuracy by k i h hidd l [C b k 1988]



a network with two hidden layers [Cybenko 1988]

Overfitting in ANNs

Error versus weight updates (example 1)

0.0080.009

0.01Training set

Validation set

0.0040.0050.0060.007

Erro

r

Validation set

0.0020.003

0 5000 10000 15000 20000

N b f i ht d tNumber of weight updates



Overfitting in ANNs

Error versus weight updates (Example 2)

0.060.070.08

Training set

Validation set

0.020.030.040.05

Erro

r Validation set

00.01

0 1000 2000 3000 4000 5000 6000

N b f i ht d tNumber of weight updates



Neural Nets for Face Recognitionleft strt rgt up

30x3230x32inputs 90% accurate learning

head pose, and recognizingp , g g1-of-20 faces

lCS 5751 Machine Learning


Typical Input Images

Learned Network Weightsleft strt rgt up

Learned Weights

30x32inputs

Typical Input Images



Alternative Error Functions

: weightslarge Penalize )()(

ji,

2221 +−≡ ∑∑ ∑

∈ ∈

jikdDd outputsk

kd wotwE γr

)()(

:valuesaswellasslopeson target Train 2

21 ⎥⎤

⎢⎡

⎟⎞

⎜⎛ ∂∂∑ ∑ kdkd ottE r

:weightstogetherTie

)()( 221

⎥⎥⎦⎢

⎢⎣

⎟⎠⎞

⎜⎝⎛

∂−

∂+−≡ ∑ ∑

∈ ∈Dd outputsk djkd

dj

kdkdkd xx

otwE μ

nrecognitio phonemein e.g., :weights together Tie

•



Recurrent Networksy(t+1) y(t+1)

FeedforwardNetwork

x(t)

RecurrentNetwork

x(t)

Network Network

y(t+1)

x(t)

y(t)

(t 1)

x(t-1)

y(t-1)

Recurrent Networkunfolded in time


Chapter 4 Artificial Neural Networks 34x(t-2)

Neural Network Summary• physiologically (neurons) inspired model• powerful (accurate), slow, opaque (hard topowerful (accurate), slow, opaque (hard to

understand resulting model)• bias: preferentialp

– based on gradient descent– finds local minimum– effect by initial conditions, parameters

• neural units– linear– linear threshold

i idCS 5751 Machine Learning


– sigmoid

Neural Network Summary (cont)• gradient descent

– convergenceg

• linear units– limitation: hyperplane decision surface– learning rule

• multilayer network– advantage: can have non-linear decision surface– backpropagation to learn

b k l i l• backprop learning rule

• learning issuesunits used



– units used

Neural Network Summary (cont)• learning issues (cont)

– batch versus incremental (stochastic)( )– parameters

• initial weightsl i• learning rate

• momentum

– cost (error) function( )• sum of squared errors• can include penalty terms

k• recurrent networks– simple

backpropagation through timeCS 5751 Machine Learning


– backpropagation through time

artificial neural networks - university of minnesota duluthrmaclin/cs5541/notes/ml_chapter04.pdf ·...

Documents