artificial neural networks - university of minnesota duluthrmaclin/cs5541/notes/ml_chapter04.pdf ·...

37
Artificial Neural Networks Threshold units Gradient descent Gradient descent Multilayer networks • Backpropagation Hidden layer representations Example: Face recognition Ad dt i Advanced topics CS 5751 Machine Learning Chapter 4 Artificial Neural Networks 1

Upload: others

Post on 25-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Artificial Neural Networks • Threshold units

• Gradient descent• Gradient descent

• Multilayer networks

• Backpropagation

• Hidden layer representationsy p

• Example: Face recognition

Ad d t i• Advanced topics

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 1

Page 2: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Connectionist ModelsConsider humans• Neuron switching time ~.001 second• Number of neurons ~1010

• Connections per neuron ~104-5

• Scene recognition time ~ 1 second• Scene recognition time ~.1 second• 100 inference step does not seem like enoughmust use lots of parallel computation!f p pProperties of artificial neural nets (ANNs):• Many neuron-like threshold switching units• Many weighted interconnections among units• Highly parallel, distributed process• Emphasis on tuning weights automaticallyCS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 2

• Emphasis on tuning weights automatically

Page 3: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

When to Consider Neural Networks• Input is high-dimensional discrete or real-valued (e.g., raw

sensor input)• Output is discrete or real valued• Output is a vector of values

P ibl i d t• Possibly noisy data• Form of target function is unknown• Human readability of result is unimportantHuman readability of result is unimportant

Examples:• Speech phoneme recognition [Waibel]Speech phoneme recognition [Waibel]• Image classification [Kanade, Baluja, Rowley]• Financial prediction

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 3

Page 4: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

ALVINN drives 70 mph on highways

SharpLeft

SharpRight

StraightAhead

4 Hidden4 HiddenUnits

30x32 Sensor30x32 SensorInput Retina

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 4

Page 5: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

PerceptronX0=1

W

W0W

1

X1

X

X0 1

W n

W2

ΣX2

i

n

ii xw∑

0⎪⎨⎧

= ∑ if 1n

1iii xwσ

Xn

i=0⎪⎩⎨ =

otherwise 1-1iσ

⎧ >+++ 0if1 xwxww

⎩⎨⎧ >+++

=

t titi lillS tiotherwise 1-

0...if1),...,( 110

1

xwxwwxxo nn

n

⎨⎧ >⋅

=0 if 1

:notationctor simpler veusewillweSometimesxw

) xo(rr

r

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 5⎩⎨ otherwise1-

)(

Page 6: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Decision Surface of Perceptron

X1

X1

X2 X2

Represents some useful functionsRepresents some useful functions• What weights represent g(x1,x2) = AND(x1,x2)?But some functions not representablep• e.g., not linearly separable• therefore, we will want networks of these ...

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 6

Page 7: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Perceptron Training Rule•

where Δ+← www iii

lue target vais )( )( η

=•−=Δ

xctxotw ii

r

ratelearning called.1)(e.g.,constant small is output perceptron is

g)(

η•• o

convergeit will proveCan

g)( gη

smallly sufficient is and separablelinearly is data trainingIf

gp

η••

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 7

Page 8: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Gradient Descent where, simpleconsider ,understand To

xwxwwotlinear uni

+++=

[ ]error squared theminimize that s'learn :Idea

21

110

wxw...xwwo

i

nn

+++=

[ ]

examplestrainingofsettheisWhere

)( 221

D

otwEDd

dd∑∈

−≡r

examplestrainingofset theis Where D

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 8

Page 9: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Gradient Descent

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 9

Page 10: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Gradient Descent

EEE ⎤⎡ ∂∂∂

n

wEwwE

wE

wEwE

∇−=Δ

⎥⎦

⎤⎢⎣

⎡∂∂

∂∂

∂∂

≡∇

][:ruleTraining

,...,,][ Gradient 10

η r

r

ii

i

wEw

wEw

∂∂

−=Δ

∇=Δ

i.e.,

][ :rule Training

η

η

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 10

iw∂

Page 11: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Gradient Descent)(

21 2∑ −

∂∂

=∂∂

ddd

ii

otww

E

)(21 2∑ −

∂∂

= ddd i

dii

otw

)()(221 ∑ −

∂∂

−= ddd i

dd

d i

otw

ot

)()(

2

∑ ⋅−∂∂

−=

ddd i

dd

d i

xwtw

ot

wrr

))(( ,∑ −−=∂∂

ddidd

d i

xotwE

w

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 11

∂ diw

Page 12: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Gradient Descent

ctor of is the vexwhere ,,txe form pair of thples is a ining exam Each traexamplestraining

><−

) ,_(DESCENTGRADIENT η

rr

iw).(e.g., .rning rateis the leavalue. et output s the tes and t iinput valu

••

domet,isconditiononterminatitheUntil valuerandom small some toeach Initialize

05 arg η

i

examplestraining,txw

><Δ

do ,_in each For - zero. toeach Initialize -

domet,iscondition on terminati the Until

r

iii

i

xotwww

ox

−+Δ←Δ

)(do ,t unit weighlinear each For *

output computeandinstance Input the

η

r

iii

iii

www

xotww

Δ+←

+Δ←Δ

do t w,unit weighlinear each For -

)( η

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 12

Page 13: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

SummaryPerceptron training rule guaranteed to succeed if• Training examples are linearly separableTraining examples are linearly separable• Sufficiently small learning rate η

Linear unit training rule uses gradient descent• Guaranteed to converge to hypothesis with• Guaranteed to converge to hypothesis with

minimum squared error• Given sufficiently small learning rate ηGiven sufficiently small learning rate η• Even when training data contains noise• Even when training data not separable by HCS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 13

• Even when training data not separable by H

Page 14: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Incremental (Stochastic) Gradient DescentBatch mode Gradient Descent:Do until satisfied:

di tthC t1 ][E r∇ 21 )(][E ∑r

][ 2.gradient theCompute .1

wEww]w[E

D

Drrr

∇−←∇

η

I t l d G di t D t

221 )(][ d

DddD otwE −≡ ∑

r

Incremental mode Gradient Descent:Do until satisfied:- For each training example d in D

][ 2.gradient theCompute .1

wEww]w[E

d

drrr

r

∇−←∇

η

221 )(][ ddd otwE −≡

r

Incremental Gradient Descent can approximate Batch GradientDescent arbitrarily closely if η made small enough

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 14

Page 15: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Multilayer Networks of Sigmoid Units

d1 d3 d3

h1h2

d1d2

3

d0 d2d0o

h1 h2

x1 x2

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 15

Page 16: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Multilayer Decision Space

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 16

Page 17: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Sigmoid UnitX

X0=1

W2

W0W

1

X1

ΣX2

W n

Σi

n

ii xwnet ∑

=

=0

neteneto −+

==1

1)(σ

Xnx

1functionsigmoid theis )(σ

xxdx

xde-x

))( 1)(( )( :property Nice

1

−=

+

σσσ

ationBackpropagnetworksMultilayer unitssigmoidofunit sigmoid One

train torulesdescent gradient derivecan We

→••

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 17

ationBackpropagnetworksMultilayer unitssigmoidof →•

Page 18: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

The Sigmoid Function

0 70.80.9

1

-xex

+=

11)(σ

0.30.40.50.60.7

outp

ut

00.10.2

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6

net input

Sort of a rounded step functionSort of a rounded step functionUnlike step function, can take derivative (makes learningpossible)

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 18

Page 19: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Error Gradient for a Sigmoid Unit

dd otww

E−

∂∂

=∂∂ ∑ )(

21 2

dd neto )()(:know But we

∂∂ σ

dd

di

Ddii

otw

ww

−∂∂

=

∂∂

∑∈

)(21

2

2

didd

ddd

d

d

d

xxwnet

oonetnet

neto

)(

)1()(

=⋅∂

=∂

−=∂

∂=

∂∂

rr

σ

dd

di

dd otw

ot

⎞⎛ ∂

−∂∂

−= ∑ )()(221

diii

E

xww ,

:So

∂∂

dd

d i

ddd

netowoot

∂∂

⎟⎟⎠

⎞⎜⎜⎝

⎛∂∂

−−=

)(

)( didddDd

di

xoootwE

,)1()( −−−=∂∂ ∑

i

d

d

d

ddd w

netnetoot

∂∂

∂∂

−= ∑ )(-

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 19

Page 20: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Backpropagation Algorithm

do example, ingeach trainFor do satisfied, Untilnumbers. random small to weightsall Initialize

otook

))(1(unit output each For 2.

outputsthecomputeandexample trainingInput the 1.

δ ←

kkhhhk

kkkkk

woohotoo

)1( unit hidden each For 3.

))(1(

δδ

δ

−←

−−←

ji

koutputsk

khhhk

www

w ,

,

ight network weeach Update4.

)(

Δ+←

∑∈

jijji

jijiji

xw

www ,,,

ere wh

δη=Δ

Δ+←

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 20

jijji xw ,, δη

Page 21: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

More on Backpropagation• Gradient descent over entire network weight vector

• Easily generalized to arbitrary directed graphsy g y g p

• Will find a local, not necessarily global error minimum– In practice, often works well (can run multiple times)p , ( p )

• Often include weight momentum α)1( )( ,,, −Δ+=Δ nwxnw jijijji αδη

• Minimizes error over training examples

• Will it generalize well to subsequent examples?

,,, jjjj

Will it generalize well to subsequent examples?

• Training can take thousands of iterations -- slow!– Using network after training is fast

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 21

Using network after training is fast

Page 22: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Learning Hidden Layer Representations

O t tI tOutputs

010000000100000010000000 10000000

Output Input

→→

000100000001000000100000 001000000100000001000000

→→→

00000100 0000010000001000 00001000

→→Inputs

00000001 0000000100000010 00000010

→→

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 22

Page 23: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Learning Hidden Layer RepresentationsOutputs

O t tI t

010000008811010100000010000000 .08 .04 .89 10000000

Output Input

→→→→

Inputs00010000.71.97.99 0001000000100000 .27 .97 .01 0010000001000000 .88.11.01 01000000

→→→→→→

00000100 .99 .99 .22 0000010000001000 .02 .05 .03 00001000

→→→→

00000001 .01 .94 .60 0000000100000010 .98 .01 .80 00000010

→→→→

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 23

Page 24: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Output Unit Error during Training

Sum of squared errors for each output unit

0.70.80.9

0.30.40.50.6

00.10.2

0 500 1000 1500 2000 25000 500 1000 1500 2000 2500

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 24

Page 25: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Hidden Unit Encoding

Hidden unit encoding for one input

0.80.9

1

0.40.50.60.7

0.10.20.3

0 500 1000 1500 2000 25000 500 1000 1500 2000 2500

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 25

Page 26: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Input to Hidden WeightsWeights from inputs to one hidden unit

1234

-2-101

-5-4-3

0 500 1000 1500 2000 25000 500 1000 1500 2000 2500

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 26

Page 27: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Convergence of BackpropagationGradient descent to some local minimum• Perhaps not global minimump g• Momentum can cause quicker convergence• Stochastic gradient descent also results in faster

convergence• Can train multiple networks and get different results (using

different initial weights)different initial weights)

Nature of convergence• Initialize weights near zero• Therefore, initial networks near-linear

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 27

• Increasingly non-linear functions as training progresses

Page 28: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Expressive Capabilities of ANNsBoolean functions:• Every Boolean function can be represented by network y p y

with a single hidden layer• But that might require an exponential (in the number of

inp ts) hidden nitsinputs) hidden units

C ti f tiContinuous functions:• Every bounded continuous function can be approximated

with arbitrarily small error by a network with one hiddenwith arbitrarily small error by a network with one hidden layer [Cybenko 1989; Hornik et al. 1989]

• Any function can be approximated to arbitrary accuracy by k i h hidd l [C b k 1988]

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 28

a network with two hidden layers [Cybenko 1988]

Page 29: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Overfitting in ANNs

Error versus weight updates (example 1)

0.0080.009

0.01Training set

Validation set

0.0040.0050.0060.007

Erro

r

Validation set

0.0020.003

0 5000 10000 15000 20000

N b f i ht d tNumber of weight updates

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 29

Page 30: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Overfitting in ANNs

Error versus weight updates (Example 2)

0.060.070.08

Training set

Validation set

0.020.030.040.05

Erro

r Validation set

00.01

0 1000 2000 3000 4000 5000 6000

N b f i ht d tNumber of weight updates

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 30

Page 31: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Neural Nets for Face Recognitionleft strt rgt up

30x3230x32inputs 90% accurate learning

head pose, and recognizingp , g g1-of-20 faces

lCS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 31

Typical Input Images

Page 32: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Learned Network Weightsleft strt rgt up

Learned Weights

30x32inputs

Typical Input Images

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 32

Page 33: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Alternative Error Functions

: weightslarge Penalize )()(

ji,

2221 +−≡ ∑∑ ∑

∈ ∈

jikdDd outputsk

kd wotwE γr

)()(

:valuesaswellasslopeson target Train 2

21 ⎥⎤

⎢⎡

⎟⎞

⎜⎛ ∂∂∑ ∑ kdkd ottE r

:weightstogetherTie

)()( 221

⎥⎥⎦⎢

⎢⎣

⎟⎠⎞

⎜⎝⎛

∂−

∂+−≡ ∑ ∑

∈ ∈Dd outputsk djkd

dj

kdkdkd xx

otwE μ

nrecognitio phonemein e.g., :weights together Tie

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 33

Page 34: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Recurrent Networksy(t+1) y(t+1)

FeedforwardNetwork

x(t)

RecurrentNetwork

x(t)

Network Network

y(t+1)

x(t)

y(t)

(t 1)

x(t-1)

y(t-1)

Recurrent Networkunfolded in time

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 34x(t-2)

Page 35: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Neural Network Summary• physiologically (neurons) inspired model• powerful (accurate), slow, opaque (hard topowerful (accurate), slow, opaque (hard to

understand resulting model)• bias: preferentialp

– based on gradient descent– finds local minimum– effect by initial conditions, parameters

• neural units– linear– linear threshold

i idCS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 35

– sigmoid

Page 36: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Neural Network Summary (cont)• gradient descent

– convergenceg

• linear units– limitation: hyperplane decision surface– learning rule

• multilayer network– advantage: can have non-linear decision surface– backpropagation to learn

b k l i l• backprop learning rule

• learning issuesunits used

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 36

– units used

Page 37: Artificial Neural Networks - University of Minnesota Duluthrmaclin/cs5541/Notes/ML_Chapter04.pdf · Chapter 4 Artificial Neural Networks 27 • Increasingly non-linear functions as

Neural Network Summary (cont)• learning issues (cont)

– batch versus incremental (stochastic)( )– parameters

• initial weightsl i• learning rate

• momentum

– cost (error) function( )• sum of squared errors• can include penalty terms

k• recurrent networks– simple

backpropagation through timeCS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 37

– backpropagation through time