ming-feng yeh1 chapter 11 back- propagation. ming-feng yeh2 objectives a generalization of the lms...

Ming-Feng Yeh 1

CHAPTER 11CHAPTER 11

Back-Back-PropagationPropagation

Ming-Feng Yeh 2

ObjectivesObjectives

A generalization of the LMS algorithm, called backpropagation, can be used to train multilayer networks.

Backpropagation is an approximate steepest descent algorithm, in which the performance index is mean square error.

In order to calculate the derivatives, we need to use the chain rule of calculus.

Ming-Feng Yeh 3

MotivationMotivation

The perceptron learning and the LMS algorithm were designed to train single-layer perceptron-like networks.They are only able to solve linearly separable classification problems.Parallel Distributed ProcessingThe multilayer perceptron, trained by the backpropagation algorithm, is currently the most widely used neural network.

Ming-Feng Yeh 4

Three-Layer NetworkThree-Layer Network

321 SSSR Number of neurons in each layer:

Ming-Feng Yeh 5

Pattern Classification: Pattern Classification: XOR gateXOR gate

The limitations of the single-layer perceptron (Minsky & Papert, 1969)

0,

0

011 tp

1,

1

022 tp

1,

0

133 tp

0,

1

144 tp

1P

2P 4P

3P

Ming-Feng Yeh 6

Two-Layer XOR NetworkTwo-Layer XOR Network

Two-layer, 2-2-1 network

11w

12w

1P

4P

AND

11

1

11n

12n

1

5.1

11a

12a

21n 2

1a

1p

2p

22

11

5.1

Individual Decisions

Ming-Feng Yeh 7

Solved Problem P11.1Solved Problem P11.1

Design a multilayer network to distinguish these categories.

T1 1111 p

T2 1111 p

T3 1111 p

T4 1111 p

Class I Class II01 bWp02 bWp

03 bWp04 bWp

There is no hyperplane that can separate these two categories.

Ming-Feng Yeh 8

Solution of Problem P11.1Solution of Problem P11.1

11

1

11n

12n

1

11a

12a

21n 2

1a

1p

2p

1

1

2

2

2

2

3p

4p

AND

OR

Ming-Feng Yeh 9

Function ApproximationFunction Approximation

Two-layer, 1-2-1 networknnf

enf

n

)( ,

1

1)( 21

.10,10,10,10 12

11

12

11 bbww

.0,1,1 221

21 bww

Ming-Feng Yeh 10


The centers of the steps occur where the net input to a neuron in the first layer is zero.

The steepness of each step can be adjusted by changing the network weights.

110100

110)10(012

12

12

12

12

11

11

11

11

11

wbpbpwn

wbpbpwn

Ming-Feng Yeh 11

Effect of Parameter ChangesEffect of Parameter Changes

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -1

0

1

2

3

12b

20 15 10 5 0

Ming-Feng Yeh 12


-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -1

0

1

2

3

21w

1.0

0.5

0.0

-0.5

-1.0

Ming-Feng Yeh 13


-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -1

0

1

2

3

21w

1.0

0.5

0.0

-0.5

-1.0

Ming-Feng Yeh 14


-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -1

0

1

2

3

2b

1.0

0.5

0.0

-0.5

-1.0

Ming-Feng Yeh 15


Two-layer networks, with sigmoid transfer functions in the hidden layer and linear transfer functions in the output layer, can approximate virtually any function of interest to any degree accuracy, provided sufficiently many hidden units are available.

Ming-Feng Yeh 16

Backpropagation AlgorithmBackpropagation Algorithm

For multilayer networks the outputs of one layer becomes the input to the following layer.

1,...,2,1,0 ),( 1111 Mmmmmmm baWfaMaapa ,0

Ming-Feng Yeh 17

Performance IndexPerformance Index

Training Set:

Mean Square Error:

Vector Case:

Approximate Mean Square Error:

Approximate Steepest Descent Algorithm

p1 t1{ , } p2 t2{ , } pQ tQ{ , }

F x E e2 = E t a– 2 =

F x E eTe = E t a–

Tt a– =

F̂ x t k a k – T t k a k – eTk e k = =

w i jm

k 1+ wi jm

k F̂

w i jm

------------–= bim

k 1+ bim

k F̂

bim

---------–=

Ming-Feng Yeh 18

Chain RuleChain Rule

If f(n) = en and n = 2w, so that f(n(w)) = e2w.

Approximate mean square error:

dw

wdn

dn

ndf

dw

wndf )()())((

2)()())((

nedw

wdn

dn

ndf

dw

wndf

)()()]()([)]()([)(ˆ TT kkkkkkF eeatatx

mji

mi

mi

mjim

ji

mji

mji w

n

n

Fkw

w

Fkwkw

,,

,,,

ˆ)(

ˆ)()1(

mi

mi

mi

mim

i

mi

mi b

n

n

Fkb

b

Fkbkb

ˆ

)(ˆ

)()1(

Ming-Feng Yeh 19

Sensitivity & GradientSensitivity & Gradient

The net input to the ith neurons of layer m:

The sensitivity of to changes in the ith element of the net input at layer m:

Gradient:

mi

S

j

mj

mji

mi bawn

m

1

1

1, 1 ,1

,

mi

mim

jmji

mi

b

na

w

n

F̂mi

mi nFs ˆ

1

,,

ˆˆ

m

jmim

ji

mi

mi

mji

asw

n

n

F

w

F

mi

mim

i

mi

mi

mi

ssb

n

n

F

b

F

1ˆˆ

Ming-Feng Yeh 20

Steepest Descent AlgorithmSteepest Descent Algorithm

The steepest descent algorithm for the approximate mean square error:

Matrix form:

1,

,,, )(

ˆ)()1(

mj

mi

mjim

ji

mi

mi

mji

mji askw

w

n

n

Fkwkw

mi

mim

i

mi

mi

mi

mi skb

b

n

n

Fkbkb

)(ˆ

)()1(

Wm

k 1+ Wm

k sm

am 1–

T

–=

bmk 1+ bm

k sm–=

sm F

nm

----------

F

n1m

---------

F

n2m

---------

F

nS

mm

-----------

=

Ming-Feng Yeh 21

BP the SensitivityBP the Sensitivity

Backpropagation: a recurrence relationship in which the sensitivity at layer m is computed from the sensitivity at layer m+1. Jacobian matrix:

.

1

2

1

1

1

12

2

12

1

12

11

2

11

1

11

1

111

m

s

m

sm

m

sm

m

s

m

s

m

m

m

m

m

m

s

m

m

m

m

m

m

m

m

mmm

m

m

n

n

n

n

n

n

n

n

n

n

n

n

n

n

n

n

n

n

n

n

Ming-Feng Yeh 22

Matrix RepressionMatrix Repression

The i,j element of Jacobian matrix

).(

)(

1,

1,

1,

1

11,1

mj

mmji

mj

mj

mm

jimj

mjm

jimj

s

l

mi

ml

mli

mj

mi

nfw

n

nfw

n

aw

n

baw

n

n

m

,)(11

mmmm

m

nFWn

n

.

)(00

0)(0

00)(

)( 2

1

m

s

m

mm

mm

mm

mnf

nf

nf

nF

Ming-Feng Yeh 23

Recurrence RelationRecurrence Relation

The recurrence relation for the sensitivity

The sensitivities are propagated backward through the network from the last layer to the first layer.

.))((

ˆ))((

ˆˆ

1T1

11

1

T1

mmmm

mTmm

mm

m

mm

sWnF

n

FWnF

n

F

n

n

n

Fs

.121 ssss MM

Ming-Feng Yeh 24

Backpropagation AlgorithmBackpropagation Algorithm

At the final layer:

.)(2

)()()(ˆ

1

2

T

Mi

iiiM

i

S

jjj

Mi

Mi

Mi n

aat

n

at

nn

Fs

M

atat

)()( M

iM

Mi

Mi

M

Mi

Mi

Mi

i nfn

nf

n

a

n

a

)()(2 Mi

Mii

Mi nfats

))((2 atnFs MMM

Ming-Feng Yeh 25

SummarySummary

The first step is to propagate the input forward through the network:

The second step is to propagate the sensitivities backward through the network: Output layer: Hidden layer:

The final step is to update the weights and biases:

Maa1,...,2,1,0 ),( 1111 Mmmmmmm baWfa

pa 0

))((2 atnFs MMM 1,2,...,1 ,))(( 1T1 Mmmmmmm sWnFs

T1)()()1( mmmm kk asWW mmm kk sbb )()1(

Ming-Feng Yeh 26

BP Neural NetworkBP Neural Network

m

jS mw,1

mjw ,1

mjiw ,

Layer m

1

j

mS

1

i

Layer m-1

1mS

mw 1,1

miw 1,

m

S mw1,1

m

SS mmw,1

m

S mw,1

m

Si mw,

1

k

MS

Layer MMa1

Mka

M

S Ma

Layer 1

1p

2p

Rp

11,1w

11,2w

1

1,1Sw

1

,1 RSw

1

2

1S

Ming-Feng Yeh 27

Ex: Function ApproximationEx: Function Approximation

g p 14---p sin+=

1-2-1Network

+

t

ep

Ming-Feng Yeh 28

Network ArchitectureNetwork Architecture

1-2-1Network

ap

Ming-Feng Yeh 29

Initial ValuesInitial ValuesW1

0 0.27–

0.41–= b1

0 0.48–

0.13–= W2

0 0.09 0.17–= b20 0.48=

Network ResponseSine Wave

-2 -1 0 1 2-1

0

1

2

3

Initial Network Response:

p

2a

Ming-Feng Yeh 30

Forward PropagationForward Propagationa

0p 1= =

a1 f1 W1a0 b1+ l ogsig 0.27–

0.41–1

0.48–

0.13–+

logsig 0.75–

0.54–

= = =

a2

f2 W2a1 b2

+ purelin 0.09 0.17–0.321

0.3680.48+( ) 0.446= = =

e t a– 1 4---p sin+

a2– 1 4---1 sin+

0.446– 1.261= = = =

a1

1

1 e0.75+--------------------

1

1 e0.54+--------------------

0.321

0.368= =

Initial input:Output of the 1st layer:

Output of the 2nd layer:

error:

Ming-Feng Yeh 31

Transfer Func. DerivativesTransfer Func. Derivatives

))(1(1

1

1

11

)1(1

1)(

11

21

aaee

e

e

edn

dnf

nn

n

n

n

1)()(2 ndn

dnf

Ming-Feng Yeh 32

BackpropagationBackpropagation

The second layer sensitivity:

The first layer sensitivity:

522.2261.112

)]([2))((2 22222

enfn atFs

0997.0

0495.0

522.217.0

09.0

368.0)368.01(0

0321.0)321.01(

))(1(0

0))(1())(( 2

22,1

21,1

12

12

11

1122111 ssWnFs

w

w

aa

aaT

Ming-Feng Yeh 33

Weight UpdateWeight Update

Learning rate 1.0

0772.0171.0

368.0321.0]522.2[1.017.009.0

)()0()1( 1222

TasWW

]732.0[]522.2[1.0]48.0[)0()1( 222 sbb

420.0

265.0]1[

0997.0

0495.01.0

41.0

27.0

)()0()1( 0111 TasWW

140.0

475.0

0997.0

0495.01.0

13.0

48.0)0()1( 111 sbb

Ming-Feng Yeh 34

Choice of Network StructureChoice of Network Structure

Multilayer networks can be used to approximate almost any function, if we have enough neurons in the hidden layers.

We cannot say, in general, how many layers or how many neurons are necessary for adequate performance.

Ming-Feng Yeh 35

Illustrated Example 1Illustrated Example 1

g p 1i 4----- p sin+=

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

1-3-1 Network 1i 2i

4i 8i

Ming-Feng Yeh 36

Illustrated Example 2Illustrated Example 2

g p 164

------ p sin+=

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

1-5-1

1-2-1 1-3-1

1-4-1

22 p

Ming-Feng Yeh 37

ConvergenceConvergenceg p 1 p sin+=

-2 -1 0 1 2-1

0

1

2

3

1

23

4

5

0

-2 -1 0 1 2-1

0

1

2

3

1

2

34

5

0

22 p

Convergence to Global Min. Convergence to Local Min.The numbers to each curve indicate the sequence of iterations.

Ming-Feng Yeh 38

GeneralizationGeneralization

In most cases the multilayer network is trained with a finite number of examples of proper network behavior:

This training set is normally representative of a much larger class of possible input/output pairs.

Can the network successfully generalize what it has learned to the total population?

p1 t1{ , } p2 t2{ , } pQ tQ{ , }

Ming-Feng Yeh 39

Generalization ExampleGeneralization Exampleg p 1

4---p sin+= p 2– 1.6– 1.2– 1.6 2 =

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

1-2-1 1-9-1

For a network to be able to generalize, it should have fewer parameters than there are data points in the training set.

Generalize well Not generalize well

ming-feng yeh1 chapter 11 back- propagation. ming-feng yeh2 objectives a generalization of the lms...

Documents