back-propagation algorithm - ulisboa · 1 back-propagation algorithm perceptron gradient descent...

1

Back-Propagation Algorithm

Perceptron Gradient Descent Multi-layerd neural network Back-Propagation More on Back-Propagation Examples

2

Inner-product

A measure of the projection of one vectoronto another

!

net =<r w ,

r x >=||

r w || " ||

r x || "cos(#)

!

net = wi" x

i

i=1

n

#

Activation function

!

o = f (net) = f ( wi " xi)i=1

n

#

!

f (x) := sgn(x) =1 if x " 0

#1 if x < 0

$ % &

3

!

f (x) :="(x) =1

1+ e(#ax )

!

f (x) :="(x) =1 if x # 0

0 if x < 0

$ % &

sigmoid function!

f (x) :="(x) =

1 if x # 0.5

x if 0.5 > x > 0.5

0 if x $ %0.5

&

' (

) (

Gradient Descent To understand, consider simpler linear

unit, where

Let's learn wi that minimize the squarederror, D={(x1,t1),(x2,t2), . .,(xd,td),..,(xm,tm)}

• (t for target)!

o = wi" x

i

i= 0

n

#

4

Error for different hypothesis,for w0 and w1 (dim 2)

We want to move the weight vector in thedirection that decrease E

wi=wi+Δwi w=w+Δw

5

Differentiating E

Update rule for gradient decent

!

"wi=# (t

d

d $D

% & od)x

id

6

Stochastic Approximation togradient descent

The gradient decent training rule updates summingover all the training examples D

Stochastic gradient approximates gradient decent byupdating weights incrementally

Calculate error for each example Known as delta-rule or LMS (last mean-square) weight

update Adaline rule, used for adaptive filters Widroff and Hoff (1960)

!

"wi=#(t $ o)x

i

7

XOR problem and Perceptron By Minsky and Papert in mid 1960

8

Multi-layer Networks The limitations of simple perceptron do not

apply to feed-forward networks withintermediate or „hidden“ nonlinear units

A network with just one hidden unit canrepresent any Boolean function

The great power of multi-layer networks wasrealized long ago But it was only in the eighties it was shown how to

make them learn

Multiple layers of cascade linear units stillproduce only linear functions

We search for networks capable ofrepresenting nonlinear functions Units should use nonlinear activation functions Examples of nonlinear activation functions

9

XOR-example

10

Back-propagation is a learning algorithm formulti-layer neural networks

It was invented independently several times Bryson an Ho [1969] Werbos [1974] Parker [1985] Rumelhart et al. [1986]

Parallel Distributed Processing - Vol. 1FoundationsDavid E. Rumelhart, James L. McClelland and the PDPResearch Group

What makes people smarter than computers? These volumesby a pioneering neurocomputing.....

11

Back-propagation The algorithm gives a prescription for

changing the weights wij in any feed-forward network to learn a training set ofinput output pairs {xd,td}

We consider a simple two-layer network

xk

x1 x2 x3 x4 x5

12

Given the pattern xd the hidden unit jreceives a net input

and produces the output

!

net jd

= w jk

k=1

5

" xkd

!

V j

d= f (net j

d) = f ( w jk

k=1

5

" xkd)

Output unit i thus receives

And produce the final output!

netid

= Wij

j=1

3

" V j

d= (Wij #

j=1

3

" f ( w jk

k=1

5

" xkd))

!

oid

= f (netid) = f ( Wij

j=1

3

" V j

d) = f ( (Wij #

j=1

3

" f ( w jk

k=1

5

" xkd)))

13

Out usual error function

For l outputs and m input output pairs{xd,td}

!

E[r w ] =

1

2(t

i

d

i=1

l

"d =1

m

" # oi

d)2

In our example E becomes

E[w] is differentiable given f is differentiable Gradient descent can be applied

!

E[r w ] =

1

2(t

i

d

i=1

2

"d =1

m

" # oi

d)2

!

E[r w ] =

1

2(ti

d

i=1

2

"d =1

m

" # f ( Wij

j

3

" $ f ( w jk xk

d

k=1

5

" )))2

14

For hidden-to-output connections thegradient descent rule gives:

!

"Wij = #$%E

%Wij

= #$ (tid # oi

d) f

'

d=1

m

& (netid) ' (#V j

d)

"Wij =$ (tid # oi

d) f

'

d=1

m

& (netid) 'V j

d

!

"id

= f'(neti

d)(ti

d # oid)

$Wij =% "id

d=1

m

& V j

d

For the input-to hidden connection wjk wemust differentiate with respect to the wjk Using the chain rule we obtain

!

"w jk = #$%E

%w jk

= #$%E

%Vj

d&%Vj

d

%w jkd=1

m

'

15

!

"id

= f'(neti

d)(ti

d# oi

d)

!

" j

d= f

'(net j

d) Wij

i=1

2

# "id

!

"w jk =# (tid

i=1

2

$d=1

m

$ % oid) f

'(neti

d)Wij f

'(net j

d) & xk

d

!

"w jk =# $id %

i=1

2

&d=1

m

& Wij f'(net j

d) % xk

d

!

"w jk =# $ j

d

d=1

m

% & xkd

!

"w jk =# $ j

d

d=1

m

% & xkd

!

"Wij =# $id

d=1

m

% V j

d

we have same form with a differentdefinition of δ

16

In general, with an arbitrary number of layers,the back-propagation update rule has alwaysthe form

Where output and input refers to the connectionconcerned

V stands for the appropriate input (hidden unit orreal input, xd )

δ depends on the layer concerned!

"wij =# $outputd=1

m

% &Vinput

By the equation

allows us to determine for a givenhidden unit Vj in terms of the δ‘s of theunit oi

The coefficient are usual forward, but theerrors δ are propagated backward back-propagation

!

" j

d= f

'(net j

d) Wij

i=1

2

# "id

17

We have to use a nonlinear differentiableactivation function Examples:

!

f (x) =" (x) =1

1+ e(#$%x )

!

f'(x) =" '

(x) =# $"(x) $ (1%"(x))

!

f (x) = tanh(" # x)

f'(x) =" # (1$ f (x)

2)

18

Consider a network with M layersm=1,2,..,M

Vmi from the output of the ith unit of the

mth layer V0

i is a synonym for xi of the ith input Subscript m layers m’s layers, not

patterns Wm

ij mean connection from Vjm-1 to Vi

m

Stochastic Back-PropagationAlgorithm (mostly used)

1. Initialize the weights to small random values2. Choose a pattern xd

k and apply is to the input layer V0k= xd

k forall k

3. Propagate the signal through the network

4. Compute the deltas for the output layer

5. Compute the deltas for the preceding layer for m=M,M-1,..2

6. Update all connections

7. Goto 2 and repeat for the next pattern

!

Vi

m= f (neti

m) = f ( wij

m

j

" V j

m#1)

!

"iM

= f'(neti

M)(ti

d#Vi

M)

!

"im#1

= f'(neti

m#1) w ji

m

j

$ " j

m

!

"wij

m=#$i

mV j

m%1

!

wij

new= wij

old+ "wij

19

More on Back-Propagation Gradient descent over entire network

weight vector Easily generalized to arbitrary directed

graphs Will find a local, not necessarily global

error minimum In practice, often works well (can run

multiple times)

Gradient descent can be very slow if η is tosmall, and can oscillate widely if η is to large

Often include weight momentum α

Momentum parameter α is chosen between 0and 1, 0.9 is a good value!

"wpq (t +1) = #$%E

%wpq

+& ' "wpq (t)

20

Minimizes error over training examples Will it generalize well

Training can take thousands of iterations,it is slow!

Using network after training is very fast

21

Convergence of Back-propagation Gradient descent to some local minimum

Perhaps not global minimum... Add momentum Stochastic gradient descent Train multiple nets with different initial weights

Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training

progresses

22

Expressive Capabilities ofANNs Boolean functions:

Every boolean function can be represented bynetwork with single hidden layer

but might require exponential (in number of inputs)hidden units

Continuous functions: Every bounded continuous function can be

approximated with arbitrarily smallerror, by network with one hidden layer [Cybenko1989; Hornik et al. 1989]

Any function can be approximated to arbitraryaccuracy by a network with twohidden layers [Cybenko 1988].

NETtalk Sejnowski et al 1987

23

Prediction

25

Perceptron Gradient Descent Multi-layerd neural network Back-Propagation More on Back-Propagation Examples

26

RBF Networks, Support Vector Machines

back-propagation algorithm - ulisboa · 1 back-propagation algorithm perceptron gradient descent...

Documents