ai - nn lecture notes chapter 8 feed-forward networks

31
AI - NN Lecture Notes Chapter 8 Feed-forward Networks

Upload: julian-wilkinson

Post on 26-Mar-2015

231 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

AI - NN Lecture Notes

Chapter 8

Feed-forward Networks

Page 2: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

§8.1 Introduction To Classification

The Classification Model

X = [x x … x ] -- the input patterns of classifier.i (X) -- decision functionThe response of the classifier is 1 or 2 or … or R.

t1 2 n

0

x x x

2

1

n

Pattern Class

1 or 2 or … or R

Classifier

i (X)0

Page 3: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

Geometric Explanation of Classification

Pattern -- an n-dimensional vector.

All n-dimensional patterns constitute an n-dimensionalEuclidean space E and is called pattern space.

If all patterns can be divided into R classes, then theregion of the space containing only patterns of r-th classis called the r-th region, r = 1, …, R. Regions are separated from each other by decision surface.

A pattern classifier maps sets of patters in E into oneof the regions denoted by numbers i = 1, 2, …, R.

n

0

n

Page 4: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

Classifiers That Use The Discriminant Functions

The membership in a class are determined based on thecomparison of R discriminant functions g (X), i =1, …,R, computed for the input pattern under consideration.

g (X) are scalar values and the pattern X belongs to thei-th class iff g (X) > g (X), i,j = 1, …, R, i j. The decision surface equation is g (X) - g (X) = 0.

Assuming that the discrimminant functions are known,the block diagram of a basic pattern classifier can beshown below:

i

ii

i j

i j

Page 5: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

g (X)

g (X)

g (X)

MaximumSelector

X

Class

i 0

Discriminators

1

i

R

1

i

R

For a given pattern, the i-th discriminator computes the value of the function g (X) called briefly thediscriminant.

i

Page 6: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

When R = 2, the classifier called dichotomizer issimplified as below:

g (X)

i 0

Class

TLU

-1

1

iX

Discriminator

Discriminant

Its discriminant function is g(X) = g (X) - g (X)1 2

If g(X) > 0, then X belongs to Class 1;If g (X) < 0, then X belongs to Class 2.

Page 7: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

The following figure is an example where 6 patternsbelong to one of the 2 classes and the decision surface is a straight line.

-2 -1

1 20-1

-2

x

x

1

2g (X) < 0g(X) > 0

(2,0)

(3/2,-1)

(1, -2)

(0,0)

(-1/2, -1)

(-1, -2)

g(X) = -2x + x +21 2

Decision Surface g(X) = 0

Infinite number of discrimminant functions may exist.

Page 8: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

Training and Classification

Consider neural network classifiers that derive theirweights during the learning cycle.

The sample patterns, called the training sequence, arepresented to the machine along with the correct response provided by the teacher.

After each incorrect response, the classifier modifies itsparameters by means of iterative, supervised learningbased on comparing the targeted correct response withthe actual response.

Page 9: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

+

+

TLU

d

0 = I =1 or -10

+-

d - 0

g(y)

x

x

x

xw

w

w

w

11

22

nn

n+1 n+1

1

-1

Page 10: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

§8.2 Single Layer Perceptron

1. Linear Threshold Unit and Separability

T

X

X

X

W

W

W1

n

N

1

n

N

y

X = (x , …, x ), x R = {1, -1}t

1 N

W = (w , …, w )1 N

tT R

y = sgn(W X - T) {1, -1}

t

Page 11: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

Let X’ = and W’ = (W , T), then we haveX-1

y = sgn(W’ X’)=( w x )t

nnn=1

N+1

Linearly Separable PatternsAssume that there is a pattern set which is divided into subset , , …, , respectively. If a linearmachine can classify the pattern from , as belongingto class i, for i = 1, …, N, then the pattern sets arelinearly separable.

1 2 N

i

t

Page 12: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

When R = 2, for given X and Y, if there exists such W,f and T that makes the relation f: {-1, 1} valid, then the function is said to be linearly separable.

Example 1: NAND is a linearly separable function.Given X and Y such that

x x y

-1 -1 1-1 1 1 1 -1 1 1 1 -1

1 2 It can be found that

W = (-1, -1) and T = -3/2

is the solution:

y = sgn(W X - T) = sgn (-x -x +3/2)

t

t1 2

Page 13: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

x x T w x + w x - T y

-1 -1 -3/2 1+1-(-3/2)=7/2 1-1 1 -3/2 1-1-(-3/2)=3/2 1 1 -1 -3/2 -1+1-(-3/2)=3/2 1 1 1 -3/2 -1-1-(-3/2)= -1/2 -1

1 2 1 1 2 2

-3/2 x

x

1

2

x + x = 3/21 2

3/23/2

Y>0

Y<0x

x

1 1

2 2

w

w Y=sgn(-x -x )1 2

Page 14: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

Example 2 XOR is not a linearly separable function.

Given that x x y1 2-1 -1 -1-1 1 1

1 -1 1 1 1 -1

It is impossible for finding such W and T that satisfying y = sgn(W X - T): If (-1)w + (-1)w < T, then (+1)w + (+1)w > T If (-1)w + (+1)w > T, then (+1)w + (-1)w < T

t

1 2 1 2

1 2 1 2

It is seen that linearly separable binary patterns can be classified by a suitably designed neuron, but linearly non-separable binary patterns cannot.

Page 15: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

Perceptron LearningGiven training set {X(t), d(t)}, t=0, 1, 2, …, whereX(t) = {x (t), …, x (t), -1}let w = T, x = -1

1 N

N+1 N+1

y = sgn( w x ) = n=1

N+1

n n

+1, N+1

n=1 w x ) 0

-1, otherwise

n n

(1) Set w (0) = small random values, n=1, …, N+1(2) Input a sample X(t) = {x (t), …, x (t), -1} and d(t)(3) Compute the actual output y(t)(4) Revise the weights w (t+1) = w (t) + [d(t)-y(t)]x (t)(5) Return to (2) until w (t+1) = w (t) for all n.

n

1 N

n n n

n n

Page 16: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

Where 0 < < 1 is a coefficient.

TheoremThe perceptron learning algorithm will converge if thefunction is linearly separable.

Gradient Descent Algorithm The perceptron learning algorithm is restricted to the linearly separable function cases (hard limitingactivation function). Gradient descent algorithm can be applied in moregeneral cases with the only requirement that the function be differentiable.

Page 17: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

Given the training set (x , y ), n = 1, …, N, try to findW* such that y^ = f(W*x ) y .

Let E = E = (1/2) (y - y^) = (1/2) (y - f(W*x ))

be the error measure of learning. To minimize E, take

grad E = = = (1/2)

= (1/2)

= - (y - f(W* • x ))

n n

n nn

n=1

N

nn=1

N

n n2

n=1

N

n n

2

w

EWm n=1

N EW

n

m n=1

N (y - y^)Wm

n n2

n=1

N(y - f(W*•x ))

Wn n

2

m

n=1

N

n n f(W*•x )n

Wm

Page 18: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

= -

= -

(y - f(W* • x )) •n=1

N

n n

f(W*•x )

Wx

WxW

n

n

n

m

(y - f(W* • x )) f’ • xn=1

N

n n mn

The learning (adjusting) rule is thus as follows

W = W - = W + > 0EW

m mn=1 (y - y^) f’ • x ,

m

N

n n mnm

Page 19: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

§ 8.3 Multi-Layer Perceptron

1. Why Multi-Layer ?

XOR which cannot be implemented by 1-layer networkas was seen can be implemented by 2-layer network:

1.5 0.5

x

x

1

1

1

1

y2x

X = sgn(1 x + 1 x - 1.5)

y = sgn(1 x + 1 x - 2 x - 0.5)10

20

1

110

20

10

20 1

Page 20: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

x x x y 1 1 1 -1 1 -1 -1 1 -1 1 -1 1 -1 -1 -1 -1

10

20 1

Single-layer networks has no hidden units and thus haveno internal representation ability. They can only map similar input patterns to similar output ones.

If there is a layer of hidden units, there is always an internal representation of the input patterns that maysupport any mappings from input to output units.

Page 21: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

This figure shows the internal representation abilities.

NetworkStructure

Types of the Classified Region

The Classification for XOR

Messed Classified Region

General form of Classified Region

2 RegionsSeparated bya sphereplane

Open ConvexRegion orClosed ConvexRegion

Arbitrary Forms

Page 22: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

2. Back Propagation Algorithm

More precisely, BP is an error back-propagation learningalgorithm for multi-layer perceptron networks and is alsoa sort of generalized gradient descent algorithm.

Assumptions:1) MLP has M layers and one single output node.2) Each node is of sigmoid type with activation function: f(x) =

1

1 + e - x

3) Training samples are (x , y ) , n = 1, …, N.n n

Page 23: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

4) Error measure is chosen as

E = E = (1/2) (y -y^ )

i=1

N

n n n

2

n=1

N

Let = , net = w O E

netn

jnjn j

N

ji i

I) If j is output node, then O = y^ andjn n

= = - (y - y^ ) f’(net )jn

Ey^

y^netn

n

jn

nn n jn

ThusEW

n

ij=

Enet

net

Wn

jn

jn

ij= O = - (y -y^ )O f’(net )

jn in n n in jn

Page 24: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

II) If j is not output node, then

= E

O

E

net OOn

jn

jn

jn=

n

jn

F’(net ), butJnjn

E O

n

jn=

k

E nnet kn

net kn

Ojn=

k

E net

( w O )

O n

kn jn

iki in

= w k

kn kj

Therefore = f’(net )

jn w kjn kn kj

and

E W

n

ij= O = O f’(net ) w jn in in jn k kn kj

Page 25: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

The MLP - BP algorithm can then be described as below

(1) Set the initial W(2) Until convergence (w = const), repeat the following (i) For n = 1 to N (number of samples) (a) Compute O , net , y^ and E (b) For m = M to 2, for all jm, compute E / w (for the unit j in the same layer m) (ii) Revise the weights: w = w - , > 0,

ij

in jn n n

n ij

Ew

n

ijij ij

Page 26: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

Mapping Ability of Feedforward Networks

MLP play a mathematical role of mapping: R R ,Kolmogorov Theorem (1957):

Let (x) be a bounded monotonically increasing continuous function, K be a bounded close subset of R , f(X) = f(x , …, x ) be real continuous function on K,then for > 0, there exist integer N and constants c , T , and w (i,j = 1, …, N) such that

m n

n

1 n

i i

ij

f^(x , …, x ) = c ( w x -T )n1 i=1

N

i j=1

n

ij j i (1)

and

Page 27: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

Max |f(x , …, x ) - f^(x , …, x ) | < (2) 1 n 1 n

That is to say, for > 0, there exists a 3-layer networkwhose hidden unit output function is (x) and whose input and output units are linear, and whose total input-output relation f^(x , …, x ) satisfies Eq(2).

Significance:Any continuous mappings R R can be approximatedby a k-layer (k-2 hidden layer) network’s input-outputrelation, k 3.

n1

m n

Page 28: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

§8.4 Applications of Feed forward Networks

MLP can be successfully applied to classification and diagnosing problems whose solution is obtained viaexperimentation and simulation rather than via rigorous and formal approach to the problem.

MLP can also act as an expert system. Formulation of the rules is one of the bottle neck in building up expertsystems. However, the layered networks can acquireknowledge without extracting IF-THEN rules if thenumber of training vector pairs is sufficient to suitablyform all decision regions.

Page 29: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

1) Fault DiagnosingAutomobile engine diagnosing (Marko et al, 1989)-- employ single-hidden layer network-- to identify 26 different faults-- training set consists of 16 sets of data for each failure-- training time needed is 10 minutes-- the main-frame is NESTOR NDS-100 computer-- fault recognition accuracy is 100%

Switching System Fault Diagnosing (Wen Fang, 1994)-- BP algorithm-- 3-layer network-- no mis-diagnosing, much better than the existing one.

Page 30: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

2) Handwritten Digits RecognitionPostal Codes Recognition (Wang et al, 1995)-- 3-layer network-- preprocessing -- 130 features-- rejection rate < 5%-- mis-classification rate < 0.01%

3) Other Applications Include-- text reading-- speech recognition-- image recognition-- medical diagnosing-- approximation

Page 31: AI - NN Lecture Notes Chapter 8 Feed-forward Networks

-- optimization-- coding-- robot controlling, etc.

Advantages and Disadvantages-- learning from examples-- better performance than traditional approach-- long training time (much improved)-- local minima (already overcome)