ai - nn lecture notes chapter 8 feed-forward networks

AI - NN Lecture Notes

Chapter 8

Feed-forward Networks

§8.1 Introduction To Classification

The Classification Model

X = [x x … x ] -- the input patterns of classifier.i (X) -- decision functionThe response of the classifier is 1 or 2 or … or R.

t1 2 n

0

x x x

2

1

n

Pattern Class

1 or 2 or … or R

Classifier

i (X)0

Geometric Explanation of Classification

Pattern -- an n-dimensional vector.

All n-dimensional patterns constitute an n-dimensionalEuclidean space E and is called pattern space.

If all patterns can be divided into R classes, then theregion of the space containing only patterns of r-th classis called the r-th region, r = 1, …, R. Regions are separated from each other by decision surface.

A pattern classifier maps sets of patters in E into oneof the regions denoted by numbers i = 1, 2, …, R.

n

0

n

Classifiers That Use The Discriminant Functions

The membership in a class are determined based on thecomparison of R discriminant functions g (X), i =1, …,R, computed for the input pattern under consideration.

g (X) are scalar values and the pattern X belongs to thei-th class iff g (X) > g (X), i,j = 1, …, R, i j. The decision surface equation is g (X) - g (X) = 0.

Assuming that the discrimminant functions are known,the block diagram of a basic pattern classifier can beshown below:

i

ii

i j

i j

g (X)

g (X)

g (X)

MaximumSelector

X

Class

i 0

Discriminators

1

i

R

1

i

R

For a given pattern, the i-th discriminator computes the value of the function g (X) called briefly thediscriminant.

i

When R = 2, the classifier called dichotomizer issimplified as below:

g (X)

i 0

Class

TLU

-1

1

iX

Discriminator

Discriminant

Its discriminant function is g(X) = g (X) - g (X)1 2

If g(X) > 0, then X belongs to Class 1;If g (X) < 0, then X belongs to Class 2.

The following figure is an example where 6 patternsbelong to one of the 2 classes and the decision surface is a straight line.

-2 -1

1 20-1

-2

x

x

1

2g (X) < 0g(X) > 0

(2,0)

(3/2,-1)

(1, -2)

(0,0)

(-1/2, -1)

(-1, -2)

g(X) = -2x + x +21 2

Decision Surface g(X) = 0

Infinite number of discrimminant functions may exist.

Training and Classification

Consider neural network classifiers that derive theirweights during the learning cycle.

The sample patterns, called the training sequence, arepresented to the machine along with the correct response provided by the teacher.

After each incorrect response, the classifier modifies itsparameters by means of iterative, supervised learningbased on comparing the targeted correct response withthe actual response.

+

+

TLU

d

0 = I =1 or -10

+-

d - 0

g(y)

x

x

x

xw

w

w

w

11

22

nn

n+1 n+1

1

-1

§8.2 Single Layer Perceptron

1. Linear Threshold Unit and Separability

T

X

X

X

W

W

W1

n

N

1

n

N

y

X = (x , …, x ), x R = {1, -1}t

1 N

W = (w , …, w )1 N

tT R

y = sgn(W X - T) {1, -1}

t

Let X’ = and W’ = (W , T), then we haveX-1

y = sgn(W’ X’)=( w x )t

nnn=1

N+1

Linearly Separable PatternsAssume that there is a pattern set which is divided into subset , , …, , respectively. If a linearmachine can classify the pattern from , as belongingto class i, for i = 1, …, N, then the pattern sets arelinearly separable.

1 2 N

i

t

When R = 2, for given X and Y, if there exists such W,f and T that makes the relation f: {-1, 1} valid, then the function is said to be linearly separable.

Example 1: NAND is a linearly separable function.Given X and Y such that

x x y

-1 -1 1-1 1 1 1 -1 1 1 1 -1

1 2 It can be found that

W = (-1, -1) and T = -3/2

is the solution:

y = sgn(W X - T) = sgn (-x -x +3/2)

t

t1 2

x x T w x + w x - T y

-1 -1 -3/2 1+1-(-3/2)=7/2 1-1 1 -3/2 1-1-(-3/2)=3/2 1 1 -1 -3/2 -1+1-(-3/2)=3/2 1 1 1 -3/2 -1-1-(-3/2)= -1/2 -1

1 2 1 1 2 2

-3/2 x

x

1

2

x + x = 3/21 2

3/23/2

Y>0

Y<0x

x

1 1

2 2

w

w Y=sgn(-x -x )1 2

Example 2 XOR is not a linearly separable function.

Given that x x y1 2-1 -1 -1-1 1 1

1 -1 1 1 1 -1

It is impossible for finding such W and T that satisfying y = sgn(W X - T): If (-1)w + (-1)w < T, then (+1)w + (+1)w > T If (-1)w + (+1)w > T, then (+1)w + (-1)w < T

t

1 2 1 2

1 2 1 2

It is seen that linearly separable binary patterns can be classified by a suitably designed neuron, but linearly non-separable binary patterns cannot.

Perceptron LearningGiven training set {X(t), d(t)}, t=0, 1, 2, …, whereX(t) = {x (t), …, x (t), -1}let w = T, x = -1

1 N

N+1 N+1

y = sgn( w x ) = n=1

N+1

n n

+1, N+1

n=1 w x ) 0

-1, otherwise

n n

(1) Set w (0) = small random values, n=1, …, N+1(2) Input a sample X(t) = {x (t), …, x (t), -1} and d(t)(3) Compute the actual output y(t)(4) Revise the weights w (t+1) = w (t) + [d(t)-y(t)]x (t)(5) Return to (2) until w (t+1) = w (t) for all n.

n

1 N

n n n

n n

Where 0 < < 1 is a coefficient.

TheoremThe perceptron learning algorithm will converge if thefunction is linearly separable.

Gradient Descent Algorithm The perceptron learning algorithm is restricted to the linearly separable function cases (hard limitingactivation function). Gradient descent algorithm can be applied in moregeneral cases with the only requirement that the function be differentiable.

Given the training set (x , y ), n = 1, …, N, try to findW* such that y^ = f(W*x ) y .

Let E = E = (1/2) (y - y^) = (1/2) (y - f(W*x ))

be the error measure of learning. To minimize E, take

grad E = = = (1/2)

= (1/2)

= - (y - f(W* • x ))

n n

n nn

n=1

N

nn=1

N

n n2

n=1

N

n n

2

w

EWm n=1

N EW

n

m n=1

N (y - y^)Wm

n n2

n=1

N(y - f(W*•x ))

Wn n

2

m

n=1

N

n n f(W*•x )n

Wm

= -

= -

(y - f(W* • x )) •n=1

N

n n

f(W*•x )

Wx

WxW

n

n

n

m

(y - f(W* • x )) f’ • xn=1

N

n n mn

The learning (adjusting) rule is thus as follows

W = W - = W + > 0EW

m mn=1 (y - y^) f’ • x ,

m

N

n n mnm

§ 8.3 Multi-Layer Perceptron

1. Why Multi-Layer ?

XOR which cannot be implemented by 1-layer networkas was seen can be implemented by 2-layer network:

1.5 0.5

x

x

1

1

1

1

y2x

X = sgn(1 x + 1 x - 1.5)

y = sgn(1 x + 1 x - 2 x - 0.5)10

20

1

110

20

10

20 1

x x x y 1 1 1 -1 1 -1 -1 1 -1 1 -1 1 -1 -1 -1 -1

10

20 1

Single-layer networks has no hidden units and thus haveno internal representation ability. They can only map similar input patterns to similar output ones.

If there is a layer of hidden units, there is always an internal representation of the input patterns that maysupport any mappings from input to output units.

This figure shows the internal representation abilities.

NetworkStructure

Types of the Classified Region

The Classification for XOR

Messed Classified Region

General form of Classified Region

2 RegionsSeparated bya sphereplane

Open ConvexRegion orClosed ConvexRegion

Arbitrary Forms

2. Back Propagation Algorithm

More precisely, BP is an error back-propagation learningalgorithm for multi-layer perceptron networks and is alsoa sort of generalized gradient descent algorithm.

Assumptions:1) MLP has M layers and one single output node.2) Each node is of sigmoid type with activation function: f(x) =

1

1 + e - x

3) Training samples are (x , y ) , n = 1, …, N.n n

4) Error measure is chosen as

E = E = (1/2) (y -y^ )

i=1

N

n n n

2

n=1

N

Let = , net = w O E

netn

jnjn j

N

ji i

I) If j is output node, then O = y^ andjn n

= = - (y - y^ ) f’(net )jn

Ey^

y^netn

n

jn

nn n jn

ThusEW

n

ij=

Enet

net

Wn

jn

jn

ij= O = - (y -y^ )O f’(net )

jn in n n in jn

II) If j is not output node, then

= E

O

E

net OOn

jn

jn

jn=

n

jn

F’(net ), butJnjn

E O

n

jn=

k

E nnet kn

net kn

Ojn=

k

E net

( w O )

O n

kn jn

iki in

= w k

kn kj

Therefore = f’(net )

jn w kjn kn kj

and

E W

n

ij= O = O f’(net ) w jn in in jn k kn kj

The MLP - BP algorithm can then be described as below

(1) Set the initial W(2) Until convergence (w = const), repeat the following (i) For n = 1 to N (number of samples) (a) Compute O , net , y^ and E (b) For m = M to 2, for all jm, compute E / w (for the unit j in the same layer m) (ii) Revise the weights: w = w - , > 0,

ij

in jn n n

n ij

Ew

n

ijij ij

Mapping Ability of Feedforward Networks

MLP play a mathematical role of mapping: R R ,Kolmogorov Theorem (1957):

Let (x) be a bounded monotonically increasing continuous function, K be a bounded close subset of R , f(X) = f(x , …, x ) be real continuous function on K,then for > 0, there exist integer N and constants c , T , and w (i,j = 1, …, N) such that

m n

n

1 n

i i

ij

f^(x , …, x ) = c ( w x -T )n1 i=1

N

i j=1

n

ij j i (1)

and

Max |f(x , …, x ) - f^(x , …, x ) | < (2) 1 n 1 n

That is to say, for > 0, there exists a 3-layer networkwhose hidden unit output function is (x) and whose input and output units are linear, and whose total input-output relation f^(x , …, x ) satisfies Eq(2).

Significance:Any continuous mappings R R can be approximatedby a k-layer (k-2 hidden layer) network’s input-outputrelation, k 3.

n1

m n

§8.4 Applications of Feed forward Networks

MLP can be successfully applied to classification and diagnosing problems whose solution is obtained viaexperimentation and simulation rather than via rigorous and formal approach to the problem.

MLP can also act as an expert system. Formulation of the rules is one of the bottle neck in building up expertsystems. However, the layered networks can acquireknowledge without extracting IF-THEN rules if thenumber of training vector pairs is sufficient to suitablyform all decision regions.

1) Fault DiagnosingAutomobile engine diagnosing (Marko et al, 1989)-- employ single-hidden layer network-- to identify 26 different faults-- training set consists of 16 sets of data for each failure-- training time needed is 10 minutes-- the main-frame is NESTOR NDS-100 computer-- fault recognition accuracy is 100%

Switching System Fault Diagnosing (Wen Fang, 1994)-- BP algorithm-- 3-layer network-- no mis-diagnosing, much better than the existing one.

2) Handwritten Digits RecognitionPostal Codes Recognition (Wang et al, 1995)-- 3-layer network-- preprocessing -- 130 features-- rejection rate < 5%-- mis-classification rate < 0.01%

3) Other Applications Include-- text reading-- speech recognition-- image recognition-- medical diagnosing-- approximation

-- optimization-- coding-- robot controlling, etc.

Advantages and Disadvantages-- learning from examples-- better performance than traditional approach-- long training time (much improved)-- local minima (already overcome)

ai - nn lecture notes chapter 8 feed-forward networks

Documents

x t w x

gy x x x x w w w w

separability t x x x

n n y x

g x g x

pattern x

w x t y

w x t nn n