lecture 11: single layer perceptronsce.sharif.edu/courses/92-93/1/ce957-1/resources/... ·...

1

In the Name of God

Lecture 11: Single Layer Perceptrons

Perceptron: architecturePerceptron: architecture• We consider the architecture: feed-forward NN

with one layerwith one layer• It is sufficient to study single layer perceptrons

with just one neuron:j

Single layer perceptronsSingle layer perceptrons• Generalization to single layer perceptrons with

i bmore neurons is easy because:

• The output units are independent among each otherThe output units are independent among each other • Each weight only affects one of the outputs

Perceptron: Neuron Modelp• The (McCulloch-Pitts) perceptron is a single layer

NN ith li th i f tiNN with a non-linear , the sign function

Perceptron

• Perceptron is based on a nonlinear neuron.• x(n) = [+1,x1(n),x2(n),...,xm(n)]T• w(n) = [b(n),w1(n),w2(n),...,xm(n)]T• v(n) = wT(n)x(n) defines a hyperplane

1

m

i ii

v w x b

m

i iy sign b w x

1i

Perceptron: Applications

• The perceptron is used for classification: classify p p ycorrectly a set of examples into one of the two classes C1 and C2:

If the output of the perceptron is +1, then the i t i i d t l Cinput is assigned to class C1

If the output of the perceptron is -1, then the input is assigned to Cinput is assigned to C2

Perceptron Trainingp g

• How can we train a perceptron for a p pclassification task?

• We try to find suitable values for the weights in such a way that the training examples are correctly classified.G t i ll t t fi d h l th t• Geometrically, we try to find a hyper-plane that separates the examples of the two classes.

Perceptron: Classification

• The equation below describes a hyperplane in the input space. This hyperplane is used to separate the two classes C1 and C2 decision

region for C1

m

x2

d i i

region for C1w1x1 + w2x2 + b > 0

0 bxw1i

ii

C1x

decisionboundary

d i i C2x1

w1x1 + w2x2 + b = 0w1x1 + w2x2 + b

Example: ANDExample: AND• Here is a representation of the AND function• White means false black means true for the• White means false, black means true for the

output• -1 means false, +1 means true for the input• A linear decision surface separates false from• A linear decision surface separates false from

true instances

-1 AND -1 = false

-1 AND +1 = false

+1 AND 1 f l+1 AND -1 = false

+1 AND +1 = true

Example: AND

• Watch a perceptron learn the AND function:

Example: AND

Example: XORp

• Here’s the XOR function:

-1 XOR -1 = falsef

-1 XOR +1 = true

+1 XOR -1 = true

+1 XOR +1 = false

Perceptrons cannot learn such linearly inseparable functions

Example: XOR

• Watch a perceptron try to learn XOR

Perceptron learningg

• If the n-th member of the training set, x(n), is correctly l ifi d b th i ht t ( ) t d t th thclassified by the weight vector w(n) computed at the n-th

iteration of the algorithm, no correction is made to the weight vector, i.e:▫ w(n+1) = w(n) if wTx(n) > 0 and x(n) belongs to class C1▫ w(n+1) = w(n) if wTx(n) 0 and x(n) belongs to class C2

• Otherwise the weight vector of the perceptron isOtherwise the weight vector of the perceptron is uppdated:▫ w(n+1) = w (n) - (n)x(n) if wT(n)x(n) >0 and x(n)

belongs to class Cbelongs to class C2▫ w(n+1) = w (n) + (n)x(n) if wT(n)x(n) 0 and x(n)

belongs to class C1• learning rate parameter (n) controls the adjustment• learning-rate parameter (n) controls the adjustment

applied to the weight vector at iteration n.

Perceptron learningg

• Or, simply

▫ w(n + 1) = w(n) + (n)e(n)x(n)

• where

▫ e(n) = d(n) − y(n) [the error]

Th t th M C ll h Pitt f l d l• The perceptron uses the McCulloch-Pitts formal model of a neuron.

• The learning process is performed for a finite number of g p piterations and then stops

Learning algorithmLearning algorithm

Epoch: Presentation of the entire training set to the neuralEpoch: Presentation of the entire training set to the neural network. In the case of the AND function an epoch consists f f t f i t b i t d t thof four sets of inputs being presented to the

network (i.e. [0,0], [0,1], [1,0], [1,1])

Error: The error value is the amount by which the valueError: The error value is the amount by which the value output by the network differs from the target value. For example, if we required the network to output 0 and it outputs a 1, then Error = -1

Decision boundary• In simple cases, divide feature space by drawing a hyperplane

across it

Decision boundary

across it.• Known as a decision boundary.• Discriminant function: returns different values on opposite sides. pp

(straight line)• Problems which can be thus classified are linearly separable.

x

+

++

x2

+ -

x2

+

+ -

- -x1

+-

x1

-

Linearly separable Non-Linearly separable

Example

• Consider the 2-dimensional training set C1 C2, • C1 = {(1, 1, 1), (1, 1, -1), (1, 0, -1)} with class label 1• C2 = {(1, -1,-1), (1, -1,1), (1, 0,1)} with class label 0

Train a perceptron on C C• Train a perceptron on C1 C2• First pass is as follows

I t W i ht D i d A t l U d t ? NInput Weight Desired Actual Update? New weight

(1, 1, 1) (1, 0, 0) 1 1 No (1, 0, 0) (1 1 1) (1 0 0) 1 1 N (1 0 0)(1, 1, -1) (1, 0, 0) 1 1 No (1, 0, 0)(1,0, -1) (1, 0, 0) 1 1 No (1, 0, 0) (1,-1, -1) (1, 0, 0) 0 1 Yes (0, 1, 1) (1,-1, 1) (0, 1, 1) 0 0 No (0, 1, 1)(1, 0, 1) (0, 1, 1) 0 1 Yes (-1, 1, 0)

Examplep

C1: {(1, 1, 1), (1, 1, -1), (1, 0, -1)} C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}

Fill out this table sequentially (Second pass):

Input Weight Desired Actual Update? New weight

q y ( p )

(1, 1, 1) (-1, 1, 0) 1 0 Yes (0, 2, 1) (1, 1, -1) (0, 2, 1) 1 1 No (0, 2, 1) (1,0, -1) (0, 2, 1) 1 0 Yes (1, 2, 0)(1,0, 1) (0, 2, 1) 1 0 Yes (1, 2, 0)(1,-1, -1) (1, 2, 0) 0 0 No (1, 2, 0) (1,-1, 1) (1, 2, 0) 0 0 No (1, 2, 0) (1 0 1) (1 2 0) 0 1 Yes (0 2 1)(1, 0, 1) (1, 2, 0) 0 1 Yes (0, 2, -1)

Examplep

C1: {(1, 1, 1), (1, 1, -1), (1, 0, -1)} C2: {(1 1 1) (1 1 1) (1 0 1)}

Input Weight Desired Actual Update? New

C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}Fill out this table sequentially (Third pass):

gweight

(1, 1, 1) (0, 2, -1) 1 1 No (0, 2, -1)(1, 1, -1) (0, 2, -1) 1 1 No (0, 2, -1)(1 0 1) (0 2 1) 1 1 N (0 2 1)(1,0, -1) (0, 2, -1) 1 1 No (0, 2, -1)(1,-1, -1) (0, 2, -1) 0 0 No (0, 2, -1)(1,-1, 1) (0, 2, -1) 0 0 No (0, 2, -1)(1 0 1) (0 2 1) 0 0 No (0 2 1)(1, 0, 1) (0, 2, -1) 0 0 No (0, 2, -1)

At epoch 3 no weight changes. stop execution of algorithm.Fi l i h (0 2 1)Final weight vector: (0, 2, -1). decision hyperplane is 2x1 - x2 = 0.

Example: final resultsp

-x2

- +1 Decision boundary:C2

11

y2x1 - x2 = 0

2

+ +-

x1

C1

1

-1

-1 1/2 2

w

+ +- C1

Some Unhappiness About Perceptron Trainingpp p g

• The perceptron learning rule fails to converge if e amples are not linearl separableif examples are not linearly separable▫ Can only model linearly separable classes, like

(those described by) the following Boolean functions: AND, OR, but not XOR

• When a perceptron gives the right answer, no learning takes placelearning takes place.

• Anything below the threshold is interpreted as “no”, even if it is just below the threshold.▫ Might be better to train the neuron based on how far

below the threshold it is?

Perceptron: Convergence Theorem

Suppose datasets C1 and C2 are linearly separable. The perceptron learning algorithm converges after n0iterations, with n0 nmax on training set C1 C2.

Proof:• suppose x C1 output = 1 and x C2 output = -1.• For simplicity assume w(1) = 0, = 1.• Suppose perceptron incorrectly classifies x(1) … x(n) … C1. Then wT(k) x(k) 0.

Error correction rule:w(2) = w(1) + x(1)w(2) = w(1) + x(1)w(3) = w(2) + x(2) w(n+1) = x(1)+ …+ x(n)

w(n+1) = w(n) + x(n).


• Let w0 be such that w0T x(n) > 0 x(n) C1. w0 exists because C1 and C2 are linearly separable.

• Let = min w0T x(n) | x(n) C1.Let min w0 x(n) | x(n) C1.

• Then, w0T w(n+1) = w0T x(1) + … + w0T x(n) n

• Cauchy-Schwarz inequality:||w0||2 ||w(n+1)||2 [w0T w(n+1)]2

||w(n+1)||2 (A)n2 2

||w0|| 2


• Now we consider another route: (k 1) (k) (k)w(k+1) = w(k) + x(k)

|| w(k+1)||2 = || w(k)||2 + ||x(k)||2 + 2 w T(k)x(k)Euclidean norm

0 because x(k) is misclassified ||w(k+1)||2 ||w(k)||2 + ||x(k)||2 k=1,..,n

=0

||w(2)||2 ||w(1)||2 + ||x(1)||2 ||w(3)||2 ||w(2)||2 + ||x(2)||2

||w(n+1)||2 n

kx 2)(|| ( )|| k

kx1

)(


• Let = max ||x(n)||2 x(n) C1|| ( +1)||2 (B)• ||w(n+1)||2 n (B)

• For sufficiently large values of n: (B) becomes in conflict with (A).Then, n cannot be greater than nmax such that (A) and (B) are bothThen, n cannot be greater than nmax such that (A) and (B) are both satisfied with the equality sign.

||||n 2022 w

• Perceptron convergence algorithm terminates in at most

β||||

nn||||

n2

02

0

ww

maxmaxmax

nmax= iterations. ||w0||2

2

Fixed-increment convergence theoremtheorem

• Let the subsets of training vectors C1 and C2 be linearly separable. Let the inputs presented to perceptron originate from these two subsets. The perceptron converges after

it ti i th th tsome n0 iterations, in the sense that

w(n ) = w(n +1) = w(n +2) =w(n0) = w(n0+1) = w(n0+2) = …

• is a solution vector for n0 ≤ nmax.0 max

Learning-Rate Annealing Schedules

• When learning rate is large, trajectory may follow zigzagging path

• When it is small, procedure may be slowS• Simplest learning rate parameter

St h ti i ti0)( n

• Stochastic approximation ▫ time-varying▫ when c is large danger of parameter blowup for▫ when c is large, danger of parameter blowup for

small n.

constant)is(c)( cn constant)is(c )(n

n

Learning-Rate Annealing Schedules

• Search then converge h d lschedule

▫ in early stages, learning rate parameter is approximately equal to

▫ for a number of iteration n large compared to search ti t t

0

time constant , ▫ learning rate parameter

approximates as c/n

)/(1

)( 0

n

n

constant) is ,( 0

Summaryy

1. Initializationt (0) 0▫ set w(0)=0

2. Activation▫ at time step n, activate perceptron by applying

ti l d i t t ( ) d d i dcontinuous valued input vector x(n) and desired response d(n)

3. Computation of actual response)]()([)( T

4. Adaptation of Weight Vector)]()(sgn[)( nnny T xw

)()]()([)()1( nnyndnn xww

5. Continuation

)()]()([)()( y

2

1

C class tobelongs x(n)if 1C class tobelongs x(n)if 1

{)(

nd5. Continuation▫ increment time step n and go back to step 2

Perceptron Learning Exampleg

x0=-1=w0

x1

x

w1w2

w0yx2.. wn

a= i=0n wi xi

y

xn. n

1 if a y=

0 if a

Perceptron Learning Exampleg

t=1t=-1

t=1

w=[0 25 –0 1 0 5]

o=1

w [0.25 0.1 0.5]x2 = 0.2 x1 – 0.5 o=-1

( t) ([ 1 1] 1)(x t)=([2 1] -1)(x,t)=([-1,-1],1)o=sgn(0.25+0.1-0.5)

=-1

(x,t)=([2,1], 1)o=sgn(0.45-0.6+0.3)

=1(x,t)=([1,1],1)o=sgn(0.25-0.7+0.1)

=-1

w=[0.2 –0.2 –0.2]w=[-0.2 –0.4 –0.2]

[ ]w=[0.2 0.2 0.2]

Adaline: Adaptive Linear Element

• The output y is a linear combination o xx1

x ww1 y

x2

x

w2wm

m

)(( )xm

0j

jj )(n(n)wxy

Adaline: Adaptive Linear Elementp

• Adaline: uses a linear neuron model and the Least-Mean-S (LMS) l i l ithSquare (LMS) learning algorithmThe idea: try to minimize the square error, which is a function of the weights

)()( )( 21 )n(e)w(n)( 221E

m

m

0jjj )(n(n)wx)n(d)n(e

• We can find the minimum of the error function E by means f th St t d t th dof the Steepest descent method

Steepest Descent Methodp• start with an arbitrary point• find a direction in which E is decreasing most rapidly

wE

wEE

,, ))w( of(gradient

1

• make a small step in that direction

mww 1

))E(nofgradient ()w(n)1n(w 34

))(g()()(

Comparison of Perceptron and AdalineCo pa so o e cept o a d da e

Perceptron Adaline

Architecture Single-layer Single-layer

N N li liNeuron model

Non-linear linear

Learning Minimze Minimize totalLearning algorithm

Minimze number of misclassified examples

Minimize total squared error

examples

Application Linear l ifi ti

Linear classification,iclassification regression

Readingg

• S Haykin, Neural Networks: A Comprehensive y , pFoundation, 2007 (Chapter 4).

lecture 11: single layer perceptronsce.sharif.edu/courses/92-93/1/ce957-1/resources/... ·...

Documents