lecture 11: single layer perceptronsce.sharif.edu/courses/92-93/1/ce957-1/resources/... ·...
TRANSCRIPT
-
1
In the Name of God
Lecture 11: Single Layer Perceptrons
-
Perceptron: architecturePerceptron: architecture• We consider the architecture: feed-forward NN
with one layerwith one layer• It is sufficient to study single layer perceptrons
with just one neuron:j
-
Single layer perceptronsSingle layer perceptrons• Generalization to single layer perceptrons with
i bmore neurons is easy because:
• The output units are independent among each otherThe output units are independent among each other • Each weight only affects one of the outputs
-
Perceptron: Neuron Modelp• The (McCulloch-Pitts) perceptron is a single layer
NN ith li th i f tiNN with a non-linear , the sign function
-
Perceptron
• Perceptron is based on a nonlinear neuron.• x(n) = [+1,x1(n),x2(n),...,xm(n)]T• w(n) = [b(n),w1(n),w2(n),...,xm(n)]T• v(n) = wT(n)x(n) defines a hyperplane
1
m
i ii
v w x b
m
i iy sign b w x
1i
-
Perceptron: Applications
• The perceptron is used for classification: classify p p ycorrectly a set of examples into one of the two classes C1 and C2:
If the output of the perceptron is +1, then the i t i i d t l Cinput is assigned to class C1
If the output of the perceptron is -1, then the input is assigned to Cinput is assigned to C2
-
Perceptron Trainingp g
• How can we train a perceptron for a p pclassification task?
• We try to find suitable values for the weights in such a way that the training examples are correctly classified.G t i ll t t fi d h l th t• Geometrically, we try to find a hyper-plane that separates the examples of the two classes.
-
Perceptron: Classification
• The equation below describes a hyperplane in the input space. This hyperplane is used to separate the two classes C1 and C2 decision
region for C1
m
x2
d i i
region for C1w1x1 + w2x2 + b > 0
0 bxw1i
ii
C1x
decisionboundary
d i i C2x1
w1x1 + w2x2 + b = 0w1x1 + w2x2 + b
-
Example: ANDExample: AND• Here is a representation of the AND function• White means false black means true for the• White means false, black means true for the
output• -1 means false, +1 means true for the input• A linear decision surface separates false from• A linear decision surface separates false from
true instances
-1 AND -1 = false
-1 AND +1 = false
+1 AND 1 f l+1 AND -1 = false
+1 AND +1 = true
-
Example: AND
• Watch a perceptron learn the AND function:
Example: AND
-
Example: XORp
• Here’s the XOR function:
-1 XOR -1 = falsef
-1 XOR +1 = true
+1 XOR -1 = true
+1 XOR +1 = false
Perceptrons cannot learn such linearly inseparable functions
-
Example: XOR
• Watch a perceptron try to learn XOR
-
Perceptron learningg
• If the n-th member of the training set, x(n), is correctly l ifi d b th i ht t ( ) t d t th thclassified by the weight vector w(n) computed at the n-th
iteration of the algorithm, no correction is made to the weight vector, i.e:▫ w(n+1) = w(n) if wTx(n) > 0 and x(n) belongs to class C1▫ w(n+1) = w(n) if wTx(n) 0 and x(n) belongs to class C2
• Otherwise the weight vector of the perceptron isOtherwise the weight vector of the perceptron is uppdated:▫ w(n+1) = w (n) - (n)x(n) if wT(n)x(n) >0 and x(n)
belongs to class Cbelongs to class C2▫ w(n+1) = w (n) + (n)x(n) if wT(n)x(n) 0 and x(n)
belongs to class C1• learning rate parameter (n) controls the adjustment• learning-rate parameter (n) controls the adjustment
applied to the weight vector at iteration n.
-
Perceptron learningg
• Or, simply
▫ w(n + 1) = w(n) + (n)e(n)x(n)
• where
▫ e(n) = d(n) − y(n) [the error]
Th t th M C ll h Pitt f l d l• The perceptron uses the McCulloch-Pitts formal model of a neuron.
• The learning process is performed for a finite number of g p piterations and then stops
-
Learning algorithmLearning algorithm
Epoch: Presentation of the entire training set to the neuralEpoch: Presentation of the entire training set to the neural network. In the case of the AND function an epoch consists f f t f i t b i t d t thof four sets of inputs being presented to the
network (i.e. [0,0], [0,1], [1,0], [1,1])
Error: The error value is the amount by which the valueError: The error value is the amount by which the value output by the network differs from the target value. For example, if we required the network to output 0 and it outputs a 1, then Error = -1
-
Decision boundary• In simple cases, divide feature space by drawing a hyperplane
across it
Decision boundary
across it.• Known as a decision boundary.• Discriminant function: returns different values on opposite sides. pp
(straight line)• Problems which can be thus classified are linearly separable.
x
+
++
x2
+ -
x2
+
+ -
- -x1
+-
x1
-
Linearly separable Non-Linearly separable
-
Example
• Consider the 2-dimensional training set C1 C2, • C1 = {(1, 1, 1), (1, 1, -1), (1, 0, -1)} with class label 1• C2 = {(1, -1,-1), (1, -1,1), (1, 0,1)} with class label 0
Train a perceptron on C C• Train a perceptron on C1 C2• First pass is as follows
I t W i ht D i d A t l U d t ? NInput Weight Desired Actual Update? New weight
(1, 1, 1) (1, 0, 0) 1 1 No (1, 0, 0) (1 1 1) (1 0 0) 1 1 N (1 0 0)(1, 1, -1) (1, 0, 0) 1 1 No (1, 0, 0)(1,0, -1) (1, 0, 0) 1 1 No (1, 0, 0) (1,-1, -1) (1, 0, 0) 0 1 Yes (0, 1, 1) (1,-1, 1) (0, 1, 1) 0 0 No (0, 1, 1)(1, 0, 1) (0, 1, 1) 0 1 Yes (-1, 1, 0)
-
Examplep
C1: {(1, 1, 1), (1, 1, -1), (1, 0, -1)} C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}
Fill out this table sequentially (Second pass):
Input Weight Desired Actual Update? New weight
q y ( p )
(1, 1, 1) (-1, 1, 0) 1 0 Yes (0, 2, 1) (1, 1, -1) (0, 2, 1) 1 1 No (0, 2, 1) (1,0, -1) (0, 2, 1) 1 0 Yes (1, 2, 0)(1,0, 1) (0, 2, 1) 1 0 Yes (1, 2, 0)(1,-1, -1) (1, 2, 0) 0 0 No (1, 2, 0) (1,-1, 1) (1, 2, 0) 0 0 No (1, 2, 0) (1 0 1) (1 2 0) 0 1 Yes (0 2 1)(1, 0, 1) (1, 2, 0) 0 1 Yes (0, 2, -1)
-
Examplep
C1: {(1, 1, 1), (1, 1, -1), (1, 0, -1)} C2: {(1 1 1) (1 1 1) (1 0 1)}
Input Weight Desired Actual Update? New
C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}Fill out this table sequentially (Third pass):
gweight
(1, 1, 1) (0, 2, -1) 1 1 No (0, 2, -1)(1, 1, -1) (0, 2, -1) 1 1 No (0, 2, -1)(1 0 1) (0 2 1) 1 1 N (0 2 1)(1,0, -1) (0, 2, -1) 1 1 No (0, 2, -1)(1,-1, -1) (0, 2, -1) 0 0 No (0, 2, -1)(1,-1, 1) (0, 2, -1) 0 0 No (0, 2, -1)(1 0 1) (0 2 1) 0 0 No (0 2 1)(1, 0, 1) (0, 2, -1) 0 0 No (0, 2, -1)
At epoch 3 no weight changes. stop execution of algorithm.Fi l i h (0 2 1)Final weight vector: (0, 2, -1). decision hyperplane is 2x1 - x2 = 0.
-
Example: final resultsp
-x2
- +1 Decision boundary:C2
11
y2x1 - x2 = 0
2
+ +-
x1
C1
1
-1
-1 1/2 2
w
+ +- C1
-
Some Unhappiness About Perceptron Trainingpp p g
• The perceptron learning rule fails to converge if e amples are not linearl separableif examples are not linearly separable▫ Can only model linearly separable classes, like
(those described by) the following Boolean functions: AND, OR, but not XOR
• When a perceptron gives the right answer, no learning takes placelearning takes place.
• Anything below the threshold is interpreted as “no”, even if it is just below the threshold.▫ Might be better to train the neuron based on how far
below the threshold it is?
-
Perceptron: Convergence Theorem
Suppose datasets C1 and C2 are linearly separable. The perceptron learning algorithm converges after n0iterations, with n0 nmax on training set C1 C2.
Proof:• suppose x C1 output = 1 and x C2 output = -1.• For simplicity assume w(1) = 0, = 1.• Suppose perceptron incorrectly classifies x(1) … x(n) … C1. Then wT(k) x(k) 0.
Error correction rule:w(2) = w(1) + x(1)w(2) = w(1) + x(1)w(3) = w(2) + x(2) w(n+1) = x(1)+ …+ x(n)
w(n+1) = w(n) + x(n).
-
Perceptron: Convergence Theorem
• Let w0 be such that w0T x(n) > 0 x(n) C1. w0 exists because C1 and C2 are linearly separable.
• Let = min w0T x(n) | x(n) C1.Let min w0 x(n) | x(n) C1.
• Then, w0T w(n+1) = w0T x(1) + … + w0T x(n) n
• Cauchy-Schwarz inequality:||w0||2 ||w(n+1)||2 [w0T w(n+1)]2
||w(n+1)||2 (A)n2 2
||w0|| 2
-
Perceptron: Convergence Theorem
• Now we consider another route: (k 1) (k) (k)w(k+1) = w(k) + x(k)
|| w(k+1)||2 = || w(k)||2 + ||x(k)||2 + 2 w T(k)x(k)Euclidean norm
0 because x(k) is misclassified ||w(k+1)||2 ||w(k)||2 + ||x(k)||2 k=1,..,n
=0
||w(2)||2 ||w(1)||2 + ||x(1)||2 ||w(3)||2 ||w(2)||2 + ||x(2)||2
||w(n+1)||2 n
kx 2)(|| ( )|| k
kx1
)(
-
Perceptron: Convergence Theorem
• Let = max ||x(n)||2 x(n) C1|| ( +1)||2 (B)• ||w(n+1)||2 n (B)
• For sufficiently large values of n: (B) becomes in conflict with (A).Then, n cannot be greater than nmax such that (A) and (B) are bothThen, n cannot be greater than nmax such that (A) and (B) are both satisfied with the equality sign.
||||n 2022 w
• Perceptron convergence algorithm terminates in at most
β||||
nn||||
n2
02
0
ww
maxmaxmax
nmax= iterations. ||w0||2
2
-
Fixed-increment convergence theoremtheorem
• Let the subsets of training vectors C1 and C2 be linearly separable. Let the inputs presented to perceptron originate from these two subsets. The perceptron converges after
it ti i th th tsome n0 iterations, in the sense that
w(n ) = w(n +1) = w(n +2) =w(n0) = w(n0+1) = w(n0+2) = …
• is a solution vector for n0 ≤ nmax.0 max
-
Learning-Rate Annealing Schedules
• When learning rate is large, trajectory may follow zigzagging path
• When it is small, procedure may be slowS• Simplest learning rate parameter
St h ti i ti0)( n
• Stochastic approximation ▫ time-varying▫ when c is large danger of parameter blowup for▫ when c is large, danger of parameter blowup for
small n.
constant)is(c)( cn constant)is(c )(n
n
-
Learning-Rate Annealing Schedules
• Search then converge h d lschedule
▫ in early stages, learning rate parameter is approximately equal to
▫ for a number of iteration n large compared to search ti t t
0
time constant , ▫ learning rate parameter
approximates as c/n
)/(1
)( 0
n
n
constant) is ,( 0
-
Summaryy
1. Initializationt (0) 0▫ set w(0)=0
2. Activation▫ at time step n, activate perceptron by applying
ti l d i t t ( ) d d i dcontinuous valued input vector x(n) and desired response d(n)
3. Computation of actual response)]()([)( T
4. Adaptation of Weight Vector)]()(sgn[)( nnny T xw
)()]()([)()1( nnyndnn xww
5. Continuation
)()]()([)()( y
2
1
C class tobelongs x(n)if 1C class tobelongs x(n)if 1
{)(
nd5. Continuation▫ increment time step n and go back to step 2
-
Perceptron Learning Exampleg
x0=-1=w0
x1
x
w1w2
w0yx2.. wn
a= i=0n wi xi
y
xn. n
1 if a y=
0 if a
-
Perceptron Learning Exampleg
t=1t=-1
t=1
w=[0 25 –0 1 0 5]
o=1
w [0.25 0.1 0.5]x2 = 0.2 x1 – 0.5 o=-1
( t) ([ 1 1] 1)(x t)=([2 1] -1)(x,t)=([-1,-1],1)o=sgn(0.25+0.1-0.5)
=-1
(x,t)=([2,1], 1)o=sgn(0.45-0.6+0.3)
=1(x,t)=([1,1],1)o=sgn(0.25-0.7+0.1)
=-1
w=[0.2 –0.2 –0.2]w=[-0.2 –0.4 –0.2]
[ ]w=[0.2 0.2 0.2]
-
Adaline: Adaptive Linear Element
• The output y is a linear combination o xx1
x ww1 y
x2
x
w2wm
m
)(( )xm
0j
jj )(n(n)wxy
-
Adaline: Adaptive Linear Elementp
• Adaline: uses a linear neuron model and the Least-Mean-S (LMS) l i l ithSquare (LMS) learning algorithmThe idea: try to minimize the square error, which is a function of the weights
)()( )( 21 )n(e)w(n)( 221E
m
m
0jjj )(n(n)wx)n(d)n(e
• We can find the minimum of the error function E by means f th St t d t th dof the Steepest descent method
-
Steepest Descent Methodp• start with an arbitrary point• find a direction in which E is decreasing most rapidly
wE
wEE
,, ))w( of(gradient
1
• make a small step in that direction
mww 1
))E(nofgradient ()w(n)1n(w 34
))(g()()(
-
Comparison of Perceptron and AdalineCo pa so o e cept o a d da e
Perceptron Adaline
Architecture Single-layer Single-layer
N N li liNeuron model
Non-linear linear
Learning Minimze Minimize totalLearning algorithm
Minimze number of misclassified examples
Minimize total squared error
examples
Application Linear l ifi ti
Linear classification,iclassification regression
-
Readingg
• S Haykin, Neural Networks: A Comprehensive y , pFoundation, 2007 (Chapter 4).