lecture 11: single layer perceptronsce.sharif.edu/courses/92-93/1/ce957-1/resources/... ·...

36
1 In the Name of God Lecture 11: Single Layer Perceptrons

Upload: others

Post on 22-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    In the Name of God

    Lecture 11: Single Layer Perceptrons

  • Perceptron: architecturePerceptron: architecture• We consider the architecture: feed-forward NN

    with one layerwith one layer• It is sufficient to study single layer perceptrons

    with just one neuron:j

  • Single layer perceptronsSingle layer perceptrons• Generalization to single layer perceptrons with

    i bmore neurons is easy because:

    • The output units are independent among each otherThe output units are independent among each other • Each weight only affects one of the outputs

  • Perceptron: Neuron Modelp• The (McCulloch-Pitts) perceptron is a single layer

    NN ith li th i f tiNN with a non-linear , the sign function

  • Perceptron

    • Perceptron is based on a nonlinear neuron.• x(n) = [+1,x1(n),x2(n),...,xm(n)]T• w(n) = [b(n),w1(n),w2(n),...,xm(n)]T• v(n) = wT(n)x(n) defines a hyperplane

    1

    m

    i ii

    v w x b

    m

    i iy sign b w x

    1i

  • Perceptron: Applications

    • The perceptron is used for classification: classify p p ycorrectly a set of examples into one of the two classes C1 and C2:

    If the output of the perceptron is +1, then the i t i i d t l Cinput is assigned to class C1

    If the output of the perceptron is -1, then the input is assigned to Cinput is assigned to C2

  • Perceptron Trainingp g

    • How can we train a perceptron for a p pclassification task?

    • We try to find suitable values for the weights in such a way that the training examples are correctly classified.G t i ll t t fi d h l th t• Geometrically, we try to find a hyper-plane that separates the examples of the two classes.

  • Perceptron: Classification

    • The equation below describes a hyperplane in the input space. This hyperplane is used to separate the two classes C1 and C2 decision

    region for C1

    m

    x2

    d i i

    region for C1w1x1 + w2x2 + b > 0

    0 bxw1i

    ii

    C1x

    decisionboundary

    d i i C2x1

    w1x1 + w2x2 + b = 0w1x1 + w2x2 + b

  • Example: ANDExample: AND• Here is a representation of the AND function• White means false black means true for the• White means false, black means true for the

    output• -1 means false, +1 means true for the input• A linear decision surface separates false from• A linear decision surface separates false from

    true instances

    -1 AND -1 = false

    -1 AND +1 = false

    +1 AND 1 f l+1 AND -1 = false

    +1 AND +1 = true

  • Example: AND

    • Watch a perceptron learn the AND function:

    Example: AND

  • Example: XORp

    • Here’s the XOR function:

    -1 XOR -1 = falsef

    -1 XOR +1 = true

    +1 XOR -1 = true

    +1 XOR +1 = false

    Perceptrons cannot learn such linearly inseparable functions

  • Example: XOR

    • Watch a perceptron try to learn XOR

  • Perceptron learningg

    • If the n-th member of the training set, x(n), is correctly l ifi d b th i ht t ( ) t d t th thclassified by the weight vector w(n) computed at the n-th

    iteration of the algorithm, no correction is made to the weight vector, i.e:▫ w(n+1) = w(n) if wTx(n) > 0 and x(n) belongs to class C1▫ w(n+1) = w(n) if wTx(n) 0 and x(n) belongs to class C2

    • Otherwise the weight vector of the perceptron isOtherwise the weight vector of the perceptron is uppdated:▫ w(n+1) = w (n) - (n)x(n) if wT(n)x(n) >0 and x(n)

    belongs to class Cbelongs to class C2▫ w(n+1) = w (n) + (n)x(n) if wT(n)x(n) 0 and x(n)

    belongs to class C1• learning rate parameter (n) controls the adjustment• learning-rate parameter (n) controls the adjustment

    applied to the weight vector at iteration n.

  • Perceptron learningg

    • Or, simply

    ▫ w(n + 1) = w(n) + (n)e(n)x(n)

    • where

    ▫ e(n) = d(n) − y(n) [the error]

    Th t th M C ll h Pitt f l d l• The perceptron uses the McCulloch-Pitts formal model of a neuron.

    • The learning process is performed for a finite number of g p piterations and then stops

  • Learning algorithmLearning algorithm

    Epoch: Presentation of the entire training set to the neuralEpoch: Presentation of the entire training set to the neural network. In the case of the AND function an epoch consists f f t f i t b i t d t thof four sets of inputs being presented to the

    network (i.e. [0,0], [0,1], [1,0], [1,1])

    Error: The error value is the amount by which the valueError: The error value is the amount by which the value output by the network differs from the target value. For example, if we required the network to output 0 and it outputs a 1, then Error = -1

  • Decision boundary• In simple cases, divide feature space by drawing a hyperplane

    across it

    Decision boundary

    across it.• Known as a decision boundary.• Discriminant function: returns different values on opposite sides. pp

    (straight line)• Problems which can be thus classified are linearly separable.

    x

    +

    ++

    x2

    + -

    x2

    +

    + -

    - -x1

    +-

    x1

    -

    Linearly separable Non-Linearly separable

  • Example

    • Consider the 2-dimensional training set C1 C2, • C1 = {(1, 1, 1), (1, 1, -1), (1, 0, -1)} with class label 1• C2 = {(1, -1,-1), (1, -1,1), (1, 0,1)} with class label 0

    Train a perceptron on C C• Train a perceptron on C1 C2• First pass is as follows

    I t W i ht D i d A t l U d t ? NInput Weight Desired Actual Update? New weight

    (1, 1, 1) (1, 0, 0) 1 1 No (1, 0, 0) (1 1 1) (1 0 0) 1 1 N (1 0 0)(1, 1, -1) (1, 0, 0) 1 1 No (1, 0, 0)(1,0, -1) (1, 0, 0) 1 1 No (1, 0, 0) (1,-1, -1) (1, 0, 0) 0 1 Yes (0, 1, 1) (1,-1, 1) (0, 1, 1) 0 0 No (0, 1, 1)(1, 0, 1) (0, 1, 1) 0 1 Yes (-1, 1, 0)

  • Examplep

    C1: {(1, 1, 1), (1, 1, -1), (1, 0, -1)} C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}

    Fill out this table sequentially (Second pass):

    Input Weight Desired Actual Update? New weight

    q y ( p )

    (1, 1, 1) (-1, 1, 0) 1 0 Yes (0, 2, 1) (1, 1, -1) (0, 2, 1) 1 1 No (0, 2, 1) (1,0, -1) (0, 2, 1) 1 0 Yes (1, 2, 0)(1,0, 1) (0, 2, 1) 1 0 Yes (1, 2, 0)(1,-1, -1) (1, 2, 0) 0 0 No (1, 2, 0) (1,-1, 1) (1, 2, 0) 0 0 No (1, 2, 0) (1 0 1) (1 2 0) 0 1 Yes (0 2 1)(1, 0, 1) (1, 2, 0) 0 1 Yes (0, 2, -1)

  • Examplep

    C1: {(1, 1, 1), (1, 1, -1), (1, 0, -1)} C2: {(1 1 1) (1 1 1) (1 0 1)}

    Input Weight Desired Actual Update? New

    C2: {(1, -1,-1), (1, -1,1), (1, 0,1)}Fill out this table sequentially (Third pass):

    gweight

    (1, 1, 1) (0, 2, -1) 1 1 No (0, 2, -1)(1, 1, -1) (0, 2, -1) 1 1 No (0, 2, -1)(1 0 1) (0 2 1) 1 1 N (0 2 1)(1,0, -1) (0, 2, -1) 1 1 No (0, 2, -1)(1,-1, -1) (0, 2, -1) 0 0 No (0, 2, -1)(1,-1, 1) (0, 2, -1) 0 0 No (0, 2, -1)(1 0 1) (0 2 1) 0 0 No (0 2 1)(1, 0, 1) (0, 2, -1) 0 0 No (0, 2, -1)

    At epoch 3 no weight changes. stop execution of algorithm.Fi l i h (0 2 1)Final weight vector: (0, 2, -1). decision hyperplane is 2x1 - x2 = 0.

  • Example: final resultsp

    -x2

    - +1 Decision boundary:C2

    11

    y2x1 - x2 = 0

    2

    + +-

    x1

    C1

    1

    -1

    -1 1/2 2

    w

    + +- C1

  • Some Unhappiness About Perceptron Trainingpp p g

    • The perceptron learning rule fails to converge if e amples are not linearl separableif examples are not linearly separable▫ Can only model linearly separable classes, like

    (those described by) the following Boolean functions: AND, OR, but not XOR

    • When a perceptron gives the right answer, no learning takes placelearning takes place.

    • Anything below the threshold is interpreted as “no”, even if it is just below the threshold.▫ Might be better to train the neuron based on how far

    below the threshold it is?

  • Perceptron: Convergence Theorem

    Suppose datasets C1 and C2 are linearly separable. The perceptron learning algorithm converges after n0iterations, with n0 nmax on training set C1 C2.

    Proof:• suppose x C1 output = 1 and x C2 output = -1.• For simplicity assume w(1) = 0, = 1.• Suppose perceptron incorrectly classifies x(1) … x(n) … C1. Then wT(k) x(k) 0.

    Error correction rule:w(2) = w(1) + x(1)w(2) = w(1) + x(1)w(3) = w(2) + x(2) w(n+1) = x(1)+ …+ x(n)

    w(n+1) = w(n) + x(n).

  • Perceptron: Convergence Theorem

    • Let w0 be such that w0T x(n) > 0 x(n) C1. w0 exists because C1 and C2 are linearly separable.

    • Let = min w0T x(n) | x(n) C1.Let min w0 x(n) | x(n) C1.

    • Then, w0T w(n+1) = w0T x(1) + … + w0T x(n) n

    • Cauchy-Schwarz inequality:||w0||2 ||w(n+1)||2 [w0T w(n+1)]2

    ||w(n+1)||2 (A)n2 2

    ||w0|| 2

  • Perceptron: Convergence Theorem

    • Now we consider another route: (k 1) (k) (k)w(k+1) = w(k) + x(k)

    || w(k+1)||2 = || w(k)||2 + ||x(k)||2 + 2 w T(k)x(k)Euclidean norm

    0 because x(k) is misclassified ||w(k+1)||2 ||w(k)||2 + ||x(k)||2 k=1,..,n

    =0

    ||w(2)||2 ||w(1)||2 + ||x(1)||2 ||w(3)||2 ||w(2)||2 + ||x(2)||2

    ||w(n+1)||2 n

    kx 2)(|| ( )|| k

    kx1

    )(

  • Perceptron: Convergence Theorem

    • Let = max ||x(n)||2 x(n) C1|| ( +1)||2 (B)• ||w(n+1)||2 n (B)

    • For sufficiently large values of n: (B) becomes in conflict with (A).Then, n cannot be greater than nmax such that (A) and (B) are bothThen, n cannot be greater than nmax such that (A) and (B) are both satisfied with the equality sign.

    ||||n 2022 w

    • Perceptron convergence algorithm terminates in at most

    β||||

    nn||||

    n2

    02

    0

    ww

    maxmaxmax

    nmax= iterations. ||w0||2

    2

  • Fixed-increment convergence theoremtheorem

    • Let the subsets of training vectors C1 and C2 be linearly separable. Let the inputs presented to perceptron originate from these two subsets. The perceptron converges after

    it ti i th th tsome n0 iterations, in the sense that

    w(n ) = w(n +1) = w(n +2) =w(n0) = w(n0+1) = w(n0+2) = …

    • is a solution vector for n0 ≤ nmax.0 max

  • Learning-Rate Annealing Schedules

    • When learning rate is large, trajectory may follow zigzagging path

    • When it is small, procedure may be slowS• Simplest learning rate parameter

    St h ti i ti0)( n

    • Stochastic approximation ▫ time-varying▫ when c is large danger of parameter blowup for▫ when c is large, danger of parameter blowup for

    small n.

    constant)is(c)( cn constant)is(c )(n

    n

  • Learning-Rate Annealing Schedules

    • Search then converge h d lschedule

    ▫ in early stages, learning rate parameter is approximately equal to

    ▫ for a number of iteration n large compared to search ti t t

    0

    time constant , ▫ learning rate parameter

    approximates as c/n

    )/(1

    )( 0

    n

    n

    constant) is ,( 0

  • Summaryy

    1. Initializationt (0) 0▫ set w(0)=0

    2. Activation▫ at time step n, activate perceptron by applying

    ti l d i t t ( ) d d i dcontinuous valued input vector x(n) and desired response d(n)

    3. Computation of actual response)]()([)( T

    4. Adaptation of Weight Vector)]()(sgn[)( nnny T xw

    )()]()([)()1( nnyndnn xww

    5. Continuation

    )()]()([)()( y

    2

    1

    C class tobelongs x(n)if 1C class tobelongs x(n)if 1

    {)(

    nd5. Continuation▫ increment time step n and go back to step 2

  • Perceptron Learning Exampleg

    x0=-1=w0

    x1

    x

    w1w2

    w0yx2.. wn

    a= i=0n wi xi

    y

    xn. n

    1 if a y=

    0 if a

  • Perceptron Learning Exampleg

    t=1t=-1

    t=1

    w=[0 25 –0 1 0 5]

    o=1

    w [0.25 0.1 0.5]x2 = 0.2 x1 – 0.5 o=-1

    ( t) ([ 1 1] 1)(x t)=([2 1] -1)(x,t)=([-1,-1],1)o=sgn(0.25+0.1-0.5)

    =-1

    (x,t)=([2,1], 1)o=sgn(0.45-0.6+0.3)

    =1(x,t)=([1,1],1)o=sgn(0.25-0.7+0.1)

    =-1

    w=[0.2 –0.2 –0.2]w=[-0.2 –0.4 –0.2]

    [ ]w=[0.2 0.2 0.2]

  • Adaline: Adaptive Linear Element

    • The output y is a linear combination o xx1

    x ww1 y

    x2

    x

    w2wm

    m

    )(( )xm

    0j

    jj )(n(n)wxy

  • Adaline: Adaptive Linear Elementp

    • Adaline: uses a linear neuron model and the Least-Mean-S (LMS) l i l ithSquare (LMS) learning algorithmThe idea: try to minimize the square error, which is a function of the weights

    )()( )( 21 )n(e)w(n)( 221E

    m

    m

    0jjj )(n(n)wx)n(d)n(e

    • We can find the minimum of the error function E by means f th St t d t th dof the Steepest descent method

  • Steepest Descent Methodp• start with an arbitrary point• find a direction in which E is decreasing most rapidly

    wE

    wEE

    ,, ))w( of(gradient

    1

    • make a small step in that direction

    mww 1

    ))E(nofgradient ()w(n)1n(w 34

    ))(g()()(

  • Comparison of Perceptron and AdalineCo pa so o e cept o a d da e

    Perceptron Adaline

    Architecture Single-layer Single-layer

    N N li liNeuron model

    Non-linear linear

    Learning Minimze Minimize totalLearning algorithm

    Minimze number of misclassified examples

    Minimize total squared error

    examples

    Application Linear l ifi ti

    Linear classification,iclassification regression

  • Readingg

    • S Haykin, Neural Networks: A Comprehensive y , pFoundation, 2007 (Chapter 4).