linear discriminant functions wen-hung liao, 11/25/2008

Linear Discriminant Functions

Wen-Hung Liao, 11/25/2008

Introduction: LDF

Assume we know the proper form of the discriminant functions, instead of the underlying probability densities.Use samples to estimate the parameters of the classifier.(statistical or non-statistical)Will be concerned with discriminant functions that are either linear in the components of x, or linear in some given set of functions of x.

Why LDF?

Simplicity vs. accuracyAttractive candidates for initial, trial classifiersRelated to neural networks

Approach

Find the LDF by minimizing a criterion function. Use gradient descent procedure for

minimization Convergence property Computational complexities

Example of criterion function: Sample risk, or training error. (Not appropriate, why?) Because a small training error does not guarantee a small test error.

LDF and Decision Surfaces

A linear discriminant function:

where w : weight vectorw0: bias or threshold

0)( xxg t

Two-Category CaseDecision rule: Decide w1 if g(x) > 0, decide w2 if g(x)<0

In other words, x is assigned to w1 if the inner product wtx exceeds the threshold –w0.

Decision Boundary

A hyperplane H defined by g(x)=0If x1 and x2 are both on the decision surface, then:

w is normal to any vector lying on the hyperplane.

0)( 21

0201

xxw

wxwwxwt

tt

Distance Measure

For any x,

where xp is the normal projection of x onto H , and r is the algebraic distance.

|||| w

wrxx p

||||)||||

()( wrw

wrxwxg p

t

||||

)(

w

xgr

Multi-category Case

General case:

c-1 2-class c(c-1)/2 linear discriminant

Use c linear discriminants,,...1,)( 0 ciwxwxg i

tii

Distance Measure

wi-wj is normal to Hij.Distance for x to Hij is given by:

|||| ji

jiij ww

ggr

Quadratic DF

Add terms involving products of pairs of component of x to obtain the quadratic discriminant function:

The separating surface defined by g(x)=0 is a hyperquadric function.

ji

d

jij

d

ii

d

ii xxwxwwxg

111

0)(

Hyperquadric Surfaces

If W=[wij] is not singular, then the linear terms in g(x) can be eliminated by translating the axes.Define a scale matrix:HypersphereHyperellipsoidHyperperboloid

)4( 01 wwWw

WW t

Generalized LDF

Polynomial discriminant functionsGeneralized LDF:

d

iii xyaxg

ˆ

1

)()(

Augment Vectors

Augment feature vector:

Augment weight vector:

Mapping a d-dimensional x-space to (d+1)-dimensional y-space

i

d

iii

d

ii xwxwwxg

01

0)(

w

w

w

w

w

a

d

0

1

0

x

x

xy

d

11

1

2-Category Separable Case

Look for a weight vector that classifies all of the samples correctly. If such a weight does exist, then the samples are said to be linearly separable.

Gradient Descent Procedure

Define a criterion function J(a) that is minimized if a is a solution vector.Step 1: Randomly pick a(1), and compute the gradient vector:Step 2: a(2) is obtained by moving some distance from a(1) in the direction of the steepest descent.

))1((aJ

))(()()()1( kaJkkaka

Setting the Learning Rate

Second-order expansion of J(a):

Substituting

Minimized whenJHJkJkkaJkaJ t )(

2

1||||)())(())1(( 22

jiij aa

Jh

2

))(()()()1( kaJkkaka

))(())((2

1))(())(()( kaaHkaakaaJkaJaJ tt

JHJ

Jk

t

2||||

)(

Newton Descent For nonsingular H

Converges faster but more difficult to compute per step.

JHkaka 1)()1(

Perceptron Criterion Function

where Y(a) is the set of samples misclassified by a. Since

Update rule:

Yy

tp yaaJ )()(

Yy

p yJ )(

kYy

ykkaka )()()1(

Convergence Proof

Refer to page 229 to 232 of textbook.

linear discriminant functions wen-hung liao, 11/25/2008

Documents

threshold slide

small training error

weight vector w

given set of functions

small test error

decision surfaces

category case decision

trial classifiers