linear discriminant functions wen-hung liao, 11/25/2008
TRANSCRIPT
Linear Discriminant Functions
Wen-Hung Liao, 11/25/2008
Introduction: LDF
Assume we know the proper form of the discriminant functions, instead of the underlying probability densities.Use samples to estimate the parameters of the classifier.(statistical or non-statistical)Will be concerned with discriminant functions that are either linear in the components of x, or linear in some given set of functions of x.
Why LDF?
Simplicity vs. accuracyAttractive candidates for initial, trial classifiersRelated to neural networks
Approach
Find the LDF by minimizing a criterion function. Use gradient descent procedure for
minimization Convergence property Computational complexities
Example of criterion function: Sample risk, or training error. (Not appropriate, why?) Because a small training error does not guarantee a small test error.
LDF and Decision Surfaces
A linear discriminant function:
where w : weight vectorw0: bias or threshold
0)( xxg t
Two-Category CaseDecision rule: Decide w1 if g(x) > 0, decide w2 if g(x)<0
In other words, x is assigned to w1 if the inner product wtx exceeds the threshold –w0.
Decision Boundary
A hyperplane H defined by g(x)=0If x1 and x2 are both on the decision surface, then:
w is normal to any vector lying on the hyperplane.
0)( 21
0201
xxw
wxwwxwt
tt
Distance Measure
For any x,
where xp is the normal projection of x onto H , and r is the algebraic distance.
|||| w
wrxx p
||||)||||
()( wrw
wrxwxg p
t
||||
)(
w
xgr
Multi-category Case
General case:
c-1 2-class c(c-1)/2 linear discriminant
Use c linear discriminants,,...1,)( 0 ciwxwxg i
tii
Distance Measure
wi-wj is normal to Hij.Distance for x to Hij is given by:
|||| ji
jiij ww
ggr
Quadratic DF
Add terms involving products of pairs of component of x to obtain the quadratic discriminant function:
The separating surface defined by g(x)=0 is a hyperquadric function.
ji
d
jij
d
ii
d
ii xxwxwwxg
111
0)(
Hyperquadric Surfaces
If W=[wij] is not singular, then the linear terms in g(x) can be eliminated by translating the axes.Define a scale matrix:HypersphereHyperellipsoidHyperperboloid
)4( 01 wwWw
WW t
Generalized LDF
Polynomial discriminant functionsGeneralized LDF:
d
iii xyaxg
ˆ
1
)()(
Augment Vectors
Augment feature vector:
Augment weight vector:
Mapping a d-dimensional x-space to (d+1)-dimensional y-space
i
d
iii
d
ii xwxwwxg
01
0)(
w
w
w
w
w
a
d
0
1
0
x
x
xy
d
11
1
2-Category Separable Case
Look for a weight vector that classifies all of the samples correctly. If such a weight does exist, then the samples are said to be linearly separable.
Gradient Descent Procedure
Define a criterion function J(a) that is minimized if a is a solution vector.Step 1: Randomly pick a(1), and compute the gradient vector:Step 2: a(2) is obtained by moving some distance from a(1) in the direction of the steepest descent.
))1((aJ
))(()()()1( kaJkkaka
Setting the Learning Rate
Second-order expansion of J(a):
Substituting
Minimized whenJHJkJkkaJkaJ t )(
2
1||||)())(())1(( 22
jiij aa
Jh
2
))(()()()1( kaJkkaka
))(())((2
1))(())(()( kaaHkaakaaJkaJaJ tt
JHJ
Jk
t
2||||
)(
Newton Descent For nonsingular H
Converges faster but more difficult to compute per step.
JHkaka 1)()1(
Perceptron Criterion Function
where Y(a) is the set of samples misclassified by a. Since
Update rule:
Yy
tp yaaJ )()(
Yy
p yJ )(
kYy
ykkaka )()()1(
Convergence Proof
Refer to page 229 to 232 of textbook.