copyright © 2001, andrew w. moore support vector machines andrew w. moore associate professor...
TRANSCRIPT
Copyright © 2001, Andrew W. Moore
Support Vector Machines
Andrew W. MooreAssociate Professor
School of Computer ScienceCarnegie Mellon University
Support Vector Machines: Slide 2Copyright © 2001, Andrew W. Moore
Outline• Linear SVMs• The definition of a maximum margin
classifier• What QP • How Maximum Margin can be turned into a
QP problem• How we deal with noisy (non-separable) data• How we permit non-linear boundaries• How SVM Kernel functions permit
Support Vector Machines: Slide 3Copyright © 2001, Andrew W. Moore
Outline• Linear SVMs• The definition of a maximum margin
classifier• What QP • How Maximum Margin can be turned into a
QP problem• How we deal with noisy (non-separable) data• How we permit non-linear boundaries• How SVM Kernel functions permit
Support Vector Machines: Slide 4Copyright © 2001, Andrew W. Moore
Linear Classifiers
f x
yest
denotes +1
denotes -1f(x,w,b) = sgn(w. x -
b)
How would you classify this data?
Learning machine f takes an input x and transforms it, somehow using weight α into a predicted output yest = +/-1
Support Vector Machines: Slide 5Copyright © 2001, Andrew W. Moore
Linear Classifiers
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Perceptron rule stops only if no misclassification remains:
w = (ty)x
Any of these lines are possible
Support Vector Machines: Slide 6Copyright © 2001, Andrew W. Moore
Linear Classifiers
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
LMS rule uses linear elements and minimizes
E= (ty)2
Not necessarily guarantee no misclassification
Support Vector Machines: Slide 7Copyright © 2001, Andrew W. Moore
Linear Classifiers
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
If the objective is to assure zero sample error…
Any of these would be fine..
..but which is best?
Support Vector Machines: Slide 8Copyright © 2001, Andrew W. Moore
Classifier Margin
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
Margin of the separator is the width of separation between
classes
Support Vector Machines: Slide 9Copyright © 2001, Andrew W. Moore
Maximum Margin
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum margin linear classifier is the linear classifier with the, um, maximum margin.
This is the simplest kind of SVM (Called an LSVM)Linear SVM
Support Vectors are those datapoints that the margin pushes up against
Support Vector Machines: Slide 10Copyright © 2001, Andrew W. Moore
Outline• Linear SVMs• The definition of a maximum margin
classifier• What QP • How Maximum Margin can be turned into a
QP problem• How we deal with noisy (non-separable) data• How we permit non-linear boundaries• How SVM Kernel functions permit
Support Vector Machines: Slide 11Copyright © 2001, Andrew W. Moore
Specifying a line and margin
• How do we represent this mathematically?• …in m input dimensions?
Plus-Plane
Minus-Plane
Classifier Boundary
“Predict Class
= +1”
zone
“Predict Class
= -1”
zone
Support Vector Machines: Slide 12Copyright © 2001, Andrew W. Moore
Specifying a line and margin
• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }
Plus-Plane
Minus-Plane
Classifier Boundary
“Predict Class
= +1”
zone
“Predict Class
= -1”
zone
Classify as..
+1 if w . x + b >= 1
-1 if w . x + b <= -1
Universe explodes
if -1 < w . x + b < 1
wx+b=1
wx+b=0
wx+b=-
1
Support Vector Machines: Slide 13Copyright © 2001, Andrew W. Moore
Computing the margin width
• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }Claim: The vector w is perpendicular to the Plus Plane.
Why?
“Predict Class
= +1”
zone
“Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M = Margin Width
How do we compute M in terms of w and b?
Let u and v be two vectors on the Plus Plane. What is w . ( u – v ) ?
And so of course the vector w is also perpendicular to the Minus Plane
Support Vector Machines: Slide 14Copyright © 2001, Andrew W. Moore
Computing the margin width
The line from x- to x+ is perpendicular to the planes.
So to get from x- to x+ travel some distance in direction w
Support Vector Machines: Slide 17Copyright © 2001, Andrew W. Moore
Outline• Linear SVMs• The definition of a maximum margin
classifier• What QP • How Maximum Margin can be turned
into a QP problem• How we deal with noisy (non-separable) data• How we permit non-linear boundaries• How SVM Kernel functions permit
Support Vector Machines: Slide 18Copyright © 2001, Andrew W. Moore
Learning the Maximum Margin Classifier
Given a guess of w and b we can• Compute whether all data points in the correct half-planes• Compute the width of the margin
So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How?
• Gradient descent? Simulated Annealing? Matrix Inversion? EM? Newton’s Method?
“Predict Class
= +1”
zone
“Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M = Margin Width =
x-
x+ww.
2
Support Vector Machines: Slide 19Copyright © 2001, Andrew W. Moore
Learning via Quadratic Programming
• QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.
Support Vector Machines: Slide 21Copyright © 2001, Andrew W. Moore
Learning the Maximum Margin Classifier
“Predict Class
= +1”
zone
“Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
ww.
2
Q1: What is our quadratic optimization criterion?
Q2: What are constraints in our QP?
The objective?• To maximize M• While ensuring all
data points are in the correct half-planes
• Constraints:w . xk + b >= 1 if yk = 1
w . xk + b <= -1 if yk = -1
Minimize w.w
- How many constraints will we have?
- What should they be?
R
M =
Support Vector Machines: Slide 22Copyright © 2001, Andrew W. Moore
Outline• Linear SVMs• The definition of a maximum margin
classifier• What QP • How Maximum Margin can be turned into a
QP problem• How we deal with noisy (non-separable)
data• How we permit non-linear boundaries• How SVM Kernel functions permit
Support Vector Machines: Slide 23Copyright © 2001, Andrew W. Moore
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?Idea :
Minimize w.w + C (distance of error points to their correct place)
Minimize trade-off between margin and training error
Support Vector Machines: Slide 24Copyright © 2001, Andrew W. Moore
Learning Maximum Margin with Noise
wx+b=1
wx+b=0
wx+b=-
1
ww.
2
Our QP criterion:Minimize
• How many constraints? 2R
• What should they be?w . xk + b >= 1-k if yk = 1
w . xk + b <= -1+k if yk = -1
k >= 0 for all k
R
kkεC
1
.2
1ww
7
11 2
Support Vector Machines: Slide 26Copyright © 2001, Andrew W. Moore
Controlling Soft Margin Separation
Support Vector Machines: Slide 27Copyright © 2001, Andrew W. Moore
Outline• Linear SVMs• The definition of a maximum margin
classifier• What QP • How Maximum Margin can be turned into a
QP problem• How we deal with noisy (non-separable) data• How we permit non-linear boundaries• How SVM Kernel functions permit
Support Vector Machines: Slide 28Copyright © 2001, Andrew W. Moore
Suppose we’re in 1-dimension
What would SVMs do with this data?
x=0
Support Vector Machines: Slide 29Copyright © 2001, Andrew W. Moore
Suppose we’re in 1-dimension
Not a big surprise
Positive “plane” Negative “plane”
x=0
Support Vector Machines: Slide 30Copyright © 2001, Andrew W. Moore
Harder 1-dimensional dataset
That’s wiped the smirk off SVM’s face.
What can be done about this?
x=0
Support Vector Machines: Slide 31Copyright © 2001, Andrew W. Moore
Harder 1-dimensional datasetRemember how
permitting non-linear basis functions made linear regression so much nicer?
Let’s permit them here too
x=0),( 2
kkk xxz
Support Vector Machines: Slide 32Copyright © 2001, Andrew W. Moore
Harder 1-dimensional datasetRemember how
permitting non-linear basis functions made linear regression so much nicer?
Let’s permit them here too
),( 2kkk xxz
x=0
Support Vector Machines: Slide 33Copyright © 2001, Andrew W. Moore
Outline• Linear SVMs• The definition of a maximum margin
classifier• What QP • How Maximum Margin can be turned into a
QP problem• How we deal with noisy (non-separable) data• How we permit non-linear boundaries• How SVM Kernel functions permit
Support Vector Machines: Slide 34Copyright © 2001, Andrew W. Moore
Common SVM basis functions
zk = ( polynomial terms of xk of degree 1 to q )
zk = ( radial basis functions of xk )
zk = ( sigmoid functions of xk )
This is sensible.
Is that the end of the story?
No…there’s one more trick!
KW
||KernelFn)(][ jk
kjk φjcx
xz
Support Vector Machines: Slide 36Copyright © 2001, Andrew W. Moore
Quadra
tic
Dot
Pro
duct
s
mm
m
m
m
m
mm
m
m
m
m
bb
bb
bb
bb
bb
bb
b
b
b
b
b
b
aa
aa
aa
aa
aa
aa
a
a
a
a
a
a
1
1
32
1
31
21
2
22
21
2
1
1
1
32
1
31
21
2
22
21
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
)()( bΦaΦ
1
m
iiiba
1
2
m
iii ba
1
22
m
i
m
ijjiji bbaa
1 1
2
+
+
+
Constant Term
Linear Terms
Pure Quadratic
Terms
Quadratic Cross-Terms
Support Vector Machines: Slide 37Copyright © 2001, Andrew W. Moore
Quadratic Dot Products
)()( bΦaΦ
m
i
m
ijjiji
m
iii
m
iii bbaababa
1 11
22
1
221
2)1.( ba
1.2).( 2 baba
121
2
1
m
iii
m
iii baba
1211 1
m
iii
m
i
m
jjjii bababa
122)(11 11
2
m
iii
m
i
m
ijjjii
m
iii babababa
They’re the same!
And this is only O(m) to compute!
Support Vector Machines: Slide 38Copyright © 2001, Andrew W. Moore
SVM Kernel Functions• K(a,b)=(a . b +1)d is an example of an SVM
Kernel Function• Beyond polynomials there are other very
high dimensional basis functions that can be made practical by finding the right Kernel Function• Radial-Basis-style Kernel Function:
• Neural-net-style Kernel Function:
2
2
2
)(exp),(
ba
baK
).tanh(),( babaK
Support Vector Machines: Slide 40Copyright © 2001, Andrew W. Moore
Doing multi-class classification
• SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2).
• What can be done?• Answer: with output arity N, learn N SVM’s
• SVM 1 learns “Output==1” vs “Output != 1”• SVM 2 learns “Output==2” vs “Output != 2”• :• SVM N learns “Output==N” vs “Output != N”
• Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.
Support Vector Machines: Slide 41Copyright © 2001, Andrew W. Moore
What You Should Know• Linear SVMs• The definition of a maximum margin
classifier• What QP can do for you (but, for this class,
you don’t need to know how it does it)• How Maximum Margin can be turned into a
QP problem• How we deal with noisy (non-separable) data• How we permit non-linear boundaries• How SVM Kernel functions permit us to
pretend we’re working with ultra-high-dimensional basis-function terms
Support Vector Machines: Slide 42Copyright © 2001, Andrew W. Moore
Remaining…• Actually, I skip
• ERM(Empirical Risk Minimization)• VC(Vapnik/Chervonenkis) dimension
• But, It needed at data processing time
• Handling unbalanced datasets• Selecting C• Selecting Kernel
• What is a Valid Kernel?• How to Construct Valid Kernel?
• It needed at tuning time.
Now, the real game is starting in earnest !!
Support Vector Machines: Slide 43Copyright © 2001, Andrew W. Moore
SVMlight
• http://svmlight.joachims.org (Thorsten Joachims @ cornell Univ.)
• Commands• svm_learn [options] [train_file] [model_file]
• svm_learn –c 1.5 –x 1 train_file model_file
• svm_classify [test_file] [model_file] [prediction_file]• svm_classify test_file model_file prediction_file
• File Format• <y> <featureNumber>:<value> …
<featureNumber>:<value> #comment• 1 23:0.5 105:0.1 1023:0.8 1999:0.34
Idx 1 … 23 … 105 … 1023
1024
… 1998
1999
… cls
Value 0 0 .5 0 .1 0 .8 0 0 0 .34 0 1
Support Vector Machines: Slide 44Copyright © 2001, Andrew W. Moore
Options• -z <char> : selects whether the SVM is trained in classification (c) or
regression (r) mode• -c <float> : controls the trade-off between training error and margin
– the lower the value of C, the more training error is tolerated. The best value of C depends on the data and must be determined emperically.
• -w <float> : width of tube (i.e Є) Є-intensive loss function used in regression. The value must be non-negative.
• -j <float> : specifies cost-factors for the loss functions both in classification and regression.
• -b <int> : switches between a biased (1) or an unbiased (0) hyperplane.
• -i <int> : selects how training errors are treated. (1-4)• -x <int> : selects whether to compute leave-one-out estimates of the
generalization performance.• -t <int> : selects the type of kernel functions (0-4)• -q <int> : maximum size of working set.• -m <int> : specifies the amount of memory (in megabytes) available
for caching kernel evaluations.• ……………………………………..
Support Vector Machines: Slide 45Copyright © 2001, Andrew W. Moore
Example 1: Sense Disambiguation
• AK has 3 senses• above knee, acetate kinase, artificial kidney
• Ex)• w13, w345, w874, w8, w7345, AK(acetate
kinase), w123, w13, w8, w14, w8• => 2 w8:3 w13:2 w14:1 w123:1 w345:1 w874:1
w7345:1
Support Vector Machines: Slide 46Copyright © 2001, Andrew W. Moore
Example of training set1 109:0.060606 193:0.090909 211:0.052632 3079:1.000000 5882:0.200000 1 109:0.030303 143:1.000000 377:0.500000 439:0.500000 1012:0.500000
3808:1.000000 3891:1.000000 4789:0.333333 18363:0.333333 23106:1.000000
1 174:0.166667 244:0.500000 321:0.500000 332:0.500000 723:0.333333 3064:0.500000 3872:1.000000 16401:1.000000 19369:1.000000 23109:1.000000
2 108:0.250000 148:0.333333 202:0.250000 313:0.200000 380:1.000000 3303:1.000000 8944:1.000000 11513:1.000000 23110:1.000000 23111:1.000000
1 100:0.125000 109:0.030303 129:0.200000 131:0.071429 135:0.050000 543:0.500000 5880:0.200000 17153:0.100000 23112:0.100000
1 100:0.125000 109:0.060606 125:0.500000 131:0.071429 193:0.045455 2553:0.200000 4790:0.100000 5880:0.200000
2 382:1.000000 410:1.000000 790:1.000000 1160:0.500000 2052:1.000000 3260:1.000000 7323:0.500000 23113:1.000000 23114:1.000000 23115:1.000000
2 133:1.000000 147:0.250000 148:0.333333 177:0.500000 214:1.000000 960:1.000000 1131:1.000000 1359:1.000000 6378:0.333333 22842:1.000000
1 109:0.060606 917:1.000000 2747:0.500000 2748:0.500000 2749:1.000000 3252:1.000000 21699:1.000000
1 543:0.500000 557:1.000000 563:0.500000 1077:1.000000 1160:0.500000 2747:0.500000 2748:0.500000 12805:1.000000 23116:1.000000 23117:1.000000
1 593:1.000000 1124:0.500000 1615:1.000000 3607:1.000000 13443:1.000000 23118:1.000000 23119:1.000000 23120:1.000000
…………………………………………………………
Support Vector Machines: Slide 47Copyright © 2001, Andrew W. Moore
Trained model of SVMSVM-multiclass Version V1.012 # number of classes24208 # number of base features0 # loss function3 # kernel type3 # kernel parameter -d1 # kernel parameter -g1 # kernel parameter -s1 # kernel parameter -rempty# kernel parameter -u48584 # highest feature index167 # number of training documents335 # number of support vectors plus 10 # threshold b, each following line is a SV (starting with alpha*y)0.010000000000000000208166817117217 24401:0.2 24407:0.071428999 24756:1
24909:0.25 25283:1 25424:0.33333299 25487:0.043478001 25773:0.5 31693:1 37411:1 #-0.010000000000000000208166817117217 193:0.2 199:0.071428999 548:1 701:0.25
1075:1 1216:0.33333299 1279:0.043478001 1565:0.5 7485:1 13203:1 #0.010000000000000000208166817117217 24407:0.071428999 24419:0.0625
25807:0.166667 26905:0.166667 27631:1 27664:1 35043:1 35888:1 36118:1 48085:1 #-0.010000000000000000208166817117217 199:0.071428999 211:0.0625 1599:0.166667
2697:0.166667 3423:1 3456:1 10835:1 11680:1 11910:1 23877:1 #……………………………………………………
Support Vector Machines: Slide 49Copyright © 2001, Andrew W. Moore
Example 3: Text Classification• Corpus: Reuters-21578
• 12,902 news story• 118 category• ModApte split (75%trainig:9,603doc / 25%test 3,299 doc)