2013-1 machine learning lecture 05 - andrew moore - support vector machines
TRANSCRIPT
Nov 23rd, 2001 Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines
Modified from the slides by Dr. Andrew W. Moore
http://www.cs.cmu.edu/~awm/tutorials
Support Vector Machines: Slide 2 Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers f x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
Support Vector Machines: Slide 3 Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers f x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
Support Vector Machines: Slide 4 Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers f x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
Support Vector Machines: Slide 5 Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers f x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
Support Vector Machines: Slide 6 Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers f x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Any of these would be fine..
..but which is best?
Support Vector Machines: Slide 7 Copyright © 2001, 2003, Andrew W. Moore
Classifier Margin f x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
Support Vector Machines: Slide 8 Copyright © 2001, 2003, Andrew W. Moore
Maximum Margin f x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum margin linear classifier is the linear classifier with the, um, maximum margin.
This is the simplest kind of SVM (Called an LSVM)
Linear SVM
Support Vector Machines: Slide 9 Copyright © 2001, 2003, Andrew W. Moore
Maximum Margin f x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum margin linear classifier is the linear classifier with the, um, maximum margin.
This is the simplest kind of SVM (Called an LSVM)
Support Vectors are those datapoints that the margin pushes up against
Linear SVM
Support Vector Machines: Slide 10 Copyright © 2001, 2003, Andrew W. Moore
Why Maximum Margin?
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum margin linear classifier is the linear classifier with the, um, maximum margin.
This is the simplest kind of SVM (Called an LSVM)
Support Vectors are those datapoints that the margin pushes up against
1. Intuitively this feels safest.
2. If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification.
3. LOOCV is easy since the model is immune to removal of any non-support-vector datapoints.
4. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing.
5. Empirically it works very very well.
Support Vector Machines: Slide 11 Copyright © 2001, 2003, Andrew W. Moore
Estimate the Margin
• What is the distance expression for a point x to a line wx+b= 0?
denotes +1
denotes -1 x wx +b = 0
2 2
12
( )d
ii
b bd
w
x w x wx
w
Support Vector Machines: Slide 12 Copyright © 2001, 2003, Andrew W. Moore
Estimate the Margin
• What is the expression for margin?
denotes +1
denotes -1 wx +b = 0
2
1
margin min ( ) mindD D
ii
bd
w
x x
x wx
Margin
Support Vector Machines: Slide 13 Copyright © 2001, 2003, Andrew W. Moore
Maximize Margin
denotes +1
denotes -1 wx +b = 0
,
,
2,1
argmax margin( , , )
= argmax min ( )
argmax min
i
i
b
iDb
i
dDbii
b D
d
b
w
w
xw
xw
w
x
x w
Margin
Support Vector Machines: Slide 14 Copyright © 2001, 2003, Andrew W. Moore
Maximize Margin
denotes +1
denotes -1 wx +b = 0
2,1
argmax min
subject to : 0
i
i
dDbii
i i i
b
w
D y b
xw
x w
x x w
Margin
• Min-max problem game problem
Support Vector Machines: Slide 15 Copyright © 2001, 2003, Andrew W. Moore
Maximize Margin
denotes +1
denotes -1 wx +b = 0
2,1
argmax min
subject to : 0
i
i
dDbii
i i i
b
w
D y b
xw
x w
x x w
Margin
Strategy:
: 1i iD b x x w
2
1,
argmin
subject to : 1
dii
b
i i i
w
D y b
w
x x w
Support Vector Machines: Slide 16 Copyright © 2001, 2003, Andrew W. Moore
Maximum Margin Linear Classifier
• How to solve it?
* * 2
1,
1 1
2 2
{ , }= argmin
subject to
1
1
....
1
d
kkw b
N N
w b w
y w x b
y w x b
y w x b
Support Vector Machines: Slide 17 Copyright © 2001, 2003, Andrew W. Moore
Learning via Quadratic Programming • QP is a well-studied class of optimization
algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.
Support Vector Machines: Slide 18 Copyright © 2001, 2003, Andrew W. Moore
Quadratic Programming arg min
2
TT R
c u
u ud uFind
nmnmnn
mm
mm
buauaua
buauaua
buauaua
...
:
...
...
2211
22222121
11212111
)()(22)(11)(
)2()2(22)2(11)2(
)1()1(22)1(11)1(
...
:
...
...
enmmenenen
nmmnnn
nmmnnn
buauaua
buauaua
buauaua
And subject to
n additional linear inequality constraints
e a
dditio
nal lin
ear
equality
co
nstra
ints
Quadratic criterion
Subject to
Support Vector Machines: Slide 19 Copyright © 2001, 2003, Andrew W. Moore
Quadratic Programming arg min
2
TT R
c u
u ud uFind
Subject to
nmnmnn
mm
mm
buauaua
buauaua
buauaua
...
:
...
...
2211
22222121
11212111
)()(22)(11)(
)2()2(22)2(11)2(
)1()1(22)1(11)1(
...
:
...
...
enmmenenen
nmmnnn
nmmnnn
buauaua
buauaua
buauaua
And subject to
n additional linear inequality constraints
e a
dditio
nal lin
ear
equality
co
nstra
ints
Quadratic criterion
Support Vector Machines: Slide 20 Copyright © 2001, 2003, Andrew W. Moore
Quadratic Programming
* * 2
,{ , }= min
subject to 1 for all training data ( , )
iiw b
i i i i
w b w
y w x b x y
* *
,
1 1
2 2
{ , }= argmax 0 0
1
1 inequality constraints
....
1
T
w b
N N
w b w w w
y w x b
y w x b
y w x b
nI
Support Vector Machines: Slide 21 Copyright © 2001, 2003, Andrew W. Moore
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Support Vector Machines: Slide 22 Copyright © 2001, 2003, Andrew W. Moore
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 1:
Find minimum w.w, while minimizing number of training set errors.
Problemette: Two things to minimize makes for an ill-defined optimization
Support Vector Machines: Slide 23 Copyright © 2001, 2003, Andrew W. Moore
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 1.1:
Minimize
w.w + C (#train errors)
There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is?
Tradeoff parameter
Support Vector Machines: Slide 24 Copyright © 2001, 2003, Andrew W. Moore
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 1.1:
Minimize
w.w + C (#train errors)
There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is?
Tradeoff parameter Can’t be expressed as a Quadratic
Programming problem.
Solving it may be too slow.
(Also, doesn’t distinguish between disastrous errors and near misses)
Support Vector Machines: Slide 25 Copyright © 2001, 2003, Andrew W. Moore
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 2.0:
Minimize w.w + C (distance of error points to their correct place)
Support Vector Machines: Slide 26 Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machine (SVM) for Noisy Data
• Any problem with the above formulism?
d* * 2
1 1, ,
1 1 1
2 2 2
{ , }= min
1
1
...
1
N
i ji jw b
N N N
w b w c
y w x b
y w x b
y w x b
denotes +1
denotes -1
1
2
3
Support Vector Machines: Slide 27 Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machine (SVM) for Noisy Data
• Balance the trade off between margin and classification errors
d* * 2
1 1, ,
1 1 1 1
2 2 2 2
{ , }= min
1 , 0
1 , 0
...
1 , 0
N
i ji jw b
N N N N
w b w c
y w x b
y w x b
y w x b
denotes +1
denotes -1
1
2
3
Support Vector Machines: Slide 28 Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machine for Noisy Data
* * 2
1, ,
1 1 1 1
2 2 2 2
{ , }= argmin
1 , 0
1 , 0 inequality constraints
....
1 , 0
N
i ji jw b
N N N N
w b w c
y w x b
y w x b
y w x b
How do we determine the appropriate value for c ?
Support Vector Machines: Slide 29 Copyright © 2001, 2003, Andrew W. Moore
The Dual Form of QP
Maximize
R
k
R
l
kllk
R
k
k Qααα1 11 2
1where ( )kl k l k lQ y y x x
Subject to these constraints:
kCαk 0
Then define:
R
k
kkk yα1
xwThen classify with:
f(x,w,b) = sign(w. x - b)
01
R
k
kk yα
Support Vector Machines: Slide 30 Copyright © 2001, 2003, Andrew W. Moore
The Dual Form of QP
Maximize
R
k
R
l
kllk
R
k
k Qααα1 11 2
1where ( )kl k l k lQ y y x x
Subject to these constraints:
kCαk 0
Then define:
R
k
kkk yα1
xw
01
R
k
kk yα
Support Vector Machines: Slide 31 Copyright © 2001, 2003, Andrew W. Moore
An Equivalent QP
Maximize where ).( lklkkl yyQ xx
Subject to these constraints:
kCαk 0
Then define:
R
kkkk yα
1
xw
01
R
k
kk yα
Datapoints with ak > 0 will be the support vectors
..so this sum only needs to be over the support vectors.
R
k
R
l
kllk
R
k
k Qααα1 11 2
1
Support Vector Machines: Slide 32 Copyright © 2001, 2003, Andrew W. Moore
Support Vectors
denotes +1
denotes -1
1w x b
1w x b
w
Support Vectors
Decision boundary is determined only by those support vectors !
R
kkkk yα
1
xw
: 1 0i i i ii y w x ba
ai = 0 for non-support vectors
ai 0 for support vectors
Support Vector Machines: Slide 33 Copyright © 2001, 2003, Andrew W. Moore
The Dual Form of QP
Maximize
R
k
R
l
kllk
R
k
k Qααα1 11 2
1where ( )kl k l k lQ y y x x
Subject to these constraints:
kCαk 0
Then define:
R
k
kkk yα1
xwThen classify with:
f(x,w,b) = sign(w. x - b)
01
R
k
kk yα
How to determine b ?
Support Vector Machines: Slide 34 Copyright © 2001, 2003, Andrew W. Moore
An Equivalent QP: Determine b
A linear programming problem !
* * 2
1,
1 1 1 1
2 2 2 2
{ , }= argmin
1 , 0
1 , 0
....
1 , 0
N
i ji jw b
N N N N
w b w c
y w x b
y w x b
y w x b
1
*
1,
1 1 1 1
2 2 2 2
= argmin
1 , 0
1 , 0
....
1 , 0
N
i i
N
jjb
N N N N
b
y w x b
y w x b
y w x b
Fix w
Support Vector Machines: Slide 35 Copyright © 2001, 2003, Andrew W. Moore
R
k
R
l
kllk
R
k
k Qααα1 11 2
1
An Equivalent QP
Maximize where ).( lklkkl yyQ xx
Subject to these constraints:
kCαk 0
Then define:
R
k
kkk yα1
xw
kk
KKKK
αK
εyb
maxarg where
.)1(
wx
Then classify with:
f(x,w,b) = sign(w. x - b)
01
R
k
kk yα
Datapoints with ak > 0 will be the support vectors
..so this sum only needs to be over the support vectors.
Why did I tell you about this equivalent QP?
• It’s a formulation that QP packages can optimize more quickly
• Because of further jaw-dropping developments you’re about to learn.
Support Vector Machines: Slide 36 Copyright © 2001, 2003, Andrew W. Moore
Suppose we’re in 1-dimension
What would SVMs do with this data?
x=0
Support Vector Machines: Slide 37 Copyright © 2001, 2003, Andrew W. Moore
Suppose we’re in 1-dimension
Not a big surprise
Positive “plane” Negative “plane”
x=0
Support Vector Machines: Slide 38 Copyright © 2001, 2003, Andrew W. Moore
Harder 1-dimensional dataset
That’s wiped the smirk off SVM’s face.
What can be done about this?
x=0
Support Vector Machines: Slide 39 Copyright © 2001, 2003, Andrew W. Moore
Harder 1-dimensional dataset Remember how
permitting non-linear basis functions made linear regression so much nicer?
Let’s permit them here too
x=0 ),( 2
kkk xxz
Support Vector Machines: Slide 40 Copyright © 2001, 2003, Andrew W. Moore
Harder 1-dimensional dataset Remember how
permitting non-linear basis functions made linear regression so much nicer?
Let’s permit them here too
x=0 ),( 2
kkk xxz
Support Vector Machines: Slide 41 Copyright © 2001, 2003, Andrew W. Moore
Common SVM basis functions
zk = ( polynomial terms of xk of degree 1 to q )
zk = ( radial basis functions of xk )
zk = ( sigmoid functions of xk )
This is sensible.
Is that the end of the story?
No…there’s one more trick!
2
2||exp)(][
jk
kjk φjcx
xz
Support Vector Machines: Slide 42 Copyright © 2001, 2003, Andrew W. Moore
Quadratic Basis Functions
mm
m
m
m
m
xx
xx
xx
xx
xx
xx
x
x
x
x
x
x
1
1
32
1
31
21
2
2
2
2
1
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
)(xΦ
Constant Term
Linear Terms
Pure Quadratic
Terms
Quadratic Cross-Terms
Number of terms (assuming m input dimensions) = (m+2)-choose-2
= (m+2)(m+1)/2
= (as near as makes no difference) m2/2
You may be wondering what those
’s are doing.
•You should be happy that they do no harm
•You’ll find out why they’re there soon.
2
Support Vector Machines: Slide 43 Copyright © 2001, 2003, Andrew W. Moore
QP (old)
Maximize where ).( lklkkl yyQ xx
Subject to these constraints:
kCαk 0
Then define:
R
kkkk yα
1
xwThen classify with:
f(x,w,b) = sign(w. x - b)
01
R
k
kk yα
R
k
R
l
kllk
R
k
k Qααα1 11 2
1
Support Vector Machines: Slide 44 Copyright © 2001, 2003, Andrew W. Moore
QP with basis functions
where ))().(( lklkkl yyQ xΦxΦ
Subject to these constraints:
kCαk 0
Then define: Then classify with:
f(x,w,b) = sign(w. f(x) - b)
01
R
k
kk yα
0 s.t.
)(kαk
kkk yα xΦw
Maximize
R
k
R
l
kllk
R
k
k Qααα1 11 2
1
Most important changes:
X f(x)
Support Vector Machines: Slide 45 Copyright © 2001, 2003, Andrew W. Moore
QP with basis functions
where ))().(( lklkkl yyQ xΦxΦ
Subject to these constraints:
kCαk 0
Then define:
Then classify with:
f(x,w,b) = sign(w. f(x) - b)
01
R
k
kk yα
We must do R2/2 dot products to get this matrix ready.
Each dot product requires m2/2 additions and multiplications
The whole thing costs R2 m2 /4. Yeeks!
…or does it?
0 s.t.
)(kαk
kkk yα xΦw
Maximize
R
k
R
l
kllk
R
k
k Qααα1 11 2
1
Support Vector Machines: Slide 46 Copyright © 2001, 2003, Andrew W. Moore
Quadra
tic
Dot
Pro
duct
s
mm
m
m
m
m
mm
m
m
m
m
bb
bb
bb
bb
bb
bb
b
b
b
b
b
b
aa
aa
aa
aa
aa
aa
a
a
a
a
a
a
1
1
32
1
31
21
2
2
2
2
1
2
1
1
1
32
1
31
21
2
2
2
2
1
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
)()( bΦaΦ
1
m
i
iiba1
2
m
i
ii ba1
22
m
i
m
ij
jiji bbaa1 1
2
+
+
+
Support Vector Machines: Slide 47 Copyright © 2001, 2003, Andrew W. Moore
Quadra
tic
Dot
Pro
duct
s
)()( bΦaΦ
m
i
m
ij
jiji
m
i
ii
m
i
ii bbaababa1 11
22
1
221
Just out of casual, innocent, interest, let’s look at another function of a and b:
2)1.( ba
1.2).( 2 baba
121
2
1
m
i
ii
m
i
ii baba
1211 1
m
i
ii
m
i
m
j
jjii bababa
122)(11 11
2
m
i
ii
m
i
m
ij
jjii
m
i
ii babababa
Support Vector Machines: Slide 48 Copyright © 2001, 2003, Andrew W. Moore
Quadra
tic
Dot
Pro
duct
s
)()( bΦaΦ
Just out of casual, innocent, interest, let’s look at another function of a and b:
2)1.( ba
1.2).( 2 baba
121
2
1
m
i
ii
m
i
ii baba
1211 1
m
i
ii
m
i
m
j
jjii bababa
122)(11 11
2
m
i
ii
m
i
m
ij
jjii
m
i
ii babababa
They’re the same!
And this is only O(m) to compute!
m
i
m
ij
jiji
m
i
ii
m
i
ii bbaababa1 11
22
1
221
Support Vector Machines: Slide 49 Copyright © 2001, 2003, Andrew W. Moore
QP with Quintic basis functions
Maximize
R
k
R
l
kllk
R
k
k Qααα1 11
where ))().(( lklkkl yyQ xΦxΦ
Subject to these constraints:
kCαk 0
Then define:
0 s.t.
)(kαk
kkk yα xΦw
Then classify with:
f(x,w,b) = sign(w. f(x) - b)
01
R
k
kk yα
Support Vector Machines: Slide 50 Copyright © 2001, 2003, Andrew W. Moore
QP with Quadratic basis functions
where ),( lklkkl KyyQ xx
Subject to these constraints:
kCαk 0
Then define:
kk
KKKK
αK
εyb
maxarg where
.)1(
wx
Then classify with:
f(x,w,b) = sign(K(w, x) - b)
01
R
k
kk yα
0 s.t.
)(kαk
kkk yα xΦw
Maximize
R
k
R
l
kllk
R
k
k Qααα1 11 2
1
Most important change:
),()().( lklk K xxxΦxΦ
Support Vector Machines: Slide 51 Copyright © 2001, 2003, Andrew W. Moore
Higher Order Polynomials
Poly-nomial
f(x) Cost to build Qkl matrix traditionally
Cost if 100 inputs
f(a).f(b) Cost to build Qkl matrix sneakily
Cost if 100 inputs
Quadratic All m2/2 terms up to degree 2
m2 R2 /4 2,500 R2 (a.b+1)2 m R2 / 2 50 R2
Cubic All m3/6 terms up to degree 3
m3 R2 /12 83,000 R2 (a.b+1)3 m R2 / 2 50 R2
Quartic All m4/24 terms up to degree 4
m4 R2 /48 1,960,000 R2 (a.b+1)4 m R2 / 2 50 R2
Support Vector Machines: Slide 52 Copyright © 2001, 2003, Andrew W. Moore
SVM Kernel Functions • K(a,b)=(a . b +1)d is an example of an SVM
Kernel Function
• Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function
• Radial-Basis-style Kernel Function:
• Neural-net-style Kernel Function:
2
2
2
)(exp),(
babaK
).tanh(),( babaK
Support Vector Machines: Slide 53 Copyright © 2001, 2003, Andrew W. Moore
Kernel Tricks • Replacing dot product with a kernel function
• Not all functions are kernel functions
• Need to be decomposable
• K(a,b) = f(a) f(b)
• Could K(a,b) = (a-b)3 be a kernel function ?
• Could K(a,b) = (a-b)4 – (a+b)2 be a kernel function?
Support Vector Machines: Slide 54 Copyright © 2001, 2003, Andrew W. Moore
Kernel Tricks • Mercer’s condition
To expand Kernel function K(x,y) into a dot product, i.e.
K(x,y)=(x)(y), K(x, y) has to be positive semi-definite function, i.e., for any function f(x) whose is finite, the following inequality holds
• Could be a kernel function?
( ) ( , ) ( ) 0dxdyf x K x y f y
2 ( )f x dx
( , )p
i iiK x y x y
Support Vector Machines: Slide 55 Copyright © 2001, 2003, Andrew W. Moore
Kernel Tricks • Pro
• Introducing nonlinearity into the model
• Computational cheap
• Con
• Still have potential overfitting problems
Support Vector Machines: Slide 56 Copyright © 2001, 2003, Andrew W. Moore
Nonlinear Kernel (I)
Support Vector Machines: Slide 57 Copyright © 2001, 2003, Andrew W. Moore
Nonlinear Kernel (II)
Support Vector Machines: Slide 58 Copyright © 2001, 2003, Andrew W. Moore
Kernelize Logistic Regression
How can we introduce the nonlinearity into the logistic regression?
2
1 1
1( | )
1 exp( )
1( ) log
1 exp( )
N N
reg ki k
p y xyx w
l c wyx w
a
Support Vector Machines: Slide 59 Copyright © 2001, 2003, Andrew W. Moore
Kernelize Logistic Regression
1
1
( ), ( )
( , ) ( , )
N
i ii
N
i ii
x x w x
K w x K x x
f a f
a
1
1 , 1
1
1 1( | )
1 exp( ( , )) 1 exp ( , )
1( ) log ( , )
1 exp ( , )
N
i ii
N N
reg i j i ji i jN
i j j ij
p y xyK x w y K x x
l c K x xy K x x
a
a a aa
• Representation Theorem
Support Vector Machines: Slide 60 Copyright © 2001, 2003, Andrew W. Moore
Overfitting in SVM
Breast Cancer
0
0.02
0.04
0.06
0.08
1 2 3 4 5
PolyDegree
Cla
ssif
icatio
n E
rro
r
Ionosphere
0
0.05
0.1
0.15
0.2
1 2 3 4 5PolyDegree
Cla
ssif
icati
on
Err
or
Training Error
Testing Error
Support Vector Machines: Slide 61 Copyright © 2001, 2003, Andrew W. Moore
SVM Performance • Anecdotally they work very very well indeed.
• Example: They are currently the best-known classifier on a well-studied hand-written-character recognition benchmark
• Another Example: Andrew knows several reliable people doing practical real-world work who claim that SVMs have saved them when their other favorite classifiers did poorly.
• There is a lot of excitement and religious fervor about SVMs as of 2001.
• Despite this, some practitioners are a little skeptical.
Support Vector Machines: Slide 62 Copyright © 2001, 2003, Andrew W. Moore
Diffusion Kernel • Kernel function describes the correlation or similarity
between two data points
• Given that I have a function s(x,y) that describes the similarity between two data points. Assume that it is a non-negative and symmetric function. How can we generate a kernel function based on this similarity function?
• A graph theory approach …
Support Vector Machines: Slide 63 Copyright © 2001, 2003, Andrew W. Moore
Diffusion Kernel
• Create a graph for the data points
• Each vertex corresponds to a data point
• The weight of each edge is the similarity s(x,y)
• Graph Laplacian
• Properties of Laplacian
• Negative semi-definite
,
,
,
i j
i j
i kk i
s x x i jL
s x x i j
Support Vector Machines: Slide 64 Copyright © 2001, 2003, Andrew W. Moore
Diffusion Kernel • Consider a simple Laplacian
• Consider
• What do these matrixes represent?
• A diffusion kernel
,
( )
1 and are connected
1k i
i j
i j
x N x
x xL
i j
2 4, ,...L L
Support Vector Machines: Slide 65 Copyright © 2001, 2003, Andrew W. Moore
Diffusion Kernel • Consider a simple Laplacian
• Consider
• What do these matrixes represent?
• A diffusion kernel
,
( )
1 and are connected
1k i
i j
i j
x N x
x xL
i j
2 4, ,...L L
Support Vector Machines: Slide 66 Copyright © 2001, 2003, Andrew W. Moore
Diffusion Kernel: Properties • Positive definite
• Local relationships L induce global relationships
• Works for undirected weighted graphs with
similarities
• How to compute the diffusion kernel
, or L dK e K LK
d
( , ) ( , )i j j is x x s x x
Le
Support Vector Machines: Slide 67 Copyright © 2001, 2003, Andrew W. Moore
Computing Diffusion Kernel • Singular value decomposition of Laplacian L
• What is L2 ?
1
1 2 1 2
1
,
( , ,..., ) ( , ,..., )
, :
T T
m m
m
m T
i i ii
T
i j i ji j
L = UΣU u u u u u u
u u
u u
2
2
1
,, 1 , 1
2
1
m T
i i ii
m mT T T
i j i i j j i j i j i ji j i j
m T
i i ii
L = u u
u u u u u u
u u
Support Vector Machines: Slide 68 Copyright © 2001, 2003, Andrew W. Moore
Computing Diffusion Kernel • What about Ln ?
• Compute diffusion kernel
2
1
m n T
i i ii
L u u
Le
1 1 1
1 1 1
! !
!i
n n nmL n T
i i in n i
n nm mT Ti
i i i ii n i
Le
n n
en
u u
u u u u
Support Vector Machines: Slide 69 Copyright © 2001, 2003, Andrew W. Moore
Doing multi-class classification • SVMs can only handle two-class outputs (i.e. a
categorical output variable with arity 2).
• What can be done?
• Answer: with output arity N, learn N SVM’s
• SVM 1 learns “Output==1” vs “Output != 1”
• SVM 2 learns “Output==2” vs “Output != 2”
• :
• SVM N learns “Output==N” vs “Output != N”
• Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.
Support Vector Machines: Slide 70 Copyright © 2001, 2003, Andrew W. Moore
Ranking Problem • Consider a problem of ranking essays
• Three ranking categories: good, ok, bad
• Given a input document, predict its ranking category
• How should we formulate this problem?
• A simple multiple class solution
• Each ranking category is a independent class
• But, there is something missing here …
• We miss the ordinal relationship between classes !
Support Vector Machines: Slide 71 Copyright © 2001, 2003, Andrew W. Moore
Ordinal Regression
• Which choice is better?
• How could we formulate this problem?
‘good’
‘OK’
‘bad’
w
w’
Support Vector Machines: Slide 72 Copyright © 2001, 2003, Andrew W. Moore
Ordinal Regression • What are the two decision boundaries?
• What is the margin for ordinal regression?
• Maximize margin
1 20 and 0b b w x w x
11 1
2
1
22 2
2
1
1 2 1 1 2 2
margin ( , ) min ( ) min
margin ( , ) min ( ) min
margin( , , )=min(margin ( , ),margin ( , ))
g o g o
o b o b
dD D D Dii
dD D D Dii
bb d
w
bb d
w
b b b b
x x
x x
x ww x
x ww x
w w w
1 2
* * *
1 2 1 2, ,
{ , , } arg max margin( , , )b b
b b b bw
w w
Support Vector Machines: Slide 73 Copyright © 2001, 2003, Andrew W. Moore
Ordinal Regression • What are the two decision boundaries?
• What is the margin for ordinal regression?
• Maximize margin
1 20 and 0b b w x w x
11 1
2
1
22 2
2
1
1 2 1 1 2 2
margin ( , ) min ( ) min
margin ( , ) min ( ) min
margin( , , )=min(margin ( , ),margin ( , ))
g o g o
o b o b
dD D D Dii
dD D D Dii
bb d
w
bb d
w
b b b b
x x
x x
x ww x
x ww x
w w w
1 2
* * *
1 2 1 2, ,
{ , , } arg max margin( , , )b b
b b b bw
w w
Support Vector Machines: Slide 74 Copyright © 2001, 2003, Andrew W. Moore
Ordinal Regression • What are the two decision boundaries?
• What is the margin for ordinal regression?
• Maximize margin
1 20 and 0b b w x w x
11 1
2
1
22 2
2
1
1 2 1 1 2 2
margin ( , ) min ( ) min
margin ( , ) min ( ) min
margin( , , )=min(margin ( , ),margin ( , ))
g o g o
o b o b
dD D D Dii
dD D D Dii
bb d
w
bb d
w
b b b b
x x
x x
x ww x
x ww x
w w w
1 2
* * *
1 2 1 2, ,
{ , , } arg max margin( , , )b b
b b b bw
w w
Support Vector Machines: Slide 75 Copyright © 2001, 2003, Andrew W. Moore
Ordinal Regression
• How do we solve this monster ?
1 2
1 2
1 2
* * *
1 2 1 2, ,
1 1 2 2, ,
1 2
2 2, ,1 1
{ , , } arg max margin( , , )
arg max min(margin ( , ),margin ( , ))
arg max min min , ming o o b
b b
b b
d dD D D Db b
i ii i
b b b b
b b
b b
w w
w
w
x xw
w w
w w
x w x w
1
1 2
2
subject to
: 0
: 0, 0
: 0
i g i
i o i i
i b i
D b
D b b
D b
x x w
x x w x w
x x w
Support Vector Machines: Slide 76 Copyright © 2001, 2003, Andrew W. Moore
Ordinal Regression • The same old trick
• To remove the scaling invariance, set
• Now the problem is simplified as:
1
2
: 1
: 1
i g o i
i o b i
D D b
D D b
x x w
x x w
1 2
* * * 2
1 2 1, ,
{ , , } arg min d
iib b
b b w
w
w
1
1 2
2
subject to
: 1
: 1, 1
: 1
i g i
i o i i
i b i
D b
D b b
D b
x x w
x x w x w
x x w
Support Vector Machines: Slide 77 Copyright © 2001, 2003, Andrew W. Moore
Ordinal Regression • Noisy case
• Is this sufficient enough?
1 2
* * * 2
1 2 1, ,
{ , , } arg min i g i o i b
d
i i i i iix D x D x Db b
b b w c c c
w
w
1
1 2
2
subject to
: 1 , 0
: 1 , 1 , 0, 0
: 1 , 0
i g i i i
i o i i i i i i
i b i i i
D b
D b b
D b
x x w
x x w x w
x x w
Support Vector Machines: Slide 78 Copyright © 2001, 2003, Andrew W. Moore
Ordinal Regression
‘good’
‘OK’
‘bad’
w
1 2
* * * 2
1 2 1, ,
{ , , } arg min i g i o i b
d
i i i i iix D x D x Db b
b b w c c c
w
w
1
1 2
2
1 2
subject to
: 1 , 0
: 1 , 1 , 0, 0
: 1 , 0
i g i i i
i o i i i i i i
i b i i i
D b
D b b
D b
b b
x x w
x x w x w
x x w
Support Vector Machines: Slide 79 Copyright © 2001, 2003, Andrew W. Moore
References • An excellent tutorial on VC-dimension and Support
Vector Machines:
C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html
• The VC/SRM/SVM Bible: (Not for beginners including myself)
Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998
• Software: SVM-light, http://svmlight.joachims.org/, free download