TDT4173MachineLearningandCase‐BasedReasoning
Lecture6–SupportVectorMachines.Ensemble Methods
HelgeLangsethogAgnarAamodt
NTNU–IDISeksjonforintelligentesystemer
Outline
1 Wrap-up from last time
2
Ensemble-methodsBackgroundBaggingBoosting
3
Support Vector MachinesBackgroundLinear separatorsThe dual problemNon-separable subspacesNonlinearity and kernels
2 TDT4173 Machine Learning
SupportVectorMachines
TDT4173MachineLearningandCBR
SupportVectorMachines(SVMs)
KernelMethods
PaperbyBenne+andCampbell
Support Vector Machines Background
Description of the task
Data:
1 We have a set of data D = {(x1, y1), . . . , (xm, ym)}. Theinstances are described by xi, the class is yi.
2 The data is generated by some unknown probabilitydistribution P(x, y).
Task:
1 Be able to “guess” y at a new location x.
2 For SVMs one typically states this as “find an unknownfunction f(x) that estimates y at x.”
3 Note! In this lesson we look at binary classification, and lety ∈ {−1,+1} denote the classes.
4 We will look for linear functions, i.e.,f(x) = b + w
Tx ≡ b +
∑mi=1 wi · xi
17 TDT4173 Machine Learning
Support Vector Machines Linear separators
How to find the best linear separator
We are looking for a linear separator for this data
18 TDT4173 Machine Learning
Support Vector Machines Linear separators
How to find the best linear separator
There are so many solutions. . .
18 TDT4173 Machine Learning
Support Vector Machines Linear separators
How to find the best linear separator
But only one is considered the “best”!
18 TDT4173 Machine Learning
Support Vector Machines Linear separators
How to find the best linear separator
SVMs are called “large margin classifiers”
18 TDT4173 Machine Learning
Support Vector Machines Linear separators
How to find the best linear separator
. . . and the data points touching the lines are the support vectors
18 TDT4173 Machine Learning
Support Vector Machines Linear separators
The geometry of the problem
w
{x : b + wTx ≡ −1}
{x : b + wTx ≡ 0}
{x : b + wTx ≡ +1}
Note! Since one line has b + wTx = −1, the other has
b + wTx = 1, the length between them is 2/‖w‖.
19 TDT4173 Machine Learning
Support Vector Machines Linear separators
An optimisation problem
Optimisation criteria:
The distance between margins is 2/‖w‖, so that is what wewant to maximise.
Equivalently, we can minimise ‖w‖/2.
For simplicity of the mathematics, we will rather minimise‖w‖2/2
Constraints:
The margin separates all data observations correctly:
b + wTxi ≤ −1 for yi = −1.
b + wTxi ≥ +1 for yi = +1.
Alternative (equivalent) constraint set: yi(b + wTxi) ≥ 1
20 TDT4173 Machine Learning
Support Vector Machines Linear separators
An optimisation problem (2)
Mathematical Programming Setting:
Combining the above requirements we obtain
minimize wrt. w and b:1
2‖w‖2
subject to yi(b + wTxi) − 1 ≥ 0, i = 1, . . . ,m
Properties:
Problem is convex
Hence it has unique minimum
Efficient algorithms for solving it exist
21 TDT4173 Machine Learning
Support Vector Machines The dual problem
The dual problem – and the convex hull
The convex hull of {xj}:
The smallest subset of the instance space that
is convex
contains all elements {xj}
is the convex hull of {xj}. Find it by drawing lines between all xj
and choose the “outermost boundary”.
22 TDT4173 Machine Learning
Support Vector Machines The dual problem
The dual problem – and the convex hull (2)
Look at the difference between the points closest in the convexhulls. The decision line must be orthogonal to the line between thetwo closest points.
c
d
c
d
So, we want to minimise ‖c − d‖. c can be written as a weightedsum of all elements in the green class: c =
∑yi=Class 1 αixi, and
similarly for d.
23 TDT4173 Machine Learning
Support Vector Machines The dual problem
The dual problem – and the convex hull (3)
Minimising ‖c − d‖ is (modulo a constant) equivalent to thisformulation:
minimize wrt. α:1
2
m∑
i=1
m∑
j=1
αiαjyiyjxT
i xj −
m∑
i=1
αi
subject to
m∑
i=1
yiαi = 0 and that αi ≥ 0, i = 1, . . . ,m
Properties:
Problem is convex, hence has unique minimum.
Quadratic programming problem – known solution method.
For solution: αi > 0 only if xi is a support vector.
24 TDT4173 Machine Learning
Support Vector Machines The dual problem
“Theoretical foundation”
1 Formal proofs of SVM properties available (but out of scopefor us)
2 Large separators smart if we have small variations in x then wewill still classify correctly
3 There are many “skinny” margin planes, only one if you lookfor the “fattest” plane; thus more robust.
25 TDT4173 Machine Learning
Support Vector Machines Non-separable subspaces
What if the convex hulls are overlapping?
If the convex hulls are overlapping we cannot find a linearseparatorTo handle this, we optimise a criteria where we maximisedistance between lines minus a penalty for mis-classificationsThis is equivalent to scaling the convex hulls, and do asbefore on the reduced convex hulls
26 TDT4173 Machine Learning
Support Vector Machines Non-separable subspaces
What if the convex hulls are overlapping? (2)
The problem with scaling is (modulo a constant) equivalent to thisformulation:
minimize wrt. α:1
2
m∑
i=1
m∑
j=1
αiαjyiyjxT
i xj −
m∑
i=1
αi
subject tom∑
i=1
yiαi = 0 and that C ≥ αi ≥ 0, i = 1, . . . ,m
Properties:
Problem as before, but C introduces the scaling; this isequivalent to incurring cost of misclassification.
Still solvable using “standard” methods.
Demo: Different values of C:http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml
27 TDT4173 Machine Learning
SupportVectorMachines
TDT4173MachineLearningandCBR
StaBsBcalLearningTheory
• MisclassificaBonerrorandthefuncBoncomplexityboundgeneralizaBonerror.
• Maximizingmarginsminimizescomplexity.
• “Eliminates”overfiRng.
• SoluBondependsonlyonSupportVectorsnotnumberofaSributes.
Support Vector Machines Nonlinearity and kernels
Nonlinear problems – when scaling does not make sense
The problem is difficult to solve when x = (r, s) has only twodimensions
. . . but if we blow it up to us five dimension:θ(x) = {r, s, rs, r2, s2}, i.e. “invent” the mappingθ(·) : R
2 7→ R5, and try to find the linear separator in R
5,then everything is OK.
28 TDT4173 Machine Learning
Support Vector Machines Nonlinearity and kernels
Solving the problem in higher dimensions
We solve this as before, but remembering to look in the higherdimension:
minimize wrt. α:1
2
m∑
i=1
m∑
j=1
αiαjyiyjθ(xi)Tθ(xj) −
m∑
i=1
αi
subject to
m∑
i=1
yiαi = 0 and that C ≥ αi ≥ 0, i = 1, . . . ,m
Note that:
We do not need to evaluate θ(x) directly, only θ(xi)Tθ(xj).
If we find a “clever way” of evaluating θ(xi)Tθ(xj) (i.e.,
independent of the size of the target space) we can solve theproblem easily, and without even thinking about what θ(x)
even means.
We define K(xi,xj) = θ(xi)Tθ(xj), and focus on finding
K(·, ·) instead of the mapping. K is called a kernel.
29 TDT4173 Machine Learning
Support Vector Machines Nonlinearity and kernels
Kernel functions
θ(x) K(θ(xi),θ(xj))
Degree d polynomial (xT
i xj + 1)d
Radial Basis Functions exp(
(xi−xj)2
2σ
)
Two-layer Neural Network sigmoid (η · xT
i xj + c)
Different kernels have different properties, and finding the“right” kernel is a difficult task, and can be hard to visualise.
Example: The RBF kernel uses (implicitly) an infinitelydimensional representation for θ(·).
30 TDT4173 Machine Learning
Support Vector Machines Nonlinearity and kernels
SVMs: Algorithmic summary
Select the parameter C (tradeoff between minimising trainingset error and maximising the margin).
Select kernel function, and associated parameters (e.g., σ forRBF).
Solve the optimisation problem using quadratic programming.
Find the value b by using the support vectors.
Classify a new point x using
f(x) = sign
{m∑
i=1
yiαiK(x,xi) − b
}
Demo: Different kernelshttp://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml
31 TDT4173 Machine Learning
SupportVectorMachines
TDT4173MachineLearningandCBROctober2‐4,2000 M2000 6
SVMExtensions
• Regression• VariableSelecBon• BoosBng• DensityEsBmaBon• UnsupervisedLearning– Novelty/OutlierDetecBon– FeatureDetecBon – Clustering
SupportVectorMachines
TDT4173MachineLearningandCBR
ManyOtherApplicaBons
• SpeechRecogniBon• DataBaseMarkeBng• QuarkFlavorsinHighEnergyPhysics• DynamicObjectRecogniBon• KnockDetecBoninEngines• ProteinSequenceProblem• TextCategorizaBon• BreastCancerDiagnosis• See:hSp://www.clopinet.com/isabelle/Projects/SVM/applist.html
SupportVectorMachines
TDT4173MachineLearningandCBR
Hallelujah!
• GeneralizaBontheoryandpracBcemeet
• Generalmethodologyformanytypesofproblems• SameProgram+NewKernel=Newmethod• Noproblemswithlocalminima
• Fewmodelparameters.Selectscapacity.• RobustopBmizaBonmethods.• SuccessfulApplicaBons
BUT…
SupportVectorMachines
TDT4173MachineLearningandCBR
HYPE?
• WillSVMsbeatmybesthand‐tunedmethodZforX?
• DoSVMscaletomassivedatasets?• HowtochoseCandKernel?• WhatistheeffectofaSributescaling?
• Howtohandlecategoricalvariables?• Howtoincorporatedomainknowledge?• Howtointerpretresults?