andrei alexandrescu - university of washingtonssli.ee.washington.edu/ws07/notes/svm.pdfandrei...

1 / 31

Support Vector Machines

Andrei Alexandrescu

June 19, 2007

Introduction

Introduction

Metaphor #1

Metaphor #2

Metaphor #3

Background:Structural RiskMinimization

The SVMConnection

Nonlinear SVMs

Loose Ends.Conclusions

2 / 31

Metaphor #1

Introduction

Metaphor #1

Metaphor #2

Metaphor #3


The SVMConnection

Nonlinear SVMs


3 / 31

■ Imagine a single perceptron■ Linear separation■ Train it to draw the separation

hyperplane:

◆ Minimize d+ + d−, where◆ Distance to closest positive point d+

◆ Distance to closest negative point d−

You’ve got an SVM.

Metaphor #2

Introduction

Metaphor #1

Metaphor #2

Metaphor #3


The SVMConnection

Nonlinear SVMs


4 / 31

■ Imagine creating a highway:■ Straight■ Trees on the left■ Rocks on the right■ Farthest from closest tree■ Farthest from closest rock

You need an SVM.

Metaphor #3

Introduction

Metaphor #1

Metaphor #2

Metaphor #3


The SVMConnection

Nonlinear SVMs


5 / 31

■ Imagine you want to support a metalsheet with coil springs

■ Coil springs are attached to fixed points■ They push from two different sides■ For the metal sheet to be supported, the

momentum from all springs must cancelout

The sheet will be, well, supported along anSVM plane.

Background: StructuralRisk Minimization

Introduction

Background:Structural RiskMinimizationCapacity andGeneralization

Empirical Risk

Risk BoundImportance of RiskBound

The VC dimensionExample:Hyperplanes in R

n

Not a Hard and FastFormula

Corollary

The SVMConnection

Nonlinear SVMs


6 / 31

Capacity and Generalization

Introduction


Empirical Risk



n


Corollary

The SVMConnection

Nonlinear SVMs


7 / 31

■ Generalization: Figure out similaritiesbetween already-seen data and new data

◆ Too much: “Square piece of paper?That’s a $100 bill”

■ Capacity: Ability to allocate newcategories for data

◆ Too much: “#L26118670? It’s afake; all $100 bills I’ve seen had otherserial numbers”

■ They are competitive with one another■ How to strike the right balance?

Empirical Risk

Introduction


Empirical Risk



n


Corollary

The SVMConnection

Nonlinear SVMs


8 / 31

■ We are given l observations 〈xi, yi〉

◆ xi ∈ Rn

◆ yi ∈ {−1, 1}

■ Learn y = f(x, α) by tuning α

■ Expected test error (risk) and empiricalrisk:

R(α) =1

2

∫

|y − f(x, α)|dP (x, y) (1)

Remp(α) =1

2l

∑

|yi − f(xi, α)| (2)

Risk Bound

Introduction


Empirical Risk



n


Corollary

The SVMConnection

Nonlinear SVMs


9 / 31

■ For 0/1 loss and 0 < η < 1:

R(α) ≤ Remp(α)+

√

h(1 + log 2lh) − log η

4

l(3)

where h ∈ N is the Vapnik-Chervonenkis(VC) dimension

■ Second term: “VC confidence”

Importance of Risk Bound

Introduction


Empirical Risk



n


Corollary

The SVMConnection

Nonlinear SVMs


10 / 31

1. Not dependent on P (x, y)2. lhs not computable3. rhs computable if we know h

■ For a given task, choose the machinethat minimizes the risk bound!

■ Even when bound not tight, we cancontrast “tightness” of various families ofmachines

The VC dimension

Introduction


Empirical Risk



n


Corollary

The SVMConnection

Nonlinear SVMs


11 / 31

■ For a family of functions f(α):

◆ Choose a set of l points◆ Label them in any way◆ ∃α s.t. f(α) can recognize

(“shatter”) them

■ Then f(α) has VC at least l

Example: Hyperplanes in Rn

Introduction


Empirical Risk



n


Corollary

The SVMConnection

Nonlinear SVMs


12 / 31

■ Choosing 4 planar points:

◆ they can’t be separated by one linefor all of their possible labelings (onelabeling will be inseparable)

■ Similarly, n + 1 points in Rn can’t be

separated for all labelings■ So the VC dimension of hyperplanes in

Rn is n + 1

Not a Hard and Fast Formula

Introduction


Empirical Risk



n


Corollary

The SVMConnection

Nonlinear SVMs


13 / 31

■ Consider a nearest-neighbor (NN)classifier

■ VC dimension is infinite■ Remp = 0■ The bound is irrelevant, yet NN classifier

can perform well in many situations

Corollary

Introduction


Empirical Risk



n


Corollary

The SVMConnection

Nonlinear SVMs


14 / 31

We’d like to find a machine able to zero theempirical risk (sufficient capacity) and

minimize the VC dimension (capacity notwastefully large)

The SVM Connection

Introduction


The SVMConnection

Linear SVMsLinear SVMs(cont’d)

Computation

Switching toLagrangian

Lagrangian Form

The Support Vectors

Testing

Unseparable Data

Nonlinear SVMs


15 / 31

Linear SVMs

Introduction


The SVMConnection


Computation


Lagrangian Form

The Support Vectors

Testing

Unseparable Data

Nonlinear SVMs


16 / 31

■ Training data {xi, yi} i = 1, . . . , l,xi ∈ R

n, and yi ∈ {−1, 1}■ On a separating hyperplane: xw + b = 0,

where

◆ w normal to the hyperplane

◆|b|

‖w‖is the distance to origin

◆ ‖w‖ Euclidean norm of w

Linear SVMs (cont’d)

Introduction


The SVMConnection


Computation


Lagrangian Form

The Support Vectors

Testing

Unseparable Data

Nonlinear SVMs


17 / 31

■ d+, d− shortest distances from labeledpoints to hyperplane

■ Define margin m = d+ + d−■ Task: find the separating hyperplane that

maximizes m

Key point: Maximizing the margin minimizesthe VC dimension

Computation

Introduction


The SVMConnection


Computation


Lagrangian Form

The Support Vectors

Testing

Unseparable Data

Nonlinear SVMs


18 / 31

■ For the separating plane:

xiw + b ≥ +1, yi = +1 (4)

xiw + b ≤ −1, yi = −1 (5)

≡ (6)

yi(xiw + b) − 1 ≥ 0, ∀i (7)

■ For the closest points the equalities aresatisfied, so:

d+ + d− =|1 − b|

‖w‖+

| − 1 − b|

‖w‖=

2

‖w‖(8)

Switching to Lagrangian

Introduction


The SVMConnection


Computation


Lagrangian Form

The Support Vectors

Testing

Unseparable Data

Nonlinear SVMs


19 / 31

■ One coefficient per train sample■ The constraints easier to handle■ Training data appears only in dot

products■ Great for applying the kernel trick later

on

Lagrangian Form

Introduction


The SVMConnection


Computation


Lagrangian Form

The Support Vectors

Testing

Unseparable Data

Nonlinear SVMs


20 / 31

■ Minimize

LP =‖w‖2

2−

l∑

i=1

αiyi(xiw + b) +l

∑

i=1

αi

(9)

■ Convex quadratic programming problemwith the dual: maximize

LD =l

∑

i=1

αi −1

2

l∑

i,j=1

αiαjyiyj(xixj)

(10)

The Support Vectors

Introduction


The SVMConnection


Computation


Lagrangian Form

The Support Vectors

Testing

Unseparable Data

Nonlinear SVMs


21 / 31

■ The points with αi > 0 are the supportvectors

■ Solution only depends on them■ All others have αi = 0 and can be moved

arbitrarily far from the decisionhyperplane, or removed

Testing

Introduction


The SVMConnection


Computation


Lagrangian Form

The Support Vectors

Testing

Unseparable Data

Nonlinear SVMs


22 / 31

■ Once the hyperplane is found:

y = sgn(wx + b) (11)

Unseparable Data

Introduction


The SVMConnection


Computation


Lagrangian Form

The Support Vectors

Testing

Unseparable Data

Nonlinear SVMs


23 / 31

■ Add slack variables ξi

xiw + b ≥ +1 − ξi, yi = +1 (12)

xiw + b ≤ −1 + ξi, yi = −1 (13)

■ The Lagrangian formulation is onlyinfluenced by an upper bound C on αi

Nonlinear SVMs

Introduction


The SVMConnection

Nonlinear SVMsSpaceTransformation(RKHS)

Kernel Example

Using Kernels


24 / 31

Space Transformation (RKHS)

Introduction


The SVMConnection


Kernel Example

Using Kernels


25 / 31

■ Take points from Rd to some space H:

Φ : Rd → H (14)

■ Choose kernel function K such that

K(xi, xj) = Φ(xi)Φ(xj) (15)

■ Since in the Lagrangian formulation weonly have xi in dot products(remember?), we don’t even need toknow Φ!

Kernel Example

Introduction


The SVMConnection


Kernel Example

Using Kernels


26 / 31

■ Gaussian Kernel

K(xi,xj) = e−‖xi−xj‖

2

2σ2 (16)

■ Polynomial

K(xi,xj) = (xixj)p (17)

■ Kinda sorta neural net!

K(xi,xj) = tanh(κxixj − δ)p (18)

Using Kernels

Introduction


The SVMConnection


Kernel Example

Using Kernels


27 / 31

■ Just replace xixj with K(xi,xj)everywhere and the magic is complete

■ Training is identical and takes similartime

■ Separation is still linear, but in a differentspace (infinite-dimensional!)

■ Same sleight of hand for testing:

sgn

Ns∑

i=1

αiyiK(si,x) + b (19)

Loose Ends. Conclusions

Introduction


The SVMConnection

Nonlinear SVMs

Loose Ends.ConclusionsSVM For MultipleClasses

Soft Outputs

Conclusions

28 / 31

SVM For Multiple Classes

Introduction


The SVMConnection

Nonlinear SVMs


Soft Outputs

Conclusions

29 / 31

■ Build n “one-versus-all” classifiers■ Essentially costs n times the complexity

of one classifier

◆ During testing choose the mostconfident answer

■ Build n(n−1)2 “one-versus-one” classifiers

◆ Decide by voting where the databelongs

◆ Many classifiers, but little data foreach

■ DAGSVM (Platt): Same training time, n

times testing time

Soft Outputs

Introduction


The SVMConnection

Nonlinear SVMs


Soft Outputs

Conclusions

30 / 31

■ sgn is discontinuous; want to restoreconfidence

■ SVM outputs can be mapped to posteriorprobabilities (Platt)

Conclusions

Introduction


The SVMConnection

Nonlinear SVMs


Soft Outputs

Conclusions

31 / 31

■ Powerful theoretical grounds■ Global, unique solution■ Performance depends on choice of kernel

and parametersStill a research topic

■ Training is memory-intensive; chunkingmust be used

■ Complexity dependent on the # ofsupport vectors

andrei alexandrescu - university of washingtonssli.ee.washington.edu/ws07/notes/svm.pdfandrei...

Documents