mathematical programming in support vector machines olvi l. mangasarian university of wisconsin -...

Mathematical Programming in Support Vector Machines

Olvi L. Mangasarian

University of Wisconsin - Madison

High Performance Computation for Engineering Systems Seminar

MIT October 4, 2000

What is a Support Vector Machine?

An optimally defined surfaceTypically nonlinear in the input spaceLinear in a higher dimensional spaceImplicitly defined by a kernel function

What are Support Vector Machines Used For?

ClassificationRegression & Data FittingSupervised & Unsupervised Learning

(Will concentrate on classification)

Example of Nonlinear Classifier:Checkerboard Classifier

Outline of Talk

Generalized support vector machines (SVMs)Completely general kernel allows complex classification

(No Mercer condition!) Smooth support vector machines

Smooth & solve SVM by a fast Newton method Lagrangian support vector machines

Very fast simple iterative scheme-One matrix inversion: No LP. No QP.

Reduced support vector machinesHandle large datasets with nonlinear kernels

Generalized Support Vector Machines2-Category Linearly Separable Case

A+

A-

wx0w = í + 1

x0w = í à 1

Generalized Support Vector MachinesAlgebra of 2-Category Linearly Separable Case

Given m points in n dimensional space Represented by an m-by-n matrix A Membership of each in class +1 or –1 specified by:A i

An m-by-m diagonal matrix D with +1 & -1 entries

D(Awà eí )=e;

More succinctly:

where e is a vector of ones.

x0w = í æ1: Separate by two bounding planes,

A iw=í + 1; for D i i = + 1;A iw5í à 1; for D i i = à 1:

Generalized Support Vector MachinesMaximizing the Margin between Bounding Planes

wx0w = í + 1

x0w = í à 1

A+

A-

jjwjj22

Generalized Support Vector MachinesThe Linear Support Vector Machine Formulation

s.t. D(Awà eí ) + y = e

Solve the following mathematical program for some :

w;í ;ymin ÷e0y+ 2

kwk

y = 0:

÷> 0

The nonnegative slack variable is zero iff: Convex hulls of and do not intersect is sufficiently large

yA + A à

÷

D(Awà eí )=e

Breast Cancer Diagnosis Application97% Tenfold Cross Validation Correctness780 Samples:494 Benign, 286 Malignant

Another Application: Disputed Federalist PapersBosch & Smith 1998

56 Hamilton, 50 Madison, 12 Disputed

Generalized Support Vector Machine Motivation

(Nonlinear Kernel Without Mercer Condition)

Linear SVM: Linear separating surface: x0w = ímin ÷e0y+ k w k1

s.t. D(Awà eí ) + y=e; y=0 Set w = A0Du. Resulting linear surface:x0A0Du = í

min ÷e0y+ k u k1

s.t. D(AA0Du à eí ) + y=e; y=0Replace AA0by arbitrary nonlinear kernel K (A;A0) Resulting nonlinear surface: K (x0;A0)Du = í

min ÷e0y+ k u k1

s.t. D(K (A;A0)Du à eí ) + y=e;y=0

SSVM: Smooth Support Vector Machine(SVM as Unconstrained Minimization Problem)

Changing to 2-norm and measuring margin in( ) space:

Smoothing the Plus Function: Integrate the Sigmoid Function

SSVM: The Smooth Support Vector Machine Smoothing the Plus Function

Integrating the sigmoid approximation to the step function:

s(x;ë) = 1+"à ëx1 ;

gives a smooth, excellent approximation to the plus function:

p(x;ë) = x + ë1 log(1+ "à ëx); ë > 0:

Replacing the plus function in the nonsmooth SVMby the smooth approximation gives our SSVM:

min Ðë(w;í ) :=

min2÷k p(eà D(Awà eí );ë) k2

2 + 21 k w;í k2

2

Newton: Minimize a sequence of quadratic approximationsto the strongly convex objective function, i.e. solve a sequenceof linear equations in n+1 variables. (Small dimensional inputspace.)

Armijo: Shorten distance between successive iterates so as to generate sufficient decrease in objective function. (In computational reality, not needed!)

Global Quadratic Convergence: Starting from any point,the iterates guaranteed to converge to the unique solution at a quadratic rate, i.e. errors get squared. (Typically, 6 to 8 iterations without an Armijo.)

SSVM with a Nonlinear Kernel Nonlinear Separating Surface in Input Space

Examples of Kernels Generate Nonlinear Separating Surfaces in Input Space

A 2 Rmâ n;a 2 Rm;ö 2 R;dintegerPolynomial Kernel

(AA0+ öaa0)d

Neural Network Kernel

(AA0+ öaa0)ã(á)ã : R ! f 0;1g

Gaussian (Radial Basis) Kernel

"à ökA ià A jk2; i;j=1;. . .;m:

LSVM: Lagrangian Support Vector MachineDual of SVM

Taking the dual of the SVM formulation:

,

gives the following simple dual problem:

min0ô u2R m 21u0(÷

I + D(AA0+ ee0)D)uà e0u

The variables (w;í ;y) of SSVM are related to u by:

w = A0Du; y = ÷u; í = à e0Du:

LSVM: Lagrangian Support Vector MachineDual SVM as Symmetric Linear Complementarity Problem

The optimality condition for this dual SVM is the LCP:

0 ô u ? Qu à eõ 0;

min 0ô u2Rm f (u) := 21u0Qu à e0u:

Reduces the dual SVM to:

Defining the two matrices:

H = D[A à e]; Q = ÷I + HH0

which, by Implicit Lagrangian Theory, is equivalent to:

Qu à e= ((Qu à e) à ëu)+:ë > 0;

LSVM AlgorithmSimple & Linearly Convergent – One Small Matrix Inversion

ui+1 = Qà 1(e+ ((Qui à e) à ëui)+); i = 0;1; . . .Where:

0< ë < ÷2

Key Idea: Sherman-Morrison-Woodbury formula allows the inversioninversion of an extremely large m-by-m matrix Q by merely invertinga much smaller n-by-n matrix as follows:

(÷I + HH0)à 1 = ÷(I à H(÷

I + H0H)à 1H0):

LSVM Algorithm – Linear Kernel11 Lines of MATLAB Code

function [it, opt, w, gamma] = svml(A,D,nu,itmax,tol)% lsvm with SMW for min 1/2*u'*Q*u-e'*u s.t. u=>0,% Q=I/nu+H*H', H=D[A -e]% Input: A, D, nu, itmax, tol; Output: it, opt, w, gamma% [it, opt, w, gamma] = svml(A,D,nu,itmax,tol); [m,n]=size(A);alpha=1.9/nu;e=ones(m,1);H=D*[A -e];it=0; S=H*inv((speye(n+1)/nu+H'*H)); u=nu*(1-S*(H'*e));oldu=u+1; while it<itmax & norm(oldu-u)>tol z=(1+pl(((u/nu+H*(H'*u))-alpha*u)-1)); oldu=u; u=nu*(z-S*(H'*z)); it=it+1; end; opt=norm(u-oldu);w=A'*D*u;gamma=-e'*D*u;

function pl = pl(x); pl = (abs(x)+x)/2;

LSVM Algorithm – Linear KernelComputational Results

2 Million random points in 10 dimensional spaceClassified in 6.7 minutes in 6 iterations & e-5 accuracy250 MHz UltraSPARC II with 2 gigabyte memoryCPLEX ran out of memory

32562 points in 123-dimensional space (UCI Adult Dataset)Classified in141 seconds & 55 iterations to 85% correctness400 MHz Pentium II with 2 gigabyte memorySVM classified in 178 seconds & 4497 iterationslight

LSVM – Nonlinear KernelFormulation

K (A;B) : Rmâ n â Rnâ l ! Rmâ l;

For the nonlinear kernel:

the separating nonlinear surface is given by:

K ([x0 à 1]; à e0A0

h i)Du = 0

Where u is the solution of the dual problem:

05u2Rmmin f (u) := 2

1u0Qu à e0u;with Q redefined as:

G = [A à e]; Q = ÷I + DK (G;G0)D

LSVM Algorithm – Nonlinear Kernel Application 100 Iterations, 58 Seconds on Pentium II, 95.9% Accuracy

Reduced Support Vector Machines (RSVM)

Large Nonlinear Kernel Classification Problems

is a small random sample ofK (A;Aö0);where Aö0 A0 Key idea: Use a rectangular kernel.

Typically Aö has 1% to 10% of the rows of A

Two important consequences:RSVM can solve very large problems

Aö Nonlinear separator depends on only

uö;í ;ymin

2÷y0y+ 2

1(uö0uö+ í 2)

s:t: D(K (A;Aö0)Döuöà eí ) + y=e;y=0

gives lousy resultsK (Aö;Aö0) Separating surface: K (x0;Aö0)Döuö = í

Conventional SVM Result on Checkerboard Using 50 Random Points Out of 1000

RSVM Result on Checkerboard Using SAME 50 Random Points Out of 1000

RSVM on Large Classification ProblemsStandard Error over 50 Runs = 0.001 to 0.002RSVM Time = 1.24 * (Random Points Time)

Conclusion

Mathematical Programming plays an essential role in SVMs

TheoryNew formulations

Generalized SVMsNew algorithm-generating concepts

Smoothing (SSVM)

Implicit Lagrangian (LSVM)Algorithms

Fast : SSVMMassive: LSVM, RSVM

Future Research

TheoryConcave minimization

Concurrent feature & data selection Multiple-instance problems

SVMs as complementarity problems

Algorithms

Multicategory classification algorithms

Kernel methods in nonlinear programming

Chunking for massive classification: 108

Talk & Papers Available on Web

www.cs.wisc.edu/~olvi

mathematical programming in support vector machines olvi l. mangasarian university of wisconsin -...

Documents