the value of kernel function represents the inner product of two training points in feature space...

The value of kernel function represents the inner product of two training points in feature space

Kernel functions merge two steps 1. map input data from input space to feature space (might be infinite dim.) 2. do inner product in the feature space

Kernel TechniqueBased on Mercer’s Condition (1909)

A Simple Example of Kernel

Polynomial Kernel of Degree 2:K (x;z) =êx;z

ë2

Let

x = x1

x2

ô õ

;z = z1

z2

ô õ2 R2and the nonlinear map

þ : R27! R3 defined by

þ(x) =x2

1

x22

2p

x1x2

2

4

3

5 .

Then

êþ(x);þ(z)

ë=

êx;z

ë2= K (x;z).

There are many other nonlinear maps, (x), that

satisfy the relation:ê (x); (z)

ë=

êx;z

ë2= K (x;z)

Power of the Kernel Technique

Consider a nonlinear map

þ : Rn7! Rp that consists

of distinct features of all the monomials of degree d.Then p = n + dà 1

d

ð ñ.

For example:n = 11; d = 10; p = 92378

Is it necessary? We only need to know êþ(x);þ(z)

ë!

This can be achieved

K (x;z) =êx;z

ëd

More Examples of KernelK (A;B) : Rmâ n â Rnâ l 7à! Rmâ l

A 2 Rmâ n;a 2 Rm;ö 2 R; d is an integer:

Polynomial Kernel : (AA0+ öaa0)dï

)(Linear KernelAA0: ö = 0;d = 1

Gaussian (Radial Basis) Kernel :

"à ökA ià A jk22; i; j = 1;. . .;mK (A;A0)ij =

The ij -entry of K (A;A0) represents the “similarity” of data points A i A jand

Nonlinear SVM Motivation

Linear SVM: (Linear separator:x0w+ b= 0 )

2Ckøk2

2+ 21(kwk2

2+ b2)

D(Aw+ eb) + ø>eø>0;w;bmin

s. t.(QP)

By QP “duality”, w = A0Dë. Maximizing the margin in the “dual space” gives:

2Ckp(eà D(AA0Dë + eb); ì )k2

2+ 21(këk2

2 + b2)ë;bmin

Dual SSVM with separator:x0A0Dë + b= 0

2Ckøk2

2+ 21(këk2

2+ b2)

D(AA0Dë + eb) + ø>eø>0;ë;bmin

s. t.

Nonlinear Smooth SVM

K (x0;A0)Dë + b= 0

K (A;A0) ReplaceAA0by a nonlinear kernel :

2Ckp(eà D(K (A;A0)Dë + eb; ì )k2

2+ 21(këk2

2 + b2)ë;bmin

Use Newton-Armijo algorithm to solve the problem

Each iteration solves m+1 linear equations in m+1 variables

Nonlinear classifier depends on entire dataset :

K (x0;A0)Dë + b= 0

Nonlinear Classifier:

Difficulties with Nonlinear SVM

for Large Problems

The nonlinear kernelK (A;A0) 2 R lâ l is fully dense

Computational complexity depends on # of example

Separating surface depends on almost entire dataset

Complexity of nonlinear SVM ø O((l + 1)3)

Runs out of memory while storing the kernel matrix

Long CPU time to compute the dense kernel matrix

O(l2) Need to generate and store entries

Need to store the entire dataset even after solving the problem

Solving the SVM with Massive Dataset

Limit the SVM to dataset of a few thousand points

Solution I: SMO (Sequential Minimal Optimization)

Standard optimization techniques require that the the data are held in memory

Solve the sub-optimization problem defined by the working set (size =2) Increase the objective function iteratively

Solution II: RSVM (Reduced Support Vector Machine)

Reduced Support Vector Machine

K (x0;Aö0)uö+ b= 0

(ii) Solve the following problem by the Newton’s method

2÷kp(eà D(K (A;A0)uö + eb);ë)k2

2 + 21(kuök2

2 + b2)min(u;b) 2 Rm+1

uö0K (Aö;x) + b=P

i=1

mö

uöiK (Aöi;x) + b= 0

(iii) The nonlinear classifier is defined by the optimal solution

(uö;b)in step (ii):

Using K (A;A0) gives lousy results!

(i) Choose a random subset matrix of entire data matrix A 2 R lâ n; (m << l):

A 2 Rmâ n

Nonlinear Classifier:

A Nonlinear Kernel ApplicationCheckerboard Training Set: 1000 Points

in Separate 486 Asterisks from 514

DotsR2

Conventional SVM Result on Checkerboard

Using 50 Randomly Selected Points Out of 1000

K (A;A0) 2 R50â 50

RSVM Result on Checkerboard Using SAME 50 Random Points Out of

1000

K (A;A0) 2 R1000â 50

RSVM on Moderate Sized Problems(Best Test Set Correctness %, CPU

seconds)

Cleveland Heart297 x 13, 30

86.473.04

85.9232.42

76.881.58

BUPA Liver345 x 6 , 35

74.862.68

73.6232.61

68.952.04

Ionosphere 351 x 34, 35

95.195.02

94.3559.88

88.702.13

Pima Indians768 x 8, 50

78.645.72

76.59328.3

57.324.64

Tic-Tac-Toe958 x 9, 96

98.7514.56

98.431033.5

88.248.87

Mushroom8124 x 22, 215

89.04466.20

N/A

N/A

83.90221.50

K (A;A0)mâ m K (A;A0)mâ m K (A;A0)mâ mmâ n; mDataset Size

RSVM on Large UCI Adult Dataset

Standard Deviation over 50 Runs = 0.001

Average Correctness % & Standard Deviation, 50 Runs

(6414, 26148) 84.47 0.001 77.03 0.014 210 3.2%(11221, 21341) 84.71 0.001 75.96 0.016 225 2.0%(16101, 16461) 84.90 0.001 75.45 0.017 242 1.5%(22697, 9865) 85.31 0.001 76.73 0.018 284 1.2%(32562, 16282) 85.07 0.001 76.95 0.013 326 1.0%

Dataset Size( Train ; Test)

UCI AdultK (A;A0)mâ m

Testing%Std.Dev.

Amâ 123

m m=mK (A;A0)mâ m

%Testing Std.Dev.

Reduced Set: Plays the Most Important Role in RSVM

It is natural to raise two questions:

Is there a way to choose the reduced set other than random selection so that RSVM will

have a better performance?

Is there a mechanism to determine the size of reduced set automatically or

dynamically?

Aö 2 Rmöâ n

Reduced Set SelectionAccording to the Data Scatter in Input

Space

Expected these points to be representative sample

Choose reduced set randomly but only keep the points in the reduced set that are more than a

certain minimal distance apart

12

3

54

6

78

9

11

10

12

Data Scatter in Input Space is NOT Good Enough

An example is given as following:

Training data analogous to XOR problem

Mapping to Feature Space

Map the input data via nonlinear mapping ：

þ : (x1;x2) ! ((x1)2;(x2)2; 2p

x1x2)

K (x;z) =êx áz

ë2

= (x1z1 + x2z2)2

= (x21z

21 + x2

2z22 + 2x1x2z1z2)

Equivalent to polynomial kernel with degree 2:

Data Points in the Feature Space

12

3

54

6

78

9

1110

12

36

25

14

8 11

9 12

7 10

The Polynomial Kernel Matrix

12

3

54

6

78

9

1110

12

Experiment Result

Express the Classifier as Linear Combination of Kernel Functions

f 1;K (A1;á);. . .K (A l;á)g

is a linear combination of a set of kernel functions

In SSVM, the nonlinear separating surface is:K (x0;A0)u+ b=

P

i=1

luiK (A i;x) + b= 0

K (x0;Aö0)uö+ b=P

i=1

möuöiK (Aöi;x) + b= 0

In RSVM, the nonlinear separating surface is:

is a linear combination of a set of kernel functionsf 1;K (Aö1;á);. . .K (Aömö;á)g

Motivation of IRSVMThe Strength of Weak Ties

The strength of weak ties

Mark S. Granovetter, The American Journal of Sociology, Vol. 78, No. 6 1360-1380, May, 1973

If the kernel functions are very similar, the spacespanned by these kernel functions will be

very limited.

Incremental Reduced SVMs

Start with a very small reduced set , then add anew data point only when the kernel

vector is dissimilar to the current function set

This point contributes the most extra informationfor generating the separating surface

Repeat until several successive points cannot be added

How to measure the dissimilarity?

K (A;Aö0)ì

í 2K (A;A0

i)

the kernel vector to the column

space of is greater than a threshold

Add a point into the reduced set if the distance from K (A;A0

i) 2 R lâ 1

K (A;Aö0)

This distance can be determined by solving a least squares

problem

Solving Least Squares Problems

K (A;Aö0)ì

í 2K (A;A0

i)

K (A;A0i)ì

ã

í 2 = K (A;Aö0)ì ã à K (A;A0i)

íí

íí 2

2

The LSP has a unique solution if rank(K (A;Aö0)) = mö

and

IRSVM Algorithm pseudo-code

(sequential version)

1 Randomly choose two data from the training data as the initial reduced set

2 Compute the reduced kernel matrix

3 For each data point not in the reduced set

4 Computes its kernel vector

5 Computes the distance from the kernel vector

6 to the column space of the current reduced kernel matrix

7 If its distance exceed a certain threshold

8 Add this point into the reduced set and form the new reduced kernel matrix

9 Until several successive failures happened in line 7

10 Solve the QP problem of nonlinear SVMs with the obtained reduced kernel

11 A new data point is classified by the separating surface

Speed up IRSVM

The main cost depends on but not on

Take this advantage this, we examine a batch data points at the same

We have to solve the LSP many times and the

complexity is O(mö3)

IRSVM Algorithm pseudo-code

(Batch version)

1 Randomly choose two data from the training data as the initial reduced set

2 Compute the reduced kernel matrix

3 For a batch data point not in the reduced set

4 Computes their kernel vectors

5 Computes the corresponding distances from these kernel vector

6 to the column space of the current reduced kernel matrix

7 For those points’ distance exceed a certain threshold

8 Add those point into the reduced set and form the new reduced kernel matrix

9 Until no data points in a batch were added in line 7,8

10 Solve the QP problem of nonlinear SVMs with the obtained reduced kernel

11 A new data point is classified by the separating surface

IRSVM on Four Public Datasets

IRSVM on UCI Adult datasets

Time comparison on Adult datasets

IRSVM 10 Runs Averageon 6414 Points Adult Training Set

the value of kernel function represents the inner product of two training points in feature space...

Documents