the value of kernel function represents the inner product of two training points in feature space...
Post on 19-Dec-2015
219 views
TRANSCRIPT
![Page 1: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/1.jpg)
The value of kernel function represents the inner product of two training points in feature space
Kernel functions merge two steps 1. map input data from input space to feature space (might be infinite dim.) 2. do inner product in the feature space
Kernel TechniqueBased on Mercer’s Condition (1909)
![Page 2: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/2.jpg)
A Simple Example of Kernel
Polynomial Kernel of Degree 2:K (x;z) =êx;z
ë2
Let
x = x1
x2
ô õ
;z = z1
z2
ô õ2 R2and the nonlinear map
þ : R27! R3 defined by
þ(x) =x2
1
x22
2p
x1x2
2
4
3
5 .
Then
êþ(x);þ(z)
ë=
êx;z
ë2= K (x;z).
There are many other nonlinear maps, (x), that
satisfy the relation:ê (x); (z)
ë=
êx;z
ë2= K (x;z)
![Page 3: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/3.jpg)
Power of the Kernel Technique
Consider a nonlinear map
þ : Rn7! Rp that consists
of distinct features of all the monomials of degree d.Then p = n + dà 1
d
ð ñ.
For example:n = 11; d = 10; p = 92378
Is it necessary? We only need to know êþ(x);þ(z)
ë!
This can be achieved
K (x;z) =êx;z
ëd
![Page 4: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/4.jpg)
More Examples of KernelK (A;B) : Rmâ n â Rnâ l 7à! Rmâ l
A 2 Rmâ n;a 2 Rm;ö 2 R; d is an integer:
Polynomial Kernel : (AA0+ öaa0)dï
)(Linear KernelAA0: ö = 0;d = 1
Gaussian (Radial Basis) Kernel :
"à ökA ià A jk22; i; j = 1;. . .;mK (A;A0)ij =
The ij -entry of K (A;A0) represents the “similarity” of data points A i A jand
![Page 5: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/5.jpg)
Nonlinear SVM Motivation
Linear SVM: (Linear separator:x0w+ b= 0 )
2Ckøk2
2+ 21(kwk2
2+ b2)
D(Aw+ eb) + ø>eø>0;w;bmin
s. t.(QP)
By QP “duality”, w = A0Dë. Maximizing the margin in the “dual space” gives:
2Ckp(eà D(AA0Dë + eb); ì )k2
2+ 21(këk2
2 + b2)ë;bmin
Dual SSVM with separator:x0A0Dë + b= 0
2Ckøk2
2+ 21(këk2
2+ b2)
D(AA0Dë + eb) + ø>eø>0;ë;bmin
s. t.
![Page 6: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/6.jpg)
Nonlinear Smooth SVM
K (x0;A0)Dë + b= 0
K (A;A0) ReplaceAA0by a nonlinear kernel :
2Ckp(eà D(K (A;A0)Dë + eb; ì )k2
2+ 21(këk2
2 + b2)ë;bmin
Use Newton-Armijo algorithm to solve the problem
Each iteration solves m+1 linear equations in m+1 variables
Nonlinear classifier depends on entire dataset :
K (x0;A0)Dë + b= 0
Nonlinear Classifier:
![Page 7: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/7.jpg)
Difficulties with Nonlinear SVM
for Large Problems
The nonlinear kernelK (A;A0) 2 R lâ l is fully dense
Computational complexity depends on # of example
Separating surface depends on almost entire dataset
Complexity of nonlinear SVM ø O((l + 1)3)
Runs out of memory while storing the kernel matrix
Long CPU time to compute the dense kernel matrix
O(l2) Need to generate and store entries
Need to store the entire dataset even after solving the problem
![Page 8: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/8.jpg)
Solving the SVM with Massive Dataset
Limit the SVM to dataset of a few thousand points
Solution I: SMO (Sequential Minimal Optimization)
Standard optimization techniques require that the the data are held in memory
Solve the sub-optimization problem defined by the working set (size =2) Increase the objective function iteratively
Solution II: RSVM (Reduced Support Vector Machine)
![Page 9: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/9.jpg)
Reduced Support Vector Machine
K (x0;Aö0)uö+ b= 0
(ii) Solve the following problem by the Newton’s method
2÷kp(eà D(K (A;A0)uö + eb);ë)k2
2 + 21(kuök2
2 + b2)min(u;b) 2 Rm+1
uö0K (Aö;x) + b=P
i=1
mö
uöiK (Aöi;x) + b= 0
(iii) The nonlinear classifier is defined by the optimal solution
(uö;b)in step (ii):
Using K (A;A0) gives lousy results!
(i) Choose a random subset matrix of entire data matrix A 2 R lâ n; (m << l):
A 2 Rmâ n
Nonlinear Classifier:
![Page 10: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/10.jpg)
A Nonlinear Kernel ApplicationCheckerboard Training Set: 1000 Points
in Separate 486 Asterisks from 514
DotsR2
![Page 11: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/11.jpg)
Conventional SVM Result on Checkerboard
Using 50 Randomly Selected Points Out of 1000
K (A;A0) 2 R50â 50
![Page 12: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/12.jpg)
RSVM Result on Checkerboard Using SAME 50 Random Points Out of
1000
K (A;A0) 2 R1000â 50
![Page 13: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/13.jpg)
RSVM on Moderate Sized Problems(Best Test Set Correctness %, CPU
seconds)
Cleveland Heart297 x 13, 30
86.473.04
85.9232.42
76.881.58
BUPA Liver345 x 6 , 35
74.862.68
73.6232.61
68.952.04
Ionosphere 351 x 34, 35
95.195.02
94.3559.88
88.702.13
Pima Indians768 x 8, 50
78.645.72
76.59328.3
57.324.64
Tic-Tac-Toe958 x 9, 96
98.7514.56
98.431033.5
88.248.87
Mushroom8124 x 22, 215
89.04466.20
N/A
N/A
83.90221.50
K (A;A0)mâ m K (A;A0)mâ m K (A;A0)mâ mmâ n; mDataset Size
![Page 14: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/14.jpg)
RSVM on Large UCI Adult Dataset
Standard Deviation over 50 Runs = 0.001
Average Correctness % & Standard Deviation, 50 Runs
(6414, 26148) 84.47 0.001 77.03 0.014 210 3.2%(11221, 21341) 84.71 0.001 75.96 0.016 225 2.0%(16101, 16461) 84.90 0.001 75.45 0.017 242 1.5%(22697, 9865) 85.31 0.001 76.73 0.018 284 1.2%(32562, 16282) 85.07 0.001 76.95 0.013 326 1.0%
Dataset Size( Train ; Test)
UCI AdultK (A;A0)mâ m
Testing%Std.Dev.
Amâ 123
m m=mK (A;A0)mâ m
%Testing Std.Dev.
![Page 15: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/15.jpg)
Reduced Set: Plays the Most Important Role in RSVM
It is natural to raise two questions:
Is there a way to choose the reduced set other than random selection so that RSVM will
have a better performance?
Is there a mechanism to determine the size of reduced set automatically or
dynamically?
Aö 2 Rmöâ n
![Page 16: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/16.jpg)
Reduced Set SelectionAccording to the Data Scatter in Input
Space
Expected these points to be representative sample
Choose reduced set randomly but only keep the points in the reduced set that are more than a
certain minimal distance apart
![Page 17: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/17.jpg)
12
3
54
6
78
9
11
10
12
Data Scatter in Input Space is NOT Good Enough
An example is given as following:
Training data analogous to XOR problem
![Page 18: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/18.jpg)
Mapping to Feature Space
Map the input data via nonlinear mapping :
þ : (x1;x2) ! ((x1)2;(x2)2; 2p
x1x2)
K (x;z) =êx áz
ë2
= (x1z1 + x2z2)2
= (x21z
21 + x2
2z22 + 2x1x2z1z2)
Equivalent to polynomial kernel with degree 2:
![Page 19: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/19.jpg)
Data Points in the Feature Space
12
3
54
6
78
9
1110
12
36
25
14
8 11
9 12
7 10
![Page 20: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/20.jpg)
The Polynomial Kernel Matrix
![Page 21: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/21.jpg)
12
3
54
6
78
9
1110
12
Experiment Result
![Page 22: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/22.jpg)
Express the Classifier as Linear Combination of Kernel Functions
f 1;K (A1;á);. . .K (A l;á)g
is a linear combination of a set of kernel functions
In SSVM, the nonlinear separating surface is:K (x0;A0)u+ b=
P
i=1
luiK (A i;x) + b= 0
K (x0;Aö0)uö+ b=P
i=1
möuöiK (Aöi;x) + b= 0
In RSVM, the nonlinear separating surface is:
is a linear combination of a set of kernel functionsf 1;K (Aö1;á);. . .K (Aömö;á)g
![Page 23: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/23.jpg)
Motivation of IRSVMThe Strength of Weak Ties
The strength of weak ties
Mark S. Granovetter, The American Journal of Sociology, Vol. 78, No. 6 1360-1380, May, 1973
If the kernel functions are very similar, the spacespanned by these kernel functions will be
very limited.
![Page 24: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/24.jpg)
Incremental Reduced SVMs
Start with a very small reduced set , then add anew data point only when the kernel
vector is dissimilar to the current function set
This point contributes the most extra informationfor generating the separating surface
Repeat until several successive points cannot be added
![Page 25: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/25.jpg)
How to measure the dissimilarity?
K (A;Aö0)ì
í 2K (A;A0
i)
the kernel vector to the column
space of is greater than a threshold
Add a point into the reduced set if the distance from K (A;A0
i) 2 R lâ 1
K (A;Aö0)
![Page 26: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/26.jpg)
This distance can be determined by solving a least squares
problem
Solving Least Squares Problems
K (A;Aö0)ì
í 2K (A;A0
i)
K (A;A0i)ì
ã
í 2 = K (A;Aö0)ì ã à K (A;A0i)
íí
íí 2
2
The LSP has a unique solution if rank(K (A;Aö0)) = mö
and
![Page 27: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/27.jpg)
IRSVM Algorithm pseudo-code
(sequential version)
1 Randomly choose two data from the training data as the initial reduced set
2 Compute the reduced kernel matrix
3 For each data point not in the reduced set
4 Computes its kernel vector
5 Computes the distance from the kernel vector
6 to the column space of the current reduced kernel matrix
7 If its distance exceed a certain threshold
8 Add this point into the reduced set and form the new reduced kernel matrix
9 Until several successive failures happened in line 7
10 Solve the QP problem of nonlinear SVMs with the obtained reduced kernel
11 A new data point is classified by the separating surface
![Page 28: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/28.jpg)
Speed up IRSVM
The main cost depends on but not on
Take this advantage this, we examine a batch data points at the same
We have to solve the LSP many times and the
complexity is O(mö3)
![Page 29: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/29.jpg)
IRSVM Algorithm pseudo-code
(Batch version)
1 Randomly choose two data from the training data as the initial reduced set
2 Compute the reduced kernel matrix
3 For a batch data point not in the reduced set
4 Computes their kernel vectors
5 Computes the corresponding distances from these kernel vector
6 to the column space of the current reduced kernel matrix
7 For those points’ distance exceed a certain threshold
8 Add those point into the reduced set and form the new reduced kernel matrix
9 Until no data points in a batch were added in line 7,8
10 Solve the QP problem of nonlinear SVMs with the obtained reduced kernel
11 A new data point is classified by the separating surface
![Page 30: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/30.jpg)
IRSVM on Four Public Datasets
![Page 31: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/31.jpg)
IRSVM on UCI Adult datasets
![Page 32: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/32.jpg)
Time comparison on Adult datasets
![Page 33: The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from](https://reader036.vdocuments.site/reader036/viewer/2022062714/56649d385503460f94a11d6e/html5/thumbnails/33.jpg)
IRSVM 10 Runs Averageon 6414 Points Adult Training Set