support vector classi cation and regressioncjlin/talks/itri.pdf · svm and kernel methods outline 1...
TRANSCRIPT
![Page 1: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/1.jpg)
Support Vector Classification andRegression
Chih-Jen LinDepartment of Computer Science
National Taiwan University
Short course at ITRI, 2016
Chih-Jen Lin (National Taiwan Univ.) 1 / 181
![Page 2: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/2.jpg)
Outline1 Introduction
2 SVM and kernel methods
3 Dual problem and solving optimization problems
4 Regulatization and linear versus kernel
5 Multi-class classification
6 Support vector regression
7 SVM for clustering
8 Practical use of support vector classification
9 A practical example of SVR
10 Discussion and conclusionsChih-Jen Lin (National Taiwan Univ.) 2 / 181
![Page 3: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/3.jpg)
Introduction
Outline
1 Introduction
2 SVM and kernel methods
3 Dual problem and solving optimization problems
4 Regulatization and linear versus kernel
5 Multi-class classification
6 Support vector regression
7 SVM for clustering
8 Practical use of support vector classification
9 A practical example of SVR
10 Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 3 / 181
![Page 4: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/4.jpg)
Introduction
About this Course
Last year I gave a four-day short course on“introduction of data mining”
In that course, SVM was discussed
This year I received a request to specifically talkabout SVM
So I assume that some of you would like to learnmore details of SVM
Chih-Jen Lin (National Taiwan Univ.) 4 / 181
![Page 5: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/5.jpg)
Introduction
About this Course (Cont’d)
Therefore, this short course will be more technicalthan last year
More mathematics will be involved
We will have breaks at 9:50, 10:50, 13:50, and 14:50
Course slides:
www.csie.ntu.edu.tw/~cjlin/talks/itri.pdf
I may still update slides (e.g., if we find errors in ourlectures)
Chih-Jen Lin (National Taiwan Univ.) 5 / 181
![Page 6: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/6.jpg)
SVM and kernel methods
Outline
1 Introduction
2 SVM and kernel methods
3 Dual problem and solving optimization problems
4 Regulatization and linear versus kernel
5 Multi-class classification
6 Support vector regression
7 SVM for clustering
8 Practical use of support vector classification
9 A practical example of SVR
10 Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 6 / 181
![Page 7: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/7.jpg)
SVM and kernel methods
Support Vector Classification
Training vectors : x i , i = 1, . . . , l
Feature vectors. For example,
A patient = [height, weight, . . .]T
Consider a simple case with two classes:
Define an indicator vector y ∈ R l
yi =
{1 if x i in class 1−1 if x i in class 2
A hyperplane which separates all data
Chih-Jen Lin (National Taiwan Univ.) 7 / 181
![Page 8: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/8.jpg)
SVM and kernel methods
◦ ◦◦◦◦◦◦◦4
44444
4
◦ ◦◦◦◦◦◦◦4
44444
4
wTx + b ={
+10−1
A separating hyperplane: wTx + b = 0
(wTx i) + b ≥ 1 if yi = 1(wTx i) + b ≤ −1 if yi = −1
Decision function f (x) = sgn(wTx + b), x : testdata
Many possible choices of w and b
Chih-Jen Lin (National Taiwan Univ.) 8 / 181
![Page 9: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/9.jpg)
SVM and kernel methods
Maximal Margin
Distance between wTx + b = 1 and −1:
2/‖w‖ = 2/√
wTw
A quadratic programming problem (Boser et al.,1992)
minw ,b
1
2wTw
subject to yi(wTx i + b) ≥ 1,
i = 1, . . . , l .
Chih-Jen Lin (National Taiwan Univ.) 9 / 181
![Page 10: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/10.jpg)
SVM and kernel methods
Example
Given two training data in R1 as in the followingfigure:
40
©1
What is the separating hyperplane ?
Now two data are x1 = 1, x2 = 0 with
y = [+1,−1]T
Chih-Jen Lin (National Taiwan Univ.) 10 / 181
![Page 11: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/11.jpg)
SVM and kernel methods
Example (Cont’d)
Now w ∈ R1. The optimization problem is
minw ,b
1
2w 2
subject to w · 1 + b ≥ 1, (1)
−1(w · 0 + b) ≥ 1. (2)
From (2), −b ≥ 1.
Putting this into (1), w ≥ 2.
That is, for any (w , b) satisfying (1) and (2),w ≥ 2.
Chih-Jen Lin (National Taiwan Univ.) 11 / 181
![Page 12: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/12.jpg)
SVM and kernel methods
Example (Cont’d)
We are minimizing 12w
2, so the smallest possibility isw = 2.
Thus, (w , b) = (2,−1) is the optimal solution.
The separating hyperplane is 2x − 1 = 0, in themiddle of the two training data:
40
©1
•x = 1/2
Chih-Jen Lin (National Taiwan Univ.) 12 / 181
![Page 13: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/13.jpg)
SVM and kernel methods
Data May Not Be Linearly Separable
An example:
◦◦◦
◦◦
◦◦◦
44
44
4
44
4
Allow training errors
Higher dimensional ( maybe infinite ) feature space
φ(x) = [φ1(x), φ2(x), . . .]T .
Chih-Jen Lin (National Taiwan Univ.) 13 / 181
![Page 14: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/14.jpg)
SVM and kernel methods
Standard SVM (Boser et al., 1992; Cortes andVapnik, 1995)
minw ,b,ξ
1
2wTw + C
l∑i=1
ξi
subject to yi(wTφ(x i) + b) ≥ 1− ξi ,ξi ≥ 0, i = 1, . . . , l .
Example: x ∈ R3, φ(x) ∈ R10
φ(x) = [1,√
2x1,√
2x2,√
2x3, x21 ,
x22 , x
23 ,√
2x1x2,√
2x1x3,√
2x2x3]T
Chih-Jen Lin (National Taiwan Univ.) 14 / 181
![Page 15: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/15.jpg)
SVM and kernel methods
Finding the Decision Function
w : maybe infinite variablesThe dual problem: finite number of variables
minα
1
2αTQα− eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l
yTα = 0,
where Qij = yiyjφ(x i)Tφ(x j) and e = [1, . . . , 1]T
At optimum
w =∑l
i=1 αiyiφ(x i)
A finite problem: #variables = #training dataChih-Jen Lin (National Taiwan Univ.) 15 / 181
![Page 16: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/16.jpg)
SVM and kernel methods
Kernel Tricks
Qij = yiyjφ(x i)Tφ(x j) needs a closed form
Example: x i ∈ R3, φ(x i) ∈ R10
φ(x i) = [1,√
2(xi)1,√
2(xi)2,√
2(xi)3, (xi)21,
(xi)22, (xi)
23,√
2(xi)1(xi)2,√
2(xi)1(xi)3,√
2(xi)2(xi)3]T
Then φ(x i)Tφ(x j) = (1 + xT
i x j)2.
Kernel: K (x , y) = φ(x)Tφ(y); common kernels:
e−γ‖x i−x j‖2
, (Radial Basis Function or Gaussian kernel)
(xTi x j/a + b)d (Polynomial kernel)
Chih-Jen Lin (National Taiwan Univ.) 16 / 181
![Page 17: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/17.jpg)
SVM and kernel methods
Can be inner product in infinite dimensional spaceAssume x ∈ R1 and γ > 0.
e−γ‖xi−xj‖2
= e−γ(xi−xj)2
= e−γx2i +2γxixj−γx2
j
=e−γx2i −γx2
j(1 +
2γxixj1!
+(2γxixj)
2
2!+
(2γxixj)3
3!+ · · ·
)=e−γx
2i −γx2
j(1 · 1+
√2γ
1!xi ·√
2γ
1!xj +
√(2γ)2
2!x2i ·√
(2γ)2
2!x2j
+
√(2γ)3
3!x3i ·√
(2γ)3
3!x3j + · · ·
)= φ(xi)
Tφ(xj),
where
φ(x) = e−γx2
[1,
√2γ
1!x ,
√(2γ)2
2!x2,
√(2γ)3
3!x3, · · ·
]T.
Chih-Jen Lin (National Taiwan Univ.) 17 / 181
![Page 18: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/18.jpg)
SVM and kernel methods
Decision function
At optimum
w =∑l
i=1 αiyiφ(x i)
Decision function
wTφ(x) + b
=l∑
i=1
αiyiφ(x i)Tφ(x) + b
=l∑
i=1
αiyiK (x i , x) + b
Only φ(x i) of αi > 0 used ⇒ support vectorsChih-Jen Lin (National Taiwan Univ.) 18 / 181
![Page 19: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/19.jpg)
SVM and kernel methods
Support Vectors: More Important Data
Only φ(x i) of αi > 0 used ⇒ support vectors
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-1.5 -1 -0.5 0 0.5 1
Chih-Jen Lin (National Taiwan Univ.) 19 / 181
![Page 20: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/20.jpg)
SVM and kernel methods
See more examples via SVM Toy available at libsvm webpage(http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
Chih-Jen Lin (National Taiwan Univ.) 20 / 181
![Page 21: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/21.jpg)
SVM and kernel methods
Example: Primal-dual Relationship
If separable, primal problem does not have ξi
minw ,b
1
2wTw
subject to yi(wTx i + b) ≥ 1, i = 1, . . . , l .
Dual problem is
minα
1
2
∑l
i=1
∑l
j=1αiαjyiyjxT
i x j −∑l
i=1αi
subject to 0 ≤ αi , i = 1, . . . , l ,∑l
i=1yiαi = 0.
Chih-Jen Lin (National Taiwan Univ.) 21 / 181
![Page 22: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/22.jpg)
SVM and kernel methods
Example: Primal-dual Relationship(Cont’d)
Consider the earlier example:
40
©1
Now two data are x1 = 1, x2 = 0 with
y = [+1,−1]T
The solution is (w , b) = (2,−1)
Chih-Jen Lin (National Taiwan Univ.) 22 / 181
![Page 23: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/23.jpg)
SVM and kernel methods
Example: Primal-dual Relationship(Cont’d)
The dual objective function
1
2
[α1 α2
] [1 00 0
] [α1
α2
]−[1 1
] [α1
α2
]=
1
2α2
1 − (α1 + α2)
In optimization, objective function means thefunction to be optimizedConstraints are
α1 − α2 = 0, 0 ≤ α1, 0 ≤ α2.
Chih-Jen Lin (National Taiwan Univ.) 23 / 181
![Page 24: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/24.jpg)
SVM and kernel methods
Example: Primal-dual Relationship(Cont’d)
Substituting α2 = α1 into the objective function,
1
2α2
1 − 2α1
has the smallest value at α1 = 2.
Because [2, 2]T satisfies constraints
0 ≤ α1 and 0 ≤ α2,
it is optimal
Chih-Jen Lin (National Taiwan Univ.) 24 / 181
![Page 25: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/25.jpg)
SVM and kernel methods
Example: Primal-dual Relationship(Cont’d)
Using the primal-dual relation
w = y1α1x1 + y2α2x2
= 1 · 2 · 1 + (−1) · 2 · 0= 2
This is the same as that by solving the primalproblem.
Chih-Jen Lin (National Taiwan Univ.) 25 / 181
![Page 26: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/26.jpg)
SVM and kernel methods
More about Support vectors
We know
αi > 0⇒ support vector
We have
yi(wTx i + b) < 1⇒ αi > 0⇒ support vector,
yi(wTx i + b) = 1⇒ αi ≥ 0⇒ maybe SV
and
yi(wTx i + b) > 1⇒ αi = 0⇒ not SV
Chih-Jen Lin (National Taiwan Univ.) 26 / 181
![Page 27: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/27.jpg)
Dual problem and solving optimization problems
Outline
1 Introduction
2 SVM and kernel methods
3 Dual problem and solving optimization problems
4 Regulatization and linear versus kernel
5 Multi-class classification
6 Support vector regression
7 SVM for clustering
8 Practical use of support vector classification
9 A practical example of SVR
10 Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 27 / 181
![Page 28: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/28.jpg)
Dual problem and solving optimization problems
Convex Optimization I
Convex problems are a important class ofoptimization problems that possess nice properties
A function is convex if ∀x , y
f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y),∀θ ∈ [0, 1]
That is, the line segment between any two points isnot lower than the function value
Chih-Jen Lin (National Taiwan Univ.) 28 / 181
![Page 29: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/29.jpg)
Dual problem and solving optimization problems
Convex Optimization II
x y
f (x)
f (y)
Chih-Jen Lin (National Taiwan Univ.) 29 / 181
![Page 30: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/30.jpg)
Dual problem and solving optimization problems
Convex Optimization III
A convex optimization problem takes the followingform
min f0(w)
subject to fi(w) ≤ 0, i = 1, . . . ,m, (3)
hi(w) = 0, i = 1, . . . , p,
where f0, . . . , fm are convex functions and h1, . . . , hpare affine (i.e., a linear function):
hi(w) = aTw + b
Chih-Jen Lin (National Taiwan Univ.) 30 / 181
![Page 31: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/31.jpg)
Dual problem and solving optimization problems
Convex Optimization IVA nice property of convex optimization problems isthat
infw{f0(w) | w satisfies constraints}
is unique
Optimal objective value is unique, but optimal wmay be not
There are other nice properties such as theprimal-dual relationship that we will use
To learn more about convex optimization, you cancheck the book by Boyd and Vandenberghe (2004)
Chih-Jen Lin (National Taiwan Univ.) 31 / 181
![Page 32: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/32.jpg)
Dual problem and solving optimization problems
Deriving the Dual
For simplification, consider the problem without ξi
minw ,b
1
2wTw
subject to yi(wTφ(x i) + b) ≥ 1, i = 1, . . . , l .
Its dual is
minα
1
2αTQα− eTα
subject to 0 ≤ αi , i = 1, . . . , l ,
yTα = 0,
whereQij = yiyjφ(x i)
Tφ(x j)Chih-Jen Lin (National Taiwan Univ.) 32 / 181
![Page 33: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/33.jpg)
Dual problem and solving optimization problems
Lagrangian Dual I
Lagrangian dual
maxα≥0
(minw ,b
L(w , b,α)),
where
L(w , b,α)
=1
2‖w‖2 −
l∑i=1
αi
(yi(wTφ(x i) + b)− 1
)Chih-Jen Lin (National Taiwan Univ.) 33 / 181
![Page 34: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/34.jpg)
Dual problem and solving optimization problems
Lagrangian Dual II
Strong duality
min Primal = maxα≥0
(minw ,b
L(w , b,α))
After SVM is popular, quite a few people think thatfor any optimization problem
⇒ Lagrangian dual exists and strong duality holds
Wrong! We usually need
The optimization problem is convexCertain constraint qualification holds (detailsnot discussed)
Chih-Jen Lin (National Taiwan Univ.) 34 / 181
![Page 35: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/35.jpg)
Dual problem and solving optimization problems
Lagrangian Dual III
We have them
SVM primal is convex and has linear constraints
Chih-Jen Lin (National Taiwan Univ.) 35 / 181
![Page 36: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/36.jpg)
Dual problem and solving optimization problems
Simplify the dual. When α is fixed,
minw ,b
L(w , b,α) =−∞ if
l∑i=1
αiyi 6= 0,
minw
12w
Tw −l∑
i=1
αi [yi(wTφ(x i)− 1] ifl∑
i=1
αiyi = 0.
If∑l
i=1 αiyi 6= 0, we can decrease
−bl∑
i=1
αiyi
in L(w , b,α) to −∞Chih-Jen Lin (National Taiwan Univ.) 36 / 181
![Page 37: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/37.jpg)
Dual problem and solving optimization problems
If∑l
i=1 αiyi = 0, optimum of the strictly convexfunction
1
2wTw −
l∑i=1
αi [yi(wTφ(x i)− 1]
happens when
∇wL(w , b,α) = 0.
Thus,
w =l∑
i=1
αiyiφ(x i).
Chih-Jen Lin (National Taiwan Univ.) 37 / 181
![Page 38: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/38.jpg)
Dual problem and solving optimization problems
Note that
wTw =
( l∑i=1
αiyiφ(x i)
)T( l∑j=1
αjyjφ(x j)
)=∑i ,j
αiαjyiyjφ(x i)Tφ(x j)
The dual is
maxα≥0
l∑
i=1
αi − 12
∑i ,j
αiαjyiyjφ(x i)Tφ(x j) if
l∑i=1
αiyi = 0,
−∞ ifl∑
i=1
αiyi 6= 0.
Chih-Jen Lin (National Taiwan Univ.) 38 / 181
![Page 39: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/39.jpg)
Dual problem and solving optimization problems
Lagrangian dual: maxα≥0
(minw ,b L(w , b,α)
)−∞ definitely not maximum of the dualDual optimal solution not happen when
l∑i=1
αiyi 6= 0
.Dual simplified to
maxα∈R l
l∑i=1
αi −1
2
l∑i=1
l∑j=1
αiαjyiyjφ(x i)Tφ(x j)
subject to yTα = 0,
αi ≥ 0, i = 1, . . . , l .
Chih-Jen Lin (National Taiwan Univ.) 39 / 181
![Page 40: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/40.jpg)
Dual problem and solving optimization problems
Our problems may be infinite dimensional (i.e.,w ∈ R∞)
We can still use Lagrangian duality
See a rigorous discussion in Lin (2001)
Chih-Jen Lin (National Taiwan Univ.) 40 / 181
![Page 41: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/41.jpg)
Dual problem and solving optimization problems
Primal versus Dual I
Recall the dual problem is
minα
1
2αTQα− eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l
yTα = 0
and at optimum
w =l∑
i=1
αiyiφ(x i) (4)
Chih-Jen Lin (National Taiwan Univ.) 41 / 181
![Page 42: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/42.jpg)
Dual problem and solving optimization problems
Primal versus Dual IIWhat if we put (4) into primal
minα,ξ
1
2αTQα + C
∑l
i=1ξi
subject to (Qα + by)i ≥ 1− ξi (5)
ξi ≥ 0
Note that
yiwTφ(x i) = yi
l∑j=1
αjyjφ(x j)Tφ(x i)
=l∑
j=1
Qijαj = (Qα)i
Chih-Jen Lin (National Taiwan Univ.) 42 / 181
![Page 43: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/43.jpg)
Dual problem and solving optimization problems
Primal versus Dual III
If Q is positive definite, we can prove that theoptimal α of (5) is the same as that of the dual
So dual is not the only choice to obtain the model
Chih-Jen Lin (National Taiwan Univ.) 43 / 181
![Page 44: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/44.jpg)
Dual problem and solving optimization problems
Large Dense Quadratic Programming
minα
1
2αTQα− eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l
yTα = 0
Qij 6= 0, Q : an l by l fully dense matrix
50,000 training points: 50,000 variables:
(50, 0002 × 8/2) bytes = 10GB RAM to store Q
Chih-Jen Lin (National Taiwan Univ.) 44 / 181
![Page 45: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/45.jpg)
Dual problem and solving optimization problems
Large Dense Quadratic Programming(Cont’d)
For quadratic programming problems, traditionaloptimization methods assume that Q is available inthe computer memory
They cannot be directly applied here because Qcannot even be stored
Currently, decomposition methods (a type ofcoordinate descent methods) are what used inpractice
Chih-Jen Lin (National Taiwan Univ.) 45 / 181
![Page 46: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/46.jpg)
Dual problem and solving optimization problems
Decomposition Methods
Working on some variables each time (e.g., Osunaet al., 1997; Joachims, 1998; Platt, 1998)
Similar to coordinate-wise minimization
Working set B , N = {1, . . . , l}\B fixed
Sub-problem at the kth iteration:
minαB
1
2
[αT
B (αkN)T] [QBB QBN
QNB QNN
] [αB
αkN
]−
[eTB (ek
N)T] [αB
αkN
]subject to 0 ≤ αt ≤ C , t ∈ B , yTBαB = −yTNαk
N
Chih-Jen Lin (National Taiwan Univ.) 46 / 181
![Page 47: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/47.jpg)
Dual problem and solving optimization problems
Avoid Memory Problems
The new objective function
1
2
[αT
B (αkN)T] [QBBαB + QBNα
kN
QNBαB + QNNαkN
]− eT
BαB + constant
=1
2αT
BQBBαB + (−eB + QBNαkN)TαB + constant
Only |B | columns of Q are needed
In general |B | ≤ 10 is used
Calculated when used : trade time for space
But is such an approach practical?Chih-Jen Lin (National Taiwan Univ.) 47 / 181
![Page 48: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/48.jpg)
Dual problem and solving optimization problems
How Decomposition Methods Perform?
Convergence not very fast. This is known becauseof using only first-order information
But, no need to have very accurate α
decision function:
sgn(wTφ(x) + b) = sgn
(∑l
i=1αiK (x i , x) + b
)Prediction may still be correct with a rough α
Further, in some situations,
# support vectors � # training points
Initial α1 = 0, some instances never usedChih-Jen Lin (National Taiwan Univ.) 48 / 181
![Page 49: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/49.jpg)
Dual problem and solving optimization problems
How Decomposition Methods Perform?(Cont’d)
An example of training 50,000 instances using thesoftware LIBSVM
$svm-train -c 16 -g 4 -m 400 22features
Total nSV = 3370
Time 79.524s
This was done on a typical desktop
Calculating the whole Q takes more time
#SVs = 3,370 � 50,000
A good case where some remain at zero all the time
Chih-Jen Lin (National Taiwan Univ.) 49 / 181
![Page 50: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/50.jpg)
Dual problem and solving optimization problems
How Decomposition Methods Perform?(Cont’d)
Because many αi = 0 in the end, we can develop ashrinking techniques
Variables are removed during the optimizationprocedure. Smaller problems are solved
Chih-Jen Lin (National Taiwan Univ.) 50 / 181
![Page 51: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/51.jpg)
Dual problem and solving optimization problems
Machine Learning Properties are Useful inDesigning Optimization Algorithms
We have seen that special properties of SVM contributeto the viability of decomposition methods
For machine learning applications, no need toaccurately solve the optimization problem
Because some optimal αi = 0, decompositionmethods may not need to update all the variables
Also, we can use shrinking techniques to reduce theproblem size during decomposition methods
Chih-Jen Lin (National Taiwan Univ.) 51 / 181
![Page 52: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/52.jpg)
Dual problem and solving optimization problems
Differences between Optimization andMachine Learning
The two topics may have different focuses. We givethe following example
The decomposition method we just discussedconverges more slowly when C is large
Using C = 1 on a data set
# iterations: 508
Using C = 5, 000
# iterations: 35,241
Chih-Jen Lin (National Taiwan Univ.) 52 / 181
![Page 53: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/53.jpg)
Dual problem and solving optimization problems
Optimization researchers may rush to solve difficultcases of large C
It turns out that large C is less used than small C
Recall that SVM solves
1
2wTw + C (sum of training losses)
A large C means to overfit training data
This does not give good test accuracy. More detailsabout overfitting will be discussed later
Chih-Jen Lin (National Taiwan Univ.) 53 / 181
![Page 54: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/54.jpg)
Regulatization and linear versus kernel
Outline
1 Introduction
2 SVM and kernel methods
3 Dual problem and solving optimization problems
4 Regulatization and linear versus kernel
5 Multi-class classification
6 Support vector regression
7 SVM for clustering
8 Practical use of support vector classification
9 A practical example of SVR
10 Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 54 / 181
![Page 55: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/55.jpg)
Regulatization and linear versus kernel
Equivalent Optimization Problem
• Recall SVM optimization problem is
minw ,b,ξ
1
2wTw + C
∑l
i=1ξi
subject to yi(wTφ(x i) + b) ≥ 1− ξi ,ξi ≥ 0, i = 1, . . . , l .
• It is equivalent to
minw ,b
1
2wTw + C
l∑i=1
max(0, 1− yi(wTφ(x i) + b))
• The reformulation is useful to derive SVM from adifferent viewpoint
Chih-Jen Lin (National Taiwan Univ.) 55 / 181
![Page 56: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/56.jpg)
Regulatization and linear versus kernel
Equivalent Optimization Problem (Cont’d)
That is, at optimum,
ξi = max(0, 1− yi(wTφ(x i) + b))
Reason: from constraints
ξi ≥ 1− yi(wTφ(x i) + b) and ξi ≥ 0
but we also want to minimize ξi
Chih-Jen Lin (National Taiwan Univ.) 56 / 181
![Page 57: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/57.jpg)
Regulatization and linear versus kernel
Linear and Kernel I
Linear classifier
sgn(wTx + b)
Kernel classifier
sgn(wTφ(x) + b) = sgn
(∑l
i=1αiK (x i , x) + b
)Linear is a special case of kernel
An important difference is that for linear we canstore w
Chih-Jen Lin (National Taiwan Univ.) 57 / 181
![Page 58: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/58.jpg)
Regulatization and linear versus kernel
Linear and Kernel II
For kernel, w may be infinite dimensional andcannot be stored
We will show that they are useful in differentcircumstances
Chih-Jen Lin (National Taiwan Univ.) 58 / 181
![Page 59: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/59.jpg)
Regulatization and linear versus kernel
The Bias Term b
Recall the decision function is
sgn(wTx + b)
Sometimes the bias term b is omitted
sgn(wTx)
This is fine if the number of features is not too small
Chih-Jen Lin (National Taiwan Univ.) 59 / 181
![Page 60: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/60.jpg)
Regulatization and linear versus kernel
Minimizing Training Errors
For classification naturally we aim to minimize thetraining error
minw
(training errors)
To characterize the training error, we need a lossfunction ξ(w ; x , y) for each instance (x , y)
Ideally we should use 0–1 training loss:
ξ(w ; x , y) =
{1 if ywTx < 0,
0 otherwise
Chih-Jen Lin (National Taiwan Univ.) 60 / 181
![Page 61: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/61.jpg)
Regulatization and linear versus kernel
Minimizing Training Errors (Cont’d)
However, this function is discontinuous. Theoptimization problem becomes difficult
−ywTx
ξ(w ; x , y)
We need continuous approximations
Chih-Jen Lin (National Taiwan Univ.) 61 / 181
![Page 62: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/62.jpg)
Regulatization and linear versus kernel
Common Loss Functions
Hinge loss (l1 loss)
ξL1(w ; x , y) ≡ max(0, 1− ywTx) (6)
Squared hinge loss (l2 loss)
ξL2(w ; x , y) ≡ max(0, 1− ywTx)2 (7)
Logistic loss
ξLR(w ; x , y) ≡ log(1 + e−ywTx) (8)
SVM: (6)-(7). Logistic regression (LR): (8)
Chih-Jen Lin (National Taiwan Univ.) 62 / 181
![Page 63: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/63.jpg)
Regulatization and linear versus kernel
Common Loss Functions (Cont’d)
−ywTx
ξ(w ; x , y)
ξL1
ξL2
ξLR
Logistic regression is very related to SVMTheir performance (i.e., test accuracy) is usuallysimilar
Chih-Jen Lin (National Taiwan Univ.) 63 / 181
![Page 64: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/64.jpg)
Regulatization and linear versus kernel
Common Loss Functions (Cont’d)
However, minimizing training losses may not give agood model for future prediction
Overfitting occurs
Chih-Jen Lin (National Taiwan Univ.) 64 / 181
![Page 65: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/65.jpg)
Regulatization and linear versus kernel
Overfitting
See the illustration in the next slide
For classification,
You can easily achieve 100% training accuracy
This is useless
When training a data set, we should
Avoid underfitting: small training error
Avoid overfitting: small testing error
Chih-Jen Lin (National Taiwan Univ.) 65 / 181
![Page 66: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/66.jpg)
Regulatization and linear versus kernel
l and s: training; © and 4: testing
Chih-Jen Lin (National Taiwan Univ.) 66 / 181
![Page 67: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/67.jpg)
Regulatization and linear versus kernel
Regularization
To minimize the training error we manipulate the wvector so that it fits the data
To avoid overfitting we need a way to make w ’svalues less extreme.
One idea is to make w values closer to zero
We can add, for example,
wTw2
or ‖w‖1
to the objective function
Chih-Jen Lin (National Taiwan Univ.) 67 / 181
![Page 68: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/68.jpg)
Regulatization and linear versus kernel
General Form of Linear Classification I
Training data {yi , x i}, x i ∈ Rn, i = 1, . . . , l , yi = ±1
l : # of data, n: # of features
minw
f (w), f (w) ≡ wTw2
+ Cl∑
i=1
ξ(w ; x i , yi)
(9)
wTw/2: regularization term
ξ(w ; x , y): loss function
C : regularization parameter
Chih-Jen Lin (National Taiwan Univ.) 68 / 181
![Page 69: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/69.jpg)
Regulatization and linear versus kernel
General Form of Linear Classification II
Of course we can map data to a higher dimensionalspace
minw
f (w), f (w) ≡ wTw2
+ Cl∑
i=1
ξ(w ;φ(x i), yi)
Chih-Jen Lin (National Taiwan Univ.) 69 / 181
![Page 70: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/70.jpg)
Regulatization and linear versus kernel
SVM and Logistic Regression I
If hinge (l1) loss is used, the optimization problem is
minw
1
2wTw + C
l∑i=1
max(0, 1− yiwTx i)
It is the SVM problem we had earlier (without thebias b)
Therefore, we have derived SVM from a differentviewpoint
We also see that SVM is very related to logisticregression
Chih-Jen Lin (National Taiwan Univ.) 70 / 181
![Page 71: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/71.jpg)
Regulatization and linear versus kernel
SVM and Logistic Regression II
However, many wrongly think that they are different
This is wrong.
Reason of this misunderstanding: traditionally,
when people say SVM ⇒ kernel SVMwhen people say logistic regression ⇒ linearlogistic regression
Indeed we can do kernel logistic regression
minw
1
2wTw + C
l∑i=1
log(1 + e−yiwTφ(x i ))
Chih-Jen Lin (National Taiwan Univ.) 71 / 181
![Page 72: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/72.jpg)
Regulatization and linear versus kernel
SVM and Logistic Regression III
A main difference from SVM is that logisticregression has probability interpretation
We will introduce logistic regression from anotherviewpoint
Chih-Jen Lin (National Taiwan Univ.) 72 / 181
![Page 73: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/73.jpg)
Regulatization and linear versus kernel
Logistic Regression
For a label-feature pair (y , x), assume theprobability model is
p(y |x) =1
1 + e−ywTx .
Note that
p(1|x) + p(−1|x)
=1
1 + e−wTx +1
1 + ewTx
=ewTx
1 + ewTx +1
1 + ewTx
= 1
w is the parameter to be decidedChih-Jen Lin (National Taiwan Univ.) 73 / 181
![Page 74: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/74.jpg)
Regulatization and linear versus kernel
Logistic Regression (Cont’d)
Idea of this model
p(1|x) =1
1 + e−wTx
{→ 1 if wTx � 0,
→ 0 if wTx � 0
Assume training instances are
(yi , x i), i = 1, . . . , l
Chih-Jen Lin (National Taiwan Univ.) 74 / 181
![Page 75: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/75.jpg)
Regulatization and linear versus kernel
Logistic Regression (Cont’d)
Logistic regression finds w by maximizing thefollowing likelihood
maxw
l∏i=1
p (yi |x i) . (10)
Negative log-likelihood
− logl∏
i=1
p (yi |x i) = −l∑
i=1
log p (yi |x i)
=l∑
i=1
log(
1 + e−yiwTx i
)Chih-Jen Lin (National Taiwan Univ.) 75 / 181
![Page 76: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/76.jpg)
Regulatization and linear versus kernel
Logistic Regression (Cont’d)
Logistic regression
minw
l∑i=1
log(
1 + e−yiwTx i
).
Regularized logistic regression
minw
1
2wTw + C
l∑i=1
log(
1 + e−yiwTx i
). (11)
C : regularization parameter decided by users
Chih-Jen Lin (National Taiwan Univ.) 76 / 181
![Page 77: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/77.jpg)
Regulatization and linear versus kernel
Loss Functions: Differentiability
However,
ξL1: not differentiableξL2: differentiable but not twice differentiableξLR: twice differentiable
The same optimization method may not be applicable toall these losses
Chih-Jen Lin (National Taiwan Univ.) 77 / 181
![Page 78: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/78.jpg)
Regulatization and linear versus kernel
Discussion
We see that the same classification method can bederived from different ways
SVM
Maximal margin
Regularization and training losses
LR
Regularization and training losses
Maximum likelihood
Chih-Jen Lin (National Taiwan Univ.) 78 / 181
![Page 79: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/79.jpg)
Regulatization and linear versus kernel
Regularization
L1 versus L2
‖w‖1 and wTw/2
wTw/2: smooth, easier to optimize
‖w‖1: non-differentiable
sparse solution; possibly many zero elements
Possible advantages of L1 regularization:
Feature selection
Less storage for w
Chih-Jen Lin (National Taiwan Univ.) 79 / 181
![Page 80: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/80.jpg)
Regulatization and linear versus kernel
Linear and Kernel Classification
Methods such as SVM and logistic regression can used intwo ways
Kernel methods: data mapped to a higherdimensional space
x ⇒ φ(x)
φ(x i)Tφ(x j) easily calculated; little control on φ(·)
Feature engineering + linear classification:
We have x without mapping. Alternatively, we cansay that φ(x) is our x ; full control on x or φ(x)
We refer to them as kernel and linear classifiersChih-Jen Lin (National Taiwan Univ.) 80 / 181
![Page 81: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/81.jpg)
Regulatization and linear versus kernel
Linear and Kernel Classification
Let’s check the prediction cost
wTx + b versus∑l
i=1αiyiK (x i , x) + b
If K (x i , x j) takes O(n), then
O(n) versus O(nl)
Linear is much cheaper
A similar difference occurs for training
Chih-Jen Lin (National Taiwan Univ.) 81 / 181
![Page 82: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/82.jpg)
Regulatization and linear versus kernel
Linear and Kernel Classification (Cont’d)
In a sense, linear is a special case of kernel
Indeed, we can prove that test accuracy of linear isthe same as Gaussian (RBF) kernel under certainparameters (Keerthi and Lin, 2003)
Therefore, roughly we have
test accuracy: kernel ≥ linearcost: kernel � linear
Speed is the reason to use linear
Chih-Jen Lin (National Taiwan Univ.) 82 / 181
![Page 83: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/83.jpg)
Regulatization and linear versus kernel
Linear and Kernel Classification (Cont’d)
For some problems, accuracy by linear is as good asnonlinear
But training and testing are much faster
This particularly happens for document classification
Number of features (bag-of-words model) very large
Data very sparse (i.e., few non-zeros)
Chih-Jen Lin (National Taiwan Univ.) 83 / 181
![Page 84: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/84.jpg)
Regulatization and linear versus kernel
Comparison Between Linear and Kernel(Training Time & Testing Accuracy)
Linear RBF KernelData set Time Accuracy Time AccuracyMNIST38 0.1 96.82 38.1 99.70ijcnn1 1.6 91.81 26.8 98.69covtype 1.4 76.37 46,695.8 96.11news20 1.1 96.95 383.2 96.90real-sim 0.3 97.44 938.3 97.82yahoo-japan 3.1 92.63 20,955.2 93.31webspam 25.7 93.35 15,681.8 99.26
Size reasonably large: e.g., yahoo-japan: 140k instancesand 830k features
Chih-Jen Lin (National Taiwan Univ.) 84 / 181
![Page 85: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/85.jpg)
Regulatization and linear versus kernel
Comparison Between Linear and Kernel(Training Time & Testing Accuracy)
Linear RBF KernelData set Time Accuracy Time AccuracyMNIST38 0.1 96.82 38.1 99.70ijcnn1 1.6 91.81 26.8 98.69covtype 1.4 76.37 46,695.8 96.11news20 1.1 96.95 383.2 96.90real-sim 0.3 97.44 938.3 97.82yahoo-japan 3.1 92.63 20,955.2 93.31webspam 25.7 93.35 15,681.8 99.26
Size reasonably large: e.g., yahoo-japan: 140k instancesand 830k features
Chih-Jen Lin (National Taiwan Univ.) 84 / 181
![Page 86: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/86.jpg)
Regulatization and linear versus kernel
Comparison Between Linear and Kernel(Training Time & Testing Accuracy)
Linear RBF KernelData set Time Accuracy Time AccuracyMNIST38 0.1 96.82 38.1 99.70ijcnn1 1.6 91.81 26.8 98.69covtype 1.4 76.37 46,695.8 96.11news20 1.1 96.95 383.2 96.90real-sim 0.3 97.44 938.3 97.82yahoo-japan 3.1 92.63 20,955.2 93.31webspam 25.7 93.35 15,681.8 99.26
Size reasonably large: e.g., yahoo-japan: 140k instancesand 830k features
Chih-Jen Lin (National Taiwan Univ.) 84 / 181
![Page 87: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/87.jpg)
Regulatization and linear versus kernel
Extension: Training Explicit Form ofNonlinear Mappings I
Linear-SVM method to train φ(x1), . . . , φ(x l)
Kernel not used
Applicable only if dimension of φ(x) not too large
Low-degree Polynomial Mappings
K (x i , x j) = (xTi x j + 1)2 = φ(x i)
Tφ(x j)
φ(x) = [1,√
2x1, . . . ,√
2xn, x21 , . . . , x
2n ,√
2x1x2, . . . ,√
2xn−1xn]T
Chih-Jen Lin (National Taiwan Univ.) 85 / 181
![Page 88: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/88.jpg)
Regulatization and linear versus kernel
Extension: Training Explicit Form ofNonlinear Mappings II
For this mapping, # features = O(n2)
Recall O(n) for linear versus O(nl) for kernel
Now O(n2) versus O(nl)
Sparse data
n⇒ n, average # non-zeros for sparse data
n� n⇒ O(n2) may be much smaller than O(l n)
When degree is small, train the explicit form of φ(x)
Chih-Jen Lin (National Taiwan Univ.) 86 / 181
![Page 89: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/89.jpg)
Regulatization and linear versus kernel
Testing Accuracy and Training Time
Data setDegree-2 Polynomial Accuracy diff.
Training time (s)Accuracy Linear RBF
LIBLINEAR LIBSVMa9a 1.6 89.8 85.06 0.07 0.02real-sim 59.8 1,220.5 98.00 0.49 0.10ijcnn1 10.7 64.2 97.84 5.63 −0.85MNIST38 8.6 18.4 99.29 2.47 −0.40covtype 5,211.9 NA 80.09 3.74 −15.98webspam 3,228.1 NA 98.44 5.29 −0.76
Training φ(x i) by linear: faster than kernel, butsometimes competitive accuracy
Chih-Jen Lin (National Taiwan Univ.) 87 / 181
![Page 90: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/90.jpg)
Regulatization and linear versus kernel
Example: Dependency Parsing I
This is an NLP Application
Kernel LinearRBF Poly-2 Linear Poly-2
Training time 3h34m53s 3h21m51s 3m36s 3m43sParsing speed 0.7x 1x 1652x 103xUAS 89.92 91.67 89.11 91.71LAS 88.55 90.60 88.07 90.71
We get faster training/testing, while maintain goodaccuracy
Chih-Jen Lin (National Taiwan Univ.) 88 / 181
![Page 91: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/91.jpg)
Regulatization and linear versus kernel
Example: Dependency Parsing II
We achieve this by training low-degreepolynomial-mapped data by linear classification
That is, linear methods to explicitly train φ(x i),∀iWe consider the following low-degree polynomialmapping:
φ(x) = [1, x1, . . . , xn, x21 , . . . , x
2n , x1x2, . . . , xn−1xn]T
Chih-Jen Lin (National Taiwan Univ.) 89 / 181
![Page 92: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/92.jpg)
Regulatization and linear versus kernel
Handing High Dimensionality of φ(x)
A multi-class problem with sparse data
n Dim. of φ(x) l n w ’s # nonzeros46,155 1,065,165,090 204,582 13.3 1,438,456
n: average # nonzeros per instance
Dimensionality of w is very high, but w is sparse
Some training feature columns of xixj are entirelyzero
Hashing techniques are used to handle sparse w
Chih-Jen Lin (National Taiwan Univ.) 90 / 181
![Page 93: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/93.jpg)
Regulatization and linear versus kernel
Example: Classifier in a Small Device
In a sensor application (Yu et al., 2014), theclassifier can use less than 16KB of RAM
Classifiers Test accuracy Model SizeDecision Tree 77.77 76.02KBAdaBoost (10 trees) 78.84 1,500.54KBSVM (RBF kernel) 85.33 1,287.15KB
Number of features: 5
We consider a degree-3 polynomial mapping
dimensionality =
(5 + 3
3
)+ bias term = 57.
Chih-Jen Lin (National Taiwan Univ.) 91 / 181
![Page 94: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/94.jpg)
Regulatization and linear versus kernel
Example: Classifier in a Small Device
One-against-one strategy for 5-class classification(5
2
)× 57× 4bytes = 2.28KB
Assume single precision
Results
SVM method Test accuracy Model SizeRBF kernel 85.33 1,287.15KBPolynomial kernel 84.79 2.28KBLinear kernel 78.51 0.24KB
Chih-Jen Lin (National Taiwan Univ.) 92 / 181
![Page 95: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/95.jpg)
Multi-class classification
Outline
1 Introduction
2 SVM and kernel methods
3 Dual problem and solving optimization problems
4 Regulatization and linear versus kernel
5 Multi-class classification
6 Support vector regression
7 SVM for clustering
8 Practical use of support vector classification
9 A practical example of SVR
10 Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 93 / 181
![Page 96: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/96.jpg)
Multi-class classification
Multi-class Classification
SVM and logistic regression are methods fortwo-class classification
We need certain ways to extend them for multi-classproblems
This is not a problem for methods such as nearestneighbor or decision trees
Chih-Jen Lin (National Taiwan Univ.) 94 / 181
![Page 97: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/97.jpg)
Multi-class classification
Multi-class Classification (Cont’d)
k classesOne-against-the rest: Train k binary SVMs:
1st class vs. (2, · · · , k)th class2nd class vs. (1, 3, . . . , k)th class
...
k decision functions
(w 1)Tφ(x) + b1
...
(w k)Tφ(x) + bk
Chih-Jen Lin (National Taiwan Univ.) 95 / 181
![Page 98: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/98.jpg)
Multi-class classification
Prediction:
arg maxj
(w j)Tφ(x) + bj
Reason: If x ∈ 1st class, then we should have
(w 1)Tφ(x) + b1 ≥ +1
(w 2)Tφ(x) + b2 ≤ −1...
(w k)Tφ(x) + bk ≤ −1
Chih-Jen Lin (National Taiwan Univ.) 96 / 181
![Page 99: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/99.jpg)
Multi-class classification
Multi-class Classification (Cont’d)
One-against-one: train k(k − 1)/2 binary SVMs
(1, 2), (1, 3), . . . , (1, k), (2, 3), (2, 4), . . . , (k − 1, k)
If 4 classes ⇒ 6 binary SVMs
yi = 1 yi = −1 Decision functionsclass 1 class 2 f 12(x) = (w 12)Tx + b12
class 1 class 3 f 13(x) = (w 13)Tx + b13
class 1 class 4 f 14(x) = (w 14)Tx + b14
class 2 class 3 f 23(x) = (w 23)Tx + b23
class 2 class 4 f 24(x) = (w 24)Tx + b24
class 3 class 4 f 34(x) = (w 34)Tx + b34
Chih-Jen Lin (National Taiwan Univ.) 97 / 181
![Page 100: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/100.jpg)
Multi-class classification
For a testing data, predicting all binary SVMs
Classes winner1 2 11 3 11 4 12 3 22 4 43 4 3
Select the one with the largest vote
class 1 2 3 4# votes 3 1 1 1
May use decision values as well
Chih-Jen Lin (National Taiwan Univ.) 98 / 181
![Page 101: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/101.jpg)
Multi-class classification
Solving a Single Problem
An approach by Crammer and Singer (2002)
minw 1,...,w k
1
2
k∑m=1
‖wm‖22 + C
l∑i=1
ξ({wm}km=1; x i , yi),
where
ξ({wm}km=1; x , y) ≡ maxm 6=y
max(0, 1− (w y−wm)Tx).
We hope the decision value of x i by the model w yi
is larger than othersPrediction: same as one-against-the rest
arg maxj
(w j)Tx
Chih-Jen Lin (National Taiwan Univ.) 99 / 181
![Page 102: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/102.jpg)
Multi-class classification
Discussion
Other variants of solving a single optimizationproblem include Weston and Watkins (1999); Leeet al. (2004)
A comparison in Hsu and Lin (2002)
RBF kernel: accuracy similar for different methods
But 1-against-1 is the fastest for training
Chih-Jen Lin (National Taiwan Univ.) 100 / 181
![Page 103: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/103.jpg)
Multi-class classification
Maximum Entropy
Maximum Entropy: a generalization of logisticregression for multi-class problems
It is widely applied by NLP applications.
Conditional probability of label y given data x .
P(y |x) ≡exp(wT
y x)∑km=1 exp(wT
mx),
Chih-Jen Lin (National Taiwan Univ.) 101 / 181
![Page 104: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/104.jpg)
Multi-class classification
Maximum Entropy (Cont’d)
We then minimizes regularized negativelog-likelihood.
minw 1,...,wm
1
2
k∑m=1
‖w k‖2 + Cl∑
i=1
ξ({wm}km=1; x i , yi),
where
ξ({wm}km=1; x , y) ≡ − logP(y |x).
Chih-Jen Lin (National Taiwan Univ.) 102 / 181
![Page 105: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/105.jpg)
Multi-class classification
Maximum Entropy (Cont’d)
Is this loss function reasonable?
IfwT
yix i � wT
mx i ,∀m 6= yi ,
thenξ({wm}km=1; x i , yi) ≈ 0
That is, no loss
In contrast, if
wTyix i � wT
mx i ,m 6= yi ,
then P(yi |x i)� 1 and the loss is large.
Chih-Jen Lin (National Taiwan Univ.) 103 / 181
![Page 106: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/106.jpg)
Multi-class classification
Features as Functions
NLP applications often use a function f (x , y) togenerate the feature vector
P(y |x) ≡ exp(wT f (x , y))∑y ′ exp(wT f (x , y ′))
. (12)
The earlier probability model is a special case by
f (x , y) =
0...0x0...0
}y − 1
∈ Rnk and w =
[ w 1
...w k
].
Chih-Jen Lin (National Taiwan Univ.) 104 / 181
![Page 107: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/107.jpg)
Support vector regression
Outline
1 Introduction
2 SVM and kernel methods
3 Dual problem and solving optimization problems
4 Regulatization and linear versus kernel
5 Multi-class classification
6 Support vector regression
7 SVM for clustering
8 Practical use of support vector classification
9 A practical example of SVR
10 Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 105 / 181
![Page 108: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/108.jpg)
Support vector regression
Least Square Regression I
Given training data (x1, y1), . . . , (x l , yl)
Nowyi ∈ R
is the target value
Regression: find a function so that
f (x i) ≈ yi
Chih-Jen Lin (National Taiwan Univ.) 106 / 181
![Page 109: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/109.jpg)
Support vector regression
Least Square Regression II
Least square regression:
minw ,b
l∑i=1
(yi − (wTx i + b))2
That is, we model f (x) by
f (x) = wTx + b
An example
Chih-Jen Lin (National Taiwan Univ.) 107 / 181
![Page 110: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/110.jpg)
Support vector regression
Least Square Regression III
40 50 60 70 80 90 100150
160
170
180
190
200
0.60 ·Weight + 130.2
Weight (kg)
Hei
ght
(cm
)
note: picture is from
http://tex.stackexchange.com/questions/119179/
how-to-add-a-regression-line-to-randomly-generated-points-using-pgfplots-in-tikz
Chih-Jen Lin (National Taiwan Univ.) 108 / 181
![Page 111: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/111.jpg)
Support vector regression
Least Square Regression IV
This is equivalent to
minw ,b
∑l
i=1ξ(w , b; x i , yi)
where
ξ(w , b; x i , yi) = (yi − (wTx i + b))2
Chih-Jen Lin (National Taiwan Univ.) 109 / 181
![Page 112: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/112.jpg)
Support vector regression
Regularized Least Square
ξ(w , b; x i , yi) is a kind of loss function
We can add regularization.
minw ,b
1
2wTw + C
∑l
i=1ξ(w , b; x i , yi)
C is still the regularization parameter
Other loss functions?
Chih-Jen Lin (National Taiwan Univ.) 110 / 181
![Page 113: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/113.jpg)
Support vector regression
Support Vector Regression I
ε-insensitive loss function (b omitted)
max(|wTx i − yi | − ε, 0)
max(|wTx i − yi | − ε, 0)2
Chih-Jen Lin (National Taiwan Univ.) 111 / 181
![Page 114: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/114.jpg)
Support vector regression
Support Vector Regression II
wTx i − yi0
loss
−ε ε
L2
L1
ε: errors small enough are treated as no error
This make the model more robust (less overfittingthe data)
Chih-Jen Lin (National Taiwan Univ.) 112 / 181
![Page 115: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/115.jpg)
Support vector regression
Support Vector Regression III
One more parameter (ε) to decide
An equivalent form of the optimization problem
minw ,b,ξ,ξ∗
1
2wTw + C
l∑i=1
ξi + Cl∑
i=1
ξ∗i
subject to wTφ(x i) + b − yi ≤ ε + ξi ,
yi −wTφ(x i)− b ≤ ε + ξ∗i ,
ξi , ξ∗i ≥ 0, i = 1, . . . , l .
This form is similar to the SVM formulation derivedearlier
Chih-Jen Lin (National Taiwan Univ.) 113 / 181
![Page 116: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/116.jpg)
Support vector regression
Support Vector Regression IV
The dual problem is
minα,α∗
1
2(α−α∗)TQ(α−α∗) + ε
l∑i=1
(αi + α∗i )
+l∑
i=1
yi(αi − α∗i )
subject to eT (α−α∗) = 0,
0 ≤ αi , α∗i ≤ C , i = 1, . . . , l ,
where Qij = K (x i , x j) ≡ φ(x i)Tφ(x j).
Chih-Jen Lin (National Taiwan Univ.) 114 / 181
![Page 117: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/117.jpg)
Support vector regression
Support Vector Regression V
After solving the dual problem,
w =l∑
i=1
(−αi + α∗i )φ(x i)
and the approximate function is
l∑i=1
(−αi + α∗i )K (x i , x) + b.
Chih-Jen Lin (National Taiwan Univ.) 115 / 181
![Page 118: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/118.jpg)
Support vector regression
Discussion
SVR and least-square regression are very related
Why people more commonly use l2 (least-square)rather than l1 losses?
Easier because of differentiability
Chih-Jen Lin (National Taiwan Univ.) 116 / 181
![Page 119: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/119.jpg)
SVM for clustering
Outline
1 Introduction
2 SVM and kernel methods
3 Dual problem and solving optimization problems
4 Regulatization and linear versus kernel
5 Multi-class classification
6 Support vector regression
7 SVM for clustering
8 Practical use of support vector classification
9 A practical example of SVR
10 Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 117 / 181
![Page 120: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/120.jpg)
SVM for clustering
One-class SVM I
Separate data to normal ones and outliers(Scholkopf et al., 2001)
minw ,ξ,ρ
1
2wTw − ρ +
1
νl
l∑i=1
ξi
subject to wTφ(x i) ≥ ρ− ξi ,ξi ≥ 0, i = 1, . . . , l .
Chih-Jen Lin (National Taiwan Univ.) 118 / 181
![Page 121: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/121.jpg)
SVM for clustering
One-class SVM II
Instead of the parameter C is SVM, here theparameter is ν.
wTφ(x i) ≥ ρ− ξi
means that we hope most data satisfy
wTφ(x i) ≥ ρ.
That is, most data are on one side of the hyperplane
Those on the wrong side are considered as outliers
Chih-Jen Lin (National Taiwan Univ.) 119 / 181
![Page 122: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/122.jpg)
SVM for clustering
One-class SVM III
The dual problem is
minα
1
2αTQα
subject to 0 ≤ αi ≤ 1/(νl), i = 1, . . . , l ,
eTα = 1,
where Qij = K (x i , x j) = φ(x i)Tφ(x j).
The decision function is
sgn
(l∑
i=1
αiK (x i , x)− ρ
).
Chih-Jen Lin (National Taiwan Univ.) 120 / 181
![Page 123: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/123.jpg)
SVM for clustering
One-class SVM IV
The role of −ρ is similar to the bias term b earlier
From the dual problem we can see that
ν ∈ (0, 1]
Otherwise, if ν > 1, then
eTα ≤ 1/ν < 1
violates the linear constraint.
Clearly, a larger ν means we don’t need to push ξito zero ⇒ more data are considered as outliers
Chih-Jen Lin (National Taiwan Univ.) 121 / 181
![Page 124: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/124.jpg)
SVM for clustering
Support Vector Data Description (SVDD)I
SVDD is another technique to identify outliers (Taxand Duin, 2004)
minR,a,ξ
R2 + Cl∑
i=1
ξi
subject to ‖φ(x i)− a‖2 ≤ R2 + ξi , i = 1, . . . , l ,
ξi ≥ 0, i = 1, . . . , l ,
Chih-Jen Lin (National Taiwan Univ.) 122 / 181
![Page 125: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/125.jpg)
SVM for clustering
Support Vector Data Description (SVDD)II
We obtain a hyperspherical model characterized bythe center a and the radius R .
A test instance x is detected as an outlier if
‖φ(x)− a‖2 > R2.
Chih-Jen Lin (National Taiwan Univ.) 123 / 181
![Page 126: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/126.jpg)
SVM for clustering
Support Vector Data Description (SVDD)III
The dual problem
minα
αTQα−l∑
i=1
αiQi ,i
subject to eTα = 1, (13)
0 ≤ αi ≤ C , i = 1, . . . , l ,
This dual problem is very close to that of one-classSVM
Chih-Jen Lin (National Taiwan Univ.) 124 / 181
![Page 127: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/127.jpg)
SVM for clustering
Support Vector Data Description (SVDD)IV
Consider a scaled version of one-class SVM dual
minα
1
2αTQα
subject to 0 ≤ αi ≤ 1, i = 1, . . . , l ,
eTα = νl .
If Gaussian kernel is used,
Qi ,i = e−γ‖x i−x i‖2
= 1
and the two dual problems are equivalent
Chih-Jen Lin (National Taiwan Univ.) 125 / 181
![Page 128: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/128.jpg)
SVM for clustering
Discussion
For unsupervised settings, evaluation is very difficult
Usually the evaluation is by a subjective way
Chih-Jen Lin (National Taiwan Univ.) 126 / 181
![Page 129: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/129.jpg)
Practical use of support vector classification
Outline
1 Introduction
2 SVM and kernel methods
3 Dual problem and solving optimization problems
4 Regulatization and linear versus kernel
5 Multi-class classification
6 Support vector regression
7 SVM for clustering
8 Practical use of support vector classification
9 A practical example of SVR
10 Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 127 / 181
![Page 130: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/130.jpg)
Practical use of support vector classification
Let’s Try a Practical Example
A problem from astroparticle physics
1 2.61e+01 5.88e+01 -1.89e-01 1.25e+02
1 5.70e+01 2.21e+02 8.60e-02 1.22e+02
1 1.72e+01 1.73e+02 -1.29e-01 1.25e+02
0 2.39e+01 3.89e+01 4.70e-01 1.25e+02
0 2.23e+01 2.26e+01 2.11e-01 1.01e+02
0 1.64e+01 3.92e+01 -9.91e-02 3.24e+01
Training and testing sets available: 3,089 and 4,000Data available at LIBSVM Data Sets
Chih-Jen Lin (National Taiwan Univ.) 128 / 181
![Page 131: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/131.jpg)
Practical use of support vector classification
Training and Testing
Training the set svmguide1 to obtain svmguide1.model
$./svm-train svmguide1
Testing the set svmguide1.t
$./svm-predict svmguide1.t svmguide1.model out
Accuracy = 66.925% (2677/4000)
We see that training and testing accuracy are verydifferent. Training accuracy is almost 100%
$./svm-predict svmguide1 svmguide1.model out
Accuracy = 99.7734% (3082/3089)
Chih-Jen Lin (National Taiwan Univ.) 129 / 181
![Page 132: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/132.jpg)
Practical use of support vector classification
Why this Fails
Gaussian kernel is used here
We see that most kernel elements have
Kij = e−‖x i−x j‖2/4
{= 1 if i = j ,
→ 0 if i 6= j .
because some features in large numeric ranges
For what kind of data,
K ≈ I?
Chih-Jen Lin (National Taiwan Univ.) 130 / 181
![Page 133: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/133.jpg)
Practical use of support vector classification
Why this Fails (Cont’d)
If we have training data
φ(x1) = [1, 0, . . . , 0]T
...
φ(x l) = [0, . . . , 0, 1]T
thenK = I
Clearly such training data can be correctlyseparated, but how about testing data?
So overfitting occurs
Chih-Jen Lin (National Taiwan Univ.) 131 / 181
![Page 134: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/134.jpg)
Practical use of support vector classification
Overfitting
See the illustration in the next slide
In theory
You can easily achieve 100% training accuracy
This is useless
When training and predicting a data, we should
Avoid underfitting: small training error
Avoid overfitting: small testing error
Chih-Jen Lin (National Taiwan Univ.) 132 / 181
![Page 135: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/135.jpg)
Practical use of support vector classification
l and s: training; © and 4: testing
Chih-Jen Lin (National Taiwan Univ.) 133 / 181
![Page 136: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/136.jpg)
Practical use of support vector classification
Data Scaling
Without scaling, the above overfitting situation mayoccur
Also, features in greater numeric ranges maydominate
Example:
height genderx1 150 Fx2 180 Mx3 185 M
andy1 = 0, y2 = 1, y3 = 1.
Chih-Jen Lin (National Taiwan Univ.) 134 / 181
![Page 137: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/137.jpg)
Practical use of support vector classification
Data Scaling (Cont’d)
The separating hyperplane almost vertical
x1
x2x3
Strongly depends on the first attribute; but secondmay be also important
Chih-Jen Lin (National Taiwan Univ.) 135 / 181
![Page 138: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/138.jpg)
Practical use of support vector classification
Data Scaling (Cont’d)
A simple solution is to linearly scale each feature to[0, 1] by:
feature value−min
max−min,
where max,min are maximal and minimal value ofeach feature
There are many other scaling methods
Scaling generally helps, but not always
Chih-Jen Lin (National Taiwan Univ.) 136 / 181
![Page 139: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/139.jpg)
Practical use of support vector classification
Data Scaling (Cont’d)
Scaling is needed for methods relying on similaritybetween instances
For example, k-nearest neighbor
It’s not needed to methods such as decision treeswhich rely on relative positions with an attribute
Chih-Jen Lin (National Taiwan Univ.) 137 / 181
![Page 140: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/140.jpg)
Practical use of support vector classification
Data Scaling: Same Factors
A common mistake
$./svm-scale -l -1 -u 1 svmguide1 > svmguide1.scale
$./svm-scale -l -1 -u 1 svmguide1.t > svmguide1.t.scale
-l -1 -u 1: scaling to [−1, 1]
We need to use same factors on training and testing
$./svm-scale -s range1 svmguide1 > svmguide1.scale
$./svm-scale -r range1 svmguide1.t > svmguide1.t.scale
Later we will give a real exampleChih-Jen Lin (National Taiwan Univ.) 138 / 181
![Page 141: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/141.jpg)
Practical use of support vector classification
After Data Scaling
Train scaled data and then predict
$./svm-train svmguide1.scale
$./svm-predict svmguide1.t.scale svmguide1.scale.model
svmguide1.t.predict
Accuracy = 96.15%
Training accuracy is now similar
$./svm-predict svmguide1.scale svmguide1.scale.model o
Accuracy = 96.439%
For this experiment, we use parameters C = 1, γ = 0.25,but sometimes performances are sensitive to parameters
Chih-Jen Lin (National Taiwan Univ.) 139 / 181
![Page 142: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/142.jpg)
Practical use of support vector classification
Parameters versus Performances
If we use C = 20, γ = 400
$./svm-train -c 20 -g 400 svmguide1.scale
$./svm-predict svmguide1.scale svmguide1.scale.model o
Accuracy = 100% (3089/3089)
100% training accuracy but
$./svm-predict svmguide1.t.scale svmguide1.scale.model o
Accuracy = 82.7% (3308/4000)
Very bad test accuracy
Overfitting happens
Chih-Jen Lin (National Taiwan Univ.) 140 / 181
![Page 143: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/143.jpg)
Practical use of support vector classification
Parameter Selection
For SVM, we may need to select suitable parameters
They are C and kernel parameters
Example:
γ of e−γ‖x i−x j‖2
a, b, d of (xTi x j/a + b)d
How to select them so performance is better?
Chih-Jen Lin (National Taiwan Univ.) 141 / 181
![Page 144: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/144.jpg)
Practical use of support vector classification
Performance Evaluation
Available data ⇒ training and validation
Train the training; test the validation to estimatethe performance
A common way is k-fold cross validation (CV):
Data randomly separated to k groups
Each time k − 1 as training and one as testing
Select parameters/kernels with best CV result
There are many other methods to evaluate theperformance
Chih-Jen Lin (National Taiwan Univ.) 142 / 181
![Page 145: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/145.jpg)
Practical use of support vector classification
Contour of CV Accuracy
Chih-Jen Lin (National Taiwan Univ.) 143 / 181
![Page 146: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/146.jpg)
Practical use of support vector classification
The good region of parameters is quite large
SVM is sensitive to parameters, but not thatsensitive
Sometimes default parameters work
but it’s good to select them if time is allowed
Chih-Jen Lin (National Taiwan Univ.) 144 / 181
![Page 147: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/147.jpg)
Practical use of support vector classification
Example of Parameter Selection
Direct training and test
$./svm-train svmguide3
$./svm-predict svmguide3.t svmguide3.model o
→ Accuracy = 2.43902%
After data scaling, accuracy is still low
$./svm-scale -s range3 svmguide3 > svmguide3.scale
$./svm-scale -r range3 svmguide3.t > svmguide3.t.scale
$./svm-train svmguide3.scale
$./svm-predict svmguide3.t.scale svmguide3.scale.model o
→ Accuracy = 12.1951%Chih-Jen Lin (National Taiwan Univ.) 145 / 181
![Page 148: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/148.jpg)
Practical use of support vector classification
Example of Parameter Selection (Cont’d)
Select parameters by trying a grid of (C , γ) values
$ python grid.py svmguide3.scale
· · ·128.0 0.125 84.8753
(Best C=128.0, γ=0.125 with five-fold cross-validationrate=84.8753%)
Train and predict using the obtained parameters
$ ./svm-train -c 128 -g 0.125 svmguide3.scale
$ ./svm-predict svmguide3.t.scale svmguide3.scale.model svmguide3.t.predict
→ Accuracy = 87.8049%Chih-Jen Lin (National Taiwan Univ.) 146 / 181
![Page 149: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/149.jpg)
Practical use of support vector classification
Selecting Kernels
RBF, polynomial, or others?
For beginners, use RBF first
Linear kernel: special case of RBF
Accuracy of linear the same as RBF under certainparameters (Keerthi and Lin, 2003)
Polynomial kernel:
(xTi x j/a + b)d
Numerical difficulties: (< 1)d → 0, (> 1)d →∞More parameters than RBF
Chih-Jen Lin (National Taiwan Univ.) 147 / 181
![Page 150: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/150.jpg)
Practical use of support vector classification
Selecting Kernels (Cont’d)
Commonly used kernels are Gaussian (RBF),polynomial, and linear
But in different areas, special kernels have beendeveloped. Examples
1. χ2 kernel is popular in computer vision
2. String kernel is useful in some domains
Chih-Jen Lin (National Taiwan Univ.) 148 / 181
![Page 151: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/151.jpg)
Practical use of support vector classification
A Simple Procedure for Beginners
After helping many users, we came up with the followingprocedure
1. Conduct simple scaling on the data
2. Consider RBF kernel K (x , y) = e−γ‖x−y‖2
3. Use cross-validation to find the best parameter C andγ
4. Use the best C and γ to train the whole training set
5. Test
In LIBSVM, we have a python script easy.pyimplementing this procedure.
Chih-Jen Lin (National Taiwan Univ.) 149 / 181
![Page 152: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/152.jpg)
Practical use of support vector classification
A Simple Procedure for Beginners(Cont’d)
We proposed this procedure in an “SVM guide”(Hsu et al., 2003) and implemented it in LIBSVM
From research viewpoints, this procedure is notnovel. We never thought about submitting ourguide somewhere
But this procedure has been tremendously useful.
Now almost the standard thing to do for SVMbeginners
Chih-Jen Lin (National Taiwan Univ.) 150 / 181
![Page 153: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/153.jpg)
Practical use of support vector classification
A Real Example of Wrong Scaling
Separately scale each feature of training and testing datato [0, 1]
$ ../svm-scale -l 0 svmguide4 > svmguide4.scale
$ ../svm-scale -l 0 svmguide4.t > svmguide4.t.scale
$ python easy.py svmguide4.scale svmguide4.t.scale
Accuracy = 69.2308% (216/312) (classification)
The accuracy is low even after parameter selection
$ ../svm-scale -l 0 -s range4 svmguide4 > svmguide4.scale
$ ../svm-scale -r range4 svmguide4.t > svmguide4.t.scale
$ python easy.py svmguide4.scale svmguide4.t.scale
Accuracy = 89.4231% (279/312) (classification)
Chih-Jen Lin (National Taiwan Univ.) 151 / 181
![Page 154: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/154.jpg)
Practical use of support vector classification
A Real Example of Wrong Scaling(Cont’d)
With the correct setting, the 10 features in the test datasvmguide4.t.scale have the following maximal values:
0.7402, 0.4421, 0.6291, 0.8583, 0.5385, 0.7407, 0.3982,1.0000, 0.8218, 0.9874
Scaling the test set to [0, 1] generated an erroneous set.
Chih-Jen Lin (National Taiwan Univ.) 152 / 181
![Page 155: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/155.jpg)
Practical use of support vector classification
More about Cross Validation
CV can be used for other classification methods
For example, a common way to select k of k nearestneighbor is by CV
However, it’s easy that CV is misused
Chih-Jen Lin (National Taiwan Univ.) 153 / 181
![Page 156: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/156.jpg)
Practical use of support vector classification
More about Cross Validation (Cont’d)
CV is a biased estimate
Think about this. If you have many parameters, youmay adjust them to boost your CV accuracy
In some papers, people compare CV accuracy ofdifferent methods
This is not very appropriate
It’s better to report independent test accuracy
Indeed you are allowed to predict the test set onlyonce for reporting the results
Chih-Jen Lin (National Taiwan Univ.) 154 / 181
![Page 157: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/157.jpg)
Practical use of support vector classification
More about Cross Validation (Cont’d)
Sometimes you must be careful in splitting data forCV
Assume you have 20,000 images of 200 users:
User 1: 100 images
· · ·User 200: 100 images
The standard CV may overestimate the performancebecause of easier predictions
Chih-Jen Lin (National Taiwan Univ.) 155 / 181
![Page 158: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/158.jpg)
Practical use of support vector classification
More about Cross Validation (Cont’d)
An instance in the validation set may find a closeone in the training set.
A more suitable setting is to split data by meta-levelinformation (i.e., users here).
Chih-Jen Lin (National Taiwan Univ.) 156 / 181
![Page 159: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/159.jpg)
A practical example of SVR
Outline
1 Introduction
2 SVM and kernel methods
3 Dual problem and solving optimization problems
4 Regulatization and linear versus kernel
5 Multi-class classification
6 Support vector regression
7 SVM for clustering
8 Practical use of support vector classification
9 A practical example of SVR
10 Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 157 / 181
![Page 160: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/160.jpg)
A practical example of SVR
Electricity Load Forecasting
EUNITE world wide competition 2001http://neuron-ai.tuke.sk/competition
We were given
Load per half hour from 1997 to 1998Average daily temperature from 1995 to 1998List of holidays
Goal:
Predict daily maximal load of January 1999
A time series prediction problem
Chih-Jen Lin (National Taiwan Univ.) 158 / 181
![Page 161: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/161.jpg)
A practical example of SVR
SVR for Time Series Prediction I
Given (· · · , yt−∆, · · · , yt−1, yt , · · · , yl) as trainingseries
Generate training data:(yt−∆, · · · , yt−1) as attributes (features) of x i
yt as the target value x i
One-step ahead prediction
Prediction:Starting from the last segment
(yl−∆+1, . . . , yl)→ yl+1
Repeat by using newly predicted values
Chih-Jen Lin (National Taiwan Univ.) 159 / 181
![Page 162: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/162.jpg)
A practical example of SVR
Data Analyses I
Maximal load of each day
Didn’t know how to use all half-hour dataNot used for later analyses/experiments
400
450
500
550
600
650
700
750
800
850
900
0 100 200 300 400 500 600 700
max_load
time
Chih-Jen Lin (National Taiwan Univ.) 160 / 181
![Page 163: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/163.jpg)
A practical example of SVR
Data Analyses II
Issues largely discussed in earlier works
Seasonal periodicityWeekly periodicityWeekday: higher, Weekend: lowerHoliday effectAll above known for January 1999Weather influenceTemperature unknown for January 1999
Temperature is very very important
The main difficulty of this competition
Chih-Jen Lin (National Taiwan Univ.) 161 / 181
![Page 164: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/164.jpg)
A practical example of SVR
Data Analyses III
Most early work on short-term prediction:Temperature available
450
500
550
600
650
700
750
800
850
900
-15 -10 -5 0 5 10 15 20 25 30
max_load
temperture
Chih-Jen Lin (National Taiwan Univ.) 162 / 181
![Page 165: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/165.jpg)
A practical example of SVR
Data Analyses IV
Error propagation of time series prediction is anissue
Chih-Jen Lin (National Taiwan Univ.) 163 / 181
![Page 166: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/166.jpg)
A practical example of SVR
Methods
In addition to SVR, another simple and effectivemethod is local modeling
It is like nearest neighbor in classification
Local modeling:
Finding segments closely resemble the segmentproceeding the point to be predicted
��������������������
��������������������
Average of elements after these similarsegments of points.
Chih-Jen Lin (National Taiwan Univ.) 164 / 181
![Page 167: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/167.jpg)
A practical example of SVR
Data Encoding I
Both methods:
Use a a segment (a vector) for predicting the nextvalue
Encoding: contents of a segment
The simplest:
Each segment: load of the previous ∆ days
Used for local model: ∆ = 7
For SVM: more information is incorporated
Seven attributes: maximal loads of the past 7days
Chih-Jen Lin (National Taiwan Univ.) 165 / 181
![Page 168: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/168.jpg)
A practical example of SVR
Data Encoding IISeven binary (i.e. 0 or 1) attributes:target day in which day of a weekOne binary attribute: target day holiday or notOne attribute (optional): temperature of thetarget day
Temperature unknown: train two SVMs
One for load and one for temperature(yt−∆
Tt−∆
), · · · ,
(yt−1
Tt−1
)SVM1−−−−→ yt
Tt−∆, · · · ,Tt−1SVM2−−−−→ Tt
Chih-Jen Lin (National Taiwan Univ.) 166 / 181
![Page 169: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/169.jpg)
A practical example of SVR
Model Selection I
Parameters and features
∆: for both approachesLocal model: # of similar segmentsSVR:
1 C : cost of error2 ε: width of the ε-insensitive loss3 mapping function φ
Extremely important for data prediction
Known data separated to
Training, validation, and testing
Chih-Jen Lin (National Taiwan Univ.) 167 / 181
![Page 170: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/170.jpg)
A practical example of SVR
Model Selection II
January 1997 or January 1998 as validation
Model selection is expensive
Restrict the search space: via reasonable choices orsimply guessing
∆ = 7SVR:RBF function φ(xi)
Tφ(xj) = e−γ‖xi−xj‖2
Use default width ε = 0.5 of LIBSVMC = 212, γ = 2−4: decided by validation
Chih-Jen Lin (National Taiwan Univ.) 168 / 181
![Page 171: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/171.jpg)
A practical example of SVR
Summer Data I
Without summer:
Result for testing January 1998 (or 1997) better
Give up information from April to September
This is an example where domain knowledge is used
However, can we do automatic time seriessegmentation to see that summer and winter aredifferent?
Chih-Jen Lin (National Taiwan Univ.) 169 / 181
![Page 172: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/172.jpg)
A practical example of SVR
Evaluation of Time Series Prediction I
MSE (Mean Square Error):
n∑i=1
(yi − yi)2
MAPE (Mean absolute percentage error)
1
n
n∑i=1
|yi − yi ||yi |
Error propagation: larger error later
Unfair to earlier prediction if MSE is used
Chih-Jen Lin (National Taiwan Univ.) 170 / 181
![Page 173: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/173.jpg)
A practical example of SVR
Evaluation of Time Series Prediction II
There are other criteria
Chih-Jen Lin (National Taiwan Univ.) 171 / 181
![Page 174: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/174.jpg)
A practical example of SVR
Results: Local Model I
Validation on different number of segments
Results in the competition: slightly worse than SVR
Chih-Jen Lin (National Taiwan Univ.) 172 / 181
![Page 175: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/175.jpg)
A practical example of SVR
Results: SVR I
Two SVRs: very difficult to predict temperature
If in one day, temperature suddenly drops orincreases
⇒ Erroneous after that day
We conclude if temperature is used, the variation ishigher
We decide to give up using the temperatureinformation
Only one SVM used
Prediction results for January 1998:
Chih-Jen Lin (National Taiwan Univ.) 173 / 181
![Page 176: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/176.jpg)
A practical example of SVR
Results: SVR II
450
500
550
600
650
700
750
800
850
900
0 5 10 15 20 25 30
ma
x_
loa
d
98 Jan
450
500
550
600
650
700
750
800
850
900
0 5 10 15 20 25 30
ma
x_
loa
d
98 Jan
The load of each week is similar
However, the model manages to find the trend
Chih-Jen Lin (National Taiwan Univ.) 174 / 181
![Page 177: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/177.jpg)
A practical example of SVR
Results: SVR III
Holiday is lower but error largerResults after encountering a holiday more inaccurateHolidays: January 1 and 6
Treat all 31 days in January 1999 as non-holidays
Some earlier work consider holidays andnon-holidays separately
We cannot do this because information aboutholidays is quite limited
Overall we take a very conservative approach
Forgot to manually lower load of January 6Reason why our maxi(errori) not good
Chih-Jen Lin (National Taiwan Univ.) 175 / 181
![Page 178: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/178.jpg)
A practical example of SVR
Results: SVR IV
MAPE: 1.98839%
MSE: 364.498
600
650
700
750
800
850
900
0 5 10 15 20 25 30
max_load
time
realpredict
Chih-Jen Lin (National Taiwan Univ.) 176 / 181
![Page 179: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/179.jpg)
A practical example of SVR
Discussion I
Instead of this conservative approach, can we dobetter ?
Is there a good way to use temperature information?
Feature selection is the key for our approach
Example: removing summer data, treating holidaysas non-holidays
Parameter selection: needed but a large range is ok
For example, if C = 212, γ = 2−4 becomesC = 212, γ = 2−5
⇒ results do not change much
Chih-Jen Lin (National Taiwan Univ.) 177 / 181
![Page 180: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/180.jpg)
Discussion and conclusions
Outline
1 Introduction
2 SVM and kernel methods
3 Dual problem and solving optimization problems
4 Regulatization and linear versus kernel
5 Multi-class classification
6 Support vector regression
7 SVM for clustering
8 Practical use of support vector classification
9 A practical example of SVR
10 Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 178 / 181
![Page 181: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/181.jpg)
Discussion and conclusions
Conclusions
In this short course, we have introduced details ofSVM
Linear versus kernel is an important issue. You mustdecide when to use which
No matter how many advanced techniques aredeveloped, simple models like linear SVM or logisticregression will remain to be the first thing to try
Chih-Jen Lin (National Taiwan Univ.) 179 / 181
![Page 182: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/182.jpg)
Discussion and conclusions
References I
B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. InProceedings of the Fifth Annual Workshop on Computational Learning Theory, pages144–152. ACM Press, 1992.
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273–297, 1995.
K. Crammer and Y. Singer. On the learnability and design of output codes for multiclassproblems. Machine Learning, (2–3):201–233, 2002.
C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines.IEEE Transactions on Neural Networks, 13(2):415–425, 2002.
C.-W. Hsu, C.-C. Chang, and C.-J. Lin. A practical guide to support vector classification.Technical report, Department of Computer Science, National Taiwan University, 2003.URL http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.
T. Joachims. Making large-scale SVM learning practical. In B. Scholkopf, C. J. C. Burges, andA. J. Smola, editors, Advances in Kernel Methods – Support Vector Learning, pages169–184, Cambridge, MA, 1998. MIT Press.
S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support vector machines with Gaussiankernel. Neural Computation, 15(7):1667–1689, 2003.
Chih-Jen Lin (National Taiwan Univ.) 180 / 181
![Page 183: Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems](https://reader033.vdocuments.site/reader033/viewer/2022042914/5f4d6189f9b77f29830daf38/html5/thumbnails/183.jpg)
Discussion and conclusions
References IIY. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines. Journal of the
American Statistical Association, 99(465):67–81, 2004.
C.-J. Lin. Formulations of support vector machines: a note from an optimization point ofview. Neural Computation, 13(2):307–317, 2001.
E. Osuna, R. Freund, and F. Girosi. Training support vector machines: An application to facedetection. In Proceedings of IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR), pages 130–136, 1997.
J. C. Platt. Fast training of support vector machines using sequential minimal optimization. InB. Scholkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods -Support Vector Learning, Cambridge, MA, 1998. MIT Press.
B. Scholkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating thesupport of a high-dimensional distribution. Neural Computation, 13(7):1443–1471, 2001.
D. M. J. Tax and R. P. W. Duin. Support vector data description. Machine Learning, 54(1):45–66, 2004.
J. Weston and C. Watkins. Multi-class support vector machines. In M. Verleysen, editor,Proceedings of ESANN99, pages 219–224, Brussels, 1999. D. Facto Press.
M.-C. Yu, T. Yu, S.-C. Wang, C.-J. Lin, and E. Y. Chang. Big data small footprint: Thedesign of a low-power classifier for detecting transportation modes. Proceedings of theVLDB Endowment, 7:1429–1440, 2014.
Chih-Jen Lin (National Taiwan Univ.) 181 / 181