support vector machines and kernel based learning · 2007. 9. 6. · kernel based learning:...
TRANSCRIPT
Support Vector Machines and Kernel Based
Learning
Johan Suykens
K.U. Leuven, ESAT-SCD/SISTAKasteelpark Arenberg 10
B-3001 Leuven (Heverlee), Belgium
Tel: 32/16/32 18 02 - Fax: 32/16/32 19 70Email: [email protected]
http://www.esat.kuleuven.be/scd/
Tutorial ICANN 2007Porto Portugal, Sept. 2007
ICANN 2007 ⋄ Johan Suykens
Living in a data worldbiomedical
energyprocess industry
bio-informatics
multimedia
traffic
ICANN 2007 ⋄ Johan Suykens 1
Kernel based learning: interdisciplinary challenges
SVM & kernel methods
linear algebra
mathematics
statistics
systems and control theory
signal processingoptimization
machine learning
pattern recognition
data mining
neural networks
• Understanding the essential concepts and different facets of problems
• Providing systematical approaches, engineering kernel machines
• Integrative design, bridging the gaps between theory and practice
ICANN 2007 ⋄ Johan Suykens 2
- Part I contents -
• neural networks and support vector machinesfeature map and kernelsprimal and dual problem
• classification, regressionconvex problem, robustness, sparseness
• wider use of the kernel trickleast squares support vector machines as core problemskernel principal component analysis
• large scale fixed-size methodnonlinear modelling
ICANN 2007 ⋄ Johan Suykens 3
Classical MLPsx1
x2
x3
xn
1
w1
w2
w3wn
b
h(·)
h(·)
y
Multilayer Perceptron (MLP) properties:
• Universal approximation of continuous nonlinear functions
• Learning from input-output patterns: off-line/on-line
• Parallel network architecture, multiple inputs and outputs
+ Flexible and widely applicable:
Feedforward/recurrent networks, supervised/unsupervised learning
- Many local minima, trial and error for determining number of neurons
ICANN 2007 ⋄ Johan Suykens 4
Support Vector Machines
cost function cost function
weights weights
MLP SVM
• Nonlinear classification and function estimation by convex optimizationwith a unique solution and primal-dual interpretations.
• Number of neurons automatically follows from a convex program.
• Learning and generalization in high dimensional input spaces(coping with the curse of dimensionality).
• Use of kernels (e.g. linear, polynomial, RBF, MLP, splines, kernels fromgraphical models, ... ), application-specific kernels (e.g. bioinformatics)
ICANN 2007 ⋄ Johan Suykens 5
Classifier with maximal margin
• Training set (xi, yi)Ni=1: inputs xi ∈ R
n; class labels yi ∈ −1,+1
• Classifier: y(x) = sign[wTϕ(x) + b]
with ϕ(·) : Rn → R
nh the mapping to a high dimensional feature space(which can be infinite dimensional!)
• Maximize the margin for good generalization ability (margin = 2‖w‖2
)
(VC theory: linear SVM classifier dates back from the sixties)
x x
xx
x
x
x
o
o
oo
o o
oo o
x o → x x
xx
x
x
x
o
o
oo
o o
oo o
x o
ICANN 2007 ⋄ Johan Suykens 6
SVM classifier: primal and dual problem
• Primal problem: [Vapnik, 1995]
minw,b,ξ
J (w, ξ) =1
2wTw + c
N∑
i=1
ξi s.t.
yi[wTϕ(xi) + b] ≥ 1 − ξi
ξi ≥ 0, i = 1, ..., N
Trade-off between margin maximization and tolerating misclassifications
• Conditions for optimality from Lagrangian.Express the solution in the Lagrange multipliers.
• Dual problem: QP problem (convex problem)
maxα
Q(α) = −1
2
N∑
i,j=1
yiyj K(xi, xj)αiαj+
N∑
j=1
αj s.t.
N∑
i=1
αiyi = 0
0 ≤ αi ≤ c, ∀i
ICANN 2007 ⋄ Johan Suykens 7
Obtaining solution via Lagrangian
• Lagrangian:
L(w, b, ξ;α, ν) = J (w, ξ) −
N∑
i=1
αiyi[wTϕ(xi) + b] − 1 + ξi −
N∑
i=1
νiξi
• Find saddle point: maxα,ν
minw,b,ξ
L(w, b, ξ; α, ν), one obtains
∂L∂w = 0 → w =
N∑
i=1
αiyiϕ(xi)
∂L∂b = 0 →
N∑
i=1
αiyi = 0
∂L∂ξi
= 0 → 0 ≤ αi ≤ c, i = 1, ..., N
Finally, write the solution in terms of α (Lagrange multipliers).
ICANN 2007 ⋄ Johan Suykens 8
SVM classifier model representations
• Classifier: primal representation
y(x) = sign[wTϕ(x) + b]
Kernel trick (Mercer Theorem): K(xi, xj) = ϕ(xi)T ϕ(xj)
• Dual representation:
y(x) = sign[∑
i
αi yi K(x, xi) + b]
Some possible kernels K(·, ·):
K(x, xi) = xTi x (linear)
K(x, xi) = (xTi x + τ)d (polynomial)
K(x, xi) = exp(−‖x − xi‖22/σ2) (RBF)
K(x, xi) = tanh(κxTi x + θ) (MLP)
• Sparseness property (many αi = 0)−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
x(1)x
(2)
ICANN 2007 ⋄ Johan Suykens 9
SVMs: living in two worlds ...
x
oooo
x
xx
x
x xx
x
x
oo
oo Input space
Feature space
ϕ(x)
Primal space:
Dual space:
y(x) = sign[wTϕ(x) + b]
y(x) = sign[P#sv
i=1 αiyiK(x, xi) + b]
K(xi, xj) = ϕ(xi)Tϕ(xj) (“Kernel trick”)
y(x)
y(x)
w1
wnh
α1
α#sv
ϕ1(x)
ϕnh(x)
K(x, x1)
K(x, x#sv)
x
x
ICANN 2007 ⋄ Johan Suykens 10
Reproducing Kernel Hilbert Space (RKHS) view
• Variational problem: [Wahba, 1990; Poggio & Girosi, 1990; Evgeniou et al., 2000]
find function f such that
minf∈H
1
N
N∑
i=1
L(yi, f(xi)) + λ‖f‖2K
with L(·, ·) the loss function. ‖f‖K is norm in RKHS H defined by K.
• Representer theorem: for convex loss function, solution of the form
f(x) =
NX
i=1
αiK(x, xi)
Reproducing property f(x) = 〈f, Kx〉K with Kx(·) = K(x, ·)
• Some special cases:
L(y, f(x)) = (y − f(x))2: regularization network
L(y, f(x)) = |y − f(x)|ǫ: SVM regression with ǫ-insensitive loss function
−ε 0 +ε
ICANN 2007 ⋄ Johan Suykens 11
Different views on kernel machines
LS−SVMSVM
Kriging
Gaussian Processes
RKHS
Some early history on RKHS:
1910-1920: Moore
1940: Aronszajn
1951: Krige
1970: Parzen
1971: Kimeldorf & Wahba
Obtaining complementary insights from different perspectives:kernels are used in different methodologies
Support vector machines (SVM): optimization approach (primal/dual)Reproducing kernel Hilbert spaces (RKHS): variational problem, functional analysisGaussian processes (GP): probabilistic/Bayesian approach
ICANN 2007 ⋄ Johan Suykens 12
Wider use of the kernel trick
• Angle between vectors: (e.g. correlation analysis)Input space:
cos θxz =xTz
‖x‖2‖z‖2Feature space:
cos θϕ(x),ϕ(z) =ϕ(x)Tϕ(z)
‖ϕ(x)‖2‖ϕ(z)‖2
=K(x, z)
√
K(x, x)√
K(z, z)
• Distance between vectors: (e.g. for “kernelized” clustering methods)Input space:
‖x − z‖22 = (x − z)T (x − z) = xTx + zTz − 2xTz
Feature space:
‖ϕ(x) − ϕ(z)‖22 = K(x, x) + K(z, z) − 2K(x, z)
x
zθ
ICANN 2007 ⋄ Johan Suykens 13
Least Squares Support Vector Machines: “core problems”
• Regression (RR)
minw,b,e
wTw + γ∑
i
e2i s.t. yi = wTϕ(xi) + b + ei, ∀i
• Classification (FDA)
minw,b,e
wTw + γ∑
i
e2i s.t. yi(w
Tϕ(xi) + b) = 1 − ei, ∀i
• Principal component analysis (PCA)
minw,b,e
−wTw + γ∑
i
e2i s.t. ei = wTϕ(xi) + b, ∀i
• Canonical correlation analysis/partial least squares (CCA/PLS)
minw,v,b,d,e,r
wTw+vTv+ν1
∑
i
e2i+ν2
∑
i
r2i−γ
∑
i
eiri s.t.
ei = wTϕ1(xi) + bri = vTϕ2(yi) + d
• partially linear models, spectral clustering, subspace algorithms, ...
ICANN 2007 ⋄ Johan Suykens 14
LS-SVM classifier
• Preserve support vector machine methodology, but simplify via leastsquares and equality constraints [Suykens, 1999]
• Primal problem:
minw,b,e
J (w, e) =1
2wTw + γ
1
2
N∑
i=1
e2i s.t. yi [w
Tϕ(xi) + b]=1 − ei, ∀i
• Dual problem:
[
0 yT
y Ω + I/γ
] [
bα
]
=
[
01N
]
where Ωij = yiyj ϕ(xi)Tϕ(xj) = yiyj K(xi, xj) and y = [y1; ...; yN ].
• LS-SVM classifiers perform very well on 20 UCI data sets [Van Gestel et al., ML 2004]
Winning results in competition WCCI 2006 by [Cawley, 2006]
ICANN 2007 ⋄ Johan Suykens 15
Obtaining solution from Lagrangian
• Lagrangian:
L(w, b, e;α) = J (w, e) −N
∑
i=1
αiyi[wTϕ(xi) + b] − 1 + ei
with Lagrange multipliers αi (support values).
• Conditions for optimality:8
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
:
∂L∂w = 0 → w =
NX
i=1
αiyiϕ(xi)
∂L∂b = 0 →
NX
i=1
αiyi = 0
∂L∂ei
= 0 → αi = γei, i = 1, ..., N∂L∂αi
= 0 → yi[wTϕ(xi) + b]− 1 + ei = 0, i = 1, ..., N
Eliminate w, e and write solution in α, b.
ICANN 2007 ⋄ Johan Suykens 16
Microarray data analysis
FDALS-SVM classifier (linear, RBF)Kernel PCA + FDA(unsupervised selection of PCs)(supervised selection of PCs)
Use regularization for linear classifiers
Systematic benchmarking study in [Pochet et al., Bioinformatics 2004]
Webservice: http://www.esat.kuleuven.ac.be/MACBETH/
Efficient computational methods for feature selection by rank-one updates
[Ojeda et al., 2007]
ICANN 2007 ⋄ Johan Suykens 17
Weighted versions and robustness
Convex cost function
convexoptimiz.
SVM solution
Weighted version with
modified cost function
robuststatistics
LS-SVM solution
SVM Weighted LS-SVM
• Weighted LS-SVM: minw,b,e
1
2wTw + γ
1
2
N∑
i=1
vie2i s.t. yi = wTϕ(xi) + b + ei, ∀i
with vi determined from eiNi=1 of unweighted LS-SVM [Suykens et al., 2002].
Robustness and stability of reweighted kernel based regression [Debruyne et al., 2006].
• SVM solution by applying iteratively weighted LS [Perez-Cruz et al., 2005]
ICANN 2007 ⋄ Johan Suykens 18
Kernel principal component analysis (KPCA)
−1.5 −1 −0.5 0 0.5 1−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
linear PCA
−1.5 −1 −0.5 0 0.5 1−2
−1.5
−1
−0.5
0
0.5
1
1.5
kernel PCA (RBF kernel)
Kernel PCA [Scholkopf et al., 1998]:take eigenvalue decomposition of the kernel matrix
K(x1, x1) ... K(x1, xN)... ...
K(xN , x1) ... K(xN , xN)
(applications in dimensionality reduction and denoising)
Where is the regularization ?
ICANN 2007 ⋄ Johan Suykens 19
Kernel PCA: primal and dual problem
• Underlying primal problem with regularization term [Suykens et al., 2003]
• Primal problem:
minw,b,e
−1
2wTw +
1
2γ
N∑
i=1
e2i s.t. ei = wTϕ(xi) + b, i = 1, ..., N.
(or alternatively min 12w
Tw − 12γ
∑Ni=1 e2
i )
• Dual problem = kernel PCA :
Ωcα = λα with λ = 1/γ
with Ωc,ij = (ϕ(xi) − µϕ)T (ϕ(xj) − µϕ) the centered kernel matrix.
• Score variables (allowing also out-of-sample extensions):
z(x) = wT (ϕ(x) − µϕ) =∑
j αj(K(xj, x) − 1N
∑
r K(xr, x)−1N
∑
r K(xr, xj) + 1N2
∑
r
∑
s K(xr, xs))
ICANN 2007 ⋄ Johan Suykens 20
Primal versus dual problems
Example 1: microarray data (10.000 genes & 50 training data)
Classifier model:sign(wTx + b) (primal)sign(
∑
i αiyixTi x + b) (dual)
primal: w ∈ R10.000 (only 50 training data!)
dual: α ∈ R50
Example 2: datamining problem (1.000.000 training data & 20 inputs)
primal: w ∈ R20
dual: α ∈ R1.000.000 (kernel matrix: 1.000.000 × 1.000.000 !)
ICANN 2007 ⋄ Johan Suykens 21
Fixed-size LS-SVM: primal-dual kernel machines
Primal space Dual space
Nystrom method
Kernel PCA
Density estimate
Entropy criteria
Eigenfunctions
SV selectionRegression
Link Nystrom approximation (GP) - kernel PCA - density estimation[Girolami, 2002; Williams & Seeger, 2001]
Modelling in view of primal-dual representations[Suykens et al., 2002]: primal space estimation, sparse, large scale
ICANN 2007 ⋄ Johan Suykens 22
Fixed-size LS-SVM: toy examples
−10 −8 −6 −4 −2 0 2 4 6 8 10−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
x
y
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
−10 −8 −6 −4 −2 0 2 4 6 8 10−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
x
y
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Sparse representations with estimation in primal space
ICANN 2007 ⋄ Johan Suykens 23
Large scale problems: Fixed-size LS-SVM
Estimate in primal (approximate feature map from KPCA on subset)Santa Fe laser data
−2
−1
0
1
2
3
4
5
0 100 200 300 400 500 600 700 800 900
discrete time k
yk
10 20 30 40 50 60 70 80 90−2
−1
0
1
2
3
4
discrete time k
yk,y
kTraining: yk+1 = f(yk, yk−1, ..., yk−p)Iterative prediction: yk+1 = f(yk, yk−1, ..., yk−p)(works well for p large, e.g. p = 50) [Espinoza et al., 2003]
ICANN 2007 ⋄ Johan Suykens 24
Partially linear models: nonlinear system identification
0 2 4 6 8 10 12 14
x 104
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
Input Signal
0 2 4 6 8 10 12 14
x 104
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
test training validation
Output Signal
0.5 1 1.5 2 2.5 3 3.5 4
x 104
−0.02
−0.015
−0.01
−0.005
0
0.005
0.01
0.015
0.02
Discrete Time Index
Residuals
0.5 1 1.5 2 2.5 3 3.5 4
x 104
−0.02
−0.015
−0.01
−0.005
0
0.005
0.01
0.015
0.02
Discrete Time Index
Residuals
Silver box benchmark study (physical system with cubic nonlinearity):(top-right) full black-box, (bottom-right) partially linear
Related application: power load forecasting [Espinoza et al., 2005]
ICANN 2007 ⋄ Johan Suykens 25
- Part II contents -
• generalizations to KPCAweighted kernel PCAspectral clusteringkernel canonical correlation analysis
• model selectionstructure detectionkernel designsemi-supervised learningincorporation of constraints
• kernel maps with reference pointdimensionality reduction and data visualization
ICANN 2007 ⋄ Johan Suykens 26
Core models + additional constraints
• Monoticity constraints: [Pelckmans et al., 2005]
minw,b,e
wTw + γ
NX
i=1
e2i s.t.
yi = wTϕ(xi) + b + ei, (i = 1, ..., N)
wTϕ(xi) ≤ wTϕ(xi+1), (i = 1, ..., N − 1)
• Structure detection: [Pelckmans et al., 2005; Tibshirani, 1996]
minw,e,t
ρ
PX
p=1
tp+
PX
p=1
w(p)T
w(p)
+γ
NX
i=1
e2i s.t.
(
yi =PP
p=1 w(p)Tϕ(p)(x(p)i ) + ei, (∀i)
−tp ≤ w(p)Tϕ(p)(x(p)i ) ≤ tp, (∀i, ∀p)
• Autocorrelated errors: [Espinoza et al., 2006]
minw,b,r,e
wTw + γ
NX
i=1
r2i s.t.
yi = wTϕ(xi) + b + ei, (i = 1, .., N)
ei = ρei−1 + ri, (i = 2, ..., N)
• Spectral clustering: [Alzate & Suykens, 2006; Chung, 1997; Shi & Malik, 2000]
minw,b,e
−wTw + γe
TD−1
e s.t. ei = wTϕ(xi) + b, (i = 1, ..., N)
ICANN 2007 ⋄ Johan Suykens 27
Generalizations to Kernel PCA: other loss functions
• Consider general loss function L (L2 case = KPCA):
minw,b,e
−1
2wTw +
1
2γ
N∑
i=1
L(ei) s.t. ei = wTϕ(xi) + b, i = 1, ..., N.
Generalizations of KPCA that lead to robustness and sparseness, e.g.Vapnik ǫ-insensitive loss, Huber loss function [Alzate & Suykens, 2006].
• Weighted least squares versions and incorporation of constraints:
minw,b,e
−1
2wTw +
1
2γ
N∑
i=1
vie2i s.t.
ei = wTϕ(xi) + b, i = 1, ..., N∑N
i=1 eie(1)i = 0
...∑N
i=1 eie(i−1)i = 0
Find i-th PC w.r.t. i − 1 orthogonality constraints (previous PC e(j)i ).
The solution is given by a generalized eigenvalue problem.
ICANN 2007 ⋄ Johan Suykens 28
Generalizations to Kernel PCA: robust denoising
Test set images, corrupted with Gaussian noise and outliers
Classical Kernel PCA Robust method
bottom rows: application of different pre-image algorithms
Robust method: improved results and fewer components needed
ICANN 2007 ⋄ Johan Suykens 29
Generalizations to Kernel PCA: sparseness
−0.5 0 0.5 1 1.5 2 2.5−1
−0.5
0
0.5
1
1.5
2
2.5
x1
x 2
−0.5 0 0.5 1 1.5 2 2.5−1
−0.5
0
0.5
1
1.5
2
2.5
x1
x 2
PC1
−0.5 0 0.5 1 1.5 2 2.5−1
−0.5
0
0.5
1
1.5
2
2.5
x1
x 2
PC2
−0.5 0 0.5 1 1.5 2 2.5−1
−0.5
0
0.5
1
1.5
2
2.5
x1
x 2
PC3
Sparse kernel PCA using ǫ-insensitive loss [Alzate & Suykens, 2006](top figure: denoising; bottom figures: different support vectors (in black) per principal component vector)
ICANN 2007 ⋄ Johan Suykens 30
Spectral clustering: weighted KPCA
• Spectral graph clustering [Chung, 1997;Shi & Malik, 2000; Ng et al., 2002]
• Normalized cut problemLq = λDq
with L = D − W the Laplacian of the graph.Cluster membership indicators are given by q.
• KPCA Weighted LS-SVM formulation to normalized cut:
minw,b,e
−1
2wTw + γ
1
2eTV e such that ei = wTϕ(xi) + b, ∀i = 1, ..., N
with V = D−1 the inverse degree matrix [Alzate & Suykens, 2006].Allows for out-of-sample extensions on test data.
2
3
4
5
6
1
cut of size 2
cut of size 1(minimal cut)
ICANN 2007 ⋄ Johan Suykens 31
Application to image segmentation
Given image (240× 160) Image segmentation
Large scale image: out-of-sample extension [Alzate & Suykens, 2006]
ICANN 2007 ⋄ Johan Suykens 32
Kernel Canonical Correlation Analysis
x
xx
x
x
x
x
x x
x
x
x
x x
xx
xx
xx
x
xx
x
x
x
x
x x
x
x
x
x x
xx
xx
xx
Space XSpace Y
Feature space on XFeature space on Y
Target spaces
Correlation: minw,v
∑
i
‖zxi− zyi
‖22
ϕ1(·)
ϕ2(·)
zx = wT ϕ1(x)zy = vT ϕ2(y)
Applications of kernel CCA [Suykens et al., 2002, Bach & Jordan, 2002] e.g. in:
- bioinformatics (correlation gene network - gene expression profiles) [Vert et al., 2003]
- information retrieval, fMRI [Shawe-Taylor et al., 2004]
- state estimation of dynamical systems, subspace algorithms [Goethals et al., 2005]
ICANN 2007 ⋄ Johan Suykens 33
LS-SVM formulation to Kernel CCA
• Score variables: zx = wT (ϕ1(x)− µϕ1), zy = vT (ϕ2(y)− µϕ2
)
Feature maps ϕ1, ϕ2, kernels K1(xi, xj) = ϕ1(xi)Tϕ1(xj), K2(yi, yj) = ϕ2(yi)
Tϕ2(yj)
• Primal problem: (Kernel PLS case: ν1 = 0, ν2 = 0 [Hoegaerts et al., 2004])
maxw,v,e,r
γ
NX
i=1
eiri − ν1
1
2
NX
i=1
e2i − ν2
1
2
NX
i=1
r2i−
1
2w
Tw −
1
2v
Tv
such that ei = wT (ϕ1(xi)− µϕ1), ri = vT (ϕ2(yi)− µϕ2
), ∀i
with µϕ1 = (1/N)PN
i=1 ϕ1(xi), µϕ2 = (1/N)PN
i=1 ϕ2(yi).
• Dual problem: generalized eigenvalue problem [Suykens et al. 2002]»
0 Ωc,2
Ωc,1 0
– »
α
β
–
= λ
»
ν1Ωc,1 + I 0
0 ν2Ωc,2 + I
– »
α
β
–
, λ = 1/γ
with Ωc,1ij= (ϕ1(xi)− µϕ1)
T (ϕ1(xj)− µϕ1), Ωc,2ij= (ϕ2(yi)− µϕ2)
T (ϕ2(yj)− µϕ2)
ICANN 2007 ⋄ Johan Suykens 34
System identification of Hammerstein systems
• Hammerstein system:
xt+1 = Axt + Bf(ut) + νt
yt = Cxt + Df(ut) + vt
with E[
νp
vp
]
[νTq vT
q ] =[
Q S
ST R
]
δpq.
• System identification problem:given (ut, yt)
N−1t=0 , estimate A,B, C,D, f .
• Subspace algorithms [Goethals et al., IEEE-AC 2005]:first estimate state vector sequence (can also be done by KCCA)(for linear systems equivalent to Kalman filtering)
• Related problems: linear non-Gaussian models, links with ICAKernels for linear systems and gait recognition [Bissacco et al., 2007]
f(.) [A, B, C, D]yu
ICANN 2007 ⋄ Johan Suykens 35
Bayesian inference
=Posterior
Evidence
LikelihoodPrior
=Posterior
Evidence
LikelihoodPrior
=Posterior
Evidence
LikelihoodPrior
Max
Max
Max
Level 1
Level 2
Level 3Model comparison
Hyperparameters
Parameters
Automatic relevance determination (ARD) [MacKay, 1998]: infer elements of diagonal
matrix S in K(xi, xj) = exp(−(xi − xj)TS(xi − xj)) which indicate how relevant
input variables are (but: many local minima, computationally expensive).
ICANN 2007 ⋄ Johan Suykens 36
Classification of brain tumors using ARD
Bayesian learning (automatic relevance determination) of most relevant frequencies [Lu, 2005]
ICANN 2007 ⋄ Johan Suykens 37
Hierarchical Kernel Machines
Level 1
Level 3
Level 2
LS−SVM substrate
Model selection
Sparseness
Structure detection
Conceptually
Hierarchical kernel machine Convex optimization
Computationally
Hierarchical modelling approach leading to convex optimization problemComputationally fusing training, hyperparameter and model selectionOptimization modelling: sparseness, input/structure selection, stability ...
[Pelckmans et al., ML 2006]
ICANN 2007 ⋄ Johan Suykens 38
Additive regularization trade-off
• Traditional Tikhonov regularization scheme:
minw,e
wTw + γ∑
i
e2i s.t. ei = yi − wTϕ(xi), ∀i = 1, ..., N
Training solution for fixed value of γ: (K + I/γ)α = y
→ Selection of γ via validation set: non-convex problem
• Additive regularization trade-off [Pelckmans et al., 2005]:
minw,e
wTw +∑
i
(ei − ci)2 s.t. ei = yi − wTϕ(xi), ∀i = 1, ..., N
Training solution for fixed value of c = [c1; ...; cN ]: (K + I)α = y − c
→ Selection of c via validation set: can be convex problem
• Convex relaxation to Tikhonov regularization[Pelckmans et al., IEEE-TNN 2007]
ICANN 2007 ⋄ Johan Suykens 39
Sparse models
• SVM classically: sparse solution from QP problem at training level
• Hierarchical kernel machine: fused problem with sparseness obtained atthe validation level [Pelckmans et al., 2005]
−1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−0.2
0
0.2
0.4
0.6
0.8
1
X1
X2
LS−SVMγ=5.3667,σ2=0.90784
RBF , with 2 different classes
class 1class 2
ICANN 2007 ⋄ Johan Suykens 40
Additive models and structure detection
• Additive models: y(x) =∑P
p=1 w(p)Tϕ(p)(x(p)) with x(p) the p-th input.
Kernel K(xi, xj) =∑P
p=1 K(p)(x(p)i , x
(p)j ).
• Structure detection [Pelckmans et al., 2005]:
minw,e,t
ρPP
p=1 tp +PP
p=1 w(p)Tw(p) + γPN
i=1 e2i
s.t.
(
yi =PP
p=1 w(p)Tϕ(p)(x(p)i ) + ei, ∀i = 1, ..., N
−tp ≤ w(p)Tϕ(p)(x(p)i ) ≤ tp, ∀i = 1, ..., N, ∀p = 1, ..., P
Study how the solution with maximalvariation varies for different values of ρ
100
101
102
103
104
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
ρ
Max
imal
Var
iatio
n
4 relevant input variables
21 irrelevant input variables
ICANN 2007 ⋄ Johan Suykens 41
Incorporation of prior knowledge
• Example: LS-SVM regression with monoticity constraint
minw,b,e
1
2wTw + γ
1
2
N∑
i=1
e2i s.t.
yi = wTϕ(xi) + b + ei, ∀i = 1, ..., NwTϕ(xi) ≤ wTϕ(xi+1), ∀i = 1, ..., N − 1
• Application: estimation of cdf [Pelckmans et al., 2005]
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−0.2
0
0.2
0.4
0.6
0.8
1
X
P(X
)
empirical cdftrue cdf
Y1
Y2
−1.5 −1 −0.5 0 0.5 1 1.5−0.2
0
0.2
0.4
0.6
0.8
1
X
P(X
)
ecdfcdfChebychev mkrmLS−SVM
ICANN 2007 ⋄ Johan Suykens 42
Equivalent kernels from constraints
Regression with autocorrelated errors:
minw,b,r,e
wTw + γ∑
i r2i
s.t. yi = wTϕ(xi) + b + ei (i = 1, .., N)ei = ρei−1 + ri (i = 2, ..., N)
leads to
f(x) =
N∑
j=2
αj−1Keq(xj, x) + b
with “equivalent kernel”
Keq(xj, xi) = K(xj, xi) − ρK(xj−1, xi)
where K(xj, xi) = ϕ(xj)Tϕ(xi) [Espinoza et al., 2006].
Partially Linear
Structure
LS-SVM
Regression
Autocorrelated
Residuals
Imposing
Symmetry
Modular Definition of the Model Structure
ICANN 2007 ⋄ Johan Suykens 43
Application: electric load forecasting
Short-term load forecasting (1-24 hours)Important for power generation decisionsHourly load values from substations in Belgian gridSeasonal/weekly/intra-daily patterns
20 40 60 80 100 120 140 1600.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Actual LoadFS−LSSVM
Hour
Norm
alize
dLoad
20 40 60 80 100 120 140 1600.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Actual LoadLinear
Hour
Norm
alize
dLoad
(a) (b)
20 40 60 80 100 120 140 1600.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Actual LoadFS−LSSVM
Hour
Norm
alize
dLoad
20 40 60 80 100 120 140 1600.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Actual LoadLinear
Hour
Norm
alize
dLoad
(c) (d)
[Espinoza et al., 2007]
1-hour ahead
24-hours ahead
Fixed-size LS-SVM→ ← Linear ARX model
ICANN 2007 ⋄ Johan Suykens 44
Semi-supervised learning
x1
x2
−1 −0.5 0 0.5 1 1.5 2 2.5−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
x1
x2
−1 −0.5 0 0.5 1 1.5 2 2.5−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Semi-supervised learning: part labeled and part unlabeled data
Assumptions for semi-supervised learning to work [Chapelle et al., 2006]:
• Smoothness assumption: if two points x1, x2 in a high density region areclose, then also the corresponding outputs y1, y2
• Cluster assumption: points from same cluster are likely same class
• Low density separation: decision boundary should be in low density region
• Manifold assumption: data lie on a low-dimensional manifold
ICANN 2007 ⋄ Johan Suykens 45
Semi-supervised learning in RKHS
• Learning in RKHS [Belkin & Niyogi, 2004]:
minf∈H
1
N
N∑
i=1
V (yi, f(xi)) + λ‖f‖2K + ηfTLf
with V (·, ·) loss function, L Laplacian matrix, ‖f‖K norm in RKHSH, f = [f(x1); ...; f(xNl+Nu)] (Nl, Nu number of labeled and unlabeleddata)
• Laplacian term: discretization of the Laplace-Beltrami operator
• Representer theorem: f(x) =∑Nl+Nu
i=1 αiK(x, xi)
• Least squares solution case: Laplacian acts on kernel matrix
• Problem: true labels of unlabeled data assumed to be zero.
ICANN 2007 ⋄ Johan Suykens 46
Formulation by adding constraints
• Semi-supervised LS-SVM model [Luts et al., 2007]:
minw,e,b,y
12w
Tw + 12γ
∑Ni=1 e2
i + 12η
∑Ni,j=1 vij(yi − yj)
2
s.t. yi = wTϕ(xi) + b, i = 1, ..., N
yi = νiyi − ei, νi ∈ 0, 1, , i = 1, ..., N
where νi = 0 for unlabeled data, νi = 1 for labeled data.
• MRI image: healthy tissue versus tumor classification [Luts et al., 2007]
Nosologic Image
2 4 6 8 10 12
2
4
6
8
10
121
2
3
4
5
6
Nosologic Image
2 4 6 8 10 12
2
4
6
8
10
121
2
3
4
5
6
[etumour FP6-2002-lifescihealth503094, healthagents FP6-2005-IST027213]
ICANN 2007 ⋄ Johan Suykens 47
Learning combination of kernels
• Take combination K =∑m
i=1 µiKi (µi ≥ 0) (e.g. for data fusion).Learn µi as convex problem [Lanckriet et al., JMLR 2004]
• QP problem of SVM:
maxα
2αT1 − αTdiag(y)Kdiag(y)α s.t. 0 ≤ α ≤ C,αTy = 0
is replaced by
minµi
maxα
2αT1 − αTdiag(y)(
m∑
i=1
µiKi)diag(y)α
s.t. 0 ≤ α ≤ C, αTy = 0, trace(
m∑
i=1
µiKi) = c,
m∑
i=1
µiKi 0.
Can be solved as a semidefinite program (SDP problem) [Boyd &Vandenberghe, 2004] (LMI constraint for positive definite kernel)
ICANN 2007 ⋄ Johan Suykens 48
Kernel design
A
B
D
C
E
P(A,B,C,D,E) = P(A|B) P(B) P(C|B) P(D|C) P(E|B)
Kernels from graphical models, Bayesian networks, HMMsKernels tailored to data types (DNA sequence, text, chemoinformatics)
[Tsuda et al., Bioinformatics 2002; Jebara et al., JMLR 2004, Ralaivola et al., 2005]
- Probability product kernel:
K(p1, p2) =
∫
X
p1(x)ρ p2(x)ρdx
- Prior knowledge incorporation
ICANN 2007 ⋄ Johan Suykens 49
Dimensionality reduction and data visualization
• Traditionally:commonly used techniques are e.g. principal component analysis, multi-dimensional scaling, self-organizing maps
• More recently:isomap, locally linear embedding, Hessian locally linear embedding,diffusion maps, Laplacian eigenmaps(“kernel eigenmap methods and manifold learning”)[Roweis & Saul, 2000; Coifman et al., 2005; Belkin et al., 2006]
• Relevant issues:- learning and generalization [Cucker & Smale, 2002; Poggio et al., 2004]- model representations and out-of-sample extensions- convex/non-convex problems, computational complexity [Smale, 1997]
• Kernel maps with reference point (KMref) [Suykens, 2007]:data visualization and dimensionality reduction by solving linear system
ICANN 2007 ⋄ Johan Suykens 50
Kernel maps with reference point: problem statement
• Kernel maps with reference point [Suykens, 2007]:- LS-SVM core part: realize dimensionality reduction x 7→ z- reference point q (e.g. first point; sacrificed in the visualization)
• Example: d = 2
minz,w1,w2,b1,b2,ei,1,ei,2
1
2(z − PDz)
T(z − PDz) +
ν
2(w
T1 w1 + w
T2 w2) +
η
2
NX
i=1
(e2i,1 + e
2i,2)
such that cT1,1z = q1 + e1,1
cT1,2z = q2 + e1,2
cTi,1z = wT
1 ϕ1(xi) + b1 + ei,1, ∀i = 2, ..., N
cTi,2z = wT
2 ϕ2(xi) + b2 + ei,2, ∀i = 2, ..., N
Coordinates in low dimensional space: z = [z1; z2; ...; zN ] ∈ RdN
Regularization term: (z − PDz)T (z − PDz) =PN
i=1 ‖zi −PN
j=1 sijDzj‖22
with D diagonal matrix and sij = exp(−‖xi − xj‖22/σ2).
ICANN 2007 ⋄ Johan Suykens 51
Kernel maps with reference point: solution
• The unique solution to the problem is given by the linear system
U −V1M−11 1 −V2M
−12 1
−1TM−11 V T
1 1TM−11 1 0
−1TM−12 V T
2 0 1TM−12 1
zb1
b2
=
η(q1c1,1 + q2c1,2)00
with matrices
U = (I − PD)T (I − PD) − γI + V1M−11 V T
1 + V2M−12 V T
2 + ηc1,1cT1,1 + ηc1,2c
T1,2
M1 =1
νΩ1 +
1
ηI , M2 =
1
νΩ2 +
1
ηI
V1 = [c2,1 ... cN,1] , V2 = [c2,2 ... cN,2]
kernel matrices Ω1, Ω2 ∈ R(N−1)×(N−1):
Ω1,ij = K1(xi, xj) = ϕ1(xi)Tϕ1(xj), Ω2,ij = K2(xi, xj) = ϕ2(xi)
Tϕ2(xj)
positive definite kernel functions K1(·, ·),K2(·, ·).
ICANN 2007 ⋄ Johan Suykens 52
Kernel maps with reference point: model representations
• The primal and dual model representations allow making out-of-sample extensions. Evaluation at point x∗ ∈ R
p:
z∗,1 = wT1 ϕ1(x
∗) + b1 =1
ν
N∑
i=2
αi,1K1(xi, x∗) + b1
z∗,2 = wT2 ϕ2(x
∗) + b2 =1
ν
N∑
i=2
αi,2K2(xi, x∗) + b2
Estimated coordinates for visualization: z∗ = [z∗,1; z∗,2].
• α1, α2 ∈ RN−1 are the unique solutions to the linear systems
M1α1 = V T1 z − b11N−1 and M2α2 = V T
2 z − b21N−1
and α1 = [α2,1; ...;αN,1], α2 = [α2,2; ...;αN,2], 1N−1 = [1; 1; ..., ; 1].
ICANN 2007 ⋄ Johan Suykens 53
KMref: spiral example
−1.5−1
−0.50
0.51
−1.5
−1
−0.5
0
0.5
1−0.5
0
0.5
x1
x2
x 3
−0.02 −0.015 −0.01 −0.005 0 0.005 0.01−5
0
5
10
15
20x 10
−3
z1
z 2
training data (blue *), validation data (magenta o), test data (red +)
Model selection: min∑
i,j
(
zTi zj
‖zi‖2‖zj‖2−
xTi xj
‖xi‖2‖xj‖2
)2
ICANN 2007 ⋄ Johan Suykens 54
KMref: swiss roll example
−1
−0.5
0
0.5
1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
x1
x2
x 3
−3.5 −3 −2.5 −2 −1.5 −1 −0.5 0
x 10−3
0
0.5
1
1.5
2
2.5
3x 10
−3
z1
z 2
Given 3D swiss roll data KMref result - 2D projection
600 training data, 100 validation data
ICANN 2007 ⋄ Johan Suykens 55
KMref: visualizing gene distributions
−2.35 −2.3 −2.25 −2.2 −2.15 −2.1 −2.05 −2 −1.95
x 10−31.9
22.1
2.22.3
x 10−3
1.7
1.8
1.9
2
2.1
x 10−3
z1
z2
z 3
−2.35 −2.3 −2.25 −2.2 −2.15 −2.1 −2.05 −2 −1.95x 10
−31.9
2
2.1
2.2
2.3
x 10−3
1.7
1.8
1.9
2
2.1
x 10−3
z1
z2
z 3
KMref 3D projection (Alon colon cancer microarray data set)Dimension input space: 62
Number of genes: 1500 (training: 500, validation: 500, test: 500)Model selection: σ2 = 104, σ2
1 = 103, σ22 = 0.5σ2
1, σ23 = 0.1σ2
1,η = 1, ν = 100, D = diag10, 5, 1, q = [+1;−1;−1].
ICANN 2007 ⋄ Johan Suykens 56
Nonlinear dynamical systems control
u
xp
θ
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.50
0.5
1
1.5
Control objective LS−SVM objectivemin
subject to System dynamics (time k = 1, 2, ... , N)
LS−SVM controller (time k = 1, 2, ... , N)
+
Merging optimal control and support vector machine optimization problems:
Approximate solutions to optimal control problems [Suykens et al., NN 2001]
ICANN 2007 ⋄ Johan Suykens 57
Conclusions and future challenges
• Integrative understanding and systematic design for supervised, semi-supervised, unsupervised learning and beyond
• Kernel methods: complementary views (LS-)SVM, RKHS, GP
• Least squares support vector machines as “core problems”:provides methodology for “optimization modelling”
• Bridging gaps between fundamental theory, algorithms and applications
• Reliable methods: numerically, computationally, statistically
Websites:http://www.kernel-machines.org/http://www.esat.kuleuven.be/sista/lssvmlab/
ICANN 2007 ⋄ Johan Suykens 58
Books
• Boyd S., Vandenberghe L., Convex Optimization, Cambridge University Press, 2004.
• Chapelle O., Scholkopf B., Zien A. (Eds.), Semi-Supervised Learning, MIT Press, 2006.
• Cristianini N., Shawe-Taylor J., An Introduction to Support Vector Machines, Cambridge UniversityPress, 2000.
• Cucker F., Zhou D.-X., Learning Theory: an Approximation Theory Viewpoint, Cambridge UniversityPress, 2007.
• Rasmussen C.E., Williams C.K.I., Gaussian Processes for Machine Learning, MIT Press, 2006.
• Scholkopf B., Smola A., Learning with Kernels, MIT Press, 2002.
• Scholkopf B., Tsuda K., Vert J.P. (Eds.) Kernel Methods in Computational Biology 400, MIT Press,
2004.
• Shawe-Taylor J., Cristianini N., Kernel Methods for Pattern Analysis, Cambridge University Press,
2004.
• Suykens J.A.K., Van Gestel T., De Brabanter J., De Moor B., Vandewalle J., Least Squares Support
Vector Machines, World Scientific, Singapore, 2002.
• Suykens J.A.K., Horvath G., Basu S., Micchelli C., Vandewalle J. (Eds.), Advances in Learning Theory
: Methods, Models and Applications, vol. 190 NATO-ASI Series III: Computer and Systems Sciences,
IOS Press, 2003.
• Vapnik V., Statistical Learning Theory, John Wiley & Sons, 1998.
• Wahba G., Spline Models for Observational Data, Series Appl. Math., 59, SIAM, 1990.
ICANN 2007 ⋄ Johan Suykens 59