support vector machines and kernel based learning · 2007. 9. 6. · kernel based learning:...

Support Vector Machines and Kernel Based

Learning

Johan Suykens

K.U. Leuven, ESAT-SCD/SISTAKasteelpark Arenberg 10

B-3001 Leuven (Heverlee), Belgium

Tel: 32/16/32 18 02 - Fax: 32/16/32 19 70Email: [email protected]

http://www.esat.kuleuven.be/scd/

Tutorial ICANN 2007Porto Portugal, Sept. 2007

ICANN 2007 ⋄ Johan Suykens

Living in a data worldbiomedical

energyprocess industry

bio-informatics

multimedia

traffic

ICANN 2007 ⋄ Johan Suykens 1

Kernel based learning: interdisciplinary challenges

SVM & kernel methods

linear algebra

mathematics

statistics

systems and control theory

signal processingoptimization

machine learning

pattern recognition

data mining

neural networks

• Understanding the essential concepts and different facets of problems

• Providing systematical approaches, engineering kernel machines

• Integrative design, bridging the gaps between theory and practice


- Part I contents -

• neural networks and support vector machinesfeature map and kernelsprimal and dual problem

• classification, regressionconvex problem, robustness, sparseness

• wider use of the kernel trickleast squares support vector machines as core problemskernel principal component analysis

• large scale fixed-size methodnonlinear modelling


Classical MLPsx1

x2

x3

xn

1

w1

w2

w3wn

b

h(·)

h(·)

y

Multilayer Perceptron (MLP) properties:

• Universal approximation of continuous nonlinear functions

• Learning from input-output patterns: off-line/on-line

• Parallel network architecture, multiple inputs and outputs

+ Flexible and widely applicable:

Feedforward/recurrent networks, supervised/unsupervised learning

- Many local minima, trial and error for determining number of neurons


Support Vector Machines

cost function cost function

weights weights

MLP SVM

• Nonlinear classification and function estimation by convex optimizationwith a unique solution and primal-dual interpretations.

• Number of neurons automatically follows from a convex program.

• Learning and generalization in high dimensional input spaces(coping with the curse of dimensionality).

• Use of kernels (e.g. linear, polynomial, RBF, MLP, splines, kernels fromgraphical models, ... ), application-specific kernels (e.g. bioinformatics)


Classifier with maximal margin

• Training set (xi, yi)Ni=1: inputs xi ∈ R

n; class labels yi ∈ −1,+1

• Classifier: y(x) = sign[wTϕ(x) + b]

with ϕ(·) : Rn → R

nh the mapping to a high dimensional feature space(which can be infinite dimensional!)

• Maximize the margin for good generalization ability (margin = 2‖w‖2

)

(VC theory: linear SVM classifier dates back from the sixties)

x x

xx

x

x

x

o

o

oo

o o

oo o

x o → x x

xx

x

x

x

o

o

oo

o o

oo o

x o


SVM classifier: primal and dual problem

• Primal problem: [Vapnik, 1995]

minw,b,ξ

J (w, ξ) =1

2wTw + c

N∑

i=1

ξi s.t.

yi[wTϕ(xi) + b] ≥ 1 − ξi

ξi ≥ 0, i = 1, ..., N

Trade-off between margin maximization and tolerating misclassifications

• Conditions for optimality from Lagrangian.Express the solution in the Lagrange multipliers.

• Dual problem: QP problem (convex problem)

maxα

Q(α) = −1

2

N∑

i,j=1

yiyj K(xi, xj)αiαj+

N∑

j=1

αj s.t.

N∑

i=1

αiyi = 0

0 ≤ αi ≤ c, ∀i


Obtaining solution via Lagrangian

• Lagrangian:

L(w, b, ξ;α, ν) = J (w, ξ) −

N∑

i=1

αiyi[wTϕ(xi) + b] − 1 + ξi −

N∑

i=1

νiξi

• Find saddle point: maxα,ν

minw,b,ξ

L(w, b, ξ; α, ν), one obtains

∂L∂w = 0 → w =

N∑

i=1

αiyiϕ(xi)

∂L∂b = 0 →

N∑

i=1

αiyi = 0

∂L∂ξi

= 0 → 0 ≤ αi ≤ c, i = 1, ..., N

Finally, write the solution in terms of α (Lagrange multipliers).


SVM classifier model representations

• Classifier: primal representation

y(x) = sign[wTϕ(x) + b]

Kernel trick (Mercer Theorem): K(xi, xj) = ϕ(xi)T ϕ(xj)

• Dual representation:

y(x) = sign[∑

i

αi yi K(x, xi) + b]

Some possible kernels K(·, ·):

K(x, xi) = xTi x (linear)

K(x, xi) = (xTi x + τ)d (polynomial)

K(x, xi) = exp(−‖x − xi‖22/σ2) (RBF)

K(x, xi) = tanh(κxTi x + θ) (MLP)

• Sparseness property (many αi = 0)−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

x(1)x

(2)


SVMs: living in two worlds ...

x

oooo

x

xx

x

x xx

x

x

oo

oo Input space

Feature space

ϕ(x)

Primal space:

Dual space:

y(x) = sign[wTϕ(x) + b]

y(x) = sign[P#sv

i=1 αiyiK(x, xi) + b]

K(xi, xj) = ϕ(xi)Tϕ(xj) (“Kernel trick”)

y(x)

y(x)

w1

wnh

α1

α#sv

ϕ1(x)

ϕnh(x)

K(x, x1)

K(x, x#sv)

x

x


Reproducing Kernel Hilbert Space (RKHS) view

• Variational problem: [Wahba, 1990; Poggio & Girosi, 1990; Evgeniou et al., 2000]

find function f such that

minf∈H

1

N

N∑

i=1

L(yi, f(xi)) + λ‖f‖2K

with L(·, ·) the loss function. ‖f‖K is norm in RKHS H defined by K.

• Representer theorem: for convex loss function, solution of the form

f(x) =

NX

i=1

αiK(x, xi)

Reproducing property f(x) = 〈f, Kx〉K with Kx(·) = K(x, ·)

• Some special cases:

L(y, f(x)) = (y − f(x))2: regularization network

L(y, f(x)) = |y − f(x)|ǫ: SVM regression with ǫ-insensitive loss function

−ε 0 +ε


Different views on kernel machines

LS−SVMSVM

Kriging

Gaussian Processes

RKHS

Some early history on RKHS:

1910-1920: Moore

1940: Aronszajn

1951: Krige

1970: Parzen

1971: Kimeldorf & Wahba

Obtaining complementary insights from different perspectives:kernels are used in different methodologies

Support vector machines (SVM): optimization approach (primal/dual)Reproducing kernel Hilbert spaces (RKHS): variational problem, functional analysisGaussian processes (GP): probabilistic/Bayesian approach


Wider use of the kernel trick

• Angle between vectors: (e.g. correlation analysis)Input space:

cos θxz =xTz

‖x‖2‖z‖2Feature space:

cos θϕ(x),ϕ(z) =ϕ(x)Tϕ(z)

‖ϕ(x)‖2‖ϕ(z)‖2

=K(x, z)

√

K(x, x)√

K(z, z)

• Distance between vectors: (e.g. for “kernelized” clustering methods)Input space:

‖x − z‖22 = (x − z)T (x − z) = xTx + zTz − 2xTz

Feature space:

‖ϕ(x) − ϕ(z)‖22 = K(x, x) + K(z, z) − 2K(x, z)

x

zθ


Least Squares Support Vector Machines: “core problems”

• Regression (RR)

minw,b,e

wTw + γ∑

i

e2i s.t. yi = wTϕ(xi) + b + ei, ∀i

• Classification (FDA)

minw,b,e

wTw + γ∑

i

e2i s.t. yi(w

Tϕ(xi) + b) = 1 − ei, ∀i

• Principal component analysis (PCA)

minw,b,e

−wTw + γ∑

i

e2i s.t. ei = wTϕ(xi) + b, ∀i

• Canonical correlation analysis/partial least squares (CCA/PLS)

minw,v,b,d,e,r

wTw+vTv+ν1

∑

i

e2i+ν2

∑

i

r2i−γ

∑

i

eiri s.t.

ei = wTϕ1(xi) + bri = vTϕ2(yi) + d

• partially linear models, spectral clustering, subspace algorithms, ...


LS-SVM classifier

• Preserve support vector machine methodology, but simplify via leastsquares and equality constraints [Suykens, 1999]

• Primal problem:

minw,b,e

J (w, e) =1

2wTw + γ

1

2

N∑

i=1

e2i s.t. yi [w

Tϕ(xi) + b]=1 − ei, ∀i

• Dual problem:

[

0 yT

y Ω + I/γ

] [

bα

]

=

[

01N

]

where Ωij = yiyj ϕ(xi)Tϕ(xj) = yiyj K(xi, xj) and y = [y1; ...; yN ].

• LS-SVM classifiers perform very well on 20 UCI data sets [Van Gestel et al., ML 2004]

Winning results in competition WCCI 2006 by [Cawley, 2006]


Obtaining solution from Lagrangian

• Lagrangian:

L(w, b, e;α) = J (w, e) −N

∑

i=1

αiyi[wTϕ(xi) + b] − 1 + ei

with Lagrange multipliers αi (support values).

• Conditions for optimality:8

>

>

>

>

>

>

>

>

>

<

>

>

>

>

>

>

>

>

>

:

∂L∂w = 0 → w =

NX

i=1

αiyiϕ(xi)

∂L∂b = 0 →

NX

i=1

αiyi = 0

∂L∂ei

= 0 → αi = γei, i = 1, ..., N∂L∂αi

= 0 → yi[wTϕ(xi) + b]− 1 + ei = 0, i = 1, ..., N

Eliminate w, e and write solution in α, b.


Microarray data analysis

FDALS-SVM classifier (linear, RBF)Kernel PCA + FDA(unsupervised selection of PCs)(supervised selection of PCs)

Use regularization for linear classifiers

Systematic benchmarking study in [Pochet et al., Bioinformatics 2004]

Webservice: http://www.esat.kuleuven.ac.be/MACBETH/

Efficient computational methods for feature selection by rank-one updates

[Ojeda et al., 2007]


Weighted versions and robustness

Convex cost function

convexoptimiz.

SVM solution

Weighted version with

modified cost function

robuststatistics

LS-SVM solution

SVM Weighted LS-SVM

• Weighted LS-SVM: minw,b,e

1

2wTw + γ

1

2

N∑

i=1

vie2i s.t. yi = wTϕ(xi) + b + ei, ∀i

with vi determined from eiNi=1 of unweighted LS-SVM [Suykens et al., 2002].

Robustness and stability of reweighted kernel based regression [Debruyne et al., 2006].

• SVM solution by applying iteratively weighted LS [Perez-Cruz et al., 2005]


Kernel principal component analysis (KPCA)

−1.5 −1 −0.5 0 0.5 1−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

linear PCA

−1.5 −1 −0.5 0 0.5 1−2

−1.5

−1

−0.5

0

0.5

1

1.5

kernel PCA (RBF kernel)

Kernel PCA [Scholkopf et al., 1998]:take eigenvalue decomposition of the kernel matrix

K(x1, x1) ... K(x1, xN)... ...

K(xN , x1) ... K(xN , xN)

(applications in dimensionality reduction and denoising)

Where is the regularization ?


Kernel PCA: primal and dual problem

• Underlying primal problem with regularization term [Suykens et al., 2003]

• Primal problem:

minw,b,e

−1

2wTw +

1

2γ

N∑

i=1

e2i s.t. ei = wTϕ(xi) + b, i = 1, ..., N.

(or alternatively min 12w

Tw − 12γ

∑Ni=1 e2

i )

• Dual problem = kernel PCA :

Ωcα = λα with λ = 1/γ

with Ωc,ij = (ϕ(xi) − µϕ)T (ϕ(xj) − µϕ) the centered kernel matrix.

• Score variables (allowing also out-of-sample extensions):

z(x) = wT (ϕ(x) − µϕ) =∑

j αj(K(xj, x) − 1N

∑

r K(xr, x)−1N

∑

r K(xr, xj) + 1N2

∑

r

∑

s K(xr, xs))


Primal versus dual problems

Example 1: microarray data (10.000 genes & 50 training data)

Classifier model:sign(wTx + b) (primal)sign(

∑

i αiyixTi x + b) (dual)

primal: w ∈ R10.000 (only 50 training data!)

dual: α ∈ R50

Example 2: datamining problem (1.000.000 training data & 20 inputs)

primal: w ∈ R20

dual: α ∈ R1.000.000 (kernel matrix: 1.000.000 × 1.000.000 !)


Fixed-size LS-SVM: primal-dual kernel machines

Primal space Dual space

Nystrom method

Kernel PCA

Density estimate

Entropy criteria

Eigenfunctions

SV selectionRegression

Link Nystrom approximation (GP) - kernel PCA - density estimation[Girolami, 2002; Williams & Seeger, 2001]

Modelling in view of primal-dual representations[Suykens et al., 2002]: primal space estimation, sparse, large scale


Fixed-size LS-SVM: toy examples

−10 −8 −6 −4 −2 0 2 4 6 8 10−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

y

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

−10 −8 −6 −4 −2 0 2 4 6 8 10−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

x

y

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Sparse representations with estimation in primal space


Large scale problems: Fixed-size LS-SVM

Estimate in primal (approximate feature map from KPCA on subset)Santa Fe laser data

−2

−1

0

1

2

3

4

5

0 100 200 300 400 500 600 700 800 900

discrete time k

yk

10 20 30 40 50 60 70 80 90−2

−1

0

1

2

3

4

discrete time k

yk,y

kTraining: yk+1 = f(yk, yk−1, ..., yk−p)Iterative prediction: yk+1 = f(yk, yk−1, ..., yk−p)(works well for p large, e.g. p = 50) [Espinoza et al., 2003]


Partially linear models: nonlinear system identification

0 2 4 6 8 10 12 14

x 104

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

Input Signal

0 2 4 6 8 10 12 14

x 104

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

test training validation

Output Signal

0.5 1 1.5 2 2.5 3 3.5 4

x 104

−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

Discrete Time Index

Residuals

0.5 1 1.5 2 2.5 3 3.5 4

x 104

−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

Discrete Time Index

Residuals

Silver box benchmark study (physical system with cubic nonlinearity):(top-right) full black-box, (bottom-right) partially linear

Related application: power load forecasting [Espinoza et al., 2005]


- Part II contents -

• generalizations to KPCAweighted kernel PCAspectral clusteringkernel canonical correlation analysis

• model selectionstructure detectionkernel designsemi-supervised learningincorporation of constraints

• kernel maps with reference pointdimensionality reduction and data visualization


Core models + additional constraints

• Monoticity constraints: [Pelckmans et al., 2005]

minw,b,e

wTw + γ

NX

i=1

e2i s.t.

yi = wTϕ(xi) + b + ei, (i = 1, ..., N)

wTϕ(xi) ≤ wTϕ(xi+1), (i = 1, ..., N − 1)

• Structure detection: [Pelckmans et al., 2005; Tibshirani, 1996]

minw,e,t

ρ

PX

p=1

tp+

PX

p=1

w(p)T

w(p)

+γ

NX

i=1

e2i s.t.

(

yi =PP

p=1 w(p)Tϕ(p)(x(p)i ) + ei, (∀i)

−tp ≤ w(p)Tϕ(p)(x(p)i ) ≤ tp, (∀i, ∀p)

• Autocorrelated errors: [Espinoza et al., 2006]

minw,b,r,e

wTw + γ

NX

i=1

r2i s.t.

yi = wTϕ(xi) + b + ei, (i = 1, .., N)

ei = ρei−1 + ri, (i = 2, ..., N)

• Spectral clustering: [Alzate & Suykens, 2006; Chung, 1997; Shi & Malik, 2000]

minw,b,e

−wTw + γe

TD−1

e s.t. ei = wTϕ(xi) + b, (i = 1, ..., N)


Generalizations to Kernel PCA: other loss functions

• Consider general loss function L (L2 case = KPCA):

minw,b,e

−1

2wTw +

1

2γ

N∑

i=1

L(ei) s.t. ei = wTϕ(xi) + b, i = 1, ..., N.

Generalizations of KPCA that lead to robustness and sparseness, e.g.Vapnik ǫ-insensitive loss, Huber loss function [Alzate & Suykens, 2006].

• Weighted least squares versions and incorporation of constraints:

minw,b,e

−1

2wTw +

1

2γ

N∑

i=1

vie2i s.t.

ei = wTϕ(xi) + b, i = 1, ..., N∑N

i=1 eie(1)i = 0

...∑N

i=1 eie(i−1)i = 0

Find i-th PC w.r.t. i − 1 orthogonality constraints (previous PC e(j)i ).

The solution is given by a generalized eigenvalue problem.


Generalizations to Kernel PCA: robust denoising

Test set images, corrupted with Gaussian noise and outliers

Classical Kernel PCA Robust method

bottom rows: application of different pre-image algorithms

Robust method: improved results and fewer components needed


Generalizations to Kernel PCA: sparseness

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

0

0.5

1

1.5

2

2.5

x1

x 2

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

0

0.5

1

1.5

2

2.5

x1

x 2

PC1

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

0

0.5

1

1.5

2

2.5

x1

x 2

PC2

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

0

0.5

1

1.5

2

2.5

x1

x 2

PC3

Sparse kernel PCA using ǫ-insensitive loss [Alzate & Suykens, 2006](top figure: denoising; bottom figures: different support vectors (in black) per principal component vector)


Spectral clustering: weighted KPCA

• Spectral graph clustering [Chung, 1997;Shi & Malik, 2000; Ng et al., 2002]

• Normalized cut problemLq = λDq

with L = D − W the Laplacian of the graph.Cluster membership indicators are given by q.

• KPCA Weighted LS-SVM formulation to normalized cut:

minw,b,e

−1

2wTw + γ

1

2eTV e such that ei = wTϕ(xi) + b, ∀i = 1, ..., N

with V = D−1 the inverse degree matrix [Alzate & Suykens, 2006].Allows for out-of-sample extensions on test data.

2

3

4

5

6

1

cut of size 2

cut of size 1(minimal cut)


Application to image segmentation

Given image (240× 160) Image segmentation

Large scale image: out-of-sample extension [Alzate & Suykens, 2006]


Kernel Canonical Correlation Analysis

x

xx

x

x

x

x

x x

x

x

x

x x

xx

xx

xx

x

xx

x

x

x

x

x x

x

x

x

x x

xx

xx

xx

Space XSpace Y

Feature space on XFeature space on Y

Target spaces

Correlation: minw,v

∑

i

‖zxi− zyi

‖22

ϕ1(·)

ϕ2(·)

zx = wT ϕ1(x)zy = vT ϕ2(y)

Applications of kernel CCA [Suykens et al., 2002, Bach & Jordan, 2002] e.g. in:

- bioinformatics (correlation gene network - gene expression profiles) [Vert et al., 2003]

- information retrieval, fMRI [Shawe-Taylor et al., 2004]

- state estimation of dynamical systems, subspace algorithms [Goethals et al., 2005]


LS-SVM formulation to Kernel CCA

• Score variables: zx = wT (ϕ1(x)− µϕ1), zy = vT (ϕ2(y)− µϕ2

)

Feature maps ϕ1, ϕ2, kernels K1(xi, xj) = ϕ1(xi)Tϕ1(xj), K2(yi, yj) = ϕ2(yi)

Tϕ2(yj)

• Primal problem: (Kernel PLS case: ν1 = 0, ν2 = 0 [Hoegaerts et al., 2004])

maxw,v,e,r

γ

NX

i=1

eiri − ν1

1

2

NX

i=1

e2i − ν2

1

2

NX

i=1

r2i−

1

2w

Tw −

1

2v

Tv

such that ei = wT (ϕ1(xi)− µϕ1), ri = vT (ϕ2(yi)− µϕ2

), ∀i

with µϕ1 = (1/N)PN

i=1 ϕ1(xi), µϕ2 = (1/N)PN

i=1 ϕ2(yi).

• Dual problem: generalized eigenvalue problem [Suykens et al. 2002]»

0 Ωc,2

Ωc,1 0

– »

α

β

–

= λ

»

ν1Ωc,1 + I 0

0 ν2Ωc,2 + I

– »

α

β

–

, λ = 1/γ

with Ωc,1ij= (ϕ1(xi)− µϕ1)

T (ϕ1(xj)− µϕ1), Ωc,2ij= (ϕ2(yi)− µϕ2)

T (ϕ2(yj)− µϕ2)


System identification of Hammerstein systems

• Hammerstein system:

xt+1 = Axt + Bf(ut) + νt

yt = Cxt + Df(ut) + vt

with E[

νp

vp

]

[νTq vT

q ] =[

Q S

ST R

]

δpq.

• System identification problem:given (ut, yt)

N−1t=0 , estimate A,B, C,D, f .

• Subspace algorithms [Goethals et al., IEEE-AC 2005]:first estimate state vector sequence (can also be done by KCCA)(for linear systems equivalent to Kalman filtering)

• Related problems: linear non-Gaussian models, links with ICAKernels for linear systems and gait recognition [Bissacco et al., 2007]

f(.) [A, B, C, D]yu


Bayesian inference

=Posterior

Evidence

LikelihoodPrior

=Posterior

Evidence

LikelihoodPrior

=Posterior

Evidence

LikelihoodPrior

Max

Max

Max

Level 1

Level 2

Level 3Model comparison

Hyperparameters

Parameters

Automatic relevance determination (ARD) [MacKay, 1998]: infer elements of diagonal

matrix S in K(xi, xj) = exp(−(xi − xj)TS(xi − xj)) which indicate how relevant

input variables are (but: many local minima, computationally expensive).


Classification of brain tumors using ARD

Bayesian learning (automatic relevance determination) of most relevant frequencies [Lu, 2005]


Hierarchical Kernel Machines

Level 1

Level 3

Level 2

LS−SVM substrate

Model selection

Sparseness

Structure detection

Conceptually

Hierarchical kernel machine Convex optimization

Computationally

Hierarchical modelling approach leading to convex optimization problemComputationally fusing training, hyperparameter and model selectionOptimization modelling: sparseness, input/structure selection, stability ...

[Pelckmans et al., ML 2006]


Additive regularization trade-off

• Traditional Tikhonov regularization scheme:

minw,e

wTw + γ∑

i

e2i s.t. ei = yi − wTϕ(xi), ∀i = 1, ..., N

Training solution for fixed value of γ: (K + I/γ)α = y

→ Selection of γ via validation set: non-convex problem

• Additive regularization trade-off [Pelckmans et al., 2005]:

minw,e

wTw +∑

i

(ei − ci)2 s.t. ei = yi − wTϕ(xi), ∀i = 1, ..., N

Training solution for fixed value of c = [c1; ...; cN ]: (K + I)α = y − c

→ Selection of c via validation set: can be convex problem

• Convex relaxation to Tikhonov regularization[Pelckmans et al., IEEE-TNN 2007]


Sparse models

• SVM classically: sparse solution from QP problem at training level

• Hierarchical kernel machine: fused problem with sparseness obtained atthe validation level [Pelckmans et al., 2005]

−1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−0.2

0

0.2

0.4

0.6

0.8

1

X1

X2

LS−SVMγ=5.3667,σ2=0.90784

RBF , with 2 different classes

class 1class 2


Additive models and structure detection

• Additive models: y(x) =∑P

p=1 w(p)Tϕ(p)(x(p)) with x(p) the p-th input.

Kernel K(xi, xj) =∑P

p=1 K(p)(x(p)i , x

(p)j ).

• Structure detection [Pelckmans et al., 2005]:

minw,e,t

ρPP

p=1 tp +PP

p=1 w(p)Tw(p) + γPN

i=1 e2i

s.t.

(

yi =PP

p=1 w(p)Tϕ(p)(x(p)i ) + ei, ∀i = 1, ..., N

−tp ≤ w(p)Tϕ(p)(x(p)i ) ≤ tp, ∀i = 1, ..., N, ∀p = 1, ..., P

Study how the solution with maximalvariation varies for different values of ρ

100

101

102

103

104

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

ρ

Max

imal

Var

iatio

n

4 relevant input variables

21 irrelevant input variables


Incorporation of prior knowledge

• Example: LS-SVM regression with monoticity constraint

minw,b,e

1

2wTw + γ

1

2

N∑

i=1

e2i s.t.

yi = wTϕ(xi) + b + ei, ∀i = 1, ..., NwTϕ(xi) ≤ wTϕ(xi+1), ∀i = 1, ..., N − 1

• Application: estimation of cdf [Pelckmans et al., 2005]

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−0.2

0

0.2

0.4

0.6

0.8

1

X

P(X

)

empirical cdftrue cdf

Y1

Y2

−1.5 −1 −0.5 0 0.5 1 1.5−0.2

0

0.2

0.4

0.6

0.8

1

X

P(X

)

ecdfcdfChebychev mkrmLS−SVM


Equivalent kernels from constraints

Regression with autocorrelated errors:

minw,b,r,e

wTw + γ∑

i r2i

s.t. yi = wTϕ(xi) + b + ei (i = 1, .., N)ei = ρei−1 + ri (i = 2, ..., N)

leads to

f(x) =

N∑

j=2

αj−1Keq(xj, x) + b

with “equivalent kernel”

Keq(xj, xi) = K(xj, xi) − ρK(xj−1, xi)

where K(xj, xi) = ϕ(xj)Tϕ(xi) [Espinoza et al., 2006].

Partially Linear

Structure

LS-SVM

Regression

Autocorrelated

Residuals

Imposing

Symmetry

Modular Definition of the Model Structure


Application: electric load forecasting

Short-term load forecasting (1-24 hours)Important for power generation decisionsHourly load values from substations in Belgian gridSeasonal/weekly/intra-daily patterns

20 40 60 80 100 120 140 1600.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Actual LoadFS−LSSVM

Hour

Norm

alize

dLoad

20 40 60 80 100 120 140 1600.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Actual LoadLinear

Hour

Norm

alize

dLoad

(a) (b)

20 40 60 80 100 120 140 1600.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Actual LoadFS−LSSVM

Hour

Norm

alize

dLoad

20 40 60 80 100 120 140 1600.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Actual LoadLinear

Hour

Norm

alize

dLoad

(c) (d)

[Espinoza et al., 2007]

1-hour ahead

24-hours ahead

Fixed-size LS-SVM→ ← Linear ARX model


Semi-supervised learning

x1

x2

−1 −0.5 0 0.5 1 1.5 2 2.5−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

x1

x2

−1 −0.5 0 0.5 1 1.5 2 2.5−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Semi-supervised learning: part labeled and part unlabeled data

Assumptions for semi-supervised learning to work [Chapelle et al., 2006]:

• Smoothness assumption: if two points x1, x2 in a high density region areclose, then also the corresponding outputs y1, y2

• Cluster assumption: points from same cluster are likely same class

• Low density separation: decision boundary should be in low density region

• Manifold assumption: data lie on a low-dimensional manifold


Semi-supervised learning in RKHS

• Learning in RKHS [Belkin & Niyogi, 2004]:

minf∈H

1

N

N∑

i=1

V (yi, f(xi)) + λ‖f‖2K + ηfTLf

with V (·, ·) loss function, L Laplacian matrix, ‖f‖K norm in RKHSH, f = [f(x1); ...; f(xNl+Nu)] (Nl, Nu number of labeled and unlabeleddata)

• Laplacian term: discretization of the Laplace-Beltrami operator

• Representer theorem: f(x) =∑Nl+Nu

i=1 αiK(x, xi)

• Least squares solution case: Laplacian acts on kernel matrix

• Problem: true labels of unlabeled data assumed to be zero.


Formulation by adding constraints

• Semi-supervised LS-SVM model [Luts et al., 2007]:

minw,e,b,y

12w

Tw + 12γ

∑Ni=1 e2

i + 12η

∑Ni,j=1 vij(yi − yj)

2

s.t. yi = wTϕ(xi) + b, i = 1, ..., N

yi = νiyi − ei, νi ∈ 0, 1, , i = 1, ..., N

where νi = 0 for unlabeled data, νi = 1 for labeled data.

• MRI image: healthy tissue versus tumor classification [Luts et al., 2007]

Nosologic Image

2 4 6 8 10 12

2

4

6

8

10

121

2

3

4

5

6

Nosologic Image

2 4 6 8 10 12

2

4

6

8

10

121

2

3

4

5

6

[etumour FP6-2002-lifescihealth503094, healthagents FP6-2005-IST027213]


Learning combination of kernels

• Take combination K =∑m

i=1 µiKi (µi ≥ 0) (e.g. for data fusion).Learn µi as convex problem [Lanckriet et al., JMLR 2004]

• QP problem of SVM:

maxα

2αT1 − αTdiag(y)Kdiag(y)α s.t. 0 ≤ α ≤ C,αTy = 0

is replaced by

minµi

maxα

2αT1 − αTdiag(y)(

m∑

i=1

µiKi)diag(y)α

s.t. 0 ≤ α ≤ C, αTy = 0, trace(

m∑

i=1

µiKi) = c,

m∑

i=1

µiKi 0.

Can be solved as a semidefinite program (SDP problem) [Boyd &Vandenberghe, 2004] (LMI constraint for positive definite kernel)


Kernel design

A

B

D

C

E

P(A,B,C,D,E) = P(A|B) P(B) P(C|B) P(D|C) P(E|B)

Kernels from graphical models, Bayesian networks, HMMsKernels tailored to data types (DNA sequence, text, chemoinformatics)

[Tsuda et al., Bioinformatics 2002; Jebara et al., JMLR 2004, Ralaivola et al., 2005]

- Probability product kernel:

K(p1, p2) =

∫

X

p1(x)ρ p2(x)ρdx

- Prior knowledge incorporation


Dimensionality reduction and data visualization

• Traditionally:commonly used techniques are e.g. principal component analysis, multi-dimensional scaling, self-organizing maps

• More recently:isomap, locally linear embedding, Hessian locally linear embedding,diffusion maps, Laplacian eigenmaps(“kernel eigenmap methods and manifold learning”)[Roweis & Saul, 2000; Coifman et al., 2005; Belkin et al., 2006]

• Relevant issues:- learning and generalization [Cucker & Smale, 2002; Poggio et al., 2004]- model representations and out-of-sample extensions- convex/non-convex problems, computational complexity [Smale, 1997]

• Kernel maps with reference point (KMref) [Suykens, 2007]:data visualization and dimensionality reduction by solving linear system


Kernel maps with reference point: problem statement

• Kernel maps with reference point [Suykens, 2007]:- LS-SVM core part: realize dimensionality reduction x 7→ z- reference point q (e.g. first point; sacrificed in the visualization)

• Example: d = 2

minz,w1,w2,b1,b2,ei,1,ei,2

1

2(z − PDz)

T(z − PDz) +

ν

2(w

T1 w1 + w

T2 w2) +

η

2

NX

i=1

(e2i,1 + e

2i,2)

such that cT1,1z = q1 + e1,1

cT1,2z = q2 + e1,2

cTi,1z = wT

1 ϕ1(xi) + b1 + ei,1, ∀i = 2, ..., N

cTi,2z = wT

2 ϕ2(xi) + b2 + ei,2, ∀i = 2, ..., N

Coordinates in low dimensional space: z = [z1; z2; ...; zN ] ∈ RdN

Regularization term: (z − PDz)T (z − PDz) =PN

i=1 ‖zi −PN

j=1 sijDzj‖22

with D diagonal matrix and sij = exp(−‖xi − xj‖22/σ2).


Kernel maps with reference point: solution

• The unique solution to the problem is given by the linear system

U −V1M−11 1 −V2M

−12 1

−1TM−11 V T

1 1TM−11 1 0

−1TM−12 V T

2 0 1TM−12 1

zb1

b2

=

η(q1c1,1 + q2c1,2)00

with matrices

U = (I − PD)T (I − PD) − γI + V1M−11 V T

1 + V2M−12 V T

2 + ηc1,1cT1,1 + ηc1,2c

T1,2

M1 =1

νΩ1 +

1

ηI , M2 =

1

νΩ2 +

1

ηI

V1 = [c2,1 ... cN,1] , V2 = [c2,2 ... cN,2]

kernel matrices Ω1, Ω2 ∈ R(N−1)×(N−1):

Ω1,ij = K1(xi, xj) = ϕ1(xi)Tϕ1(xj), Ω2,ij = K2(xi, xj) = ϕ2(xi)

Tϕ2(xj)

positive definite kernel functions K1(·, ·),K2(·, ·).


Kernel maps with reference point: model representations

• The primal and dual model representations allow making out-of-sample extensions. Evaluation at point x∗ ∈ R

p:

z∗,1 = wT1 ϕ1(x

∗) + b1 =1

ν

N∑

i=2

αi,1K1(xi, x∗) + b1

z∗,2 = wT2 ϕ2(x

∗) + b2 =1

ν

N∑

i=2

αi,2K2(xi, x∗) + b2

Estimated coordinates for visualization: z∗ = [z∗,1; z∗,2].

• α1, α2 ∈ RN−1 are the unique solutions to the linear systems

M1α1 = V T1 z − b11N−1 and M2α2 = V T

2 z − b21N−1

and α1 = [α2,1; ...;αN,1], α2 = [α2,2; ...;αN,2], 1N−1 = [1; 1; ..., ; 1].


KMref: spiral example

−1.5−1

−0.50

0.51

−1.5

−1

−0.5

0

0.5

1−0.5

0

0.5

x1

x2

x 3

−0.02 −0.015 −0.01 −0.005 0 0.005 0.01−5

0

5

10

15

20x 10

−3

z1

z 2

training data (blue *), validation data (magenta o), test data (red +)

Model selection: min∑

i,j

(

zTi zj

‖zi‖2‖zj‖2−

xTi xj

‖xi‖2‖xj‖2

)2


KMref: swiss roll example

−1

−0.5

0

0.5

1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

x1

x2

x 3

−3.5 −3 −2.5 −2 −1.5 −1 −0.5 0

x 10−3

0

0.5

1

1.5

2

2.5

3x 10

−3

z1

z 2

Given 3D swiss roll data KMref result - 2D projection

600 training data, 100 validation data


KMref: visualizing gene distributions

−2.35 −2.3 −2.25 −2.2 −2.15 −2.1 −2.05 −2 −1.95

x 10−31.9

22.1

2.22.3

x 10−3

1.7

1.8

1.9

2

2.1

x 10−3

z1

z2

z 3

−2.35 −2.3 −2.25 −2.2 −2.15 −2.1 −2.05 −2 −1.95x 10

−31.9

2

2.1

2.2

2.3

x 10−3

1.7

1.8

1.9

2

2.1

x 10−3

z1

z2

z 3

KMref 3D projection (Alon colon cancer microarray data set)Dimension input space: 62

Number of genes: 1500 (training: 500, validation: 500, test: 500)Model selection: σ2 = 104, σ2

1 = 103, σ22 = 0.5σ2

1, σ23 = 0.1σ2

1,η = 1, ν = 100, D = diag10, 5, 1, q = [+1;−1;−1].


Nonlinear dynamical systems control

u

xp

θ

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.50

0.5

1

1.5

Control objective LS−SVM objectivemin

subject to System dynamics (time k = 1, 2, ... , N)

LS−SVM controller (time k = 1, 2, ... , N)

+

Merging optimal control and support vector machine optimization problems:

Approximate solutions to optimal control problems [Suykens et al., NN 2001]


Conclusions and future challenges

• Integrative understanding and systematic design for supervised, semi-supervised, unsupervised learning and beyond

• Kernel methods: complementary views (LS-)SVM, RKHS, GP

• Least squares support vector machines as “core problems”:provides methodology for “optimization modelling”

• Bridging gaps between fundamental theory, algorithms and applications

• Reliable methods: numerically, computationally, statistically

Websites:http://www.kernel-machines.org/http://www.esat.kuleuven.be/sista/lssvmlab/


Books

• Boyd S., Vandenberghe L., Convex Optimization, Cambridge University Press, 2004.

• Chapelle O., Scholkopf B., Zien A. (Eds.), Semi-Supervised Learning, MIT Press, 2006.

• Cristianini N., Shawe-Taylor J., An Introduction to Support Vector Machines, Cambridge UniversityPress, 2000.

• Cucker F., Zhou D.-X., Learning Theory: an Approximation Theory Viewpoint, Cambridge UniversityPress, 2007.

• Rasmussen C.E., Williams C.K.I., Gaussian Processes for Machine Learning, MIT Press, 2006.

• Scholkopf B., Smola A., Learning with Kernels, MIT Press, 2002.

• Scholkopf B., Tsuda K., Vert J.P. (Eds.) Kernel Methods in Computational Biology 400, MIT Press,

2004.

• Shawe-Taylor J., Cristianini N., Kernel Methods for Pattern Analysis, Cambridge University Press,

2004.

• Suykens J.A.K., Van Gestel T., De Brabanter J., De Moor B., Vandewalle J., Least Squares Support

Vector Machines, World Scientific, Singapore, 2002.

• Suykens J.A.K., Horvath G., Basu S., Micchelli C., Vandewalle J. (Eds.), Advances in Learning Theory

: Methods, Models and Applications, vol. 190 NATO-ASI Series III: Computer and Systems Sciences,

IOS Press, 2003.

• Vapnik V., Statistical Learning Theory, John Wiley & Sons, 1998.

• Wahba G., Spline Models for Observational Data, Series Appl. Math., 59, SIAM, 1990.


support vector machines and kernel based learning · 2007. 9. 6. · kernel based learning:...

Documents