introduction to support vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. ·...

94
Introduction to Support Vector machines Colin Campbell Intelligent Systems Lab – University of Bristol, UK 6 th International Summer School on Pattern Recognition, September 2010 Colin Campbell Introduction to Support Vector machines

Upload: others

Post on 18-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Introduction to Support Vector machines

Colin Campbell

Intelligent Systems Lab – University of Bristol, UK

6th International Summer School on Pattern Recognition,September 2010

Colin Campbell Introduction to Support Vector machines

Page 2: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Outline: part 1

1.1 Introduction

1.2 Support Vector Machines for binary classification

1.3 Multi-class classification

1.4 Learning with noise: soft margins

1.5 Case study: predicting disease progression

Colin Campbell Introduction to Support Vector machines

Page 3: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Outline: part 2

2 Different models you can construct with kernels

2.1 Other kernel-based learning machines

2.2 Introducing a confidence measure

2.3 One class classification

2.4 Regression: learning with real-valued labels

2.5 Structured output learning

Colin Campbell Introduction to Support Vector machines

Page 4: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Outline: part 3

3 Learning with kernels

3.1 Properties of kernels

3.2 Simple kernels

3.3 String kernels

3.4 p-spectrum kernel

3.5 Graph kernels

3.6 Multiple kernel learning (MKL)

3.7 Case Study: protein fold prediction using MKL

Colin Campbell Introduction to Support Vector machines

Page 5: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

1.1. Introduction.

Support Vectors Machines have become a well established toolwithin machine learning. Conceptually they have many advantages:

the approach is systematic and properly motivated bystatistical learning theory.

training a Support Vector Machine (SVM) involvesoptimization of a convex function: there is a unique solution.

the constructed model has an explicit dependence on a subsetof the datapoints, the support vectors, which improves modelinterpretation.

Colin Campbell Introduction to Support Vector machines

Page 6: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

1.1. Introduction.

Data is stored in the form of kernels which quantify thesimilarity or dissimilarity of data objects. Kernels can now beconstructed for a wide variety of data objects from continuousand discrete input data, through to sequence and graph data.

Support Vectors Machines work well in practice

Kernel substitution concept is applicable to many other typesof data analysis model.

Thus Support Vector Machines are the most well known of a broadclass of methods which use kernels to represent data and can becalled kernel-based methods.

Colin Campbell Introduction to Support Vector machines

Page 7: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

SVMs for binary classification: 1

Statistical learning theory provides the theoretical underpinning toSVM learning. From the perspective of this subject the motivationfor considering binary classifier SVMs comes from a theoreticalupper bound on the generalization error, that is, the theoreticalprediction error when applying the classifier to novel, unseeninstances. This generalization bound have two important features:

[A] the error bound is minimized by maximizing the margin, γ,i.e. the minimal distance between the hyperplane separatingthe two classes and the closest datapoints to the hyperplane.

[B] the generalization error bound does not depend on thedimensionality of the space.

Colin Campbell Introduction to Support Vector machines

Page 8: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

SVMs for binary classification: 2

x 1 x 2

Colin Campbell Introduction to Support Vector machines

Page 9: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

SVMs for binary classification: 3

Let us consider a binary classification task with datapoints xi(i = 1, . . . ,m) having corresponding labels yi = ±1 and let thedecision function be:

f (x) = sign (w · x+ b)

If the dataset is separable then the data will be correctly classifiedif yi(w · xi + b) > 0 ∀i .This relation is invariant under a positive rescaling of the argumentinside the sign-function, hence we implicitly define a scale for(w, b) by setting w · x+ b = 1 for the closest points on one sideand w · x+ b = −1 for the closest on the other side.

Colin Campbell Introduction to Support Vector machines

Page 10: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

SVMs for binary classification: 4

The hyperplanes passing through w · x+ b = 1 andw · x+ b = −1 are called canonical hyperplanes and theregion between these canonical hyperplanes is called themargin band.

For the separating hyperplane w · x+ b = 0 the normal vectoris w/ ||w||2, and hence the margin can be found from theprojection of x1 − x2 onto this vector.

Since w · x1 + b = 1 and w · x2 + b = −1 this means themargin is γ = 1/ ||w||2.

Colin Campbell Introduction to Support Vector machines

Page 11: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

SVMs for binary classification: 5

To maximize the margin the task is therefore:

Minimize1

2||w||22

subject to the constraints:

yi (w · xi + b) ≥ 1 ∀i

Colin Campbell Introduction to Support Vector machines

Page 12: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

SVMs for binary classification: 6

As a constrained optimisation problem, the above formulation canbe reduced to minimization of the following Lagrange function,which we will call the primal formulation:

L =1

2(w ·w)−

m∑

i=1

αi (yi (w · x+ b)− 1)

where αi are Lagrange multipliers such that αi ≥ 0

Colin Campbell Introduction to Support Vector machines

Page 13: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

SVMs for binary classification: 7

We can take the derivatives with respect to b and w:

∂L

∂b= −

m∑

i=1

αiyi = 0

∂L

∂w= w −

m∑

i=1

αiyix = 0

Substituting w back into the primal objective we get the dualobjective function:

Colin Campbell Introduction to Support Vector machines

Page 14: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

SVMs for binary classification: 8

W (α) =

m∑

i=1

αi −1

2

m∑

i ,j=1

αiαjyiyj (xi · xj )

which must be maximised with respect to the αi subject to theconstraints:

αi ≥ 0m∑

i=1

αiyi = 0

Colin Campbell Introduction to Support Vector machines

Page 15: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

SVMs for binary classification: 9

So far we haven’t used the second observation, [B ], implied bythe generalisation theorem mentioned above: the error bounddoes not depend on the dimensionality of the space.

From the dual objective we notice that the datapoints, xi ,only appear inside an inner product.

To get an alternative representation of the data we couldtherefore map the datapoints into a space with a differentdimensionality, called feature space, through a replacement:

xi · xj → Φ (xi ) · Φ(xj)

Colin Campbell Introduction to Support Vector machines

Page 16: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

SVMs for binary classification: 10

Data which is not separable in input space can always beseparated in a space of high enough dimensionality.

Observation [B ] means there is no loss of generalisationperformance if we map to a feature space where the data isseparable and a margin can be defined.

The functional form of the mapping Φ(xi ) does not need tobe known since it is implicitly defined by the choice of kernel:K (xi , xj ) = Φ(xi ) · Φ(xj) or inner product in feature space.

Of course, there are restrictions on the possible choice of kernel.One apparent restriction is that there must be an inner productconsistently defined in feature space. This restricts feature spaceto a Hilbert space.

Colin Campbell Introduction to Support Vector machines

Page 17: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Non-separable dataset

Colin Campbell Introduction to Support Vector machines

Page 18: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Possible kernels

Many types of kernel are possible e.g.

(a) a Gaussian kernel:

K (xi , xj ) = e−(xi−xj )2/2σ2

(b)

K (xi , xj ) = (xi ·xj+1)d K (xi , xj ) = tanh(βxi ·xj+b)

which define other types of classifier, in this case a polynomialand a feedforward neural network classifier.

Colin Campbell Introduction to Support Vector machines

Page 19: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Summary

For binary classification with a given choice of kernel the learningtask therefore involves maximization of:

W (α) =m∑

i=1

αi −1

2

m∑

i ,j=1

αiαjyiyjK (xi , xj)

subject to:

αi ≥ 0

m∑

i=1

αiyi = 0

Colin Campbell Introduction to Support Vector machines

Page 20: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

The bias b

Since the bias, b, has not featured so far it must be foundseparately. For a datapoint with yi = +1 we note that:

min [w · x+ b] = min

m∑

j=1

yjαjK (xi , xj )

+ b = 1

with a similar expression for datapoints labeled yi = −1. From thisobservation we deduce:

b = −1

2

max{i |yi=−1}

m∑

j=1

yjαjK (xi , xj)

+ min

{i |yi=+1}

m∑

j=1

yjαjK (xi , xj )

Colin Campbell Introduction to Support Vector machines

Page 21: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Decision function

For a novel input vector z, the predicted class is therefore based onthe sign of:

φ(z) =m∑

i=1

yiα∗i K (z, xi ) + b∗

where b∗ denotes the value of the bias at optimality.We will henceforth refer to such a solution (α∗

i , b∗) as a hypothesis

modelling the data. Only those points which lie closest to thehyperplane have α∗

i > 0 and these points are the support vectors.All other points have α∗

i = 0 and the decision function isindependent of these samples.

Colin Campbell Introduction to Support Vector machines

Page 22: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

The Karush-Kuhn-Tucker (KKT) conditions

The Karush-Kuhn-Tucker (KKT) conditions are the complete setof conditions which must be satisfied at the optimum of aconstrained optimisation problem. One of the KKT conditions is:

αi (yi(w · x+ b)− 1) = 0

from which we deduce that either yi(w · x+ b) > 1 (a non-supportvector) and hence αi = 0 or yi(w · x+ b) = 1 (a support vector)and thus αi ≥ 0.

Colin Campbell Introduction to Support Vector machines

Page 23: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

1.3. Multi-class classification

Many problems involve multiclass classification and a number ofschemes have been outlined. The main strategies are as follows:

if the number of classes is small then we can use a directedacyclic graph (DAG) with the learning task reduced to binaryclassification at each node.

we could use a series of one-against-all classifiers. Weconstruct C separate SVMs with the c th SVM trained usingdata from class c as the positively labelled samples and theremaining classes as the negatively labelled samples.

we use a set of one class classifiers (considered later): eachone class classifier constructs a boundary around each class ofdata.

Colin Campbell Introduction to Support Vector machines

Page 24: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

1.3: DAG tree for multiclass classification

1/3

2/3

321

1/2

Figure: A multi-class classification problem can be reduced to a series ofbinary classification tasks.

Colin Campbell Introduction to Support Vector machines

Page 25: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

1.4. Learning with noise: soft margins

Most real life datasets contain noise and an SVM can fit to thisnoise leading to poor generalization. Two soft margin methods canbe used to reduce the effects of noise:

With an L1 error norm the constraint 0 ≤ αi is replaced bythe box constraint:

0 ≤ αi ≤ C

With an L2 error norm we add a small positive constant to theleading diagonal of the kernel matrix:

K (xi , xi )← K (xi , xi ) + λ

C and λ control the trade-off between training error andgeneralization ability.

Colin Campbell Introduction to Support Vector machines

Page 26: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

L1 and L2 soft margins

5.4

5.6

5.8

6

6.2

6.4

6.6

6.8

7

7.2

0 2 4 6 8 105

5.5

6

6.5

7

7.5

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Figure: Soft margin classification using L1 and L2 error norms. Left:Test error as a percentage (y -axis) versus C (x-axis). Right: test error asa percentage (y -axis) versus λ (x-axis). The ionosphere dataset from theUCI Database Repository was used with Gaussian kernels (σ = 1.5).

Colin Campbell Introduction to Support Vector machines

Page 27: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Case study: predicting disease progression

We now illustrate the use of Support Vector Machines with anapplication to predicting disease progression.

In this study the objective is to predict relapse versusnon-relapse for Wilm’s tumour, a cancer which accounts forabout 6% of all childhood malignancies.

The cDNA microarray had 30,720 probes (17,790 reliablereadings).

The dataset was approximately balanced with 27 samples (13samples were from patients who relapsed).

Colin Campbell Introduction to Support Vector machines

Page 28: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Performance

5

6

7

8

9

10

11

12

13

0 20 40 60 80 100

Num

ber

of LO

O test

err

ors

Number of features

0

2

4

6

8

10

12

14

16

0 20 40 60 80 100

Num

ber

of LO

O test

err

ors

Number of features

Figure: The number of LOO test errors (y -axis) versus number oftop-ranked features (x-axis) remaining using a Fisher score filter (left) ort-test filter (right) for predicting relapse or non-relapse for Wilm’stumour.

Colin Campbell Introduction to Support Vector machines

Page 29: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

2.1. Different models you can construct with kernels

The kernel substitution concept can be applied to a broad range ofalgorithmic methods where data appears in the form of an innerproduct and so we can use kernel substitution:

K (xi , xj ) = Φ(xi ) · Φ(xj)

Colin Campbell Introduction to Support Vector machines

Page 30: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Example: a linear programming approach

Example: Rather than using quadratic programming it is alsopossible to derive a kernel classifier in which the learning taskinvolves linear programming (LP) instead. For binary classification,the predicted class label is determined by the sign of:

f (z) =

m∑

i=1

wiK (z, xi ) + b

In contrast to an SVM, where we used a 2-norm for the weights(the sum of the squares of the weights), here we will use a 1-norm(the sum of the absolute values):

||w ||1 =m∑

i=1

|wi |

Colin Campbell Introduction to Support Vector machines

Page 31: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

A linear programming approach 2

This norm is useful since it encourages sparse solutions in whichmany or most weights are zero. During training we minimise anobjective function:

L =1

m||w ||1 + C

m∑

i=1

ξi

where the slack variable ξi is defined by:

ξi = max {1− yi f (xi ), 0}These slack variables are therefore positively-valued whenyi f (xi ) < 1 which only occurs when there is an error in storingsample i . Furthermore, we can always write any variable wi ,potentially positive or negative, as the difference of twopositively-valued variable wi = αi − αi where αi , αi ≥ 0.

Colin Campbell Introduction to Support Vector machines

Page 32: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

A linear programming approach 3

To make the value of wi as small as possible we therefore minimisethe sum (αi + αi ). We thus obtain a linear programming (LP)problem with objective function:

minα,ξ,b

[1

m

m∑

i=1

(αi + αi ) + Cm∑

i=1

ξi

]

and subject to constraints (from 31):

yi f (xi ) ≥ 1− ξi

with αi , αi , ξi ≥ 0 and f (xi ) =∑m

j=1 (αj − αj)K (xi , xj ) + b.

Colin Campbell Introduction to Support Vector machines

Page 33: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Other models

There are many other kernelisable models, e.g.:

Bayes Point Machine (supervised learning)

Analytic Center Machine (supervised learning)

Kernel PCA (principal component analysis, unsupervised)

Colin Campbell Introduction to Support Vector machines

Page 34: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

2.2. Introducing a confidence measure

Useful to have a confidence measure for the class assignmentin addition to determining the class label.

An SVM with linear kernel does have an inbuilt measure ofconfidence that could be exploited to provide a confidencemeasure for the assigned class, i.e. the distance of a newpoint from the separating hyperplane.

Logically a test point a large distance from the separatinghyperplane should be assigned a higher degree of confidencethan a point which lies close to the hyperplane.

Colin Campbell Introduction to Support Vector machines

Page 35: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Introducing a confidence measure 1

Recall that the output of an SVM, before thresholding to ±1, isgiven by

f (z) =∑

i

yiαiK (xi , z) + b

One approach is to fit the posterior probability p(y = +1|f )directly. The good choice for mapping function is the sigmoid:

p(y = +1|f ) = 1

1 + exp(Af + B)

with the parameters A and B found from a training set (fi , yi ) aswe discuss below.

Colin Campbell Introduction to Support Vector machines

Page 36: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Introducing a confidence measure 2

Let us define ti as the target probabilities:

ti =yi + 1

2

so that, using yi ∈ {−1, 1}, we have ti ∈ {0, 1}. Furthermore let fidenote f (xi ) then we can find A and B by minimizing the followingfunction over the entire training set:

min

[−∑

i

ti log(pi ) + (1− ti ) log(1− pi )

]

where pi is evaluated at fi .

Colin Campbell Introduction to Support Vector machines

Page 37: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Example: ovarian cancer

0

0.2

0.4

0.6

0.8

1

-2 -1.5 -1 -0.5 0 0.5 1 1.5

Pro

babili

ty o

f m

em

bers

hip

of one c

lass

Margin

Figure: Probability of membership of one class (y -axis) versus margin.The plot shows the training points and fitted sigmoid for an ovariancancer data set. A hard margin was used which explains the absense ofpoints in the central band.

Colin Campbell Introduction to Support Vector machines

Page 38: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Gaussian Processes 1

The above method was proposed by John Platt.

To develop a fully probabilistic approach to kernel-basedlearning requires some understanding of Bayesian methodswhich takes us beyond this introductory course.

Gaussian processes (GPs) are a kernel-based approach whichwill give a probability distribution for the prediction on noveltest points.

Colin Campbell Introduction to Support Vector machines

Page 39: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Gaussian Processes 2

Ignoring the bias term b, the real-valued output of the GP is

yi = wTΦ(xi )

where w is the weight vector.We assume a probabilistic approach with an assumed priorprobability distribution over the weights w. Namely that they areapproximately normally distributed and thus modeled by aGaussian distribution

p(w) = N(w|0, α−1I)

where α is the precision or inverse variance of the distribution.

Colin Campbell Introduction to Support Vector machines

Page 40: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Gaussian Processes 3

From y = Φw and p(w) it is apparent that y is expressed as alinear combination of Gaussian distributions and must thereforefollow a Gaussian distribution itself. The mean and covariancematrix for the probability distribution of y are given by:

E [y] = ΦE [w] = 0

Covariance [y] = E[yyT

]=

1

αΦΦT = K

where we define a kernel function:

Kij =1

αΦ(xi )

TΦ(xj)

With a novel input z this approach gives a mean value for theprediction and an associated spread.

Colin Campbell Introduction to Support Vector machines

Page 41: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

2.3. One class classification

For many real-world problems the task is not to classify but todetect novel or abnormal instances. Novelty detection haspotential applications in many application domains such ascondition monitoring or medical diagnosis.

one-class classification: task is to model the support of a datadistribution i.e. to create a binary-valued function which is positivein those regions of input space where the data predominantly liesand negative elsewhere.

Colin Campbell Introduction to Support Vector machines

Page 42: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

One class classification: applications

Applications:

novelty detection,

multi-class classifification: we create a function φc=1 which ispositive where class c = 1 data is located and negativeelsewhere, then we could create a set of φc for each data classand the relative ratios of the φc would decide for ambiguousdatapoints not falling readily into one class.

Colin Campbell Introduction to Support Vector machines

Page 43: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

One class classification

One approach to one class learning is to find a hypersphere with aminimal radius R and centre a which contains most of the data:novel test points lie outside the boundary of this hypersphere. Theeffect of outliers is reduced by using slack variables ξi to allow foroutliers outside the sphere and the task is to minimize the volumeof the sphere and number of datapoints outside i.e.

min

[R2 +

1

i

ξi

](1)

subject to the constraints: (xi − a)T (xi − a) ≤ R2 + ξi and ξi ≥ 0,and where ν controls the tradeoff between the two terms.

Colin Campbell Introduction to Support Vector machines

Page 44: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

One class classification 2

The primal objective function is then:

L(R , a, αi , ξi ) = R2 +1

m∑

i=1

ξi −m∑

i=1

γiξi

−m∑

i=1

αi

(R2 + ξi − (xi · xi − 2a · xi + a · a)

)

with αi ≥ 0 and γi ≥ 0.

Colin Campbell Introduction to Support Vector machines

Page 45: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

One class classification 3

After kernel substitution the dual formulation gives rise to aquadratic programming problem, namely maximize:

W (α) =

m∑

i=1

αiK (xi , xi )−m∑

i ,j=1

αiαjK (xi , xj )

with respect to αi and subject to∑m

i=1 αi = 1 and0 ≤ αi ≤ 1/mν.

Colin Campbell Introduction to Support Vector machines

Page 46: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

One class classification 4

Having completed the training process a test point z is declarednovel if:

φ = R2 − K (z, z) + 2

m∑

i=1

αiK (z, xi )−m∑

i ,j=1

αiαjK (xi , xj ) < 0

Colin Campbell Introduction to Support Vector machines

Page 47: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

One class classification: example

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

-0.35 -0.3 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

-0.3 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2

Figure: The boundary around two clusters of data in input spacecorresponding to φ(x) = 0. Left: a hard margin solution using a Gaussiankernel. Right: a solution using the modified Gaussian kernelK (xi , xj) = e−|xi−xj |/2σ

2

.

Colin Campbell Introduction to Support Vector machines

Page 48: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

2.4. Regression: learning with real-valued labels

So far we have only considered learning with discrete labels.

The following approach to learning with real-valued outputs isalso theoretically motivated by statistical learning theory.

Colin Campbell Introduction to Support Vector machines

Page 49: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Regression 1

We now use constraints yi −w · xi − b ≤ ε andw · xi + b − yi ≤ ε to allow for some deviation ε between theeventual targets yi and the function f (x) = w · x+ b,modelling the data.

We can visualise this as a band or tube of size ±(θ − γ)around the hypothesis function f (x) and any points outsidethis tube can be viewed as training errors.

The structure of the tube is defined by an ε−insensitive lossfunction.

Colin Campbell Introduction to Support Vector machines

Page 50: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

a linear ε-insensitive loss function

-ε ε -ε ε

Figure: Left figure: a linear ε-insensitive loss function versusyi −w · xi − b. Right figure: a quadratic ε-insensitive loss function.

Colin Campbell Introduction to Support Vector machines

Page 51: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

A linear ε-insensitive loss function 1

As before we minimize ||w||2 to penalise overcomplexity.

To account for training errors we also introduce slack variablesξi , ξi for the two types of training error.

These slack variables are zero for points inside the tube andprogressively increase for points outside the tube according tothe loss function used.

This approach is called ε-SV regression.

Colin Campbell Introduction to Support Vector machines

Page 52: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

A linear ε-insensitive loss function 2

For a linear ε−insensitive loss function the task is therefore tominimize:

minw,ξi ,ξi

[||w||2 + C

m∑

i=1

(ξi + ξi

)]

subject to

yi − w · xi − b ≤ ε+ ξi

(w · xi + b)− yi ≤ ε+ ξi

where the slack variables are both positive ξi , ξi ≥ 0.

Colin Campbell Introduction to Support Vector machines

Page 53: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

A linear ε-insensitive loss function 3

After kernel substitution the dual objective function is:

W (α, α) =m∑

i=1

yi (αi − αi)− εm∑

i=1

(αi + αi )

− 1

2

m∑

i ,j=1

(αi − αi )(αj − αj)K (xi , xj)

which is maximized subject to

m∑

i=1

αi =m∑

i=1

αi

and:

0 ≤ αi ≤ C 0 ≤ αi ≤ C

Colin Campbell Introduction to Support Vector machines

Page 54: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

The model

The function modelling the data is then:

f (z) =m∑

i=1

(αi − αi )K (xi , z) + b

Colin Campbell Introduction to Support Vector machines

Page 55: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

2.5. Structured output learning

So far we have considered SVM learning with simple outputs:either a discrete output label for classification or acontinuously-valued output for regression.

However, for many real-world applications we would like toconsider more complex output structures.

With structured output prediction we wish to capture thedependency structure across the output class labels so as togeneralise across classes.Example: given an input sentence, we may want to output asyntax or parse tree which portrays the syntactic structure ofthe sentence.

Colin Campbell Introduction to Support Vector machines

Page 56: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Structural SVMs 1

S

NP VP

V

Det

NP

ate the

N

pie

N

Paul

Figure: A simple parse tree for the sentence ‘Paul ate the pie’.

Colin Campbell Introduction to Support Vector machines

Page 57: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Structural SVMs 2

To proceed further we need a function to encode the mappingfrom the input x to the given structured output y.

x is the input sentence ‘Paul ate the pie’

y is the resulting parse tree.

Colin Campbell Introduction to Support Vector machines

Page 58: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Structural SVMs 3

The function Ψ(x, y) encodes this mapping for the givenvocabulary space and formal grammar used:

Ψ(x, y) =

1011...1111

S → NP ,VBS → NPNP → Det,NVP → V ,NP. . .N → PaulV → ateDet → theN → pie

Colin Campbell Introduction to Support Vector machines

Page 59: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Structural SVMs 4

Ψ(x, y) is a vector whose components are the counts of howoften a grammar rule occurs in the parse tree y.

We will associate a weight wl to each node in the tree.

This is turn means we can derive a predictive function givensome novel input z.

Colin Campbell Introduction to Support Vector machines

Page 60: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Structural SVMs 5

Specifically, we can then state a function F (z, y;w) as a weightedlinear combination of the Ψ(z, y) :

F (z, y;w) = wTΨ(z, y)

F (z, y;w) can be regarded as a quantifying the compatibility ofthe pair z, y. The predicted structured output therefore amountsto maximising this function over the space of parse trees Y, thus:

fw (z) = argmaxy∈Y

F (z, y;w)

Colin Campbell Introduction to Support Vector machines

Page 61: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Structural SVMs 6

We still need to derive a scheme to train such a structuralSVM i.e. we need to find appropriate values for the weights w.

The function 4(y, y) quantifies the difference between thepredicted output y and the correct output y, given input x.

Intuitively, the training process should involve theminimisation of 4(y, y) so that y closely agrees with y.

Colin Campbell Introduction to Support Vector machines

Page 62: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Structural SVMs 7

However, in general 4 is not a convex function and it willhave discontinuities.

Example would be the choice 4(y, y) = 0, 4(y, y′) = 1 ify 6= y′ which is discontinuous and does not give a uniquesolution.

To circumvent this problem we avoid minimising 4(y, y)directly but instead minimise a more tractable upper bound on4(y, y).

Colin Campbell Introduction to Support Vector machines

Page 63: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Structural SVMs 8

Thus for a given input xi with matching output yi we use:

4(yi , yi ) ≤ maxy∈Y

[4(yi , y) + wTΨ(xi , y)

]− wTΨ(xi , yi )

This bound follows by noting that:

4(yi , yi ) ≤ 4(yi , yi )−[wTΨ(xi , yi )− wTΨ(xi , y)

]

= maxy∈Y

[4(yi , y)] − wTΨ(xi , yi ) + maxy∈Y

[wTΨ(xi , y)

]

since y maximises wTΨ(xi , y).

Colin Campbell Introduction to Support Vector machines

Page 64: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Structural SVMs 9

We further introduce a ||w||2 regulariser to finally give thefollowing optimisation problem for a structural SVM:

W = minw

{1

2||w||2 + C

m∑

i=1

[maxy∈Y

(4(yi , y)

+wTΨ(xi , y))− wTΨ(xi , yi )

]}

Colin Campbell Introduction to Support Vector machines

Page 65: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Structural SVMs 10

We could rewrite this problem as:

W = minw{[f (w)]− [g(w)]}

f (W) =1

2||w||2 + C

m∑

i=1

maxy∈Y

(4(yi , y) + wTΨ(xi , y)

)

g(w) = C

m∑

i=1

wTΨ(xi , yi )

which amounts to a difference of two convex functions.

Colin Campbell Introduction to Support Vector machines

Page 66: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Structural SVMs 11

An alternative approach is to introduce a slack variable for eachsample ξi and reformulate it as a familiar constrained quadraticprogramming problem:

minw,ξ

[1

2||w||2 + C

m∑

i=1

ξi

]

subject to the set of constraints:

wTΨ(xi , yi )− wTΨ(xi , y) + ξi ≥ 4(yi , y)

for all y ∈ Y , with ξi ≥ 0 ∀i and where i = 1, . . . ,m.

Colin Campbell Introduction to Support Vector machines

Page 67: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

3.1. Properties of kernels

For a set of real-valued variables a1, . . . , am a positive semi-definitekernel (PSD kernel) satisfies (equation [A]):

m∑

i=1

m∑

j=1

aiajK (xi , xj ) ≥ 0

This type of kernel is symmetric, K (xi , xj ) = K (xj , xi ), withpositive components on the diagonal, K (x , x) ≥ 0.

Colin Campbell Introduction to Support Vector machines

Page 68: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Properties of kernels 2

An obvious example is the linear kernel :

K (xi , xj ) = xi · xjThis kernel is plainly symmetric and it satisfies [A] since:

m∑

i=1

m∑

j=1

aiaj(xi · xj) =∣∣∣∣∣

∣∣∣∣∣

m∑

i=1

aixi

∣∣∣∣∣

∣∣∣∣∣

2

≥ 0

Colin Campbell Introduction to Support Vector machines

Page 69: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Properties of kernels 2

With a mapping into feature space we can also satisfy this positivesemi-definite requirement. Thus suppose we use Φ(x) to perform amapping into a d -dimensional space which we denote N

d and let(a · b)Nd denote an inner product between a and b in this spacethen:

m∑

i=1

m∑

j=1

aiaj (Φ(xi ) · Φ(xj))Nd =

∣∣∣∣∣

∣∣∣∣∣

m∑

i=1

aiΦ(xi )

∣∣∣∣∣

∣∣∣∣∣

2

Nd

≥ 0

Colin Campbell Introduction to Support Vector machines

Page 70: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Properties of kernels 3

Of course this statement is only correct if an inner product can bedefined in feature space. A Hilbert space is a vector space with adefined inner product (which is also complete with respect to thenorm defined by this inner product). A PSD kernel would thereforebe permissable if and only if there exists a Hilbert space H and amapping Φ : X → H such that for any xi and xj in X :

K (xi , xj ) = (Φ(xi ) · Φ(xj))H

Colin Campbell Introduction to Support Vector machines

Page 71: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Properties of kernels 4

In fact this is always true. If the kernel is symmetric positivesemi-definite then it can be diagonalised using an orthonormalbasis of eigenvectors (u1, . . . , um) with non-negative eigenvaluesλm ≥ . . . ≥ λ1 ≥ 0 thus:

K (xi , xj ) =m∑

k=1

λkuk(i)uk(j) = (Φ(xi ) · Φ(xj ))Nm

and the mapping function is Φ(xi ) = (√λ1u1(i), . . . ,

√λmum(i)).

In short, if we can establish that the proposed kernel matrix issymmetric positive semi-definite then it is a permissable kernel.

Colin Campbell Introduction to Support Vector machines

Page 72: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

3.2. Simple kernels

In many applications fairly simply defined kernels are used.Examples of kernels:

the homogeneous polynomial kernel :

K (xi , xj ) = (xi · xj )d

the inhomogeneous polynomial :

K (xi , xj ) = (xi · xj + c)d

Gaussian kernels:

K (xi , xj ) = exp

(−||xi − xj ||2

2σ2

)

Colin Campbell Introduction to Support Vector machines

Page 73: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

The kernel parameter 1

For each of these kernels we notice that there is at least one kernelparameter (e.g. σ for the Gaussian kernel) whose value must befound.

Colin Campbell Introduction to Support Vector machines

Page 74: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

The kernel parameter 2

There are several ways to find the value of this parameter:

If we have enough data we can split it into a training set, avalidation set and a test set. We then pursue a crossvalidation study.

We can avoid the use of validation data by using certaintheoretical criteria (discussed below).

can use multiple kernel learning (discusssed later) with a linearcombination of kernels of the same functional form but havingregularly spaced values of the kernel parameter.

Colin Campbell Introduction to Support Vector machines

Page 75: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Finding the kernel parameter

Finding the kernel parameter: the test error can be estimated asa function of the kernel parameter, without recourse to validationdata.Example: Torsten Joachim’s leave-one-out bound. Thisgeneralisation bound is calculated by extracting a single datapointand estimating the prediction error based on the hypothesis derivedfrom the remaining m − 1 datapoints.

Colin Campbell Introduction to Support Vector machines

Page 76: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Finding the kernel parameter 2

This calculation includes use of a soft margin such as the boxconstraint 0 ≤ αi ≤ C . Specifically, the number of leave-one-outerrors of an L1-norm soft margin SVM is upper bounded by

|{i : (2α∗i B

2 + ξ∗i ) ≥ 1}|/mthat, the count of the number of samples satisfying 2α∗

i B2+ ξi ≥ 1

divided by the total number of samples, m. B2 is an upper boundon K (xi , xi ) with K (xi , xj) ≥ 0 (we can determine ξ∗i fromyi (∑

j α∗j K (xj , xi ) + b∗) ≥ 1− ξ∗i . For a Gaussian kernel B = 1

and with a hard margin this bound simply becomes the count ofthe number of instances i for which 2α∗

i ≥ 1, divided by m.

Colin Campbell Introduction to Support Vector machines

Page 77: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

String kernels

Strings are sequences of symbols drawn from an alphabet which wewill denote Σ. Strings are relevant in many contexts. For example,we could be considering words from English text or strings of thefour nucleotides A, C, G and T which make up DNA geneticsequences. String kernels encode the similarity of strings and thereare a number of schemes based on different ways of quantifyingthe matches between two strings.

Example: STRAY, RAY and RAYS.

Colin Campbell Introduction to Support Vector machines

Page 78: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

p-spectrum kernel

One way of comparing two strings is to count the number ofcontiguous substrings of length p which are in common betweenthem. A p-spectrum kernel is based on the frequency spectrum oforder p. This is the spectrum of frequencies of all contiguoussubstrings of length p. For example, suppose we wish to computethe 2-spectrum of the string s = SAY . There are two contiguoussubstrings of length p = 2, namely u1 = SA and u2 = AY bothwith a frequency of 1.

Colin Campbell Introduction to Support Vector machines

Page 79: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

p-spectrum kernel 2

As an explicit example let us consider the strings s1 = SAY ,s2 = BAY , s3 = SAD and s4 = BAD. The 2-spectra are given inthe table below:

Φ SA BA AY AD

SAY 1 0 1 0

BAY 0 1 1 0

SAD 1 0 0 1

BAD 0 1 0 1

where each entry is the number of occurrences of the string u (sayu1 =SA) in the given string (say s1=SAY).

Colin Campbell Introduction to Support Vector machines

Page 80: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

p-spectrum kernel 3

The kernel is then:

K SAY BAY SAD BAD

SAY 2 1 1 0

BAY 1 2 0 1

SAD 1 0 2 1

BAD 0 1 1 2

Colin Campbell Introduction to Support Vector machines

Page 81: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

p-spectrum kernel 4

Thus to compute the (SAY,SAD) entry, for example, we readacross the SAY row in the first Table locating all the correspondingp = 2 substrings of SAD i.e. SA and AD. We add up the entries inthis row corresponding to these substrings to give thecorresponding entry in second Table.

Colin Campbell Introduction to Support Vector machines

Page 82: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Graph kernels

Graphs are a natural way to represent many types of data.

The vertices, or nodes, represent data objects and the edgeswould then represent the relationships between these dataobjects.

Colin Campbell Introduction to Support Vector machines

Page 83: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Graph kernels

With graphs we can consider two types of similarity:

kernels on graphs: For a given graph we may be interested inthe similarity of two vertices within the same graph. Thisgraph could encode information about the similarities ofdatapoints represented as vertices.

kernels between graphs: we may be interested in constructinga measure of similarity between two different graphs.

Colin Campbell Introduction to Support Vector machines

Page 84: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Kernels on graphs

Example: kernels on graphs. Let us consider a graph G = (V ,E )where V is the set of vertices and E are the edges. We willformulate the corresponding kernel as an exponential kernel.Specifically the exponential of a (n × n) matrix βH can be writtenas a series expansion:

eβH = I + βH +β2

2!H2 +

β3

3!H3 + . . . (2)

where β is a real-valued scalar parameter (we comment on itsinterpretation later). Any even power of a symmetric matrix ispositive semi-definite. Thus if H is symmetric and we replace n by2n the exponential of this matrix is positive semi-definite and thusa suitable kernel.

Colin Campbell Introduction to Support Vector machines

Page 85: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Kernels on graphs 2

For a graph with vertex set V and edge set E we can use thenotation v1 ∼ v2 to denote a link or edge between vertices v1 andv2. Furthermore let di denote the number of edges leading intovertex i . We then construct a kernel using:

Hij =

1 for i ∼ j−di for i = j0 otherwise

(3)

Colin Campbell Introduction to Support Vector machines

Page 86: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Multiple kernel learning

The learning the kernel problem ranges from finding the widthparameter, σ, in a Gaussian kernel to obtaining an optimal linearcombination of a finite set of candidate kernels, with these kernelsrepresenting different types of input data. The latter is referred toas multiple kernel learning (MKL).

Colin Campbell Introduction to Support Vector machines

Page 87: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Multiple kernel learning 2

Let K be a set of candidate kernels.

The typical choice for K is a linear combination of p prescribedkernels {K` : ` = 1, . . . , p} i.e.

K =

p∑

`=1

λ`K`

where∑p

`=1 λ` = 1, λ` ≥ 0 and the λ` are called kernelcoefficients.

Colin Campbell Introduction to Support Vector machines

Page 88: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Multiple kernel learning 3

This framework is motivated by classification or regressionproblems in which we want to use all available input data andthis input data may derive from many disparate sources.

These data objects could include network graphs andsequence strings in addition to numerical data: as we haveseen all these types of data can be encoded into kernels.

The problem of data integration is therefore transformed intothe problem of learning the most appropriate combination ofcandidate kernel matrices (typically a linear combination isused though nonlinear combinations are feasible).

Colin Campbell Introduction to Support Vector machines

Page 89: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

MKL: an example

Understanding the three-dimensional structure of a proteincan give insight into its function.

This motivates the problem of using machine learningmethods to predict the structure of a protein from sequenceand other data.

In this Case Study we will only consider a sub-problem ofstructure prediction in which the predicted label is over a setof fold classes.

Colin Campbell Introduction to Support Vector machines

Page 90: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

MKL: an example 2

In this study there were 27 fold classes with 313 proteins usedfor training and 385 for testing.

There are a number of observational feaures relevant topredicting fold class and in this study we used 12 differentinformative data-types, or feature spaces.

These included the RNA sequence and various physicalmeasurements uch as hydrophobicity, polarity and van derWaals volume. Viewed as a machine learning task, there aremultiple data-types thus we consider multiple kernel learning.

Colin Campbell Introduction to Support Vector machines

Page 91: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Performance: MKLdiv

C H P Z S V L1 L4 L14 L30 SW1 SW20

10

20

30

40

50

60

70

80T

SA

C H P Z S V L1 L4 L14 L30 SW1 SW20

0.05

0.1

0.15

0.2

0.25

Ke

rne

l we

igh

ts λ

Figure: Performance of a MKL algorithm (MKLdiv) on a protein foldprediction dataset. There are 27 classes and 12 types of input data. Leftfigure: test set accuracy (TSA, as a %) based on individual datatypes(vertical bars) and using MKL (horizontal bar). Right figure: the kernelcoefficients λ` which quantify the relative significance of individual typesof data.

Colin Campbell Introduction to Support Vector machines

Page 92: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Performance: SimpleMKL

C H P Z S V L1 L4 L14 L30 SW1 SW20

10

20

30

40

50

60

70

TS

A

C H P Z S V L1 L4 L14 L30 SW1 SW20

0.1

0.2

0.3

0.4

0.5

Ke

rne

l we

igh

ts

λ

Figure: Performance of a MKL algorithm (SimpleMKL) on a protein foldprediction dataset. Left figure: test set accuracy (TSA, as a %). Rightfigure: the kernel coefficients λ`. This algorithm is less accurate thanMKLdiv but it entirely eliminates certain datatypes (right figure). Thuslower accuracy is achieved based on the use of fewer types of data.

Colin Campbell Introduction to Support Vector machines

Page 93: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Conclusion: Support Vector Machines

SVMs are properly motovated by statistical learning theory,

optimisation of a convex objecyive function: one solution,

kernel substitution: make the method very effective. Differentkernels can handle different types of input data.

Kernel substitution can be used with other methods:kernel-based learning

Colin Campbell Introduction to Support Vector machines

Page 94: Introduction to Support Vector machinesenicgc/pubs/2010/plymouth.pdf · 2010. 9. 9. · Introduction to Support Vector machines Colin Campbell Intelligent SystemsLab–UniversityofBristol,UK

Conclusion

Many different types of task can be achieved: classification,regression, novelty detection, structured output learning,classification with multiple different types of input data

can be extended to a probabilistic/Bayesian framework

work well in practice

Colin Campbell Introduction to Support Vector machines