a review of optimization methodologies in support vector machines

10
A review of optimization methodologies in support vector machines John Shawe-Taylor a , Shiliang Sun a,b, a Department of Computer Science, University College London, Gower Street, London WC1E 6BT, United Kingdom b Department of Computer Science and Technology, East China Normal University, 500 Dongchuan Road, Shanghai 200241, China article info Article history: Received 13 March 2011 Received in revised form 24 June 2011 Accepted 26 June 2011 Communicated by C. Fyfe Available online 12 August 2011 Keywords: Decision support systems Duality Optimization methodology Pattern classification Support vector machine (SVM) abstract Support vector machines (SVMs) are theoretically well-justified machine learning techniques, which have also been successfully applied to many real-world domains. The use of optimization methodol- ogies plays a central role in finding solutions of SVMs. This paper reviews representative and state-of- the-art techniques for optimizing the training of SVMs, especially SVMs for classification. The objective of this paper is to provide readers an overview of the basic elements and recent advances for training SVMs and enable them to develop and implement new optimization strategies for SVM-related research at their disposal. & 2011 Elsevier B.V. All rights reserved. 1. Introduction Support vector machines (SVMs) are state-of-the-art machine learning techniques with their root in structural risk minimiza- tion [51,60]. Roughly speaking, by the theory of structural risk minimization, a function from a function class tends to have a low expectation of risk on all the data from an underlying distribution if it has a low empirical risk on a certain data set that is sampled from this distribution and simultaneously the function class has a low complexity, for example, measured by the VC-dimension [61]. The well-known large margin principle in SVMs [8] essen- tially restricts the complexity of the function class, making use of the fact that for some function classes a larger margin corre- sponds to a lower fat-shattering dimension [51]. Since their invention [8], research on SVMs has exploded both in theory and applications. Recent theory advances in character- izing their generalization errors include Rademacher complexity theory [3,52] and PAC-Bayesian theory [33,39]. In practice, SVMs have been successfully applied to many real-world domains [11]. Moreover, SVMs have been combined with some other research directions, for example, multitask learning, active learning, multi- view learning and semi-supervised learning, and brought forward fruitful approaches [4,15,16,55,56]. The use of optimization methodologies plays an important role in training SVMs. Due to different requirements on the training speed, memory constraint and accuracy of optimization variables, practitioners should choose different optimization methods. It is thus instructive to review the optimization techniques used to train SVMs. Given that SVMs have been blended into the devel- opments of other learning mechanisms, this would also be beneficial to facilitate readers to formulate and solve their own optimization objectives arising from SVM-related research. Since developments on optimization techniques for SVMs are still active, and SVMs have multiple variants depending on different purposes such as classification, regression and density estimation, it is impractical and unnecessary to include every optimization technique used so far. Therefore, this paper mainly focuses on reviewing representative optimization methodologies used for SVM classification. Being familiar with these techniques is, however, helpful to understand other related optimization methods and implement optimization for other SVM variants. The rest of this paper is organized as follows. Section 2 introduces the theory of convex optimization, which is closely related to the optimization of SVMs. Section 3 gives the formula- tion of SVMs for pattern classification, where the kernel trick is also involved. Also, in Section 3, we derive the dual optimization problems and provide optimality conditions for SVMs. They constitute a foundation for the various optimization methodolo- gies detailed in Section 4. Then we briefly describe SVMs for density estimation and regression in Section 5, and introduce optimization techniques used for learning kernels in Section 6. Finally, concluding remarks are given in Section 7. Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing 0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2011.06.026 Corresponding author at: Department of Computer Science and Technology, East China Normal University, 500 Dongchuan Road, Shanghai 200241, China. Tel.: þ86 21 54345186; fax: þ86 21 54345119. E-mail addresses: [email protected], [email protected] (S. Sun). Neurocomputing 74 (2011) 3609–3618

Upload: frank-puk

Post on 25-Dec-2015

2 views

Category:

Documents


0 download

DESCRIPTION

a review of optimization methods in svm

TRANSCRIPT

Page 1: A Review of Optimization Methodologies in Support Vector Machines

Neurocomputing 74 (2011) 3609–3618

Contents lists available at ScienceDirect

Neurocomputing

0925-23

doi:10.1

� Corr

East Ch

Tel.: þ8

E-m

journal homepage: www.elsevier.com/locate/neucom

A review of optimization methodologies in support vector machines

John Shawe-Taylor a, Shiliang Sun a,b,�

a Department of Computer Science, University College London, Gower Street, London WC1E 6BT, United Kingdomb Department of Computer Science and Technology, East China Normal University, 500 Dongchuan Road, Shanghai 200241, China

a r t i c l e i n f o

Article history:

Received 13 March 2011

Received in revised form

24 June 2011

Accepted 26 June 2011

Communicated by C. Fyfeof this paper is to provide readers an overview of the basic elements and recent advances for training

Available online 12 August 2011

Keywords:

Decision support systems

Duality

Optimization methodology

Pattern classification

Support vector machine (SVM)

12/$ - see front matter & 2011 Elsevier B.V. A

016/j.neucom.2011.06.026

esponding author at: Department of Compu

ina Normal University, 500 Dongchuan Roa

6 21 54345186; fax: þ86 21 54345119.

ail addresses: [email protected], slsun@

a b s t r a c t

Support vector machines (SVMs) are theoretically well-justified machine learning techniques, which

have also been successfully applied to many real-world domains. The use of optimization methodol-

ogies plays a central role in finding solutions of SVMs. This paper reviews representative and state-of-

the-art techniques for optimizing the training of SVMs, especially SVMs for classification. The objective

SVMs and enable them to develop and implement new optimization strategies for SVM-related

research at their disposal.

& 2011 Elsevier B.V. All rights reserved.

1. Introduction

Support vector machines (SVMs) are state-of-the-art machinelearning techniques with their root in structural risk minimiza-tion [51,60]. Roughly speaking, by the theory of structural riskminimization, a function from a function class tends to have a lowexpectation of risk on all the data from an underlying distributionif it has a low empirical risk on a certain data set that is sampledfrom this distribution and simultaneously the function class has alow complexity, for example, measured by the VC-dimension[61]. The well-known large margin principle in SVMs [8] essen-tially restricts the complexity of the function class, making use ofthe fact that for some function classes a larger margin corre-sponds to a lower fat-shattering dimension [51].

Since their invention [8], research on SVMs has exploded bothin theory and applications. Recent theory advances in character-izing their generalization errors include Rademacher complexitytheory [3,52] and PAC-Bayesian theory [33,39]. In practice, SVMshave been successfully applied to many real-world domains [11].Moreover, SVMs have been combined with some other researchdirections, for example, multitask learning, active learning, multi-view learning and semi-supervised learning, and brought forwardfruitful approaches [4,15,16,55,56].

ll rights reserved.

ter Science and Technology,

d, Shanghai 200241, China.

cs.ecnu.edu.cn (S. Sun).

The use of optimization methodologies plays an important rolein training SVMs. Due to different requirements on the trainingspeed, memory constraint and accuracy of optimization variables,practitioners should choose different optimization methods. It isthus instructive to review the optimization techniques used totrain SVMs. Given that SVMs have been blended into the devel-opments of other learning mechanisms, this would also bebeneficial to facilitate readers to formulate and solve their ownoptimization objectives arising from SVM-related research.

Since developments on optimization techniques for SVMs arestill active, and SVMs have multiple variants depending ondifferent purposes such as classification, regression and densityestimation, it is impractical and unnecessary to include everyoptimization technique used so far. Therefore, this paper mainlyfocuses on reviewing representative optimization methodologiesused for SVM classification. Being familiar with these techniquesis, however, helpful to understand other related optimizationmethods and implement optimization for other SVM variants.

The rest of this paper is organized as follows. Section 2introduces the theory of convex optimization, which is closelyrelated to the optimization of SVMs. Section 3 gives the formula-tion of SVMs for pattern classification, where the kernel trick isalso involved. Also, in Section 3, we derive the dual optimizationproblems and provide optimality conditions for SVMs. Theyconstitute a foundation for the various optimization methodolo-gies detailed in Section 4. Then we briefly describe SVMs fordensity estimation and regression in Section 5, and introduceoptimization techniques used for learning kernels in Section 6.Finally, concluding remarks are given in Section 7.

Page 2: A Review of Optimization Methodologies in Support Vector Machines

J. Shawe-Taylor, S. Sun / Neurocomputing 74 (2011) 3609–36183610

2. Convex optimization theory

An optimization problem has the form

min f ðxÞ

s:t: fiðxÞr0, i¼ 1, . . . ,n: ð1Þ

Here, the vector xARm is the optimization variable, the functionf : Rm-R is the objective function, and the functions fi : R

m-R

(i¼ 1, . . . ,n) are the inequality constraint functions. The domain ofthis problem is D¼ dom f

Tni ¼ 1 dom fi. A vector from the domain

is said to be one element of the feasible set Df if and only if theconstraints hold on this point. A vector xn is called optimal, orthe solution of the optimization problem, if its objective value isthe smallest among all vectors satisfying the constraints [10]. Fornow we do not assume the above optimization problem is convex.

The Lagrangian associated with the problem (1) is defined as

Lðx,aÞ ¼ f ðxÞþXn

i ¼ 1

aifiðxÞ, aiZ0, ð2Þ

where a¼ ½a1, . . . ,an�>. The scalar ai is referred to as the Lagrange

multiplier associated with the ith constraint. The vector a is calleda dual variable or Lagrange multiplier vector. The Lagrange dualfunction is defined as the infimum of the Lagrangian

gðaÞ ¼ infxADf

Lðx,aÞ: ð3Þ

Denote the optimal value of problem (1) by fn. It can be easilyshown that gðaÞr f n [10].

With the aim to approach fn, it is natural to define thefollowing Lagrange dual optimization problem:

max gðaÞ

s:t: ak0, ð4Þ

where ak0 means aiZ0 (i¼ 1, . . . ,n).

2.1. Optimality conditions for generic optimization problems

KKT (Karush–Kuhn–Tucker) conditions are among the mostimportant sufficient criteria for solving generic optimizationproblems, which are given in the following theorem [26,31,36,49].

Theorem 1 (KKT saddle point conditions). Consider a generic opti-

mization problem given by (1) where f and fi are arbitrary functions.

If a pair of variables ðxn,anÞ exists where xnARm\D and an

k0, such

that for all xARm\D and aA ½0,1Þn,

Lðxn,aÞrLðxn,anÞrLðx,anÞ ð5Þ

then xn is a solution to the optimization problem.

One of the important insights derived from (5) is

an

i fiðxnÞ ¼ 0, i¼ 1, . . . ,n, ð6Þ

which is usually called complementary slackness conditions.Furthermore, under the assumption that (5) holds, it is straight-forward to prove that Lðxn,anÞ ¼ f n ¼ gn with gn being the optimalvalue of the dual problem, and an is the associated optimalsolution. The important step is to show that

maxak0

gðaÞ ¼maxak0

infxADf

Lðx,aÞZ infxADf

Lðx,anÞ ¼ Lðxn,anÞ ¼ f n: ð7Þ

Combining this with gðaÞr f n results in the assertion. Now, thereis no gap between the optimal values of the primal and dualproblems. The KKT-gap

Pni ¼ 1 aifiðxÞ vanishes when the optimal

pair of solutions satisfying (5) are found.If equality constraints are included in an optimization pro-

blem, such as hðxÞ ¼ 0, we can split it into two inequalityconstraints hðxÞr0 and �hðxÞr0 [49]. Suppose the optimization

problem is

min f ðxÞ

s:t: fiðxÞr0, i¼ 1, . . . ,n,

hjðxÞ ¼ 0, j¼ 1, . . . ,e: ð8Þ

The Lagrangian associated with it is defined as

Lðx,a,bÞ ¼ f ðxÞþXn

i ¼ 1

aifiðxÞþXe

j ¼ 1

bjhjðxÞ, aiZ0, bjAR: ð9Þ

The corresponding KKT saddle point conditions for problem (8)are given by the following theorem [49].

Theorem 2 (KKT saddle point conditions with equality constraints).Consider a generic optimization problem given by (8) where f, fi and

hj are arbitrary. If a triplet of variables ðxn,an,bnÞ exists where

xnARm\D, an

k0, and bnARe, such that for all xARm\D,

aA ½0,1Þn and bARe,

Lðxn,a,bÞrLðxn,an,bnÞrLðx,an,bn

Þ, ð10Þ

then xn is a solution to the optimization problem, where D is

temporarily reused to represent the domain of problem (8).

As equality constraints can be equivalently addressed by splittingthem to two inequality constraints, we will mainly use the formula-tion given in (1) for optimization treatment in the sequel.

2.2. Optimality conditions for convex optimization problems

The KKT saddle point conditions are sufficient for any optimi-zation problems. However, it is more desirable to disclosenecessary conditions when implementing optimization meth-odologies. Although it is difficult to provide necessary conditionsfor generic optimization problems, this is quite possible for someconvex optimization problems having nice properties.

Definition 3 (Convex function). A function f : Rm-R is convex ifits domain dom f is a convex set and if for 0ryr1 and allx,yAdom f , the following holds:

f ðyxþð1�yÞyÞryf ðxÞþð1�yÞf ðyÞ: ð11Þ

If the functions f and fi are all convex functions, then theproblem (1) is a convex optimization problem. For a convexoptimization problem shown in (1), both the domain D and thefeasible region

Df ¼ fxAD and fiðxÞr0 ði¼ 1, . . . ,nÞg ð12Þ

can be proved to be convex sets.The following theorem [36,49] states the nice properties

required for the constraints in order to obtain the necessaryoptimality conditions for convex optimization problems.

Theorem 4 (Constraint qualifications). Suppose DARm is a convex

domain, and the feasible region Df is defined by (12) where functions

fi : D-R are convex ði¼ 1, . . . ,nÞ. Then the following three qualifica-

tions on constraint functions fi are connected by ðiÞ3ðiiÞ and ðiiiÞ ) ðiÞ:

(i)

There exists an xAD s.t. fiðxÞo0 for all i¼1,y,n (Slater’scondition [54]).

(ii)

For all nonzero aA ½0,1Þn there exists an xAD s.t.Pni ¼ 1 aifiðxÞr0 (Karlin’s condition [25]).

(iii)

The feasible region Df contains at least two distinct points, and

there exists an xADf s.t. all fi are strictly convex at x with

respect to Df (strict constraint qualification).

Now we reach the theorem [31,25,49] stating the necessity ofKKT conditions.

Page 3: A Review of Optimization Methodologies in Support Vector Machines

J. Shawe-Taylor, S. Sun / Neurocomputing 74 (2011) 3609–3618 3611

Theorem 5 (Necessity of KKT conditions). For a convex optimiza-

tion problem defined by (1), if all the inequality constraints fis satisfy

one of the constraint qualifications of Theorem4, then the KKT saddle

point conditions (5) given in Theorem1 are necessary for optimality.

A convex optimization problem is defined as the minimizationof a convex function over a convex set. It can be represented as

min f ðxÞ

s:t: fiðxÞr0, i¼ 1, . . . ,n,

a>j x¼ bj, j¼ 1, . . . ,e, ð13Þ

where the functions f and fi are convex, and the equalityconstraints hjðxÞ ¼ a>j x�bj are affine [10]. In this case, we caneliminate the equality constraints by updating the domainDTe

j ¼ 1ða>j x¼ bjÞ-D and then apply Theorem 5 on a convex

problem with no equality constraints.Slater’s condition can be further refined when some of the

inequality constraints in the convex problem defined by (1) areaffine. The strict feasibility in Slater’s condition for these con-straints can be relaxed to feasibility [10].

2.3. Optimality conditions for differentiable convex problems

For SVMs, we often encounter convex problems whose objec-tive and constraint functions are further differentiable. Suppose aconvex problem (1) satisfies the conditions of Theorem 5. Thenwe can find its solutions making use of the KKT conditions (5).Under the assumption that the objective and constraint functionsare differentiable, the KKT conditions can be simplified as in thefollowing theorem [31,49]. Many optimization methodologiesfrequently use these conditions.

Theorem 6 (KKT for differentiable convex problems). Consider a

convex problem (1) whose objective and constraint functions are

differentiable at solutions. The vector xn is a solution, if there exists

some anARn s.t. the following conditions hold:

ank0 ðRequired by the LagrangianÞ, ð14Þ

@xLðxn,anÞ ¼ @xf ðxnÞþXn

i ¼ 1

an

i @xfiðxnÞ ¼ 0 ðSaddle point at xnÞ,

ð15Þ

@aiLðxn,anÞ ¼ fiðx

nÞr0, i¼ 1, . . . ,n ðSaddle point at anÞ, ð16Þ

an

i fiðxnÞ ¼ 0, i¼ 1, . . . ,n ðZero KKT-gapÞ: ð17Þ

Note that the sign ‘r ’ in (16) is correct as a result of the domainof ai (i¼ 1, . . . ,n) in the Lagrangian being ½0,1Þ.

3. SVMs and kernels

In this section, we describe the optimization formulations,dual optimization problems, optimality conditions and relatedconcepts for SVMs for binary classification, which will be shownvery useful for various SVM optimization methodologies intro-duced in later sections. Here after addressing linear classifiers forlinearly separable and nonseparable problems, respectively, wethen describe nonlinear classifiers with kernels.

Henceforth, we use boldface x to denote an example vector,and the sign ‘n’ will be used to denote the number of trainingexamples for consistency with most literature on SVMs. Given atraining set S ¼ fxi,yig

ni ¼ 1 where xiARd and yiAf1,�1g, the aim of

SVMs is to induce a classifier which has good classificationperformance on future unseen examples.

3.1. Linear classifier for linearly separable problems

When the data set S is linearly separable, SVMs solve the problem

minw,b

FðwÞ ¼ 12JwJ2

s:t: yiðw>xiþbÞZ1, i¼ 1, . . . ,n, ð18Þ

where the constraints represent the linearly separable property. Thelarge margin principle is reflected by minimizing 1

2 JwJ2 with 2=JwJ

being the margin between two lines w>xþb¼ 1 and w>xþb¼�1.The SVM classifier would be

cw,bðxÞ ¼ signðw>xþbÞ: ð19Þ

Problem (18) is a differentiable convex problem with affineconstraints, and therefore the constraint qualification is satisfiedby the refined Slater’s condition. Theorem 6 is suitable to solvethe current optimization problem.

The Lagrangian is constructed as

Lðw,b,kÞ ¼1

2JwJ2

�Xn

i ¼ 1

li½yiðw>xiþbÞ�1�, liZ0, ð20Þ

where k¼ ½l1, . . . ,ln�> are the associated Lagrange multipliers.

Using the superscript star to denote the solutions of the optimi-zation problem, we have

@wLðwn,bn,knÞ ¼wn�

Xn

i ¼ 1

ln

i yixi ¼ 0, ð21Þ

@bLðwn,bn,knÞ ¼�

Xn

i ¼ 1

ln

i yi ¼ 0: ð22Þ

From (21), the solution wn has the form

wn ¼Xn

i ¼ 1

ln

i yixi: ð23Þ

The training examples for which ln

i 40 are called support vectors,because other examples with ln

i ¼ 0 can be omitted from theexpression.

Substituting (21) and (22) into the Lagrangian results in

gðkÞ ¼1

2JwJ2

�Xn

i ¼ 1

liyiw>xiþ

Xn

i ¼ 1

li

¼Xn

i ¼ 1

liþ1

2

Xn

i ¼ 1

Xn

j ¼ 1

liljyiyjx>i xj�

Xn

i ¼ 1

Xn

j ¼ 1

liljyiyjx>i xj

¼Xn

i ¼ 1

li�1

2

Xn

i ¼ 1

Xn

j ¼ 1

liljyiyjx>i xj: ð24Þ

Thus the dual optimization problem is

maxk

gðkÞ ¼ k>1�12k>Dk

s:t: k>y¼ 0,

kk0, ð25Þ

where y¼ ½y1, . . . ,yn�> and D is a symmetric n�n matrix with

entries Dij ¼ yiyjx>i xj [43,52].

The zero KKT-gap requirement (also called the complementaryslackness condition) implies that

ln

i ½yiðx>i wnþbnÞ�1� ¼ 0, i¼ 1, . . . ,n: ð26Þ

Therefore, for all support vectors the constraints are active withequality. The bias bn can thus be calculated as

bn ¼ yi�x>i wn ð27Þ

Page 4: A Review of Optimization Methodologies in Support Vector Machines

J. Shawe-Taylor, S. Sun / Neurocomputing 74 (2011) 3609–36183612

using any support vector xi. With kn and bn calculated, the SVMdecision function can be represented as

cnðxÞ ¼ signXn

i ¼ 1

yiln

i x>xiþbn

!: ð28Þ

3.2. Linear classifier for linearly nonseparable problems

In the case that the data set is not linearly separable but wewould still like to learn a linear classifier, a loss on the violation ofthe linearly separable constraints has to be introduced. A com-mon choice is the hinge loss

maxð0,1�yiðw>xiþbÞÞ, ð29Þ

which can be represented using a slack variable xi.The optimization problem is formulated as

minw,b,n

Fðw,nÞ ¼1

2JwJ2

þCXn

i ¼ 1

xi

s:t: yiðw>xiþbÞZ1�xi, i¼ 1, . . . ,n,

xiZ0, i¼ 1, . . . ,n, ð30Þ

where the scalar C controls the balance between the margin andempirical loss. This problem is also a differentiable convexproblem with affine constraints. The constraint qualification issatisfied by the refined Slater’s condition.

The Lagrangian of problem (30) is

Lðw,b,n,k,cÞ ¼1

2JwJ2

þCXn

i ¼ 1

xi�Xn

i ¼ 1

li½yiðw>xiþbÞ�1þxi�

�Xn

i ¼ 1

gixi, liZ0, giZ0, ð31Þ

where k¼ ½l1, . . . ,ln�> and c¼ ½g1, . . . ,gn�

> are the associatedLagrange multipliers. Making use of Theorem 6, we obtain

@wLðwn,bn,nn,kn,cnÞ ¼wn�Xn

i ¼ 1

ln

i yixi ¼ 0, ð32Þ

@bLðwn,bn,nn,kn,cnÞ ¼ �Xn

i ¼ 1

ln

i yi ¼ 0, ð33Þ

@xiLðwn,bn,nn,kn,cnÞ ¼ C�ln

i �gn

i ¼ 0, i¼ 1, . . . ,n, ð34Þ

where (32) and (33) are respectively identical to (21) and (22) forthe separable case.

The dual optimization problem is derived as

maxk

gðkÞ ¼ k>1�12k>Dk

s:t: k>y¼ 0,

kk0,

k$C1, ð35Þ

where y and D have the same meanings as in problem (25).The complementary slackness condition requires

ln

i yiðx>i wnþbnÞ�1þxn

i

� �¼ 0, i¼ 1, . . . ,n,

gni xn

i ¼ 0, i¼ 1, . . . ,n: ð36Þ

Combining (34) and (36), we can solve bn ¼ yi�x>i wn for anysupport vector xi with 0oln

i oC. The existence of 0oln

i oC isequivalent to the assumption that there is at least one supportvector with functional output yiðx

>i wnþbnÞ ¼ 1 [43]. Usually, this

is true. If this assumption is violated, however, other user-definedtechniques have to be used to determine the bn. Once kn and bn

are solved, the SVM decision function is given by

cnðxÞ ¼ signXn

i ¼ 1

yiln

i x>xiþbn

!: ð37Þ

3.3. Nonlinear classifier with kernels

The use of kernel functions (kernels for short) provides a powerfuland principled approach to detecting nonlinear relations with alinear method in an appropriate feature space [52]. The design ofkernels and linear methods can be decoupled, which largely facil-itates modularity of machine learning methods including SVMs.

Definition 7 (Kernel function). A kernel is a function k that for allx, z from a space X (which need not be a vector space) satisfies

kðx,zÞ ¼/fðxÞ,fðzÞS, ð38Þ

where f is a mapping from the space X to a Hilbert space F that isusually called the feature space

f : xAX/fðxÞAF: ð39Þ

To verify a function is a kernel, one approach is to construct afeature space for which the function between two inputs corre-sponds to first performing an explicit feature mapping and thencomputing the inner product between the images of the inputs.An alternative approach, which is more widely used, is toinvestigate the finitely positive semidefinite property [52].

Definition 8 (Finitely positive semidefinite function). A function k :X � X-R satisfies the finitely positive semidefinite property if itis a symmetric function for which the kernel matrices K withKij ¼ kðxi,xjÞ formed by restriction to any finite subset of the spaceX are positive semidefinite.

The following theorem [52] justifies the use of the aboveproperty for characterizing kernels [2,49].

Theorem 9 (Characterization of kernels). A function k : X � X-R

which is either continuous or has a countable domain, can be

decomposed as an inner product in a Hilbert space F by a feature

map f applied to both its arguments

kðx,zÞ ¼/fðxÞ,fðzÞS ð40Þ

if and only if it satisfies the finitely positive semidefinite property.

Normally, a Hilbert space is defined as an inner product spacethat is complete. If the separability property is further added tothe definition of a Hilbert space, the ‘kernels are continuous or theinput space is countable’ in Theorem 9 is then necessary toaddress this issue. The Hilbert space constructed in provingTheorem 9 is called the reproducing kernel Hilbert space (RKHS)because of the following reproducing property of the kernelresulting from the defined inner product

/fF ,kðx,�ÞS¼ fF ðxÞ, ð41Þ

where fF is a function in the space F of functions, and functionkðx,�Þ is the mapping fðxÞ.

Besides the linear kernel kðx,zÞ ¼ x>z, other commonly usedkernels are the polynomial kernel

kðx,zÞ ¼ ð1þx>zÞdeg , ð42Þ

where deg is the degree of the polynomial, and the Gaussian radialbasis function (RBF) kernel (Gaussian kernel for short)

kðx,zÞ ¼ exp �Jx�zJ2

2s2

!: ð43Þ

Page 5: A Review of Optimization Methodologies in Support Vector Machines

J. Shawe-Taylor, S. Sun / Neurocomputing 74 (2011) 3609–3618 3613

Using the kernel trick, the optimization problem (35) for SVMsbecomes

maxk

gðkÞ ¼ k>1�12k>Dk

s:t: k>y¼ 0,

kk0,

k$C1, ð44Þ

where the entries of D are Dij ¼ yiyjkðxi,xjÞ. The solution for theSVM classifier is formulated as

cnðxÞ ¼ signXn

i ¼ 1

yiln

i kðxi,xÞþbn

!: ð45Þ

4. Optimization methodologies for SVMs

For small and moderately sized problems (say with less than5000 examples), interior point algorithms are reliable and accu-rate optimization methods to choose [49]. For large-scaleproblems, methods that exploit the sparsity of the dual variablesmust be adopted; if further limitation on the storage is required,compact representation should be considered, for example, usingan approximation for the kernel matrix.

4.1. Interior point algorithms

An interior point is a pair of primal and dual variables whichsatisfy the primal and dual constraints, respectively. Below weprovide the main procedure of interior point algorithms for SVMoptimization, given the fundamental importance of this optimiza-tion technique.

Suppose now we are focusing on solving an equivalentproblem of problem (44)

mink

12k>Dk�k>1

s:t: k>y¼ 0,

k$C1,

kk0: ð46Þ

Introducing Lagrange multipliers h, s and z where s,zk0 and h isfree, we can get the Lagrangian

12k>Dk�k>1þs>ðk�C1Þ�z>kþhk>y: ð47Þ

The KKT conditions from Theorem 6 are instantiated as

Dk�1þs�zþhy¼ 0 ðDualfeasibilityÞ,

k>y¼ 0 ðPrimalfeasibilityÞ,

k$C1 ðPrimalfeasibilityÞ,

kk0 ðPrimalfeasibilityÞ,

s>ðk�C1Þ ¼ 0 ðZeroKKT� gapÞ,

z>k¼ 0 ðZeroKKT� gapÞ,

sk0, zk0: ð48Þ

Usually, the conditions k$C1 and s>ðk�C1Þ ¼ 0 are transformedto

kþc¼ C1,

s>c¼ 0,

ck0: ð49Þ

Instead of requiring z>k¼ 0 and s>c¼ 0, according to theprimal-dual path-following algorithm [59], they are modified aszili ¼ m and sigi ¼ m (i¼ 1, . . . ,n) for m40. After solving thevariables for a certain m, we then decrease m to 0. The advantageis that this can facilitate the use of a Newton-type predictorcorrector algorithm to update k,c,h,s,z [49].

Suppose we already have appropriate initialization fork,c,h,s,z, and would like to calculate their increments. For this

purpose, expanding the equalities in (48) and (49), e.g., substitut-ing k with kþDk, yields

DDkþDs�DzþyDh¼�Dkþ1�sþz�hy9r1,

y>Dk¼�y>k,

DkþDc¼ C1�k�c¼ 0,

g�1i siDgiþDsi ¼ mg�1

i �si�g�1i DgiDsi9rkkt1i

, i¼ 1, . . . ,n,

l�1i ziDliþDzi ¼ ml

�1i �zi�l

�1i DliDzi9rkkt2i

, i¼ 1, . . . ,n: ð50Þ

As it is clear that

Dsi ¼ rkkt1iþg�1

i siDli,

Dzi ¼ rkkt2i�l�1

i ziDli,

Dc¼�Dk ð51Þ

we only need to further solve for Dk and Dh. Substituting (51) intothe first two equalities of (50), we have

Dþdiagðc�1s>þk�1z>Þ y

y> 0

" #Dk

Dh

� �¼

r1�rkkt1þrkkt2

�y>k

" #, ð52Þ

where

c�1 ¼ ½1=g1, . . . ,1=gn�>,

k�1¼ ½1=l1, . . . ,1=ln�

>,

rkkt1 ¼ ½rkkt11, . . . ,rkkt1n

�>,

rkkt2 ¼ ½rkkt21, . . . ,rkkt2n

�>: ð53Þ

Then, the Newton-type predictor corrector algorithm, which aimsto get a performance similar to higher order methods but withoutimplementing them, is adopted to solve (52) and other variables.

By the blockwise matrix inversion formula with the Schurcomplement technique, the above linear equation can be readilysolved if we can invert Dþdiagðc�1s>þk�1z>Þ. The standardCholesky decomposition, which is numerically extremely stable,is suitable to solve this problem with computational complexityn3/6 [45].

Generally, interior point algorithms are a good choice for smalland moderately sized problems with high reliability and preci-sion. Because it involves Cholesky decomposition for a matrixscaled by the number of training examples, the computation isexpensive for large-scale data. One way to overcome this is tocombine interior point algorithms with sparse greedy matrixapproximation to provide a feasible computation of the matrixinverse [49].

4.2. Chunking, sequential minimal optimization

The chunking algorithm [60] relies on the fact that the value ofthe quadratic form in (46) is unchanged if we remove the rowsand columns of D that associate with zero entries of k. That is, weneed to only employ support vectors to represent the finalclassifier. Thus, the large quadratic program problem can bedivided into a series of smaller subproblems.

Chunking starts with a subset (chunk) of training data andsolves a small quadratic program (QP) subproblem. Then it retainsthe support vectors and replace the other data in the chunk with acertain number of data violating KKT conditions, and solves thesecond subproblem for a new classifier. Each subproblem isinitialized with the solutions of the previous subproblem. At thelast step, all nonzero entries of k can be identified.

Page 6: A Review of Optimization Methodologies in Support Vector Machines

J. Shawe-Taylor, S. Sun / Neurocomputing 74 (2011) 3609–36183614

Osuna et al. [42] gave a theorem on the solution convergenceof breaking down a large QP problem into a series of smaller QPsubproblems. As long as at least one example violating the KKTconditions is added to the examples for the previous subproblem,each step will decrease the objective function and has a feasiblepoint satisfying all constraints [44]. According to this theorem,the chunking algorithm will converge to the globally optimalsolution.

The sequential minimal optimization (SMO) algorithm, pro-posed by Platt [44], is a special case of chunking, which iterativelysolves the smallest possible optimization problem each time withtwo examples

minli ,lj

12½l

2i Diiþ2liljDijþl

2j Djj��li�lj

s:t: liyiþljyj ¼ rij,

0rlirC,

0rljrC, ð54Þ

where the scaler rij is computed by fulfilling all the constraints ofthe full problem k>y¼ 0.

The advantage of SMO is that a QP problem with two examplecan be solved analytically, and thus a numerical QP solver isavoided. It can be more than 1000 times faster and exhibit betterscaling (up to one order better) in the training set size than theclassical chunking algorithm given in [60]. Convergence is guar-anteed as long as at least one of the two selected examplesviolates the KKT conditions before solving the subproblem. How-ever, it should be noted that the SMO algorithm will not convergeas quickly if solutions with a high accuracy are needed [44].A more recent approach is the LASVM algorithm [6], whichoptimizes the cost function of SVMs with a reorganization ofthe SMO direction searches and can converge to the SVM solution.

There are other decomposition techniques based on the sameidea of working set selection where more than two examples areselected. The fast convergence and inexpensive computationalcost during each iteration are two conflicting goals. Two widelyused decomposition packages are LIBSVM [13] and SVMlight [22].LIBSVM is implemented for working sets of two examples, andexploits the second-order information for working set selection.Different from LIBSVM, SVMlight can exploit working sets of morethan two examples (up to Oð103

Þ) and thus needs a QP solver tosolve the subproblems. Recently, following the SVMlight frame-work, Zanni et al. [62] proposed a parallel SVM software withgradient projection QP solvers. Experiments on data sets sizedOð106

Þ show that training nonlinear SVMs with Oð105Þ support

vectors can be completed in a few hours with some tens ofprocessors [62].

4.3. Coordinate descent

Coordinate descent is a popular optimization technique thatupdates one variable at a time by optimizing a subprobleminvolving only a single variable [21]. It has been discussed andapplied to the SVM optimization, e.g., in the work of [7,18,37].Here we introduce the coordinate descent method used by Hsiehet al. [21] for training large-scale linear SVMs in the dual form,though coordinate descent is surely not confined to linear SVMs.

In some applications where features of examples are highdimensional, SVMs with the linear kernel or nonlinear kernelshave similar performances. Furthermore, if the linear kernel isused, much larger data sets can be adopted to train SVMs [21].Linear SVMs are among the most popular tools to deal with thiskind of applications.

By appending each example with a constant feature 1, Hsiehet al. [21] absorbed the bias b into the weight vector w

x> ¼ ½x>,1�, w> ¼ ½w>,b� ð55Þ

and solve the following dual problem:

mink

f ðkÞ ¼ 12k>Dk�k>1

s:t: k$C1,

kk0, ð56Þ

where parameters have the same meaning as in (35).The optimization process begins with an initial value k0ARn

and iteratively updates k to generate k0,k1, . . . ,k1. The processfrom kk to kkþ1 is referred to as an outer iteration. During eachouter iteration there are n inner iterations which sequentiallyupdate l1, . . . ,ln. Each outer iteration generates multiplier vectorskk,iARn (i¼ 1, . . . ,nþ1) such that

kk,1¼ kk,

kk,nþ1¼ kkþ1,

kk,i¼ ½lkþ1

1 , . . . ,lkþ1i�1 ,lk

i , . . . ,lkn�>, i¼ 2, . . . ,n: ð57Þ

When we update kk,i to kk,iþ1, the following one-variablesubproblem [21] is solved:

mind

f ðkk,iþdeiÞ

s:t: 0rlki þdrC, ð58Þ

where the unit vector ei ¼ ½0, . . . ,0,1,0, . . . ,0�>. The objectivefunction is a quadratic function of the scalar d

f ðkk,iþdeiÞ ¼

12Diid

2þrif ðk

k,iÞdþconst:, ð59Þ

where rif is the ith entry of the gradient vector rf . For linearSVMs, rif can be efficiently calculated and thus speedy conver-gence of the whole optimization process can be expected. It wasshown that the above coordinate descent method is faster thansome state-of-the-art solvers for SVM optimization [21]. Notethat, from the point of view of coordinate descent, the SMOalgorithm also implements a coordinate descent procedure,though on coordinate pairs.

4.4. Active set methods

Active set methods are among typical approaches to solvingQP problems [41]. In active set methods, constraints are dividedinto the set of active constraints (the active set) and the set ofinactive constraints. The algorithm iteratively solves the optimi-zation problem, updating the active set accordingly until theoptimal solution is found. For a certain iteration, an appropriatesubproblem is solved for the variables not actively constrained.

Active set methods have also been applied to solve the SVMoptimization, e.g., by [12,19,46,47]. In this subsection, we mainlyoutline the methods used by the two recent papers [46,47], whichboth have theoretical guarantees to converge in a finite time.

The active set method adopted by Scheinberg [46] to solve (46)is to fix variables li in the active set at their current values (0 orC), and then solve the reduced subproblem for each iteration

minks

12k>

s Dssksþk>a Dasks�k>s 1

s:t: k>s ys ¼�k>a ya,

ks$C1,

ksk0, ð60Þ

where ka and ks correspond to variables in and not in the activeset, respectively, Dss and Das are composed of the associated rowsand columns of D, and the meanings of ya and ys are similarly

Page 7: A Review of Optimization Methodologies in Support Vector Machines

J. Shawe-Taylor, S. Sun / Neurocomputing 74 (2011) 3609–3618 3615

defined. Note that the cardinality of the free vector ks is thenumber of support vectors on the margin with 0!ks!C1 afterthe last iteration.

Solving the above reduced subproblem can be expensive if thenumber of the above support vectors is large. Scheinberg [46]proposed to make one step move toward its solution at eachiteration. That is, each time either one variable enters or oneleaves the active set. The rank-one update of the Choleskyfactorization of Dss is therefore efficient for this variant.

The active set method used by Shilton et al. [47] is similar butit tries to solve a different objective function, which is a partiallydual form

maxb

min0$k$C1

1

2

b

k

" #>0 y>

y D

" #b

k

" #�

b

k

" #>0

1

� �, ð61Þ

where D is the same as in (44). The advantage of using thispartially dual representation is that the constraint k>y¼ 0 whosepresence complicates the implementation of the active setapproach is eliminated [47].

4.5. Solving the primal with Newton’s method

Although most literature on SVMs deals with the dual optimi-zation problem, there is also some work concentrating on theoptimization of the primal because primal and dual are indeedtwo sides of the same coin. For example, [27,38] studied theprimal optimization of linear SVMs, and [14,34] studied theprimal optimization of both linear and nonlinear SVMs. Theseworks mainly use Newton-type methods. Below we introduce theoptimization problem and the approach adopted in [14].

Consider a kernel function k and an associated reproducingkernel Hilbert space H. The optimization problem is

minf AH

rJf J2Hþ

Xn

i ¼ 1

Rðyi,f ðxiÞÞ, ð62Þ

where r is a regularization coefficient, and Rðy,tÞ is a loss functionthat is differentiable with respect to its second argument [14].

Suppose the optimal solution is fn. The gradient of theobjective function at fn should vanish. We have

2rf nþXn

i ¼ 1

@R

@fðyi,f

nðxiÞÞkðxi,�Þ ¼ 0, ð63Þ

where the reproducing property f ðxiÞ ¼/f ,kðxi,�ÞSH is used. Thisindicates that the optimal solution can be represented as a linearcombination of kernel functions evaluated at the n training data,which is known as the representer theorem [29].

From (63), we can write f ðxÞ in the form

f ðxÞ ¼Xn

i ¼ 1

bikðxi,xÞ ð64Þ

and then solve for b¼ ½b1, . . . ,bn�>. The entries of b should not be

interpreted as Lagrange multipliers [14].Recall the definition of the kernel matrix K with Kij ¼ kðxi,xjÞ.

The optimization problem (62) can be rewritten as the followingunconstrained optimization problem:

minb

rb>KbþXn

i ¼ 1

Rðyi,K>i bÞ, ð65Þ

where Ki is the ith column of K. As the common hinge lossRðyi,f ðxiÞÞ ¼maxð0,1�yif ðxiÞÞ is not differentiable, Chapelle [14]adopted the Huber loss as a differentiable approximation of the

hinge loss. The Huber loss is formulated as

Rðy,tÞ ¼

0 if yt41þh,

ð1þh�ytÞ2

4hif 91�yt9rh,

1�yt if yto1�h,

8>>><>>>:

ð66Þ

where parameter h typically varies between 0.01 and 0.5 [14]. TheNewton step of variable update is

b’b�rstepH�1r, ð67Þ

where H and r are the Hessian and gradient of the objectivefunction in (65), respectively, and step size rstep can be found byline search along the direction of H�1r.

The term �H�1r is provided by the standard Newton’smethod that aims to find an increment d to maximize or minimizea function say LðhþdÞ. To show this, setting the derivative of thefollowing second-order Taylor series approximation:

LðhþdÞ ¼ LðhÞþd>rðhÞþ12d>HðhÞd ð68Þ

to zero, we obtain

d¼�H�1ðhÞrðhÞ: ð69Þ

It was shown that primal and dual optimizations are equiva-lent in terms of the solution and time complexity [14]. However,as primal optimization is more focused on minimizing the desiredobjective function, perhaps it is superior to the dual optimizationwhen people are interested in finding approximate solutions [14].

Here, we would like to point out an alternative solutionapproach called the augmented Lagrangian method [5], whichhas approximately the same convergence speed as the Newton’smethod.

4.6. Stochastic subgradient with projection

The subgradient method is a simple iterative algorithm forminimizing a convex objective function that can be non-differ-entiable. The method seems very much like the ordinary gradientmethod applied to differentiable functions, but with severalsubtle differences. For instance, the subgradient method is nottruly a descent method: the function value often increases.Furthermore, the step lengths in the subgradient method arefixed in advance, instead of found by a line search as in thegradient method [9].

A subgradient of function f ðxÞ : Rd-R at x is any vector g thatsatisfies

f ðyÞZ f ðxÞþg>ðy�xÞ ð70Þ

for all y [53]. When f is differentiable, a very useful choice of g isrf ðxÞ, and then the subgradient method uses the same searchdirection as the gradient descent method.

Suppose we are minimizing the above function f ðxÞ that isconvex. The subgradient method uses the iteration

xðkþ1Þ’xðkÞ�akgðkÞ, ð71Þ

where gðkÞ is any subgradient of f at xðkÞ, and ak40 is the steplength.

Recently, there has been a lot of interest in studying gradientmethods for SVM training [30,63]. Shalev-Shwartz et al. [50]proposed an algorithm alternating between stochastic subgradi-ent descent and projection steps to solve the primal problem

minw

r

2JwJ2

þ1

n

Xn

i ¼ 1

Rðw,ðxi,yiÞÞ, ð72Þ

where hinge loss Rðw,ðx,yÞÞ ¼maxf0,1�y/w,xSg with classifierfunction f ðwÞ ¼/w,xS.

Page 8: A Review of Optimization Methodologies in Support Vector Machines

J. Shawe-Taylor, S. Sun / Neurocomputing 74 (2011) 3609–36183616

The algorithm receives two parameters as input: T—the totalnumber of iterations; k—the number of examples used to calcu-late subgradients. Initially, weight vector w1 is set to any vectorwith norm no larger than 1=

ffiffiffirp

. For the tth iteration, a set At sizedk is chosen from the original n training examples to approximatethe objective function as

f ðw,AtÞ ¼r

2JwJ2

þ1

k

Xðx,yÞAAt

Rðw,ðx,yÞÞ: ð73Þ

In general, At includes k examples sampled i.i.d. from the trainingset. Define Aþt to be the set of examples in At for which the hingeloss is nonzero. Then, a two-step update is performed. First,update wt to wtþ 1

2

wtþ 12¼wt�Ztrt , ð74Þ

where step length Zt ¼ 1=ðrtÞ, and rt ¼ rwt�ð1=kÞPðx,yÞAAþt

yx is asubgradient of f ðw,AtÞ at wt . Last, wtþ 1

2is updated to wtþ1 by

projecting onto the set

fw : JwJr1=ffiffiffirpg, ð75Þ

which is shown to contain the optimal solution of the SVM. Thealgorithm finally outputs wTþ1.

If k¼1, the algorithm is a variant of the stochastic gradientmethod; if k is equal to the total number of training examples, thealgorithm is the subgradient projection method. Beside simplicity,the algorithm is guaranteed to obtain a solution of accuracy ewith ~Oð1=eÞ iterations, where an e-accurate solution w is definedas f ðwÞrminwf ðwÞþe [50].

4.7. Cutting plane algorithms

The essence of the cutting plane algorithms is to approximatea risk function by one or multiple linear under-estimator (theso-called cutting plane) [28]. Recently, some large-scale SVMsolvers using cutting plane algorithms have been proposed. Forexample, Joachims [23] solved a solution-equivalent structuralvariant of the following problem:

minw

1

2JwJ2

þC

n

Xn

i ¼ 1

Rðw,ðxi,yiÞÞ, ð76Þ

where hinge loss Rðw,ðx,yÞÞ ¼maxf0,1�y/w,xSg with classifierfunction f ðwÞ ¼/w,xS. Teo et al. [57] considered a wider range oflosses and regularization terms applicable to many patternanalysis problems. Franc [17] improved the cutting plane algo-rithm used in [23] with line search and monotonicity techniquesto solve problem (76) more efficiently. Below we provide a flavorof what the basic cutting plane algorithm does in order to solveproblem (76).

At iteration t, a reduced problem of (76) is solved for wt

minw

1

2JwJ2

þC

n

Xn

i ¼ 1

Rtðw,ðxi,yiÞÞ, ð77Þ

where Rt is a linear approximation of R. Rt is derived using thefollowing property. As a result of convexity, for any ðxi,yiÞ functionR can be approximated at any point wt�1 by a linear under-estimator [17]

RðwÞZRðwt�1Þþ/gt�1,w�wt�1S, 8wARd, ð78Þ

where gt�1 is any subgradient of R at wt�1. Note that (78)resembles (70), which indicates there can be some similaritybetween the methods mentioned here and those in Section 4.6.This inequality can be abbreviated as RðwÞZ/gt�1,wSþbt�1 withbt�1 defined as bt�1 ¼ Rðwt�1Þ�/gt�1,wt�1S. The equality/gt�1,wSþbt�1 ¼ 0 is called a cutting plane.

For SVM optimization, a subgradient of Rðw,ðxi,yiÞÞ at wt�1 canbe given as gt�1 ¼�piyixi where pi ¼ 1 if yi/wt�1,xiSo1 andpi ¼ 0 otherwise. Consequently, the term Rt w,ðxi,yiÞð Þ in (77) canbe replaced by a linear relaxation /gt�1,wSþbt�1, and thus astandard QP is recovered which is straightforward to solve. To geta better approximation of the original risk function, more thanone cutting planes at different points can be incorporated [17].

5. SVMs for density estimation and regression

In this section, we slightly touch the problem of densityestimation and regression using SVMs. Many optimization meth-odologies introduced in the last section can be adopted with someadaptation to deal with the density estimation and regressionproblems. Thereby, here we do not delve into the optimizationdetails of these problems.

The one-class SVM, an unsupervised learning machine pro-posed by Scholkopf et al. [48] for density estimation, aims tocapture the support of a high-dimensional distribution. It outputsa binary function which is þ1 in a region containing most of thedata, and �1 in the remaining region. The corresponding optimi-zation problem, which can be interpreted as separating the dataset from the origin in a feature space, is formulated as

minw,n,r

1

2JwJ2

þ1

nn

Xn

i ¼ 1

xi�r

s:t: /w,fðxiÞSZr�xi, i¼ 1, . . . ,n,

xiZ0, i¼ 1, . . . ,n, ð79Þ

where the scalar nAð0,1� is a balance parameter. The decisionfunction is

cðxÞ ¼ signð/w,fðxÞS�rÞ: ð80Þ

For support vector regression, the labels in the training dataare real numbers, that is yiAR (i¼ 1, . . . ,n). In order to introduce asparse representation for the decision function, Vapnik [61]devised the following e-insensitive function [49]

9y�f ðxÞ9e ¼maxf0,9y�f ðxÞ9�eg ð81Þ

with eZ0, and applied it to support vector regression.The standard form of support vector regression is to minimize

the objective

1

2JwJ2

þCXn

i ¼ 1

9yi�f ðxiÞ9e, ð82Þ

where the positive scalar C reflects the trade-off between functioncomplexity and the empirical loss. An equivalent optimizationproblem commonly used is given by

minw,b,n,nn

1

2JwJ2

þCXn

i ¼ 1

xiþCXn

i ¼ 1

xn

i

s:t: /w,fðxiÞSþb�yireþxi,

yi�/w,fðxiÞS�breþxn

i ,

xi,xn

i Z0, i¼ 1, . . . ,n: ð83Þ

The decision output for support vector regression is

cðxÞ ¼/w,fðxÞSþb: ð84Þ

6. Optimization methods for kernel learning

SVMs are one of the most famous kernel-based algorithmswhich encode the similarity information between exampleslargely by the chosen kernel function or kernel matrix. Determin-ing the kernel functions or kernel matrices is essentially a modelselection problem. This section introduces several optimizationtechniques used for learning kernels in SVM-type applications.

Page 9: A Review of Optimization Methodologies in Support Vector Machines

J. Shawe-Taylor, S. Sun / Neurocomputing 74 (2011) 3609–3618 3617

Semidefinite programming is a recent popular class of optimi-zations with wide engineering applications including kernellearning for SVMs [32,35]. For solving semidefinite programming,there exist interior point algorithms with good theoretical proper-ties and computational efficiency [58].

A semidefinite program (SDP) is a special convex optimizationproblem with a linear objective function, and linear matrixinequality and affine equality constraints.

Definition 10 (SDP). An SDP is an optimization problem of theform

minu

c>u

s:t: FjðuÞ : ¼ Fj0þu1Fj

1þ � � � þuqFjq$0, j¼ 1, . . . ,L,

Au¼ b, ð85Þ

where u : ¼ ½u1, . . . ,uq�>ARq is the decision vector, c is a constant

vector defining the linear objective, Fji ði¼ 0, . . . ,qÞ are given

symmetric matrices and FjðuÞ$0 means that the symmetricmatrix FjðuÞ is negative semidefinite.

Lanckriet et al. [32] proposed an SDP framework for learningthe kernel matrix, which is readily applicable to induction,transduction and even the multiple kernel learning cases. Theidea is to wrap the original optimization problems with thefollowing constraints on the symmetric kernel matrix K

K ¼Xmi ¼ 1

miKi,

Kk0,

traceðKÞrc ð86Þ

or with constraints

K ¼Xmi ¼ 1

miKi,

miZ0, i¼ 1, . . . ,m,

Kk0,

traceðKÞrc ð87Þ

and then formulate the problem in terms of an SDP which isusually solved by making use of interior point methods. Theadvantages of the second set of constraints include reducingcomputational complexity and facilitating the study of the statis-tical properties of a class of kernel matrices [32]. In particular, forthe 1-norm soft margin SVM, the SDP for kernel learning reducesto a quadratically constrained quadratic programming (QCQP)problem. Tools for second-order cone programming (SOCP) can beapplied to efficiently solve this problem.

Argyriou et al. [1] proposed a DC (difference of convexfunctions) programming algorithm for supervised kernel learning.An important difference with other kernel selection methods isthat they consider the convex hull of continuously parameterizedbasic kernels. That is, the number of basic kernels can be infinite.In their formulation, kernels are selected from the set

KðGÞ : ¼Z

GðoÞdpðoÞ : pAPðOÞ,oAO� �

, ð88Þ

where O is the kernel parameter space, GðoÞ is a basic kernel, andPðOÞ is the set of all probability measures on O.

Making use of the conjugate function and von Neumannminimax theorem [40], Argyriou et al. [1] formulated an optimi-zation problem which was then addressed by a greedy algorithm.As a subroutine of the algorithm involves optimizing a functionwhich is not convex, they converted it to a difference of twoconvex functions. With the DC programming techniques [20], thenecessary and sufficient condition for finding a solution isobtained, which is then solved by a cutting plane algorithm.

7. Conclusion

We have presented a review of optimization techniques usedfor training SVMs. The KKT optimality conditions are a milestonein optimization theory. The optimization of many real problemsrelies heavily on the application of the conditions. We haveshown how to instantiate the KKT conditions for SVMs. Alongwith the introduction of the SVM algorithms, the characterizationof effective kernels has also been presented, which would behelpful to understand the SVMs with nonlinear classifiers.

For the optimization methodologies applied to SVMs, we havereviewed interior point algorithms, chunking and SMO, coordinatedescent, active set methods, Newton’s method for solving the primal,stochastic subgradient with projection, and cutting plane algorithms.The first four methods focus on the optimization of dual problems,while the last three directly deal with the optimization of the primal.This reflects current developments in SVM training.

Since their invention, SVMs have been extended to multiplevariants in order to solve different applications such as the brieflyintroduced density estimation and regression problems. For kernellearning in SVM-type applications, we further introduced SDP andDC programming, which can be very useful for optimizing problemsmore complex than the traditional SVMs with fixed kernels.

We believe the optimization techniques introduced in thispaper can be applied to other SVM-related research as well. Basedon these techniques, people can implement their own solvers fordifferent purposes, such as contributing to one future line ofresearch for training SVMs by developing fast and accurateapproximation methods for the challenging problem of large-scale applications. Another direction can be investigating recentpromising optimization methods (e.g., [24] and its references) andapplying them to solving variants of SVMs for certain machinelearning applications.

Acknowledgements

This work was supported in part by the National NaturalScience Foundation of China under Project 61075005, the Funda-mental Research Funds for the Central Universities, and thePASCAL2 Network of Excellence. This publication only reflectsthe authors’ views.

References

[1] A. Argyriou, R. Hauser, C. Micchelli, M. Pontil, A DC-programming algorithmfor kernel selection, in: Proceedings of the 23rd International Conference onMachine Learning, 2006, pp. 41–48.

[2] N. Aronszajn, Theory of reproducing kernels, Transactions of the AmericanMathematical Society 68 (1950) 337–404.

[3] P. Bartlett, S. Mendelson, Rademacher and Gaussian complexities: riskbounds and structural results, Journal of Machine Learning Research 3(2002) 463–482.

[4] K. Bennett, A. Demiriz, Semi-supervised support vector machines, Advancesin Neural Information Processing Systems 11 (1999) 368–374.

[5] D. Bertsekas, Constrained Optimization and Lagrange Multiplier Methods,Academic Press, New York, 1982.

[6] A. Bordes, S. Ertekin, J. Weston, L. Bottou, Fast kernel classifiers with onlineand active learning, Journal of Machine Learning Research 6 (2005)1579–1619.

[7] A. Bordes, L. Bottou, P. Gallinari, J. Weston, Solving multiclass support vectormachines with LaRank, in: Proceedings of the 24th International Conferenceon Machine Learning, 2007, pp. 89–96.

[8] B. Boser, I. Guyon, V. Vapnik, A training algorithm for optimal marginclassifier, in: Proceedings of the 5th ACM Workshop on ComputationalLearning Theory, 1992, pp. 144–152.

[9] S. Boyd, L. Xiao, A. Mutapcic, Subgradient Methods, Available at: /http://www.stanford.edu/class/ee392o/subgrad_method.pdfS, 2003.

[10] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press,England, 2004.

[11] C. Campbell, Kernel methods: a survey of current techniques, Neurocomput-ing 48 (2002) 63–84.

Page 10: A Review of Optimization Methodologies in Support Vector Machines

J. Shawe-Taylor, S. Sun / Neurocomputing 74 (2011) 3609–36183618

[12] G. Cauwenberghs, T. Poggio, Incremental and decremental support vectormachine learning, Advances in Neural Information Processing Systems 13(2001) 409–415.

[13] C. Chang, C. Lin, LIBSVM: A Library for Support Vector Machines, Available at:/http://www.csie.ntu.edu.tw/�cjlin/libsvmS, 2001.

[14] O. Chapelle, Training a support vector machines in the primal, NeuralComputation 19 (2007) 1155–1178.

[15] T. Evgeniou, M. Pontil, Regularized multi-task learning, in: Proceedings of the10th ACM SIGKDD International Conference on Knowledge Discovery andData Mining, 2004, pp. 109–117.

[16] J. Farquhar, D. Hardoon, H. Meng, J. Shawe-Taylor, S. Szedmak, Two viewlearning: SVM-2 K, theory and practice, Advances in Neural InformationProcessing Systems 18 (2006) 355–362.

[17] V. Franc, S. Sonnenburg, Optimized cutting plane algorithm for supportvector machines, in: Proceedings of the 25th International Conference onMachine Learning, 2008, pp. 320–327.

[18] T. Friess, N. Cristianini, C. Campbell, The kernel adatron algorithm: a fast andsimple learning procedure for support vector machines, in: Proceedings ofthe 15th International Conference on Machine Learning, 1998, pp. 188–196.

[19] T. Hastie, S. Rosset, R. Tibshirani, J. Zhu, The entire regularization path,Journal of Machine Learning Research 5 (2004) 1391–1415.

[20] R. Horst, V. Thoai, DC programming: overview, Journal of OptimizationTheory and Applications 103 (1999) 1–41.

[21] C. Hsieh, K. Chang, C. Lin, S. Keerthi, S. Sundararajan, A dual coordinatedescent method for large-scale linear SVM, in: Proceedings of the 25thInternational Conference on Machine Learning, 2008, pp. 408–415.

[22] T. Joachims, Making large-scale SVM learning practical, in: Advances inKernel Methods—Support Vector Learning, MIT Press, Cambridge, MA, 1998.

[23] T. Joachims, Training linear SVMs in linear time, in: Proceedings of the 12thACM Conference on Knowledge Discovery and Data Mining, 2006, pp. 217–226.

[24] A. Juditsky, A. Nemirovski, C. Tauvel, Solving Variational Inequalities withStochastic Mirror-Prox Algorithm, PASCAL EPrints, 2008.

[25] S. Karlin, Mathematical Methods and Theory in Games, Programming, andEconomics, Addison Wesley, Reading, MA, 1959.

[26] W. Karush, Minima of Functions of Several Variables with Inequalities as SideConstraints, Master’s Thesis, Department of Mathematics, University ofChicago, 1939.

[27] S. Keerthi, D. DeCoste, A modified finite Newton method for fast solution oflarge scale linear SVMs, Journal of Machine Learning Research 6 (2005)341–361.

[28] J. Kelley, The cutting-plane method for solving convex programs, Journal ofthe Society for Industrial Applied Mathematics 8 (1960) 703–712.

[29] G. Kimeldorf, G. Wahba, A correspondence between Bayesian estimation onstochastic processes and smoothing by splines, Annals of MathematicalStatistics 41 (1970) 495–502.

[30] J. Kivinen, A. Smola, R. Williamson, Online learning with kernels, IEEETransactions on Signal Processing 52 (2004) 2165–2176.

[31] H. Kuhn, A. Tucker, Nonlinear programming, in: Proceedings of the 2ndBerkeley Symposium on Mathematical Statistics and Probabilitics, 1951,pp. 481–492.

[32] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, M. Jordan, Learning thekernel matrix with semidefinite programming, Journal of Machine LearningResearch 5 (2004) 27–72.

[33] J. Langford, J. Shawe-Taylor, PAC-Bayes and margins, Advances in NeuralInformation Processing Systems 15 (2002) 439–446.

[34] Y. Lee, O. Mangasarian, SSVM: a smooth support vector machine forclassification, Computational Optimization and Applications 20 (2001) 5–22.

[35] J. Li, S. Sun, Nonlinear combination of multiple kernels for support vectormachines, in: Proceedings of the 20th International Conference on PatternRecognition, 2010, pp. 1–4.

[36] O. Mangasarian, Nonlinear Programming, McGraw-Hill, New York, 1969.[37] O. Mangasarian, D. Musicant, Successive overrelaxation for support vector

machines, IEEE Transactions on Neural Networks 10 (1999) 1032–1037.[38] O. Mangasarian, A finite Newton method for classification, Optimization

Methods and Software 17 (2002) 913–929.[39] D. McAllester, Simplified PAC-Bayesian margin bounds, in: Proceedings of the

16th Annual Conference on Computational Learning Theory, 2003,pp. 203–215.

[40] A. Micchelli, M. Pontil, Learning the kernel function via regularization,Journal of Machine Learning Research 6 (2005) 1099–1125.

[41] J. Nocedal, S. Wright, Numerical Optimization, Springer-Verlag, New York,1999.

[42] E. Osuna, R. Freund, F. Girosi, An improved training algorithm for supportvector machines, in: Proceedings of the IEEE Workshop on Neural Networksfor Signal Processing, 1997, pp. 276–285.

[43] E. Osuna, R. Freund, F. Girosi, Support Vector Machines: Training andApplications, Technical Report AIM-1602, Massachusetts Institute of Tech-nology, MA, 1997.

[44] J. Platt, Sequential Minimal Optimization: A Fast Algorithm for TrainingSupport Vector Machines, Technical Report MSR-TR-98-14, MicrosoftResearch, 1998.

[45] C. Rasmussen, C. Williams, Gaussian Processes for Machine Learning, MITPress, Cambridge, MA, 2006.

[46] K. Scheinberg, An efficient implementation of an active set method for SVMs,Journal of Machine Learning Research 7 (2006) 2237–2257.

[47] A. Schilton, M. Palaniswami, D. Ralph, A. Tsoi, Incremental training of supportvector machines, IEEE Transactions on Neural Networks 16 (2005) 114–131.

[48] B. Scholkopf, J. Platt, J. Shawe-Taylor, A. Smola, R. Williamson, Estimating thesupport of a high-dimensional distribution, Neural Computation 13 (2001)1443–1471.

[49] B. Scholkopf, A. Smola, Learning with Kernels, MIT Press, Cambridge, MA,2002.

[50] S. Shalev-Shwartz, Y. Singer, N. Srebro, Pegasos: primal estimated sub-gradient solver for SVM, in: Proceedings of the 24th International Conferenceon Machine Learning, 2007, pp. 807–814.

[51] J. Shawe-Taylor, P. Bartlett, R. Williamson, M. Anthony, Structural riskminimization over data-dependent hieraichies, IEEE Transactions on Infor-mation Theory 44 (1998) 1926–1940.

[52] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cam-bridge University Press, Cambridge, UK, 2004.

[53] N. Shor, Minimization Methods for Non-Differentiable Functions, Springer-Verlag, Berlin, 1985.

[54] M. Slater, A note on Motzkin’s transposition theorem, Econometrica 19(1951) 185–186.

[55] S. Sun, D. Hardoon, Active learning with extremely sparse labeled examples,Neurocomputing 73 (2010) 2980–2988.

[56] S. Sun, J. Shawe-Taylor, Sparse semi-supervised learning using conjugatefunctions, Journal of Machine Learning Research 11 (2010) 2423–2455.

[57] C. Teo, Q. Le, A. Smola, S. Vishwanathan, A scalable modular convex solver forregularized risk minimization, in: Proceedings of the 13th ACM Conferenceon Knowledge Discovery and Data Mining, 2007, pp. 727–736.

[58] L. Vandenberghe, S. Boyd, Semidefinite programming, SIAM Review 38(1996) 49–95.

[59] R. Vanderbei, LOQO User’s Manual—Version 3.10, Technical Report SOR-97-08, Statistics and Operations Research, Princeton University, 1997.

[60] V. Vapnik, Estimation of Dependences Based on Empirical Data, Springer-Verlag, New York, 1982.

[61] V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, NewYork, 1995.

[62] L. Zanni, T. Serafini, G. Zanghirati, Parallel software for training large sclaesupport vector machines on multiprocessor systems, Journal of MachineLearning Research 7 (2006) 1467–1492.

[63] T. Zhang, Solving large scale linear prediction problems using stochasticgradient descent algorithms, in: Proceedings of the 21st InternationalConference on Machine Learning, 2004, pp. 919–926.

John Shawe-Taylor is a professor at the Department ofComputer Science, University College London (UK). Hismain research area is Statistical Learning Theory, buthis contributions range from Neural Networks, toMachine Learning, to Graph Theory. He obtained aPh.D. in Mathematics at Royal Holloway, University ofLondon in 1986. He subsequently completed an M.Sc.in the Foundations of Advanced Information Technol-ogy at Imperial College. He was promoted to professorof Computing Science in 1996. He has published over150 research papers. He moved to the University ofSouthampton in 2003 to lead the ISIS research group.

He has been appointed as the director of the Centre for

Computational Statistics and Machine Learning at University College, London fromJuly 2006. He has coordinated a number of European wide projects investigatingthe theory and practice of Machine Learning, including the NeuroCOLT projects.He is currently the scientific coordinator of a Framework VI Network of Excellencein Pattern Analysis, Statistical Modelling and Computational Learning (PASCAL)involving 57 partners.

Shiliang Sun received the B.E. degree with honors inAutomatic Control from the Department of AutomaticControl, Beijing University of Aeronautics and Astro-nautics in 2002, and the Ph.D. degree with honors inPattern Recognition and Intelligent Systems from theState Key Laboratory of Intelligent Technology andSystems, Department of Automation, Tsinghua Univer-sity, Beijing, China, in 2007. In 2004, he was entitledMicrosoft Fellow. Currently, he is a professor at theDepartment of Computer Science and Technology andthe founding director of the Pattern Recognition andMachine Learning Research Group, East China Normal

University. From 2009 to 2010, he was a visiting

researcher at the Department of Computer Science, University College London,working within the Centre for Computational Statistics and Machine Learning. Heis a member of the IEEE, a member of the PASCAL (Pattern Analysis, StatisticalModeling and Computational Learning) network of excellence, and on the editorialboards of multiple international journals. His research interests include MachineLearning, Pattern Recognition, Computer Vision, Natural Language Processing andIntelligent Transportation Systems.