Neurocomputing 74 (2011) 3609–3618
Contents lists available at ScienceDirect
Neurocomputing
0925-23
doi:10.1
� Corr
East Ch
Tel.: þ8
E-m
journal homepage: www.elsevier.com/locate/neucom
A review of optimization methodologies in support vector machines
John Shawe-Taylor a, Shiliang Sun a,b,�
a Department of Computer Science, University College London, Gower Street, London WC1E 6BT, United Kingdomb Department of Computer Science and Technology, East China Normal University, 500 Dongchuan Road, Shanghai 200241, China
a r t i c l e i n f o
Article history:
Received 13 March 2011
Received in revised form
24 June 2011
Accepted 26 June 2011
Communicated by C. Fyfeof this paper is to provide readers an overview of the basic elements and recent advances for training
Available online 12 August 2011
Keywords:
Decision support systems
Duality
Optimization methodology
Pattern classification
Support vector machine (SVM)
12/$ - see front matter & 2011 Elsevier B.V. A
016/j.neucom.2011.06.026
esponding author at: Department of Compu
ina Normal University, 500 Dongchuan Roa
6 21 54345186; fax: þ86 21 54345119.
ail addresses: [email protected], slsun@
a b s t r a c t
Support vector machines (SVMs) are theoretically well-justified machine learning techniques, which
have also been successfully applied to many real-world domains. The use of optimization methodol-
ogies plays a central role in finding solutions of SVMs. This paper reviews representative and state-of-
the-art techniques for optimizing the training of SVMs, especially SVMs for classification. The objective
SVMs and enable them to develop and implement new optimization strategies for SVM-related
research at their disposal.
& 2011 Elsevier B.V. All rights reserved.
1. Introduction
Support vector machines (SVMs) are state-of-the-art machinelearning techniques with their root in structural risk minimiza-tion [51,60]. Roughly speaking, by the theory of structural riskminimization, a function from a function class tends to have a lowexpectation of risk on all the data from an underlying distributionif it has a low empirical risk on a certain data set that is sampledfrom this distribution and simultaneously the function class has alow complexity, for example, measured by the VC-dimension[61]. The well-known large margin principle in SVMs [8] essen-tially restricts the complexity of the function class, making use ofthe fact that for some function classes a larger margin corre-sponds to a lower fat-shattering dimension [51].
Since their invention [8], research on SVMs has exploded bothin theory and applications. Recent theory advances in character-izing their generalization errors include Rademacher complexitytheory [3,52] and PAC-Bayesian theory [33,39]. In practice, SVMshave been successfully applied to many real-world domains [11].Moreover, SVMs have been combined with some other researchdirections, for example, multitask learning, active learning, multi-view learning and semi-supervised learning, and brought forwardfruitful approaches [4,15,16,55,56].
ll rights reserved.
ter Science and Technology,
d, Shanghai 200241, China.
cs.ecnu.edu.cn (S. Sun).
The use of optimization methodologies plays an important rolein training SVMs. Due to different requirements on the trainingspeed, memory constraint and accuracy of optimization variables,practitioners should choose different optimization methods. It isthus instructive to review the optimization techniques used totrain SVMs. Given that SVMs have been blended into the devel-opments of other learning mechanisms, this would also bebeneficial to facilitate readers to formulate and solve their ownoptimization objectives arising from SVM-related research.
Since developments on optimization techniques for SVMs arestill active, and SVMs have multiple variants depending ondifferent purposes such as classification, regression and densityestimation, it is impractical and unnecessary to include everyoptimization technique used so far. Therefore, this paper mainlyfocuses on reviewing representative optimization methodologiesused for SVM classification. Being familiar with these techniquesis, however, helpful to understand other related optimizationmethods and implement optimization for other SVM variants.
The rest of this paper is organized as follows. Section 2introduces the theory of convex optimization, which is closelyrelated to the optimization of SVMs. Section 3 gives the formula-tion of SVMs for pattern classification, where the kernel trick isalso involved. Also, in Section 3, we derive the dual optimizationproblems and provide optimality conditions for SVMs. Theyconstitute a foundation for the various optimization methodolo-gies detailed in Section 4. Then we briefly describe SVMs fordensity estimation and regression in Section 5, and introduceoptimization techniques used for learning kernels in Section 6.Finally, concluding remarks are given in Section 7.
J. Shawe-Taylor, S. Sun / Neurocomputing 74 (2011) 3609–36183610
2. Convex optimization theory
An optimization problem has the form
min f ðxÞ
s:t: fiðxÞr0, i¼ 1, . . . ,n: ð1Þ
Here, the vector xARm is the optimization variable, the functionf : Rm-R is the objective function, and the functions fi : R
m-R
(i¼ 1, . . . ,n) are the inequality constraint functions. The domain ofthis problem is D¼ dom f
Tni ¼ 1 dom fi. A vector from the domain
is said to be one element of the feasible set Df if and only if theconstraints hold on this point. A vector xn is called optimal, orthe solution of the optimization problem, if its objective value isthe smallest among all vectors satisfying the constraints [10]. Fornow we do not assume the above optimization problem is convex.
The Lagrangian associated with the problem (1) is defined as
Lðx,aÞ ¼ f ðxÞþXn
i ¼ 1
aifiðxÞ, aiZ0, ð2Þ
where a¼ ½a1, . . . ,an�>. The scalar ai is referred to as the Lagrange
multiplier associated with the ith constraint. The vector a is calleda dual variable or Lagrange multiplier vector. The Lagrange dualfunction is defined as the infimum of the Lagrangian
gðaÞ ¼ infxADf
Lðx,aÞ: ð3Þ
Denote the optimal value of problem (1) by fn. It can be easilyshown that gðaÞr f n [10].
With the aim to approach fn, it is natural to define thefollowing Lagrange dual optimization problem:
max gðaÞ
s:t: ak0, ð4Þ
where ak0 means aiZ0 (i¼ 1, . . . ,n).
2.1. Optimality conditions for generic optimization problems
KKT (Karush–Kuhn–Tucker) conditions are among the mostimportant sufficient criteria for solving generic optimizationproblems, which are given in the following theorem [26,31,36,49].
Theorem 1 (KKT saddle point conditions). Consider a generic opti-
mization problem given by (1) where f and fi are arbitrary functions.
If a pair of variables ðxn,anÞ exists where xnARm\D and an
k0, such
that for all xARm\D and aA ½0,1Þn,
Lðxn,aÞrLðxn,anÞrLðx,anÞ ð5Þ
then xn is a solution to the optimization problem.
One of the important insights derived from (5) is
an
i fiðxnÞ ¼ 0, i¼ 1, . . . ,n, ð6Þ
which is usually called complementary slackness conditions.Furthermore, under the assumption that (5) holds, it is straight-forward to prove that Lðxn,anÞ ¼ f n ¼ gn with gn being the optimalvalue of the dual problem, and an is the associated optimalsolution. The important step is to show that
maxak0
gðaÞ ¼maxak0
infxADf
Lðx,aÞZ infxADf
Lðx,anÞ ¼ Lðxn,anÞ ¼ f n: ð7Þ
Combining this with gðaÞr f n results in the assertion. Now, thereis no gap between the optimal values of the primal and dualproblems. The KKT-gap
Pni ¼ 1 aifiðxÞ vanishes when the optimal
pair of solutions satisfying (5) are found.If equality constraints are included in an optimization pro-
blem, such as hðxÞ ¼ 0, we can split it into two inequalityconstraints hðxÞr0 and �hðxÞr0 [49]. Suppose the optimization
problem is
min f ðxÞ
s:t: fiðxÞr0, i¼ 1, . . . ,n,
hjðxÞ ¼ 0, j¼ 1, . . . ,e: ð8Þ
The Lagrangian associated with it is defined as
Lðx,a,bÞ ¼ f ðxÞþXn
i ¼ 1
aifiðxÞþXe
j ¼ 1
bjhjðxÞ, aiZ0, bjAR: ð9Þ
The corresponding KKT saddle point conditions for problem (8)are given by the following theorem [49].
Theorem 2 (KKT saddle point conditions with equality constraints).Consider a generic optimization problem given by (8) where f, fi and
hj are arbitrary. If a triplet of variables ðxn,an,bnÞ exists where
xnARm\D, an
k0, and bnARe, such that for all xARm\D,
aA ½0,1Þn and bARe,
Lðxn,a,bÞrLðxn,an,bnÞrLðx,an,bn
Þ, ð10Þ
then xn is a solution to the optimization problem, where D is
temporarily reused to represent the domain of problem (8).
As equality constraints can be equivalently addressed by splittingthem to two inequality constraints, we will mainly use the formula-tion given in (1) for optimization treatment in the sequel.
2.2. Optimality conditions for convex optimization problems
The KKT saddle point conditions are sufficient for any optimi-zation problems. However, it is more desirable to disclosenecessary conditions when implementing optimization meth-odologies. Although it is difficult to provide necessary conditionsfor generic optimization problems, this is quite possible for someconvex optimization problems having nice properties.
Definition 3 (Convex function). A function f : Rm-R is convex ifits domain dom f is a convex set and if for 0ryr1 and allx,yAdom f , the following holds:
f ðyxþð1�yÞyÞryf ðxÞþð1�yÞf ðyÞ: ð11Þ
If the functions f and fi are all convex functions, then theproblem (1) is a convex optimization problem. For a convexoptimization problem shown in (1), both the domain D and thefeasible region
Df ¼ fxAD and fiðxÞr0 ði¼ 1, . . . ,nÞg ð12Þ
can be proved to be convex sets.The following theorem [36,49] states the nice properties
required for the constraints in order to obtain the necessaryoptimality conditions for convex optimization problems.
Theorem 4 (Constraint qualifications). Suppose DARm is a convex
domain, and the feasible region Df is defined by (12) where functions
fi : D-R are convex ði¼ 1, . . . ,nÞ. Then the following three qualifica-
tions on constraint functions fi are connected by ðiÞ3ðiiÞ and ðiiiÞ ) ðiÞ:
(i)
There exists an xAD s.t. fiðxÞo0 for all i¼1,y,n (Slater’scondition [54]).(ii)
For all nonzero aA ½0,1Þn there exists an xAD s.t.Pni ¼ 1 aifiðxÞr0 (Karlin’s condition [25]).(iii)
The feasible region Df contains at least two distinct points, andthere exists an xADf s.t. all fi are strictly convex at x with
respect to Df (strict constraint qualification).
Now we reach the theorem [31,25,49] stating the necessity ofKKT conditions.
J. Shawe-Taylor, S. Sun / Neurocomputing 74 (2011) 3609–3618 3611
Theorem 5 (Necessity of KKT conditions). For a convex optimiza-
tion problem defined by (1), if all the inequality constraints fis satisfy
one of the constraint qualifications of Theorem4, then the KKT saddle
point conditions (5) given in Theorem1 are necessary for optimality.
A convex optimization problem is defined as the minimizationof a convex function over a convex set. It can be represented as
min f ðxÞ
s:t: fiðxÞr0, i¼ 1, . . . ,n,
a>j x¼ bj, j¼ 1, . . . ,e, ð13Þ
where the functions f and fi are convex, and the equalityconstraints hjðxÞ ¼ a>j x�bj are affine [10]. In this case, we caneliminate the equality constraints by updating the domainDTe
j ¼ 1ða>j x¼ bjÞ-D and then apply Theorem 5 on a convex
problem with no equality constraints.Slater’s condition can be further refined when some of the
inequality constraints in the convex problem defined by (1) areaffine. The strict feasibility in Slater’s condition for these con-straints can be relaxed to feasibility [10].
2.3. Optimality conditions for differentiable convex problems
For SVMs, we often encounter convex problems whose objec-tive and constraint functions are further differentiable. Suppose aconvex problem (1) satisfies the conditions of Theorem 5. Thenwe can find its solutions making use of the KKT conditions (5).Under the assumption that the objective and constraint functionsare differentiable, the KKT conditions can be simplified as in thefollowing theorem [31,49]. Many optimization methodologiesfrequently use these conditions.
Theorem 6 (KKT for differentiable convex problems). Consider a
convex problem (1) whose objective and constraint functions are
differentiable at solutions. The vector xn is a solution, if there exists
some anARn s.t. the following conditions hold:
ank0 ðRequired by the LagrangianÞ, ð14Þ
@xLðxn,anÞ ¼ @xf ðxnÞþXn
i ¼ 1
an
i @xfiðxnÞ ¼ 0 ðSaddle point at xnÞ,
ð15Þ
@aiLðxn,anÞ ¼ fiðx
nÞr0, i¼ 1, . . . ,n ðSaddle point at anÞ, ð16Þ
an
i fiðxnÞ ¼ 0, i¼ 1, . . . ,n ðZero KKT-gapÞ: ð17Þ
Note that the sign ‘r ’ in (16) is correct as a result of the domainof ai (i¼ 1, . . . ,n) in the Lagrangian being ½0,1Þ.
3. SVMs and kernels
In this section, we describe the optimization formulations,dual optimization problems, optimality conditions and relatedconcepts for SVMs for binary classification, which will be shownvery useful for various SVM optimization methodologies intro-duced in later sections. Here after addressing linear classifiers forlinearly separable and nonseparable problems, respectively, wethen describe nonlinear classifiers with kernels.
Henceforth, we use boldface x to denote an example vector,and the sign ‘n’ will be used to denote the number of trainingexamples for consistency with most literature on SVMs. Given atraining set S ¼ fxi,yig
ni ¼ 1 where xiARd and yiAf1,�1g, the aim of
SVMs is to induce a classifier which has good classificationperformance on future unseen examples.
3.1. Linear classifier for linearly separable problems
When the data set S is linearly separable, SVMs solve the problem
minw,b
FðwÞ ¼ 12JwJ2
s:t: yiðw>xiþbÞZ1, i¼ 1, . . . ,n, ð18Þ
where the constraints represent the linearly separable property. Thelarge margin principle is reflected by minimizing 1
2 JwJ2 with 2=JwJ
being the margin between two lines w>xþb¼ 1 and w>xþb¼�1.The SVM classifier would be
cw,bðxÞ ¼ signðw>xþbÞ: ð19Þ
Problem (18) is a differentiable convex problem with affineconstraints, and therefore the constraint qualification is satisfiedby the refined Slater’s condition. Theorem 6 is suitable to solvethe current optimization problem.
The Lagrangian is constructed as
Lðw,b,kÞ ¼1
2JwJ2
�Xn
i ¼ 1
li½yiðw>xiþbÞ�1�, liZ0, ð20Þ
where k¼ ½l1, . . . ,ln�> are the associated Lagrange multipliers.
Using the superscript star to denote the solutions of the optimi-zation problem, we have
@wLðwn,bn,knÞ ¼wn�
Xn
i ¼ 1
ln
i yixi ¼ 0, ð21Þ
@bLðwn,bn,knÞ ¼�
Xn
i ¼ 1
ln
i yi ¼ 0: ð22Þ
From (21), the solution wn has the form
wn ¼Xn
i ¼ 1
ln
i yixi: ð23Þ
The training examples for which ln
i 40 are called support vectors,because other examples with ln
i ¼ 0 can be omitted from theexpression.
Substituting (21) and (22) into the Lagrangian results in
gðkÞ ¼1
2JwJ2
�Xn
i ¼ 1
liyiw>xiþ
Xn
i ¼ 1
li
¼Xn
i ¼ 1
liþ1
2
Xn
i ¼ 1
Xn
j ¼ 1
liljyiyjx>i xj�
Xn
i ¼ 1
Xn
j ¼ 1
liljyiyjx>i xj
¼Xn
i ¼ 1
li�1
2
Xn
i ¼ 1
Xn
j ¼ 1
liljyiyjx>i xj: ð24Þ
Thus the dual optimization problem is
maxk
gðkÞ ¼ k>1�12k>Dk
s:t: k>y¼ 0,
kk0, ð25Þ
where y¼ ½y1, . . . ,yn�> and D is a symmetric n�n matrix with
entries Dij ¼ yiyjx>i xj [43,52].
The zero KKT-gap requirement (also called the complementaryslackness condition) implies that
ln
i ½yiðx>i wnþbnÞ�1� ¼ 0, i¼ 1, . . . ,n: ð26Þ
Therefore, for all support vectors the constraints are active withequality. The bias bn can thus be calculated as
bn ¼ yi�x>i wn ð27Þ
J. Shawe-Taylor, S. Sun / Neurocomputing 74 (2011) 3609–36183612
using any support vector xi. With kn and bn calculated, the SVMdecision function can be represented as
cnðxÞ ¼ signXn
i ¼ 1
yiln
i x>xiþbn
!: ð28Þ
3.2. Linear classifier for linearly nonseparable problems
In the case that the data set is not linearly separable but wewould still like to learn a linear classifier, a loss on the violation ofthe linearly separable constraints has to be introduced. A com-mon choice is the hinge loss
maxð0,1�yiðw>xiþbÞÞ, ð29Þ
which can be represented using a slack variable xi.The optimization problem is formulated as
minw,b,n
Fðw,nÞ ¼1
2JwJ2
þCXn
i ¼ 1
xi
s:t: yiðw>xiþbÞZ1�xi, i¼ 1, . . . ,n,
xiZ0, i¼ 1, . . . ,n, ð30Þ
where the scalar C controls the balance between the margin andempirical loss. This problem is also a differentiable convexproblem with affine constraints. The constraint qualification issatisfied by the refined Slater’s condition.
The Lagrangian of problem (30) is
Lðw,b,n,k,cÞ ¼1
2JwJ2
þCXn
i ¼ 1
xi�Xn
i ¼ 1
li½yiðw>xiþbÞ�1þxi�
�Xn
i ¼ 1
gixi, liZ0, giZ0, ð31Þ
where k¼ ½l1, . . . ,ln�> and c¼ ½g1, . . . ,gn�
> are the associatedLagrange multipliers. Making use of Theorem 6, we obtain
@wLðwn,bn,nn,kn,cnÞ ¼wn�Xn
i ¼ 1
ln
i yixi ¼ 0, ð32Þ
@bLðwn,bn,nn,kn,cnÞ ¼ �Xn
i ¼ 1
ln
i yi ¼ 0, ð33Þ
@xiLðwn,bn,nn,kn,cnÞ ¼ C�ln
i �gn
i ¼ 0, i¼ 1, . . . ,n, ð34Þ
where (32) and (33) are respectively identical to (21) and (22) forthe separable case.
The dual optimization problem is derived as
maxk
gðkÞ ¼ k>1�12k>Dk
s:t: k>y¼ 0,
kk0,
k$C1, ð35Þ
where y and D have the same meanings as in problem (25).The complementary slackness condition requires
ln
i yiðx>i wnþbnÞ�1þxn
i
� �¼ 0, i¼ 1, . . . ,n,
gni xn
i ¼ 0, i¼ 1, . . . ,n: ð36Þ
Combining (34) and (36), we can solve bn ¼ yi�x>i wn for anysupport vector xi with 0oln
i oC. The existence of 0oln
i oC isequivalent to the assumption that there is at least one supportvector with functional output yiðx
>i wnþbnÞ ¼ 1 [43]. Usually, this
is true. If this assumption is violated, however, other user-definedtechniques have to be used to determine the bn. Once kn and bn
are solved, the SVM decision function is given by
cnðxÞ ¼ signXn
i ¼ 1
yiln
i x>xiþbn
!: ð37Þ
3.3. Nonlinear classifier with kernels
The use of kernel functions (kernels for short) provides a powerfuland principled approach to detecting nonlinear relations with alinear method in an appropriate feature space [52]. The design ofkernels and linear methods can be decoupled, which largely facil-itates modularity of machine learning methods including SVMs.
Definition 7 (Kernel function). A kernel is a function k that for allx, z from a space X (which need not be a vector space) satisfies
kðx,zÞ ¼/fðxÞ,fðzÞS, ð38Þ
where f is a mapping from the space X to a Hilbert space F that isusually called the feature space
f : xAX/fðxÞAF: ð39Þ
To verify a function is a kernel, one approach is to construct afeature space for which the function between two inputs corre-sponds to first performing an explicit feature mapping and thencomputing the inner product between the images of the inputs.An alternative approach, which is more widely used, is toinvestigate the finitely positive semidefinite property [52].
Definition 8 (Finitely positive semidefinite function). A function k :X � X-R satisfies the finitely positive semidefinite property if itis a symmetric function for which the kernel matrices K withKij ¼ kðxi,xjÞ formed by restriction to any finite subset of the spaceX are positive semidefinite.
The following theorem [52] justifies the use of the aboveproperty for characterizing kernels [2,49].
Theorem 9 (Characterization of kernels). A function k : X � X-R
which is either continuous or has a countable domain, can be
decomposed as an inner product in a Hilbert space F by a feature
map f applied to both its arguments
kðx,zÞ ¼/fðxÞ,fðzÞS ð40Þ
if and only if it satisfies the finitely positive semidefinite property.
Normally, a Hilbert space is defined as an inner product spacethat is complete. If the separability property is further added tothe definition of a Hilbert space, the ‘kernels are continuous or theinput space is countable’ in Theorem 9 is then necessary toaddress this issue. The Hilbert space constructed in provingTheorem 9 is called the reproducing kernel Hilbert space (RKHS)because of the following reproducing property of the kernelresulting from the defined inner product
/fF ,kðx,�ÞS¼ fF ðxÞ, ð41Þ
where fF is a function in the space F of functions, and functionkðx,�Þ is the mapping fðxÞ.
Besides the linear kernel kðx,zÞ ¼ x>z, other commonly usedkernels are the polynomial kernel
kðx,zÞ ¼ ð1þx>zÞdeg , ð42Þ
where deg is the degree of the polynomial, and the Gaussian radialbasis function (RBF) kernel (Gaussian kernel for short)
kðx,zÞ ¼ exp �Jx�zJ2
2s2
!: ð43Þ
J. Shawe-Taylor, S. Sun / Neurocomputing 74 (2011) 3609–3618 3613
Using the kernel trick, the optimization problem (35) for SVMsbecomes
maxk
gðkÞ ¼ k>1�12k>Dk
s:t: k>y¼ 0,
kk0,
k$C1, ð44Þ
where the entries of D are Dij ¼ yiyjkðxi,xjÞ. The solution for theSVM classifier is formulated as
cnðxÞ ¼ signXn
i ¼ 1
yiln
i kðxi,xÞþbn
!: ð45Þ
4. Optimization methodologies for SVMs
For small and moderately sized problems (say with less than5000 examples), interior point algorithms are reliable and accu-rate optimization methods to choose [49]. For large-scaleproblems, methods that exploit the sparsity of the dual variablesmust be adopted; if further limitation on the storage is required,compact representation should be considered, for example, usingan approximation for the kernel matrix.
4.1. Interior point algorithms
An interior point is a pair of primal and dual variables whichsatisfy the primal and dual constraints, respectively. Below weprovide the main procedure of interior point algorithms for SVMoptimization, given the fundamental importance of this optimiza-tion technique.
Suppose now we are focusing on solving an equivalentproblem of problem (44)
mink
12k>Dk�k>1
s:t: k>y¼ 0,
k$C1,
kk0: ð46Þ
Introducing Lagrange multipliers h, s and z where s,zk0 and h isfree, we can get the Lagrangian
12k>Dk�k>1þs>ðk�C1Þ�z>kþhk>y: ð47Þ
The KKT conditions from Theorem 6 are instantiated as
Dk�1þs�zþhy¼ 0 ðDualfeasibilityÞ,
k>y¼ 0 ðPrimalfeasibilityÞ,
k$C1 ðPrimalfeasibilityÞ,
kk0 ðPrimalfeasibilityÞ,
s>ðk�C1Þ ¼ 0 ðZeroKKT� gapÞ,
z>k¼ 0 ðZeroKKT� gapÞ,
sk0, zk0: ð48Þ
Usually, the conditions k$C1 and s>ðk�C1Þ ¼ 0 are transformedto
kþc¼ C1,
s>c¼ 0,
ck0: ð49Þ
Instead of requiring z>k¼ 0 and s>c¼ 0, according to theprimal-dual path-following algorithm [59], they are modified aszili ¼ m and sigi ¼ m (i¼ 1, . . . ,n) for m40. After solving thevariables for a certain m, we then decrease m to 0. The advantageis that this can facilitate the use of a Newton-type predictorcorrector algorithm to update k,c,h,s,z [49].
Suppose we already have appropriate initialization fork,c,h,s,z, and would like to calculate their increments. For this
purpose, expanding the equalities in (48) and (49), e.g., substitut-ing k with kþDk, yields
DDkþDs�DzþyDh¼�Dkþ1�sþz�hy9r1,
y>Dk¼�y>k,
DkþDc¼ C1�k�c¼ 0,
g�1i siDgiþDsi ¼ mg�1
i �si�g�1i DgiDsi9rkkt1i
, i¼ 1, . . . ,n,
l�1i ziDliþDzi ¼ ml
�1i �zi�l
�1i DliDzi9rkkt2i
, i¼ 1, . . . ,n: ð50Þ
As it is clear that
Dsi ¼ rkkt1iþg�1
i siDli,
Dzi ¼ rkkt2i�l�1
i ziDli,
Dc¼�Dk ð51Þ
we only need to further solve for Dk and Dh. Substituting (51) intothe first two equalities of (50), we have
Dþdiagðc�1s>þk�1z>Þ y
y> 0
" #Dk
Dh
� �¼
r1�rkkt1þrkkt2
�y>k
" #, ð52Þ
where
c�1 ¼ ½1=g1, . . . ,1=gn�>,
k�1¼ ½1=l1, . . . ,1=ln�
>,
rkkt1 ¼ ½rkkt11, . . . ,rkkt1n
�>,
rkkt2 ¼ ½rkkt21, . . . ,rkkt2n
�>: ð53Þ
Then, the Newton-type predictor corrector algorithm, which aimsto get a performance similar to higher order methods but withoutimplementing them, is adopted to solve (52) and other variables.
By the blockwise matrix inversion formula with the Schurcomplement technique, the above linear equation can be readilysolved if we can invert Dþdiagðc�1s>þk�1z>Þ. The standardCholesky decomposition, which is numerically extremely stable,is suitable to solve this problem with computational complexityn3/6 [45].
Generally, interior point algorithms are a good choice for smalland moderately sized problems with high reliability and preci-sion. Because it involves Cholesky decomposition for a matrixscaled by the number of training examples, the computation isexpensive for large-scale data. One way to overcome this is tocombine interior point algorithms with sparse greedy matrixapproximation to provide a feasible computation of the matrixinverse [49].
4.2. Chunking, sequential minimal optimization
The chunking algorithm [60] relies on the fact that the value ofthe quadratic form in (46) is unchanged if we remove the rowsand columns of D that associate with zero entries of k. That is, weneed to only employ support vectors to represent the finalclassifier. Thus, the large quadratic program problem can bedivided into a series of smaller subproblems.
Chunking starts with a subset (chunk) of training data andsolves a small quadratic program (QP) subproblem. Then it retainsthe support vectors and replace the other data in the chunk with acertain number of data violating KKT conditions, and solves thesecond subproblem for a new classifier. Each subproblem isinitialized with the solutions of the previous subproblem. At thelast step, all nonzero entries of k can be identified.
J. Shawe-Taylor, S. Sun / Neurocomputing 74 (2011) 3609–36183614
Osuna et al. [42] gave a theorem on the solution convergenceof breaking down a large QP problem into a series of smaller QPsubproblems. As long as at least one example violating the KKTconditions is added to the examples for the previous subproblem,each step will decrease the objective function and has a feasiblepoint satisfying all constraints [44]. According to this theorem,the chunking algorithm will converge to the globally optimalsolution.
The sequential minimal optimization (SMO) algorithm, pro-posed by Platt [44], is a special case of chunking, which iterativelysolves the smallest possible optimization problem each time withtwo examples
minli ,lj
12½l
2i Diiþ2liljDijþl
2j Djj��li�lj
s:t: liyiþljyj ¼ rij,
0rlirC,
0rljrC, ð54Þ
where the scaler rij is computed by fulfilling all the constraints ofthe full problem k>y¼ 0.
The advantage of SMO is that a QP problem with two examplecan be solved analytically, and thus a numerical QP solver isavoided. It can be more than 1000 times faster and exhibit betterscaling (up to one order better) in the training set size than theclassical chunking algorithm given in [60]. Convergence is guar-anteed as long as at least one of the two selected examplesviolates the KKT conditions before solving the subproblem. How-ever, it should be noted that the SMO algorithm will not convergeas quickly if solutions with a high accuracy are needed [44].A more recent approach is the LASVM algorithm [6], whichoptimizes the cost function of SVMs with a reorganization ofthe SMO direction searches and can converge to the SVM solution.
There are other decomposition techniques based on the sameidea of working set selection where more than two examples areselected. The fast convergence and inexpensive computationalcost during each iteration are two conflicting goals. Two widelyused decomposition packages are LIBSVM [13] and SVMlight [22].LIBSVM is implemented for working sets of two examples, andexploits the second-order information for working set selection.Different from LIBSVM, SVMlight can exploit working sets of morethan two examples (up to Oð103
Þ) and thus needs a QP solver tosolve the subproblems. Recently, following the SVMlight frame-work, Zanni et al. [62] proposed a parallel SVM software withgradient projection QP solvers. Experiments on data sets sizedOð106
Þ show that training nonlinear SVMs with Oð105Þ support
vectors can be completed in a few hours with some tens ofprocessors [62].
4.3. Coordinate descent
Coordinate descent is a popular optimization technique thatupdates one variable at a time by optimizing a subprobleminvolving only a single variable [21]. It has been discussed andapplied to the SVM optimization, e.g., in the work of [7,18,37].Here we introduce the coordinate descent method used by Hsiehet al. [21] for training large-scale linear SVMs in the dual form,though coordinate descent is surely not confined to linear SVMs.
In some applications where features of examples are highdimensional, SVMs with the linear kernel or nonlinear kernelshave similar performances. Furthermore, if the linear kernel isused, much larger data sets can be adopted to train SVMs [21].Linear SVMs are among the most popular tools to deal with thiskind of applications.
By appending each example with a constant feature 1, Hsiehet al. [21] absorbed the bias b into the weight vector w
x> ¼ ½x>,1�, w> ¼ ½w>,b� ð55Þ
and solve the following dual problem:
mink
f ðkÞ ¼ 12k>Dk�k>1
s:t: k$C1,
kk0, ð56Þ
where parameters have the same meaning as in (35).The optimization process begins with an initial value k0ARn
and iteratively updates k to generate k0,k1, . . . ,k1. The processfrom kk to kkþ1 is referred to as an outer iteration. During eachouter iteration there are n inner iterations which sequentiallyupdate l1, . . . ,ln. Each outer iteration generates multiplier vectorskk,iARn (i¼ 1, . . . ,nþ1) such that
kk,1¼ kk,
kk,nþ1¼ kkþ1,
kk,i¼ ½lkþ1
1 , . . . ,lkþ1i�1 ,lk
i , . . . ,lkn�>, i¼ 2, . . . ,n: ð57Þ
When we update kk,i to kk,iþ1, the following one-variablesubproblem [21] is solved:
mind
f ðkk,iþdeiÞ
s:t: 0rlki þdrC, ð58Þ
where the unit vector ei ¼ ½0, . . . ,0,1,0, . . . ,0�>. The objectivefunction is a quadratic function of the scalar d
f ðkk,iþdeiÞ ¼
12Diid
2þrif ðk
k,iÞdþconst:, ð59Þ
where rif is the ith entry of the gradient vector rf . For linearSVMs, rif can be efficiently calculated and thus speedy conver-gence of the whole optimization process can be expected. It wasshown that the above coordinate descent method is faster thansome state-of-the-art solvers for SVM optimization [21]. Notethat, from the point of view of coordinate descent, the SMOalgorithm also implements a coordinate descent procedure,though on coordinate pairs.
4.4. Active set methods
Active set methods are among typical approaches to solvingQP problems [41]. In active set methods, constraints are dividedinto the set of active constraints (the active set) and the set ofinactive constraints. The algorithm iteratively solves the optimi-zation problem, updating the active set accordingly until theoptimal solution is found. For a certain iteration, an appropriatesubproblem is solved for the variables not actively constrained.
Active set methods have also been applied to solve the SVMoptimization, e.g., by [12,19,46,47]. In this subsection, we mainlyoutline the methods used by the two recent papers [46,47], whichboth have theoretical guarantees to converge in a finite time.
The active set method adopted by Scheinberg [46] to solve (46)is to fix variables li in the active set at their current values (0 orC), and then solve the reduced subproblem for each iteration
minks
12k>
s Dssksþk>a Dasks�k>s 1
s:t: k>s ys ¼�k>a ya,
ks$C1,
ksk0, ð60Þ
where ka and ks correspond to variables in and not in the activeset, respectively, Dss and Das are composed of the associated rowsand columns of D, and the meanings of ya and ys are similarly
J. Shawe-Taylor, S. Sun / Neurocomputing 74 (2011) 3609–3618 3615
defined. Note that the cardinality of the free vector ks is thenumber of support vectors on the margin with 0!ks!C1 afterthe last iteration.
Solving the above reduced subproblem can be expensive if thenumber of the above support vectors is large. Scheinberg [46]proposed to make one step move toward its solution at eachiteration. That is, each time either one variable enters or oneleaves the active set. The rank-one update of the Choleskyfactorization of Dss is therefore efficient for this variant.
The active set method used by Shilton et al. [47] is similar butit tries to solve a different objective function, which is a partiallydual form
maxb
min0$k$C1
1
2
b
k
" #>0 y>
y D
" #b
k
" #�
b
k
" #>0
1
� �, ð61Þ
where D is the same as in (44). The advantage of using thispartially dual representation is that the constraint k>y¼ 0 whosepresence complicates the implementation of the active setapproach is eliminated [47].
4.5. Solving the primal with Newton’s method
Although most literature on SVMs deals with the dual optimi-zation problem, there is also some work concentrating on theoptimization of the primal because primal and dual are indeedtwo sides of the same coin. For example, [27,38] studied theprimal optimization of linear SVMs, and [14,34] studied theprimal optimization of both linear and nonlinear SVMs. Theseworks mainly use Newton-type methods. Below we introduce theoptimization problem and the approach adopted in [14].
Consider a kernel function k and an associated reproducingkernel Hilbert space H. The optimization problem is
minf AH
rJf J2Hþ
Xn
i ¼ 1
Rðyi,f ðxiÞÞ, ð62Þ
where r is a regularization coefficient, and Rðy,tÞ is a loss functionthat is differentiable with respect to its second argument [14].
Suppose the optimal solution is fn. The gradient of theobjective function at fn should vanish. We have
2rf nþXn
i ¼ 1
@R
@fðyi,f
nðxiÞÞkðxi,�Þ ¼ 0, ð63Þ
where the reproducing property f ðxiÞ ¼/f ,kðxi,�ÞSH is used. Thisindicates that the optimal solution can be represented as a linearcombination of kernel functions evaluated at the n training data,which is known as the representer theorem [29].
From (63), we can write f ðxÞ in the form
f ðxÞ ¼Xn
i ¼ 1
bikðxi,xÞ ð64Þ
and then solve for b¼ ½b1, . . . ,bn�>. The entries of b should not be
interpreted as Lagrange multipliers [14].Recall the definition of the kernel matrix K with Kij ¼ kðxi,xjÞ.
The optimization problem (62) can be rewritten as the followingunconstrained optimization problem:
minb
rb>KbþXn
i ¼ 1
Rðyi,K>i bÞ, ð65Þ
where Ki is the ith column of K. As the common hinge lossRðyi,f ðxiÞÞ ¼maxð0,1�yif ðxiÞÞ is not differentiable, Chapelle [14]adopted the Huber loss as a differentiable approximation of the
hinge loss. The Huber loss is formulated as
Rðy,tÞ ¼
0 if yt41þh,
ð1þh�ytÞ2
4hif 91�yt9rh,
1�yt if yto1�h,
8>>><>>>:
ð66Þ
where parameter h typically varies between 0.01 and 0.5 [14]. TheNewton step of variable update is
b’b�rstepH�1r, ð67Þ
where H and r are the Hessian and gradient of the objectivefunction in (65), respectively, and step size rstep can be found byline search along the direction of H�1r.
The term �H�1r is provided by the standard Newton’smethod that aims to find an increment d to maximize or minimizea function say LðhþdÞ. To show this, setting the derivative of thefollowing second-order Taylor series approximation:
LðhþdÞ ¼ LðhÞþd>rðhÞþ12d>HðhÞd ð68Þ
to zero, we obtain
d¼�H�1ðhÞrðhÞ: ð69Þ
It was shown that primal and dual optimizations are equiva-lent in terms of the solution and time complexity [14]. However,as primal optimization is more focused on minimizing the desiredobjective function, perhaps it is superior to the dual optimizationwhen people are interested in finding approximate solutions [14].
Here, we would like to point out an alternative solutionapproach called the augmented Lagrangian method [5], whichhas approximately the same convergence speed as the Newton’smethod.
4.6. Stochastic subgradient with projection
The subgradient method is a simple iterative algorithm forminimizing a convex objective function that can be non-differ-entiable. The method seems very much like the ordinary gradientmethod applied to differentiable functions, but with severalsubtle differences. For instance, the subgradient method is nottruly a descent method: the function value often increases.Furthermore, the step lengths in the subgradient method arefixed in advance, instead of found by a line search as in thegradient method [9].
A subgradient of function f ðxÞ : Rd-R at x is any vector g thatsatisfies
f ðyÞZ f ðxÞþg>ðy�xÞ ð70Þ
for all y [53]. When f is differentiable, a very useful choice of g isrf ðxÞ, and then the subgradient method uses the same searchdirection as the gradient descent method.
Suppose we are minimizing the above function f ðxÞ that isconvex. The subgradient method uses the iteration
xðkþ1Þ’xðkÞ�akgðkÞ, ð71Þ
where gðkÞ is any subgradient of f at xðkÞ, and ak40 is the steplength.
Recently, there has been a lot of interest in studying gradientmethods for SVM training [30,63]. Shalev-Shwartz et al. [50]proposed an algorithm alternating between stochastic subgradi-ent descent and projection steps to solve the primal problem
minw
r
2JwJ2
þ1
n
Xn
i ¼ 1
Rðw,ðxi,yiÞÞ, ð72Þ
where hinge loss Rðw,ðx,yÞÞ ¼maxf0,1�y/w,xSg with classifierfunction f ðwÞ ¼/w,xS.
J. Shawe-Taylor, S. Sun / Neurocomputing 74 (2011) 3609–36183616
The algorithm receives two parameters as input: T—the totalnumber of iterations; k—the number of examples used to calcu-late subgradients. Initially, weight vector w1 is set to any vectorwith norm no larger than 1=
ffiffiffirp
. For the tth iteration, a set At sizedk is chosen from the original n training examples to approximatethe objective function as
f ðw,AtÞ ¼r
2JwJ2
þ1
k
Xðx,yÞAAt
Rðw,ðx,yÞÞ: ð73Þ
In general, At includes k examples sampled i.i.d. from the trainingset. Define Aþt to be the set of examples in At for which the hingeloss is nonzero. Then, a two-step update is performed. First,update wt to wtþ 1
2
wtþ 12¼wt�Ztrt , ð74Þ
where step length Zt ¼ 1=ðrtÞ, and rt ¼ rwt�ð1=kÞPðx,yÞAAþt
yx is asubgradient of f ðw,AtÞ at wt . Last, wtþ 1
2is updated to wtþ1 by
projecting onto the set
fw : JwJr1=ffiffiffirpg, ð75Þ
which is shown to contain the optimal solution of the SVM. Thealgorithm finally outputs wTþ1.
If k¼1, the algorithm is a variant of the stochastic gradientmethod; if k is equal to the total number of training examples, thealgorithm is the subgradient projection method. Beside simplicity,the algorithm is guaranteed to obtain a solution of accuracy ewith ~Oð1=eÞ iterations, where an e-accurate solution w is definedas f ðwÞrminwf ðwÞþe [50].
4.7. Cutting plane algorithms
The essence of the cutting plane algorithms is to approximatea risk function by one or multiple linear under-estimator (theso-called cutting plane) [28]. Recently, some large-scale SVMsolvers using cutting plane algorithms have been proposed. Forexample, Joachims [23] solved a solution-equivalent structuralvariant of the following problem:
minw
1
2JwJ2
þC
n
Xn
i ¼ 1
Rðw,ðxi,yiÞÞ, ð76Þ
where hinge loss Rðw,ðx,yÞÞ ¼maxf0,1�y/w,xSg with classifierfunction f ðwÞ ¼/w,xS. Teo et al. [57] considered a wider range oflosses and regularization terms applicable to many patternanalysis problems. Franc [17] improved the cutting plane algo-rithm used in [23] with line search and monotonicity techniquesto solve problem (76) more efficiently. Below we provide a flavorof what the basic cutting plane algorithm does in order to solveproblem (76).
At iteration t, a reduced problem of (76) is solved for wt
minw
1
2JwJ2
þC
n
Xn
i ¼ 1
Rtðw,ðxi,yiÞÞ, ð77Þ
where Rt is a linear approximation of R. Rt is derived using thefollowing property. As a result of convexity, for any ðxi,yiÞ functionR can be approximated at any point wt�1 by a linear under-estimator [17]
RðwÞZRðwt�1Þþ/gt�1,w�wt�1S, 8wARd, ð78Þ
where gt�1 is any subgradient of R at wt�1. Note that (78)resembles (70), which indicates there can be some similaritybetween the methods mentioned here and those in Section 4.6.This inequality can be abbreviated as RðwÞZ/gt�1,wSþbt�1 withbt�1 defined as bt�1 ¼ Rðwt�1Þ�/gt�1,wt�1S. The equality/gt�1,wSþbt�1 ¼ 0 is called a cutting plane.
For SVM optimization, a subgradient of Rðw,ðxi,yiÞÞ at wt�1 canbe given as gt�1 ¼�piyixi where pi ¼ 1 if yi/wt�1,xiSo1 andpi ¼ 0 otherwise. Consequently, the term Rt w,ðxi,yiÞð Þ in (77) canbe replaced by a linear relaxation /gt�1,wSþbt�1, and thus astandard QP is recovered which is straightforward to solve. To geta better approximation of the original risk function, more thanone cutting planes at different points can be incorporated [17].
5. SVMs for density estimation and regression
In this section, we slightly touch the problem of densityestimation and regression using SVMs. Many optimization meth-odologies introduced in the last section can be adopted with someadaptation to deal with the density estimation and regressionproblems. Thereby, here we do not delve into the optimizationdetails of these problems.
The one-class SVM, an unsupervised learning machine pro-posed by Scholkopf et al. [48] for density estimation, aims tocapture the support of a high-dimensional distribution. It outputsa binary function which is þ1 in a region containing most of thedata, and �1 in the remaining region. The corresponding optimi-zation problem, which can be interpreted as separating the dataset from the origin in a feature space, is formulated as
minw,n,r
1
2JwJ2
þ1
nn
Xn
i ¼ 1
xi�r
s:t: /w,fðxiÞSZr�xi, i¼ 1, . . . ,n,
xiZ0, i¼ 1, . . . ,n, ð79Þ
where the scalar nAð0,1� is a balance parameter. The decisionfunction is
cðxÞ ¼ signð/w,fðxÞS�rÞ: ð80Þ
For support vector regression, the labels in the training dataare real numbers, that is yiAR (i¼ 1, . . . ,n). In order to introduce asparse representation for the decision function, Vapnik [61]devised the following e-insensitive function [49]
9y�f ðxÞ9e ¼maxf0,9y�f ðxÞ9�eg ð81Þ
with eZ0, and applied it to support vector regression.The standard form of support vector regression is to minimize
the objective
1
2JwJ2
þCXn
i ¼ 1
9yi�f ðxiÞ9e, ð82Þ
where the positive scalar C reflects the trade-off between functioncomplexity and the empirical loss. An equivalent optimizationproblem commonly used is given by
minw,b,n,nn
1
2JwJ2
þCXn
i ¼ 1
xiþCXn
i ¼ 1
xn
i
s:t: /w,fðxiÞSþb�yireþxi,
yi�/w,fðxiÞS�breþxn
i ,
xi,xn
i Z0, i¼ 1, . . . ,n: ð83Þ
The decision output for support vector regression is
cðxÞ ¼/w,fðxÞSþb: ð84Þ
6. Optimization methods for kernel learning
SVMs are one of the most famous kernel-based algorithmswhich encode the similarity information between exampleslargely by the chosen kernel function or kernel matrix. Determin-ing the kernel functions or kernel matrices is essentially a modelselection problem. This section introduces several optimizationtechniques used for learning kernels in SVM-type applications.
J. Shawe-Taylor, S. Sun / Neurocomputing 74 (2011) 3609–3618 3617
Semidefinite programming is a recent popular class of optimi-zations with wide engineering applications including kernellearning for SVMs [32,35]. For solving semidefinite programming,there exist interior point algorithms with good theoretical proper-ties and computational efficiency [58].
A semidefinite program (SDP) is a special convex optimizationproblem with a linear objective function, and linear matrixinequality and affine equality constraints.
Definition 10 (SDP). An SDP is an optimization problem of theform
minu
c>u
s:t: FjðuÞ : ¼ Fj0þu1Fj
1þ � � � þuqFjq$0, j¼ 1, . . . ,L,
Au¼ b, ð85Þ
where u : ¼ ½u1, . . . ,uq�>ARq is the decision vector, c is a constant
vector defining the linear objective, Fji ði¼ 0, . . . ,qÞ are given
symmetric matrices and FjðuÞ$0 means that the symmetricmatrix FjðuÞ is negative semidefinite.
Lanckriet et al. [32] proposed an SDP framework for learningthe kernel matrix, which is readily applicable to induction,transduction and even the multiple kernel learning cases. Theidea is to wrap the original optimization problems with thefollowing constraints on the symmetric kernel matrix K
K ¼Xmi ¼ 1
miKi,
Kk0,
traceðKÞrc ð86Þ
or with constraints
K ¼Xmi ¼ 1
miKi,
miZ0, i¼ 1, . . . ,m,
Kk0,
traceðKÞrc ð87Þ
and then formulate the problem in terms of an SDP which isusually solved by making use of interior point methods. Theadvantages of the second set of constraints include reducingcomputational complexity and facilitating the study of the statis-tical properties of a class of kernel matrices [32]. In particular, forthe 1-norm soft margin SVM, the SDP for kernel learning reducesto a quadratically constrained quadratic programming (QCQP)problem. Tools for second-order cone programming (SOCP) can beapplied to efficiently solve this problem.
Argyriou et al. [1] proposed a DC (difference of convexfunctions) programming algorithm for supervised kernel learning.An important difference with other kernel selection methods isthat they consider the convex hull of continuously parameterizedbasic kernels. That is, the number of basic kernels can be infinite.In their formulation, kernels are selected from the set
KðGÞ : ¼Z
GðoÞdpðoÞ : pAPðOÞ,oAO� �
, ð88Þ
where O is the kernel parameter space, GðoÞ is a basic kernel, andPðOÞ is the set of all probability measures on O.
Making use of the conjugate function and von Neumannminimax theorem [40], Argyriou et al. [1] formulated an optimi-zation problem which was then addressed by a greedy algorithm.As a subroutine of the algorithm involves optimizing a functionwhich is not convex, they converted it to a difference of twoconvex functions. With the DC programming techniques [20], thenecessary and sufficient condition for finding a solution isobtained, which is then solved by a cutting plane algorithm.
7. Conclusion
We have presented a review of optimization techniques usedfor training SVMs. The KKT optimality conditions are a milestonein optimization theory. The optimization of many real problemsrelies heavily on the application of the conditions. We haveshown how to instantiate the KKT conditions for SVMs. Alongwith the introduction of the SVM algorithms, the characterizationof effective kernels has also been presented, which would behelpful to understand the SVMs with nonlinear classifiers.
For the optimization methodologies applied to SVMs, we havereviewed interior point algorithms, chunking and SMO, coordinatedescent, active set methods, Newton’s method for solving the primal,stochastic subgradient with projection, and cutting plane algorithms.The first four methods focus on the optimization of dual problems,while the last three directly deal with the optimization of the primal.This reflects current developments in SVM training.
Since their invention, SVMs have been extended to multiplevariants in order to solve different applications such as the brieflyintroduced density estimation and regression problems. For kernellearning in SVM-type applications, we further introduced SDP andDC programming, which can be very useful for optimizing problemsmore complex than the traditional SVMs with fixed kernels.
We believe the optimization techniques introduced in thispaper can be applied to other SVM-related research as well. Basedon these techniques, people can implement their own solvers fordifferent purposes, such as contributing to one future line ofresearch for training SVMs by developing fast and accurateapproximation methods for the challenging problem of large-scale applications. Another direction can be investigating recentpromising optimization methods (e.g., [24] and its references) andapplying them to solving variants of SVMs for certain machinelearning applications.
Acknowledgements
This work was supported in part by the National NaturalScience Foundation of China under Project 61075005, the Funda-mental Research Funds for the Central Universities, and thePASCAL2 Network of Excellence. This publication only reflectsthe authors’ views.
References
[1] A. Argyriou, R. Hauser, C. Micchelli, M. Pontil, A DC-programming algorithmfor kernel selection, in: Proceedings of the 23rd International Conference onMachine Learning, 2006, pp. 41–48.
[2] N. Aronszajn, Theory of reproducing kernels, Transactions of the AmericanMathematical Society 68 (1950) 337–404.
[3] P. Bartlett, S. Mendelson, Rademacher and Gaussian complexities: riskbounds and structural results, Journal of Machine Learning Research 3(2002) 463–482.
[4] K. Bennett, A. Demiriz, Semi-supervised support vector machines, Advancesin Neural Information Processing Systems 11 (1999) 368–374.
[5] D. Bertsekas, Constrained Optimization and Lagrange Multiplier Methods,Academic Press, New York, 1982.
[6] A. Bordes, S. Ertekin, J. Weston, L. Bottou, Fast kernel classifiers with onlineand active learning, Journal of Machine Learning Research 6 (2005)1579–1619.
[7] A. Bordes, L. Bottou, P. Gallinari, J. Weston, Solving multiclass support vectormachines with LaRank, in: Proceedings of the 24th International Conferenceon Machine Learning, 2007, pp. 89–96.
[8] B. Boser, I. Guyon, V. Vapnik, A training algorithm for optimal marginclassifier, in: Proceedings of the 5th ACM Workshop on ComputationalLearning Theory, 1992, pp. 144–152.
[9] S. Boyd, L. Xiao, A. Mutapcic, Subgradient Methods, Available at: /http://www.stanford.edu/class/ee392o/subgrad_method.pdfS, 2003.
[10] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press,England, 2004.
[11] C. Campbell, Kernel methods: a survey of current techniques, Neurocomput-ing 48 (2002) 63–84.
J. Shawe-Taylor, S. Sun / Neurocomputing 74 (2011) 3609–36183618
[12] G. Cauwenberghs, T. Poggio, Incremental and decremental support vectormachine learning, Advances in Neural Information Processing Systems 13(2001) 409–415.
[13] C. Chang, C. Lin, LIBSVM: A Library for Support Vector Machines, Available at:/http://www.csie.ntu.edu.tw/�cjlin/libsvmS, 2001.
[14] O. Chapelle, Training a support vector machines in the primal, NeuralComputation 19 (2007) 1155–1178.
[15] T. Evgeniou, M. Pontil, Regularized multi-task learning, in: Proceedings of the10th ACM SIGKDD International Conference on Knowledge Discovery andData Mining, 2004, pp. 109–117.
[16] J. Farquhar, D. Hardoon, H. Meng, J. Shawe-Taylor, S. Szedmak, Two viewlearning: SVM-2 K, theory and practice, Advances in Neural InformationProcessing Systems 18 (2006) 355–362.
[17] V. Franc, S. Sonnenburg, Optimized cutting plane algorithm for supportvector machines, in: Proceedings of the 25th International Conference onMachine Learning, 2008, pp. 320–327.
[18] T. Friess, N. Cristianini, C. Campbell, The kernel adatron algorithm: a fast andsimple learning procedure for support vector machines, in: Proceedings ofthe 15th International Conference on Machine Learning, 1998, pp. 188–196.
[19] T. Hastie, S. Rosset, R. Tibshirani, J. Zhu, The entire regularization path,Journal of Machine Learning Research 5 (2004) 1391–1415.
[20] R. Horst, V. Thoai, DC programming: overview, Journal of OptimizationTheory and Applications 103 (1999) 1–41.
[21] C. Hsieh, K. Chang, C. Lin, S. Keerthi, S. Sundararajan, A dual coordinatedescent method for large-scale linear SVM, in: Proceedings of the 25thInternational Conference on Machine Learning, 2008, pp. 408–415.
[22] T. Joachims, Making large-scale SVM learning practical, in: Advances inKernel Methods—Support Vector Learning, MIT Press, Cambridge, MA, 1998.
[23] T. Joachims, Training linear SVMs in linear time, in: Proceedings of the 12thACM Conference on Knowledge Discovery and Data Mining, 2006, pp. 217–226.
[24] A. Juditsky, A. Nemirovski, C. Tauvel, Solving Variational Inequalities withStochastic Mirror-Prox Algorithm, PASCAL EPrints, 2008.
[25] S. Karlin, Mathematical Methods and Theory in Games, Programming, andEconomics, Addison Wesley, Reading, MA, 1959.
[26] W. Karush, Minima of Functions of Several Variables with Inequalities as SideConstraints, Master’s Thesis, Department of Mathematics, University ofChicago, 1939.
[27] S. Keerthi, D. DeCoste, A modified finite Newton method for fast solution oflarge scale linear SVMs, Journal of Machine Learning Research 6 (2005)341–361.
[28] J. Kelley, The cutting-plane method for solving convex programs, Journal ofthe Society for Industrial Applied Mathematics 8 (1960) 703–712.
[29] G. Kimeldorf, G. Wahba, A correspondence between Bayesian estimation onstochastic processes and smoothing by splines, Annals of MathematicalStatistics 41 (1970) 495–502.
[30] J. Kivinen, A. Smola, R. Williamson, Online learning with kernels, IEEETransactions on Signal Processing 52 (2004) 2165–2176.
[31] H. Kuhn, A. Tucker, Nonlinear programming, in: Proceedings of the 2ndBerkeley Symposium on Mathematical Statistics and Probabilitics, 1951,pp. 481–492.
[32] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, M. Jordan, Learning thekernel matrix with semidefinite programming, Journal of Machine LearningResearch 5 (2004) 27–72.
[33] J. Langford, J. Shawe-Taylor, PAC-Bayes and margins, Advances in NeuralInformation Processing Systems 15 (2002) 439–446.
[34] Y. Lee, O. Mangasarian, SSVM: a smooth support vector machine forclassification, Computational Optimization and Applications 20 (2001) 5–22.
[35] J. Li, S. Sun, Nonlinear combination of multiple kernels for support vectormachines, in: Proceedings of the 20th International Conference on PatternRecognition, 2010, pp. 1–4.
[36] O. Mangasarian, Nonlinear Programming, McGraw-Hill, New York, 1969.[37] O. Mangasarian, D. Musicant, Successive overrelaxation for support vector
machines, IEEE Transactions on Neural Networks 10 (1999) 1032–1037.[38] O. Mangasarian, A finite Newton method for classification, Optimization
Methods and Software 17 (2002) 913–929.[39] D. McAllester, Simplified PAC-Bayesian margin bounds, in: Proceedings of the
16th Annual Conference on Computational Learning Theory, 2003,pp. 203–215.
[40] A. Micchelli, M. Pontil, Learning the kernel function via regularization,Journal of Machine Learning Research 6 (2005) 1099–1125.
[41] J. Nocedal, S. Wright, Numerical Optimization, Springer-Verlag, New York,1999.
[42] E. Osuna, R. Freund, F. Girosi, An improved training algorithm for supportvector machines, in: Proceedings of the IEEE Workshop on Neural Networksfor Signal Processing, 1997, pp. 276–285.
[43] E. Osuna, R. Freund, F. Girosi, Support Vector Machines: Training andApplications, Technical Report AIM-1602, Massachusetts Institute of Tech-nology, MA, 1997.
[44] J. Platt, Sequential Minimal Optimization: A Fast Algorithm for TrainingSupport Vector Machines, Technical Report MSR-TR-98-14, MicrosoftResearch, 1998.
[45] C. Rasmussen, C. Williams, Gaussian Processes for Machine Learning, MITPress, Cambridge, MA, 2006.
[46] K. Scheinberg, An efficient implementation of an active set method for SVMs,Journal of Machine Learning Research 7 (2006) 2237–2257.
[47] A. Schilton, M. Palaniswami, D. Ralph, A. Tsoi, Incremental training of supportvector machines, IEEE Transactions on Neural Networks 16 (2005) 114–131.
[48] B. Scholkopf, J. Platt, J. Shawe-Taylor, A. Smola, R. Williamson, Estimating thesupport of a high-dimensional distribution, Neural Computation 13 (2001)1443–1471.
[49] B. Scholkopf, A. Smola, Learning with Kernels, MIT Press, Cambridge, MA,2002.
[50] S. Shalev-Shwartz, Y. Singer, N. Srebro, Pegasos: primal estimated sub-gradient solver for SVM, in: Proceedings of the 24th International Conferenceon Machine Learning, 2007, pp. 807–814.
[51] J. Shawe-Taylor, P. Bartlett, R. Williamson, M. Anthony, Structural riskminimization over data-dependent hieraichies, IEEE Transactions on Infor-mation Theory 44 (1998) 1926–1940.
[52] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cam-bridge University Press, Cambridge, UK, 2004.
[53] N. Shor, Minimization Methods for Non-Differentiable Functions, Springer-Verlag, Berlin, 1985.
[54] M. Slater, A note on Motzkin’s transposition theorem, Econometrica 19(1951) 185–186.
[55] S. Sun, D. Hardoon, Active learning with extremely sparse labeled examples,Neurocomputing 73 (2010) 2980–2988.
[56] S. Sun, J. Shawe-Taylor, Sparse semi-supervised learning using conjugatefunctions, Journal of Machine Learning Research 11 (2010) 2423–2455.
[57] C. Teo, Q. Le, A. Smola, S. Vishwanathan, A scalable modular convex solver forregularized risk minimization, in: Proceedings of the 13th ACM Conferenceon Knowledge Discovery and Data Mining, 2007, pp. 727–736.
[58] L. Vandenberghe, S. Boyd, Semidefinite programming, SIAM Review 38(1996) 49–95.
[59] R. Vanderbei, LOQO User’s Manual—Version 3.10, Technical Report SOR-97-08, Statistics and Operations Research, Princeton University, 1997.
[60] V. Vapnik, Estimation of Dependences Based on Empirical Data, Springer-Verlag, New York, 1982.
[61] V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, NewYork, 1995.
[62] L. Zanni, T. Serafini, G. Zanghirati, Parallel software for training large sclaesupport vector machines on multiprocessor systems, Journal of MachineLearning Research 7 (2006) 1467–1492.
[63] T. Zhang, Solving large scale linear prediction problems using stochasticgradient descent algorithms, in: Proceedings of the 21st InternationalConference on Machine Learning, 2004, pp. 919–926.
John Shawe-Taylor is a professor at the Department ofComputer Science, University College London (UK). Hismain research area is Statistical Learning Theory, buthis contributions range from Neural Networks, toMachine Learning, to Graph Theory. He obtained aPh.D. in Mathematics at Royal Holloway, University ofLondon in 1986. He subsequently completed an M.Sc.in the Foundations of Advanced Information Technol-ogy at Imperial College. He was promoted to professorof Computing Science in 1996. He has published over150 research papers. He moved to the University ofSouthampton in 2003 to lead the ISIS research group.
He has been appointed as the director of the Centre forComputational Statistics and Machine Learning at University College, London fromJuly 2006. He has coordinated a number of European wide projects investigatingthe theory and practice of Machine Learning, including the NeuroCOLT projects.He is currently the scientific coordinator of a Framework VI Network of Excellencein Pattern Analysis, Statistical Modelling and Computational Learning (PASCAL)involving 57 partners.
Shiliang Sun received the B.E. degree with honors inAutomatic Control from the Department of AutomaticControl, Beijing University of Aeronautics and Astro-nautics in 2002, and the Ph.D. degree with honors inPattern Recognition and Intelligent Systems from theState Key Laboratory of Intelligent Technology andSystems, Department of Automation, Tsinghua Univer-sity, Beijing, China, in 2007. In 2004, he was entitledMicrosoft Fellow. Currently, he is a professor at theDepartment of Computer Science and Technology andthe founding director of the Pattern Recognition andMachine Learning Research Group, East China Normal
University. From 2009 to 2010, he was a visitingresearcher at the Department of Computer Science, University College London,working within the Centre for Computational Statistics and Machine Learning. Heis a member of the IEEE, a member of the PASCAL (Pattern Analysis, StatisticalModeling and Computational Learning) network of excellence, and on the editorialboards of multiple international journals. His research interests include MachineLearning, Pattern Recognition, Computer Vision, Natural Language Processing andIntelligent Transportation Systems.