adaptive simplification of solution for support vector machine

9
Pattern Recognition 40 (2007) 972 – 980 www.elsevier.com/locate/pr Adaptive simplification of solution for support vector machine Qing Li a , , Licheng Jiao a , Yingjuan Hao b a Institute of Intelligent Information Processing, Xidian University, P.O. box 224, Xi’an 710071, PR China b Lanzhou University, Lanzhou 730000, PR China Received 14 November 2005; received in revised form 3 June 2006; accepted 4 July 2006 Abstract SVM has been receiving increasing interest in areas ranging from its original application in pattern recognition to other applications such as regression estimation due to its remarkable generalization performance. Unfortunately, SVM is currently considerably slower in test phase caused by number of the support vectors, which has been a serious limitation for some applications. To overcome this problem, we proposed an adaptive algorithm named feature vectors selection (FVS) to select the feature vectors from the support vector solutions, which is based on the vector correlation principle and greedy algorithm. Through the adaptive algorithm, the sparsity of solution is improved and the time cost in testing is reduced. To select the number of the feature vectors adaptively by the requirements, the generalization and complexity trade-off can be directly controlled. The computer simulations on regression estimation and pattern recognition show that FVS is a promising algorithm to simplify the solution for support vector machine. 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Support vector machine; Simplification; Vector correlation; Feature vector; Regression estimation; Pattern recognition 1. Introduction The support vector machine (SVM) is a new and promis- ing classification and regression technique proposed by Vapnik and his group atAT&T Bell Laboratories [1,2]. The theory of SVM is based on the idea of structural risk mini- mization (SRM) [3,4]. It has been shown to provide higher performance than traditional learning machines and has been introduced as powerful tools for solving both pattern recognition and regression estimation problems [5,6]. SVM uses a device called kernel mapping to map the data in input space to a high-dimensional feature space in which the prob- lem becomes linear. The decision function obtained by SVM is related not only to the number of support vectors (SVs) and their weights but also to the priori chosen kernel that is called support vector kernel. SVM has been successfully Corresponding author. Tel.: +86 029 88208394. E-mail addresses: [email protected] (Q. Li), [email protected] (L. Jiao), [email protected] (Y. Hao). 0031-3203/$30.00 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2006.07.005 applied in many areas such as the time series prediction, the hand-writing digital character recognition, the image classi- fication, etc [7–11]. However, there still exists one problem, the huge compu- tation in testing phase caused by the large number of support vectors, which greatly blocks it into practical use. The time taken for a SVM to test a new sample is proportional to the number of the support vectors, so the decision speed will become quite slow if the number of the support vectors is very large. SVM is a sparse machine learning algorithm in theory, but the sparsity of the solution is not as good as what we expect, so it is considerably slower in test phase than other approaches with similar generalization performance [12]. Therefore, how to reduce the complexity of the test phase becomes a crucial problem in SVM fields. In 1996, Burges described a method of speeding up the classification process by approximating the solution using a smaller number of vectors [13]. The reduced set of vectors determined by Burges’ method are generally not support vectors. Burges and Schoelköpf refined the method in 1997 [14]. The reduced set of vectors used in the refined method is computed from the original support vector set, which

Upload: qing-li

Post on 21-Jun-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adaptive simplification of solution for support vector machine

Pattern Recognition 40 (2007) 972–980www.elsevier.com/locate/pr

Adaptive simplification of solution for support vector machine

Qing Lia,∗, Licheng Jiaoa, Yingjuan Haob

aInstitute of Intelligent Information Processing, Xidian University, P.O. box 224, Xi’an 710071, PR ChinabLanzhou University, Lanzhou 730000, PR China

Received 14 November 2005; received in revised form 3 June 2006; accepted 4 July 2006

Abstract

SVM has been receiving increasing interest in areas ranging from its original application in pattern recognition to other applications suchas regression estimation due to its remarkable generalization performance. Unfortunately, SVM is currently considerably slower in testphase caused by number of the support vectors, which has been a serious limitation for some applications. To overcome this problem, weproposed an adaptive algorithm named feature vectors selection (FVS) to select the feature vectors from the support vector solutions, whichis based on the vector correlation principle and greedy algorithm. Through the adaptive algorithm, the sparsity of solution is improvedand the time cost in testing is reduced. To select the number of the feature vectors adaptively by the requirements, the generalization andcomplexity trade-off can be directly controlled. The computer simulations on regression estimation and pattern recognition show that FVSis a promising algorithm to simplify the solution for support vector machine.� 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.

Keywords: Support vector machine; Simplification; Vector correlation; Feature vector; Regression estimation; Pattern recognition

1. Introduction

The support vector machine (SVM) is a new and promis-ing classification and regression technique proposed byVapnik and his group at AT&T Bell Laboratories [1,2]. Thetheory of SVM is based on the idea of structural risk mini-mization (SRM) [3,4]. It has been shown to provide higherperformance than traditional learning machines and hasbeen introduced as powerful tools for solving both patternrecognition and regression estimation problems [5,6]. SVMuses a device called kernel mapping to map the data in inputspace to a high-dimensional feature space in which the prob-lem becomes linear. The decision function obtained by SVMis related not only to the number of support vectors (SVs)and their weights but also to the priori chosen kernel thatis called support vector kernel. SVM has been successfully

∗ Corresponding author. Tel.: +86 029 88208394.E-mail addresses: [email protected] (Q. Li),

[email protected] (L. Jiao), [email protected](Y. Hao).

0031-3203/$30.00 � 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.doi:10.1016/j.patcog.2006.07.005

applied in many areas such as the time series prediction, thehand-writing digital character recognition, the image classi-fication, etc [7–11].

However, there still exists one problem, the huge compu-tation in testing phase caused by the large number of supportvectors, which greatly blocks it into practical use. The timetaken for a SVM to test a new sample is proportional to thenumber of the support vectors, so the decision speed willbecome quite slow if the number of the support vectors isvery large. SVM is a sparse machine learning algorithm intheory, but the sparsity of the solution is not as good as whatwe expect, so it is considerably slower in test phase thanother approaches with similar generalization performance[12]. Therefore, how to reduce the complexity of the testphase becomes a crucial problem in SVM fields.

In 1996, Burges described a method of speeding up theclassification process by approximating the solution using asmaller number of vectors [13]. The reduced set of vectorsdetermined by Burges’ method are generally not supportvectors. Burges and Schoelköpf refined the method in 1997[14]. The reduced set of vectors used in the refined methodis computed from the original support vector set, which

Page 2: Adaptive simplification of solution for support vector machine

Q. Li et al. / Pattern Recognition 40 (2007) 972–980 973

provides the best approximation to the original decision sur-face. Yet, the method proposed by Burges and Schoelköpfdid not show good generalization performance in some in-stances such as NIST data set. In 2001, Tom Downs proposedto simplification of support vectors by linear dependence.Method by Tom Downs can exactly reduce the support vec-tors without any loss of generalization, but the reduced setwill be even large for some instances [4]. In 2002, JanezBrank et al. proposed a method to select the feature vectorsfrom the training dataset using linear SVM, which could re-duce the complexity of the training process and improve thesparsity for the decision function, and successfully used itin the text categorization [15].

In this paper, we propose a new method to select thefeature vectors from the support vector solutions accord-ing to vector correlation principle and greedy algorithm.The method can capture the structure of the feature spaceby approximating a basis of the support vector solutions;therefore, the statistical information of the solutions of theSVM is preserved. Further, the number of the feature vectorscould be selected adaptively according to the tasks’ need, sothe generalization/complexity trade-off can be directly con-trolled.

The nonlinear mapping function � maps the sup-port vectors (xi , yi)1� i � l in input space into a high-dimensional feature space (�(xi ), yi)1� i � l . Accord-ing to the linear theory, these mapping support vectors{�(x1), �(x2), . . . ,�(xl )} maybe not linear independent[16]. Suppose {�(x̃1), �(x̃2), . . . ,�(x̃m)} (often m is farless than l) could approximate the mapping support vectors{�(x1), �(x2), . . . ,�(xl )} accurately, then any mappingsupport vector �(xi ) can be expressed in the linear formof the feature vector

∑�jQ�(x̃j ).Therefore, testing on the

original support vectors will be equal to testing on the fea-ture vectors as well as we exactly know the corresponding

coefficients matrix � = [⇀� 1,⇀

� 2, . . . ,⇀

� l]T,because⎡⎢⎢⎣

�(x1)

�(x2)...

�(xl )

⎤⎥⎥⎦=

⎡⎢⎢⎣

�11 �12 · · · �1m

�21 �22 · · · �2m...

.... . .

...

�l1 �l2 · · · �lm

⎤⎥⎥⎦⎡⎢⎣

�(x̃1)

�(x̃2)

· · ·�(x̃m)

⎤⎥⎦ , (1)

where [�(x1), �(x2), . . . ,�(xl )]T is the original supportvectors in feature space. Frequently, we aim to greatlyimprove the test speed while remain the accuracy accept-able in practical, so we propose to select feature vectors toapproximate the original support vectors with error �:

� = ‖�(x) − ��(x̃)‖, (2)

where �(x) = {�(x1), �(x2), . . . ,�(xl )}, �(x̃) = {�(x̃1),

�(x̃2), . . . ,�(x̃m)}. In such a way, the number of featurevectors can be reduced adaptively according to the approxi-mation error �, which could satisfy the needs of the practicaltasks.

In this paper, we propose to select the feature vectorsXF ={xF1 , xF2 , . . . , xFm} adaptively according to the vector

correlation principle and greedy algorithm, and then test-ing SVM with the feature vectors to reduce the scale of thetesting computation and improve the sparsity of the solu-tion. Moreover, we analyze the generalization performancecaused by the approximation error in theory.

The rest of the paper is organized as follows. A briefreview of SVM both for regression estimation and patternrecognition is described in Section 2. Section 3 details themethod of the feature vector selection (FVS) and simplifi-cation of support vectors. Moreover, we analyze the gener-alization performance in Section 3.3. The computer simula-tions are presented in Section 4. Finally, we give our con-cluding remarks in Section 5.

2. Support vector machine (SVM)

SVM uses SV kernel to map the data in input space toa high-dimension feature space in which we can solve theproblem in linear form [17–19].

2.1. SVM for regression estimation

Given a set of data points (xi , yi)1� i � l randomly gen-erated from an unknown function, SVM approximates thefunction using the following form:

y = f (x, �) = �T�(x) + b, (3)

where � is a nonlinear mapping function. In order to obtaina small risk when estimate (3), SVM using the followingregularized risk functional:

min

(1

2‖�‖2 + C

l

l∑i=1

L�(yi, f (xi , �))

). (4)

The second term of (4) is the empirical error measured bythe �-insensitive loss function defined as

L�(y, f (x, �))

={

0 if |y − f (x, �)| < �,|y − f (x, �)| − � otherwise.

(5)

In (4), C > 0 is a constant and � > 0 is called tube size whichdefines the width of the �-insensitive zone of the cost func-tion [1,2]. By introducing Lagrange multiplier techniques,the minimization of (4) leads to the following dual optimiza-tion problem:

max W(�(∗)i ) =

l∑i=1

yi(�i − �∗i ) − �

l∑i=1

(�i + �∗i )

− 1

2

l∑i=1

l∑j=1

(�i − �∗i )(�j − �∗

j )

× K(xi , xj ), (6)

Page 3: Adaptive simplification of solution for support vector machine

974 Q. Li et al. / Pattern Recognition 40 (2007) 972–980

with the following constraints:

l∑i=1

(�i − �∗i ) = 0, 0��(∗)

i �C, i = 1, . . . , l, (7)

where K(x, xi ) is defined as the Mercer kernel. Commonexamples of the Mercer kernels are the polynomial ker-nel K(xi , xj ) = (〈xi , xj 〉 + 1)d and the Gaussian kernelK(xi , xj ) = exp(−(1/�2)‖xi − xj‖2). Finally, the resultingregression estimation found by SVM is

f (x) = �TK(x, xsv) + b, (8)

where �T =[(�1 − �∗1), . . . , (�l − �∗

l )] and xsv is the supportvector set.

2.2. SVM for pattern recognition

SVM for pattern recognition is similar to it for regres-sion. The only difference between them is the expressionof the optimization problem and the decision function. Sup-pose a binary classification and let (xi , yi) ∈ RN × {±1},i = 1, . . . , l be a set of training samples. Finding the clas-sification hyperplane by SVM can be cast as a quadraticoptimization problem with the slack variance �:

min �(�, �) = 1

2‖�‖2 + CQ

(l∑

i=1

i

)

s.t. yi[(�Qxi ) − b]�1 − i

i �0, i = 1, . . . , l. (10)

Using Lagrange multiplier techniques lead to the followingdual optimization problem:

max W(�) =l∑

i=1

�i − 1

2

l∑i=1

l∑j=1

�iyi�j yjK(xi , xj )

s.t.l∑

i=1

�iyi = 0, �i ∈ [0, C], i = 1, . . . , l. (11)

The final decision function becomes

f (x) = sgn(�TK(x, xsv) + b), (12)

where �T = [�1y1, . . . , �lyl].

3. Simplification of support vectors for SVM

3.1. Feature vector selection

Let (xi , yi)1� i � l (xi ∈ RN, y ∈ R) be the support vec-tors. A nonlinear mapping function � maps the input spaceof the support vectors into a feature Hilbert space H:

� : RN → H ,

x → �(x). (13)

Therefore, the mapping support vector set in feature spaceare (�(xi ), yi)1� i � l (�(xi ) ∈ H, y ∈ R) , which lies in asubspace Hs of the H with the dimension up to l. In practice,the dimension of this subspace is far lower than l and equalto the number of its base vector. As we have shown in the in-troduction, test on the original support vectors will be equalto test on the feature vector if feature vectors could approx-imate support vectors accurately. In the paper, we propose apostprocessing method to select feature vectors to approxi-mate original support vectors and use these feature vectorsas the final solutions so that to improve the sparsity of thesolutions and reduce the scale of the test computation com-plexity.

In the following, the method of select feature vectorswill be introduced. For notation simplification: for each xi

the mapping is noted �(xi ) = �i for 1� i� l and the se-lected feature vectors are noted by xFj

and �(xFj) = �Fj

for 1�j �M(M is the number of the feature vectors). For agiven feature vector set XF ={xF1 , xF2 , . . . , xFM

}, the map-ping of any vector xi can be expressed as a linear combina-tion of XF with the form

�̂i = �Ti �F , (14)

where �F = (�F1, . . . ,�FM

)T is the matrix of the mappingfeature vectors and �i=(�i1, . . . , �iM)T is the correspondingcoefficient vector.

Given (xi , yi)1� i � l , the goal is to find the feature vectorsXF ={xF1 , xF2 , . . . , xFM

} such that for any mapping �i , theestimated mapping �̂i is as close as possible to �i :

� =∑xi∈X

�i =∑xi∈X

(‖�i − �̂i‖2). (15)

The � is the approximation error between the feature vectorset and the original support vector set. We minimize � toselect the feature vectors,

minXF

(�) = minXF

⎛⎝∑

xi∈X

�i

⎞⎠ . (16)

Now, we get

� = ‖�i − �Ti �F ‖2 = (�i − �T

i �F )T(�i − �Ti �F )

= �Ti �i − 2�T

F �i�i + �Ti �T

F �F �i .

(17)

Putting the derivative of � to zero gives the coefficient vector�i :

��

��i

= 2(�TF �F )�i − 2�T

F �i ,

��

��i

= 0 ⇒ �i = (�TF �F )−1�T

F �i (18)

Page 4: Adaptive simplification of solution for support vector machine

Q. Li et al. / Pattern Recognition 40 (2007) 972–980 975

and (�TF �F )−1 exists if the mapping feature vectors are

linear independent. Substituting (17) and (18) to (16), we get

minXF

(�) = minXF

⎛⎝∑

xi∈X

(�Ti �i − �T

i �F (�TF �F )−1�T

F �i )

⎞⎠ .

(19)

By Mercer’s theorem, we can replace the inner product be-tween feature space vectors by a positive defined kernelfunction over pairs of vectors in input space. In other word,we use the substitution �(x)T�(y) = 〈�(x), �(y)〉 = k(x, y)

to (19) and get

minXF

(�) = minXF

⎛⎝∑

xi∈X

(kii − ⇀

KT

FiK−1FF

KFi)

⎞⎠ , (20)

where kii = �Ti �i ,

KFi = �TF �i and KFF = �T

F �F . Wedefined the feature set fitness �F and the each vector fitness�Fi

corresponding to a given feature vector set XF by

�F = 1

l

∑xi∈X

�Fi, (21)

where

�Fi= kii

KT

FiK−1FF

KFi . (22)

Now, minimization of (16) is equivalent to minimizing thefollowing form:

minXF

(�F ). (23)

The process of FVS is a greedy iterative algorithm. Whenselecting the first feature vector, we look for the samplesthat gives the minimum �F . In each iteration, (21) is usedto estimate the performance of the current feature set and(22) is used to select the next best candidate feature vector.If the one has the maximal fitness �Fi

for the current featureset, we select it as the next feature vector. It is because thecollinearity between such vector and the current feature setis the worst (or in other word, such vector could hardly beexpressed as the linear combination of the current featurevectors). The expected maximum of the support vectors orthe pre-defined feature set fitness could be used to stop theiterative process. Theoretically, the selected feature vectorsare the real base vectors of the original support points if wepre-set �F = 0, which means we simplify the solutions ofthe KMP accurately. As we have shown in the introduction,we usually need not simplify the support vectors accuratelysince we want to acquire the much sparser solutions to meetthe task’s need, so we proposed to set �F �0 to reduce themore support points. When the current feature set fitnessreaches the pre-defined fitness (which means that XF is agood approximation of the basis for the original dataset inH ) or the number of the feature vectors reaches the expectedmaximum, the algorithm stops. Another important stop cri-terion must be noted that the algorithm should be stopped

if the matrix of KFF = (�TF �F ) is not invertible anymore,

which means the current feature set XF is the real basis infeature space H. From the above analysis, we can see that thenumber of the feature vectors can be controlled adaptivelyas long as we set various approximation error �F or prede-fine the expected number of the feature vectors . Generally,the more feature vectors we obtain, the less approximationerror �F is.

3.2. Simplification of solution for SVM

Suppose we train an SVM and that l of these are de-termined to be support vectors. Denote them by SV s ={s1, s2, . . . , sl}. According to Section 2, the decision sur-faces both for regression and classification take the formf (x)=∑l

i=1�TK(si , x)+b, which can also be expressed as

f (x) =l∑

i=1

�i〈�(si ), �(x)〉 + b

= �T�s�(x) + b, (24)

where �T = [�1, . . . , �l], �s = [�(s1), . . . , �(sl )]T. Now,let XF = {xF1 , xF2 , . . . , xFM

} (M < l) is the feature vectorset of the original support vectors with the correspondingapproximated coefficient matrix is

B =

⎡⎢⎢⎣

�11 �12 · · · �1M

�21 �22 · · · �2M...

.... . .

...

�l1 �l2 · · · �lM

⎤⎥⎥⎦ , (25)

which means �s =[�(s1), . . . , �(sl )]T can be approximatedby �F = [�(xF1), . . . , �(xFM

)]T with the approximated co-efficient matrix B:⎡⎢⎢⎣

�(s1)

�(s2)...

�(sl )

⎤⎥⎥⎦=

⎡⎢⎢⎣

�11 �12 · · · �1M

�21 �22 · · · �2M...

.... . .

...

�l1 �l2 · · · �lM

⎤⎥⎥⎦⎡⎢⎣

�(xF1)

�(xF2)

· · ·�(xFM

)

⎤⎥⎦ . (26)

We rewrite (26) as �s = B�B and substitute it to (24),

f (x) = �T�s�(x) + b = �TB�F �(x) + b

=m∑

i=1

�i〈�(xFi), �(x)〉 + b

=m∑

i=1

�iK(xFi, xi ) + b, (27)

where �i = �1�1i + �2�2i + · · · + �l�li .From Section 3.1, we know that the number of feature

vectors is often far less than of support vectors, so the sub-stitution of feature vectors for support vectors will greatlyimprove the sparsity of solution and speed up the decisiontime when testing a new sample.

Page 5: Adaptive simplification of solution for support vector machine

976 Q. Li et al. / Pattern Recognition 40 (2007) 972–980

3.3. Analysis of the approximation error

In general, feature vectors are not the accurate ap-proximation of the support vectors; then how about thegeneralization performance is when we take the place ofsupport vectors with feature vectors? In the following, wewill discuss this problem. First, let us see the followingtheorem.

Theorem (consistency). Suppose f (x) can be expressed inthe form �T�(x)�(x) where � is the nonlinear mappingfunction and � = [�1, . . . , �l], �(x) = [�(x1), . . . ,�(xl )]T

are vectors. If ‖�(x) − µ�(xF )‖��, then

‖f (x) − �Tµ�(xF )�(x)‖��Q‖�‖Q‖�(x)‖, (28)

where �(xF ) = [�(xF1), . . . ,�(xFM)] and µ = [ij ]l×FM

isthe corresponding approximation coefficient matrix. More-over, when � converges to zero, ‖f (x) − �Tµ�(xb)�(x)‖will also converge to zero.

The proof of theorem is shown in Appendix. From the the-orem, we know that the error of the simplified solutions hasthe supremum �Q‖�‖Q‖�(x)‖ if the feature vectors couldapproximate the original support vectors with error �. Ob-viously, it will converge in probability to zero when � con-verges to zero.

4. Validation of effectiveness

In the paper, we compare the FVS SVM with the stan-dard SVM and the kernel both for SVM and FVS adoptRBF kernel with the form K(x, xi)= exp(−‖x − xi‖2/2p).We adopt the parameters’ notation of FVS as follows:maxN is the maximum of the feature vectors; minFitthe FVS stopping criterion (predefined accuracy). In thetest of regression, we adopt the approximation error as

ess =√

(∑l

i=1(yi − fi)2)/ l, and the loss function of the

SVM adopts �-insensitive loss. For avoiding the weak prob-lem, each experiment has been performed 30 independentruns, and all experiments were carried on a Pentium IV2.6 Ghz with 512 MB RAM using Matlab 7.01 compiler.

Table 1Approximation results of 1-D sinc function

Algorithm App. error #.fva/#.sv Simp. rate (%) Testing time (s) Error

FVS SVM � = 0 27/136 80.1 1.3594 0.00829� = 0.006 13/136 90.4 0.0688 0.00838� = 0.014 12/136 91.2 0.0594 0.00847� = 0.033 10/136 92.6 0.0469 0.00862

SVM None 136/136 0 26.75 0.00829

a#.fv represents the number of the feature vectors.

4.1. Regression experiments

4.1.1. Approximation of single-variable functionIn this experiment, we approximate the one-dimensional

function, y = sin(x)/x, which is also named Sinc func-tion. We uniformly sample 1200 points over the domain[−10, 10], 400 of which are taken as the training examplesand others as testing examples.

Before the training, each training point is added Gaussiannoise with 0 mean and standard deviation 0.1. The kernelboth for SVM and FVS adopt RBF kernel with sigma=1.79and for SVM, C =1, �=0.1. We select different approxima-tion error (shown in Table 1) to find the performance of theFVS SVM proposed by this paper. We also give the approx-imation drawings in Figs. 1 and 2 with 10 reduced featurevectors.

From the experiments, we see that FVS algorithm exactlysimplify the support points when �F = 0 (only 27 reducedfeature vectors). Moreover, it can reduce the support pointsmuch further by �F �= 0 while leave the approximationerror almost unchanged. More detail information shown inTable 1.

Fig. 1. ‘Q’ Representing SVs; dotted line representing the original Sinccurve; solid line representing the curve regressed by standard SVM(� = 0.1, � = 0.1, C = 1, 136 SV/400 total) (136 SV/ 400 total means400 training samples in which contain 136 support vectors).

Page 6: Adaptive simplification of solution for support vector machine

Q. Li et al. / Pattern Recognition 40 (2007) 972–980 977

Fig. 2. ‘∗’ Representing reduced feature vectors; dotted line representingthe original Sinc curve; solid line representing the curve regressed byFVS SVM (� = 0.1, � = 0.1, C = 1, 10 FV/136SV/400 total).

Table 2Approximation results of 2-D sinc function

Algorithm #.fv/#.sv Simp. rate (%) Testing time (s) Error

FSV SVM 80/216 63.0 7.8438 0.0175465/216 70.0 6.2501 0.0175945/216 79.2 4.1576 0.0176835/216 83.8 3.2344 0.01787

SVM 216/216 0 51.984 0.01753

4.1.2. Approximation of two-variable functionThis experiment is to approximate a two-dimensional Sinc

function with the form y = sin(�√

x21 + x2

2 )/(�√

x21 + x2

2 )

over the domain [−10, 10]× [−10, 10]. We uniformly sam-ple 2500 points, 500 of which are taken as the training ex-amples and others as testing examples.

Before the training, each training point is added Gaussiannoise with 0 mean and standard deviation 0.1. The kernelboth for SVM and FVS adopt RBF kernel with sigma=0.49and for SVM, C = 1, � = 0.1. In this experiment, we prede-fine the maximal number of the feature vectors to reduce thesupport vector solutions adaptively. Table 2 lists the approx-imation errors for FVS SVM and standard SVM algorithm.The approximation drawings are shown in Fig. 3 with 45reduced feature vectors.

In Fig. 3(b), we use 216 support vectors to approximatethe curve. However, only 45 reduced feature vectors are usedto approximate the original function while acquire almostthe same results compared to Fig. 3(b), which confirms theFVS can really improve the sparsity of the support vectors.More detail information shown in Table 2.

4.2. Experiments of pattern recognition

4.2.1. Two spirals’ problemLearning to tell two spirals apart is important both for

purely academic reasons and for industrial application [20].In the research of pattern recognition, it is a well-knownproblem for its difficulty. The parametric equation of the twospirals can be presented as follows:

spiral-1 : x1 = (k1 + e1) cos( ),

y1 = (k1 + e1) sin( ),

spiral-2 : x2 = (k2 + e2) cos( ),

y2 = (k2 + e2) sin( ),

where k1, k2, e1 and e2 are parameters. In our experiment,we choose k1 = k2 = 4, e1 = 1, e2 = 10. We uniformlygenerate 12 000 samples, and randomly choose 1500 of themas training data, others as test data.

Before the training, we add noise to data—randomlychoosing 150 training samples, changing their class at-tributes. The kernel both for SVM and FSV adopt RBFkernel with sigma = 8 and for SVM, C = 10. We selectdifferent approximation error to control the number of thefeature vectors and test the performance of the FVS SVMproposed by this paper. Table 3 lists the classification resultsby the ASS SVM and standard SVM algorithm.

From the experiments, FVS exactly simplifies the solu-tions from 1196 support vectors to 75 reduced feature vec-tors by �F = 0. Moreover, we can reduce the support pointsmuch further by �F �0.07 while remain the classificationaccuracy unchanged.

4.2.2. Artificial data experimentWe generate artificial dataset with the parametric equation

of the data as{x = � sin � cos y = � sin � sin z = � cos �

∈ U [0, 2�], � ∈ U [0, �].

Parameter � of the first class is of continuous uniform distri-bution U [0, 50], and � in the second class is of U [50, 100].We generate 8000 samples, and randomly choose 1000 sam-ples as training data, others as test data.

Before the training, we add noise to data—randomlychoosing 100 training samples, changing their class at-tributes. The kernel both for SVM and FVS adopt RBFkernel with sigma = 8 and for SVM, C = 10. We predefinedifferent maximal number of the feature vectors (shown inTable 4) to test the performance of the FVS SVM. Table4 lists the classification results by FVS SVM and standardSVM algorithm, respectively.

4.2.3. Waveform and ionosphere dataset classificationWe did experiment on two well-known datasets, waveform

and ionosphere dataset coming from the UCI BenchmarkRepository [21].

Page 7: Adaptive simplification of solution for support vector machine

978 Q. Li et al. / Pattern Recognition 40 (2007) 972–980

Fig. 3. Regression of two-variable Sinc function: (a) original function curve; (b) approximated curve by standard SVM (216 SV) and (c) approximatedcurve by FVS SVM (45 FV).

Table 3Recognition results of two spirals

Algorithm App. error #.fv/#.sv Simp. rate (%) Testing time (s) Accuracy (%)

FVS SVM � = 0 75/1196 93.7 35.766 100� = 0.03 25/1196 97.9 11.953 100� = 0.07 22/1196 98.2 10.516 100� = 0.14 16/1196 98.7 7.6875 91.3

SVM None 1196/1196 0 579.42 100

Table 4Recognition results of artificial dataset

Algorithm #.fv/#.sv Simp. rate (%) Testingtime (s)

Accuracy(%)

ASS SVM 277/642 55.6 88.172 89.78229/624 63.3 72.813 89.71187/624 70.0 59.406 89.68103/624 83.5 32.781 88.84

SVM 624/624 0 201.69 89.80

Waveform dataset has 21 characteristic attributes with allincluding noise and one class attribute. It contains threeclasses, and we select two of them (class 0 and class 2)as our experimental dataset. The dataset has 3353 samples,and we choose 3000 as training data, others as test data.The kernel both for SVM and FVS adopt RBF kernel withsigma=10 and for SVM, C=1. In this experiment, we select

different approximation error to control the number of thefeature vectors to test the performance of the FVS SVMproposed. We show the classification results in Table 5.

Ionosphere dataset is a binary problem with 34 contin-uous characteristic attributes and one class attribute. Thedataset contains 351 instances and we randomly choose 250instances as training samples, others as testing samples.The kernel both for SVM and FVS adopt RBF kernel withsigma = 1.3 and for SVM, C = 4. Table 6 lists the classi-fication results by FVS SVM and standard SVM algorithm.In the experiment, we predefine different maximal numberof the feature vectors to test the generalization performanceof the FVS SVM proposed.

From the above experiments on the benchmark data-sets, we see that FVS simplifies the solutions suc-cessfully according to the pre-defined fitness or theexpected maximum feature vectors, which prove thefeasibility and validity of the using FVS algorithm inSVM.

Page 8: Adaptive simplification of solution for support vector machine

Q. Li et al. / Pattern Recognition 40 (2007) 972–980 979

Table 5Test on waveform dataset

Algorithm App. error #.fv/#.sv Simp. rate (%) Testing time (s) Accuracy (%)

ASS SVM � = 0 198/350 43.4 41.922 92.27� = 0.0016 74/350 78.6 15.391 92.27� = 0.0032 41/350 88.3 8.4531 92.27� = 0.0064 25/350 92.9 5.1875 92.23� = 0.0128 20/350 94.3 4.0938 91.97

SVM None 350/350 0 73.938 92.27

Table 6Test on ionosphere dataset

Algorithm #.fv/#.sv Simp. rate (%) Testing time (s) Accuracy (%)

ASS SVM 120/138 13.0 88.172 94.06115/138 16.7 72.813 94.06109/138 21.0 59.406 93.07103/138 25.4 32.781 91.10

SVM 138/138 0 0.70313 94.06

5. Concluding remarks

SVM uses a device called kernel mapping to map the datain input space to a high-dimensional feature space in whichthe problem becomes linear. It has made great progress in thefields of machine learning. However, there still exists sucha problem, the huge computation in testing phase causedby the large number of support vectors, which greatly influ-ences it into the practical use. The time taken for a SVM totest a new sample is proportional to the number of the sup-port vectors, so if the number of the support vectors is verylarge, the decision speed will become quite slow. SVM is asparse machine learning algorithm in theory, but the sparsityof the solution is not as good as what we expect. It is con-siderably slower in test phase than other approaches withsimilar generalization performance.

In this paper, we propose a new method to adaptively se-lect the feature vectors from the support vector solutionsaccording to vector correlation principle and greedy algo-rithm. The method can capture the structure of the featurespace by approximating a basis of the support vector solu-tions; therefore, the statistical information of the solutionsof the SVM is preserved. Further, the number of the fea-ture vectors could be selected adaptively according to thetasks’ need, so the generalization/complexity trade-off to bedirectly controlled.

Acknowledgments

This research was supported by the national natural sci-ence foundation of China under Grant no. 60372050 andthe National Grand Fundamental Research 973 Programof China under Grant no. 2001CB309403. Moreover, the

authors wish to thank the anonymous reviewers for theirvaluable comments and suggestions, which have helped toimprove this manuscript significantly.

Appendix (Proof of the theorem)

Proof. Suppose � is the nonlinear mapping functionand XF = [xF1 , . . . , xFM

] can approximate the datasetX = [x1, . . . , xl] with the error � in the feature space cor-responding to the coefficient matrix µ = [ij ]l×FM

, whichcan also be written as

‖�(x) − µ�(xF )‖��. (A.1)

Then,

‖f (x) − �Tµ�(xF )�(x)‖= ‖f (x) − �T�(x)�(x) + �T�(x)�(x)

− �Tµ�(xF )�(x)‖�‖f (x) − �T�(x)�(x)‖ + ‖�T�(x)�(x)

− �Tµ�(xF )�(x)‖. (A.2)

According to the theorem,

f (x) = �T�(x)�(x). (A.3)

Substitute (A.1)–(A.3), we can get

‖f (x) − �Tµ�(xF )�(x)‖�‖�T�(x)�(x) − �Tµ�(xF )�(x)‖�‖�T‖Q‖�(x) − µ�(xF )‖Q‖�(x)‖= �Q‖�‖Q‖�(x)‖.

It is obvious that ‖f (x) − �Tµ�(xF )�(x)‖ will convergeto zero when � converges to zero, and the proof is com-pleted. �

References

[1] V. Vapnik, An overview of statistical learning theory, IEEE Trans.Neural Network 10 (5) (1999) 988–999.

[2] C. Cortes, V. Vapnik, Support vector networks, Mach. Learn. 20(1995) 273–297.

Page 9: Adaptive simplification of solution for support vector machine

980 Q. Li et al. / Pattern Recognition 40 (2007) 972–980

[3] V. Vapnik, Three remarks on support vector machine, in: S.A. Solla,T.K. Leen, K.R. Müller (Eds.), Advances in Neural Computing,vol. 10, 1998, pp. 1299–1319.

[4] V. Vapnik, The Nature of Statistical Learning Theory, Springer, NewYork, 1995.

[5] A. Smola, B. Schölkopf, A tutorial on support vector regression,Royal Holloway Col., Univ. London, UK. Neuro Technical ReportNC-TR-98-030, 1998.

[6] C.J.C. Burges, A tutorial on support vector machines for patternrecognition, Data Mining Knowledge Discovery 2 (2) (1998) 121–167.

[7] M. Schmidt, Identifying speaker with support vector networks, InInterface ’96 Proceedings, Sydney, Australia, 1996.

[8] K.I. Kim, K. Jung, S.H. Park, H.J. Kim, Support vector machinesfor texture classification, IEEE Trans. Pattern Anal. Mach. Intell. 24(11) (2002) 1542–1550.

[9] L.J. Cao, E.H. Francis, Support vector machine with adaptiveparameters in financial time series forecasting, IEEE Trans. NeuralNetworks 14 (6) (2003) 1506–1518.

[10] V. Vapnik, S.E. Golowich, A.J. Smola, Support vector method forfunction approximation, regression estimation and signal processing,Adv. Neural Inform. Process. Syst. 9 (1996) 281–287.

[11] O. Chapelle, P. Haffner, V. Vapnik, Support vector machines forhistogram image classification, IEEE Trans. Neural Networks 5 (10)(1999) 1055–1064.

[12] T. Downs, K. Gates, A. Masters, Exact simplification of supportvector solutions, J. Mach. Learning Res. 2 (2001) 293–297.

[13] C.J.C. Burges. Simplified support vector decision rules. Proceedings13th International Conference on Machine Learning, Bari, Italy, 1996,pp. 71–77.

[14] C.J.C. Burges, B. Schoelkopf, Improving speed and accuracy ofsupport vector learning machines, Adv. Neural Inform. Process. Syst.9 (1997) 375–381.

[15] J. Brank, M. Grobelnik, N. Milic-Frayling, D. Mladenic, Featureselection using linear support vector machines, Microsoft ResearchTechnical Report MSR-TR-2002-63, 12 June 2002.

[16] G. Baudat, F. Anouar, Feature vector selection and projection usingkernels, Neurocomputing 55 (1–2) (2003) 21–28.

[17] C.J.C. Burges, Geometry and invariance in kernel based method,in: Advance in Kernel Method-Support Vector Learning, MIT Press,Cambridge, MA, 1999, pp. 86–116.

[18] B. Scholkopf, A. Smola, Learning with Kernels, MIT Press,Cambridge, 1999.

[19] T. Graepel, R. Herbrich, J. Shawe-Taylor, Generalisation errorbounds for sparse linear classiers, in: Proceedings of the ThirteenthAnnual Conference on Computational Learning Theory, 2000,pp. 298–303.

[20] K.J. Lang, M.J. Witbrock, Learning to tell two spirals apart, in:Proceedings of 1989 Connectionist Models Summer School, 1989,pp. 52–61.

[21] C. Blake, C. Merz, UCI repository of machine learning databases.

About the Author—QING LI received the B.S. degree in automation from Xidian University, Xi’an, China, in 2002. Since 2002, he has been workingtoward the M.S. and Ph.D. degrees in circuit and system at Xidian University.His research interests include pattern recognition, machine learning and data mining.

About the Author—LICHENG JIAO received the B.S.degree from Shanghai Jiaotong University, Shanghai, China, in 1982 and the M.S. and Ph.D.degrees from Xi’an Jiaotong University, Xi’an, China, in 1984 and 1990, respectively.He is currently Professor and Dean of the electronic engineering school at Xidian University. His research interests include neural networks, data mining,nonlinear intelligent signal processing, and communication.

About the Author—YINGJUAN HAO received the B.S. degree from Northwest University for Nationalities, Lanzhou, China, in 2004. Since 2004, shehas been working toward the M.S. degree at Lanzhou University.