a combination of modified particle swarm optimization algorithm and support vector machine for gene...

A

ttcdof©

K

1

witteenaorpsTc

0d

Talanta 71 (2007) 1679–1683

A combination of modified particle swarm optimizationalgorithm and support vector machine for gene

selection and tumor classification

Qi Shen a,b,∗, Wei-Min Shi a, Wei Kong a, Bao-Xian Ye a

a Chemistry Department, Zhengzhou University, Zhengzhou 450052, Chinab State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering,

Hunan University, Changsha 410082, China

Received 26 April 2006; received in revised form 6 July 2006; accepted 27 July 2006Available online 1 September 2006

bstract

In the analysis of gene expression profiles, the number of tissue samples with genes expression levels available is usually small compared withhe number of genes. This can lead either to possible overfitting or even to a complete failure in analysis of microarray data. The selection of geneshat are really indicative of the tissue classification concerned is becoming one of the key steps in microarray studies. In the present paper, we have
ombined the modified discrete particle swarm optimization (PSO) and support vector machines (SVM) for tumor classification. The modifiediscrete PSO is applied to select genes, while SVM is used as the classifier or the evaluator. The proposed approach is used to the microarray dataf 22 normal and 40 colon tumor tissues and showed good prediction performance. It has been demonstrated that the modified PSO is a useful toolor gene selection and mining high dimension data.
2006 Elsevier B.V. All rights reserved.

n; Ge

mmsbncse

hda(t

eywords: Particle swarm optimization; Support vector machine; Gene selectio

. Introduction

The development of microarray technology is creating aealth of gene expression data and brings about a revolution

n biological and medical research [1]. Thousands even tens ofhousands of genes expression levels could be monitored simul-aneously in microarray technology [2,3]. In the analysis of genexpression profiles, the number of tissue samples with genesxpression levels available is usually small compared with theumber of genes. This can lead either to possible overfittingnd dimensional curse or even to a complete failure in analysisf microarray data. Most of the genes monitored in microar-ay may not be relevant to classification and these genes mayotentially degrade the prediction performance of data analy-
is by masking the contribution of the relevant genes [4–7].he selection of genes that are really indicative of the tissuelassification concerned is becoming one of the key steps in
∗ Corresponding author: Tel.: +86 371 67767957; fax: +86 371 67763220.E-mail address: [email protected] (Q. Shen).

peeet[u

039-9140/$ – see front matter © 2006 Elsevier B.V. All rights reserved.oi:10.1016/j.talanta.2006.07.047

ne expression data

icroarray studies. The benefit gained from gene selection inicroarray data analysis is not only the stability of the analy-

is model, but also the biological interpretability of relationshipetween the genes and a complex biological phenomenon. Largeumbers of features also increase computational complexity andost. Therefore, reducing the dimensionality of the gene expres-ion information is a key step in developing a successful genexpression-based data analysis system.

Various clustering, classification, and prediction techniquesave been used to analyze and understand the gene expressionata resulted from DNA microarray, such as Fisher discriminantnalysis, artificial neural networks and support vector machineSVM). SVM has been found useful in handling classificationasks in case of the high dimensionality and sparsity of dataoints and has been recommended as an popular approach tofficiently treating this particular data structure [8–11]. How-ver, there is increasing evidence that gene selection is also
ssential for successful SVM analysis of microarray data andhe lack of gene selection also can spoil the SVM performance12,13]. Results of SVM analysis might not be improved whensing excess of genes. It has been recognized that the elimina-
mailto:[email protected]

dx.doi.org/10.1016/j.talanta.2006.07.047

1 ta 71

tfs

asaassatotwuaatod[ahmccpTnpid

2

2

StrmkSmnflttanS[p

2s

iboPiiioppETpvapfiv

cIbt

iv

Tr

I

I

I

watcuilrr

I

(4)

680 Q. Shen et al. / Talan

ion of uninformative genes which do not contribute to modelormulation is of importance in microarray data analysis even inituations when SVM is applied.

For the gene selections, one can use filtering approach suchs t test and nonparametric scoring, as well as some moreophisticated methodologies such as genetic algorithms (GAs)nd evolution algorithm (EAs) [14–19]. Among them, GAsnd EAs are optimization techniques simulating biologicalystems which are classified as a category of the research ofo-called artificial life. Particle swarm optimization (PSO)lgorithm [20–22], a relatively new optimization technique inhis category, can also be used as an excellent optimizer whichriginated as a simulation of simplified social system. Similaro GAs and EAs, PSO is a population based optimization tool,hich search for optima by updating generations. However,nlike GAs and EAs, PSO has no evolution operators suchs crossover and mutation. Compared to GAs and EAs, thedvantages of PSO are that PSO is easy to implement andhere are few parameters to adjust. Most versions of PSO haveperated in continuous and real-number space. A modifiediscrete PSO algorithm has been proposed in our previous study23–25] to select variables in partial least squares modelingnd shown satisfied performance. In the present paper, weave combined the modified discrete PSO and support vectorachines (SVM) for tumor classification. The modified dis-

rete PSO is applied to select genes, while SVM is used as thelassifier or the evaluator. The formulation and correspondingrogramming flow chart are presented in details in the paper.he proposed approach is used to the microarray data of 22ormal and 40 colon tumor tissues and showed good predictionerformance. It has been demonstrated that the modified PSOs a useful tool for gene selection and mining high dimensionata.

. Methods

.1. Support vector machines

SVM is used here for classifying tumor and normal tissues.VM is a kind of learning machine based on statistical learning

heory and is a popular tool in pattern recognition. The mostemarkable characteristics of SVMs are the absence of localinima, the sparseness of the solution, and the use of the

ernel-induced feature spaces. The basic idea of applyingVMs to pattern classification can be outlined as follows. First,ap the input vectors into a feature space either linearly or

on-linearly, which is relevant to the selection of the kernelunction. Then, within the feature space, seek an optimizedinear division; i.e. construct a hyperplane which can separatewo classes with the least error and maximal margin. The SVMraining process always seeks a global optimized solution andvoids overfitting, so it has the ability to deal with a large
umber of features. A complete description to the theory ofVMs for pattern recognition is given in the book by Vapnik26]. In this study, linear kernel function is included in the SVMrocedure.
I

(2007) 1679–1683

.2. Modified particle swarm optimization for geneelection

PSO [20–22] developed by Eberhart and Kennedy in 1995s a stochastic global optimization technique inspired by socialehavior of bird flocking. The algorithm models the explorationf a problem space by a population of individuals or particles. InSO, each single solution is a particle in the search space. Each

ndividual in PSO flies in the search space with a velocity whichs dynamically adjusted according to the flying experience ofts own and its companions. In PSO algorithm, a populationf particles is updated on a basis of information about eacharticles previous best performance and the best particle in theopulation. PSO is initialized with a group of random particles.ach particle is treated as a point in a D-dimensional space.he ith particle is represented as xi = (xi1, xi2, . . ., xiD). The bestrevious position of the ith particle that gives the best fitnessalue is represented as pi = (pi1, pi2, . . ., piD). The best particlemong all the particles in the population is represented byg = (pg1, pg2, . . ., pgD). Velocity, the rate of the position changeor particle i is represented as vi = (vi1, vi2, . . . , viD). In everyteration, each particle is updated by following the two bestalues.

For a discrete problem expressed in a binary notation, a parti-le moves in a search space restricted to 0 or 1 on each dimension.n binary problem, updating a particle represents changes of ait that should be in either state 1 or 0 and the velocity representshe probability of bit xiD taking the value 1 or 0.

According to information sharing mechanism of PSO, a mod-fied discrete PSO [22] was proposed as follows. The velocityiD of every individual is a random number in the range of (0, 1).he resulting change in position then is defined by the following

ule:

f (0 < viD ≤ a), then xiD(new) = xiD(old) (1)

f

(a < viD ≤ 1 + a

2

), then xiD(new) = piD (2)

f

(1 + a

2< viD ≤ 1

), then xiD(new) = pgD (3)

here a is a random value in the range of (0, 1) named static prob-bility. In this study static probability a equals to 0.5. Thoughhe velocity in the modified discrete PSO is different from that inontinuous version of PSO, information sharing mechanism andpdating model of particle by following the two best positionss the same in two PSO versions. To circumvent convergence toocal optima and improve the ability of the modified PSO algo-ithm to overleap local optima, 10% of particles are forced to flyandomly not following the two best particles:

f (0 < viD ≤ 0.1) and for (0 < b ≤ β) then xiD(new) = 1

f (0 < viD ≤ 0.1) and for (β < b ≤ 1) then xiD(new) = 0

(5)

ta 71 (2007) 1679–1683 1681

I

wpraIlaed

2d

wfTaPetaosd

•

•

•

•

ictddittpe

sio

54

3

tocu4aecic4s

4

ra0p

Q. Shen et al. / Talan

f (0.1 < viD ≤ 1) then xiD(new) = xiD(old) (6)

here b is a random value in the range of (0, 1), β is selectionrobability. In random flying operator, 10% of particles wereandomly selected, and each site of the selected particles hasprobability of 0.1 to vary the value in a stochastic manner.

n this study selection probability β equals to 0.01 to avoid thearge number of genes contained in subsets. For microarray datanalysis, 1% of genes were randomly selected. If the minimumrror criterion is attained or the number of cycles reaches a user-efined limit, the algorithm is terminated.

.3. Classification modeling by the combination of modifiediscrete PSO and support vector machines (PSOSVM)

Though SVM has the ability to avoid overfitting, and dealith a large number of features, there is increasing evidence that

eature selection is also essential for successful SVM analysis.he efficient scheme is to combine the gene selection with SVMnalysis. SVM is used as the classifier and the modified discreteSO is applied to select features. In the modified discrete PSO,ach particle is encoded to a string of binary bits associated withhe number of genes, which makes up of a SVM classifier withll its features. A bit “0” in a particle represents the uselessnessf corresponding gene. The classification modeling by particlewarm optimization and support vector machine (PSOSVM) isescribed as follows:

Step 1. Randomly initialize all the initial binary strings IND inmodified discrete PSO with an appropriate size of population.IND is strings of binary bits corresponding to each gene inSVM.Step 2. Calculate the fitness function of the individual corre-sponding to models in training set. If the best object functionof the generation fulfills the end condition, the training isstopped with the results output, otherwise, go to the next step.Step 3. Update the IND population according to the modifieddiscrete PSO.Step 4. Go back to the second step to calculate the fitness ofthe renewed population. The PSOSVM scheme is presentedin Fig. 1.

In this study, a cross-validation (CV) resampling approachs used to construct the learning and test sets. We evaluated thelassification accuracy of the prediction models derived fromhe four datasets by using a leave-half-out CV (LHOCV) proce-ure. Briefly, the two-class samples are randomly split into twoata sets of approximately equal size, respectively. The train-ng data set, which is a random combination of the subsets forhe two classes, is used to derive a classification model that ishen applied to predict the total remaining subsets. The LHOCVroduces four pairs of learning and test sets. Each individual isvaluated by the averaged value over the four pairs.

In PSOSVM, the performance of each particle is mea-ured according to a pre-defined fitness function. The fitnesss defined as the reciprocal of averaged classification accuracyver LHOCV that is evaluated using a linear SVM.

Iwfc

Fig. 1. The chart of the PSOSVM scheme.

The modified PSO, SVM algorithm was written in Matlab.3 and run on a personal computer (Intel Pentium processor/1.5 GHz 256 MB RAM).

. Data set

Alon [27] analyzed the gene expressions of colon tissue. Inhe present study, the colon data was used to test the performancef PSOSVM in gene selection of microarray data analysis. Theolon data set consists of expression profiles of 2000 genessing an Affymetrix oligonucleotide array from 22 normal and0 colon tumor tissues and these data are publicly availablet http://www.microarray.princeton.edu/oncology/. The genexpression data are scaled into (0.0, 1.0) for analysis. Among 62olon samples, 50 randomly selected samples were used as train-ng set and the remaining 12 samples as the prediction set. Forolon dataset, we first applied t test filtering algorithm to select00 top-ranked informative genes and then applied PSOSVMearch methods on these 400 genes.

. Results and discussion

At the beginning, SVM classifier with all 2000 genes was car-ied out for colon cancer dataset. Using all initial 2000 genes, theccuracy of classification for training set and test set were 1 and.8333, respectively. Using all 2000 genes does not offer goodredictive ability and there is an obvious symptom of overfitting.
nclusion of excess of the gene variables in the modeling processill degrade the performance of SVM analysis. This might arise
rom the sensitivity of SVM to irrelevant variables that do notontribute to classification and prediction. So the modified PSO

http://www.microarray.princeton.edu/oncology/

1682 Q. Shen et al. / Talanta 71 (2007) 1679–1683

Fu

atuaoe

tNsso2fiwsmcr

wtntarts2scuataam

boai9ugcf

bamtaduction of this unique attribute that the PSO algorithm generallyexhibits a high convergence rate.

Fig. 4 shows the relationship between classification accuracyand the number of genes in a selected optimal gene set

ig. 2. Distribution of classification accuracy over 200 runs of partition samplessing the best four-gene model.

lgorithm is employed to select the genes strongly contributingo classification for SVM modeling. In the present work, the pop-lation size of PSO is selected as 50 and PSOSVM was stoppedfter 50 iterations. In PSOSVM, the prediction performancesf the classification models derived by selected variables werevaluated by using a LHOCV procedure.

The best model with maximum classification accuracy con-ains four genes during the PSOSVM search. The four genes areo. 377, 765, 1495 and 1582. The best model gives the clas-

ification accuracy 94.00% for training set and 91.67% for testet. Two samples reported are false positives and the numberf false negatives reported by the best model is also equal to. The four-gene model provides the sensitivity 95% and speci-city 91%. The classification accuracy by the next best modelhich contains five genes are 95.1 and 90% for training and test

et, respectively. Using only five or four genes, the PSOSVMethod provided 91% accuracy of classification for test set for

olon cancer. By removing redundant genes the test error waseduced.

Even the classification accuracy of the prediction modelsere evaluated by using LHOCV procedure, it should be noted

hat the classification accuracy of a model at each iteration isot necessarily the same because of the variation partition ofraining and tests sets. The reliability of a classification model isn essential issue in microarray data analysis. To evaluate accu-ately the predictive ability and reliability of models derived byhese selected optimal sets of genes by PSOSVM, the total tissueamples were randomly partitioned into training and tests sets00 times and then averaged the classification accuracy for eachelected set of genes. Fig. 2 shows the distribution of classifi-ation accuracy for test set over 200 runs of partition samplessing the best four-gene model. As shown in Fig. 2, classificationccuracy larger than 91.5% is about 111 times in 200 runs andhe highest accuracy (100%) appear 68 times. By resampling
large number of learning samples, the average classification
ccuracy achieved 91.8%. It can be seen that the classificationodel using the selected four genes is stable and reliable.

Fs

Fig. 3. Convergence curves for PSOSVM.

In comparing with SVM analysis by all genes, it shows thatetter results are obtained from classification analysis includingnly selected genes than from SVM analysis including all vari-bles. The predictive ability of SVM model was much improvedn classification accuracy by the PSOSVM analysis from 83 to1.7%. One notices that the use of the modified PSO search helpss to select 4 or 5 genes from 2000 descriptors. These selectedenes carry more or less information related to tumor classifi-ation. In this way, PSOSVM maximally extracts informationrom the original data set for microarray data analysis.

As shown in Fig. 3 the maximum classification accuracy cane obtained in about 25 cycles during the PSOSVM algorithmnd fitness value drops quickly in the PSOSVM algorithm. Infor-ation sharing in PSO is among the global best position, and

he corresponding personal best positions. This is a very uniquettribute for the PSO algorithm. It seems that it is due to the intro-

ig. 4. The relationship between classification accuracy and the number of geneselected using PSOSVM.

ta 71

uogmbipia

tPsScemLbst8

tiot3iia[tgattarcdf

5

cgmvwC

rg

A

S

R

[

[

[

[[

[

[

[[[

[

[

[

[[

[

[[[

Q. Shen et al. / Talan

sing PSOSVM. For colon dataset, four or five genes are anptimal number for classification by SVM. As the number ofenes exceeds 5, the accuracy of classification decreases. Theore genes are selected, the small the classification accuracy

ecame. This may be because more irrelevant genes werencluded to build the classification model and thus degraded therediction performance. It indicates the necessity of excludingrrelevant genes or selecting relevant genes in microarraynalysis.

In the modified discrete PSO, the performance of each par-icle is measured according to a pre-defined fitness function. InSOSVM the fitness is defined as the reciprocal of averaged clas-ification accuracy over LHOCV that is evaluated using a linearVM. Other classification methods such as the Fisher liner dis-riminant analysis (LDA) instead of SVM can also be used tovaluate the fitness in the modified PSO. The best model withaximum classification accuracy contains three genes usingDA as fitness. The three genes are No. 377, 493 and 1964. Theest model gives the classification accuracy 94.00% for traininget and 75% for test set. By resampling learning samples 200imes, the average classification accuracy achieved 90.43 and9.38% for training and test set.

When the modified PSO search terminates, one may counthe number of times a particular gene descriptor appears in 100ndividual combinations. When one lists the descriptors by orderf decreasing numbers of times of appearance, the top descrip-ors or the most frequently appeared feathers are obtained. Gene77 (Z50753) related to uroguanylin precursor is shown to bemportant in classification analysis of colon dataset. Gene 377s markedly low expression in colon samples and it may playn essential role in pathological and treatment of colon cancer28–30]. Gene 1423 (J02854) corresponding to myosin regula-ory light chain occupies an important position. Besides theseenes, genes 14, 66, 249, 698, 765 1042 and 1873 also havedvantage for classification of colon dataset. It is interestinghat most of these top descriptors are high expression in normalissue and low expression in colon samples. Genes (No. 14, 249nd 377) were also found to be correlated with colon tumor inesearches [30,31]. The genetic architecture of the disease is aomplex one and we developed a novel approach to hunting forisease relevant genes using an approach with PSOSVM as theeature gene search engine.

. Conclusion

The selection of genes that are really indicative of the tissuelassification concerned is a key step in developing a successfulene expression-based data analysis system. In this paper, the
odified discrete PSO was applied to select genes and support
ector machine was used for classifier. A new objective functionas formulated to determine the appropriate number of genes.olon data set was used by the proposed PMPSO algorithm. The

[

[[

(2007) 1679–1683 1683

esults have demonstrated that the proposed method is useful forene selection and classification.

cknowledgements

The work was financially supported by the National Naturalcience Foundation of China (Grant Nos. 20505015, 20475050).

eferences

[1] D. Schena, R.W. Shalon, P.O. Davis, Science 270 (1995) 467.[2] Y. Yamaguchi, D. Ogura, K. Yamashita, M. Miyazaki, H. Nakamura, H.

Maeda, Talanta 68 (2006) 700.[3] G.P. Yang, D.T. Ross, W.W. Kuang, P.O. Brown, R.J. Weigel, Nucleic Acids

Res. 27 (1999) 1517.[4] G. Stephanopoulos, D. Hwang, W.A. Schmitt, J. Misra, Bioinformatics 18

(2002) 1054.[5] S. Biceiato, A. Luchini, C.D. Bello, Bioinformatics 19 (2003) 571.[6] D.V. Nguyen, D.M. Rocke, Bioinformatics 18 (2002) 1216.[7] Y.X. Tan, L. Shi, W. Tong, G.G.T. Hwang, C. Wang, Comput. Biol. Chem.

28 (2004) 235.[8] C.Z. Cai, W.L. Wang, L.Z. Sun, Y.Z. Chen, Math. Biosci. 185 (2003) 111.[9] E. Byvatov, U. Fechner, J. Sadowski, G.J. Schneider, Chem. Inf. Comput.

Sci. 43 (2003) 1882.10] H.X. Liu, R.S. Zhang, F. Luan, X.J. Yao, M.C. Liu, Z.D. Hu, B.T. Fan, J.

Chem. Inf. Comput. Sci. 43 (2003) 900.11] R. Burbidge, M. Trotter, B. Buxton, S. Holden, Comput. Chem. 26 (2001)

5.12] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, V. Vapnik,

Adv. Neural Inf. Process. Syst. 13 (2001) 668.13] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Mach. Learn. 46 (2002) 389.14] V.G. Franco, J.C. Peraan, V.E. Mantovani, H.C. Goicoechea, Talanta 68

(2006) 1005.15] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H.

Coller, M. Loh, J. Downing, M. Caligiuri, Science 286 (1999) 531.16] L. Li, W. Jiang, X. Li, K.L. Moser, Z. Guo, L. Du, Q. Wang, E.J. Topol, Q.

Wang, S. Rao, Genomics 85 (2005) 16.17] Z. Ramadan, D. Jacobs, M. Grigorov, S. Kochhar, Talanta 68 (2006) 1683.18] P. Watkins, G. Puxty, Talanta 68 (2006) 1336.19] T. Furey, N. Cristianini, N. Duffy, D. Bednarski, M. Schummer, D. Haussler,

Bioformatics 16 (2000) 906.20] J. Kennedy, R. Eberhart, Proceedings of IEEE International Conference On

Neural Networks, 1995, p. 1942.21] Y. Shi, R. Eberhart, Proceedings of the IEEE World Congress on Compu-

tational Intelligence, 1998, p. 69.22] M. Clerc, J. Kennedy, Proceedings of the IEEE Transactions on Evolution-

ary Computation, vol. 6, 2002, p. 58.23] Q. Shen, J.H. Jing, G.L. Shen, R.Q. Yu, Eur. J. Pharm. Sci. 22 (2004) 145.24] Q. Shen, J.H. Jing, W.Q. Lin, G.L. Shen, R.Q. Yu, J. Comput. Chem. 25

(2004) 1726.25] Q. Shen, J.H. Jing, G.L. Shen, R.Q. Yu, J. Chem. Inf. Comput. Sci. 44

(2004) 2027.26] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.27] U. Alon, Proc. Natl. Acad. Sci. 96 (1999) 6745.28] D. Notterman, U. Alon, A. Sierk, A. Levine, Cancer Res. 61 (2001) 3124.
29] K. Shailubhai, H.H. Yu, K. Karunanandaa, J.Y. Wang, S.L. Eber, Y. Wang,
N.S. Joo, H.D. Kim, B.W. Miedema, S.Z. Abbas, Cancer Res. 60 (2000)5151.

30] Y. Li, C. Campbell, M. Tipping, Bioinformatics 18 (2002) 1332.31] S. Ma, J. Huang, Bioinformatics 21 (2005) 4356.

a combination of modified particle swarm optimization algorithm and support vector machine for gene...

Documents