hybrid particle swarm optimization and tabu search approach for selecting genes for tumor...

8
Available online at www.sciencedirect.com Computational Biology and Chemistry 32 (2008) 53–60 Research Article Hybrid particle swarm optimization and tabu search approach for selecting genes for tumor classification using gene expression data Qi Shen, Wei-Min Shi , Wei Kong Chemistry Department, Zhengzhou University, Zhengzhou 450052, China Received 7 January 2007; received in revised form 10 October 2007; accepted 14 October 2007 Abstract Gene expression data are characterized by thousands even tens of thousands of measured genes on only a few tissue samples. This can lead either to possible overfitting and dimensional curse or even to a complete failure in analysis of microarray data. Gene selection is an important component for gene expression-based tumor classification systems. In this paper, we develop a hybrid particle swarm optimization (PSO) and tabu search (HPSOTS) approach for gene selection for tumor classification. The incorporation of tabu search (TS) as a local improvement procedure enables the algorithm HPSOTS to overleap local optima and show satisfactory performance. The proposed approach is applied to three different microarray data sets. Moreover, we compare the performance of HPSOTS on these datasets to that of stepwise selection, the pure TS and PSO algorithm. It has been demonstrated that the HPSOTS is a useful tool for gene selection and mining high dimension data. © 2007 Published by Elsevier Ltd. Keywords: Particle swarm optimization; Tabu search; Gene selection; Gene expression data 1. Introduction High-density DNA microarrays are one of the most power- ful tools for functional genomic studies and the development of microarray technology allows for measuring expression lev- els of thousands of genes simultaneously Schena et al. (1995). Recent studies have shown that one of the most important appli- cations of microarrays is tumor classification (Cho et al., 2003; Li et al., 2004). Gene selection is an important component for gene expression-based tumor classification systems. Microarray experiments generate large datasets with expression values for thousands even tens of thousands of genes but not more than a few tissue samples. Most of the genes monitored in microar- ray may be irrelevant to analysis and the use of all the genes may potentially inhibit the prediction performance of classifica- tion rule by masking the contribution of the relevant genes (Li, 2006; Li and Yang, 2002; Stephanopoulos et al., 2002; Nguyen and Rocke, 2002; Biceiato et al., 2003; Tan et al., 2004). An efficient way to solve this problem is gene selection and the Corresponding author. Tel.: +86 371 67767957; fax: +86 371 67763220. E-mail address: [email protected] (W.-M. Shi). selection of discriminatory genes is critical to improving the accuracy and decrease computational complexity and cost. By selecting relevant genes, conventional classification techniques can be applied to the microarray data. Gene selection may high- light those relevant genes and it could enable biologists to gain significant insight into the genetic nature of the disease and the mechanisms responsible for it (Guyon et al., 2002; Wang et al., 2005). Several gene selections techniques have been employed in classification problems, such as t-test filtering approach, as well as some artificial intelligence techniques such as genetic algo- rithms (GAs), evolution algorithms (EAs) (Golub et al., 1999; Furey et al., 2000; Xiong et al., 2001; Peng et al., 2003; Li et al., 2005; Tibshirani et al., 2002; Sima and Dougherty, 2006), sim- ulated annealing, tabu search and particle swarm optimization. Particle swarm optimization (PSO) algorithm (Kennedy and Eberhart, 1995; Shi and Eberhart, 1998; Clerc and Kennedy, 2002) is a recently proposed algorithm by James Kennedy and R.C. Eberhart in 1995, motivated by social behavior of organ- isms such as bird flocking and fish schooling. Particle swarm optimization comprises a very simple concept, and can be imple- mented in a few lines of computer code. It requires only few parameters to adjust, and is computationally inexpensive in 1476-9271/$ – see front matter © 2007 Published by Elsevier Ltd. doi:10.1016/j.compbiolchem.2007.10.001

Upload: qi-shen

Post on 26-Jun-2016

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Hybrid particle swarm optimization and tabu search approach for selecting genes for tumor classification using gene expression data

A

ecsema©

K

1

foeRcLgetarmt2ae

1d

Available online at www.sciencedirect.com

Computational Biology and Chemistry 32 (2008) 53–60

Research Article

Hybrid particle swarm optimization and tabu search approach forselecting genes for tumor classification using gene

expression data

Qi Shen, Wei-Min Shi ∗, Wei KongChemistry Department, Zhengzhou University, Zhengzhou 450052, China

Received 7 January 2007; received in revised form 10 October 2007; accepted 14 October 2007

bstract

Gene expression data are characterized by thousands even tens of thousands of measured genes on only a few tissue samples. This can leadither to possible overfitting and dimensional curse or even to a complete failure in analysis of microarray data. Gene selection is an importantomponent for gene expression-based tumor classification systems. In this paper, we develop a hybrid particle swarm optimization (PSO) and tabuearch (HPSOTS) approach for gene selection for tumor classification. The incorporation of tabu search (TS) as a local improvement procedure

nables the algorithm HPSOTS to overleap local optima and show satisfactory performance. The proposed approach is applied to three differenticroarray data sets. Moreover, we compare the performance of HPSOTS on these datasets to that of stepwise selection, the pure TS and PSO

lgorithm. It has been demonstrated that the HPSOTS is a useful tool for gene selection and mining high dimension data.2007 Published by Elsevier Ltd.

ressi

sasclsm2

carF2u

eywords: Particle swarm optimization; Tabu search; Gene selection; Gene exp

. Introduction

High-density DNA microarrays are one of the most power-ul tools for functional genomic studies and the developmentf microarray technology allows for measuring expression lev-ls of thousands of genes simultaneously Schena et al. (1995).ecent studies have shown that one of the most important appli-ations of microarrays is tumor classification (Cho et al., 2003;i et al., 2004). Gene selection is an important component forene expression-based tumor classification systems. Microarrayxperiments generate large datasets with expression values forhousands even tens of thousands of genes but not more thanfew tissue samples. Most of the genes monitored in microar-

ay may be irrelevant to analysis and the use of all the genesay potentially inhibit the prediction performance of classifica-

ion rule by masking the contribution of the relevant genes (Li,

006; Li and Yang, 2002; Stephanopoulos et al., 2002; Nguyennd Rocke, 2002; Biceiato et al., 2003; Tan et al., 2004). Anfficient way to solve this problem is gene selection and the

∗ Corresponding author. Tel.: +86 371 67767957; fax: +86 371 67763220.E-mail address: [email protected] (W.-M. Shi).

E2Riomp

476-9271/$ – see front matter © 2007 Published by Elsevier Ltd.oi:10.1016/j.compbiolchem.2007.10.001

on data

election of discriminatory genes is critical to improving theccuracy and decrease computational complexity and cost. Byelecting relevant genes, conventional classification techniquesan be applied to the microarray data. Gene selection may high-ight those relevant genes and it could enable biologists to gainignificant insight into the genetic nature of the disease and theechanisms responsible for it (Guyon et al., 2002; Wang et al.,

005).Several gene selections techniques have been employed in

lassification problems, such as t-test filtering approach, as wells some artificial intelligence techniques such as genetic algo-ithms (GAs), evolution algorithms (EAs) (Golub et al., 1999;urey et al., 2000; Xiong et al., 2001; Peng et al., 2003; Li et al.,005; Tibshirani et al., 2002; Sima and Dougherty, 2006), sim-lated annealing, tabu search and particle swarm optimization.

Particle swarm optimization (PSO) algorithm (Kennedy andberhart, 1995; Shi and Eberhart, 1998; Clerc and Kennedy,002) is a recently proposed algorithm by James Kennedy and.C. Eberhart in 1995, motivated by social behavior of organ-

sms such as bird flocking and fish schooling. Particle swarmptimization comprises a very simple concept, and can be imple-ented in a few lines of computer code. It requires only few

arameters to adjust, and is computationally inexpensive in

Page 2: Hybrid particle swarm optimization and tabu search approach for selecting genes for tumor classification using gene expression data

5 ology

tc(ssco

hmcssT

aptsipadoPad

2

2

atoeDittttv

tpieristb

iTr

b

I

I

I

watcuih

2

uapnoianthstsSctnptp

2a

pgattgnl

4 Q. Shen et al. / Computational Bi

erms of both memory requirements and speed. A modified dis-rete PSO algorithm has been proposed in our previous studyShen et al., 2004a,b, in press) to reduce dimension and shownatisfied performance. Although PSO has proved to be a potentearch technique for solving optimization, there are still manyomplex situations where the PSO tends to converge to localptima and does not perform particularly well.

Tabu search (TS) is a powerful optimization procedure thatas been successfully applied to a number of combinatorial opti-ization problems Glover (1986). It has the ability to avoid

onvergence to local minima by employing a flexible memoryystem. But the convergence speed of TS depends on the initialolution and the parallelism of PSO population would help theS find the promising regions of the search space very quickly.

In this paper, we develop a hybrid PSO and TS (HPSOTS)pproach for gene selection for tumor classification. The incor-oration of TS as a local improvement procedure enableshe algorithm HPSOTS to overleap local optima and showatisfactory performance. The formulation and correspond-ng programming flow chart are presented in details in theaper. To evaluate the performance of HPSOTS, the proposedpproach is applied to three publicly available microarrayatasets. Moreover, we compare the performance of HPSOTSn these datasets to that of stepwise selection, the pure TS andSO algorithm. It has been demonstrated that the HPSOTS isuseful tool for gene selection and mining high dimension

ata.

. Methods

.1. Modified particle swarm optimization

PSO Kennedy and Eberhart (1995), Shi and Eberhart (1998)nd Clerc and Kennedy (2002) is a stochastic global optimiza-ion technique and can be used for gene selection. PSO carriesut a search based on population of individuals or particles andach particle represents a potential solution in the search space.uring flight, each particle with a velocity which is adjusted

ts position according to its own experience and the other par-icle’s experience until stopping criteria is satisfied. Supposehat the problem space is D-dimensional, then the position ofhe ith particle is represented as xi = (xi1, xi2, . . ., xiD). Velocity,he rate of the position change for particle i is represented asi = (vi1, vi2, . . . viD). The best previous position of the ith par-icle that gives the best fitness value is expressed as pi = (pi1,i2, . . ., piD). The position of the best particle in the swarms denoted as pg = (pg1, pg2, . . ., pgD). For a discrete problemxpressed in a binary notation, a particle moves in a search spaceestricted to 0 or 1 on each dimension. In binary problem, updat-ng a particle represents changes of a bit that should be in eithertate 1 or 0 and the velocity represents the probability of bit xidaking the value 1 or 0. In every iteration, each particle is updatedy following the two best values.

According to information sharing mechanism of PSO, a mod-fied discrete PSO (Shen et al., 2004a,b) was proposed as follows.he velocity vid of every individual is a random number in the

ange of (0,1). The resulting change in position then is defined

tocb

and Chemistry 32 (2008) 53–60

y the following rule:

f (0 < vid ≤ a), then xid(new) = xid(old) (1)

f

(a < vid ≤ (1 + a)

2

), then xid(new) = pid (2)

f

((1 + a)

2< vid ≤ 1

), then xid(new) = pgd (3)

here a is a random value in the range of (0,1) named static prob-bility. In this study static probability a equals to 0.5. Thoughhe velocity in the modified discrete PSO is different from that inontinuous version of PSO, information sharing mechanism andpdating model of particle by following the two best positionss the same in two PSO versions. The details of modified PSOave been described elsewhere (Shen et al., 2004a,b).

.2. Tabu search

Tabu search (TS) was invented by Glover (1986) and has beensed to solve a wide range of hard optimization problems. TS isn iterative procedure designed for the solution of optimizationroblems. TS starts with a random solution and evaluate the fit-ess function for the given solution. Then all possible neighborsf the given solution are generated and evaluated. A neighbors a solution which can be reached from the current solution bysimple, basic transformation. If the best of these neighbors isot in tabu list then pick it to be the new current solution. Theabu list keeps track of previously explored solutions and pro-ibits TS from revisiting them again. Thus, if the best neighborolution is worse than the current design, TS will go uphill. Inhis way, local minima can be overcome. Any reversal of theseolutions or moves is then forbad move and is classified as tabu.ome aspiration criteria which allow overriding of tabu statusan be introduced if that moves is still found to lead to a bet-er fitness with respect to the fitness of the current optimum. Ifo more neighbors are present (all are tabu), or when during aredetermined number of iterations no improvements are found,he algorithm stops. Otherwise, the algorithm continues the TSrocedures.

.3. Classification by hybrid PSO and TS (HPSOTS)pproach

Although PSO has the advantages of good convergentroperty and is effective on solving optimization, after someenerations the population diversity would be greatly reducednd the PSO algorithm might lead to a premature convergenceo a local optimum. TS is a powerful stochastic optimizationechnique, which can theoretically converge asymptotically to alobal optimum solution, but it will take much time to reach theear-global minimum. The incorporation of TS into PSO as aocal improvement procedure enables the algorithm to maintain

he population diversity and prevent leading to misleading localptima. In the present work strings of binary bits were adopted toode all the particles of the modified discrete PSO. Each binaryits coded string (particle) stands for a set of genes, which are
Page 3: Hybrid particle swarm optimization and tabu search approach for selecting genes for tumor classification using gene expression data

ology and Chemistry 32 (2008) 53–60 55

uiTca

S

S

S

S

S

aficuirsawoattssta

stwiptcab

r(

3

tc(aTsbTideiTpo

Q. Shen et al. / Computational Bi

sed for evaluating the classification fitness function. A bit “0”n a particle represents the uselessness of corresponding gene.he TS was applied to pick the new current particle. The classifi-ation modeling by hybrid PSO and TS (HPSOTS) is describeds follows

tep 1. Randomly initialize all the initial binary strings IND inHPSOTS with an appropriate size of population andevaluate the fitness function of individual in IND. INDis strings of binary bits corresponding to each gene.

tep 2. Generate and evaluate the neighbors of 90% of individ-ual in IND according to information sharing mechanismof PSO (Eqs. (1)–(3)).

tep 3. Pick new individual from the explored neighborhoodaccording to the aspiration criterion and tabu conditionsand update the IND population.

tep 4. To improve further the ability of HPSOTS to overleaplocal optima, other 10% of particles in IND are forcedto fly randomly not following the two best particles.Evaluate the fitness function of these ten percent ofparticles.

tep 5. If the best object function of the generation fulfills theend condition, the training is stopped with the resultsoutput, otherwise, go to the second step to renew popu-lation. The HPSOTS scheme is presented in Fig. 1.

In HPSOTS, the performance of each particle is measuredccording to a pre-defined fitness function. In tumor classi-cation the fitness is defined as the reciprocal of averagedlassification accuracy over fivefold cross-validation that is eval-ated using the Fisher liner discriminant analysis (LDA). It ismportant that the data used for evaluating the predictive accu-acy of the classifier must be distinct from the data used forelecting the genes and building the supervised classifier (Dupuynd Simon, 2007). If the cross-validation accuracy is calculatedithin the feature-selection process and is used as an estimatef the prediction error, there is a selection bias in it (Ambroisend McLachlan, 2002). Feature selection should be done withhe cross-validation loop using the training set only and not usehe same feature selection for all iterations. Because the expres-ion levels of different genes are correlated, the set of geneselected may vary substantially among different iterations ofhe cross-validation process. We summarized the stability foundnd refined the genes used in the classifier.

For comparison with the performance of HPSOTS, stepwiseelection, the pure TS and PSO algorithm were also appliedo select a subset of genes to be used in the LDA. In the for-ard stepwise, we began with selection the first of the most

mportant gene, then select the second most important, etc. Therocedure continues to select one additional gene at a time untilhe including genes lead to no improvement on the accuracy oflassification. All these selection genes are used for evaluatingveraged classification accuracy over fivefold cross-validation

y LDA.

The HPSOTS, LDA, stepwise selection, TS and PSO algo-ithm was written in Matlab 5.3 and run on a personal computerIntel Pentium processor 4/1.5G Hz 256 MB RAM).

pmso

Fig. 1. The chart of the HPSOTS scheme.

. Results and discussion

In the present study, three publicly available datasets are usedo test the performance of HPSOTS in gene selection for tumorlassification. The first dataset was published by Alon et al.1999). It is comprised of human tumor and normal colon tissuesnd is available at http://microarray.princeton.edu/oncology/.he second dataset on breast tumor samples was first pre-ented by West et al. (2001) and Spang et al. (2001). It cane obtained at http://mgm.duke.edu/genome/dna micro/work/.he third dataset on acute leukemia classification was orig-

nally analysed by Golub et al. (1999). The datasets can beownloaded from http://www.genome.wi.mit.edu/MPR. Genexpression levels for each gene were normalized by subtract-ng the lowest value and dividing by the range of that gene.he expression levels for each gene are scaled on [0,1]. In thisaper, we first applied t-test filtering algorithm to select a setf top-ranked informative genes. Prior to the heuristic search

rocedure, genes with lower t-test absolute value among the nor-al and tumor samples were removed and then applied heuristic

earch methods such as HPSOTS, stepwise, pure TS and PSOn these datasets.

Page 4: Hybrid particle swarm optimization and tabu search approach for selecting genes for tumor classification using gene expression data

56 Q. Shen et al. / Computational Biology and Chemistry 32 (2008) 53–60

Table 1Results of gene selection by the stepwise, HPSOTS, pure TS and PSO for colon data

Gene Method Cross-validation accuracy Averaged accuracya

Training Test

Hsa.8125, Hsa.8147, Hsa.1832 Stepwise 0.8387 0.8509 0.8234Hsa.8147, Hsa.36689, Hsa.37937, Hsa.34575 Pure TS 0.9032 0.9041 0.9031HH a.362

e sele

3

gmrrwit

sstaa8f

Pt

cmPvTsTawa

rttoic

sa.3862, Hsa.15101, Hsa.2809, Hsa.36689, Hsa.43279, Hsa.1832sa.1832, Hsa.36689, Hsa.477, Hsa.1659, Hsa.2809, Hsa.15101, Hsa.3862, Hs

a Averaged classification accuracy over 200 runs of partition samples using th

.1. Colon dataset

The colon dataset consists of expression profiles of 2000enes using an Affymetrix oligonucleotide array from 22 nor-al and 40 colon tumor tissues. Among 62 colon samples, 50

andomly selected samples were used as training set and theemaining 12 samples as the prediction set. For colon dataset,e first applied t-test filtering algorithm to select 400 top-ranked

nformative genes and then applied heuristic search methods onhese 400 genes.

As a comparison, stepwise selection was first utilized forelecting the subset of genes used in the LDA model. The optimalubset for LDA classification using stepwise method containshree genes (Hsa.8125, Hsa.8147, Hsa.1832). The classificationccuracy over fivefold cross-validation is 83.87%. For trainingnd prediction sets the classification accuracy are 85.09% and2.34%, respectively, indicting stepwise selection is inadequate

or modeling the colon data.

To compare with HPSOTS, genes selection by pure TS andSO were also performed in which the fitness was defined as

he reciprocal of averaged classification accuracy over fivefold

umgH

Fig. 2. Distribution of classification accuracy over 200 runs of partit

Pure PSO 0.9033 0.9139 0.882718 HPSOTS 0.9355 0.9332 0.9031

cted genes.

ross-validation that is evaluated using the LDA. Table 1 sum-arizes the result of classification by the stepwise, pure TS andSO methods. The classification accuracy over fivefold cross-alidation are 90.32% and 90.333% by Ts and PSO, respectively.he classification accuracy of the optimal model for traininget and test set by TS are 90.74% and 90.34%, respectively.he classification accuracy for training set and test set by PSOre 90.33% and 88.27%, respectively. A comparison with step-ise shows that better results were obtained from TS and PSO

lgorithm.To further improve the LDA model, the HPSOTS algo-

ithm is employed to select the genes strongly contributingo classification for LDA modeling. In the present work,he population size of HPSOTS is selected as 50, the sizef tabu list is set as 5 and HPSOTS was stopped after 50terations. In HPSOTS, the prediction performances of thelassification models derived by selected variables were eval-

ated by using a fivefold cross-validation procedure. The bestodel with maximum classification accuracy contains eight

enes (Hsa.1832, Hsa.36689, Hsa.477, Hsa.1659, Hsa.2809,sa.15101, Hsa.3862, Hsa.36218) during the HPSOTS search.

ion samples using the best eight-gene-model for colon dataset.

Page 5: Hybrid particle swarm optimization and tabu search approach for selecting genes for tumor classification using gene expression data

Q. Shen et al. / Computational Biology and Chemistry 32 (2008) 53–60 57

Table 2Results of gene selection by the stepwise, HPSOTS, pure TS and PSO for estrogen data

Gene Method Cross-validation accuracy Averaged accuracya

Training Test

AFFX-CreX-3 st, U83601 at, M26708 s at Stepwise 0.8571 0.8594 0.8490L26081 at, J04130 s at, U90916 at, U40038 at, X63097 at, J04080 at Pure TS 0.9184 0.9135 0.8931UU

e sele

T9ctapfT

teaoptofutiB

cit

3

glsOnssat1

51678 at, U90916 at, X62083 s at, U73824 at, AB002559 at, U27185 at83601 at, M85276 at, X96584 at, AFFX-CreX-3 st, D00762 at, U40371 at

a Averaged classification accuracy over 200 runs of partition samples using th

he classification accuracy over fivefold cross-validation is3.55% by HPSOTS. Comparing with pure TS and PSO, thelassification accuracy of training set by HPSOTS was betterhan that by pure TS and PSO. The incorporation of TS into PSOs a local improvement procedure improved the characteristicerformance of the gene selection, as the classification accuracyor training and test set are improved with the introduction ofS.

As the partition of training and tests sets is a random choice,he classification accuracy of a model at each iteration is not nec-ssarily the same. To evaluate accurately the predictive abilitynd reliability of models derived by these selected optimal setsf genes by HPSOTS, the total tissue samples were randomlyartitioned into 80% training and 20% tests sets 200 times andhen averaged the classification accuracy for each selected setf genes. Fig. 2 shows the distribution of classification accuracyor training set and test set over 200 runs of partition samples

sing the best eight-gene-model. As shown in Fig. 2, classifica-ion accuracy for test set larger than 90.65% is about 107 timesn 200 runs and the highest accuracy (100%) appear 55 times.y resampling a large number of learning samples, the average

wTm

Fig. 3. Distribution of classification accuracy over 200 runs of partiti

Pure PSO 0.9388 0.9225 0.8780HPSOTS 0.9796 0.9667 0.9350

cted genes.

lassification accuracy achieved 93.32% and 90.31% for train-ng and test set. It can be seen that the classification model usinghe selected six genes is stable and reliable.

.2. Estrogen dataset

The estrogen dataset includes expression values of 7129enes of 49 breast tumor samples. The response describes theymph nodal (LN) status, which is an indicator for the metastaticpread of the tumor, an important risk factor for disease outcome.f the total, 25 samples are positive (LN+) and 24 samples areegative (LN−). Among 49 breast tumor samples, 40 randomlyelected samples were used as training set and the remaining 9amples as the prediction set. For estrogen dataset, t-test filteringlgorithm was first applied to select 1000 top-ranked informa-ive genes and then applied heuristic search methods on these000 genes.

Table 2 summarizes the result of classification by the step-ise, pure TS and PSO HPSOTS methods for estrogen dataset.he optimal subset for LDA classification using stepwiseethod contains three genes (AFFX-CreX-3 st, U83601 at,

on samples using the best six-gene-model for estrogen dataset.

Page 6: Hybrid particle swarm optimization and tabu search approach for selecting genes for tumor classification using gene expression data

58 Q. Shen et al. / Computational Biology and Chemistry 32 (2008) 53–60

Table 3Results of gene selection by the stepwise, HPSOTS, pure TS and PSO for AML/ALL data

Gene Method Cross-validation accuracy Averaged accuracya

Training Test

M55150 at, Z15115 at, X59417 at Stepwise 0.9028 0.9083 0.8814M55150 at, M19311 s at, X89576 at, U70063 at, J03798 at Pure TS 0.9583 0.9583 0.9424X13839 at, X16706 at, J03779 at, U88666 at, M95623 cds1 at,

Z54367 s at, HG1612-HT1612 atPure PSO 0.9444 0.9475 0.9419

M H

e sele

MvcTLoas98

wsgDfdot

(na

3

f(ea1Fth

13690 s at, X05130 s at, X94232 at, M38690 at, U43885 at,M98399 s at, M61853 at

a Averaged classification accuracy over 200 runs of partition samples using th

26708 s at). The classification accuracy over fivefold cross-alidation is 85.71%. For training and prediction sets thelassification accuracy are 85.94% and 84.40%, respectively.hen pure TS and PSO were applied for selection genes forDA modeling of estrogen data. The classification accuracyver fivefold cross-validation are 91.84% and 93.88% by TSnd PSO, respectively. The average classification accuracy ofelected models by pure TS and PSO for training set were1.35% and 92.25%, the accuracy for test set were 89.31% and7.80%, respectively.

To further improve the LDA model, the HPSOTS algorithmas used to selected genes for estrogen data. The optimal

ubset for LDA classification using HPSOTS contains sixenes (U83601 at, M85276 at, X96584 at, AFFX-CreX-3 st,00762 at, U40371 at). The classification accuracy over five-

old cross-validation is 97.96% by HPSOTS. Fig. 3 shows theistribution of classification accuracy for training set and test setver 200 runs of partition samples (80% training, 20% test) usinghe best nine-gene-model. For the test set, the highest accuracy

sAt

Fig. 4. Distribution of classification accuracy over 200 runs of partition

PSOTS 0.9861 0.9808 0.9581

cted genes.

100%) appear 109 times in 200 runs. By resampling a largeumber of learning samples, the average classification accuracychieved 96.67% and 93.50% for training and test set.

.3. AML/ALL dataset

The dataset contains gene expression levels for 7129 genesrom two classes of leukemia: acute lymphoblastic leukemiaALL) and acute myeloblastic leukemia (AML). Following thexperimental setup of the original authors, the data is split intotraining set consisting of 38 samples of which 27 are ALL and1 are AML, and a test set of 24 samples, 20 ALL and 14 AML.or AML/ALL dataset, we first applied t-test filtering algorithm

o select 1000 top-ranked informative genes and then appliedeuristic search methods on these 1000 genes.

Table 3 summarizes the result of classification by thetepwise, pure TS and PSO HPSOTS methods for theML/ALL dataset. The optimal subset for LDA classifica-

ion using stepwise method contains three genes (M55150 at,

samples using the best seven-gene-model for AML/ALL dataset.

Page 7: Hybrid particle swarm optimization and tabu search approach for selecting genes for tumor classification using gene expression data

ology

ZfsttvTTrdocMfiHpttrbfiriap9gr

4

cfawTHfppTad

A

S

R

A

A

B

C

C

D

F

G

G

G

K

L

L

L

L

N

P

S

S

S

S

S

S

Q. Shen et al. / Computational Bi

15115 at, X59417 at). The classification accuracy over five-old cross-validation is 90.28%. For training and predictionets the classification accuracy are 90.83 and 88.14%, respec-ively, indicting stepwise selection is inadequate for modelinghe dataset. The classification accuracy over fivefold cross-alidation are 95.83% and 94.44% by Ts and PSO, respectively.he average classification accuracy of selected models by pureS and PSO for training set were 95.83% and 94.75%, the accu-

acy for test set were 94.24% and 94.19%, respectively. For thisata set the result using pure TS and PSO are close to eachther. The optimal subset for LDA classification using HPSOTSontains seven genes (M13690 s at, X05130 s at, X94232 at,38690 at, U43885 at, M98399 s at, M61853 at). The classi-

cation accuracy over fivefold cross-validation is 98.61% byPSOTS. The results with this data set revealed again that theroposed HPSOTS offered substantially improved classifica-ion performance as compared to TS and PSO. Fig. 4 showshe distribution of classification accuracy for test set over 200uns of partition samples (50% training, 50% test) using theest six-gene-model. As shown in Fig. 4, the highest classi-cation accuracy (100%) for training set appears 79 in 200uns. For test set, classification accuracy larger than 95.81%s about 124 times in 200 runs and the highest accuracy (100%)ppear 48 times. By resampling a large number of learning sam-les, the average classification accuracy achieved 98.08% and5.81% for training and test set. It can be seen that the six-ene classification model for AML/ALL dataset is stable andeliable.

. Conclusion

The selection of genes that are really indicative of the tissuelassification concerned is a key step in developing a success-ul gene expression-based data analysis system. In this paper,hybrid PSO and TS (HPSOTS) approach for gene selectionas developed for tumor classification. The incorporation ofS as a local improvement procedure enables the algorithmPSOTS to overleap local optima and show satisfactory per-

ormance. Three different microarray data sets were used by theroposed HPSOTS algorithm. Moreover, stepwise selection, theure TS and PSO algorithm were compared with the HPSOTS.he results have demonstrated that the proposed method isuseful tool for gene selection and mining high dimension

ata.

cknowledgement

The work was financially supported by the National naturalcience Foundation of China (Grant No. 20505015).

eferences

lon, U., et al., 1999. Broad patterns of gene expression revealed by clustering

analysis of tumor and normal colon tissues probed by oligonucleotide arrays.Proc. Natl. Acad. Sci. U.S.A. 96, 6745–6750.

mbroise, C., McLachlan, G.J., 2002. Selection bias in gene extraction on thebasis of microarray gene-expression data. Proc. Natl. Acad. Sci. 99 (10),6562–6566.

S

and Chemistry 32 (2008) 53–60 59

iceiato, S., Luchini, A., Bello, C.D., 2003. PCA disjoint models for multi-class cancer a analysis using gene expression data. Bioinformatics 19, 571–578.

ho, J.H., Lee, D., Park, J.H., Lee, I.B., 2003. New gene selection method forclassification of cancer subtypes considering within-class variation. FEBSLett. 551, 3–7.

lerc, M., Kennedy, J., 2002. The particle swarm-explosion, stability and con-vergence in a multidimensional complex space. IEEE Trans. Evol. Comput.6, 58–64.

upuy, A., Simon, R.M., 2007. Critical review of published microarray studiesfor cancer outcome and guidelines on statistical analysis and reporting. J.Natl. Cancer Inst. (2), 147–157, 17;99.

urey, T., Cristianini, N., Duffy, N., Bednarski, D., Schummer, M., Haussler, D.,2000. Support vector machine classification and validation of cancer tissuesamples using microarray expression data. Bioformatics 16, 906–914.

lover, F., 1986. Future paths for Integer Programming and Links to ArtificialIntelligence”. Comput. Oper. Res. 5, 533–549.

olub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.,Coller, H., Loh, M., Downing, J., Caligiuri, M., et al., 1999. Molecular clas-sification of cancer: class discovery and class prediction by gene expressionmonitoring. Science 286, 531–537.

uyon, I., Weston, J., Barnhill, S., Vapnik, V., 2002. Gene selection for can-cer classification using support vector machines. Mach. Learn. 46, 389–422.

ennedy, J., Eberhart, R., 1995. Particle swarm optimization. In: Proceedingsof the IEEE First International Conference, on Neural Networks, Perth,Australia, pp. 1942–1948.

i, W., 2006. The-more-the-better and the-less-the-better. Bioinformatics 22,2187–2188, 18.

i, W., Yang, Y., 2002. How many genes are needed for a discriminant microarraydata analysis? In: Lin, S.M., Johnson, K.F. (Eds.), Methods of MicroarrayData Analysis. Kluwer Academic, pp. 137–150.

i, T., Zhang, C., Ogihara, M., 2004. A comparative study of feature selectionand multiclass classification methods for tissue classification based on geneexpression. Bioinformatics 20, 2429–2437.

i, L., Jiang, W., Li, X., Moser, K.L., Guo, Z., Du, L., Wang, Q., Topol, E.J.,Wang, Q., Rao, S., 2005. A robust hybrid between genetic algorithm and sup-port vector machine for extracting an optimal feature gene subset. Genomics85, 16–23.

guyen, D.V., Rocke, D.M., 2002. Multi-class cancer classification viapartial least squares with gene expression profiles. Bioinformatics 18,1216–1226.

eng, S.H., Xu, Q.H., Ling, X.B., Peng, X.N., Du, W., Chen, L.B., 2003.Molecular classification of cancer types from microarray data using the com-bination of genetic algorithms and support vector machines. FEBS Lett. 555,358–362.

chena, M., Shalon, D., Davis, R.W., Brown, P.O., 1995. Quantitative monitoringof gene expression patterns with a complementary DNA microarray. Science270, 467–470.

hen, Q., Jing, J.H., Shen, G.L., Yu, R.Q., 2004a. Modified particle swarmoptimization algorithm for variable selection in MLR and PLS modeling:QSAR studies of antagonism of angiotensin II antagonists. Eur. J. Pharm.Sci. 22, 145–152.

hen, Q., Jing, J.H., Shen, G.L., Yu, R.Q., 2004b. Optimized partition of min-imum spanning tree for piecewise modeling by particle swarm algorithm:QSAR studies of antagonism of angiotensin II antagonists. J. Chem Inf.Comput. Sci. 44, 2027–2031.

hen, Q., Shi, W.M., Kong W., Ye, B.X., in press. A combination of modifiedparticle swarm optimization algorithm and support vector machine for geneselection and tumor classification Talanta.

hi, Y., Eberhart, R., 1998. A modified particle swarm optimizer. In: Proceedingsof the IEEE World Congress on Computational Intelligence, pp. 69–73.

ima, C., Dougherty, E.R., 2006. What should be expected from feature selection

in small-sample settings. Bioinformatics 22, 2430–2436, 19.

pang, R., Blanchette, C., Zuzan, H., Marks, J., Nevins, J., West, M., 2001.Prediction and uncertainty in the analysis of gene expression profiles. In:Wingerder, E., Hofestadt, R. (Eds.), Proceedings of the German Conferenceon Bioinformatics. Braunschweig.

Page 8: Hybrid particle swarm optimization and tabu search approach for selecting genes for tumor classification using gene expression data

6 ology

S

T

T

W

Wcer by using gene expression profiles. Proc. Natl. Acad. Sci. U.S.A. 98,

0 Q. Shen et al. / Computational Bi

tephanopoulos, G., Hwang, D., Schmitt, W.A., Misra, J., 2002. Mapping phys-iological states from microarray expression measurements. Bioinformatics18, 1054–1063.

an, Y.X., Shi, L., Tong, W., Hwang, G.G.T., Wang, C., 2004. Multi-class tumorclassification by discriminant partial least squares using microarray gene

expression data and assessment of classification models. Comput. Biol.Chem. 28, 235–244.

ibshirani, R., Hastie, T., Narasimhan, B., Chu, G., 2002. Diagnosis of multiplecancer types by shrunken centroids of gene expression. Proc. Natl. Acad.Sci. 99 (10), 6567–6572.

X

and Chemistry 32 (2008) 53–60

ang, Y., Makedon, F.S., Ford, J.C., Pearlman, J., 2005. HykGene: a hybridapproach for selecting marker genes for phenotype classification usingmicroarray gene expression data. Bioformatics 21 (8), 1530–1537.

est, M., et al., 2001. Predicting the clinical status of human breast can-

11467–11562.iong, M., Li, W., Zhao, J., Jin, L., Boerwinkle, E., 2001. Feature (gene) selec-

tion in gene expression-based tumor classification. Mol. Genet. Metab. 73,239–247.