simultaneous genes and training samples selection by modified particle swarm optimization for gene...

Computers in Biology and Medicine 39 (2009) 646 -- 649

Contents lists available at ScienceDirect

Computers in Biology andMedicine

journal homepage: www.e lsev ier .com/ locate /cbm

Simultaneous genes and training samples selection by modified particle swarmoptimization for gene expression data classification

Qi Shen∗, Zhen Mei, Bao-Xian YeDepartment of Chemistry, Zhengzhou University, Zhengzhou 450052, China

A R T I C L E I N F O A B S T R A C T

Article history:Received 31 January 2008Accepted 29 April 2009

Keywords:Gene expression dataGene selectionSample selectionParticle swarm optimization

Gene expression datasets is a means to classify and predict the diagnostic categories of a patient. Informa-tive genes and representative samples selection are two important aspects for reducing gene expressiondata. Identifying and pruning redundant genes and samples simultaneously can improve the performanceof classification and circumvent the local optima problem. In the present paper, the modified particleswarm optimization was applied to selecting optimal genes and samples simultaneously and supportvector machine was used as an objective function to determine the optimum set of genes and samples.To evaluate the performance of the new proposed method, it was applied to three publicly availablemicroarray datasets. It has been demonstrated that the proposed method for gene and sample selectionis a useful tool for mining high dimension data.

© 2009 Elsevier Ltd. All rights reserved.

1. Introduction

The development of microarray technology has inspired a wealthof gene expression data and led to a revolution of information col-lection technique in biology and medical science. These gene expres-sion databases were employed to classify and predict the diagnosticcategories of a patient [1]. Actually the precise diagnosis of cancertype is critical for successful treatment.

To determine cancer type the classification models were set upusing data mining techniques. Generally, gene express data havetwo dimensions: the columns record all sorts of genes of each tissuesample, and the rows represent different samples. Gene expressiondata are usually redundant, contaminated or with some troublesomesamples to learn for the classification model [2]. For gene expressdata matrix, data reduction strategy could be used to reduce thenumber of columns (gene selection) or the rows (sample selection)[3]. Genes and samples selection are two important aspects for re-duction gene expression data. Compared with the number of genes,the number of tissue samples with genes expression levels availableis usually small. These will lead either to possible overfitting or evento a complete failure in analysis of microarray data. The existence ofsamples in the overlapped boundary region may also cause the deci-sion boundary too complex and decline the generalization ability ofclassifier. In addition, labeling a sample in some cases may be sub-jective and a very small number of mislabeled samples could deeply

∗ Corresponding author. Tel.: +8637167781024; fax: +8637167763220.E-mail address: [email protected] (Q. Shen).

0010-4825/$ - see front matter © 2009 Elsevier Ltd. All rights reserved.doi:10.1016/j.compbiomed.2009.04.008

degrade the performance of the classifier [4]. Mislabeled and trou-blesome learning samples may be often near the boundary and leadto high error rate. So data reduction is necessary for establishing agood classification model. Identifying and pruning redundant genesand uncorrected samples can improve the performance of classifica-tion for higher stability and the generalization ability of the predic-tion model.

Most of the data reduction techniques are used for gene selec-tion. The traditional gene selection methods usually rank genes ac-cording to their score and the genes with highest scores are selected.Recently, semi-supervised [5], unsupervised [6], mixture models [7],hybrid genes selection [8] methods were applied to selecting genes.Sample selection is a new field that has been developed recently[9,10] and seldom mentioned in gene expression data classification.In article [11], extreme patient samples were used for outcome pre-diction from gene expression data. Some strategies used for datareduction were first to select samples and then to select variables.Such an approach seems difficult to circumvent the local optima andoverfitting problems. From this point of view it is highly desirableto design a methodology which is capable of selecting genes andsamples simultaneously.

Particle swarm optimization (PSO) [12] is a novel evolutionarycomputation technique based on swarm intelligence inspired bysocial behavior of bird flocking. Similar to evolutionary algorithms,PSO is initialized with a population of random solutions, whichsearch for optima by updating generations. However, PSO possessesno crossover and mutation operators and is simpler than other evo-lutionary algorithms. The utility of PSO for genes selection has beenpublished in gene expression data analysis [13,14]. The modified PSO

http://www.sciencedirect.com/science/journal/cbm

http://www.elsevier.com/locate/cbm

mailto:[email protected]

Q. Shen et al. / Computers in Biology and Medicine 39 (2009) 646–649 647

algorithm has been proposed in our published work and used toselect genes and variables for data mining with satisfactory perfor-mance [15–17]. In the present paper, the modified PSO was appliedto selecting optimal genes and samples simultaneously, and supportvector machines (SVM) was used as an objective function to de-termine the optimum set of genes and samples. The correspondingcomputer flow chart was presented in detail in this article. To eval-uate the performance of the new proposed method, it was appliedto three publicly available microarray datasets. It has been demon-strated that the proposed gene and sample selection method is auseful tool for mining high dimension data.

2. Methods

2.1. Modified particle swarm optimization

PSO was developed by Kennedy and Eberhart [12] in 1995. Re-cently, many modified versions of PSO were introduced. Most mod-ified versions of PSO have operated in continuous and real-numberspace. In this paper, we use a discrete version PSO introduced in ourprevious study [15–17] to perform a hybrid optimization of genesand samples selection.

In PSO algorithm, the mechanism of exploration of a problemspace is to update a population of particles on a basis of informationabout each particles previous best performance and the best parti-cle in the population. PSO is initialized with a group of random par-ticles. Each particle is treated as a point in a D-dimensional space.The ith particle is represented as xi = (xi1, xi2, . . . , xiD). The best pre-vious position of the ith particle that gives the best fitness value isrepresented as pi = (pi1, pi2, . . . ,piD). The best particle among all theparticles in the population is represented by pg = (pg1,pg2, . . . ,pgD).Velocity, the rate of the position change for particle i, is representedas i = (i1, i2, . . . , iD). In every iteration, each particle is updated byfollowing the two best values.

For a discrete problem expressed in a binary notation, a particlemoves in a search space restricted to 0 or 1 on each dimension.In binary problem, updating a particle represents changes of a bitthat should be in either state 1 or 0 and the velocity represents theprobability of bit xiD taking the value 1 or 0.

According to information sharing mechanism of PSO, a modifieddiscrete PSO was applied as follows. The velocity iD of every individ-ual is a random number in the range of (0, 1). The resulting changein position then is defined by the following rule:

If (0<vid a), then xid(new) = xid(old) (1)

If (a<vid (1 + a)/2), then xid(new) = pid (2)

If ((1 + a)/2<vid 1), then xid(new) = pgd (3)

where a is a random value in the range of (0, 1) named static prob-ability. In this study, static probability a equals 0.5. To circumventconvergence to local optima and improve the ability of the modifiedPSO algorithm to overleap local optima, 10% of particles are forcedto fly randomly not following the two best particles.

If the minimum error criterion is attained or the number of cyclesreaches a user defined limit, the algorithm is terminated.

2.2. Simultaneous gene and sample selection by modified PSO (SSPSO)

In this paper, we used the modified PSO to select optimal setof genes and samples simultaneously. In the modified discrete PSO,each particle is encoded to a string of binary bits associated with allinformation genes and samples used. Each particle consists of twoparts. The length of the first part equals the number of genes (Ng)and each bit associates with a gene. The length of the second part

Test Set

Initialize PSO’s population: IND

Training Set

Yes

No

Construct Classifier using the

selected genes and samples

Output Result

Calculate fitness for population

Update IND by PSO

The best fitnessis good enough?

New

pop

ulat

ion

Fig. 1. The chart of the SSPSO scheme.

equals the number of samples (Ns) and each bit associates with asample. The length of each particle is the sum of Ng and Ns. A bit“1” in each particle represents the usefulness of correlative gene orsample, whereas a bit “0” in a particle represents the uselessnessone. The simultaneous genes and samples selection by modified PSOis described as follows:

Step 1. Initialize all the initial binary strings IND (individual) inmodified PSO randomly with an appropriate size of population. INDis individual strings of binary bits standing for a set of genes andsamples.

Step 2. Calculate the object function of each particle. If the bestobject function of the generation fulfills the end condition, the train-ing is stopped with the results output; otherwise, go to the next step.

Setp 3. Update the IND population according to the modified dis-crete PSO (Eqs. (1)–(3)).

Step 4. Go back to step 2 to calculate the fitness of the renewedpopulation. The total process of the SSPSO is shown in Fig. 1.

2.3. The fitness function

In SSPSO, the optimal set of genes and samples are simultane-ously selected by each particle and the performance of each particleis measured according to a pre-defined fitness function. The fitnessis defined as the total classification error rates over 5-fold cross-validation which is evaluated using a linear support vector machine

648 Q. Shen et al. / Computers in Biology and Medicine 39 (2009) 646–649

(SVM) classifier. Briefly, the samples selected by each particle, whichare associated with bit “1” in the second part of each particle, arerandomly divided into five subsets of approximately equal size, re-spectively. Each time, one of the five subsets is used as the test setand the other four subsets are put together to form a training setand are used to construct a SVM classification model. The procedureshould be repeated five times and each particle in SSPSO is evaluatedby the total error over all 5 trials.

3. Datasets

Three public datasets were used to test our proposed method inthis paper.

3.1. Bipolar disorder data

Bipolar disorder dataset [18] consists of 61 samples (31 controland 30 bipolar disorder). Gene expressions for 22283 human genesare measured using the Affymetrix technology. These data are pub-licly available at http://www.ncbi.nlm.nih.gov/projects/geo/gds/gds_gds_browse.cgi?gds = 2190.

3.2. Gliomas of grades III and IV data

Gliomas dataset were obtained by applying the Affymetrix genechip technology and first presented in papers by [19]. The com-mon expression matrix monitors 22645 genes in 85 Glioma tumorsamples. In this dataset, 26 samples are labeled grades III; the rest59 samples are labeled grades IV. These data are retrieved fromhttp://www.ncbi.nlm.nih.gov/projects/geo/gds/gds_browse.cgi?gds= 1976.

3.3. Sarcoma data

Sarcoma dataset [20,21] is publicly available at http://www.ncbi.nlm.nih.gov/projects/geo/gds/gds_browse.cgi?gds = 1209. The data-set consists of expression profiles of 22283 human genes from 54patients; in this 15 are normal and 39 with diseases.

Our algorithm was programmed in Matlab 7.0 and run on a per-sonal computer.

4. Results and discussion

For each dataset, about two-third samples were randomly se-lected as training set and the remaining samples as the prediction set.For bipolar disorder dataset, we constructed 39 randomly selectedsamples (19 bipolar disorder and 20 controls) as training set and theremaining 22 samples as the prediction set. Among 85 Glioma tumorsamples, 55 randomly selected samples (38 are grades IV and 17 aregrades III) were used as training set and the remaining 30 samplesas the prediction set. For Sarcoma dataset, we constructed 35 ran-domly selected samples (10 are normal and 25 are with diseases)as training set and the remaining 19 samples as the prediction set.Then methods (SVM, PSOSVM, SSPSO) were applied to training setsto obtain models and predicted the test sets. Because of the arbi-trariness of partition of dataset, the predicted error rate of a modelat each partition of dataset may not be the same. To evaluate ac-curately the performance of each algorithm (SVM, PSOSVM, SSPSO),such partition was repeated 20 times and then the error rate of eachpartition was averaged.

At the beginning, naive SVM classifier with all genes was carriedout for these gene expression datasets. To evaluate accurately theperformance of SVM classifier, the total tissue samples were ran-domly partitioned into training and test sets 20 times and then themisclassification rates were averaged. For bipolar disorder dataset,

the average misclassification rates for training set and test set were55.51% and 43.78%, respectively, using all initial genes. For Gliomaand Sarcoma dataset, the average misclassification rates for trainingset are 14.91% and 9.43%, respectively, and the average misclassifi-cation rates for test set are 10.4% and 3.95%, respectively. The resultsindicated that inclusion of excess of the irrelevant genes in SVM wasinadequate for classification and would degrade the performance ofSVM analysis.

To compare with SSPSO, genes selected by PSO (PSOSVM) werealso utilized for these gene expression datasets in which only geneswere selected. Sample selection was not considered in PSOSVM.Genes in the three datasets were more than 22000 and the numberof tissue samples was too small compared with the number of genes.Large numbers of uninformative genes could aggravate computa-tional complexity and cost. Therefore, we first applied t-test filter-ing algorithm to select 400 top-ranked informative genes for thesedatasets and applied PSOSVM, SVM and SSPSO methods to these 400genes.

In PSOSVM, the population size of PSO was selected as 100 andthe procedure was stopped after 200 iterations. The performance ofeach particle was evaluated using the total classification error ratesover 5-fold cross-validation by a linear SVM. The average misclassifi-cation rates for each dataset are presented in Table 1 For bipolar dis-order dataset, there were averagely 8 genes in the models searchedby the PSOSVM, and the average misclassification rates were 8.72%for training set and 26.36% for test set. Using the 13-gene model,the average misclassification rates for training and test set were 0%and 15.33%, respectively, for Glioma dataset. For Sarcoma dataset,there were averagely 11 genes in the model and the average mis-classification rates for training and test set were 0.57% and 4.21%,respectively, by PSOSVM.

The SVM using the 400 top-ranked informative genes was alsoperformed for these datasets. For bipolar disorder dataset, the av-erage misclassification rates for training and test set were 26.67%and 22.0%, respectively. For Glioma and Sarcoma dataset, the aver-age misclassification rates for training set were 6.82% and 7.83%, re-spectively, and the average misclassification rates for test set were1.43% and 2.16%, respectively. Comparing PSOSVM with SVM by 400top-ranked genes and nave SVM each other showed that results oftraining set by PSOSVM were better than them with SVM by 400top-ranked genes and all genes, and the misclassification rates fortest set were higher than that by SVM using 400 top-ranked genes.These phenomena implied that over-fitting had likely occurred forthe training sets in PSOSVM.

To further improve the performance of classification, the SSPSOalgorithm was used to classify the three gene expression dataset.In SSPSO, the modified PSO was used to select genes and samplessimultaneously. The population size and number of generations inSSPSO were selected as 100 and 200, and were same as that inPSOSVM.

All the initial binary particle strings IND were initialized in SSPSOand genes and samples were selected simultaneously in each parti-cle. For example, the length of particles for bipolar disorder datasetwas 439, the former 400 bits represented the genes selected byT-test and the latter 39 bits represented 39 training samples. A bit“1” in the former 400 bits represented that the correlative gene wasselected, and a bit “1” in the latter 39 bits represented that the cor-relative sample was selected from the training set. Then the selectedtraining samples were randomly divided into five subsets of approx-imately equal size, respectively, and used to measure the perfor-mance of the particle according to the total classification error ratesover 5-fold cross-validation. The object function of each particle wascalculated and the IND population was updated according to Eqs.(1)–(3). This process would be iterated until the number of iterationsreached 200.

http://www.ncbi.nlm.nih.gov/projects/geo/gds/gds_browse.cgi?gds\mathsurround =2pt\unhbox \voidb@x \hbox $\mathbin =$2190






Q. Shen et al. / Computers in Biology and Medicine 39 (2009) 646–649 649

Table 1Results of misclassification rates for three datasets.

Datasets Method

SSPSO (%) SVMa (%) PSOSVM (%) Naive SVMb (%)

Bipolar disorder Training set 0 26.67 8.72 55.51Test set 20.00 22.00 26.36 43.18

Glioma Training set 0 6.82 0 14.91Test set 7.33 7.83 15.33 10.40

Sarcoma Training set 0 1.43 0.57 9.43Test set 1.05 2.16 4.21 3.95

aSVM method using 400 top-ranked genes selected by t-test filtering algorithm.bSVM method using all original genes.

The results of SSPSO are also listed in Table 1. The average num-ber of genes and samples selected for bipolar disorder dataset bySSPSO were 18 and 34. The average misclassification rates for bipo-lar disorder dataset were 0% and 20.00% for training and test set,respectively. For Glioma dataset, average 43 samples were selectedby SSPSO from 55 training sample and the average number of genesselected was 41. The average number of genes and samples selectedwere 22 and 31 for Sarcoma dataset. For Glioma and Sarcoma dataset,the average misclassification rates for training set were all equal tozero, and the average misclassification rates for test set were 7.33%and 1.05%, respectively. The introduction of sample selection into thealgorithm improved the performance of the SVM by SSPSO, as themisclassification rates for the training and test set were lower thanthat by SVM and PSOSVM. The selection of representative sampleswas beneficial to steady model and could prevent the problem ofoverfitting in some extent. These results indicated the performanceof our proposed SSPSO was better than SVM and PSOSVM methods.

5. Conclusions

Selecting the informative genes and representative samples si-multaneously was essential in classification gene expression data.In this paper, the modified PSO was applied to select the genes andsamples simultaneously. Three public datasets were used to test theproposed algorithm. The results demonstrated that the proposedmethod was a useful tool for data mining.

Conflict of interest statement

No conflict of interest exists in the submission of this manuscript.

Acknowledgment

The work was financially supported by the National NaturalScience Foundation of China (Grant nos. 20505015, 20475050).

References

[1] P. Qiu, Z.J. Wang, K.J. Liu, Ensemble dependence model for classification andprediction of cancer and normal gene expression data, Bioinformatics 21 (2005)3114.

[2] A. Angelova, Y. Abu-Mostafa, P. Perona, Pruning training sets for learning ofobject categories, IEEE Computer Society Conference on Computer Vision andPattern Recognition 1 (2005) 494.

[3] J.R. Cano, F. Herrera, M. Lozano, On the combination of evolutionary algorithmsand stratified strategies for training set selection in data mining, Applied SoftComputing 6 (3) (2006) 323.

[4] A. Malossini, E. Blanzieri, T. Raymond, Detecting potential labeling errors inmicroarrays by data perturbation, Bioinformatics 22 (2006) 2114.

[5] P.H. Gosselin, M. Cord, Feature-based approach to semi-supervised similaritylearning, Pattern Recognition 39 (10) (2005) 1839.

[6] R. Varshavsky, A. Gottlieb, M. Linial, et al., Novel unsupervised feature filteringof biological data, Bioinformatics 22 (14) (2006) 507.

[7] C. Ming-Hui, J.G. Ibrahim, C. Yueh-Yun, A new class of mixture models fordifferential gene expression in DNA microarray data, Journal of StatisticalPlanning and Inference 138 (2) (2008) 387.

[8] Q. Hu, Z. Xie, D. Yu, Hybrid attribute reduction based on a novel fuzzy-rough model and information granulation, Pattern Recognition 40 (12) (2007)3509.

[9] S. Vijayakumar, M. Sugiyama, H. Ogawa, Training data selection for optimalgeneralization with noise variance reduction in neural networks, Neural NetsWIRN Vietri-98, vol. 153, Springer, Berlin, 1998.

[10] J. Sun, G.S. Hong, Y.S. Wong, et al., Effective training data selection in toolcondition monitoring system, International Journal of Machine Tools andManufacture 46 (2006) 218.

[11] H. Liu, J. Li, L. Wong, Use of extreme patient samples for outcome predictionfrom gene expression data, Bioinformatics 21 (16) (2005) 3377.

[12] J. Kennedy, R. Eberhart, A new optimizer using particle swarm theory,Proceeding of the Sixth International Symposium on Micro Machine and HumanScience, vol. 39, IEEE Service Center, Nagoya, Japan, 1995.

[13] L.Y. Chuang, H.W. Chang, C.J. Tu, et al., Improved binary PSO for feature selectionusing gene expression data, Computational Biology and Chemistry 32 (1) (2008)29.

[14] R. Xu, G.K. Venayagamoorthy, D.C. Wunsch, Modeling of gene regulatorynetworks with hybrid differential evolution and particle swarm optimization,Neural Networks 20 (8) (2007) 917.

[15] Q. Shen, J.H. Jing, G.L. Shen, et al., Modified particle swarm optimizationalgorithm for variable selection in MLR and PLS modeling: QSAR studies ofantagonism of angiotensin II antagonists, European Journal of PharmaceuticalScience 22 (2004) 145.

[16] Q. Shen, J.H. Jing, W.Q. Lin, et al., Hybridized particle swarm algorithm foradaptive structure training of multilayer feed-forward neural network: QSARstudies of bioactivity of organic compounds, Journal of Computational Chemistry25 (2004) 1726.

[17] Q. Shen, W.-M. Shi, W. Kong, et al., A combination of modified particle swarmoptimization algorithm and support vector machine for gene selection andtumor classification, Talanta 71 (2007) 1679.

[18] M.M. Ryan, H.E. Lockstone, S.J. Huffaker, et al., Gene expression analysis ofbipolar disorder reveals downregulation of the ubiquitin cycle and alterationsin synaptic genes, Molecular Psychiatry 11 (10) (2006) 965.

[19] W.A. Freije, F.E. Castro-Vargas, Z. Fang, et al., Gene expression profiling ofgliomas strongly predicts survival, Cancer Research 64 (18) (2004) 6503.

[20] K.Y. Detwiller, N.T. Fernando, N.H. Segal, et al., Analysis of hypoxia-relatedgene expression in sarcomas and effect of hypoxia on RNA interference ofvascular endothelial cell growth factor A, Cancer Research 65 (13) (2005)5881.

[21] S.S. Yoon, N.H. Segal, P.J. Park, K.Y. Detwiller, et al., Angiogenic profile of softtissue sarcomas based on analysis of circulating factors and microarray geneexpression, Journal of Surgical Research 135 (2) (2006) 282.

simultaneous genes and training samples selection by modified particle swarm optimization for gene...

Documents