multiple suboptimal solutions for prediction rules in gene ... · for this, the machine learning...

15
Hindawi Publishing Corporation Computational and Mathematical Methods in Medicine Volume 2013, Article ID 798189, 14 pages http://dx.doi.org/10.1155/2013/798189 Research Article Multiple Suboptimal Solutions for Prediction Rules in Gene Expression Data Osamu Komori, 1 Mari Pritchard, 2 and Shinto Eguchi 1 1 e Institute of Statistical Mathematics, Midori-cho, Tachikawa, Tokyo 190-8562, Japan 2 CLC Bio Japan, Inc., Daikanyama Park Side Village 204, 9-8 Sarugakucho, Shibuya-ku, Tokyo 150-0033, Japan Correspondence should be addressed to Osamu Komori; [email protected] Received 30 January 2013; Revised 22 March 2013; Accepted 23 March 2013 Academic Editor: Shigeyuki Matsui Copyright © 2013 Osamu Komori et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. is paper discusses mathematical and statistical aspects in analysis methods applied to microarray gene expressions. We focus on pattern recognition to extract informative features embedded in the data for prediction of phenotypes. It has been pointed out that there are severely difficult problems due to the unbalance in the number of observed genes compared with the number of observed subjects. We make a reanalysis of microarray gene expression published data to detect many other gene sets with almost the same performance. We conclude in the current stage that it is not possible to extract only informative genes with high performance in the all observed genes. We investigate the reason why this difficulty still exists even though there are actively proposed analysis methods and learning algorithms in statistical machine learning approaches. We focus on the mutual coherence or the absolute value of the Pearson correlations between two genes and describe the distributions of the correlation for the selected set of genes and the total set. We show that the problem of finding informative genes in high dimensional data is ill-posed and that the difficulty is closely related with the mutual coherence. 1. Introduction e Human Genome Project [1, 2] has driven genome tech- nology forward to exhaustive observation. e accumulation of genome knowledge leads us to study gene and protein expressions to elucidate the functions of genes and the inter- action among genes. We overview the progress of microarray technology for gene expressions and the analysis methods based on gene expression towards good prediction for phe- notypes. Analysis of gene expressions has been rapidly developed and enhanced by microarray technology. In the current stage, this progress enables us to observe all the gene expressions of subjects in an exhaustive manner. It is opening an area of bioinformatics to discover the relation of pheno- types with gene expressions, where phenotypes imply degrees and stages of pathological response, treatment effect and prognosis of patients. We anticipated that a breakthrough in medical and clinical science will lead to the discovery of key understandings which elucidate the associations between phenotypes and gene expressions. For this, the machine learning approach is successfully exploited including sup- port vector machine, boosting learning algorithm, and the Bayesian network. However there exists a difficult problem in analyses for gene expression data in which the number of genes is much more than that of samples. Such an extreme unbalance between the data dimension and sample size is a typical characteristic in genomics and omics data; this tendency will become more apparent on account of new technology in the near future. Appropriate methods for solving the problem are urgently required; however, there are severe difficulties with attaining a comprehensive understanding of the phenotypes based on the set of exhaustive gene expressions. We face many false positive genes sinking true positive genes in the prediction, which creates an impediment to building individualized medicines. ere are a vast number of pro- posals with complex procedures to challenge the difficult problem of extracting robust and exhaustive information from gene expressions. Sparse type of feature selection is one example; it is considered to avoid overfitting and obtaining

Upload: others

Post on 16-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple Suboptimal Solutions for Prediction Rules in Gene ... · For this, the machine learning approach is successfully exploited including sup-portvectormachine,boostinglearningalgorithm,andthe

Hindawi Publishing CorporationComputational and Mathematical Methods in MedicineVolume 2013 Article ID 798189 14 pageshttpdxdoiorg1011552013798189

Research ArticleMultiple Suboptimal Solutions for Prediction Rules inGene Expression Data

Osamu Komori1 Mari Pritchard2 and Shinto Eguchi1

1 The Institute of Statistical Mathematics Midori-cho Tachikawa Tokyo 190-8562 Japan2 CLC Bio Japan Inc Daikanyama Park Side Village 204 9-8 Sarugakucho Shibuya-ku Tokyo 150-0033 Japan

Correspondence should be addressed to Osamu Komori komoriismacjp

Received 30 January 2013 Revised 22 March 2013 Accepted 23 March 2013

Academic Editor Shigeyuki Matsui

Copyright copy 2013 Osamu Komori et alThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

This paper discusses mathematical and statistical aspects in analysis methods applied to microarray gene expressions We focus onpattern recognition to extract informative features embedded in the data for prediction of phenotypes It has been pointed out thatthere are severely difficult problems due to the unbalance in the number of observed genes compared with the number of observedsubjects We make a reanalysis of microarray gene expression published data to detect many other gene sets with almost the sameperformance We conclude in the current stage that it is not possible to extract only informative genes with high performance inthe all observed genes We investigate the reason why this difficulty still exists even though there are actively proposed analysismethods and learning algorithms in statistical machine learning approaches We focus on the mutual coherence or the absolutevalue of the Pearson correlations between two genes and describe the distributions of the correlation for the selected set of genesand the total set We show that the problem of finding informative genes in high dimensional data is ill-posed and that the difficultyis closely related with the mutual coherence

1 Introduction

The Human Genome Project [1 2] has driven genome tech-nology forward to exhaustive observationThe accumulationof genome knowledge leads us to study gene and proteinexpressions to elucidate the functions of genes and the inter-action among genes We overview the progress of microarraytechnology for gene expressions and the analysis methodsbased on gene expression towards good prediction for phe-notypes Analysis of gene expressions has been rapidlydeveloped and enhanced by microarray technology In thecurrent stage this progress enables us to observe all the geneexpressions of subjects in an exhaustive manner It is openingan area of bioinformatics to discover the relation of pheno-types with gene expressions where phenotypes imply degreesand stages of pathological response treatment effect andprognosis of patients We anticipated that a breakthrough inmedical and clinical science will lead to the discovery of keyunderstandings which elucidate the associations betweenphenotypes and gene expressions For this the machine

learning approach is successfully exploited including sup-port vector machine boosting learning algorithm and theBayesian network

However there exists a difficult problem in analyses forgene expression data in which the number of genes is muchmore than that of samples Such an extreme unbalancebetween the data dimension and sample size is a typicalcharacteristic in genomics and omics data this tendency willbecome more apparent on account of new technology in thenear future Appropriate methods for solving the problem areurgently required however there are severe difficulties withattaining a comprehensive understanding of the phenotypesbased on the set of exhaustive gene expressions We facemany false positive genes sinking true positive genes inthe prediction which creates an impediment to buildingindividualized medicines There are a vast number of pro-posals with complex procedures to challenge the difficultproblem of extracting robust and exhaustive informationfrom gene expressions Sparse type of feature selection is oneexample it is considered to avoid overfitting and obtaining

2 Computational and Mathematical Methods in Medicine

an interpretable model for gene expression data In theregression context Tibshirani [3] proposed Lasso whichachieved feature selection by shrinking some coefficients toexactly zero In the image processing area the sparse modelis also considered for redundant representation of data byDonoho and Elad [4] and Candes et al [5]

The current status for prediction performance hasattained a constant level however there are still unsolvedproblems in the prediction in which the observed number ofgenes is extremely larger than that of the subjectsThis causesdifficulties in which superfluous discovered genes arise theexpressions which have weak power for prediction As aresult almost all microarray data analysis is not completelyconfirmed as the biological replication which is discussed inAllison et al [6]

We address themicroarray data analysis discussed in vanrsquotVeer et al [7] in which a set of 70 genes is selected for pre-diction for prognosis in breast cancer patients as informativebiomarkersThe result helps to build the prediction kit namedldquoMammaPrintrdquo which was approved by the FDA in 2007 Wemake a reanalysis of the data and discuss how surprisinglymany other gene sets show almost the same predictionperformance as the gene set published Thus their gene setdoes not uniquely show a reasonable power of predictionso we suggest that it is impossible to build up a universalprediction rule by efficient selection of genes In particularthe ranking procedure according to association measure tothe phenotype is very fragile for data partition We discussthe statistical reason why the data analysis of gene expressionis involved with such difficulties of multiple solutions forthe problem We calculate the value of mutualcoherence forgenes in MammaPrint and show the essential difficulty ofdetermining informative genes among huge number of genesused in the data analysis

This paper is organized as follows In Section 2 wedescribe the pattern recognition of gene expression and over-view the current proposed methods In Section 3 we pointout the current difficulty in gene expression using a realdata set Finally we discuss the results and future works inSection 4

2 Pattern Recognition from Gene Expressions

In this section we overview the pattern recognition of geneexpression First we mention a DNA microarray techniquewhich is the widely used method to measure millions of geneexpressions Then we present the current methods for geneselection using microarray

21 DNA Microarray Microarray has become a widely usedtechnology in a variety of areas Microarray measures theamount of mRNA or gene expression There are two majortechnologies available for gene expressionmeasurement Oneis GeneChip system provided by Affymetrix Inc GeneChipuses prefabricated short lengths of oligonucleotideThe otheris cDNAarraywhichwas originally developed by Schena et al[9] We briefly mention both technologies

GeneChip uses a pair of short length oligonucleotidesattached to a solid surface such as glass plastic or siliconThe

short pair of oligonucleotides is called the probe pair Eachprobe pair is composed of a perfect match (PM) probe anda mismatch (MM) probe PM is a section of the mRNA ofinterest and MM is created by changing the middle (13th)base of the PM with the intention of measuring nonspecificbinding mRNA samples are collected from subjects suchas cancer patients then labeled with fluorescence dye If theshort oligonucleotide is matched with the mRNA sample thelabeled mRNA sample is hybridized to a spot of microarrayIf the labeled mRNA and the probe match perfectly theybind strongly otherwise they bindweaklyThosewithweak ornonspecific binding are washed out by a washing buffer thenonly strongly bound mRNA samples are measured by a scan-ner Scanned measurements need further processing beforeanalysis such as outlier detection background subtractionand normalization These processes are called preprocessing

In the early stage of microarray the quality of microarraymeasurements contained a lot of variance Therefore pre-processing was a very active research area Affymetrix rec-ommended the use of both PM and MM probes to subtractnonspecific binding and implement MASS algorithm totheir software however Irizarry et al [10] and Naef et al[11] pointed out that the normalization model consideringMM captures nonspecific effect more than reality Currentlyrobust multichip average (RMA) which is introduced byIrizarry et al [10] is also widely used

cDNA array uses glass slides to attach short oligonu-cleotides probes cDNA array uses inkjet printing technologyGeneChip uses one color fluorescent dye on the other handcDNA array utilizes two different color fluorescent dyes Oneof the colors is for controlling mRNA and the other color isfor treatment of mRNA Both samples are hybridized on thesame array The scanner detects both fluorescent dyes sepa-rately Data processing is slightly different from GeneChipAs cDNA uses two fluorescent dyes scanned data is normallytreated as ratio data of treatment over control

Microarray technology has improved in the last decadesincluding reduction of the variance normalization proce-dures donot have as great an effect as before Research interesthas moved to areas such as data analysis finding subclass orpredicting the subclass

22 Review for Prediction via Microarray The initial ap-proach employed for subclass discovery was hierarchicalclustering analysis Golub et al [12] showed the result ofclustering for leukemia data using microarray In their resultsubclasses of leukemia were well clustered by gene expressionpattern This result was hailed as a new dawn in the cancerclassification problem

Breast cancer is one of the most used cancers for the geneexpression classification problem Breast cancer treatmentdecisions are based largely on clinicopathological criteriasuch as tumor size histological grade and lymph nodemetastatsis however vanrsquot Veer et al [7] pointed that themajority of patients received unnecessary chemotherapy andthat there is a need to find a better criteria who benefits fromchemotherapy

vanrsquot Veer et al [7] proposed 70 genes to predict patientoutcome as first multigene signatures in breast cancer 70

Computational and Mathematical Methods in Medicine 3

Table 1 A taxonomy of feature selection techniques summarized by Saeys et al [8] These major feature selections are addressed Each typehas a subcategory Advantages disadvantages and example methods are shown

Model search Advantages Disadvantages Examples

Univariate

Filter

FastScalableIndependent of the classifier

Ignores feature dependenciesIgnores interaction with the classifier

1205942

Euclidean distancet-testInformation gain

MultivariateModels feature dependenciesIndependent of the classifierBetter computational complexity thanwrapper methods

Slower than univariate techniquesLess scalable than univariatetechniquesIgnores interaction with the classifier

Correlation-based feature selectionMarkov blanket filterFast correlation-based feature selection

Deterministic

Wrapper

SimpleInteracts with the classifierModels feature dependenciesLess computationally intensive thanrandomized methods

Risk of overfittingMore prone than randomizedalgorithms to getting stuck in a localoptimumClassifier dependent selection

Sequential forward selectionSequential backward selection

Randomized

Less prone to local optimaInteracts with the classifierModels feature dependencies

Computationally intensiveClassifier dependent selectionHigher risk of overfitting thandeterministic algorithms

Simulated annealingRandomized hill climbingGenetic algorithmsEstimation of distribution algorithms

Embedded

Interacts with the classifierBetter computational complexity thanwrapper methodsModels feature dependencies

Classifier dependent selectionDecision treesWeighted naive BayesRFE-SVM

genes were decided using 78 patientsrsquo tumor samples Inbrief they selected 5000 significant expressed genes from25000 genes onmicroarray then coefficient correlation withoutcome was calculated for each class Genes were sorted bycorrelation coefficient and further optimized from the topranked gene with sequentially adding from a subset of fivegenesThe top 70 geneswere proposed as outcome predictorsPaik et al [13] proposed 21multigene signatures based on RT-PCR results These two multigene prognostic signatures areavailable as clinical test named MammaPrint and OncotypeDX FDA cleared the MammaPrint test in 2007 and it is cur-rently being tested in the Microarray In Node-negative and1ndash3 positive lymph-node Disease may Avoid ChemoTherapy(MINDACT) for further assessment which is described byCardoso et al [14]

Besides these two multigene prognostic signatures dif-ferent multigenes were selected as prognostic signatures Fanet al [15] discussed a different set of multigene signaturesin the breast cancer prognostic classification studies Thosesignatures show little overlap however they still showedsimilar classification power Fan et al [15] suggested that thesesignatures are probably tracking a common set of biologicalphenotypes Considering thousands andmillions of genes onmicroarray multiple useful signature sets are not difficult to

imagine however finding a stable and informative gene withhigh classification accuracy is of interest

23 Feature Selection Reduction of dimension size is neces-sary as superfluous features can cause overfitting and inter-pretation of classification model becomes difficult Reducingdimension size while keeping relevant features is importantThere are some feature selection methods proposed Saeyset al [8] provided a taxonomy of feature selection methodsand discussed their use advantages and disadvantages Theymentioned the objectives of feature selection (a) to avoidoverfitting and improve model performance that is predic-tion performance in the case of supervised classification andbetter cluster detection in the case of clustering (b) to providefaster andmore cost-effectivemodels and (c) to gain a deeperinsight into the underlying processes that generated the dataTable 1 provides their taxonomy of feature selectionmethodsIn the context of classification feature selection methodsare organized into three categories filter methods wrappermethods and embeddedmethods Feature selectionmethodsare categorized depending on how they are combined withthe construction of a classification model

Filter methods calculate statistics such as t-statisticsthen filter out those which do not meet the threshold

4 Computational and Mathematical Methods in Medicine

value Advantages of filter methods are easy implementationcomputational simplicity and speed Filter methods areindependent of classification methods therefore differentclassifiers can be used Two of the disadvantages of filtermethods are that they ignore the interaction with the clas-sification model and most of the proposed methods areunivariate

Whereas filter techniques treat the problem of findinga good feature subset independently of the model selec-tion step wrapper methods embed the model hypothesissearch within the feature subset search such as sequentialbackward selection [16] Wrapper methods search all featuresubsets the feature subset space grows exponentially withthe number of features One advantage of wrapper methodsis the interaction between feature subset search and modelselection Computational burden is one of the disadvantagesof this approach especially if building the classifier has a highcomputational cost

The third class of feature selection is embedded tech-niques Like the wrapper methods embedded techniquessearch for an optimal subset of features with classifierconstruction For example SVM-RFE proposed by Guyonet al [17] Thus embedded approaches are specific to a givenlearning algorithm The difference from wrapper methodsis that embedded techniques are guided by the learningprocess Whereas wrapper methods search all possible com-binations of gene sets embedded techniques search for thecombination based on a criteria This enables reduction incomputational burden

Besides thesemethods the idea of sparsenesswas recentlyintroduced in some feature selectionmethods One approachis to use penalties in the regression context Tibshirani [3]proposed Lasso which uses 119871

1norm penalties The combi-

nation of 1198711penalty and 119871

2penalty is called elastic net [18]

These methods are focusing on reducing features to avoidingoverfitting and better interpretability as biologists expect toobtain biological insight from selected features

However these sparseness ideas do not take into accountmultiple solutions in one data set When the data dimensionis thousands ormillions there aremultiple possible solutionsSorting genes based on some criteria then selecting a subsetfrom the top is not always the best selection We elaboratemultiple solutions in the following section and give some ideaof how to select the optimum solution Here we refer to anoptimal solution as a prediction rule with high classificationaccuracy for various data sets

3 Multiple Solutions for Prediction Rules

The existence of multiple solutions for prediction of diseasestatus based on breast cancer data [7] was shown by Ein-Doret al [19] where they suggest three reasons for this problemThe first is that there are many genes correlated with diseasestatus the second one is that the differences of the correlationamong the genes are very small the last one is that the valueof the correlation is very sensitive to the sample used for thecalculation of the correlation In the paper they demonstrategene ranking based on the correlation and show that thereexist many equally predictive gene sets

In this section we investigate the existence of multiplesolutions from different viewpoints At first to check thevariability of prediction accuracy based on different statisticalmethods we apply the vanrsquot Veermethod [7] the Fisher lineardiscriminant analysis [20] AdaBoost [21] and AUCBoost[22] The last two methods are called boosting in machinelearning community where genes are nonlinearly combinedto predict the disease status Second we apply hierarchal clus-tering to examine the heterogeneity of gene expression pat-terns Soslashrlie et al [23] showed there exist subtypes of breastcancers for which the patterns of gene expression are clearlydifferent and the disease statuses are also different in accor-dance with them Ein-Dor et al [19] suggests that the hetero-geneity of the subtypes is one reason why there are so manysolutions for the prediction or large fluctuations of genesselected for the predictions Hence we calculate BiologicalHomogeneity Index (BHI) to see the clustering performancefor various gene sets and examine the existence of thesubtypes Finally we consider the mutualcoherence for thebreast cancer data and discuss the relation to the multiplesolutions

The breast cancer data consists of the expression data of25000 genes and 97 cancer patients After a filtering proce-dure 5420 genes are identified to be significantly regulatedand we focused on this filtered data The patients are dividedinto a training sample (78 patients) and a test sample (19patients) in the same way as the original paper [7] Here weconsider classical methods and boosting methods to predictwhether the patient has good prognosis (free of disease afterinitial diagnosis at least 5 years) or has bad prognosis (distantmetastaseswithin 5 years)Theprediction rule is generated bythe training sample and we measure the prediction accuracybased on the test sample

31 Classification We briefly introduce the statistical meth-ods used for the prediction of the disease status The firsttwo methods are linear discriminant function where thegenes are linearly combined to classify the patients into agood prognosis group or a bad prognosis group The last twoones are boosting methods where the genes are nonlinearlycombined to generate the discriminant functions

311 The vanrsquot Veer Method Let 119910 be a class label indicatingdisease status such as 119910 = minus1 (good prognosis) and 119910 = 1

(metastases) and let x(=(1199091 119909

119901)119879) be a 119901-dimensional

covariate such as gene expression We denote the samples of119910 = minus1 and 119910 = 1 as xminus

119894 119894 = 1 119899

minus and x+

119895 119895 =

1 119899+ respectively and the mean value of the patients

with good prognosis as

minus= (minus

1

minus

119901)119879

=1

119899minus

119899minus

sum

119894=1

xminus119894 (1)

Then vanrsquot Veer et al [7] proposed a discriminant function119865(x) based on the correlation to the average good prognosisprofile above which is given as

119865 (x) = minussum119901

119896=1(119909119896minus ) (

minus

119896minus minus)

radicsum119901

119896=1(119909119896minus )2radicsum119901

119896=1(minus

119896minus minus)2

(2)

Computational and Mathematical Methods in Medicine 5

where

=1

119901

119901

sum

119896=1

119909119896

minus=1

119901

119901

sum

119896=1

minus

119896 (3)

If the value of 119865(x) is smaller than a predefined thresholdvalue then the patient is judged to have good prognosis (119910 =minus1) otherwise to have metastases (119910 = 1) This method iscalled one-class classification where it focuses on only one-class label information to predict the disease status 119910 Thisidea is also employed in machine learning community SeeYousef et al [24] and Gardner et al [25] for applications inbiology and medicine

312 DLDA We consider Fisherrsquos linear discriminant analy-sis [20] which is widely used in many applications Supposethat x is distributed as119873(120583

minusΣminus) for 119910 = minus1 and as119873(120583

+Σ+)

for 119910 = 1 Then if Σminus= Σ+= Σ the estimated log-likelihood

ratio is given as

119865 (x) = (+minus minus)119879

Σminus1x minus 1

2119879

+Σminus1

+

+1

2119879

minusΣminus1

minus+ log

119899+

119899minus

(4)

where + minusand Σ are the samplemeans for119910 isin minus1 1 and a

total sample variance respectively For simplicity we take thediagonal part of Σ (Diagonal Linear Discriminant Analysis)and predict the disease status based on the value of the log-likelihood ratio above This modification is often used in asituation where 119901 is much larger than 119899 (= 119899

minus+ 119899+) In that

case the inverse of Σ cannot be calculated

313 AdaBoost We introduce a famous boosting method inmachine learning community The key concept of boostingis to construct a powerful discriminant function 119865(x) bycombining variousweak classifiers119891(x) [26]We employ a setF of decision stumps as a dictionary of weak classifiers Herethe decision stump for the 119896th gene expression 119909

119896is defined

as a simple step functions such as

119891119896(x) = 1 if 119909

119896ge 119887119896

minus1 otherwise(5)

where 119887119896is a threshold value Accordingly it is known that

119891119896(x) is the simplest classifier in the sense that 119891

119896(x) neglects

all other information of gene expression patterns than thatof one gene 119909

119896 However by changing the value of 119887

119896for all

genes (119896 = 1 119901) we have F that contains exhaustiveinformation of gene expression patterns We attempt tobuild a good discriminant function 119865 by combining decisionstumps inF

The typical one is AdaBoost proposed by Schapire [21]which is designed to minimize the exponential loss for adiscriminant function 119865 as

119871exp (119865) =1

119899

119899

sum

119894=1

exp minus119910119894119865 (x119894) (6)

where the entire data set is given as (x119894 119910119894) | 119894 = 1 119899The

exponential loss is sequentially minimized by the followingalgorithm

(1) Initialize the weight of 119909119894for 119894 = 1 119899 as 119908

1(119894) =

1119899

(2) For the iteration number 119905 = 1 119879

(a) choose 119891119905as

119891119905= argmin119891isinF

120598119905(119891) (7)

where

120598119905(119891) =

119899

sum

119894=1

119868 (119891 (x119894) = 119910119894)

119908119905(119894)

sum119899

119894=1119908119905(119894) (8)

and 119868 is the indicator function(b) calculate the coefficient as

120573119905=1

2log

1 minus 120598119905(119891119905)

120598119905(119891119905) (9)

(c) update the weight as

119908119905+1(119894) = 119908

119905(119894) exp minus119910

119894120573119905119891119905(x119894) (10)

(3) output the final function as 119865(x) = sum119879119905=1120573119905119891119905(x)

Here we have

119871exp (119865 + 120573119891)

=1

119899

119899

sum

119894=1

exp minus119910119894119865 (x119894) exp minus120573119910

119894119891 (x119894)

(11)

=1

119899

119899

sum

119894=1

exp minus119910119894119865 (x119894)

times [119890120573119868 (119891 (x

119894) = 119910119894) + 119890minus120573119868 (119891 (x

119894) = 119910119894)]

(12)

= 119871exp (119865) 119890120573120598 (119891) + 119890

minus120573(1 minus 120598) (13)

ge 2119871exp (119865)radic120598 (119891) (1 minus 120598 (119891)) (14)

where 120598(119891) = 1119899sum119899

119894=1119868(119891(x

119894) = 119910119894) expminus119910

119894119865(x119894)119871exp(119865)

and the equality in (14) is attained if and only if

120573 =1

2log

1 minus 120598 (119891)

120598 (119891) (15)

6 Computational and Mathematical Methods in Medicine

Hence we can see that the exponential loss is sequentiallyminimized in the algorithm It is easily shown that

120598119905+1(119891119905) =

1

2 (16)

That is the weak classifier 119891119905chosen at the step 119905 is the worst

element in F in the sense of the weighted error rate in thestep 119905 + 1

314 AUCBoost with Natural Cubic Splines The area underthe ROCcurve (AUC) is widely used tomeasure classificationaccuracy [27] This criterion consists of the false positiverate and true positive rate so it evaluates them separately incontrast to the commonly used error rateThe empirical AUCbased on the samples xminus

119894 119894 = 1 119899

minus and x+

119895 119895 = 1

119899+ is given as

AUC (119865) = 1

119899minus119899+

119899minus

sum

119894=1

119899+

sum

119895=1

119867(119865 (x+119895) minus 119865 (xminus

119894)) (17)

where 119867(119911) is the Heaviside function 119867(119911) = 1 if 119911 ge 0

and H(119911) = 0 otherwise To avoid the difficulty to maximizethe nondifferential function above an approximate AUC isconsidered by Komori [28] as

AUC120590(119865) =

1

119899minus119899+

119899minus

sum

119894=1

119899+

sum

119895=1

119867120590(119865 (x+119895) minus 119865 (xminus

119894)) (18)

where 119867120590(119911) = Φ(119911120590) with Φ being the standard normal

distribution function A smaller scale parameter 120590 meansa better approximation of the Heaviside function 119867(119911)Based on this approximation a boosting method for themaximization of AUC as well as the partial AUC is proposedby Komori and Eguchi [22] in which they consider thefollowing objective function

AUC120590120582(119865) =

1

119899minus119899+

119899minus

sum

119894=1

119899+

sum

119895=1

119867120590(119865 (x+119895) minus 119865 (xminus

119894))

minus 120582

119901

sum

119896=1

int 11986510158401015840

119896(119909119896)2

119889119909119896

(19)

where 11986510158401015840119896(119909119896) is the second derivative of the 119896th component

of 119865(x) and 120582 is a smoothing parameter that controls thesmoothness of 119865(x) Here the set of weak classifiers is givenas

F = 119891 (x) =119873119896119897(119909119896)

119885119896119897

| 119896 = 1 2 119901 119897 = 1 2 119898119896

(20)

where 119873119896119897

is a basis function for representing natural cubicsplines and 119885

119896119897is a standardization factor Then the relation-

ship

max120590120582119865

AUC120590120582(119865) = max

120582119865

AUC1120582(119865) (21)

allows us to fix 120590 = 1 without loss of generality and have thefollowing algorithm

(1) Start with 1198650(x) = 0

(2) For 119905 = 1 119879

(a) update 120573119905minus1(119891) to 120573

119905(119891) with a one-step

Newton-Raphson iteration(b) find the best weak classifier 119891

119905

119891119905= argmax119891

AUC1120582(119865119905minus1

+ 120573119905(119891) 119891) (22)

(c) update the score function as

119865119905(x) = 119865

119905minus1(x) + 120573

119905(119891119905) 119891119905(x) (23)

(3) Output 119865(x) = sum119879119905=1120573119905(119891119905)119891119905(x)

The value of the smoothing parameter 120582 and the iterationnumber 119879 is determined at the same time by the crossvalidation

32 Clustering We applied a hierarchical clustering usingbreast cancer data [7] where the distances between samplesand genes are determined by the correlation and completelinkage was applied as the agglomeration method To mea-sure the performance of the clustering we used biologicalhomogeneity index (BHI) [29] which measures the homo-geneity between the cluster C = 119862

1 119862

119870 and the

biological category or subtypeB = 1198611 119861

119871

BHI (CB) =1

119870

119870

sum

119896=1

1

119899119896(119899119896minus 1)

sum

119894 = 119895119894119895isin119862119896

119868 (119861(119894)= 119861(119895))

(24)

where 119861(119894) isin B is the subtype for the subject 119894 and 119899119896is

the number of subjects in 119862119896 This index is upper bounded

by 1 meaning the perfect homogeneity between the clustersand the biological categories We calculated this index forthe breast cancer data to investigate the relationship betweenthe hierarchical clustering and biological categories such asdisease status (good prognosis or metastases) and hormonestatus estrogen receptor (ER) status and progesterone recep-tor (PR) status The hormone status is known to be closelyrelated with the prognosis of the patients [23]

33 Mutualcoherence Now we have a data matrix X with 119899rows (119899 patients) and 119901 columns (119901 genes)We assumed an 119899-dimensional vector b indicating the true disease status where

Computational and Mathematical Methods in Medicine 7

the positive values correspond to metastases and negativeones to good prognosisThemagnitude of b denotes the levelof disease status Then the optimal linear solution 120573 isin R119901

should be satisfied

X120573 = b (25)

Note that if 119901 is much larger than 119899 then only a few elementsof 120573 should be non-zeros and the others to be zero whichmeans 120573 has a sparse structure The sparsity has a closerelationship with mutualcoherence [30] which is defined forthe data matrix X as

120583 (X) = max1le119894119895le119901119894 = 119895

10038161003816100381610038161003816x119879119894x119895

100381610038161003816100381610038161003817100381710038171003817x11989410038171003817100381710038172

10038171003817100381710038171003817x119895

100381710038171003817100381710038172

(26)

where x119894isin R119899 denotes the 119894th column in X or 119894th gene in

the breast cancer data | sdot | is the absolute value and sdot2is the

Euclidean normThis indexmeasures the distance of columns(genes) in the data matrixX and is bounded as 0 le 120583(X) le 1The next theorem shows the relationship between sparsity of120573 and the data matrix X

Theorem 1 (Elad [30]) If X120573 = b has a solution 120573 obeying1205730

lt (1 + 1120583(X))2 this solution is necessarily thesparsest possible where 120573

0denotes the number of the nonzero

components of 120573

This theorem suggests that the linear discriminant func-tion 119865(x) = 120573119879x could have a sparse solution to predict thedisease status b which corresponds to metastases or goodprognosis in the breast cancer data If the value of 120583(X) isnearly equal to 1 then 120573

0could become approximately 1

indicating that just one gene (a column inX) could be enoughto predict the disease status if there were a solution 120573 Thisindicates that we could have a chance to find the multiplesolutions with fewer genes than 70 genes in MammaPrintAlthough the framework in (25) is based on a system of linearequations and does not include random effects as seen in theclassification problems we deal with Theorem 1 is indicativein the case where the number of covariates 119901 is much largerthan the observation number 119899

34 Results We prepare various data sets using a trainingsample (78 patients) based on 230 genes selected by vanrsquot Veeret al [7] in order to investigate the existence of the multiplesolutions of the prediction of the disease status which aregiven by

D = 1198631minus70

1198636minus75

119863161minus230

(27)

The first data set 1198631minus70

consists of 78 patients and 70 genesin MammaPrint The ranking of the 230 genes in D is onthe basis of correlation with disease status as in [7] We apply

the vanrsquot Veer method DLDA AdaBoost and AUCBoost asexplained in the previous subsection to the training data tohave the discriminant functions119865(x) where each threshold isdetermined so that the training error rate isminimizedThenwe evaluate the classification accuracy based on the trainingdata as well as the test data The results of classificationperformance of vanrsquot Veer method and DLDA are illustratedin Figure 1 those of AdaBoost and AUCBoost are in Figure 2The AUC and the error rate are plotted against D In regardto the performance based on the training data denoted bysolid line the boosting methods are superior to the classicalmethodsHowever comparison based on the test data DLDAshows good performance for almost all the data sets in Dhaving the AUC more than 08 and the error rates less than02 in averageThis evidence suggests there exist many sets ofgenes having almost the same classification performance asthat of MammaPrint

We investigate the performance of the hierarchical clus-terings based on the training data sets 119863

1minus70 11986311minus80

and119863111minus180

which are shown in Figure 3 Each row represents70 genes and each column represents 78 patients withdisease status (blue) ER status (red) and PR status (orange)The BHI for disease status ER status and PR status inMammaPrint (119863

1minus70) are 070 069 and 057 respectively

The gene expression in 11986311minus80

shows different patterns fromthe others Mainly there are two clusters characterized by ERstatus The left-hand side is the subgroup of patients withER negative and poor prognosis This would correspond toBasal-like subtype or triple negative subtype though the Her2status is unclear The right-hand side could be divided intothree subgroups The subgroup of patients with ER negativeshows good prognosis indicating Luminal A and either sideof it would be Luminal B or Luminal C because it showsworse prognosis than Luminal A The data set of 119863

111minus180

has the highest BHI for disease status 076 and a similar geneexpression pattern to that in 119863

1minus70 The other values of BHI

are illustrated in Figure 4 It turned out that the data sets inDhave almost the same BHI for three statuses suggesting thereexists various gene sets with similar expression patterns

Next we investigate the stability of the gene ranking inMammaPrint Among 78 patients we randomly choose 50patients with 5420 gene expression patterns Then we takethe top 70 genes ranked by the correlation coefficients ofthe gene expression with disease status This procedure isrepeated 100 times and checked how many times the genesin MammaPrint are ranked within the top 70 The resultsare shown in the upper panel of Figure 5 The lower panelshows the result based on the AUC instead of the correlationcoefficient used in the ranking We clearly see that someof the genes in MammaPrint are rarely selected in the top70 which indicates the instability of the gene ranking Theperformances of DLDA with 100 replications which showsmost stable prediction accuracy as seen in Figure 1 basedon randomly selected 50 patients shown in Figure 6 wherethe vertical axes are AUC (left panels) and error rate (rightpanels) and genes are ranked by the correlation (a) and AUC(b) The performance clearly gets worse than that in Figure 1in terms of both AUC and error rate The heterogeneity ofgene expression pattern may come from the several subtypes

8 Computational and Mathematical Methods in Medicine

0

02

04

06

08

1

Data sets

AUC

0

02

04

06

08

1

Erro

r

TrainingTest

TrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230Data sets

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(a) vanrsquot Veer methodEr

ror

0

02

04

06

08

1

AUC

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(b) DLDA

Figure 1 Results of classical methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data setsD for vanrsquotVeer method and DLDA

as suggested in the cluster analysis and it would make itdifficult to select useful genes for predictions

Finally we calculate the mutualcoherence based on genesin MammaPrint and total 5420 genes in Figure 7 The scatterplot in the upper panel (a) shows a pair of geneswith the high-est mutual coherence (0984) in MammaPrint NM 014889

and NM 014968 correspond to MP1 and KIAA1104 respec-tivelyMP1 is a scaffold protein inmultiple signaling pathwaysincluding the one in breast cancer The latter one is pitrilysinmetallopeptidase 1The correlations defined in the right-handside of (26) are calculated for all pairs of genes in total 5420

genes and 70 genes in MammaPrint The distributions ofthem are shown in the lower panel (b) in Figure 7 The genepairs are sorted so that the values of the correlations decreasemonotonically Note that the number of the gene pairs ineach data set is different but the range of horizontal axis isrestricted to 0 and 1 for clear view and for easy comparisonThedifference between the black and red curves indicates thatthe gene set of MammaPrint has large homogeneity of geneexpression patterns in comparison with that of total 5420genes This indicates that the ranking of genes based on two-sample statistic such as the correlation coefficients is prone

Computational and Mathematical Methods in Medicine 9

0

02

04

06

08

1AU

C

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Erro

r

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(a) AdaBoost

0

02

04

06

08

1

AUC

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Erro

r

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(b) AUCBoost

Figure 2 Results of boosting methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets D forAdaBoost and AUCBoost

to select genes such that their gene expression patterns aresimilar to each other This would be one reason why we havemultiple solutions after the ranking methods based on thecorrelation coefficients It is also interesting to see that thereare a few gene pairs with very low correlation even in gene setof MammaPrint

4 Discussion and Concluding Remarks

In this paper we have addressed an important classificationissue using microarray data Our results present the existence

of multiple suboptimal solutions for choosing classificationpredictors Classification results showed that the perfor-mance of several gene sets was the same This behaviorwas confirmed by the vanrsquot Veer method DLDA AdaBoostand AUCBoost Amazingly nontop ranked gene sets showedbetter performance than the top ranked gene sets Theseresults indicate that the ranking method is very unstable forfeature selection in microarray data

To examine the expression pattern of selected predictorswe performed clustering this added support to the existenceof multiple solutions We easily recognized the good and

10 Computational and Mathematical Methods in Medicine

(a) (b)

(c)

Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70

(MammaPrint)(b) 119863

11minus80clearly showing some subtypes and (c) 119863

111minus180with the highest BHI regarding metastases The blue bars indicate patients with

metastases red bars those with ER positive and orange bars those with PR positive

the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes

Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work

Computational and Mathematical Methods in Medicine 11

SetBH

I

MetastasesERp

PRp

0

02

04

06

08

1

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig32125

RCU82987

AL137718

AB037863

NM020188

NM020974

NM000127

NM002019

NM002073

NM000436

NM004994

Con

tig55377

RCC

ontig35251

RCC

ontig25991

NM003875

NM006101

NM003882

NM003607

AF0

7351

9A

F052

162

NM000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig46223

RCN

M015984

NM006117

AK0

0074

5C

ontig40831

RCN

M003239

NM014791

X056

10N

M016448

NM018401

NM000788

Con

tig51464

RCA

L080

079

NM006931

AF2

5717

5N

M014321

NM002916

Con

tig55725

RCC

ontig24252

RCA

F201

951

NM005915

NM001282

Con

tig56457

RCN

M000599

NM020386

NM014889

AF0

5503

3N

M006681

NM007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM007036

NM018354

(a)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig

46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig

32125

RCU82987

AL1

37718

AB0

37863

NM

020188

NM

020974

NM

000127

NM

002019

NM

002073

NM

000436

NM

004994

Con

tig55377

RCC

ontig

35251

RCC

ontig

25991

NM

003875

NM

006101

NM

003882

NM

003607

AF0

7351

9A

F052

162

NM

000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig

46223

RCN

M015984

NM

006117

AK0

0074

5C

ontig

40831

RCN

M003239

NM

014791

X056

10N

M016448

NM

018401

NM

000788

Con

tig51464

RCA

L080

079

NM

006931

AF2

5717

5N

M014321

NM

002916

Con

tig55725

RCC

ontig

24252

RCA

F201

951

NM

005915

NM

001282

Con

tig56457

RCN

M000599

NM

020386

NM

014889

AF0

5503

3N

M006681

NM

007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM

007036

NM

018354

(b)

Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint

12 Computational and Mathematical Methods in Medicine

0 20 40 60 80 100

1

05

06

07

08

09

1AU

C

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

Number of trialsNumber of trials

TrainingTest

TrainingTest

(a)

0 20 40 60 80 100

05

06

07

08

09

1

Number of trialsNumber of trials

AUC

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

TrainingTest

TrainingTest

(b)

Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials

Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is

not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data

One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by

Computational and Mathematical Methods in Medicine 13

0 02 04 06

0

02

04

06

minus02

minus02NM 014889

NM

0149

68

(a)

Index of pairs of genes

Cor

relat

ion

5420 genes70 genes

0

02

04

06

08

1

0 02 04 06 08 1

(b)

Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view

Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata

We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data

Conflict of Interests

The authors have declared that no conflict of interests exists

References

[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001

[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001

[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996

[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003

[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006

[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006

[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002

[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996

14 Computational and Mathematical Methods in Medicine

[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003

[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002

[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999

[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004

[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008

[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006

[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989

[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005

[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005

[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936

[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010

[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001

[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008

[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006

[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001

[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003

[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011

[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011

[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010

[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001

[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009

[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009

[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009

[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009

[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 2: Multiple Suboptimal Solutions for Prediction Rules in Gene ... · For this, the machine learning approach is successfully exploited including sup-portvectormachine,boostinglearningalgorithm,andthe

2 Computational and Mathematical Methods in Medicine

an interpretable model for gene expression data In theregression context Tibshirani [3] proposed Lasso whichachieved feature selection by shrinking some coefficients toexactly zero In the image processing area the sparse modelis also considered for redundant representation of data byDonoho and Elad [4] and Candes et al [5]

The current status for prediction performance hasattained a constant level however there are still unsolvedproblems in the prediction in which the observed number ofgenes is extremely larger than that of the subjectsThis causesdifficulties in which superfluous discovered genes arise theexpressions which have weak power for prediction As aresult almost all microarray data analysis is not completelyconfirmed as the biological replication which is discussed inAllison et al [6]

We address themicroarray data analysis discussed in vanrsquotVeer et al [7] in which a set of 70 genes is selected for pre-diction for prognosis in breast cancer patients as informativebiomarkersThe result helps to build the prediction kit namedldquoMammaPrintrdquo which was approved by the FDA in 2007 Wemake a reanalysis of the data and discuss how surprisinglymany other gene sets show almost the same predictionperformance as the gene set published Thus their gene setdoes not uniquely show a reasonable power of predictionso we suggest that it is impossible to build up a universalprediction rule by efficient selection of genes In particularthe ranking procedure according to association measure tothe phenotype is very fragile for data partition We discussthe statistical reason why the data analysis of gene expressionis involved with such difficulties of multiple solutions forthe problem We calculate the value of mutualcoherence forgenes in MammaPrint and show the essential difficulty ofdetermining informative genes among huge number of genesused in the data analysis

This paper is organized as follows In Section 2 wedescribe the pattern recognition of gene expression and over-view the current proposed methods In Section 3 we pointout the current difficulty in gene expression using a realdata set Finally we discuss the results and future works inSection 4

2 Pattern Recognition from Gene Expressions

In this section we overview the pattern recognition of geneexpression First we mention a DNA microarray techniquewhich is the widely used method to measure millions of geneexpressions Then we present the current methods for geneselection using microarray

21 DNA Microarray Microarray has become a widely usedtechnology in a variety of areas Microarray measures theamount of mRNA or gene expression There are two majortechnologies available for gene expressionmeasurement Oneis GeneChip system provided by Affymetrix Inc GeneChipuses prefabricated short lengths of oligonucleotideThe otheris cDNAarraywhichwas originally developed by Schena et al[9] We briefly mention both technologies

GeneChip uses a pair of short length oligonucleotidesattached to a solid surface such as glass plastic or siliconThe

short pair of oligonucleotides is called the probe pair Eachprobe pair is composed of a perfect match (PM) probe anda mismatch (MM) probe PM is a section of the mRNA ofinterest and MM is created by changing the middle (13th)base of the PM with the intention of measuring nonspecificbinding mRNA samples are collected from subjects suchas cancer patients then labeled with fluorescence dye If theshort oligonucleotide is matched with the mRNA sample thelabeled mRNA sample is hybridized to a spot of microarrayIf the labeled mRNA and the probe match perfectly theybind strongly otherwise they bindweaklyThosewithweak ornonspecific binding are washed out by a washing buffer thenonly strongly bound mRNA samples are measured by a scan-ner Scanned measurements need further processing beforeanalysis such as outlier detection background subtractionand normalization These processes are called preprocessing

In the early stage of microarray the quality of microarraymeasurements contained a lot of variance Therefore pre-processing was a very active research area Affymetrix rec-ommended the use of both PM and MM probes to subtractnonspecific binding and implement MASS algorithm totheir software however Irizarry et al [10] and Naef et al[11] pointed out that the normalization model consideringMM captures nonspecific effect more than reality Currentlyrobust multichip average (RMA) which is introduced byIrizarry et al [10] is also widely used

cDNA array uses glass slides to attach short oligonu-cleotides probes cDNA array uses inkjet printing technologyGeneChip uses one color fluorescent dye on the other handcDNA array utilizes two different color fluorescent dyes Oneof the colors is for controlling mRNA and the other color isfor treatment of mRNA Both samples are hybridized on thesame array The scanner detects both fluorescent dyes sepa-rately Data processing is slightly different from GeneChipAs cDNA uses two fluorescent dyes scanned data is normallytreated as ratio data of treatment over control

Microarray technology has improved in the last decadesincluding reduction of the variance normalization proce-dures donot have as great an effect as before Research interesthas moved to areas such as data analysis finding subclass orpredicting the subclass

22 Review for Prediction via Microarray The initial ap-proach employed for subclass discovery was hierarchicalclustering analysis Golub et al [12] showed the result ofclustering for leukemia data using microarray In their resultsubclasses of leukemia were well clustered by gene expressionpattern This result was hailed as a new dawn in the cancerclassification problem

Breast cancer is one of the most used cancers for the geneexpression classification problem Breast cancer treatmentdecisions are based largely on clinicopathological criteriasuch as tumor size histological grade and lymph nodemetastatsis however vanrsquot Veer et al [7] pointed that themajority of patients received unnecessary chemotherapy andthat there is a need to find a better criteria who benefits fromchemotherapy

vanrsquot Veer et al [7] proposed 70 genes to predict patientoutcome as first multigene signatures in breast cancer 70

Computational and Mathematical Methods in Medicine 3

Table 1 A taxonomy of feature selection techniques summarized by Saeys et al [8] These major feature selections are addressed Each typehas a subcategory Advantages disadvantages and example methods are shown

Model search Advantages Disadvantages Examples

Univariate

Filter

FastScalableIndependent of the classifier

Ignores feature dependenciesIgnores interaction with the classifier

1205942

Euclidean distancet-testInformation gain

MultivariateModels feature dependenciesIndependent of the classifierBetter computational complexity thanwrapper methods

Slower than univariate techniquesLess scalable than univariatetechniquesIgnores interaction with the classifier

Correlation-based feature selectionMarkov blanket filterFast correlation-based feature selection

Deterministic

Wrapper

SimpleInteracts with the classifierModels feature dependenciesLess computationally intensive thanrandomized methods

Risk of overfittingMore prone than randomizedalgorithms to getting stuck in a localoptimumClassifier dependent selection

Sequential forward selectionSequential backward selection

Randomized

Less prone to local optimaInteracts with the classifierModels feature dependencies

Computationally intensiveClassifier dependent selectionHigher risk of overfitting thandeterministic algorithms

Simulated annealingRandomized hill climbingGenetic algorithmsEstimation of distribution algorithms

Embedded

Interacts with the classifierBetter computational complexity thanwrapper methodsModels feature dependencies

Classifier dependent selectionDecision treesWeighted naive BayesRFE-SVM

genes were decided using 78 patientsrsquo tumor samples Inbrief they selected 5000 significant expressed genes from25000 genes onmicroarray then coefficient correlation withoutcome was calculated for each class Genes were sorted bycorrelation coefficient and further optimized from the topranked gene with sequentially adding from a subset of fivegenesThe top 70 geneswere proposed as outcome predictorsPaik et al [13] proposed 21multigene signatures based on RT-PCR results These two multigene prognostic signatures areavailable as clinical test named MammaPrint and OncotypeDX FDA cleared the MammaPrint test in 2007 and it is cur-rently being tested in the Microarray In Node-negative and1ndash3 positive lymph-node Disease may Avoid ChemoTherapy(MINDACT) for further assessment which is described byCardoso et al [14]

Besides these two multigene prognostic signatures dif-ferent multigenes were selected as prognostic signatures Fanet al [15] discussed a different set of multigene signaturesin the breast cancer prognostic classification studies Thosesignatures show little overlap however they still showedsimilar classification power Fan et al [15] suggested that thesesignatures are probably tracking a common set of biologicalphenotypes Considering thousands andmillions of genes onmicroarray multiple useful signature sets are not difficult to

imagine however finding a stable and informative gene withhigh classification accuracy is of interest

23 Feature Selection Reduction of dimension size is neces-sary as superfluous features can cause overfitting and inter-pretation of classification model becomes difficult Reducingdimension size while keeping relevant features is importantThere are some feature selection methods proposed Saeyset al [8] provided a taxonomy of feature selection methodsand discussed their use advantages and disadvantages Theymentioned the objectives of feature selection (a) to avoidoverfitting and improve model performance that is predic-tion performance in the case of supervised classification andbetter cluster detection in the case of clustering (b) to providefaster andmore cost-effectivemodels and (c) to gain a deeperinsight into the underlying processes that generated the dataTable 1 provides their taxonomy of feature selectionmethodsIn the context of classification feature selection methodsare organized into three categories filter methods wrappermethods and embeddedmethods Feature selectionmethodsare categorized depending on how they are combined withthe construction of a classification model

Filter methods calculate statistics such as t-statisticsthen filter out those which do not meet the threshold

4 Computational and Mathematical Methods in Medicine

value Advantages of filter methods are easy implementationcomputational simplicity and speed Filter methods areindependent of classification methods therefore differentclassifiers can be used Two of the disadvantages of filtermethods are that they ignore the interaction with the clas-sification model and most of the proposed methods areunivariate

Whereas filter techniques treat the problem of findinga good feature subset independently of the model selec-tion step wrapper methods embed the model hypothesissearch within the feature subset search such as sequentialbackward selection [16] Wrapper methods search all featuresubsets the feature subset space grows exponentially withthe number of features One advantage of wrapper methodsis the interaction between feature subset search and modelselection Computational burden is one of the disadvantagesof this approach especially if building the classifier has a highcomputational cost

The third class of feature selection is embedded tech-niques Like the wrapper methods embedded techniquessearch for an optimal subset of features with classifierconstruction For example SVM-RFE proposed by Guyonet al [17] Thus embedded approaches are specific to a givenlearning algorithm The difference from wrapper methodsis that embedded techniques are guided by the learningprocess Whereas wrapper methods search all possible com-binations of gene sets embedded techniques search for thecombination based on a criteria This enables reduction incomputational burden

Besides thesemethods the idea of sparsenesswas recentlyintroduced in some feature selectionmethods One approachis to use penalties in the regression context Tibshirani [3]proposed Lasso which uses 119871

1norm penalties The combi-

nation of 1198711penalty and 119871

2penalty is called elastic net [18]

These methods are focusing on reducing features to avoidingoverfitting and better interpretability as biologists expect toobtain biological insight from selected features

However these sparseness ideas do not take into accountmultiple solutions in one data set When the data dimensionis thousands ormillions there aremultiple possible solutionsSorting genes based on some criteria then selecting a subsetfrom the top is not always the best selection We elaboratemultiple solutions in the following section and give some ideaof how to select the optimum solution Here we refer to anoptimal solution as a prediction rule with high classificationaccuracy for various data sets

3 Multiple Solutions for Prediction Rules

The existence of multiple solutions for prediction of diseasestatus based on breast cancer data [7] was shown by Ein-Doret al [19] where they suggest three reasons for this problemThe first is that there are many genes correlated with diseasestatus the second one is that the differences of the correlationamong the genes are very small the last one is that the valueof the correlation is very sensitive to the sample used for thecalculation of the correlation In the paper they demonstrategene ranking based on the correlation and show that thereexist many equally predictive gene sets

In this section we investigate the existence of multiplesolutions from different viewpoints At first to check thevariability of prediction accuracy based on different statisticalmethods we apply the vanrsquot Veermethod [7] the Fisher lineardiscriminant analysis [20] AdaBoost [21] and AUCBoost[22] The last two methods are called boosting in machinelearning community where genes are nonlinearly combinedto predict the disease status Second we apply hierarchal clus-tering to examine the heterogeneity of gene expression pat-terns Soslashrlie et al [23] showed there exist subtypes of breastcancers for which the patterns of gene expression are clearlydifferent and the disease statuses are also different in accor-dance with them Ein-Dor et al [19] suggests that the hetero-geneity of the subtypes is one reason why there are so manysolutions for the prediction or large fluctuations of genesselected for the predictions Hence we calculate BiologicalHomogeneity Index (BHI) to see the clustering performancefor various gene sets and examine the existence of thesubtypes Finally we consider the mutualcoherence for thebreast cancer data and discuss the relation to the multiplesolutions

The breast cancer data consists of the expression data of25000 genes and 97 cancer patients After a filtering proce-dure 5420 genes are identified to be significantly regulatedand we focused on this filtered data The patients are dividedinto a training sample (78 patients) and a test sample (19patients) in the same way as the original paper [7] Here weconsider classical methods and boosting methods to predictwhether the patient has good prognosis (free of disease afterinitial diagnosis at least 5 years) or has bad prognosis (distantmetastaseswithin 5 years)Theprediction rule is generated bythe training sample and we measure the prediction accuracybased on the test sample

31 Classification We briefly introduce the statistical meth-ods used for the prediction of the disease status The firsttwo methods are linear discriminant function where thegenes are linearly combined to classify the patients into agood prognosis group or a bad prognosis group The last twoones are boosting methods where the genes are nonlinearlycombined to generate the discriminant functions

311 The vanrsquot Veer Method Let 119910 be a class label indicatingdisease status such as 119910 = minus1 (good prognosis) and 119910 = 1

(metastases) and let x(=(1199091 119909

119901)119879) be a 119901-dimensional

covariate such as gene expression We denote the samples of119910 = minus1 and 119910 = 1 as xminus

119894 119894 = 1 119899

minus and x+

119895 119895 =

1 119899+ respectively and the mean value of the patients

with good prognosis as

minus= (minus

1

minus

119901)119879

=1

119899minus

119899minus

sum

119894=1

xminus119894 (1)

Then vanrsquot Veer et al [7] proposed a discriminant function119865(x) based on the correlation to the average good prognosisprofile above which is given as

119865 (x) = minussum119901

119896=1(119909119896minus ) (

minus

119896minus minus)

radicsum119901

119896=1(119909119896minus )2radicsum119901

119896=1(minus

119896minus minus)2

(2)

Computational and Mathematical Methods in Medicine 5

where

=1

119901

119901

sum

119896=1

119909119896

minus=1

119901

119901

sum

119896=1

minus

119896 (3)

If the value of 119865(x) is smaller than a predefined thresholdvalue then the patient is judged to have good prognosis (119910 =minus1) otherwise to have metastases (119910 = 1) This method iscalled one-class classification where it focuses on only one-class label information to predict the disease status 119910 Thisidea is also employed in machine learning community SeeYousef et al [24] and Gardner et al [25] for applications inbiology and medicine

312 DLDA We consider Fisherrsquos linear discriminant analy-sis [20] which is widely used in many applications Supposethat x is distributed as119873(120583

minusΣminus) for 119910 = minus1 and as119873(120583

+Σ+)

for 119910 = 1 Then if Σminus= Σ+= Σ the estimated log-likelihood

ratio is given as

119865 (x) = (+minus minus)119879

Σminus1x minus 1

2119879

+Σminus1

+

+1

2119879

minusΣminus1

minus+ log

119899+

119899minus

(4)

where + minusand Σ are the samplemeans for119910 isin minus1 1 and a

total sample variance respectively For simplicity we take thediagonal part of Σ (Diagonal Linear Discriminant Analysis)and predict the disease status based on the value of the log-likelihood ratio above This modification is often used in asituation where 119901 is much larger than 119899 (= 119899

minus+ 119899+) In that

case the inverse of Σ cannot be calculated

313 AdaBoost We introduce a famous boosting method inmachine learning community The key concept of boostingis to construct a powerful discriminant function 119865(x) bycombining variousweak classifiers119891(x) [26]We employ a setF of decision stumps as a dictionary of weak classifiers Herethe decision stump for the 119896th gene expression 119909

119896is defined

as a simple step functions such as

119891119896(x) = 1 if 119909

119896ge 119887119896

minus1 otherwise(5)

where 119887119896is a threshold value Accordingly it is known that

119891119896(x) is the simplest classifier in the sense that 119891

119896(x) neglects

all other information of gene expression patterns than thatof one gene 119909

119896 However by changing the value of 119887

119896for all

genes (119896 = 1 119901) we have F that contains exhaustiveinformation of gene expression patterns We attempt tobuild a good discriminant function 119865 by combining decisionstumps inF

The typical one is AdaBoost proposed by Schapire [21]which is designed to minimize the exponential loss for adiscriminant function 119865 as

119871exp (119865) =1

119899

119899

sum

119894=1

exp minus119910119894119865 (x119894) (6)

where the entire data set is given as (x119894 119910119894) | 119894 = 1 119899The

exponential loss is sequentially minimized by the followingalgorithm

(1) Initialize the weight of 119909119894for 119894 = 1 119899 as 119908

1(119894) =

1119899

(2) For the iteration number 119905 = 1 119879

(a) choose 119891119905as

119891119905= argmin119891isinF

120598119905(119891) (7)

where

120598119905(119891) =

119899

sum

119894=1

119868 (119891 (x119894) = 119910119894)

119908119905(119894)

sum119899

119894=1119908119905(119894) (8)

and 119868 is the indicator function(b) calculate the coefficient as

120573119905=1

2log

1 minus 120598119905(119891119905)

120598119905(119891119905) (9)

(c) update the weight as

119908119905+1(119894) = 119908

119905(119894) exp minus119910

119894120573119905119891119905(x119894) (10)

(3) output the final function as 119865(x) = sum119879119905=1120573119905119891119905(x)

Here we have

119871exp (119865 + 120573119891)

=1

119899

119899

sum

119894=1

exp minus119910119894119865 (x119894) exp minus120573119910

119894119891 (x119894)

(11)

=1

119899

119899

sum

119894=1

exp minus119910119894119865 (x119894)

times [119890120573119868 (119891 (x

119894) = 119910119894) + 119890minus120573119868 (119891 (x

119894) = 119910119894)]

(12)

= 119871exp (119865) 119890120573120598 (119891) + 119890

minus120573(1 minus 120598) (13)

ge 2119871exp (119865)radic120598 (119891) (1 minus 120598 (119891)) (14)

where 120598(119891) = 1119899sum119899

119894=1119868(119891(x

119894) = 119910119894) expminus119910

119894119865(x119894)119871exp(119865)

and the equality in (14) is attained if and only if

120573 =1

2log

1 minus 120598 (119891)

120598 (119891) (15)

6 Computational and Mathematical Methods in Medicine

Hence we can see that the exponential loss is sequentiallyminimized in the algorithm It is easily shown that

120598119905+1(119891119905) =

1

2 (16)

That is the weak classifier 119891119905chosen at the step 119905 is the worst

element in F in the sense of the weighted error rate in thestep 119905 + 1

314 AUCBoost with Natural Cubic Splines The area underthe ROCcurve (AUC) is widely used tomeasure classificationaccuracy [27] This criterion consists of the false positiverate and true positive rate so it evaluates them separately incontrast to the commonly used error rateThe empirical AUCbased on the samples xminus

119894 119894 = 1 119899

minus and x+

119895 119895 = 1

119899+ is given as

AUC (119865) = 1

119899minus119899+

119899minus

sum

119894=1

119899+

sum

119895=1

119867(119865 (x+119895) minus 119865 (xminus

119894)) (17)

where 119867(119911) is the Heaviside function 119867(119911) = 1 if 119911 ge 0

and H(119911) = 0 otherwise To avoid the difficulty to maximizethe nondifferential function above an approximate AUC isconsidered by Komori [28] as

AUC120590(119865) =

1

119899minus119899+

119899minus

sum

119894=1

119899+

sum

119895=1

119867120590(119865 (x+119895) minus 119865 (xminus

119894)) (18)

where 119867120590(119911) = Φ(119911120590) with Φ being the standard normal

distribution function A smaller scale parameter 120590 meansa better approximation of the Heaviside function 119867(119911)Based on this approximation a boosting method for themaximization of AUC as well as the partial AUC is proposedby Komori and Eguchi [22] in which they consider thefollowing objective function

AUC120590120582(119865) =

1

119899minus119899+

119899minus

sum

119894=1

119899+

sum

119895=1

119867120590(119865 (x+119895) minus 119865 (xminus

119894))

minus 120582

119901

sum

119896=1

int 11986510158401015840

119896(119909119896)2

119889119909119896

(19)

where 11986510158401015840119896(119909119896) is the second derivative of the 119896th component

of 119865(x) and 120582 is a smoothing parameter that controls thesmoothness of 119865(x) Here the set of weak classifiers is givenas

F = 119891 (x) =119873119896119897(119909119896)

119885119896119897

| 119896 = 1 2 119901 119897 = 1 2 119898119896

(20)

where 119873119896119897

is a basis function for representing natural cubicsplines and 119885

119896119897is a standardization factor Then the relation-

ship

max120590120582119865

AUC120590120582(119865) = max

120582119865

AUC1120582(119865) (21)

allows us to fix 120590 = 1 without loss of generality and have thefollowing algorithm

(1) Start with 1198650(x) = 0

(2) For 119905 = 1 119879

(a) update 120573119905minus1(119891) to 120573

119905(119891) with a one-step

Newton-Raphson iteration(b) find the best weak classifier 119891

119905

119891119905= argmax119891

AUC1120582(119865119905minus1

+ 120573119905(119891) 119891) (22)

(c) update the score function as

119865119905(x) = 119865

119905minus1(x) + 120573

119905(119891119905) 119891119905(x) (23)

(3) Output 119865(x) = sum119879119905=1120573119905(119891119905)119891119905(x)

The value of the smoothing parameter 120582 and the iterationnumber 119879 is determined at the same time by the crossvalidation

32 Clustering We applied a hierarchical clustering usingbreast cancer data [7] where the distances between samplesand genes are determined by the correlation and completelinkage was applied as the agglomeration method To mea-sure the performance of the clustering we used biologicalhomogeneity index (BHI) [29] which measures the homo-geneity between the cluster C = 119862

1 119862

119870 and the

biological category or subtypeB = 1198611 119861

119871

BHI (CB) =1

119870

119870

sum

119896=1

1

119899119896(119899119896minus 1)

sum

119894 = 119895119894119895isin119862119896

119868 (119861(119894)= 119861(119895))

(24)

where 119861(119894) isin B is the subtype for the subject 119894 and 119899119896is

the number of subjects in 119862119896 This index is upper bounded

by 1 meaning the perfect homogeneity between the clustersand the biological categories We calculated this index forthe breast cancer data to investigate the relationship betweenthe hierarchical clustering and biological categories such asdisease status (good prognosis or metastases) and hormonestatus estrogen receptor (ER) status and progesterone recep-tor (PR) status The hormone status is known to be closelyrelated with the prognosis of the patients [23]

33 Mutualcoherence Now we have a data matrix X with 119899rows (119899 patients) and 119901 columns (119901 genes)We assumed an 119899-dimensional vector b indicating the true disease status where

Computational and Mathematical Methods in Medicine 7

the positive values correspond to metastases and negativeones to good prognosisThemagnitude of b denotes the levelof disease status Then the optimal linear solution 120573 isin R119901

should be satisfied

X120573 = b (25)

Note that if 119901 is much larger than 119899 then only a few elementsof 120573 should be non-zeros and the others to be zero whichmeans 120573 has a sparse structure The sparsity has a closerelationship with mutualcoherence [30] which is defined forthe data matrix X as

120583 (X) = max1le119894119895le119901119894 = 119895

10038161003816100381610038161003816x119879119894x119895

100381610038161003816100381610038161003817100381710038171003817x11989410038171003817100381710038172

10038171003817100381710038171003817x119895

100381710038171003817100381710038172

(26)

where x119894isin R119899 denotes the 119894th column in X or 119894th gene in

the breast cancer data | sdot | is the absolute value and sdot2is the

Euclidean normThis indexmeasures the distance of columns(genes) in the data matrixX and is bounded as 0 le 120583(X) le 1The next theorem shows the relationship between sparsity of120573 and the data matrix X

Theorem 1 (Elad [30]) If X120573 = b has a solution 120573 obeying1205730

lt (1 + 1120583(X))2 this solution is necessarily thesparsest possible where 120573

0denotes the number of the nonzero

components of 120573

This theorem suggests that the linear discriminant func-tion 119865(x) = 120573119879x could have a sparse solution to predict thedisease status b which corresponds to metastases or goodprognosis in the breast cancer data If the value of 120583(X) isnearly equal to 1 then 120573

0could become approximately 1

indicating that just one gene (a column inX) could be enoughto predict the disease status if there were a solution 120573 Thisindicates that we could have a chance to find the multiplesolutions with fewer genes than 70 genes in MammaPrintAlthough the framework in (25) is based on a system of linearequations and does not include random effects as seen in theclassification problems we deal with Theorem 1 is indicativein the case where the number of covariates 119901 is much largerthan the observation number 119899

34 Results We prepare various data sets using a trainingsample (78 patients) based on 230 genes selected by vanrsquot Veeret al [7] in order to investigate the existence of the multiplesolutions of the prediction of the disease status which aregiven by

D = 1198631minus70

1198636minus75

119863161minus230

(27)

The first data set 1198631minus70

consists of 78 patients and 70 genesin MammaPrint The ranking of the 230 genes in D is onthe basis of correlation with disease status as in [7] We apply

the vanrsquot Veer method DLDA AdaBoost and AUCBoost asexplained in the previous subsection to the training data tohave the discriminant functions119865(x) where each threshold isdetermined so that the training error rate isminimizedThenwe evaluate the classification accuracy based on the trainingdata as well as the test data The results of classificationperformance of vanrsquot Veer method and DLDA are illustratedin Figure 1 those of AdaBoost and AUCBoost are in Figure 2The AUC and the error rate are plotted against D In regardto the performance based on the training data denoted bysolid line the boosting methods are superior to the classicalmethodsHowever comparison based on the test data DLDAshows good performance for almost all the data sets in Dhaving the AUC more than 08 and the error rates less than02 in averageThis evidence suggests there exist many sets ofgenes having almost the same classification performance asthat of MammaPrint

We investigate the performance of the hierarchical clus-terings based on the training data sets 119863

1minus70 11986311minus80

and119863111minus180

which are shown in Figure 3 Each row represents70 genes and each column represents 78 patients withdisease status (blue) ER status (red) and PR status (orange)The BHI for disease status ER status and PR status inMammaPrint (119863

1minus70) are 070 069 and 057 respectively

The gene expression in 11986311minus80

shows different patterns fromthe others Mainly there are two clusters characterized by ERstatus The left-hand side is the subgroup of patients withER negative and poor prognosis This would correspond toBasal-like subtype or triple negative subtype though the Her2status is unclear The right-hand side could be divided intothree subgroups The subgroup of patients with ER negativeshows good prognosis indicating Luminal A and either sideof it would be Luminal B or Luminal C because it showsworse prognosis than Luminal A The data set of 119863

111minus180

has the highest BHI for disease status 076 and a similar geneexpression pattern to that in 119863

1minus70 The other values of BHI

are illustrated in Figure 4 It turned out that the data sets inDhave almost the same BHI for three statuses suggesting thereexists various gene sets with similar expression patterns

Next we investigate the stability of the gene ranking inMammaPrint Among 78 patients we randomly choose 50patients with 5420 gene expression patterns Then we takethe top 70 genes ranked by the correlation coefficients ofthe gene expression with disease status This procedure isrepeated 100 times and checked how many times the genesin MammaPrint are ranked within the top 70 The resultsare shown in the upper panel of Figure 5 The lower panelshows the result based on the AUC instead of the correlationcoefficient used in the ranking We clearly see that someof the genes in MammaPrint are rarely selected in the top70 which indicates the instability of the gene ranking Theperformances of DLDA with 100 replications which showsmost stable prediction accuracy as seen in Figure 1 basedon randomly selected 50 patients shown in Figure 6 wherethe vertical axes are AUC (left panels) and error rate (rightpanels) and genes are ranked by the correlation (a) and AUC(b) The performance clearly gets worse than that in Figure 1in terms of both AUC and error rate The heterogeneity ofgene expression pattern may come from the several subtypes

8 Computational and Mathematical Methods in Medicine

0

02

04

06

08

1

Data sets

AUC

0

02

04

06

08

1

Erro

r

TrainingTest

TrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230Data sets

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(a) vanrsquot Veer methodEr

ror

0

02

04

06

08

1

AUC

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(b) DLDA

Figure 1 Results of classical methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data setsD for vanrsquotVeer method and DLDA

as suggested in the cluster analysis and it would make itdifficult to select useful genes for predictions

Finally we calculate the mutualcoherence based on genesin MammaPrint and total 5420 genes in Figure 7 The scatterplot in the upper panel (a) shows a pair of geneswith the high-est mutual coherence (0984) in MammaPrint NM 014889

and NM 014968 correspond to MP1 and KIAA1104 respec-tivelyMP1 is a scaffold protein inmultiple signaling pathwaysincluding the one in breast cancer The latter one is pitrilysinmetallopeptidase 1The correlations defined in the right-handside of (26) are calculated for all pairs of genes in total 5420

genes and 70 genes in MammaPrint The distributions ofthem are shown in the lower panel (b) in Figure 7 The genepairs are sorted so that the values of the correlations decreasemonotonically Note that the number of the gene pairs ineach data set is different but the range of horizontal axis isrestricted to 0 and 1 for clear view and for easy comparisonThedifference between the black and red curves indicates thatthe gene set of MammaPrint has large homogeneity of geneexpression patterns in comparison with that of total 5420genes This indicates that the ranking of genes based on two-sample statistic such as the correlation coefficients is prone

Computational and Mathematical Methods in Medicine 9

0

02

04

06

08

1AU

C

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Erro

r

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(a) AdaBoost

0

02

04

06

08

1

AUC

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Erro

r

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(b) AUCBoost

Figure 2 Results of boosting methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets D forAdaBoost and AUCBoost

to select genes such that their gene expression patterns aresimilar to each other This would be one reason why we havemultiple solutions after the ranking methods based on thecorrelation coefficients It is also interesting to see that thereare a few gene pairs with very low correlation even in gene setof MammaPrint

4 Discussion and Concluding Remarks

In this paper we have addressed an important classificationissue using microarray data Our results present the existence

of multiple suboptimal solutions for choosing classificationpredictors Classification results showed that the perfor-mance of several gene sets was the same This behaviorwas confirmed by the vanrsquot Veer method DLDA AdaBoostand AUCBoost Amazingly nontop ranked gene sets showedbetter performance than the top ranked gene sets Theseresults indicate that the ranking method is very unstable forfeature selection in microarray data

To examine the expression pattern of selected predictorswe performed clustering this added support to the existenceof multiple solutions We easily recognized the good and

10 Computational and Mathematical Methods in Medicine

(a) (b)

(c)

Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70

(MammaPrint)(b) 119863

11minus80clearly showing some subtypes and (c) 119863

111minus180with the highest BHI regarding metastases The blue bars indicate patients with

metastases red bars those with ER positive and orange bars those with PR positive

the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes

Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work

Computational and Mathematical Methods in Medicine 11

SetBH

I

MetastasesERp

PRp

0

02

04

06

08

1

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig32125

RCU82987

AL137718

AB037863

NM020188

NM020974

NM000127

NM002019

NM002073

NM000436

NM004994

Con

tig55377

RCC

ontig35251

RCC

ontig25991

NM003875

NM006101

NM003882

NM003607

AF0

7351

9A

F052

162

NM000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig46223

RCN

M015984

NM006117

AK0

0074

5C

ontig40831

RCN

M003239

NM014791

X056

10N

M016448

NM018401

NM000788

Con

tig51464

RCA

L080

079

NM006931

AF2

5717

5N

M014321

NM002916

Con

tig55725

RCC

ontig24252

RCA

F201

951

NM005915

NM001282

Con

tig56457

RCN

M000599

NM020386

NM014889

AF0

5503

3N

M006681

NM007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM007036

NM018354

(a)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig

46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig

32125

RCU82987

AL1

37718

AB0

37863

NM

020188

NM

020974

NM

000127

NM

002019

NM

002073

NM

000436

NM

004994

Con

tig55377

RCC

ontig

35251

RCC

ontig

25991

NM

003875

NM

006101

NM

003882

NM

003607

AF0

7351

9A

F052

162

NM

000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig

46223

RCN

M015984

NM

006117

AK0

0074

5C

ontig

40831

RCN

M003239

NM

014791

X056

10N

M016448

NM

018401

NM

000788

Con

tig51464

RCA

L080

079

NM

006931

AF2

5717

5N

M014321

NM

002916

Con

tig55725

RCC

ontig

24252

RCA

F201

951

NM

005915

NM

001282

Con

tig56457

RCN

M000599

NM

020386

NM

014889

AF0

5503

3N

M006681

NM

007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM

007036

NM

018354

(b)

Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint

12 Computational and Mathematical Methods in Medicine

0 20 40 60 80 100

1

05

06

07

08

09

1AU

C

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

Number of trialsNumber of trials

TrainingTest

TrainingTest

(a)

0 20 40 60 80 100

05

06

07

08

09

1

Number of trialsNumber of trials

AUC

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

TrainingTest

TrainingTest

(b)

Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials

Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is

not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data

One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by

Computational and Mathematical Methods in Medicine 13

0 02 04 06

0

02

04

06

minus02

minus02NM 014889

NM

0149

68

(a)

Index of pairs of genes

Cor

relat

ion

5420 genes70 genes

0

02

04

06

08

1

0 02 04 06 08 1

(b)

Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view

Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata

We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data

Conflict of Interests

The authors have declared that no conflict of interests exists

References

[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001

[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001

[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996

[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003

[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006

[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006

[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002

[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996

14 Computational and Mathematical Methods in Medicine

[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003

[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002

[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999

[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004

[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008

[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006

[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989

[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005

[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005

[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936

[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010

[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001

[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008

[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006

[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001

[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003

[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011

[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011

[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010

[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001

[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009

[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009

[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009

[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009

[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 3: Multiple Suboptimal Solutions for Prediction Rules in Gene ... · For this, the machine learning approach is successfully exploited including sup-portvectormachine,boostinglearningalgorithm,andthe

Computational and Mathematical Methods in Medicine 3

Table 1 A taxonomy of feature selection techniques summarized by Saeys et al [8] These major feature selections are addressed Each typehas a subcategory Advantages disadvantages and example methods are shown

Model search Advantages Disadvantages Examples

Univariate

Filter

FastScalableIndependent of the classifier

Ignores feature dependenciesIgnores interaction with the classifier

1205942

Euclidean distancet-testInformation gain

MultivariateModels feature dependenciesIndependent of the classifierBetter computational complexity thanwrapper methods

Slower than univariate techniquesLess scalable than univariatetechniquesIgnores interaction with the classifier

Correlation-based feature selectionMarkov blanket filterFast correlation-based feature selection

Deterministic

Wrapper

SimpleInteracts with the classifierModels feature dependenciesLess computationally intensive thanrandomized methods

Risk of overfittingMore prone than randomizedalgorithms to getting stuck in a localoptimumClassifier dependent selection

Sequential forward selectionSequential backward selection

Randomized

Less prone to local optimaInteracts with the classifierModels feature dependencies

Computationally intensiveClassifier dependent selectionHigher risk of overfitting thandeterministic algorithms

Simulated annealingRandomized hill climbingGenetic algorithmsEstimation of distribution algorithms

Embedded

Interacts with the classifierBetter computational complexity thanwrapper methodsModels feature dependencies

Classifier dependent selectionDecision treesWeighted naive BayesRFE-SVM

genes were decided using 78 patientsrsquo tumor samples Inbrief they selected 5000 significant expressed genes from25000 genes onmicroarray then coefficient correlation withoutcome was calculated for each class Genes were sorted bycorrelation coefficient and further optimized from the topranked gene with sequentially adding from a subset of fivegenesThe top 70 geneswere proposed as outcome predictorsPaik et al [13] proposed 21multigene signatures based on RT-PCR results These two multigene prognostic signatures areavailable as clinical test named MammaPrint and OncotypeDX FDA cleared the MammaPrint test in 2007 and it is cur-rently being tested in the Microarray In Node-negative and1ndash3 positive lymph-node Disease may Avoid ChemoTherapy(MINDACT) for further assessment which is described byCardoso et al [14]

Besides these two multigene prognostic signatures dif-ferent multigenes were selected as prognostic signatures Fanet al [15] discussed a different set of multigene signaturesin the breast cancer prognostic classification studies Thosesignatures show little overlap however they still showedsimilar classification power Fan et al [15] suggested that thesesignatures are probably tracking a common set of biologicalphenotypes Considering thousands andmillions of genes onmicroarray multiple useful signature sets are not difficult to

imagine however finding a stable and informative gene withhigh classification accuracy is of interest

23 Feature Selection Reduction of dimension size is neces-sary as superfluous features can cause overfitting and inter-pretation of classification model becomes difficult Reducingdimension size while keeping relevant features is importantThere are some feature selection methods proposed Saeyset al [8] provided a taxonomy of feature selection methodsand discussed their use advantages and disadvantages Theymentioned the objectives of feature selection (a) to avoidoverfitting and improve model performance that is predic-tion performance in the case of supervised classification andbetter cluster detection in the case of clustering (b) to providefaster andmore cost-effectivemodels and (c) to gain a deeperinsight into the underlying processes that generated the dataTable 1 provides their taxonomy of feature selectionmethodsIn the context of classification feature selection methodsare organized into three categories filter methods wrappermethods and embeddedmethods Feature selectionmethodsare categorized depending on how they are combined withthe construction of a classification model

Filter methods calculate statistics such as t-statisticsthen filter out those which do not meet the threshold

4 Computational and Mathematical Methods in Medicine

value Advantages of filter methods are easy implementationcomputational simplicity and speed Filter methods areindependent of classification methods therefore differentclassifiers can be used Two of the disadvantages of filtermethods are that they ignore the interaction with the clas-sification model and most of the proposed methods areunivariate

Whereas filter techniques treat the problem of findinga good feature subset independently of the model selec-tion step wrapper methods embed the model hypothesissearch within the feature subset search such as sequentialbackward selection [16] Wrapper methods search all featuresubsets the feature subset space grows exponentially withthe number of features One advantage of wrapper methodsis the interaction between feature subset search and modelselection Computational burden is one of the disadvantagesof this approach especially if building the classifier has a highcomputational cost

The third class of feature selection is embedded tech-niques Like the wrapper methods embedded techniquessearch for an optimal subset of features with classifierconstruction For example SVM-RFE proposed by Guyonet al [17] Thus embedded approaches are specific to a givenlearning algorithm The difference from wrapper methodsis that embedded techniques are guided by the learningprocess Whereas wrapper methods search all possible com-binations of gene sets embedded techniques search for thecombination based on a criteria This enables reduction incomputational burden

Besides thesemethods the idea of sparsenesswas recentlyintroduced in some feature selectionmethods One approachis to use penalties in the regression context Tibshirani [3]proposed Lasso which uses 119871

1norm penalties The combi-

nation of 1198711penalty and 119871

2penalty is called elastic net [18]

These methods are focusing on reducing features to avoidingoverfitting and better interpretability as biologists expect toobtain biological insight from selected features

However these sparseness ideas do not take into accountmultiple solutions in one data set When the data dimensionis thousands ormillions there aremultiple possible solutionsSorting genes based on some criteria then selecting a subsetfrom the top is not always the best selection We elaboratemultiple solutions in the following section and give some ideaof how to select the optimum solution Here we refer to anoptimal solution as a prediction rule with high classificationaccuracy for various data sets

3 Multiple Solutions for Prediction Rules

The existence of multiple solutions for prediction of diseasestatus based on breast cancer data [7] was shown by Ein-Doret al [19] where they suggest three reasons for this problemThe first is that there are many genes correlated with diseasestatus the second one is that the differences of the correlationamong the genes are very small the last one is that the valueof the correlation is very sensitive to the sample used for thecalculation of the correlation In the paper they demonstrategene ranking based on the correlation and show that thereexist many equally predictive gene sets

In this section we investigate the existence of multiplesolutions from different viewpoints At first to check thevariability of prediction accuracy based on different statisticalmethods we apply the vanrsquot Veermethod [7] the Fisher lineardiscriminant analysis [20] AdaBoost [21] and AUCBoost[22] The last two methods are called boosting in machinelearning community where genes are nonlinearly combinedto predict the disease status Second we apply hierarchal clus-tering to examine the heterogeneity of gene expression pat-terns Soslashrlie et al [23] showed there exist subtypes of breastcancers for which the patterns of gene expression are clearlydifferent and the disease statuses are also different in accor-dance with them Ein-Dor et al [19] suggests that the hetero-geneity of the subtypes is one reason why there are so manysolutions for the prediction or large fluctuations of genesselected for the predictions Hence we calculate BiologicalHomogeneity Index (BHI) to see the clustering performancefor various gene sets and examine the existence of thesubtypes Finally we consider the mutualcoherence for thebreast cancer data and discuss the relation to the multiplesolutions

The breast cancer data consists of the expression data of25000 genes and 97 cancer patients After a filtering proce-dure 5420 genes are identified to be significantly regulatedand we focused on this filtered data The patients are dividedinto a training sample (78 patients) and a test sample (19patients) in the same way as the original paper [7] Here weconsider classical methods and boosting methods to predictwhether the patient has good prognosis (free of disease afterinitial diagnosis at least 5 years) or has bad prognosis (distantmetastaseswithin 5 years)Theprediction rule is generated bythe training sample and we measure the prediction accuracybased on the test sample

31 Classification We briefly introduce the statistical meth-ods used for the prediction of the disease status The firsttwo methods are linear discriminant function where thegenes are linearly combined to classify the patients into agood prognosis group or a bad prognosis group The last twoones are boosting methods where the genes are nonlinearlycombined to generate the discriminant functions

311 The vanrsquot Veer Method Let 119910 be a class label indicatingdisease status such as 119910 = minus1 (good prognosis) and 119910 = 1

(metastases) and let x(=(1199091 119909

119901)119879) be a 119901-dimensional

covariate such as gene expression We denote the samples of119910 = minus1 and 119910 = 1 as xminus

119894 119894 = 1 119899

minus and x+

119895 119895 =

1 119899+ respectively and the mean value of the patients

with good prognosis as

minus= (minus

1

minus

119901)119879

=1

119899minus

119899minus

sum

119894=1

xminus119894 (1)

Then vanrsquot Veer et al [7] proposed a discriminant function119865(x) based on the correlation to the average good prognosisprofile above which is given as

119865 (x) = minussum119901

119896=1(119909119896minus ) (

minus

119896minus minus)

radicsum119901

119896=1(119909119896minus )2radicsum119901

119896=1(minus

119896minus minus)2

(2)

Computational and Mathematical Methods in Medicine 5

where

=1

119901

119901

sum

119896=1

119909119896

minus=1

119901

119901

sum

119896=1

minus

119896 (3)

If the value of 119865(x) is smaller than a predefined thresholdvalue then the patient is judged to have good prognosis (119910 =minus1) otherwise to have metastases (119910 = 1) This method iscalled one-class classification where it focuses on only one-class label information to predict the disease status 119910 Thisidea is also employed in machine learning community SeeYousef et al [24] and Gardner et al [25] for applications inbiology and medicine

312 DLDA We consider Fisherrsquos linear discriminant analy-sis [20] which is widely used in many applications Supposethat x is distributed as119873(120583

minusΣminus) for 119910 = minus1 and as119873(120583

+Σ+)

for 119910 = 1 Then if Σminus= Σ+= Σ the estimated log-likelihood

ratio is given as

119865 (x) = (+minus minus)119879

Σminus1x minus 1

2119879

+Σminus1

+

+1

2119879

minusΣminus1

minus+ log

119899+

119899minus

(4)

where + minusand Σ are the samplemeans for119910 isin minus1 1 and a

total sample variance respectively For simplicity we take thediagonal part of Σ (Diagonal Linear Discriminant Analysis)and predict the disease status based on the value of the log-likelihood ratio above This modification is often used in asituation where 119901 is much larger than 119899 (= 119899

minus+ 119899+) In that

case the inverse of Σ cannot be calculated

313 AdaBoost We introduce a famous boosting method inmachine learning community The key concept of boostingis to construct a powerful discriminant function 119865(x) bycombining variousweak classifiers119891(x) [26]We employ a setF of decision stumps as a dictionary of weak classifiers Herethe decision stump for the 119896th gene expression 119909

119896is defined

as a simple step functions such as

119891119896(x) = 1 if 119909

119896ge 119887119896

minus1 otherwise(5)

where 119887119896is a threshold value Accordingly it is known that

119891119896(x) is the simplest classifier in the sense that 119891

119896(x) neglects

all other information of gene expression patterns than thatof one gene 119909

119896 However by changing the value of 119887

119896for all

genes (119896 = 1 119901) we have F that contains exhaustiveinformation of gene expression patterns We attempt tobuild a good discriminant function 119865 by combining decisionstumps inF

The typical one is AdaBoost proposed by Schapire [21]which is designed to minimize the exponential loss for adiscriminant function 119865 as

119871exp (119865) =1

119899

119899

sum

119894=1

exp minus119910119894119865 (x119894) (6)

where the entire data set is given as (x119894 119910119894) | 119894 = 1 119899The

exponential loss is sequentially minimized by the followingalgorithm

(1) Initialize the weight of 119909119894for 119894 = 1 119899 as 119908

1(119894) =

1119899

(2) For the iteration number 119905 = 1 119879

(a) choose 119891119905as

119891119905= argmin119891isinF

120598119905(119891) (7)

where

120598119905(119891) =

119899

sum

119894=1

119868 (119891 (x119894) = 119910119894)

119908119905(119894)

sum119899

119894=1119908119905(119894) (8)

and 119868 is the indicator function(b) calculate the coefficient as

120573119905=1

2log

1 minus 120598119905(119891119905)

120598119905(119891119905) (9)

(c) update the weight as

119908119905+1(119894) = 119908

119905(119894) exp minus119910

119894120573119905119891119905(x119894) (10)

(3) output the final function as 119865(x) = sum119879119905=1120573119905119891119905(x)

Here we have

119871exp (119865 + 120573119891)

=1

119899

119899

sum

119894=1

exp minus119910119894119865 (x119894) exp minus120573119910

119894119891 (x119894)

(11)

=1

119899

119899

sum

119894=1

exp minus119910119894119865 (x119894)

times [119890120573119868 (119891 (x

119894) = 119910119894) + 119890minus120573119868 (119891 (x

119894) = 119910119894)]

(12)

= 119871exp (119865) 119890120573120598 (119891) + 119890

minus120573(1 minus 120598) (13)

ge 2119871exp (119865)radic120598 (119891) (1 minus 120598 (119891)) (14)

where 120598(119891) = 1119899sum119899

119894=1119868(119891(x

119894) = 119910119894) expminus119910

119894119865(x119894)119871exp(119865)

and the equality in (14) is attained if and only if

120573 =1

2log

1 minus 120598 (119891)

120598 (119891) (15)

6 Computational and Mathematical Methods in Medicine

Hence we can see that the exponential loss is sequentiallyminimized in the algorithm It is easily shown that

120598119905+1(119891119905) =

1

2 (16)

That is the weak classifier 119891119905chosen at the step 119905 is the worst

element in F in the sense of the weighted error rate in thestep 119905 + 1

314 AUCBoost with Natural Cubic Splines The area underthe ROCcurve (AUC) is widely used tomeasure classificationaccuracy [27] This criterion consists of the false positiverate and true positive rate so it evaluates them separately incontrast to the commonly used error rateThe empirical AUCbased on the samples xminus

119894 119894 = 1 119899

minus and x+

119895 119895 = 1

119899+ is given as

AUC (119865) = 1

119899minus119899+

119899minus

sum

119894=1

119899+

sum

119895=1

119867(119865 (x+119895) minus 119865 (xminus

119894)) (17)

where 119867(119911) is the Heaviside function 119867(119911) = 1 if 119911 ge 0

and H(119911) = 0 otherwise To avoid the difficulty to maximizethe nondifferential function above an approximate AUC isconsidered by Komori [28] as

AUC120590(119865) =

1

119899minus119899+

119899minus

sum

119894=1

119899+

sum

119895=1

119867120590(119865 (x+119895) minus 119865 (xminus

119894)) (18)

where 119867120590(119911) = Φ(119911120590) with Φ being the standard normal

distribution function A smaller scale parameter 120590 meansa better approximation of the Heaviside function 119867(119911)Based on this approximation a boosting method for themaximization of AUC as well as the partial AUC is proposedby Komori and Eguchi [22] in which they consider thefollowing objective function

AUC120590120582(119865) =

1

119899minus119899+

119899minus

sum

119894=1

119899+

sum

119895=1

119867120590(119865 (x+119895) minus 119865 (xminus

119894))

minus 120582

119901

sum

119896=1

int 11986510158401015840

119896(119909119896)2

119889119909119896

(19)

where 11986510158401015840119896(119909119896) is the second derivative of the 119896th component

of 119865(x) and 120582 is a smoothing parameter that controls thesmoothness of 119865(x) Here the set of weak classifiers is givenas

F = 119891 (x) =119873119896119897(119909119896)

119885119896119897

| 119896 = 1 2 119901 119897 = 1 2 119898119896

(20)

where 119873119896119897

is a basis function for representing natural cubicsplines and 119885

119896119897is a standardization factor Then the relation-

ship

max120590120582119865

AUC120590120582(119865) = max

120582119865

AUC1120582(119865) (21)

allows us to fix 120590 = 1 without loss of generality and have thefollowing algorithm

(1) Start with 1198650(x) = 0

(2) For 119905 = 1 119879

(a) update 120573119905minus1(119891) to 120573

119905(119891) with a one-step

Newton-Raphson iteration(b) find the best weak classifier 119891

119905

119891119905= argmax119891

AUC1120582(119865119905minus1

+ 120573119905(119891) 119891) (22)

(c) update the score function as

119865119905(x) = 119865

119905minus1(x) + 120573

119905(119891119905) 119891119905(x) (23)

(3) Output 119865(x) = sum119879119905=1120573119905(119891119905)119891119905(x)

The value of the smoothing parameter 120582 and the iterationnumber 119879 is determined at the same time by the crossvalidation

32 Clustering We applied a hierarchical clustering usingbreast cancer data [7] where the distances between samplesand genes are determined by the correlation and completelinkage was applied as the agglomeration method To mea-sure the performance of the clustering we used biologicalhomogeneity index (BHI) [29] which measures the homo-geneity between the cluster C = 119862

1 119862

119870 and the

biological category or subtypeB = 1198611 119861

119871

BHI (CB) =1

119870

119870

sum

119896=1

1

119899119896(119899119896minus 1)

sum

119894 = 119895119894119895isin119862119896

119868 (119861(119894)= 119861(119895))

(24)

where 119861(119894) isin B is the subtype for the subject 119894 and 119899119896is

the number of subjects in 119862119896 This index is upper bounded

by 1 meaning the perfect homogeneity between the clustersand the biological categories We calculated this index forthe breast cancer data to investigate the relationship betweenthe hierarchical clustering and biological categories such asdisease status (good prognosis or metastases) and hormonestatus estrogen receptor (ER) status and progesterone recep-tor (PR) status The hormone status is known to be closelyrelated with the prognosis of the patients [23]

33 Mutualcoherence Now we have a data matrix X with 119899rows (119899 patients) and 119901 columns (119901 genes)We assumed an 119899-dimensional vector b indicating the true disease status where

Computational and Mathematical Methods in Medicine 7

the positive values correspond to metastases and negativeones to good prognosisThemagnitude of b denotes the levelof disease status Then the optimal linear solution 120573 isin R119901

should be satisfied

X120573 = b (25)

Note that if 119901 is much larger than 119899 then only a few elementsof 120573 should be non-zeros and the others to be zero whichmeans 120573 has a sparse structure The sparsity has a closerelationship with mutualcoherence [30] which is defined forthe data matrix X as

120583 (X) = max1le119894119895le119901119894 = 119895

10038161003816100381610038161003816x119879119894x119895

100381610038161003816100381610038161003817100381710038171003817x11989410038171003817100381710038172

10038171003817100381710038171003817x119895

100381710038171003817100381710038172

(26)

where x119894isin R119899 denotes the 119894th column in X or 119894th gene in

the breast cancer data | sdot | is the absolute value and sdot2is the

Euclidean normThis indexmeasures the distance of columns(genes) in the data matrixX and is bounded as 0 le 120583(X) le 1The next theorem shows the relationship between sparsity of120573 and the data matrix X

Theorem 1 (Elad [30]) If X120573 = b has a solution 120573 obeying1205730

lt (1 + 1120583(X))2 this solution is necessarily thesparsest possible where 120573

0denotes the number of the nonzero

components of 120573

This theorem suggests that the linear discriminant func-tion 119865(x) = 120573119879x could have a sparse solution to predict thedisease status b which corresponds to metastases or goodprognosis in the breast cancer data If the value of 120583(X) isnearly equal to 1 then 120573

0could become approximately 1

indicating that just one gene (a column inX) could be enoughto predict the disease status if there were a solution 120573 Thisindicates that we could have a chance to find the multiplesolutions with fewer genes than 70 genes in MammaPrintAlthough the framework in (25) is based on a system of linearequations and does not include random effects as seen in theclassification problems we deal with Theorem 1 is indicativein the case where the number of covariates 119901 is much largerthan the observation number 119899

34 Results We prepare various data sets using a trainingsample (78 patients) based on 230 genes selected by vanrsquot Veeret al [7] in order to investigate the existence of the multiplesolutions of the prediction of the disease status which aregiven by

D = 1198631minus70

1198636minus75

119863161minus230

(27)

The first data set 1198631minus70

consists of 78 patients and 70 genesin MammaPrint The ranking of the 230 genes in D is onthe basis of correlation with disease status as in [7] We apply

the vanrsquot Veer method DLDA AdaBoost and AUCBoost asexplained in the previous subsection to the training data tohave the discriminant functions119865(x) where each threshold isdetermined so that the training error rate isminimizedThenwe evaluate the classification accuracy based on the trainingdata as well as the test data The results of classificationperformance of vanrsquot Veer method and DLDA are illustratedin Figure 1 those of AdaBoost and AUCBoost are in Figure 2The AUC and the error rate are plotted against D In regardto the performance based on the training data denoted bysolid line the boosting methods are superior to the classicalmethodsHowever comparison based on the test data DLDAshows good performance for almost all the data sets in Dhaving the AUC more than 08 and the error rates less than02 in averageThis evidence suggests there exist many sets ofgenes having almost the same classification performance asthat of MammaPrint

We investigate the performance of the hierarchical clus-terings based on the training data sets 119863

1minus70 11986311minus80

and119863111minus180

which are shown in Figure 3 Each row represents70 genes and each column represents 78 patients withdisease status (blue) ER status (red) and PR status (orange)The BHI for disease status ER status and PR status inMammaPrint (119863

1minus70) are 070 069 and 057 respectively

The gene expression in 11986311minus80

shows different patterns fromthe others Mainly there are two clusters characterized by ERstatus The left-hand side is the subgroup of patients withER negative and poor prognosis This would correspond toBasal-like subtype or triple negative subtype though the Her2status is unclear The right-hand side could be divided intothree subgroups The subgroup of patients with ER negativeshows good prognosis indicating Luminal A and either sideof it would be Luminal B or Luminal C because it showsworse prognosis than Luminal A The data set of 119863

111minus180

has the highest BHI for disease status 076 and a similar geneexpression pattern to that in 119863

1minus70 The other values of BHI

are illustrated in Figure 4 It turned out that the data sets inDhave almost the same BHI for three statuses suggesting thereexists various gene sets with similar expression patterns

Next we investigate the stability of the gene ranking inMammaPrint Among 78 patients we randomly choose 50patients with 5420 gene expression patterns Then we takethe top 70 genes ranked by the correlation coefficients ofthe gene expression with disease status This procedure isrepeated 100 times and checked how many times the genesin MammaPrint are ranked within the top 70 The resultsare shown in the upper panel of Figure 5 The lower panelshows the result based on the AUC instead of the correlationcoefficient used in the ranking We clearly see that someof the genes in MammaPrint are rarely selected in the top70 which indicates the instability of the gene ranking Theperformances of DLDA with 100 replications which showsmost stable prediction accuracy as seen in Figure 1 basedon randomly selected 50 patients shown in Figure 6 wherethe vertical axes are AUC (left panels) and error rate (rightpanels) and genes are ranked by the correlation (a) and AUC(b) The performance clearly gets worse than that in Figure 1in terms of both AUC and error rate The heterogeneity ofgene expression pattern may come from the several subtypes

8 Computational and Mathematical Methods in Medicine

0

02

04

06

08

1

Data sets

AUC

0

02

04

06

08

1

Erro

r

TrainingTest

TrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230Data sets

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(a) vanrsquot Veer methodEr

ror

0

02

04

06

08

1

AUC

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(b) DLDA

Figure 1 Results of classical methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data setsD for vanrsquotVeer method and DLDA

as suggested in the cluster analysis and it would make itdifficult to select useful genes for predictions

Finally we calculate the mutualcoherence based on genesin MammaPrint and total 5420 genes in Figure 7 The scatterplot in the upper panel (a) shows a pair of geneswith the high-est mutual coherence (0984) in MammaPrint NM 014889

and NM 014968 correspond to MP1 and KIAA1104 respec-tivelyMP1 is a scaffold protein inmultiple signaling pathwaysincluding the one in breast cancer The latter one is pitrilysinmetallopeptidase 1The correlations defined in the right-handside of (26) are calculated for all pairs of genes in total 5420

genes and 70 genes in MammaPrint The distributions ofthem are shown in the lower panel (b) in Figure 7 The genepairs are sorted so that the values of the correlations decreasemonotonically Note that the number of the gene pairs ineach data set is different but the range of horizontal axis isrestricted to 0 and 1 for clear view and for easy comparisonThedifference between the black and red curves indicates thatthe gene set of MammaPrint has large homogeneity of geneexpression patterns in comparison with that of total 5420genes This indicates that the ranking of genes based on two-sample statistic such as the correlation coefficients is prone

Computational and Mathematical Methods in Medicine 9

0

02

04

06

08

1AU

C

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Erro

r

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(a) AdaBoost

0

02

04

06

08

1

AUC

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Erro

r

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(b) AUCBoost

Figure 2 Results of boosting methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets D forAdaBoost and AUCBoost

to select genes such that their gene expression patterns aresimilar to each other This would be one reason why we havemultiple solutions after the ranking methods based on thecorrelation coefficients It is also interesting to see that thereare a few gene pairs with very low correlation even in gene setof MammaPrint

4 Discussion and Concluding Remarks

In this paper we have addressed an important classificationissue using microarray data Our results present the existence

of multiple suboptimal solutions for choosing classificationpredictors Classification results showed that the perfor-mance of several gene sets was the same This behaviorwas confirmed by the vanrsquot Veer method DLDA AdaBoostand AUCBoost Amazingly nontop ranked gene sets showedbetter performance than the top ranked gene sets Theseresults indicate that the ranking method is very unstable forfeature selection in microarray data

To examine the expression pattern of selected predictorswe performed clustering this added support to the existenceof multiple solutions We easily recognized the good and

10 Computational and Mathematical Methods in Medicine

(a) (b)

(c)

Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70

(MammaPrint)(b) 119863

11minus80clearly showing some subtypes and (c) 119863

111minus180with the highest BHI regarding metastases The blue bars indicate patients with

metastases red bars those with ER positive and orange bars those with PR positive

the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes

Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work

Computational and Mathematical Methods in Medicine 11

SetBH

I

MetastasesERp

PRp

0

02

04

06

08

1

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig32125

RCU82987

AL137718

AB037863

NM020188

NM020974

NM000127

NM002019

NM002073

NM000436

NM004994

Con

tig55377

RCC

ontig35251

RCC

ontig25991

NM003875

NM006101

NM003882

NM003607

AF0

7351

9A

F052

162

NM000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig46223

RCN

M015984

NM006117

AK0

0074

5C

ontig40831

RCN

M003239

NM014791

X056

10N

M016448

NM018401

NM000788

Con

tig51464

RCA

L080

079

NM006931

AF2

5717

5N

M014321

NM002916

Con

tig55725

RCC

ontig24252

RCA

F201

951

NM005915

NM001282

Con

tig56457

RCN

M000599

NM020386

NM014889

AF0

5503

3N

M006681

NM007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM007036

NM018354

(a)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig

46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig

32125

RCU82987

AL1

37718

AB0

37863

NM

020188

NM

020974

NM

000127

NM

002019

NM

002073

NM

000436

NM

004994

Con

tig55377

RCC

ontig

35251

RCC

ontig

25991

NM

003875

NM

006101

NM

003882

NM

003607

AF0

7351

9A

F052

162

NM

000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig

46223

RCN

M015984

NM

006117

AK0

0074

5C

ontig

40831

RCN

M003239

NM

014791

X056

10N

M016448

NM

018401

NM

000788

Con

tig51464

RCA

L080

079

NM

006931

AF2

5717

5N

M014321

NM

002916

Con

tig55725

RCC

ontig

24252

RCA

F201

951

NM

005915

NM

001282

Con

tig56457

RCN

M000599

NM

020386

NM

014889

AF0

5503

3N

M006681

NM

007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM

007036

NM

018354

(b)

Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint

12 Computational and Mathematical Methods in Medicine

0 20 40 60 80 100

1

05

06

07

08

09

1AU

C

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

Number of trialsNumber of trials

TrainingTest

TrainingTest

(a)

0 20 40 60 80 100

05

06

07

08

09

1

Number of trialsNumber of trials

AUC

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

TrainingTest

TrainingTest

(b)

Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials

Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is

not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data

One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by

Computational and Mathematical Methods in Medicine 13

0 02 04 06

0

02

04

06

minus02

minus02NM 014889

NM

0149

68

(a)

Index of pairs of genes

Cor

relat

ion

5420 genes70 genes

0

02

04

06

08

1

0 02 04 06 08 1

(b)

Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view

Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata

We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data

Conflict of Interests

The authors have declared that no conflict of interests exists

References

[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001

[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001

[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996

[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003

[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006

[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006

[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002

[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996

14 Computational and Mathematical Methods in Medicine

[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003

[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002

[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999

[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004

[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008

[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006

[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989

[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005

[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005

[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936

[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010

[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001

[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008

[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006

[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001

[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003

[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011

[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011

[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010

[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001

[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009

[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009

[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009

[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009

[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 4: Multiple Suboptimal Solutions for Prediction Rules in Gene ... · For this, the machine learning approach is successfully exploited including sup-portvectormachine,boostinglearningalgorithm,andthe

4 Computational and Mathematical Methods in Medicine

value Advantages of filter methods are easy implementationcomputational simplicity and speed Filter methods areindependent of classification methods therefore differentclassifiers can be used Two of the disadvantages of filtermethods are that they ignore the interaction with the clas-sification model and most of the proposed methods areunivariate

Whereas filter techniques treat the problem of findinga good feature subset independently of the model selec-tion step wrapper methods embed the model hypothesissearch within the feature subset search such as sequentialbackward selection [16] Wrapper methods search all featuresubsets the feature subset space grows exponentially withthe number of features One advantage of wrapper methodsis the interaction between feature subset search and modelselection Computational burden is one of the disadvantagesof this approach especially if building the classifier has a highcomputational cost

The third class of feature selection is embedded tech-niques Like the wrapper methods embedded techniquessearch for an optimal subset of features with classifierconstruction For example SVM-RFE proposed by Guyonet al [17] Thus embedded approaches are specific to a givenlearning algorithm The difference from wrapper methodsis that embedded techniques are guided by the learningprocess Whereas wrapper methods search all possible com-binations of gene sets embedded techniques search for thecombination based on a criteria This enables reduction incomputational burden

Besides thesemethods the idea of sparsenesswas recentlyintroduced in some feature selectionmethods One approachis to use penalties in the regression context Tibshirani [3]proposed Lasso which uses 119871

1norm penalties The combi-

nation of 1198711penalty and 119871

2penalty is called elastic net [18]

These methods are focusing on reducing features to avoidingoverfitting and better interpretability as biologists expect toobtain biological insight from selected features

However these sparseness ideas do not take into accountmultiple solutions in one data set When the data dimensionis thousands ormillions there aremultiple possible solutionsSorting genes based on some criteria then selecting a subsetfrom the top is not always the best selection We elaboratemultiple solutions in the following section and give some ideaof how to select the optimum solution Here we refer to anoptimal solution as a prediction rule with high classificationaccuracy for various data sets

3 Multiple Solutions for Prediction Rules

The existence of multiple solutions for prediction of diseasestatus based on breast cancer data [7] was shown by Ein-Doret al [19] where they suggest three reasons for this problemThe first is that there are many genes correlated with diseasestatus the second one is that the differences of the correlationamong the genes are very small the last one is that the valueof the correlation is very sensitive to the sample used for thecalculation of the correlation In the paper they demonstrategene ranking based on the correlation and show that thereexist many equally predictive gene sets

In this section we investigate the existence of multiplesolutions from different viewpoints At first to check thevariability of prediction accuracy based on different statisticalmethods we apply the vanrsquot Veermethod [7] the Fisher lineardiscriminant analysis [20] AdaBoost [21] and AUCBoost[22] The last two methods are called boosting in machinelearning community where genes are nonlinearly combinedto predict the disease status Second we apply hierarchal clus-tering to examine the heterogeneity of gene expression pat-terns Soslashrlie et al [23] showed there exist subtypes of breastcancers for which the patterns of gene expression are clearlydifferent and the disease statuses are also different in accor-dance with them Ein-Dor et al [19] suggests that the hetero-geneity of the subtypes is one reason why there are so manysolutions for the prediction or large fluctuations of genesselected for the predictions Hence we calculate BiologicalHomogeneity Index (BHI) to see the clustering performancefor various gene sets and examine the existence of thesubtypes Finally we consider the mutualcoherence for thebreast cancer data and discuss the relation to the multiplesolutions

The breast cancer data consists of the expression data of25000 genes and 97 cancer patients After a filtering proce-dure 5420 genes are identified to be significantly regulatedand we focused on this filtered data The patients are dividedinto a training sample (78 patients) and a test sample (19patients) in the same way as the original paper [7] Here weconsider classical methods and boosting methods to predictwhether the patient has good prognosis (free of disease afterinitial diagnosis at least 5 years) or has bad prognosis (distantmetastaseswithin 5 years)Theprediction rule is generated bythe training sample and we measure the prediction accuracybased on the test sample

31 Classification We briefly introduce the statistical meth-ods used for the prediction of the disease status The firsttwo methods are linear discriminant function where thegenes are linearly combined to classify the patients into agood prognosis group or a bad prognosis group The last twoones are boosting methods where the genes are nonlinearlycombined to generate the discriminant functions

311 The vanrsquot Veer Method Let 119910 be a class label indicatingdisease status such as 119910 = minus1 (good prognosis) and 119910 = 1

(metastases) and let x(=(1199091 119909

119901)119879) be a 119901-dimensional

covariate such as gene expression We denote the samples of119910 = minus1 and 119910 = 1 as xminus

119894 119894 = 1 119899

minus and x+

119895 119895 =

1 119899+ respectively and the mean value of the patients

with good prognosis as

minus= (minus

1

minus

119901)119879

=1

119899minus

119899minus

sum

119894=1

xminus119894 (1)

Then vanrsquot Veer et al [7] proposed a discriminant function119865(x) based on the correlation to the average good prognosisprofile above which is given as

119865 (x) = minussum119901

119896=1(119909119896minus ) (

minus

119896minus minus)

radicsum119901

119896=1(119909119896minus )2radicsum119901

119896=1(minus

119896minus minus)2

(2)

Computational and Mathematical Methods in Medicine 5

where

=1

119901

119901

sum

119896=1

119909119896

minus=1

119901

119901

sum

119896=1

minus

119896 (3)

If the value of 119865(x) is smaller than a predefined thresholdvalue then the patient is judged to have good prognosis (119910 =minus1) otherwise to have metastases (119910 = 1) This method iscalled one-class classification where it focuses on only one-class label information to predict the disease status 119910 Thisidea is also employed in machine learning community SeeYousef et al [24] and Gardner et al [25] for applications inbiology and medicine

312 DLDA We consider Fisherrsquos linear discriminant analy-sis [20] which is widely used in many applications Supposethat x is distributed as119873(120583

minusΣminus) for 119910 = minus1 and as119873(120583

+Σ+)

for 119910 = 1 Then if Σminus= Σ+= Σ the estimated log-likelihood

ratio is given as

119865 (x) = (+minus minus)119879

Σminus1x minus 1

2119879

+Σminus1

+

+1

2119879

minusΣminus1

minus+ log

119899+

119899minus

(4)

where + minusand Σ are the samplemeans for119910 isin minus1 1 and a

total sample variance respectively For simplicity we take thediagonal part of Σ (Diagonal Linear Discriminant Analysis)and predict the disease status based on the value of the log-likelihood ratio above This modification is often used in asituation where 119901 is much larger than 119899 (= 119899

minus+ 119899+) In that

case the inverse of Σ cannot be calculated

313 AdaBoost We introduce a famous boosting method inmachine learning community The key concept of boostingis to construct a powerful discriminant function 119865(x) bycombining variousweak classifiers119891(x) [26]We employ a setF of decision stumps as a dictionary of weak classifiers Herethe decision stump for the 119896th gene expression 119909

119896is defined

as a simple step functions such as

119891119896(x) = 1 if 119909

119896ge 119887119896

minus1 otherwise(5)

where 119887119896is a threshold value Accordingly it is known that

119891119896(x) is the simplest classifier in the sense that 119891

119896(x) neglects

all other information of gene expression patterns than thatof one gene 119909

119896 However by changing the value of 119887

119896for all

genes (119896 = 1 119901) we have F that contains exhaustiveinformation of gene expression patterns We attempt tobuild a good discriminant function 119865 by combining decisionstumps inF

The typical one is AdaBoost proposed by Schapire [21]which is designed to minimize the exponential loss for adiscriminant function 119865 as

119871exp (119865) =1

119899

119899

sum

119894=1

exp minus119910119894119865 (x119894) (6)

where the entire data set is given as (x119894 119910119894) | 119894 = 1 119899The

exponential loss is sequentially minimized by the followingalgorithm

(1) Initialize the weight of 119909119894for 119894 = 1 119899 as 119908

1(119894) =

1119899

(2) For the iteration number 119905 = 1 119879

(a) choose 119891119905as

119891119905= argmin119891isinF

120598119905(119891) (7)

where

120598119905(119891) =

119899

sum

119894=1

119868 (119891 (x119894) = 119910119894)

119908119905(119894)

sum119899

119894=1119908119905(119894) (8)

and 119868 is the indicator function(b) calculate the coefficient as

120573119905=1

2log

1 minus 120598119905(119891119905)

120598119905(119891119905) (9)

(c) update the weight as

119908119905+1(119894) = 119908

119905(119894) exp minus119910

119894120573119905119891119905(x119894) (10)

(3) output the final function as 119865(x) = sum119879119905=1120573119905119891119905(x)

Here we have

119871exp (119865 + 120573119891)

=1

119899

119899

sum

119894=1

exp minus119910119894119865 (x119894) exp minus120573119910

119894119891 (x119894)

(11)

=1

119899

119899

sum

119894=1

exp minus119910119894119865 (x119894)

times [119890120573119868 (119891 (x

119894) = 119910119894) + 119890minus120573119868 (119891 (x

119894) = 119910119894)]

(12)

= 119871exp (119865) 119890120573120598 (119891) + 119890

minus120573(1 minus 120598) (13)

ge 2119871exp (119865)radic120598 (119891) (1 minus 120598 (119891)) (14)

where 120598(119891) = 1119899sum119899

119894=1119868(119891(x

119894) = 119910119894) expminus119910

119894119865(x119894)119871exp(119865)

and the equality in (14) is attained if and only if

120573 =1

2log

1 minus 120598 (119891)

120598 (119891) (15)

6 Computational and Mathematical Methods in Medicine

Hence we can see that the exponential loss is sequentiallyminimized in the algorithm It is easily shown that

120598119905+1(119891119905) =

1

2 (16)

That is the weak classifier 119891119905chosen at the step 119905 is the worst

element in F in the sense of the weighted error rate in thestep 119905 + 1

314 AUCBoost with Natural Cubic Splines The area underthe ROCcurve (AUC) is widely used tomeasure classificationaccuracy [27] This criterion consists of the false positiverate and true positive rate so it evaluates them separately incontrast to the commonly used error rateThe empirical AUCbased on the samples xminus

119894 119894 = 1 119899

minus and x+

119895 119895 = 1

119899+ is given as

AUC (119865) = 1

119899minus119899+

119899minus

sum

119894=1

119899+

sum

119895=1

119867(119865 (x+119895) minus 119865 (xminus

119894)) (17)

where 119867(119911) is the Heaviside function 119867(119911) = 1 if 119911 ge 0

and H(119911) = 0 otherwise To avoid the difficulty to maximizethe nondifferential function above an approximate AUC isconsidered by Komori [28] as

AUC120590(119865) =

1

119899minus119899+

119899minus

sum

119894=1

119899+

sum

119895=1

119867120590(119865 (x+119895) minus 119865 (xminus

119894)) (18)

where 119867120590(119911) = Φ(119911120590) with Φ being the standard normal

distribution function A smaller scale parameter 120590 meansa better approximation of the Heaviside function 119867(119911)Based on this approximation a boosting method for themaximization of AUC as well as the partial AUC is proposedby Komori and Eguchi [22] in which they consider thefollowing objective function

AUC120590120582(119865) =

1

119899minus119899+

119899minus

sum

119894=1

119899+

sum

119895=1

119867120590(119865 (x+119895) minus 119865 (xminus

119894))

minus 120582

119901

sum

119896=1

int 11986510158401015840

119896(119909119896)2

119889119909119896

(19)

where 11986510158401015840119896(119909119896) is the second derivative of the 119896th component

of 119865(x) and 120582 is a smoothing parameter that controls thesmoothness of 119865(x) Here the set of weak classifiers is givenas

F = 119891 (x) =119873119896119897(119909119896)

119885119896119897

| 119896 = 1 2 119901 119897 = 1 2 119898119896

(20)

where 119873119896119897

is a basis function for representing natural cubicsplines and 119885

119896119897is a standardization factor Then the relation-

ship

max120590120582119865

AUC120590120582(119865) = max

120582119865

AUC1120582(119865) (21)

allows us to fix 120590 = 1 without loss of generality and have thefollowing algorithm

(1) Start with 1198650(x) = 0

(2) For 119905 = 1 119879

(a) update 120573119905minus1(119891) to 120573

119905(119891) with a one-step

Newton-Raphson iteration(b) find the best weak classifier 119891

119905

119891119905= argmax119891

AUC1120582(119865119905minus1

+ 120573119905(119891) 119891) (22)

(c) update the score function as

119865119905(x) = 119865

119905minus1(x) + 120573

119905(119891119905) 119891119905(x) (23)

(3) Output 119865(x) = sum119879119905=1120573119905(119891119905)119891119905(x)

The value of the smoothing parameter 120582 and the iterationnumber 119879 is determined at the same time by the crossvalidation

32 Clustering We applied a hierarchical clustering usingbreast cancer data [7] where the distances between samplesand genes are determined by the correlation and completelinkage was applied as the agglomeration method To mea-sure the performance of the clustering we used biologicalhomogeneity index (BHI) [29] which measures the homo-geneity between the cluster C = 119862

1 119862

119870 and the

biological category or subtypeB = 1198611 119861

119871

BHI (CB) =1

119870

119870

sum

119896=1

1

119899119896(119899119896minus 1)

sum

119894 = 119895119894119895isin119862119896

119868 (119861(119894)= 119861(119895))

(24)

where 119861(119894) isin B is the subtype for the subject 119894 and 119899119896is

the number of subjects in 119862119896 This index is upper bounded

by 1 meaning the perfect homogeneity between the clustersand the biological categories We calculated this index forthe breast cancer data to investigate the relationship betweenthe hierarchical clustering and biological categories such asdisease status (good prognosis or metastases) and hormonestatus estrogen receptor (ER) status and progesterone recep-tor (PR) status The hormone status is known to be closelyrelated with the prognosis of the patients [23]

33 Mutualcoherence Now we have a data matrix X with 119899rows (119899 patients) and 119901 columns (119901 genes)We assumed an 119899-dimensional vector b indicating the true disease status where

Computational and Mathematical Methods in Medicine 7

the positive values correspond to metastases and negativeones to good prognosisThemagnitude of b denotes the levelof disease status Then the optimal linear solution 120573 isin R119901

should be satisfied

X120573 = b (25)

Note that if 119901 is much larger than 119899 then only a few elementsof 120573 should be non-zeros and the others to be zero whichmeans 120573 has a sparse structure The sparsity has a closerelationship with mutualcoherence [30] which is defined forthe data matrix X as

120583 (X) = max1le119894119895le119901119894 = 119895

10038161003816100381610038161003816x119879119894x119895

100381610038161003816100381610038161003817100381710038171003817x11989410038171003817100381710038172

10038171003817100381710038171003817x119895

100381710038171003817100381710038172

(26)

where x119894isin R119899 denotes the 119894th column in X or 119894th gene in

the breast cancer data | sdot | is the absolute value and sdot2is the

Euclidean normThis indexmeasures the distance of columns(genes) in the data matrixX and is bounded as 0 le 120583(X) le 1The next theorem shows the relationship between sparsity of120573 and the data matrix X

Theorem 1 (Elad [30]) If X120573 = b has a solution 120573 obeying1205730

lt (1 + 1120583(X))2 this solution is necessarily thesparsest possible where 120573

0denotes the number of the nonzero

components of 120573

This theorem suggests that the linear discriminant func-tion 119865(x) = 120573119879x could have a sparse solution to predict thedisease status b which corresponds to metastases or goodprognosis in the breast cancer data If the value of 120583(X) isnearly equal to 1 then 120573

0could become approximately 1

indicating that just one gene (a column inX) could be enoughto predict the disease status if there were a solution 120573 Thisindicates that we could have a chance to find the multiplesolutions with fewer genes than 70 genes in MammaPrintAlthough the framework in (25) is based on a system of linearequations and does not include random effects as seen in theclassification problems we deal with Theorem 1 is indicativein the case where the number of covariates 119901 is much largerthan the observation number 119899

34 Results We prepare various data sets using a trainingsample (78 patients) based on 230 genes selected by vanrsquot Veeret al [7] in order to investigate the existence of the multiplesolutions of the prediction of the disease status which aregiven by

D = 1198631minus70

1198636minus75

119863161minus230

(27)

The first data set 1198631minus70

consists of 78 patients and 70 genesin MammaPrint The ranking of the 230 genes in D is onthe basis of correlation with disease status as in [7] We apply

the vanrsquot Veer method DLDA AdaBoost and AUCBoost asexplained in the previous subsection to the training data tohave the discriminant functions119865(x) where each threshold isdetermined so that the training error rate isminimizedThenwe evaluate the classification accuracy based on the trainingdata as well as the test data The results of classificationperformance of vanrsquot Veer method and DLDA are illustratedin Figure 1 those of AdaBoost and AUCBoost are in Figure 2The AUC and the error rate are plotted against D In regardto the performance based on the training data denoted bysolid line the boosting methods are superior to the classicalmethodsHowever comparison based on the test data DLDAshows good performance for almost all the data sets in Dhaving the AUC more than 08 and the error rates less than02 in averageThis evidence suggests there exist many sets ofgenes having almost the same classification performance asthat of MammaPrint

We investigate the performance of the hierarchical clus-terings based on the training data sets 119863

1minus70 11986311minus80

and119863111minus180

which are shown in Figure 3 Each row represents70 genes and each column represents 78 patients withdisease status (blue) ER status (red) and PR status (orange)The BHI for disease status ER status and PR status inMammaPrint (119863

1minus70) are 070 069 and 057 respectively

The gene expression in 11986311minus80

shows different patterns fromthe others Mainly there are two clusters characterized by ERstatus The left-hand side is the subgroup of patients withER negative and poor prognosis This would correspond toBasal-like subtype or triple negative subtype though the Her2status is unclear The right-hand side could be divided intothree subgroups The subgroup of patients with ER negativeshows good prognosis indicating Luminal A and either sideof it would be Luminal B or Luminal C because it showsworse prognosis than Luminal A The data set of 119863

111minus180

has the highest BHI for disease status 076 and a similar geneexpression pattern to that in 119863

1minus70 The other values of BHI

are illustrated in Figure 4 It turned out that the data sets inDhave almost the same BHI for three statuses suggesting thereexists various gene sets with similar expression patterns

Next we investigate the stability of the gene ranking inMammaPrint Among 78 patients we randomly choose 50patients with 5420 gene expression patterns Then we takethe top 70 genes ranked by the correlation coefficients ofthe gene expression with disease status This procedure isrepeated 100 times and checked how many times the genesin MammaPrint are ranked within the top 70 The resultsare shown in the upper panel of Figure 5 The lower panelshows the result based on the AUC instead of the correlationcoefficient used in the ranking We clearly see that someof the genes in MammaPrint are rarely selected in the top70 which indicates the instability of the gene ranking Theperformances of DLDA with 100 replications which showsmost stable prediction accuracy as seen in Figure 1 basedon randomly selected 50 patients shown in Figure 6 wherethe vertical axes are AUC (left panels) and error rate (rightpanels) and genes are ranked by the correlation (a) and AUC(b) The performance clearly gets worse than that in Figure 1in terms of both AUC and error rate The heterogeneity ofgene expression pattern may come from the several subtypes

8 Computational and Mathematical Methods in Medicine

0

02

04

06

08

1

Data sets

AUC

0

02

04

06

08

1

Erro

r

TrainingTest

TrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230Data sets

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(a) vanrsquot Veer methodEr

ror

0

02

04

06

08

1

AUC

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(b) DLDA

Figure 1 Results of classical methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data setsD for vanrsquotVeer method and DLDA

as suggested in the cluster analysis and it would make itdifficult to select useful genes for predictions

Finally we calculate the mutualcoherence based on genesin MammaPrint and total 5420 genes in Figure 7 The scatterplot in the upper panel (a) shows a pair of geneswith the high-est mutual coherence (0984) in MammaPrint NM 014889

and NM 014968 correspond to MP1 and KIAA1104 respec-tivelyMP1 is a scaffold protein inmultiple signaling pathwaysincluding the one in breast cancer The latter one is pitrilysinmetallopeptidase 1The correlations defined in the right-handside of (26) are calculated for all pairs of genes in total 5420

genes and 70 genes in MammaPrint The distributions ofthem are shown in the lower panel (b) in Figure 7 The genepairs are sorted so that the values of the correlations decreasemonotonically Note that the number of the gene pairs ineach data set is different but the range of horizontal axis isrestricted to 0 and 1 for clear view and for easy comparisonThedifference between the black and red curves indicates thatthe gene set of MammaPrint has large homogeneity of geneexpression patterns in comparison with that of total 5420genes This indicates that the ranking of genes based on two-sample statistic such as the correlation coefficients is prone

Computational and Mathematical Methods in Medicine 9

0

02

04

06

08

1AU

C

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Erro

r

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(a) AdaBoost

0

02

04

06

08

1

AUC

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Erro

r

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(b) AUCBoost

Figure 2 Results of boosting methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets D forAdaBoost and AUCBoost

to select genes such that their gene expression patterns aresimilar to each other This would be one reason why we havemultiple solutions after the ranking methods based on thecorrelation coefficients It is also interesting to see that thereare a few gene pairs with very low correlation even in gene setof MammaPrint

4 Discussion and Concluding Remarks

In this paper we have addressed an important classificationissue using microarray data Our results present the existence

of multiple suboptimal solutions for choosing classificationpredictors Classification results showed that the perfor-mance of several gene sets was the same This behaviorwas confirmed by the vanrsquot Veer method DLDA AdaBoostand AUCBoost Amazingly nontop ranked gene sets showedbetter performance than the top ranked gene sets Theseresults indicate that the ranking method is very unstable forfeature selection in microarray data

To examine the expression pattern of selected predictorswe performed clustering this added support to the existenceof multiple solutions We easily recognized the good and

10 Computational and Mathematical Methods in Medicine

(a) (b)

(c)

Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70

(MammaPrint)(b) 119863

11minus80clearly showing some subtypes and (c) 119863

111minus180with the highest BHI regarding metastases The blue bars indicate patients with

metastases red bars those with ER positive and orange bars those with PR positive

the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes

Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work

Computational and Mathematical Methods in Medicine 11

SetBH

I

MetastasesERp

PRp

0

02

04

06

08

1

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig32125

RCU82987

AL137718

AB037863

NM020188

NM020974

NM000127

NM002019

NM002073

NM000436

NM004994

Con

tig55377

RCC

ontig35251

RCC

ontig25991

NM003875

NM006101

NM003882

NM003607

AF0

7351

9A

F052

162

NM000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig46223

RCN

M015984

NM006117

AK0

0074

5C

ontig40831

RCN

M003239

NM014791

X056

10N

M016448

NM018401

NM000788

Con

tig51464

RCA

L080

079

NM006931

AF2

5717

5N

M014321

NM002916

Con

tig55725

RCC

ontig24252

RCA

F201

951

NM005915

NM001282

Con

tig56457

RCN

M000599

NM020386

NM014889

AF0

5503

3N

M006681

NM007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM007036

NM018354

(a)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig

46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig

32125

RCU82987

AL1

37718

AB0

37863

NM

020188

NM

020974

NM

000127

NM

002019

NM

002073

NM

000436

NM

004994

Con

tig55377

RCC

ontig

35251

RCC

ontig

25991

NM

003875

NM

006101

NM

003882

NM

003607

AF0

7351

9A

F052

162

NM

000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig

46223

RCN

M015984

NM

006117

AK0

0074

5C

ontig

40831

RCN

M003239

NM

014791

X056

10N

M016448

NM

018401

NM

000788

Con

tig51464

RCA

L080

079

NM

006931

AF2

5717

5N

M014321

NM

002916

Con

tig55725

RCC

ontig

24252

RCA

F201

951

NM

005915

NM

001282

Con

tig56457

RCN

M000599

NM

020386

NM

014889

AF0

5503

3N

M006681

NM

007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM

007036

NM

018354

(b)

Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint

12 Computational and Mathematical Methods in Medicine

0 20 40 60 80 100

1

05

06

07

08

09

1AU

C

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

Number of trialsNumber of trials

TrainingTest

TrainingTest

(a)

0 20 40 60 80 100

05

06

07

08

09

1

Number of trialsNumber of trials

AUC

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

TrainingTest

TrainingTest

(b)

Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials

Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is

not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data

One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by

Computational and Mathematical Methods in Medicine 13

0 02 04 06

0

02

04

06

minus02

minus02NM 014889

NM

0149

68

(a)

Index of pairs of genes

Cor

relat

ion

5420 genes70 genes

0

02

04

06

08

1

0 02 04 06 08 1

(b)

Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view

Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata

We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data

Conflict of Interests

The authors have declared that no conflict of interests exists

References

[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001

[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001

[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996

[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003

[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006

[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006

[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002

[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996

14 Computational and Mathematical Methods in Medicine

[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003

[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002

[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999

[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004

[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008

[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006

[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989

[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005

[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005

[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936

[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010

[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001

[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008

[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006

[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001

[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003

[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011

[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011

[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010

[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001

[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009

[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009

[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009

[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009

[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 5: Multiple Suboptimal Solutions for Prediction Rules in Gene ... · For this, the machine learning approach is successfully exploited including sup-portvectormachine,boostinglearningalgorithm,andthe

Computational and Mathematical Methods in Medicine 5

where

=1

119901

119901

sum

119896=1

119909119896

minus=1

119901

119901

sum

119896=1

minus

119896 (3)

If the value of 119865(x) is smaller than a predefined thresholdvalue then the patient is judged to have good prognosis (119910 =minus1) otherwise to have metastases (119910 = 1) This method iscalled one-class classification where it focuses on only one-class label information to predict the disease status 119910 Thisidea is also employed in machine learning community SeeYousef et al [24] and Gardner et al [25] for applications inbiology and medicine

312 DLDA We consider Fisherrsquos linear discriminant analy-sis [20] which is widely used in many applications Supposethat x is distributed as119873(120583

minusΣminus) for 119910 = minus1 and as119873(120583

+Σ+)

for 119910 = 1 Then if Σminus= Σ+= Σ the estimated log-likelihood

ratio is given as

119865 (x) = (+minus minus)119879

Σminus1x minus 1

2119879

+Σminus1

+

+1

2119879

minusΣminus1

minus+ log

119899+

119899minus

(4)

where + minusand Σ are the samplemeans for119910 isin minus1 1 and a

total sample variance respectively For simplicity we take thediagonal part of Σ (Diagonal Linear Discriminant Analysis)and predict the disease status based on the value of the log-likelihood ratio above This modification is often used in asituation where 119901 is much larger than 119899 (= 119899

minus+ 119899+) In that

case the inverse of Σ cannot be calculated

313 AdaBoost We introduce a famous boosting method inmachine learning community The key concept of boostingis to construct a powerful discriminant function 119865(x) bycombining variousweak classifiers119891(x) [26]We employ a setF of decision stumps as a dictionary of weak classifiers Herethe decision stump for the 119896th gene expression 119909

119896is defined

as a simple step functions such as

119891119896(x) = 1 if 119909

119896ge 119887119896

minus1 otherwise(5)

where 119887119896is a threshold value Accordingly it is known that

119891119896(x) is the simplest classifier in the sense that 119891

119896(x) neglects

all other information of gene expression patterns than thatof one gene 119909

119896 However by changing the value of 119887

119896for all

genes (119896 = 1 119901) we have F that contains exhaustiveinformation of gene expression patterns We attempt tobuild a good discriminant function 119865 by combining decisionstumps inF

The typical one is AdaBoost proposed by Schapire [21]which is designed to minimize the exponential loss for adiscriminant function 119865 as

119871exp (119865) =1

119899

119899

sum

119894=1

exp minus119910119894119865 (x119894) (6)

where the entire data set is given as (x119894 119910119894) | 119894 = 1 119899The

exponential loss is sequentially minimized by the followingalgorithm

(1) Initialize the weight of 119909119894for 119894 = 1 119899 as 119908

1(119894) =

1119899

(2) For the iteration number 119905 = 1 119879

(a) choose 119891119905as

119891119905= argmin119891isinF

120598119905(119891) (7)

where

120598119905(119891) =

119899

sum

119894=1

119868 (119891 (x119894) = 119910119894)

119908119905(119894)

sum119899

119894=1119908119905(119894) (8)

and 119868 is the indicator function(b) calculate the coefficient as

120573119905=1

2log

1 minus 120598119905(119891119905)

120598119905(119891119905) (9)

(c) update the weight as

119908119905+1(119894) = 119908

119905(119894) exp minus119910

119894120573119905119891119905(x119894) (10)

(3) output the final function as 119865(x) = sum119879119905=1120573119905119891119905(x)

Here we have

119871exp (119865 + 120573119891)

=1

119899

119899

sum

119894=1

exp minus119910119894119865 (x119894) exp minus120573119910

119894119891 (x119894)

(11)

=1

119899

119899

sum

119894=1

exp minus119910119894119865 (x119894)

times [119890120573119868 (119891 (x

119894) = 119910119894) + 119890minus120573119868 (119891 (x

119894) = 119910119894)]

(12)

= 119871exp (119865) 119890120573120598 (119891) + 119890

minus120573(1 minus 120598) (13)

ge 2119871exp (119865)radic120598 (119891) (1 minus 120598 (119891)) (14)

where 120598(119891) = 1119899sum119899

119894=1119868(119891(x

119894) = 119910119894) expminus119910

119894119865(x119894)119871exp(119865)

and the equality in (14) is attained if and only if

120573 =1

2log

1 minus 120598 (119891)

120598 (119891) (15)

6 Computational and Mathematical Methods in Medicine

Hence we can see that the exponential loss is sequentiallyminimized in the algorithm It is easily shown that

120598119905+1(119891119905) =

1

2 (16)

That is the weak classifier 119891119905chosen at the step 119905 is the worst

element in F in the sense of the weighted error rate in thestep 119905 + 1

314 AUCBoost with Natural Cubic Splines The area underthe ROCcurve (AUC) is widely used tomeasure classificationaccuracy [27] This criterion consists of the false positiverate and true positive rate so it evaluates them separately incontrast to the commonly used error rateThe empirical AUCbased on the samples xminus

119894 119894 = 1 119899

minus and x+

119895 119895 = 1

119899+ is given as

AUC (119865) = 1

119899minus119899+

119899minus

sum

119894=1

119899+

sum

119895=1

119867(119865 (x+119895) minus 119865 (xminus

119894)) (17)

where 119867(119911) is the Heaviside function 119867(119911) = 1 if 119911 ge 0

and H(119911) = 0 otherwise To avoid the difficulty to maximizethe nondifferential function above an approximate AUC isconsidered by Komori [28] as

AUC120590(119865) =

1

119899minus119899+

119899minus

sum

119894=1

119899+

sum

119895=1

119867120590(119865 (x+119895) minus 119865 (xminus

119894)) (18)

where 119867120590(119911) = Φ(119911120590) with Φ being the standard normal

distribution function A smaller scale parameter 120590 meansa better approximation of the Heaviside function 119867(119911)Based on this approximation a boosting method for themaximization of AUC as well as the partial AUC is proposedby Komori and Eguchi [22] in which they consider thefollowing objective function

AUC120590120582(119865) =

1

119899minus119899+

119899minus

sum

119894=1

119899+

sum

119895=1

119867120590(119865 (x+119895) minus 119865 (xminus

119894))

minus 120582

119901

sum

119896=1

int 11986510158401015840

119896(119909119896)2

119889119909119896

(19)

where 11986510158401015840119896(119909119896) is the second derivative of the 119896th component

of 119865(x) and 120582 is a smoothing parameter that controls thesmoothness of 119865(x) Here the set of weak classifiers is givenas

F = 119891 (x) =119873119896119897(119909119896)

119885119896119897

| 119896 = 1 2 119901 119897 = 1 2 119898119896

(20)

where 119873119896119897

is a basis function for representing natural cubicsplines and 119885

119896119897is a standardization factor Then the relation-

ship

max120590120582119865

AUC120590120582(119865) = max

120582119865

AUC1120582(119865) (21)

allows us to fix 120590 = 1 without loss of generality and have thefollowing algorithm

(1) Start with 1198650(x) = 0

(2) For 119905 = 1 119879

(a) update 120573119905minus1(119891) to 120573

119905(119891) with a one-step

Newton-Raphson iteration(b) find the best weak classifier 119891

119905

119891119905= argmax119891

AUC1120582(119865119905minus1

+ 120573119905(119891) 119891) (22)

(c) update the score function as

119865119905(x) = 119865

119905minus1(x) + 120573

119905(119891119905) 119891119905(x) (23)

(3) Output 119865(x) = sum119879119905=1120573119905(119891119905)119891119905(x)

The value of the smoothing parameter 120582 and the iterationnumber 119879 is determined at the same time by the crossvalidation

32 Clustering We applied a hierarchical clustering usingbreast cancer data [7] where the distances between samplesand genes are determined by the correlation and completelinkage was applied as the agglomeration method To mea-sure the performance of the clustering we used biologicalhomogeneity index (BHI) [29] which measures the homo-geneity between the cluster C = 119862

1 119862

119870 and the

biological category or subtypeB = 1198611 119861

119871

BHI (CB) =1

119870

119870

sum

119896=1

1

119899119896(119899119896minus 1)

sum

119894 = 119895119894119895isin119862119896

119868 (119861(119894)= 119861(119895))

(24)

where 119861(119894) isin B is the subtype for the subject 119894 and 119899119896is

the number of subjects in 119862119896 This index is upper bounded

by 1 meaning the perfect homogeneity between the clustersand the biological categories We calculated this index forthe breast cancer data to investigate the relationship betweenthe hierarchical clustering and biological categories such asdisease status (good prognosis or metastases) and hormonestatus estrogen receptor (ER) status and progesterone recep-tor (PR) status The hormone status is known to be closelyrelated with the prognosis of the patients [23]

33 Mutualcoherence Now we have a data matrix X with 119899rows (119899 patients) and 119901 columns (119901 genes)We assumed an 119899-dimensional vector b indicating the true disease status where

Computational and Mathematical Methods in Medicine 7

the positive values correspond to metastases and negativeones to good prognosisThemagnitude of b denotes the levelof disease status Then the optimal linear solution 120573 isin R119901

should be satisfied

X120573 = b (25)

Note that if 119901 is much larger than 119899 then only a few elementsof 120573 should be non-zeros and the others to be zero whichmeans 120573 has a sparse structure The sparsity has a closerelationship with mutualcoherence [30] which is defined forthe data matrix X as

120583 (X) = max1le119894119895le119901119894 = 119895

10038161003816100381610038161003816x119879119894x119895

100381610038161003816100381610038161003817100381710038171003817x11989410038171003817100381710038172

10038171003817100381710038171003817x119895

100381710038171003817100381710038172

(26)

where x119894isin R119899 denotes the 119894th column in X or 119894th gene in

the breast cancer data | sdot | is the absolute value and sdot2is the

Euclidean normThis indexmeasures the distance of columns(genes) in the data matrixX and is bounded as 0 le 120583(X) le 1The next theorem shows the relationship between sparsity of120573 and the data matrix X

Theorem 1 (Elad [30]) If X120573 = b has a solution 120573 obeying1205730

lt (1 + 1120583(X))2 this solution is necessarily thesparsest possible where 120573

0denotes the number of the nonzero

components of 120573

This theorem suggests that the linear discriminant func-tion 119865(x) = 120573119879x could have a sparse solution to predict thedisease status b which corresponds to metastases or goodprognosis in the breast cancer data If the value of 120583(X) isnearly equal to 1 then 120573

0could become approximately 1

indicating that just one gene (a column inX) could be enoughto predict the disease status if there were a solution 120573 Thisindicates that we could have a chance to find the multiplesolutions with fewer genes than 70 genes in MammaPrintAlthough the framework in (25) is based on a system of linearequations and does not include random effects as seen in theclassification problems we deal with Theorem 1 is indicativein the case where the number of covariates 119901 is much largerthan the observation number 119899

34 Results We prepare various data sets using a trainingsample (78 patients) based on 230 genes selected by vanrsquot Veeret al [7] in order to investigate the existence of the multiplesolutions of the prediction of the disease status which aregiven by

D = 1198631minus70

1198636minus75

119863161minus230

(27)

The first data set 1198631minus70

consists of 78 patients and 70 genesin MammaPrint The ranking of the 230 genes in D is onthe basis of correlation with disease status as in [7] We apply

the vanrsquot Veer method DLDA AdaBoost and AUCBoost asexplained in the previous subsection to the training data tohave the discriminant functions119865(x) where each threshold isdetermined so that the training error rate isminimizedThenwe evaluate the classification accuracy based on the trainingdata as well as the test data The results of classificationperformance of vanrsquot Veer method and DLDA are illustratedin Figure 1 those of AdaBoost and AUCBoost are in Figure 2The AUC and the error rate are plotted against D In regardto the performance based on the training data denoted bysolid line the boosting methods are superior to the classicalmethodsHowever comparison based on the test data DLDAshows good performance for almost all the data sets in Dhaving the AUC more than 08 and the error rates less than02 in averageThis evidence suggests there exist many sets ofgenes having almost the same classification performance asthat of MammaPrint

We investigate the performance of the hierarchical clus-terings based on the training data sets 119863

1minus70 11986311minus80

and119863111minus180

which are shown in Figure 3 Each row represents70 genes and each column represents 78 patients withdisease status (blue) ER status (red) and PR status (orange)The BHI for disease status ER status and PR status inMammaPrint (119863

1minus70) are 070 069 and 057 respectively

The gene expression in 11986311minus80

shows different patterns fromthe others Mainly there are two clusters characterized by ERstatus The left-hand side is the subgroup of patients withER negative and poor prognosis This would correspond toBasal-like subtype or triple negative subtype though the Her2status is unclear The right-hand side could be divided intothree subgroups The subgroup of patients with ER negativeshows good prognosis indicating Luminal A and either sideof it would be Luminal B or Luminal C because it showsworse prognosis than Luminal A The data set of 119863

111minus180

has the highest BHI for disease status 076 and a similar geneexpression pattern to that in 119863

1minus70 The other values of BHI

are illustrated in Figure 4 It turned out that the data sets inDhave almost the same BHI for three statuses suggesting thereexists various gene sets with similar expression patterns

Next we investigate the stability of the gene ranking inMammaPrint Among 78 patients we randomly choose 50patients with 5420 gene expression patterns Then we takethe top 70 genes ranked by the correlation coefficients ofthe gene expression with disease status This procedure isrepeated 100 times and checked how many times the genesin MammaPrint are ranked within the top 70 The resultsare shown in the upper panel of Figure 5 The lower panelshows the result based on the AUC instead of the correlationcoefficient used in the ranking We clearly see that someof the genes in MammaPrint are rarely selected in the top70 which indicates the instability of the gene ranking Theperformances of DLDA with 100 replications which showsmost stable prediction accuracy as seen in Figure 1 basedon randomly selected 50 patients shown in Figure 6 wherethe vertical axes are AUC (left panels) and error rate (rightpanels) and genes are ranked by the correlation (a) and AUC(b) The performance clearly gets worse than that in Figure 1in terms of both AUC and error rate The heterogeneity ofgene expression pattern may come from the several subtypes

8 Computational and Mathematical Methods in Medicine

0

02

04

06

08

1

Data sets

AUC

0

02

04

06

08

1

Erro

r

TrainingTest

TrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230Data sets

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(a) vanrsquot Veer methodEr

ror

0

02

04

06

08

1

AUC

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(b) DLDA

Figure 1 Results of classical methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data setsD for vanrsquotVeer method and DLDA

as suggested in the cluster analysis and it would make itdifficult to select useful genes for predictions

Finally we calculate the mutualcoherence based on genesin MammaPrint and total 5420 genes in Figure 7 The scatterplot in the upper panel (a) shows a pair of geneswith the high-est mutual coherence (0984) in MammaPrint NM 014889

and NM 014968 correspond to MP1 and KIAA1104 respec-tivelyMP1 is a scaffold protein inmultiple signaling pathwaysincluding the one in breast cancer The latter one is pitrilysinmetallopeptidase 1The correlations defined in the right-handside of (26) are calculated for all pairs of genes in total 5420

genes and 70 genes in MammaPrint The distributions ofthem are shown in the lower panel (b) in Figure 7 The genepairs are sorted so that the values of the correlations decreasemonotonically Note that the number of the gene pairs ineach data set is different but the range of horizontal axis isrestricted to 0 and 1 for clear view and for easy comparisonThedifference between the black and red curves indicates thatthe gene set of MammaPrint has large homogeneity of geneexpression patterns in comparison with that of total 5420genes This indicates that the ranking of genes based on two-sample statistic such as the correlation coefficients is prone

Computational and Mathematical Methods in Medicine 9

0

02

04

06

08

1AU

C

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Erro

r

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(a) AdaBoost

0

02

04

06

08

1

AUC

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Erro

r

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(b) AUCBoost

Figure 2 Results of boosting methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets D forAdaBoost and AUCBoost

to select genes such that their gene expression patterns aresimilar to each other This would be one reason why we havemultiple solutions after the ranking methods based on thecorrelation coefficients It is also interesting to see that thereare a few gene pairs with very low correlation even in gene setof MammaPrint

4 Discussion and Concluding Remarks

In this paper we have addressed an important classificationissue using microarray data Our results present the existence

of multiple suboptimal solutions for choosing classificationpredictors Classification results showed that the perfor-mance of several gene sets was the same This behaviorwas confirmed by the vanrsquot Veer method DLDA AdaBoostand AUCBoost Amazingly nontop ranked gene sets showedbetter performance than the top ranked gene sets Theseresults indicate that the ranking method is very unstable forfeature selection in microarray data

To examine the expression pattern of selected predictorswe performed clustering this added support to the existenceof multiple solutions We easily recognized the good and

10 Computational and Mathematical Methods in Medicine

(a) (b)

(c)

Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70

(MammaPrint)(b) 119863

11minus80clearly showing some subtypes and (c) 119863

111minus180with the highest BHI regarding metastases The blue bars indicate patients with

metastases red bars those with ER positive and orange bars those with PR positive

the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes

Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work

Computational and Mathematical Methods in Medicine 11

SetBH

I

MetastasesERp

PRp

0

02

04

06

08

1

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig32125

RCU82987

AL137718

AB037863

NM020188

NM020974

NM000127

NM002019

NM002073

NM000436

NM004994

Con

tig55377

RCC

ontig35251

RCC

ontig25991

NM003875

NM006101

NM003882

NM003607

AF0

7351

9A

F052

162

NM000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig46223

RCN

M015984

NM006117

AK0

0074

5C

ontig40831

RCN

M003239

NM014791

X056

10N

M016448

NM018401

NM000788

Con

tig51464

RCA

L080

079

NM006931

AF2

5717

5N

M014321

NM002916

Con

tig55725

RCC

ontig24252

RCA

F201

951

NM005915

NM001282

Con

tig56457

RCN

M000599

NM020386

NM014889

AF0

5503

3N

M006681

NM007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM007036

NM018354

(a)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig

46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig

32125

RCU82987

AL1

37718

AB0

37863

NM

020188

NM

020974

NM

000127

NM

002019

NM

002073

NM

000436

NM

004994

Con

tig55377

RCC

ontig

35251

RCC

ontig

25991

NM

003875

NM

006101

NM

003882

NM

003607

AF0

7351

9A

F052

162

NM

000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig

46223

RCN

M015984

NM

006117

AK0

0074

5C

ontig

40831

RCN

M003239

NM

014791

X056

10N

M016448

NM

018401

NM

000788

Con

tig51464

RCA

L080

079

NM

006931

AF2

5717

5N

M014321

NM

002916

Con

tig55725

RCC

ontig

24252

RCA

F201

951

NM

005915

NM

001282

Con

tig56457

RCN

M000599

NM

020386

NM

014889

AF0

5503

3N

M006681

NM

007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM

007036

NM

018354

(b)

Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint

12 Computational and Mathematical Methods in Medicine

0 20 40 60 80 100

1

05

06

07

08

09

1AU

C

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

Number of trialsNumber of trials

TrainingTest

TrainingTest

(a)

0 20 40 60 80 100

05

06

07

08

09

1

Number of trialsNumber of trials

AUC

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

TrainingTest

TrainingTest

(b)

Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials

Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is

not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data

One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by

Computational and Mathematical Methods in Medicine 13

0 02 04 06

0

02

04

06

minus02

minus02NM 014889

NM

0149

68

(a)

Index of pairs of genes

Cor

relat

ion

5420 genes70 genes

0

02

04

06

08

1

0 02 04 06 08 1

(b)

Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view

Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata

We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data

Conflict of Interests

The authors have declared that no conflict of interests exists

References

[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001

[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001

[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996

[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003

[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006

[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006

[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002

[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996

14 Computational and Mathematical Methods in Medicine

[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003

[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002

[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999

[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004

[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008

[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006

[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989

[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005

[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005

[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936

[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010

[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001

[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008

[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006

[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001

[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003

[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011

[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011

[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010

[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001

[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009

[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009

[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009

[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009

[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 6: Multiple Suboptimal Solutions for Prediction Rules in Gene ... · For this, the machine learning approach is successfully exploited including sup-portvectormachine,boostinglearningalgorithm,andthe

6 Computational and Mathematical Methods in Medicine

Hence we can see that the exponential loss is sequentiallyminimized in the algorithm It is easily shown that

120598119905+1(119891119905) =

1

2 (16)

That is the weak classifier 119891119905chosen at the step 119905 is the worst

element in F in the sense of the weighted error rate in thestep 119905 + 1

314 AUCBoost with Natural Cubic Splines The area underthe ROCcurve (AUC) is widely used tomeasure classificationaccuracy [27] This criterion consists of the false positiverate and true positive rate so it evaluates them separately incontrast to the commonly used error rateThe empirical AUCbased on the samples xminus

119894 119894 = 1 119899

minus and x+

119895 119895 = 1

119899+ is given as

AUC (119865) = 1

119899minus119899+

119899minus

sum

119894=1

119899+

sum

119895=1

119867(119865 (x+119895) minus 119865 (xminus

119894)) (17)

where 119867(119911) is the Heaviside function 119867(119911) = 1 if 119911 ge 0

and H(119911) = 0 otherwise To avoid the difficulty to maximizethe nondifferential function above an approximate AUC isconsidered by Komori [28] as

AUC120590(119865) =

1

119899minus119899+

119899minus

sum

119894=1

119899+

sum

119895=1

119867120590(119865 (x+119895) minus 119865 (xminus

119894)) (18)

where 119867120590(119911) = Φ(119911120590) with Φ being the standard normal

distribution function A smaller scale parameter 120590 meansa better approximation of the Heaviside function 119867(119911)Based on this approximation a boosting method for themaximization of AUC as well as the partial AUC is proposedby Komori and Eguchi [22] in which they consider thefollowing objective function

AUC120590120582(119865) =

1

119899minus119899+

119899minus

sum

119894=1

119899+

sum

119895=1

119867120590(119865 (x+119895) minus 119865 (xminus

119894))

minus 120582

119901

sum

119896=1

int 11986510158401015840

119896(119909119896)2

119889119909119896

(19)

where 11986510158401015840119896(119909119896) is the second derivative of the 119896th component

of 119865(x) and 120582 is a smoothing parameter that controls thesmoothness of 119865(x) Here the set of weak classifiers is givenas

F = 119891 (x) =119873119896119897(119909119896)

119885119896119897

| 119896 = 1 2 119901 119897 = 1 2 119898119896

(20)

where 119873119896119897

is a basis function for representing natural cubicsplines and 119885

119896119897is a standardization factor Then the relation-

ship

max120590120582119865

AUC120590120582(119865) = max

120582119865

AUC1120582(119865) (21)

allows us to fix 120590 = 1 without loss of generality and have thefollowing algorithm

(1) Start with 1198650(x) = 0

(2) For 119905 = 1 119879

(a) update 120573119905minus1(119891) to 120573

119905(119891) with a one-step

Newton-Raphson iteration(b) find the best weak classifier 119891

119905

119891119905= argmax119891

AUC1120582(119865119905minus1

+ 120573119905(119891) 119891) (22)

(c) update the score function as

119865119905(x) = 119865

119905minus1(x) + 120573

119905(119891119905) 119891119905(x) (23)

(3) Output 119865(x) = sum119879119905=1120573119905(119891119905)119891119905(x)

The value of the smoothing parameter 120582 and the iterationnumber 119879 is determined at the same time by the crossvalidation

32 Clustering We applied a hierarchical clustering usingbreast cancer data [7] where the distances between samplesand genes are determined by the correlation and completelinkage was applied as the agglomeration method To mea-sure the performance of the clustering we used biologicalhomogeneity index (BHI) [29] which measures the homo-geneity between the cluster C = 119862

1 119862

119870 and the

biological category or subtypeB = 1198611 119861

119871

BHI (CB) =1

119870

119870

sum

119896=1

1

119899119896(119899119896minus 1)

sum

119894 = 119895119894119895isin119862119896

119868 (119861(119894)= 119861(119895))

(24)

where 119861(119894) isin B is the subtype for the subject 119894 and 119899119896is

the number of subjects in 119862119896 This index is upper bounded

by 1 meaning the perfect homogeneity between the clustersand the biological categories We calculated this index forthe breast cancer data to investigate the relationship betweenthe hierarchical clustering and biological categories such asdisease status (good prognosis or metastases) and hormonestatus estrogen receptor (ER) status and progesterone recep-tor (PR) status The hormone status is known to be closelyrelated with the prognosis of the patients [23]

33 Mutualcoherence Now we have a data matrix X with 119899rows (119899 patients) and 119901 columns (119901 genes)We assumed an 119899-dimensional vector b indicating the true disease status where

Computational and Mathematical Methods in Medicine 7

the positive values correspond to metastases and negativeones to good prognosisThemagnitude of b denotes the levelof disease status Then the optimal linear solution 120573 isin R119901

should be satisfied

X120573 = b (25)

Note that if 119901 is much larger than 119899 then only a few elementsof 120573 should be non-zeros and the others to be zero whichmeans 120573 has a sparse structure The sparsity has a closerelationship with mutualcoherence [30] which is defined forthe data matrix X as

120583 (X) = max1le119894119895le119901119894 = 119895

10038161003816100381610038161003816x119879119894x119895

100381610038161003816100381610038161003817100381710038171003817x11989410038171003817100381710038172

10038171003817100381710038171003817x119895

100381710038171003817100381710038172

(26)

where x119894isin R119899 denotes the 119894th column in X or 119894th gene in

the breast cancer data | sdot | is the absolute value and sdot2is the

Euclidean normThis indexmeasures the distance of columns(genes) in the data matrixX and is bounded as 0 le 120583(X) le 1The next theorem shows the relationship between sparsity of120573 and the data matrix X

Theorem 1 (Elad [30]) If X120573 = b has a solution 120573 obeying1205730

lt (1 + 1120583(X))2 this solution is necessarily thesparsest possible where 120573

0denotes the number of the nonzero

components of 120573

This theorem suggests that the linear discriminant func-tion 119865(x) = 120573119879x could have a sparse solution to predict thedisease status b which corresponds to metastases or goodprognosis in the breast cancer data If the value of 120583(X) isnearly equal to 1 then 120573

0could become approximately 1

indicating that just one gene (a column inX) could be enoughto predict the disease status if there were a solution 120573 Thisindicates that we could have a chance to find the multiplesolutions with fewer genes than 70 genes in MammaPrintAlthough the framework in (25) is based on a system of linearequations and does not include random effects as seen in theclassification problems we deal with Theorem 1 is indicativein the case where the number of covariates 119901 is much largerthan the observation number 119899

34 Results We prepare various data sets using a trainingsample (78 patients) based on 230 genes selected by vanrsquot Veeret al [7] in order to investigate the existence of the multiplesolutions of the prediction of the disease status which aregiven by

D = 1198631minus70

1198636minus75

119863161minus230

(27)

The first data set 1198631minus70

consists of 78 patients and 70 genesin MammaPrint The ranking of the 230 genes in D is onthe basis of correlation with disease status as in [7] We apply

the vanrsquot Veer method DLDA AdaBoost and AUCBoost asexplained in the previous subsection to the training data tohave the discriminant functions119865(x) where each threshold isdetermined so that the training error rate isminimizedThenwe evaluate the classification accuracy based on the trainingdata as well as the test data The results of classificationperformance of vanrsquot Veer method and DLDA are illustratedin Figure 1 those of AdaBoost and AUCBoost are in Figure 2The AUC and the error rate are plotted against D In regardto the performance based on the training data denoted bysolid line the boosting methods are superior to the classicalmethodsHowever comparison based on the test data DLDAshows good performance for almost all the data sets in Dhaving the AUC more than 08 and the error rates less than02 in averageThis evidence suggests there exist many sets ofgenes having almost the same classification performance asthat of MammaPrint

We investigate the performance of the hierarchical clus-terings based on the training data sets 119863

1minus70 11986311minus80

and119863111minus180

which are shown in Figure 3 Each row represents70 genes and each column represents 78 patients withdisease status (blue) ER status (red) and PR status (orange)The BHI for disease status ER status and PR status inMammaPrint (119863

1minus70) are 070 069 and 057 respectively

The gene expression in 11986311minus80

shows different patterns fromthe others Mainly there are two clusters characterized by ERstatus The left-hand side is the subgroup of patients withER negative and poor prognosis This would correspond toBasal-like subtype or triple negative subtype though the Her2status is unclear The right-hand side could be divided intothree subgroups The subgroup of patients with ER negativeshows good prognosis indicating Luminal A and either sideof it would be Luminal B or Luminal C because it showsworse prognosis than Luminal A The data set of 119863

111minus180

has the highest BHI for disease status 076 and a similar geneexpression pattern to that in 119863

1minus70 The other values of BHI

are illustrated in Figure 4 It turned out that the data sets inDhave almost the same BHI for three statuses suggesting thereexists various gene sets with similar expression patterns

Next we investigate the stability of the gene ranking inMammaPrint Among 78 patients we randomly choose 50patients with 5420 gene expression patterns Then we takethe top 70 genes ranked by the correlation coefficients ofthe gene expression with disease status This procedure isrepeated 100 times and checked how many times the genesin MammaPrint are ranked within the top 70 The resultsare shown in the upper panel of Figure 5 The lower panelshows the result based on the AUC instead of the correlationcoefficient used in the ranking We clearly see that someof the genes in MammaPrint are rarely selected in the top70 which indicates the instability of the gene ranking Theperformances of DLDA with 100 replications which showsmost stable prediction accuracy as seen in Figure 1 basedon randomly selected 50 patients shown in Figure 6 wherethe vertical axes are AUC (left panels) and error rate (rightpanels) and genes are ranked by the correlation (a) and AUC(b) The performance clearly gets worse than that in Figure 1in terms of both AUC and error rate The heterogeneity ofgene expression pattern may come from the several subtypes

8 Computational and Mathematical Methods in Medicine

0

02

04

06

08

1

Data sets

AUC

0

02

04

06

08

1

Erro

r

TrainingTest

TrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230Data sets

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(a) vanrsquot Veer methodEr

ror

0

02

04

06

08

1

AUC

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(b) DLDA

Figure 1 Results of classical methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data setsD for vanrsquotVeer method and DLDA

as suggested in the cluster analysis and it would make itdifficult to select useful genes for predictions

Finally we calculate the mutualcoherence based on genesin MammaPrint and total 5420 genes in Figure 7 The scatterplot in the upper panel (a) shows a pair of geneswith the high-est mutual coherence (0984) in MammaPrint NM 014889

and NM 014968 correspond to MP1 and KIAA1104 respec-tivelyMP1 is a scaffold protein inmultiple signaling pathwaysincluding the one in breast cancer The latter one is pitrilysinmetallopeptidase 1The correlations defined in the right-handside of (26) are calculated for all pairs of genes in total 5420

genes and 70 genes in MammaPrint The distributions ofthem are shown in the lower panel (b) in Figure 7 The genepairs are sorted so that the values of the correlations decreasemonotonically Note that the number of the gene pairs ineach data set is different but the range of horizontal axis isrestricted to 0 and 1 for clear view and for easy comparisonThedifference between the black and red curves indicates thatthe gene set of MammaPrint has large homogeneity of geneexpression patterns in comparison with that of total 5420genes This indicates that the ranking of genes based on two-sample statistic such as the correlation coefficients is prone

Computational and Mathematical Methods in Medicine 9

0

02

04

06

08

1AU

C

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Erro

r

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(a) AdaBoost

0

02

04

06

08

1

AUC

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Erro

r

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(b) AUCBoost

Figure 2 Results of boosting methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets D forAdaBoost and AUCBoost

to select genes such that their gene expression patterns aresimilar to each other This would be one reason why we havemultiple solutions after the ranking methods based on thecorrelation coefficients It is also interesting to see that thereare a few gene pairs with very low correlation even in gene setof MammaPrint

4 Discussion and Concluding Remarks

In this paper we have addressed an important classificationissue using microarray data Our results present the existence

of multiple suboptimal solutions for choosing classificationpredictors Classification results showed that the perfor-mance of several gene sets was the same This behaviorwas confirmed by the vanrsquot Veer method DLDA AdaBoostand AUCBoost Amazingly nontop ranked gene sets showedbetter performance than the top ranked gene sets Theseresults indicate that the ranking method is very unstable forfeature selection in microarray data

To examine the expression pattern of selected predictorswe performed clustering this added support to the existenceof multiple solutions We easily recognized the good and

10 Computational and Mathematical Methods in Medicine

(a) (b)

(c)

Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70

(MammaPrint)(b) 119863

11minus80clearly showing some subtypes and (c) 119863

111minus180with the highest BHI regarding metastases The blue bars indicate patients with

metastases red bars those with ER positive and orange bars those with PR positive

the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes

Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work

Computational and Mathematical Methods in Medicine 11

SetBH

I

MetastasesERp

PRp

0

02

04

06

08

1

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig32125

RCU82987

AL137718

AB037863

NM020188

NM020974

NM000127

NM002019

NM002073

NM000436

NM004994

Con

tig55377

RCC

ontig35251

RCC

ontig25991

NM003875

NM006101

NM003882

NM003607

AF0

7351

9A

F052

162

NM000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig46223

RCN

M015984

NM006117

AK0

0074

5C

ontig40831

RCN

M003239

NM014791

X056

10N

M016448

NM018401

NM000788

Con

tig51464

RCA

L080

079

NM006931

AF2

5717

5N

M014321

NM002916

Con

tig55725

RCC

ontig24252

RCA

F201

951

NM005915

NM001282

Con

tig56457

RCN

M000599

NM020386

NM014889

AF0

5503

3N

M006681

NM007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM007036

NM018354

(a)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig

46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig

32125

RCU82987

AL1

37718

AB0

37863

NM

020188

NM

020974

NM

000127

NM

002019

NM

002073

NM

000436

NM

004994

Con

tig55377

RCC

ontig

35251

RCC

ontig

25991

NM

003875

NM

006101

NM

003882

NM

003607

AF0

7351

9A

F052

162

NM

000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig

46223

RCN

M015984

NM

006117

AK0

0074

5C

ontig

40831

RCN

M003239

NM

014791

X056

10N

M016448

NM

018401

NM

000788

Con

tig51464

RCA

L080

079

NM

006931

AF2

5717

5N

M014321

NM

002916

Con

tig55725

RCC

ontig

24252

RCA

F201

951

NM

005915

NM

001282

Con

tig56457

RCN

M000599

NM

020386

NM

014889

AF0

5503

3N

M006681

NM

007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM

007036

NM

018354

(b)

Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint

12 Computational and Mathematical Methods in Medicine

0 20 40 60 80 100

1

05

06

07

08

09

1AU

C

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

Number of trialsNumber of trials

TrainingTest

TrainingTest

(a)

0 20 40 60 80 100

05

06

07

08

09

1

Number of trialsNumber of trials

AUC

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

TrainingTest

TrainingTest

(b)

Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials

Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is

not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data

One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by

Computational and Mathematical Methods in Medicine 13

0 02 04 06

0

02

04

06

minus02

minus02NM 014889

NM

0149

68

(a)

Index of pairs of genes

Cor

relat

ion

5420 genes70 genes

0

02

04

06

08

1

0 02 04 06 08 1

(b)

Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view

Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata

We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data

Conflict of Interests

The authors have declared that no conflict of interests exists

References

[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001

[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001

[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996

[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003

[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006

[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006

[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002

[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996

14 Computational and Mathematical Methods in Medicine

[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003

[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002

[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999

[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004

[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008

[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006

[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989

[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005

[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005

[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936

[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010

[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001

[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008

[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006

[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001

[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003

[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011

[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011

[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010

[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001

[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009

[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009

[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009

[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009

[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 7: Multiple Suboptimal Solutions for Prediction Rules in Gene ... · For this, the machine learning approach is successfully exploited including sup-portvectormachine,boostinglearningalgorithm,andthe

Computational and Mathematical Methods in Medicine 7

the positive values correspond to metastases and negativeones to good prognosisThemagnitude of b denotes the levelof disease status Then the optimal linear solution 120573 isin R119901

should be satisfied

X120573 = b (25)

Note that if 119901 is much larger than 119899 then only a few elementsof 120573 should be non-zeros and the others to be zero whichmeans 120573 has a sparse structure The sparsity has a closerelationship with mutualcoherence [30] which is defined forthe data matrix X as

120583 (X) = max1le119894119895le119901119894 = 119895

10038161003816100381610038161003816x119879119894x119895

100381610038161003816100381610038161003817100381710038171003817x11989410038171003817100381710038172

10038171003817100381710038171003817x119895

100381710038171003817100381710038172

(26)

where x119894isin R119899 denotes the 119894th column in X or 119894th gene in

the breast cancer data | sdot | is the absolute value and sdot2is the

Euclidean normThis indexmeasures the distance of columns(genes) in the data matrixX and is bounded as 0 le 120583(X) le 1The next theorem shows the relationship between sparsity of120573 and the data matrix X

Theorem 1 (Elad [30]) If X120573 = b has a solution 120573 obeying1205730

lt (1 + 1120583(X))2 this solution is necessarily thesparsest possible where 120573

0denotes the number of the nonzero

components of 120573

This theorem suggests that the linear discriminant func-tion 119865(x) = 120573119879x could have a sparse solution to predict thedisease status b which corresponds to metastases or goodprognosis in the breast cancer data If the value of 120583(X) isnearly equal to 1 then 120573

0could become approximately 1

indicating that just one gene (a column inX) could be enoughto predict the disease status if there were a solution 120573 Thisindicates that we could have a chance to find the multiplesolutions with fewer genes than 70 genes in MammaPrintAlthough the framework in (25) is based on a system of linearequations and does not include random effects as seen in theclassification problems we deal with Theorem 1 is indicativein the case where the number of covariates 119901 is much largerthan the observation number 119899

34 Results We prepare various data sets using a trainingsample (78 patients) based on 230 genes selected by vanrsquot Veeret al [7] in order to investigate the existence of the multiplesolutions of the prediction of the disease status which aregiven by

D = 1198631minus70

1198636minus75

119863161minus230

(27)

The first data set 1198631minus70

consists of 78 patients and 70 genesin MammaPrint The ranking of the 230 genes in D is onthe basis of correlation with disease status as in [7] We apply

the vanrsquot Veer method DLDA AdaBoost and AUCBoost asexplained in the previous subsection to the training data tohave the discriminant functions119865(x) where each threshold isdetermined so that the training error rate isminimizedThenwe evaluate the classification accuracy based on the trainingdata as well as the test data The results of classificationperformance of vanrsquot Veer method and DLDA are illustratedin Figure 1 those of AdaBoost and AUCBoost are in Figure 2The AUC and the error rate are plotted against D In regardto the performance based on the training data denoted bysolid line the boosting methods are superior to the classicalmethodsHowever comparison based on the test data DLDAshows good performance for almost all the data sets in Dhaving the AUC more than 08 and the error rates less than02 in averageThis evidence suggests there exist many sets ofgenes having almost the same classification performance asthat of MammaPrint

We investigate the performance of the hierarchical clus-terings based on the training data sets 119863

1minus70 11986311minus80

and119863111minus180

which are shown in Figure 3 Each row represents70 genes and each column represents 78 patients withdisease status (blue) ER status (red) and PR status (orange)The BHI for disease status ER status and PR status inMammaPrint (119863

1minus70) are 070 069 and 057 respectively

The gene expression in 11986311minus80

shows different patterns fromthe others Mainly there are two clusters characterized by ERstatus The left-hand side is the subgroup of patients withER negative and poor prognosis This would correspond toBasal-like subtype or triple negative subtype though the Her2status is unclear The right-hand side could be divided intothree subgroups The subgroup of patients with ER negativeshows good prognosis indicating Luminal A and either sideof it would be Luminal B or Luminal C because it showsworse prognosis than Luminal A The data set of 119863

111minus180

has the highest BHI for disease status 076 and a similar geneexpression pattern to that in 119863

1minus70 The other values of BHI

are illustrated in Figure 4 It turned out that the data sets inDhave almost the same BHI for three statuses suggesting thereexists various gene sets with similar expression patterns

Next we investigate the stability of the gene ranking inMammaPrint Among 78 patients we randomly choose 50patients with 5420 gene expression patterns Then we takethe top 70 genes ranked by the correlation coefficients ofthe gene expression with disease status This procedure isrepeated 100 times and checked how many times the genesin MammaPrint are ranked within the top 70 The resultsare shown in the upper panel of Figure 5 The lower panelshows the result based on the AUC instead of the correlationcoefficient used in the ranking We clearly see that someof the genes in MammaPrint are rarely selected in the top70 which indicates the instability of the gene ranking Theperformances of DLDA with 100 replications which showsmost stable prediction accuracy as seen in Figure 1 basedon randomly selected 50 patients shown in Figure 6 wherethe vertical axes are AUC (left panels) and error rate (rightpanels) and genes are ranked by the correlation (a) and AUC(b) The performance clearly gets worse than that in Figure 1in terms of both AUC and error rate The heterogeneity ofgene expression pattern may come from the several subtypes

8 Computational and Mathematical Methods in Medicine

0

02

04

06

08

1

Data sets

AUC

0

02

04

06

08

1

Erro

r

TrainingTest

TrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230Data sets

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(a) vanrsquot Veer methodEr

ror

0

02

04

06

08

1

AUC

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(b) DLDA

Figure 1 Results of classical methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data setsD for vanrsquotVeer method and DLDA

as suggested in the cluster analysis and it would make itdifficult to select useful genes for predictions

Finally we calculate the mutualcoherence based on genesin MammaPrint and total 5420 genes in Figure 7 The scatterplot in the upper panel (a) shows a pair of geneswith the high-est mutual coherence (0984) in MammaPrint NM 014889

and NM 014968 correspond to MP1 and KIAA1104 respec-tivelyMP1 is a scaffold protein inmultiple signaling pathwaysincluding the one in breast cancer The latter one is pitrilysinmetallopeptidase 1The correlations defined in the right-handside of (26) are calculated for all pairs of genes in total 5420

genes and 70 genes in MammaPrint The distributions ofthem are shown in the lower panel (b) in Figure 7 The genepairs are sorted so that the values of the correlations decreasemonotonically Note that the number of the gene pairs ineach data set is different but the range of horizontal axis isrestricted to 0 and 1 for clear view and for easy comparisonThedifference between the black and red curves indicates thatthe gene set of MammaPrint has large homogeneity of geneexpression patterns in comparison with that of total 5420genes This indicates that the ranking of genes based on two-sample statistic such as the correlation coefficients is prone

Computational and Mathematical Methods in Medicine 9

0

02

04

06

08

1AU

C

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Erro

r

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(a) AdaBoost

0

02

04

06

08

1

AUC

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Erro

r

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(b) AUCBoost

Figure 2 Results of boosting methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets D forAdaBoost and AUCBoost

to select genes such that their gene expression patterns aresimilar to each other This would be one reason why we havemultiple solutions after the ranking methods based on thecorrelation coefficients It is also interesting to see that thereare a few gene pairs with very low correlation even in gene setof MammaPrint

4 Discussion and Concluding Remarks

In this paper we have addressed an important classificationissue using microarray data Our results present the existence

of multiple suboptimal solutions for choosing classificationpredictors Classification results showed that the perfor-mance of several gene sets was the same This behaviorwas confirmed by the vanrsquot Veer method DLDA AdaBoostand AUCBoost Amazingly nontop ranked gene sets showedbetter performance than the top ranked gene sets Theseresults indicate that the ranking method is very unstable forfeature selection in microarray data

To examine the expression pattern of selected predictorswe performed clustering this added support to the existenceof multiple solutions We easily recognized the good and

10 Computational and Mathematical Methods in Medicine

(a) (b)

(c)

Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70

(MammaPrint)(b) 119863

11minus80clearly showing some subtypes and (c) 119863

111minus180with the highest BHI regarding metastases The blue bars indicate patients with

metastases red bars those with ER positive and orange bars those with PR positive

the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes

Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work

Computational and Mathematical Methods in Medicine 11

SetBH

I

MetastasesERp

PRp

0

02

04

06

08

1

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig32125

RCU82987

AL137718

AB037863

NM020188

NM020974

NM000127

NM002019

NM002073

NM000436

NM004994

Con

tig55377

RCC

ontig35251

RCC

ontig25991

NM003875

NM006101

NM003882

NM003607

AF0

7351

9A

F052

162

NM000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig46223

RCN

M015984

NM006117

AK0

0074

5C

ontig40831

RCN

M003239

NM014791

X056

10N

M016448

NM018401

NM000788

Con

tig51464

RCA

L080

079

NM006931

AF2

5717

5N

M014321

NM002916

Con

tig55725

RCC

ontig24252

RCA

F201

951

NM005915

NM001282

Con

tig56457

RCN

M000599

NM020386

NM014889

AF0

5503

3N

M006681

NM007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM007036

NM018354

(a)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig

46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig

32125

RCU82987

AL1

37718

AB0

37863

NM

020188

NM

020974

NM

000127

NM

002019

NM

002073

NM

000436

NM

004994

Con

tig55377

RCC

ontig

35251

RCC

ontig

25991

NM

003875

NM

006101

NM

003882

NM

003607

AF0

7351

9A

F052

162

NM

000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig

46223

RCN

M015984

NM

006117

AK0

0074

5C

ontig

40831

RCN

M003239

NM

014791

X056

10N

M016448

NM

018401

NM

000788

Con

tig51464

RCA

L080

079

NM

006931

AF2

5717

5N

M014321

NM

002916

Con

tig55725

RCC

ontig

24252

RCA

F201

951

NM

005915

NM

001282

Con

tig56457

RCN

M000599

NM

020386

NM

014889

AF0

5503

3N

M006681

NM

007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM

007036

NM

018354

(b)

Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint

12 Computational and Mathematical Methods in Medicine

0 20 40 60 80 100

1

05

06

07

08

09

1AU

C

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

Number of trialsNumber of trials

TrainingTest

TrainingTest

(a)

0 20 40 60 80 100

05

06

07

08

09

1

Number of trialsNumber of trials

AUC

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

TrainingTest

TrainingTest

(b)

Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials

Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is

not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data

One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by

Computational and Mathematical Methods in Medicine 13

0 02 04 06

0

02

04

06

minus02

minus02NM 014889

NM

0149

68

(a)

Index of pairs of genes

Cor

relat

ion

5420 genes70 genes

0

02

04

06

08

1

0 02 04 06 08 1

(b)

Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view

Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata

We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data

Conflict of Interests

The authors have declared that no conflict of interests exists

References

[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001

[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001

[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996

[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003

[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006

[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006

[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002

[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996

14 Computational and Mathematical Methods in Medicine

[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003

[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002

[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999

[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004

[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008

[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006

[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989

[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005

[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005

[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936

[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010

[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001

[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008

[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006

[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001

[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003

[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011

[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011

[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010

[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001

[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009

[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009

[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009

[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009

[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 8: Multiple Suboptimal Solutions for Prediction Rules in Gene ... · For this, the machine learning approach is successfully exploited including sup-portvectormachine,boostinglearningalgorithm,andthe

8 Computational and Mathematical Methods in Medicine

0

02

04

06

08

1

Data sets

AUC

0

02

04

06

08

1

Erro

r

TrainingTest

TrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230Data sets

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(a) vanrsquot Veer methodEr

ror

0

02

04

06

08

1

AUC

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(b) DLDA

Figure 1 Results of classical methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data setsD for vanrsquotVeer method and DLDA

as suggested in the cluster analysis and it would make itdifficult to select useful genes for predictions

Finally we calculate the mutualcoherence based on genesin MammaPrint and total 5420 genes in Figure 7 The scatterplot in the upper panel (a) shows a pair of geneswith the high-est mutual coherence (0984) in MammaPrint NM 014889

and NM 014968 correspond to MP1 and KIAA1104 respec-tivelyMP1 is a scaffold protein inmultiple signaling pathwaysincluding the one in breast cancer The latter one is pitrilysinmetallopeptidase 1The correlations defined in the right-handside of (26) are calculated for all pairs of genes in total 5420

genes and 70 genes in MammaPrint The distributions ofthem are shown in the lower panel (b) in Figure 7 The genepairs are sorted so that the values of the correlations decreasemonotonically Note that the number of the gene pairs ineach data set is different but the range of horizontal axis isrestricted to 0 and 1 for clear view and for easy comparisonThedifference between the black and red curves indicates thatthe gene set of MammaPrint has large homogeneity of geneexpression patterns in comparison with that of total 5420genes This indicates that the ranking of genes based on two-sample statistic such as the correlation coefficients is prone

Computational and Mathematical Methods in Medicine 9

0

02

04

06

08

1AU

C

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Erro

r

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(a) AdaBoost

0

02

04

06

08

1

AUC

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Erro

r

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(b) AUCBoost

Figure 2 Results of boosting methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets D forAdaBoost and AUCBoost

to select genes such that their gene expression patterns aresimilar to each other This would be one reason why we havemultiple solutions after the ranking methods based on thecorrelation coefficients It is also interesting to see that thereare a few gene pairs with very low correlation even in gene setof MammaPrint

4 Discussion and Concluding Remarks

In this paper we have addressed an important classificationissue using microarray data Our results present the existence

of multiple suboptimal solutions for choosing classificationpredictors Classification results showed that the perfor-mance of several gene sets was the same This behaviorwas confirmed by the vanrsquot Veer method DLDA AdaBoostand AUCBoost Amazingly nontop ranked gene sets showedbetter performance than the top ranked gene sets Theseresults indicate that the ranking method is very unstable forfeature selection in microarray data

To examine the expression pattern of selected predictorswe performed clustering this added support to the existenceof multiple solutions We easily recognized the good and

10 Computational and Mathematical Methods in Medicine

(a) (b)

(c)

Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70

(MammaPrint)(b) 119863

11minus80clearly showing some subtypes and (c) 119863

111minus180with the highest BHI regarding metastases The blue bars indicate patients with

metastases red bars those with ER positive and orange bars those with PR positive

the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes

Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work

Computational and Mathematical Methods in Medicine 11

SetBH

I

MetastasesERp

PRp

0

02

04

06

08

1

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig32125

RCU82987

AL137718

AB037863

NM020188

NM020974

NM000127

NM002019

NM002073

NM000436

NM004994

Con

tig55377

RCC

ontig35251

RCC

ontig25991

NM003875

NM006101

NM003882

NM003607

AF0

7351

9A

F052

162

NM000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig46223

RCN

M015984

NM006117

AK0

0074

5C

ontig40831

RCN

M003239

NM014791

X056

10N

M016448

NM018401

NM000788

Con

tig51464

RCA

L080

079

NM006931

AF2

5717

5N

M014321

NM002916

Con

tig55725

RCC

ontig24252

RCA

F201

951

NM005915

NM001282

Con

tig56457

RCN

M000599

NM020386

NM014889

AF0

5503

3N

M006681

NM007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM007036

NM018354

(a)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig

46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig

32125

RCU82987

AL1

37718

AB0

37863

NM

020188

NM

020974

NM

000127

NM

002019

NM

002073

NM

000436

NM

004994

Con

tig55377

RCC

ontig

35251

RCC

ontig

25991

NM

003875

NM

006101

NM

003882

NM

003607

AF0

7351

9A

F052

162

NM

000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig

46223

RCN

M015984

NM

006117

AK0

0074

5C

ontig

40831

RCN

M003239

NM

014791

X056

10N

M016448

NM

018401

NM

000788

Con

tig51464

RCA

L080

079

NM

006931

AF2

5717

5N

M014321

NM

002916

Con

tig55725

RCC

ontig

24252

RCA

F201

951

NM

005915

NM

001282

Con

tig56457

RCN

M000599

NM

020386

NM

014889

AF0

5503

3N

M006681

NM

007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM

007036

NM

018354

(b)

Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint

12 Computational and Mathematical Methods in Medicine

0 20 40 60 80 100

1

05

06

07

08

09

1AU

C

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

Number of trialsNumber of trials

TrainingTest

TrainingTest

(a)

0 20 40 60 80 100

05

06

07

08

09

1

Number of trialsNumber of trials

AUC

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

TrainingTest

TrainingTest

(b)

Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials

Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is

not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data

One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by

Computational and Mathematical Methods in Medicine 13

0 02 04 06

0

02

04

06

minus02

minus02NM 014889

NM

0149

68

(a)

Index of pairs of genes

Cor

relat

ion

5420 genes70 genes

0

02

04

06

08

1

0 02 04 06 08 1

(b)

Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view

Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata

We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data

Conflict of Interests

The authors have declared that no conflict of interests exists

References

[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001

[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001

[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996

[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003

[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006

[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006

[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002

[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996

14 Computational and Mathematical Methods in Medicine

[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003

[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002

[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999

[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004

[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008

[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006

[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989

[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005

[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005

[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936

[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010

[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001

[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008

[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006

[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001

[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003

[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011

[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011

[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010

[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001

[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009

[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009

[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009

[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009

[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 9: Multiple Suboptimal Solutions for Prediction Rules in Gene ... · For this, the machine learning approach is successfully exploited including sup-portvectormachine,boostinglearningalgorithm,andthe

Computational and Mathematical Methods in Medicine 9

0

02

04

06

08

1AU

C

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Erro

r

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(a) AdaBoost

0

02

04

06

08

1

AUC

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Erro

r

0

02

04

06

08

1

Data setsTrainingTest

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

(b) AUCBoost

Figure 2 Results of boosting methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets D forAdaBoost and AUCBoost

to select genes such that their gene expression patterns aresimilar to each other This would be one reason why we havemultiple solutions after the ranking methods based on thecorrelation coefficients It is also interesting to see that thereare a few gene pairs with very low correlation even in gene setof MammaPrint

4 Discussion and Concluding Remarks

In this paper we have addressed an important classificationissue using microarray data Our results present the existence

of multiple suboptimal solutions for choosing classificationpredictors Classification results showed that the perfor-mance of several gene sets was the same This behaviorwas confirmed by the vanrsquot Veer method DLDA AdaBoostand AUCBoost Amazingly nontop ranked gene sets showedbetter performance than the top ranked gene sets Theseresults indicate that the ranking method is very unstable forfeature selection in microarray data

To examine the expression pattern of selected predictorswe performed clustering this added support to the existenceof multiple solutions We easily recognized the good and

10 Computational and Mathematical Methods in Medicine

(a) (b)

(c)

Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70

(MammaPrint)(b) 119863

11minus80clearly showing some subtypes and (c) 119863

111minus180with the highest BHI regarding metastases The blue bars indicate patients with

metastases red bars those with ER positive and orange bars those with PR positive

the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes

Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work

Computational and Mathematical Methods in Medicine 11

SetBH

I

MetastasesERp

PRp

0

02

04

06

08

1

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig32125

RCU82987

AL137718

AB037863

NM020188

NM020974

NM000127

NM002019

NM002073

NM000436

NM004994

Con

tig55377

RCC

ontig35251

RCC

ontig25991

NM003875

NM006101

NM003882

NM003607

AF0

7351

9A

F052

162

NM000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig46223

RCN

M015984

NM006117

AK0

0074

5C

ontig40831

RCN

M003239

NM014791

X056

10N

M016448

NM018401

NM000788

Con

tig51464

RCA

L080

079

NM006931

AF2

5717

5N

M014321

NM002916

Con

tig55725

RCC

ontig24252

RCA

F201

951

NM005915

NM001282

Con

tig56457

RCN

M000599

NM020386

NM014889

AF0

5503

3N

M006681

NM007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM007036

NM018354

(a)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig

46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig

32125

RCU82987

AL1

37718

AB0

37863

NM

020188

NM

020974

NM

000127

NM

002019

NM

002073

NM

000436

NM

004994

Con

tig55377

RCC

ontig

35251

RCC

ontig

25991

NM

003875

NM

006101

NM

003882

NM

003607

AF0

7351

9A

F052

162

NM

000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig

46223

RCN

M015984

NM

006117

AK0

0074

5C

ontig

40831

RCN

M003239

NM

014791

X056

10N

M016448

NM

018401

NM

000788

Con

tig51464

RCA

L080

079

NM

006931

AF2

5717

5N

M014321

NM

002916

Con

tig55725

RCC

ontig

24252

RCA

F201

951

NM

005915

NM

001282

Con

tig56457

RCN

M000599

NM

020386

NM

014889

AF0

5503

3N

M006681

NM

007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM

007036

NM

018354

(b)

Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint

12 Computational and Mathematical Methods in Medicine

0 20 40 60 80 100

1

05

06

07

08

09

1AU

C

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

Number of trialsNumber of trials

TrainingTest

TrainingTest

(a)

0 20 40 60 80 100

05

06

07

08

09

1

Number of trialsNumber of trials

AUC

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

TrainingTest

TrainingTest

(b)

Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials

Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is

not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data

One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by

Computational and Mathematical Methods in Medicine 13

0 02 04 06

0

02

04

06

minus02

minus02NM 014889

NM

0149

68

(a)

Index of pairs of genes

Cor

relat

ion

5420 genes70 genes

0

02

04

06

08

1

0 02 04 06 08 1

(b)

Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view

Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata

We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data

Conflict of Interests

The authors have declared that no conflict of interests exists

References

[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001

[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001

[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996

[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003

[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006

[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006

[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002

[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996

14 Computational and Mathematical Methods in Medicine

[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003

[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002

[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999

[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004

[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008

[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006

[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989

[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005

[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005

[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936

[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010

[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001

[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008

[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006

[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001

[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003

[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011

[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011

[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010

[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001

[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009

[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009

[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009

[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009

[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 10: Multiple Suboptimal Solutions for Prediction Rules in Gene ... · For this, the machine learning approach is successfully exploited including sup-portvectormachine,boostinglearningalgorithm,andthe

10 Computational and Mathematical Methods in Medicine

(a) (b)

(c)

Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70

(MammaPrint)(b) 119863

11minus80clearly showing some subtypes and (c) 119863

111minus180with the highest BHI regarding metastases The blue bars indicate patients with

metastases red bars those with ER positive and orange bars those with PR positive

the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes

Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work

Computational and Mathematical Methods in Medicine 11

SetBH

I

MetastasesERp

PRp

0

02

04

06

08

1

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig32125

RCU82987

AL137718

AB037863

NM020188

NM020974

NM000127

NM002019

NM002073

NM000436

NM004994

Con

tig55377

RCC

ontig35251

RCC

ontig25991

NM003875

NM006101

NM003882

NM003607

AF0

7351

9A

F052

162

NM000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig46223

RCN

M015984

NM006117

AK0

0074

5C

ontig40831

RCN

M003239

NM014791

X056

10N

M016448

NM018401

NM000788

Con

tig51464

RCA

L080

079

NM006931

AF2

5717

5N

M014321

NM002916

Con

tig55725

RCC

ontig24252

RCA

F201

951

NM005915

NM001282

Con

tig56457

RCN

M000599

NM020386

NM014889

AF0

5503

3N

M006681

NM007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM007036

NM018354

(a)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig

46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig

32125

RCU82987

AL1

37718

AB0

37863

NM

020188

NM

020974

NM

000127

NM

002019

NM

002073

NM

000436

NM

004994

Con

tig55377

RCC

ontig

35251

RCC

ontig

25991

NM

003875

NM

006101

NM

003882

NM

003607

AF0

7351

9A

F052

162

NM

000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig

46223

RCN

M015984

NM

006117

AK0

0074

5C

ontig

40831

RCN

M003239

NM

014791

X056

10N

M016448

NM

018401

NM

000788

Con

tig51464

RCA

L080

079

NM

006931

AF2

5717

5N

M014321

NM

002916

Con

tig55725

RCC

ontig

24252

RCA

F201

951

NM

005915

NM

001282

Con

tig56457

RCN

M000599

NM

020386

NM

014889

AF0

5503

3N

M006681

NM

007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM

007036

NM

018354

(b)

Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint

12 Computational and Mathematical Methods in Medicine

0 20 40 60 80 100

1

05

06

07

08

09

1AU

C

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

Number of trialsNumber of trials

TrainingTest

TrainingTest

(a)

0 20 40 60 80 100

05

06

07

08

09

1

Number of trialsNumber of trials

AUC

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

TrainingTest

TrainingTest

(b)

Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials

Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is

not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data

One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by

Computational and Mathematical Methods in Medicine 13

0 02 04 06

0

02

04

06

minus02

minus02NM 014889

NM

0149

68

(a)

Index of pairs of genes

Cor

relat

ion

5420 genes70 genes

0

02

04

06

08

1

0 02 04 06 08 1

(b)

Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view

Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata

We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data

Conflict of Interests

The authors have declared that no conflict of interests exists

References

[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001

[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001

[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996

[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003

[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006

[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006

[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002

[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996

14 Computational and Mathematical Methods in Medicine

[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003

[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002

[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999

[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004

[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008

[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006

[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989

[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005

[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005

[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936

[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010

[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001

[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008

[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006

[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001

[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003

[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011

[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011

[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010

[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001

[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009

[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009

[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009

[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009

[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 11: Multiple Suboptimal Solutions for Prediction Rules in Gene ... · For this, the machine learning approach is successfully exploited including sup-portvectormachine,boostinglearningalgorithm,andthe

Computational and Mathematical Methods in Medicine 11

SetBH

I

MetastasesERp

PRp

0

02

04

06

08

1

1ndash70 41ndash120 81ndash150 121ndash190 161ndash230

Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig32125

RCU82987

AL137718

AB037863

NM020188

NM020974

NM000127

NM002019

NM002073

NM000436

NM004994

Con

tig55377

RCC

ontig35251

RCC

ontig25991

NM003875

NM006101

NM003882

NM003607

AF0

7351

9A

F052

162

NM000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig46223

RCN

M015984

NM006117

AK0

0074

5C

ontig40831

RCN

M003239

NM014791

X056

10N

M016448

NM018401

NM000788

Con

tig51464

RCA

L080

079

NM006931

AF2

5717

5N

M014321

NM002916

Con

tig55725

RCC

ontig24252

RCA

F201

951

NM005915

NM001282

Con

tig56457

RCN

M000599

NM020386

NM014889

AF0

5503

3N

M006681

NM007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM007036

NM018354

(a)

Rate

of r

anki

ng in

the t

op 7

0

0

02

04

06

08

1

AL080059

Con

tig63649

RCC

ontig

46218

RCN

M016359

AA555029

RCN

M003748

Con

tig38288

RCN

M003862

Con

tig28552

RCC

ontig

32125

RCU82987

AL1

37718

AB0

37863

NM

020188

NM

020974

NM

000127

NM

002019

NM

002073

NM

000436

NM

004994

Con

tig55377

RCC

ontig

35251

RCC

ontig

25991

NM

003875

NM

006101

NM

003882

NM

003607

AF0

7351

9A

F052

162

NM

000849

Con

tig32185

RCN

M016577

Con

tig48328

RCC

ontig

46223

RCN

M015984

NM

006117

AK0

0074

5C

ontig

40831

RCN

M003239

NM

014791

X056

10N

M016448

NM

018401

NM

000788

Con

tig51464

RCA

L080

079

NM

006931

AF2

5717

5N

M014321

NM

002916

Con

tig55725

RCC

ontig

24252

RCA

F201

951

NM

005915

NM

001282

Con

tig56457

RCN

M000599

NM

020386

NM

014889

AF0

5503

3N

M006681

NM

007203

Con

tig63102

RCN

M003981

Con

tig20217

RCN

M001809

Con

tig2399

RCN

M004702

NM

007036

NM

018354

(b)

Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint

12 Computational and Mathematical Methods in Medicine

0 20 40 60 80 100

1

05

06

07

08

09

1AU

C

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

Number of trialsNumber of trials

TrainingTest

TrainingTest

(a)

0 20 40 60 80 100

05

06

07

08

09

1

Number of trialsNumber of trials

AUC

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

TrainingTest

TrainingTest

(b)

Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials

Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is

not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data

One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by

Computational and Mathematical Methods in Medicine 13

0 02 04 06

0

02

04

06

minus02

minus02NM 014889

NM

0149

68

(a)

Index of pairs of genes

Cor

relat

ion

5420 genes70 genes

0

02

04

06

08

1

0 02 04 06 08 1

(b)

Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view

Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata

We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data

Conflict of Interests

The authors have declared that no conflict of interests exists

References

[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001

[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001

[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996

[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003

[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006

[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006

[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002

[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996

14 Computational and Mathematical Methods in Medicine

[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003

[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002

[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999

[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004

[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008

[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006

[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989

[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005

[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005

[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936

[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010

[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001

[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008

[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006

[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001

[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003

[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011

[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011

[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010

[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001

[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009

[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009

[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009

[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009

[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 12: Multiple Suboptimal Solutions for Prediction Rules in Gene ... · For this, the machine learning approach is successfully exploited including sup-portvectormachine,boostinglearningalgorithm,andthe

12 Computational and Mathematical Methods in Medicine

0 20 40 60 80 100

1

05

06

07

08

09

1AU

C

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

Number of trialsNumber of trials

TrainingTest

TrainingTest

(a)

0 20 40 60 80 100

05

06

07

08

09

1

Number of trialsNumber of trials

AUC

0 20 40 60 80 100

0

01

02

03

04

05

Erro

r

TrainingTest

TrainingTest

(b)

Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials

Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is

not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data

One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by

Computational and Mathematical Methods in Medicine 13

0 02 04 06

0

02

04

06

minus02

minus02NM 014889

NM

0149

68

(a)

Index of pairs of genes

Cor

relat

ion

5420 genes70 genes

0

02

04

06

08

1

0 02 04 06 08 1

(b)

Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view

Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata

We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data

Conflict of Interests

The authors have declared that no conflict of interests exists

References

[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001

[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001

[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996

[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003

[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006

[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006

[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002

[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996

14 Computational and Mathematical Methods in Medicine

[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003

[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002

[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999

[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004

[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008

[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006

[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989

[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005

[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005

[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936

[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010

[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001

[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008

[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006

[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001

[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003

[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011

[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011

[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010

[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001

[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009

[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009

[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009

[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009

[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 13: Multiple Suboptimal Solutions for Prediction Rules in Gene ... · For this, the machine learning approach is successfully exploited including sup-portvectormachine,boostinglearningalgorithm,andthe

Computational and Mathematical Methods in Medicine 13

0 02 04 06

0

02

04

06

minus02

minus02NM 014889

NM

0149

68

(a)

Index of pairs of genes

Cor

relat

ion

5420 genes70 genes

0

02

04

06

08

1

0 02 04 06 08 1

(b)

Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view

Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata

We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data

Conflict of Interests

The authors have declared that no conflict of interests exists

References

[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001

[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001

[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996

[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003

[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006

[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006

[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002

[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996

14 Computational and Mathematical Methods in Medicine

[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003

[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002

[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999

[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004

[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008

[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006

[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989

[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005

[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005

[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936

[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010

[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001

[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008

[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006

[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001

[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003

[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011

[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011

[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010

[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001

[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009

[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009

[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009

[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009

[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 14: Multiple Suboptimal Solutions for Prediction Rules in Gene ... · For this, the machine learning approach is successfully exploited including sup-portvectormachine,boostinglearningalgorithm,andthe

14 Computational and Mathematical Methods in Medicine

[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003

[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002

[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999

[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004

[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008

[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006

[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989

[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005

[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005

[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936

[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010

[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001

[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008

[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006

[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001

[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003

[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011

[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011

[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010

[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001

[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009

[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008

[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009

[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009

[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009

[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom

Page 15: Multiple Suboptimal Solutions for Prediction Rules in Gene ... · For this, the machine learning approach is successfully exploited including sup-portvectormachine,boostinglearningalgorithm,andthe

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MEDIATORSINFLAMMATION

of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Behavioural Neurology

EndocrinologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Disease Markers

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

OncologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Oxidative Medicine and Cellular Longevity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational and Mathematical Methods in Medicine

OphthalmologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Diabetes ResearchJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Research and TreatmentAIDS

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Gastroenterology Research and Practice

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom