multiple suboptimal solutions for prediction rules in gene ... · for this, the machine learning...
TRANSCRIPT
Hindawi Publishing CorporationComputational and Mathematical Methods in MedicineVolume 2013 Article ID 798189 14 pageshttpdxdoiorg1011552013798189
Research ArticleMultiple Suboptimal Solutions for Prediction Rules inGene Expression Data
Osamu Komori1 Mari Pritchard2 and Shinto Eguchi1
1 The Institute of Statistical Mathematics Midori-cho Tachikawa Tokyo 190-8562 Japan2 CLC Bio Japan Inc Daikanyama Park Side Village 204 9-8 Sarugakucho Shibuya-ku Tokyo 150-0033 Japan
Correspondence should be addressed to Osamu Komori komoriismacjp
Received 30 January 2013 Revised 22 March 2013 Accepted 23 March 2013
Academic Editor Shigeyuki Matsui
Copyright copy 2013 Osamu Komori et alThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
This paper discusses mathematical and statistical aspects in analysis methods applied to microarray gene expressions We focus onpattern recognition to extract informative features embedded in the data for prediction of phenotypes It has been pointed out thatthere are severely difficult problems due to the unbalance in the number of observed genes compared with the number of observedsubjects We make a reanalysis of microarray gene expression published data to detect many other gene sets with almost the sameperformance We conclude in the current stage that it is not possible to extract only informative genes with high performance inthe all observed genes We investigate the reason why this difficulty still exists even though there are actively proposed analysismethods and learning algorithms in statistical machine learning approaches We focus on the mutual coherence or the absolutevalue of the Pearson correlations between two genes and describe the distributions of the correlation for the selected set of genesand the total set We show that the problem of finding informative genes in high dimensional data is ill-posed and that the difficultyis closely related with the mutual coherence
1 Introduction
The Human Genome Project [1 2] has driven genome tech-nology forward to exhaustive observationThe accumulationof genome knowledge leads us to study gene and proteinexpressions to elucidate the functions of genes and the inter-action among genes We overview the progress of microarraytechnology for gene expressions and the analysis methodsbased on gene expression towards good prediction for phe-notypes Analysis of gene expressions has been rapidlydeveloped and enhanced by microarray technology In thecurrent stage this progress enables us to observe all the geneexpressions of subjects in an exhaustive manner It is openingan area of bioinformatics to discover the relation of pheno-types with gene expressions where phenotypes imply degreesand stages of pathological response treatment effect andprognosis of patients We anticipated that a breakthrough inmedical and clinical science will lead to the discovery of keyunderstandings which elucidate the associations betweenphenotypes and gene expressions For this the machine
learning approach is successfully exploited including sup-port vector machine boosting learning algorithm and theBayesian network
However there exists a difficult problem in analyses forgene expression data in which the number of genes is muchmore than that of samples Such an extreme unbalancebetween the data dimension and sample size is a typicalcharacteristic in genomics and omics data this tendency willbecome more apparent on account of new technology in thenear future Appropriate methods for solving the problem areurgently required however there are severe difficulties withattaining a comprehensive understanding of the phenotypesbased on the set of exhaustive gene expressions We facemany false positive genes sinking true positive genes inthe prediction which creates an impediment to buildingindividualized medicines There are a vast number of pro-posals with complex procedures to challenge the difficultproblem of extracting robust and exhaustive informationfrom gene expressions Sparse type of feature selection is oneexample it is considered to avoid overfitting and obtaining
2 Computational and Mathematical Methods in Medicine
an interpretable model for gene expression data In theregression context Tibshirani [3] proposed Lasso whichachieved feature selection by shrinking some coefficients toexactly zero In the image processing area the sparse modelis also considered for redundant representation of data byDonoho and Elad [4] and Candes et al [5]
The current status for prediction performance hasattained a constant level however there are still unsolvedproblems in the prediction in which the observed number ofgenes is extremely larger than that of the subjectsThis causesdifficulties in which superfluous discovered genes arise theexpressions which have weak power for prediction As aresult almost all microarray data analysis is not completelyconfirmed as the biological replication which is discussed inAllison et al [6]
We address themicroarray data analysis discussed in vanrsquotVeer et al [7] in which a set of 70 genes is selected for pre-diction for prognosis in breast cancer patients as informativebiomarkersThe result helps to build the prediction kit namedldquoMammaPrintrdquo which was approved by the FDA in 2007 Wemake a reanalysis of the data and discuss how surprisinglymany other gene sets show almost the same predictionperformance as the gene set published Thus their gene setdoes not uniquely show a reasonable power of predictionso we suggest that it is impossible to build up a universalprediction rule by efficient selection of genes In particularthe ranking procedure according to association measure tothe phenotype is very fragile for data partition We discussthe statistical reason why the data analysis of gene expressionis involved with such difficulties of multiple solutions forthe problem We calculate the value of mutualcoherence forgenes in MammaPrint and show the essential difficulty ofdetermining informative genes among huge number of genesused in the data analysis
This paper is organized as follows In Section 2 wedescribe the pattern recognition of gene expression and over-view the current proposed methods In Section 3 we pointout the current difficulty in gene expression using a realdata set Finally we discuss the results and future works inSection 4
2 Pattern Recognition from Gene Expressions
In this section we overview the pattern recognition of geneexpression First we mention a DNA microarray techniquewhich is the widely used method to measure millions of geneexpressions Then we present the current methods for geneselection using microarray
21 DNA Microarray Microarray has become a widely usedtechnology in a variety of areas Microarray measures theamount of mRNA or gene expression There are two majortechnologies available for gene expressionmeasurement Oneis GeneChip system provided by Affymetrix Inc GeneChipuses prefabricated short lengths of oligonucleotideThe otheris cDNAarraywhichwas originally developed by Schena et al[9] We briefly mention both technologies
GeneChip uses a pair of short length oligonucleotidesattached to a solid surface such as glass plastic or siliconThe
short pair of oligonucleotides is called the probe pair Eachprobe pair is composed of a perfect match (PM) probe anda mismatch (MM) probe PM is a section of the mRNA ofinterest and MM is created by changing the middle (13th)base of the PM with the intention of measuring nonspecificbinding mRNA samples are collected from subjects suchas cancer patients then labeled with fluorescence dye If theshort oligonucleotide is matched with the mRNA sample thelabeled mRNA sample is hybridized to a spot of microarrayIf the labeled mRNA and the probe match perfectly theybind strongly otherwise they bindweaklyThosewithweak ornonspecific binding are washed out by a washing buffer thenonly strongly bound mRNA samples are measured by a scan-ner Scanned measurements need further processing beforeanalysis such as outlier detection background subtractionand normalization These processes are called preprocessing
In the early stage of microarray the quality of microarraymeasurements contained a lot of variance Therefore pre-processing was a very active research area Affymetrix rec-ommended the use of both PM and MM probes to subtractnonspecific binding and implement MASS algorithm totheir software however Irizarry et al [10] and Naef et al[11] pointed out that the normalization model consideringMM captures nonspecific effect more than reality Currentlyrobust multichip average (RMA) which is introduced byIrizarry et al [10] is also widely used
cDNA array uses glass slides to attach short oligonu-cleotides probes cDNA array uses inkjet printing technologyGeneChip uses one color fluorescent dye on the other handcDNA array utilizes two different color fluorescent dyes Oneof the colors is for controlling mRNA and the other color isfor treatment of mRNA Both samples are hybridized on thesame array The scanner detects both fluorescent dyes sepa-rately Data processing is slightly different from GeneChipAs cDNA uses two fluorescent dyes scanned data is normallytreated as ratio data of treatment over control
Microarray technology has improved in the last decadesincluding reduction of the variance normalization proce-dures donot have as great an effect as before Research interesthas moved to areas such as data analysis finding subclass orpredicting the subclass
22 Review for Prediction via Microarray The initial ap-proach employed for subclass discovery was hierarchicalclustering analysis Golub et al [12] showed the result ofclustering for leukemia data using microarray In their resultsubclasses of leukemia were well clustered by gene expressionpattern This result was hailed as a new dawn in the cancerclassification problem
Breast cancer is one of the most used cancers for the geneexpression classification problem Breast cancer treatmentdecisions are based largely on clinicopathological criteriasuch as tumor size histological grade and lymph nodemetastatsis however vanrsquot Veer et al [7] pointed that themajority of patients received unnecessary chemotherapy andthat there is a need to find a better criteria who benefits fromchemotherapy
vanrsquot Veer et al [7] proposed 70 genes to predict patientoutcome as first multigene signatures in breast cancer 70
Computational and Mathematical Methods in Medicine 3
Table 1 A taxonomy of feature selection techniques summarized by Saeys et al [8] These major feature selections are addressed Each typehas a subcategory Advantages disadvantages and example methods are shown
Model search Advantages Disadvantages Examples
Univariate
Filter
FastScalableIndependent of the classifier
Ignores feature dependenciesIgnores interaction with the classifier
1205942
Euclidean distancet-testInformation gain
MultivariateModels feature dependenciesIndependent of the classifierBetter computational complexity thanwrapper methods
Slower than univariate techniquesLess scalable than univariatetechniquesIgnores interaction with the classifier
Correlation-based feature selectionMarkov blanket filterFast correlation-based feature selection
Deterministic
Wrapper
SimpleInteracts with the classifierModels feature dependenciesLess computationally intensive thanrandomized methods
Risk of overfittingMore prone than randomizedalgorithms to getting stuck in a localoptimumClassifier dependent selection
Sequential forward selectionSequential backward selection
Randomized
Less prone to local optimaInteracts with the classifierModels feature dependencies
Computationally intensiveClassifier dependent selectionHigher risk of overfitting thandeterministic algorithms
Simulated annealingRandomized hill climbingGenetic algorithmsEstimation of distribution algorithms
Embedded
Interacts with the classifierBetter computational complexity thanwrapper methodsModels feature dependencies
Classifier dependent selectionDecision treesWeighted naive BayesRFE-SVM
genes were decided using 78 patientsrsquo tumor samples Inbrief they selected 5000 significant expressed genes from25000 genes onmicroarray then coefficient correlation withoutcome was calculated for each class Genes were sorted bycorrelation coefficient and further optimized from the topranked gene with sequentially adding from a subset of fivegenesThe top 70 geneswere proposed as outcome predictorsPaik et al [13] proposed 21multigene signatures based on RT-PCR results These two multigene prognostic signatures areavailable as clinical test named MammaPrint and OncotypeDX FDA cleared the MammaPrint test in 2007 and it is cur-rently being tested in the Microarray In Node-negative and1ndash3 positive lymph-node Disease may Avoid ChemoTherapy(MINDACT) for further assessment which is described byCardoso et al [14]
Besides these two multigene prognostic signatures dif-ferent multigenes were selected as prognostic signatures Fanet al [15] discussed a different set of multigene signaturesin the breast cancer prognostic classification studies Thosesignatures show little overlap however they still showedsimilar classification power Fan et al [15] suggested that thesesignatures are probably tracking a common set of biologicalphenotypes Considering thousands andmillions of genes onmicroarray multiple useful signature sets are not difficult to
imagine however finding a stable and informative gene withhigh classification accuracy is of interest
23 Feature Selection Reduction of dimension size is neces-sary as superfluous features can cause overfitting and inter-pretation of classification model becomes difficult Reducingdimension size while keeping relevant features is importantThere are some feature selection methods proposed Saeyset al [8] provided a taxonomy of feature selection methodsand discussed their use advantages and disadvantages Theymentioned the objectives of feature selection (a) to avoidoverfitting and improve model performance that is predic-tion performance in the case of supervised classification andbetter cluster detection in the case of clustering (b) to providefaster andmore cost-effectivemodels and (c) to gain a deeperinsight into the underlying processes that generated the dataTable 1 provides their taxonomy of feature selectionmethodsIn the context of classification feature selection methodsare organized into three categories filter methods wrappermethods and embeddedmethods Feature selectionmethodsare categorized depending on how they are combined withthe construction of a classification model
Filter methods calculate statistics such as t-statisticsthen filter out those which do not meet the threshold
4 Computational and Mathematical Methods in Medicine
value Advantages of filter methods are easy implementationcomputational simplicity and speed Filter methods areindependent of classification methods therefore differentclassifiers can be used Two of the disadvantages of filtermethods are that they ignore the interaction with the clas-sification model and most of the proposed methods areunivariate
Whereas filter techniques treat the problem of findinga good feature subset independently of the model selec-tion step wrapper methods embed the model hypothesissearch within the feature subset search such as sequentialbackward selection [16] Wrapper methods search all featuresubsets the feature subset space grows exponentially withthe number of features One advantage of wrapper methodsis the interaction between feature subset search and modelselection Computational burden is one of the disadvantagesof this approach especially if building the classifier has a highcomputational cost
The third class of feature selection is embedded tech-niques Like the wrapper methods embedded techniquessearch for an optimal subset of features with classifierconstruction For example SVM-RFE proposed by Guyonet al [17] Thus embedded approaches are specific to a givenlearning algorithm The difference from wrapper methodsis that embedded techniques are guided by the learningprocess Whereas wrapper methods search all possible com-binations of gene sets embedded techniques search for thecombination based on a criteria This enables reduction incomputational burden
Besides thesemethods the idea of sparsenesswas recentlyintroduced in some feature selectionmethods One approachis to use penalties in the regression context Tibshirani [3]proposed Lasso which uses 119871
1norm penalties The combi-
nation of 1198711penalty and 119871
2penalty is called elastic net [18]
These methods are focusing on reducing features to avoidingoverfitting and better interpretability as biologists expect toobtain biological insight from selected features
However these sparseness ideas do not take into accountmultiple solutions in one data set When the data dimensionis thousands ormillions there aremultiple possible solutionsSorting genes based on some criteria then selecting a subsetfrom the top is not always the best selection We elaboratemultiple solutions in the following section and give some ideaof how to select the optimum solution Here we refer to anoptimal solution as a prediction rule with high classificationaccuracy for various data sets
3 Multiple Solutions for Prediction Rules
The existence of multiple solutions for prediction of diseasestatus based on breast cancer data [7] was shown by Ein-Doret al [19] where they suggest three reasons for this problemThe first is that there are many genes correlated with diseasestatus the second one is that the differences of the correlationamong the genes are very small the last one is that the valueof the correlation is very sensitive to the sample used for thecalculation of the correlation In the paper they demonstrategene ranking based on the correlation and show that thereexist many equally predictive gene sets
In this section we investigate the existence of multiplesolutions from different viewpoints At first to check thevariability of prediction accuracy based on different statisticalmethods we apply the vanrsquot Veermethod [7] the Fisher lineardiscriminant analysis [20] AdaBoost [21] and AUCBoost[22] The last two methods are called boosting in machinelearning community where genes are nonlinearly combinedto predict the disease status Second we apply hierarchal clus-tering to examine the heterogeneity of gene expression pat-terns Soslashrlie et al [23] showed there exist subtypes of breastcancers for which the patterns of gene expression are clearlydifferent and the disease statuses are also different in accor-dance with them Ein-Dor et al [19] suggests that the hetero-geneity of the subtypes is one reason why there are so manysolutions for the prediction or large fluctuations of genesselected for the predictions Hence we calculate BiologicalHomogeneity Index (BHI) to see the clustering performancefor various gene sets and examine the existence of thesubtypes Finally we consider the mutualcoherence for thebreast cancer data and discuss the relation to the multiplesolutions
The breast cancer data consists of the expression data of25000 genes and 97 cancer patients After a filtering proce-dure 5420 genes are identified to be significantly regulatedand we focused on this filtered data The patients are dividedinto a training sample (78 patients) and a test sample (19patients) in the same way as the original paper [7] Here weconsider classical methods and boosting methods to predictwhether the patient has good prognosis (free of disease afterinitial diagnosis at least 5 years) or has bad prognosis (distantmetastaseswithin 5 years)Theprediction rule is generated bythe training sample and we measure the prediction accuracybased on the test sample
31 Classification We briefly introduce the statistical meth-ods used for the prediction of the disease status The firsttwo methods are linear discriminant function where thegenes are linearly combined to classify the patients into agood prognosis group or a bad prognosis group The last twoones are boosting methods where the genes are nonlinearlycombined to generate the discriminant functions
311 The vanrsquot Veer Method Let 119910 be a class label indicatingdisease status such as 119910 = minus1 (good prognosis) and 119910 = 1
(metastases) and let x(=(1199091 119909
119901)119879) be a 119901-dimensional
covariate such as gene expression We denote the samples of119910 = minus1 and 119910 = 1 as xminus
119894 119894 = 1 119899
minus and x+
119895 119895 =
1 119899+ respectively and the mean value of the patients
with good prognosis as
minus= (minus
1
minus
119901)119879
=1
119899minus
119899minus
sum
119894=1
xminus119894 (1)
Then vanrsquot Veer et al [7] proposed a discriminant function119865(x) based on the correlation to the average good prognosisprofile above which is given as
119865 (x) = minussum119901
119896=1(119909119896minus ) (
minus
119896minus minus)
radicsum119901
119896=1(119909119896minus )2radicsum119901
119896=1(minus
119896minus minus)2
(2)
Computational and Mathematical Methods in Medicine 5
where
=1
119901
119901
sum
119896=1
119909119896
minus=1
119901
119901
sum
119896=1
minus
119896 (3)
If the value of 119865(x) is smaller than a predefined thresholdvalue then the patient is judged to have good prognosis (119910 =minus1) otherwise to have metastases (119910 = 1) This method iscalled one-class classification where it focuses on only one-class label information to predict the disease status 119910 Thisidea is also employed in machine learning community SeeYousef et al [24] and Gardner et al [25] for applications inbiology and medicine
312 DLDA We consider Fisherrsquos linear discriminant analy-sis [20] which is widely used in many applications Supposethat x is distributed as119873(120583
minusΣminus) for 119910 = minus1 and as119873(120583
+Σ+)
for 119910 = 1 Then if Σminus= Σ+= Σ the estimated log-likelihood
ratio is given as
119865 (x) = (+minus minus)119879
Σminus1x minus 1
2119879
+Σminus1
+
+1
2119879
minusΣminus1
minus+ log
119899+
119899minus
(4)
where + minusand Σ are the samplemeans for119910 isin minus1 1 and a
total sample variance respectively For simplicity we take thediagonal part of Σ (Diagonal Linear Discriminant Analysis)and predict the disease status based on the value of the log-likelihood ratio above This modification is often used in asituation where 119901 is much larger than 119899 (= 119899
minus+ 119899+) In that
case the inverse of Σ cannot be calculated
313 AdaBoost We introduce a famous boosting method inmachine learning community The key concept of boostingis to construct a powerful discriminant function 119865(x) bycombining variousweak classifiers119891(x) [26]We employ a setF of decision stumps as a dictionary of weak classifiers Herethe decision stump for the 119896th gene expression 119909
119896is defined
as a simple step functions such as
119891119896(x) = 1 if 119909
119896ge 119887119896
minus1 otherwise(5)
where 119887119896is a threshold value Accordingly it is known that
119891119896(x) is the simplest classifier in the sense that 119891
119896(x) neglects
all other information of gene expression patterns than thatof one gene 119909
119896 However by changing the value of 119887
119896for all
genes (119896 = 1 119901) we have F that contains exhaustiveinformation of gene expression patterns We attempt tobuild a good discriminant function 119865 by combining decisionstumps inF
The typical one is AdaBoost proposed by Schapire [21]which is designed to minimize the exponential loss for adiscriminant function 119865 as
119871exp (119865) =1
119899
119899
sum
119894=1
exp minus119910119894119865 (x119894) (6)
where the entire data set is given as (x119894 119910119894) | 119894 = 1 119899The
exponential loss is sequentially minimized by the followingalgorithm
(1) Initialize the weight of 119909119894for 119894 = 1 119899 as 119908
1(119894) =
1119899
(2) For the iteration number 119905 = 1 119879
(a) choose 119891119905as
119891119905= argmin119891isinF
120598119905(119891) (7)
where
120598119905(119891) =
119899
sum
119894=1
119868 (119891 (x119894) = 119910119894)
119908119905(119894)
sum119899
119894=1119908119905(119894) (8)
and 119868 is the indicator function(b) calculate the coefficient as
120573119905=1
2log
1 minus 120598119905(119891119905)
120598119905(119891119905) (9)
(c) update the weight as
119908119905+1(119894) = 119908
119905(119894) exp minus119910
119894120573119905119891119905(x119894) (10)
(3) output the final function as 119865(x) = sum119879119905=1120573119905119891119905(x)
Here we have
119871exp (119865 + 120573119891)
=1
119899
119899
sum
119894=1
exp minus119910119894119865 (x119894) exp minus120573119910
119894119891 (x119894)
(11)
=1
119899
119899
sum
119894=1
exp minus119910119894119865 (x119894)
times [119890120573119868 (119891 (x
119894) = 119910119894) + 119890minus120573119868 (119891 (x
119894) = 119910119894)]
(12)
= 119871exp (119865) 119890120573120598 (119891) + 119890
minus120573(1 minus 120598) (13)
ge 2119871exp (119865)radic120598 (119891) (1 minus 120598 (119891)) (14)
where 120598(119891) = 1119899sum119899
119894=1119868(119891(x
119894) = 119910119894) expminus119910
119894119865(x119894)119871exp(119865)
and the equality in (14) is attained if and only if
120573 =1
2log
1 minus 120598 (119891)
120598 (119891) (15)
6 Computational and Mathematical Methods in Medicine
Hence we can see that the exponential loss is sequentiallyminimized in the algorithm It is easily shown that
120598119905+1(119891119905) =
1
2 (16)
That is the weak classifier 119891119905chosen at the step 119905 is the worst
element in F in the sense of the weighted error rate in thestep 119905 + 1
314 AUCBoost with Natural Cubic Splines The area underthe ROCcurve (AUC) is widely used tomeasure classificationaccuracy [27] This criterion consists of the false positiverate and true positive rate so it evaluates them separately incontrast to the commonly used error rateThe empirical AUCbased on the samples xminus
119894 119894 = 1 119899
minus and x+
119895 119895 = 1
119899+ is given as
AUC (119865) = 1
119899minus119899+
119899minus
sum
119894=1
119899+
sum
119895=1
119867(119865 (x+119895) minus 119865 (xminus
119894)) (17)
where 119867(119911) is the Heaviside function 119867(119911) = 1 if 119911 ge 0
and H(119911) = 0 otherwise To avoid the difficulty to maximizethe nondifferential function above an approximate AUC isconsidered by Komori [28] as
AUC120590(119865) =
1
119899minus119899+
119899minus
sum
119894=1
119899+
sum
119895=1
119867120590(119865 (x+119895) minus 119865 (xminus
119894)) (18)
where 119867120590(119911) = Φ(119911120590) with Φ being the standard normal
distribution function A smaller scale parameter 120590 meansa better approximation of the Heaviside function 119867(119911)Based on this approximation a boosting method for themaximization of AUC as well as the partial AUC is proposedby Komori and Eguchi [22] in which they consider thefollowing objective function
AUC120590120582(119865) =
1
119899minus119899+
119899minus
sum
119894=1
119899+
sum
119895=1
119867120590(119865 (x+119895) minus 119865 (xminus
119894))
minus 120582
119901
sum
119896=1
int 11986510158401015840
119896(119909119896)2
119889119909119896
(19)
where 11986510158401015840119896(119909119896) is the second derivative of the 119896th component
of 119865(x) and 120582 is a smoothing parameter that controls thesmoothness of 119865(x) Here the set of weak classifiers is givenas
F = 119891 (x) =119873119896119897(119909119896)
119885119896119897
| 119896 = 1 2 119901 119897 = 1 2 119898119896
(20)
where 119873119896119897
is a basis function for representing natural cubicsplines and 119885
119896119897is a standardization factor Then the relation-
ship
max120590120582119865
AUC120590120582(119865) = max
120582119865
AUC1120582(119865) (21)
allows us to fix 120590 = 1 without loss of generality and have thefollowing algorithm
(1) Start with 1198650(x) = 0
(2) For 119905 = 1 119879
(a) update 120573119905minus1(119891) to 120573
119905(119891) with a one-step
Newton-Raphson iteration(b) find the best weak classifier 119891
119905
119891119905= argmax119891
AUC1120582(119865119905minus1
+ 120573119905(119891) 119891) (22)
(c) update the score function as
119865119905(x) = 119865
119905minus1(x) + 120573
119905(119891119905) 119891119905(x) (23)
(3) Output 119865(x) = sum119879119905=1120573119905(119891119905)119891119905(x)
The value of the smoothing parameter 120582 and the iterationnumber 119879 is determined at the same time by the crossvalidation
32 Clustering We applied a hierarchical clustering usingbreast cancer data [7] where the distances between samplesand genes are determined by the correlation and completelinkage was applied as the agglomeration method To mea-sure the performance of the clustering we used biologicalhomogeneity index (BHI) [29] which measures the homo-geneity between the cluster C = 119862
1 119862
119870 and the
biological category or subtypeB = 1198611 119861
119871
BHI (CB) =1
119870
119870
sum
119896=1
1
119899119896(119899119896minus 1)
sum
119894 = 119895119894119895isin119862119896
119868 (119861(119894)= 119861(119895))
(24)
where 119861(119894) isin B is the subtype for the subject 119894 and 119899119896is
the number of subjects in 119862119896 This index is upper bounded
by 1 meaning the perfect homogeneity between the clustersand the biological categories We calculated this index forthe breast cancer data to investigate the relationship betweenthe hierarchical clustering and biological categories such asdisease status (good prognosis or metastases) and hormonestatus estrogen receptor (ER) status and progesterone recep-tor (PR) status The hormone status is known to be closelyrelated with the prognosis of the patients [23]
33 Mutualcoherence Now we have a data matrix X with 119899rows (119899 patients) and 119901 columns (119901 genes)We assumed an 119899-dimensional vector b indicating the true disease status where
Computational and Mathematical Methods in Medicine 7
the positive values correspond to metastases and negativeones to good prognosisThemagnitude of b denotes the levelof disease status Then the optimal linear solution 120573 isin R119901
should be satisfied
X120573 = b (25)
Note that if 119901 is much larger than 119899 then only a few elementsof 120573 should be non-zeros and the others to be zero whichmeans 120573 has a sparse structure The sparsity has a closerelationship with mutualcoherence [30] which is defined forthe data matrix X as
120583 (X) = max1le119894119895le119901119894 = 119895
10038161003816100381610038161003816x119879119894x119895
100381610038161003816100381610038161003817100381710038171003817x11989410038171003817100381710038172
10038171003817100381710038171003817x119895
100381710038171003817100381710038172
(26)
where x119894isin R119899 denotes the 119894th column in X or 119894th gene in
the breast cancer data | sdot | is the absolute value and sdot2is the
Euclidean normThis indexmeasures the distance of columns(genes) in the data matrixX and is bounded as 0 le 120583(X) le 1The next theorem shows the relationship between sparsity of120573 and the data matrix X
Theorem 1 (Elad [30]) If X120573 = b has a solution 120573 obeying1205730
lt (1 + 1120583(X))2 this solution is necessarily thesparsest possible where 120573
0denotes the number of the nonzero
components of 120573
This theorem suggests that the linear discriminant func-tion 119865(x) = 120573119879x could have a sparse solution to predict thedisease status b which corresponds to metastases or goodprognosis in the breast cancer data If the value of 120583(X) isnearly equal to 1 then 120573
0could become approximately 1
indicating that just one gene (a column inX) could be enoughto predict the disease status if there were a solution 120573 Thisindicates that we could have a chance to find the multiplesolutions with fewer genes than 70 genes in MammaPrintAlthough the framework in (25) is based on a system of linearequations and does not include random effects as seen in theclassification problems we deal with Theorem 1 is indicativein the case where the number of covariates 119901 is much largerthan the observation number 119899
34 Results We prepare various data sets using a trainingsample (78 patients) based on 230 genes selected by vanrsquot Veeret al [7] in order to investigate the existence of the multiplesolutions of the prediction of the disease status which aregiven by
D = 1198631minus70
1198636minus75
119863161minus230
(27)
The first data set 1198631minus70
consists of 78 patients and 70 genesin MammaPrint The ranking of the 230 genes in D is onthe basis of correlation with disease status as in [7] We apply
the vanrsquot Veer method DLDA AdaBoost and AUCBoost asexplained in the previous subsection to the training data tohave the discriminant functions119865(x) where each threshold isdetermined so that the training error rate isminimizedThenwe evaluate the classification accuracy based on the trainingdata as well as the test data The results of classificationperformance of vanrsquot Veer method and DLDA are illustratedin Figure 1 those of AdaBoost and AUCBoost are in Figure 2The AUC and the error rate are plotted against D In regardto the performance based on the training data denoted bysolid line the boosting methods are superior to the classicalmethodsHowever comparison based on the test data DLDAshows good performance for almost all the data sets in Dhaving the AUC more than 08 and the error rates less than02 in averageThis evidence suggests there exist many sets ofgenes having almost the same classification performance asthat of MammaPrint
We investigate the performance of the hierarchical clus-terings based on the training data sets 119863
1minus70 11986311minus80
and119863111minus180
which are shown in Figure 3 Each row represents70 genes and each column represents 78 patients withdisease status (blue) ER status (red) and PR status (orange)The BHI for disease status ER status and PR status inMammaPrint (119863
1minus70) are 070 069 and 057 respectively
The gene expression in 11986311minus80
shows different patterns fromthe others Mainly there are two clusters characterized by ERstatus The left-hand side is the subgroup of patients withER negative and poor prognosis This would correspond toBasal-like subtype or triple negative subtype though the Her2status is unclear The right-hand side could be divided intothree subgroups The subgroup of patients with ER negativeshows good prognosis indicating Luminal A and either sideof it would be Luminal B or Luminal C because it showsworse prognosis than Luminal A The data set of 119863
111minus180
has the highest BHI for disease status 076 and a similar geneexpression pattern to that in 119863
1minus70 The other values of BHI
are illustrated in Figure 4 It turned out that the data sets inDhave almost the same BHI for three statuses suggesting thereexists various gene sets with similar expression patterns
Next we investigate the stability of the gene ranking inMammaPrint Among 78 patients we randomly choose 50patients with 5420 gene expression patterns Then we takethe top 70 genes ranked by the correlation coefficients ofthe gene expression with disease status This procedure isrepeated 100 times and checked how many times the genesin MammaPrint are ranked within the top 70 The resultsare shown in the upper panel of Figure 5 The lower panelshows the result based on the AUC instead of the correlationcoefficient used in the ranking We clearly see that someof the genes in MammaPrint are rarely selected in the top70 which indicates the instability of the gene ranking Theperformances of DLDA with 100 replications which showsmost stable prediction accuracy as seen in Figure 1 basedon randomly selected 50 patients shown in Figure 6 wherethe vertical axes are AUC (left panels) and error rate (rightpanels) and genes are ranked by the correlation (a) and AUC(b) The performance clearly gets worse than that in Figure 1in terms of both AUC and error rate The heterogeneity ofgene expression pattern may come from the several subtypes
8 Computational and Mathematical Methods in Medicine
0
02
04
06
08
1
Data sets
AUC
0
02
04
06
08
1
Erro
r
TrainingTest
TrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230Data sets
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(a) vanrsquot Veer methodEr
ror
0
02
04
06
08
1
AUC
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(b) DLDA
Figure 1 Results of classical methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data setsD for vanrsquotVeer method and DLDA
as suggested in the cluster analysis and it would make itdifficult to select useful genes for predictions
Finally we calculate the mutualcoherence based on genesin MammaPrint and total 5420 genes in Figure 7 The scatterplot in the upper panel (a) shows a pair of geneswith the high-est mutual coherence (0984) in MammaPrint NM 014889
and NM 014968 correspond to MP1 and KIAA1104 respec-tivelyMP1 is a scaffold protein inmultiple signaling pathwaysincluding the one in breast cancer The latter one is pitrilysinmetallopeptidase 1The correlations defined in the right-handside of (26) are calculated for all pairs of genes in total 5420
genes and 70 genes in MammaPrint The distributions ofthem are shown in the lower panel (b) in Figure 7 The genepairs are sorted so that the values of the correlations decreasemonotonically Note that the number of the gene pairs ineach data set is different but the range of horizontal axis isrestricted to 0 and 1 for clear view and for easy comparisonThedifference between the black and red curves indicates thatthe gene set of MammaPrint has large homogeneity of geneexpression patterns in comparison with that of total 5420genes This indicates that the ranking of genes based on two-sample statistic such as the correlation coefficients is prone
Computational and Mathematical Methods in Medicine 9
0
02
04
06
08
1AU
C
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Erro
r
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(a) AdaBoost
0
02
04
06
08
1
AUC
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Erro
r
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(b) AUCBoost
Figure 2 Results of boosting methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets D forAdaBoost and AUCBoost
to select genes such that their gene expression patterns aresimilar to each other This would be one reason why we havemultiple solutions after the ranking methods based on thecorrelation coefficients It is also interesting to see that thereare a few gene pairs with very low correlation even in gene setof MammaPrint
4 Discussion and Concluding Remarks
In this paper we have addressed an important classificationissue using microarray data Our results present the existence
of multiple suboptimal solutions for choosing classificationpredictors Classification results showed that the perfor-mance of several gene sets was the same This behaviorwas confirmed by the vanrsquot Veer method DLDA AdaBoostand AUCBoost Amazingly nontop ranked gene sets showedbetter performance than the top ranked gene sets Theseresults indicate that the ranking method is very unstable forfeature selection in microarray data
To examine the expression pattern of selected predictorswe performed clustering this added support to the existenceof multiple solutions We easily recognized the good and
10 Computational and Mathematical Methods in Medicine
(a) (b)
(c)
Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70
(MammaPrint)(b) 119863
11minus80clearly showing some subtypes and (c) 119863
111minus180with the highest BHI regarding metastases The blue bars indicate patients with
metastases red bars those with ER positive and orange bars those with PR positive
the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes
Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work
Computational and Mathematical Methods in Medicine 11
SetBH
I
MetastasesERp
PRp
0
02
04
06
08
1
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig32125
RCU82987
AL137718
AB037863
NM020188
NM020974
NM000127
NM002019
NM002073
NM000436
NM004994
Con
tig55377
RCC
ontig35251
RCC
ontig25991
NM003875
NM006101
NM003882
NM003607
AF0
7351
9A
F052
162
NM000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig46223
RCN
M015984
NM006117
AK0
0074
5C
ontig40831
RCN
M003239
NM014791
X056
10N
M016448
NM018401
NM000788
Con
tig51464
RCA
L080
079
NM006931
AF2
5717
5N
M014321
NM002916
Con
tig55725
RCC
ontig24252
RCA
F201
951
NM005915
NM001282
Con
tig56457
RCN
M000599
NM020386
NM014889
AF0
5503
3N
M006681
NM007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM007036
NM018354
(a)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig
46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig
32125
RCU82987
AL1
37718
AB0
37863
NM
020188
NM
020974
NM
000127
NM
002019
NM
002073
NM
000436
NM
004994
Con
tig55377
RCC
ontig
35251
RCC
ontig
25991
NM
003875
NM
006101
NM
003882
NM
003607
AF0
7351
9A
F052
162
NM
000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig
46223
RCN
M015984
NM
006117
AK0
0074
5C
ontig
40831
RCN
M003239
NM
014791
X056
10N
M016448
NM
018401
NM
000788
Con
tig51464
RCA
L080
079
NM
006931
AF2
5717
5N
M014321
NM
002916
Con
tig55725
RCC
ontig
24252
RCA
F201
951
NM
005915
NM
001282
Con
tig56457
RCN
M000599
NM
020386
NM
014889
AF0
5503
3N
M006681
NM
007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM
007036
NM
018354
(b)
Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint
12 Computational and Mathematical Methods in Medicine
0 20 40 60 80 100
1
05
06
07
08
09
1AU
C
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
Number of trialsNumber of trials
TrainingTest
TrainingTest
(a)
0 20 40 60 80 100
05
06
07
08
09
1
Number of trialsNumber of trials
AUC
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
TrainingTest
TrainingTest
(b)
Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials
Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is
not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data
One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by
Computational and Mathematical Methods in Medicine 13
0 02 04 06
0
02
04
06
minus02
minus02NM 014889
NM
0149
68
(a)
Index of pairs of genes
Cor
relat
ion
5420 genes70 genes
0
02
04
06
08
1
0 02 04 06 08 1
(b)
Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view
Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata
We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data
Conflict of Interests
The authors have declared that no conflict of interests exists
References
[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001
[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001
[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996
[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003
[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006
[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006
[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002
[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996
14 Computational and Mathematical Methods in Medicine
[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003
[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002
[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999
[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004
[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008
[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006
[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989
[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005
[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005
[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936
[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990
[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010
[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001
[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008
[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006
[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001
[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003
[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011
[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011
[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010
[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001
[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009
[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008
[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009
[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009
[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009
[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
2 Computational and Mathematical Methods in Medicine
an interpretable model for gene expression data In theregression context Tibshirani [3] proposed Lasso whichachieved feature selection by shrinking some coefficients toexactly zero In the image processing area the sparse modelis also considered for redundant representation of data byDonoho and Elad [4] and Candes et al [5]
The current status for prediction performance hasattained a constant level however there are still unsolvedproblems in the prediction in which the observed number ofgenes is extremely larger than that of the subjectsThis causesdifficulties in which superfluous discovered genes arise theexpressions which have weak power for prediction As aresult almost all microarray data analysis is not completelyconfirmed as the biological replication which is discussed inAllison et al [6]
We address themicroarray data analysis discussed in vanrsquotVeer et al [7] in which a set of 70 genes is selected for pre-diction for prognosis in breast cancer patients as informativebiomarkersThe result helps to build the prediction kit namedldquoMammaPrintrdquo which was approved by the FDA in 2007 Wemake a reanalysis of the data and discuss how surprisinglymany other gene sets show almost the same predictionperformance as the gene set published Thus their gene setdoes not uniquely show a reasonable power of predictionso we suggest that it is impossible to build up a universalprediction rule by efficient selection of genes In particularthe ranking procedure according to association measure tothe phenotype is very fragile for data partition We discussthe statistical reason why the data analysis of gene expressionis involved with such difficulties of multiple solutions forthe problem We calculate the value of mutualcoherence forgenes in MammaPrint and show the essential difficulty ofdetermining informative genes among huge number of genesused in the data analysis
This paper is organized as follows In Section 2 wedescribe the pattern recognition of gene expression and over-view the current proposed methods In Section 3 we pointout the current difficulty in gene expression using a realdata set Finally we discuss the results and future works inSection 4
2 Pattern Recognition from Gene Expressions
In this section we overview the pattern recognition of geneexpression First we mention a DNA microarray techniquewhich is the widely used method to measure millions of geneexpressions Then we present the current methods for geneselection using microarray
21 DNA Microarray Microarray has become a widely usedtechnology in a variety of areas Microarray measures theamount of mRNA or gene expression There are two majortechnologies available for gene expressionmeasurement Oneis GeneChip system provided by Affymetrix Inc GeneChipuses prefabricated short lengths of oligonucleotideThe otheris cDNAarraywhichwas originally developed by Schena et al[9] We briefly mention both technologies
GeneChip uses a pair of short length oligonucleotidesattached to a solid surface such as glass plastic or siliconThe
short pair of oligonucleotides is called the probe pair Eachprobe pair is composed of a perfect match (PM) probe anda mismatch (MM) probe PM is a section of the mRNA ofinterest and MM is created by changing the middle (13th)base of the PM with the intention of measuring nonspecificbinding mRNA samples are collected from subjects suchas cancer patients then labeled with fluorescence dye If theshort oligonucleotide is matched with the mRNA sample thelabeled mRNA sample is hybridized to a spot of microarrayIf the labeled mRNA and the probe match perfectly theybind strongly otherwise they bindweaklyThosewithweak ornonspecific binding are washed out by a washing buffer thenonly strongly bound mRNA samples are measured by a scan-ner Scanned measurements need further processing beforeanalysis such as outlier detection background subtractionand normalization These processes are called preprocessing
In the early stage of microarray the quality of microarraymeasurements contained a lot of variance Therefore pre-processing was a very active research area Affymetrix rec-ommended the use of both PM and MM probes to subtractnonspecific binding and implement MASS algorithm totheir software however Irizarry et al [10] and Naef et al[11] pointed out that the normalization model consideringMM captures nonspecific effect more than reality Currentlyrobust multichip average (RMA) which is introduced byIrizarry et al [10] is also widely used
cDNA array uses glass slides to attach short oligonu-cleotides probes cDNA array uses inkjet printing technologyGeneChip uses one color fluorescent dye on the other handcDNA array utilizes two different color fluorescent dyes Oneof the colors is for controlling mRNA and the other color isfor treatment of mRNA Both samples are hybridized on thesame array The scanner detects both fluorescent dyes sepa-rately Data processing is slightly different from GeneChipAs cDNA uses two fluorescent dyes scanned data is normallytreated as ratio data of treatment over control
Microarray technology has improved in the last decadesincluding reduction of the variance normalization proce-dures donot have as great an effect as before Research interesthas moved to areas such as data analysis finding subclass orpredicting the subclass
22 Review for Prediction via Microarray The initial ap-proach employed for subclass discovery was hierarchicalclustering analysis Golub et al [12] showed the result ofclustering for leukemia data using microarray In their resultsubclasses of leukemia were well clustered by gene expressionpattern This result was hailed as a new dawn in the cancerclassification problem
Breast cancer is one of the most used cancers for the geneexpression classification problem Breast cancer treatmentdecisions are based largely on clinicopathological criteriasuch as tumor size histological grade and lymph nodemetastatsis however vanrsquot Veer et al [7] pointed that themajority of patients received unnecessary chemotherapy andthat there is a need to find a better criteria who benefits fromchemotherapy
vanrsquot Veer et al [7] proposed 70 genes to predict patientoutcome as first multigene signatures in breast cancer 70
Computational and Mathematical Methods in Medicine 3
Table 1 A taxonomy of feature selection techniques summarized by Saeys et al [8] These major feature selections are addressed Each typehas a subcategory Advantages disadvantages and example methods are shown
Model search Advantages Disadvantages Examples
Univariate
Filter
FastScalableIndependent of the classifier
Ignores feature dependenciesIgnores interaction with the classifier
1205942
Euclidean distancet-testInformation gain
MultivariateModels feature dependenciesIndependent of the classifierBetter computational complexity thanwrapper methods
Slower than univariate techniquesLess scalable than univariatetechniquesIgnores interaction with the classifier
Correlation-based feature selectionMarkov blanket filterFast correlation-based feature selection
Deterministic
Wrapper
SimpleInteracts with the classifierModels feature dependenciesLess computationally intensive thanrandomized methods
Risk of overfittingMore prone than randomizedalgorithms to getting stuck in a localoptimumClassifier dependent selection
Sequential forward selectionSequential backward selection
Randomized
Less prone to local optimaInteracts with the classifierModels feature dependencies
Computationally intensiveClassifier dependent selectionHigher risk of overfitting thandeterministic algorithms
Simulated annealingRandomized hill climbingGenetic algorithmsEstimation of distribution algorithms
Embedded
Interacts with the classifierBetter computational complexity thanwrapper methodsModels feature dependencies
Classifier dependent selectionDecision treesWeighted naive BayesRFE-SVM
genes were decided using 78 patientsrsquo tumor samples Inbrief they selected 5000 significant expressed genes from25000 genes onmicroarray then coefficient correlation withoutcome was calculated for each class Genes were sorted bycorrelation coefficient and further optimized from the topranked gene with sequentially adding from a subset of fivegenesThe top 70 geneswere proposed as outcome predictorsPaik et al [13] proposed 21multigene signatures based on RT-PCR results These two multigene prognostic signatures areavailable as clinical test named MammaPrint and OncotypeDX FDA cleared the MammaPrint test in 2007 and it is cur-rently being tested in the Microarray In Node-negative and1ndash3 positive lymph-node Disease may Avoid ChemoTherapy(MINDACT) for further assessment which is described byCardoso et al [14]
Besides these two multigene prognostic signatures dif-ferent multigenes were selected as prognostic signatures Fanet al [15] discussed a different set of multigene signaturesin the breast cancer prognostic classification studies Thosesignatures show little overlap however they still showedsimilar classification power Fan et al [15] suggested that thesesignatures are probably tracking a common set of biologicalphenotypes Considering thousands andmillions of genes onmicroarray multiple useful signature sets are not difficult to
imagine however finding a stable and informative gene withhigh classification accuracy is of interest
23 Feature Selection Reduction of dimension size is neces-sary as superfluous features can cause overfitting and inter-pretation of classification model becomes difficult Reducingdimension size while keeping relevant features is importantThere are some feature selection methods proposed Saeyset al [8] provided a taxonomy of feature selection methodsand discussed their use advantages and disadvantages Theymentioned the objectives of feature selection (a) to avoidoverfitting and improve model performance that is predic-tion performance in the case of supervised classification andbetter cluster detection in the case of clustering (b) to providefaster andmore cost-effectivemodels and (c) to gain a deeperinsight into the underlying processes that generated the dataTable 1 provides their taxonomy of feature selectionmethodsIn the context of classification feature selection methodsare organized into three categories filter methods wrappermethods and embeddedmethods Feature selectionmethodsare categorized depending on how they are combined withthe construction of a classification model
Filter methods calculate statistics such as t-statisticsthen filter out those which do not meet the threshold
4 Computational and Mathematical Methods in Medicine
value Advantages of filter methods are easy implementationcomputational simplicity and speed Filter methods areindependent of classification methods therefore differentclassifiers can be used Two of the disadvantages of filtermethods are that they ignore the interaction with the clas-sification model and most of the proposed methods areunivariate
Whereas filter techniques treat the problem of findinga good feature subset independently of the model selec-tion step wrapper methods embed the model hypothesissearch within the feature subset search such as sequentialbackward selection [16] Wrapper methods search all featuresubsets the feature subset space grows exponentially withthe number of features One advantage of wrapper methodsis the interaction between feature subset search and modelselection Computational burden is one of the disadvantagesof this approach especially if building the classifier has a highcomputational cost
The third class of feature selection is embedded tech-niques Like the wrapper methods embedded techniquessearch for an optimal subset of features with classifierconstruction For example SVM-RFE proposed by Guyonet al [17] Thus embedded approaches are specific to a givenlearning algorithm The difference from wrapper methodsis that embedded techniques are guided by the learningprocess Whereas wrapper methods search all possible com-binations of gene sets embedded techniques search for thecombination based on a criteria This enables reduction incomputational burden
Besides thesemethods the idea of sparsenesswas recentlyintroduced in some feature selectionmethods One approachis to use penalties in the regression context Tibshirani [3]proposed Lasso which uses 119871
1norm penalties The combi-
nation of 1198711penalty and 119871
2penalty is called elastic net [18]
These methods are focusing on reducing features to avoidingoverfitting and better interpretability as biologists expect toobtain biological insight from selected features
However these sparseness ideas do not take into accountmultiple solutions in one data set When the data dimensionis thousands ormillions there aremultiple possible solutionsSorting genes based on some criteria then selecting a subsetfrom the top is not always the best selection We elaboratemultiple solutions in the following section and give some ideaof how to select the optimum solution Here we refer to anoptimal solution as a prediction rule with high classificationaccuracy for various data sets
3 Multiple Solutions for Prediction Rules
The existence of multiple solutions for prediction of diseasestatus based on breast cancer data [7] was shown by Ein-Doret al [19] where they suggest three reasons for this problemThe first is that there are many genes correlated with diseasestatus the second one is that the differences of the correlationamong the genes are very small the last one is that the valueof the correlation is very sensitive to the sample used for thecalculation of the correlation In the paper they demonstrategene ranking based on the correlation and show that thereexist many equally predictive gene sets
In this section we investigate the existence of multiplesolutions from different viewpoints At first to check thevariability of prediction accuracy based on different statisticalmethods we apply the vanrsquot Veermethod [7] the Fisher lineardiscriminant analysis [20] AdaBoost [21] and AUCBoost[22] The last two methods are called boosting in machinelearning community where genes are nonlinearly combinedto predict the disease status Second we apply hierarchal clus-tering to examine the heterogeneity of gene expression pat-terns Soslashrlie et al [23] showed there exist subtypes of breastcancers for which the patterns of gene expression are clearlydifferent and the disease statuses are also different in accor-dance with them Ein-Dor et al [19] suggests that the hetero-geneity of the subtypes is one reason why there are so manysolutions for the prediction or large fluctuations of genesselected for the predictions Hence we calculate BiologicalHomogeneity Index (BHI) to see the clustering performancefor various gene sets and examine the existence of thesubtypes Finally we consider the mutualcoherence for thebreast cancer data and discuss the relation to the multiplesolutions
The breast cancer data consists of the expression data of25000 genes and 97 cancer patients After a filtering proce-dure 5420 genes are identified to be significantly regulatedand we focused on this filtered data The patients are dividedinto a training sample (78 patients) and a test sample (19patients) in the same way as the original paper [7] Here weconsider classical methods and boosting methods to predictwhether the patient has good prognosis (free of disease afterinitial diagnosis at least 5 years) or has bad prognosis (distantmetastaseswithin 5 years)Theprediction rule is generated bythe training sample and we measure the prediction accuracybased on the test sample
31 Classification We briefly introduce the statistical meth-ods used for the prediction of the disease status The firsttwo methods are linear discriminant function where thegenes are linearly combined to classify the patients into agood prognosis group or a bad prognosis group The last twoones are boosting methods where the genes are nonlinearlycombined to generate the discriminant functions
311 The vanrsquot Veer Method Let 119910 be a class label indicatingdisease status such as 119910 = minus1 (good prognosis) and 119910 = 1
(metastases) and let x(=(1199091 119909
119901)119879) be a 119901-dimensional
covariate such as gene expression We denote the samples of119910 = minus1 and 119910 = 1 as xminus
119894 119894 = 1 119899
minus and x+
119895 119895 =
1 119899+ respectively and the mean value of the patients
with good prognosis as
minus= (minus
1
minus
119901)119879
=1
119899minus
119899minus
sum
119894=1
xminus119894 (1)
Then vanrsquot Veer et al [7] proposed a discriminant function119865(x) based on the correlation to the average good prognosisprofile above which is given as
119865 (x) = minussum119901
119896=1(119909119896minus ) (
minus
119896minus minus)
radicsum119901
119896=1(119909119896minus )2radicsum119901
119896=1(minus
119896minus minus)2
(2)
Computational and Mathematical Methods in Medicine 5
where
=1
119901
119901
sum
119896=1
119909119896
minus=1
119901
119901
sum
119896=1
minus
119896 (3)
If the value of 119865(x) is smaller than a predefined thresholdvalue then the patient is judged to have good prognosis (119910 =minus1) otherwise to have metastases (119910 = 1) This method iscalled one-class classification where it focuses on only one-class label information to predict the disease status 119910 Thisidea is also employed in machine learning community SeeYousef et al [24] and Gardner et al [25] for applications inbiology and medicine
312 DLDA We consider Fisherrsquos linear discriminant analy-sis [20] which is widely used in many applications Supposethat x is distributed as119873(120583
minusΣminus) for 119910 = minus1 and as119873(120583
+Σ+)
for 119910 = 1 Then if Σminus= Σ+= Σ the estimated log-likelihood
ratio is given as
119865 (x) = (+minus minus)119879
Σminus1x minus 1
2119879
+Σminus1
+
+1
2119879
minusΣminus1
minus+ log
119899+
119899minus
(4)
where + minusand Σ are the samplemeans for119910 isin minus1 1 and a
total sample variance respectively For simplicity we take thediagonal part of Σ (Diagonal Linear Discriminant Analysis)and predict the disease status based on the value of the log-likelihood ratio above This modification is often used in asituation where 119901 is much larger than 119899 (= 119899
minus+ 119899+) In that
case the inverse of Σ cannot be calculated
313 AdaBoost We introduce a famous boosting method inmachine learning community The key concept of boostingis to construct a powerful discriminant function 119865(x) bycombining variousweak classifiers119891(x) [26]We employ a setF of decision stumps as a dictionary of weak classifiers Herethe decision stump for the 119896th gene expression 119909
119896is defined
as a simple step functions such as
119891119896(x) = 1 if 119909
119896ge 119887119896
minus1 otherwise(5)
where 119887119896is a threshold value Accordingly it is known that
119891119896(x) is the simplest classifier in the sense that 119891
119896(x) neglects
all other information of gene expression patterns than thatof one gene 119909
119896 However by changing the value of 119887
119896for all
genes (119896 = 1 119901) we have F that contains exhaustiveinformation of gene expression patterns We attempt tobuild a good discriminant function 119865 by combining decisionstumps inF
The typical one is AdaBoost proposed by Schapire [21]which is designed to minimize the exponential loss for adiscriminant function 119865 as
119871exp (119865) =1
119899
119899
sum
119894=1
exp minus119910119894119865 (x119894) (6)
where the entire data set is given as (x119894 119910119894) | 119894 = 1 119899The
exponential loss is sequentially minimized by the followingalgorithm
(1) Initialize the weight of 119909119894for 119894 = 1 119899 as 119908
1(119894) =
1119899
(2) For the iteration number 119905 = 1 119879
(a) choose 119891119905as
119891119905= argmin119891isinF
120598119905(119891) (7)
where
120598119905(119891) =
119899
sum
119894=1
119868 (119891 (x119894) = 119910119894)
119908119905(119894)
sum119899
119894=1119908119905(119894) (8)
and 119868 is the indicator function(b) calculate the coefficient as
120573119905=1
2log
1 minus 120598119905(119891119905)
120598119905(119891119905) (9)
(c) update the weight as
119908119905+1(119894) = 119908
119905(119894) exp minus119910
119894120573119905119891119905(x119894) (10)
(3) output the final function as 119865(x) = sum119879119905=1120573119905119891119905(x)
Here we have
119871exp (119865 + 120573119891)
=1
119899
119899
sum
119894=1
exp minus119910119894119865 (x119894) exp minus120573119910
119894119891 (x119894)
(11)
=1
119899
119899
sum
119894=1
exp minus119910119894119865 (x119894)
times [119890120573119868 (119891 (x
119894) = 119910119894) + 119890minus120573119868 (119891 (x
119894) = 119910119894)]
(12)
= 119871exp (119865) 119890120573120598 (119891) + 119890
minus120573(1 minus 120598) (13)
ge 2119871exp (119865)radic120598 (119891) (1 minus 120598 (119891)) (14)
where 120598(119891) = 1119899sum119899
119894=1119868(119891(x
119894) = 119910119894) expminus119910
119894119865(x119894)119871exp(119865)
and the equality in (14) is attained if and only if
120573 =1
2log
1 minus 120598 (119891)
120598 (119891) (15)
6 Computational and Mathematical Methods in Medicine
Hence we can see that the exponential loss is sequentiallyminimized in the algorithm It is easily shown that
120598119905+1(119891119905) =
1
2 (16)
That is the weak classifier 119891119905chosen at the step 119905 is the worst
element in F in the sense of the weighted error rate in thestep 119905 + 1
314 AUCBoost with Natural Cubic Splines The area underthe ROCcurve (AUC) is widely used tomeasure classificationaccuracy [27] This criterion consists of the false positiverate and true positive rate so it evaluates them separately incontrast to the commonly used error rateThe empirical AUCbased on the samples xminus
119894 119894 = 1 119899
minus and x+
119895 119895 = 1
119899+ is given as
AUC (119865) = 1
119899minus119899+
119899minus
sum
119894=1
119899+
sum
119895=1
119867(119865 (x+119895) minus 119865 (xminus
119894)) (17)
where 119867(119911) is the Heaviside function 119867(119911) = 1 if 119911 ge 0
and H(119911) = 0 otherwise To avoid the difficulty to maximizethe nondifferential function above an approximate AUC isconsidered by Komori [28] as
AUC120590(119865) =
1
119899minus119899+
119899minus
sum
119894=1
119899+
sum
119895=1
119867120590(119865 (x+119895) minus 119865 (xminus
119894)) (18)
where 119867120590(119911) = Φ(119911120590) with Φ being the standard normal
distribution function A smaller scale parameter 120590 meansa better approximation of the Heaviside function 119867(119911)Based on this approximation a boosting method for themaximization of AUC as well as the partial AUC is proposedby Komori and Eguchi [22] in which they consider thefollowing objective function
AUC120590120582(119865) =
1
119899minus119899+
119899minus
sum
119894=1
119899+
sum
119895=1
119867120590(119865 (x+119895) minus 119865 (xminus
119894))
minus 120582
119901
sum
119896=1
int 11986510158401015840
119896(119909119896)2
119889119909119896
(19)
where 11986510158401015840119896(119909119896) is the second derivative of the 119896th component
of 119865(x) and 120582 is a smoothing parameter that controls thesmoothness of 119865(x) Here the set of weak classifiers is givenas
F = 119891 (x) =119873119896119897(119909119896)
119885119896119897
| 119896 = 1 2 119901 119897 = 1 2 119898119896
(20)
where 119873119896119897
is a basis function for representing natural cubicsplines and 119885
119896119897is a standardization factor Then the relation-
ship
max120590120582119865
AUC120590120582(119865) = max
120582119865
AUC1120582(119865) (21)
allows us to fix 120590 = 1 without loss of generality and have thefollowing algorithm
(1) Start with 1198650(x) = 0
(2) For 119905 = 1 119879
(a) update 120573119905minus1(119891) to 120573
119905(119891) with a one-step
Newton-Raphson iteration(b) find the best weak classifier 119891
119905
119891119905= argmax119891
AUC1120582(119865119905minus1
+ 120573119905(119891) 119891) (22)
(c) update the score function as
119865119905(x) = 119865
119905minus1(x) + 120573
119905(119891119905) 119891119905(x) (23)
(3) Output 119865(x) = sum119879119905=1120573119905(119891119905)119891119905(x)
The value of the smoothing parameter 120582 and the iterationnumber 119879 is determined at the same time by the crossvalidation
32 Clustering We applied a hierarchical clustering usingbreast cancer data [7] where the distances between samplesand genes are determined by the correlation and completelinkage was applied as the agglomeration method To mea-sure the performance of the clustering we used biologicalhomogeneity index (BHI) [29] which measures the homo-geneity between the cluster C = 119862
1 119862
119870 and the
biological category or subtypeB = 1198611 119861
119871
BHI (CB) =1
119870
119870
sum
119896=1
1
119899119896(119899119896minus 1)
sum
119894 = 119895119894119895isin119862119896
119868 (119861(119894)= 119861(119895))
(24)
where 119861(119894) isin B is the subtype for the subject 119894 and 119899119896is
the number of subjects in 119862119896 This index is upper bounded
by 1 meaning the perfect homogeneity between the clustersand the biological categories We calculated this index forthe breast cancer data to investigate the relationship betweenthe hierarchical clustering and biological categories such asdisease status (good prognosis or metastases) and hormonestatus estrogen receptor (ER) status and progesterone recep-tor (PR) status The hormone status is known to be closelyrelated with the prognosis of the patients [23]
33 Mutualcoherence Now we have a data matrix X with 119899rows (119899 patients) and 119901 columns (119901 genes)We assumed an 119899-dimensional vector b indicating the true disease status where
Computational and Mathematical Methods in Medicine 7
the positive values correspond to metastases and negativeones to good prognosisThemagnitude of b denotes the levelof disease status Then the optimal linear solution 120573 isin R119901
should be satisfied
X120573 = b (25)
Note that if 119901 is much larger than 119899 then only a few elementsof 120573 should be non-zeros and the others to be zero whichmeans 120573 has a sparse structure The sparsity has a closerelationship with mutualcoherence [30] which is defined forthe data matrix X as
120583 (X) = max1le119894119895le119901119894 = 119895
10038161003816100381610038161003816x119879119894x119895
100381610038161003816100381610038161003817100381710038171003817x11989410038171003817100381710038172
10038171003817100381710038171003817x119895
100381710038171003817100381710038172
(26)
where x119894isin R119899 denotes the 119894th column in X or 119894th gene in
the breast cancer data | sdot | is the absolute value and sdot2is the
Euclidean normThis indexmeasures the distance of columns(genes) in the data matrixX and is bounded as 0 le 120583(X) le 1The next theorem shows the relationship between sparsity of120573 and the data matrix X
Theorem 1 (Elad [30]) If X120573 = b has a solution 120573 obeying1205730
lt (1 + 1120583(X))2 this solution is necessarily thesparsest possible where 120573
0denotes the number of the nonzero
components of 120573
This theorem suggests that the linear discriminant func-tion 119865(x) = 120573119879x could have a sparse solution to predict thedisease status b which corresponds to metastases or goodprognosis in the breast cancer data If the value of 120583(X) isnearly equal to 1 then 120573
0could become approximately 1
indicating that just one gene (a column inX) could be enoughto predict the disease status if there were a solution 120573 Thisindicates that we could have a chance to find the multiplesolutions with fewer genes than 70 genes in MammaPrintAlthough the framework in (25) is based on a system of linearequations and does not include random effects as seen in theclassification problems we deal with Theorem 1 is indicativein the case where the number of covariates 119901 is much largerthan the observation number 119899
34 Results We prepare various data sets using a trainingsample (78 patients) based on 230 genes selected by vanrsquot Veeret al [7] in order to investigate the existence of the multiplesolutions of the prediction of the disease status which aregiven by
D = 1198631minus70
1198636minus75
119863161minus230
(27)
The first data set 1198631minus70
consists of 78 patients and 70 genesin MammaPrint The ranking of the 230 genes in D is onthe basis of correlation with disease status as in [7] We apply
the vanrsquot Veer method DLDA AdaBoost and AUCBoost asexplained in the previous subsection to the training data tohave the discriminant functions119865(x) where each threshold isdetermined so that the training error rate isminimizedThenwe evaluate the classification accuracy based on the trainingdata as well as the test data The results of classificationperformance of vanrsquot Veer method and DLDA are illustratedin Figure 1 those of AdaBoost and AUCBoost are in Figure 2The AUC and the error rate are plotted against D In regardto the performance based on the training data denoted bysolid line the boosting methods are superior to the classicalmethodsHowever comparison based on the test data DLDAshows good performance for almost all the data sets in Dhaving the AUC more than 08 and the error rates less than02 in averageThis evidence suggests there exist many sets ofgenes having almost the same classification performance asthat of MammaPrint
We investigate the performance of the hierarchical clus-terings based on the training data sets 119863
1minus70 11986311minus80
and119863111minus180
which are shown in Figure 3 Each row represents70 genes and each column represents 78 patients withdisease status (blue) ER status (red) and PR status (orange)The BHI for disease status ER status and PR status inMammaPrint (119863
1minus70) are 070 069 and 057 respectively
The gene expression in 11986311minus80
shows different patterns fromthe others Mainly there are two clusters characterized by ERstatus The left-hand side is the subgroup of patients withER negative and poor prognosis This would correspond toBasal-like subtype or triple negative subtype though the Her2status is unclear The right-hand side could be divided intothree subgroups The subgroup of patients with ER negativeshows good prognosis indicating Luminal A and either sideof it would be Luminal B or Luminal C because it showsworse prognosis than Luminal A The data set of 119863
111minus180
has the highest BHI for disease status 076 and a similar geneexpression pattern to that in 119863
1minus70 The other values of BHI
are illustrated in Figure 4 It turned out that the data sets inDhave almost the same BHI for three statuses suggesting thereexists various gene sets with similar expression patterns
Next we investigate the stability of the gene ranking inMammaPrint Among 78 patients we randomly choose 50patients with 5420 gene expression patterns Then we takethe top 70 genes ranked by the correlation coefficients ofthe gene expression with disease status This procedure isrepeated 100 times and checked how many times the genesin MammaPrint are ranked within the top 70 The resultsare shown in the upper panel of Figure 5 The lower panelshows the result based on the AUC instead of the correlationcoefficient used in the ranking We clearly see that someof the genes in MammaPrint are rarely selected in the top70 which indicates the instability of the gene ranking Theperformances of DLDA with 100 replications which showsmost stable prediction accuracy as seen in Figure 1 basedon randomly selected 50 patients shown in Figure 6 wherethe vertical axes are AUC (left panels) and error rate (rightpanels) and genes are ranked by the correlation (a) and AUC(b) The performance clearly gets worse than that in Figure 1in terms of both AUC and error rate The heterogeneity ofgene expression pattern may come from the several subtypes
8 Computational and Mathematical Methods in Medicine
0
02
04
06
08
1
Data sets
AUC
0
02
04
06
08
1
Erro
r
TrainingTest
TrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230Data sets
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(a) vanrsquot Veer methodEr
ror
0
02
04
06
08
1
AUC
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(b) DLDA
Figure 1 Results of classical methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data setsD for vanrsquotVeer method and DLDA
as suggested in the cluster analysis and it would make itdifficult to select useful genes for predictions
Finally we calculate the mutualcoherence based on genesin MammaPrint and total 5420 genes in Figure 7 The scatterplot in the upper panel (a) shows a pair of geneswith the high-est mutual coherence (0984) in MammaPrint NM 014889
and NM 014968 correspond to MP1 and KIAA1104 respec-tivelyMP1 is a scaffold protein inmultiple signaling pathwaysincluding the one in breast cancer The latter one is pitrilysinmetallopeptidase 1The correlations defined in the right-handside of (26) are calculated for all pairs of genes in total 5420
genes and 70 genes in MammaPrint The distributions ofthem are shown in the lower panel (b) in Figure 7 The genepairs are sorted so that the values of the correlations decreasemonotonically Note that the number of the gene pairs ineach data set is different but the range of horizontal axis isrestricted to 0 and 1 for clear view and for easy comparisonThedifference between the black and red curves indicates thatthe gene set of MammaPrint has large homogeneity of geneexpression patterns in comparison with that of total 5420genes This indicates that the ranking of genes based on two-sample statistic such as the correlation coefficients is prone
Computational and Mathematical Methods in Medicine 9
0
02
04
06
08
1AU
C
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Erro
r
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(a) AdaBoost
0
02
04
06
08
1
AUC
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Erro
r
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(b) AUCBoost
Figure 2 Results of boosting methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets D forAdaBoost and AUCBoost
to select genes such that their gene expression patterns aresimilar to each other This would be one reason why we havemultiple solutions after the ranking methods based on thecorrelation coefficients It is also interesting to see that thereare a few gene pairs with very low correlation even in gene setof MammaPrint
4 Discussion and Concluding Remarks
In this paper we have addressed an important classificationissue using microarray data Our results present the existence
of multiple suboptimal solutions for choosing classificationpredictors Classification results showed that the perfor-mance of several gene sets was the same This behaviorwas confirmed by the vanrsquot Veer method DLDA AdaBoostand AUCBoost Amazingly nontop ranked gene sets showedbetter performance than the top ranked gene sets Theseresults indicate that the ranking method is very unstable forfeature selection in microarray data
To examine the expression pattern of selected predictorswe performed clustering this added support to the existenceof multiple solutions We easily recognized the good and
10 Computational and Mathematical Methods in Medicine
(a) (b)
(c)
Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70
(MammaPrint)(b) 119863
11minus80clearly showing some subtypes and (c) 119863
111minus180with the highest BHI regarding metastases The blue bars indicate patients with
metastases red bars those with ER positive and orange bars those with PR positive
the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes
Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work
Computational and Mathematical Methods in Medicine 11
SetBH
I
MetastasesERp
PRp
0
02
04
06
08
1
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig32125
RCU82987
AL137718
AB037863
NM020188
NM020974
NM000127
NM002019
NM002073
NM000436
NM004994
Con
tig55377
RCC
ontig35251
RCC
ontig25991
NM003875
NM006101
NM003882
NM003607
AF0
7351
9A
F052
162
NM000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig46223
RCN
M015984
NM006117
AK0
0074
5C
ontig40831
RCN
M003239
NM014791
X056
10N
M016448
NM018401
NM000788
Con
tig51464
RCA
L080
079
NM006931
AF2
5717
5N
M014321
NM002916
Con
tig55725
RCC
ontig24252
RCA
F201
951
NM005915
NM001282
Con
tig56457
RCN
M000599
NM020386
NM014889
AF0
5503
3N
M006681
NM007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM007036
NM018354
(a)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig
46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig
32125
RCU82987
AL1
37718
AB0
37863
NM
020188
NM
020974
NM
000127
NM
002019
NM
002073
NM
000436
NM
004994
Con
tig55377
RCC
ontig
35251
RCC
ontig
25991
NM
003875
NM
006101
NM
003882
NM
003607
AF0
7351
9A
F052
162
NM
000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig
46223
RCN
M015984
NM
006117
AK0
0074
5C
ontig
40831
RCN
M003239
NM
014791
X056
10N
M016448
NM
018401
NM
000788
Con
tig51464
RCA
L080
079
NM
006931
AF2
5717
5N
M014321
NM
002916
Con
tig55725
RCC
ontig
24252
RCA
F201
951
NM
005915
NM
001282
Con
tig56457
RCN
M000599
NM
020386
NM
014889
AF0
5503
3N
M006681
NM
007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM
007036
NM
018354
(b)
Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint
12 Computational and Mathematical Methods in Medicine
0 20 40 60 80 100
1
05
06
07
08
09
1AU
C
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
Number of trialsNumber of trials
TrainingTest
TrainingTest
(a)
0 20 40 60 80 100
05
06
07
08
09
1
Number of trialsNumber of trials
AUC
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
TrainingTest
TrainingTest
(b)
Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials
Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is
not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data
One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by
Computational and Mathematical Methods in Medicine 13
0 02 04 06
0
02
04
06
minus02
minus02NM 014889
NM
0149
68
(a)
Index of pairs of genes
Cor
relat
ion
5420 genes70 genes
0
02
04
06
08
1
0 02 04 06 08 1
(b)
Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view
Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata
We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data
Conflict of Interests
The authors have declared that no conflict of interests exists
References
[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001
[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001
[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996
[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003
[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006
[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006
[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002
[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996
14 Computational and Mathematical Methods in Medicine
[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003
[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002
[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999
[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004
[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008
[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006
[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989
[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005
[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005
[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936
[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990
[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010
[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001
[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008
[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006
[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001
[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003
[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011
[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011
[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010
[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001
[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009
[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008
[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009
[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009
[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009
[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
Computational and Mathematical Methods in Medicine 3
Table 1 A taxonomy of feature selection techniques summarized by Saeys et al [8] These major feature selections are addressed Each typehas a subcategory Advantages disadvantages and example methods are shown
Model search Advantages Disadvantages Examples
Univariate
Filter
FastScalableIndependent of the classifier
Ignores feature dependenciesIgnores interaction with the classifier
1205942
Euclidean distancet-testInformation gain
MultivariateModels feature dependenciesIndependent of the classifierBetter computational complexity thanwrapper methods
Slower than univariate techniquesLess scalable than univariatetechniquesIgnores interaction with the classifier
Correlation-based feature selectionMarkov blanket filterFast correlation-based feature selection
Deterministic
Wrapper
SimpleInteracts with the classifierModels feature dependenciesLess computationally intensive thanrandomized methods
Risk of overfittingMore prone than randomizedalgorithms to getting stuck in a localoptimumClassifier dependent selection
Sequential forward selectionSequential backward selection
Randomized
Less prone to local optimaInteracts with the classifierModels feature dependencies
Computationally intensiveClassifier dependent selectionHigher risk of overfitting thandeterministic algorithms
Simulated annealingRandomized hill climbingGenetic algorithmsEstimation of distribution algorithms
Embedded
Interacts with the classifierBetter computational complexity thanwrapper methodsModels feature dependencies
Classifier dependent selectionDecision treesWeighted naive BayesRFE-SVM
genes were decided using 78 patientsrsquo tumor samples Inbrief they selected 5000 significant expressed genes from25000 genes onmicroarray then coefficient correlation withoutcome was calculated for each class Genes were sorted bycorrelation coefficient and further optimized from the topranked gene with sequentially adding from a subset of fivegenesThe top 70 geneswere proposed as outcome predictorsPaik et al [13] proposed 21multigene signatures based on RT-PCR results These two multigene prognostic signatures areavailable as clinical test named MammaPrint and OncotypeDX FDA cleared the MammaPrint test in 2007 and it is cur-rently being tested in the Microarray In Node-negative and1ndash3 positive lymph-node Disease may Avoid ChemoTherapy(MINDACT) for further assessment which is described byCardoso et al [14]
Besides these two multigene prognostic signatures dif-ferent multigenes were selected as prognostic signatures Fanet al [15] discussed a different set of multigene signaturesin the breast cancer prognostic classification studies Thosesignatures show little overlap however they still showedsimilar classification power Fan et al [15] suggested that thesesignatures are probably tracking a common set of biologicalphenotypes Considering thousands andmillions of genes onmicroarray multiple useful signature sets are not difficult to
imagine however finding a stable and informative gene withhigh classification accuracy is of interest
23 Feature Selection Reduction of dimension size is neces-sary as superfluous features can cause overfitting and inter-pretation of classification model becomes difficult Reducingdimension size while keeping relevant features is importantThere are some feature selection methods proposed Saeyset al [8] provided a taxonomy of feature selection methodsand discussed their use advantages and disadvantages Theymentioned the objectives of feature selection (a) to avoidoverfitting and improve model performance that is predic-tion performance in the case of supervised classification andbetter cluster detection in the case of clustering (b) to providefaster andmore cost-effectivemodels and (c) to gain a deeperinsight into the underlying processes that generated the dataTable 1 provides their taxonomy of feature selectionmethodsIn the context of classification feature selection methodsare organized into three categories filter methods wrappermethods and embeddedmethods Feature selectionmethodsare categorized depending on how they are combined withthe construction of a classification model
Filter methods calculate statistics such as t-statisticsthen filter out those which do not meet the threshold
4 Computational and Mathematical Methods in Medicine
value Advantages of filter methods are easy implementationcomputational simplicity and speed Filter methods areindependent of classification methods therefore differentclassifiers can be used Two of the disadvantages of filtermethods are that they ignore the interaction with the clas-sification model and most of the proposed methods areunivariate
Whereas filter techniques treat the problem of findinga good feature subset independently of the model selec-tion step wrapper methods embed the model hypothesissearch within the feature subset search such as sequentialbackward selection [16] Wrapper methods search all featuresubsets the feature subset space grows exponentially withthe number of features One advantage of wrapper methodsis the interaction between feature subset search and modelselection Computational burden is one of the disadvantagesof this approach especially if building the classifier has a highcomputational cost
The third class of feature selection is embedded tech-niques Like the wrapper methods embedded techniquessearch for an optimal subset of features with classifierconstruction For example SVM-RFE proposed by Guyonet al [17] Thus embedded approaches are specific to a givenlearning algorithm The difference from wrapper methodsis that embedded techniques are guided by the learningprocess Whereas wrapper methods search all possible com-binations of gene sets embedded techniques search for thecombination based on a criteria This enables reduction incomputational burden
Besides thesemethods the idea of sparsenesswas recentlyintroduced in some feature selectionmethods One approachis to use penalties in the regression context Tibshirani [3]proposed Lasso which uses 119871
1norm penalties The combi-
nation of 1198711penalty and 119871
2penalty is called elastic net [18]
These methods are focusing on reducing features to avoidingoverfitting and better interpretability as biologists expect toobtain biological insight from selected features
However these sparseness ideas do not take into accountmultiple solutions in one data set When the data dimensionis thousands ormillions there aremultiple possible solutionsSorting genes based on some criteria then selecting a subsetfrom the top is not always the best selection We elaboratemultiple solutions in the following section and give some ideaof how to select the optimum solution Here we refer to anoptimal solution as a prediction rule with high classificationaccuracy for various data sets
3 Multiple Solutions for Prediction Rules
The existence of multiple solutions for prediction of diseasestatus based on breast cancer data [7] was shown by Ein-Doret al [19] where they suggest three reasons for this problemThe first is that there are many genes correlated with diseasestatus the second one is that the differences of the correlationamong the genes are very small the last one is that the valueof the correlation is very sensitive to the sample used for thecalculation of the correlation In the paper they demonstrategene ranking based on the correlation and show that thereexist many equally predictive gene sets
In this section we investigate the existence of multiplesolutions from different viewpoints At first to check thevariability of prediction accuracy based on different statisticalmethods we apply the vanrsquot Veermethod [7] the Fisher lineardiscriminant analysis [20] AdaBoost [21] and AUCBoost[22] The last two methods are called boosting in machinelearning community where genes are nonlinearly combinedto predict the disease status Second we apply hierarchal clus-tering to examine the heterogeneity of gene expression pat-terns Soslashrlie et al [23] showed there exist subtypes of breastcancers for which the patterns of gene expression are clearlydifferent and the disease statuses are also different in accor-dance with them Ein-Dor et al [19] suggests that the hetero-geneity of the subtypes is one reason why there are so manysolutions for the prediction or large fluctuations of genesselected for the predictions Hence we calculate BiologicalHomogeneity Index (BHI) to see the clustering performancefor various gene sets and examine the existence of thesubtypes Finally we consider the mutualcoherence for thebreast cancer data and discuss the relation to the multiplesolutions
The breast cancer data consists of the expression data of25000 genes and 97 cancer patients After a filtering proce-dure 5420 genes are identified to be significantly regulatedand we focused on this filtered data The patients are dividedinto a training sample (78 patients) and a test sample (19patients) in the same way as the original paper [7] Here weconsider classical methods and boosting methods to predictwhether the patient has good prognosis (free of disease afterinitial diagnosis at least 5 years) or has bad prognosis (distantmetastaseswithin 5 years)Theprediction rule is generated bythe training sample and we measure the prediction accuracybased on the test sample
31 Classification We briefly introduce the statistical meth-ods used for the prediction of the disease status The firsttwo methods are linear discriminant function where thegenes are linearly combined to classify the patients into agood prognosis group or a bad prognosis group The last twoones are boosting methods where the genes are nonlinearlycombined to generate the discriminant functions
311 The vanrsquot Veer Method Let 119910 be a class label indicatingdisease status such as 119910 = minus1 (good prognosis) and 119910 = 1
(metastases) and let x(=(1199091 119909
119901)119879) be a 119901-dimensional
covariate such as gene expression We denote the samples of119910 = minus1 and 119910 = 1 as xminus
119894 119894 = 1 119899
minus and x+
119895 119895 =
1 119899+ respectively and the mean value of the patients
with good prognosis as
minus= (minus
1
minus
119901)119879
=1
119899minus
119899minus
sum
119894=1
xminus119894 (1)
Then vanrsquot Veer et al [7] proposed a discriminant function119865(x) based on the correlation to the average good prognosisprofile above which is given as
119865 (x) = minussum119901
119896=1(119909119896minus ) (
minus
119896minus minus)
radicsum119901
119896=1(119909119896minus )2radicsum119901
119896=1(minus
119896minus minus)2
(2)
Computational and Mathematical Methods in Medicine 5
where
=1
119901
119901
sum
119896=1
119909119896
minus=1
119901
119901
sum
119896=1
minus
119896 (3)
If the value of 119865(x) is smaller than a predefined thresholdvalue then the patient is judged to have good prognosis (119910 =minus1) otherwise to have metastases (119910 = 1) This method iscalled one-class classification where it focuses on only one-class label information to predict the disease status 119910 Thisidea is also employed in machine learning community SeeYousef et al [24] and Gardner et al [25] for applications inbiology and medicine
312 DLDA We consider Fisherrsquos linear discriminant analy-sis [20] which is widely used in many applications Supposethat x is distributed as119873(120583
minusΣminus) for 119910 = minus1 and as119873(120583
+Σ+)
for 119910 = 1 Then if Σminus= Σ+= Σ the estimated log-likelihood
ratio is given as
119865 (x) = (+minus minus)119879
Σminus1x minus 1
2119879
+Σminus1
+
+1
2119879
minusΣminus1
minus+ log
119899+
119899minus
(4)
where + minusand Σ are the samplemeans for119910 isin minus1 1 and a
total sample variance respectively For simplicity we take thediagonal part of Σ (Diagonal Linear Discriminant Analysis)and predict the disease status based on the value of the log-likelihood ratio above This modification is often used in asituation where 119901 is much larger than 119899 (= 119899
minus+ 119899+) In that
case the inverse of Σ cannot be calculated
313 AdaBoost We introduce a famous boosting method inmachine learning community The key concept of boostingis to construct a powerful discriminant function 119865(x) bycombining variousweak classifiers119891(x) [26]We employ a setF of decision stumps as a dictionary of weak classifiers Herethe decision stump for the 119896th gene expression 119909
119896is defined
as a simple step functions such as
119891119896(x) = 1 if 119909
119896ge 119887119896
minus1 otherwise(5)
where 119887119896is a threshold value Accordingly it is known that
119891119896(x) is the simplest classifier in the sense that 119891
119896(x) neglects
all other information of gene expression patterns than thatof one gene 119909
119896 However by changing the value of 119887
119896for all
genes (119896 = 1 119901) we have F that contains exhaustiveinformation of gene expression patterns We attempt tobuild a good discriminant function 119865 by combining decisionstumps inF
The typical one is AdaBoost proposed by Schapire [21]which is designed to minimize the exponential loss for adiscriminant function 119865 as
119871exp (119865) =1
119899
119899
sum
119894=1
exp minus119910119894119865 (x119894) (6)
where the entire data set is given as (x119894 119910119894) | 119894 = 1 119899The
exponential loss is sequentially minimized by the followingalgorithm
(1) Initialize the weight of 119909119894for 119894 = 1 119899 as 119908
1(119894) =
1119899
(2) For the iteration number 119905 = 1 119879
(a) choose 119891119905as
119891119905= argmin119891isinF
120598119905(119891) (7)
where
120598119905(119891) =
119899
sum
119894=1
119868 (119891 (x119894) = 119910119894)
119908119905(119894)
sum119899
119894=1119908119905(119894) (8)
and 119868 is the indicator function(b) calculate the coefficient as
120573119905=1
2log
1 minus 120598119905(119891119905)
120598119905(119891119905) (9)
(c) update the weight as
119908119905+1(119894) = 119908
119905(119894) exp minus119910
119894120573119905119891119905(x119894) (10)
(3) output the final function as 119865(x) = sum119879119905=1120573119905119891119905(x)
Here we have
119871exp (119865 + 120573119891)
=1
119899
119899
sum
119894=1
exp minus119910119894119865 (x119894) exp minus120573119910
119894119891 (x119894)
(11)
=1
119899
119899
sum
119894=1
exp minus119910119894119865 (x119894)
times [119890120573119868 (119891 (x
119894) = 119910119894) + 119890minus120573119868 (119891 (x
119894) = 119910119894)]
(12)
= 119871exp (119865) 119890120573120598 (119891) + 119890
minus120573(1 minus 120598) (13)
ge 2119871exp (119865)radic120598 (119891) (1 minus 120598 (119891)) (14)
where 120598(119891) = 1119899sum119899
119894=1119868(119891(x
119894) = 119910119894) expminus119910
119894119865(x119894)119871exp(119865)
and the equality in (14) is attained if and only if
120573 =1
2log
1 minus 120598 (119891)
120598 (119891) (15)
6 Computational and Mathematical Methods in Medicine
Hence we can see that the exponential loss is sequentiallyminimized in the algorithm It is easily shown that
120598119905+1(119891119905) =
1
2 (16)
That is the weak classifier 119891119905chosen at the step 119905 is the worst
element in F in the sense of the weighted error rate in thestep 119905 + 1
314 AUCBoost with Natural Cubic Splines The area underthe ROCcurve (AUC) is widely used tomeasure classificationaccuracy [27] This criterion consists of the false positiverate and true positive rate so it evaluates them separately incontrast to the commonly used error rateThe empirical AUCbased on the samples xminus
119894 119894 = 1 119899
minus and x+
119895 119895 = 1
119899+ is given as
AUC (119865) = 1
119899minus119899+
119899minus
sum
119894=1
119899+
sum
119895=1
119867(119865 (x+119895) minus 119865 (xminus
119894)) (17)
where 119867(119911) is the Heaviside function 119867(119911) = 1 if 119911 ge 0
and H(119911) = 0 otherwise To avoid the difficulty to maximizethe nondifferential function above an approximate AUC isconsidered by Komori [28] as
AUC120590(119865) =
1
119899minus119899+
119899minus
sum
119894=1
119899+
sum
119895=1
119867120590(119865 (x+119895) minus 119865 (xminus
119894)) (18)
where 119867120590(119911) = Φ(119911120590) with Φ being the standard normal
distribution function A smaller scale parameter 120590 meansa better approximation of the Heaviside function 119867(119911)Based on this approximation a boosting method for themaximization of AUC as well as the partial AUC is proposedby Komori and Eguchi [22] in which they consider thefollowing objective function
AUC120590120582(119865) =
1
119899minus119899+
119899minus
sum
119894=1
119899+
sum
119895=1
119867120590(119865 (x+119895) minus 119865 (xminus
119894))
minus 120582
119901
sum
119896=1
int 11986510158401015840
119896(119909119896)2
119889119909119896
(19)
where 11986510158401015840119896(119909119896) is the second derivative of the 119896th component
of 119865(x) and 120582 is a smoothing parameter that controls thesmoothness of 119865(x) Here the set of weak classifiers is givenas
F = 119891 (x) =119873119896119897(119909119896)
119885119896119897
| 119896 = 1 2 119901 119897 = 1 2 119898119896
(20)
where 119873119896119897
is a basis function for representing natural cubicsplines and 119885
119896119897is a standardization factor Then the relation-
ship
max120590120582119865
AUC120590120582(119865) = max
120582119865
AUC1120582(119865) (21)
allows us to fix 120590 = 1 without loss of generality and have thefollowing algorithm
(1) Start with 1198650(x) = 0
(2) For 119905 = 1 119879
(a) update 120573119905minus1(119891) to 120573
119905(119891) with a one-step
Newton-Raphson iteration(b) find the best weak classifier 119891
119905
119891119905= argmax119891
AUC1120582(119865119905minus1
+ 120573119905(119891) 119891) (22)
(c) update the score function as
119865119905(x) = 119865
119905minus1(x) + 120573
119905(119891119905) 119891119905(x) (23)
(3) Output 119865(x) = sum119879119905=1120573119905(119891119905)119891119905(x)
The value of the smoothing parameter 120582 and the iterationnumber 119879 is determined at the same time by the crossvalidation
32 Clustering We applied a hierarchical clustering usingbreast cancer data [7] where the distances between samplesand genes are determined by the correlation and completelinkage was applied as the agglomeration method To mea-sure the performance of the clustering we used biologicalhomogeneity index (BHI) [29] which measures the homo-geneity between the cluster C = 119862
1 119862
119870 and the
biological category or subtypeB = 1198611 119861
119871
BHI (CB) =1
119870
119870
sum
119896=1
1
119899119896(119899119896minus 1)
sum
119894 = 119895119894119895isin119862119896
119868 (119861(119894)= 119861(119895))
(24)
where 119861(119894) isin B is the subtype for the subject 119894 and 119899119896is
the number of subjects in 119862119896 This index is upper bounded
by 1 meaning the perfect homogeneity between the clustersand the biological categories We calculated this index forthe breast cancer data to investigate the relationship betweenthe hierarchical clustering and biological categories such asdisease status (good prognosis or metastases) and hormonestatus estrogen receptor (ER) status and progesterone recep-tor (PR) status The hormone status is known to be closelyrelated with the prognosis of the patients [23]
33 Mutualcoherence Now we have a data matrix X with 119899rows (119899 patients) and 119901 columns (119901 genes)We assumed an 119899-dimensional vector b indicating the true disease status where
Computational and Mathematical Methods in Medicine 7
the positive values correspond to metastases and negativeones to good prognosisThemagnitude of b denotes the levelof disease status Then the optimal linear solution 120573 isin R119901
should be satisfied
X120573 = b (25)
Note that if 119901 is much larger than 119899 then only a few elementsof 120573 should be non-zeros and the others to be zero whichmeans 120573 has a sparse structure The sparsity has a closerelationship with mutualcoherence [30] which is defined forthe data matrix X as
120583 (X) = max1le119894119895le119901119894 = 119895
10038161003816100381610038161003816x119879119894x119895
100381610038161003816100381610038161003817100381710038171003817x11989410038171003817100381710038172
10038171003817100381710038171003817x119895
100381710038171003817100381710038172
(26)
where x119894isin R119899 denotes the 119894th column in X or 119894th gene in
the breast cancer data | sdot | is the absolute value and sdot2is the
Euclidean normThis indexmeasures the distance of columns(genes) in the data matrixX and is bounded as 0 le 120583(X) le 1The next theorem shows the relationship between sparsity of120573 and the data matrix X
Theorem 1 (Elad [30]) If X120573 = b has a solution 120573 obeying1205730
lt (1 + 1120583(X))2 this solution is necessarily thesparsest possible where 120573
0denotes the number of the nonzero
components of 120573
This theorem suggests that the linear discriminant func-tion 119865(x) = 120573119879x could have a sparse solution to predict thedisease status b which corresponds to metastases or goodprognosis in the breast cancer data If the value of 120583(X) isnearly equal to 1 then 120573
0could become approximately 1
indicating that just one gene (a column inX) could be enoughto predict the disease status if there were a solution 120573 Thisindicates that we could have a chance to find the multiplesolutions with fewer genes than 70 genes in MammaPrintAlthough the framework in (25) is based on a system of linearequations and does not include random effects as seen in theclassification problems we deal with Theorem 1 is indicativein the case where the number of covariates 119901 is much largerthan the observation number 119899
34 Results We prepare various data sets using a trainingsample (78 patients) based on 230 genes selected by vanrsquot Veeret al [7] in order to investigate the existence of the multiplesolutions of the prediction of the disease status which aregiven by
D = 1198631minus70
1198636minus75
119863161minus230
(27)
The first data set 1198631minus70
consists of 78 patients and 70 genesin MammaPrint The ranking of the 230 genes in D is onthe basis of correlation with disease status as in [7] We apply
the vanrsquot Veer method DLDA AdaBoost and AUCBoost asexplained in the previous subsection to the training data tohave the discriminant functions119865(x) where each threshold isdetermined so that the training error rate isminimizedThenwe evaluate the classification accuracy based on the trainingdata as well as the test data The results of classificationperformance of vanrsquot Veer method and DLDA are illustratedin Figure 1 those of AdaBoost and AUCBoost are in Figure 2The AUC and the error rate are plotted against D In regardto the performance based on the training data denoted bysolid line the boosting methods are superior to the classicalmethodsHowever comparison based on the test data DLDAshows good performance for almost all the data sets in Dhaving the AUC more than 08 and the error rates less than02 in averageThis evidence suggests there exist many sets ofgenes having almost the same classification performance asthat of MammaPrint
We investigate the performance of the hierarchical clus-terings based on the training data sets 119863
1minus70 11986311minus80
and119863111minus180
which are shown in Figure 3 Each row represents70 genes and each column represents 78 patients withdisease status (blue) ER status (red) and PR status (orange)The BHI for disease status ER status and PR status inMammaPrint (119863
1minus70) are 070 069 and 057 respectively
The gene expression in 11986311minus80
shows different patterns fromthe others Mainly there are two clusters characterized by ERstatus The left-hand side is the subgroup of patients withER negative and poor prognosis This would correspond toBasal-like subtype or triple negative subtype though the Her2status is unclear The right-hand side could be divided intothree subgroups The subgroup of patients with ER negativeshows good prognosis indicating Luminal A and either sideof it would be Luminal B or Luminal C because it showsworse prognosis than Luminal A The data set of 119863
111minus180
has the highest BHI for disease status 076 and a similar geneexpression pattern to that in 119863
1minus70 The other values of BHI
are illustrated in Figure 4 It turned out that the data sets inDhave almost the same BHI for three statuses suggesting thereexists various gene sets with similar expression patterns
Next we investigate the stability of the gene ranking inMammaPrint Among 78 patients we randomly choose 50patients with 5420 gene expression patterns Then we takethe top 70 genes ranked by the correlation coefficients ofthe gene expression with disease status This procedure isrepeated 100 times and checked how many times the genesin MammaPrint are ranked within the top 70 The resultsare shown in the upper panel of Figure 5 The lower panelshows the result based on the AUC instead of the correlationcoefficient used in the ranking We clearly see that someof the genes in MammaPrint are rarely selected in the top70 which indicates the instability of the gene ranking Theperformances of DLDA with 100 replications which showsmost stable prediction accuracy as seen in Figure 1 basedon randomly selected 50 patients shown in Figure 6 wherethe vertical axes are AUC (left panels) and error rate (rightpanels) and genes are ranked by the correlation (a) and AUC(b) The performance clearly gets worse than that in Figure 1in terms of both AUC and error rate The heterogeneity ofgene expression pattern may come from the several subtypes
8 Computational and Mathematical Methods in Medicine
0
02
04
06
08
1
Data sets
AUC
0
02
04
06
08
1
Erro
r
TrainingTest
TrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230Data sets
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(a) vanrsquot Veer methodEr
ror
0
02
04
06
08
1
AUC
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(b) DLDA
Figure 1 Results of classical methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data setsD for vanrsquotVeer method and DLDA
as suggested in the cluster analysis and it would make itdifficult to select useful genes for predictions
Finally we calculate the mutualcoherence based on genesin MammaPrint and total 5420 genes in Figure 7 The scatterplot in the upper panel (a) shows a pair of geneswith the high-est mutual coherence (0984) in MammaPrint NM 014889
and NM 014968 correspond to MP1 and KIAA1104 respec-tivelyMP1 is a scaffold protein inmultiple signaling pathwaysincluding the one in breast cancer The latter one is pitrilysinmetallopeptidase 1The correlations defined in the right-handside of (26) are calculated for all pairs of genes in total 5420
genes and 70 genes in MammaPrint The distributions ofthem are shown in the lower panel (b) in Figure 7 The genepairs are sorted so that the values of the correlations decreasemonotonically Note that the number of the gene pairs ineach data set is different but the range of horizontal axis isrestricted to 0 and 1 for clear view and for easy comparisonThedifference between the black and red curves indicates thatthe gene set of MammaPrint has large homogeneity of geneexpression patterns in comparison with that of total 5420genes This indicates that the ranking of genes based on two-sample statistic such as the correlation coefficients is prone
Computational and Mathematical Methods in Medicine 9
0
02
04
06
08
1AU
C
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Erro
r
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(a) AdaBoost
0
02
04
06
08
1
AUC
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Erro
r
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(b) AUCBoost
Figure 2 Results of boosting methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets D forAdaBoost and AUCBoost
to select genes such that their gene expression patterns aresimilar to each other This would be one reason why we havemultiple solutions after the ranking methods based on thecorrelation coefficients It is also interesting to see that thereare a few gene pairs with very low correlation even in gene setof MammaPrint
4 Discussion and Concluding Remarks
In this paper we have addressed an important classificationissue using microarray data Our results present the existence
of multiple suboptimal solutions for choosing classificationpredictors Classification results showed that the perfor-mance of several gene sets was the same This behaviorwas confirmed by the vanrsquot Veer method DLDA AdaBoostand AUCBoost Amazingly nontop ranked gene sets showedbetter performance than the top ranked gene sets Theseresults indicate that the ranking method is very unstable forfeature selection in microarray data
To examine the expression pattern of selected predictorswe performed clustering this added support to the existenceof multiple solutions We easily recognized the good and
10 Computational and Mathematical Methods in Medicine
(a) (b)
(c)
Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70
(MammaPrint)(b) 119863
11minus80clearly showing some subtypes and (c) 119863
111minus180with the highest BHI regarding metastases The blue bars indicate patients with
metastases red bars those with ER positive and orange bars those with PR positive
the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes
Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work
Computational and Mathematical Methods in Medicine 11
SetBH
I
MetastasesERp
PRp
0
02
04
06
08
1
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig32125
RCU82987
AL137718
AB037863
NM020188
NM020974
NM000127
NM002019
NM002073
NM000436
NM004994
Con
tig55377
RCC
ontig35251
RCC
ontig25991
NM003875
NM006101
NM003882
NM003607
AF0
7351
9A
F052
162
NM000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig46223
RCN
M015984
NM006117
AK0
0074
5C
ontig40831
RCN
M003239
NM014791
X056
10N
M016448
NM018401
NM000788
Con
tig51464
RCA
L080
079
NM006931
AF2
5717
5N
M014321
NM002916
Con
tig55725
RCC
ontig24252
RCA
F201
951
NM005915
NM001282
Con
tig56457
RCN
M000599
NM020386
NM014889
AF0
5503
3N
M006681
NM007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM007036
NM018354
(a)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig
46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig
32125
RCU82987
AL1
37718
AB0
37863
NM
020188
NM
020974
NM
000127
NM
002019
NM
002073
NM
000436
NM
004994
Con
tig55377
RCC
ontig
35251
RCC
ontig
25991
NM
003875
NM
006101
NM
003882
NM
003607
AF0
7351
9A
F052
162
NM
000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig
46223
RCN
M015984
NM
006117
AK0
0074
5C
ontig
40831
RCN
M003239
NM
014791
X056
10N
M016448
NM
018401
NM
000788
Con
tig51464
RCA
L080
079
NM
006931
AF2
5717
5N
M014321
NM
002916
Con
tig55725
RCC
ontig
24252
RCA
F201
951
NM
005915
NM
001282
Con
tig56457
RCN
M000599
NM
020386
NM
014889
AF0
5503
3N
M006681
NM
007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM
007036
NM
018354
(b)
Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint
12 Computational and Mathematical Methods in Medicine
0 20 40 60 80 100
1
05
06
07
08
09
1AU
C
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
Number of trialsNumber of trials
TrainingTest
TrainingTest
(a)
0 20 40 60 80 100
05
06
07
08
09
1
Number of trialsNumber of trials
AUC
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
TrainingTest
TrainingTest
(b)
Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials
Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is
not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data
One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by
Computational and Mathematical Methods in Medicine 13
0 02 04 06
0
02
04
06
minus02
minus02NM 014889
NM
0149
68
(a)
Index of pairs of genes
Cor
relat
ion
5420 genes70 genes
0
02
04
06
08
1
0 02 04 06 08 1
(b)
Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view
Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata
We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data
Conflict of Interests
The authors have declared that no conflict of interests exists
References
[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001
[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001
[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996
[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003
[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006
[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006
[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002
[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996
14 Computational and Mathematical Methods in Medicine
[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003
[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002
[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999
[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004
[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008
[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006
[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989
[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005
[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005
[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936
[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990
[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010
[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001
[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008
[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006
[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001
[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003
[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011
[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011
[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010
[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001
[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009
[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008
[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009
[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009
[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009
[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
4 Computational and Mathematical Methods in Medicine
value Advantages of filter methods are easy implementationcomputational simplicity and speed Filter methods areindependent of classification methods therefore differentclassifiers can be used Two of the disadvantages of filtermethods are that they ignore the interaction with the clas-sification model and most of the proposed methods areunivariate
Whereas filter techniques treat the problem of findinga good feature subset independently of the model selec-tion step wrapper methods embed the model hypothesissearch within the feature subset search such as sequentialbackward selection [16] Wrapper methods search all featuresubsets the feature subset space grows exponentially withthe number of features One advantage of wrapper methodsis the interaction between feature subset search and modelselection Computational burden is one of the disadvantagesof this approach especially if building the classifier has a highcomputational cost
The third class of feature selection is embedded tech-niques Like the wrapper methods embedded techniquessearch for an optimal subset of features with classifierconstruction For example SVM-RFE proposed by Guyonet al [17] Thus embedded approaches are specific to a givenlearning algorithm The difference from wrapper methodsis that embedded techniques are guided by the learningprocess Whereas wrapper methods search all possible com-binations of gene sets embedded techniques search for thecombination based on a criteria This enables reduction incomputational burden
Besides thesemethods the idea of sparsenesswas recentlyintroduced in some feature selectionmethods One approachis to use penalties in the regression context Tibshirani [3]proposed Lasso which uses 119871
1norm penalties The combi-
nation of 1198711penalty and 119871
2penalty is called elastic net [18]
These methods are focusing on reducing features to avoidingoverfitting and better interpretability as biologists expect toobtain biological insight from selected features
However these sparseness ideas do not take into accountmultiple solutions in one data set When the data dimensionis thousands ormillions there aremultiple possible solutionsSorting genes based on some criteria then selecting a subsetfrom the top is not always the best selection We elaboratemultiple solutions in the following section and give some ideaof how to select the optimum solution Here we refer to anoptimal solution as a prediction rule with high classificationaccuracy for various data sets
3 Multiple Solutions for Prediction Rules
The existence of multiple solutions for prediction of diseasestatus based on breast cancer data [7] was shown by Ein-Doret al [19] where they suggest three reasons for this problemThe first is that there are many genes correlated with diseasestatus the second one is that the differences of the correlationamong the genes are very small the last one is that the valueof the correlation is very sensitive to the sample used for thecalculation of the correlation In the paper they demonstrategene ranking based on the correlation and show that thereexist many equally predictive gene sets
In this section we investigate the existence of multiplesolutions from different viewpoints At first to check thevariability of prediction accuracy based on different statisticalmethods we apply the vanrsquot Veermethod [7] the Fisher lineardiscriminant analysis [20] AdaBoost [21] and AUCBoost[22] The last two methods are called boosting in machinelearning community where genes are nonlinearly combinedto predict the disease status Second we apply hierarchal clus-tering to examine the heterogeneity of gene expression pat-terns Soslashrlie et al [23] showed there exist subtypes of breastcancers for which the patterns of gene expression are clearlydifferent and the disease statuses are also different in accor-dance with them Ein-Dor et al [19] suggests that the hetero-geneity of the subtypes is one reason why there are so manysolutions for the prediction or large fluctuations of genesselected for the predictions Hence we calculate BiologicalHomogeneity Index (BHI) to see the clustering performancefor various gene sets and examine the existence of thesubtypes Finally we consider the mutualcoherence for thebreast cancer data and discuss the relation to the multiplesolutions
The breast cancer data consists of the expression data of25000 genes and 97 cancer patients After a filtering proce-dure 5420 genes are identified to be significantly regulatedand we focused on this filtered data The patients are dividedinto a training sample (78 patients) and a test sample (19patients) in the same way as the original paper [7] Here weconsider classical methods and boosting methods to predictwhether the patient has good prognosis (free of disease afterinitial diagnosis at least 5 years) or has bad prognosis (distantmetastaseswithin 5 years)Theprediction rule is generated bythe training sample and we measure the prediction accuracybased on the test sample
31 Classification We briefly introduce the statistical meth-ods used for the prediction of the disease status The firsttwo methods are linear discriminant function where thegenes are linearly combined to classify the patients into agood prognosis group or a bad prognosis group The last twoones are boosting methods where the genes are nonlinearlycombined to generate the discriminant functions
311 The vanrsquot Veer Method Let 119910 be a class label indicatingdisease status such as 119910 = minus1 (good prognosis) and 119910 = 1
(metastases) and let x(=(1199091 119909
119901)119879) be a 119901-dimensional
covariate such as gene expression We denote the samples of119910 = minus1 and 119910 = 1 as xminus
119894 119894 = 1 119899
minus and x+
119895 119895 =
1 119899+ respectively and the mean value of the patients
with good prognosis as
minus= (minus
1
minus
119901)119879
=1
119899minus
119899minus
sum
119894=1
xminus119894 (1)
Then vanrsquot Veer et al [7] proposed a discriminant function119865(x) based on the correlation to the average good prognosisprofile above which is given as
119865 (x) = minussum119901
119896=1(119909119896minus ) (
minus
119896minus minus)
radicsum119901
119896=1(119909119896minus )2radicsum119901
119896=1(minus
119896minus minus)2
(2)
Computational and Mathematical Methods in Medicine 5
where
=1
119901
119901
sum
119896=1
119909119896
minus=1
119901
119901
sum
119896=1
minus
119896 (3)
If the value of 119865(x) is smaller than a predefined thresholdvalue then the patient is judged to have good prognosis (119910 =minus1) otherwise to have metastases (119910 = 1) This method iscalled one-class classification where it focuses on only one-class label information to predict the disease status 119910 Thisidea is also employed in machine learning community SeeYousef et al [24] and Gardner et al [25] for applications inbiology and medicine
312 DLDA We consider Fisherrsquos linear discriminant analy-sis [20] which is widely used in many applications Supposethat x is distributed as119873(120583
minusΣminus) for 119910 = minus1 and as119873(120583
+Σ+)
for 119910 = 1 Then if Σminus= Σ+= Σ the estimated log-likelihood
ratio is given as
119865 (x) = (+minus minus)119879
Σminus1x minus 1
2119879
+Σminus1
+
+1
2119879
minusΣminus1
minus+ log
119899+
119899minus
(4)
where + minusand Σ are the samplemeans for119910 isin minus1 1 and a
total sample variance respectively For simplicity we take thediagonal part of Σ (Diagonal Linear Discriminant Analysis)and predict the disease status based on the value of the log-likelihood ratio above This modification is often used in asituation where 119901 is much larger than 119899 (= 119899
minus+ 119899+) In that
case the inverse of Σ cannot be calculated
313 AdaBoost We introduce a famous boosting method inmachine learning community The key concept of boostingis to construct a powerful discriminant function 119865(x) bycombining variousweak classifiers119891(x) [26]We employ a setF of decision stumps as a dictionary of weak classifiers Herethe decision stump for the 119896th gene expression 119909
119896is defined
as a simple step functions such as
119891119896(x) = 1 if 119909
119896ge 119887119896
minus1 otherwise(5)
where 119887119896is a threshold value Accordingly it is known that
119891119896(x) is the simplest classifier in the sense that 119891
119896(x) neglects
all other information of gene expression patterns than thatof one gene 119909
119896 However by changing the value of 119887
119896for all
genes (119896 = 1 119901) we have F that contains exhaustiveinformation of gene expression patterns We attempt tobuild a good discriminant function 119865 by combining decisionstumps inF
The typical one is AdaBoost proposed by Schapire [21]which is designed to minimize the exponential loss for adiscriminant function 119865 as
119871exp (119865) =1
119899
119899
sum
119894=1
exp minus119910119894119865 (x119894) (6)
where the entire data set is given as (x119894 119910119894) | 119894 = 1 119899The
exponential loss is sequentially minimized by the followingalgorithm
(1) Initialize the weight of 119909119894for 119894 = 1 119899 as 119908
1(119894) =
1119899
(2) For the iteration number 119905 = 1 119879
(a) choose 119891119905as
119891119905= argmin119891isinF
120598119905(119891) (7)
where
120598119905(119891) =
119899
sum
119894=1
119868 (119891 (x119894) = 119910119894)
119908119905(119894)
sum119899
119894=1119908119905(119894) (8)
and 119868 is the indicator function(b) calculate the coefficient as
120573119905=1
2log
1 minus 120598119905(119891119905)
120598119905(119891119905) (9)
(c) update the weight as
119908119905+1(119894) = 119908
119905(119894) exp minus119910
119894120573119905119891119905(x119894) (10)
(3) output the final function as 119865(x) = sum119879119905=1120573119905119891119905(x)
Here we have
119871exp (119865 + 120573119891)
=1
119899
119899
sum
119894=1
exp minus119910119894119865 (x119894) exp minus120573119910
119894119891 (x119894)
(11)
=1
119899
119899
sum
119894=1
exp minus119910119894119865 (x119894)
times [119890120573119868 (119891 (x
119894) = 119910119894) + 119890minus120573119868 (119891 (x
119894) = 119910119894)]
(12)
= 119871exp (119865) 119890120573120598 (119891) + 119890
minus120573(1 minus 120598) (13)
ge 2119871exp (119865)radic120598 (119891) (1 minus 120598 (119891)) (14)
where 120598(119891) = 1119899sum119899
119894=1119868(119891(x
119894) = 119910119894) expminus119910
119894119865(x119894)119871exp(119865)
and the equality in (14) is attained if and only if
120573 =1
2log
1 minus 120598 (119891)
120598 (119891) (15)
6 Computational and Mathematical Methods in Medicine
Hence we can see that the exponential loss is sequentiallyminimized in the algorithm It is easily shown that
120598119905+1(119891119905) =
1
2 (16)
That is the weak classifier 119891119905chosen at the step 119905 is the worst
element in F in the sense of the weighted error rate in thestep 119905 + 1
314 AUCBoost with Natural Cubic Splines The area underthe ROCcurve (AUC) is widely used tomeasure classificationaccuracy [27] This criterion consists of the false positiverate and true positive rate so it evaluates them separately incontrast to the commonly used error rateThe empirical AUCbased on the samples xminus
119894 119894 = 1 119899
minus and x+
119895 119895 = 1
119899+ is given as
AUC (119865) = 1
119899minus119899+
119899minus
sum
119894=1
119899+
sum
119895=1
119867(119865 (x+119895) minus 119865 (xminus
119894)) (17)
where 119867(119911) is the Heaviside function 119867(119911) = 1 if 119911 ge 0
and H(119911) = 0 otherwise To avoid the difficulty to maximizethe nondifferential function above an approximate AUC isconsidered by Komori [28] as
AUC120590(119865) =
1
119899minus119899+
119899minus
sum
119894=1
119899+
sum
119895=1
119867120590(119865 (x+119895) minus 119865 (xminus
119894)) (18)
where 119867120590(119911) = Φ(119911120590) with Φ being the standard normal
distribution function A smaller scale parameter 120590 meansa better approximation of the Heaviside function 119867(119911)Based on this approximation a boosting method for themaximization of AUC as well as the partial AUC is proposedby Komori and Eguchi [22] in which they consider thefollowing objective function
AUC120590120582(119865) =
1
119899minus119899+
119899minus
sum
119894=1
119899+
sum
119895=1
119867120590(119865 (x+119895) minus 119865 (xminus
119894))
minus 120582
119901
sum
119896=1
int 11986510158401015840
119896(119909119896)2
119889119909119896
(19)
where 11986510158401015840119896(119909119896) is the second derivative of the 119896th component
of 119865(x) and 120582 is a smoothing parameter that controls thesmoothness of 119865(x) Here the set of weak classifiers is givenas
F = 119891 (x) =119873119896119897(119909119896)
119885119896119897
| 119896 = 1 2 119901 119897 = 1 2 119898119896
(20)
where 119873119896119897
is a basis function for representing natural cubicsplines and 119885
119896119897is a standardization factor Then the relation-
ship
max120590120582119865
AUC120590120582(119865) = max
120582119865
AUC1120582(119865) (21)
allows us to fix 120590 = 1 without loss of generality and have thefollowing algorithm
(1) Start with 1198650(x) = 0
(2) For 119905 = 1 119879
(a) update 120573119905minus1(119891) to 120573
119905(119891) with a one-step
Newton-Raphson iteration(b) find the best weak classifier 119891
119905
119891119905= argmax119891
AUC1120582(119865119905minus1
+ 120573119905(119891) 119891) (22)
(c) update the score function as
119865119905(x) = 119865
119905minus1(x) + 120573
119905(119891119905) 119891119905(x) (23)
(3) Output 119865(x) = sum119879119905=1120573119905(119891119905)119891119905(x)
The value of the smoothing parameter 120582 and the iterationnumber 119879 is determined at the same time by the crossvalidation
32 Clustering We applied a hierarchical clustering usingbreast cancer data [7] where the distances between samplesand genes are determined by the correlation and completelinkage was applied as the agglomeration method To mea-sure the performance of the clustering we used biologicalhomogeneity index (BHI) [29] which measures the homo-geneity between the cluster C = 119862
1 119862
119870 and the
biological category or subtypeB = 1198611 119861
119871
BHI (CB) =1
119870
119870
sum
119896=1
1
119899119896(119899119896minus 1)
sum
119894 = 119895119894119895isin119862119896
119868 (119861(119894)= 119861(119895))
(24)
where 119861(119894) isin B is the subtype for the subject 119894 and 119899119896is
the number of subjects in 119862119896 This index is upper bounded
by 1 meaning the perfect homogeneity between the clustersand the biological categories We calculated this index forthe breast cancer data to investigate the relationship betweenthe hierarchical clustering and biological categories such asdisease status (good prognosis or metastases) and hormonestatus estrogen receptor (ER) status and progesterone recep-tor (PR) status The hormone status is known to be closelyrelated with the prognosis of the patients [23]
33 Mutualcoherence Now we have a data matrix X with 119899rows (119899 patients) and 119901 columns (119901 genes)We assumed an 119899-dimensional vector b indicating the true disease status where
Computational and Mathematical Methods in Medicine 7
the positive values correspond to metastases and negativeones to good prognosisThemagnitude of b denotes the levelof disease status Then the optimal linear solution 120573 isin R119901
should be satisfied
X120573 = b (25)
Note that if 119901 is much larger than 119899 then only a few elementsof 120573 should be non-zeros and the others to be zero whichmeans 120573 has a sparse structure The sparsity has a closerelationship with mutualcoherence [30] which is defined forthe data matrix X as
120583 (X) = max1le119894119895le119901119894 = 119895
10038161003816100381610038161003816x119879119894x119895
100381610038161003816100381610038161003817100381710038171003817x11989410038171003817100381710038172
10038171003817100381710038171003817x119895
100381710038171003817100381710038172
(26)
where x119894isin R119899 denotes the 119894th column in X or 119894th gene in
the breast cancer data | sdot | is the absolute value and sdot2is the
Euclidean normThis indexmeasures the distance of columns(genes) in the data matrixX and is bounded as 0 le 120583(X) le 1The next theorem shows the relationship between sparsity of120573 and the data matrix X
Theorem 1 (Elad [30]) If X120573 = b has a solution 120573 obeying1205730
lt (1 + 1120583(X))2 this solution is necessarily thesparsest possible where 120573
0denotes the number of the nonzero
components of 120573
This theorem suggests that the linear discriminant func-tion 119865(x) = 120573119879x could have a sparse solution to predict thedisease status b which corresponds to metastases or goodprognosis in the breast cancer data If the value of 120583(X) isnearly equal to 1 then 120573
0could become approximately 1
indicating that just one gene (a column inX) could be enoughto predict the disease status if there were a solution 120573 Thisindicates that we could have a chance to find the multiplesolutions with fewer genes than 70 genes in MammaPrintAlthough the framework in (25) is based on a system of linearequations and does not include random effects as seen in theclassification problems we deal with Theorem 1 is indicativein the case where the number of covariates 119901 is much largerthan the observation number 119899
34 Results We prepare various data sets using a trainingsample (78 patients) based on 230 genes selected by vanrsquot Veeret al [7] in order to investigate the existence of the multiplesolutions of the prediction of the disease status which aregiven by
D = 1198631minus70
1198636minus75
119863161minus230
(27)
The first data set 1198631minus70
consists of 78 patients and 70 genesin MammaPrint The ranking of the 230 genes in D is onthe basis of correlation with disease status as in [7] We apply
the vanrsquot Veer method DLDA AdaBoost and AUCBoost asexplained in the previous subsection to the training data tohave the discriminant functions119865(x) where each threshold isdetermined so that the training error rate isminimizedThenwe evaluate the classification accuracy based on the trainingdata as well as the test data The results of classificationperformance of vanrsquot Veer method and DLDA are illustratedin Figure 1 those of AdaBoost and AUCBoost are in Figure 2The AUC and the error rate are plotted against D In regardto the performance based on the training data denoted bysolid line the boosting methods are superior to the classicalmethodsHowever comparison based on the test data DLDAshows good performance for almost all the data sets in Dhaving the AUC more than 08 and the error rates less than02 in averageThis evidence suggests there exist many sets ofgenes having almost the same classification performance asthat of MammaPrint
We investigate the performance of the hierarchical clus-terings based on the training data sets 119863
1minus70 11986311minus80
and119863111minus180
which are shown in Figure 3 Each row represents70 genes and each column represents 78 patients withdisease status (blue) ER status (red) and PR status (orange)The BHI for disease status ER status and PR status inMammaPrint (119863
1minus70) are 070 069 and 057 respectively
The gene expression in 11986311minus80
shows different patterns fromthe others Mainly there are two clusters characterized by ERstatus The left-hand side is the subgroup of patients withER negative and poor prognosis This would correspond toBasal-like subtype or triple negative subtype though the Her2status is unclear The right-hand side could be divided intothree subgroups The subgroup of patients with ER negativeshows good prognosis indicating Luminal A and either sideof it would be Luminal B or Luminal C because it showsworse prognosis than Luminal A The data set of 119863
111minus180
has the highest BHI for disease status 076 and a similar geneexpression pattern to that in 119863
1minus70 The other values of BHI
are illustrated in Figure 4 It turned out that the data sets inDhave almost the same BHI for three statuses suggesting thereexists various gene sets with similar expression patterns
Next we investigate the stability of the gene ranking inMammaPrint Among 78 patients we randomly choose 50patients with 5420 gene expression patterns Then we takethe top 70 genes ranked by the correlation coefficients ofthe gene expression with disease status This procedure isrepeated 100 times and checked how many times the genesin MammaPrint are ranked within the top 70 The resultsare shown in the upper panel of Figure 5 The lower panelshows the result based on the AUC instead of the correlationcoefficient used in the ranking We clearly see that someof the genes in MammaPrint are rarely selected in the top70 which indicates the instability of the gene ranking Theperformances of DLDA with 100 replications which showsmost stable prediction accuracy as seen in Figure 1 basedon randomly selected 50 patients shown in Figure 6 wherethe vertical axes are AUC (left panels) and error rate (rightpanels) and genes are ranked by the correlation (a) and AUC(b) The performance clearly gets worse than that in Figure 1in terms of both AUC and error rate The heterogeneity ofgene expression pattern may come from the several subtypes
8 Computational and Mathematical Methods in Medicine
0
02
04
06
08
1
Data sets
AUC
0
02
04
06
08
1
Erro
r
TrainingTest
TrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230Data sets
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(a) vanrsquot Veer methodEr
ror
0
02
04
06
08
1
AUC
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(b) DLDA
Figure 1 Results of classical methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data setsD for vanrsquotVeer method and DLDA
as suggested in the cluster analysis and it would make itdifficult to select useful genes for predictions
Finally we calculate the mutualcoherence based on genesin MammaPrint and total 5420 genes in Figure 7 The scatterplot in the upper panel (a) shows a pair of geneswith the high-est mutual coherence (0984) in MammaPrint NM 014889
and NM 014968 correspond to MP1 and KIAA1104 respec-tivelyMP1 is a scaffold protein inmultiple signaling pathwaysincluding the one in breast cancer The latter one is pitrilysinmetallopeptidase 1The correlations defined in the right-handside of (26) are calculated for all pairs of genes in total 5420
genes and 70 genes in MammaPrint The distributions ofthem are shown in the lower panel (b) in Figure 7 The genepairs are sorted so that the values of the correlations decreasemonotonically Note that the number of the gene pairs ineach data set is different but the range of horizontal axis isrestricted to 0 and 1 for clear view and for easy comparisonThedifference between the black and red curves indicates thatthe gene set of MammaPrint has large homogeneity of geneexpression patterns in comparison with that of total 5420genes This indicates that the ranking of genes based on two-sample statistic such as the correlation coefficients is prone
Computational and Mathematical Methods in Medicine 9
0
02
04
06
08
1AU
C
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Erro
r
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(a) AdaBoost
0
02
04
06
08
1
AUC
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Erro
r
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(b) AUCBoost
Figure 2 Results of boosting methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets D forAdaBoost and AUCBoost
to select genes such that their gene expression patterns aresimilar to each other This would be one reason why we havemultiple solutions after the ranking methods based on thecorrelation coefficients It is also interesting to see that thereare a few gene pairs with very low correlation even in gene setof MammaPrint
4 Discussion and Concluding Remarks
In this paper we have addressed an important classificationissue using microarray data Our results present the existence
of multiple suboptimal solutions for choosing classificationpredictors Classification results showed that the perfor-mance of several gene sets was the same This behaviorwas confirmed by the vanrsquot Veer method DLDA AdaBoostand AUCBoost Amazingly nontop ranked gene sets showedbetter performance than the top ranked gene sets Theseresults indicate that the ranking method is very unstable forfeature selection in microarray data
To examine the expression pattern of selected predictorswe performed clustering this added support to the existenceof multiple solutions We easily recognized the good and
10 Computational and Mathematical Methods in Medicine
(a) (b)
(c)
Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70
(MammaPrint)(b) 119863
11minus80clearly showing some subtypes and (c) 119863
111minus180with the highest BHI regarding metastases The blue bars indicate patients with
metastases red bars those with ER positive and orange bars those with PR positive
the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes
Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work
Computational and Mathematical Methods in Medicine 11
SetBH
I
MetastasesERp
PRp
0
02
04
06
08
1
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig32125
RCU82987
AL137718
AB037863
NM020188
NM020974
NM000127
NM002019
NM002073
NM000436
NM004994
Con
tig55377
RCC
ontig35251
RCC
ontig25991
NM003875
NM006101
NM003882
NM003607
AF0
7351
9A
F052
162
NM000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig46223
RCN
M015984
NM006117
AK0
0074
5C
ontig40831
RCN
M003239
NM014791
X056
10N
M016448
NM018401
NM000788
Con
tig51464
RCA
L080
079
NM006931
AF2
5717
5N
M014321
NM002916
Con
tig55725
RCC
ontig24252
RCA
F201
951
NM005915
NM001282
Con
tig56457
RCN
M000599
NM020386
NM014889
AF0
5503
3N
M006681
NM007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM007036
NM018354
(a)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig
46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig
32125
RCU82987
AL1
37718
AB0
37863
NM
020188
NM
020974
NM
000127
NM
002019
NM
002073
NM
000436
NM
004994
Con
tig55377
RCC
ontig
35251
RCC
ontig
25991
NM
003875
NM
006101
NM
003882
NM
003607
AF0
7351
9A
F052
162
NM
000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig
46223
RCN
M015984
NM
006117
AK0
0074
5C
ontig
40831
RCN
M003239
NM
014791
X056
10N
M016448
NM
018401
NM
000788
Con
tig51464
RCA
L080
079
NM
006931
AF2
5717
5N
M014321
NM
002916
Con
tig55725
RCC
ontig
24252
RCA
F201
951
NM
005915
NM
001282
Con
tig56457
RCN
M000599
NM
020386
NM
014889
AF0
5503
3N
M006681
NM
007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM
007036
NM
018354
(b)
Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint
12 Computational and Mathematical Methods in Medicine
0 20 40 60 80 100
1
05
06
07
08
09
1AU
C
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
Number of trialsNumber of trials
TrainingTest
TrainingTest
(a)
0 20 40 60 80 100
05
06
07
08
09
1
Number of trialsNumber of trials
AUC
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
TrainingTest
TrainingTest
(b)
Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials
Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is
not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data
One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by
Computational and Mathematical Methods in Medicine 13
0 02 04 06
0
02
04
06
minus02
minus02NM 014889
NM
0149
68
(a)
Index of pairs of genes
Cor
relat
ion
5420 genes70 genes
0
02
04
06
08
1
0 02 04 06 08 1
(b)
Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view
Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata
We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data
Conflict of Interests
The authors have declared that no conflict of interests exists
References
[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001
[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001
[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996
[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003
[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006
[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006
[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002
[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996
14 Computational and Mathematical Methods in Medicine
[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003
[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002
[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999
[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004
[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008
[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006
[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989
[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005
[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005
[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936
[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990
[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010
[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001
[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008
[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006
[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001
[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003
[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011
[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011
[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010
[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001
[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009
[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008
[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009
[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009
[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009
[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
Computational and Mathematical Methods in Medicine 5
where
=1
119901
119901
sum
119896=1
119909119896
minus=1
119901
119901
sum
119896=1
minus
119896 (3)
If the value of 119865(x) is smaller than a predefined thresholdvalue then the patient is judged to have good prognosis (119910 =minus1) otherwise to have metastases (119910 = 1) This method iscalled one-class classification where it focuses on only one-class label information to predict the disease status 119910 Thisidea is also employed in machine learning community SeeYousef et al [24] and Gardner et al [25] for applications inbiology and medicine
312 DLDA We consider Fisherrsquos linear discriminant analy-sis [20] which is widely used in many applications Supposethat x is distributed as119873(120583
minusΣminus) for 119910 = minus1 and as119873(120583
+Σ+)
for 119910 = 1 Then if Σminus= Σ+= Σ the estimated log-likelihood
ratio is given as
119865 (x) = (+minus minus)119879
Σminus1x minus 1
2119879
+Σminus1
+
+1
2119879
minusΣminus1
minus+ log
119899+
119899minus
(4)
where + minusand Σ are the samplemeans for119910 isin minus1 1 and a
total sample variance respectively For simplicity we take thediagonal part of Σ (Diagonal Linear Discriminant Analysis)and predict the disease status based on the value of the log-likelihood ratio above This modification is often used in asituation where 119901 is much larger than 119899 (= 119899
minus+ 119899+) In that
case the inverse of Σ cannot be calculated
313 AdaBoost We introduce a famous boosting method inmachine learning community The key concept of boostingis to construct a powerful discriminant function 119865(x) bycombining variousweak classifiers119891(x) [26]We employ a setF of decision stumps as a dictionary of weak classifiers Herethe decision stump for the 119896th gene expression 119909
119896is defined
as a simple step functions such as
119891119896(x) = 1 if 119909
119896ge 119887119896
minus1 otherwise(5)
where 119887119896is a threshold value Accordingly it is known that
119891119896(x) is the simplest classifier in the sense that 119891
119896(x) neglects
all other information of gene expression patterns than thatof one gene 119909
119896 However by changing the value of 119887
119896for all
genes (119896 = 1 119901) we have F that contains exhaustiveinformation of gene expression patterns We attempt tobuild a good discriminant function 119865 by combining decisionstumps inF
The typical one is AdaBoost proposed by Schapire [21]which is designed to minimize the exponential loss for adiscriminant function 119865 as
119871exp (119865) =1
119899
119899
sum
119894=1
exp minus119910119894119865 (x119894) (6)
where the entire data set is given as (x119894 119910119894) | 119894 = 1 119899The
exponential loss is sequentially minimized by the followingalgorithm
(1) Initialize the weight of 119909119894for 119894 = 1 119899 as 119908
1(119894) =
1119899
(2) For the iteration number 119905 = 1 119879
(a) choose 119891119905as
119891119905= argmin119891isinF
120598119905(119891) (7)
where
120598119905(119891) =
119899
sum
119894=1
119868 (119891 (x119894) = 119910119894)
119908119905(119894)
sum119899
119894=1119908119905(119894) (8)
and 119868 is the indicator function(b) calculate the coefficient as
120573119905=1
2log
1 minus 120598119905(119891119905)
120598119905(119891119905) (9)
(c) update the weight as
119908119905+1(119894) = 119908
119905(119894) exp minus119910
119894120573119905119891119905(x119894) (10)
(3) output the final function as 119865(x) = sum119879119905=1120573119905119891119905(x)
Here we have
119871exp (119865 + 120573119891)
=1
119899
119899
sum
119894=1
exp minus119910119894119865 (x119894) exp minus120573119910
119894119891 (x119894)
(11)
=1
119899
119899
sum
119894=1
exp minus119910119894119865 (x119894)
times [119890120573119868 (119891 (x
119894) = 119910119894) + 119890minus120573119868 (119891 (x
119894) = 119910119894)]
(12)
= 119871exp (119865) 119890120573120598 (119891) + 119890
minus120573(1 minus 120598) (13)
ge 2119871exp (119865)radic120598 (119891) (1 minus 120598 (119891)) (14)
where 120598(119891) = 1119899sum119899
119894=1119868(119891(x
119894) = 119910119894) expminus119910
119894119865(x119894)119871exp(119865)
and the equality in (14) is attained if and only if
120573 =1
2log
1 minus 120598 (119891)
120598 (119891) (15)
6 Computational and Mathematical Methods in Medicine
Hence we can see that the exponential loss is sequentiallyminimized in the algorithm It is easily shown that
120598119905+1(119891119905) =
1
2 (16)
That is the weak classifier 119891119905chosen at the step 119905 is the worst
element in F in the sense of the weighted error rate in thestep 119905 + 1
314 AUCBoost with Natural Cubic Splines The area underthe ROCcurve (AUC) is widely used tomeasure classificationaccuracy [27] This criterion consists of the false positiverate and true positive rate so it evaluates them separately incontrast to the commonly used error rateThe empirical AUCbased on the samples xminus
119894 119894 = 1 119899
minus and x+
119895 119895 = 1
119899+ is given as
AUC (119865) = 1
119899minus119899+
119899minus
sum
119894=1
119899+
sum
119895=1
119867(119865 (x+119895) minus 119865 (xminus
119894)) (17)
where 119867(119911) is the Heaviside function 119867(119911) = 1 if 119911 ge 0
and H(119911) = 0 otherwise To avoid the difficulty to maximizethe nondifferential function above an approximate AUC isconsidered by Komori [28] as
AUC120590(119865) =
1
119899minus119899+
119899minus
sum
119894=1
119899+
sum
119895=1
119867120590(119865 (x+119895) minus 119865 (xminus
119894)) (18)
where 119867120590(119911) = Φ(119911120590) with Φ being the standard normal
distribution function A smaller scale parameter 120590 meansa better approximation of the Heaviside function 119867(119911)Based on this approximation a boosting method for themaximization of AUC as well as the partial AUC is proposedby Komori and Eguchi [22] in which they consider thefollowing objective function
AUC120590120582(119865) =
1
119899minus119899+
119899minus
sum
119894=1
119899+
sum
119895=1
119867120590(119865 (x+119895) minus 119865 (xminus
119894))
minus 120582
119901
sum
119896=1
int 11986510158401015840
119896(119909119896)2
119889119909119896
(19)
where 11986510158401015840119896(119909119896) is the second derivative of the 119896th component
of 119865(x) and 120582 is a smoothing parameter that controls thesmoothness of 119865(x) Here the set of weak classifiers is givenas
F = 119891 (x) =119873119896119897(119909119896)
119885119896119897
| 119896 = 1 2 119901 119897 = 1 2 119898119896
(20)
where 119873119896119897
is a basis function for representing natural cubicsplines and 119885
119896119897is a standardization factor Then the relation-
ship
max120590120582119865
AUC120590120582(119865) = max
120582119865
AUC1120582(119865) (21)
allows us to fix 120590 = 1 without loss of generality and have thefollowing algorithm
(1) Start with 1198650(x) = 0
(2) For 119905 = 1 119879
(a) update 120573119905minus1(119891) to 120573
119905(119891) with a one-step
Newton-Raphson iteration(b) find the best weak classifier 119891
119905
119891119905= argmax119891
AUC1120582(119865119905minus1
+ 120573119905(119891) 119891) (22)
(c) update the score function as
119865119905(x) = 119865
119905minus1(x) + 120573
119905(119891119905) 119891119905(x) (23)
(3) Output 119865(x) = sum119879119905=1120573119905(119891119905)119891119905(x)
The value of the smoothing parameter 120582 and the iterationnumber 119879 is determined at the same time by the crossvalidation
32 Clustering We applied a hierarchical clustering usingbreast cancer data [7] where the distances between samplesand genes are determined by the correlation and completelinkage was applied as the agglomeration method To mea-sure the performance of the clustering we used biologicalhomogeneity index (BHI) [29] which measures the homo-geneity between the cluster C = 119862
1 119862
119870 and the
biological category or subtypeB = 1198611 119861
119871
BHI (CB) =1
119870
119870
sum
119896=1
1
119899119896(119899119896minus 1)
sum
119894 = 119895119894119895isin119862119896
119868 (119861(119894)= 119861(119895))
(24)
where 119861(119894) isin B is the subtype for the subject 119894 and 119899119896is
the number of subjects in 119862119896 This index is upper bounded
by 1 meaning the perfect homogeneity between the clustersand the biological categories We calculated this index forthe breast cancer data to investigate the relationship betweenthe hierarchical clustering and biological categories such asdisease status (good prognosis or metastases) and hormonestatus estrogen receptor (ER) status and progesterone recep-tor (PR) status The hormone status is known to be closelyrelated with the prognosis of the patients [23]
33 Mutualcoherence Now we have a data matrix X with 119899rows (119899 patients) and 119901 columns (119901 genes)We assumed an 119899-dimensional vector b indicating the true disease status where
Computational and Mathematical Methods in Medicine 7
the positive values correspond to metastases and negativeones to good prognosisThemagnitude of b denotes the levelof disease status Then the optimal linear solution 120573 isin R119901
should be satisfied
X120573 = b (25)
Note that if 119901 is much larger than 119899 then only a few elementsof 120573 should be non-zeros and the others to be zero whichmeans 120573 has a sparse structure The sparsity has a closerelationship with mutualcoherence [30] which is defined forthe data matrix X as
120583 (X) = max1le119894119895le119901119894 = 119895
10038161003816100381610038161003816x119879119894x119895
100381610038161003816100381610038161003817100381710038171003817x11989410038171003817100381710038172
10038171003817100381710038171003817x119895
100381710038171003817100381710038172
(26)
where x119894isin R119899 denotes the 119894th column in X or 119894th gene in
the breast cancer data | sdot | is the absolute value and sdot2is the
Euclidean normThis indexmeasures the distance of columns(genes) in the data matrixX and is bounded as 0 le 120583(X) le 1The next theorem shows the relationship between sparsity of120573 and the data matrix X
Theorem 1 (Elad [30]) If X120573 = b has a solution 120573 obeying1205730
lt (1 + 1120583(X))2 this solution is necessarily thesparsest possible where 120573
0denotes the number of the nonzero
components of 120573
This theorem suggests that the linear discriminant func-tion 119865(x) = 120573119879x could have a sparse solution to predict thedisease status b which corresponds to metastases or goodprognosis in the breast cancer data If the value of 120583(X) isnearly equal to 1 then 120573
0could become approximately 1
indicating that just one gene (a column inX) could be enoughto predict the disease status if there were a solution 120573 Thisindicates that we could have a chance to find the multiplesolutions with fewer genes than 70 genes in MammaPrintAlthough the framework in (25) is based on a system of linearequations and does not include random effects as seen in theclassification problems we deal with Theorem 1 is indicativein the case where the number of covariates 119901 is much largerthan the observation number 119899
34 Results We prepare various data sets using a trainingsample (78 patients) based on 230 genes selected by vanrsquot Veeret al [7] in order to investigate the existence of the multiplesolutions of the prediction of the disease status which aregiven by
D = 1198631minus70
1198636minus75
119863161minus230
(27)
The first data set 1198631minus70
consists of 78 patients and 70 genesin MammaPrint The ranking of the 230 genes in D is onthe basis of correlation with disease status as in [7] We apply
the vanrsquot Veer method DLDA AdaBoost and AUCBoost asexplained in the previous subsection to the training data tohave the discriminant functions119865(x) where each threshold isdetermined so that the training error rate isminimizedThenwe evaluate the classification accuracy based on the trainingdata as well as the test data The results of classificationperformance of vanrsquot Veer method and DLDA are illustratedin Figure 1 those of AdaBoost and AUCBoost are in Figure 2The AUC and the error rate are plotted against D In regardto the performance based on the training data denoted bysolid line the boosting methods are superior to the classicalmethodsHowever comparison based on the test data DLDAshows good performance for almost all the data sets in Dhaving the AUC more than 08 and the error rates less than02 in averageThis evidence suggests there exist many sets ofgenes having almost the same classification performance asthat of MammaPrint
We investigate the performance of the hierarchical clus-terings based on the training data sets 119863
1minus70 11986311minus80
and119863111minus180
which are shown in Figure 3 Each row represents70 genes and each column represents 78 patients withdisease status (blue) ER status (red) and PR status (orange)The BHI for disease status ER status and PR status inMammaPrint (119863
1minus70) are 070 069 and 057 respectively
The gene expression in 11986311minus80
shows different patterns fromthe others Mainly there are two clusters characterized by ERstatus The left-hand side is the subgroup of patients withER negative and poor prognosis This would correspond toBasal-like subtype or triple negative subtype though the Her2status is unclear The right-hand side could be divided intothree subgroups The subgroup of patients with ER negativeshows good prognosis indicating Luminal A and either sideof it would be Luminal B or Luminal C because it showsworse prognosis than Luminal A The data set of 119863
111minus180
has the highest BHI for disease status 076 and a similar geneexpression pattern to that in 119863
1minus70 The other values of BHI
are illustrated in Figure 4 It turned out that the data sets inDhave almost the same BHI for three statuses suggesting thereexists various gene sets with similar expression patterns
Next we investigate the stability of the gene ranking inMammaPrint Among 78 patients we randomly choose 50patients with 5420 gene expression patterns Then we takethe top 70 genes ranked by the correlation coefficients ofthe gene expression with disease status This procedure isrepeated 100 times and checked how many times the genesin MammaPrint are ranked within the top 70 The resultsare shown in the upper panel of Figure 5 The lower panelshows the result based on the AUC instead of the correlationcoefficient used in the ranking We clearly see that someof the genes in MammaPrint are rarely selected in the top70 which indicates the instability of the gene ranking Theperformances of DLDA with 100 replications which showsmost stable prediction accuracy as seen in Figure 1 basedon randomly selected 50 patients shown in Figure 6 wherethe vertical axes are AUC (left panels) and error rate (rightpanels) and genes are ranked by the correlation (a) and AUC(b) The performance clearly gets worse than that in Figure 1in terms of both AUC and error rate The heterogeneity ofgene expression pattern may come from the several subtypes
8 Computational and Mathematical Methods in Medicine
0
02
04
06
08
1
Data sets
AUC
0
02
04
06
08
1
Erro
r
TrainingTest
TrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230Data sets
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(a) vanrsquot Veer methodEr
ror
0
02
04
06
08
1
AUC
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(b) DLDA
Figure 1 Results of classical methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data setsD for vanrsquotVeer method and DLDA
as suggested in the cluster analysis and it would make itdifficult to select useful genes for predictions
Finally we calculate the mutualcoherence based on genesin MammaPrint and total 5420 genes in Figure 7 The scatterplot in the upper panel (a) shows a pair of geneswith the high-est mutual coherence (0984) in MammaPrint NM 014889
and NM 014968 correspond to MP1 and KIAA1104 respec-tivelyMP1 is a scaffold protein inmultiple signaling pathwaysincluding the one in breast cancer The latter one is pitrilysinmetallopeptidase 1The correlations defined in the right-handside of (26) are calculated for all pairs of genes in total 5420
genes and 70 genes in MammaPrint The distributions ofthem are shown in the lower panel (b) in Figure 7 The genepairs are sorted so that the values of the correlations decreasemonotonically Note that the number of the gene pairs ineach data set is different but the range of horizontal axis isrestricted to 0 and 1 for clear view and for easy comparisonThedifference between the black and red curves indicates thatthe gene set of MammaPrint has large homogeneity of geneexpression patterns in comparison with that of total 5420genes This indicates that the ranking of genes based on two-sample statistic such as the correlation coefficients is prone
Computational and Mathematical Methods in Medicine 9
0
02
04
06
08
1AU
C
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Erro
r
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(a) AdaBoost
0
02
04
06
08
1
AUC
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Erro
r
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(b) AUCBoost
Figure 2 Results of boosting methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets D forAdaBoost and AUCBoost
to select genes such that their gene expression patterns aresimilar to each other This would be one reason why we havemultiple solutions after the ranking methods based on thecorrelation coefficients It is also interesting to see that thereare a few gene pairs with very low correlation even in gene setof MammaPrint
4 Discussion and Concluding Remarks
In this paper we have addressed an important classificationissue using microarray data Our results present the existence
of multiple suboptimal solutions for choosing classificationpredictors Classification results showed that the perfor-mance of several gene sets was the same This behaviorwas confirmed by the vanrsquot Veer method DLDA AdaBoostand AUCBoost Amazingly nontop ranked gene sets showedbetter performance than the top ranked gene sets Theseresults indicate that the ranking method is very unstable forfeature selection in microarray data
To examine the expression pattern of selected predictorswe performed clustering this added support to the existenceof multiple solutions We easily recognized the good and
10 Computational and Mathematical Methods in Medicine
(a) (b)
(c)
Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70
(MammaPrint)(b) 119863
11minus80clearly showing some subtypes and (c) 119863
111minus180with the highest BHI regarding metastases The blue bars indicate patients with
metastases red bars those with ER positive and orange bars those with PR positive
the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes
Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work
Computational and Mathematical Methods in Medicine 11
SetBH
I
MetastasesERp
PRp
0
02
04
06
08
1
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig32125
RCU82987
AL137718
AB037863
NM020188
NM020974
NM000127
NM002019
NM002073
NM000436
NM004994
Con
tig55377
RCC
ontig35251
RCC
ontig25991
NM003875
NM006101
NM003882
NM003607
AF0
7351
9A
F052
162
NM000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig46223
RCN
M015984
NM006117
AK0
0074
5C
ontig40831
RCN
M003239
NM014791
X056
10N
M016448
NM018401
NM000788
Con
tig51464
RCA
L080
079
NM006931
AF2
5717
5N
M014321
NM002916
Con
tig55725
RCC
ontig24252
RCA
F201
951
NM005915
NM001282
Con
tig56457
RCN
M000599
NM020386
NM014889
AF0
5503
3N
M006681
NM007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM007036
NM018354
(a)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig
46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig
32125
RCU82987
AL1
37718
AB0
37863
NM
020188
NM
020974
NM
000127
NM
002019
NM
002073
NM
000436
NM
004994
Con
tig55377
RCC
ontig
35251
RCC
ontig
25991
NM
003875
NM
006101
NM
003882
NM
003607
AF0
7351
9A
F052
162
NM
000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig
46223
RCN
M015984
NM
006117
AK0
0074
5C
ontig
40831
RCN
M003239
NM
014791
X056
10N
M016448
NM
018401
NM
000788
Con
tig51464
RCA
L080
079
NM
006931
AF2
5717
5N
M014321
NM
002916
Con
tig55725
RCC
ontig
24252
RCA
F201
951
NM
005915
NM
001282
Con
tig56457
RCN
M000599
NM
020386
NM
014889
AF0
5503
3N
M006681
NM
007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM
007036
NM
018354
(b)
Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint
12 Computational and Mathematical Methods in Medicine
0 20 40 60 80 100
1
05
06
07
08
09
1AU
C
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
Number of trialsNumber of trials
TrainingTest
TrainingTest
(a)
0 20 40 60 80 100
05
06
07
08
09
1
Number of trialsNumber of trials
AUC
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
TrainingTest
TrainingTest
(b)
Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials
Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is
not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data
One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by
Computational and Mathematical Methods in Medicine 13
0 02 04 06
0
02
04
06
minus02
minus02NM 014889
NM
0149
68
(a)
Index of pairs of genes
Cor
relat
ion
5420 genes70 genes
0
02
04
06
08
1
0 02 04 06 08 1
(b)
Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view
Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata
We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data
Conflict of Interests
The authors have declared that no conflict of interests exists
References
[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001
[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001
[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996
[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003
[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006
[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006
[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002
[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996
14 Computational and Mathematical Methods in Medicine
[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003
[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002
[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999
[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004
[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008
[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006
[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989
[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005
[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005
[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936
[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990
[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010
[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001
[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008
[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006
[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001
[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003
[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011
[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011
[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010
[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001
[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009
[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008
[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009
[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009
[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009
[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
6 Computational and Mathematical Methods in Medicine
Hence we can see that the exponential loss is sequentiallyminimized in the algorithm It is easily shown that
120598119905+1(119891119905) =
1
2 (16)
That is the weak classifier 119891119905chosen at the step 119905 is the worst
element in F in the sense of the weighted error rate in thestep 119905 + 1
314 AUCBoost with Natural Cubic Splines The area underthe ROCcurve (AUC) is widely used tomeasure classificationaccuracy [27] This criterion consists of the false positiverate and true positive rate so it evaluates them separately incontrast to the commonly used error rateThe empirical AUCbased on the samples xminus
119894 119894 = 1 119899
minus and x+
119895 119895 = 1
119899+ is given as
AUC (119865) = 1
119899minus119899+
119899minus
sum
119894=1
119899+
sum
119895=1
119867(119865 (x+119895) minus 119865 (xminus
119894)) (17)
where 119867(119911) is the Heaviside function 119867(119911) = 1 if 119911 ge 0
and H(119911) = 0 otherwise To avoid the difficulty to maximizethe nondifferential function above an approximate AUC isconsidered by Komori [28] as
AUC120590(119865) =
1
119899minus119899+
119899minus
sum
119894=1
119899+
sum
119895=1
119867120590(119865 (x+119895) minus 119865 (xminus
119894)) (18)
where 119867120590(119911) = Φ(119911120590) with Φ being the standard normal
distribution function A smaller scale parameter 120590 meansa better approximation of the Heaviside function 119867(119911)Based on this approximation a boosting method for themaximization of AUC as well as the partial AUC is proposedby Komori and Eguchi [22] in which they consider thefollowing objective function
AUC120590120582(119865) =
1
119899minus119899+
119899minus
sum
119894=1
119899+
sum
119895=1
119867120590(119865 (x+119895) minus 119865 (xminus
119894))
minus 120582
119901
sum
119896=1
int 11986510158401015840
119896(119909119896)2
119889119909119896
(19)
where 11986510158401015840119896(119909119896) is the second derivative of the 119896th component
of 119865(x) and 120582 is a smoothing parameter that controls thesmoothness of 119865(x) Here the set of weak classifiers is givenas
F = 119891 (x) =119873119896119897(119909119896)
119885119896119897
| 119896 = 1 2 119901 119897 = 1 2 119898119896
(20)
where 119873119896119897
is a basis function for representing natural cubicsplines and 119885
119896119897is a standardization factor Then the relation-
ship
max120590120582119865
AUC120590120582(119865) = max
120582119865
AUC1120582(119865) (21)
allows us to fix 120590 = 1 without loss of generality and have thefollowing algorithm
(1) Start with 1198650(x) = 0
(2) For 119905 = 1 119879
(a) update 120573119905minus1(119891) to 120573
119905(119891) with a one-step
Newton-Raphson iteration(b) find the best weak classifier 119891
119905
119891119905= argmax119891
AUC1120582(119865119905minus1
+ 120573119905(119891) 119891) (22)
(c) update the score function as
119865119905(x) = 119865
119905minus1(x) + 120573
119905(119891119905) 119891119905(x) (23)
(3) Output 119865(x) = sum119879119905=1120573119905(119891119905)119891119905(x)
The value of the smoothing parameter 120582 and the iterationnumber 119879 is determined at the same time by the crossvalidation
32 Clustering We applied a hierarchical clustering usingbreast cancer data [7] where the distances between samplesand genes are determined by the correlation and completelinkage was applied as the agglomeration method To mea-sure the performance of the clustering we used biologicalhomogeneity index (BHI) [29] which measures the homo-geneity between the cluster C = 119862
1 119862
119870 and the
biological category or subtypeB = 1198611 119861
119871
BHI (CB) =1
119870
119870
sum
119896=1
1
119899119896(119899119896minus 1)
sum
119894 = 119895119894119895isin119862119896
119868 (119861(119894)= 119861(119895))
(24)
where 119861(119894) isin B is the subtype for the subject 119894 and 119899119896is
the number of subjects in 119862119896 This index is upper bounded
by 1 meaning the perfect homogeneity between the clustersand the biological categories We calculated this index forthe breast cancer data to investigate the relationship betweenthe hierarchical clustering and biological categories such asdisease status (good prognosis or metastases) and hormonestatus estrogen receptor (ER) status and progesterone recep-tor (PR) status The hormone status is known to be closelyrelated with the prognosis of the patients [23]
33 Mutualcoherence Now we have a data matrix X with 119899rows (119899 patients) and 119901 columns (119901 genes)We assumed an 119899-dimensional vector b indicating the true disease status where
Computational and Mathematical Methods in Medicine 7
the positive values correspond to metastases and negativeones to good prognosisThemagnitude of b denotes the levelof disease status Then the optimal linear solution 120573 isin R119901
should be satisfied
X120573 = b (25)
Note that if 119901 is much larger than 119899 then only a few elementsof 120573 should be non-zeros and the others to be zero whichmeans 120573 has a sparse structure The sparsity has a closerelationship with mutualcoherence [30] which is defined forthe data matrix X as
120583 (X) = max1le119894119895le119901119894 = 119895
10038161003816100381610038161003816x119879119894x119895
100381610038161003816100381610038161003817100381710038171003817x11989410038171003817100381710038172
10038171003817100381710038171003817x119895
100381710038171003817100381710038172
(26)
where x119894isin R119899 denotes the 119894th column in X or 119894th gene in
the breast cancer data | sdot | is the absolute value and sdot2is the
Euclidean normThis indexmeasures the distance of columns(genes) in the data matrixX and is bounded as 0 le 120583(X) le 1The next theorem shows the relationship between sparsity of120573 and the data matrix X
Theorem 1 (Elad [30]) If X120573 = b has a solution 120573 obeying1205730
lt (1 + 1120583(X))2 this solution is necessarily thesparsest possible where 120573
0denotes the number of the nonzero
components of 120573
This theorem suggests that the linear discriminant func-tion 119865(x) = 120573119879x could have a sparse solution to predict thedisease status b which corresponds to metastases or goodprognosis in the breast cancer data If the value of 120583(X) isnearly equal to 1 then 120573
0could become approximately 1
indicating that just one gene (a column inX) could be enoughto predict the disease status if there were a solution 120573 Thisindicates that we could have a chance to find the multiplesolutions with fewer genes than 70 genes in MammaPrintAlthough the framework in (25) is based on a system of linearequations and does not include random effects as seen in theclassification problems we deal with Theorem 1 is indicativein the case where the number of covariates 119901 is much largerthan the observation number 119899
34 Results We prepare various data sets using a trainingsample (78 patients) based on 230 genes selected by vanrsquot Veeret al [7] in order to investigate the existence of the multiplesolutions of the prediction of the disease status which aregiven by
D = 1198631minus70
1198636minus75
119863161minus230
(27)
The first data set 1198631minus70
consists of 78 patients and 70 genesin MammaPrint The ranking of the 230 genes in D is onthe basis of correlation with disease status as in [7] We apply
the vanrsquot Veer method DLDA AdaBoost and AUCBoost asexplained in the previous subsection to the training data tohave the discriminant functions119865(x) where each threshold isdetermined so that the training error rate isminimizedThenwe evaluate the classification accuracy based on the trainingdata as well as the test data The results of classificationperformance of vanrsquot Veer method and DLDA are illustratedin Figure 1 those of AdaBoost and AUCBoost are in Figure 2The AUC and the error rate are plotted against D In regardto the performance based on the training data denoted bysolid line the boosting methods are superior to the classicalmethodsHowever comparison based on the test data DLDAshows good performance for almost all the data sets in Dhaving the AUC more than 08 and the error rates less than02 in averageThis evidence suggests there exist many sets ofgenes having almost the same classification performance asthat of MammaPrint
We investigate the performance of the hierarchical clus-terings based on the training data sets 119863
1minus70 11986311minus80
and119863111minus180
which are shown in Figure 3 Each row represents70 genes and each column represents 78 patients withdisease status (blue) ER status (red) and PR status (orange)The BHI for disease status ER status and PR status inMammaPrint (119863
1minus70) are 070 069 and 057 respectively
The gene expression in 11986311minus80
shows different patterns fromthe others Mainly there are two clusters characterized by ERstatus The left-hand side is the subgroup of patients withER negative and poor prognosis This would correspond toBasal-like subtype or triple negative subtype though the Her2status is unclear The right-hand side could be divided intothree subgroups The subgroup of patients with ER negativeshows good prognosis indicating Luminal A and either sideof it would be Luminal B or Luminal C because it showsworse prognosis than Luminal A The data set of 119863
111minus180
has the highest BHI for disease status 076 and a similar geneexpression pattern to that in 119863
1minus70 The other values of BHI
are illustrated in Figure 4 It turned out that the data sets inDhave almost the same BHI for three statuses suggesting thereexists various gene sets with similar expression patterns
Next we investigate the stability of the gene ranking inMammaPrint Among 78 patients we randomly choose 50patients with 5420 gene expression patterns Then we takethe top 70 genes ranked by the correlation coefficients ofthe gene expression with disease status This procedure isrepeated 100 times and checked how many times the genesin MammaPrint are ranked within the top 70 The resultsare shown in the upper panel of Figure 5 The lower panelshows the result based on the AUC instead of the correlationcoefficient used in the ranking We clearly see that someof the genes in MammaPrint are rarely selected in the top70 which indicates the instability of the gene ranking Theperformances of DLDA with 100 replications which showsmost stable prediction accuracy as seen in Figure 1 basedon randomly selected 50 patients shown in Figure 6 wherethe vertical axes are AUC (left panels) and error rate (rightpanels) and genes are ranked by the correlation (a) and AUC(b) The performance clearly gets worse than that in Figure 1in terms of both AUC and error rate The heterogeneity ofgene expression pattern may come from the several subtypes
8 Computational and Mathematical Methods in Medicine
0
02
04
06
08
1
Data sets
AUC
0
02
04
06
08
1
Erro
r
TrainingTest
TrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230Data sets
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(a) vanrsquot Veer methodEr
ror
0
02
04
06
08
1
AUC
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(b) DLDA
Figure 1 Results of classical methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data setsD for vanrsquotVeer method and DLDA
as suggested in the cluster analysis and it would make itdifficult to select useful genes for predictions
Finally we calculate the mutualcoherence based on genesin MammaPrint and total 5420 genes in Figure 7 The scatterplot in the upper panel (a) shows a pair of geneswith the high-est mutual coherence (0984) in MammaPrint NM 014889
and NM 014968 correspond to MP1 and KIAA1104 respec-tivelyMP1 is a scaffold protein inmultiple signaling pathwaysincluding the one in breast cancer The latter one is pitrilysinmetallopeptidase 1The correlations defined in the right-handside of (26) are calculated for all pairs of genes in total 5420
genes and 70 genes in MammaPrint The distributions ofthem are shown in the lower panel (b) in Figure 7 The genepairs are sorted so that the values of the correlations decreasemonotonically Note that the number of the gene pairs ineach data set is different but the range of horizontal axis isrestricted to 0 and 1 for clear view and for easy comparisonThedifference between the black and red curves indicates thatthe gene set of MammaPrint has large homogeneity of geneexpression patterns in comparison with that of total 5420genes This indicates that the ranking of genes based on two-sample statistic such as the correlation coefficients is prone
Computational and Mathematical Methods in Medicine 9
0
02
04
06
08
1AU
C
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Erro
r
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(a) AdaBoost
0
02
04
06
08
1
AUC
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Erro
r
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(b) AUCBoost
Figure 2 Results of boosting methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets D forAdaBoost and AUCBoost
to select genes such that their gene expression patterns aresimilar to each other This would be one reason why we havemultiple solutions after the ranking methods based on thecorrelation coefficients It is also interesting to see that thereare a few gene pairs with very low correlation even in gene setof MammaPrint
4 Discussion and Concluding Remarks
In this paper we have addressed an important classificationissue using microarray data Our results present the existence
of multiple suboptimal solutions for choosing classificationpredictors Classification results showed that the perfor-mance of several gene sets was the same This behaviorwas confirmed by the vanrsquot Veer method DLDA AdaBoostand AUCBoost Amazingly nontop ranked gene sets showedbetter performance than the top ranked gene sets Theseresults indicate that the ranking method is very unstable forfeature selection in microarray data
To examine the expression pattern of selected predictorswe performed clustering this added support to the existenceof multiple solutions We easily recognized the good and
10 Computational and Mathematical Methods in Medicine
(a) (b)
(c)
Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70
(MammaPrint)(b) 119863
11minus80clearly showing some subtypes and (c) 119863
111minus180with the highest BHI regarding metastases The blue bars indicate patients with
metastases red bars those with ER positive and orange bars those with PR positive
the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes
Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work
Computational and Mathematical Methods in Medicine 11
SetBH
I
MetastasesERp
PRp
0
02
04
06
08
1
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig32125
RCU82987
AL137718
AB037863
NM020188
NM020974
NM000127
NM002019
NM002073
NM000436
NM004994
Con
tig55377
RCC
ontig35251
RCC
ontig25991
NM003875
NM006101
NM003882
NM003607
AF0
7351
9A
F052
162
NM000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig46223
RCN
M015984
NM006117
AK0
0074
5C
ontig40831
RCN
M003239
NM014791
X056
10N
M016448
NM018401
NM000788
Con
tig51464
RCA
L080
079
NM006931
AF2
5717
5N
M014321
NM002916
Con
tig55725
RCC
ontig24252
RCA
F201
951
NM005915
NM001282
Con
tig56457
RCN
M000599
NM020386
NM014889
AF0
5503
3N
M006681
NM007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM007036
NM018354
(a)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig
46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig
32125
RCU82987
AL1
37718
AB0
37863
NM
020188
NM
020974
NM
000127
NM
002019
NM
002073
NM
000436
NM
004994
Con
tig55377
RCC
ontig
35251
RCC
ontig
25991
NM
003875
NM
006101
NM
003882
NM
003607
AF0
7351
9A
F052
162
NM
000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig
46223
RCN
M015984
NM
006117
AK0
0074
5C
ontig
40831
RCN
M003239
NM
014791
X056
10N
M016448
NM
018401
NM
000788
Con
tig51464
RCA
L080
079
NM
006931
AF2
5717
5N
M014321
NM
002916
Con
tig55725
RCC
ontig
24252
RCA
F201
951
NM
005915
NM
001282
Con
tig56457
RCN
M000599
NM
020386
NM
014889
AF0
5503
3N
M006681
NM
007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM
007036
NM
018354
(b)
Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint
12 Computational and Mathematical Methods in Medicine
0 20 40 60 80 100
1
05
06
07
08
09
1AU
C
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
Number of trialsNumber of trials
TrainingTest
TrainingTest
(a)
0 20 40 60 80 100
05
06
07
08
09
1
Number of trialsNumber of trials
AUC
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
TrainingTest
TrainingTest
(b)
Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials
Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is
not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data
One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by
Computational and Mathematical Methods in Medicine 13
0 02 04 06
0
02
04
06
minus02
minus02NM 014889
NM
0149
68
(a)
Index of pairs of genes
Cor
relat
ion
5420 genes70 genes
0
02
04
06
08
1
0 02 04 06 08 1
(b)
Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view
Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata
We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data
Conflict of Interests
The authors have declared that no conflict of interests exists
References
[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001
[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001
[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996
[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003
[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006
[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006
[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002
[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996
14 Computational and Mathematical Methods in Medicine
[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003
[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002
[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999
[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004
[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008
[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006
[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989
[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005
[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005
[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936
[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990
[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010
[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001
[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008
[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006
[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001
[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003
[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011
[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011
[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010
[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001
[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009
[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008
[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009
[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009
[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009
[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
Computational and Mathematical Methods in Medicine 7
the positive values correspond to metastases and negativeones to good prognosisThemagnitude of b denotes the levelof disease status Then the optimal linear solution 120573 isin R119901
should be satisfied
X120573 = b (25)
Note that if 119901 is much larger than 119899 then only a few elementsof 120573 should be non-zeros and the others to be zero whichmeans 120573 has a sparse structure The sparsity has a closerelationship with mutualcoherence [30] which is defined forthe data matrix X as
120583 (X) = max1le119894119895le119901119894 = 119895
10038161003816100381610038161003816x119879119894x119895
100381610038161003816100381610038161003817100381710038171003817x11989410038171003817100381710038172
10038171003817100381710038171003817x119895
100381710038171003817100381710038172
(26)
where x119894isin R119899 denotes the 119894th column in X or 119894th gene in
the breast cancer data | sdot | is the absolute value and sdot2is the
Euclidean normThis indexmeasures the distance of columns(genes) in the data matrixX and is bounded as 0 le 120583(X) le 1The next theorem shows the relationship between sparsity of120573 and the data matrix X
Theorem 1 (Elad [30]) If X120573 = b has a solution 120573 obeying1205730
lt (1 + 1120583(X))2 this solution is necessarily thesparsest possible where 120573
0denotes the number of the nonzero
components of 120573
This theorem suggests that the linear discriminant func-tion 119865(x) = 120573119879x could have a sparse solution to predict thedisease status b which corresponds to metastases or goodprognosis in the breast cancer data If the value of 120583(X) isnearly equal to 1 then 120573
0could become approximately 1
indicating that just one gene (a column inX) could be enoughto predict the disease status if there were a solution 120573 Thisindicates that we could have a chance to find the multiplesolutions with fewer genes than 70 genes in MammaPrintAlthough the framework in (25) is based on a system of linearequations and does not include random effects as seen in theclassification problems we deal with Theorem 1 is indicativein the case where the number of covariates 119901 is much largerthan the observation number 119899
34 Results We prepare various data sets using a trainingsample (78 patients) based on 230 genes selected by vanrsquot Veeret al [7] in order to investigate the existence of the multiplesolutions of the prediction of the disease status which aregiven by
D = 1198631minus70
1198636minus75
119863161minus230
(27)
The first data set 1198631minus70
consists of 78 patients and 70 genesin MammaPrint The ranking of the 230 genes in D is onthe basis of correlation with disease status as in [7] We apply
the vanrsquot Veer method DLDA AdaBoost and AUCBoost asexplained in the previous subsection to the training data tohave the discriminant functions119865(x) where each threshold isdetermined so that the training error rate isminimizedThenwe evaluate the classification accuracy based on the trainingdata as well as the test data The results of classificationperformance of vanrsquot Veer method and DLDA are illustratedin Figure 1 those of AdaBoost and AUCBoost are in Figure 2The AUC and the error rate are plotted against D In regardto the performance based on the training data denoted bysolid line the boosting methods are superior to the classicalmethodsHowever comparison based on the test data DLDAshows good performance for almost all the data sets in Dhaving the AUC more than 08 and the error rates less than02 in averageThis evidence suggests there exist many sets ofgenes having almost the same classification performance asthat of MammaPrint
We investigate the performance of the hierarchical clus-terings based on the training data sets 119863
1minus70 11986311minus80
and119863111minus180
which are shown in Figure 3 Each row represents70 genes and each column represents 78 patients withdisease status (blue) ER status (red) and PR status (orange)The BHI for disease status ER status and PR status inMammaPrint (119863
1minus70) are 070 069 and 057 respectively
The gene expression in 11986311minus80
shows different patterns fromthe others Mainly there are two clusters characterized by ERstatus The left-hand side is the subgroup of patients withER negative and poor prognosis This would correspond toBasal-like subtype or triple negative subtype though the Her2status is unclear The right-hand side could be divided intothree subgroups The subgroup of patients with ER negativeshows good prognosis indicating Luminal A and either sideof it would be Luminal B or Luminal C because it showsworse prognosis than Luminal A The data set of 119863
111minus180
has the highest BHI for disease status 076 and a similar geneexpression pattern to that in 119863
1minus70 The other values of BHI
are illustrated in Figure 4 It turned out that the data sets inDhave almost the same BHI for three statuses suggesting thereexists various gene sets with similar expression patterns
Next we investigate the stability of the gene ranking inMammaPrint Among 78 patients we randomly choose 50patients with 5420 gene expression patterns Then we takethe top 70 genes ranked by the correlation coefficients ofthe gene expression with disease status This procedure isrepeated 100 times and checked how many times the genesin MammaPrint are ranked within the top 70 The resultsare shown in the upper panel of Figure 5 The lower panelshows the result based on the AUC instead of the correlationcoefficient used in the ranking We clearly see that someof the genes in MammaPrint are rarely selected in the top70 which indicates the instability of the gene ranking Theperformances of DLDA with 100 replications which showsmost stable prediction accuracy as seen in Figure 1 basedon randomly selected 50 patients shown in Figure 6 wherethe vertical axes are AUC (left panels) and error rate (rightpanels) and genes are ranked by the correlation (a) and AUC(b) The performance clearly gets worse than that in Figure 1in terms of both AUC and error rate The heterogeneity ofgene expression pattern may come from the several subtypes
8 Computational and Mathematical Methods in Medicine
0
02
04
06
08
1
Data sets
AUC
0
02
04
06
08
1
Erro
r
TrainingTest
TrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230Data sets
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(a) vanrsquot Veer methodEr
ror
0
02
04
06
08
1
AUC
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(b) DLDA
Figure 1 Results of classical methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data setsD for vanrsquotVeer method and DLDA
as suggested in the cluster analysis and it would make itdifficult to select useful genes for predictions
Finally we calculate the mutualcoherence based on genesin MammaPrint and total 5420 genes in Figure 7 The scatterplot in the upper panel (a) shows a pair of geneswith the high-est mutual coherence (0984) in MammaPrint NM 014889
and NM 014968 correspond to MP1 and KIAA1104 respec-tivelyMP1 is a scaffold protein inmultiple signaling pathwaysincluding the one in breast cancer The latter one is pitrilysinmetallopeptidase 1The correlations defined in the right-handside of (26) are calculated for all pairs of genes in total 5420
genes and 70 genes in MammaPrint The distributions ofthem are shown in the lower panel (b) in Figure 7 The genepairs are sorted so that the values of the correlations decreasemonotonically Note that the number of the gene pairs ineach data set is different but the range of horizontal axis isrestricted to 0 and 1 for clear view and for easy comparisonThedifference between the black and red curves indicates thatthe gene set of MammaPrint has large homogeneity of geneexpression patterns in comparison with that of total 5420genes This indicates that the ranking of genes based on two-sample statistic such as the correlation coefficients is prone
Computational and Mathematical Methods in Medicine 9
0
02
04
06
08
1AU
C
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Erro
r
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(a) AdaBoost
0
02
04
06
08
1
AUC
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Erro
r
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(b) AUCBoost
Figure 2 Results of boosting methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets D forAdaBoost and AUCBoost
to select genes such that their gene expression patterns aresimilar to each other This would be one reason why we havemultiple solutions after the ranking methods based on thecorrelation coefficients It is also interesting to see that thereare a few gene pairs with very low correlation even in gene setof MammaPrint
4 Discussion and Concluding Remarks
In this paper we have addressed an important classificationissue using microarray data Our results present the existence
of multiple suboptimal solutions for choosing classificationpredictors Classification results showed that the perfor-mance of several gene sets was the same This behaviorwas confirmed by the vanrsquot Veer method DLDA AdaBoostand AUCBoost Amazingly nontop ranked gene sets showedbetter performance than the top ranked gene sets Theseresults indicate that the ranking method is very unstable forfeature selection in microarray data
To examine the expression pattern of selected predictorswe performed clustering this added support to the existenceof multiple solutions We easily recognized the good and
10 Computational and Mathematical Methods in Medicine
(a) (b)
(c)
Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70
(MammaPrint)(b) 119863
11minus80clearly showing some subtypes and (c) 119863
111minus180with the highest BHI regarding metastases The blue bars indicate patients with
metastases red bars those with ER positive and orange bars those with PR positive
the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes
Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work
Computational and Mathematical Methods in Medicine 11
SetBH
I
MetastasesERp
PRp
0
02
04
06
08
1
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig32125
RCU82987
AL137718
AB037863
NM020188
NM020974
NM000127
NM002019
NM002073
NM000436
NM004994
Con
tig55377
RCC
ontig35251
RCC
ontig25991
NM003875
NM006101
NM003882
NM003607
AF0
7351
9A
F052
162
NM000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig46223
RCN
M015984
NM006117
AK0
0074
5C
ontig40831
RCN
M003239
NM014791
X056
10N
M016448
NM018401
NM000788
Con
tig51464
RCA
L080
079
NM006931
AF2
5717
5N
M014321
NM002916
Con
tig55725
RCC
ontig24252
RCA
F201
951
NM005915
NM001282
Con
tig56457
RCN
M000599
NM020386
NM014889
AF0
5503
3N
M006681
NM007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM007036
NM018354
(a)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig
46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig
32125
RCU82987
AL1
37718
AB0
37863
NM
020188
NM
020974
NM
000127
NM
002019
NM
002073
NM
000436
NM
004994
Con
tig55377
RCC
ontig
35251
RCC
ontig
25991
NM
003875
NM
006101
NM
003882
NM
003607
AF0
7351
9A
F052
162
NM
000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig
46223
RCN
M015984
NM
006117
AK0
0074
5C
ontig
40831
RCN
M003239
NM
014791
X056
10N
M016448
NM
018401
NM
000788
Con
tig51464
RCA
L080
079
NM
006931
AF2
5717
5N
M014321
NM
002916
Con
tig55725
RCC
ontig
24252
RCA
F201
951
NM
005915
NM
001282
Con
tig56457
RCN
M000599
NM
020386
NM
014889
AF0
5503
3N
M006681
NM
007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM
007036
NM
018354
(b)
Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint
12 Computational and Mathematical Methods in Medicine
0 20 40 60 80 100
1
05
06
07
08
09
1AU
C
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
Number of trialsNumber of trials
TrainingTest
TrainingTest
(a)
0 20 40 60 80 100
05
06
07
08
09
1
Number of trialsNumber of trials
AUC
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
TrainingTest
TrainingTest
(b)
Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials
Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is
not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data
One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by
Computational and Mathematical Methods in Medicine 13
0 02 04 06
0
02
04
06
minus02
minus02NM 014889
NM
0149
68
(a)
Index of pairs of genes
Cor
relat
ion
5420 genes70 genes
0
02
04
06
08
1
0 02 04 06 08 1
(b)
Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view
Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata
We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data
Conflict of Interests
The authors have declared that no conflict of interests exists
References
[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001
[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001
[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996
[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003
[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006
[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006
[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002
[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996
14 Computational and Mathematical Methods in Medicine
[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003
[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002
[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999
[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004
[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008
[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006
[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989
[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005
[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005
[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936
[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990
[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010
[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001
[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008
[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006
[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001
[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003
[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011
[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011
[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010
[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001
[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009
[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008
[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009
[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009
[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009
[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
8 Computational and Mathematical Methods in Medicine
0
02
04
06
08
1
Data sets
AUC
0
02
04
06
08
1
Erro
r
TrainingTest
TrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230Data sets
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(a) vanrsquot Veer methodEr
ror
0
02
04
06
08
1
AUC
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(b) DLDA
Figure 1 Results of classical methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data setsD for vanrsquotVeer method and DLDA
as suggested in the cluster analysis and it would make itdifficult to select useful genes for predictions
Finally we calculate the mutualcoherence based on genesin MammaPrint and total 5420 genes in Figure 7 The scatterplot in the upper panel (a) shows a pair of geneswith the high-est mutual coherence (0984) in MammaPrint NM 014889
and NM 014968 correspond to MP1 and KIAA1104 respec-tivelyMP1 is a scaffold protein inmultiple signaling pathwaysincluding the one in breast cancer The latter one is pitrilysinmetallopeptidase 1The correlations defined in the right-handside of (26) are calculated for all pairs of genes in total 5420
genes and 70 genes in MammaPrint The distributions ofthem are shown in the lower panel (b) in Figure 7 The genepairs are sorted so that the values of the correlations decreasemonotonically Note that the number of the gene pairs ineach data set is different but the range of horizontal axis isrestricted to 0 and 1 for clear view and for easy comparisonThedifference between the black and red curves indicates thatthe gene set of MammaPrint has large homogeneity of geneexpression patterns in comparison with that of total 5420genes This indicates that the ranking of genes based on two-sample statistic such as the correlation coefficients is prone
Computational and Mathematical Methods in Medicine 9
0
02
04
06
08
1AU
C
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Erro
r
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(a) AdaBoost
0
02
04
06
08
1
AUC
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Erro
r
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(b) AUCBoost
Figure 2 Results of boosting methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets D forAdaBoost and AUCBoost
to select genes such that their gene expression patterns aresimilar to each other This would be one reason why we havemultiple solutions after the ranking methods based on thecorrelation coefficients It is also interesting to see that thereare a few gene pairs with very low correlation even in gene setof MammaPrint
4 Discussion and Concluding Remarks
In this paper we have addressed an important classificationissue using microarray data Our results present the existence
of multiple suboptimal solutions for choosing classificationpredictors Classification results showed that the perfor-mance of several gene sets was the same This behaviorwas confirmed by the vanrsquot Veer method DLDA AdaBoostand AUCBoost Amazingly nontop ranked gene sets showedbetter performance than the top ranked gene sets Theseresults indicate that the ranking method is very unstable forfeature selection in microarray data
To examine the expression pattern of selected predictorswe performed clustering this added support to the existenceof multiple solutions We easily recognized the good and
10 Computational and Mathematical Methods in Medicine
(a) (b)
(c)
Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70
(MammaPrint)(b) 119863
11minus80clearly showing some subtypes and (c) 119863
111minus180with the highest BHI regarding metastases The blue bars indicate patients with
metastases red bars those with ER positive and orange bars those with PR positive
the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes
Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work
Computational and Mathematical Methods in Medicine 11
SetBH
I
MetastasesERp
PRp
0
02
04
06
08
1
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig32125
RCU82987
AL137718
AB037863
NM020188
NM020974
NM000127
NM002019
NM002073
NM000436
NM004994
Con
tig55377
RCC
ontig35251
RCC
ontig25991
NM003875
NM006101
NM003882
NM003607
AF0
7351
9A
F052
162
NM000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig46223
RCN
M015984
NM006117
AK0
0074
5C
ontig40831
RCN
M003239
NM014791
X056
10N
M016448
NM018401
NM000788
Con
tig51464
RCA
L080
079
NM006931
AF2
5717
5N
M014321
NM002916
Con
tig55725
RCC
ontig24252
RCA
F201
951
NM005915
NM001282
Con
tig56457
RCN
M000599
NM020386
NM014889
AF0
5503
3N
M006681
NM007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM007036
NM018354
(a)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig
46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig
32125
RCU82987
AL1
37718
AB0
37863
NM
020188
NM
020974
NM
000127
NM
002019
NM
002073
NM
000436
NM
004994
Con
tig55377
RCC
ontig
35251
RCC
ontig
25991
NM
003875
NM
006101
NM
003882
NM
003607
AF0
7351
9A
F052
162
NM
000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig
46223
RCN
M015984
NM
006117
AK0
0074
5C
ontig
40831
RCN
M003239
NM
014791
X056
10N
M016448
NM
018401
NM
000788
Con
tig51464
RCA
L080
079
NM
006931
AF2
5717
5N
M014321
NM
002916
Con
tig55725
RCC
ontig
24252
RCA
F201
951
NM
005915
NM
001282
Con
tig56457
RCN
M000599
NM
020386
NM
014889
AF0
5503
3N
M006681
NM
007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM
007036
NM
018354
(b)
Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint
12 Computational and Mathematical Methods in Medicine
0 20 40 60 80 100
1
05
06
07
08
09
1AU
C
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
Number of trialsNumber of trials
TrainingTest
TrainingTest
(a)
0 20 40 60 80 100
05
06
07
08
09
1
Number of trialsNumber of trials
AUC
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
TrainingTest
TrainingTest
(b)
Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials
Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is
not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data
One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by
Computational and Mathematical Methods in Medicine 13
0 02 04 06
0
02
04
06
minus02
minus02NM 014889
NM
0149
68
(a)
Index of pairs of genes
Cor
relat
ion
5420 genes70 genes
0
02
04
06
08
1
0 02 04 06 08 1
(b)
Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view
Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata
We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data
Conflict of Interests
The authors have declared that no conflict of interests exists
References
[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001
[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001
[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996
[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003
[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006
[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006
[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002
[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996
14 Computational and Mathematical Methods in Medicine
[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003
[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002
[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999
[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004
[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008
[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006
[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989
[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005
[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005
[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936
[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990
[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010
[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001
[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008
[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006
[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001
[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003
[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011
[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011
[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010
[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001
[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009
[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008
[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009
[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009
[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009
[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
Computational and Mathematical Methods in Medicine 9
0
02
04
06
08
1AU
C
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Erro
r
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(a) AdaBoost
0
02
04
06
08
1
AUC
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Erro
r
0
02
04
06
08
1
Data setsTrainingTest
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
(b) AUCBoost
Figure 2 Results of boosting methods (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets D forAdaBoost and AUCBoost
to select genes such that their gene expression patterns aresimilar to each other This would be one reason why we havemultiple solutions after the ranking methods based on thecorrelation coefficients It is also interesting to see that thereare a few gene pairs with very low correlation even in gene setof MammaPrint
4 Discussion and Concluding Remarks
In this paper we have addressed an important classificationissue using microarray data Our results present the existence
of multiple suboptimal solutions for choosing classificationpredictors Classification results showed that the perfor-mance of several gene sets was the same This behaviorwas confirmed by the vanrsquot Veer method DLDA AdaBoostand AUCBoost Amazingly nontop ranked gene sets showedbetter performance than the top ranked gene sets Theseresults indicate that the ranking method is very unstable forfeature selection in microarray data
To examine the expression pattern of selected predictorswe performed clustering this added support to the existenceof multiple solutions We easily recognized the good and
10 Computational and Mathematical Methods in Medicine
(a) (b)
(c)
Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70
(MammaPrint)(b) 119863
11minus80clearly showing some subtypes and (c) 119863
111minus180with the highest BHI regarding metastases The blue bars indicate patients with
metastases red bars those with ER positive and orange bars those with PR positive
the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes
Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work
Computational and Mathematical Methods in Medicine 11
SetBH
I
MetastasesERp
PRp
0
02
04
06
08
1
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig32125
RCU82987
AL137718
AB037863
NM020188
NM020974
NM000127
NM002019
NM002073
NM000436
NM004994
Con
tig55377
RCC
ontig35251
RCC
ontig25991
NM003875
NM006101
NM003882
NM003607
AF0
7351
9A
F052
162
NM000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig46223
RCN
M015984
NM006117
AK0
0074
5C
ontig40831
RCN
M003239
NM014791
X056
10N
M016448
NM018401
NM000788
Con
tig51464
RCA
L080
079
NM006931
AF2
5717
5N
M014321
NM002916
Con
tig55725
RCC
ontig24252
RCA
F201
951
NM005915
NM001282
Con
tig56457
RCN
M000599
NM020386
NM014889
AF0
5503
3N
M006681
NM007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM007036
NM018354
(a)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig
46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig
32125
RCU82987
AL1
37718
AB0
37863
NM
020188
NM
020974
NM
000127
NM
002019
NM
002073
NM
000436
NM
004994
Con
tig55377
RCC
ontig
35251
RCC
ontig
25991
NM
003875
NM
006101
NM
003882
NM
003607
AF0
7351
9A
F052
162
NM
000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig
46223
RCN
M015984
NM
006117
AK0
0074
5C
ontig
40831
RCN
M003239
NM
014791
X056
10N
M016448
NM
018401
NM
000788
Con
tig51464
RCA
L080
079
NM
006931
AF2
5717
5N
M014321
NM
002916
Con
tig55725
RCC
ontig
24252
RCA
F201
951
NM
005915
NM
001282
Con
tig56457
RCN
M000599
NM
020386
NM
014889
AF0
5503
3N
M006681
NM
007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM
007036
NM
018354
(b)
Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint
12 Computational and Mathematical Methods in Medicine
0 20 40 60 80 100
1
05
06
07
08
09
1AU
C
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
Number of trialsNumber of trials
TrainingTest
TrainingTest
(a)
0 20 40 60 80 100
05
06
07
08
09
1
Number of trialsNumber of trials
AUC
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
TrainingTest
TrainingTest
(b)
Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials
Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is
not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data
One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by
Computational and Mathematical Methods in Medicine 13
0 02 04 06
0
02
04
06
minus02
minus02NM 014889
NM
0149
68
(a)
Index of pairs of genes
Cor
relat
ion
5420 genes70 genes
0
02
04
06
08
1
0 02 04 06 08 1
(b)
Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view
Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata
We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data
Conflict of Interests
The authors have declared that no conflict of interests exists
References
[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001
[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001
[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996
[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003
[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006
[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006
[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002
[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996
14 Computational and Mathematical Methods in Medicine
[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003
[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002
[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999
[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004
[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008
[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006
[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989
[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005
[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005
[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936
[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990
[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010
[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001
[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008
[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006
[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001
[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003
[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011
[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011
[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010
[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001
[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009
[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008
[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009
[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009
[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009
[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
10 Computational and Mathematical Methods in Medicine
(a) (b)
(c)
Figure 3 Heat maps of gene expression data with rows representing genes and columns representing patients (a) 1198631minus70
(MammaPrint)(b) 119863
11minus80clearly showing some subtypes and (c) 119863
111minus180with the highest BHI regarding metastases The blue bars indicate patients with
metastases red bars those with ER positive and orange bars those with PR positive
the bad prognosis groups from the nontop ranked genesets clustering expression pattern Interestingly some clus-tering patterns showed subtypes in the expression patternAs described by Soslashrlie et al [31] and Parker et al [32]breast cancer could be categorized into at least five subtypes
Luminal A Luminal B normal breast-like HER2 and basal-like Selected ranked gene sets contain heterogeneous groupsOur data set does not include the necessary pathologicaldata to categorize genes into known subtypes but consideringsubtypes for feature selection could be future work
Computational and Mathematical Methods in Medicine 11
SetBH
I
MetastasesERp
PRp
0
02
04
06
08
1
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig32125
RCU82987
AL137718
AB037863
NM020188
NM020974
NM000127
NM002019
NM002073
NM000436
NM004994
Con
tig55377
RCC
ontig35251
RCC
ontig25991
NM003875
NM006101
NM003882
NM003607
AF0
7351
9A
F052
162
NM000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig46223
RCN
M015984
NM006117
AK0
0074
5C
ontig40831
RCN
M003239
NM014791
X056
10N
M016448
NM018401
NM000788
Con
tig51464
RCA
L080
079
NM006931
AF2
5717
5N
M014321
NM002916
Con
tig55725
RCC
ontig24252
RCA
F201
951
NM005915
NM001282
Con
tig56457
RCN
M000599
NM020386
NM014889
AF0
5503
3N
M006681
NM007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM007036
NM018354
(a)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig
46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig
32125
RCU82987
AL1
37718
AB0
37863
NM
020188
NM
020974
NM
000127
NM
002019
NM
002073
NM
000436
NM
004994
Con
tig55377
RCC
ontig
35251
RCC
ontig
25991
NM
003875
NM
006101
NM
003882
NM
003607
AF0
7351
9A
F052
162
NM
000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig
46223
RCN
M015984
NM
006117
AK0
0074
5C
ontig
40831
RCN
M003239
NM
014791
X056
10N
M016448
NM
018401
NM
000788
Con
tig51464
RCA
L080
079
NM
006931
AF2
5717
5N
M014321
NM
002916
Con
tig55725
RCC
ontig
24252
RCA
F201
951
NM
005915
NM
001282
Con
tig56457
RCN
M000599
NM
020386
NM
014889
AF0
5503
3N
M006681
NM
007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM
007036
NM
018354
(b)
Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint
12 Computational and Mathematical Methods in Medicine
0 20 40 60 80 100
1
05
06
07
08
09
1AU
C
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
Number of trialsNumber of trials
TrainingTest
TrainingTest
(a)
0 20 40 60 80 100
05
06
07
08
09
1
Number of trialsNumber of trials
AUC
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
TrainingTest
TrainingTest
(b)
Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials
Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is
not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data
One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by
Computational and Mathematical Methods in Medicine 13
0 02 04 06
0
02
04
06
minus02
minus02NM 014889
NM
0149
68
(a)
Index of pairs of genes
Cor
relat
ion
5420 genes70 genes
0
02
04
06
08
1
0 02 04 06 08 1
(b)
Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view
Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata
We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data
Conflict of Interests
The authors have declared that no conflict of interests exists
References
[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001
[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001
[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996
[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003
[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006
[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006
[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002
[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996
14 Computational and Mathematical Methods in Medicine
[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003
[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002
[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999
[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004
[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008
[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006
[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989
[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005
[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005
[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936
[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990
[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010
[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001
[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008
[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006
[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001
[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003
[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011
[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011
[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010
[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001
[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009
[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008
[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009
[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009
[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009
[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
Computational and Mathematical Methods in Medicine 11
SetBH
I
MetastasesERp
PRp
0
02
04
06
08
1
1ndash70 41ndash120 81ndash150 121ndash190 161ndash230
Figure 4 Biological homogeneity index (BHI) for metastases (solid line) ER positive (dashed line) and PR positive (dotted line)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig32125
RCU82987
AL137718
AB037863
NM020188
NM020974
NM000127
NM002019
NM002073
NM000436
NM004994
Con
tig55377
RCC
ontig35251
RCC
ontig25991
NM003875
NM006101
NM003882
NM003607
AF0
7351
9A
F052
162
NM000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig46223
RCN
M015984
NM006117
AK0
0074
5C
ontig40831
RCN
M003239
NM014791
X056
10N
M016448
NM018401
NM000788
Con
tig51464
RCA
L080
079
NM006931
AF2
5717
5N
M014321
NM002916
Con
tig55725
RCC
ontig24252
RCA
F201
951
NM005915
NM001282
Con
tig56457
RCN
M000599
NM020386
NM014889
AF0
5503
3N
M006681
NM007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM007036
NM018354
(a)
Rate
of r
anki
ng in
the t
op 7
0
0
02
04
06
08
1
AL080059
Con
tig63649
RCC
ontig
46218
RCN
M016359
AA555029
RCN
M003748
Con
tig38288
RCN
M003862
Con
tig28552
RCC
ontig
32125
RCU82987
AL1
37718
AB0
37863
NM
020188
NM
020974
NM
000127
NM
002019
NM
002073
NM
000436
NM
004994
Con
tig55377
RCC
ontig
35251
RCC
ontig
25991
NM
003875
NM
006101
NM
003882
NM
003607
AF0
7351
9A
F052
162
NM
000849
Con
tig32185
RCN
M016577
Con
tig48328
RCC
ontig
46223
RCN
M015984
NM
006117
AK0
0074
5C
ontig
40831
RCN
M003239
NM
014791
X056
10N
M016448
NM
018401
NM
000788
Con
tig51464
RCA
L080
079
NM
006931
AF2
5717
5N
M014321
NM
002916
Con
tig55725
RCC
ontig
24252
RCA
F201
951
NM
005915
NM
001282
Con
tig56457
RCN
M000599
NM
020386
NM
014889
AF0
5503
3N
M006681
NM
007203
Con
tig63102
RCN
M003981
Con
tig20217
RCN
M001809
Con
tig2399
RCN
M004702
NM
007036
NM
018354
(b)
Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patientsThehorizontal axis denotes 70 genes used by MammaPrint
12 Computational and Mathematical Methods in Medicine
0 20 40 60 80 100
1
05
06
07
08
09
1AU
C
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
Number of trialsNumber of trials
TrainingTest
TrainingTest
(a)
0 20 40 60 80 100
05
06
07
08
09
1
Number of trialsNumber of trials
AUC
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
TrainingTest
TrainingTest
(b)
Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials
Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is
not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data
One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by
Computational and Mathematical Methods in Medicine 13
0 02 04 06
0
02
04
06
minus02
minus02NM 014889
NM
0149
68
(a)
Index of pairs of genes
Cor
relat
ion
5420 genes70 genes
0
02
04
06
08
1
0 02 04 06 08 1
(b)
Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view
Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata
We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data
Conflict of Interests
The authors have declared that no conflict of interests exists
References
[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001
[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001
[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996
[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003
[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006
[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006
[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002
[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996
14 Computational and Mathematical Methods in Medicine
[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003
[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002
[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999
[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004
[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008
[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006
[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989
[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005
[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005
[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936
[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990
[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010
[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001
[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008
[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006
[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001
[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003
[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011
[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011
[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010
[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001
[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009
[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008
[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009
[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009
[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009
[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
12 Computational and Mathematical Methods in Medicine
0 20 40 60 80 100
1
05
06
07
08
09
1AU
C
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
Number of trialsNumber of trials
TrainingTest
TrainingTest
(a)
0 20 40 60 80 100
05
06
07
08
09
1
Number of trialsNumber of trials
AUC
0 20 40 60 80 100
0
01
02
03
04
05
Erro
r
TrainingTest
TrainingTest
(b)
Figure 6 The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlationcoefficients (a) and by AUC (b) These values are calculated based on randomly sampled 50 patients over 100 trials
Microarray technology has become common technologyand a lot of biomarker candidates are proposed Howevernot many researchers are aware that multiple solutions canexist in a single data set This instability is related to thehigh-dimensional nature of microarray Besides the rankingmethod easily returns different results for different sets ofsubjects There may be several explanations for this Themain problem is the ultra-high dimension ofmicroarray dataThis problem is known as 119901 ≫ 119899 problem The ultra-highdimension gene expression data contains too many similarexpression patterns causing redundancy This redundancy is
not omitted by ranking procedure This is the critical pitfallfor gene selection in microarray data
One approach to tackle this problem about the highdimensionality of the data is to apply statistical methods inconsideration with the background knowledge of medical orbiological data As seen in Figure 3 or demonstrated by Soslashrlieet al [31] the subtypes of breast cancer are closely relatedwith the resultant outcome of the patientrsquos prognosis Thisheterogeneity of data is thought as one factor that makes theranking of the genes unstable leading to themultiple solutionfor prediction rules As seen in Figure 5 and suggested by
Computational and Mathematical Methods in Medicine 13
0 02 04 06
0
02
04
06
minus02
minus02NM 014889
NM
0149
68
(a)
Index of pairs of genes
Cor
relat
ion
5420 genes70 genes
0
02
04
06
08
1
0 02 04 06 08 1
(b)
Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view
Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata
We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data
Conflict of Interests
The authors have declared that no conflict of interests exists
References
[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001
[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001
[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996
[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003
[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006
[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006
[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002
[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996
14 Computational and Mathematical Methods in Medicine
[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003
[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002
[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999
[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004
[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008
[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006
[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989
[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005
[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005
[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936
[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990
[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010
[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001
[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008
[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006
[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001
[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003
[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011
[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011
[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010
[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001
[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009
[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008
[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009
[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009
[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009
[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
Computational and Mathematical Methods in Medicine 13
0 02 04 06
0
02
04
06
minus02
minus02NM 014889
NM
0149
68
(a)
Index of pairs of genes
Cor
relat
ion
5420 genes70 genes
0
02
04
06
08
1
0 02 04 06 08 1
(b)
Figure 7 (a) Scatter plots of two pairs of genes with highest mutualcoherence (0984) among 70 genes of MammaPrint (b)The distributionof the correlation of total 5420 genes (black) and 70 genes ofMammaPrint (red)The horizontal axis denotes the index of pairs of genes basedon which the correlations are calculated The horizontal axis is standardized between 0 and 1 for clear view
Ein-Dor et al [19] the gene ranking based on two-samplestatistic such as correlation coefficients has large amount ofvariability indicating the limitation of single gene analysisThe clustering analysis that deals with the whole informationof total genes would be useful to capture the mutual relationamong genes and to identify the subgroups of informativegenes for the prediction problem The combination withunsupervised learning and supervised learning is a promisingway to solve the difficulty involved in the high-dimensionaldata
We addressed the challenges and difficulties regarding thepattern recognition of gene expression The main problemwas caused by high-dimensional data sets This is not onlya problem of microarray data but also RNA sequencing TheRNA sequencing technologies (RNA-seq) have dramaticallyadvanced recently and are considered an alternative tomicroarray for measuring gene expression as addressed inMortazavi et al [33] Wang et al [34] Pepke et al [35] andWilhelm and Landry [36] Witten [37] showed the clusteringand classification methodology applying Poisson model forthe RNA-seq data The same difficulty occurs in the RNA-seq data The dimension of the RNA-seq is quite high Therunning cost of RNA sequencing has decreased howeverthe number of features is still much larger than the numberof observations Therefore it is not difficult to imagine thatRNA-seq data would have the same difficulty as microarrayWe have to be aware of it and tackle this difficulty using theinsights we have learned from microarray data
Conflict of Interests
The authors have declared that no conflict of interests exists
References
[1] E S Lander L M Linton B Birren et al ldquoInitial sequencingand analysis of the human genomerdquo Nature vol 409 pp 860ndash921 2001
[2] J C Venter M D Adams E W Myers et al ldquoThe sequence ofthe human genomerdquo Science vol 291 no 5507 pp 1304ndash13512001
[3] R Tibshirani ldquoRegression shirinkagee and selection via thelassordquo Journal of Royal Statistical Society vol 58 no 1 pp 267ndash288 1996
[4] D L Donoho and M Elad ldquoOptimally sparse representationin general (nonorthogonal) dictionaries via ℓ1 minimizationrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 100 no 5 pp 2197ndash2202 2003
[5] E J Candes J K Romberg and T Tao ldquoStable signal recoveryfrom incomplete and inaccurate measurementsrdquo Communica-tions on Pure and Applied Mathematics vol 59 no 8 pp 1207ndash1223 2006
[6] D B Allison X Cui G P Page and M Sabripour ldquoMicroarraydata analysis from disarray to consolidation and consensusrdquoNature Reviews Genetics vol 7 no 1 pp 55ndash65 2006
[7] L J vanrsquot Veer H Dai et al ldquoGene expression profiling predictsclinical outcome of breast cancerrdquoNature vol 415 no 6871 pp530ndash536 2002
[8] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007
[9] M Schena D Shalon R Heller A Chai P O Brown andR W Davis ldquoParallel human genome analysis microarray-based expression monitoring of 1000 genesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 93 no 20 pp 10614ndash10619 1996
14 Computational and Mathematical Methods in Medicine
[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003
[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002
[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999
[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004
[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008
[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006
[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989
[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005
[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005
[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936
[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990
[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010
[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001
[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008
[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006
[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001
[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003
[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011
[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011
[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010
[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001
[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009
[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008
[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009
[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009
[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009
[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
14 Computational and Mathematical Methods in Medicine
[10] RA Irizarry BHobbs F Collin et al ldquoExploration normaliza-tion and summaries of high density oligonucleotide array probelevel datardquo Biostatistics vol 4 no 2 pp 249ndash264 2003
[11] F Naef C R Hacker N Patil and M Magnasco ldquoEmpiricalcharacterization of the expression ratio noise structure in high-density oligonucleotide arraysrdquo Genome Biology vol 3 no 4article RESEARCH0018 2002
[12] T R Golub D K Slonim P Tamayo et al ldquoMolecular classi-fication of cancer class discovery and class prediction by geneexpression monitoringrdquo Science vol 286 no 5439 pp 531ndash5371999
[13] S Paik S Shak G Tang et al ldquoA multigene assay to predictrecurrence of tamoxifen-treated node-negative breast cancerrdquoThe New England Journal of Medicine vol 351 no 27 pp 2817ndash2826 2004
[14] F Cardoso L Vanrsquot Veer E Rutgers S Loi S Mook and MJ Piccart-Gebhart ldquoClinical application of the 70-gene profilethe MINDACT trialrdquo Journal of Clinical Oncology vol 26 no5 pp 729ndash735 2008
[15] C Fan D S Oh L Wessels et al ldquoConcordance among gene-expression-based predictors for breast cancerrdquoTheNewEnglandJournal of Medicine vol 355 no 6 pp 560ndash569 2006
[16] M Basseville ldquoDistance measures for signal processing andpattern recognitionrdquo Signal Processing vol 18 no 4 pp 349ndash369 1989
[17] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002
[18] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society B vol 67no 2 pp 301ndash320 2005
[19] L Ein-Dor I Kela GGetz DGivol and EDomany ldquoOutcomesignature genes in breast cancer is there a unique setrdquoBioinformatics vol 21 no 2 pp 171ndash178 2005
[20] R A Fisher ldquoThe use of multiple measurements in taxonomicproblemsrdquo Annals of Eugenics vol 7 no 2 pp 179ndash188 1936
[21] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990
[22] O Komori and S Eguchi ldquoA boosting method for maximizingthe partial area under the ROC curverdquo BMCBioinformatics vol11 article 314 2010
[23] T Soslashrlie Perou C M Perou R Tibshiran et al ldquoGene expres-sion patterns of breast carcinomas distinguish tumor subclasseswith clinical implicationsrdquo Proceedings of the National Academyof Sciences vol 98 no 19 pp 10869ndash10874 2001
[24] M Yousef S Jung L C Showe and M K Showe ldquoLearningfrom positive examples when the negative class is undeter-mined-microRNA gene identificationrdquo Algorithms for Molecu-lar Biology vol 3 no 1 article 2 2008
[25] A B Gardner A M Krieger G Vachtsevanos and B LittldquoOne-class novelty detection for seizure analysis from intracra-nial EEGrdquo Journal of Machine Learning Research vol 7 pp1025ndash1044 2006
[26] T Hastie R Tibshirani and J FriedmanThe Elements of Statis-tical Learning Data Mining Inference and Prediction SpringerNew York NY USA 2001
[27] M S PepeThe Statistical Evaluation of Medical Tests for Classi-fication and Prediction Oxford University Press New York NYUSA 2003
[28] O Komori ldquoA boosting method for maximization of the areaunder the ROC curverdquoAnnals of the Institute of StatisticalMath-ematics vol 63 no 5 pp 961ndash979 2011
[29] H M Wu ldquoOn biological validity indices for soft clusteringalgorithms for gene expression datardquo Computational Statisticsand Data Analysis vol 55 no 5 pp 1969ndash1979 2011
[30] M Elad Sparse and Redundant Representations fromTheory toApplications in Signal and Image Processing Springer NewYorkNY USA 2010
[31] T Soslashrlie C M Perou R Tibshiran et al ldquoGene expression pat-terns of breast carcinomas distinguish tumor subclasses withclinical implicationsrdquo Proceedings of the National Academy ofSciences vol 98 no 19 pp 10869ndash10874 2001
[32] J S Parker M Mullins M C U Cheang et al ldquoSupervised riskpredictor of breast cancer based on intrinsic subtypesrdquo Journalof Clinical Oncology vol 27 no 8 pp 1160ndash1167 2009
[33] A Mortazavi B A Williams K McCue L Schaeffer and BWold ldquoMapping and quantifying mammalian transcriptomesby RNA-Seqrdquo Nature Methods vol 5 no 7 pp 621ndash628 2008
[34] Z Wang M Gerstein and M Snyder ldquoRNA-Seq a revolution-ary tool for transcriptomicsrdquo Nature Reviews Genetics vol 10no 1 pp 57ndash63 2009
[35] S Pepke B Wold and A Mortazavi ldquoComputation for ChIP-seq and RNA-seq studiesrdquo Nature Methods vol 6 no 11 ppS22ndashS32 2009
[36] B T Wilhelm and J R Landry ldquoRNA-Seq-quantitative mea-surement of expression through massively parallel RNA-sequencingrdquoMethods vol 48 no 3 pp 249ndash257 2009
[37] D M Witten ldquoClassification and clustering of sequencing datausing a Poisson modelrdquo The Annals of Applied Statistics vol 5no 4 pp 2493ndash2518 2011
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom
Submit your manuscripts athttpwwwhindawicom
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MEDIATORSINFLAMMATION
of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Behavioural Neurology
EndocrinologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Disease Markers
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
OncologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Oxidative Medicine and Cellular Longevity
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
PPAR Research
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
ObesityJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational and Mathematical Methods in Medicine
OphthalmologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Diabetes ResearchJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Research and TreatmentAIDS
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Gastroenterology Research and Practice
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Parkinsonrsquos Disease
Evidence-Based Complementary and Alternative Medicine
Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom