empirical study for the agreement between statistical methods in quality assessment and control of...
TRANSCRIPT
Comput Stat (2011) 26:259–277DOI 10.1007/s00180-010-0216-2
ORIGINAL PAPER
Empirical study for the agreement between statisticalmethods in quality assessment and controlof microarray data
Markus Schmidberger · Esmeralda Vicedo ·Ulrich Mansmann
Received: 10 September 2009 / Accepted: 7 September 2010 / Published online: 21 September 2010© Springer-Verlag 2010
Abstract As microarray data quality can affect each step of the microarray analysisprocess, quality assessment and control is an integral part. It detects divergent mea-surements beyond the acceptable level of random fluctuations. This empirical studyidentifies association and correlation between the six quality assessment methods formicroarray outlier detection used in the arrayQualityMetrics package version 2.2.2.For evaluation two different agreement tests—Cohen’s Kappa, after a homogeneitymarginal criteria, and AC1 Statistic—, the Pearson Correlation Coefficient and real-istic microarray data from the public ArrayExpress database have been used. It ispossible to assess the quality of a data set using only four of the six currently proposedstatistical methods to comprehensively quantify the quality information in large seriesof microarrays. This saves computation time and reduces decision complexity for theanalyst. The new proposed rule is validated with data sets from biomedical studies.
Keywords Empirical study · Microarray · Quality assessment and control ·R · Bioconductor
1 Introduction
Quality assessment (QA) and quality control (QC) are essential parts of the micro-array data analysis process Gentleman et al. (2005); Berrar et al. (2003). The termquality assessment deals with the computation and interpretation of metrics that are
Electronic supplementary material The online version of this article(doi:10.1007/s00180-010-0216-2) contains supplementary material, which is available to authorized users.
M. Schmidberger (B) · E. Vicedo · U. MansmannDivision of Biometrics and Bioinformatics, IBE, University of Munich, 81377 Munich, Germanye-mail: [email protected]
123
260 M. Schmidberger et al.
intended to measure quality. The term quality control is used for possible subsequentactions, such as removing data or redoing experiments. In the following we call arraysdetected as critical in the quality assessment step as “outliers”. This are not directly“bad” or “defect” arrays, a decision about the array has to be taken in the qualitycontrol step.
As microarray data quality can affect each step of the microarray analysis process,“bad” arrays should be identified and removed as early as possible in the process.In general, quality assessment should be applied 2 times in the analysis process: Onceafter image processing and before preprocessing (background correction, normaliza-tion and summarization) to detect arrays with a particular problem (e.g large spatialartifacts, strong probe effects between arrays, differences in RNA hybridization meth-ods), once after preprocessing and before high-level analysis (e.g. differential geneexpression) to check for normalization efficiency.
A typical first step in the quality assessment process is the visual examination ofimages of the raw probe-level data or an image of the log-scale intensities. For exper-iments with more than 1,000 arrays this is a time consuming task and there is nogolden rule for assessing an array as outlier. Furthermore, there are effects which cannot be detected with pseudo-images Gentleman et al. (2005). Therefore, automati-cally procedures based on different quality metrics are required to make the best useof information produced by the arrays, to ascertain the quality of this information andto save time. These metrics do not allow a complete automated quality assessment, butcan reduce the number of inspectable arrays to a lower number and therefore reducethe decision complexity for the analyst.
In the following, the arrayQualityMetrics package from the Bioconductor repos-itory for quality assessment and the publicly available data sets used for evaluationand for validation in the empirical study are described. Section 3 deals with the sta-tistical methods to compare the statistical graphical methods for microarray outlierdetection and methods for automatic outlier detection. In Sect. 4 the results are dis-cussed and validated. A new rule for the outlier detection based on the result tablefrom the arrayQualityMetrics package is presented. At the end, a critical conclusion,some improvement ideas for the arrayQualityMetrics package and a summary of theagreement study are presented.
2 The arrayQualityMetrics package
In the Bioconductor repository Gentleman et al. (2004) several methods for qual-ity assessment are implemented. For example, the affy package Gautier et al.(2004) implements tools for graphical quality assessment of microarray data. ThearrayQualityMetrics package Kauffmann et al. (2009) provides a comprehensiveselection of tools, that works on all expression arrays and platforms and produces aself-contained quality report, which can be web-delivered. The report contains theevaluation of different categories of metrics Huber (2008).
Boxplots and density plots allow the control of the homogeneity between the exper-iments. The boxplots are built of the unprocessed log-scale probe-level data on eacharray i ,
123
Agreement study for quality assessment 261
Ii = log2(Ii j ), (1)
where Ii j is the Intensity value of j-th probe on the i-th array. Each box of the box-plot corresponds to one array i . It gives a simple summary of the distribution of probeintensities across all arrays. Typically, one expects the boxes to have similar size (IQR)and position (median).
The single array quality assessment is stated by MA-plots, the existence of spatialeffects is checked by image representations of the arrays. MA-plot are considered foreach array i with M and A defined as,
Mig = log2(Iig) − log2(I∗g), (2)
Aig = 1
2(log2(Iig) + log2(I∗g)) (3)
where Iig is the intensity of the array studied on the gene g-th (considered as blockin the array) and I∗g is the intensity of a “pseudo”—array ∗ on the same gene, whichhas the median values of all the arrays.
Scatter plots are used to asses the reproducibility of the experiments. A heatmaprepresenting the distance between the samples allow the evaluation of the biologicalsignal to noise ratio. Heatmaps are computed as the mean absolute difference of theM-value for each pair of arrays i and l,
dilg = mean(∣∣Mig − Mlg
∣∣) (4)
In the case of Affymetrix experiments, some quality controls only available forthis platform are included in the report: RNA degradation plot from the affy package,the Relative Log Expressions (RLE) and the Normalized Unscaled Standard Errors(NUSE) plots from the affyPLM package Brettschneider et al. (2007), and the QCstat plot from the simpleaffy package Wilson and Miller (2005).
In addition, quality assessment can be stratified to the feature, subgrid or block ofmicroarray. Although examination of each stratum is crucial, a comprehensive analy-sis strategy based on all strata is advantageous Burgoon et al. (2005). That has beenalso combined by the arrayQualityMetrics package to consider an array as outlier.The automatic outlier detection is performed in the version 2.2.2 of the package forthe boxplot and spatial distributions of feature intensities—both based on the analysisat array level—MA-plot and heatmap used for the examination of the block level andRLE and NUSE for the feature level.
For example, Fig. 1 shows the intensity boxplots and the heatmap for the distancebetween arrays for ten different arrays of the E-GEOD-7123 data set listed in theSummary table of the Fig. 2. Full quality reports for different microarray data sets usedin this empirical study are provided in the Appendix. More information about qualityassessment and control methods and the different plots can be found in Gentlemanet al. (2005) and Kauffmann et al. (2009).
In the arrayQualityMetrics package the automatic quality assessment and outlierdetection is integrated. The results are represented as a summary table, marking crit-ical arrays with an asterisk (see Fig. 2). For the automatic outlier detection for every
123
262 M. Schmidberger et al.
Fig. 1 Example graphics from a self-contained quality report generated by the arrayQualityMetricspackage for the E-GEOD-7123 data set
Fig. 2 Summary table generated by arrayQualityMetrics package for the E-GEOD-7123 data set. Theoutlier arrays has been detected and are marked in the table with an asterisk
quality assessment method and every array a score S is calculated. For every qualityassessment the scores S of the arrays are compared using a method based on a boxplot.Those arrays that lie beyond the extremes of the boxplot’s whiskers are consideredas possible outlier arrays of the quality assessment method. For the different methodsthe scores S are calculated as follows:
MA-plot: the mean of the absolute value of M for each array i, Si = mean(|Mi |).Boxplot: the mean and interquartile range from the boxplots from each array i, Si =
mean(Ii ) and I Q R(Ii ).Heatmap: the sums of the rows of the distance matrix of an array i, Si = ∑
(di ).Spatial intensity distribution: relative amplitude of low versus high fre-
quency components of the periodogram in the array i, Si =∑
lowFreqi∑highFreqi
.
123
Agreement study for quality assessment 263
NUSE plot: the mean and interquartile range from the box from each array i of NUSEplot, Si = mean(NU SE(Θi ) and I Q R(NU SE(Θi )).
RLE plot: any array with a median RLE higher than 0.1 is considered as a possibleoutlier (Score is not necessary to be calculated).
In the following we assume, that all these scores are correct and produce a good qualityassessment for microarray data.
Due to the result table in the self-contained quality report, arrays of a data set canbe classified in two dichotomous categories: “outlier” array and “acceptable” qualityarray. An array marked as outlier is not directly a “bad” or “defect” array. In the qualitycontrol step a decision has to be taken how to deal with this array. Arrays with no markcan be treated as good arrays. There is no published rule in how many methods anarray has to be marked as outlier to really being considered as outlier. One of the twofollowing rules is often used:
>2 rule: Arrays—marked in more than two methods—are treated as outlier arrays.>3 rule: Arrays—marked in more than three methods—are treated as outlier arrays.
Such a rule has to consider how much independent information is given by the differentscores calculated for an array. It is expected that methods which inspect the same stra-tum of microarray information should show a similar correlation. If some scores arestrongly correlated it is quite likely that one outlier mark in one score produces somemore marks in the correlated scores. It is not useful to base the decision on all corre-lated scores. The decision rule should be based on scores which offer complementaryinformation (are not strongly correlated) and should include the three different strataof microarray information (features, block and array). Therefore, a simple rule >1 cannot be considered to accept an array as outlier. Gentleman et al. (2005) presents exam-ples where more than one rule is required to classify an array as outlier. A decisionbased on the manual image inspection is critical and depending on the analyst. A moredetailed rule is required to reduce the number of arrays for manual quality assessmentand to support a reproducible quality assessment.
2.1 Used microarray data sets
For the study four arbitrary public available microarray data sets from theArrayExpress database Parkinson et al. (2007, 2009) are chosen. To avoid influencesfrom any kind of data processing and to perform the quality assessment and controlas first steps in the microarray analysis process, the raw data (CEL files) had to beavailable for the experiments. Furthermore, all arrays in one experiment have to be ofthe same chip type, there should be enough arrays in one experiment, but the numbershould not be too big to be processable on one machine. For an optimal boxplotsrepresentation, data set with less than 300 arrays has been considered. Boxplot whichcontains more than 300 boxes are not any more visual discernible at a monitor or in aprinted pdf file Vicedo (2009). Furthermore, processing more than 500 microarrays atone machine with the R language requires a lot of main memory (>64 GB, dependingon the computer architecture and the microarray chip type) Schmidberger et al. (2009).
The following four data sets are used as evaluation sets in the empirical study:
123
264 M. Schmidberger et al.
E-GEOD-1159: The data set was published on April 15, 2004 Stirewalt et al.(2008). It consists of expression profiles of 293 acute myeloidleukemia (AML) patient samples.Chip Platform: Affymetrix GeneChip Human Genome U133AArray (HG-U133A)
E-GEOD-11121: The data set was published on July 01, 2008 Schmidt et al. (2008).It consists of a population based N0 untreated breast cancer cohortstudy including 200 samples.Chip Platform: Affymetrix GeneChip Human Genome U133AArray (HG-U133A)
E-GEOD-15396: The data set was published on March 27, 2009. Transcription pro-filing of 156 human peripheral blood mononuclear, DU145, andHCT116 cells treated with a CDK inhibitor.Chip Platform: Affymetrix GeneChip Human Genome U133 Plus2.0 (HG-U133 Plus 2.0)
E-GEOD-2990: The data set was published on July 11, 2008 Sotiriou et al. (2006).Transcription profiling of 189 human invasive breast carcinomasand data analysis from published studies to understand the molec-ular basis of histologic grade to improve prognosis.Chip Platform: Affymetrix GeneChip Human Genome U133AArray (HG-U133A)
To validate the results two further arbitrary public available data sets from theArrayExpress database are chosen:
E-GEOD-3960: The data set was published on July 11, 2008 Wang et al. (2006).Transcription profiling of 102 human prognostic samples from neu-roblastoma patients for classification of neuroblastoma by integrat-ing gene expression pattern with regional alterations in DNA copynumber.Chip Platform: Affymetrix GeneChip Human Genome U95Av2(HG-U95Av2)
E-GEOD-4475: The data set was published on July 12, 2008 Hummel et al. (2006).Transcription profiling of 221 human Burkitts lymphomas.Chip Platform: Affymetrix GeneChip Human Genome U133AArray (HG-U133A)
In no reference articles or depending appendix files of these six studies any descrip-tion about quality assessment or control methods can be found and the analysis descrip-tion denotes the use of all published arrays. Therefore, either no quality assessmentand control was conducted or defect arrays are not published. MIAME describes the‘Minimum Information About a Microarray Experiment’ that is needed to enablethe interpretation of the results of the experiment unambiguously and potentially toreproduce the experiment Brazma (2009). Especially point six in MIAME claims forpublication of the processing protocols. MIAME is supported by several databases(including ArrayExpress) and software tools, but so far only complied completely bysome experiments. Examining the arrays with pseudo-image plots of the raw probeintensities (see Fig. 3) shows, that there are several large artifacts, which is a strong
123
Agreement study for quality assessment 265
Fig. 3 Chip pseudo-image of the raw probe intensities from two arrays in the E-GEOD-1159 (arraynumber 8) and E-GEOD-11121 (array number 272) experiment. In both images there is a large artifact
indicator for a defect array Gentleman et al. (2005). Small artifacts are usually incon-sequential, because in modern array designs, the probes for each probe set are notcontiguously next to each other, but rather they are spread throughout the array. Thus,it will at most effect a few probes in a probe set and not the entire block and neitherthe total array.
Table 1 lists the “bad” quality arrays in the described experiments identified withmanual inspection of the pseudo-image plots and having large artifacts. In this man-ual process all images were independently evaluated from two microarray analysts.Arrays with large artifacts and clear intensity differences are marked. Column two andthree list the outlier arrays with more than two or three marks in the quality report tablefrom the arrayQualityMetrics package version 2.2.2. Column four lists the outlierarrays for the validation data sets which are manually detected by two independentanalysts, using the six graphical quality assessment methods from the arrayQuality-Metrics package (without the calculated S score). Column five and six presents thenew proposed rules described in the following sections.
All used data sets are published in the ArrayExpress database and results and anal-yses on these data sets are published in established journals. Therefore, we assume alow (or zero) number of bad quality arrays in these example data sets. The >2 ruleidentifies the most arrays and can be useful for a first quality assessment, to reduce thenumber of arrays for manual inspection. However, excluding all arrays identified asoutlier arrays with this rule is too strict and the number of samples in the experimentwill be unnecessary reduced. Some of the arrays with the strong artifacts are recog-nized with this rule. This indicates, that due to the array design visible artifacts haveonly a low influence to the quality of microarrays. Using the >3 rule a notable lowernumber of arrays is marked as outliers and most of the manually by image inspec-tion detected arrays are not included. Analyzing the described assessment graphicsfor the detected arrays in detail, clear differences to the other arrays are found. Thehuge number of identified arrays with the >2 rule indicates, that there is probably a
123
266 M. Schmidberger et al.
Tabl
e1
Out
lier
arra
ysin
the
expe
rim
ents
asse
ssed
with
the
arra
yQua
lityM
etri
cspa
ckag
ean
dm
anua
l
Imag
ein
spec
tion
arra
yQua
lityM
etri
cs(>
2ru
le)
arra
yQua
lityM
etri
cs(>
3ru
le)
Man
ualQ
Aar
rayQ
ualit
yMet
rics
(>=2
com
bine
dru
le)
arra
yQua
lityM
etri
cs(>
=2re
duce
dru
le)
E-G
EO
D-1
1121
8,65
,102
,184
,200
8,12
,22
,65
,95
,10
2,11
1,13
9,17
0
12,1
02N
a12
,102
12,1
02
E-G
EO
D-1
159
96,1
66,2
30,2
726,
96,
99,
120,
166,
178,
272,
287,
289
6,96
,289
Na
6,96
,289
6,96
,289
E-G
EO
D-1
5396
–6,
23,8
4,13
4–
Na
131
–
E-G
EO
D-2
990
27,5
6,60
27,6
1–
Na
––
E-G
EO
D-3
960
–46
,53,
7646
,53,
7646
,53,
7646
,53,
7646
,53,
76
E-G
EO
D-4
475
112,
199
19,6
1–
––
–
‘Na’
indi
cate
sth
atth
em
anua
lqua
lity
asse
ssm
enti
sno
tper
form
edfo
rth
ese
data
sets
123
Agreement study for quality assessment 267
Table 2 Contingency table ofbinary ratings for rater A (rows)and rater B (columns)
Rater B− +
Rater A − a b a + b
+ c d c + d
a + c b + d total
correlation between the scores and that arrays are marked due to correlated scores. Thisassumption is analysed with the empirical study described in the following sections.
3 Statistical analysis
To prove a possible association among the statistical methods, the statistical methods(raters) have been pairwise combined in a 2×2 contingency table (see Table 2). In thefollowing we use the word “rater” synonymously to “statistical method”. The tablecan be used to express the relationship between two variables. In this case + or −indicates an “acceptable” or a “outlier” array. For example, the number a is the num-ber of arrays marked as outlier arrays in the statistical method A and B. T otal is thenumber of arrays in the used data set. As a quantitative measure for association onecould use the Odds Ratio
(OR = b
c
)for matched pairs.
3.1 Cohen’s Kappa
If there is homogeneity between marginal frequencies (see Sect. 3.3) the agreementmeasure Cohen’s Kappa Coefficient Gewet (2002b); Landis and Koch (1977) can beapplied. The Cohen’s Kappa coefficient κ is a statistical measure of inter-rater agree-ment for qualitative items. It is generally a more robust measure than simple percentagreement calculation since κ takes into account the agreement occurring by chance.
Cohen’s Kappa measures the non-random agreement between two raters that eachclassify N items into C mutually exclusive categories.
κ = ρO − ρE
1 − ρE(5)
where ρO is the relative observed agreement among raters, and ρE is the hypotheticalprobability of chance agreement, using the observed data to calculate the probabilitiesof each observer randomly saying each category. In view of the notation in Table 2this means that N = total, C = 2, ρO = a+d
total and ρE = (a+d)·(a+c)+(b+d)·(c+d)
total2 . Ifthe raters are in complete agreement then κ = 1. If there is no agreement among theraters (other than what would be expected by chance) then κ ≈ 0.
Landis and Koch (1977) give the following Table 3 for interpreting κ values. Thisguide is however by no means universally accepted. The calculation of the standarderror and confidence intervals for κ is explained in details in Fleiss et al. (2003);Altman (1991). The limitation of the (weighted) Cohen’s Kappa method is that itdoes not extend to multiple raters, and does not adjust for both chance agreement andmisclassification errors.
123
268 M. Schmidberger et al.
Table 3 Guide to interpret thekappa values provided by Landisand Koch (1977)
Kappa Strength of agreement
<0.2 Poor
>0.2–0.4 Fair
>0.4–0.6 Moderate
>0.6–0.8 Good
>0.8–1 Very good
3.2 AC1 statistic
In an effort to solve some of these limitations Kilem Gewet theoretically derived twoalternatives to current agreement coefficients. The first- and second-order agreementcoefficients, AC1 and AC2 Gewet (2002a). These coefficients adjust for chance agree-ment and both chance agreement and declassification, respectively. Both are usefultools for reliability studies, however, computation of the statistics and their conditionaland unconditional variances beyond the two-rater, two-category case is nontrivial.
The AC1 Statistic Gewet (2002a,b) method conducts a formal analysis of the agree-ment without consideration of the marginal differences. It is a simple inter-rater agree-ment statistic and give by:
AC1 = ρ0 − ρE
1 − ρE(6)
where ρE measures the likelihood of agreement by chance and is defined as ρE =2ρ+(1 − ρ+), ρ0 = a+d
total and ρ+ = (c+d+b+d)/2total .
3.3 McNemar test
The McNemar test is used to assess marginal homogeneity (OR = 1) of each ratingcategory of the 2 × 2 contingency table Altman (1991); McNemar (1947). A sig-nificant result implies that marginal frequencies are not homogeneous and therefore,the Cohen’s Kappa coefficient may not be calculated, but the AC1 Statistic. If theMcNemar test is not significant (homogeneity between marginal frequencies), theCohen’s Kappa coefficient we used to find out if there is any kind of agreement amongthe analysed graphical methods.
Marginal homogeneity implies, that row totals are equal to the corresponding col-umn totals. In view of the notation in Table 2 this means:
(a + b) = (a + c) (7)
(c + d) = (b + d) (8)
The McNemar statistic is calculated as
χ2 = (b − c)2
(b + c)(9)
123
Agreement study for quality assessment 269
and the value χ2 can be viewed as a chi-squared statistic with one degree of free-dom. Altman (1991) recommends an improved version of the McNemar test with acorrection for discontinuity, calculated as
χ2 = (|b − c| − 1)2
(b + c). (10)
Statistical significance is determined by evaluating the probability of χ2 with refer-ence to the values of a χ2 distribution for one degree of freedom and the correspondingconfidence intervals. A significant result implies that marginal frequencies (or pro-portions) are not homogeneous.
If the values b and/or c are small, the McNemar test, especially the uncorrectedversion, is not well approximated by the chi-squared distribution. For (b + c) < 20 atwo-tailed exact test, based on the cumulative binomial distribution with p = q = 0.5is used instead.
3.4 Pearson correlation coefficient
Additional to the two described tests the Pearson product-moment Correlation Coef-ficient Fleiss et al. (2003) is calculated for all pairs of scores derived from two ofthe statistical methods and the correlation for the interesting pairs is visualized withscatter plots. The Pearson correlation coefficient is a measure of the correlation (lineardependence) between two variables X and Y, giving a value between +1 and −1 inclu-sive. It is widely used in science as a measure of the strength of linear dependencebetween two variables and is obtained by dividing the covariance of the two variablesby the product of their standard deviations:
ρX,Y = cov(X, Y )
σXσY(11)
where σ is the standard deviation.
4 Results
The arrayQualityMetrics package is applied to the six data sets and the self-containedHTML quality reports are generated. Due to 102 to 293 arrays per experiment each ofthe loaded AffyBatch objects require 10 to 50 GB of main memory. The calcula-tions, e.g. generating the plots, using these objects requires additional main memory(up to 75 GB). Furthermore, the process of creating anAffyBatchobject and runningthe main arrayQualityMetrics() function, which creates the quality report,takes up to 8 hours at an Intel Xeon processor with 2.93 GHz.
The summary table in the HTML report is imported and manipulated in R, thearrays ordered and coded in two categories—“outliers” and “accepted” arrays—andused to create 2 × 2 contingency tables of all possible raters combinations withoutrepetition (6 raters represent 15 possible 2 × 2 tables). These tables are tested for
123
270 M. Schmidberger et al.
Table 4 Summary for the agreement in the four data sets and the 6 methods used in the arrayQuality-Metrics package
E-GEOD-1159 E-GEOD-11121 E-GEOD-15396 E-GEOD-2990
MA-plot/Spatial_dis Poor Poor Poor Poor
MA-plot/Boxplot – Good Good –
MA-plot/heatmap Very good Good Very good Moderate
MA-plot/RLE Poor Poor Poor –
MA-plot/NUSE Poor Poor Poor Poor
Spatial_dis/Boxplot Poor Poor Poor Poor
Spatial_dis/heatmap Poor Poor Poor Poor
Spatial_dis/RLE Fair – Poor –
Spatial_dis/NUSE Fair Poor Poor Poor
Boxplot/heatmap – Good Good Moderate
Boxplot/RLE Poor Poor Poor –
Boxplot/NUSE – Poor Poor Poor
Heatmap/RLE Poor – Poor –
Heatmap/NUSE Poor Poor Poor Poor
RLE/NUSE Moderate Poor Poor Poor
Classification adapted from Landis and Koch guidelines, 1977See Table 3 for interpretation of the κ values
marginal homogeneity using the McNemar test with and without correction. Only thesignificant tables after the homogeneity test (the significant two rater combinations)are used to calculate the Cohen’s Kappa coefficient and all tables for the AC1 Statis-tic. For this analysis procedure a new function, called kappaAQM() (available in theAppendix) is implemented and applied to the HTML reports from the arrayQuality-Metrics package version 2.2.2.
The McNemar homogeneity test rejects the null hypothesis for up to five (in theE-GEOD-2950 data set) rater combinations. For about 50% of the pairs there is a smallsample size (b + c) < 20 and a binomial test is performed.
For an overview of the final results of the agreement test with Cohen’s Kappa,Table 4 presents the classification of the statistical method pairs by the guidelines ofLandis and Koch for all four data sets. The character “-” indicates, that for this pairthe McNemar homogeneity test rejected the null hypothesis and Cohen’s Kappa couldnot be applied. The table demonstrates, that the agreement between MA-plot/Boxplotmethods is “good” for two data sets. Between MA-plot/heatmap methods the agree-ment is classified from “moderate” to “very good” for four data sets. The methodsBoxplot/heatmap have an agreement from “good” to “moderate” in three data sets.The other method combinations present values that should be considered as not rele-vant for this agreement study between the methods used in the arrayQualityMetricpackage version 2.2.2.
Figure 4 compares the obtained values of the two agreement tests. The AC1Statistic obtains higher values than Cohen’s Kappa.Formally Cohen’s Kappa allowsboth raters to use two different rating frameworks. This implies for the binary rating
123
Agreement study for quality assessment 271
0.0
0.2
0.4
0.6
0.8
1.0
Agr
eem
ent c
oeffi
cien
ts
MA
−pl
ot/S
patia
l_di
s
MA
−pl
ot/B
oxpl
ot
MA
−pl
ot/h
eatm
ap
MA
−pl
ot/R
LE
MA
−pl
ot/N
US
E
Spa
tial_
dis/
Box
plot
Spa
tial_
dis/
heat
map
Spa
tial_
dis/
RLE
Spa
tial_
dis/
NU
SE
Box
plot
/hea
tmap
Box
plot
/RLE
Box
plot
/NU
SE
heat
map
/RLE
heat
map
/NU
SE
RLE
/NU
SE
AC1 StatisticCohen’s KappaE−GEOD−1159E−GEOD−11121E−GEOD−15396E−GEOD−2990
Fig. 4 Plot with values of the AC1 Statistic and Cohen’s Kappa agreement for the data sets E-GEOD-1159,E-GEOD-11121, E-GEOD-15396, E-GEOD-2990
that category one may be based on different frequencies between both raters and thatfor one rater the category “outlier” means something different as for rater two. TheAC1 coefficient assumes that both raters use the same rating framework. Thus, the cat-egory outlier is assumed to have equal incidence for both raters.The advantage of theAC1 method is, that all methods are analysed and not only the one, which is acceptedby the marginal homogeneity test. In the Cohen’s Kappa coefficients there is morevariance than in the AC1 Statistic coefficients and the Cohen’s Kappa coefficientsshow a very low (and zero) agreement for several methods. As demonstrated in Gewet(2002b), the Cohen’s Kappa coefficients tend to yield a low agreement. But a similartrend for the agreement between the statistical graphical methods can be found in bothtests.
The following methods have the highest mean values of the AC1 statistic:MA-plot/Boxplot (0.985), MA-plot/heatmap (0.983), Boxplot/heatmap (0.97),spatial density distribution/RLE (0.947), RLE/NUSE (0.940).
Examining the Pearson Correlation Coefficient and the scatter plots for the men-tioned statistical graphical methods the agreement between the listed five pairs ofmethods can be approved. Table 5 shows the median Pearson Correlation Coefficientfor the four data sets. Especially for MA-plot/heatmap there is a strong correlation.Due to the symmetry of the coefficient the matrix is an upper triangle matrix andthe diagonal elements are equal to one, because for identical variables the correlationcoefficient is one.
Figure 5 shows the scatter plots of the scores for the automatic outlier detectionin the arrayQualityMetrics package. Only the interesting pairs are plotted for theE-GEOD-11121 data set. For the other three data sets the scatter plots look very simi-lar, for the other pairs of methods there is less correlation observable. In the upper left
123
272 M. Schmidberger et al.
Table 5 Median values for pearson correlation coefficient of all four data sets
Boxplot MA-plot NUSE RLE Heatmap Spatial_dis
Boxplot 1 0.11 −0.11 0.03 0.14 0.18
MA-plot 1 0.55 0.03 0.99 0.04
NUSE 1 0.24 0.56 0.25
RLE 0.04 0.37
Heatmap 1 0.04
Spatial_dis 1
figure the differential gene expression is calculated to the mean array of the sampleconsidered. The mean of absolute M A values has to be positive and will be influencedby extreme values of the differential gene expression. The figure demonstrates thatarrays with a low or high interquartile range are associated with extreme means whichare affected by extreme differential expression values. In the upper right figure a dis-tance matrix is created which measures the distance between each pairs of arrays in asample. For each array the mean distance to all other arrays is calculated and plottedon the y-axis. This number is not equal but similar to the distance of the array to amean array. A situation which is close to the upper left figure is created this way.
After the analysis of Cohen’s Kappa and AC1 Statistic coefficients and the directvisualization of the correlation, the assumption of some agreement and correlationbetween the six quality assessment methods can be approved. There is a strong cor-relation between MA-plot/heatmap and a strong agreement between the statisticalgraphical methods MA-plot/Boxplot, Boxplot/heatmap. Between the other methodsthere is less agreement and correlation. This association among the three pairs of meth-ods can be explained with their score S to detect the outlier arrays and from the level ofthe microarray data they use. The MA-plot and heatmap methods are based on the blockstratum of the array data and both uses the M metric as basis to calculate the scores S(see Sect. 2). The boxplot is based on the intensities of the entire array (median andIQR of the intensities of each array) and MA-plot and heatmap use both the M value.If the boxplot detects an array as outlier, the array has to high/low intensities or theintensities values are dispersed (maybe due to experimental errors during the processor the genes in this array are more expressed than in the other arrays). For the MA-plotor heatmap methods the M metric is calculated as Mi = log2
(Ii j
) − log2(Il j
). In
the case that enough genes/probe sets on the array are affected, the M value for thisarray will be higher/lower than the other arrays of the experiment. Hence, it will bealso detected as outlier in both methods, MA-plot and heatmap.
Therefore, the quality assessment rule for the quality report table generated by thearrayQualityMetrics package can be improved in the following way:
>= 2 combined rule: If there is one outlier array mark in one of the three methodsMA-plot, heatmap and boxplot, and one outlier array mark inone of the other three methods, then the array should be con-sidered as outlier array. In this rule, the methods are dividedin two groups: group 1 combines the methods with a strongcorrelation and agreement (MA-plot, heatmap and Boxplot)
123
Agreement study for quality assessment 273
Correlation: −0.06
Boxplot
MA
−pl
otCorrelation: −0.03
Boxplot
Hea
tmap
Correlation: 0.99
MA−plot
Hea
tmap
Correlation: 0.12
RLEN
US
E
Correlation: 0.65
6.5 7.0 7.5 8.0 6.5 7.0 7.5 8.0
0.2 0.4 0.6 0.8 1.0 −0.15 −0.05 0.05 0.10 0.15
0.00015 0.00020 0.00025 0.00030
0.2
0.6
1.0
8012
016
0
8012
016
0
0.99
1.02
−0.
150.
000.
15
Spatial_dis
RLE
Fig. 5 Scatter plots of the scores for the automatical outlier detection in the arrayQualityMetrics packagefor the E-GEOD-11121 data set. “Outlier” arrays are marked in red
and group 2 the rest of the methods (spatial distribution,NUSE, RLE). When an array is detected as outlier in thegroup 1 and in the group 2, then it should be considered asoutlier and automatically excluded.
Furthermore, due to the strong agreement between MA-plot/boxplot and MA-plot/heatmap the methods heatmap and boxplot can be excluded from the qualityassessment. The analysis of the microarray after the strata is also considered in thisrule. Then, the following quality assessment rule should be used:
>= 2 reduced rule: If an array is marked as outlier array from the MA-plot methodand at least marked in one of the other three methods (NUSE,RLE, spatial distribution of feature intensities), then the arrayshould be considered as outlier array.
Table 1 indicated that both described new rules (>= 2 combined rule) and (>= 2reduced rule) detect the same arrays as outlier, except for the data set E-GEOD-15396.In this experiment the >= 2 combined rule detects the array 131 as outlier but notthe >= 2 reduced rule. For this case, the >= 2 reduced rule agree with the rule >3.
123
274 M. Schmidberger et al.
4.1 Validation
The new presented rules are validated with the two proposed validation data setsE-GEOD-3960 and E-GEOD-4475. For these two data sets a manual quality assess-ment was independently executed from two experts in microarray analysis. For theirdecisions they used all metrics generated from the arrayQualityMetrics package(without the scores S). In addition, the new rules are tested with the four learning datasets, too. Table 1 lists the outlier arrays in the described experiments identified withmanual inspection of the pseudo-image plots, with the arrayQualityMetrics packageversion 2.2.2, the complete manual quality assessment for the two validation data setsand the two rules, and with arrayQualityMetrics package and the two new rules.
All used data sets are published in the ArrayExpress database and results and anal-yses on these data sets are published in established journals. Therefore, we assume alow number of bad quality arrays in these examples. But, as demonstrated in the intro-duction, there are several outlier arrays available. The >2 rule identifies a lot of arraysand can be useful for a first quality assessment, to reduce the number of arrays for man-ually inspection. However, excluding all arrays identified as outlier arrays with thisrule is too strict, the >3 rule seems to be more useful to detect automatically the outlierarrays (see Sect. 1). In the two validation data sets E-GEOD-3960 and E-GEOD-4475(last two rows in Table 1) both newly proposed rules identify the same arrays than the>3 rule and these arrays match with the results of the complete manual quality assess-ment (Table 1, column four). Except for one array in the E-GEOD-15396 data set thenewly developed rules match in the evaluation data sets, too. The >= 2 reduced rulefinds exactly the same arrays, only uses four of the six proposed statistical graphicalmethods and includes the three strata of the microarray data (see Sect. 2).
5 Discussion and conclusions
In microarray experiments the arrayQualityMetrics package is very useful for qual-ity assessment and control based on the mentioned statistical graphical methods. Itworks on all expression arrays and platforms and produces a self-contained qualityreport, which can be web-delivered. In addition, an automatic quality assessment andoutlier detection is integrated and the results are represented as a summary table.
Unfortunately the methods require a lot of main memory and have long computa-tion times (more than 8 h for 200 arrays). Two ways to solve these problems are the useof parallel computing or to reduce the number of used statistical graphical methodsand to use only the most informative measures.
5.1 Parallel computing
In parallel computing many calculations are carried out simultaneously, operating onthe principle that large problems can often be divided into smaller ones, which arethen solved concurrently (“in parallel”).
The affyPara package Schmidberger et al. (2009); Schmidberger and Mansmann(2008) implements parallel algorithms for preprocessing of microarray data.
123
Agreement study for quality assessment 275
Parallelization of existing preprocessing methods produces, in view of machine accu-racy, the same results as serialized methods. The partition and distribution of data toseveral nodes solves the main memory problems and accelerates the methods up tofactor 15 for 300 arrays or more. Additional, in the bachelor thesis from EsmeraldaVicedo (2009) parallel quality assessment methods are implemented in the affyParapackage. The graphical visualization for more than 300 arrays becomes—independentof the used graphical method—very complex and unreadable. Therefore, the parall-elized methods were extended for only plotting the interesting (outlier and referencearrays) arrays.
Due to latest developments in computer chip production multi-core machines areavailable to most users. In the arrayQualityMetrics() function up to 13independent tasks are executed in serial. The new R package multicore Urbanek(2009) can be used to accelerate the quality assessment by the number of availableprocessors. Unfortunately, this does not solve the mentioned main memory problems,but due to the used fork Stevens (1992) technology and read-only operations on theAffyBatch object, no more main memory is required.
5.2 Method reduction
The presented empirical study for the agreement between statistical methods in qual-ity assessment and control of microarray data with the arrayQualityMetrics packageproves, that there is some association and correlation between three of the six currentlyproposed statistical methods to comprehensively quantify the quality information inlarge series of microarrays. Therefore, the number of quality assessment methods canbe reduced to four and computation time saved.
There is a strong correlation between MA-plot/heatmap and a strong agree-ment between the statistical graphical methods MA-plot/Boxplot, Boxplot/heatmap.These three methods are very similar measures and it is enough to use only oneof these methods. Therefore, a useful rule for the automatically detection of out-liers based on the quality report table from arrayQualityMetrics package can bedefined:
If an array is marked as outlier array from the MA-plot method and at leastmarked in one of the other three methods (NUSE, RLE, spatial distribution offeature intensities) then the array should be considered as outlier array.
This decision rule is based on scores which offer complementary information (arenot strongly correlated) and only four of the six measures have to be calculated.The validation in Sect. 4 demonstrates, that the new decision rule identifies the samearrays as outliers as the original rule in the arrayQualityMetrics package and thethree different strata of the microarray data are also considered in the new rule. Thisrule is already applied in the latest development version of the arrayQualityMetricspackage (version 2.5.10). The summary table is now reduced to three columns.
As long as the number of (outlier) arrays is small the authors advice to check allmarked arrays manually. If the number of arrays is too big, the new proposed >= 2reduced rule can be used for automatic array exclusion.
123
276 M. Schmidberger et al.
Acknowledgments The work is supported by the LMUinnovative collaborative centre ‘Analysis andModelling of Complex Systems in Biology and Medicine’. Special thank goes to the working group“Statistical Computing” (http://www.statistical-computing.de/) and the organization committee of the work-shop “Statistical Computing” 2009 in Günzbug (Germany).
References
Altman DG (1991) Practical statistics for medical research. Chapman & Hall, Boca RationBerrar DP, Dubitzky W, Granzow M (eds) (2003) A practical approach to microarray data analysis. Kluwer
Academic Publishers Group, LondonBrazma A (2009) Minimum information about a microarray experiment (miame)–successes, failures, chal-
lenges. Scientific World J 9:420–423Brettschneider J, Collin F, Bolstad BM, Speed TP (2007) Quality assessment for short oligonucleotide
arraysBurgoon LD, Eckel-Passow JE, Gennings C, Boverhof DR, Burt JW, Fong CJ, Zacharewski TR (2005)
Protocols for the assurance of microarray data quality and process control. Nucleic Acids Res 33:1–11Fleiss JL, Levin BA, Levin B, Paik MC (2003) Statistical methods for rates and proportions. Wiley-
Interscience, New YorkGautier L, Cope L, Bolstad BM, Irizarry RA (2004) affy—analysis of affymetrix geneChip data at the
probe level. Bioinformatics 20(3):307–315Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S (2005) Bioinformatics and computational biology
solutions using R and bioconductor. 1st edn.. Springer, BerlinGentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J,
Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, SawitzkiG, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J (2004) Bioconductor: open software develop-ment for computational biology and bioinformatics. Genome Biol 5(10):R80
Gewet K (2002) Handbook of inter-rater reliability. Technical report, STATAXIS Publishing CompanyGewet K (2002) Inter-rater reliability: dependency on trait prevalence and marginal homogeneity. Stat
Methods Inter Rater Reliab Assess 2:1–9Huber W (September 2008) Sixth framework programme for quality of life and management of living
resources. Technical report, microarray and gene expression data society, EMERALD WorkshopHummel M, Bentink S, Berger H, Klapper W, Wessendorf S, Barth TFE, Bernd H-W, Cogliatti SB,
Dierlamm J, Feller AC, Hansmann M-L, Haralambieva E, Harder L, Hasenclever D, Khn M,Lenze D, Lichter P, Martin-Subero JI, Möller P, Müller-Hermelink H-K, Ott G, Parwaresch RM, Pott C,Rosenwald A, Rosolowski M, Schwaenen C, Stürzenhofecker B, Szczepanowski M, Trautmann H,Wacker H-H, Spang R, Loeffler M, Trümper L, Stein H, Siebert R (2006) Molecular mechanisms inmalignant Lymphomas network project of the Deutsche Krebshilfe. A biologic definition of burkitt’slymphoma from transcriptional and genomic profiling. N Engl J Med 354(23):2419–2430
Kauffmann A, Gentleman R, Huber W (2009) arrayQualityMetrics—a bioconductor package for qualityassessment of microarray data. Bioinformatics 25(3):415–416
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics33(1):159–174
McNemar Q (1947) Note on the sampling error of the difference between correlated proportions or per-centages. Psychometrika 12:153–157
Parkinson H, Kapushesky M, Kolesnikov N, Rustici G, Shojatalab M, Abeygunawardena N, Berube H,Dylag M, Emam I, Farne A, Holloway E, Lukk M, Malone J, Mani R, Pilicheva E, Rayner TF,Rezwan F, Sharma A, Williams E, Bradley XZ, Adamusiak T, Brandizi M, Burdett T, Coulson R,Krestyaninova M, Kurnosov P, Maguire E, Neogi SG, Rocca-Serra P, Sansone S-A, Sklyar N, Zhao M,Sarkans U, Brazma A (2009) Arrayexpress update—from an archive of functional genomics experi-ments to the atlas of gene expression. Nucleic Acids Res 37(Database issue):D868–D872
Parkinson H, Kapushesky M, Shojatalab M, Abeygunawardena N, Coulson R, Farne A, Holloway E,Kolesnykov N, Lilja P, Lukk M, Mani R, Rayner T, Sharma A, William E, Sarkans U, Brazma A (2007)Arrayexpress—a public database of microarray experiments and gene expression profiles. NucleicAcids Res 35(Database issue):D747–D750
123
Agreement study for quality assessment 277
Schmidt M, Böhm D, von Törne C, Steiner E, Puhl A, Pilch H, Lehr H-A, Hengstler JG, Kölbl H,Gehrmann M (2008) The humoral immune system has a key prognostic impact in node-negativebreast cancer. Cancer Res 68(13):5405–5413
Schmidberger M, Mansmann U (2008) Parallelized preprocessing algorithms for high-density oligonu-cleotide arrays. In: Proceedings IEEE international symposium on parallel and distributed processingIPDPS, 14–18 April 2008, pp 1–7
Schmidberger M, Vicedo E, Mansmann U (2009) affypara—a bioconductor package for parallelized pre-processing algorithms of affymetrix microarray data. Bioinform Biol Insights 3:83–87
Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B,Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Marc B, Van de Vijver MJ, Bergh J,Piccart M, Delorenzi M (2006) Gene expression profiling in breast cancer: understanding the molec-ular basis of histologic grade to improve prognosis. J Natl Cancer Inst 98(4):262–272
Stevens W Richard (1992) Advanced programming in the UNIX environment. Addison-Wesley, UpperSaddle River, NJ [u.a.]
Stirewalt DL, Meshinchi S, Kopecky KJ, Fan W, Pogosova-Agadjanyan EL, Engel JH, Cronk MR,Dorcy KS, McQuary AR, Hockenbery D, Wood B, Heimfeld S, Radich JP (2008) Identification ofgenes with abnormal expression changes in Acute Myeloid Leukemia. Genes Chromosomes Cancer47(1):8–20
Urbanek S (2009) multicore: parallel processing of R code on machines with multiple cores or CPUs, Rpackage version 0.1–3
Vicedo E (2009) Quality assessment of huge numbers of affymetrix microarray dataWang Q, Diskin S, Rappaport E, Attiyeh E, Mosse Y, Shue D, Seiser E, Jagannathan J, Shusterman S,
Bansal M, Khazi D, Winter C, Okawa E, Grant G, Cnaan A, Zhao H, Cheung N-K, Gerald W,London W, Matthay KK, Brodeur GM, Maris JM (2006) Integrative genomics identifies distinctmolecular classes of neuroblastoma and shows that multiple genes are targeted by regional altera-tions in dna copy number. Cancer Res 66(12):6050–6062
Wilson CL, Miller CJ (2005) Simpleaffy: a bioconductor package for affymetrix quality control and dataanalysis. Bioinformatics 21(18):3683–3685
123