empirical study for the agreement between statistical methods in quality assessment and control of...

19
Comput Stat (2011) 26:259–277 DOI 10.1007/s00180-010-0216-2 ORIGINAL PAPER Empirical study for the agreement between statistical methods in quality assessment and control of microarray data Markus Schmidberger · Esmeralda Vicedo · Ulrich Mansmann Received: 10 September 2009 / Accepted: 7 September 2010 / Published online: 21 September 2010 © Springer-Verlag 2010 Abstract As microarray data quality can affect each step of the microarray analysis process, quality assessment and control is an integral part. It detects divergent mea- surements beyond the acceptable level of random fluctuations. This empirical study identifies association and correlation between the six quality assessment methods for microarray outlier detection used in the arrayQualityMetrics package version 2.2.2. For evaluation two different agreement tests—Cohen’s Kappa, after a homogeneity marginal criteria, and AC1 Statistic—, the Pearson Correlation Coefficient and real- istic microarray data from the public ArrayExpress database have been used. It is possible to assess the quality of a data set using only four of the six currently proposed statistical methods to comprehensively quantify the quality information in large series of microarrays. This saves computation time and reduces decision complexity for the analyst. The new proposed rule is validated with data sets from biomedical studies. Keywords Empirical study · Microarray · Quality assessment and control · R · Bioconductor 1 Introduction Quality assessment (QA) and quality control (QC) are essential parts of the micro- array data analysis process Gentleman et al. (2005); Berrar et al. (2003). The term quality assessment deals with the computation and interpretation of metrics that are Electronic supplementary material The online version of this article (doi:10.1007/s00180-010-0216-2) contains supplementary material, which is available to authorized users. M. Schmidberger (B ) · E. Vicedo · U. Mansmann Division of Biometrics and Bioinformatics, IBE, University of Munich, 81377 Munich, Germany e-mail: [email protected] 123

Upload: markus-schmidberger

Post on 10-Jul-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Comput Stat (2011) 26:259–277DOI 10.1007/s00180-010-0216-2

ORIGINAL PAPER

Empirical study for the agreement between statisticalmethods in quality assessment and controlof microarray data

Markus Schmidberger · Esmeralda Vicedo ·Ulrich Mansmann

Received: 10 September 2009 / Accepted: 7 September 2010 / Published online: 21 September 2010© Springer-Verlag 2010

Abstract As microarray data quality can affect each step of the microarray analysisprocess, quality assessment and control is an integral part. It detects divergent mea-surements beyond the acceptable level of random fluctuations. This empirical studyidentifies association and correlation between the six quality assessment methods formicroarray outlier detection used in the arrayQualityMetrics package version 2.2.2.For evaluation two different agreement tests—Cohen’s Kappa, after a homogeneitymarginal criteria, and AC1 Statistic—, the Pearson Correlation Coefficient and real-istic microarray data from the public ArrayExpress database have been used. It ispossible to assess the quality of a data set using only four of the six currently proposedstatistical methods to comprehensively quantify the quality information in large seriesof microarrays. This saves computation time and reduces decision complexity for theanalyst. The new proposed rule is validated with data sets from biomedical studies.

Keywords Empirical study · Microarray · Quality assessment and control ·R · Bioconductor

1 Introduction

Quality assessment (QA) and quality control (QC) are essential parts of the micro-array data analysis process Gentleman et al. (2005); Berrar et al. (2003). The termquality assessment deals with the computation and interpretation of metrics that are

Electronic supplementary material The online version of this article(doi:10.1007/s00180-010-0216-2) contains supplementary material, which is available to authorized users.

M. Schmidberger (B) · E. Vicedo · U. MansmannDivision of Biometrics and Bioinformatics, IBE, University of Munich, 81377 Munich, Germanye-mail: [email protected]

123

260 M. Schmidberger et al.

intended to measure quality. The term quality control is used for possible subsequentactions, such as removing data or redoing experiments. In the following we call arraysdetected as critical in the quality assessment step as “outliers”. This are not directly“bad” or “defect” arrays, a decision about the array has to be taken in the qualitycontrol step.

As microarray data quality can affect each step of the microarray analysis process,“bad” arrays should be identified and removed as early as possible in the process.In general, quality assessment should be applied 2 times in the analysis process: Onceafter image processing and before preprocessing (background correction, normaliza-tion and summarization) to detect arrays with a particular problem (e.g large spatialartifacts, strong probe effects between arrays, differences in RNA hybridization meth-ods), once after preprocessing and before high-level analysis (e.g. differential geneexpression) to check for normalization efficiency.

A typical first step in the quality assessment process is the visual examination ofimages of the raw probe-level data or an image of the log-scale intensities. For exper-iments with more than 1,000 arrays this is a time consuming task and there is nogolden rule for assessing an array as outlier. Furthermore, there are effects which cannot be detected with pseudo-images Gentleman et al. (2005). Therefore, automati-cally procedures based on different quality metrics are required to make the best useof information produced by the arrays, to ascertain the quality of this information andto save time. These metrics do not allow a complete automated quality assessment, butcan reduce the number of inspectable arrays to a lower number and therefore reducethe decision complexity for the analyst.

In the following, the arrayQualityMetrics package from the Bioconductor repos-itory for quality assessment and the publicly available data sets used for evaluationand for validation in the empirical study are described. Section 3 deals with the sta-tistical methods to compare the statistical graphical methods for microarray outlierdetection and methods for automatic outlier detection. In Sect. 4 the results are dis-cussed and validated. A new rule for the outlier detection based on the result tablefrom the arrayQualityMetrics package is presented. At the end, a critical conclusion,some improvement ideas for the arrayQualityMetrics package and a summary of theagreement study are presented.

2 The arrayQualityMetrics package

In the Bioconductor repository Gentleman et al. (2004) several methods for qual-ity assessment are implemented. For example, the affy package Gautier et al.(2004) implements tools for graphical quality assessment of microarray data. ThearrayQualityMetrics package Kauffmann et al. (2009) provides a comprehensiveselection of tools, that works on all expression arrays and platforms and produces aself-contained quality report, which can be web-delivered. The report contains theevaluation of different categories of metrics Huber (2008).

Boxplots and density plots allow the control of the homogeneity between the exper-iments. The boxplots are built of the unprocessed log-scale probe-level data on eacharray i ,

123

Agreement study for quality assessment 261

Ii = log2(Ii j ), (1)

where Ii j is the Intensity value of j-th probe on the i-th array. Each box of the box-plot corresponds to one array i . It gives a simple summary of the distribution of probeintensities across all arrays. Typically, one expects the boxes to have similar size (IQR)and position (median).

The single array quality assessment is stated by MA-plots, the existence of spatialeffects is checked by image representations of the arrays. MA-plot are considered foreach array i with M and A defined as,

Mig = log2(Iig) − log2(I∗g), (2)

Aig = 1

2(log2(Iig) + log2(I∗g)) (3)

where Iig is the intensity of the array studied on the gene g-th (considered as blockin the array) and I∗g is the intensity of a “pseudo”—array ∗ on the same gene, whichhas the median values of all the arrays.

Scatter plots are used to asses the reproducibility of the experiments. A heatmaprepresenting the distance between the samples allow the evaluation of the biologicalsignal to noise ratio. Heatmaps are computed as the mean absolute difference of theM-value for each pair of arrays i and l,

dilg = mean(∣∣Mig − Mlg

∣∣) (4)

In the case of Affymetrix experiments, some quality controls only available forthis platform are included in the report: RNA degradation plot from the affy package,the Relative Log Expressions (RLE) and the Normalized Unscaled Standard Errors(NUSE) plots from the affyPLM package Brettschneider et al. (2007), and the QCstat plot from the simpleaffy package Wilson and Miller (2005).

In addition, quality assessment can be stratified to the feature, subgrid or block ofmicroarray. Although examination of each stratum is crucial, a comprehensive analy-sis strategy based on all strata is advantageous Burgoon et al. (2005). That has beenalso combined by the arrayQualityMetrics package to consider an array as outlier.The automatic outlier detection is performed in the version 2.2.2 of the package forthe boxplot and spatial distributions of feature intensities—both based on the analysisat array level—MA-plot and heatmap used for the examination of the block level andRLE and NUSE for the feature level.

For example, Fig. 1 shows the intensity boxplots and the heatmap for the distancebetween arrays for ten different arrays of the E-GEOD-7123 data set listed in theSummary table of the Fig. 2. Full quality reports for different microarray data sets usedin this empirical study are provided in the Appendix. More information about qualityassessment and control methods and the different plots can be found in Gentlemanet al. (2005) and Kauffmann et al. (2009).

In the arrayQualityMetrics package the automatic quality assessment and outlierdetection is integrated. The results are represented as a summary table, marking crit-ical arrays with an asterisk (see Fig. 2). For the automatic outlier detection for every

123

262 M. Schmidberger et al.

Fig. 1 Example graphics from a self-contained quality report generated by the arrayQualityMetricspackage for the E-GEOD-7123 data set

Fig. 2 Summary table generated by arrayQualityMetrics package for the E-GEOD-7123 data set. Theoutlier arrays has been detected and are marked in the table with an asterisk

quality assessment method and every array a score S is calculated. For every qualityassessment the scores S of the arrays are compared using a method based on a boxplot.Those arrays that lie beyond the extremes of the boxplot’s whiskers are consideredas possible outlier arrays of the quality assessment method. For the different methodsthe scores S are calculated as follows:

MA-plot: the mean of the absolute value of M for each array i, Si = mean(|Mi |).Boxplot: the mean and interquartile range from the boxplots from each array i, Si =

mean(Ii ) and I Q R(Ii ).Heatmap: the sums of the rows of the distance matrix of an array i, Si = ∑

(di ).Spatial intensity distribution: relative amplitude of low versus high fre-

quency components of the periodogram in the array i, Si =∑

lowFreqi∑highFreqi

.

123

Agreement study for quality assessment 263

NUSE plot: the mean and interquartile range from the box from each array i of NUSEplot, Si = mean(NU SE(Θi ) and I Q R(NU SE(Θi )).

RLE plot: any array with a median RLE higher than 0.1 is considered as a possibleoutlier (Score is not necessary to be calculated).

In the following we assume, that all these scores are correct and produce a good qualityassessment for microarray data.

Due to the result table in the self-contained quality report, arrays of a data set canbe classified in two dichotomous categories: “outlier” array and “acceptable” qualityarray. An array marked as outlier is not directly a “bad” or “defect” array. In the qualitycontrol step a decision has to be taken how to deal with this array. Arrays with no markcan be treated as good arrays. There is no published rule in how many methods anarray has to be marked as outlier to really being considered as outlier. One of the twofollowing rules is often used:

>2 rule: Arrays—marked in more than two methods—are treated as outlier arrays.>3 rule: Arrays—marked in more than three methods—are treated as outlier arrays.

Such a rule has to consider how much independent information is given by the differentscores calculated for an array. It is expected that methods which inspect the same stra-tum of microarray information should show a similar correlation. If some scores arestrongly correlated it is quite likely that one outlier mark in one score produces somemore marks in the correlated scores. It is not useful to base the decision on all corre-lated scores. The decision rule should be based on scores which offer complementaryinformation (are not strongly correlated) and should include the three different strataof microarray information (features, block and array). Therefore, a simple rule >1 cannot be considered to accept an array as outlier. Gentleman et al. (2005) presents exam-ples where more than one rule is required to classify an array as outlier. A decisionbased on the manual image inspection is critical and depending on the analyst. A moredetailed rule is required to reduce the number of arrays for manual quality assessmentand to support a reproducible quality assessment.

2.1 Used microarray data sets

For the study four arbitrary public available microarray data sets from theArrayExpress database Parkinson et al. (2007, 2009) are chosen. To avoid influencesfrom any kind of data processing and to perform the quality assessment and controlas first steps in the microarray analysis process, the raw data (CEL files) had to beavailable for the experiments. Furthermore, all arrays in one experiment have to be ofthe same chip type, there should be enough arrays in one experiment, but the numbershould not be too big to be processable on one machine. For an optimal boxplotsrepresentation, data set with less than 300 arrays has been considered. Boxplot whichcontains more than 300 boxes are not any more visual discernible at a monitor or in aprinted pdf file Vicedo (2009). Furthermore, processing more than 500 microarrays atone machine with the R language requires a lot of main memory (>64 GB, dependingon the computer architecture and the microarray chip type) Schmidberger et al. (2009).

The following four data sets are used as evaluation sets in the empirical study:

123

264 M. Schmidberger et al.

E-GEOD-1159: The data set was published on April 15, 2004 Stirewalt et al.(2008). It consists of expression profiles of 293 acute myeloidleukemia (AML) patient samples.Chip Platform: Affymetrix GeneChip Human Genome U133AArray (HG-U133A)

E-GEOD-11121: The data set was published on July 01, 2008 Schmidt et al. (2008).It consists of a population based N0 untreated breast cancer cohortstudy including 200 samples.Chip Platform: Affymetrix GeneChip Human Genome U133AArray (HG-U133A)

E-GEOD-15396: The data set was published on March 27, 2009. Transcription pro-filing of 156 human peripheral blood mononuclear, DU145, andHCT116 cells treated with a CDK inhibitor.Chip Platform: Affymetrix GeneChip Human Genome U133 Plus2.0 (HG-U133 Plus 2.0)

E-GEOD-2990: The data set was published on July 11, 2008 Sotiriou et al. (2006).Transcription profiling of 189 human invasive breast carcinomasand data analysis from published studies to understand the molec-ular basis of histologic grade to improve prognosis.Chip Platform: Affymetrix GeneChip Human Genome U133AArray (HG-U133A)

To validate the results two further arbitrary public available data sets from theArrayExpress database are chosen:

E-GEOD-3960: The data set was published on July 11, 2008 Wang et al. (2006).Transcription profiling of 102 human prognostic samples from neu-roblastoma patients for classification of neuroblastoma by integrat-ing gene expression pattern with regional alterations in DNA copynumber.Chip Platform: Affymetrix GeneChip Human Genome U95Av2(HG-U95Av2)

E-GEOD-4475: The data set was published on July 12, 2008 Hummel et al. (2006).Transcription profiling of 221 human Burkitts lymphomas.Chip Platform: Affymetrix GeneChip Human Genome U133AArray (HG-U133A)

In no reference articles or depending appendix files of these six studies any descrip-tion about quality assessment or control methods can be found and the analysis descrip-tion denotes the use of all published arrays. Therefore, either no quality assessmentand control was conducted or defect arrays are not published. MIAME describes the‘Minimum Information About a Microarray Experiment’ that is needed to enablethe interpretation of the results of the experiment unambiguously and potentially toreproduce the experiment Brazma (2009). Especially point six in MIAME claims forpublication of the processing protocols. MIAME is supported by several databases(including ArrayExpress) and software tools, but so far only complied completely bysome experiments. Examining the arrays with pseudo-image plots of the raw probeintensities (see Fig. 3) shows, that there are several large artifacts, which is a strong

123

Agreement study for quality assessment 265

Fig. 3 Chip pseudo-image of the raw probe intensities from two arrays in the E-GEOD-1159 (arraynumber 8) and E-GEOD-11121 (array number 272) experiment. In both images there is a large artifact

indicator for a defect array Gentleman et al. (2005). Small artifacts are usually incon-sequential, because in modern array designs, the probes for each probe set are notcontiguously next to each other, but rather they are spread throughout the array. Thus,it will at most effect a few probes in a probe set and not the entire block and neitherthe total array.

Table 1 lists the “bad” quality arrays in the described experiments identified withmanual inspection of the pseudo-image plots and having large artifacts. In this man-ual process all images were independently evaluated from two microarray analysts.Arrays with large artifacts and clear intensity differences are marked. Column two andthree list the outlier arrays with more than two or three marks in the quality report tablefrom the arrayQualityMetrics package version 2.2.2. Column four lists the outlierarrays for the validation data sets which are manually detected by two independentanalysts, using the six graphical quality assessment methods from the arrayQuality-Metrics package (without the calculated S score). Column five and six presents thenew proposed rules described in the following sections.

All used data sets are published in the ArrayExpress database and results and anal-yses on these data sets are published in established journals. Therefore, we assume alow (or zero) number of bad quality arrays in these example data sets. The >2 ruleidentifies the most arrays and can be useful for a first quality assessment, to reduce thenumber of arrays for manual inspection. However, excluding all arrays identified asoutlier arrays with this rule is too strict and the number of samples in the experimentwill be unnecessary reduced. Some of the arrays with the strong artifacts are recog-nized with this rule. This indicates, that due to the array design visible artifacts haveonly a low influence to the quality of microarrays. Using the >3 rule a notable lowernumber of arrays is marked as outliers and most of the manually by image inspec-tion detected arrays are not included. Analyzing the described assessment graphicsfor the detected arrays in detail, clear differences to the other arrays are found. Thehuge number of identified arrays with the >2 rule indicates, that there is probably a

123

266 M. Schmidberger et al.

Tabl

e1

Out

lier

arra

ysin

the

expe

rim

ents

asse

ssed

with

the

arra

yQua

lityM

etri

cspa

ckag

ean

dm

anua

l

Imag

ein

spec

tion

arra

yQua

lityM

etri

cs(>

2ru

le)

arra

yQua

lityM

etri

cs(>

3ru

le)

Man

ualQ

Aar

rayQ

ualit

yMet

rics

(>=2

com

bine

dru

le)

arra

yQua

lityM

etri

cs(>

=2re

duce

dru

le)

E-G

EO

D-1

1121

8,65

,102

,184

,200

8,12

,22

,65

,95

,10

2,11

1,13

9,17

0

12,1

02N

a12

,102

12,1

02

E-G

EO

D-1

159

96,1

66,2

30,2

726,

96,

99,

120,

166,

178,

272,

287,

289

6,96

,289

Na

6,96

,289

6,96

,289

E-G

EO

D-1

5396

–6,

23,8

4,13

4–

Na

131

E-G

EO

D-2

990

27,5

6,60

27,6

1–

Na

––

E-G

EO

D-3

960

–46

,53,

7646

,53,

7646

,53,

7646

,53,

7646

,53,

76

E-G

EO

D-4

475

112,

199

19,6

1–

––

‘Na’

indi

cate

sth

atth

em

anua

lqua

lity

asse

ssm

enti

sno

tper

form

edfo

rth

ese

data

sets

123

Agreement study for quality assessment 267

Table 2 Contingency table ofbinary ratings for rater A (rows)and rater B (columns)

Rater B− +

Rater A − a b a + b

+ c d c + d

a + c b + d total

correlation between the scores and that arrays are marked due to correlated scores. Thisassumption is analysed with the empirical study described in the following sections.

3 Statistical analysis

To prove a possible association among the statistical methods, the statistical methods(raters) have been pairwise combined in a 2×2 contingency table (see Table 2). In thefollowing we use the word “rater” synonymously to “statistical method”. The tablecan be used to express the relationship between two variables. In this case + or −indicates an “acceptable” or a “outlier” array. For example, the number a is the num-ber of arrays marked as outlier arrays in the statistical method A and B. T otal is thenumber of arrays in the used data set. As a quantitative measure for association onecould use the Odds Ratio

(OR = b

c

)for matched pairs.

3.1 Cohen’s Kappa

If there is homogeneity between marginal frequencies (see Sect. 3.3) the agreementmeasure Cohen’s Kappa Coefficient Gewet (2002b); Landis and Koch (1977) can beapplied. The Cohen’s Kappa coefficient κ is a statistical measure of inter-rater agree-ment for qualitative items. It is generally a more robust measure than simple percentagreement calculation since κ takes into account the agreement occurring by chance.

Cohen’s Kappa measures the non-random agreement between two raters that eachclassify N items into C mutually exclusive categories.

κ = ρO − ρE

1 − ρE(5)

where ρO is the relative observed agreement among raters, and ρE is the hypotheticalprobability of chance agreement, using the observed data to calculate the probabilitiesof each observer randomly saying each category. In view of the notation in Table 2this means that N = total, C = 2, ρO = a+d

total and ρE = (a+d)·(a+c)+(b+d)·(c+d)

total2 . Ifthe raters are in complete agreement then κ = 1. If there is no agreement among theraters (other than what would be expected by chance) then κ ≈ 0.

Landis and Koch (1977) give the following Table 3 for interpreting κ values. Thisguide is however by no means universally accepted. The calculation of the standarderror and confidence intervals for κ is explained in details in Fleiss et al. (2003);Altman (1991). The limitation of the (weighted) Cohen’s Kappa method is that itdoes not extend to multiple raters, and does not adjust for both chance agreement andmisclassification errors.

123

268 M. Schmidberger et al.

Table 3 Guide to interpret thekappa values provided by Landisand Koch (1977)

Kappa Strength of agreement

<0.2 Poor

>0.2–0.4 Fair

>0.4–0.6 Moderate

>0.6–0.8 Good

>0.8–1 Very good

3.2 AC1 statistic

In an effort to solve some of these limitations Kilem Gewet theoretically derived twoalternatives to current agreement coefficients. The first- and second-order agreementcoefficients, AC1 and AC2 Gewet (2002a). These coefficients adjust for chance agree-ment and both chance agreement and declassification, respectively. Both are usefultools for reliability studies, however, computation of the statistics and their conditionaland unconditional variances beyond the two-rater, two-category case is nontrivial.

The AC1 Statistic Gewet (2002a,b) method conducts a formal analysis of the agree-ment without consideration of the marginal differences. It is a simple inter-rater agree-ment statistic and give by:

AC1 = ρ0 − ρE

1 − ρE(6)

where ρE measures the likelihood of agreement by chance and is defined as ρE =2ρ+(1 − ρ+), ρ0 = a+d

total and ρ+ = (c+d+b+d)/2total .

3.3 McNemar test

The McNemar test is used to assess marginal homogeneity (OR = 1) of each ratingcategory of the 2 × 2 contingency table Altman (1991); McNemar (1947). A sig-nificant result implies that marginal frequencies are not homogeneous and therefore,the Cohen’s Kappa coefficient may not be calculated, but the AC1 Statistic. If theMcNemar test is not significant (homogeneity between marginal frequencies), theCohen’s Kappa coefficient we used to find out if there is any kind of agreement amongthe analysed graphical methods.

Marginal homogeneity implies, that row totals are equal to the corresponding col-umn totals. In view of the notation in Table 2 this means:

(a + b) = (a + c) (7)

(c + d) = (b + d) (8)

The McNemar statistic is calculated as

χ2 = (b − c)2

(b + c)(9)

123

Agreement study for quality assessment 269

and the value χ2 can be viewed as a chi-squared statistic with one degree of free-dom. Altman (1991) recommends an improved version of the McNemar test with acorrection for discontinuity, calculated as

χ2 = (|b − c| − 1)2

(b + c). (10)

Statistical significance is determined by evaluating the probability of χ2 with refer-ence to the values of a χ2 distribution for one degree of freedom and the correspondingconfidence intervals. A significant result implies that marginal frequencies (or pro-portions) are not homogeneous.

If the values b and/or c are small, the McNemar test, especially the uncorrectedversion, is not well approximated by the chi-squared distribution. For (b + c) < 20 atwo-tailed exact test, based on the cumulative binomial distribution with p = q = 0.5is used instead.

3.4 Pearson correlation coefficient

Additional to the two described tests the Pearson product-moment Correlation Coef-ficient Fleiss et al. (2003) is calculated for all pairs of scores derived from two ofthe statistical methods and the correlation for the interesting pairs is visualized withscatter plots. The Pearson correlation coefficient is a measure of the correlation (lineardependence) between two variables X and Y, giving a value between +1 and −1 inclu-sive. It is widely used in science as a measure of the strength of linear dependencebetween two variables and is obtained by dividing the covariance of the two variablesby the product of their standard deviations:

ρX,Y = cov(X, Y )

σXσY(11)

where σ is the standard deviation.

4 Results

The arrayQualityMetrics package is applied to the six data sets and the self-containedHTML quality reports are generated. Due to 102 to 293 arrays per experiment each ofthe loaded AffyBatch objects require 10 to 50 GB of main memory. The calcula-tions, e.g. generating the plots, using these objects requires additional main memory(up to 75 GB). Furthermore, the process of creating anAffyBatchobject and runningthe main arrayQualityMetrics() function, which creates the quality report,takes up to 8 hours at an Intel Xeon processor with 2.93 GHz.

The summary table in the HTML report is imported and manipulated in R, thearrays ordered and coded in two categories—“outliers” and “accepted” arrays—andused to create 2 × 2 contingency tables of all possible raters combinations withoutrepetition (6 raters represent 15 possible 2 × 2 tables). These tables are tested for

123

270 M. Schmidberger et al.

Table 4 Summary for the agreement in the four data sets and the 6 methods used in the arrayQuality-Metrics package

E-GEOD-1159 E-GEOD-11121 E-GEOD-15396 E-GEOD-2990

MA-plot/Spatial_dis Poor Poor Poor Poor

MA-plot/Boxplot – Good Good –

MA-plot/heatmap Very good Good Very good Moderate

MA-plot/RLE Poor Poor Poor –

MA-plot/NUSE Poor Poor Poor Poor

Spatial_dis/Boxplot Poor Poor Poor Poor

Spatial_dis/heatmap Poor Poor Poor Poor

Spatial_dis/RLE Fair – Poor –

Spatial_dis/NUSE Fair Poor Poor Poor

Boxplot/heatmap – Good Good Moderate

Boxplot/RLE Poor Poor Poor –

Boxplot/NUSE – Poor Poor Poor

Heatmap/RLE Poor – Poor –

Heatmap/NUSE Poor Poor Poor Poor

RLE/NUSE Moderate Poor Poor Poor

Classification adapted from Landis and Koch guidelines, 1977See Table 3 for interpretation of the κ values

marginal homogeneity using the McNemar test with and without correction. Only thesignificant tables after the homogeneity test (the significant two rater combinations)are used to calculate the Cohen’s Kappa coefficient and all tables for the AC1 Statis-tic. For this analysis procedure a new function, called kappaAQM() (available in theAppendix) is implemented and applied to the HTML reports from the arrayQuality-Metrics package version 2.2.2.

The McNemar homogeneity test rejects the null hypothesis for up to five (in theE-GEOD-2950 data set) rater combinations. For about 50% of the pairs there is a smallsample size (b + c) < 20 and a binomial test is performed.

For an overview of the final results of the agreement test with Cohen’s Kappa,Table 4 presents the classification of the statistical method pairs by the guidelines ofLandis and Koch for all four data sets. The character “-” indicates, that for this pairthe McNemar homogeneity test rejected the null hypothesis and Cohen’s Kappa couldnot be applied. The table demonstrates, that the agreement between MA-plot/Boxplotmethods is “good” for two data sets. Between MA-plot/heatmap methods the agree-ment is classified from “moderate” to “very good” for four data sets. The methodsBoxplot/heatmap have an agreement from “good” to “moderate” in three data sets.The other method combinations present values that should be considered as not rele-vant for this agreement study between the methods used in the arrayQualityMetricpackage version 2.2.2.

Figure 4 compares the obtained values of the two agreement tests. The AC1Statistic obtains higher values than Cohen’s Kappa.Formally Cohen’s Kappa allowsboth raters to use two different rating frameworks. This implies for the binary rating

123

Agreement study for quality assessment 271

0.0

0.2

0.4

0.6

0.8

1.0

Agr

eem

ent c

oeffi

cien

ts

MA

−pl

ot/S

patia

l_di

s

MA

−pl

ot/B

oxpl

ot

MA

−pl

ot/h

eatm

ap

MA

−pl

ot/R

LE

MA

−pl

ot/N

US

E

Spa

tial_

dis/

Box

plot

Spa

tial_

dis/

heat

map

Spa

tial_

dis/

RLE

Spa

tial_

dis/

NU

SE

Box

plot

/hea

tmap

Box

plot

/RLE

Box

plot

/NU

SE

heat

map

/RLE

heat

map

/NU

SE

RLE

/NU

SE

AC1 StatisticCohen’s KappaE−GEOD−1159E−GEOD−11121E−GEOD−15396E−GEOD−2990

Fig. 4 Plot with values of the AC1 Statistic and Cohen’s Kappa agreement for the data sets E-GEOD-1159,E-GEOD-11121, E-GEOD-15396, E-GEOD-2990

that category one may be based on different frequencies between both raters and thatfor one rater the category “outlier” means something different as for rater two. TheAC1 coefficient assumes that both raters use the same rating framework. Thus, the cat-egory outlier is assumed to have equal incidence for both raters.The advantage of theAC1 method is, that all methods are analysed and not only the one, which is acceptedby the marginal homogeneity test. In the Cohen’s Kappa coefficients there is morevariance than in the AC1 Statistic coefficients and the Cohen’s Kappa coefficientsshow a very low (and zero) agreement for several methods. As demonstrated in Gewet(2002b), the Cohen’s Kappa coefficients tend to yield a low agreement. But a similartrend for the agreement between the statistical graphical methods can be found in bothtests.

The following methods have the highest mean values of the AC1 statistic:MA-plot/Boxplot (0.985), MA-plot/heatmap (0.983), Boxplot/heatmap (0.97),spatial density distribution/RLE (0.947), RLE/NUSE (0.940).

Examining the Pearson Correlation Coefficient and the scatter plots for the men-tioned statistical graphical methods the agreement between the listed five pairs ofmethods can be approved. Table 5 shows the median Pearson Correlation Coefficientfor the four data sets. Especially for MA-plot/heatmap there is a strong correlation.Due to the symmetry of the coefficient the matrix is an upper triangle matrix andthe diagonal elements are equal to one, because for identical variables the correlationcoefficient is one.

Figure 5 shows the scatter plots of the scores for the automatic outlier detectionin the arrayQualityMetrics package. Only the interesting pairs are plotted for theE-GEOD-11121 data set. For the other three data sets the scatter plots look very simi-lar, for the other pairs of methods there is less correlation observable. In the upper left

123

272 M. Schmidberger et al.

Table 5 Median values for pearson correlation coefficient of all four data sets

Boxplot MA-plot NUSE RLE Heatmap Spatial_dis

Boxplot 1 0.11 −0.11 0.03 0.14 0.18

MA-plot 1 0.55 0.03 0.99 0.04

NUSE 1 0.24 0.56 0.25

RLE 0.04 0.37

Heatmap 1 0.04

Spatial_dis 1

figure the differential gene expression is calculated to the mean array of the sampleconsidered. The mean of absolute M A values has to be positive and will be influencedby extreme values of the differential gene expression. The figure demonstrates thatarrays with a low or high interquartile range are associated with extreme means whichare affected by extreme differential expression values. In the upper right figure a dis-tance matrix is created which measures the distance between each pairs of arrays in asample. For each array the mean distance to all other arrays is calculated and plottedon the y-axis. This number is not equal but similar to the distance of the array to amean array. A situation which is close to the upper left figure is created this way.

After the analysis of Cohen’s Kappa and AC1 Statistic coefficients and the directvisualization of the correlation, the assumption of some agreement and correlationbetween the six quality assessment methods can be approved. There is a strong cor-relation between MA-plot/heatmap and a strong agreement between the statisticalgraphical methods MA-plot/Boxplot, Boxplot/heatmap. Between the other methodsthere is less agreement and correlation. This association among the three pairs of meth-ods can be explained with their score S to detect the outlier arrays and from the level ofthe microarray data they use. The MA-plot and heatmap methods are based on the blockstratum of the array data and both uses the M metric as basis to calculate the scores S(see Sect. 2). The boxplot is based on the intensities of the entire array (median andIQR of the intensities of each array) and MA-plot and heatmap use both the M value.If the boxplot detects an array as outlier, the array has to high/low intensities or theintensities values are dispersed (maybe due to experimental errors during the processor the genes in this array are more expressed than in the other arrays). For the MA-plotor heatmap methods the M metric is calculated as Mi = log2

(Ii j

) − log2(Il j

). In

the case that enough genes/probe sets on the array are affected, the M value for thisarray will be higher/lower than the other arrays of the experiment. Hence, it will bealso detected as outlier in both methods, MA-plot and heatmap.

Therefore, the quality assessment rule for the quality report table generated by thearrayQualityMetrics package can be improved in the following way:

>= 2 combined rule: If there is one outlier array mark in one of the three methodsMA-plot, heatmap and boxplot, and one outlier array mark inone of the other three methods, then the array should be con-sidered as outlier array. In this rule, the methods are dividedin two groups: group 1 combines the methods with a strongcorrelation and agreement (MA-plot, heatmap and Boxplot)

123

Agreement study for quality assessment 273

Correlation: −0.06

Boxplot

MA

−pl

otCorrelation: −0.03

Boxplot

Hea

tmap

Correlation: 0.99

MA−plot

Hea

tmap

Correlation: 0.12

RLEN

US

E

Correlation: 0.65

6.5 7.0 7.5 8.0 6.5 7.0 7.5 8.0

0.2 0.4 0.6 0.8 1.0 −0.15 −0.05 0.05 0.10 0.15

0.00015 0.00020 0.00025 0.00030

0.2

0.6

1.0

8012

016

0

8012

016

0

0.99

1.02

−0.

150.

000.

15

Spatial_dis

RLE

Fig. 5 Scatter plots of the scores for the automatical outlier detection in the arrayQualityMetrics packagefor the E-GEOD-11121 data set. “Outlier” arrays are marked in red

and group 2 the rest of the methods (spatial distribution,NUSE, RLE). When an array is detected as outlier in thegroup 1 and in the group 2, then it should be considered asoutlier and automatically excluded.

Furthermore, due to the strong agreement between MA-plot/boxplot and MA-plot/heatmap the methods heatmap and boxplot can be excluded from the qualityassessment. The analysis of the microarray after the strata is also considered in thisrule. Then, the following quality assessment rule should be used:

>= 2 reduced rule: If an array is marked as outlier array from the MA-plot methodand at least marked in one of the other three methods (NUSE,RLE, spatial distribution of feature intensities), then the arrayshould be considered as outlier array.

Table 1 indicated that both described new rules (>= 2 combined rule) and (>= 2reduced rule) detect the same arrays as outlier, except for the data set E-GEOD-15396.In this experiment the >= 2 combined rule detects the array 131 as outlier but notthe >= 2 reduced rule. For this case, the >= 2 reduced rule agree with the rule >3.

123

274 M. Schmidberger et al.

4.1 Validation

The new presented rules are validated with the two proposed validation data setsE-GEOD-3960 and E-GEOD-4475. For these two data sets a manual quality assess-ment was independently executed from two experts in microarray analysis. For theirdecisions they used all metrics generated from the arrayQualityMetrics package(without the scores S). In addition, the new rules are tested with the four learning datasets, too. Table 1 lists the outlier arrays in the described experiments identified withmanual inspection of the pseudo-image plots, with the arrayQualityMetrics packageversion 2.2.2, the complete manual quality assessment for the two validation data setsand the two rules, and with arrayQualityMetrics package and the two new rules.

All used data sets are published in the ArrayExpress database and results and anal-yses on these data sets are published in established journals. Therefore, we assume alow number of bad quality arrays in these examples. But, as demonstrated in the intro-duction, there are several outlier arrays available. The >2 rule identifies a lot of arraysand can be useful for a first quality assessment, to reduce the number of arrays for man-ually inspection. However, excluding all arrays identified as outlier arrays with thisrule is too strict, the >3 rule seems to be more useful to detect automatically the outlierarrays (see Sect. 1). In the two validation data sets E-GEOD-3960 and E-GEOD-4475(last two rows in Table 1) both newly proposed rules identify the same arrays than the>3 rule and these arrays match with the results of the complete manual quality assess-ment (Table 1, column four). Except for one array in the E-GEOD-15396 data set thenewly developed rules match in the evaluation data sets, too. The >= 2 reduced rulefinds exactly the same arrays, only uses four of the six proposed statistical graphicalmethods and includes the three strata of the microarray data (see Sect. 2).

5 Discussion and conclusions

In microarray experiments the arrayQualityMetrics package is very useful for qual-ity assessment and control based on the mentioned statistical graphical methods. Itworks on all expression arrays and platforms and produces a self-contained qualityreport, which can be web-delivered. In addition, an automatic quality assessment andoutlier detection is integrated and the results are represented as a summary table.

Unfortunately the methods require a lot of main memory and have long computa-tion times (more than 8 h for 200 arrays). Two ways to solve these problems are the useof parallel computing or to reduce the number of used statistical graphical methodsand to use only the most informative measures.

5.1 Parallel computing

In parallel computing many calculations are carried out simultaneously, operating onthe principle that large problems can often be divided into smaller ones, which arethen solved concurrently (“in parallel”).

The affyPara package Schmidberger et al. (2009); Schmidberger and Mansmann(2008) implements parallel algorithms for preprocessing of microarray data.

123

Agreement study for quality assessment 275

Parallelization of existing preprocessing methods produces, in view of machine accu-racy, the same results as serialized methods. The partition and distribution of data toseveral nodes solves the main memory problems and accelerates the methods up tofactor 15 for 300 arrays or more. Additional, in the bachelor thesis from EsmeraldaVicedo (2009) parallel quality assessment methods are implemented in the affyParapackage. The graphical visualization for more than 300 arrays becomes—independentof the used graphical method—very complex and unreadable. Therefore, the parall-elized methods were extended for only plotting the interesting (outlier and referencearrays) arrays.

Due to latest developments in computer chip production multi-core machines areavailable to most users. In the arrayQualityMetrics() function up to 13independent tasks are executed in serial. The new R package multicore Urbanek(2009) can be used to accelerate the quality assessment by the number of availableprocessors. Unfortunately, this does not solve the mentioned main memory problems,but due to the used fork Stevens (1992) technology and read-only operations on theAffyBatch object, no more main memory is required.

5.2 Method reduction

The presented empirical study for the agreement between statistical methods in qual-ity assessment and control of microarray data with the arrayQualityMetrics packageproves, that there is some association and correlation between three of the six currentlyproposed statistical methods to comprehensively quantify the quality information inlarge series of microarrays. Therefore, the number of quality assessment methods canbe reduced to four and computation time saved.

There is a strong correlation between MA-plot/heatmap and a strong agree-ment between the statistical graphical methods MA-plot/Boxplot, Boxplot/heatmap.These three methods are very similar measures and it is enough to use only oneof these methods. Therefore, a useful rule for the automatically detection of out-liers based on the quality report table from arrayQualityMetrics package can bedefined:

If an array is marked as outlier array from the MA-plot method and at leastmarked in one of the other three methods (NUSE, RLE, spatial distribution offeature intensities) then the array should be considered as outlier array.

This decision rule is based on scores which offer complementary information (arenot strongly correlated) and only four of the six measures have to be calculated.The validation in Sect. 4 demonstrates, that the new decision rule identifies the samearrays as outliers as the original rule in the arrayQualityMetrics package and thethree different strata of the microarray data are also considered in the new rule. Thisrule is already applied in the latest development version of the arrayQualityMetricspackage (version 2.5.10). The summary table is now reduced to three columns.

As long as the number of (outlier) arrays is small the authors advice to check allmarked arrays manually. If the number of arrays is too big, the new proposed >= 2reduced rule can be used for automatic array exclusion.

123

276 M. Schmidberger et al.

Acknowledgments The work is supported by the LMUinnovative collaborative centre ‘Analysis andModelling of Complex Systems in Biology and Medicine’. Special thank goes to the working group“Statistical Computing” (http://www.statistical-computing.de/) and the organization committee of the work-shop “Statistical Computing” 2009 in Günzbug (Germany).

References

Altman DG (1991) Practical statistics for medical research. Chapman & Hall, Boca RationBerrar DP, Dubitzky W, Granzow M (eds) (2003) A practical approach to microarray data analysis. Kluwer

Academic Publishers Group, LondonBrazma A (2009) Minimum information about a microarray experiment (miame)–successes, failures, chal-

lenges. Scientific World J 9:420–423Brettschneider J, Collin F, Bolstad BM, Speed TP (2007) Quality assessment for short oligonucleotide

arraysBurgoon LD, Eckel-Passow JE, Gennings C, Boverhof DR, Burt JW, Fong CJ, Zacharewski TR (2005)

Protocols for the assurance of microarray data quality and process control. Nucleic Acids Res 33:1–11Fleiss JL, Levin BA, Levin B, Paik MC (2003) Statistical methods for rates and proportions. Wiley-

Interscience, New YorkGautier L, Cope L, Bolstad BM, Irizarry RA (2004) affy—analysis of affymetrix geneChip data at the

probe level. Bioinformatics 20(3):307–315Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S (2005) Bioinformatics and computational biology

solutions using R and bioconductor. 1st edn.. Springer, BerlinGentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J,

Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, SawitzkiG, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J (2004) Bioconductor: open software develop-ment for computational biology and bioinformatics. Genome Biol 5(10):R80

Gewet K (2002) Handbook of inter-rater reliability. Technical report, STATAXIS Publishing CompanyGewet K (2002) Inter-rater reliability: dependency on trait prevalence and marginal homogeneity. Stat

Methods Inter Rater Reliab Assess 2:1–9Huber W (September 2008) Sixth framework programme for quality of life and management of living

resources. Technical report, microarray and gene expression data society, EMERALD WorkshopHummel M, Bentink S, Berger H, Klapper W, Wessendorf S, Barth TFE, Bernd H-W, Cogliatti SB,

Dierlamm J, Feller AC, Hansmann M-L, Haralambieva E, Harder L, Hasenclever D, Khn M,Lenze D, Lichter P, Martin-Subero JI, Möller P, Müller-Hermelink H-K, Ott G, Parwaresch RM, Pott C,Rosenwald A, Rosolowski M, Schwaenen C, Stürzenhofecker B, Szczepanowski M, Trautmann H,Wacker H-H, Spang R, Loeffler M, Trümper L, Stein H, Siebert R (2006) Molecular mechanisms inmalignant Lymphomas network project of the Deutsche Krebshilfe. A biologic definition of burkitt’slymphoma from transcriptional and genomic profiling. N Engl J Med 354(23):2419–2430

Kauffmann A, Gentleman R, Huber W (2009) arrayQualityMetrics—a bioconductor package for qualityassessment of microarray data. Bioinformatics 25(3):415–416

Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics33(1):159–174

McNemar Q (1947) Note on the sampling error of the difference between correlated proportions or per-centages. Psychometrika 12:153–157

Parkinson H, Kapushesky M, Kolesnikov N, Rustici G, Shojatalab M, Abeygunawardena N, Berube H,Dylag M, Emam I, Farne A, Holloway E, Lukk M, Malone J, Mani R, Pilicheva E, Rayner TF,Rezwan F, Sharma A, Williams E, Bradley XZ, Adamusiak T, Brandizi M, Burdett T, Coulson R,Krestyaninova M, Kurnosov P, Maguire E, Neogi SG, Rocca-Serra P, Sansone S-A, Sklyar N, Zhao M,Sarkans U, Brazma A (2009) Arrayexpress update—from an archive of functional genomics experi-ments to the atlas of gene expression. Nucleic Acids Res 37(Database issue):D868–D872

Parkinson H, Kapushesky M, Shojatalab M, Abeygunawardena N, Coulson R, Farne A, Holloway E,Kolesnykov N, Lilja P, Lukk M, Mani R, Rayner T, Sharma A, William E, Sarkans U, Brazma A (2007)Arrayexpress—a public database of microarray experiments and gene expression profiles. NucleicAcids Res 35(Database issue):D747–D750

123

Agreement study for quality assessment 277

Schmidt M, Böhm D, von Törne C, Steiner E, Puhl A, Pilch H, Lehr H-A, Hengstler JG, Kölbl H,Gehrmann M (2008) The humoral immune system has a key prognostic impact in node-negativebreast cancer. Cancer Res 68(13):5405–5413

Schmidberger M, Mansmann U (2008) Parallelized preprocessing algorithms for high-density oligonu-cleotide arrays. In: Proceedings IEEE international symposium on parallel and distributed processingIPDPS, 14–18 April 2008, pp 1–7

Schmidberger M, Vicedo E, Mansmann U (2009) affypara—a bioconductor package for parallelized pre-processing algorithms of affymetrix microarray data. Bioinform Biol Insights 3:83–87

Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B,Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Marc B, Van de Vijver MJ, Bergh J,Piccart M, Delorenzi M (2006) Gene expression profiling in breast cancer: understanding the molec-ular basis of histologic grade to improve prognosis. J Natl Cancer Inst 98(4):262–272

Stevens W Richard (1992) Advanced programming in the UNIX environment. Addison-Wesley, UpperSaddle River, NJ [u.a.]

Stirewalt DL, Meshinchi S, Kopecky KJ, Fan W, Pogosova-Agadjanyan EL, Engel JH, Cronk MR,Dorcy KS, McQuary AR, Hockenbery D, Wood B, Heimfeld S, Radich JP (2008) Identification ofgenes with abnormal expression changes in Acute Myeloid Leukemia. Genes Chromosomes Cancer47(1):8–20

Urbanek S (2009) multicore: parallel processing of R code on machines with multiple cores or CPUs, Rpackage version 0.1–3

Vicedo E (2009) Quality assessment of huge numbers of affymetrix microarray dataWang Q, Diskin S, Rappaport E, Attiyeh E, Mosse Y, Shue D, Seiser E, Jagannathan J, Shusterman S,

Bansal M, Khazi D, Winter C, Okawa E, Grant G, Cnaan A, Zhao H, Cheung N-K, Gerald W,London W, Matthay KK, Brodeur GM, Maris JM (2006) Integrative genomics identifies distinctmolecular classes of neuroblastoma and shows that multiple genes are targeted by regional altera-tions in dna copy number. Cancer Res 66(12):6050–6062

Wilson CL, Miller CJ (2005) Simpleaffy: a bioconductor package for affymetrix quality control and dataanalysis. Bioinformatics 21(18):3683–3685

123