normalization and quantification of differential expression in gene … · 2011-09-15 ·...

BRIEFINGS IN BIOINFORMATICS. VOL 7. NO 2. 166-177Advance Access publication March 7, 2006

doi:IO.I093/bib/bbl002

Normalization and quantificationof differential expression in geneexpression microarraysChristine Steinhoffand Martin Vingron

Received (in revised form): 28ch January 2006

AbstractArray-based gene expression studies frequently serve to identify genes that are expressed differently under two ornnore conditions. The actual analysis of the data, however, may be hampered by a number of technical and statisticalproblems. Possible remedies on the level of computational analysis lie in appropriate preprocessing steps, propernormalization of the data and application of statistical testing procedures in the derivation of differentially expressedgenes. This review summarizes methods that are available for these purposes and provides a brief overview of theavailable software tools.

Keywords: microarray; normalization; low-level analysis; differential gene expression

INTRODUCTIONMicroarray technology has been around for almost10 years and with it a plethora of computationalanalysis tools has been developed. Yet, the apphca-tion of microarray technology in biological researchstiU poses serious problems and causes considerableconfusion on the part of the users of the technology.The lack of simple answers to the problems in thisfield is largely due to the wide scope of questions thatcan be tackled with the technology, overlaid withits technical aspects, which influence the analysis invery specific ways. This review aims at summarizingexisting approaches to the 'early' steps in the analysispipeline, coupled with methods to tackle thesupposedly simple question of finding genes thatbehave differently under different conditions.

A microarray experiment is performed under theassumption that gene intensities reflect actual mRJvTAlevels. It is, however, well-known that raw geneexpression intensities do not fulfill this requirement.Their values are highly influenced by a number ofnon-biological sources of variation (for an overview

see [1, 2]). Thus, for achieving biologically mean-ingful data, computational preprocessing includingnormalization steps is essential [3].

Microarray experiments are frequently employedfor the purpose of identifying genes that areexpressed differently under distinct conditions. Thisamounts to comparing one group A with anothergroup B and delineating a list of genes rankedaccording to their respective statistic of differentialexpression. In a fiirther step, significance is assignedto each gene and a cut-off value can be defined(for an overview see [4-7]. Even for these seeminglysimple questions, proper preprocessing and normal-ization are crucial and to a certain degree the twoaspects are even linked with each other.

While we review the computational methods,familiarity with microarray technologies on the partof the reader is assumed. The platforms that willbe considered are Affymetrix-type oligonucleotidearrays [13, 14] and two-colour spotted (cDNA-)arrays [14-16]. In the following, we are using theabbreviations 'oligo array' and 'two-dye arrays'. For

Corresponding author. Christine Steinhoff, Max Planck Institute for Molecular Genetics, Department of Computational MolecularBiology, Ihnestr 73, D-14195 Berlin, Germany. E-mail: [email protected]

Christine SteinhofF is a postdoctoral scientist in the Computational Molecular Biology Department at the Max Planck Institute forMolecular Genetics in BerHn. Her research interest focuses on epigenetic gene regulatory mechanisms especially based on geneexpression experimental approaches.Martin Vingron is the Director at the Max Planck Institute for Molecular Genetics in Berlin and Head of the Department forComputational Molecular Biology. His current research interest lies in utilizing gene expression data as well as evolutionary data for theelucidation of gene regulatory mechanisms.

© The Author 2006. Published by Oxford University Press. For Permissions, please email: [email protected]

Normalization and quantification of differential expression 167

Table I: Freely available computational tools for preprocessing and normalization. Computational tools for pre-processing and normalization that are freely available are summarized in this table. For detailed information on thefunctions implemented in bioconductor packages, we refer to the vignettes of the respective packages. Otherwise,detailed information is either given on the cited homepage or within the respective paper

Name

PM/MM, Background correction

Summary statisticPreprocessingSequence based preprocessingtwo-dye array preprocessingVariance stabilization

Li/Wong, dChip*ANOVA

Error model, local regressionLocal regression based methods

QuantileQspline

Package

affyaffyrmagcrmamarrayvsn

affy

—maanova (R)-MAANOVA (matlab)

NeSeColor (Fortran)affy

affyaffy

Name of function

expressoexpresso

vsn

expresso

expressoexpressoexpresso

Download

http://ww w. bioconductor.orghttp://w w w. bioconductor.orghttp://ww w. bioconductor.orghttp://w w w. bioconductor.orghttp://wwv^.bioconductor.orghttp://www.bioconductor.orghttp://www.bioconductor.org

http://www.jax.org/staff/churchill/labsite

ftp://ftp.santafe.edu/pub/kepler

http://w w w. bioconductor.orghttp://w w w. bioconductor.orghttp://ww w. bioconductor.org

References

[22, 33][22, 33]

[21][23][82]

[35][24][38]

[39]

[1. 31][33][32]

This is not exactly the original version of dChip [24], which is a commercial software. Here, we cite the open source version available at theBioconductor project.

the latter technology also the possibility of dye-swapexperiments will be considered.

This review is structured according to thesequence of analysis steps that need to be performed.Preprocessing and normalization are dealt with in'Preprocessing and normalization methods' section,while 'Differential Expression' section deals withthe quantification of differential expression. We alsoprovide a brief overview of tools that are availableto the researcher in order to carry out these analyticalsteps (Tables 1 and 2). Most computational proce-dures that are reviewed in this article can beperformed by using the open source language R[8] and R packages in the Bioconductor project [9].We recommend using R and Bioconductor.Presently, the packages provide a wide range ofpowerful statistical applications for various kindsof genomic analysis. It allows for the integration ofdifferent kinds of biological data and for rapiddevelopment of new statistical packages.

Recendy, a number of books were publishedthat introduce in detail the process of DNA micro-array analysis, discuss problems and drawbacks, andprovide different software solutions [10-15].

PREPROCESSING ANDNORMALIZATION METHODSMotivationThe need for what we call preprocessing comes fromthe fact that in addition to reflecting mRNA levels,spot intensities may also depend on peculiarities of

print tips, particular PCR reactions, integrationefficiency of a dye or spatial and hybridizationspecific effects. These problems can pardy beremedied by image processing methods, backgroundadjustment, normalization, summary of multipleprobes per transcript, or quality control measures[1, 2, 16]. Thus, such procedures are referred to aspreprocessing of the data.

A simple self-self comparison wiU demonstratethe problem. Splitting an RNA sample into twoaliquots, labelling them differently and performinga hybridization wiU show a summary of all theseunwanted effects. The variation seen between thetwo equal samples is all due to the experimentalvariation which we need to deal with in order tolater on quantify differential gene expression [17].

The need for normalization arises fi-om theobservation that measurements from differenthybridizations may occupy different scales. In orderto compare them they need to be normalized.Otherwise, one would deem genes differentiallyexpressed where only the hybridizations behaveddifferently. Additionally, the variance in the datatends to depend on the absolute intensity of the data.This, too, may lead to false biological conclusionsand should be remedied by a normalization method.

For two hybridizations (or two coloun of onehybridization) this latter problem is easily visualizedwith a scatter plot of the average of the two logintensities A versus their log ratio M [18]. Thisgraphical representation is fi^equently referred to as

168 Steinhoff and Vingron

Table 2: Freely available computational tools for quantification of differential expression. Computational tools forquantification of differential expression that are freely available are summarized in this table. For detailed informationon the functions that are available on the Bioconductor homepage we refer to the vignettes of the respective packages.Otherwise detailed information is either given on the cited homepage or within the respective article

Description

Penalized t-statisticPenalized t-statisticModified t-statisticModified t-statisticLinear modelWilcoxonEmpirical BayesRelative entropyTime course analysis: splinesTime course analysis: HMMLocal false discovery rateDiverse multiple testing procedures

Q-valuesFDRGene filtering

Name

SAM - samrsamrocLimma, IpeCyberTmaanovamulttestEBarraysSPEGREEDGEGQLTwilightmulttestQVALUE-genefilter

Download

www-stat.Stanford.edu/~tibs/SAMSupplementary information in [14]http://www.bioconductor.orghttp://www.genomics.uci.edu/software.html

http://www.jax.org/staff/churchill/labsitehttp://www.bioconductor.orghttp://www.stat.wisc.edu/~newtonhttp://ctb.pku.edu.ca/main/QianGroup/sgegre.htmshttp://facultywashington.edu/jstorey/edgehttp://ghmm.org/gql

http://www.bioconductor.orghttp://www.bioconductor.org

http://faculty.washington.edu/jstorey/edge

http://www.stjuderesearch.org/depts/biostatshttp://www.bioconductor.org

References

[44]

[46][41, 51][48][38][46][40][52]

[55][61][64][14]

[78]

[62, 74, 76][14]

MA plot [16]. It shows that the variance of Mchanges strongly with A, e.g. while the variance islow for high values of A it is rather large for smallvalues of A. This is a source for possible misinter-pretation of the data: a fold change of two may behighly interesting for two strongly expressed geneswhile it is not noteworthy when the genes comefrom the region of low expression. For thequantification of differential expression, we requireconstant variance across the whole dynamic range.

It is a common practice to transform geneexpression intensities to logarithmic scale. Thismakes the variation of intensities or differences lessdependent on the absolute magnitude and evens outhighly skewed distributions. Furthermore, logarith-mic transformations convert multiplicative errorsinto additive ones [19]. Problems with logarithmicscale arise for negative values which occur frequentlyafter background subtraction. For positive valuesclose to zero, logarithmic transformation yieldsstrongly negative values and consequently heavilyscattered plots.

During the last years, a number of solutions forpreprocessing and normalization came up. A pipelineof the analysis procedure and an overview offrequently used methods are given in Figure 1.One of the basic questions there pertains to the user'sassumption as to whether only a small fraction ofthe genes or large parts of them change under thestudied change of conditions. This is usually areflection of the experiment design. For example.

using a specialized array containing genes relevantfor a particular biological process, one expects mostof the genes to change in the experiment. Whennormalizing these experimental settings, housekeep-ing genes, internal controls or spikes have to be used.Amongst the genes of a whole-genome array, on theother hand, only a small fraction is expected tochange. Here, we will focus on methods for thelatter, the general purpose array. Regarding spacelimitations, for most methods we will not go intodeep detail. For a tutorial guiding through normal-ization procedures, see the article by Kreil [20]published last year in this journal.

PreprocessingWhen analysing oligo arrays, one chooses for abackground correction and decides how to utilizeperfect matches (PM) and mismatches (MM) in orderto obtain a summary of intensities. This is frequentlycalled summary statistic in the literature (see [21, 22]for an overview). Irizarry et al. [21] propose abackground correction that ignores MM valuesaltogether. They offer a Bioconductor packagecalled Robust Multi-array Analysis (RMA) that com-prises background adjustment and normalization(refer 'Transformation methods' section). Wu et al.[23] introduce a sequence-based statistical modelthat describes background adjustment specifically foroligo arrays. The components of the error model areestimated by a maximum likelihood approach or anempirical Bayes approach. This approach is


tion

cs

•KPQ

ispe

cts

Tec

hnol

o

/" NMajority of genes

changed:normalization by

some spots

/' NMajority of genes

unchanged:normalization by

all spotsV y V y

Housekeeping[ genes J

Spikes

/ \

Oligo array 2-dye array

Ia

Pre

Background-model

PM/MMmodel

Summarystatistic

ScannersettingImage

analysisBackground-

correction

IIo

I

oZ

Within eachslide

Multipleslides

Within eachslide

Multipleslides

Dye-swap

Global scaling

Error model based• Variance stabilization.ANOVA

Transformation• Global regression• Local regression• Quantile• QSpline

Figure I: Pipeline of preprocessing and normalization of gene expression data. In this figure, distinct stages of tech-nology choice, data preprocessing and normalization are displayed as referred in the text. In the first step, the basicassumption is either that the majority of genes on the array might change or remain unchanged (basic assumption).Depending on this, there are different technology settings possible (technology aspects). Again, depending on thatchoice there are different strategies how to preprocess the data (preprocessing). The basic normalization strategycan now be distinguished with regard to the number of slides in each normalization step (normalization setting).In the last step, one has to decide for the actual normalization method.

implemented in the Bioconductor package gcrmawhich is a modified version of RMA that describesthe intensity of probes as a function of the GC-content. Li and Wong [24] establish a statisticalframework that comprises an error model for perfectmatches and mismatches. This setting is only

applicable for oligo arrays. Their approach comprisesthe deduction of a summary statistic. For anoverview of different probe set summary methodssee [25].

Likewise, for two-dye arrays there exist tools forimage analysis including background corrections or


to test for the above mentioned artifacts like PCRbatch effects and the like [26].

Table 1 provides an overview of the fi-eelyavailable computational tools for preprocessinggene expression data. Having performed pre-processing for either kind of technology platform,we end up with one value per probe set or transcriptrepresented on the array. In the following, theseunits will shortly be referred to as 'genes'. Since it isnot the focus of this review, we wiU not go intodeeper detail regarding preprocessing steps. Thereare a number of publications dealing with this aspect[5, 11, 12, 14, 18].

Scaling methodsApplying scaling methods, one assumes that differentsets of intensities differ by a constant global factor.These are only correct for 'global multiplicativeeffects' [27], since all raw intensity values aremultiplied with one common (i.e. global) scalingfactor. Note that using log-transformed datasetsmultiplicative effects become additive. The scalingfactor might be the mean, median, Z-score, etc.[27, 28]. Preprocessing, including standardization asprovided by the Microarray Suite Software5.0 (MAS5.0)applies a trimmed mean based scaling approach.Adapted from the available documentation ofMAS5.0 the algorithm is implemented in theBioconductor package affy.

Transformation methodsTransformation methods aim at quantitativelymapping one set of intensities to another one.They are non-parametric when no distributionalassumptions are made. Mostly, these methods arebased on regression. Regression can be applied eitherover the entire range of intensities [29, 30] or locally[31]. Depending on whether the regression functionis a Hnear function or a polynomial function ofdegree larger than one, we distinguish linear andpolynomial regression.

Especially for local regression, outlier values canstrongly influence the regression curve. Therefore,it is advisable to introduce weights that penalizeoudiers. Local regression via loess/lowess (locallyweighted scatter plot smooth) uses a linear (lowess)or quadratic (loess) polynomial weighted regressionfunction with Tukey's biweight function [31] whilelocal regression via locfit apphes a tricubic weightingfunction. With regard to microarray normalizationthey perform very similarly. Workman et al. [32]

proposed a normalization method where intensitypairs of two arrays are interpolated according to acubic spline fianction (qspline).

Quantile normalization for oUgo arrays as pro-posed by Bolstad et al. [33] aims at making thedistribution of gene expression intensities of eachsample the same. This approach is applicable formany arbitrary samples. Each quantile of intensities isprojected to lie along the unit diagonal. This can beachieved by the following procedure: let X(i, k) bethe gene expression intensity of the ith gene and thefeth sample. Each sample set of intensities X(-, k) isbeing sorted by a permutation TTk according tointensity values and results in a sorted sample set X'(•,k). Then each intensity value X'(i, k) is substitutedby the mean across all samples: mean (X' (i, •)). Theinverse permutation inv(;rfe) is now applied to eachsample set and produces the normalized set of geneexpression intensities. The approach is implementedin the Bioconductor package affy.

Error model based transformationmethodsThe basic idea of introducing an error model is todescribe the relation between measured signalintensities and true abundance of RNA molecules.Assume that the true intensity level x^g of the fethsample and gth gene is disturbed by randommultiplicative (bi^g) and additive (a^g) factors. Themeasurement y^g of the _^h gene in the feth samplecan be described as

Ykg = akg + bkgXkg.

Proposing modified models and approaches thatdetermine and decompose the multiplicative factorin stochastic terms has been the focus of severalpublications over the last few years [34—36]. One ofthe first approaches to determine a multiplicativeterm in an error model was proposed by [37] andyields a justification for the logarithmic transforma-tion. A more sophisticated error model was intro-duced by Rocke and Durbin [34] and led tothe normalization model of variance stabilization[35, 36].

Looking at MA representations (refer'Motivation' in 'Preprocessing and normalizationmethods' section), we observe scattered plots forlow intensities whereas for high intensities this is notthe case [2, 18]. This phenomenon is due to the factthat the variance depends on the intensity, e.g. for


low mean intensities we find a rather high variancewhereas for large intensities the variance is roughlyconstant. Variance stabilization provides a solutionfor this problem. Applying a variance stabilizingtransformation as proposed by Huber et al. [35] andDurbin et al. [36] the variance is approximatelyconstant across the whole dynamic range of expres-sion intensities. Thus, it allows for quantification ofdifferential expression independently from the meanintensities. Furthermore, this approach overcomesthe shortcoming of logarithmic transformation.Variance stabilization is performed by applying anarsinh transformation. In contrast to the logarithmicfunction the arsinh function is continuous, has nosingularity at zero and is defined for negative values.

Kerr et al. [38] propose an ahalysis of variance(ANOVA) model to capture multiple effects andtheir interactions. To apply this method, theexperiment has to be designed in an ANOVAsetting. Dye-swap experiments for example fulfillthis requirement. This method provides an integra-tive approach to adjust for extraneous effects andto assign significance to gene expression changes.These are captured in the variety-gene interactionterm. Several other methods have been proposed,e.g. [19, 39].

DIFFERENTIAL EXPRESSIONMotivationThe task of analysing a gene expression experimentfor differential genes falls into the following steps:

(1) Ranking: genes are ranked according to theirevidence of differential expression.

(2) Assigning significance: a statistical significance isbeing assigned to each gene.

(3) Cut-off value: to arrive at a Hmited number ofdifferentially expressed genes a cut-off value forthe statistical significance needs to be determined.

Quantifying gene expression differences highlydepends on the experimental setting. First, onedistinguishes according to whether repetitions areavailable or whether the measurement has beenmade only once. In the absence of repetitionspossibilities are very Umited. The simplest experi-mental setting is the comparison of two experimentalgroups A and B and asking for their differences ineach gene. Intuitively, one can use the empiricalintensity values of each series A and B and introduce

an ordered list of ranked differences between them.Typically, in this 'quick and dirty' approach a fixedcut-off is chosen, frequendy this is a fold change oftwo. That means, all genes showing a fold changeof more than two are considered to be differential.Here, it is particularly important to perform avariance stabilizing normalization. Otherwise, thechanges in variance over the intensities would domi-nate the analysis for differential genes. In order todetect differential expression, Newton et al. [40]propose an empirical Bayes approach. They use aprobability model which accounts for measurementerrors and fluctuations in absolute gene expressionlevels. They deduce estimates for expression changes.

Availability of repetitions provides for a richerspectrum of applicable statistical procedures. Wedistinguish the experimental setting according to thenumber of conditions that are compared. Either wecompare two groups or multiple groups (Figure 2).

In the two-condition case, one considers eithera paired or unpaired situation. Comparing a healthygroup with a diseased one is an example for anunpaired experiment because the samples areindependent. An example for a paired situation isgene expression measurements of one cell Hne beforeand after chemical treatment (Figure 2). Theavailability of replicates allows for a sound statisticalprocedure because variation between replicatescan be considered. Several methods have beenpublished that provide an appropriate statisticalframework for analysing two-condition comparisons(for an overview see [4-7, 18, 41, 42]), (Section'Two-conditional setting and independent multi-conditional setting').

In the case of multiple conditions, one distin-guishes independent and dependent settings too(Figure 2). The essential difference between these isthe linear order of states in the dependent setting.Statistically, each conditional state, e.g. each timepoint, is dependent on all the others. Cellulardifferentiation experiments are examples for adependent testing structure. An example for theindependent sample setting is finding differentiallyexpressed genes comparing multiple groups ofdisease stages (for an overview see [4]). Mostcommonly, multiconditional experiments are timecourses. Several methods that provide a statisticalframework for analysing multiconditional setting areintroduced subsequendy (Sections 'Two-conditionalsetting and independent multiconditional setting'and 'Dependent multiconditional setting').


Number of conditions= 2

/ \

Number of conditions>2

/ \

Paired Unpaired Independent Time course

Possibly gene filtering: biological knowledge, variance, intensity

7"-statisticbased testing

SAMlimma

wilcoxon

r-statisticbased testing

SAMlimma

sign-rankedwilcoxon

F-statisticSAMlimma

iruskal-wallis

J

/•

Splinesregression

HMMSAM...

Control for/estimate FWER or FDR

Figure 2: Methods for quantification of differential gene expression in replicated experiments. A scheme displaysbasic considerations for an appropriate testing procedure aiming at quantifying differential expression. This figure isadapted from [4].

Two-conditional setting andindependent multiconditional settingThe availability of replicates enables to rank genesaccording to their associated (-statistic for each gene:t^m/(std/ ^ri), where m is the difference of meansacross replicates, std, the within groups standarddeviation and n, the number of genes considered fortesting. F-scores are the straightforward general-ization of f-scores in the multiconditional case.Problems arise when genes with small intensitydifferences show almost no changes between condi-tions. This might yield high t-scores and thus, thesegenes occupy top ranks. A remedy lies in artificiallyenlarging these variances.

Accordingly, a number of methods has beenintroduced that propose different penalizing factorsin the f-statistic [43-46]. Many authors offer fi-eelyavailable computational tools. Table 2 provides anoverview of these tools. Lonnstedt and Speed [43]introduce a parametric empirical Bayes approach. Interms of ranking genes according to their evidenceof differential expression this is equivalent to apenalized (-statistic [5]: t = tn/^((a + st(f)/n). They

use the penalty value a, which is estimated from themean and standard deviation of the variance acrosssamples. Also, Tusher et al. [44] and Efron et al [45]suggest using a penalizing factor, e.g. the 'fudgefactor'. Likewise, low variances are being correctedby proposing an enlarging factor. The approach byTusher et al. [44] is implemented in the computa-tional tool called significance analysis of microarrays

(SAM). Recently, SAM has been updated such thattime courses can be analysed, too. The authorsdeveloped a bioconductor package samr [47] as wellas an Excel Add-in. Efiron et al. [45] suggestapplying an additive penalizing factor in thedenominator of the t-statistic that is the 90thpercentile of the standard deviation across samples.Choosing the penalizing factor to be zero reducesthis method to the ordinary f-statistic. Also, Baldiand Long [48] suggest a Bayesian probabilisticapproach combined with a modified f-test. Relatedto the approach by Tusher et al [44], Broberg [46]suggest a calibrated testing procedure such thatestimaton for false negative and false positive ratesare minimized.


Several linear model approaches for ranking geneexpression differences have been introduced [38, 41,49, 50]. Kerr etal [38] use ANOVA models for anintegrated procedure of normalization and detectionof differentially expressed genes. They assume alinear model of specific effects for log intensities of allgenes. These effects might be dye, slide, treatment,gene effects and their respective interactions. Smythet al. [41] propose a modified (-statistic that isproportional to the (-statistic with sample varianceoffset as used in [44—46]. The approach can begeneralized for the multi-conditional case. It hasbeen implemented in the Bioconductor packagelimma [1, 41]. Using this package, experimentalsetting, duplicate spots and quality weights can alsobe considered. The moderated (-statistic is calculated,genes are ranked with respect to the resulting scoresand P-values can be assigned. Further developmentsfocus on linear models in a gene wise manner [49].Also, Jain et al [51] propose a modified (-statistic.Lin etal. [50] use a robust linear model for each singlegene to estimate contrasts of aU pairwise comparisonsof tested groups.

Furthermore, a number of rank-based approaches(thus, non-parametric) have been developed.These are based on a Wilcoxon rank sum test orpermutation (-test. While (-test and F-test basedmethods assume that the intensity measurementsof normalized ratios are normally distributed, rank-based approaches do not do so. Instead of consider-ing numerical values, Wilcoxon rank sum tests useranks. This is a more robust approach, althoughfrequently with lower power, because one losesinformation by switching from the numerical tothe rank scale. In the multiconditional case, theKruskal—WaUis test is the straightforward general-ization of the Wilcoxon test. Yan et al [52] presenta non-parametric method based on the statistic ofrelative entropy between two distributions. Forthe assignment of significance, resampling basedpermutations [53] are appHed [52].

Dependent multiconditional settingTime course experiments arise for example from celldifferentiation processes and constitute multicondi-tional experiments. Each time point representsone conditional state. Thus, all experiments corre-sponding to one time point build up one conditionalgroup. The essential difference compared withindependent cases is the linear order of states.Statistically, each conditional state, e.g. each time

point, is dependent on all the others. This factrequires new concepts of the statistical procedure.Bar-Joseph [54] provides an overview of severalrecent developments in analysing time course geneexpression data.

Recently, the original form of SAM [44] has beengeneralized to time course experimental settings.Time is being included as one covariate. Storey etal[55] propose a statistical framework specificallydesigned for time course analysis. This spUne-basedapproach has been implemented in the open-sourcesoftware package EDGE (Table 2). To assignsignificance to each gene or group of genes theyuse a (-statistic and F-statistic related approach.Guo et al [56] introduce a robust statistic which isbased on the Wald statistic. There, time-relevantdependencies within the gene intensity data set areexplicitly integrated. To assign significance, eitherrecent versions of SAM [44] or [57] might beapplied. Xu et al. [58] suggest an approach usingregression analysis. To estimate the parameters of theregression model they apply least squares estimates.Standard errors are assessed using estimating tech-niques as introduced in [59]. Significance levels areassigned based on Z-statistic. Bar-Joseph et al. [60]use cubic splines to describe gene expression timecourses and significance is assigned by comparingglobal differences of two aligned curves. SchHep etal[61] suggest using Hidden Markov models (HMM)for the analysis of time course gene expression data.External biological knowledge can be integratedusing a partially supervised learning approach.External biological knowledge for example mightbe the expression behaviour of several master genesthat is known beforehand. The method has beenimplemented in the freely available software packageGQL (Table 2).

Cut-ofFand multiple testingAfter ranking the genes according to a statisticalprocedure, one has to find a cut-off above whichbiologically meaningful information is expected.Frequently, researchers choose the P-value cut-offof 0.05 and assume all genes showing a lowerP-value to be biologically significant. Performingmany tests at a time, however, increases the problemof falsely significant genes. Roughly speaking,when performing 10 000 tests one expects 5% ofthe genes to show a P-value of less than 0.05 justdue to chance.

174 Steinhoff and Yingron

There are a number of multiple testingapproaches to overcome this problem. One possibil-ity to lower the problem is to reduce the number ofstatistical tests by filtering steps. Thus, we haveto find a criterion due to which the number oftesting procedures can be limited. This might beeither external biological knowledge or varianceacross conditions. That means, the set of intensitiescan be reduced by neglecting genes which wedo not expect any biological information firom.Alternatively, one could use only those genes thatshow a certain minimal amount of variance over allconditional states or apply intensity-based filtering,e.g. neglecting very lowly expressed genes. Foroligo-array experimental setting. Pounds and Cheng[62] suggest a filtering procedure using the P-valuesof present/absent calls. They combine these to onesummary P-value that is used for filtering. They alsodiscuss that there might be cases where filtering is notnecessarily improving the detection of differentiallyexpressed genes.

Given a type I error rate (i.e. a false positive rate)controlling for multiple testing means correctingP-values such that the given error rate can beguaranteed for all tests. Methods can be divided intothose that control the family wise error rate (FWER)or the false discovery rate (FDR). The probability ofat least one type I error within the significant genes iscalled FWER. The FDR is the expected proportionof type I errors within the rejected hypotheses.For an overview see [63—65].

The so-called Bonferroni correction is anextremely conservative approach. Significance levelsare being divided by the number of tests that areperformed. This one-step multiphcity adjustmentcontrols the FWER. Holm [66] suggests a stepwiseprocedure which improves the power. Westfalland Young [53] suggest a resampling method toadjust P-values.

While these methods control the FWER,Benjamini and Hochberg [67] suggest a lessconservative approach by controlling the FDRinstead. Likewise, different modifications have beenproposed [65, 67-76]. Storey and Tibshirani [77]propose using Q-values which is a measure ofstatistical significance in terms of the FDR insteadof false positive rates as it is the case for P-values.Estimation of FDR, as proposed in [78], isimplemented in SAM. Efron et al. [45] suggestusing the local FDR. Given a score for a certain genethe local FDR determines the probability that the

gene is not differentially expressed conditioned onthe observed test score. Scheid and Spang [64] derivean estimator for local false discovery rate. Theprocedure is implemented in the Bioconductorpackage twilight.

CONCLUSIONStarting with raw gene expression measurementswe summarized numerous approaches to arriveat biologically meaningful expression datasets. Weoutlined the importance of preprocessing andnormalization, various aspects of ranking genesaccording to a statistic for differential expression,assigning significances to expression changes andderiving meaningful cut-offs [1-3, 5]. There arefreely available software packages that enable meth-odologically sound analyses of microarray data. Weprovided a brief overview of different tools andrecommend using the open source R-packages inthe Bioconductor project. These packages not onlyallow for the integration of various kinds ofbiological data but are also rapidly evolving andproviding current statistical approaches.

Although this review has attempted to presentnormalization procedures separately firom the searchfor differential genes, it must be reahzed that the twotasks are in fact linked. Normalization by transfor-mation assumes that most of the genes representedon an array remain unchanged upon a change ofcondition. This, in tum, means that already thenormalization method impHcitly flags other genes asdifferential and it is these ones that are more likely tobe found in the search for differential genes. Thus,the two problems really are one. One recommend-able combination, for example, is to apply variancestabilization, followed by a modified (-test andmultiple testing correction using FDR. Variancestabilization overcomes many drawbacks of othermethods, as outlined before. The choice of the testhighly depends on the experimental setting. To ourexperience in many cases modified (-tests haveproven valuable.

Practically, however, appropriate experimentaldesign is crucial for achieving a biologically mean-ingful interpretation of the experiment. Otherwise,computational analysis needs to focus on trouble-shooting rather than providing a solid procedurefor biological hypothesis generation. For example,searching for differentially expressed genes, replicatesare indispensable for assigning a statistical significance


to the changes. Furthermore, the smaller the expectedexpression changes the more important are repeti-tions. Working with two-dye arrays, dye-swapexperiments may offer an additional opportunity tocheaply generate data for normalization, in particularwhen only small amounts of sample material areavailable. An experiment that, right firom the start, isdesigned to be evaluated by the ANOVA approach,may minimize the number of hybridizationsnecessary to answer a particular question.

In microarray technology, the large number ofgenes that can be tested are of great appeal to theexperimenter. At the same time, this is the statisticalcurse about the method. The fact that typically thenumber of genes on the array is much larger than thenumber of conditions is what makes it so difficultto analyse the data in a statistically sound manner.Remedies lie in filtering techniques and appropriatecorrections for multiple testing. In addition, on thelevel of functional interpretation more informationcan be gained, e.g. by searching for overrepresenta-tion of genes belonging to a particular functionalcategory or combining gene expression analysis withgene function prediction or elucidating biologicalnetworks [79-81].

Key Points

• Microarray technology has been improved over thelast decade and at the same time lots of computationalanalysis tools have been proposed.

• In order to enable for biological interpretation appropri-ate computational preprocessing including normalizationis essential, one recommendable normalization method isvariance stabilization.

• The choice of the test for differential gene expressionhighly depends on the experimental setting. Many set-tings allow for modified t-test procedures.

• Due to the problem of high numbers of genes and few sam-ples, multiple testing corrections are necessary, for exam-ple those using the FDR.

• We stress that normalization and differential gene discov-ery should be regarded as necessarily linked in the sensethat normalization strongly determines which gene willbe found to be differential.

AcknowledgementsThe authors thank Stefanie Scheid and Anja von Heydebreck fortheir critical reading of the manuscript and useful suggestions.We acknowledge funding by the Deutsche Foischungsgemeinschaft(DFG), Sonderforschungsbereich (SFB) 618: Theoretical Biology:Robustness, Modularity and Evolutionary Design of LivingSystems.

References

1. Smyth GK, Speed TP. Normalization of cDNA microarraydata. Methods 2003;31:265-73.

2. Huber W, Heydebreck Av, Vingron M. Analysis ofMicroarray Gene Expression Data. Chichester: John Wiley &Sons, 2003.

3. SteinhofF C, Vingron M. Normalization StrategiesforMicroarrayData Analysis. Taylor & Francis Group, 2005.

4. Scheid S, Spang R. Microarray data analysis: Differential geneexpression. Taylor & Francis Group, 2005.

5. Smyth GK, Yang YH, Speed TP. Statistical Issues incDNA Microarray Data Analysis. Totowa: Humana Press,2002.

6. Pan W. A comparative review of statistical methodsfor discovering differentially expressed genes in replicatedmicroarray experiments. Bioinformatics 2002;18(4):546—54.

7. Elo L, Aittokallio T, Filen S, etal. The effect of replicationon gene rankings: a practical comparison of methods fordetecting differential expression in microarray experiments.In: Bioinformatics Research and Education Workshop, Berlin,2005.

8. Ihaka R, Gentleman R. R: a language for data analysisand graphics. J Comput Graph Stat 1996;5:299-314.

9. Gentleman R, Carey V, Bates D, et al. Bioconductor:open software development for computational biology andbioinformatics. Genome Biology 2004;5(10):R80.

10. Speed T. Statistical Analysis of Gerw Expression Microarray Data.CRC Press, 2003.

11. Simon R, Kom EL, McShane LM, etal. Design and Analysisof DNA Microarray Investigations. Springer, 2003.

12. Parmigiani G, Garrett ES, Irizarry R (eds). The Analysisof Gene Expression Data. Springer, 2003.

13. Lee M-LT. Analysis of Microarray Gene Expresson Data.Kluwer Academic Publishers, 2004.

14. Gentleman R, Garey V, Huber W, et al. Bioinformaticsand Computational Biology Solutions Using R and Bioconductor.Springer, 2005.

15. Nuber UA. DM4 Microarrays. Taylor & Francis Group:Garland Science Publishing, 2005.

16. Dudoit S, Yang YH, Callow MJ, et al. Statistical methodsfor identifying differentially expressed genes in replicatedcDNA microarray experiments. Statistica Sinica 2002;12:111-39.

17. Yang IV, Chen E, Hasseman JP, et al. Within thefold: assessing differential expression measures andreproducibility in microarray assays. Genome Biol 2002;3(1 l):research0062.1-0062.12.

18. Huber W, Heydebreck Av, Vingron M. Low-Level analysisof microarray experiments. Wiley-VCH, 2005.

19. Cui X, Kerr MK, Churchill GA. Transformation for cDNAMicroarray Data. Stat Appl Genet Mol Biol 2003;2(l):Article 4.

20. Kreil DP. There is no silver BuUet-a guide to low-level datatransforms and normalisation methods for microarray data.Brief Bioinform 2005;6(l):86-97.

21. Irizarry RA, Hobbs B, Collin F, et al. Exploration,normalization, and summaries of high density oligo-nucleotide array probe level data. Biostatistics 2003;4(2):249-64.


22. Irizarry RA, Bolstad BM, Collin F, et al. Summaries ofAffymetrix GeneChip probe level data. Nucleic Acids Res2003;31(4):el5.

23. Wu Z, Irizarry RA, Gentleman R, et al. A model-basedbackground adjustment for oligonucleotide expressionarrays. JAm Stat Assoc 2004:99(468):909-17.

24. Li C, Wong WH. Model-based analysis of oligonucleotidearrays: expression index computation and outlier detection.PM4S2001;98:31-6.

25. Choe SE, Boutros M, Michelson AM, et al. Preferredanalysis methods for Affymetrix GeneChips revealedby a wholly deflner control dataset. BMC Bioiriformatics2005;6:R16.

26. Yang YH, Buckley MJ, Dudoit S, et al. Comparison ofmethods for image analysis on cDNA microarray data.J Comput Graph Statist 2002;ll(l):108-36.

27. Shena M, Shalon D, Davis RW, et al. Quantitativemonitoring of gene expression patterns with a complemen-tary DNA microarray. Science 1995:270(5235):467-70.

28. DeRisi JL, Iyer VR, Brown PO. Exploring the metabolicand genetic control of gene expression on a genomic scale.Science 1997:278(5338):680-6.

29. Golub TR, Slonim DK, Tamayo P, et al. Molecularclassification of cancer: class discovery and class predictionby gene expression monitoring. Science 1999;286:531-7.

30. Virtaneva K, Wright FA, Tanner SM, et al. Expressionprofiling reveals fundamental biological differences in acutemyeloid leukemia with isolated trisomy 8 and normalcytogenetics. PNAS 2001 ;98:1124-9.

31. Yang YH, Dudoit S, Luu P, etal. Normalization for cDNAmicroarray data: a robust composite method addressingsingle and multiple slide systematic variation. Nucleic AcidsRes 2002:30(4):el5.

32. Workman C, Jensen LJ, Jarmer H, et al. A new non-linearnormalization method for reducing variability in DNAmicroarray experiments. Genome Biol 2002:3(9):research0048.1-0048.16.

33. Bolstad BM, Irizarry RA, Astrand M, etal. A Comparison ofnormalization methods for high density oligonucleotidearray based on variance and bias. Bioinformatics 2003:19(2):185-93.

34. Rocke DM, Durbin BP. A model for measurementerror for gene expression arrays. J Computat Biol 2001:8(6):557-69.

35. Huber W, Heydebreck Av, Sliltmann H, et al. Variancestabilization applied to microarray data calibration and tothe quantification of differential expression. Bioinformatics2002:18(Sl):S96-104.

36. Durbin BP, Hardin JS, Hawkins DM, et al. A variancestabilizing transformation for gene-expression microarraydata. Bioinformatics 2002;18(Suppll):S96-104.

37. Chen Y, Dougherty ER, Bittner ML. Ratio based decisionsand the quantitative analysis of cDNA microarray images.fBiomedOpt 1997:2:364-74.

38. Kerr MK, Martin M, Churchill GA. Analysis of variancesfrom gene expression microarray data. J Gomput Biol 2000:7(6):819-37.

39. Kepler TB, Crosby L, Morgan KT. Normalization andanalysis of DNA microarray data by self-consistencyand local regression. Genome Biol 2002:3(7):researchOO37.1-0037.12.

40. Newton MA, Kendziorski CM, Richmond CS, et al.On Differential variability of expression ratios:improving statistical inference about gene expressionchanges fi-om microarray data. J Gomput Biol 2001:8(1):37-52.

41. Smyth GK. Linear models and empirical bayes methodsfor assessing differential expression in microarray experi-ments. Stat Appl Genet Mol Biol 2004:3(1).

42. Troyanskaya OG, Garber ME, Brown PO, et al.Nonparametric methods for identifying differentiallyexpressed genes in microarray data. Bioinformatics 2002:

43. Lonnstedt I, Speed TP. Replicated microarray data.Statistica Sinica 2002:12:31-46.

44. Tusher V, Tibshirani R, Chu G. Significance analysis ofmicroarrays applied to the ionizing radiation response.PN/IS 2001:98:5116-24.

45. Efron B, Tibshirani R, Storey JD, et al. Empirical Bayesanalysis of a microarray experiment. J Am Stat Assoc 2001:96(456):1151-60.

46. Broberg P. Statistical methods for ranking differentiallyexpressed genes. Genome Biol 2003:4:R41.

47. Tibshirani R, Chu G, Hastie T. The samr Package.Bioconductor, 2005.

48. Baldi P, Long AD. A Bayesian framework for the analysisof microarray expression data: regularized t-test andstatistical inferences of gene changes. Bioinformatics 2001:17(6):509-19.

49. Thomas JG, Olson JM, Tapscott SJ, et at. Anefficient approach and robust statistical modelingapproach to discover differentially expressed genesusing genomic expression profiles. Genome Res 2001:11(7):1227-36.

50. Lin DM, Yang YH, Scolnick JA, et al. Spatial patternsof gene expression in the olfactory bulb. PNAS 2004:101 (34): 12718-23.

51. Jain N, Thatte J, Braciale T, etal. Local-pooled-error testfor identifying differentially expressed genes with a smallnumber of replicated microarrays. Bioinformatics 2003:19(15):1945-51.

52. Yan X, Deng M, Fung WK, et al. Detecting differentiallyexpressed genes by relative entropy, f Theor Biol 2005:234:395-402.

53. Westfall PH, Young SS. Re-Sampling Based Multiple Testing.New York: Wiley, 1993.

54. Bar-Joseph Z. Analyzing time series gene expression data.Bioinformatics 2004:20(16):2493-503.

55. Storey JD, Xiao W, Leek JT, et al. Significance analysisof time course microarray experiments. PNAS 2005,in press.

56. Guo X, Qi H, Verfaillie CM, et at. Statistical significanceanalysis of longitudinal gene expression data. Bioinformatics2003:19(13):1628-35.

57. Pan W, Lin J, Le CT. A mixture model approach todetecting differentially expressed genes with microarraydata. Funct Integr Genomics 2003:3(3):l 17-24.

58. Xu XL, Olson JM, Zhao LP. A regression-based methodto identify differentially expressed genes in microarraytime course studies and its application in an inducibleHuntington's disease transgenic model. Hum Mol Genet2002:ll(17):1977-85.


59. Zhao LP, Prentice RL, Breeden L. Statistical modeling oflarge microarray data sets to identify stimulus-responseprofiles. PNAS 2001:98:5631-6.

60. Bar-Joseph Z, Gerber G, Jaakkola T, et al. Comparing thecontinous representation of time series expression profilesto identify differentially expressed genes. PNAS 2003:100:10146-51.

61. Schliep A, Costa IG, Steinhoff C, et al. Analyzing geneexpression time-courses. IEEE Trans Gomput Biol 2005:2(3):179-93.

62. Pounds S, Cheng C. Statistical development and evaluationof microarray gene expression data filters. J Gomput Biol2005:12(4):482-95.

63. Dudoit S, Shaffer JP, Boldrick JC. Mulriple hypothesistesting in microarray experiments. StatSci 2003:18(l):71-103.

64. Scheid S, Spang R. A stochastic downhiU search algorithmfor estimating the local false discovery rate. IEEE TransGomput Biol 2004:1 (3):98-108.

65. Tsai CA, Hsueh HM, Chen JJ. Estimation of false discoveryrates and multiple testing: application to gene microarraydata. Biometrics 2003:59(4):1071-81.

66. Holm S. A simple sequentially rejective multiple testprocedure. Scandf Statist 1979:6:65-70.

67. Benjamini Y, Hochberg Y. Controlling the false discoveryrate: a practical and powerful approach to multiple testing.JRStatSocB 1995:57(l):289-300.

68. Benjamini Y, Hochberg Y. On the adaptive control of thefalse discovery rate in multiple testing with independentstatistics. J£(iMcBe/)(ii Stat 2000:25(l):60-83.

69. Benjamini Y, Yekutieli D. The control of the falsediscovery rate in multiple hypothesis testing under depen-dencies. Ann Statist 2001:29(4):1165-88.

70. Storey JD. The positive false discovery rate: a bayesianinterpretation and the Q-Value. Ann Math Statist 2003:31(6):2013-35.

71. Benjamini Y, Liu W. A step-down multiple hypothesistesting procedure that controls the false discovery rate underindependence. J Stat Plan Infer 1999:82:163-70.

72. Yekutieli D, Benjamini Y. Resampling-based falsediscovery rate controlling multiple test procedures forcorrelated test statisrics. J Stat Plan Infer 1999:82(1-2):171-96.

73. Allison DB, Gadbury GL, Heo M, et at. A mixture modelapproach for the analysis of microarray gene expression data.Gomput Statist Data Anat 2002;39(l):l-20.

74. Pounds S, Morris SW. Estimating the occurrence of falsepositives and false negatives in microarray studies byapproximating and partitioning the empirical distributionof P-values. Bioinformatics 2003:19(10).

75. Reiner A, Yekutieli D, Benjamini Y. Identifying differen-tially expressed genes using false discovery rate controllingprocedures. Bioinformatics 2003:19(3): 368-75.

76. Pounds S, Cheng C. Improving false discovery rateestimation. Bioinformatics 2004:20(11):1737-45.

77. Storey JD, Tibshirani R. Statistical significance for genome-wide studies. PNAS 2003:100(16):9440-5.

78. Storey JD. A direct approach to false discovery rates.JRStatist SocB 2002:64(3):479-98.

79. Troyanskaya OG. Putting microarrays in a context:integrated analysis of diverse biological data. BriefBioinformat 2005:6(l):34-43.

80. Stuart JM, Segal A, Koller D, et at. A gene-coexpressionnetwork for global discovery of conserved genetic modules.Science 2003:302:249-54.

81. McCarroll SA, Murphy CT, Zou S, et al. Comparinggenomic expression patterns across species identifies sharedtianscripdonal profile in aging. NatureGenet 2004:36(2):197-204.

82. Duboit S, Yang YH. Bioconductor R packages forexploratory analysis and normalisation of cDNA microarraydata. New York: Springer: 2002.

normalization and quantification of differential expression in gene … · 2011-09-15 ·...

Documents