gene expression microarray: differential expression, data artifacts · 2020-07-21 · tusher et...

Gene expression microarray: Differential expression, data

artifacts

Outline• Scientific goal and potential problems. • Review of basic statistical concepts:

– Hypothesis testing.– Multiple comparison problem.

• Differential expression (DE) test methods:– SAM – Empirical Bayesian (EB) methods: limma.– Complex designs.– Permutation.

• Data artifacts:– Technical artifacts: batch effect. – Biological artifacts: cell type mixture.

Input data for DE test

• Assume data are correctly within- and between-array normalized.

• Input data for DE test is a matrix of positive number:– Rows for genes and columns for samples. – Usually work on logarithm of the data, which are

normal-ish.

Goal of DE test

• Goal: find genes that are expressed differently between (among) conditions. – Assign a score for each gene to represent its statistical

significance of being different.

– Rank the genes according to the score.

– Find a proper threshold for the score for calling DE.

• Easy solutions: – Hypothesis testing (t-test, ANOVA, linear model, etc.) to

get p-values and use as scores.

– Use canonical cutoff (0.05) to call DE.

Potential problems

• Hypothesis testing:– sample sizes are usually small, which lead to unstable test

results. • When data are not normal, p-values are not

accurate.• Use 0.05 as threshold of p-values to call DE - multiple

comparison problem.

Review of statistical inference• Two-group t-test:

– data from two groups (cancer and normal): X1, …, XM; Y1, …, YN.

– Assume X’s and Y’s are normally distributed.

– A “null hypothesis” is that the means of X’s and Y’s are identical.

– Test statistics

where

– t follows t-distribution.

– P-value: under null hypothesis (that the means are the same), the probability to observe a t-statistics more extreme than the observed.

t = (X −Y ) / SX2 /M + SY

2 / N

SX2 =

1M −1

(Xi − X)2

i=1

M

∑ , SY2 =

1N −1

(Yi −Y )2

i=1

N

∑

Why gene-by-gene t-test is a bad idea

• Small sample size (e.g., 3 vs. 3) leads to unstable estimates of variances. – By chance some genes have very small variance, which will

result in large t-statistics and tiny p-values even when the difference is small.

– Solution: SAM, EB methods.

• Sometimes data are not normally distributed, lead to incorrect p-values. – solution: non-parametric approach to obtain p-values.

SAM t-testTusher et al. (2001) PNAS

• Try to remove (or minimize) the dependence of test statistics on variances (because small variance tend to lead to bigger test statistics).

• Solutions: add a small constant to the denominator in calculating t statistics:

di =yi − xisi + s0

yi, xi : Means of two groups for gene i.si : Standard deviation for gene i, assuming equal variace in both groups.so : "Exchangeability factor" estimated using all genes.

The exchangeability factor• Chosen to make signal-to-noise ratios independent

of signal, e.g., the distribution of the statistics independent of the variance.

• Procedure:– Let Sα be the α percentile of the si values – For compute– Compute , here qj are

quantile. – Compute cv(α), the coefficient of variation of– Choose

α ∈ (0, 0.01, 0.02, ..., 1.0) diα = (yi − xi ) / (si + s

α )

vjα =mad(di

α | si ∈ [qj,,qj+1)), j =1,2,…99

vjα

α = argmina

{cv(a)}. so = sα

SAM t-test

• Highly cited (>13,000 citations as of 2020), http://www-stat.stanford.edu/~tibs/SAM/.

• Implemented as Bioconductor package siggenes, and Excel plugin.

• Follow-up work: SAMSeq on RNA-seq DE test. • Limitations: solutions for s0 often sensitive to

data.

http://www-stat.stanford.edu/~tibs/SAM/

Empirical Bayes method from limmaSmyth et al. (2004) Statistical Applications in Genetics and Molecular Biology

• Highly cited (>11,500 citations as of 2020).• Use a Bayesian hierarchical model in multiple

regression setting. • Borrow information from all genes to estimate gene

specific variances.– As a result, variance estimates will be “shrunk” toward the

mean of all variances. So very small variance scenarios will be alleviated.

• Implemented in Bioconductor package “limma”.

to be evaluated at ↵g and the dependence is assumed to be such that it can be ignoredto a first order approximation.

Let vgj be the jth diagonal element of CT VgC. The distributional assumptions madein this paper about the data can be summarized by

�gj | �gj, ⇥2g � N(�gj, vgj⇥

2g)

and

s2g | ⇥2

g �⇥2

g

dg⇤2

dg

where dg is the residual degrees of freedom for the linear model for gene g. Under theseassumptions the ordinary t-statistic

tgj =�gj

sg⇤

vgj

follows an approximate t-distribution on dg degrees of freedom.The hierarchical model defined in the next section will assume that the estimators

�g and s2g from di�erent genes are independent. Although this is not necessarily a

realistic assumption, the methodology which will be derived makes qualitative senseeven when the genes are dependent, as they likely will be for data from actual microarrayexperiments.

This paper focuses on the problem of testing the null hypotheses H0 : �gj = 0 andaims to develop improved test statistics. In many gene discovery experiments for whichmicroarrays are used the primary aim is to rank the genes in order of evidence againstH0 rather than to assign absolute p-values (Smyth et al, 2003). This is because only alimited number of genes may be followed up for further study regardless of the numberwhich are significant. Even when the above distributional assumptions fail for a givendata set it may still be that the tests statistics perform well from a ranking the pointof view.

3 Hierarchical Model

Given the large number of gene-wise linear model fits arising from a microarray ex-periment, there is a pressing need to take advantage of the parallel structure wherebythe same model is fitted to each gene. This section defines a simple hierarchical modelwhich in e�ect describes this parallel structure. The key is to describe how the unknowncoe⇤cients �gj and unknown variances ⇥2

g vary across genes. This is done by assumingprior distributions for these sets of parameters.

Prior information is assumed on ⇥2g equivalent to a prior estimator s2

0 with d0 degreesof freedom, i.e.,

1

⇥2g

� 1

d0s20

⇤2d0

.

7



�gj | �gj, ⇥2g � N(�gj, vgj⇥

2g)

and

s2g | ⇥2

g �⇥2

g

dg⇤2

dg


tgj =�gj

sg⇤

vgj










1

⇥2g

� 1

d0s20

⇤2d0

.

7

This describes how the variances are expected to vary across genes. For any given j,we assume that a �gj is non zero with known probability

P (�gj ⇤= 0) = pj.

Then pj is the expected proportion of truly di�erentially expressed genes. For thosewhich are nonzero, prior information on the coe⌅cient is assumed equivalent to a priorobservation equal to zero with unscaled variance v0j, i.e.,

�gj | ⇤2g , �gj ⇤= 0 � N(0, v0j⇤

2g).

This describes the expected distribution of log-fold changes for genes which are di�er-entially expressed. Apart from the mixing proportion pj, the above equations describea standard conjugate prior for the normal distributional model assumed in the previoussection. In the case of replicated single sample data, the model and prior here is areparametrization of that proposed by Lonnstedt and Speed (2002). The parametriza-tions are related through dg = f , vg = 1/n, d0 = 2⇥, s2

0 = a/(d0vg) and v0 = c where f ,n, ⇥ and a are as in Lonnstedt and Speed (2002). See also Lonnstedt (2001). For thecalculations in this paper the above prior details are su⌅cient and it is not necessaryto fully specify a multivariate prior for the �g.

Under the above hierarchical model, the posterior mean of ⇤�2g given s2

g is s�2g with

s2g =

d0s20 + dgs2

g

d0 + dg.

The posterior values shrink the observed variances towards the prior values with thedegree of shrinkage depending on the relative sizes of the observed and prior degrees offreedom. Define the moderated t-statistic by

tgj =�gj

sg⇧

vgj.

This statistic represents a hybrid classical/Bayes approach in which the posterior vari-ance has been substituted into to the classical t-statistic in place of the usual samplevariance. The moderated t reduces to the ordinary t-statistic if d0 = 0 and at theopposite end of the spectrum is proportion to the coe⌅cient �gj if d0 =⇥.

In the next section the moderated t-statistics tgj and residual sample variances s2g

are shown to be distributed independently. The moderated t is shown to follow a t-distribution under the null hypothesis H0 : �gj = 0 with degrees of freedom dg +d0. Theadded degrees of freedom for tgj over tgj reflect the extra information which is borrowed,on the basis of the hierarchical model, from the ensemble of genes for inference abouteach individual gene. Note that this distributional result assumes d0 and s2

0 to be givenvalues. In practice these values need to be estimated from the data as described inSection 6.

8


P (�gj ⇤= 0) = pj.


�gj | ⇤2g , �gj ⇤= 0 � N(0, v0j⇤

2g).




g is s�2g with

s2g =

d0s20 + dgs2

g

d0 + dg.


tgj =�gj

sg⇧

vgj.





8



�gj | �gj, ⇥2g � N(�gj, vgj⇥

2g)

and

s2g | ⇥2

g �⇥2

g

dg⇤2

dg


tgj =�gj

sg⇤

vgj










1

⇥2g

� 1

d0s20

⇤2d0

.

7

with priors:

Let !"# be coefficient (difference in means in two group setting) for gene g, factor j, assume

The hierarchical model

Posterior statistics

Posterior variance :


P (�gj ⇤= 0) = pj.


�gj | ⇤2g , �gj ⇤= 0 � N(0, v0j⇤

2g).




g is s�2g with

s2g =

d0s20 + dgs2

g

d0 + dg.


tgj =�gj

sg⇧

vgj.





8

Moderated t-statistics for testing !"# = 0 :


P (�gj ⇤= 0) = pj.


�gj | ⇤2g , �gj ⇤= 0 � N(0, v0j⇤

2g).




g is s�2g with

s2g =

d0s20 + dgs2

g

d0 + dg.


tgj =�gj

sg⇧

vgj.





8

Summary on two-sample DE test

• Try to alleviate the “small sample variance” problem.

• Combine information from all genes. • Many other variations of the model. • In practice SAM and limma performs similarly.

Volcano plot

• A diagnostic plot to visualize the test results.• Scatter plot of the statistical significance (-

log10 p-values) vs. biological significance (log2 fold change).

• Ideally the two should agree with each other.

A bad volcano plot

A good one

More complex experiments

• Complex experimental designs:– multiple (>2) groups.– crossed/nested. – etc.

• Examples for multiple-group:

! " #

!$ !% !& "$ "% "& #$ #% !"

A complicated loop design on two-color array

Oleksiak et al. (2002) Nature Genetics

DE test for complex design

• Two sample test -> multiple regression.• The same problems still exist, and similar solutions

can be applied. • Mixed effect models can be used to model repeated

measurements. • Both SAM and limma provide functions for complex

designs.

P-values by randomization• When the data don’t satisfy normal assumption,

permutation/bootstrap can be used to derive empirical p-values.

• Procedures for two sample comparison: – For each gene, randomly shuffle the sample labels.– Compute the t-statistics on the randomized data.– Repeat the procedure for N times, compute p-values as the percent of

times that the permuted t-statistics more extreme than the observed.

• The procedure is a little complicated for multiple design. Basically shuffle the data based on null model.

Multiple testing correction• Multiple testing problem is severe in high throughput

data analysis because a large number of tests were performed. – Under type I error α=0.05, 1000 out of 20000 genes will be

falsely declared DE (false positive) by chance.– If there are a total of 2000 genes declared DE, the false

discovery rate (FDR) is 0.5!• Multiple testing correction– Bonferroni correction: use α=0.05/20000 (too

conservative). – FDR control (Benjamini and Hochberg, 1995 JRSS-B)

Bioconductor packages for microarray analysis

Bioconductor for microarray data

• There’re a rich collection of bioc packages for microarrays. In fact, Bioconductor started for microarray analysis.

• There are currently 200+ packages for microarray. • Important ones include:

– affy: one of the earliest bioc packages. Designed for analyzing data from Affymetrix arrays.

– limma and siggenes: DE detection using limma and SAM-t model. – oligo: preprocessing tools for many types of oligonecleotide arrays.

This is designed to replace affy package. – Many annotation data package to link probe names to genes.

My suggestion

• Use oligo to reading in data, normalization and summarization.

• Use siggenes or limma for detecting DE genes.

An exmple of Analyzing a set of Affymetrix data

• Data generated by MAQC (MicroArray Quality Control) project.

• Five brain samples and five reference samples on human exon arrays.

• Raw data are CEL files (binary file generated by factory).

• Each CEL file is around 65Mb. • The platform design package (pd.huex.1.0.st.v2)

needs to be installed.

Read in data## load in necessary libraries > library(oligo)> library(limma)## get a list of CEL files > CELfiles=dir(pattern="CEL")## read in all raw data > rawdata=read.celfiles(CELfiles)> rawdataExonFeatureSet (storageMode: lockedEnvironment)assayData: 6553600 features, 10 samples

element names: exprs protocolData

rowNames: ambion_A1.CEL, ambion_A2.CEL, ..., stratagene_K2.CEL (10 total)...Annotation: pd.huex.1.0.st.v2

Normalization and summarization## using RMA > normdata=rma(rawdata, target = "core")> normdataExpressionSet (storageMode: lockedEnvironment)assayData: 22011 features, 10 samples

element names: exprs ...## extract expression values using expr function> data=exprs(normdata)> head(data)

sample 1 sample 2 sample 3 sample 41007_s_at 10.160224 10.214496 10.090697 11.0206491053_at 9.501826 9.500412 9.574311 7.361141117_at 5.669447 5.478072 5.648788 6.048142121_at 8.061479 8.154549 8.156215 7.9025971255_g_at 4.307739 4.017903 3.992333 4.6689721294_at 7.108730 7.185586 7.122404 6.597161

## check data distribution after RMA> boxplot(data)

The boxplot looks really good after RMA, so between array normalization is unnecessary. But in case you need it, use normalizeQuantiles function from limma for quantile normalization : > data2=normalizeQuantiles(data)Now the new boxplot after quantile normalization:

DE detection using SAM t-test

> library(siggenes)## create a vector for design.> design <- c(rep(0,5),rep(1,5))> sam.result=sam(data2, cl=design)> sam.resultSAM Analysis for the Two-Class Unpaired Case Assuming Unequal Variances

DE detection using limma

## create design matrix. Intercept must be included > design=cbind(mu=1,beta=c(rep(0,5),rep(1,5)))## fit linear model and compute estimates > limma.result=lmFit(data2, design=design)## Empirical Bayes method to get p-values > limma.result=eBayes(limma.result)## get p-values for the comparison> pval=limma.result$p.value[,"beta"]

Compare results from limma and SAM

• Agreement is good, 0.95 Spearman rank correlation. • Limma seems to be more liberal.

Obtain gene annotations• Now you get p-values for all genes, but you also need

gene names for generating report. • There are many annotation packages available for

different array platforms. For example, hgu133a.db is for HGU133A arrays.

• These packages contain comprehensive information for all probes, including their sequences, chromosome, position, corresponding gene IDs, GO terms, etc.

• A typical way to convert probeset names to accession number or gene alias is: > library(hgu133a.db)

## convert to accession numbers:> geneAcc=as.character(hgu133aACCNUM[rownames(data)])## convert to gene names > geneNames=as.character(hgu133aSYMBOL[rownames(data)])

Finally generate a report table

> [email protected]<0.1> result=data.frame(gene=geneNames[ix],

[email protected][ix],fold=sam.result@fold[ix])

## sort by fold change> ix2=sort(result$fold, decreasing=TRUE, index.return=TRUE)$ix> result=result[ix2,]> head(result)

gene pvalue fold2731192 NM_000477 0 185.57203457336 NM_006928 0 155.71432772566 NM_144646 0 152.82322731230 NM_001134 0 132.8515> write.table(result, file=”report.txt”, sep=”\t”)

Data artifacts: batch effect and cell mixture

Technical artifact: batch effect• Microarray experiments are very sensitive to

experimental conditions:– Equipment, agents, technicians, etc.

• Data generated from different “batches” (lab, time, etc.) can be quite different, but data from the same batch tend to be more similar.

• So batch effects are structured noise/bias common to all replicates in the same batch, but markedly different from batch to batch.

Example

27 E

: sep

sis

perit

onea

l

25 E

: sep

sis

perit

onea

l

26 E

: sep

sis

perit

onea

l

28 F

: tum

or s

plee

n

29 F

: tum

or b

one

mar

row

11 B

: sep

sis

sple

en

12 D

: sep

sis

bone

mar

row

10 C

: nai

ve b

one

mar

row

5 B

: sep

sis

sple

en

4 D

: sep

sis

bone

mar

row

6 D

: sep

sis

bone

mar

row

3 B

: sep

sis

sple

en

1 A

: nai

ve s

plee

n

2 C

: nai

ve b

one

mar

row

8 C

: nai

ve b

one

mar

row

9 A

: nai

ve s

plee

n

7 A

: nai

ve s

plee

n

27 E: sepsis peritoneal



28 F: tumor spleen

29 F: tumor bone marrow

11 B: sepsis spleen

12 D: sepsis bone marrow

10 C: naive bone marrow

5 B: sepsis spleen



3 B: sepsis spleen

1 A: naive spleen



9 A: naive spleen

7 A: naive spleen

0.85 0.95Value

Color Key

Variation within and between batches

Methods to remove batch effects

• Based on linear model: batches cause location/scale changes (e.g., combat).

• Based on dimension reduction technique: SVD, PCA, factor analysis, etc. (e.g., sva). – The singular vectors/PCs/factors that are

correlated with batch are deemed from batch effects.

– Remove batch effects from data, leftovers are biological signals.

sva package in Bioconductor

• Contains ComBat function for removing effects of known batches.

• Assume we have– edata:a matrix for raw expression values– batch: a vector named for batch numbers.

modcombat = model.matrix(~1, data=as.factor(batch)) combat_edata = ComBat(dat=edata, batch=batch,

mod=modcombat, par.prior=TRUE, prior.plot=FALSE)

BatchQC - Batch Effects Quality Control

• A Bioconductor package with a GUI (shiny app).

• http://bioconductor.org/packages/release/bioc/html/BatchQC.html

http://bioconductor.org/packages/release/bioc/html/BatchQC.html

• One major conclusion is that tissues are more similar within a species, compared with the same tissue across species.

Comparison of the transcriptional landscapes betweenhuman and mouse tissuesShin Lina,b,1, Yiing Linc,1, Joseph R. Neryd, Mark A. Urichd, Alessandra Breschie,f, Carrie A. Davisg, Alexander Dobing,Christopher Zaleskig, Michael A. Beerh, William C. Chapmanc, Thomas R. Gingerasg,i, Joseph R. Eckerd,j,2,and Michael P. Snydera,2

aDepartment of Genetics, Stanford University, Stanford, CA 94305; bDivision of Cardiovascular Medicine, Stanford University, Stanford, CA 94305;cDepartment of Surgery, Washington University School of Medicine, St. Louis, MO 63110; dGenomic Analysis Laboratory, The Salk Institute for BiologicalStudies, La Jolla, CA 92037; eCentre for Genomic Regulation and UPF, Catalonia, 08003 Barcelona, Spain; fDepartament de Ciències Experimentals i de la Salut,Universitat Pompeu Fabra, 08003 Barcelona, Spain; gFunctional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11742; hMcKusick-NathansInstitute of Genetic Medicine and the Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21205; iAffymetrix, Inc., Santa Clara,CA 95051; and jHoward Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA 92037

Contributed by Joseph R. Ecker, July 23, 2014 (sent for review May 23, 2014)

Although the similarities between humans and mice are typicallyhighlighted, morphologically and genetically, there are many differ-ences. To better understand these two species on a molecular level,we performed a comparison of the expression profiles of 15 tissuesby deep RNA sequencing and examined the similarities and differ-ences in the transcriptome for both protein-coding and -noncodingtranscripts. Although commonalities are evident in the expression oftissue-specific genes between the two species, the expression formany sets of genes was found to be more similar in different tissueswithin the same species than between species. These findings werefurther corroborated by associated epigenetic histonemark analyses.We also find that many noncoding transcripts are expressed at a lowlevel and are not detectable at appreciable levels across individuals.Moreover, the majority lack obvious sequence homologs betweenspecies, evenwhenwe restrict our attention to those which are mosthighly reproducible across biological replicates. Overall, our resultsindicate that there is considerable RNA expression diversity betweenhumans and mice, well beyondwhat was described previously, likelyreflecting the fundamental physiological differences between thesetwo organisms.

transcriptome | epigenome | species comparison | noncoding transcripts

The mouse has served as a valuable model organism for hu-man biology and disease. It is widely assumed that bio-

chemical, cellular, and developmental pathways in the mouse arehighly conserved with humans and that many processes areclearly preserved at a molecular and genetic level. Moreover,recent detailed studies have examined gene expression in a lim-ited number of tissues in humans and mice. These studies haveindicated that gene expression is often conserved and is moresimilar between the comparable tissues of different organismsrather than within tissues of the same organism. In contrast, thetranscript isoform repertoire was found to be markedly differentbetween species (1, 2).

Gene Expression Is More Similar Among Tissues Withina Species Than Between Corresponding Tissues of the TwoSpeciesTo examine the similarities between humans and mice in muchgreater detail, we produced RNA-seq data from 13 human tissues[as part of the Encyclopedia Of DNA Elements (ENCODE)],another 11 human tissues [as part of the Roadmap EpigenomicsMapping Consortium (REMC) (3)], and 13 mouse tissues (formouse ENCODE). We also included in our analysis other datafrom mouse ENCODE and the Illumina Human BodyMap 2.0(HBM) (SI Materials and Methods). Sequencing was performedto a depth of 11,313,824–166,188,101 mappable reads (medianof 68,399,538 with and an interquartile range of 31,557,381–81,836,199). In total, our analysis used 93 datasets encompassingthe most tissue-diverse RNA-seq dataset to date spanning several

major projects. Thirteen of the mouse and human orthologousdatasets were produced by the same laboratory. For our analysisregarding noncoding transcripts, we incorporated an additional 294RNA-seq datasets from the Genotype-Tissue Expression (GTEx)project (4).We first explored gene expression similarities and differences by

analyzing the expression of !15,106 protein-coding orthologs; thislist was generated by the modENCODE and mouse ENCODEconsortia and represents the most recent mouse–human orthologlist to date (biorxiv.org/content/biorxiv/early/2014/05/31/005736.full.pdf). Fragments per kilobase of transcript per million (FPKM)values were obtained from each dataset, and principal componentanalysis (PCA) was used to compare gene expression (Materials andMethods). In contrast to what was reported previously (1, 2, 5),surprisingly, we found that themouse and human samples cluster byspecies when the data are projected onto the first three principalcomponents (Fig. 1A). Because the same tissues of the same speciesproduced by different laboratories did not cluster together, thepossibility of methodologic differences among laboratories con-founding our results was considered. To address this issue, analysisof only the 13 paired samples processed under one experimentalprotocol yielded the same species-specific clustering (Fig. 1C). Thesame species-specific clustering was observed when other combi-nations of 10 or more tissues were examined, indicating that theclustering is not due to the particular 13–15 tissues selected. Finally,

Significance

To date, various studies have found similarities between humansand mice on a molecular level, and indeed, the murine modelserves as an important experimental system for biomedical sci-ence. In this study of a broad number of tissues between humansand mice, high-throughput sequencing assays on the tran-scriptome and epigenome reveal that, in general, differencesdominate similarities between the two species. These findingsprovide the basis for understanding the differences in pheno-types and responses to conditions in humans and mice.

Author contributions: S.L., Y.L., W.C.C., T.R.G., J.R.E., and M.P.S. designed research; S.L.,Y.L., J.R.N., M.A.U., and A.D. performed research; S.L., Y.L., and W.C.C. contributed newreagents/analytic tools; S.L., Y.L., A.B., C.A.D., A.D., C.Z., and M.A.B. analyzed data; andS.L., Y.L., J.R.E., and M.P.S. wrote the paper.

The authors declare no conflict of interest.

Freely available online through the PNAS open access option.

Data deposition: The data reported in this paper have been deposited in the Gene Ex-pression Omnibus (GEO) database, www.ncbi.nlm.nih.gov/geo (accession no. GSE36025).The Roadmap Epigenomics Mapping Consortium RNA-seq data can be accessed underGSE16256.1S.L. and Y.L. contributed equally to this work.2To whom correspondence may be addressed. Email: [email protected] or [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1413624111/-/DCSupplemental.

17224–17229 | PNAS | December 2, 2014 | vol. 111 | no. 48 www.pnas.org/cgi/doi/10.1073/pnas.1413624111

different normalization methods (e.g., quantile normalization) ap-plied to the data produced similar groupings.To understand the differences between our results and those

of others (1, 2, 5), we performed extensive additional analyses.We first varied the ortholog list and similarity measure, but thesechanges did not significantly alter our results (Fig. S1A). We nextapplied our analytic process to human and mouse data producedfrom other studies that reported tissue dominated clustering (1,5), and we were able to reproduce their findings (Fig. S1 B andC). Moreover, in our own dataset, we observed a more tissue-dominated clustering for principal components 4–6 (Fig. 1 B andD). Thus, gene expression profiles of different organism tissues

do exhibit similarities in gene expression but of lower strengthrelative to organismal signals.To determine whether we could further reconcile these vari-

ous observations, we identified the groups of genes that are tis-sue specific and those present in all tissues (i.e., housekeeping)using Shannon entropy (H) (6). H is a parameter commonly usedto assess tissue specificity, with lower values signifying expressionin a smaller fraction of the total set. We calculated H for eachgene using our expression data and considered genes with valuesbelow two to be tissue specific. We found that testes, brain, liver,muscle (cardiac and/or skeletal), and kidney were among thetissues that expressed the most tissue-specific genes (Fig. 1E),

Fig. 1. Loading plots from PCA on human and mouse gene expression data. (A) PCA is performed on the combined Stanford (human, mouse), Salk (human),HBM (human), LICR (mouse), and CSHL (mouse) expression datasets using 15 tissue types, 15,106 orthologs (biorxiv.org/content/biorxiv/early/2014/05/31/005736.full.pdf), and Pearson’s correlation as the distance measure. The loadings on principal components 1–3 are plotted. (B) Same as in A except loadingson principal components 4–6 are plotted. (C ) The loadings on principal components 1–3 are plotted from a PCA performed as in A except only 13 humanand mouse tissue sets processed at Stanford. (D) The loadings on principal components 4–6 for the analysis in C are used. (E ) Barplot of number of tissuespecific-genes per tissue. (F ) PCA is performed as in A except the tissue set is restricted to testis, brain, heart, liver, and kidney, which have higher numbersof tissue-specific genes. The loadings on principal components 1–3 are plotted.

Lin et al. PNAS | December 2, 2014 | vol. 111 | no. 48 | 17225GEN

ETICS

• Experimental design: data are from 5 batches.

ʸʷʷʷ�»É»·È¹¾

�Æ»Ä��»»È��»Ì¿»Í

Ƒ��·ÈÌ·Èº��¹¾ÅÅÂ�Å¼��Ë¸Â¿¹�·¼·»Â��È¿Ð·ÈÈÏ�»·ÂÊ¾��

Ƒ��Ä¿Ì»ÈÉ¿ÊÏ�Å¼��·Â¿¼ÅÈÄ¿·Ƒ�¿¹¾·»Â��¿É»Ä�»ÈÁ»Â»Ï��

Ƒ��Ä¿Ì»ÈÉ¿ÊÏ�Å¼��º¿Ä¸ËÈ½¾��¿¹Á��·ÊÉÅÄ

Ƒ��Ä¿Ì»ÈÉ¿ÊÏ�Å¼��·Â¿¼ÅÈÄ¿·Ƒ�¿ÅÈ��·¹¾Ê»È�»ÈÁ»Â»Ï��

�¿É¹ËÉÉ�Ê¾¿É�·ÈÊ¿¹Â»�ƺʺƻ�ÅÃÃ»ÄÊÉ

ʻ

ʺ

ʹ

ʸ

��È»·Ä·ÂÏÉ¿É�Å¼�ÃÅËÉ»��¹ÅÃÆ·È·Ê¿Ì»�½»Ä»�»ÎÆÈ»ÉÉ¿ÅÄ

�º·Ê· ƾÌ»ÈÉ¿ÅÄ�ʸƒ�È»¼»È»»ÉƓ�ʺ�·ÆÆÈÅÌ»ºƑ�ʸ�·ÆÆÈÅÌ»º�Í¿Ê¾È»É»ÈÌ·Ê¿ÅÄÉƿ�Å·Ì�¿Â·ºƑ��ÈÄ·��¿ÐÈ·¾¿Ɩ�·Ä�»Æ·ÈÊÃ»ÄÊ�Å¼��ËÃ·Ä�»Ä»Ê¿¹ÉƑ��Ä¿Ì»ÈÉ¿ÊÏ�Å¼��¾¿¹·½ÅƑ��¾¿¹·½ÅƑ��Ƒ�ʽʷʽʺʾƑ��

�¸ÉÊÈ·¹Ê�»¹»ÄÊÂÏƑ�Ê¾»��ÅËÉ»��ÅÄÉÅÈÊ¿ËÃ�È»ÆÅÈÊ»º�Ê¾·Ê�¹ÅÃÆ·È·Ê¿Ì»�½»Ä»»ÎÆÈ»ÉÉ¿ÅÄ�º·Ê·�¼ÈÅÃ�¾ËÃ·Ä�·Äº�ÃÅËÉ»�Ê»Äº�ÊÅ�¹ÂËÉÊ»È�ÃÅÈ»�¸Ï�ÉÆ»¹¿»É�È·Ê¾»ÈÊ¾·Ä�¸Ï�Ê¿ÉÉË»Ɣ��¾¿É�Å¸É»ÈÌ·Ê¿ÅÄ�Í·É�ÉËÈÆÈ¿É¿Ä½Ƒ�·É�¿Ê�¹ÅÄÊÈ·º¿¹Ê»º�ÃË¹¾�Å¼�Ê¾»¹ÅÃÆ·È·Ê¿Ì»�½»Ä»�È»½ËÂ·ÊÅÈÏ�º·Ê·�¹ÅÂÂ»¹Ê»º�ÆÈ»Ì¿ÅËÉÂÏƑ�·É�Í»ÂÂ�·É�Ê¾»�¹ÅÃÃÅÄÄÅÊ¿ÅÄ�Ê¾·Ê�Ã·ÀÅÈ�º»Ì»ÂÅÆÃ»ÄÊ·Â�Æ·Ê¾Í·ÏÉ�·È»�¾¿½¾ÂÏ�¹ÅÄÉ»ÈÌ»º�·¹ÈÅÉÉ�·�Í¿º»È·Ä½»�Å¼�ÉÆ»¹¿»ÉƑ�¿Ä�Æ·ÈÊ¿¹ËÂ·È�·¹ÈÅÉÉ�Ã·ÃÃ·ÂÉƔ��»È»�Í»�É¾ÅÍ�Ê¾·Ê�Ê¾»��ÅËÉ»��½»Ä»�»ÎÆÈ»ÉÉ¿ÅÄ�º·Ê·�Í»È»�¹ÅÂÂ»¹Ê»º�ËÉ¿Ä½�·�¼Â·Í»º�ÉÊËºÏ�º»É¿½ÄƑÍ¾¿¹¾�¹ÅÄ¼ÅËÄº»º�É»ÇË»Ä¹¿Ä½�¸·Ê¹¾�ƺÄ·Ã»ÂÏƑ�Ê¾»�·ÉÉ¿½ÄÃ»ÄÊ�Å¼�É·ÃÆÂ»É�ÊÅÉ»ÇË»Ä¹¿Ä½�¼ÂÅÍ¹»ÂÂÉ�·Äº�Â·Ä»Éƻ�Í¿Ê¾�ÉÆ»¹¿»ÉƔ��¾»Ä�Í»�·¹¹ÅËÄÊ�¼ÅÈ�Ê¾»�¸·Ê¹¾»¼¼»¹ÊƑ�Ê¾»�¹ÅÈÈ»¹Ê»º�¹ÅÃÆ·È·Ê¿Ì»�½»Ä»�»ÎÆÈ»ÉÉ¿ÅÄ�º·Ê·�¼ÈÅÃ�¾ËÃ·Ä�·ÄºÃÅËÉ»�Ê»Äº�ÊÅ�¹ÂËÉÊ»È�¸Ï�Ê¿ÉÉË»Ƒ�ÄÅÊ�¸Ï�ÉÆ»¹¿»ÉƔ

��¾¿É�·ÈÊ¿¹Â»�¿É�¿Ä¹ÂËº»º�¿Ä�Ê¾»��È»¹Â¿Ä¿¹·Â

�¹¾·ÄÄ»ÂƔ�»ÆÈÅºË¹¿¸¿Â¿ÊÏ�·Äº��Å¸ËÉÊÄ»ÉÉ

��Å·Ì�¿Â·º�ƺ ƻ�ÅÈÈ»ÉÆÅÄº¿Ä½�·ËÊ¾ÅÈƓ ½¿Â·ºʤË¹¾¿¹·½ÅƔ»ºË�¿Â·º��·Äº��¿ÐÈ·¾¿Ɩ�·Ä��Ɣ��ÅÍ�ÊÅ�¹¿Ê»�Ê¾¿É�·ÈÊ¿¹Â»Ɠ ��È»·Ä·ÂÏÉ¿É�Å¼�ÃÅËÉ»��¹ÅÃÆ·È·Ê¿Ì»�½»Ä»�»ÎÆÈ»ÉÉ¿ÅÄ�º·Ê·�ƾÌ»ÈÉ¿ÅÄ�ʸƒ

� �ʹʷʸʼƑ� Ɠʸʹʸ�ƺºÅ¿Ɠ� ƻÈ»¼»È»»ÉƓ�ʺ�·ÆÆÈÅÌ»ºƑ�ʸ�·ÆÆÈÅÌ»º�Í¿Ê¾�È»É»ÈÌ·Ê¿ÅÄÉƿ ʼʻʻʻ�»É»·È¹¾ ʻ ʸʷƔʸʹʽʿʿƭ¼ʸʷʷʷÈ»É»·È¹¾ƔʽʼʺʽƔʸ�ʛ�ʹʷʸʼ�¿Â·º��·Äº��¿ÐÈ·¾¿Ɩ�·Ä��Ɣ��¾¿É�¿É�·Ä�ÅÆ»Ä�·¹¹»ÉÉ�·ÈÊ¿¹Â»�º¿ÉÊÈ¿¸ËÊ»º�ËÄº»È�Ê¾»�Ê»ÈÃÉ�Å¼�Ê¾»��ÅÆÏÈ¿½¾ÊƓ �È»·Ê¿Ì»��ÅÃÃÅÄÉ��ÊÊÈ¿¸ËÊ¿ÅÄ

Ƒ�Í¾¿¹¾�Æ»ÈÃ¿ÊÉ�ËÄÈ»ÉÊÈ¿¹Ê»º�ËÉ»Ƒ�º¿ÉÊÈ¿¸ËÊ¿ÅÄƑ�·Äº�È»ÆÈÅºË¹Ê¿ÅÄ�¿Ä�·ÄÏ�Ã»º¿ËÃƑ�ÆÈÅÌ¿º»º�Ê¾»�ÅÈ¿½¿Ä·Â�ÍÅÈÁ�¿É�ÆÈÅÆ»ÈÂÏ�¹¿Ê»ºƔ��·Ê·�·ÉÉÅ¹¿·Ê»º�¿¹»Ä¹»Í¿Ê¾�Ê¾»�·ÈÊ¿¹Â»�·È»�·Ì·¿Â·¸Â»�ËÄº»È�Ê¾»�Ê»ÈÃÉ�Å¼�Ê¾»� �ƺ��ʷ�ʸƔʷ��Ë¸Â¿¹�ºÅÃ·¿Ä�º»º¿¹·Ê¿ÅÄƻƔ�È»·Ê¿Ì»��ÅÃÃÅÄÉ��»ÈÅ�Ͳ�Å�È¿½¾ÊÉ�È»É»ÈÌ»ºͲ�º·Ê·�Í·¿Ì»È

��¾¿É�ÍÅÈÁ�Í·É�ÉËÆÆÅÈÊ»º�¸Ï��½È·ÄÊ��ʷʿʻʾʷʺƔÈ·ÄÊ�¿Ä¼ÅÈÃ·Ê¿ÅÄƓ�¾»�¼ËÄº»ÈÉ�¾·º�ÄÅ�ÈÅÂ»�¿Ä�ÉÊËºÏ�º»É¿½ÄƑ�º·Ê·�¹ÅÂÂ»¹Ê¿ÅÄ�·Äº�·Ä·ÂÏÉ¿ÉƑ�º»¹¿É¿ÅÄ�ÊÅ�ÆË¸Â¿É¾Ƒ�ÅÈ�ÆÈ»Æ·È·Ê¿ÅÄ�Å¼�Ê¾»�Ã·ÄËÉ¹È¿ÆÊƔ

��ÅÃÆ»Ê¿Ä½�¿ÄÊ»È»ÉÊÉƓ �¾»�·ËÊ¾ÅÈÉ�¾·Ì»�ÄÅ�¹ÅÄ¼Â¿¹ÊÉ�Å¼�¿ÄÊ»È»ÉÊ�ÅÈ�¹ÅÃÆ»Ê¿Ä½�¿ÄÊ»È»ÉÊÉ�ÊÅ�º¿É¹ÂÅÉ»Ɣ�ʸˀ��·Ï�ʹʷʸʼƑ� Ɠʸʹʸ�ƺºÅ¿Ɠ� ƻ�¿ÈÉÊ�ÆË¸Â¿É¾»ºƓ ʻ ʸʷƔʸʹʽʿʿƭ¼ʸʷʷʷÈ»É»·È¹¾ƔʽʼʺʽƔʸ

� � � ��»¼»È»»��Ê·ÊËÉƓ

� �ÄÌ¿Ê»º��»¼»È»»É

�version 1ÆË¸Â¿É¾»ºʸˀ��·Ï�ʹʷʸʼ

� � �ʸ ʹ ʺ ʻ

È»ÆÅÈÊ È»ÆÅÈÊ È»ÆÅÈÊ È»ÆÅÈÊ

�ʸˀ��·Ï�ʹʷʸʼƑ� Ɠʸʹʸ�ƺºÅ¿Ɠ� ƻ¿ÈÉÊ�ÆË¸Â¿É¾»ºƓ ʻ ʸʷƔʸʹʽʿʿƭ¼ʸʷʷʷÈ»É»·È¹¾ƔʽʼʺʽƔʸ�ʸˀ��·Ï�ʹʷʸʼƑ� Ɠʸʹʸ�ƺºÅ¿Ɠ� ƻ�·Ê»ÉÊ�ÆË¸Â¿É¾»ºƓ ʻ ʸʷƔʸʹʽʿʿƭ¼ʸʷʷʷÈ»É»·È¹¾ƔʽʼʺʽƔʸ

Ìʸ

�·½»�ʸ�Å¼�ʺʾ

ʸʷʷʷ�»É»·È¹¾�ʹʷʸʼƑ�ʻƓʸʹʸ��·ÉÊ�ËÆº·Ê»ºƓ�ʹʽ��ʹʷʸʽ

of the number of ortholog pairs analyzed by Lin et al. Nevertheless, we believe that a possible explanation for this disparity is a pars-ing error. The last two columns of the ‘modENCODE ortholog file’ represent the number of genes from each species in the ortholog group. One of the steps required to obtain the subset of ortholog groups for analysis is to select those records where the two last col-umns have a value of 1 (i.e. one-to-one ortholog pairs). We found that if this selection is done through a command line search that does not require that the value in the last column be exactly “1”, but rather just begins with “1”, then the result is 15,104 putative human-mouse ortholog pairs.

Quality assessment of RNA-Seq dataWe used the FastQC software v0.10.0 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to assess the quality of the individ-ual FASTQ files (Supplementary Table 2–Supplementary Table 6). We were concerned by evidence for GC content bias and over-represented sequences. To examine the latter in greater detail, we mapped the sequences overrepresented in at least one sample to the genome of the respective species, using BLAT searches6 against the hg19 (human) and mm10 (mouse) assemblies at the UCSC genome browser site (http://genome.ucsc.edu/)6. We found that in both spe-cies many of the overrepresented sequences mapped perfectly to the mitochondrial genome (Supplementary Table 3–Supplementary Table 6). For the mouse pancreas sample only, we also found many overrepresented sequences mapped to regions with rRNA repeats from the SSU-rRNA_Hsa and LSU-rRNA_Hsa families.

Mapping RNA-Seq reads to genome sequencesWe mapped the RNA-Seq reads to their respective genomes using Tophat v2.0.117 with the following options: “--mate-inner-dist 200” (i.e. inner mate distance is 200nt, based on paired-end reads with length 100nt each and an insert size of 350-450nt ); “--bowtie-n” (i.e. the “-n” option will be used in Bowtie8 in the initial read map-ping stage); “-g 1” (i.e. multi-mapping reads will be excluded from alignment); “-m 1” (i.e. one mismatch is allowed in the anchor region of a spliced alignment); “--library-type fr-firststrand” (the libraries had been constructed using the Illumina TruSeq Stranded mRNA LT Sample Prep Kit2). An exception was the mouse pancreas sample, for which the mapping process stalled consistently at the same stage.

For this sample we used Tophat v1.4.18 with the same options as above. Tophat requires a Bowtie8 index. For human we used the Bowtie index that was packaged with the genome sequence in the file downloaded from the Illumina iGenomes page (http://support.illumina.com/sequencing/sequencing_software/igenome.html). For mouse we built an index using the bowtie-build utility from Bowtie v2.2.1 (v 0.12.7 for the index used with Tophat v1.4.1).

Calculating gene GC contentFor each of the two species we used the appropriate GTF file to generate a table, which contains for each gene its ENSEMBL gene identifier its common name, and the GC content of the sequence covered by the union of the gene’s transcripts. To this end, we first generated a GTF file where overlapping exons from different tran-scripts of the same gene were merged into a single “exon” with the same sequence coverage, retaining the association with the gene identifier. Next, we computed the nucleotide content of the exons in this new GTF file using the ‘nuc’ utility from bedtools v2.17.09. Finally, we computed the GC content for each gene identifier by summing the number of ‘G’ and ‘C’ nucleotides in its merged exons and dividing by the sum of counts of unambiguous nucleotides in these exons.

Per-gene FPKM valuesWe used Cufflinks v2.2.110 to compute fragments per kilo base of transcript per million (FPKM) values and aggregate them per gene. The only option used was “--library-type fr-firststrand”. For the required transcript annotation file (“-G” parameter) we used the GTF file for the respective species described in the “Genome and gene annotation files” section. We then generated a matrix of 14,744 by 26 FPKM values for each gene (in the ortholog table) and sample. While generating this table we noticed that some of the common gene names were associated with more than one ENSEMBL gene identifier. In some cases we determined that this was due to gene identifiers that have been retired from the ENSEMBL database3 but were retained in the GTF file (27 and 64 retired identifiers for human and mouse, respectively). These retired identifiers were ignored when constructing the FPKM matrix. For the remaining such cases we incorporated the value from the first appearance of the common name.

Figure 1. Study design. Sequencing batches as inferred based on the sequence identifiers of the RNA-Seq reads.

�·½»�ʺ�Å¼�ʺʾ


After correcting for batch effects

• Tissues tend to cluster together more.

Analysis of normalized data after accounting for batch effects yields clustering by tissueA previous evaluation of normalization methods for RNA-Seq data15 suggested that FPKM values were not optimal for cluster-ing analysis. Therefore, as a basis for our reanalysis, we used the matrix of per-gene raw fragment counts. The entire analysis was done within R environment v 3.1.3 GUI 1.65 Snow Leopard build (6912)12. See Supplementary Text 2 for detailed commands, and a supplement zip file for the R input (available in Zenodo: http://dx.doi.org/10.5281/zenodo.17606).

Following Li et al.16, we removed the 30% of genes with the lowest expression as determined by the sum of fragment counts across all samples. Next, due to the presence of mitochondrial genes among the overrepresented sequences in the data, we also removed reads that map to the 12 mitochondrial genes. This left us with expression data from 10,309 genes for analysis. We note that merely limiting the analysis to this subset of genes does not have a marked effect on the patterns reported by Lin et al. (Figure S3; detailed commands in Supplementary Text 3, and a supplement zip file for the R input (available in Zenodo: http://dx.doi.org/10.5281/zenodo.17606)). We performed within-column normalization to remove the GC bias in the data, indicated by the initial quality assessment. To this end, we applied the ‘withinLaneNormalization’ function from the EDASeq package v2.0.017 to each column in the matrix, using the gene GC values for the species associated with the column. Next, we used the ‘calcNormFactors’ from the edgeR package v3.8.618, with the trimmed mean of M-values (TMM) method19, to calculate normalization

factors for the library sizes for the samples. We used these normali-zation factors in the depth normalization of the columns (using the column sums of the original, unfiltered, counts matrix as a proxy for library sizes). The normalized data were log2-transformed (after adding ‘1’ to each value in the matrix to avoid undefined values).

We then considered how to account for the fact that the assignment of samples to sequencing flowcells and lanes was nearly completely confounded with the species annotations of the samples (Figure 1). The consideration of ‘batch effect’ was the most important differ-ence between the analysis that recapitulated the patterns reported by the mouse ENCODE papers (the previous ‘Results’ section) and the current reanalysis effort. Specifically, we accounted for the sequencing study design batch effects using the ‘ComBat’ function from the sva package v3.12.020, with a model that includes effects for batch, species and tissue. For this purpose the samples were classified into five batches, based on the sequencing study design (see methods and Figure 1).

To visualize the data, we used the function ‘prcomp’ (with the ‘scale’ and ‘center’ options set to TRUE) to perform principal com-ponent analysis (PCA) of the transposed log-transformed matrix of ‘clean’ values (after removal of invariant columns, i.e. genes), and the ggplot2 package13 to generate scatter plots of the PCA results. None of the first five principal components (accounting together for 56% of the variability in the data) support the clustering of the gene expression data by species (Figure 3a and Figure S4–Figure S5). However, the sixth principal component, which accounts for 6% of

Figure 3. Clustering of data once batch effects are accounted for. a. Two-dimensional plots of principal components calculated by applying PCA to the transposed matrix of batch-corrected log-transformed normalized fragment counts (from 10,309 orthologous gene pairs that remained after the exclusion steps described in the results) for the 26 samples, after removal of invariant columns (genes). b. Heatmap based on pairwise Pearson correlation of the expression data used in panel a. We used Euclidean distance and complete linkage as distance measure and clustering method, respectively.

�·½»�ʼ�Å¼�ʺʾ


Batch effects are prevalent

• Observed in many high-throughput experiments: microarray, different types of sequencing, even brain imaging.

• Methods for identifying and removing batch effects is under continuous developments.

75. Johannes, F., Colot, V. & Jansen, R. C. Epigenome dynamics: a quantitative genetics perspective. Nature Rev. Genet. 9, 883–890 (2008).

76. Teixeira, F. K. & Colot, V. Repeat elements and the Arabidopsis DNA methylation landscape. Heredity 105, 14–23 (2010).

77. Eichler, E. E. et!al. Missing heritability and strategies for finding the underlying causes of complex disease. Nature Rev. Genet. 11, 446–450 (2010).

78. Heinig, M. et al. A trans-acting locus regulates an anti-viral expression network and type 1 diabetes risk. Nature (in the press).

79. McKinney, E.!F. et!al. A CD8+ T!cell transcription signature predicts prognosis in autoimmune disease. Nature Med. 16, 586–591, 1p following 591 (2010).

80. Pham, M.!X. et!al. Gene-expression profiling for rejection surveillance after cardiac transplantation. N.!Engl. J.!Med. 362, 1890–1900 (2010).

81. Winzeler, E.!A. et!al. Functional characterization of the S.!cerevisiae genome by gene deletion and parallel analysis. Science 285, 901–906 (1999).

82. Giaever, G. et!al. Functional profiling of the Saccharomyces cerevisiae genome. Nature 418, 387–391 (2002).

83. Manolio, T.!A. Genomewide association studies and assessment of the risk of disease. N.!Engl. J.!Med. 363, 166–176 (2010).

84. Barabási, A.!L. & Oltvai, Z.!N. Network biology: understanding the cell’s functional organization. Nature Rev. Genet. 5, 101–113 (2004).

85. Delbrück, M. in Unités Biologiques Douées De Continuité Génétique. 33–35 (Editions du Centre National de la Recherche Scientifique Paris, 1949).

86. Waddington, C.!H. The Strategy of the Genes (Geo. Allen & Unwin, London, 1957).

87. Monod, J. & Jacob, F. Teleonomic mechanisms in cellular metabolism, growth, and differentiation. Cold Spring Harb. Symp. Quant. Biol. 26, 389–401 (1961).

88. Nurse, P. The great ideas of biology. Clin. Med. 3, 560–568 (2003).

89. Wagner, G.!P. The developmental genetics of homology. Nature Rev. Genet. 8, 473–479 (2007).

90. Arendt, D. The evolution of cell types in animals: emerging principles from molecular studies. Nature Rev. Genet. 9, 868–882 (2008).

91. Lynch, V.!J. et!al. Adaptive changes in the transcription factor HoxA-11 are essential for the evolution of pregnancy in mammals. Proc. Natl Acad. Sci. USA 105, 14928–14933 (2008).

92. Patel, C.!J., Bhattacharya, J. & Butte, A.!J. An environment-wide association study (EWAS) on type!2 diabetes mellitus. PLoS ONE 5, e10746 (2010).

93. Buckler, E.!S. et!al. The genetic architecture of maize flowering time. Science 325, 714–718 (2009).

94. Schneeberger, K. et!al. SHOREmap: simultaneous mapping and mutation identification by deep sequencing. Nature Methods 6, 550–551 (2009).

95. Ehrenreich, I.!M. et!al. Dissection of genetically complex traits with extremely large pools of yeast segregants. Nature 464, 1039–1042 (2010).

96. Qin, J. et!al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010).

97. Dangl, J.!L. & Jones, J.!D. Plant pathogens and integrated defence responses to infection. Nature 411, 826–833 (2001).

98. Brady, S.!M. et!al. A high-resolution root spatiotemporal map reveals dominant expression patterns. Science 318, 801–806 (2007).

99. Chickarmane, V. et!al. Computational morphodynamics: a modeling framework to understand plant growth. Annu. Rev. Plant Biol. 61, 65–87 (2010).

100. Guenther, M. & Young, R. A. Repressive transcription. Science 329, 150–151 (2010).

101. Jacob, F. & Monod, J. Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol. 3, 318–356 (1961).

102. Gurdon, J.!B. Adult frogs derived from the nuclei of single somatic cells. Dev. Biol. 4, 256–273 (1962).

103. Tang, F. et!al. Tracing the derivation of embryonic stem cells from the inner cell mass by single-cell RNA-seq analysis. Cell Stem Cell 6, 468–478 (2010).

AcknowledgementsS.T. thanks members of her laboratory for helpful discussions. J.A.T. thanks the Wellcome Trust, the Juvenile Diabetes Research Foundation International and the National Institute for Health Research Cambridge Biomedical Research Centre for funding, and J.!Nadeau for sharing unpublished information. The Cambridge Institute for Medical Research is a recipient of a Wellcome Trust Strategic Award (079895). M.V. would like to thank M.!Walhout, J.!Dekker and J.!Vandenhaute for helpful conversations on the subject discussed here.

Competing interests statement The authors declare no competing financial interests.

(746*'4�+0(14/#6+10'FKVJ�*GCTFoU�JQOGRCIG��JVVR��YYY�EWTKG�HT�TGEJGTEJG�VJGOGU�FGVCKNAGSWKRG�EHO�NCPI�AID�KFAGSWKRG��JVO5CTCJ�6KUJMQHHoU�JQOGRCIG��JVVR��YYY�OGF�WRGPP�GFW�VKUJMQHH�.CD�6KUJMQHH�6KUJMQHH�JVON,QJP�#��6QFFoU�JQOGRCIG��JVVR��YYY�EKOT�ECO�CE�WM�KPXGUVKICVQTU�VQFF�KPFGZ�JVON/CTE�8KFCNoU�JQOGRCIG��JVVR��EEUD�FHEK�JCTXCTF�GFW)ØPVGT�2��9CIPGToU�JQOGRCIG��JVVR��RCPVJGQP�[CNG�GFW�`IRYCIPGT�KPFGZ�JVON&GVNGH�9GKIGNoU�JQOGRCIG��JVVR��YYY�YGKIGNYQTNF�QTI4KEJCTF�;QWPIoU�JQOGRCIG��JVVR��YGD�YK�OKV�GFW�[QWPI��)GPQOGU�2TQLGEV��JVVR��YYY��IGPQOGU�QTI6JG�$)+�JQOGRCIG��JVVR��YYY�IGPQOKEU�EP

#..�.+0-5�#4'�#%6+8'�+0�6*'�10.+0'�2&(

12 +0 +10

Tackling the widespread and critical impact of batch effects in high-throughput dataJeffrey T.!Leek, Robert B.!Scharpf, Héctor Corrada Bravo, David Simcha, Benjamin Langmead, W.!Evan Johnson, Donald Geman, Keith Baggerly and Rafael A.!Irizarry

#DUVTCEV�^�*KIJ�VJTQWIJRWV�VGEJPQNQIKGU�CTG�YKFGN[�WUGF��HQT�GZCORNG�VQ�CUUC[�IGPGVKE�XCTKCPVU��IGPG�CPF�RTQVGKP�GZRTGUUKQP��CPF�GRKIGPGVKE�OQFKHKECVKQPU��1PG�QHVGP�QXGTNQQMGF�EQORNKECVKQP�YKVJ�UWEJ�UVWFKGU�KU�DCVEJ�GHHGEVU��YJKEJ�QEEWT�DGECWUG�OGCUWTGOGPVU�CTG�CHHGEVGF�D[�NCDQTCVQT[�EQPFKVKQPU��TGCIGPV�NQVU�CPF�RGTUQPPGN�FKHHGTGPEGU��6JKU�DGEQOGU�C�OCLQT�RTQDNGO�YJGP�DCVEJ�GHHGEVU�CTG�EQTTGNCVGF�YKVJ�CP�QWVEQOG�QH�KPVGTGUV�CPF�NGCF�VQ�KPEQTTGEV�EQPENWUKQPU��7UKPI�DQVJ�RWDNKUJGF�UVWFKGU�CPF�QWT�QYP�CPCN[UGU��YG�CTIWG�VJCV�DCVEJ�GHHGEVU�CU�YGNN�CU�QVJGT�VGEJPKECN�CPF�DKQNQIKECN�CTVGHCEVU��CTG�YKFGURTGCF�CPF�ETKVKECN�VQ�CFFTGUU��9G�TGXKGY�GZRGTKOGPVCN�CPF�EQORWVCVKQPCN�CRRTQCEJGU�HQT�FQKPI�UQ�

Many technologies used in biology — including high-throughput ones such as microarrays, bead chips, mass spectrom-eters and second-generation sequencing — depend on a complicated set of reagents

and hardware, along with highly trained per-sonnel, to produce accurate measurements. When these conditions vary during the course of an experiment, many of the quan-tities being measured will be simultaneously

affected by both biological and non-biological factors. Here we focus on batch effects, a common and powerful source of variation in high-throughput experiments.

Batch effects are sub-groups of measure-ments that have qualitatively different behaviour across conditions and are unre-lated to the biological or scientific variables in a study. For example, batch effects may occur if a subset of experiments was run on Monday and another set on Tuesday, if two technicians were responsible for different subsets of the experiments or if two different lots of reagents, chips or instruments were used. These effects are not exclusive to high-throughput biology and genomics research1, and batch effects also affect low-dimensional molecular measurements, such as northern blots and quantitative PCR. Although batch effects are difficult or impossible to detect in low-dimensional assays, high-throughput technologies provide enough data to detect and even remove them. However, if not properly dealt with, these effects can have a particularly strong and pervasive impact. Specific examples have been documented in published studies2,3 in which the biologi-cal variables were extremely correlated with technical variables, which subsequently led to serious concerns about the validity of the biological conclusions4,5.

2'452'%6 +8'5

NATURE REVIEWS | )'0'6+%5� VOLUME 11 | OCTOBER 2010 | ��

© 20 Macmillan Publishers Limited. All rights reserved10

We re-examined data sets from three studies for which batch effects have been reported (TABLE!1). The first was the sTCC study9 described above. The second was a microarray data set from a study examining population differences in gene expression2. The conclusion of the original paper, that the expression of 1,097 of 4,197 genes dif-fers between populations of!European and Asian descent, was questioned in another publication4 because the populations and processing dates were highly correlated. In fact, more differences were found when comparing data from two processing dates while keeping the population fixed. The third was a mass spectrometry data set that was used to develop a statistical pro-cedure, based on proteomic patterns in serum, to distinguish neoplastic diseases from non-neoplastic diseases within the ovary3. Concerns about the conclusions of this paper were raised in another publication, which showed that outcome was confounded with run date5.

To further illustrate the ubiquity and potential hazards associated with batch effects, we also carried out analyses on rep-resentative publicly available data sets that have been established using a range of high-throughput technologies. In addition to the three studies described above, we examined: data from a study of copy number variation

in HapMap populations16; a study of copy number variation in a genome-wide asso-ciation study of bipolar disorder17; gene expression from an ovarian cancer study from The Cancer Genome Atlas (TCGA)18 produced using two platforms (Affymetrix and Agilent); methylation data from the same TCGA ovarian cancer samples pro-duced using Illumina BeadChips; and second-generation sequencing data from a study comparing unrelated HapMap indi-viduals (these data were a subset of the data from the 1000 Genomes Project).

We found batch effects for all of these data sets, and substantial percentages (32.1–99.5%) of measured features showed statistically significant associations with processing date, irrespective of biologi-cal phenotype (TABLE!1). This suggests that batch effects influence a large percentage of the measurements from genomic tech-nologies. Next, we computed the first five principal components of the feature data (the principal components were ordered by the amount of variability explained). Ideally these principal components would correlate with the biological variables of interest, as the principal components represent the largest sources of signal in the data. Instead, for all of the studied data sets, the surro-gates for batch (date or processing group) were strongly correlated with one of the top

principal components (TABLE!1). In general, the correlation with the top principal com-ponents was not as high for the biological outcome as it was for the surrogates. This suggests that technical variability was more influential than biological variability across a range of experimental conditions and technologies.

For most of the data sets examined, neither date nor biological factors was perfectly asso-ciated with the top principal components, suggesting that other unknown sources of batch variability are present. This implies that accounting for date or processing group might not be sufficient to capture and remove batch effects. For example, we did a further analysis of second-generation sequencing data that were generated by the 1000 Genomes Project (FIG.!2). We found that 32% of the features were associated with date but up to 73% were associated with the second principal component. Note that the principal components cannot be explained by biol-ogy because only 17% of the features are associated with the biological!outcome.

&QYPUVTGCO�EQPUGSWGPEGUIn the most benign cases, batch effects will lead to increased variability and decreased power to detect a real biological signal15. Of more concern are cases in which batch effects are confounded with an outcome of interest and result in misleading biological or clinical conclusions. An example of con-founding is when all of the cases are pro-cessed on one day and all of the controls are processed on another. We have shown that in a typical high-throughput experiment, one can expect a substantial percentage of features to show statistically significant dif-ferences when comparing across batches, even when no real biological differences are present (FIG.!1; TABLE!1). Therefore, if one is not aware of the batch effect, a confounded experiment will lead to incorrect biological conclusions because results due to batch will be impossible to distinguish from real biological effects. As an example, we consider the proteomics study mentioned above3. These published results and further confirmation19 led to the development of a ‘home-brew’ diagnostic assay for ovarian cancer. However, in this study the biologi-cal variable of interest (neoplastic disease within the ovary) was extremely correlated with processing day5. Furthermore, batch effects were identified as a major driver of these results. Fortunately, objections raised after the assay was advertised led the US Food and Drug Administration to block use of the assay, pending further validation20.

Sam

ple

orde

red

by d

ate

161718

223

224241

243

244

245

251

254

257

259

261

262263

Genome location3,500,0003,000,0002,000,000 2,500,0001,000,000 1,500,000500,000

Nature Reviews | Genetics(KIWTG��^�$CVEJ�GHHGEVU�HQT�UGEQPF�IGPGTCVKQP�UGSWGPEKPI�FCVC�HTQO�VJG��)GPQOGU�2TQLGEV��'CEJ�TQY�KU�C�FKHHGTGPV�*CR/CR�UCORNG�RTQEGUUGF�KP�VJG�UCOG�HCEKNKV[�YKVJ�VJG�UCOG�RNCV�HQTO��5GG�5WRRNGOGPVCT[�KPHQTOCVKQP�5��DQZ��HQT�C�FGUETKRVKQP�QH�VJG�FCVC�TGRTGUGPVGF�JGTG��6JG�UCORNGU�CTG�QTFGTGF�D[�RTQEGUUKPI�FCVG�YKVJ�JQTK\QPVCN�NKPGU�FKXKFKPI�VJG�FKHHGTGPV�FCVGU��9G�UJQY�C��|/D�TGIKQP�HTQO�EJTQOQUQOG��%QXGTCIG�FCVC�HTQO�GCEJ�HGCVWTG�YGTG�UVCPFCTFK\GF�CETQUU�UCORNGU��DNWG�TGRTGUGPVU�VJTGG�UVCPFCTF�FGXKCVKQPU�DGNQY�CXGTCIG�CPF�QTCPIG�TGRTGUGPVU�VJTGG�UVCPFCTF�FGXKCVKQPU�CDQXG�CXGTCIG��8CTKQWU�DCVEJ�GHHGEVU�ECP�DG�QDUGTXGF��CPF�VJG�NCTIGUV�QPG�QEEWTU�DGVYGGP�FC[U��CPF��VJG�NCTIG�QTCPIG�JQTK\QPVCN�UVTGCM��

2'452'%6 +8'5

��| OCTOBER 2010 | VOLUME 11 �YYY�PCVWTG�EQO�TGXKGYU�IGPGVKEU

© 20 Macmillan Publishers Limited. All rights reserved10

Biological artifact: cell mixture

• Tissue sample is a mixture of different cell types. • Data collected are mixed signals.

Genetic profileof each cell type

Mixture proportions

An example: EWAS in aging study

Jaffe and Irizarry GB(2014)

• Cellular composition changes with age.• Cellular composition is a major source of variability in DNA

methylation datasets in whole blood.

Existing signal deconvolution methods

• Reference-based methods (some type of regression):

• Require cell type specific signature: Abbas et al. 2009; Clarke et al.

2010; Gong et al. 2011; Lu et al. 2003; Wang et al. 2006; Vallania et

al. 2018; Du et al. 2018;

• Requires mixture proportions: Erkkila et al. 2010; Lahdesmaki et al.

2005; Shen-Orr et al. 2010; Stuart et al. 2004.

• Reference free methods (some type of factor analysis):

• Gaujoux et al. 2011; Kuhn et al. 2011; Repsilber et al. 2010; Roy et

al. 2006; Venet et al. 2001; Houseman et al. 2012, 2014, 2016;

Rahmani et al. 2016, 2018; Lutsik et al. 2017; Xie et al. 2018;

Method to adjust for cell proportion

• In EWAS, add proportion as covariate in the model.• More rigorous statistical modeling for DE/DM with

sample mixture has been a popular topic recently, and a number of methods are developed: – csSAM: Shen-Orr et al. 2010 Nature methods– CellDMC: Zheng et al. 2018 Nature Methods– TOAST: Li et al. 2019 Bioinformatics, 2019 Genome Biology

Review• We have covered microarray analysis DE test,

including:– SAM t-test.– EB method: Limma.– A little on complex design.– Permutation test.– Multiple testing. – R/Bioconductor packages for DE analysis.

• Batch effects.• Cell type mixture in complex tissues

gene expression microarray: differential expression, data artifacts · 2020-07-21 · tusher et...

Documents