statistical tests for differential expression in count ... › images › 4 › 43 ›...
TRANSCRIPT
Microarrays NGS Normalization Methylation
Statistical testsfor differential expression
in count data (1)
Erik van Zwet, LUMC
NBIC Advanced RNA-seq course25-26 August 2011
Academic Medical Center, Amsterdam
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
The analysis of a microarray experiment
I Pre-processI image analysis to produce relative abundancesI take the logarithmI remove experimental artifacts (normalize)
I perform 20,000 t tests
I Correct for multiple testing
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
The model
For any particular gene, the log expression levels are assumed to benormally distributed:
fit=lm(y~group)
or perhaps
fit=lm(y~age+group)
I One might further assume that the gene variances come froma common (Gamma) distribution. In R: limma.
I Results in shrinkage: the empirical variances are pulledtowards their common mean.
I Favours genes with a large variance.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
Multiple Testing
K tests at level α = 0.05. Expect 0.05K false discoveries.
Bonferroni:K tests at level α = 0.05/K . Expect 0.05 false discoveries.
In other words, we expect to make one false discovery in twentyexperiments. Perhaps too conservative?
Simple solution: increase the level.
Unfortunately, this is taboo.
The Concise Oxford Dictionary (1996) (1) a system or the act of setting a person or thing apart as sacred,
prohibited, or accursed. (2) a prohibition or restriction imposed on certain behaviour, word usage, etc., by social
custom.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
Multiple Testing
H0 true H0 false
H0 rejected V S
H0 not rejected U T
Note: V are the false discoveries.
Bonferroni (Holm) controls the FWER
FWER = P(V > 0) ≤ 0.05.
Benjamini–Hochberg controls the FDR
FDR = E
(V
V + S
)≤ 0.05.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
Multiple Testing
FDR = E
(V
V + S
)≤ 0.05.
I We allow not more than 1 in 20 of our discoveries to be false(on average).
I FDR is less conservative than FWER, yet somehow it does notfeel like violating the taboo.
Same for NGS data.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
Testing sets of genes (pathways)
I Next speaker.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
Next Generation Sequencing
An NGS experiment produces very many counts.
I not well approximated by the normal distribution (unlesscounts are large and you take the square root.)
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
Example from Baggerly et al. 2003
library 1 2 3 4 5 6 7 8pheno T+ T+ T+ T+ T+ T− T− T−size 100474 96631 92510 95785 18705 95155 91593 98220
tag 1 129 167 71 61 6 43 247 509tag 2···
We want to test if tag 1 is more/less abundant in the tumor versusthe control group.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
In the tumor group we have
434
404105= 0.0011
and in the control group we have
799
284968= 0.0028.
So, tag 1 occurs more than twice as often in the control group.
Is this statistically significant?
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
I Total count for tag 1 in the tumor group has a binomialdistribution with n1 = 404105 and some probability p1.
I Total count for tag 1 in the control group has a binomialdistribution with n2 = 284968 and some probability p2.
We can use R to test the null hypothesis H0 : p1 = p2
x=c(434,799)n=c(404105,284968)prop.test(x,n)
p-value < 2.2e-16.
Giddyup!
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
A more flexible approach is to fit a logistic regression model for theprobability of observing tag 1. In such a model we could add othercovariates besides the presence of absence of the tumor.
x=c(129,167,71,61,6,43,247,509)n=c(100474,96631,92510,95785,18705,95155,91593,98220)group=factor(c(0,0,0,0,0,1,1,1))y=x/nfit=glm(y~group,family=binomial(),weights=n)
Again, we find a very significant effect.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
We put all subjects in each group together, and assumed binomialdistributions for the total counts of tag 1. This is only valid if theproportions for each subject within a group are the same.
library 1 2 3 4 5 6 7 8pheno T+ T+ T+ T+ T+ T− T− T−size 100474 96631 92510 95785 18705 95155 91593 98220
tag 1 129 167 71 61 6 43 247 509prop. (%) 0.13 0.17 0.08 0.06 0.03 0.05 0.27 0.52
Do subjects 1 and 2 have the same proportion? No! p=0.012.
x=c(129,167)n=c(100474,96631)prop.test(x,n)
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
Overdispersion
I There is biological/measurement variation between subjects.
I This will increase the variance of the total count:overdispersion.
I Failure to recognize this, will lead to underestimation of thevariance.
I Underestimation of the variance, will lead to many falsepositives.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
We need to account for extra (biological) variation.
I For every subject, let Pi be random.
I For every subject, X ∼ Bin(ni ,Pi ).
I Test if the Pi differ on average between the two groups.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
Many ways to deal with overdispersion:
I log(Pi/1− Pi ) normally distributed (GLMM)I too slow to do a million tests or more.
I Pi beta distributed. R package BBSeqI seems to be too slow as well.
I If n is large and p is small, then we can use Poisson-Gammaor negative binomial. R packages baySeq, DESeq, edgeR.
I we’ll discuss now.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
Recall, Binomial(n, p) distribution has
mean = np and variance = np(1− p).
DESeq and edgeR assume Negative-Binomial(µ, φ) distributionwith
mean = µ and variance = µ+ φµ2.
Much effort goes into estimating the dispersion φ when very few(biological) replicates are available.
I Here we should point out that it is generally better to increasethe sample size, if you can.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
Recall, Binomial(n, p) distribution has
mean = np and variance = np(1− p).
DESeq and edgeR assume Negative-Binomial(µ, φ) distributionwith
mean = µ and variance = µ+ φµ2.
Much effort goes into estimating the dispersion φ when very few(biological) replicates are available.
I Here we should point out that it is generally better to increasethe sample size, if you can.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
From the edgeR manual
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
Much more on baySeq, DESeq and edgeR later today.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
Quasi-Binomial “distribution”
A cheap trick is to add an extra dispersion parameter ϕ to thebinomial
variance = np(1− p) becomes variance = ϕnp(1− p).
To estimate, we maximize a “quasi likelihood”.
fit=glm(y~group,family=binomial(),weights=n)
simply becomes
fit=glm(y~group,family=quasibinomial(),weights=n)
Now we find p = 0.12.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
fit=glm(y~group,family=quasibinomial(),weights=n)
I accounts for overdispersion.
I few assumptions, hence robust.
I standard GLM: fast and reliable software available.
I standard GLM: theory well-developed.
I easy to include covariates.
I extends to dependent data: GEE.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
I We condition on the library sizes, and test if the proportion ofany given tag relative to all other tags is different betweengroups.
I But, if any tag differs between groups then other tags mustdo so as well, simply because
∑i pi = 1.
I In a two sample binomial experiment with
X1 ∼ Bin(n1, p1) and X2 ∼ Bin(n2, p2)
it makes no sense to test both
H0 : p1 = p2 and H0 : (1− p1) = (1− p2).
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
I The problem is very noticeable if there are tags that are bothvery abundant and different beween groups.
I Anders and Huber (2010) and Robinson and Oshlack (2010)suggest to modify the library sizes so that single tags havelittle influence.
I Alternatively, we might take a set of tags that we know won’tdiffer between groups and test all tags of interest relative tothat set.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
Methylation
Joint with
I Jelle Goeman.
I Bas Heijmans.
I Elmar Tobi.
n is not large and p is not small so baySeq, DESeq and edgeRcannot be used. Approximation with the normal distribution is alsonot appropriate.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
Methylation
I addition of methyl group (CH3) to cytosine.
I occurs at CpG sites.
I inherited during mitosis, not meiosis.
I stably affects transcription.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
Jirtle and Skinner, 2007
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
Dutch Famine 1944–1945
I Well documented, brief period allows the study of the effectsof famine on human health.
I Children exposed to famine in utero compared to siblings:I diabetesI obesityI cardiovascular disease
How can we measure methylation?
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
Reduced Representation Bisulphite Sequencing
Align to reduced representation &Infer methylation
M--CCGG-----CCGG----GGCC-----GGCC—
M
MCGG-----C
C-----GGC
M MCGG-----CCGGCC-----GGCM
MspI digestion &Size selection (40-220 bp)
End repair &Adaptor ligation
M MCGG-----CCG
GCC-----GGUM
Bisulfite treatment
CGG-----CCGGCC-----GGC
CGG-----CCAGCC-----GGT
CGG-----
-----GGT
Sequencing ofshort reads
PCR
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
Data
24 same sex sibling pairs. One sib was exposed in utero to DutchFamine.
family 9900 9900 9930 9930 18240 18240 17650 17650 17910 17910 ...
case 0 1 0 1 0 1 0 1 0 1 ...
cg02824864
coverage 16 8 5 7 4 10 7 1 12 33 ...
methylated 11 5 4 7 4 8 3 1 9 25 ...
...
Three million different CpGs.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
I First, we remove all CpG with insufficient coverage.
I Next, we test for differences in methylation fraction betweenexposed and unexposed individuals for every CpGs.
I Finally, we test various groups of CpG simultaneously.
We can account for relatedness by introducing a fixed effect perfamily, very similar to a paired t test.
For more complicated designs, we would need GEE.
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
family 9900 9900 9930 9930 18240 18240 17650 17650 17910 17910 ...
case 0 1 0 1 0 1 0 1 0 1 ...
cg02824864
coverage (n) 16 8 5 7 4 10 7 1 12 33 ...
methyl. (x) 11 5 4 7 4 8 3 1 9 25 ...
...
y=x/nfit=glm(y~case+fam,family=quasibinomial(),weights=n)
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
How to test sets of CpGs?
Next speaker!
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC
Microarrays NGS Normalization Methylation
How to test sets of CpGs?
Next speaker!
Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC