statistical analysis of expression data: normalization, differential expression and multiple testing...

54
Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Upload: ella-nelson

Post on 13-Jan-2016

230 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Statistical analysis of expression data:

Normalization, differential expression and multiple testing

Jelle Goeman

Page 2: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Outline

NormalizationExpression variationModeling the log Fold changeComplex designsShrinkage and empirical Bayes (limma)Multiple testing (False Discovery Rate)

Page 3: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Measuring expression

Page 4: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Platforms

Microarrays

RNAseq

Common: Need for normalizationBatch effects

Page 5: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Why normalization

Some experimental factors cannot be completely controlledAmount of materialAmount of degradationPrint tip differencesQuality of hybridization

Effects are systematicCause variation between samples and

between batches

Page 6: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

What is normalization?

Normalization =

An attempt to get rid of unwanted systematic variation by statistical means

Note 1: this will never completely succeedNote 2: this may do more harm than good

Much better, but often impossible

Better control of the experimental conditions

Page 7: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

How do normalization methods work?General approach

1.Assume: data from an ideal experiment would have characteristic AE.g. mean expression is equal for each sample

Note: this is an assumption!

2. If the data do not have characteristic A, change the data such that the data now do have characteristic AE.g. Multiply each sample’s expression by a factor

Page 8: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Example: quantile normalization

Assume: “Most probes are not differentially expressed”

“As many probes are up and downregulated”

Reasonable consequence:The distribution of the expression values is identical for each sample

Normalization:Make the distribution of expression values identical for each sample

Page 9: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Quantile normalization in practice

Choose a target distributionTypically the average of the measured distributionsAll samples will get this distribution after normalization

Quantile normalization: Replace the ith largest expression value in each sample by the

ith largest value in the target distribution

Consequence: Distribution of expressions the same between samples Expressions for specific genes may differ

Page 10: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Less radical forms of normalization

Make the means per sample the sameMake the medians the sameMake the variances the sameLoess curve smoothing

Same idea, but less change to the data

Page 11: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Overnormalizing

Normalizing can remove or reduce true biological differencesExample: global increase in expression

Normalization can create differences that are not thereExample: almost global increase in expression

Usually: normalization reduces unwanted variation

Page 12: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Batch effects

Differences between batches are even stronger than between samples in the same batch

Note: batch effects at several stages

Normalization is not sufficient to remove batch-effects

Methods available (comBat) but not perfectBest: avoid batch effects if possible

Page 13: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Confounding by batch

Take care of batch-effects in experimental design

Problem: confounding of effect of interest by batch effects

Example: Golub data

Solution: balance or randomize

Page 14: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Expression variation

Page 15: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Differential expression

Two experimental conditionsTreated versus untreated

Two distinct phenotypesTumor versus normal tissue

Which genes can reliably be called differentially expressed?

Also: continuous phenotypesWhich gene expressions are correlated with phenotype?

Page 16: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Variation in gene expression

Technical variationVariation due to measurement techniqueVariability of measured expression from experiment to

experiment on the same subject

Biological variationVariation between subjects/samplesVariability of “true” expression between different

subjects

Total variationSum of technical and biological variation

Page 17: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Reliable assessment

Two samples always have different expressionMaybe even a high fold changeDue to random biological and technical variation

Reliable assessment of differential expression:Show: fold change found cannot be explained by

random variation

Page 18: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Assessment of differential expression

Two interrelated aspects:

Fold change:How large is the expression difference found?

P-value:How sure are we that a true difference exists?

Page 19: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

LIMMA:Linear models for gene expression

Page 20: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Modeling variation

How does gene expression depend on experimental conditions?

Can often be well modeled with linear models

Limma: linear models for microarray analysisGordon Smyth, W. and E. Hall Institute, Australia

Page 21: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Multiplicative scale effects

Assumption: effects on gene expression work in a multiplicative way (“fold change”)

Example: treatment increases gene expression of gene MMP8 by a factor 2 “2-fold increase”

Treatment decreases gene expression of gene MMP8 by a factor 2“2-fold decrease”

Page 22: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Multiplicative scale errors

Assumption: variation on gene expression works in a multiplicative way

A 2-fold increase by chance is just as likely as a 2-fold decrease by chance

When true expression is 4, measuring 8 is as likely as measuring 2

Page 23: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Working on the log scale

When effects are multiplicative, log-transform!Usual in microarray analysis: log to base 2

Remember: log(ab) = log(a)+log(b)2 fold increase = +1 to log expression2 fold decrease = -1 to log expression

Log scale makes multiplicative effects symmetric½ and 2 are not symmetric around 1 (= no change)-1 and +1 are symmetric around 0 (= no change)

Page 24: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

A simple linear model

Example: treated and untreated samples

Model separately for each geneLog Expression of gene 1: E1

E1 = a + b * Treatment + error

a: intercept = average untreated logexpression b: slope = treatment effect

Page 25: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Modeling all genes simultaneously

E1 = a1 + b1 * Treatment + errorE2 = a2 + b2 * Treatment + error…E20,000 = a20,000 + b20,000 * Treatment +

error

Same model, butSeparate intercept and slope for each geneAnd separate sd sigma1, sigma2, … of error

Page 26: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Estimates and standard errors

Gene 1: Estimates for a1, b1 and sigma1Estimate of treatment effect of gene 1b1 is the estimated log fold changestandard error s.e.(b1) depends on sigma1

Regular t-test for H0: b1=0:T = b1/s.e.(b1) Can be used to calculate p-values.Just like regular regression, only 20,000 times

Page 27: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Back to original scale

Log scale regression coefficient b1Average log fold change

Back to a fold change: 2^b1b1= 1 becomes fold change 2b1 = -1 becomes fold change 1/2

Page 28: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Confounders

Other effects may influence gene expression

Example: batch effectsExample: sex or age of patients

In a linear model we can adjust for such confounders

Page 29: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Flexibility of the linear model

Earlier: E1 = a1 + b1 * Treatment + error

Generalize:E1 = a1 + b1 * X + c1 * Y + d1 + Z + error

Add as many variables as you need.

Page 30: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Variance shrinkage

Page 31: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Empirical Bayes

So far: each gene on its own20,000 unrelated models

Limma: exchange information between genes

“Borrowing strength”By empirical Bayes arguments

Page 32: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Estimating variance

For each gene a variance is estimated

Small sample size: variance estimate is unreliableToo small for some genesToo large for others

Variance estimated too small: false positivesVariance estimated too large: low power

Page 33: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Large and small estimated variance

Gene with low variance estimateLikely to have low true varianceBut also: likely to have underestimated variance

Gene with high variance estimateLikely to have high true varianceBut also: likely to have overestimated variance

Limma’s idea:Use information from other genes to assess whether

variance is over/underestimated

Page 34: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

True and estimated variance

Page 35: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Variance model

Limma has a gene variance modelAll gene’s variances are drawn at random

from an inverse gamma distribution

Based on this model:Large variances are shrunk downwardsSmall variances are shrunk upwards

Page 36: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Effect of variance shrinkage

Genes with large fold change and large varianceMore powerMore likely to be significant

Genes with small fold change and small varianceLess powerLess likely to be significant

Page 37: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Limma and sample size

Shrinkage of limma only effective for small sample size (< 10 samples/group)

Added information of other genes becomes negligeable if sample size gets large

Large samples: Doing limma is the same as doing regression per gene

Page 38: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Differential expression in RNAseq

Page 39: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

RNAseq data: counts

Gene id Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10

ENSG00000110514 69 178 101 58 101 31 165 108 70 1

ENSG00000086015 115 52 86 88 146 84 59 85 86 0

ENSG00000115808 285 190 467 295 345 532 369 473 423 5

ENSG00000169740 502 184 363 195 403 262 225 332 136 3

ENSG00000215869 0 7 0 0 0 0 0 2 0 0

ENSG00000261609 20 31 76 20 25 158 23 18 23 1

ENSG00000169744 488 529 470 505 1137 373 1392 3517 192 1

ENSG00000215864 1 0 0 0 0 0 0 0 0 0

Page 40: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Modelling count data

Distinguish three types of variationBiological variationTechnical variationCount variation

Count variation is important for low-expressed genes

Generally biological variation most important

Page 41: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Overdispersion

Modelling count data: two stages

1.Model how gene expression varies from sample to sample

2.Model how the observed count varies by repeated sequencing of the same sample

Stage 2 is specific for RNAseq

Page 42: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Two approaches

Approach 1: Model the count variation and the between-sample variationedgeRDeseq

Approach 2: Normalize the count data and model only the biological variationVoom + limma

Approach 3: Model count variation onlyPopular but very wrong!

Page 43: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Multiple testing

Page 44: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

20,000 p-values

Fitting 20,000 linear modelsSome variance shrinkage

Result:20,000 fold changes20,000 p-values

Which ones are truly differentially expressed?

Page 45: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Multiple testing

Doing 20,000 tests: risk false positive 20,000 times

If 5% of null hypotheses is significant, expect 1,000 significant by pure chance

How to make sure you can really trust the results?

Page 46: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Bonferroni

Classical way of doing multiple testingCall K the number of tests performed

Bonferroni: significant = p-value < 0.05/K

“Adjusted p-value”Multiply all p-values by K, compare with 0.05

Page 47: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Advantages of Bonferroni

Familywise error control=Probability of making any type I error < 0.05

With 95% chance, list of differentially expressed genes has no errors

Very strictEasy to do

Page 48: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Disadvantages of Bonferroni

Very strict“No” false positivesMany false negatives

It is not a big problem to have a few false positives

Do validation experiments later

Page 49: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

False discovery rate (Benjamini and Hochberg)

FDR = expected proportion of false discoveries among all discoveries

Control of FDR at 0.05 means in the long run experiments average about 5% type I errors among the reported genes

Percentage: longer lists of genes are allowed to have more errors

Page 50: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Benjamini and Hochberg by hand

1. Order the p-values small to largeExample: 0.0031, 0.0034, 0.02, 0.10, 0.652. Multiply the k-th p-value by m/k, where m is the number

of p-values, so0.0031 * 5/1, 0.0034 * 5/2, 0.02 * 5/3, 0.10 * 5/4, 0.65 * 5/5which becomes0.0155, 0.0085, 0.033, 0.125, 0.653. If the p-values are no longer in increasing order, replace

each p-value by the smallest p-value that is later in the list. In the example, we replace 0.0155 by 0.0085. The final Benjamini-Hochberg adjusted p-values become

0.0085, 0.0085, 0.033, 0.125, 0.65

Page 51: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

FDR warnings

FDR is susceptible to cheating

How to cheat with FDR?Add many tests of known false null

hypotheses…

Result: reject more of the other null hypotheses

Page 52: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Example limma results

Page 53: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Conclusion

Page 54: Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Testing for differentially expressed genes

Repeated application of a linear model

Include all factors in the model that may influence gene expression

Limma: additional step “borrowing strength”

Don’t forget to correct for multiple testing!