statistical analysis of expression data: normalization, differential expression and multiple testing...

Statistical analysis of expression data:

Normalization, differential expression and multiple testing

Jelle Goeman

Outline

NormalizationExpression variationModeling the log Fold changeComplex designsShrinkage and empirical Bayes (limma)Multiple testing (False Discovery Rate)

Measuring expression

Platforms

Microarrays

RNAseq

Common: Need for normalizationBatch effects

Why normalization

Some experimental factors cannot be completely controlledAmount of materialAmount of degradationPrint tip differencesQuality of hybridization

Effects are systematicCause variation between samples and

between batches

What is normalization?

Normalization =

An attempt to get rid of unwanted systematic variation by statistical means

Note 1: this will never completely succeedNote 2: this may do more harm than good

Much better, but often impossible

Better control of the experimental conditions

How do normalization methods work?General approach

1.Assume: data from an ideal experiment would have characteristic AE.g. mean expression is equal for each sample

Note: this is an assumption!

2. If the data do not have characteristic A, change the data such that the data now do have characteristic AE.g. Multiply each sample’s expression by a factor

Example: quantile normalization

Assume: “Most probes are not differentially expressed”

“As many probes are up and downregulated”

Reasonable consequence:The distribution of the expression values is identical for each sample

Normalization:Make the distribution of expression values identical for each sample

Quantile normalization in practice

Choose a target distributionTypically the average of the measured distributionsAll samples will get this distribution after normalization

Quantile normalization: Replace the ith largest expression value in each sample by the

ith largest value in the target distribution

Consequence: Distribution of expressions the same between samples Expressions for specific genes may differ

Less radical forms of normalization

Make the means per sample the sameMake the medians the sameMake the variances the sameLoess curve smoothing

Same idea, but less change to the data

Overnormalizing

Normalizing can remove or reduce true biological differencesExample: global increase in expression

Normalization can create differences that are not thereExample: almost global increase in expression

Usually: normalization reduces unwanted variation

Batch effects

Differences between batches are even stronger than between samples in the same batch

Note: batch effects at several stages

Normalization is not sufficient to remove batch-effects

Methods available (comBat) but not perfectBest: avoid batch effects if possible

Confounding by batch

Take care of batch-effects in experimental design

Problem: confounding of effect of interest by batch effects

Example: Golub data

Solution: balance or randomize

Expression variation

Differential expression

Two experimental conditionsTreated versus untreated

Two distinct phenotypesTumor versus normal tissue

Which genes can reliably be called differentially expressed?

Also: continuous phenotypesWhich gene expressions are correlated with phenotype?

Variation in gene expression

Technical variationVariation due to measurement techniqueVariability of measured expression from experiment to

experiment on the same subject

Biological variationVariation between subjects/samplesVariability of “true” expression between different

subjects

Total variationSum of technical and biological variation

Reliable assessment

Two samples always have different expressionMaybe even a high fold changeDue to random biological and technical variation

Reliable assessment of differential expression:Show: fold change found cannot be explained by

random variation

Assessment of differential expression

Two interrelated aspects:

Fold change:How large is the expression difference found?

P-value:How sure are we that a true difference exists?

LIMMA:Linear models for gene expression

Modeling variation

How does gene expression depend on experimental conditions?

Can often be well modeled with linear models

Limma: linear models for microarray analysisGordon Smyth, W. and E. Hall Institute, Australia

Multiplicative scale effects

Assumption: effects on gene expression work in a multiplicative way (“fold change”)

Example: treatment increases gene expression of gene MMP8 by a factor 2 “2-fold increase”

Treatment decreases gene expression of gene MMP8 by a factor 2“2-fold decrease”

Multiplicative scale errors

Assumption: variation on gene expression works in a multiplicative way

A 2-fold increase by chance is just as likely as a 2-fold decrease by chance

When true expression is 4, measuring 8 is as likely as measuring 2

Working on the log scale

When effects are multiplicative, log-transform!Usual in microarray analysis: log to base 2

Remember: log(ab) = log(a)+log(b)2 fold increase = +1 to log expression2 fold decrease = -1 to log expression

Log scale makes multiplicative effects symmetric½ and 2 are not symmetric around 1 (= no change)-1 and +1 are symmetric around 0 (= no change)

A simple linear model

Example: treated and untreated samples

Model separately for each geneLog Expression of gene 1: E1

E1 = a + b * Treatment + error

a: intercept = average untreated logexpression b: slope = treatment effect

Modeling all genes simultaneously

E1 = a1 + b1 * Treatment + errorE2 = a2 + b2 * Treatment + error…E20,000 = a20,000 + b20,000 * Treatment +

error

Same model, butSeparate intercept and slope for each geneAnd separate sd sigma1, sigma2, … of error

Estimates and standard errors

Gene 1: Estimates for a1, b1 and sigma1Estimate of treatment effect of gene 1b1 is the estimated log fold changestandard error s.e.(b1) depends on sigma1

Regular t-test for H0: b1=0:T = b1/s.e.(b1) Can be used to calculate p-values.Just like regular regression, only 20,000 times

Back to original scale

Log scale regression coefficient b1Average log fold change

Back to a fold change: 2^b1b1= 1 becomes fold change 2b1 = -1 becomes fold change 1/2

Confounders

Other effects may influence gene expression

Example: batch effectsExample: sex or age of patients

In a linear model we can adjust for such confounders

Flexibility of the linear model

Earlier: E1 = a1 + b1 * Treatment + error

Generalize:E1 = a1 + b1 * X + c1 * Y + d1 + Z + error

Add as many variables as you need.

Variance shrinkage

Empirical Bayes

So far: each gene on its own20,000 unrelated models

Limma: exchange information between genes

“Borrowing strength”By empirical Bayes arguments

Estimating variance

For each gene a variance is estimated

Small sample size: variance estimate is unreliableToo small for some genesToo large for others

Variance estimated too small: false positivesVariance estimated too large: low power

Large and small estimated variance

Gene with low variance estimateLikely to have low true varianceBut also: likely to have underestimated variance

Gene with high variance estimateLikely to have high true varianceBut also: likely to have overestimated variance

Limma’s idea:Use information from other genes to assess whether

variance is over/underestimated

True and estimated variance

Variance model

Limma has a gene variance modelAll gene’s variances are drawn at random

from an inverse gamma distribution

Based on this model:Large variances are shrunk downwardsSmall variances are shrunk upwards

Effect of variance shrinkage

Genes with large fold change and large varianceMore powerMore likely to be significant

Genes with small fold change and small varianceLess powerLess likely to be significant

Limma and sample size

Shrinkage of limma only effective for small sample size (< 10 samples/group)

Added information of other genes becomes negligeable if sample size gets large

Large samples: Doing limma is the same as doing regression per gene

Differential expression in RNAseq

RNAseq data: counts

Gene id Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10

ENSG00000110514 69 178 101 58 101 31 165 108 70 1

ENSG00000086015 115 52 86 88 146 84 59 85 86 0

ENSG00000115808 285 190 467 295 345 532 369 473 423 5

ENSG00000169740 502 184 363 195 403 262 225 332 136 3

ENSG00000215869 0 7 0 0 0 0 0 2 0 0

ENSG00000261609 20 31 76 20 25 158 23 18 23 1

ENSG00000169744 488 529 470 505 1137 373 1392 3517 192 1

ENSG00000215864 1 0 0 0 0 0 0 0 0 0

Modelling count data

Distinguish three types of variationBiological variationTechnical variationCount variation

Count variation is important for low-expressed genes

Generally biological variation most important

Overdispersion

Modelling count data: two stages

1.Model how gene expression varies from sample to sample

2.Model how the observed count varies by repeated sequencing of the same sample

Stage 2 is specific for RNAseq

Two approaches

Approach 1: Model the count variation and the between-sample variationedgeRDeseq

Approach 2: Normalize the count data and model only the biological variationVoom + limma

Approach 3: Model count variation onlyPopular but very wrong!

Multiple testing

20,000 p-values

Fitting 20,000 linear modelsSome variance shrinkage

Result:20,000 fold changes20,000 p-values

Which ones are truly differentially expressed?

Multiple testing

Doing 20,000 tests: risk false positive 20,000 times

If 5% of null hypotheses is significant, expect 1,000 significant by pure chance

How to make sure you can really trust the results?

Bonferroni

Classical way of doing multiple testingCall K the number of tests performed

Bonferroni: significant = p-value < 0.05/K

“Adjusted p-value”Multiply all p-values by K, compare with 0.05

Advantages of Bonferroni

Familywise error control=Probability of making any type I error < 0.05

With 95% chance, list of differentially expressed genes has no errors

Very strictEasy to do

Disadvantages of Bonferroni

Very strict“No” false positivesMany false negatives

It is not a big problem to have a few false positives

Do validation experiments later

False discovery rate (Benjamini and Hochberg)

FDR = expected proportion of false discoveries among all discoveries

Control of FDR at 0.05 means in the long run experiments average about 5% type I errors among the reported genes

Percentage: longer lists of genes are allowed to have more errors

Benjamini and Hochberg by hand

1. Order the p-values small to largeExample: 0.0031, 0.0034, 0.02, 0.10, 0.652. Multiply the k-th p-value by m/k, where m is the number

of p-values, so0.0031 * 5/1, 0.0034 * 5/2, 0.02 * 5/3, 0.10 * 5/4, 0.65 * 5/5which becomes0.0155, 0.0085, 0.033, 0.125, 0.653. If the p-values are no longer in increasing order, replace

each p-value by the smallest p-value that is later in the list. In the example, we replace 0.0155 by 0.0085. The final Benjamini-Hochberg adjusted p-values become

0.0085, 0.0085, 0.033, 0.125, 0.65

FDR warnings

FDR is susceptible to cheating

How to cheat with FDR?Add many tests of known false null

hypotheses…

Result: reject more of the other null hypotheses

Example limma results

Conclusion

Testing for differentially expressed genes

Repeated application of a linear model

Include all factors in the model that may influence gene expression

Limma: additional step “borrowing strength”

Don’t forget to correct for multiple testing!

statistical analysis of expression data: normalization, differential expression and multiple testing...

Documents

samples expression

mean expression

samplequantile normalization

normalization methods

ith largest expression

batch effectsexample

distribution of expressions

experimental conditionshow