statistical models for sequencing data: from experimental design...
TRANSCRIPT
![Page 1: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/1.jpg)
1
Statistical Models for sequencing data: from Experimental Design
to Generalized Linear Models
Oscar M. Rueda Breast Cancer Functional Genomics Group. CRUK Cambridge Research Institute (a.k.a. Li Ka Shing Centre) [email protected]!!
Best practices in the analysis of RNA-Seq and CHiP-Seq data 27th-31st, July 2015
University of Cambridge, Cambridge, UK
![Page 2: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/2.jpg)
Outline
• Experimental Design • Design and Contrast matrices • Generalized linear models • Models for coun:ng data
2
![Page 3: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/3.jpg)
3
To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.
Sir Ronald Fisher (1890-‐1962)
[evolu:onary biologist, gene:cist and sta:s:cian]
![Page 4: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/4.jpg)
4
An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem.
John Tukey (1915-‐2000)
[Sta:s:cian]
![Page 5: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/5.jpg)
5
An unsophisticated forecaster uses statistics as a drunken man uses lamp-posts - for support rather than for illumination.
Andrew Lang (1844-‐1912)
[Poet, novelist and literary cri:c]
![Page 6: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/6.jpg)
Experimental Design
![Page 7: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/7.jpg)
Design of an experiment
• Select biological ques:ons of interest • Iden:fy an appropriate measure to answer that ques:on
• Select addi:onal variables or factors that can have an influence in the result of the experiment
• Select a sample size and the sample units • Assign samples to lanes/flow cells.
7
![Page 8: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/8.jpg)
Principles of Sta:s:cal Design of Experiments
• R. A. Fisher: – Replica:on – Blocking – Randomiza:on.
• They have been used in microarray studies from the beginning.
• Bar coding makes easy to adapt them to NGS studies.
8
![Page 9: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/9.jpg)
Sampling hierarchy
There are three levels of sampling: • Subject Sampling • RNA sampling • Fragment sampling.
Auer and Doerge. Genetics 185:405-416(2010) 9
![Page 10: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/10.jpg)
Unreplicated Data
Inferences for RNA and fragment-‐level can be obtained through Fisher’s test. But they don’t reflect biological variability.
Auer and Doerge. Genetics 185:405-416(2010) 10
![Page 11: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/11.jpg)
Replicated Data
Inferences for treatment effect using generalized linear models (more on this later).
Auer and Doerge. Genetics 185:405-416(2010) 11
Is this a good design? We should randomize within block!
![Page 12: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/12.jpg)
Balanced Block Designs
• Avoids confounding effects: – Lane effects (any errors from the point where the sample is input to the flow cell un:l the data output). Examples: systema:cally bad sequencing cycles, errors in base calling…
– Batch effects (any errors ager random fragmenta:on of the RNA un:l it is input to the flow cell). Examples: PCR amplifica:on, reverse transcrip:on ar:facts…
– Other effects non related to treatment.
Auer and Doerge. Genetics 185:405-416(2010) 12
![Page 13: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/13.jpg)
Balanced blocks by mul:plexing
Auer and Doerge. Genetics 185:405-416(2010)
![Page 14: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/14.jpg)
Balanced incomplete block design and blocking without mul:plexing
• Usually there are restric:ons with the number of treatments, replicates, unique bar codes and available lanes for sequencing.
• 3 treatments (T1, T2, T3) • 1 subject per treatment (T11, T21, T31) • 2 technical replicates (T111, T112, T211,
T212, T311, T312)
Auer and Doerge. Genetics 185:405-416(2010)
![Page 15: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/15.jpg)
Simula:ons
Auer and Doerge. Genetics 185:405-416(2010)
![Page 16: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/16.jpg)
Results
Auer and Doerge. Genetics 185:405-416(2010) 16
![Page 17: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/17.jpg)
Benefits of a proper design
• NGS is benefited with design principles • Technical replicates can not replace biological replicates
• It is possible to avoid mul:plexing with enough biological replicates and sequencing lanes
• The advantages of mul:plexing are bigger than the disadvantages (cost, loss of sequencing depth, bar-‐code bias…)
17
![Page 18: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/18.jpg)
Design and contrast matrices
![Page 19: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/19.jpg)
Sta:s:cal models – We want to model the expected result of an
outcome (dependent variable) under given values of other variables (independent variables)
E(Y ) = f (X)Y = f (X)+ε
Expected value of variable Y
Arbitrary function (any shape)
A set of k independent variables (also called factors)
This is the variability around the expected mean of y
19
![Page 20: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/20.jpg)
20
Design matrix
– Represents the independent variables that have an influence in the response variable, but also the way we have coded the information and the design of the experiment.
– For now, let’s restrict to models
Y = βX +ε
Response variable Parameter vector
Design matrix
Stochastic error
![Page 21: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/21.jpg)
Types of designs considered
• Models with 1 factor – Models with two treatments – Models with several treatments
• Models with 2 factors – Interac:ons
• Paired designs • Models with categorical and con:nuous factors • TimeCourse Experiments • Mul:factorial models.
21
![Page 22: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/22.jpg)
Strategy
• Define our set of samples • Define the factors, type of factors (con:nuous, categorical), number of levels…
• Define the set of parameters: the effects we want to es:mate
• Build the design matrix, that relates the informa:on that each sample contains about the parameters.
• Es:mate the parameters of the model: tes:ng • Further es:ma:on (and tes:ng): contrast matrices.
![Page 23: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/23.jpg)
Models with 1 factor, 2 levels
Treatme Sample Treatment
Sample1 Treatment A
Sample 2 Control
Sample 3 Treatment A
Sample 4 Control
Sample 5 Treatment A
Sample 6 Control
Number of samples: 6 Number of factors: 1
Treatment: Number of levels: 2
Possible parameters (What differences are important)? -‐ Effect of Treatment A -‐ Effect of Control 23
![Page 24: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/24.jpg)
Design matrix for models with 1 factor, 2 levels
Design Matrix
1 00 11 00 11 00 1
!
"
#######
$
%
&&&&&&&
Treat. A
Control
Sample 1
Sample 2 Sample 3
Sample 4
Sample 5
Sample 6
Parameters (coefficients, levels of the variable) S1
S2S3S4S5S6
!
"
#######
$
%
&&&&&&&
= TC
!
"#
$
%&
24 Equivalent to a t-‐test
Sample Treatment
Sample1 Treatment A
Sample 2 Control
Sample 3 Treatment A
Sample 4 Control
Sample 5 Treatment A
Sample 6 Control
C is the mean expression of the control T is the mean expression of the treatment
![Page 25: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/25.jpg)
Design matrix for models with 1 factor, 2 levels
Design Matrix
1 00 11 00 11 00 1
!
"
#######
$
%
&&&&&&&
Treat. A
Control
Sample 1
Sample 2 Sample 3
Sample 4
Sample 5
Sample 6
Parameters (coefficients, levels of the variable) S1
S2S3S4S5S6
!
"
#######
$
%
&&&&&&&
= TC
!
"#
$
%&
25 Equivalent to a t-‐test
Sample Treatment
Sample1 Treatment A
Sample 2 Control
Sample 3 Treatment A
Sample 4 Control
Sample 5 Treatment A
Sample 6 Control
![Page 26: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/26.jpg)
Intercepts
Different parameteriza:on: using intercept
26
Sample Treatment
Sample1 Treatment A
Sample 2 Control
Sample 3 Treatment A
Sample 4 Control
Sample 5 Treatment A
Sample 6 Control
Let’s now consider this parameteriza:on: C= Baseline expression TA= Baseline expression + effect of treatment So the set of parameters are: C = Control (mean expression of the control) a = TA – Control (mean change in expression under treatment
![Page 27: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/27.jpg)
Intercept
Design Matrix
1 11 01 11 01 11 0
!
"
#######
$
%
&&&&&&&
Intercep
t
Treatm
ent A
Sample 1
Sample 2 Sample 3
Sample 4
Sample 5
Sample 6
Parameters (coefficients, levels of the variable) S1
S2S3S4S5S6
!
"
#######
$
%
&&&&&&&
= β0a
!
"##
$
%&&
Different parameteriza:on: using intercept
Intercept measures the baseline expression. a measures now the differen:al expression between Treatment A and Control
27
![Page 28: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/28.jpg)
Contrast matrices Are the two parameteriza:ons equivalent?
Contrast matrix Contrast matrices allow us to es:mate (and test) linear combina:ons of our coefficients.
28
1 −1"#
$%
TC
"
#&&
$
%''= T −C
![Page 29: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/29.jpg)
Models with 1 factor, more than 2 levels
Treatme Sample Treatment
Sample1 Treatment A
Sample 2 Treatment B
Sample 3 Control
Sample 4 Treatment A
Sample 5 Treatment B
Sample 6 Control
Number of samples: 6 Number of factors: 1
Treatment: Number of levels: 3 Possible parameters (What differences are important)? -‐ Effect of Treatment A -‐ Effect of Treatment B -‐ Effect of Control -‐ Differences between treatments?
29
ANOVA models
![Page 30: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/30.jpg)
Design matrix for ANOVA models
1 0 00 1 00 0 11 0 00 1 00 0 1
!
"
#######
$
%
&&&&&&&
S1S2S3S4S5S6
!
"
#######
$
%
&&&&&&&
=
TATBC
!
"
####
$
%
&&&&
1 1 01 0 11 0 01 1 01 1 11 0 0
!
"
#######
$
%
&&&&&&&
S1S2S3S4S5S6
!
"
#######
$
%
&&&&&&&
=
β0ab
!
"
####
$
%
&&&&
30
Sample Treatment
Sample1 Treatment A
Sample 2 Treatment B
Sample 3 Control
Sample 4 Treatment A
Sample 5 Treatment B
Sample 6 Control
![Page 31: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/31.jpg)
Design matrix for ANOVA models
1 0 00 1 00 0 11 0 00 1 00 0 1
!
"
#######
$
%
&&&&&&&
S1S2S3S4S5S6
!
"
#######
$
%
&&&&&&&
=
TATBC
!
"
####
$
%
&&&&
1 1 01 0 11 0 01 1 01 0 11 0 0
!
"
#######
$
%
&&&&&&&
S1S2S3S4S5S6
!
"
#######
$
%
&&&&&&&
=
β0ab
!
"
####
$
%
&&&&
31
Sample Treatment
Sample1 Treatment A
Sample 2 Treatment B
Sample 3 Control
Sample 4 Treatment A
Sample 5 Treatment B
Sample 6 Control
Control = Baseline TA = Baseline + a TB = Baseline + b
![Page 32: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/32.jpg)
Baseline levels
1 0 01 1 01 0 11 0 01 1 01 0 1
!
"
#######
$
%
&&&&&&&
S1S2S3S4S5S6
!
"
#######
$
%
&&&&&&&
=
β0bc
!
"
####
$
%
&&&&
The model with intercept always take one level as a baseline:
The baseline is treatment A, the coefficients are comparisons against it!
32
By default, R uses the first level as baseline
![Page 33: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/33.jpg)
R code
R code: > Treatment <- rep(c(“TreatmentA”, “TreatmentB”, “Control”), 2)!> design.matrix <- model.matrix(~ Treatment) (model with intercept)!> design.matrix <- model.matrix(~ -1 + Treatment) (model without intercept)!> design.matrix <- model.matrix(~ 0 + Treatment) (model without intercept)!
33
![Page 34: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/34.jpg)
Exercise
Build contrast matrices for all pairwise comparisons for this design:
1 0 00 1 00 0 11 0 00 1 00 0 1
!
"
#######
$
%
&&&&&&&
S1S2S3S4S5S6
!
"
#######
$
%
&&&&&&&
=
TATBC
!
"
####
$
%
&&&&
TATBC
!
"
####
$
%
&&&&
!
"
###
$
%
&&&
34
![Page 35: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/35.jpg)
Exercise
Build contrast matrices for all pairwise comparisons for these designs:
1 0 00 1 00 0 11 0 00 1 00 0 1
!
"
#######
$
%
&&&&&&&
S1S2S3S4S5S6
!
"
#######
$
%
&&&&&&&
=
TATBC
!
"
####
$
%
&&&&
TATBC
!
"
####
$
%
&&&&
1 0 −10 1 −11 −1 0
"
#
$$$
%
&
'''
35
![Page 36: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/36.jpg)
Exercise
Build contrast matrices for all pairwise comparisons for these designs:
!
"
###
$
%
&&&
1 1 01 0 11 0 01 1 01 1 11 0 0
!
"
#######
$
%
&&&&&&&
S1S2S3S4S5S6
!
"
#######
$
%
&&&&&&&
=
β0ab
!
"
####
$
%
&&&&
β0ab
!
"
####
$
%
&&&&
36
![Page 37: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/37.jpg)
Exercise
Build contrast matrices for all pairwise comparisons for these designs:
0 1 00 0 10 1 −1
"
#
$$$
%
&
'''
1 1 01 0 11 0 01 1 01 1 11 0 0
!
"
#######
$
%
&&&&&&&
S1S2S3S4S5S6
!
"
#######
$
%
&&&&&&&
=
β0ab
!
"
####
$
%
&&&&
β0ab
!
"
####
$
%
&&&&
37
![Page 38: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/38.jpg)
Models with 2 factors
Treatme Sample Treatment ER status
Sample1 Treatment A +
Sample 2 No Treatment +
Sample 3 Treatment A +
Sample 4 No Treatment +
Sample 5 Treatment A -‐
Sample 6 No Treatment -‐
Sample 7 Treatment A -‐
Sample 8 No Treatment -‐ Number of samples: 8 Number of factors: 2
Treatment: Number of levels: 2 ER: Number of levels: 2
38
![Page 39: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/39.jpg)
Adapted from Natalie Thorne, Nuno L. Barbosa Morais
Treat effect"
Both effects"
ER effect"
Treat x ER positive interaction
Treat x ER negative interaction
Understanding Interac:ons
Treat + ER effects
Treat effect"
Both effects"
ER effect"
Treat x ER effects
Treat x ER effects
No Treat Treat A
ER -‐ S6, S8 S5, S7
ER + S2, S4 S1, S3
39
![Page 40: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/40.jpg)
Models with 2 factors and no interac:on
Number of coefficients (parameters): Intercept + (♯levels Treat -‐1) + (♯levels ER -‐1) = 3 If we remove the intercept, the addi:onal parameter comes from the missing level in one of the variables, but in models with more than 1 factor it is a good idea to keep the intercept.
Model with no interac:on: only main effects
40
![Page 41: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/41.jpg)
Models with 2 factors (no interac:on)
S1S2S3S4S5S6S7S8
!
"
##########
$
%
&&&&&&&&&&
=
1 1 11 0 11 1 11 0 11 1 01 0 01 1 01 0 0
'
(
))))))))))
*
+
,,,,,,,,,,
β0aer +
!
"
####
$
%
&&&&
R code: > design.matrix <- model.matrix(~Treatment+ER) (model with intercept)
In R, the baseline for each variable is the first level.
41
No Treat Treat A
ER -‐ S6, S8 S5, S7
ER + S2, S4 S1, S3
![Page 42: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/42.jpg)
Models with 2 factors (no interac:on)
S1S2S3S4S5S6S7S8
!
"
##########
$
%
&&&&&&&&&&
=
1 1 11 0 11 1 11 0 11 1 01 0 01 1 01 0 0
'
(
))))))))))
*
+
,,,,,,,,,,
β0aer +
!
"
####
$
%
&&&&
42
No Treat Treat A
ER -‐ S6, S8 S5, S7
ER + S2, S4 S1, S3
R code: > design.matrix <- model.matrix(~Treatment+ER) (model with intercept)
![Page 43: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/43.jpg)
Models with 2 factors and interac:on
Number of coefficients (parameters): Intercept + (♯levels Treat -‐1) + (♯levels ER -‐1) +((♯levels Treat -‐1) * (♯levels ER -‐1)) = 4
Model with interac:on: main effects + interac/on
43
![Page 44: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/44.jpg)
Models with 2 factors (interac:on)
Y1Y2Y3Y4Y5Y6Y7Y8
!
"
###########
$
%
&&&&&&&&&&&
=
1 1 1 11 0 1 01 1 1 11 0 1 01 1 0 01 0 0 01 1 0 01 0 0 0
'
(
))))))))))
*
+
,,,,,,,,,,
β0a
er +a.er +
!
"
######
$
%
&&&&&&
R code: > design.matrix <- model.matrix(~Treatment*ER) (model with intercept)
“Extra effect” of Treatment A on ER+ samples
44
No Treat Treat A
ER -‐ S6, S8 S5, S7
ER + S2, S4 S1, S3
![Page 45: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/45.jpg)
Models with 2 factors (interac:on)
Y1Y2Y3Y4Y5Y6Y7Y8
!
"
###########
$
%
&&&&&&&&&&&
=
1 1 1 11 0 1 01 1 1 11 0 1 01 1 0 01 0 0 01 1 0 01 0 0 0
'
(
))))))))))
*
+
,,,,,,,,,,
β0a
er +a.er +
!
"
######
$
%
&&&&&&
“Extra effect” of Treatment A on ER+ samples
45
R code: > design.matrix <- model.matrix(~Treatment*ER) (model with intercept)
No Treat Treat A
ER -‐ S6, S8 S5, S7
ER + S2, S4 S1, S3
![Page 46: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/46.jpg)
2 by 3 factorial experiment
• Identify DE genes that have different time profiles between different mutants. α = time effect, β = strains, αβ = interaction effect
0 12 24"
Strain B!Strain A!
time
α > 0 β = 0 αβ=0
Exp
0 12 24"
Strain B!
Strain A!
time
α > 0 β > 0 αβ=0
Exp
0 12 24"
Strain A!Strain B!
time
αβ>0 Exp
0 12 24" time
Strain A!
Strain B!
α=0 β> 0 αβ=0
Exp
Slide by Natalie Thorne, Nuno L. Barbosa Morais 46
![Page 47: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/47.jpg)
Paired Designs Sample Type
Sample 1 Tumour
Sample 2 Matched Normal
Sample 3 Tumour
Sample 4 Matched Normal
Sample 5 Tumour
Sample 6 Matched Normal
Sample 7 Tumour
Sample 8 Matched Normal
Number of samples: 4 Number of factors: 2
Sample: Number of levels: 4 Type: Number of levels: 2
Number of samples: 8 Number of factors: 1
Type: Number of levels: 2
Sample Type
Sample 1 Tumour
Sample 1 Matched Normal
Sample 2 Tumour
Sample 2 Matched Normal
Sample 3 Tumour
Sample 3 Matched Normal
Sample 4 Tumour
Sample 4 Matched Normal
47
![Page 48: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/48.jpg)
Design matrix for Paired experiments We can gain precision in our es:mates with a paired design, because individual variability is removed when we compare the effect of the treatment within the same sample.
R code: > design.matrix <- model.matrix(~-1 +Type) (unpaired; model without intercept)
> design.matrix <- model.matrix(~-1 +Sample+Type) (paired; model without intercept)
These effects only reflect biological differences not related to tumour/normal effect.
48
Sample Type
Sample 1 Tumour
Sample 1 Matched Normal
Sample 2 Tumour
Sample 2 Matched Normal
Sample 3 Tumour
Sample 3 Matched Normal
Sample 4 Tumour
Sample 4 Matched Normal
Y1Y2Y3Y4Y5Y6Y7Y8
!
"
###########
$
%
&&&&&&&&&&&
=
1 0 0 0 11 0 0 0 00 1 0 0 10 1 0 0 00 0 1 0 10 0 1 0 00 0 0 1 10 0 0 1 0
'
(
))))))))))
*
+
,,,,,,,,,,
S1S2S3S4t
!
"
######
$
%
&&&&&&
![Page 49: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/49.jpg)
Analysis of covariance (Models with categorical and con:nuous variables)
Sample ER Dose
Sample 1 + 37
Sample 2 -‐ 52
Sample 3 + 65
Sample 4 -‐ 89
Sample 5 + 24
Sample 6 -‐ 19
Sample 7 + 54
Sample 8 -‐ 67
Number of samples: 8 Number of factors: 2
ER: Number of levels: 2 Dose: Con:nuous
49
![Page 50: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/50.jpg)
Analysis of covariance (Models with categorical and con:nuous variables)
Y1Y2Y3Y4Y5Y6Y7Y8
!
"
###########
$
%
&&&&&&&&&&&
=
1 1 371 0 521 1 651 0 891 1 241 0 191 1 541 0 67
'
(
))))))))))
*
+
,,,,,,,,,,
β0er +d
!
"
####
$
%
&&&&
R code: > design.matrix <- model.matrix(~ ER + dose)
If we consider the effect of dose linear we use 1 coefficient (degree of freedom). We can also model it as non-‐linear (using splines, for example).
50
Sample ER Dose
Sample 1 + 37
Sample 2 -‐ 52
Sample 3 + 65
Sample 4 -‐ 89
Sample 5 + 24
Sample 6 -‐ 19
Sample 7 + 54
Sample 8 -‐ 67
![Page 51: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/51.jpg)
Analysis of covariance (Models with categorical and con:nuous variables)
Y1Y2Y3Y4Y5Y6Y7Y8
!
"
###########
$
%
&&&&&&&&&&&
=
1 1 37 371 0 52 01 1 65 651 0 89 01 1 24 241 0 19 01 1 54 541 0 67 0
'
(
))))))))))
*
+
,,,,,,,,,,
β0er +d
er +.d
!
"
####
$
%
&&&&
If the interac:on is significant, the effect on the dose is different depending on the levels of ER.
Interac:on: Is it the effect of dose equal in ER + and ER -‐?
51
R code: > design.matrix <- model.matrix(~ ER * dose)
Sample ER Dose
Sample 1 + 37
Sample 2 -‐ 52
Sample 3 + 65
Sample 4 -‐ 89
Sample 5 + 24
Sample 6 -‐ 19
Sample 7 + 54
Sample 8 -‐ 67
![Page 52: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/52.jpg)
Time Course experiments Sample Time
Sample 1 0h
Sample 1 1h
Sample 1 4h
Sample 1 16h
Sample 2 0h
Sample 2 1h
Sample 2 4h
Sample 2 16h
Number of samples: 2 Number of factors: 2
Sample: Number of levels: 2 Time: Con:nuous or categorical?
Main ques:on: how does expression change over :me?
If we model :me as categorical, we don’t make assump:ons about its effect, but we use too many degrees of freedom.
If we model :me as con:nuous, we use less degrees of freedom but we have to make assump:ons about the type of effect.
52 Intermediate solu:on: splines
![Page 53: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/53.jpg)
Time Course experiments: no assump:ons
Y1Y2Y3Y4Y5Y6Y7Y8
!
"
###########
$
%
&&&&&&&&&&&
=
1 0 0 0 01 0 1 0 01 0 0 1 01 0 0 0 10 1 0 0 00 1 1 0 00 1 0 1 00 1 0 0 1
'
(
))))))))))
*
+
,,,,,,,,,,
S1S2T1T4T16
!
"
#######
$
%
&&&&&&&
R code: > design.matrix <- model.matrix(~Sample + factor(Time))
We can use contrasts to test differences at :me points.
53
Sample Time
Sample 1 0h
Sample 1 1h
Sample 1 4h
Sample 1 16h
Sample 2 0h
Sample 2 1h
Sample 2 4h
Sample 2 16h
![Page 54: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/54.jpg)
Time Course experiments
Y1Y2Y3Y4Y5Y6Y7Y8
!
"
###########
$
%
&&&&&&&&&&&
=
1 0 01 0 11 0 41 0 160 1 00 1 10 1 40 1 16
'
(
))))))))))
*
+
,,,,,,,,,,
S1S2X
!
"
####
$
%
&&&&
R code: > design.matrix <- model.matrix(~Sample + Time)
We are assuming a linear effect on :me
54
time
small coef x
Big coef x
Large neg coef x Intermediate models are possible: splines
Sample Time
Sample 1 0h
Sample 1 1h
Sample 1 4h
Sample 1 16h
Sample 2 0h
Sample 2 1h
Sample 2 4h
Sample 2 16h
![Page 55: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/55.jpg)
Mul: factorial models
• We can fit models with many variables • Sample size must be adequate to the number of factors • Same rules for building the design matrix must be used:
• There will be one column in design matrix for the intercept • Con:nuous variables with a linear effect will need one column in the design
matrix • Categorical variable will need ♯levels -‐1 columns • Interac:ons will need (♯levels -‐1) x (♯levels -‐1) • It is possible to include interac:ons of more than 2 variables, but the number of
samples needed to accurately es:mate those interac:ons is large.
55
![Page 56: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/56.jpg)
Generalized linear models
![Page 57: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/57.jpg)
Sta:s:cal models
– We want to model the expected result of an outcome (dependent variable) under given values of other variables (independent variables)
E(Y ) = f (X)Y = f (X)+ε
Expected value of variable y Arbitrary func:on (any shape)
A set of k independent variables (also called factors)
This is the variability around the expected mean of y
57
![Page 58: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/58.jpg)
Fixed vs Random effects
– If we consider an independent variable Xi as fixed, that is the set of observed values has been fixed by the design, then it is called a fixed factor.
– If we consider an independent variable Xj as random, that is the set of observed values comes from a realiza:on of a random process, it is called a random factor.
– Models that include random effects are called MIXED MODELS.
– In this course we will only deal with fixed factors.
58
![Page 59: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/59.jpg)
Linear models
– The observed value of Y is a linear combina:on of the effects of the independent variables
– If we include categorical variables the model is called General Linear Model
E(Y ) = β0 +β1X1 +β2X2 +...+βkXk
E(Y ) = β0 +β1X1 +β2X21 +...+βpX
p1
E(Y ) = β0 +β1 log(X1)+β2 f (X2 )+...+βkXk
Arbitrary number of independent variables
Polynomials are valid
We can use func:ons of the variables if the effects are linear
Smooth func:ons: not exactly the same as the so-‐called addi/ve models
59
![Page 60: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/60.jpg)
Model Es:ma:on We use least squares esImaIon
●
●●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
●
●
●
●●●●
●
●
●
●●
●
●●●●
●
●
●
●
●
●
●
●
0 20 40 60 80 100
−20
24
68
1012
X
Y
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
X
Y
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
Male Female
●
●
Given n observa:ons (y1,..yn,x1,..xn) minimize the differences between the observed and the predicted values
60
![Page 61: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/61.jpg)
Model Es:ma:on Y = βX +ε
β
β
se(β)
Parameter of interest (effect of X on Y)
EsImator of the parameter of interest
Standard Error of the es:mator of the parameter of interest
β = (XTX)−1XTY
se(βi ) =σ ciwhere ci is the i
th diagonal element of XTX( )−1
y = βxe = y− y
Fived values (predicted by the model)
Residuals (observed errors) 61
![Page 62: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/62.jpg)
Model Assump:ons In order to conduct sta:s:cal inferences on the parameters on the model, some assump:ons must be made: • The observa:ons 1,..,n are independent • Normality of the errors:
• Homoscedas:city: the variance is constant. • Linearity.
εi ~ N(0,σ2 )
62
![Page 63: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/63.jpg)
Generalized linear models
– Extension of the linear model to other distribu:ons and non-‐linearity in the structure (to some degree)
– Y must follow a probability distribu:on from the exponen:al family (Bernoulli, Binomial, Poisson, Gamma, Normal,…)
– Parameter es:ma:on must be performed using an itera:ve method (IWLS).
g(E(Y )) = XβLink func:on
63
![Page 64: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/64.jpg)
Example: Logis:c Regression
– We want to study the rela:onship between the presence of an amplifica:on in the ERBB2 gene and the size of the tumour in a specific type of breast cancer.
– Our dependent variable Y, takes two possible values: “AMP”, “NORMAL” (“YES”, “NO”)
– X (size) takes con:nuous values.
64
![Page 65: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/65.jpg)
Example: Logis:c Regression
It is very difficult to see the relationship. Let’s model the “probability of success”: in this case, the probability of amplification
65
● ●● ●●● ●●● ●● ●●● ● ●● ●● ●●● ●●● ●●● ● ●● ●● ●● ●● ●● ●● ●● ● ●● ●● ● ●
● ●● ● ● ● ●● ●● ● ● ●● ● ● ●● ●● ● ●●● ●● ●● ●●● ● ●● ● ● ●● ●● ● ●● ● ● ●●● ● ●
Size
ERBB
2 Am
plific
atio
n
NOYE
S
0 5 10 15 20
![Page 66: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/66.jpg)
Example: Logis:c Regression
Some predictions are out of the possible range for a probability
66
● ● ● ● ● ●
●
● ●
●
●
●
●
●
● ●
●
● ● ● ● ● ●
0 5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
Size
Prob.Amplification
![Page 67: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/67.jpg)
Example: Logis:c Regression We can transform the probabilities to a scale that goes from –Inf to Inf using log odds
logodds = log p1− p"
#$
%
&'
67
● ● ● ● ● ●
●
● ●
●
●
●
●●
● ●●
● ● ● ● ● ●
0 5 10 15 20
−10
−8−6
−4−2
Size
log
odds
Am
plifi
catio
n
![Page 68: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/68.jpg)
Example: Logis:c Regression
How does this relate to the generalized linear model? • Y follows a Bernoulli distribu:on; it can take two values (YES or NO)
• The expecta:on of Y, p is the probability of YES (EY=p) • We assume that there is a linear rela:onship between size and a func:on of the expected value of Y: the log odds (the link func:on)
logodds(prob.amplif ) = β0 +β1Sizeg(EY ) = βX
68
![Page 69: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/69.jpg)
Binomial Distribu:on
• It is the distribu:on of the number of events in a series of n independent Bernoulli experiments, each with a probability of success p.
• Y can take integer values from 0 to n • EY=np • VarY= np(1-‐p)
69
0 1 2 3 4 5 6 7 8 9 10
Binomial distribution. n=10, p=0.3
0.00
0.05
0.10
0.15
0.20
0.25
![Page 70: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/70.jpg)
Poisson Distribu:on
• Let Y ~ B(n,p). If n is large and p is small then Y can be approximated by a Poisson Distribu:on (Law of rare events)
• Y ~ P(λ) • EY=λ • VarY=λ
70 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Poisson Distributionλ = 2
0.00
0.05
0.10
0.15
0.20
0.25
![Page 71: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/71.jpg)
Nega:ve Binomial Distribu:on
• Let Y ~ NB(r,p) • Represents the number of successes in a Bernoulli experiment
un:l r failures occur. • It is also the distribu:on of a con:nuous mixture of Poisson
distribu:ons where λ follows a Gamma distribu:on. • It can be seen as a overdispersed Poisson distribu:on.
p = µσ 2
r = µ 2
σ 2 −µ
Overdispersion parameter
Loca:on parameter
71
Negative Binomial distribution. r=10, p=0.3
0.00
0.01
0.02
0.03
0.04
![Page 72: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/72.jpg)
Hypothesis tes:ng
• Everything starts with a biological ques:on to test: – What genes are differenIally expressed under one treatment? – What genes are more commonly amplified in a class of tumours?
– What promoters are methylated more frequently in cancer? • We must express this biological ques:on in terms of a
parameter in a model. • We then conduct an experiment, obtain data and es:mate
the parameter. • How do we take into account uncertainty in order to
answer our ques:on based on our es:mate?
72
![Page 73: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/73.jpg)
Sampling and tes:ng
10% red balls and 90% blue balls
Random sample of 10 balls from the box
Discrete observations
When do I think that I am not sampling from this box anymore?
How many reds could I expect to get just by chance alone!
#red = 3
Slide by Natalie Thorne, Nuno L. Barbosa Morais 73
![Page 74: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/74.jpg)
10% red balls and 90% blue balls
Random sample of 10 balls from the box
Discrete observations
Sample
Null hypothesis (about the population that is being sampled)
Rejection criteria (based on your observed sample,
do you have evidence to reject the hypothesis
that you sampled from the null population)
#red = 3
Test statistic
Slide by Natalie Thorne, Nuno L. Barbosa Morais 74
![Page 75: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/75.jpg)
• Null Hypothesis: Our population follows a (known) distribution defined by a set of parameters: H0 : X ~ f(θ1,…θk)
• Take a random sample (X1,…Xn) = (x1,…xn) and observe test statistic
T(X1,…Xn)= t(x1,…xn)
• The distribution of T under H0 is known (g(.)) • p-value : probability under H0 of observing a
result as extreme as t(x1,…xn)
75
Hypothesis tes:ng
![Page 76: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/76.jpg)
observed z-‐score
High probability of gezng a more extreme score just by chance
P-‐value is high!
76 Slide by Natalie Thorne, Nuno L. Barbosa Morais
![Page 77: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/77.jpg)
observed z-‐score
Low probability of gezng a more extreme score just by chance
P-‐value is low
Reject null hypothesis
77 Slide by Natalie Thorne, Nuno L. Barbosa Morais
![Page 78: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/78.jpg)
Type I and Type II errors
• Type I error: probability of rejec:ng the null hypothesis when it is true. Usually, it is the significance level of the test. It is denoted as α
• Type II error: probability of not rejec:ng the null hypothesis when it is false It is denoted as β
• Decreasing one type of error increases the other, so in prac:ce we fix the type I error and choose the test that minimizes type II error.
78
![Page 79: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/79.jpg)
The power of a test • The power of a test is the
probability of rejec:ng the null hypothesis at a given significance level when a specific alterna:ve is true
• For a given significance level and a given alterna:ve hypothesis in a given test, the power is a func:on of the sample size
• What is the difference between sta:s:cal significance and biological significance?
0 1000 2000 3000 4000 5000
0.2
0.4
0.6
0.8
1.0
t−test: true diff:0.1 std=1 sig.lev=0.05
Sample Size
Powe
r
With enough sample size, we can detect any alterna:ve hypothesis (if the es:mator is consistent, its standard error converges to zero as the sample size increases)
79
![Page 80: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/80.jpg)
The Likelihood Ra:o Test (LRT)
• We are working with models, therefore we would like to do hypothesis tests on coefficients or contrasts of those models
• We fit two models M1 without the coefficient to test and M2 with the coefficient.
• We compute the likelihoods of the two models (L1 and L2) and obtain LRT=-‐2log(L1 /L2) that has a known distribu:on under the null hypothesis that the two models are equivalent. This is also known as model selec/on.
80
![Page 81: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/81.jpg)
Mul:ple tes:ng problem
• In High throughput experiments we are fizng one model for each gene/exon/sequence of interest, and therefore performing thousands of tests.
• Type I error is not equal to the significance level of each test.
• Mul:ple test correc:ons try to fix this problem (Bonferroni, FDR,…)
81
![Page 82: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/82.jpg)
Distribu:on of p-‐values
82
Re-cap of hypothesis testing for a single hypothesis
Crucial property of p-values
Since p is a transformation of t, we can also imagine performing theexperiment many times, but rather than plotting t each time, we plot p.What will the histogram look like?
If the null hypothesis is true, the p-values from the repeated experimentscome from a Uniform(0,1) distribution.
Histogram of p
p
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
01000
2000
3000
4000
5000
Alex Lewin (Biostatistics, St Mary’s) Multiple Testing March 2011 13 / 69
Slide by Alex Lewin, Ernest Turro, Paul O’Reilly
![Page 83: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/83.jpg)
Controlling the number of errors
Null Hypothesis True AlternaIve Hypothesis True
Total
Not Significant (don’t reject) ♯True Nega:ve ♯False Nega:ve
(Type II error) N-‐♯ Rejec:ons
Significant (Reject) ♯False posi:ve (Type I error)
♯True posi:ve ♯Total Rejec:ons
Total n0 N-‐n0 N
N = number of hypothesis tested R = number of rejected hypothesis n0 = number of true hypothesis
![Page 84: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/84.jpg)
Bonferroni Correc:on
If the tests are independent: P(at least one false posi:ve | all null hypothesis are true) = P(at least one p-‐value < α| all null hypothesis are true) = 1 – (1-‐α)m
Usually, we set a threshold at α/ n. Bonferroni correc:on: reject each hypothesis at α/N level It is a very conserva:ve method
![Page 85: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/85.jpg)
False Discovery Rate (FDR) N = number of hypothesis tested R = number of rejected hypothesis n0 = number of true hypothesis
FDR aims to control the set of false posi:ves among the rejected null hypothesis.
Family Wise Error Rate: FWER = P(V≥1) False Discovery Rate: FDR = E(V/R | R>0) P(R>0)
Null Hypothesis True AlternaIve Hypothesis True
Total
Not Significant (don’t reject) ♯True Nega:ve ♯False Nega:ve
(Type II error) N-‐♯ Rejec:ons
Significant (Reject) V=♯False posi:ve (Type I error)
♯True posi:ve R=♯Total Rejec:ons
Total n0 N-‐n0 N
![Page 86: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/86.jpg)
86
Classical approaches to multiple testing (FWER and FDR)
Benjamini & Hochberg (BH) step-up method to controlFDR
Benjamini & Hochberg proposed the idea of controlling FDR, and used astep-wise method for controlling it.
Step 1: compare largest p-value to the specified significance level ↵:If pordm > ↵ then don’t reject corresponding null hypothesis
Step 2: compare second largest p-value to a modified threshold:If pordm�1
> ↵ ⇤ (m � 1)/m then don’t reject corresponding null hypothesis
Step 3:If pordm�2
> ↵ ⇤ (m � 2)/m then don’t reject corresponding null hypothesis
. . .
Stop when a p-value is lower than the modified threshold:
All other null hypotheses (with smaller p-values) are rejected.
Alex Lewin (Biostatistics, St Mary’s) Multiple Testing March 2011 35 / 69
Slide by Alex Lewin, Ernest Turro, Paul O’Reilly
![Page 87: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/87.jpg)
87
Classical approaches to multiple testing (FWER and FDR)
Adjusted p-values for BH FDR
The final threshold on p-values below which all null hypotheses arerejected is ↵j⇤/m where j⇤ is the index of the largest such p-value.
BH:compare pi to ↵j⇤/m ! compare mpi/j⇤ to ↵
Can define ’adjusted p-values’ as mpi/j⇤
But these ’adjusted p-values’ tell you the level of FDR which is beingcontrolled (as opposed to the FWER in the Bonferroni and Holm cases).
We will see versions of FDR in later sections, in Bayesian mixture modelsand in the Bayesian-inspired approaches in the last section.
Alex Lewin (Biostatistics, St Mary’s) Multiple Testing March 2011 36 / 69
Slide by Alex Lewin, Ernest Turro, Paul O’Reilly
![Page 88: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/88.jpg)
Mul:ple power problem
• We have another problem related to the power of each test. Each unit tested has a different test sta:s:c that depends on the variance of the distribu:on. This variance is usually different for each gene/transcript,…
• This means that the probability of detec:ng a given difference is different for each gene; if there is low variability in a gene we will reject the null hypothesis under a smaller difference
• Methods that shrinkage variance (like the empirical Bayes in limma for microarrays) deal with this problem.
88
![Page 89: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/89.jpg)
Models for coun:ng data
![Page 90: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/90.jpg)
Microarray expression data
Adapted from slides by Benilton Carvalho
0 10000 30000 500000.00000
0.00010
0.00020
RG densities
Intensity
Density
4 6 8 10 12 14 16
0.00
0.05
0.10
0.15
0.20
RG densities
Intensity
Density
log yij ~ N(µ j,σ j2 )
Data are color intensi:es
90
![Page 91: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/91.jpg)
Gene Sample 1 Sample 2
ERBB2 0 45
MYC 14 23
ESR1 56 2
Sequencing data
Adapted from slides by Benilton Carvalho, Tom Hardcastle
• Transcript (or sequence, or methyla:on) i in sample j is generated at a rate λij
• A fragment avaches to the flow cell with a probability of pij (small)
• The number of observed tags yij follows a Poisson distribu:on with a rate that is propor:onal to λijpij.
Number of reads
Den
sity
0 20 40 60
0.0
0.1
0.2
0.3
0.4
0.5
0.6
The variance in a Poisson distribu:on is equal to the mean
91
![Page 92: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/92.jpg)
Extra variability
Adapted from slides by Benilton Carvalho
Squared coeffi
cien
t of varia:o
n
92
![Page 93: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/93.jpg)
Nega:ve binomial model for sequencing data
Adapted from slides by Benilton Carvalho 93
![Page 94: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/94.jpg)
Es:ma:ng Overdispersion with edgeR
• edgeR (Robinson, McCarthy, Chen and Smyth) • Total CV2=Technical CV2 + Biological CV2
• Borrows informa:on from all genes to es:mate BCV. – Common dispersion for all tags – Empirical Bayes to shrink each dispersion to the common dispersion.
Variability in gene abundance between replicates
Decreases with sequencing depth
94
![Page 95: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/95.jpg)
Es:ma:ng Overdispersion with DESeq
• DESeq (Anders, Huber) • Var = sμ + αs2μ2
• es:mateDispersions() 1. Dispersion value for each gene 2. Fits a curve through the es:mates 3. Each gene gets an es:mate between (1) and (2).
Expected normalized count value
Size factor for the sample
Dispersion of the gene
95
![Page 96: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/96.jpg)
Reproducibility
Slide by Wolfgang Huber 96
![Page 97: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/97.jpg)
A few number of genes get most of the reads
Slide by Wolfgang Huber 97
![Page 98: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/98.jpg)
Effec:ve library sizes
• Also called normaliza:on (although the counts are not changed!!!)
• We must es:mate the effec:ve library size of each sample, so our counts are comparable between genes and samples
• Gene lengths? • This library sizes are included in the model as an offset (a
parameter with a fixed value)
98
![Page 99: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/99.jpg)
Es:ma:ng library size with edgeR
• edgeR (Robinson, McCarthy, Chen and Smyth) • Adjust for sequencing depth and RNA composi:on (total RNA output)
• Choose a set of genes with the same RNA composi:on between samples (with the log fold change of normalised counts) ager trimming
• Use the total reads of that set as the es:mate.
99
![Page 100: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/100.jpg)
Es:ma:ng library size with DESeq
• DESeq (Anders, Huber) • Adjust for sequencing depth and RNA composi:on (total RNA output)
• Compute the ra:o between the log counts in each gene and each sample and the log mean for that gene on all samples.
• The median on all genes is the es:mated library size.
100
![Page 101: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec51bcd58c4664d06191dd3/html5/thumbnails/101.jpg)
References
• Anders and Huber. Genome Biology, 2010; 11:R106 • Auer and Doerge. Gene:cs 2010, 185:405-‐416 • Harrell. Regression Modeling Strategies • Robles et al. BMC Genomics 2012, 13:484 • Venables and Ripley. Modern Applied Sta:s:cs with S
101