statistical bioinformatics - lumc
TRANSCRIPT
Introduction Needles and Haystacks Gene set testing and extensions Discussion
Statistical Bioinformatics
Jelle Goeman
Medical Statistics & BioinformaticsLeiden University Medical Center
Kick-off Meeting, 2009-11-10
Statistical Bioinformatics Jelle Goeman
Introduction Needles and Haystacks Gene set testing and extensions Discussion
Outline
1 Introduction
2 Needles and Haystacks
3 Gene set testing and extensions
4 Discussion
Statistical Bioinformatics Jelle Goeman
Introduction Needles and Haystacks Gene set testing and extensions Discussion
Introduction
Data avalanche
Advent of genomics has had a great impact on statistics
New ways of working and thinking needed
Old rule-of-the-thumb:Need at least five subjects for every measured feature
Old software (SPSS) could not handle the data
Old methods broke down
Great stimulus: much development
Exciting new subfield of statistics: high dimensional data analysis
Statistical Bioinformatics Jelle Goeman
Introduction Needles and Haystacks Gene set testing and extensions Discussion
Bioinformaticians in the medical statistics group
Jeanine Houwing and Stefan Bohringer
Statistical genetics
Bart Mertens
Statistics of proteomics
Erik van Zwet and Jelle Goeman
Statistics of microarray data
Statistical Bioinformatics Jelle Goeman
Introduction Needles and Haystacks Gene set testing and extensions Discussion
Statistical consulting
Medical Statistics group
Long tradition of statistical consulting for whole LUMC
Similar consulting for statistical bioinformatics
We can
Advise
Show/teach how to do statistical analyses
Perform analyses for you
Statistical Bioinformatics Jelle Goeman
Introduction Needles and Haystacks Gene set testing and extensions Discussion
Experimental design
Close relationship
Design of the experiment ⇐⇒ statistical analysis
Recommendation
See a statistician before you start your experiment
Statistical Bioinformatics Jelle Goeman
Introduction Needles and Haystacks Gene set testing and extensions Discussion
Multiple testing
Many simultaneous measurements
Consequence: many simultaneous research questionsWhich gene expressions are different between cases and controls?
Multiplicity: needle-in-a-haystack problem
Among so many genes, many will seem to be different
Many P-values < 0.05 by pure chance
Correct for multiplicity
By statistical adjustment for multiple testingRisk: throw out the good genes with the bad
Statistical Bioinformatics Jelle Goeman
Introduction Needles and Haystacks Gene set testing and extensions Discussion
Prognostic modeling
Patient-level prediction
Using genomic information to distinguish good/bad prognosis
Similar needle-in-a-haystack problem
Many prediction rules seem to do well; few really do
0 5 10 15
0.0
0.2
0.4
0.6
0.8
1.0
time (years)
surv
ival
pro
babi
lity
all tumorspercentile 0−25percentile 25−50percentile 50−75percentile 75−100
Statistical Bioinformatics Jelle Goeman
Introduction Needles and Haystacks Gene set testing and extensions Discussion
The Needle-in-a-haystack problem
Problem
Too much data
Risk of false positive findings
Lack of structure
One solution: provide structure
Use external information to structure statistical learning
Source of information
Bioinformatics: databases
Statistical Bioinformatics Jelle Goeman
Introduction Needles and Haystacks Gene set testing and extensions Discussion
Gene set testing
Microarray gene expression studies
Would produce (long) lists of differentially expressed genes
But: genes do not operate in isolation
What are the biological processes these genes are involved in?
Typical: post hoc analysis
Analyzing the list of differentially expressed genes for commonfunctions
Using databases of gene function (Gene Ontology)
Problem: statistically highly inefficient
Better: incorporate gene function into analysis directly
Statistical Bioinformatics Jelle Goeman
Introduction Needles and Haystacks Gene set testing and extensions Discussion
Globaltest
Globaltest method
Analyze your data directly at the level of gene sets
# genes p-valuechromosome segregation 14 1e-05cell cycle 230 1e-05cytokinesis 7 2e-05microtubule cytoskeleton organiz. and biogen. 22 2e-05microtubule-based process 47 2e-05mitotic cell cycle 69 2e-05G2/M transition of mitotic cell cycle 4 2e-05DNA replication 49 3e-05mitosis 53 3e-05M phase 66 3e-05M phase of mitotic cell cycle 54 3e-05sister chromatid segregation 9 3e-05mitotic sister chromatid segregation 9 3e-05establishment of organelle localization 3 4e-05cytoskeleton organization and biogenesis 128 4e-05
Statistical Bioinformatics Jelle Goeman
Introduction Needles and Haystacks Gene set testing and extensions Discussion
Using the structure of Gene Ontology
Looking at many gene sets
Still relatively unstructured
Exploit structure
Gene Ontology is a graph
Let the graph guide thesearch
Result of one test showswhich test to do next
Result: more power
Statistical Bioinformatics Jelle Goeman
Introduction Needles and Haystacks Gene set testing and extensions Discussion
Structure in data
Structured data
Measurements along thegenome
Exploit this structure
Start testing at thechromosome level
Go down deeper whereyou find effects
Chr Arm Band Gene
8
p
q
23.323.223.1
22
21.3
21.221.1
12
11.2311.2111.111.1
11.2111.2211.23
12.112.212.313.113.2
13.321.1121.1221.1321.221.3
22.122.2
22.3
23.123.2
23.324.1124.1224.1324.21
24.2224.23
24.3extra
H84926...R56148
Statistical Bioinformatics Jelle Goeman
Introduction Needles and Haystacks Gene set testing and extensions Discussion
Data-driven structure
Use clustering to get a data-driven structure
Exploit that structure for insight and increased power
abso
lute
cor
rela
tion ●
1
0.8
0.6
0.4
0.2
NU
SA
P1
CC
NB
2P
RC
1B
UB
1K
IF23
RA
CG
AP
1C
CN
B1
CD
C25
AN
CA
PH
CD
CA
3U
BE
2CA
UR
KB
ES
PL1
BIR
C5
SP
C25
CD
CA
8C
DC
20Z
WIN
TM
AD
2L1
CD
C2
NE
K2
CK
S1B
AN
LNC
CN
A2
CD
C25
BS
PAG
5C
KS
2N
CA
PD
2N
US
AP
1N
DC
80S
GO
L1A
SP
MC
DC
A2
KIF
11F
BX
O5
NU
F2
AS
PM
AS
PM
SG
OL2
SM
C4
CD
C6
CIT
SE
PT
3C
DC
123
RA
D21
SE
PT
11C
CN
D2
FT
SJ3
SE
PT
9LL
GL2
AN
AP
C11
PAR
D6G
MA
PR
E2
MA
D2L
2H
OX
B4
CC
NG
2M
AP
9PA
RD
6BPA
RD
6BC
DC
14A
STA
G2
CC
DC
5S
YC
P2
MA
EA
AAT
FPA
RD
6AC
DK
3C
DC
2L6
CD
K6
TX
NL4
AD
IAP
H2
TG
FB
2N
ED
D9
SE
PT
5
p−va
lue
1
0.1
0.01
0.001
1e−04
1e−05
1e−06
1e−07
1e−08 pos. assoc. with survivalneg. assoc. with survival
Statistical Bioinformatics Jelle Goeman
Introduction Needles and Haystacks Gene set testing and extensions Discussion
Future: use other structures
Statistical Bioinformatics Jelle Goeman
Introduction Needles and Haystacks Gene set testing and extensions Discussion
Discussion
Statistical Bioinformatics
Many new developments
Builds upon classical statistics
Greatest challenge and opportunity
Using biological knowledge to guide the analysis
Statistical Bioinformatics Jelle Goeman