![Page 1: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/1.jpg)
Statistical methods for microbial community comparison
james robert whiteApril 2008
Advisor: Mihai Pop, CBCB.
![Page 2: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/2.jpg)
Outline
2
• Brief background in metagenomics
• Introduce my problem
• Methods
• Applications
• Future work
![Page 3: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/3.jpg)
Biology!
• Every microbe has a conserved gene called16S rRNA.
• Easy to recognize and exists in all known microbes.
Bacillus anthracis
E. coli
Mycobacterium tuberculosis
3
IntroOur methodsApplicationsFuture work
![Page 4: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/4.jpg)
Metagenomics
16S gene. . . TAGTCCATGACAGTACCGTACAAAA . . .
4
Prochlorococcus marinus
IntroOur methodsApplicationsFuture work
![Page 5: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/5.jpg)
CensusEnvironment
(radioactive waste)
5
IntroOur methodsApplicationsFuture work
![Page 6: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/6.jpg)
Census
75%
10%
5%1%
6%
3%
taxa
Environment(radioactive waste)
6
IntroOur methodsApplicationsFuture work
![Page 7: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/7.jpg)
The Problem
(Healthy colons) (Sick colons)How do two environments
differ?
Which organisms are differentially
abundant?
7
IntroOur methodsApplicationsFuture work
![Page 8: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/8.jpg)
p1 p2 p3 p4 p5 p6 p7
t1 243 300 120 0 43 21 66
t2 12 34 32 0 0 0 0
t3 0 3 10 200 140 134 70
t4 42 4 12 54 76 80 60
t5 2 0 10 4 6 0 0
t6 5 5 3 15 12 0 43
Taxa abundance matrix
Healthy colons Sick colons
8
IntroOur methodsApplicationsFuture work
![Page 9: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/9.jpg)
Differential abundance
• convert frequencies to relative proportions.
• compute sample means, variances.
9
p1 p2 p3 p4 p5 p6 p7
t1 .37 .42 .35 0.0 .10 .05 .17
IntroOur methodsApplicationsFuture work
![Page 10: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/10.jpg)
Hypothesis tests
• So for each taxa, Ti , we perform a hypothesis test of proportions:
• Ho: μhealthy = μsick
• HA: μhealthy ≠ μsick
• We obtain a test statistic ti, corresponding p-value.
• Reject or accept the null hypothesis?
10
IntroOur methodsApplicationsFuture work
![Page 11: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/11.jpg)
Permuted p-values
11
IntroOur methodsApplicationsFuture work
![Page 12: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/12.jpg)
Permuted p-values
11
IntroOur methodsApplicationsFuture work
![Page 13: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/13.jpg)
Permuted p-values
11
t1 ... tM
IntroOur methodsApplicationsFuture work
![Page 14: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/14.jpg)
Permuted p-values
11
IntroOur methodsApplicationsFuture work
![Page 15: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/15.jpg)
Permuted p-values
11
IntroOur methodsApplicationsFuture work
![Page 16: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/16.jpg)
Permuted p-values
11
t1* ... tM*
IntroOur methodsApplicationsFuture work
![Page 17: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/17.jpg)
Permuted p-values
11
simulated t’s
t1* ... tM*
IntroOur methodsApplicationsFuture work
![Page 18: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/18.jpg)
Permuted p-values
11
simulated t’s
t1* ... tM*
IntroOur methodsApplicationsFuture work
![Page 19: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/19.jpg)
Permuted p-values
11
simulated t’s
t1* ... tM*
IntroOur methodsApplicationsFuture work
![Page 20: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/20.jpg)
Multiple tests
• criteria for rejecting Ho
• p-value estimates of significance
• choose a threshold α, and reject if p ≤ α
• if you reject all Ho with p ≤ 0.05, you expect 5% of all true Ho to be false positives.
• M = 10 tests? 1000 tests? 100,000 tests?
• Bonferroni correction
12
IntroOur methodsApplicationsFuture work
![Page 21: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/21.jpg)
FDR alternative
13
• False Discovery Rate - “the rate of significant features that are truly null.”
• Analog to p-value => q-value
• if you reject all Ho with p ≤ 0.05, you expect 5% of all true Ho to be false positives.
• if you reject all Ho with q ≤ 0.05, you expect 5% of all rejected Ho to be false positives.
• e.g. 10,000 tests.
IntroOur methodsApplicationsFuture work
![Page 22: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/22.jpg)
Multiple tests
accept null reject null total
null true MaT MrT MT
null false MaF MrF MF
total M-Mr Mr M
14
IntroOur methodsApplicationsFuture work
![Page 23: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/23.jpg)
Hedenfalk p values
15
IntroOur methodsApplicationsFuture work
![Page 24: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/24.jpg)
Hedenfalk q values
16
IntroOur methodsApplicationsFuture work
![Page 25: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/25.jpg)
Additional issues
17
• Low frequency taxa.
p1 p2 p3 p4 p5 p6 p7
t1 243 300 120 0 43 21 66
t2 12 34 32 0 0 0 0
t3 0 3 10 200 140 134 70t4 42 4 12 54 76 80 60
t5 2 0 10 4 6 0 0
t6 5 5 3 15 12 0 43
Healthy colons Sick colons
IntroOur methodsApplicationsFuture work
![Page 26: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/26.jpg)
heuristic
• N = total number of samples from treatment
• N*p ≥ 25 to use the t statistic
• p ≥ 25/N
• p ≥ 25/5000 = 0.005
18
IntroOur methodsApplicationsFuture work
![Page 27: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/27.jpg)
small frequencies
• what about 25/N > p ?
• if p is this small, indicates small variance among subjects, so merge all samples into one large sample.
• use Fisher’s exact test to find an appropriate p value.
19
IntroOur methodsApplicationsFuture work
![Page 28: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/28.jpg)
e.g.s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
0 1 1 2 3 0 0 1 0 0
50 49 49 48 47 50 50 49 50 50
g1 g2
S 7 1
F 243 24920
![Page 29: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/29.jpg)
Real 16S data
21
• Ley et al. 2006, Nature
• metagenomic study of differentially abundant taxa between human guts of obese (12) and lean (5) people
• found significant differences between two high level taxa: Bacteroidetes and Firmicutes
• we generated taxa abundance matrices from original data and tried to replicate their results.
IntroOur methodsApplicationsFuture work
![Page 30: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/30.jpg)
Ley et al. data
Obese vs. Lean
0
10
20
30
40
50
60
70
80
90
100
Firmicutes Bacteroidetes Actinobacteria
rela
tive
abun
danc
e (%
)
ObeseLean
22
(mean ± std. err)(p≤0.05)
IntroOur methodsApplicationsFuture work
![Page 31: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/31.jpg)
human vs. mouse
23
• Two 16S distal gut studies:
• 5 lean control humans (Ley et al., 2006)
• 12 lean control mice (Ley et al., PNAS, 2005)
• 6,250 16S sequences.
• assigned using the RDP II Bayesian classifier
IntroOur methodsApplicationsFuture work
![Page 32: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/32.jpg)
human vs. mouse
24
IntroOur methodsApplicationsFuture work
![Page 33: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/33.jpg)
Metabolic profiles
25
• Dinsdale et al., Nature, 2008.
• Collected 87 microbial and viral metagenomes.
• 15 million shotgun sequences!
• subterranean, coral reefs, hypersaline, freshwater, animal guts, mosquito viruses.
IntroOur methodsApplicationsFuture work
![Page 34: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/34.jpg)
Metabolic profiles
26
(mean ± std. err)(p≤0.02)
IntroOur methodsApplicationsFuture work
![Page 35: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/35.jpg)
27
![Page 36: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/36.jpg)
Timeline• December
• Consider statistical methodology given sampling issues.
• Develop at least two methodologies to compare.
• Design broad simulation to test q-values vs. p-values.
• January • Finish broad simulation.• Finalize statistical methodology.• Finish application of software to Ley
data.• February
• Apply best method to additional metagenomic data.
• Develop documentation for software. 28
• April• Complete final draft of
report including edits from advisor.
• Submit polished version of our software to BioConductor group.
• May• Deliver final report.• Final presentation
• Beyond• Submit paper.• New data.• Correlations between
taxa.
IntroOur methodsApplicationsFuture work
![Page 37: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/37.jpg)
Acknowledgments
• Mihai Pop, CBCB
• Andrea Ottesen, PLSC
• Frank Siewerdt, ANSC
• Paul Smith, STAT
• Radu Balan, CSCAMM
• Aleksey Zimin, IPST
29
![Page 38: Statistical methods for microbial community comparison › ~rvbalan › TEACHING › AMSC663Fall2007 › ... · • 5 lean control humans (Ley et al., 2006) • 12 lean control mice](https://reader033.vdocuments.site/reader033/viewer/2022042308/5ed4238da6cc2c57c3522cd8/html5/thumbnails/38.jpg)
Questions?
30