robustness, reproducibility & ecological consistency in the demarcation of operational taxonomic...
TRANSCRIPT
Robustness, Reproducibility!& Ecological Consistency!
in the Demarcation of Operational Taxonomic Units
Sebastian Schmidt!Institute for Molecular Life Sciences!
University of Zü[email protected]
ISME15, Seoul, 2014/08/29 [email protected]
A general workflow in (targeted) metagenomics
Jean Tinguely, “Heureka”!Lake Zürich
ISME15, Seoul, 2014/08/29
Sampling &!Sequencing “Making OTUs”
Understanding!your data!(hopefully)
A general workflow in (targeted) metagenomics
Concepts
ISME15, Seoul, 2014/08/29 [email protected]
replicability!!robustness!!reproducibility!!ecological consistency
Concepts
ISME15, Seoul, 2014/08/29 [email protected]
replicability!!robustness!!reproducibility!!ecological consistency
42!
Life, the Universe and Everything?
42!
Life, the Universe and Everything?
Concepts
ISME15, Seoul, 2014/08/29 [email protected]
replicability!!robustness!!reproducibility!!ecological consistency
42!
Life, the Universe and Everything?
42!
Life, Microbial Ecology and Everything?
Concepts
ISME15, Seoul, 2014/08/29 [email protected]
replicability!!robustness!!reproducibility!!ecological consistency
42!
Life, the Universe and Everything?
Life, the Universe and Everything?
42!
Grice et al, Science, 2009
The Human Skin Microbiome (HSM) dataset:!!~115,000 full-length 16S sequences!!sampled from 21 distinct body sites!
!clustered to 97% sequence identity
ISME15, Seoul, 2014/08/29 [email protected]
OTU
Aall methods
agree (almost)perfectly5,423 SEQ.
SMAL
L OTU
S
õ����4EQPER OTU
methods providedifferent # of “small”
OTUsõ����TFR�QFS�056
OTU
D 2,692 SEQ.
TQMJUUJOHby Uclust
OTU
C
� ����4EQ.TQMJUUJOHby CL
OTU
B
8,465 SEQ.
MVNQJOHby SL
UPARSE
� ��� OTUS
UCLUST
3,282 OTUS
CD-HIT
� ��� OTUS
SINGLE LINKAGE
� ��� OTUS
COMPLETE LINKAGE
� ��� OTUS
AVERAGE LINKAGE
� ��� OTUS
ISME15, Seoul, 2014/08/29 Schmidt et al, Environ Microbiol, in press
0.682-0.051
0.932-0.095
0.920-0.075
0.9880.3870.9690.3000.9910.150
0.5760.116
0.981-0.008
0.7940.079
0.991-0.2990.858-0.2610.966-0.136
0.545-0.131
0.928-0.060
0.772-0.099
0.9860.5220.7730.4630.9530.216
0.5510.167
0.9220.087
0.7490.154
0.973-0.6860.817-0.5610.949-0.286
0.358-0.207
0.513-0.358
0.9840.204
0.672-0.163
0.5840.7800.6650.350
0.802-0.194
0.9180.128
0.8051.511
0.855-0.181
0.9480.3900.9120.427
0.472-0.325
0.9530.064
0.7852.033
0.694-0.280
0.6680.8530.7990.642
0.643-0.158
0.9200.151
0.8811.347
0.884-0.126
0.9220.2920.9050.356
0.791-0.209
0.981-0.056
0.8381.734
0.862-0.201
0.9450.5920.9120.506
0.614-0.091
0.482-0.366
0.984-0.095
0.764-0.084
0.6130.5180.6080.214
0.7620.036
0.9930.027
0.977-0.164
0.9450.055
0.989-0.0980.998-0.071
0.464-0.040
0.558-0.271
0.978-0.482
0.759-0.009
0.5840.2190.5740.063
0.630-0.076
0.552-0.298
0.972-0.318
0.793-0.064
0.5690.3170.5700.134
0.5200.118
0.436-0.422
0.837-1.829
0.6170.117
0.559-0.0730.434-0.292
0.886-0.015
0.937-0.068
0.9930.224
0.957-0.020
0.9740.2020.9950.079
CHAO1
INV SIMPSON
SHANNON
SØRENSEN
JABD
MORISITA-HORN
CHAO1
INV SIMPSON
SHANNON
SØRENSEN
JABD
MORISITA-HORN
CHAO1
INV SIMPSON
SHANNON
SØRENSEN
JABD
MORISITA-HORN
CHAO1
INV SIMPSON
SHANNON
SØRENSEN
JABD
MORISITA-HORN
CHAO1
INV SIMPSON
SHANNON
SØRENSEN
JABD
MORISITA-HORN
CHAO1
INV SIMPSON
SHANNON
SØRENSEN
JABD
MORISITA-HORN
CD-HIT UCLUST UPARSECOMPLETE LINKAGE SINGLE LINKAGEAVERAGE LINKAGE
AL
CL
SL
CD
-HIT
UC
LUS
TU
PA
RS
E
B
significanceof mean shift
red: shift towards higher values
blue: shift towards lower values
0.5510.167
0.9220.087
0.7490.154
0.973-0.6860.817-0.5610.949-0.286
PEARSON CORRELATIONRELATIVE SHIFT (LOG2)
RELATIVE SHIFT (LOG2)PEARSON CORRELATION
PEARSON CORRELATIONRELATIVE SHIFT (LOG2)
Q�ö�����
Q�ö�����
Q�������
Q������� Q�������
Q������� Q��������
Q��������
ISME15, Seoul, 2014/08/29Schmidt et al, Environ Microbiol, in press!
(data from Grice et al, Science, 2009)
0.8 0.9 1.00.6 0.70.5
90
95
100
90
95
10090
95
10090
95
10090
95
100
90
95
100
90 95 100 90 95 100 90 95 100 90 95 100 90 95 100 90 95 100
AVERAGE LINKAGE
AVER
AGE L
INKA
GE
COM
PLET
E LIN
KAGE
SING
LE L
INKA
GE
UCLU
STCD
-HIT
COMPLETE LINKAGE SINGLE LINKAGE UCLUST UPARSECD-HIT
UPAR
SE
ADJUSTEDMUTUAL INF
A ‘global’ 16S dataset!~1.1M full-length sequences!≥30k samples, diverse environments!!Adjusted Mutual Information (AMI), a measure of partition similarity!!high replicability!…when clustering twice to the exact same threshold!!
differential robustness!…to slight threshold changes
Schmidt et al, Environ Microbiol,!in press
0.8 0.9 1.00.6 0.70.5
90
95
100
90
95
10090
95
10090
95
10090
95
100
90
95
100
90 95 100 90 95 100 90 95 100 90 95 100 90 95 100 90 95 100
AVERAGE LINKAGE
AVER
AGE L
INKA
GE
COM
PLET
E LIN
KAGE
SING
LE L
INKA
GE
UCLU
STCD
-HIT
COMPLETE LINKAGE SINGLE LINKAGE UCLUST UPARSECD-HIT
UPAR
SE
ADJUSTEDMUTUAL INF
A ‘global’ 16S dataset!~1.1M full-length sequences!≥30k samples, diverse environments!!Adjusted Mutual Information (AMI), a measure of partition similarity!!high replicability!…when clustering twice to the exact same threshold!!
differential robustness!…to slight threshold changes!
!differential reproducibility!pairwise similarity maxima between methods off-diagonal!comparability of results across studies?
Schmidt et al, Environ Microbiol,!in press
0.8 0.9 1.00.6 0.70.5
90
95
100
90
95
10090
95
10090
95
10090
95
100
90
95
100
90 95 100 90 95 100 90 95 100 90 95 100 90 95 100 90 95 100
AVERAGE LINKAGE
AVER
AGE L
INKA
GE
COM
PLET
E LIN
KAGE
SING
LE L
INKA
GE
UCLU
STCD
-HIT
COMPLETE LINKAGE SINGLE LINKAGE UCLUST UPARSECD-HIT
UPAR
SE
ADJUSTEDMUTUAL INF
“Greengenes 97”!vs.!
“SILVA 99”!AMI ~ 0.65
Schmidt et al, Environ Microbiol,!in press
A ‘global’ 16S dataset!~1.1M full-length sequences!≥30k samples, diverse environments!!Adjusted Mutual Information (AMI), a measure of partition similarity!!high replicability!…when clustering twice to the exact same threshold!!
differential robustness!…to slight threshold changes!
!differential reproducibility!pairwise similarity maxima between methods off-diagonal!comparability of results across studies?
0.8 0.9 1.00.6 0.70.5
90
95
100
90
95
10090
95
10090
95
10090
95
100
90
95
100
90 95 100 90 95 100 90 95 100 90 95 100 90 95 100 90 95 100
AVERAGE LINKAGE
AVER
AGE L
INKA
GE
COM
PLET
E LIN
KAGE
SING
LE L
INKA
GE
UCLU
STCD
-HIT
COMPLETE LINKAGE SINGLE LINKAGE UCLUST UPARSECD-HIT
UPAR
SE
ADJUSTEDMUTUAL INF
A ~1.1M ≥environments!Adjusted Mutual Information (AMI)measure of partition similarity!!high …the exact same threshold!!
differential …to slight threshold changes!
!differential pairwise similarity maxima between comparability of results across studies?
Schmidt et al, Environ Microbiol,!in press
But which method makes the ‘best’ O
TUs?
‘Good’ OTUs should correspond to ‘true’ bacterial lineages (‘species’)!they should comply with evolutionary theory of bacterial speciation!BUT: no unifying / commonly accepted bacterial species concept!
!!Two main criteria for theory-compliant OTUs!
phylogenetic consistency (represent monophyletic lineages)!ecological consistency (represent ecologically homogenous groups of organisms)
Gevers et al., Nat Rev Microbiol, 2005!Cohan, Philos T R Soc B, 2006!
Koeppel et al., PNAS, 2008!Hunt et al., Science, 2008!
Fraser et al., Science, 2009!Vos, Trends Microbiol, 2011!Koeppel & Wu, NAR, 2013!
Preheim et al, Appl Env Microbiol, 2013!!
[and many more…]
ISME15, Seoul, 2014/08/29 [email protected]
daydeep
mat
high
cold
milksoildiversity
sediment
water
community marine
associated
acidplant
sludge
anaerobic
field
searhizosphere
lake
gut
spring
halophilic
culture
activity
rootsurface
productioncontaminated
thermophilic
wastewater
structure
degradation
degrading
seawater
producing
treatment
hydrothermal
oil
feces
hotbiofilm
waste activatedendophytic
nodule
deepseafreshwater
reactor
vent
enrichment
microbiota
growth
disease
pathogen
salt
patient
aerobic
coastal
mine
host
fermented
culturable
archaealhabitat actinomycete
respond
lactic
environmental
diverse
forest
regionclinical
symbiont
biodegradation
temperature
skin
moderately
antarctic
methanogenic
swab
revealzone
ocean
tract
infectionintestinalrum
en
natural
control
bioreactor
river
sponge
producedcarbon
blood
fluid
coral
mud
foodshift
highly
leaf
ice
organicrock
draft
dietoral
tree
solar
stream
iron
coast
wild
core
fed
low
grown
tidal
fecal
mineral
flat
compostsaline
symbiotic
content
saltern
pathogenic
alkaline
diseased
rhizobia
woundactive
intestine
traditional
sand hypersaline
subsurface
antimicrobial
fermentation
effluent
comb
condition
caused
product
sewage
treatingsulfatereducing
ecology
purification
station
hydrocarbon
nitrogen
coidentity
degrade
resistance endosymbiont
mangrove
metal
methane
polluted
acidic
antibiotic
oxidation
probiotic
cultured
cultivation
methanogen
processpesticide
revealed
tissue
agricultural
chemical
heterotrophic
biocontrol
alkaliphilicarchaeon
consortium
legume
denitrifying
indigenous
industrial
correlate
defense
cluster
heavy
reductiontolerantaquifer
extremely
reservoirwetland
diabetic
enriched
chloroplast
cultivated
cultureindependent
nitrogenfixing
prolonged
protease
basin
compound
halotolerant
mesophilicresistant
microbiom
e
removal
formation
laboratory
adult
anoxicpaddy
petroleum
termite
functional
aquatic
association
factory
fresh
antifungalkorean
terrestrial
involved
promoting
geothermal
bay
black
island
sulfur
drainage
farm
groundwater
hydrogen
ISME15, Seoul, 2014/08/29 [email protected]
100000 10000 1000
NUMBER OF OTUS
6000
5500
5000
4500
4000
3500
3000
2500
2000
1500
1000
EC
OLO
GIC
AL
CO
NS
IST
EN
CY S
CO
RE (ECS
)
ACOMPLETE LINKAGE
UCLUST
CD-HIT
SINGLE LINKAGE
AVERAGE LINKAGE
97% NOMINAL SIMILARITY
ISME15, Seoul, 2014/08/29 Schmidt et al, PLOS Comp Biol, 2014
100000 10000 1000
NUMBER OF OTUS
6000
5500
5000
4500
4000
3500
3000
2500
2000
1500
1000
EC
OLO
GIC
AL
CO
NS
IST
EN
CY S
CO
RE (ECS
)
ACOMPLETE LINKAGE
UCLUST
CD-HIT
SINGLE LINKAGE
AVERAGE LINKAGE
97% NOMINAL SIMILARITY
F
100000 10000 1000
5000
4000
3000
2000
1000
BACTERIA, HOST TAXONOMYE
100000 10000 1000
2500
2000
1500
1000
500
0
BACTERIA, SAMPLING SITESD
1000 10000 100000
2500
2000
1500
1000
500
BACTERIA, ENVO TERMS
C
10000 1000 100
400
300
200
100
EUKARYA, ECOLOGICAL TERMS
10000 1000 100
700
600
500
400
300
ARCHAEA, ECOLOGICAL TERMSB
ISME15, Seoul, 2014/08/29 Schmidt et al, PLOS Comp Biol, 2014
Conclusions
ISME15, Seoul, 2014/08/29 [email protected]
replicability!clustering was generally replicable!!
robustness!AL, CL & CD-HIT were highly robust to (slightly) changing thresholds, UCLUST, UPARSE & SL more sensitive!similar trends for robustness to clustering context and choice of subregion (not shown)!
!reproducibility!
surprisingly discordant partitions by different methods!similarity maxima generally off-diagonal!AL and CD-HIT most similar pair!implications for reference-based OTU-binning: choice of reference clustering determines quality!!
!ecological consistency!
CL provided most consistent OTU sets!implications for taxonomy and species definitions?