talk for uc davis applied phylogenetics course at bodega bay
DESCRIPTION
Talk by Jonathan Eisen for UC Davis Applied Phylogenetics Course at Bodega BayTRANSCRIPT
Phylogenomics
Jonathan A. EisenUC Davis
Bodega Applied Phylogenetics WorkshopMarch 7, 2011
Tuesday, March 8, 2011
Fleischmann et al. 1995 Science 269:496-512
Tuesday, March 8, 2011
Whole Genome Shotgun Sequencing
Tuesday, March 8, 2011
Whole Genome Shotgun Sequencing
Tuesday, March 8, 2011
Whole Genome Shotgun Sequencing
Warner Brothers, Inc.
Tuesday, March 8, 2011
Whole Genome Shotgun Sequencing
shotgun
Warner Brothers, Inc.
Tuesday, March 8, 2011
Whole Genome Shotgun Sequencing
shotgun
Warner Brothers, Inc.
Tuesday, March 8, 2011
Whole Genome Shotgun Sequencing
shotgun
sequenceWarner Brothers, Inc.
Tuesday, March 8, 2011
Whole Genome Shotgun Sequencing
shotgun
sequenceWarner Brothers, Inc.
Tuesday, March 8, 2011
Assemble Fragments
Tuesday, March 8, 2011
Assemble Fragments
sequencer output
Tuesday, March 8, 2011
Assemble Fragments
sequencer output
Tuesday, March 8, 2011
Assemble Fragments
sequencer output
assemble fragments
Tuesday, March 8, 2011
Assemble Fragments
sequencer output
assemble fragments
Closure &
Annotation
Tuesday, March 8, 2011
From http://genomesonline.orgTuesday, March 8, 2011
Tuesday, March 8, 2011
Tuesday, March 8, 2011
Tuesday, March 8, 2011
Tuesday, March 8, 2011
Genome Sequences Have Revolutionized Microbiology
• Predictions of metabolic processes
• Better vaccine and drug design
• New insights into mechanisms of evolution
• Genomes serve as template for functional studies
• New enzymes and materials for engineering and synthetic biology
Tuesday, March 8, 2011
General Steps in Analysis of Complete Genomes
• Identification/prediction of genes• Characterization of gene features• Characterization of genome features• Prediction of gene function• Prediction of pathways• Integration with known biological
data• Comparative genomics
Tuesday, March 8, 2011
Genome Size
Tuesday, March 8, 2011
Genome Structure:
More Variable
than Once Thought
Tuesday, March 8, 2011
Tuesday, March 8, 2011
Why Completeness is • Improves characterization of genome
features– Gene order, replication origins
• Better comparative genomics– Genome duplications, inversions
• Presence and absence of particular genes can be very important
• Missing sequence might be important (e.g., centromere)
• Allows researchers to focus on biology not sequencing
Tuesday, March 8, 2011
Vibrio cholerae Metabolism
Tuesday, March 8, 2011
Tuesday, March 8, 2011
From http://genomesonline.orgTuesday, March 8, 2011
Phylogenomic Analysis
• Evolutionary reconstructions greatly improve genome analyses
• Genome analysis greatly improves evolutionary reconstructions
• There is a feedback loop such that these should be integrated
Tuesday, March 8, 2011
Outline
• Phylogenomic Tales– Selecting genomes for sequencing– Species evolution– Predicting functions of genes– Uncultured microbes– Searching for novel organisms and genes
Tuesday, March 8, 2011
Outline
• Phylogenomic Tales– Selecting genomes for sequencing– Species evolution– Predicting functions of genes– Uncultured microbes– Searching for novel organisms and genes
• All of these going to be told in context of a recent project “A Genomic Encyclopedia of Bacteria and Archaea” (aka GEBA)
Tuesday, March 8, 2011
GEBA Introduction
Knowing What We Don’t Know
Tuesday, March 8, 2011
Major Microbial Sequencing Efforts
• Coordinated, top-down efforts– Fungal Genome Initiative (Broad/Whitehead)– Gordon and Betty Moore Foundation Marine Microbial Genome
Sequencing Project– Sanger Center Pathogen Sequencing Unit– NHGRI Human Gut Microbiome Project– NIH Human Microbiome Program
• White paper or grant systems– NIAID Microbial Sequencing Centers– DOE/JGI Community Sequencing Program– DOE/JGI BER Sequencing Program– NSF/USDA Microbial Genome Sequencing
• Covers lots of ground and biological diversity
Tuesday, March 8, 2011
As of 2002
Tuesday, March 8, 2011
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
As of 2002
Based on Hugenholtz, 2002
Tuesday, March 8, 2011
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
As of 2002
Based on Hugenholtz, 2002
Tuesday, March 8, 2011
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
• Some other phyla are only sparsely sampled
As of 2002
Based on Hugenholtz, 2002
Tuesday, March 8, 2011
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
• Some other phyla are only sparsely sampled
As of 2002
Based on Hugenholtz, 2002
Tuesday, March 8, 2011
Need for Tree Guidance Well Established
• Common approach within some eukaryotic groups
• Many small projects funded to fill in some bacterial or archaeal gaps
• Phylogenetic gaps in bacterial and archaeal projects commonly lamented in literature
Tuesday, March 8, 2011
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
• Some other phyla are only sparsely sampled
• Solution I: sequence more phyla
• NSF-funded Tree of Life Project
• A genome from each of eight phyla
Eisen, Ward, Robb, Nelson, et al
Tuesday, March 8, 2011
Phylum
Species selected
Chrysiogenes
Chrysiogenes arsenatis (GCA)
Coprothermobacter
Coprothermobacter proteolyticus (GCBP)
Dictyoglomi
Dictyoglomus thermophilum (GD T )
Thermodesulfobacteria
Thermodesulfobacterium commune (GTC)
Nitrospirae
Thermodesulfovibrio yellowstonii (GTY)
Thermomicrobia
Thermomicrobium roseum (GTR )
Deferribacteres
Geovibrio thiophilus (GGT)
Synergistes
Synergistes jonesii (GSJ)
Organisms Selected
Tuesday, March 8, 2011
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
• Some other phyla are only sparsely sampled
• Still highly biased in terms of the tree
• NSF-funded Tree of Life Project
• A genome from each of eight phyla
Eisen & Ward, PIs
Tuesday, March 8, 2011
Major Lineages of Actinobacteria2.5.1 Acidimicrobidae2.5.1.1 Unclassified2.5.1.2 "Microthrixineae2.5.1.3 Acidimicrobineae2.5.1.4 BD2-102.5.1.5 EB10172.5.2 Actinobacteridae2.5.2.1 Unclassified2.5.2.10 Ellin306/WR1602.5.2.11 Ellin50122.5.2.12 Ellin50342.5.2.13 Frankineae2.5.2.14 Glycomyces2.5.2.15 Intrasporangiaceae2.5.2.16 Kineosporiaceae2.5.2.17 Microbacteriaceae2.5.2.18 Micrococcaceae2.5.2.19 Micromonosporaceae2.5.2.2 Actinomyces2.5.2.20 Propionibacterineae2.5.2.21 Pseudonocardiaceae2.5.2.22 Streptomycineae2.5.2.23 Streptosporangineae2.5.2.3 Actinomycineae2.5.2.4 Actinosynnemataceae2.5.2.5 Bifidobacteriaceae2.5.2.6 Brevibacteriaceae2.5.2.7 Cellulomonadaceae2.5.2.8 Corynebacterineae2.5.2.9 Dermabacteraceae2.5.3 Coriobacteridae2.5.3.1 Unclassified2.5.3.2 Atopobiales2.5.3.3 Coriobacteriales2.5.3.4 Eggerthellales2.5.4 OPB412.5.5 PK12.5.6 Rubrobacteridae2.5.6.1 Unclassified2.5.6.2 "Thermoleiphilaceae2.5.6.3 MC472.5.6.4 Rubrobacteraceae
2.5 Actinobacteria2.5.1 Acidimicrobidae2.5.1.1 Unclassified2.5.1.2 "Microthrixineae2.5.1.3 Acidimicrobineae2.5.1.3.1 Unclassified2.5.1.3.2 Acidimicrobiaceae2.5.1.4 BD2-102.5.1.5 EB10172.5.2 Actinobacteridae2.5.2.1 Unclassified2.5.2.10 Ellin306/WR1602.5.2.11 Ellin50122.5.2.12 Ellin50342.5.2.13 Frankineae2.5.2.13.1 Unclassified2.5.2.13.2 Acidothermaceae2.5.2.13.3 Ellin60902.5.2.13.4 Frankiaceae2.5.2.13.5 Geodermatophilaceae2.5.2.13.6 Microsphaeraceae2.5.2.13.7 Sporichthyaceae2.5.2.14 Glycomyces2.5.2.15 Intrasporangiaceae2.5.2.15.1 Unclassified2.5.2.15.2 Dermacoccus2.5.2.15.3 Intrasporangiaceae2.5.2.16 Kineosporiaceae2.5.2.17 Microbacteriaceae2.5.2.17.1 Unclassified2.5.2.17.2 Agrococcus2.5.2.17.3 Agromyces2.5.2.18 Micrococcaceae2.5.2.19 Micromonosporaceae2.5.2.2 Actinomyces2.5.2.20 Propionibacterineae2.5.2.20.1 Unclassified2.5.2.20.2 Kribbella2.5.2.20.3 Nocardioidaceae2.5.2.20.4 Propionibacteriaceae2.5.2.21 Pseudonocardiaceae2.5.2.22 Streptomycineae2.5.2.22.1 Unclassified2.5.2.22.2 Kitasatospora2.5.2.22.3 Streptacidiphilus2.5.2.23 Streptosporangineae2.5.2.23.1 Unclassified2.5.2.23.2 Ellin51292.5.2.23.3 Nocardiopsaceae2.5.2.23.4 Streptosporangiaceae2.5.2.23.5 Thermomonosporaceae2.5.2.3 Actinomycineae2.5.2.4 Actinosynnemataceae2.5.2.5 Bifidobacteriaceae2.5.2.6 Brevibacteriaceae2.5.2.7 Cellulomonadaceae2.5.2.8 Corynebacterineae2.5.2.8.1 Unclassified2.5.2.8.2 Corynebacteriaceae2.5.2.8.3 Dietziaceae2.5.2.8.4 Gordoniaceae2.5.2.8.5 Mycobacteriaceae2.5.2.8.6 Rhodococcus2.5.2.8.7 Rhodococcus2.5.2.8.8 Rhodococcus2.5.2.9 Dermabacteraceae2.5.2.9.1 Unclassified2.5.2.9.2 Brachybacterium2.5.2.9.3 Dermabacter2.5.3 Coriobacteridae2.5.3.1 Unclassified2.5.3.2 Atopobiales2.5.3.3 Coriobacteriales2.5.3.4 Eggerthellales2.5.4 OPB412.5.5 PK12.5.6 Rubrobacteridae2.5.6.1 Unclassified2.5.6.2 "Thermoleiphilaceae2.5.6.2.1 Unclassified2.5.6.2.2 Conexibacter2.5.6.2.3 XGE5142.5.6.3 MC472.5.6.4 Rubrobacteraceae
Tuesday, March 8, 2011
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
• Some other phyla are only sparsely sampled
• Same trend in Archaea
• NSF-funded Tree of Life Project
• A genome from each of eight phyla
Eisen & Ward, PIs
Tuesday, March 8, 2011
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
• Some other phyla are only sparsely sampled
• Same trend in Eukaryotes
• NSF-funded Tree of Life Project
• A genome from each of eight phyla
Eisen & Ward, PIs
Tuesday, March 8, 2011
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
• Some other phyla are only sparsely sampled
• Same trend in Viruses
• NSF-funded Tree of Life Project
• A genome from each of eight phyla
Eisen & Ward, PIs
Tuesday, March 8, 2011
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
• Some other phyla are only sparsely sampled
• Solution: Really Fill in the Tree
• GEBA• A genomic
encyclopedia of bacteria and archaea
Eisen & Ward, PIs
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
Tuesday, March 8, 2011
http://www.jgi.doe.gov/programs/GEBA/pilot.htmlTuesday, March 8, 2011
GEBA Pilot Project: Components• Project overview (Phil Hugenholtz, Nikos Kyrpides, Jonathan
Eisen, Eddy Rubin, Jim Bristow)• Project management (David Bruce, Eileen Dalin, Lynne Goodwin)• Culture collection and DNA prep (DSMZ, Hans-Peter Klenk)• Sequencing and closure (Eileen Dalin, Susan Lucas, Alla Lapidus,
Mat Nolan, Alex Copeland, Cliff Han, Feng Chen, Jan-Fang Cheng)• Annotation and data release (Nikos Kyrpides, Victor Markowitz, et
al)• Analysis (Dongying Wu, Kostas Mavrommatis, Martin Wu, Victor
Kunin, Neil Rawlings, Ian Paulsen, Patrick Chain, Patrik D’Haeseleer, Sean Hooper, Iain Anderson, Amrita Pati, Natalia N. Ivanova, Athanasios Lykidis, Adam Zemla)
• Adopt a microbe education project (Cheryl Kerfeld)• Outreach (David Gilbert)• $$$ (DOE, Eddy Rubin, Jim Bristow)
Tuesday, March 8, 2011
rRNA Tree of Life
FIgure from Barton, Eisen et al. “Evolution”, CSHL Press.
Based on tree from Pace NR, 2003.
Tuesday, March 8, 2011
Tuesday, March 8, 2011
Tuesday, March 8, 2011
Tuesday, March 8, 2011
GEBA Pilot Target List
0
5
10
15
20
25
30
35
B: A
ctinob
acteria
(High GC)
B: A
minan
aero
bia
B: A
quifica
e
B: B
actero
idetes
B: C
hlor
oflexi
B: D
efer
ribac
tere
s
B: D
efer
ribac
tere
s
B: D
eino
cocc
i
B: D
elta Pro
teob
acteria
B: Eps
ilon Pr
oteo
bacter
ia
B: Firm
icutes
B: Fus
obac
teria
B: G
amma Pr
oteo
bacter
ia
B: G
emmatim
onad
etes
B: H
aloa
naer
obiales
B: Planc
tomyc
etes
B: S
piro
chae
tes
B: The
rmod
esulfoba
cter
ia
B: The
rmod
esulfobia
B: The
rmov
enab
ulae
A: H
alob
acteria
A: A
rcha
eoglob
i
A: M
etha
noba
cter
ia
A: M
etha
nomicr
obia
A: The
rmoc
occi
A: The
rmop
rotei
Phyla
# o
f G
en
om
es
Tuesday, March 8, 2011
GEBA Pilot Project Overview
• Identify major branches in rRNA tree for which no genomes are available
• Identify those with a cultured representative in DSMZ
• DSMZ grew > 200 of these and prepped DNA
• Sequence and finish 200+• Annotate, analyze, release data• Assess benefits of tree guided sequencing• 1st paper Wu et al in Nature Dec 2009
Tuesday, March 8, 2011
GEBA Phylogenomic Lesson 1
The rRNA Tree of Life is a Useful Tool for Identifying Phylogenetically Novel
Genomes
Tuesday, March 8, 2011
rRNA Tree of Life
Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.
Based on tree from Pace 1997 Science 276:734-740
Archaea
Eukaryotes
Bacteria
Tuesday, March 8, 2011
The Core Gets Small ...
Tuesday, March 8, 2011
The Pangenome
Tuesday, March 8, 2011
Islands Among Synteny
Tuesday, March 8, 2011
The Pangenome
Tuesday, March 8, 2011
Network of Life
Figure from Barton, Eisen et al. “Evolution”, CSHL Press.
Based on tree from Pace NR, 2003.
Archaea
Eukaryotes
Bacteria
Tuesday, March 8, 2011
Using the Core
Tuesday, March 8, 2011
Wh
Whole genome tree built using AMPHORAby Martin Wu and Dongying Wu
Tuesday, March 8, 2011
Tuesday, March 8, 2011
Four Models for Rooting TOLfrom Lake et al. doi: 10.1098/rstb.2009.0035
Tuesday, March 8, 2011
GEBA Phylogenomic Lesson 2
rRNA Tree is good but not perfectand better genomic sampling improves
phylogenetic inference
Tuesday, March 8, 2011
16s Says Hyphomonas is in Rhodobacteriales
Badger et al. 2005
Tuesday, March 8, 2011
WGT and individual gene trees:Its Related to Caulobacterales
Badger et al. 2005
Tuesday, March 8, 2011
Badger et al. 2005 Int J System Evol Microbiol 55: 1021-1026.
16s WGT, 23S
Tuesday, March 8, 2011
Caveats: ignoring LGT and using concatenated alignments
Tuesday, March 8, 2011
Concatenated Alignment ML Tree
Tuesday, March 8, 2011
Green Non Sulfur Bacteria
Tuesday, March 8, 2011
Chlamydia-Verrucomicrobia
Tuesday, March 8, 2011
Proteobacteria
Tuesday, March 8, 2011
Zimmer. New York Times. 2009Tuesday, March 8, 2011
GEBA Phylogenomic Lesson 3
Phylogenetics guided genome selection (and phylogenetics in
general) improves genome annotation
Tuesday, March 8, 2011
Predicting Function
• Key step in genome projects• More accurate predictions help guide
experimental and computational analyses• Many diverse approaches• All improved both by “phylogenomic” type
analyses that integrate evolutionary reconstructions and understanding of how new functions evolve
Tuesday, March 8, 2011
From Eisen et al. 1997 Nature Medicine 3: 1076-1078.
Tuesday, March 8, 2011
Blast Search of H. pylori “MutS”
• Blast search pulls up Syn. sp MutS#2 with much higher p value than other MutS homologs
• Based on this TIGR predicted this species had mismatch repair
• Assumes functional constancy Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.
Tuesday, March 8, 2011
Predicting Function• Identification of motifs
– Short regions of sequence similarity that are indicative of general activity
– e.g., ATP binding• Homology/similarity based methods
– Gene sequence is searched against a databases of other sequences
– If significant similar genes are found, their functional information is used
• Problem– Genes frequently have similarity to hundreds of motifs
and multiple genes, not all with the same function
Tuesday, March 8, 2011
MutL??
From http://asajj.roswellpark.org/huberman/dna_repair/mmr.html
Tuesday, March 8, 2011
Phylogenetic Tree of MutS Family
Aquae Trepa
FlyXenlaRatMouseHumanYeastNeucrArath
BorbuStrpyBacsu
SynspEcoliNeigo
ThemaTheaqDeira
Chltr
SpombeYeast
YeastSpombeMouseHumanArath
YeastHumanMouseArath
StrpyBacsu
CelegHumanYeast MetthBorbu
AquaeSynspDeira Helpy
mSaco
YeastCelegHuman
Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300.
Tuesday, March 8, 2011
MutS Subfamilies
Aquae Trepa
FlyXenlaRatMouseHumanYeastNeucrArath
BorbuStrpyBacsu
SynspEcoli
Neigo
ThemaTheaqDeira
Chltr
SpombeYeast
YeastSpombe
MouseHumanArath
YeastHumanMouseArath
StrpyBacsu
CelegHumanYeast MetthBorbu
AquaeSynspDeira Helpy
mSaco
YeastCelegHuman
MSH4
MSH5 MutS2
MutS1
MSH1
MSH3
MSH6
MSH2
Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300.
Tuesday, March 8, 2011
Overlaying Functions onto Tree
Aquae Trepa
Rat
FlyXenla
MouseHumanYeastNeucrArath
BorbuSynsp
Neigo
ThemaStrpyBacsu
Ecoli
TheaqDeiraChltr
SpombeYeast
YeastSpombeMouseHumanArath
YeastHumanMouseArath
StrpyBacsu
HumanCelegYeast
MetthBorbu
AquaeSynspDeira Helpy
mSaco
YeastCelegHuman
MSH4
MSH5MutS2
MutS1
MSH1
MSH3
MSH6
MSH2
Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300.
Tuesday, March 8, 2011
Functional Prediction Using Tree
Aquae Trepa
FlyXenlaRatMouseHumanYeastNeucrArath
BorbuStrpyBacsu
SynspEcoliNeigo
ThemaTheaqDeira
Chltr
SpombeYeast
YeastSpombe
MouseHumanArath
YeastHumanMouseArath
MSH1MitochondrialRepair
MSH3 - Nuclear RepairOf Loops
MSH6 - Nuclear RepairOf Mismatches
MutS1 - Bacterial Mismatch and Loop Repair
StrpyBacsu
CelegHumanYeast MetthBorbu
AquaeSynspDeira Helpy
mSaco
YeastCelegHuman
MSH4 - Meiotic CrossingOver
MSH5 - Meiotic Crossing Over MutS2 - Unknown Functions
MSH2 - Eukaryotic NuclearMismatch and Loop Repair
Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300.
Tuesday, March 8, 2011
Tuesday, March 8, 2011
PHYLOGENENETIC PREDICTION OF GENE FUNCTION
IDENTIFY HOMOLOGS
OVERLAY KNOWNFUNCTIONS ONTO TREE
INFER LIKELY FUNCTIONOF GENE(S) OF INTEREST
1 2 3 4 5 6
3 5
3
1A 2A 3A 1B 2B 3B
2A 1B
1A
3A
1B2B
3B
ALIGN SEQUENCES
CALCULATE GENE TREE
12
4
6
CHOOSE GENE(S) OF INTEREST
2A
2A
5
3
Species 3Species 1 Species 2
1
1 2
2
2 31
1A 3A
1A 2A 3A
1A 2A 3A
4 6
4 5 6
4 5 6
2B 3B
1B 2B 3B
1B 2B 3B
ACTUAL EVOLUTION(ASSUMED TO BE UNKNOWN)
Duplication?
EXAMPLE A EXAMPLE B
Duplication?
Duplication?
Duplication
5
METHOD
Ambiguous
Based on Eisen, 1998 Genome Res 8: 163-167.
Tuesday, March 8, 2011
Phylogenetic Prediction of
• Termed phylogenomics (Eisen, et al 1997)• Greatly improves accuracy of functional
predictions compared to similarity based methods (e.g., blast)
• Automated methods now available– Sean Eddy, Steven Brenner, Kimmen Sjölander,
etc.• But …
Tuesday, March 8, 2011
Example 2: Recent Changes
E.coli gi1787690
B.subtilis gi2633766Synechocystis sp. gi1001299Synechocystis sp. gi1001300Synechocystis sp. gi1652276Synechocystis sp. gi1652103H.pylori gi2313716H.pylori99 gi4155097C.jejuni Cj1190cC.jejuni Cj1110cA.fulgidus gi2649560A.fulgidus gi2649548B.subtilis gi2634254B.subtilis gi2632630B.subtilis gi2635607B.subtilis gi2635608B.subtilis gi2635609B.subtilis gi2635610B.subtilis gi2635882E.coli gi1788195E.coli gi2367378E.coli gi1788194
E.coli gi1789453C.jejuni Cj0144C.jejuni Cj0262c
H.pylori gi2313186H.pylori99 gi4154603C.jejuni Cj1564
C.jejuni Cj1506cH.pylori gi2313163H.pylori99 gi4154575H.pylori gi2313179H.pylori99 gi4154599C.jejuni Cj0019cC.jejuni Cj0951cC.jejuni Cj0246cB.subtilis gi2633374T.maritima TM0014
T.pallidum gi3322777T.pallidum gi3322939T.pallidum gi3322938B.burgdorferi gi2688522T.pallidum gi3322296B.burgdorferi gi2688521T.maritima TM0429T.maritima TM0918T.maritima TM0023T.maritima TM1428T.maritima TM1143T.maritima TM1146P.abyssi PAB1308P.horikoshii gi3256846P.abyssi PAB1336P.horikoshii gi3256896P.abyssi PAB2066P.horikoshii gi3258290P.abyssi PAB1026P.horikoshii gi3256884D.radiodurans DRA00354D.radiodurans DRA0353D.radiodurans DRA0352P.abyssi PAB1189P.horikoshii gi3258414B.burgdorferi gi2688621M.tuberculosis gi1666149
V.cholerae VC0512V.cholerae VCA1034V.cholerae VCA0974V.cholerae VCA0068V.cholerae VC0825V.cholerae VC0282V.cholerae VCA0906V.cholerae VCA0979V.cholerae VCA1056V.cholerae VC1643V.cholerae VC2161V.cholerae VCA0923V.cholerae VC0514V.cholerae VC1868V.cholerae VCA0773V.cholerae VC1313V.cholerae VC1859V.cholerae VC1413V.cholerae VCA0268V.cholerae VCA0658V.cholerae VC1405V.cholerae VC1298V.cholerae VC1248V.cholerae VCA0864V.cholerae VCA0176V.cholerae VCA0220V.cholerae VC1289V.cholerae VCA1069V.cholerae VC2439V.cholerae VC1967V.cholerae VCA0031V.cholerae VC1898V.cholerae VCA0663V.cholerae VCA0988V.cholerae VC0216V.cholerae VC0449V.cholerae VCA0008V.cholerae VC1406V.cholerae VC1535V.cholerae VC0840
V.cholerae VC0098V.cholerae VCA1092
V.cholerae VC1403V.cholerae VCA1088
V.cholerae VC1394
V.cholerae VC0622
NJ
**
*****
******
****
***
****
**
*
****
**
**
******
******
*
****
******
***
***
***
****
**
*
****
*
• Phylogenomic functional prediction may not work well for very newly evolved functions
• Can use understanding of origin of novelty to better interpret these cases?
• Screen genomes for genes that have changed recently
– Pseudogenes and gene loss– Contingency Loci– Acquisition (e.g., LGT)– Unusual dS/dN ratios– Rapid evolutionary rates– Recent duplications
Tuesday, March 8, 2011
Example 3: Non homology methods
• Many genes have homologs in other species but no homologs have ever been studied experimentally
• Non-homology methods can make functional predictions for these
• Example: phylogenetic profiling
Tuesday, March 8, 2011
Phylogenetic profiling basis
• Microbial genes are lost rapidly when not maintained by selection
• Genes can be acquired by lateral transfer• Frequently gain and loss occurs for entire
pathways/processes• Thus might be able to use correlated presence/
absence information to identify genes with similar functions
Tuesday, March 8, 2011
Non-Homology Predictions: Phylogenetic Profiling
• Step 1: Search all genes in organisms of interest against all other genomes
• Ask: Yes or No, is each gene found in each other species
• Cluster genes by distribution patterns (profiles)
Tuesday, March 8, 2011
Carboxydothermus hydrogenoformans
• Isolated from a Russian hotspring• Thermophile (grows at 80°C)• Anaerobic• Grows very efficiently on CO
(Carbon Monoxide)• Produces hydrogen gas• Low GC Gram positive
(Firmicute)• Genome Determined (Wu et al.
2005 PLoS Genetics 1: e65. )
Tuesday, March 8, 2011
Homologs of Sporulation Genes
Wu et al. 2005 PLoS Genetics 1: e65.
Tuesday, March 8, 2011
Carboxydothermus sporulates
Wu et al. 2005 PLoS Genetics 1: e65.
Tuesday, March 8, 2011
Wu et al. 2005 PLoS Genetics 1: e65.
Tuesday, March 8, 2011
PG Profiling Works Better Using Orthology
Tuesday, March 8, 2011
GEBA Lesson 3:Phylogeny driven genome selection (and
phylogenetics) improves genome annotation
• Took 56 GEBA genomes and compared results vs. 56 randomly sampled new genomes
• Better definition of protein family sequence “patterns”• Greatly improves “comparative” and “evolutionary”
based predictions• Conversion of hypothetical into conserved hypotheticals• Linking distantly related members of protein families• Improved non-homology prediction
Tuesday, March 8, 2011
GEBA Lesson 4:Metadata Important
Tuesday, March 8, 2011
GEBA Phylogenomic Lesson 5
Phylogeny-driven genome selection helps discover new genetic diversity
Tuesday, March 8, 2011
Network of Life
FIgure from Barton, Eisen et al. “Evolution”, CSHL Press.
Based on tree from Pace NR, 2003.
Archaea
Eukaryotes
Bacteria
Tuesday, March 8, 2011
Protein Family Rarefaction
• Take data set of multiple complete genomes• Identify all protein families using MCL• Plot # of genomes vs. # of protein families
Tuesday, March 8, 2011
Wu et al. 2009 Nature 462, 1056-1060
Tuesday, March 8, 2011
Wu et al. 2009 Nature 462, 1056-1060
Tuesday, March 8, 2011
Wu et al. 2009 Nature 462, 1056-1060
Tuesday, March 8, 2011
Wu et al. 2009 Nature 462, 1056-1060
Tuesday, March 8, 2011
Wu et al. 2009 Nature 462, 1056-1060
Tuesday, March 8, 2011
Synapomorphies exist
Wu et al. 2009 Nature 462, 1056-1060
Tuesday, March 8, 2011
Families/PD not uniform
! !
!"#$%"&'(%)"*
+,%-./&#(%)"*
Tuesday, March 8, 2011
Structural Novelty
• Of the 17000 protein families in the GEBA56, 1800 are novel in sequence (Wu)
• Structural modeling suggests many are structurally novel too (D'haeseleer)
• 372 being crystallized by the PSI (Kerfeld)
Tuesday, March 8, 2011
GEBA Phylogenomic Lesson 6
Improves analysis of genome data from uncultured organisms
Tuesday, March 8, 2011
Great Plate Count Anomaly
Culturing Microscope
CountCount
Tuesday, March 8, 2011
Great Plate Count Anomaly
Culturing Microscope
CountCount <<<<
Tuesday, March 8, 2011
Environmental DNA Analysis
Culturing Microscope
CountCount <<<<
DNA
Tuesday, March 8, 2011
rRNA Phylotyping
• Collect DNA from environment
• PCR amplify rRNA genes using broad (so-called universal) primers
• Sequence• Align to others• Infer evolutionary tree• Unknowns “identified”
by placement on tree• Some use BLAST, but
not as good as phylogeny
Tuesday, March 8, 2011
rRNA PCR
The Hidden Majority Richness estimates
Bohannan and Hughes 2003Hugenholtz 2002
Tuesday, March 8, 2011
Tuesday, March 8, 2011
rRNA data increasing exponentially tooTuesday, March 8, 2011
rRNA phylotyping issues
• Massive amounts of data– 1 x 10^6 new partial sequences with new 454– 2 x 10^6 full length sequences in DB
• Alignments of new sequences not always straightforward
• Solutions:– Reliance on similarity scores (bad)– High throughput automated phylogenetic tools
• STAP• WATERs
Tuesday, March 8, 2011
Perna et al. 2003Tuesday, March 8, 2011
Tuesday, March 8, 2011
Tuesday, March 8, 2011
Tuesday, March 8, 2011
Diversity of Proteorhodopsins by PCR
de la Torre et al 2003
Tuesday, March 8, 2011
Metagenomics
shotgunsequence
Tuesday, March 8, 2011
Massiuve Diversity of Proteorhodopsins
Venter et al., 2004Tuesday, March 8, 2011
Tuesday, March 8, 2011
Applied Phylogenetics
Tuesday, March 8, 2011
Example I: Functional Diversity
Tuesday, March 8, 2011
Functional Diversity of Proteorhodopsins?
Venter et al., Science 304: 66. 2004
Tuesday, March 8, 2011
Example II: Phylotyping w/ many genes
Tuesday, March 8, 2011
rRNA Phylotyping in Sargasso Sea
Venter et al., Science 304: 66. 2004
Tuesday, March 8, 2011
Shotgun Sequencing Allows Use of Alternative Anchors (e.g., RecA)
Venter et al., Science 304: 66. 2004
Tuesday, March 8, 2011
0
0.1250
0.2500
0.3750
0.5000
Alphapro
teob
acte
ria
Betap
rote
obac
teria
Gamm
apro
teob
acte
ria
Epsilon
prote
obac
teria
Deltap
rote
obac
teria
Cyano
bacte
ria
Firm
icute
s
Actino
bacte
ria
Chloro
biCFB
Chloro
flexi
Spiroch
aete
s
Fuso
bacte
ria
Deinoc
occu
s-Th
erm
us
Eurya
rcha
eota
Crena
rcha
eota
Sargasso Phylotypes
Wei
ght
ed %
of
Clo
nes
Major Phylogenetic Group
EFGEFTuHSP70RecARpoBrRNA
Shotgun Sequencing Allows Use of Other Markers
Venter et al., Science 304: 66. 2004
Tuesday, March 8, 2011
0
0.1250
0.2500
0.3750
0.5000
Alphapro
teob
acte
ria
Betap
rote
obac
teria
Gamm
apro
teob
acte
ria
Epsilon
prote
obac
teria
Deltap
rote
obac
teria
Cyano
bacte
ria
Firm
icute
s
Actino
bacte
ria
Chloro
biCFB
Chloro
flexi
Spiroch
aete
s
Fuso
bacte
ria
Deinoc
occu
s-Th
erm
us
Eurya
rcha
eota
Crena
rcha
eota
Sargasso Phylotypes
Wei
ght
ed %
of
Clo
nes
Major Phylogenetic Group
EFGEFTuHSP70RecARpoBrRNA
Shotgun Sequencing Allows Use of Other Markers
Venter et al., Science 304: 66-74. 2004Tuesday, March 8, 2011
0
0.1250
0.2500
0.3750
0.5000
Alphapro
teob
acte
ria
Betap
rote
obac
teria
Gamm
apro
teob
acte
ria
Epsilon
prote
obac
teria
Deltap
rote
obac
teria
Cyano
bacte
ria
Firm
icute
s
Actino
bacte
ria
Chloro
biCFB
Chloro
flexi
Spiroch
aete
s
Fuso
bacte
ria
Deinoc
occu
s-Th
erm
us
Eurya
rcha
eota
Crena
rcha
eota
Sargasso Phylotypes
Wei
ght
ed %
of
Clo
nes
Major Phylogenetic Group
EFGEFTuHSP70RecARpoBrRNA
Shotgun Sequencing Allows Use of Other Markers
Cannot be done without good sampling of genomes
Venter et al., Science 304: 66-74. 2004Tuesday, March 8, 2011
Example III:Binning
Tuesday, March 8, 2011
Metagenomics Challenge
Tuesday, March 8, 2011
Binning challenge
Tuesday, March 8, 2011
Binning challenge
Best binning method: reference genomes
Tuesday, March 8, 2011
Binning challenge
Best binning method: reference genomes
Tuesday, March 8, 2011
Binning challenge
No reference genome? What do you do?
Tuesday, March 8, 2011
CFB Phyla
Tuesday, March 8, 2011
0
0.1250
0.2500
0.3750
0.5000
Alphapro
teob
acte
ria
Betap
rote
obac
teria
Gamm
apro
teob
acte
ria
Epsilon
prote
obac
teria
Deltap
rote
obac
teria
Cyano
bacte
ria
Firm
icute
s
Actino
bacte
ria
Chloro
biCFB
Chloro
flexi
Spiroch
aete
s
Fuso
bacte
ria
Deinoc
occu
s-Th
erm
us
Eurya
rcha
eota
Crena
rcha
eota
Sargasso Phylotypes
Wei
ght
ed %
of
Clo
nes
Major Phylogenetic Group
EFGEFTuHSP70RecARpoBrRNA
Phylogenetic Binning
Venter et al., Science 304: 66-74. 2004Tuesday, March 8, 2011
0
0.1250
0.2500
0.3750
0.5000
Alphapro
teob
acte
ria
Betap
rote
obac
teria
Gamm
apro
teob
acte
ria
Epsilon
prote
obac
teria
Deltap
rote
obac
teria
Cyano
bacte
ria
Firm
icute
s
Actino
bacte
ria
Chloro
biCFB
Chloro
flexi
Spiroch
aete
s
Fuso
bacte
ria
Deinoc
occu
s-Th
erm
us
Eurya
rcha
eota
Crena
rcha
eota
Sargasso Phylotypes
Wei
ght
ed %
of
Clo
nes
Major Phylogenetic Group
EFGEFTuHSP70RecARpoBrRNA
Shotgun Sequencing Allows Use of Other Markers
Cannot be done without good sampling of genomes
Venter et al., Science 304: 66-74. 2004Tuesday, March 8, 2011
0
0.1250
0.2500
0.3750
0.5000
Alphapro
teob
acte
ria
Betap
rote
obac
teria
Gamm
apro
teob
acte
ria
Epsilon
prote
obac
teria
Deltap
rote
obac
teria
Cyano
bacte
ria
Firm
icute
s
Actino
bacte
ria
Chloro
biCFB
Chloro
flexi
Spiroch
aete
s
Fuso
bacte
ria
Deinoc
occu
s-Th
erm
us
Eurya
rcha
eota
Crena
rcha
eota
Sargasso Phylotypes
Wei
ght
ed %
of
Clo
nes
Major Phylogenetic Group
EFGEFTuHSP70RecARpoBrRNA
Shotgun Sequencing Allows Use of Other Markers
GEBA Project improves metagenomic analysis
Venter et al., Science 304: 66-74. 2004Tuesday, March 8, 2011
134
GEBA CyanoSequencing status (as of 01/14):
Awaiting Material 11Library 12Production 22Finishing 5Grand Total 50
On-going/ Planed Activities:
- Building Cyanobacterial Metadatabase (IMG-GOLD)
- 10th Cyanobacterial Molecular Biology Workshop, Lake Arrowhead, CA (06/10)
--> Cheryl will host: Workshop training as prep for virtual Jamboree
Tuesday, March 8, 2011
135
Plan: Sequence multiple Root Nodule Bacteria (RNBs) across the
planet. Pilot: 100 RNBs.
Alpha RNB
BradyrhizobiumMesorhizobiumRhizobium
Beta RNB
Sinorhizobium
CupriavidisBurkholderia
Balneimonas-like
DevosiaOchrobactrumPhyllobacterium
AzorhizobiumAllorhizobium
GEBA RNB
Goal: • Understand BioGeographical effects on species evolution
and understand host-specificity.
Rationale: • N2 fixation by legume pastures and crops provides 65% of the N
currently utilized in agricultural production.
• Contributes 25 to 90 million metric tones N pa.
• Symbioses save $US 6-10 billion annually on N fertilizer.
• Grain and animal production enhanced by fixed nitrogen supplied by the symbiosis.
Nikos KyrpidesTuesday, March 8, 2011
Haloarchaeal GEBA-like
Tuesday, March 8, 2011
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
• Some other phyla are only sparsely sampled
• Still not happy
• NSF-funded Tree of Life Project
• A genome from each of eight phyla
Eisen & Ward, PIs
Tuesday, March 8, 2011
0
0.1250
0.2500
0.3750
0.5000
Alphapro
teob
acte
ria
Betap
rote
obac
teria
Gamm
apro
teob
acte
ria
Epsilon
prote
obac
teria
Deltap
rote
obac
teria
Cyano
bacte
ria
Firm
icute
s
Actino
bacte
ria
Chloro
biCFB
Chloro
flexi
Spiroch
aete
s
Fuso
bacte
ria
Deinoc
occu
s-Th
erm
us
Eurya
rcha
eota
Crena
rcha
eota
Sargasso Phylotypes
Wei
ght
ed %
of
Clo
nes
Major Phylogenetic Group
EFGEFTuHSP70RecARpoBrRNA
Shotgun Sequencing Allows Use of Other Markers
GEBA Project improves metagenomic analysis, but only a little
Venter et al., Science 304: 66-74. 2004Tuesday, March 8, 2011
Phylogenomics Future 1
Need to adapt genomic and metagenomic methods to make better
use of data
Tuesday, March 8, 2011
Improving Metagenomic Analysis
• Methods– More automation– Better phylogenetic methods for short reads– Improved tools for using distantly related genomes
in metagenomic analysis• Data sets
– Rebuild protein family models– New phylogenetic markers– Need better reference phylogenies, including HGT
• More simulationsTuesday, March 8, 2011
Automation
Tuesday, March 8, 2011
AMPHORA
Guide treeTuesday, March 8, 2011
0
0.1750
0.3500
0.5250
0.7000
Alphapro
teob
acte
ria
Betap
rote
obac
teria
Gamm
apro
teob
acte
ria
Deltap
rote
obac
teria
Epsilon
prote
obac
teria
Unclas
sified
Pro
teob
acte
ria
Cyano
bacte
ria
Chlam
ydiae
Acidob
acte
ria
Bacte
roidet
es
Actino
bacte
ria
Aquifica
e
Planct
omyc
etes
Spiroch
aete
s
Firm
icute
s
Chloro
flexi
Chloro
bi
Unclas
sified
Bac
teria
dnaGfrrinfCnusApgkpyrGrplArplBrplCrplDrplErplFrplKrplLrplMrplNrplPrplSrplTrpmArpoBrpsBrpsCrpsErpsIrpsJrpsKrpsMrpsSsmpBtsf
Tuesday, March 8, 2011
We have more than 700 compete genome sequences:
•Select 100 representatives•Build gene families•Identify families that present in all organisms with equal numbers •Hmm building and phylogenetic analysis to identify the true makers
ε γβαδProteobacteria Firmicutes
Phylogenetic Tree of Bacteria (built from 31 concatenate marker alignments)
Tuesday, March 8, 2011
More Markers
Tuesday, March 8, 2011
AMPHORA 2 Coming w/ More Markers
Phylogenetic group Genome Number
Gene Number
Maker Candidates
Archaea 62 145415 106
Actinobacteria 63 267783 136
Alphaproteobacteria 94 347287 121
Betaproteobacteria 56 266362 311
Gammaproteobacteria 126 483632 118
Deltaproteobacteria 25 102115 206
Epislonproteobacteria 18 33416 455
Bacteriodes 25 71531 286
Chlamydae 13 13823 560
Chloroflexi 10 33577 323
Cyanobacteria 36 124080 590
Firmicutes 106 312309 87
Spirochaetes 18 38832 176
Thermi 5 14160 974
Thermotogae 9 17037 684
Tuesday, March 8, 2011
0 1 2 3 4 5 6
rRNA16SruvBnusArplBpurArpsJsecYrpsIpyrHrpsErplPrplNrpsCruvArplFrplAserSrplKrpsKpriAsmpBrpsGguaArpsQrpsLrplUrplOrpsMinfCrplSrplVrplCrpsPrplErplTrplLrplQrpsHmraWrpsOrpsBrplIrplMrplRttffrrtsfrplDradArpsStrmDcoaErpmA
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
nusArpsCrpsEpriArplBsecY
rRNA16SrpsJrpsBruvBguaArplNserSrplFfrrrplArplErplCinfCrplDrplKpurAradAruvArpsMpyrHrplIrplMrpsGrpsLmraWrpsIttfrplStrmDtsfrplUrpsKrpsPrplOrplTrplVrpsSrplPrpsOsmpBrpsHrplQrplRrpsQrplLrpmAcoaE
Ribosomal protein Transcription/translation related proteinDNA repair protein Protein of other functionAMPHORA marker
Distance between the genome tree and 100 random trees (average ± standard deviation)
NODAL distance SPLIT distance
Distances between gene trees and the AMPHORA concatenated genome tree
Tuesday, March 8, 2011
Fragments
Tuesday, March 8, 2011
Phylogenetic challenge
A single tree with everything
Tuesday, March 8, 2011
PhylOTU: A High-Throughput Procedure QuantifiesMicrobial Community Diversity and Resolves Novel Taxafrom Metagenomic DataThomas J. Sharpton1*, Samantha J. Riesenfeld1, Steven W. Kembel2, Joshua Ladau1, James P.
O’Dwyer2,3, Jessica L. Green2, Jonathan A. Eisen4, Katherine S. Pollard1,5
1 The J. David Gladstone Institutes, University of California San Francisco, San Francisco, California, United States of America, 2Center for Ecology and Evolutionary
Biology, University of Oregon, Eugene, Oregon, United States of America, 3 Institute of Integrative and Comparative Biology, University of Leeds, Leeds, United Kingdom,
4Department of Evolution and Ecology, University of California Davis, Davis, California, United States of America, 5 Institute for Human Genetics & Division of Biostatistics,
University of California San Francisco, San Francisco, California, United States of America
Abstract
Microbial diversity is typically characterized by clustering ribosomal RNA (SSU-rRNA) sequences into operational taxonomicunits (OTUs). Targeted sequencing of environmental SSU-rRNA markers via PCR may fail to detect OTUs due to biases inpriming and amplification. Analysis of shotgun sequenced environmental DNA, known as metagenomics, avoidsamplification bias but generates fragmentary, non-overlapping sequence reads that cannot be clustered by existing OTU-finding methods. To circumvent these limitations, we developed PhylOTU, a computational workflow that identifies OTUsfrom metagenomic SSU-rRNA sequence data through the use of phylogenetic principles and probabilistic sequence profiles.Using simulated metagenomic data, we quantified the accuracy with which PhylOTU clusters reads into OTUs. Comparisonsof PCR and shotgun sequenced SSU-rRNA markers derived from the global open ocean revealed that while PCR librariesidentify more OTUs per sequenced residue, metagenomic libraries recover a greater taxonomic diversity of OTUs. Inaddition, we discover novel species, genera and families in the metagenomic libraries, including OTUs from phyla missed byanalysis of PCR sequences. Taken together, these results suggest that PhylOTU enables characterization of part of thebiosphere currently hidden from PCR-based surveys of diversity?
Citation: Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O’Dwyer JP, et al. (2011) PhylOTU: A High-Throughput Procedure Quantifies Microbial CommunityDiversity and Resolves Novel Taxa from Metagenomic Data. PLoS Comput Biol 7(1): e1001061. doi:10.1371/journal.pcbi.1001061
Editor: Oded Beja, Technion-Israel Institute of Technology, Israel
Received July 22, 2010; Accepted December 17, 2010; Published January 20, 2011
Copyright: ! 2011 Sharpton et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: Funding for this work was provided by the Gordon and Betty Moore Foundation (grant #1660, http://www.moore.org/). JPOD acknowledges fundingfrom the EPSRC (grant #EP/G051402/1, http://www.epsrc.ac.uk). The funders had no role in study design, data collection and analysis, decision to publish, orpreparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
Introduction
A central goal of ecology and evolution is to understand theforces that shape biodiversity - the variety of life on Earth. It isbecoming increasingly clear that global biodiversity is mostlymicrobial. It is estimated that there are millions of microbialspecies on the planet, relatively few of which have been isolated inculture [1–2]. Despite the recognized importance of microorgan-isms, we still know little about the magnitude and variability ofmicrobial biodiversity in natural environments relative to what isknown about plants and animals. This is a major knowledge gap,given that microbes are critical components of our planet,responsible for key ecosystems services including the productionof agriculturally critical small molecules, the degradation ofenvironmental contaminants, and the regulation of human hostphenotypes.Biodiversity science has traditionally focused on comparing
species richness across space, time and environments. Out ofnecessity, microbial diversity studies usually examine the richness(i.e. number) of operational taxonomic units (OTUs), where OTUsare sequence similarity based surrogates for microbial taxa, whichcan be difficult to define. In addition to richness, OTUs have been
used to characterize the abundance, range, and distribution ofmicrobes, thereby improving our understanding of both naturalecosystems and human health [3–6]. OTUs are commonlyidentified by aligning sequences of the small subunit of ribosomalRNA (SSU-rRNA) from one or more samples and identifyinggroups of related sequences using a hierarchical clusteringalgorithm. This clustering is based upon a measure of distancebetween all pairs of sequences, which is typically defined usingsome variant of the percent sequence identify (PID) (e.g. [3,7–8]).For example, researchers traditionally cluster sequences that areno more than 3% diverged into the same OTU. This designationhas been proposed as being roughly equivalent to a species-levelclassification [9], though evidence suggests that it may result in anunderestimate of the true number of species [10].The SSU-rRNA sequences for OTU identification are tradi-
tionally amplified from a sample via polymerase chain reaction(PCR) using universal primers. Each PCR product is thenindividually sequenced. One of the biggest drawbacks of thistargeted sequencing approach is that it leverages PCR, which hasbeen shown to exhibit sequence-based biases at the level ofpriming and extension [11–13]. In addition, the so-called‘universal’ PCR primers used in such assays will fail to amplify
PLoS Computational Biology | www.ploscompbiol.org 1 January 2011 | Volume 7 | Issue 1 | e1001061
alignment used to build the profile, resulting in a multiplesequence alignment of full-length reference sequences andmetagenomic reads. The final step of the alignment process is aquality control filter that 1) ensures that only homologous SSU-rRNA sequences from the appropriate phylogenetic domain areincluded in the final alignment, and 2) masks highly gappedalignment columns (see Text S1).We use this high quality alignment of metagenomic reads and
references sequences to construct a fully-resolved, phylogenetictree and hence determine the evolutionary relationships betweenthe reads. Reference sequences are included in this stage of theanalysis to guide the phylogenetic assignment of the relativelyshort metagenomic reads. While the software can be easilyextended to incorporate a number of different phylogenetic toolscapable of analyzing metagenomic data (e.g., RAxML [27],pplacer [28], etc.), PhylOTU currently employs FastTree as adefault method due to its relatively high speed-to-performanceratio and its ability to construct accurate trees in the presence ofhighly-gapped data [29]. After construction of the phylogeny,lineages representing reference sequences are pruned from thetree. The resulting phylogeny of metagenomic reads is then used tocompute a PD distance matrix in which the distance between apair of reads is defined as the total tree path distance (i.e., branchlength) separating the two reads [30]. This tree-based distancematrix is subsequently used to hierarchically cluster metagenomicreads via MOTHUR into OTUs in a fashion similar to traditionalPID-based analysis [31]. As with PID clustering, the hierarchicalalgorithm can be tuned to produce finer or courser clusters,corresponding to different taxonomic levels, by adjusting theclustering threshold and linkage method.To evaluate the performance of PhylOTU, we employed
statistical comparisons of distance matrices and clustering resultsfor a variety of data sets. These investigations aimed 1) to compare
PD versus PID clustering, 2) to explore overlap between PhylOTUclusters and recognized taxonomic designations, and 3) to quantifythe accuracy of PhylOTU clusters from shotgun reads relative tothose obtained from full-length sequences.
PhylOTU Clusters Recapitulate PID ClustersWe sought to identify how PD-based clustering compares to
commonly employed PID-based clustering methods by applyingthe two methods to the same set of sequences. Both PID-basedclustering and PhylOTU may be used to identify OTUs fromoverlapping sequences. Therefore we applied both methods to adataset of 508 full-length bacterial SSU-rRNA sequences (refer-ence sequences; see above) obtained from the Ribosomal DatabaseProject (RDP) [25]. Recent work has demonstrated that PID ismore accurately calculated from pairwise alignments than multiplesequence alignments [32–33], so we used ESPRIT, whichimplements pairwise alignments, to obtain a PID distance matrixfor the reference sequences [32]. We used PhylOTU to compute aPD distance matrix for the same data. Then, we used MOTHUR tohierarchically cluster sequences into OTUs based on both PIDand PD. For each of the two distance matrices, we employed arange of clustering thresholds and three different definitions oflinkage in the hierarchical clustering algorithm: nearest-neighbor,average, and furthest-neighbor.To statistically evaluate the similarity of cluster composition
between of each pair of clustering results, we used two summarystatistics that together capture the frequency with which sequencesare co-clustered in both analyses: true conjunction rate (i.e., theproportion of pairs of sequences derived from the same cluster inthe first analysis that also are clustered together in the secondanalysis) and true disjunction rate (i.e., the proportion of pairs ofsequences derived from different clusters in the first analysis thatalso are not clustered together in the second analysis) (see Methods
Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in this generalizeworkflow of PhylOTU. See Results section for details.doi:10.1371/journal.pcbi.1001061.g001
Finding Metagenomic OTUs
PLoS Computational Biology | www.ploscompbiol.org 3 January 2011 | Volume 7 | Issue 1 | e1001061
Tuesday, March 8, 2011
AMPHORA ALL • Build reference tree with concatenated alignment
• Align reads that match any of the HMMs to concatenated alignment
• Place reads into reference tree one at a time
Tuesday, March 8, 2011
Phylogenomics Future 2
We have still only scratched the surface of microbial diversity
Tuesday, March 8, 2011
rRNA Tree of Life
Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.
Based on tree from Pace 1997 Science 276:734-740
Archaea
Eukaryotes
Bacteria
Tuesday, March 8, 2011
Phylogenetic Diversity: Genomes
From Wu et al. 2009 Nature 462, 1056-1060
Tuesday, March 8, 2011
Phylogenetic Diversity with GEBA
From Wu et al. 2009 Nature 462, 1056-1060
Tuesday, March 8, 2011
Phylogenetic Diversity: Isolates
From Wu et al. 2009 Nature 462, 1056-1060Tuesday, March 8, 2011
Phylogenetic Diversity: All
From Wu et al. 2009 Nature 462, 1056-1060
Tuesday, March 8, 2011
Uncultured Lineages:
• Get into culture• Enrichment cultures• If abundant in low diversity ecosystems• Flow sorting• Microbeads• Microfluidic sorting• Single cell amplification
Tuesday, March 8, 2011
159
Number of SAGs from Candidate Phyla
OD
1
OP
11
OP
3
SA
R4
06
Site A: Hydrothermal vent 4 1 - -Site B: Gold Mine 6 13 2 -Site C: Tropical gyres (Mesopelagic) - - - 2Site D: Tropical gyres (Photic zone) 1 - - -
Sample collections at 4 additional sites are underway.
Phil Hugenholtz
GEBA uncultured
Tuesday, March 8, 2011
Tuesday, March 8, 2011
Tuesday, March 8, 2011
Tuesday, March 8, 2011
Tuesday, March 8, 2011
Phylogenomics Future 3
Need Experiments from Across the Tree of Life too
Tuesday, March 8, 2011
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
As of 2002
Based on Hugenholtz, 2002
Tuesday, March 8, 2011
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Experimental studies are mostly from three phyla
As of 2002
Based on Hugenholtz, 2002
Tuesday, March 8, 2011
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Experimental studies are mostly from three phyla
• Some studies in other phyla
As of 2002
Based on Hugenholtz, 2002
Tuesday, March 8, 2011
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
• Some other phyla are only sparsely sampled
• Same trend in Eukaryotes
As of 2002
Based on Hugenholtz, 2002
Tuesday, March 8, 2011
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
• Some other phyla are only sparsely sampled
• Same trend in Viruses
As of 2002
Based on Hugenholtz, 2002
Tuesday, March 8, 2011
0.1
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
Tree based on Hugenholtz (2002) with some modifications.
Need experimental studies from across the tree too
Tuesday, March 8, 2011
0.1
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
Tree based on Hugenholtz (2002) with some modifications.
Adopt a Microbe
Tuesday, March 8, 2011
Conclusion
• Phylogenetic sampling of genomes improves our understanding of microbial diversity in many ways
• Still need– More biogeography– More phenotypic/experimental data– Deeper phylogenetic sampling
Tuesday, March 8, 2011
Tuesday, March 8, 2011
MICROBES
Tuesday, March 8, 2011
A Happy Tree of Life
Tuesday, March 8, 2011
Acknowledgements
• GEBA: DOE-JGI, DSMZ• GWSS: Nancy Moran & lab, Dongying Wu• iSEEM: Katie Pollard, Jessica Green,
Martin Wu• RecA: Dongying Wu, Craig Venter, Doug
Rusch, et al.
Tuesday, March 8, 2011