phylogeny-driven approaches to microbial & microbiome studies: talk by jonathan eisen at ucsb...

95
Phylogeny-Driven Approaches to Studies of Microbial and Microbiome Diversity Jonathan A. Eisen University of California, Davis @phylogenomics February 7, 2015 UCSB EEMB Graduate Student Symposium

Upload: jonathan-eisen

Post on 15-Jul-2015

720 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Phylogeny-Driven Approaches to Studies of Microbial and Microbiome

Diversity

Jonathan A. EisenUniversity of California, Davis

@phylogenomics

February 7, 2015UCSB EEMB Graduate Student Symposium

Page 2: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Phylogeny-Driven Approaches to Studies of Microbial and Microbiome

Diversity

Jonathan A. EisenUniversity of California, Davis

@phylogenomics

February 7, 2015UCSB EEMB Graduate Student Symposium

Some Lessons I Think I Have

Learned

Page 3: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Phylogeny-Driven Approaches to Studies of Microbial and Microbiome

Diversity

Jonathan A. EisenUniversity of California, Davis

@phylogenomics

February 7, 2015UCSB EEMB Graduate Student Symposium

Lesson 1: Go With Your Obsessions

Page 4: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Open Science

Page 5: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Open Science

X

Page 6: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Social Media & Science

Page 7: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Social Media & Science

X

Page 8: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

• RedSox

RedSox

Page 9: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

• RedSox

RedSox

X

Page 10: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Microbial Evolution

Page 11: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Microbial Evolution

Lesson 2: History Matters

Page 12: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Microbial Evolution

Lesson 2: History (of

species, genes, people, science)

Matters

Page 13: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Example I: Lost in Graduate School?

Page 14: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Lost in Graduate School?

Get A Map

Page 15: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Tree from Woese. 1987. Microbiological Reviews 51:221

Map for Graduate School

Carl Woese

Page 16: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Limited Sampling of RRR Studies

Tree from Woese. 1987. Microbiological Reviews 51:221

Page 17: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

My Study Organisms

Tree from Woese. 1987. Microbiological Reviews 51:221

Page 18: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

H. volcanii Excision Repair

0

0.2

0.4

0.6

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

Avg. Mol. Wt.(Base Pairs)

H. volcanii UV Repair Label 7 - 45J / m2)

45 J/m2 Dark 24 Hours

45 J/m2 Photoreac.

45 J/m2 t0

0 J/m2 t0

By Grombo - from Wikipedia

1E-07

1E-06

1E-05

0.0001

0.001

0.01

0.1

1

RelativeSurvival

0 50 100 150 200 250 300 350 400

UV J/m2

UV Survival E.coli vs H.volcanii

H.volcanii WFD11

E.coli NR10125 mfd+

E.coli NR10121 mfd-

From Eisen 1998. PhD Thesis.

Page 19: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Tree from Woese. 1987. Microbiological Reviews 51:221

Map for Graduate School

Lesson 3: Go Fishing Where Nobody Else Has

Page 20: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Example II: Rice Microbiomes and Phylogeny

Joseph Edwards

@Bulk_Soil

Sundar@sundarlab

CameronJohnson

SrijakBhatnagar

@srijakbhatnagar

Edwards et al. 2015. Structure, variation, and assembly of the root-associated

microbiomes of rice. PNAS

9

Supplementary Figures 231

232

Fig. S1 Map depicting soil collection locations for greenhouse experiment. 233

10

234

Fig. S2. Sampling and collection of the rhizocompartments. Roots are collected from rice 235

plants and soil is shaken off the roots to leave ~1mm of soil around the roots. The ~1 mm of soil 236

Page 21: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

DNA extraction

PCRSequence

rRNA genes

Sequence alignment = Data matrixPhylogenetic tree

PCR

rRNA1

rRNA2

Makes lots of copies of the rRNA genes in sample

rRNA1 5’...ACACACATAGGTGGAGCTA

GCGATCGATCGA... 3’

E. coli

Humans

A

T

T

A

G

A

A

C

A

T

C

A

C

A

A

C

A

G

G

A

G

T

T

CrRNA1

E. coli Humans

rRNA2 rRNA2 5’..TACAGTATAGGTGGAGCTAG

CGACGATCGA... 3’

rRNA3 5’...ACGGCAAAATAGGTGGATT

CTAGCGATATAGA... 3’

rRNA4 5’...ACGGCCCGATAGGTGGATT

CTAGCGCCATAGA... 3’

rRNA3 C A C T G T

rRNA4 C A C A G T

Yeast T A C A G T

Yeast

rRNA3 rRNA4

Phylogeny

PCR and phylogenetic analysis of rRNA genes

Page 22: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

STAP

An Automated Phylogenetic Tree-Based Small SubunitrRNA Taxonomy and Alignment Pipeline (STAP)Dongying Wu1*, Amber Hartman1,6, Naomi Ward4,5, Jonathan A. Eisen1,2,3

1 UC Davis Genome Center, University of California Davis, Davis, California, United States of America, 2 Section of Evolution and Ecology, College of Biological Sciences,

University of California Davis, Davis, California, United States of America, 3 Department of Medical Microbiology and Immunology, School of Medicine, University of

California Davis, Davis, California, United States of America, 4 Department of Molecular Biology, University of Wyoming, Laramie, Wyoming, United States of America,

5 Center of Marine Biotechnology, Baltimore, Maryland, United States of America, 6 The Johns Hopkins University, Department of Biology, Baltimore, Maryland, United

States of America

Abstract

Comparative analysis of small-subunit ribosomal RNA (ss-rRNA) gene sequences forms the basis for much of what we knowabout the phylogenetic diversity of both cultured and uncultured microorganisms. As sequencing costs continue to declineand throughput increases, sequences of ss-rRNA genes are being obtained at an ever-increasing rate. This increasing flow ofdata has opened many new windows into microbial diversity and evolution, and at the same time has created significantmethodological challenges. Those processes which commonly require time-consuming human intervention, such as thepreparation of multiple sequence alignments, simply cannot keep up with the flood of incoming data. Fully automatedmethods of analysis are needed. Notably, existing automated methods avoid one or more steps that, thoughcomputationally costly or difficult, we consider to be important. In particular, we regard both the building of multiplesequence alignments and the performance of high quality phylogenetic analysis to be necessary. We describe here our fully-automated ss-rRNA taxonomy and alignment pipeline (STAP). It generates both high-quality multiple sequence alignmentsand phylogenetic trees, and thus can be used for multiple purposes including phylogenetically-based taxonomicassignments and analysis of species diversity in environmental samples. The pipeline combines publicly-available packages(PHYML, BLASTN and CLUSTALW) with our automatic alignment, masking, and tree-parsing programs. Most importantly,this automated process yields results comparable to those achievable by manual analysis, yet offers speed and capacity thatare unattainable by manual efforts.

Citation: Wu D, Hartman A, Ward N, Eisen JA (2008) An Automated Phylogenetic Tree-Based Small Subunit rRNA Taxonomy and Alignment Pipeline (STAP). PLoSONE 3(7): e2566. doi:10.1371/journal.pone.0002566

Editor: Jean-Nicolas Volff, Ecole Normale Superieure de Lyon, France

Received January 31, 2008; Accepted May 26, 2008; Published July 2, 2008

Copyright: ! 2008 Wu et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricteduse, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: The National Science Foundation ‘‘Assembling the Tree of Life’’ Grant No. 0228651. The final work on this project was funded by the Gordon and BettyMoore Foundation (grant #1660 to Jonathan Eisen).

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

Introduction

ss-RNA gene sequence analysis as a tool for microbialsystematics and ecology

Phylogenetic analysis of rRNA gene sequences (particularly ss-rRNA, i.e., the small subunit rRNA) has led to important advancesin microbiology, such as the discovery of a third branch on the treeof life (the archaea) [1] and the realization that the microbes thatcan be grown in pure culture represent but a small fraction, interms of both phylogenetic types and total numbers of cells of themicrobes, found in nature [2]. The power of ss-rRNA forphylogenetic analysis can be attributed to many factors, includingits presence in all cellular organisms, its favorable patterns ofsequence conservation that enable study of both recent andancient evolutionary events, and the ease with which this gene canbe cloned and sequenced from new organisms [3]. The sequencingof ss-rRNA genes from new species is greatly facilitated by thepresence of highly conserved regions at several positions along thegene [4]. The conservation of these regions allows one to designand use broadly targeted oligonucleotide primers that work on awide diversity of species for both sequencing and amplification by

the polymerase chain reaction (PCR). In fact, it is now standardprocedure to sequence the ss-rRNA gene when a new microbe hasbeen isolated [5,6].

The ss-rRNA gene has become a key target for environmentalmicrobiology studies largely because through the use of broadlytargeted primers, one can use PCR to amplify in a single reactionthe ss-rRNA genes from a wide diversity of organisms present inan environmental sample [7,8]. The amplified products can thenbe characterized in multiple ways such as through restrictiondigestion [9,10], denaturing gradient gel electrophoresis [11],hybridization to arrays [12], or sequencing. As sequencingcontinues to decrease in cost and difficulty, we believe it willbecome the preferred option and thus we focus on sequenceanalysis here.

Once DNA sequences of environmental ss-rRNA genes are inhand, multiple types of analyses can be used to characterize theorganisms and communities from which they were obtained. Forexample phylogenetic analysis of the sequences can reveal whattypes of microbial organisms are present in a sample. In addition,very closely related ss-rRNA sequences can be grouped togetherinto phylotypes or operational taxonomic units (OTUs), groupings which

PLoS ONE | www.plosone.org 1 July 2008 | Volume 3 | Issue 7 | e2566

multiple alignment and phylogeny was deemed unfeasible.However, this we believe can compromise the value of the results.

For example, the delineation of OTUs has also been automatedvia tools that do not make use of alignments or phylogenetic trees(e.g., Greengenes). This is usually done by carrying out pairwisecomparisons of sequences and then clustering of sequences thathave better than some cutoff threshold of similarity with eachother). This approach can be powerful (and reasonably efficient)but it too has limitations. In particular, since multiple sequencealignments are not used, one cannot carry out standardphylogenetic analyses. In addition, without multiple sequencealignments one might end up comparing and contrasting differentregions of a sequence depending on what it is paired with.

The limitations of avoiding multiple sequence alignments andphylogenetic analysis are readily apparent in tools to classifysequences. For example, the Ribosomal Database Project’sClassifier program [29] focuses on composition characteristics ofeach sequence (e.g., oligonucleotide frequency) and assignstaxonomy based upon clustering genes by their composition.Though this is fast and completely automatable, it can be misled incases where distantly related sequences have converged on similarcomposition, something known to be a major problem in ss-rRNAsequences [30]. Other taxonomy assignment systems focusprimarily on the similarity of sequences. The simplest of these isto use BLASTN to search a sequence database (e.g., Genbank) andto then use information about the top match to assign some sort oftaxonomy information to new sequences. Such similarity-basedapproaches are analogous to using top blast matches to predict thefunctions of genes and have similar limitations. Though fast, suchapproaches are not ideal because the most similar sequence maynot in fact be the most closely related sequence due to the vagariesof evolution such as unequal rates of change in different lineages orconvergent evolution [31–35].

Despite the clear advantages of using multiple sequencealignments and phylogenetic analyses for many aspects of ss-rRNA analyses, there are only a few examples of attempts togenerate these outputs in a highly or completely automatedmanner. The most comprehensive tool we are aware of is the BIBIsoftware package [36], which takes new sequences, identifiessimilar sequences in a database using BLASTN and then generatesa new multiple sequence alignment and then produces phyloge-netic trees from the alignment. Users can then view the trees tomake taxonomic assignments based upon phylogenetic position ofquery sequences relative to known ones. Though BIBI is quantumleap more advanced than most similarity based available

classification tools it does have some limitations. For example,the generation of new alignments for each sequence is bothcomputational costly, and does not take advantage of availablecurated alignments that make use of ss-RNA secondary structureto guide the primary sequence alignment. Perhaps mostimportantly however is that the tool is not fully automated. Inaddition, it does not generate multiple sequence alignments for allsequences in a dataset which would be necessary for doing manyanalyses.

Automated methods for analyzing rRNA sequences are alsoavailable at the web sites for multiple rRNA centric databases,such as Greengenes and the Ribosomal Database Project (RDPII).Though these and other web sites offer diverse powerful tools, theydo have some limitations. For example, not all provide multiplesequence alignments as output and few use phylogeneticapproaches for taxonomy assignments or other analyses. Moreimportantly, all provide only web-based interfaces and theirintegrated software, (e.g., alignment and taxonomy assignment),cannot be locally installed by the user. Therefore, the user cannottake advantage of the speed and computing power of parallelprocessing such as is available on linux clusters, or locally alter andpotentially tailor these programs to their individual computingneeds (Table 1).

Given the limited automated tools that are available forresearchers have had to choose between two non-ideal options:manually generating and/or curating alignments (an expensiveand slow process which can handle only a limited number ofsequences) or using the non-phylogenetic and non-alignmentbased methods that can be automated more readily.

We describe here the development of a fully-automated, high-throughput method that meets many of the key requirements of ss-rRNA sequence analysis. First, this method generates high qualitymultiple sequence alignments that can be used for phylogeneticreconstructions as well as for diversity measures such as theidentification of OTUs. Secondly, the method generates aphylogenetic tree for each query sequence and assigns thatsequence to a taxonomic group based upon its position in the treerelative to other known sequences. The alignments and phyloge-netic tree outputs of this program can be used for input into avariety of other software tools such as DOTUR (for identifyingOTUs) and UNIFRAC (for phylogenetic based communitycomparisons)[26,37]. We refer to this method as STAP: a SmallSubunit rRNA Taxonomy and Alignment Pipeline.

A key advantage of STAP is that it is the only fully automatedmethod available that can be locally installed by the user and is

Table 1. Comparison of STAP’s computational abilities relative to existing commonly-used ss-RNA analysis tools.

STAP ARB Greengenes RDP

Installed where? Locally Locally Web only Web only

User interface Command line GUI Web portal Web portal

Parallel processing YES NO NO NO

Manual curation for taxonomy assignment NO YES NO NO

Manual curation for alignment NO YES NO* NO

Open source YES** NO NO NO

Processing speed Fast Slow Medium Medium

It is important to note, that STAP is the only software that runs on the command line and can take advantage of parallel processing on linux clusters and, further, ismore amenable to downstream code manipulation.*Note: Greengenes alignment output is compatible with upload into ARB and downstream manual alignment.**The STAP program itself is open source, the programs it depends on are freely available but not open source.doi:10.1371/journal.pone.0002566.t001

ss-rRNA Taxonomy Pipeline

PLoS ONE | www.plosone.org 3 July 2008 | Volume 3 | Issue 7 | e2566

STAP database, and the query sequence is aligned to them usingthe CLUSTALW profile alignment algorithm [40] as describedabove for domain assignment. By adapting the profile alignment

algorithm, the alignments from the STAP database remain intact,while gaps are inserted and nucleotides are trimmed for the querysequence according to the profile defined by the previousalignments from the databases. Thus the accuracy and quality ofthe alignment generated at this step depends heavily on the qualityof the Bacterial/Archaeal ss-rRNA alignments from theGreengenes project or the Eukaryotic ss-rRNA alignments fromthe RDPII project.

Phylogenetic analysis using multiple sequence alignments rests onthe assumption that the residues (nucleotides or amino acids) at thesame position in every sequence in the alignment are homologous.Thus, columns in the alignment for which ‘‘positional homology’’cannot be robustly determined must be excluded from subsequentanalyses. This process of evaluating homology and eliminatingquestionable columns, known as masking, typically requires time-consuming, skillful, human intervention. We designed an automat-ed masking method for ss-rRNA alignments, thus eliminating thisbottleneck in high-throughput processing.

First, an alignment score is calculated for each aligned columnby a method similar to that used in the CLUSTALX package [42].Specifically, an R-dimensional sequence space representing all thepossible nucleotide character states is defined. Then for eachaligned column, the nucleotide populating that column in each ofthe aligned sequences is assigned a score in each of the Rdimensions (Sr) according to the IUB matrix [42]. The consensus‘‘nucleotide’’ for each column (X) also has R dimensions, with thescore for each dimension (Xr) calculated as the average of thescores for that column in that dimension (average of Sr). Thus thescore of the consensus nucleotide is a mathematical expressiondescribing the average ‘‘nucleotide’’ in that column for thatalignment.

Figure 2. Domain assignment. In Step 1, STAP assigns a domain toeach query sequence based on its position in a maximum likelihoodtree of representative ss-rRNA sequences. Because the tree illustratedhere is not rooted, domain assignment would not be accurate andreliable (sequence similarity based methods cannot make an accurateassignment in this case either). However the figure illustrates animportant role of the tree-based domain assignment step, namelyautomatic identification of deep-branching environmental ss-rRNAs.doi:10.1371/journal.pone.0002566.g002

Figure 1. A flow chart of the STAP pipeline.doi:10.1371/journal.pone.0002566.g001

ss-rRNA Taxonomy Pipeline

PLoS ONE | www.plosone.org 5 July 2008 | Volume 3 | Issue 7 | e2566

STAP database, and the query sequence is aligned to them usingthe CLUSTALW profile alignment algorithm [40] as describedabove for domain assignment. By adapting the profile alignment

algorithm, the alignments from the STAP database remain intact,while gaps are inserted and nucleotides are trimmed for the querysequence according to the profile defined by the previousalignments from the databases. Thus the accuracy and quality ofthe alignment generated at this step depends heavily on the qualityof the Bacterial/Archaeal ss-rRNA alignments from theGreengenes project or the Eukaryotic ss-rRNA alignments fromthe RDPII project.

Phylogenetic analysis using multiple sequence alignments rests onthe assumption that the residues (nucleotides or amino acids) at thesame position in every sequence in the alignment are homologous.Thus, columns in the alignment for which ‘‘positional homology’’cannot be robustly determined must be excluded from subsequentanalyses. This process of evaluating homology and eliminatingquestionable columns, known as masking, typically requires time-consuming, skillful, human intervention. We designed an automat-ed masking method for ss-rRNA alignments, thus eliminating thisbottleneck in high-throughput processing.

First, an alignment score is calculated for each aligned columnby a method similar to that used in the CLUSTALX package [42].Specifically, an R-dimensional sequence space representing all thepossible nucleotide character states is defined. Then for eachaligned column, the nucleotide populating that column in each ofthe aligned sequences is assigned a score in each of the Rdimensions (Sr) according to the IUB matrix [42]. The consensus‘‘nucleotide’’ for each column (X) also has R dimensions, with thescore for each dimension (Xr) calculated as the average of thescores for that column in that dimension (average of Sr). Thus thescore of the consensus nucleotide is a mathematical expressiondescribing the average ‘‘nucleotide’’ in that column for thatalignment.

Figure 2. Domain assignment. In Step 1, STAP assigns a domain toeach query sequence based on its position in a maximum likelihoodtree of representative ss-rRNA sequences. Because the tree illustratedhere is not rooted, domain assignment would not be accurate andreliable (sequence similarity based methods cannot make an accurateassignment in this case either). However the figure illustrates animportant role of the tree-based domain assignment step, namelyautomatic identification of deep-branching environmental ss-rRNAs.doi:10.1371/journal.pone.0002566.g002

Figure 1. A flow chart of the STAP pipeline.doi:10.1371/journal.pone.0002566.g001

ss-rRNA Taxonomy Pipeline

PLoS ONE | www.plosone.org 5 July 2008 | Volume 3 | Issue 7 | e2566

Dongying WuAmber

Hartman Naomi Ward

Page 23: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

WATERsHartman et al. BMC Bioinformatics 2010, 11:317http://www.biomedcentral.com/1471-2105/11/317

Page 2 of 14

sequence rDNA (the genes for ribosomal RNA) in partic-ular those for small-subunit ribosomal RNA (ss-rRNA).These studies revealed a large amount of previouslyundetected microbial diversity [1,11-13]. Researchersfocused on the small subunit rRNA gene not onlybecause of the ease with which it can be PCR amplified,but also because it has variable and highly conservedregions, it is thought to be universally distributed amongall living organisms, and it is useful for inferring phyloge-netic relationships [14,15]. Since then, "cultivation-inde-pendent technologies" have brought a revolution to thefield of microbiology by allowing scientists to study awide and complex amount of diversity in many differenthabitats and environments [16-18]. The general premiseof these methods remains relatively unchanged from theinitial experiments two decades ago and relies onstraightforward molecular biology techniques and bioin-formatics tools from ecology, evolutionary biology andDNA sequencing projects.

Briefly, the lab work involved in 16 S rDNA surveysbegins with environmental samples (e.g., soil or water)from which total genomic DNA is extracted. Next, the 16S rDNA is PCR-amplified with pan-bacterial or pan-archaeal primers (i.e., primers designed to amplify asmany known bacteria or archaea as possible), cloned intoa sequencing vector, and then sequenced (or directlysequenced without cloning in next generation sequenc-ing) resulting in large collections of diverse microbial 16 SrDNA sequences from these different samples. Assequencing costs have continually declined, environmen-tal microbiology surveys have expanded correspondinglyand 16 S rDNA datasets have grown increasingly com-plex.

The size and complexity of data sets introduce a newchallenge - analyses that one could carry out manually onsmall data sets now must be aided or run entirely on com-puters. And those analyses that previously were carriedout computationally now must be made more efficient tohave any hopes of being completed in a timely manner[7,19].

How then is the microbial community sequencing dataconverted from reads off a sequencing machine to bargraphs, network diagrams, and biological conclusions?Fortunately, even as data sets have expanded, mostresearchers analyzing rDNA sequence data sets, evenwhen they are very large, have a similar set of goals intheir analysis. For example, most studies are interested inassigning a microbial identity to the 16 S rDNAsequences and determining the proportion of theseorganisms in each sequence collection. And to achievethese (and related goals), a similar set of steps are used(Fig. 1) including aligning the rDNA sequences in a data-set to each other so that they are comparable, removing

chimeric sequences generated during PCR identifyingclosely related sets of sequences (also known as opera-tional taxonomic units or OTUs), removing redundantsequences above a certain percent identity cutoff, assign-ing putative taxonomic identifiers to each sequence orrepresentative of a group, inferring a phylogenetic tree ofthe sequences, and comparing the phylogenetic structureof different samples to each other and to the larger bacte-rial or archaeal tree of life.

Over the last few years, a large number of softwaretools and web applications have become available to carryout each of the above steps (e.g., [20,21] for chimerachecking, [22] for phylogenetic comparisons, STAP fortaxonomy assignments). In practice, even as new soft-ware became available, researchers still have to act as thedrivers of the workflow. At each step in this process, dif-ferent types of software must be chosen and employed,each with distinct data formatting requirements, invoca-tion methods, and each associated with a variety of post-analysis steps that may be selected and applied. Even afterall of these steps have been completed, a wide variety ofstatistical and visualization tools are applied to theseresults to interpret and represent these data. In this con-text, there is a clear need for tools that will run a compre-hensive set of analyses all linked together into one system.Very recently, two such systems have been released -mothur and QIIME. WATERS is our effort in this regardwith some key differences compared to mothur andQIIME.

Figure 1 Overview of WATERS. Schema of WATERS where white boxes indicate "behind the scenes" analyses that are performed in WA-TERS. Quality control files are generated for white boxes, but not oth-erwise routinely analyzed. Black arrows indicate that metadata (e.g., sample type) has been overlaid on the data for downstream interpre-tation. Colored boxes indicate different types of results files that are generated for the user for further use and biological interpretation. Colors indicate different types of WATERS actors from Fig. 2 which were used: green, Diversity metrics, WriteGraphCoordinates, Diversity graphs; blue, Taxonomy, BuildTree, Rename Trees, Save Trees; Create-Unifrac; yellow, CreateOtuTable, CreateCytoscape, CreateOTUFile; white, remaining unnamed actors.

AlignCheck

chimerasCluster Build

Tree

AssignTaxonomy

Tree w/Taxonomy

Diversity statistics &

graphs

Unifrac files

Cytoscape network

OTU table

Hartman et al 2010. W.A.T.E.R.S.: a Workflow for the Alignment, Taxonomy, and Ecology of Ribosomal Sequences. BMC Bioinformatics 2010, 11:317 doi:10.1186/1471-2105-11-317

Hartman et al. BMC Bioinformatics 2010, 11:317http://www.biomedcentral.com/1471-2105/11/317

Page 9 of 14

default is 97% and 99%), and they are also generated forevery metadata variable comparison that the userincludes.

Data pruningTo assist in troubleshooting and quality control,WATERS returns to the user three fasta files of sequencesthat were removed at various steps in the workflow. Ashort_sequences.fas file is created that contains all

Figure 3 Biologically similar results automatically produced by WATERS on published colonic microbiota samples. (A) Rarefaction curves sim-ilar to curves shown in Eckburg et al. Fig. 2; 70-72, indicate patient numbers, i.e., 3 different individuals. (B) Weighted Unifrac analysis based on phylo-genetic tree and OTU data produced by WATERS very similar to Eckburg et al. Fig. 3B. (C) Neighbor-joining phylogenetic tree (Quicktree) representing the sequences analyzed by WATERS, which is clearly similar to Fig. S1 in Eckburg et al.

BA

!"#$ !"#% !"#& "#" "#&'&(!(')*+),-(./*0/-01,()234/0,)5(67#778

!"#%

!"#&

"#"

"#&

"#%

"#$

"#6

"#9

'%(!

(')*

+),-

(./*

0/-0

1,()

234/

0,)5

(%&#

9%8

:";:"<:"=

:">:"?:"@

:"A

:&;:&<:&=

:&>:&?:&@

:&A

:%;:%<:%=:%>

:%?:%@:%A

'=;(!('&(.B('%

" :9" &9"" %%9" $""""

9"

&""

&9"

%""

%9"

:%

:&

:"

C

!"#$%&'()%$%*!"#$%&'()"+%*

)%+$",&'$%'!"#$%&("

"#$(-'!"#$%&("

.%&&/#'0(#&'!("

%,*(+'-,&'$%'!"#$%&("

1(&0(#/$%*#+'*$&()("#+'*$&()("+%*

2324

5"00",&'$%'!"#$%&("

#6"-'!"#$%&(""+,7",&'$%'!"#$%&("

1/*'!"#$%&("

1(&0(#/$%*!"#(++(

1(&0(#/$%*0'++(#/$%*

Amber Hartman

BertramLudaescer

Page 24: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

alignment used to build the profile, resulting in a multiplesequence alignment of full-length reference sequences andmetagenomic reads. The final step of the alignment process is aquality control filter that 1) ensures that only homologous SSU-rRNA sequences from the appropriate phylogenetic domain areincluded in the final alignment, and 2) masks highly gappedalignment columns (see Text S1).We use this high quality alignment of metagenomic reads and

references sequences to construct a fully-resolved, phylogenetictree and hence determine the evolutionary relationships betweenthe reads. Reference sequences are included in this stage of theanalysis to guide the phylogenetic assignment of the relativelyshort metagenomic reads. While the software can be easilyextended to incorporate a number of different phylogenetic toolscapable of analyzing metagenomic data (e.g., RAxML [27],pplacer [28], etc.), PhylOTU currently employs FastTree as adefault method due to its relatively high speed-to-performanceratio and its ability to construct accurate trees in the presence ofhighly-gapped data [29]. After construction of the phylogeny,lineages representing reference sequences are pruned from thetree. The resulting phylogeny of metagenomic reads is then used tocompute a PD distance matrix in which the distance between apair of reads is defined as the total tree path distance (i.e., branchlength) separating the two reads [30]. This tree-based distancematrix is subsequently used to hierarchically cluster metagenomicreads via MOTHUR into OTUs in a fashion similar to traditionalPID-based analysis [31]. As with PID clustering, the hierarchicalalgorithm can be tuned to produce finer or courser clusters,corresponding to different taxonomic levels, by adjusting theclustering threshold and linkage method.To evaluate the performance of PhylOTU, we employed

statistical comparisons of distance matrices and clustering resultsfor a variety of data sets. These investigations aimed 1) to compare

PD versus PID clustering, 2) to explore overlap between PhylOTUclusters and recognized taxonomic designations, and 3) to quantifythe accuracy of PhylOTU clusters from shotgun reads relative tothose obtained from full-length sequences.

PhylOTU Clusters Recapitulate PID ClustersWe sought to identify how PD-based clustering compares to

commonly employed PID-based clustering methods by applyingthe two methods to the same set of sequences. Both PID-basedclustering and PhylOTU may be used to identify OTUs fromoverlapping sequences. Therefore we applied both methods to adataset of 508 full-length bacterial SSU-rRNA sequences (refer-ence sequences; see above) obtained from the Ribosomal DatabaseProject (RDP) [25]. Recent work has demonstrated that PID ismore accurately calculated from pairwise alignments than multiplesequence alignments [32–33], so we used ESPRIT, whichimplements pairwise alignments, to obtain a PID distance matrixfor the reference sequences [32]. We used PhylOTU to compute aPD distance matrix for the same data. Then, we used MOTHUR tohierarchically cluster sequences into OTUs based on both PIDand PD. For each of the two distance matrices, we employed arange of clustering thresholds and three different definitions oflinkage in the hierarchical clustering algorithm: nearest-neighbor,average, and furthest-neighbor.To statistically evaluate the similarity of cluster composition

between of each pair of clustering results, we used two summarystatistics that together capture the frequency with which sequencesare co-clustered in both analyses: true conjunction rate (i.e., theproportion of pairs of sequences derived from the same cluster inthe first analysis that also are clustered together in the secondanalysis) and true disjunction rate (i.e., the proportion of pairs ofsequences derived from different clusters in the first analysis thatalso are not clustered together in the second analysis) (see Methods

Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in this generalizeworkflow of PhylOTU. See Results section for details.doi:10.1371/journal.pcbi.1001061.g001

Finding Metagenomic OTUs

PLoS Computational Biology | www.ploscompbiol.org 3 January 2011 | Volume 7 | Issue 1 | e1001061

Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O'Dwyer JP, Green JL, Eisen JA, Pollard KS. (2011) PhylOTU: A High-Throughput Procedure Quantifies Microbial Community Diversity and Resolves Novel Taxa from Metagenomic Data. PLoS Comput Biol 7(1): e1001061. doi:10.1371/journal.pcbi.1001061

PhylOTU

Tom Sharpton@tjsharpton

Page 25: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

QIIME Phylotyping and Phylogenetic Ecology

15

296

Fig. S6. A set of 96 OTUs mainly consisting of Proteobacteria is enriched across every 297

compartment in the greenhouse experiment. (A) Number of OTUs and the phyla and classes 298

they belong to that are enriched across all rhizocompartments in the greenhouse experiment. (B) 299

A subset of the Proteobacteria and the classes and families they belong to in the OTUs that are 300

enriched across all rhizocompartments in the greenhouse. 301

302

303

�������������

���� ����

https://evomics.org/2014/01/the-glories-of-the-gut-ask-a-fat-mouse/

Page 26: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

QIIME Phylotyping and Phylogenetic Ecology

15

296

Fig. S6. A set of 96 OTUs mainly consisting of Proteobacteria is enriched across every 297

compartment in the greenhouse experiment. (A) Number of OTUs and the phyla and classes 298

they belong to that are enriched across all rhizocompartments in the greenhouse experiment. (B) 299

A subset of the Proteobacteria and the classes and families they belong to in the OTUs that are 300

enriched across all rhizocompartments in the greenhouse. 301

302

303

�������������

���� ����

https://evomics.org/2014/01/the-glories-of-the-gut-ask-a-fat-mouse/

Lesson 4: Accept When You

Are Defeated

Page 27: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Rice Microbiome: Variation w/in Plant

Joseph Edwards

@Bulk_Soil

Sundar@sundarlab

CameronJohnson

SrijakBhatnagar

@srijakbhatnagar

To address some of these questions, we have undertaken anexhaustive characterization of the root-associated microbiome ofrice. Rice is a major crop plant and a staple food for half of theworld’s population. Metagenomic and proteomic approacheshave been used to identify different microbial genes present inthe rice microbiome (17, 18), but an extensive characterization ofmicrobiome composition and variation has not been performed.Rice cultivation also contributes to global methane, accountingfor an estimated 10–20% of anthropogenic emissions, due to thegrowth of methanogenic archaea in the vicinity of rice roots (19).Here we have used deep sequencing of microbial 16S rRNAgenes to detect over 250,000 operational taxonomic units(OTUs), with a structural resolution of three distinct compart-ments (rhizosphere, rhizoplane, and endosphere) and extendingover multiple factors contributing to variation, both under con-trolled greenhouse conditions as well as different field environ-ments. The large datasets from the different conditions sampledin this study were used for identification of putative microbialconsortia involved in processes such as methane cycling. Throughdynamic studies of the microbiome composition, we provideinsights into the process of root microbiome assembly.

ResultsRoot-Associated Microbiomes Form Three Spatially Separable Com-partments Exhibiting Distinct and Overlapping Microbial Communities.Sterilized rice seeds were germinated and grown under con-trolled greenhouse conditions in soil collected from three ricefields across the Central Valley of California (SI Appendix, Fig.S1). We analyzed the bacterial and archaeal microbiomes fromthree separate rhizocompartments: the rhizosphere, rhizoplane,and endosphere (Fig. 1A). Because the root microbiome hasbeen shown to correlate with the developmental stage of theplant (10), the root-associated microbial communities weresampled at 42 d (6 wk), when rice plants from all genotypes werewell-established in the soil but still in their vegetative phase ofgrowth. For our study, the rhizosphere compartment was com-

posed of ∼1 mm of soil tightly adhering to the root surface that isnot easily shaken from the root (SI Appendix, Fig. S2). Therhizoplane compartment microbiome was derived from the suiteof microbes on the root surface that cannot be removed bywashing in buffer but is removed by sonication (SI Appendix,Materials and Methods). The endosphere compartment micro-biome, composed of the microbes inhabiting the interior of theroot, was isolated from the same roots left after sonication.Unplanted soil pots were used as a control to differentiate planteffects from general edaphic factors.The V4-V5 region of the 16S rRNA gene was amplified using

PCR and sequenced using the Illumina MiSeq platform. A totalof 10,554,651 high-quality sequences was obtained with a medianread count per sample of 51,970 (range: 2,958–203,371; DatasetS2). The high-quality reads were clustered using >97% sequenceidentity into 101,112 microbial OTUs. Low-abundance OTUs(<5 total counts) were discarded, resulting in 27,147 OTUs. Theresulting OTU counts in each library were normalized using thetrimmed mean of M values method. This method was chosen dueto its sensitivity for detecting differentially abundant taxa com-pared with traditional microbiome normalization techniquessuch as rarefaction and relative abundance (20). Measures ofwithin-sample diversity (α-diversity) revealed a diversity gradientfrom the endosphere to the rhizosphere (Fig. 1B and DatasetS4). Endosphere communities had the lowest α-diversity and therhizosphere had the highest α-diversity. The mean α-diversitywas higher in the rhizosphere than bulk soil; however, the dif-ference in α-diversity between these two compartments cannot beconsidered as statistically significant (Wilcoxon test; Dataset S4).Unconstrained principal coordinate analyses (PCoAs) of

weighted and unweighted UniFrac distances were performed toinvestigate patterns of separation between microbial communi-ties (SI Appendix, Materials and Methods). The UniFrac distanceis based on taxonomic relatedness, where the weighted UniFrac(WUF) metric takes abundance of taxa into consideration whereasthe unweighted UniFrac (UUF) does not and is thus more sen-sitive to rare taxa. In both the WUF and UUF PCoAs, the rhi-zocompartments separate across the first principal coordinate,indicating that the largest source of variation in root-associatedmicrobial communities is proximity to the root (Fig. 1C, WUFand SI Appendix, Fig. S4, UUF). Moreover, the pattern of sepa-ration is consistent with a gradient of microbial populations fromthe exterior of the root, across the rhizoplane, and into the in-terior of the root. Permutational multivariate analysis of variance(PERMANOVA) corroborates that rhizospheric compartmen-talization comprises the largest source of variation within themicrobiome data when using a WUF distance metric (46.62%,P < 0.001; Dataset S5A). PERMANOVA using a UUF distance,however, describes rhizospheric compartmentalization as havingthe second largest source of variation behind soil type (18.07%,P < 0.001; Dataset S5H). In addition to PERMANOVA, we alsoperformed partial canonical analysis of principal coordinates(CAP) on both the WUF and UUF metrics to quantify the var-iance attributable to each experimental variable (SI Appendix,Materials and Methods). This technique differs from unconstrainedPCoA in that technical factors can be controlled for in theanalysis and the analysis can be constrained to any factor of in-terest to better understand the quantitative impact of the factoron the microbial composition. Using this technique to control forsoil type, cultivar, and technical factors (biological replicate, se-quencing batch, and planting container), we found that inagreement with the PERMANOVA results, microbial commu-nities vary significantly between rhizocompartments (34.2% ofvariance, P = 0.005, WUF, SI Appendix, Fig. S5A and 22.6% ofvariance, P = 0.005, UUF, SI Appendix, Fig. S5C).There are notable differences in the proportions of various

phyla across the compartments that are consistent across everytested soil (Fig. 1D). The endosphere has a significantly greaterproportion of Proteobacteria and Spirochaetes than the rhizo-sphere or bulk soil, whereas Acidobacteria, Planctomycetes, andGemmatimonadetes are mostly depleted in the endosphere

Fig. 1. Root-associated microbial communities are separable by rhizo-compartment and soil type. (A) A representation of a rice root cross-sectiondepicting the locations of the microbial communities sampled. (B) Within-sample diversity (α-diversity) measurements between rhizospheric compart-ments indicate a decreasing gradient in microbial diversity from the rhizo-sphere to the endosphere independent of soil type. Estimated speciesrichness was calculated as eShannon_entropy. The horizontal bars within boxesrepresent median. The tops and bottoms of boxes represent 75th and 25thquartiles, respectively. The upper and lower whiskers extend 1.5× theinterquartile range from the upper edge and lower edge of the box, re-spectively. All outliers are plotted as individual points. (C) PCoA using theWUF metric indicates that the largest separation between microbial com-munities is spatial proximity to the root (PCo 1) and the second largestsource of variation is soil type (PCo 2). (D) Histograms of phyla abundances ineach compartment and soil. B, bulk soil; E, endosphere; P, rhizoplane; S,rhizosphere; Sac, Sacramento.

2 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1414592112 Edwards et al.

Edwards et al. 2015. Structure, variation, and assembly of the root-associated

microbiomes of rice. PNAS

Page 28: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Rice Genotype Affects Microbiome

using the UUF metric (26.7%, P = 0.005; SI Appendix, Fig. S5D).This discrepancy is likely due to differences between the WUFand UUF distance metric: Soil type might have more of an effecton frequency of rare taxa than abundant taxa, and thus the UUFmetric has a larger effect size for soil type. Compartments ofplants grown in distinct soils have commonalities in differentiallyabundant OTUs (Dataset S9), sharing 92 endosphere-enrichedOTUs, 71 rhizoplane-enriched OTUs, and 10 rhizosphere OTUs(SI Appendix, Fig. S8 J, I, and H, respectively, and SI Appendix,Fig. S9). In agreement with the PCoA analysis, Davis andArbuckle shared a significant overlap in OTUs enriched in theendosphere and rhizoplane (P = 2.22 × 10−16 and 7.86 × 10−7,respectively, hypergeometric test; SI Appendix, Fig. S8 I and J) butnot the rhizosphere (P = 0.52, hypergeometric test; SI Appendix,Fig. S8H). The Sacramento soil did not share significant overlapsin compartment-enriched OTUs with the other sites.The enrichment/depletion effects within each rhizosphere com-

partment vary by soil. Rhizosphere compartments of plants inDavis and Arbuckle soils exhibited higher enrichment/depletionratios (72/3 and 53/17, respectively) than plants in Sacramentosoil (78/116) (SI Appendix, Fig. S8A). The level of enrichment issimilar between each soil in the rhizosphere; however, the de-pletion level is higher in Sacramento soil than in Arbuckle orDavis. Chemical analysis of the soils showed that the nutrientcompositions of the soils did not show any exceptional trends(Dataset S7). The Davis and Arbuckle fields were similar in pHand nitrate, magnesium, and phosphorus content, whereas theArbuckle and Sacramento fields were similar in potassium, cal-cium, and iron content. Taken together, these results indicate thateach soil contains a different pool of microbes and that the plantis not restricted to specific OTUs but instead draws from avail-able OTUs in the pool to organize its microbiome. Nevertheless,the distribution of phyla across the different compartments wassimilar for all three soil types (Fig. 1D), suggesting that the overallrecruitment of OTUs is governed by a set of factors that result ina consistent representation of phyla independent of soil type.

Microbial Communities in the Rhizocompartments Are Influenced byRice Genotype. To investigate the relationship between rice ge-notype and the root microbiome, domesticated rice varietiescultivated in widely separated growing regions were tested. Sixcultivated rice varieties spanning two species within the Oryzagenus were grown for 42 d in the greenhouse before sampling.Asian rice (Oryza sativa) cultivars M104, Nipponbare (bothtemperate japonica varieties), IR50, and 93-11 (both indica va-rieties) were grown alongside two cultivars of African cultivatedrice Oryza glaberrima, TOg7102 (Glab B) and TOg7267 (Glab E).PERMANOVA indicated that rice genotype accounted fora significant amount of variation between microbial communitieswhen using WUF (2.41% of the variance, P < 0.001; DatasetS5A) and UUF (1.54% of the variance, P < 0.066; Dataset S5H);however, visual representations for clustering patterns of thegenotypes were not evident on the first two axes of unconstrainedPCoA ordinations (SI Appendix, Fig. S10). We then used CAPanalysis to quantify the effect of rice genotype on the microbialcommunities. By focusing on rice cultivar and controlling forcompartment, soil type, and technical factors, we found that ge-notypic differences in rice have a significant effect on root-associated microbial communities (5.1%, P = 0.005, WUF, Fig.3A and 3.1%, P = 0.005, UUF, SI Appendix, Fig. S11A). Ordi-nation of the resulting CAP analysis revealed clustering patternsof the cultivars that are only partially consistent with geneticlineage for both the WUF and UUF metrics. The two japonicacultivars clustered together and the two O. glaberrima cultivarsclustered together; however, the indica cultivars were split, with93-11 clustering with the O. glaberrima cultivars and IR50 clus-tering with the japonica cultivars.To analyze how the genotypic effect manifests in individual

rhizocompartments, we separated the whole dataset to focus oneach compartment individually and conducted CAP analysiscontrolling for soil type and technical factors. The rhizosphere

had the greatest genotypic effect on the microbiome (30.3%,P = 0.005, WUF, SI Appendix, Fig. S11B and 10.5%, P = 0.005,UUF, SI Appendix, Fig. S11E). The clustering patterns of thecultivars in the rhizosphere were similar to the clustering pat-terns exhibited when conducting CAP analysis on the wholedata using all rhizocompartments. Again, the japonica andO. glaberrima cultivars clustered separately, whereas the indicacultivars were split between the japonica and O. glaberrima clusters.This clustering pattern is maintained in the rhizoplane commu-nities (SI Appendix, Fig. S11 C and F); however, it breaks down inthe endosphere compartment communities, which coincidentlyare the least affected by rice genotype (12.8%, P = 0.005, WUF,SI Appendix, Fig. S11D and 8.5%, P = 0.028, UUF, SI Appendix,Fig. S11G). α-Diversity measurements within the rhizosphereshow a notable difference between the cultivars (P = 3.12E-06,ANOVA), with the O. glaberrima cultivars exhibiting high di-versity relative to the japonica cultivars, especially in Arbucklesoil (Fig. 3B and Dataset S11). Again, the two japonica cultivarswere more similar to the indica cultivar IR50, and the twoO. glaberrima cultivars were more similar to the indica cultivar93-11. These patterns in α-diversity were not evident when ex-amining other compartments (SI Appendix, Fig. S12). To explainwhich OTUs accounted for the genotypic effects in each rhizo-compartment, we performed differential OTU abundance anal-yses between the cultivars (Dataset S12). In total, we found 125OTUs that were affected by the plant genotype in at least onerhizocompartment. The rhizosphere had the most OTUs thatwere significantly impacted by genotype (SI Appendix, Fig. S13).This is consistent with the results from PERMANOVA and theCAP analyses.

Geographical Effects on the Microbiomes of Field-Grown Plants. Wesought to determine whether the results from greenhouse plantswere generalizable to cultivated rice and to investigate otherfactors that might affect the microbiome under field conditions.We therefore characterized the root-associated microbiomes offield rice plants distributed across eight geographically separatesites across California’s Sacramento Valley (Fig. 4A). Theseeight sites were operated under two cultivation practices: organiccultivation and a more conventional cultivation practice termed“ecofarming” (see below). Because genotype explained the leastvariance in the greenhouse data, we limited the analysis to onecultivar, S102, a California temperate japonica variety that iswidely cultivated by commercial growers and is closely related toM104 (26). Field samples were collected from vegetativelygrowing rice plants in flooded fields and the previously definedrhizocompartments were analyzed as before. Unfortunately,collection of bulk soil controls for the field experiment was not

Fig. 3. Host plant genotype significantly affects microbial communities inthe rhizospheric compartments. (A) Ordination of CAP analysis using theWUF metric constrained to rice genotype. (B) Within-sample diversitymeasurements of rhizosphere samples of each cultivar grown in each soil.Estimated species richness was calculated as eShannon_entropy. The horizontalbars within boxes represent median. The tops and bottoms of boxes repre-sent 75th and 25th quartiles, respectively. The upper and lower whiskersextend 1.5× the interquartile range from the upper edge and lower edge ofthe box, respectively. All outliers are plotted as individual points.

4 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1414592112 Edwards et al.

Edwards et al. 2015. Structure, variation, and assembly of the root-associated

microbiomes of rice. PNAS

Page 29: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Rice: Cultivation Site EffectsEdwards et al. 2015.

Structure, variation, and assembly of the root-

associated microbiomes of rice.

PNAS

possible, because planting densities in California commercial ricefields are too high to find representative soil that is unlikely tobe affected by nearby plants. Amplification and sequencing ofthe field microbiome samples yielded 13,349,538 high-qualitysequences (median: 54,069 reads per sample; range: 12,535–148,233 reads per sample; Dataset S13). The sequences wereclustered into OTUs using the same criteria as the greenhouseexperiment, yielding 222,691 microbial OTUs and 47,983 OTUswith counts >5 across the field dataset.We found that the microbial diversity of field rice plants is

significantly influenced by the field site. α-Diversity measure-ments of the field rhizospheres indicated that the cultivation sitesignificantly impacts microbial diversity (SI Appendix, Fig. S14A,P = 2.00E-16, ANOVA and Dataset S14). Unconstrained PCoAusing both the WUF and UUF metrics showed that microbialcommunities separated by field site across the first axis (Fig. 4B,WUF and SI Appendix, Fig. S14B, UUF). PERMANOVA agreedwith the unconstrained PCoA in that field site explained thelargest proportion of variance between the microbial communi-ties for field plants (30.4% of variance, P < 0.001, WUF, DatasetS5O and 26.6% of variance, P < 0.001, UUF, Dataset S5P). CAPanalysis constrained to field site and controlled for rhizocom-partment, cultivation practice, and technical factors (sequencingbatch and biological replicate) agreed with the PERMANOVAresults in that the field site explains the largest proportion ofvariance between the root-associated microbial communities infield plants (27.3%, P = 0.005, WUF, SI Appendix, Fig. S15Aand 28.9%, P = 0.005, UUF, SI Appendix, Fig. S15E), sug-gesting that geographical factors may shape root-associatedmicrobial communities.

Rhizospheric Compartmentalization Is Retained in Field Plants. Sim-ilar to the greenhouse plants, the rhizospheric microbiomes offield plants are distinguishable by compartment. α-Diversity ofthe field plants again showed that the rhizosphere had thehighest microbial diversity, whereas the endosphere had the least

diversity for all fields tested (SI Appendix, Fig. S14A and DatasetS15). PCoA of the microbial communities from field plants usingthe WUF and UUF distance metrics showed that the rhizo-compartments separate across PCo 2 (Fig. 4C, WUF and SIAppendix, Fig. S14C, UUF). PERMANOVA indicated that theseparation in the rhizospheric compartments explained the sec-ond largest source of variation of the factors that were tested(20.76%, P < 0.001, WUF, Dataset S5O and 7.30%, P < 0.001,UUF, Dataset S5P). CAP analysis of the field plants’ micro-biomes constrained to the rhizocompartment factor and con-trolled for field site, cultivation practice, and technical factorsagreed with PERMANOVA that a significant proportion of thevariance between microbial communities is explained by rhizo-compartment (20.9%, P = 0.005, WUF, SI Appendix, Fig. S15Cand 10.9%, P = 0.005, UUF, SI Appendix, Fig. S15G).Taxonomic distributions of phyla for the field plants were

overall similar to the greenhouse plants: Proteobacteria,Chloroflexi, and Acidobacteria make up the majority of the ricemicrobiota. The taxonomic gradients from the rhizosphere to theendosphere are maintained in the field plants for Acidobacteria,Proteobacteria, Spirochaetes, Gemmatimonadetes, Armatimonadetes,and Planctomycetes. However, unlike for greenhouse plants, thedistribution of Actinobacteria generally showed an increasingtrend from the rhizosphere to the endosphere of field plants (SIAppendix, Fig. S14E and Dataset S16).We again performed differential abundance tests between the

OTUs in the compartments of field-grown plants (SI Appendix, Fig.S16). We found a set of 32 OTUs that were enriched in theendosphere compartment between every cultivation site, potentiallyrepresenting a core field rice endospheric microbiome (SI Appendix,Fig. S17). The set of 32 OTUs consisted of Deltaproteobacteria inthe genus Anaeromyxobacter and Spirochaetes, Actinobacteria,and Alphaproteobacteria in the family Methylocystaceae. In-terestingly, 11 of the 32 core field endosphere OTUs were alsofound to be enriched in the endosphere compartment ofgreenhouse plants (SI Appendix, Fig. S18). Three of theseOTUs were classifiable at the family level. These OTUs con-sisted of taxa in the families Kineosporiaceae, Rhodocyclaceae,and Myxococcaceae, all of which are also enriched in the Ara-bidopsis root endosphere microbiome (10).

Cultivation Practice Results in Discernible Differences in the Microbiomes.The rice fields that we sampled from were cultivated under twopractices, organic farming and a variation of conventional cultiva-tion called ecofarming (27). Ecofarming differs from organicfarming in that chemical fertilizers, fungicide use, and herbicide useare all permitted but growth of transgenic rice and use of post-harvest fumigants are not permitted. Although cultivation practiceitself does significantly affect α-diversity of the rhizospheric com-partments overall (P = 0.008, ANOVA; Dataset S14), there is alsoa significant interaction between the cultivation practice used andthe rhizocompartments (P = 3.52E-07, ANOVA; Dataset S14),indicating that the α-diversities of some rhizocompartments areaffected differentially by cultivation practice. The α-diversity withinthe rhizosphere compartment varied significantly by cultivationpractice, with the mean α-diversity being higher in ecofarmed rhi-zospheres than organic rhizospheres (P = 0.001, Wilcoxon test;Dataset S14), whereas not in the endosphere and rhizoplane mi-crobial communities (P = 0.51 and 0.75, respectively, Wilcoxontests; Dataset S14). Under nonconstrained PCoA, the cultivationpractices are separable across principal coordinates 2 and 3 for boththe WUF metric (Fig. 4D) and UUF metric (SI Appendix, Fig.S14D). PERMANOVA of the microbial communities was inagreement with the PCoAs in that cultivation practice has a signif-icant impact on the rhizospheric microbial communities of field riceplants (8.47%, P < 0.001, WUF, Dataset S5O and 6.52%, P < 0.001,UUF, Dataset S5P). CAP analysis of the field plants constrained tocultivation practice agreed with the PERMANOVA results thatthere are significant differences between microbial communitiesfrom organic and ecofarmed rice plants (6.9% of the variance,

Fig. 4. Root-associated microbiomes from field-grown plants are separableby cultivation site, rhizospheric compartment, and cultivation practice. (A)Map depicting the locations of the field experiment collection sites acrossCalifornia’s Central Valley. Circles represent organic-cultivated siteswhereas triangles represent ecofarm-cultivated sites. (Scale bar, 10 mi.) (B)PCoA using the WUF method colored to depict the various sample collec-tion sites. (C) Same PCoA in B colored by rhizospheric compartment. (D)Same PCoA in B and C depicting second and third axes and colored bycultivation practice.

Edwards et al. PNAS Early Edition | 5 of 10

PLANTBIOLO

GY

PNASPL

US

Page 30: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Rice: Functional Enrichment x Genotype

and mitochondrial) reads to analyze microbial abundance inthe endosphere over time (Fig. 6A). Using this technique, weconfirmed the sterility of seedling roots before transplantation.We found that microbial penetrance into the endosphere oc-curred at or before 24 h after transplantation and that the pro-portion of microbial reads to organellar reads increased over thefirst 2 wk after transplantation (Fig. 6A). To further support theevidence for microbiome acquisition within the first 24 h, wesampled root endospheric microbiomes from sterilely germi-nated seedlings before transplanting into Davis field soil as wellas immediately after transplantation and 24 h after transplan-tation (SI Appendix, Fig. S24). The root endospheres of sterilelygerminated seedlings, as well as seedlings transplanted intoDavis field soil for 1 min, both had a very low percentage ofmicrobial reads compared with organellar reads (0.22% and0.71%), with the differences not statistically significant (P = 0.1,Wilcoxon test). As before, endospheric microbial abundanceincreased significantly, by >10-fold after 24 h in field soil (3.95%,P = 0.05, Wilcoxon test). We conclude that brief soil contactdoes not strongly increase the proportion of microbial reads, andtherefore the increase in microbial reads at 24 h is indicative ofendophyte acquisition within 1 d after transplantation.α-Diversity significantly varied by rhizocompartment (P < 2E-

16; Dataset S23) and there was a significant interaction betweenrhizocompartment and collection time (P = 0.042; Dataset S23);however, when each rhizocompartment was analyzed individ-ually, the bulk soil was the only compartment that showeda significant amount of variation in α-diversity over time (SIAppendix, Fig. S25 and Dataset S23). The above results suggestthat a diverse microbiota can begin to colonize the rhizoplaneand endosphere as early as 24 h after transplanting into soil.We next evaluated how β-diversities shift over time in eachrhizocompartment. We compared the time-series microbialcommunities with the previous greenhouse experiment mi-crobial communities of M104 in Davis soil (Fig. 6 B and C).β-Diversity measurements of the time-series data indicatedthat microbiome samples from each compartment are sepa-rable by time. Furthermore, the rhizoplane and endospheremicrobiomes from the later time point in the time-series data

(13 d) approach the endosphere and rhizoplane microbiomecompositions for plants that have been grown in the green-house for 42 d.There are slight shifts in the distribution of phyla over time;

however, there are significant distinctions between the com-partments starting as early as 24 h after transplantation into soil(Fig. 6D, SI Appendix, Figs. S24B and S26, and Dataset S24).Because each phylum consists of diverse OTUs that could ex-hibit very different behaviors during acquisition, we next ex-amined the dynamics and colonization patterns of specificOTUs within the time-course experiment. The core set of 92endosphere-enriched OTUs obtained from the previous green-house experiment (SI Appendix, Fig. S9C) was analyzed forrelative abundances at different time points (Fig. 6E). Of the 92core endosphere-enriched microbes present in the greenhouseexperiment, 53 OTUs were detectable in the endosphere in thetime-course experiment. The average abundance profile overtime revealed a colonization pattern for the core endosphericmicrobiome. Relative abundance of the core endosphere-enriched microbiome peaks early (3 d) in the rhizosphere andthen decreases back to a steady, low level for the remainder ofthe time points. Similarly, the rhizoplane profile shows an in-crease after 3 d with a peak at 8 d with a decline at 13 d. Theendosphere generally follows the rhizoplane profile, except thatrelative abundance is still increasing at 13 d. These results sug-gest that the core endospheric microbes are first attracted to therhizosphere and then locate to the rhizoplane, where they attachbefore migration to the root interior. To summarize, microbiomeacquisition from soil appears to occur relatively rapidly, initiatingwithin 24 h and approaching steady state within 14 d. The dy-namics of accumulation suggest a multistep process, in which therhizosphere and rhizoplane are likely to play key roles in de-termining the compositions of the interior and exterior compo-nents of the root-associated microbiome (Discussion).

DiscussionFactors Affecting the Composition of Root-Associated Microbiomes.The data presented here provide a characterization of themicrobiome of rice, involving the combination of finer structural

Fig. 5. OTU coabundance network reveals modules of OTUs associated with methane cycling. (A) Subset of the entire network corresponding to 11modules with methane cycling potential. Each node represents one OTU and an edge is drawn between OTUs if they share a Pearson correlation ofgreater than or equal to 0.6. (B) Depiction of module 119 showing the relationship between methanogens, syntrophs, methanotrophs, and othermethane cycling taxonomies. Each node represents one OTU and is labeled by the presumed function of that OTU’s taxonomy in methane cycling. Anedge is drawn between two OTUs if they have a Pearson correlation of greater than or equal to 0.6. (C ) Mean abundance profile for OTUs in module 119across all rhizocompartments and field sites. The position along the x axis corresponds to a different field site. Error bars represent SE. The x and y axesrepresent no particular scale.

Edwards et al. PNAS Early Edition | 7 of 10

PLANTBIOLO

GY

PNASPL

US

Edwards et al. 2015. Structure, variation, and assembly of the root-associated microbiomes of rice. PNAS

Page 31: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Rice Developmental Time Series

resolution and deeper sequencing than previous plant micro-biome studies and using both controlled greenhouse and fieldstudies covering a geographical range of cultivation. Specifically,we have been able to characterize in-depth the compositions ofthree distinct rhizocompartments—the rhizosphere, rhizoplane,and endosphere—and gain insights into the effects of externalfactors on each of these compartments. We note that a detailedcharacterization of plant rhizoplane microbiota in relation tothe rhizosphere and the endosphere has not been previouslyattempted. To achieve this, we successfully adapted protocols forremoval of rhizoplane microbes from the endosphere of Arabi-dopsis roots (9, 10). Because the fractional abundance oforganellar reads in the rhizosphere, rhizoplane, and endosphereexhibits a clear increasing gradient (SI Appendix, Fig. S27), wehypothesize that we are isolating the rhizoplane fraction viadisruption of the rhizodermis, consistent with direct EM obser-vations on Arabidopsis roots following sonication (9, 10). Thefine structure approach we have used combined with depth ofsequencing allowed us to analyze over 250,000 OTUs, an order

of magnitude greater than in any single plant species to date.Under controlled greenhouse conditions, the rhizocompartmentsdescribed the largest source of variation in the microbial com-munities sampled (Dataset S5A). The pattern of separation be-tween the microbial communities in each compartment isconsistent with a spatial gradient from the bulk soil across therhizosphere and rhizoplane into the endosphere (Fig. 1C).Similarly, microbial diversity patterns within samples hold thesame pattern where there is a gradient in α-diversity from therhizosphere to the endosphere (Fig. 1B). Enrichment and de-pletion of certain microbes across the rhizocompartments indi-cates that microbial colonization of rice roots is not a passiveprocess and that plants have the ability to select for certain mi-crobial consortia or that some microbes are better at filling theroot colonizing niche. Similar to studies in Arabidopsis, we foundthat the relative abundance of Proteobacteria is increased in theendosphere compared with soil, and that the relative abundancesof Acidobacteria and Gemmatimonadetes decrease from the soilto the endosphere (9–11), suggesting that the distribution ofdifferent bacterial phyla inside the roots might be similar for allland plants (Fig. 1D and Dataset S6). Under controlled green-house conditions, soil type described the second largest sourceof variation within the microbial communities of each sample.However, the soil source did not affect the pattern of separationbetween the rhizospheric compartments, suggesting that therhizocompartments exert a recruitment effect on microbial con-sortia independent of the microbiome source.By using differential OTU abundance analysis in the com-

partments, we observed that the rhizosphere serves an enrich-ment role for a subset of microbial OTUs relative to bulk soil(Fig. 2). Further, the majority of the OTUs enriched in therhizosphere are simultaneously enriched in the rhizoplane and/orendosphere of rice roots (Fig. 2B and SI Appendix, Fig. S16B),consistent with a recruitment model in which factors produced bythe root attract taxa that can colonize the endosphere. We foundthat the rhizoplane, although enriched for OTUs that are alsoenriched in the endosphere, is also uniquely enriched for a subsetof OTUs, suggesting that the rhizoplane serves as a specializedniche for some taxa. Conversely, the vast majority of microbesdepleted in the rhizoplane are also depleted in the endosphere(Fig. 2C and SI Appendix, Fig. S16C), suggesting that the selec-tivity for colonization of the interior occurs at the rhizoplane andthat the rhizoplane may serve an important gating role for lim-iting microbial penetrance into the endosphere. It is important tonote that the community structure we observe in each com-partment is likely not simply caused by the plant alone. Microbialcommunity structural differences between the compartmentsmay be attributable also to microbial interactions involving bothcompetition and cooperation.In the case of field plants, we observed that the largest source

of microbiome separation was due to cultivation site, rather thanthe spatial compartments (Dataset S5 O and P). These resultsare in contrast to the controlled greenhouse experiment wherethe soil effect was the second largest source of variation, sug-gesting the geography may be more important for determiningthe composition of the root microbiome than soil structure alone(Dataset S5A). These results differ from the results in the maizemicrobiome study, where microbial communities showed clearseparation by state but not very much by geographic locationwithin the same state (12). However, we note that in our studythe locations within California were separated by distances of upto ∼125 km, vs. a maximum separation of ∼40 km in the in-trastate locations of the maize study. Other factors that mightaccount for the different results in our study include the numberof field sites examined (eight, vs. three intrastate fields examinedin the maize study), increased sequencing depth, different reso-lution because spatial compartments in maize roots were notseparately analyzed, or possibly intrinsic differences betweencultivated rice and maize.Our design of the field experiment allowed us to test for cul-

tivation practice effects on the rice root-associated microbiome,

Fig. 6. Time-series analysis of root-associated microbial communities revealsdistinct microbiome colonization patterns. (A) Ratios of microbial toorganellar (plastidial and mitochondrial) 16S rRNA gene reads in the endo-sphere after transplantation into Davis soil. The 42-d time point correspondsto the earlier greenhouse experiment data (Fig. 1) subsetted to M104 inDavis soil. Mean percentages of the ratios are depicted with each bar. (B)PCoA of the time-series experiment and the greenhouse experiment sub-setted to plants growing in Davis soil and colored by rhizospheric com-partment. (C) The same PCoA as in B colored by collection day aftertransplantation into soil. (D) Average relative abundance for select phylaover the course of microbiome acquisition. (E) Average abundance profile of53 out of the 92 core endosphere-enriched OTUs in each rhizospheric com-partment. Error bars represent SE.

8 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1414592112 Edwards et al.

Edwards et al. 2015. Structure, variation, and

assembly of the root-associated

microbiomes of rice. PNAS

Page 32: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Tree from Woese. 1987. Microbiological Reviews 51:221

Example III: rRNA Not Perfect

Lesson 5: Nothing is Perfect

Page 33: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Tree from Woese. 1987. Microbiological Reviews 51:221

Taxa Phylogeny III: rRNA Not Perfect

Page 34: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

rRNA Copy # Correction by Phylogeny

Kembel SW, Wu M, Eisen JA, Green JL (2012) Incorporating 16S Gene Copy Number Information Improves Estimates of Microbial Diversity and Abundance. PLoS Comput Biol 8(10): e1002743. doi:10.1371/journal.pcbi.1002743

Jessica Green@jessicaleegreen

Steven Kembel@stevenkembel

Martin Wu

Page 35: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

DNA extraction

PCRSequence all genes

Phylogenetic tree

Shotgun

GeneX

E. coli Humans

GeneX

Yeast

GeneX GeneX

Phylotyping

Phylogeny in Shotgun Metagenomics

Page 36: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

RecA vs. rRNA

Eisen 1995 Journal of Molecular Evolution 41: 1105-1123..

Page 37: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

RecA vs. rRNA

Eisen 1995 Journal of Molecular Evolution 41: 1105-1123..

Lesson 6: Keep Going Back

to Your Past

Page 38: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Phylotyping w/ Protein Markers

AMPHORA

http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.7

Genome Biology 2008, 9:R151

sequences are not conserved at the nucleotide level [29]. As a

result, the nr database does not actually contain many more

protein marker sequences that can be used as references than

those available from complete genome sequences.

Comparison of phylogeny-based and similarity-based phylotypingAlthough our phylogeny-based phylotyping is fully auto-

mated, it still requires many more steps than, and is slower

than, similarity based phylotyping methods such as a

MEGAN [30]. Is it worth the trouble? Similarity based phylo-

typing works by searching a query sequence against a refer-

ence database such as NCBI nr and deriving taxonomic

information from the best matches or 'hits'. When species

that are closely related to the query sequence exist in the ref-

erence database, similarity-based phylotyping can work well.

However, if the reference database is a biased sample or if it

contains no closely related species to the query, then the top

hits returned could be misleading [31]. Furthermore, similar-

ity-based methods require an arbitrary similarity cut-off

value to define the top hits. Because individual bacterial

genomes and proteins can evolve at very different rates, a uni-

versal cut-off that works under all conditions does not exist.

As a result, the final results can be very subjective.

In contrast, our tree-based bracketing algorithm places the

query sequence within the context of a phylogenetic tree and

only assigns it to a taxonomic level if that level has adequate

sampling (see Materials and methods [below] for details of

the algorithm). With the well sampled species Prochlorococ-

cus marinus, for example, our method can distinguish closely

related organisms and make taxonomic identifications at the

species level. Our reanalysis of the Sargasso Sea data placed

672 sequences (3.6% of the total) within a P. marinus clade.

On the other hand, for sparsely sampled clades such as

Aquifex, assignments will be made only at the phylum level.

Thus, our phylogeny-based analysis is less susceptible to data

sampling bias than a similarity based approach, and it makes

Major phylotypes identified in Sargasso Sea metagenomic dataFigure 3Major phylotypes identified in Sargasso Sea metagenomic data. The metagenomic data previously obtained from the Sargasso Sea was reanalyzed using AMPHORA and the 31 protein phylogenetic markers. The microbial diversity profiles obtained from individual markers are remarkably consistent. The breakdown of the phylotyping assignments by markers and major taxonomic groups is listed in Additional data file 5.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Alphap

roteo

bacte

ria

Betapr

oteob

acter

ia

Gammap

roteo

bacte

ria

Deltap

roteo

bacte

ria

Epsilo

npro

teoba

cteria

Unclas

sified

prote

obac

teria

Bacter

oidete

s

Chlamyd

iae

Cyano

bacte

ria

Acidob

acter

ia

Therm

otoga

e

Fusob

acter

ia

Actino

bacte

ria

Aquific

ae

Plancto

mycete

s

Spiroc

haete

s

Firmicu

tes

Chloro

flexi

Chloro

bi

Unclas

sified

bacte

ria

dnaGfrrinfCnusApgkpyrGrplArplBrplCrplDrplErplFrplKrplLrplMrplNrplPrplSrplTrpmArpoBrpsBrpsCrpsErpsIrpsJrpsKrpsMrpsSsmpBtsf

Rel

ativ

e ab

unda

nce

Martin Wu

Page 39: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

GOS 1

GOS 2

GOS 3

GOS 4

GOS 5

Phylogenetic ID of Novel Lineages

Wu et al PLoS One 2011

Dongying Wu

Page 40: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Phylogenetic Diversity of Metagenomes

cally defined by a sequence similarity threshold) in the sampleas equally related. Newer ! diversity measures that incorporatephylogenetic information are more powerful because they ac-count for the degree of divergence between sequences (13, 18,29, 30). Phylogenetic ! diversity measures can also be eitherquantitative or qualitative depending on whether abundance istaken into account. The original, unweighted UniFrac measure(13) is a qualitative measure. Unweighted UniFrac measuresthe distance between two communities by calculating the frac-tion of the branch length in a phylogenetic tree that leads todescendants in either, but not both, of the two communities(Fig. 1A). The fixation index (FST), which measures thedistance between two communities by comparing the geneticdiversity within each community to the total genetic diversity ofthe communities combined (18), is a quantitative measure thataccounts for different levels of divergence between sequences.The phylogenetic test (P test), which measures the significanceof the association between environment and phylogeny (18), istypically used as a qualitative measure because duplicate se-quences are usually removed from the tree. However, the Ptest may be used in a semiquantitative manner if all clones,even those with identical or near-identical sequences, are in-cluded in the tree (13).

Here we describe a quantitative version of UniFrac that wecall “weighted UniFrac.” We show that weighted UniFrac be-haves similarly to the FST test in situations where both are

applicable. However, weighted UniFrac has a major advantageover FST because it can be used to combine data in whichdifferent parts of the 16S rRNA were sequenced (e.g., whennonoverlapping sequences can be combined into a single treeusing full-length sequences as guides). We use two differentdata sets to illustrate how analyses with quantitative and qual-itative ! diversity measures can lead to dramatically differentconclusions about the main factors that structure microbialdiversity. Specifically, qualitative measures that disregard rel-ative abundance can better detect effects of different foundingpopulations, such as the source of bacteria that first colonizethe gut of newborn mice and the effects of factors that arerestrictive for microbial growth such as temperature. In con-trast, quantitative measures that account for the relative abun-dance of microbial lineages can reveal the effects of moretransient factors such as nutrient availability.

MATERIALS AND METHODS

Weighted UniFrac. Weighted UniFrac is a new variant of the original un-weighted UniFrac measure that weights the branches of a phylogenetic treebased on the abundance of information (Fig. 1B). Weighted UniFrac is thus aquantitative measure of ! diversity that can detect changes in how many se-quences from each lineage are present, as well as detect changes in which taxaare present. This ability is important because the relative abundance of differentkinds of bacteria can be critical for describing community changes. In contrast,the original, unweighted UniFrac (Fig. 1A) is a qualitative ! diversity measurebecause duplicate sequences contribute no additional branch length to the tree(by definition, the branch length that separates a pair of duplicate sequences iszero, because no substitutions separate them).

The first step in applying weighted UniFrac is to calculate the raw weightedUniFrac value (u), according to the first equation:

u ! !i

n

bi " "Ai

AT#

Bi

BT"

Here, n is the total number of branches in the tree, bi is the length of branch i,Ai and Bi are the numbers of sequences that descend from branch i in commu-nities A and B, respectively, and AT and BT are the total numbers of sequencesin communities A and B, respectively. In order to control for unequal samplingeffort, Ai and Bi are divided by AT and BT.

If the phylogenetic tree is not ultrametric (i.e., if different sequences in thesample have evolved at different rates), clustering with weighted UniFrac willplace more emphasis on communities that contain quickly evolving taxa. Sincethese taxa are assigned more branch length, a comparison of the communitiesthat contain them will tend to produce higher values of u. In some situations, itmay be desirable to normalize u so that it has a value of 0 for identical commu-nities and 1 for nonoverlapping communities. This is accomplished by dividing uby a scaling factor (D), which is the average distance of each sequence from theroot, as shown in the equation as follows:

D ! !j

n

dj " #Aj

AT$

Bj

BT$

Here, dj is the distance of sequence j from the root, Aj and Bj are the numbersof times the sequences were observed in communities A and B, respectively, andAT and BT are the total numbers of sequences from communities A and B,respectively.

Clustering with normalized u values treats each sample equally instead of

TABLE 1. Measurements of diversity

Measure Measurement of " diversity Measurement of ! diversity

Only presence/absence of taxa considered Qualitative (species richness) QualitativeAdditionally accounts for the no. of times that

each taxon was observedQuantitative (species richness and evenness) Quantitative

FIG. 1. Calculation of the unweighted and the weighted UniFracmeasures. Squares and circles represent sequences from two differentenvironments. (a) In unweighted UniFrac, the distance between thecircle and square communities is calculated as the fraction of thebranch length that has descendants from either the square or the circleenvironment (black) but not both (gray). (b) In weighted UniFrac,branch lengths are weighted by the relative abundance of sequences inthe square and circle communities; square sequences are weightedtwice as much as circle sequences because there are twice as many totalcircle sequences in the data set. The width of branches is proportionalto the degree to which each branch is weighted in the calculations, andgray branches have no weight. Branches 1 and 2 have heavy weightssince the descendants are biased toward the square and circles, respec-tively. Branch 3 contributes no value since it has an equal contributionfrom circle and square sequences after normalization.

VOL. 73, 2007 PHYLOGENETICALLY COMPARING MICROBIAL COMMUNITIES 1577

Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of Metagenomes. PLoS ONE 6(8): e23214. doi:10.1371/journal.pone.0023214

Jessica Green

Steven Kembel

Katie Pollard

Page 41: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Phylosift/ pplacer WorkflowInput Sequences

rRNA workflow

protein workflow

profile HMMs used to align candidates to reference alignment

Taxonomic Summaries

parallel option

hmmalign multiple alignment

LAST fast candidate search

pplacer phylogenetic placement

LAST fast candidate search

LAST fast candidate search

search input against references

hmmalign multiple alignment

hmmalign multiple alignment

Infernal multiple alignment

LAST fast candidate search

<600 bp

>600 bp

Sample Analysis & Comparison

Krona plots, Number of reads placed

for each marker gene

Edge PCA, Tree visualization, Bayes factor tests

each

inpu

t seq

uenc

e sc

anne

d ag

ains

t bot

h w

orkf

low

s

Aaron Darling @koadman

Erik Matsen @ematsen

Holly Bik @hollybik

Guillaume Jospin @guillaumejospin

Darling AE, Jospin G, Lowe E, Matsen FA IV, Bik HM, Eisen JA. (2014) PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ 2:e243 http://dx.doi.org/10.7717/peerj.243

Erik Lowe

Page 42: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Whole Genome Tree of 2000 Taxa

Lang JM, Darling AE, Eisen JA (2013) Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees and Supermatrices. PLoS ONE 8(4): e62510. doi:10.1371/journal.pone.0062510

Jenna Lang@jennnomics

Aaron Darling@koadman

Page 43: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Phylosift Markers

• PMPROK – Dongying Wu’s Bac/Arch markers

• Eukaryotic Orthologs – Parfrey 2011 paper • 16S/18S rRNA • Mitochondria - protein-coding genes • Viral Markers – Markov clustering on

genomes • Codon Subtrees – finer scale taxonomy • Extended Markers – plastids, gene families

Page 44: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

PhyEco Markers

Phylogenetic group Genome Number Gene Number Maker Candidates

Archaea 62 145415 106

Actinobacteria 63 267783 136

Alphaproteobacteria 94 347287 121

Betaproteobacteria 56 266362 311

Gammaproteobacteria 126 483632 118

Deltaproteobacteria 25 102115 206

Epislonproteobacteria 18 33416 455

Bacteriodes 25 71531 286

Chlamydae 13 13823 560

Chloroflexi 10 33577 323

Cyanobacteria 36 124080 590

Firmicutes 106 312309 87

Spirochaetes 18 38832 176

Thermi 5 14160 974

Thermotogae 9 17037 684

Wu D, Jospin G, Eisen JA (2013) Systematic Identification of Gene Families for Use as “Markers” for Phylogenetic and Phylogeny-Driven Ecological Studies of Bacteria and Archaea and Their Major Subgroups. PLoS ONE 8(10): e77033. doi:10.1371/journal.pone.0077033

Page 45: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Edge PCA: Identify lineages that explain most variation among samples

Edge PCA - Matsen and Evans 2013

Output: Edge PCA

Page 46: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

QIIME Phylotyping and Phylogenetic Ecology

15

296

Fig. S6. A set of 96 OTUs mainly consisting of Proteobacteria is enriched across every 297

compartment in the greenhouse experiment. (A) Number of OTUs and the phyla and classes 298

they belong to that are enriched across all rhizocompartments in the greenhouse experiment. (B) 299

A subset of the Proteobacteria and the classes and families they belong to in the OTUs that are 300

enriched across all rhizocompartments in the greenhouse. 301

302

303

�������������

���� ����

https://evomics.org/2014/01/the-glories-of-the-gut-ask-a-fat-mouse/

Lesson 7: Don’t Accept

When You Are Defeated

Page 47: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Example IV: Functional Evolution

Page 48: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

My Study Organisms

Tree from Woese. 1987. Microbiological Reviews 51:221

Page 49: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

1st Genome Sequence

Fleischmann et al. 1995

Page 50: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

TIGR Genome Projects

Tree from Woese. 1987. Microbiological Reviews 51:221

Page 51: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

1st Genome Sequence

Fleischmann et al. 1995

Lesson 8: If you can’t beat them, critique

them or join them

Page 52: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

• Leveraging an understanding of the evolution of function to better prediction functions

Function & Phylogeny

Page 53: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

PHYLOGENENETIC PREDICTION OF GENE FUNCTION

IDENTIFY HOMOLOGS

OVERLAY KNOWNFUNCTIONS ONTO TREE

INFER LIKELY FUNCTIONOF GENE(S) OF INTEREST

1 2 3 4 5 6

3 5

3

1A 2A 3A 1B 2B 3B

2A 1B

1A

3A

1B2B

3B

ALIGN SEQUENCES

CALCULATE GENE TREE

12

4

6

CHOOSE GENE(S) OF INTEREST

2A

2A

5

3

Species 3Species 1 Species 2

1

1 2

2

2 31

1A 3A

1A 2A 3A

1A 2A 3A

4 6

4 5 6

4 5 6

2B 3B

1B 2B 3B

1B 2B 3B

ACTUAL EVOLUTION(ASSUMED TO BE UNKNOWN)

Duplication?

EXAMPLE A EXAMPLE B

Duplication?

Duplication?

Duplication

5

METHOD

Ambiguous

Based on Eisen, 1998 Genome Res 8: 163-167.

Phylogenomics

Page 54: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

PHYLOGENENETIC PREDICTION OF GENE FUNCTION

IDENTIFY HOMOLOGS

OVERLAY KNOWNFUNCTIONS ONTO TREE

INFER LIKELY FUNCTIONOF GENE(S) OF INTEREST

1 2 3 4 5 6

3 5

3

1A 2A 3A 1B 2B 3B

2A 1B

1A

3A

1B2B

3B

ALIGN SEQUENCES

CALCULATE GENE TREE

12

4

6

CHOOSE GENE(S) OF INTEREST

2A

2A

5

3

Species 3Species 1 Species 2

1

1 2

2

2 31

1A 3A

1A 2A 3A

1A 2A 3A

4 6

4 5 6

4 5 6

2B 3B

1B 2B 3B

1B 2B 3B

ACTUAL EVOLUTION(ASSUMED TO BE UNKNOWN)

Duplication?

EXAMPLE A EXAMPLE B

Duplication?

Duplication?

Duplication

5

METHOD

Ambiguous

Based on Eisen, 1998 Genome Res 8: 163-167.

Phylogenomics

Lesson 9: If you invent your own omics word,

you are stuck with it so use it for

branding

Page 55: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Phylogenomics ~~ Phylotyping

Eisen et al. 1992Eisen et al. 1992. J. Bact.174: 3416

Page 56: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Phylogenomics ~~ Phylotyping

Eisen et al. 1992Eisen et al. 1992. J. Bact.174: 3416

Lesson 10: Stealing (with

acknowledgement) is OK

Page 57: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Proteorhodopsin Functional Diversity

Venter et al., Science 304: 66. 2004

Page 58: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

• Leveraging understanding of gene gain and loss to better predict genome functions

Lesson 11: Who you hang out

with matters

Page 59: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Carboxydothermus hydrogenoformans

• Isolated from a Russian hotspring • Thermophile (grows at 80°C) • Anaerobic • Grows very efficiently on CO (Carbon

Monoxide) • Produces hydrogen gas • Low GC Gram positive (Firmicute) • Genome Determined (Wu et al. 2005

PLoS Genetics 1: e65. )

Page 62: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Non-Homology Predictions: Phylogenetic Profiling

• Step 1: Search all genes in organisms of interest against all other genomes

• Ask: Yes or No, is each gene found in each other species

• Cluster genes by distribution patterns (profiles)

Page 64: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

B. subtilis new sporulation genes

J Bacteriol. 2013 Jan;195(2):253-60. doi: 10.1128/JB.01778-12

Bjorn Traag

Richard Losick

Page 65: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Tree from Woese. 1987. Microbiological Reviews 51:221

Example V: More Gaps

Lesson 12: Keep Returning to the Same Theme Over and Over

and Over

Page 66: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Yet Another Map

Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree

Page 67: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Genomes Poorly Sampled

Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree

Page 68: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

TIGR Tree of Life Project

Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree

Page 69: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Genomic Encyclopedia of Bacteria & Archaea

Wu et al. 2009 Nature 462, 1056-1060

Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree

Page 70: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Genomic Encyclopedia of Bacteria & Archaea

Wu et al. 2009 Nature 462, 1056-1060

Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree

Page 71: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Family Diversity vs. PD

Wu et al. 2009 Nature 462, 1056-1060

Page 72: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

GEBA Cyanobacteria

Shih et al. 2013. PNAS 10.1073/pnas.1217107110

light-harvesting strategies. The majority of cyanobacteria absorblight mainly with soluble pigment–protein complexes calledphycobilisomes, in contrast to eukaryotes, which use membrane-bound light-harvesting complexes (LHCs). However, an increasingnumber of transmembrane proteins involved in cyanobacteriallight harvesting are being identified, such as Pcb and IsiA (22, 23).These proteins are analogous in function to eukaryotic LHCs.Because of the growing number of proteins and names, an over-arching nomenclature has been proposed to name this proteinfamily the chlorophyll binding proteins (CBPs), which are char-acterized by six transmembrane helices and the ability to bindchlorophyll (24).With the increase in number and diversity of genomes, we find

that CBPs are widely distributed across the cyanobacterial phy-lum: 67% (84 of 126) of cyanobacterial genomes have, in addi-tion to the phycobilisomes, genes that putatively function asmembrane-bound light-harvesting proteins. In our phylogeneticanalysis, the increase in sequence diversity reveals strong supportfor various subclades that we have provisionally named CBPIV,-V, and -VI (Fig. 3A and SI Appendix, Fig. S5). Although not yetexperimentally demonstrated, members of CBPIV, -V, and -VIare expected to bind chlorophyll because they contain position-ally conserved histidine and glutamine residues that ligate chlo-rophyll in confirmed chlorophyll-binding CBPs (SI Appendix, Fig.S6). Some of these proteins, such as CBPIV, have previously

been annotated as PsbC homologs (25), because all CBP pro-teins are thought to have a common evolutionary origin with thepsbC gene (24). Because of the vast enrichment of cyanobacterialprotein sequences, the increase from two to six known CBPVIsequences augments phylogenetic resolution (bootstrap supportof 85%), allowing us to more confidently assert that there isa separate and distinct CBPVI subfamily. On the basis of ourphylogenetic analysis of the CBP family, and consistent withprevious studies (26), there seems to be a substantial amount ofgene duplication and horizontal gene transfer among CBPIV,-V, and -VI. In some genomes, CBPIV and CBPV are found ina gene cluster with other CBP proteins, including IsiA (Fig. 3C),suggestive of the potential for lateral transfer of gene clustersencoding light-harvesting proteins, as documented in marinecyanobacteria (27). Interestingly, many proteins of the CBPVclade also contain a C-terminal extension (SI Appendix, Fig. S7)with homology to the PsaL subunit of photosystem I (PSI).Notably, two distinct subclades within the CBPV family seem tohave independently lost the PsaL domains, reflecting the mod-ularity of this C-terminal extension. Homology modeling andinsertion of the PsaL-like domain into the PSI structure (Fig. 3Band SI Appendix, Fig. S8) suggests how the CBPV protein couldtheoretically be incorporated as an ancillary light-harvestingpolypeptide into a monomeric, but not trimeric, PSI. Althoughscattered observations of members of these CBP protein clades

0.3

B1

B2

C1

Paulinella

Glaucophyte

GreenRed

Chromalveolates

C2C3

AE

FG

B3D

A

B

Fig. 2. Implications on plastid evolution. (A) Maxi-mum-likelihood phylogenetic tree of plastids and cya-nobacteria, grouped by subclades (Fig. 1). The red dot(bootstrap support = 97%) represents the primaryendosymbiosis event that gave rise to the Arch-aeplastida lineage, made up of Glaucophytes (orange),Rhodophytes (red), Viridiplantae (green), and Chro-maleveolates (brown). The independent primary en-dosymbiosis in the amoeba Paulinella chromatophorais shown in purple. (B) Number of predicted eukary-otic, nuclear genes transferred from a cyanobacterialendosymbiont. Colors correspond to the lineageorganisms as above. Light and dark shades of colorsrepresent before and after adding the CyanoGEBAgenomes, respectively.

4 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1217107110 Shih et al.

Page 73: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Haloarchaeal GEBA-like

Lynch et al. (2012) PLoS ONE 7(7): e41389. doi:10.1371/journal.pone.0041389

Page 74: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

The Dark Matter of Biology

From Wu et al. 2009 Nature 462, 1056-1060

Page 75: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

75

Number of SAGs from Candidate Phyla

OD

1

OP

11

OP

3

SA

R4

06

Site A: Hydrothermal vent 4 1 - -Site B: Gold Mine 6 13 2 -Site C: Tropical gyres (Mesopelagic) - - - 2Site D: Tropical gyres (Photic zone) 1 - - -

Sample collections at 4 additional sites are underway.

Phil Hugenholtz

GEBA Uncultured

Page 76: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

JGI Dark Matter Project

environmental samples (n=9)

isolation of singlecells (n=9,600)

whole genomeamplification (n=3,300)

SSU rRNA gene based identification

(n=2,000)

genome sequencing, assembly and QC (n=201)

draft genomes(n=201)

SAK

HSM ETLTG

HOT

GOM

GBS

EPR

TAETL T

PR

EBS

AK E

SM G TATTG

OM

OT

seawater brackish/freshwater hydrothermal sediment bioreactor

GN04WS3 (Latescibacteria)GN01

!"#$%&'$LD1

WS1PoribacteriaBRC1

LentisphaeraeVerrucomicrobia

OP3 (Omnitrophica)ChlamydiaePlanctomycetes

NKB19 (Hydrogenedentes)WYOArmatimonadetesWS4

ActinobacteriaGemmatimonadetesNC10SC4WS2

Cyanobacteria()*&2

Deltaproteobacteria

EM19 (Calescamantes)+,-*./'&'012345678#89/,-568/:

GAL35Aquificae

EM3Thermotogae

Dictyoglomi

SPAMGAL15

CD12 (Aerophobetes)OP8 (Aminicenantes)AC1SBR1093

ThermodesulfobacteriaDeferribacteres

Synergistetes

OP9 (Atribacteria)()*&2

CaldisericaAD3

Chloroflexi

AcidobacteriaElusimicrobiaNitrospirae49S1 2B

CaldithrixGOUTA4

*;<%0123=/68>8?8,6@98/:Chlorobi

486?8,A-5BTenericutes4AB@9/,-568/Chrysiogenetes

Proteobacteria

4896@9/,-565BTG3SpirochaetesWWE1 (Cloacamonetes)

C=1ZB3

=D)&'EF58>@,@,,AB&CG56?ABOP1 (Acetothermia)Bacteriodetes

TM7GN02 (Gracilibacteria)

SR1BH1

OD1 (Parcubacteria)

(*1OP11 (Microgenomates)

Euryarchaeota

Micrarchaea

DSEG (Aenigmarchaea)Nanohaloarchaea

Nanoarchaea

Cren MCGThaumarchaeota

Cren C2Aigarchaeota

Cren pISA7

Cren ThermoproteiKorarchaeota

pMC2A384 (Diapherotrites)

BACTERIA ARCHAEA

archaeal toxins (Nanoarchaea)

lytic murein transglycosylase

stringent response (Diapherotrites, Nanoarchaea)

ppGpp

limitingamino acids

SpotT RelA

(GTP or GDP)+ PPi

GTP or GDP+ATP

limitingphosphate,fatty acids,carbon, iron

DksA

Expression of components for stress response

sigma factor (Diapherotrites, Nanoarchaea)

!4

"#$#"%

!2!3 !1

-35 -10

&'()

&*()

+',#-./0123452

oxidoretucase

+ +e- donor e- acceptor

H

'Ribo

ADP

+

'62

O

Reduction

OxidationH

'Ribo

ADP

'6

O

2H

',)##$#6##$#72#####################',)6+ + -

HGT from Eukaryotes (Nanoarchaea)

Eukaryota

O68*62

OH

'6

*8*63

OO

68*62

'6

*8*63

O

tetra-peptide

O68*62

OH

'6

*8*63

OO

68*62

'6

*8*63

O

tetra-peptide

murein (peptido-glycan)

archaeal type purine synthesis (Microgenomates)

PurFPurD9:3'PurL/QPurMPurKPurE9:3*PurB

PurP

?

Archaea

adenine guanine

O

6##'2

+'

'62

'

'

H

H

'

'

'

H

HH' '

H

PRPP ;,<*,+

IMP

,<*,+

A*

GUA *G U

GU

A

*

GU

A UA * U

A * U

Growing AA chain

=+',>?/0@#recognizes

UGA1+',

UGA recoded for Gly (Gracilibacteria)

ribosome

Woyke et al. Nature 2013.

Page 77: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

A Genomic Encyclopedia of Microbes (GEM)

Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree

Page 78: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Tetrahymena Genome Project

Page 79: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

A Genomic Encyclopedia of Microbes (GEM)

Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree

Page 80: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Tree from Woese. 1987. Microbiological Reviews 51:221

Example VI: Beyond Sequence

Lesson 13: Don’t Overdo It

With That Theme

Page 81: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

DNA extraction

PCRSequence all genes

Shotgun

Shotgun Metagenomics

Page 82: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Wu et al. 2006 PLoS Biology 4: e188.

Baumannia makes vitamins and cofactors

Sulcia makes amino acids

Phylogenetic Binning

Page 83: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

HiC Crosslinking & Sequencing

Beitel CW, Froenicke L, Lang JM, Korf IF, Michelmore RW, Eisen JA, Darling AE. (2014) Strain- and plasmid-level deconvolution of a synthetic metagenome by sequencing proximity ligation products. PeerJ 2:e415 http://dx.doi.org/10.7717/peerj.415

Table 1 Species alignment fractions. The number of reads aligning to each replicon present in thesynthetic microbial community are shown before and after filtering, along with the percent of totalconstituted by each species. The GC content (“GC”) and restriction site counts (“#R.S.”) of each replicon,species, and strain are shown. Bur1: B. thailandensis chromosome 1. Bur2: B. thailandensis chromosome2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2: L. brevis plasmid 2, Ped: P. pentosaceus,K12: E. coli K12 DH10B, BL21: E. coli BL21. An expanded version of this table can be found in Table S2.

Sequence Alignment % of Total Filtered % of aligned Length GC #R.S.

Lac0 10,603,204 26.17% 10,269,562 96.85% 2,291,220 0.462 629

Lac1 145,718 0.36% 145,478 99.84% 13,413 0.386 3

Lac2 691,723 1.71% 665,825 96.26% 35,595 0.385 16

Lac 11,440,645 28.23% 11,080,865 96.86% 2,340,228 0.46 648

Ped 2,084,595 5.14% 2,022,870 97.04% 1,832,387 0.373 863

BL21 12,882,177 31.79% 2,676,458 20.78% 4,558,953 0.508 508

K12 9,693,726 23.92% 1,218,281 12.57% 4,686,137 0.507 568

E. coli 22,575,903 55.71% 3,894,739 17.25% 9,245,090 0.51 1076

Bur1 1,886,054 4.65% 1,797,745 95.32% 2,914,771 0.68 144

Bur2 2,536,569 6.26% 2,464,534 97.16% 3,809,201 0.672 225

Bur 4,422,623 10.91% 4,262,279 96.37% 6,723,972 0.68 369

Figure 1 Hi-C insert distribution. The distribution of genomic distances between Hi-C read pairs isshown for read pairs mapping to each chromosome. For each read pair the minimum path length onthe circular chromosome was calculated and read pairs separated by less than 1000 bp were discarded.The 2.5 Mb range was divided into 100 bins of equal size and the number of read pairs in each binwas recorded for each chromosome. Bin values for each chromosome were normalized to sum to 1 andplotted.

E. coli K12 genome were distributed in a similar manner as previously reported (Fig. 1;(Lieberman-Aiden et al., 2009)). We observed a minor depletion of alignments spanningthe linearization point of the E. coli K12 assembly (e.g., near coordinates 0 and 4686137)due to edge eVects induced by BWA treating the sequence as a linear chromosome ratherthan circular.

Beitel et al. (2014), PeerJ, DOI 10.7717/peerj.415 9/19

Figure 2 Metagenomic Hi-C associations. The log-scaled, normalized number of Hi-C read pairsassociating each genomic replicon in the synthetic community is shown as a heat map (see color scale,blue to yellow: low to high normalized, log scaled association rates). Bur1: B. thailandensis chromosome1. Bur2: B. thailandensis chromosome 2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2:L. brevis plasmid 2, Ped: P. pentosaceus, K12: E. coli K12 DH10B, BL21: E. coli BL21.

reference assemblies of the members of our synthetic microbial community with the samealignment parameters as were used in the top ranked clustering (described above). We firstcounted the number of Hi-C reads associating each reference assembly replicon (Fig. 2;Table S3), observing that Hi-C data associated replicons within the same species (cell)orders of magnitude more frequently than it associated replicons from diVerent species.The rate of within-species association was 98.8% when ignoring read pairs mapping lessthan 1,000 bp apart. Including read pairs <1,000 bp inflated this figure to 99.97%. Fig. 3illustrates this by visualizing the graph of contigs and their associations. Similarly, forthe two E. coli strains (K12, BL21) we observed the rate of within-strain association tobe 96.36%. When evaluated on genes unique to each strain (where read mapping to eachstrain would be unambiguous), the self-association rate was observed to be >99%.

We observed that the rate of association of L. brevis plasmids 1 and 2 with each other andwith the L. brevis chromosome was at least 100-fold higher than with the other constituentsof the synthetic community (Fig. 2). Chromosome and plasmid Hi-C contact maps showthat the plasmids associate with sequences throughout the L. brevis chromosome (Fig. 4;Figs. S3–S5) and exhibit the expected enrichment near restriction sites. This demonstratesthat metagenomic Hi-C can be used to associate plasmids to specific strains in microbialcommunities as well as to determine cell co-localization of plasmids with one another.

Variant graph connectednessAlgorithms that reconstruct single-molecule genotypes from samples containing two ormore closely-related strains or chromosomal haplotypes depend on reads or read pairsthat indicate whether pairs of variants coexist in the same DNA molecule. Such algorithms

Beitel et al. (2014), PeerJ, DOI 10.7717/peerj.415 11/19

Figure 3 Contigs associated by Hi-C reads. A graph is drawn with nodes depicting contigs and edgesdepicting associations between contigs as indicated by aligned Hi-C read pairs, with the count thereofdepicted by the weight of edges. Nodes are colored to reflect the species to which they belong (see legend)with node size reflecting contig size. Contigs below 5 kb and edges with weights less than 5 were excluded.Contig associations were normalized for variation in contig size.

typically represent the reads and variant sites as a variant graph wherein variant sites arerepresented as nodes, and sequence reads define edges between variant sites observed inthe same read (or read pair). We reasoned that variant graphs constructed from Hi-Cdata would have much greater connectivity (where connectivity is defined as the meanpath length between randomly sampled variant positions) than graphs constructed frommate-pair sequencing data, simply because Hi-C inserts span megabase distances. Suchconnectivity should, in theory, enable more accurate reconstruction of single-moleculegenotypes from smaller amounts of data. Furthermore, by linking distant sites with fewerintermediate nodes in the graph, estimates of linkage disequilibrium at distant sites (from amixed population) are likely to have greater precision.

To evaluate whether Hi-C produces more connected variant graphs we compared theconnectivity of variant graphs constructed from Hi-C data to those constructed fromsimulated mate-pair data (with average inserts of 5 kb, 10 kb, 20 kb, and 40 kb). To excludepaired-end products from the analysis, Hi-C reads with inserts under 1 kb were excludedfrom the analysis. For each variant graph constructed from these inputs, 10,000 variantposition pairs were sampled at random, with 94.75% and 100% of these pairs belonging tothe same connected graph component of the Hi-C and 40 kb variant graphs, respectively.

Beitel et al. (2014), PeerJ, DOI 10.7717/peerj.415 12/19

Figure 4 Hi-C contact maps for replicons of Lactobacillus brevis. Contact maps show the number ofHi-C read pairs associating each region of the L. brevis genome. The L. brevis chromosome (Lac0, (A),Spearman rank correlation) and plasmids (Lac1, (B); Lac2, (C)) show enrichment for local associations(bright diagonal band). Interactions between Lac1 and Lac0 (D) and Lac2 and Lac0 (E) are shown.All except Lac0 are log-scaled. Circularity of Lac0 became apparent after transforming data with theSpearman rank correlation (computed for each matrix element between the row and column sharingthat element) in place of log transformation (A) indicated by the high number of contacts between theends of the sequence. In all plots, pixels are sized to represent interactions between blocks sized at 1% ofthe interacting genomes. The number of HindIII restriction sites in each region of sequence is shown asa histogram on the left and top of each panel.

These rates fell to 6.21%, 16.6%, and 32.38% for the 5 kb, 10 kb, and 20 kb mate-pairvariant graphs, respectively (Table 3).

Across conditions, variant graphs diVered in terms of their connectivity, with Hi-Cgraphs showing the greatest connectivity. Despite having simulated an equal number ofreads for each mate-pair distance, the numbers of variant positions linked by such readswas diVerent across conditions. We observed that the variant graph derived from Hi-Cdata (>1 kb inserts, no alignment filtering), despite having the lowest number of variantlinks, had the lowest mean and maximum path length (5.47, 11; Table 3). Path lengthwas not correlated with distance within Hi-C variant graphs, in contrast to the mate-pairconditions (Fig. 5). The lengths of paths between variant pairs in the mate-pair graphsdid increase with distance, reaching maximums of 71, 96, 94, and 111 in the 5 kb, 10 kb,20 kb, and 40 kb cases, respectively. We further examined the eVect of alignment qualityand completeness filtering and observed that in the latter case such filtering vastly reducedthe rate at which variant positions occur within the same connected graph component.

DISCUSSIONThis study demonstrates that Hi-C sequencing data provide valuable information formetagenome analyses that are not currently obtainable by other methods. By applyingHi-C to a synthetic microbial community we showed that genomic DNA was associated

Beitel et al. (2014), PeerJ, DOI 10.7717/peerj.415 13/19

Chris Beitel@datscimed

Aaron Darling @koadman

Page 84: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Sequence Isn’t Everything

PB-PSB1 (Purple sulfur bacteria)

PB-SRB1 (Sulfate reducing bacteria)

(sulfate)

(sulfide)

Wilbanks, E.G. et al (2014). Environmental Microbiology

Lizzy Wilbanks@lizzywilbanks

Page 85: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

12C, 12C14N, 32S

Biomass (RGB composite)

0.044 0.080

34S-incorporation (34S/32S ratio)

Wilbanks, E.G. et al (2014). Environmental Microbiology

Transfer of 34S from SRB to PSB

Page 86: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Long Reads Help, A Lot

Hiseq & Miseq

100-250 bp

Moleculo

2-20 kb

Pacbio RSII

2-20kb

Micky Kertesz, Tim Blauwcamp

Meredith AshbyCheryl Heiner

Illumina-based “synthetic long reads”

Real-time single molecule sequencing

(p4-c2, p5-c3)

295 Megabases 474 Megabases61 Gigabases

Page 87: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Light-responsive sulfate reducer?

rhodopsin

w/ Susumu Yoshizawa

Page 88: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Lesson 14: Asking for, and

getting, help, is a good thing

Page 89: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Seagrass Microbiome

>1000 samples collected. Not a blade of seagrass touched.

YEAR ONE

Page 90: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015
Page 91: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

ZEN (Zostera Experimental Network)

25 partner sites leaves, roots, sediment, and water samples

Page 92: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015
Page 93: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015
Page 94: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

MICROBES

Page 95: Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015

Acknowledgements• GEBA:

• $$: DOE-JGI, DSMZ • Eddy Rubin, Phil Hugenholtz, Hans-Peter Klenk, Nikos Kyrpides, Tanya Woyke, Dongying Wu, Aaron Darling,

Jenna Lang • GEBA Cyanobacteria

• $$: DOE-JGI • Cheryl Kerfeld, Dongying Wu, Patrick Shih

• Haloarchaea • $$$ NSF • Marc Facciotti, Aaron Darling, Erin Lynch,

• Phylosift • $$$ DHS • Aaron Darling, Erik Matsen, Holly Bik, Guillaume Jospin

• iSEEM: • $$: GBMF • Katie Pollard, Jessica Green, Martin Wu, Steven Kembel, Tom Sharpton, Morgan Langille, Guillaume Jospin,

Dongying Wu, • aTOL

• $$: NSF • Naomi Ward, Jonathan Badger, Frank Robb, Martin Wu, Dongying Wu

• Others (not mentioned in detail) • $$: NSF, NIH, DOE, GBMF, DARPA, Sloan • Frank Robb, Craig Venter, Doug Rusch, Shibu Yooseph, Nancy Moran, Colleen Cavanaugh, Josh Weitz • EisenLab: Srijak Bhatnagar, Russell Neches, Lizzy Wilbanks, Holly Bik