phylogeny-driven approaches to microbial & microbiome studies: talk by jonathan eisen at ucsb...
TRANSCRIPT
Phylogeny-Driven Approaches to Studies of Microbial and Microbiome
Diversity
Jonathan A. EisenUniversity of California, Davis
@phylogenomics
February 7, 2015UCSB EEMB Graduate Student Symposium
Phylogeny-Driven Approaches to Studies of Microbial and Microbiome
Diversity
Jonathan A. EisenUniversity of California, Davis
@phylogenomics
February 7, 2015UCSB EEMB Graduate Student Symposium
Some Lessons I Think I Have
Learned
Phylogeny-Driven Approaches to Studies of Microbial and Microbiome
Diversity
Jonathan A. EisenUniversity of California, Davis
@phylogenomics
February 7, 2015UCSB EEMB Graduate Student Symposium
Lesson 1: Go With Your Obsessions
Open Science
Open Science
X
Social Media & Science
Social Media & Science
X
• RedSox
RedSox
• RedSox
RedSox
X
Microbial Evolution
Microbial Evolution
Lesson 2: History Matters
Microbial Evolution
Lesson 2: History (of
species, genes, people, science)
Matters
Example I: Lost in Graduate School?
Lost in Graduate School?
Get A Map
Tree from Woese. 1987. Microbiological Reviews 51:221
Map for Graduate School
Carl Woese
Limited Sampling of RRR Studies
Tree from Woese. 1987. Microbiological Reviews 51:221
My Study Organisms
Tree from Woese. 1987. Microbiological Reviews 51:221
H. volcanii Excision Repair
0
0.2
0.4
0.6
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Avg. Mol. Wt.(Base Pairs)
H. volcanii UV Repair Label 7 - 45J / m2)
45 J/m2 Dark 24 Hours
45 J/m2 Photoreac.
45 J/m2 t0
0 J/m2 t0
By Grombo - from Wikipedia
1E-07
1E-06
1E-05
0.0001
0.001
0.01
0.1
1
RelativeSurvival
0 50 100 150 200 250 300 350 400
UV J/m2
UV Survival E.coli vs H.volcanii
H.volcanii WFD11
E.coli NR10125 mfd+
E.coli NR10121 mfd-
From Eisen 1998. PhD Thesis.
Tree from Woese. 1987. Microbiological Reviews 51:221
Map for Graduate School
Lesson 3: Go Fishing Where Nobody Else Has
Example II: Rice Microbiomes and Phylogeny
Joseph Edwards
@Bulk_Soil
Sundar@sundarlab
CameronJohnson
SrijakBhatnagar
@srijakbhatnagar
Edwards et al. 2015. Structure, variation, and assembly of the root-associated
microbiomes of rice. PNAS
9
Supplementary Figures 231
232
Fig. S1 Map depicting soil collection locations for greenhouse experiment. 233
10
234
Fig. S2. Sampling and collection of the rhizocompartments. Roots are collected from rice 235
plants and soil is shaken off the roots to leave ~1mm of soil around the roots. The ~1 mm of soil 236
DNA extraction
PCRSequence
rRNA genes
Sequence alignment = Data matrixPhylogenetic tree
PCR
rRNA1
rRNA2
Makes lots of copies of the rRNA genes in sample
rRNA1 5’...ACACACATAGGTGGAGCTA
GCGATCGATCGA... 3’
E. coli
Humans
A
T
T
A
G
A
A
C
A
T
C
A
C
A
A
C
A
G
G
A
G
T
T
CrRNA1
E. coli Humans
rRNA2 rRNA2 5’..TACAGTATAGGTGGAGCTAG
CGACGATCGA... 3’
rRNA3 5’...ACGGCAAAATAGGTGGATT
CTAGCGATATAGA... 3’
rRNA4 5’...ACGGCCCGATAGGTGGATT
CTAGCGCCATAGA... 3’
rRNA3 C A C T G T
rRNA4 C A C A G T
Yeast T A C A G T
Yeast
rRNA3 rRNA4
Phylogeny
PCR and phylogenetic analysis of rRNA genes
STAP
An Automated Phylogenetic Tree-Based Small SubunitrRNA Taxonomy and Alignment Pipeline (STAP)Dongying Wu1*, Amber Hartman1,6, Naomi Ward4,5, Jonathan A. Eisen1,2,3
1 UC Davis Genome Center, University of California Davis, Davis, California, United States of America, 2 Section of Evolution and Ecology, College of Biological Sciences,
University of California Davis, Davis, California, United States of America, 3 Department of Medical Microbiology and Immunology, School of Medicine, University of
California Davis, Davis, California, United States of America, 4 Department of Molecular Biology, University of Wyoming, Laramie, Wyoming, United States of America,
5 Center of Marine Biotechnology, Baltimore, Maryland, United States of America, 6 The Johns Hopkins University, Department of Biology, Baltimore, Maryland, United
States of America
Abstract
Comparative analysis of small-subunit ribosomal RNA (ss-rRNA) gene sequences forms the basis for much of what we knowabout the phylogenetic diversity of both cultured and uncultured microorganisms. As sequencing costs continue to declineand throughput increases, sequences of ss-rRNA genes are being obtained at an ever-increasing rate. This increasing flow ofdata has opened many new windows into microbial diversity and evolution, and at the same time has created significantmethodological challenges. Those processes which commonly require time-consuming human intervention, such as thepreparation of multiple sequence alignments, simply cannot keep up with the flood of incoming data. Fully automatedmethods of analysis are needed. Notably, existing automated methods avoid one or more steps that, thoughcomputationally costly or difficult, we consider to be important. In particular, we regard both the building of multiplesequence alignments and the performance of high quality phylogenetic analysis to be necessary. We describe here our fully-automated ss-rRNA taxonomy and alignment pipeline (STAP). It generates both high-quality multiple sequence alignmentsand phylogenetic trees, and thus can be used for multiple purposes including phylogenetically-based taxonomicassignments and analysis of species diversity in environmental samples. The pipeline combines publicly-available packages(PHYML, BLASTN and CLUSTALW) with our automatic alignment, masking, and tree-parsing programs. Most importantly,this automated process yields results comparable to those achievable by manual analysis, yet offers speed and capacity thatare unattainable by manual efforts.
Citation: Wu D, Hartman A, Ward N, Eisen JA (2008) An Automated Phylogenetic Tree-Based Small Subunit rRNA Taxonomy and Alignment Pipeline (STAP). PLoSONE 3(7): e2566. doi:10.1371/journal.pone.0002566
Editor: Jean-Nicolas Volff, Ecole Normale Superieure de Lyon, France
Received January 31, 2008; Accepted May 26, 2008; Published July 2, 2008
Copyright: ! 2008 Wu et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricteduse, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: The National Science Foundation ‘‘Assembling the Tree of Life’’ Grant No. 0228651. The final work on this project was funded by the Gordon and BettyMoore Foundation (grant #1660 to Jonathan Eisen).
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
Introduction
ss-RNA gene sequence analysis as a tool for microbialsystematics and ecology
Phylogenetic analysis of rRNA gene sequences (particularly ss-rRNA, i.e., the small subunit rRNA) has led to important advancesin microbiology, such as the discovery of a third branch on the treeof life (the archaea) [1] and the realization that the microbes thatcan be grown in pure culture represent but a small fraction, interms of both phylogenetic types and total numbers of cells of themicrobes, found in nature [2]. The power of ss-rRNA forphylogenetic analysis can be attributed to many factors, includingits presence in all cellular organisms, its favorable patterns ofsequence conservation that enable study of both recent andancient evolutionary events, and the ease with which this gene canbe cloned and sequenced from new organisms [3]. The sequencingof ss-rRNA genes from new species is greatly facilitated by thepresence of highly conserved regions at several positions along thegene [4]. The conservation of these regions allows one to designand use broadly targeted oligonucleotide primers that work on awide diversity of species for both sequencing and amplification by
the polymerase chain reaction (PCR). In fact, it is now standardprocedure to sequence the ss-rRNA gene when a new microbe hasbeen isolated [5,6].
The ss-rRNA gene has become a key target for environmentalmicrobiology studies largely because through the use of broadlytargeted primers, one can use PCR to amplify in a single reactionthe ss-rRNA genes from a wide diversity of organisms present inan environmental sample [7,8]. The amplified products can thenbe characterized in multiple ways such as through restrictiondigestion [9,10], denaturing gradient gel electrophoresis [11],hybridization to arrays [12], or sequencing. As sequencingcontinues to decrease in cost and difficulty, we believe it willbecome the preferred option and thus we focus on sequenceanalysis here.
Once DNA sequences of environmental ss-rRNA genes are inhand, multiple types of analyses can be used to characterize theorganisms and communities from which they were obtained. Forexample phylogenetic analysis of the sequences can reveal whattypes of microbial organisms are present in a sample. In addition,very closely related ss-rRNA sequences can be grouped togetherinto phylotypes or operational taxonomic units (OTUs), groupings which
PLoS ONE | www.plosone.org 1 July 2008 | Volume 3 | Issue 7 | e2566
multiple alignment and phylogeny was deemed unfeasible.However, this we believe can compromise the value of the results.
For example, the delineation of OTUs has also been automatedvia tools that do not make use of alignments or phylogenetic trees(e.g., Greengenes). This is usually done by carrying out pairwisecomparisons of sequences and then clustering of sequences thathave better than some cutoff threshold of similarity with eachother). This approach can be powerful (and reasonably efficient)but it too has limitations. In particular, since multiple sequencealignments are not used, one cannot carry out standardphylogenetic analyses. In addition, without multiple sequencealignments one might end up comparing and contrasting differentregions of a sequence depending on what it is paired with.
The limitations of avoiding multiple sequence alignments andphylogenetic analysis are readily apparent in tools to classifysequences. For example, the Ribosomal Database Project’sClassifier program [29] focuses on composition characteristics ofeach sequence (e.g., oligonucleotide frequency) and assignstaxonomy based upon clustering genes by their composition.Though this is fast and completely automatable, it can be misled incases where distantly related sequences have converged on similarcomposition, something known to be a major problem in ss-rRNAsequences [30]. Other taxonomy assignment systems focusprimarily on the similarity of sequences. The simplest of these isto use BLASTN to search a sequence database (e.g., Genbank) andto then use information about the top match to assign some sort oftaxonomy information to new sequences. Such similarity-basedapproaches are analogous to using top blast matches to predict thefunctions of genes and have similar limitations. Though fast, suchapproaches are not ideal because the most similar sequence maynot in fact be the most closely related sequence due to the vagariesof evolution such as unequal rates of change in different lineages orconvergent evolution [31–35].
Despite the clear advantages of using multiple sequencealignments and phylogenetic analyses for many aspects of ss-rRNA analyses, there are only a few examples of attempts togenerate these outputs in a highly or completely automatedmanner. The most comprehensive tool we are aware of is the BIBIsoftware package [36], which takes new sequences, identifiessimilar sequences in a database using BLASTN and then generatesa new multiple sequence alignment and then produces phyloge-netic trees from the alignment. Users can then view the trees tomake taxonomic assignments based upon phylogenetic position ofquery sequences relative to known ones. Though BIBI is quantumleap more advanced than most similarity based available
classification tools it does have some limitations. For example,the generation of new alignments for each sequence is bothcomputational costly, and does not take advantage of availablecurated alignments that make use of ss-RNA secondary structureto guide the primary sequence alignment. Perhaps mostimportantly however is that the tool is not fully automated. Inaddition, it does not generate multiple sequence alignments for allsequences in a dataset which would be necessary for doing manyanalyses.
Automated methods for analyzing rRNA sequences are alsoavailable at the web sites for multiple rRNA centric databases,such as Greengenes and the Ribosomal Database Project (RDPII).Though these and other web sites offer diverse powerful tools, theydo have some limitations. For example, not all provide multiplesequence alignments as output and few use phylogeneticapproaches for taxonomy assignments or other analyses. Moreimportantly, all provide only web-based interfaces and theirintegrated software, (e.g., alignment and taxonomy assignment),cannot be locally installed by the user. Therefore, the user cannottake advantage of the speed and computing power of parallelprocessing such as is available on linux clusters, or locally alter andpotentially tailor these programs to their individual computingneeds (Table 1).
Given the limited automated tools that are available forresearchers have had to choose between two non-ideal options:manually generating and/or curating alignments (an expensiveand slow process which can handle only a limited number ofsequences) or using the non-phylogenetic and non-alignmentbased methods that can be automated more readily.
We describe here the development of a fully-automated, high-throughput method that meets many of the key requirements of ss-rRNA sequence analysis. First, this method generates high qualitymultiple sequence alignments that can be used for phylogeneticreconstructions as well as for diversity measures such as theidentification of OTUs. Secondly, the method generates aphylogenetic tree for each query sequence and assigns thatsequence to a taxonomic group based upon its position in the treerelative to other known sequences. The alignments and phyloge-netic tree outputs of this program can be used for input into avariety of other software tools such as DOTUR (for identifyingOTUs) and UNIFRAC (for phylogenetic based communitycomparisons)[26,37]. We refer to this method as STAP: a SmallSubunit rRNA Taxonomy and Alignment Pipeline.
A key advantage of STAP is that it is the only fully automatedmethod available that can be locally installed by the user and is
Table 1. Comparison of STAP’s computational abilities relative to existing commonly-used ss-RNA analysis tools.
STAP ARB Greengenes RDP
Installed where? Locally Locally Web only Web only
User interface Command line GUI Web portal Web portal
Parallel processing YES NO NO NO
Manual curation for taxonomy assignment NO YES NO NO
Manual curation for alignment NO YES NO* NO
Open source YES** NO NO NO
Processing speed Fast Slow Medium Medium
It is important to note, that STAP is the only software that runs on the command line and can take advantage of parallel processing on linux clusters and, further, ismore amenable to downstream code manipulation.*Note: Greengenes alignment output is compatible with upload into ARB and downstream manual alignment.**The STAP program itself is open source, the programs it depends on are freely available but not open source.doi:10.1371/journal.pone.0002566.t001
ss-rRNA Taxonomy Pipeline
PLoS ONE | www.plosone.org 3 July 2008 | Volume 3 | Issue 7 | e2566
STAP database, and the query sequence is aligned to them usingthe CLUSTALW profile alignment algorithm [40] as describedabove for domain assignment. By adapting the profile alignment
algorithm, the alignments from the STAP database remain intact,while gaps are inserted and nucleotides are trimmed for the querysequence according to the profile defined by the previousalignments from the databases. Thus the accuracy and quality ofthe alignment generated at this step depends heavily on the qualityof the Bacterial/Archaeal ss-rRNA alignments from theGreengenes project or the Eukaryotic ss-rRNA alignments fromthe RDPII project.
Phylogenetic analysis using multiple sequence alignments rests onthe assumption that the residues (nucleotides or amino acids) at thesame position in every sequence in the alignment are homologous.Thus, columns in the alignment for which ‘‘positional homology’’cannot be robustly determined must be excluded from subsequentanalyses. This process of evaluating homology and eliminatingquestionable columns, known as masking, typically requires time-consuming, skillful, human intervention. We designed an automat-ed masking method for ss-rRNA alignments, thus eliminating thisbottleneck in high-throughput processing.
First, an alignment score is calculated for each aligned columnby a method similar to that used in the CLUSTALX package [42].Specifically, an R-dimensional sequence space representing all thepossible nucleotide character states is defined. Then for eachaligned column, the nucleotide populating that column in each ofthe aligned sequences is assigned a score in each of the Rdimensions (Sr) according to the IUB matrix [42]. The consensus‘‘nucleotide’’ for each column (X) also has R dimensions, with thescore for each dimension (Xr) calculated as the average of thescores for that column in that dimension (average of Sr). Thus thescore of the consensus nucleotide is a mathematical expressiondescribing the average ‘‘nucleotide’’ in that column for thatalignment.
Figure 2. Domain assignment. In Step 1, STAP assigns a domain toeach query sequence based on its position in a maximum likelihoodtree of representative ss-rRNA sequences. Because the tree illustratedhere is not rooted, domain assignment would not be accurate andreliable (sequence similarity based methods cannot make an accurateassignment in this case either). However the figure illustrates animportant role of the tree-based domain assignment step, namelyautomatic identification of deep-branching environmental ss-rRNAs.doi:10.1371/journal.pone.0002566.g002
Figure 1. A flow chart of the STAP pipeline.doi:10.1371/journal.pone.0002566.g001
ss-rRNA Taxonomy Pipeline
PLoS ONE | www.plosone.org 5 July 2008 | Volume 3 | Issue 7 | e2566
STAP database, and the query sequence is aligned to them usingthe CLUSTALW profile alignment algorithm [40] as describedabove for domain assignment. By adapting the profile alignment
algorithm, the alignments from the STAP database remain intact,while gaps are inserted and nucleotides are trimmed for the querysequence according to the profile defined by the previousalignments from the databases. Thus the accuracy and quality ofthe alignment generated at this step depends heavily on the qualityof the Bacterial/Archaeal ss-rRNA alignments from theGreengenes project or the Eukaryotic ss-rRNA alignments fromthe RDPII project.
Phylogenetic analysis using multiple sequence alignments rests onthe assumption that the residues (nucleotides or amino acids) at thesame position in every sequence in the alignment are homologous.Thus, columns in the alignment for which ‘‘positional homology’’cannot be robustly determined must be excluded from subsequentanalyses. This process of evaluating homology and eliminatingquestionable columns, known as masking, typically requires time-consuming, skillful, human intervention. We designed an automat-ed masking method for ss-rRNA alignments, thus eliminating thisbottleneck in high-throughput processing.
First, an alignment score is calculated for each aligned columnby a method similar to that used in the CLUSTALX package [42].Specifically, an R-dimensional sequence space representing all thepossible nucleotide character states is defined. Then for eachaligned column, the nucleotide populating that column in each ofthe aligned sequences is assigned a score in each of the Rdimensions (Sr) according to the IUB matrix [42]. The consensus‘‘nucleotide’’ for each column (X) also has R dimensions, with thescore for each dimension (Xr) calculated as the average of thescores for that column in that dimension (average of Sr). Thus thescore of the consensus nucleotide is a mathematical expressiondescribing the average ‘‘nucleotide’’ in that column for thatalignment.
Figure 2. Domain assignment. In Step 1, STAP assigns a domain toeach query sequence based on its position in a maximum likelihoodtree of representative ss-rRNA sequences. Because the tree illustratedhere is not rooted, domain assignment would not be accurate andreliable (sequence similarity based methods cannot make an accurateassignment in this case either). However the figure illustrates animportant role of the tree-based domain assignment step, namelyautomatic identification of deep-branching environmental ss-rRNAs.doi:10.1371/journal.pone.0002566.g002
Figure 1. A flow chart of the STAP pipeline.doi:10.1371/journal.pone.0002566.g001
ss-rRNA Taxonomy Pipeline
PLoS ONE | www.plosone.org 5 July 2008 | Volume 3 | Issue 7 | e2566
Dongying WuAmber
Hartman Naomi Ward
WATERsHartman et al. BMC Bioinformatics 2010, 11:317http://www.biomedcentral.com/1471-2105/11/317
Page 2 of 14
sequence rDNA (the genes for ribosomal RNA) in partic-ular those for small-subunit ribosomal RNA (ss-rRNA).These studies revealed a large amount of previouslyundetected microbial diversity [1,11-13]. Researchersfocused on the small subunit rRNA gene not onlybecause of the ease with which it can be PCR amplified,but also because it has variable and highly conservedregions, it is thought to be universally distributed amongall living organisms, and it is useful for inferring phyloge-netic relationships [14,15]. Since then, "cultivation-inde-pendent technologies" have brought a revolution to thefield of microbiology by allowing scientists to study awide and complex amount of diversity in many differenthabitats and environments [16-18]. The general premiseof these methods remains relatively unchanged from theinitial experiments two decades ago and relies onstraightforward molecular biology techniques and bioin-formatics tools from ecology, evolutionary biology andDNA sequencing projects.
Briefly, the lab work involved in 16 S rDNA surveysbegins with environmental samples (e.g., soil or water)from which total genomic DNA is extracted. Next, the 16S rDNA is PCR-amplified with pan-bacterial or pan-archaeal primers (i.e., primers designed to amplify asmany known bacteria or archaea as possible), cloned intoa sequencing vector, and then sequenced (or directlysequenced without cloning in next generation sequenc-ing) resulting in large collections of diverse microbial 16 SrDNA sequences from these different samples. Assequencing costs have continually declined, environmen-tal microbiology surveys have expanded correspondinglyand 16 S rDNA datasets have grown increasingly com-plex.
The size and complexity of data sets introduce a newchallenge - analyses that one could carry out manually onsmall data sets now must be aided or run entirely on com-puters. And those analyses that previously were carriedout computationally now must be made more efficient tohave any hopes of being completed in a timely manner[7,19].
How then is the microbial community sequencing dataconverted from reads off a sequencing machine to bargraphs, network diagrams, and biological conclusions?Fortunately, even as data sets have expanded, mostresearchers analyzing rDNA sequence data sets, evenwhen they are very large, have a similar set of goals intheir analysis. For example, most studies are interested inassigning a microbial identity to the 16 S rDNAsequences and determining the proportion of theseorganisms in each sequence collection. And to achievethese (and related goals), a similar set of steps are used(Fig. 1) including aligning the rDNA sequences in a data-set to each other so that they are comparable, removing
chimeric sequences generated during PCR identifyingclosely related sets of sequences (also known as opera-tional taxonomic units or OTUs), removing redundantsequences above a certain percent identity cutoff, assign-ing putative taxonomic identifiers to each sequence orrepresentative of a group, inferring a phylogenetic tree ofthe sequences, and comparing the phylogenetic structureof different samples to each other and to the larger bacte-rial or archaeal tree of life.
Over the last few years, a large number of softwaretools and web applications have become available to carryout each of the above steps (e.g., [20,21] for chimerachecking, [22] for phylogenetic comparisons, STAP fortaxonomy assignments). In practice, even as new soft-ware became available, researchers still have to act as thedrivers of the workflow. At each step in this process, dif-ferent types of software must be chosen and employed,each with distinct data formatting requirements, invoca-tion methods, and each associated with a variety of post-analysis steps that may be selected and applied. Even afterall of these steps have been completed, a wide variety ofstatistical and visualization tools are applied to theseresults to interpret and represent these data. In this con-text, there is a clear need for tools that will run a compre-hensive set of analyses all linked together into one system.Very recently, two such systems have been released -mothur and QIIME. WATERS is our effort in this regardwith some key differences compared to mothur andQIIME.
Figure 1 Overview of WATERS. Schema of WATERS where white boxes indicate "behind the scenes" analyses that are performed in WA-TERS. Quality control files are generated for white boxes, but not oth-erwise routinely analyzed. Black arrows indicate that metadata (e.g., sample type) has been overlaid on the data for downstream interpre-tation. Colored boxes indicate different types of results files that are generated for the user for further use and biological interpretation. Colors indicate different types of WATERS actors from Fig. 2 which were used: green, Diversity metrics, WriteGraphCoordinates, Diversity graphs; blue, Taxonomy, BuildTree, Rename Trees, Save Trees; Create-Unifrac; yellow, CreateOtuTable, CreateCytoscape, CreateOTUFile; white, remaining unnamed actors.
AlignCheck
chimerasCluster Build
Tree
AssignTaxonomy
Tree w/Taxonomy
Diversity statistics &
graphs
Unifrac files
Cytoscape network
OTU table
Hartman et al 2010. W.A.T.E.R.S.: a Workflow for the Alignment, Taxonomy, and Ecology of Ribosomal Sequences. BMC Bioinformatics 2010, 11:317 doi:10.1186/1471-2105-11-317
Hartman et al. BMC Bioinformatics 2010, 11:317http://www.biomedcentral.com/1471-2105/11/317
Page 9 of 14
default is 97% and 99%), and they are also generated forevery metadata variable comparison that the userincludes.
Data pruningTo assist in troubleshooting and quality control,WATERS returns to the user three fasta files of sequencesthat were removed at various steps in the workflow. Ashort_sequences.fas file is created that contains all
Figure 3 Biologically similar results automatically produced by WATERS on published colonic microbiota samples. (A) Rarefaction curves sim-ilar to curves shown in Eckburg et al. Fig. 2; 70-72, indicate patient numbers, i.e., 3 different individuals. (B) Weighted Unifrac analysis based on phylo-genetic tree and OTU data produced by WATERS very similar to Eckburg et al. Fig. 3B. (C) Neighbor-joining phylogenetic tree (Quicktree) representing the sequences analyzed by WATERS, which is clearly similar to Fig. S1 in Eckburg et al.
BA
!"#$ !"#% !"#& "#" "#&'&(!(')*+),-(./*0/-01,()234/0,)5(67#778
!"#%
!"#&
"#"
"#&
"#%
"#$
"#6
"#9
'%(!
(')*
+),-
(./*
0/-0
1,()
234/
0,)5
(%&#
9%8
:";:"<:"=
:">:"?:"@
:"A
:&;:&<:&=
:&>:&?:&@
:&A
:%;:%<:%=:%>
:%?:%@:%A
'=;(!('&(.B('%
" :9" &9"" %%9" $""""
9"
&""
&9"
%""
%9"
:%
:&
:"
C
!"#$%&'()%$%*!"#$%&'()"+%*
)%+$",&'$%'!"#$%&("
"#$(-'!"#$%&("
.%&&/#'0(#&'!("
%,*(+'-,&'$%'!"#$%&("
1(&0(#/$%*#+'*$&()("#+'*$&()("+%*
2324
5"00",&'$%'!"#$%&("
#6"-'!"#$%&(""+,7",&'$%'!"#$%&("
1/*'!"#$%&("
1(&0(#/$%*!"#(++(
1(&0(#/$%*0'++(#/$%*
Amber Hartman
BertramLudaescer
alignment used to build the profile, resulting in a multiplesequence alignment of full-length reference sequences andmetagenomic reads. The final step of the alignment process is aquality control filter that 1) ensures that only homologous SSU-rRNA sequences from the appropriate phylogenetic domain areincluded in the final alignment, and 2) masks highly gappedalignment columns (see Text S1).We use this high quality alignment of metagenomic reads and
references sequences to construct a fully-resolved, phylogenetictree and hence determine the evolutionary relationships betweenthe reads. Reference sequences are included in this stage of theanalysis to guide the phylogenetic assignment of the relativelyshort metagenomic reads. While the software can be easilyextended to incorporate a number of different phylogenetic toolscapable of analyzing metagenomic data (e.g., RAxML [27],pplacer [28], etc.), PhylOTU currently employs FastTree as adefault method due to its relatively high speed-to-performanceratio and its ability to construct accurate trees in the presence ofhighly-gapped data [29]. After construction of the phylogeny,lineages representing reference sequences are pruned from thetree. The resulting phylogeny of metagenomic reads is then used tocompute a PD distance matrix in which the distance between apair of reads is defined as the total tree path distance (i.e., branchlength) separating the two reads [30]. This tree-based distancematrix is subsequently used to hierarchically cluster metagenomicreads via MOTHUR into OTUs in a fashion similar to traditionalPID-based analysis [31]. As with PID clustering, the hierarchicalalgorithm can be tuned to produce finer or courser clusters,corresponding to different taxonomic levels, by adjusting theclustering threshold and linkage method.To evaluate the performance of PhylOTU, we employed
statistical comparisons of distance matrices and clustering resultsfor a variety of data sets. These investigations aimed 1) to compare
PD versus PID clustering, 2) to explore overlap between PhylOTUclusters and recognized taxonomic designations, and 3) to quantifythe accuracy of PhylOTU clusters from shotgun reads relative tothose obtained from full-length sequences.
PhylOTU Clusters Recapitulate PID ClustersWe sought to identify how PD-based clustering compares to
commonly employed PID-based clustering methods by applyingthe two methods to the same set of sequences. Both PID-basedclustering and PhylOTU may be used to identify OTUs fromoverlapping sequences. Therefore we applied both methods to adataset of 508 full-length bacterial SSU-rRNA sequences (refer-ence sequences; see above) obtained from the Ribosomal DatabaseProject (RDP) [25]. Recent work has demonstrated that PID ismore accurately calculated from pairwise alignments than multiplesequence alignments [32–33], so we used ESPRIT, whichimplements pairwise alignments, to obtain a PID distance matrixfor the reference sequences [32]. We used PhylOTU to compute aPD distance matrix for the same data. Then, we used MOTHUR tohierarchically cluster sequences into OTUs based on both PIDand PD. For each of the two distance matrices, we employed arange of clustering thresholds and three different definitions oflinkage in the hierarchical clustering algorithm: nearest-neighbor,average, and furthest-neighbor.To statistically evaluate the similarity of cluster composition
between of each pair of clustering results, we used two summarystatistics that together capture the frequency with which sequencesare co-clustered in both analyses: true conjunction rate (i.e., theproportion of pairs of sequences derived from the same cluster inthe first analysis that also are clustered together in the secondanalysis) and true disjunction rate (i.e., the proportion of pairs ofsequences derived from different clusters in the first analysis thatalso are not clustered together in the second analysis) (see Methods
Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in this generalizeworkflow of PhylOTU. See Results section for details.doi:10.1371/journal.pcbi.1001061.g001
Finding Metagenomic OTUs
PLoS Computational Biology | www.ploscompbiol.org 3 January 2011 | Volume 7 | Issue 1 | e1001061
Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O'Dwyer JP, Green JL, Eisen JA, Pollard KS. (2011) PhylOTU: A High-Throughput Procedure Quantifies Microbial Community Diversity and Resolves Novel Taxa from Metagenomic Data. PLoS Comput Biol 7(1): e1001061. doi:10.1371/journal.pcbi.1001061
PhylOTU
Tom Sharpton@tjsharpton
QIIME Phylotyping and Phylogenetic Ecology
15
296
Fig. S6. A set of 96 OTUs mainly consisting of Proteobacteria is enriched across every 297
compartment in the greenhouse experiment. (A) Number of OTUs and the phyla and classes 298
they belong to that are enriched across all rhizocompartments in the greenhouse experiment. (B) 299
A subset of the Proteobacteria and the classes and families they belong to in the OTUs that are 300
enriched across all rhizocompartments in the greenhouse. 301
302
303
�������������
���� ����
�
https://evomics.org/2014/01/the-glories-of-the-gut-ask-a-fat-mouse/
QIIME Phylotyping and Phylogenetic Ecology
15
296
Fig. S6. A set of 96 OTUs mainly consisting of Proteobacteria is enriched across every 297
compartment in the greenhouse experiment. (A) Number of OTUs and the phyla and classes 298
they belong to that are enriched across all rhizocompartments in the greenhouse experiment. (B) 299
A subset of the Proteobacteria and the classes and families they belong to in the OTUs that are 300
enriched across all rhizocompartments in the greenhouse. 301
302
303
�������������
���� ����
�
https://evomics.org/2014/01/the-glories-of-the-gut-ask-a-fat-mouse/
Lesson 4: Accept When You
Are Defeated
Rice Microbiome: Variation w/in Plant
Joseph Edwards
@Bulk_Soil
Sundar@sundarlab
CameronJohnson
SrijakBhatnagar
@srijakbhatnagar
To address some of these questions, we have undertaken anexhaustive characterization of the root-associated microbiome ofrice. Rice is a major crop plant and a staple food for half of theworld’s population. Metagenomic and proteomic approacheshave been used to identify different microbial genes present inthe rice microbiome (17, 18), but an extensive characterization ofmicrobiome composition and variation has not been performed.Rice cultivation also contributes to global methane, accountingfor an estimated 10–20% of anthropogenic emissions, due to thegrowth of methanogenic archaea in the vicinity of rice roots (19).Here we have used deep sequencing of microbial 16S rRNAgenes to detect over 250,000 operational taxonomic units(OTUs), with a structural resolution of three distinct compart-ments (rhizosphere, rhizoplane, and endosphere) and extendingover multiple factors contributing to variation, both under con-trolled greenhouse conditions as well as different field environ-ments. The large datasets from the different conditions sampledin this study were used for identification of putative microbialconsortia involved in processes such as methane cycling. Throughdynamic studies of the microbiome composition, we provideinsights into the process of root microbiome assembly.
ResultsRoot-Associated Microbiomes Form Three Spatially Separable Com-partments Exhibiting Distinct and Overlapping Microbial Communities.Sterilized rice seeds were germinated and grown under con-trolled greenhouse conditions in soil collected from three ricefields across the Central Valley of California (SI Appendix, Fig.S1). We analyzed the bacterial and archaeal microbiomes fromthree separate rhizocompartments: the rhizosphere, rhizoplane,and endosphere (Fig. 1A). Because the root microbiome hasbeen shown to correlate with the developmental stage of theplant (10), the root-associated microbial communities weresampled at 42 d (6 wk), when rice plants from all genotypes werewell-established in the soil but still in their vegetative phase ofgrowth. For our study, the rhizosphere compartment was com-
posed of ∼1 mm of soil tightly adhering to the root surface that isnot easily shaken from the root (SI Appendix, Fig. S2). Therhizoplane compartment microbiome was derived from the suiteof microbes on the root surface that cannot be removed bywashing in buffer but is removed by sonication (SI Appendix,Materials and Methods). The endosphere compartment micro-biome, composed of the microbes inhabiting the interior of theroot, was isolated from the same roots left after sonication.Unplanted soil pots were used as a control to differentiate planteffects from general edaphic factors.The V4-V5 region of the 16S rRNA gene was amplified using
PCR and sequenced using the Illumina MiSeq platform. A totalof 10,554,651 high-quality sequences was obtained with a medianread count per sample of 51,970 (range: 2,958–203,371; DatasetS2). The high-quality reads were clustered using >97% sequenceidentity into 101,112 microbial OTUs. Low-abundance OTUs(<5 total counts) were discarded, resulting in 27,147 OTUs. Theresulting OTU counts in each library were normalized using thetrimmed mean of M values method. This method was chosen dueto its sensitivity for detecting differentially abundant taxa com-pared with traditional microbiome normalization techniquessuch as rarefaction and relative abundance (20). Measures ofwithin-sample diversity (α-diversity) revealed a diversity gradientfrom the endosphere to the rhizosphere (Fig. 1B and DatasetS4). Endosphere communities had the lowest α-diversity and therhizosphere had the highest α-diversity. The mean α-diversitywas higher in the rhizosphere than bulk soil; however, the dif-ference in α-diversity between these two compartments cannot beconsidered as statistically significant (Wilcoxon test; Dataset S4).Unconstrained principal coordinate analyses (PCoAs) of
weighted and unweighted UniFrac distances were performed toinvestigate patterns of separation between microbial communi-ties (SI Appendix, Materials and Methods). The UniFrac distanceis based on taxonomic relatedness, where the weighted UniFrac(WUF) metric takes abundance of taxa into consideration whereasthe unweighted UniFrac (UUF) does not and is thus more sen-sitive to rare taxa. In both the WUF and UUF PCoAs, the rhi-zocompartments separate across the first principal coordinate,indicating that the largest source of variation in root-associatedmicrobial communities is proximity to the root (Fig. 1C, WUFand SI Appendix, Fig. S4, UUF). Moreover, the pattern of sepa-ration is consistent with a gradient of microbial populations fromthe exterior of the root, across the rhizoplane, and into the in-terior of the root. Permutational multivariate analysis of variance(PERMANOVA) corroborates that rhizospheric compartmen-talization comprises the largest source of variation within themicrobiome data when using a WUF distance metric (46.62%,P < 0.001; Dataset S5A). PERMANOVA using a UUF distance,however, describes rhizospheric compartmentalization as havingthe second largest source of variation behind soil type (18.07%,P < 0.001; Dataset S5H). In addition to PERMANOVA, we alsoperformed partial canonical analysis of principal coordinates(CAP) on both the WUF and UUF metrics to quantify the var-iance attributable to each experimental variable (SI Appendix,Materials and Methods). This technique differs from unconstrainedPCoA in that technical factors can be controlled for in theanalysis and the analysis can be constrained to any factor of in-terest to better understand the quantitative impact of the factoron the microbial composition. Using this technique to control forsoil type, cultivar, and technical factors (biological replicate, se-quencing batch, and planting container), we found that inagreement with the PERMANOVA results, microbial commu-nities vary significantly between rhizocompartments (34.2% ofvariance, P = 0.005, WUF, SI Appendix, Fig. S5A and 22.6% ofvariance, P = 0.005, UUF, SI Appendix, Fig. S5C).There are notable differences in the proportions of various
phyla across the compartments that are consistent across everytested soil (Fig. 1D). The endosphere has a significantly greaterproportion of Proteobacteria and Spirochaetes than the rhizo-sphere or bulk soil, whereas Acidobacteria, Planctomycetes, andGemmatimonadetes are mostly depleted in the endosphere
Fig. 1. Root-associated microbial communities are separable by rhizo-compartment and soil type. (A) A representation of a rice root cross-sectiondepicting the locations of the microbial communities sampled. (B) Within-sample diversity (α-diversity) measurements between rhizospheric compart-ments indicate a decreasing gradient in microbial diversity from the rhizo-sphere to the endosphere independent of soil type. Estimated speciesrichness was calculated as eShannon_entropy. The horizontal bars within boxesrepresent median. The tops and bottoms of boxes represent 75th and 25thquartiles, respectively. The upper and lower whiskers extend 1.5× theinterquartile range from the upper edge and lower edge of the box, re-spectively. All outliers are plotted as individual points. (C) PCoA using theWUF metric indicates that the largest separation between microbial com-munities is spatial proximity to the root (PCo 1) and the second largestsource of variation is soil type (PCo 2). (D) Histograms of phyla abundances ineach compartment and soil. B, bulk soil; E, endosphere; P, rhizoplane; S,rhizosphere; Sac, Sacramento.
2 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1414592112 Edwards et al.
Edwards et al. 2015. Structure, variation, and assembly of the root-associated
microbiomes of rice. PNAS
Rice Genotype Affects Microbiome
using the UUF metric (26.7%, P = 0.005; SI Appendix, Fig. S5D).This discrepancy is likely due to differences between the WUFand UUF distance metric: Soil type might have more of an effecton frequency of rare taxa than abundant taxa, and thus the UUFmetric has a larger effect size for soil type. Compartments ofplants grown in distinct soils have commonalities in differentiallyabundant OTUs (Dataset S9), sharing 92 endosphere-enrichedOTUs, 71 rhizoplane-enriched OTUs, and 10 rhizosphere OTUs(SI Appendix, Fig. S8 J, I, and H, respectively, and SI Appendix,Fig. S9). In agreement with the PCoA analysis, Davis andArbuckle shared a significant overlap in OTUs enriched in theendosphere and rhizoplane (P = 2.22 × 10−16 and 7.86 × 10−7,respectively, hypergeometric test; SI Appendix, Fig. S8 I and J) butnot the rhizosphere (P = 0.52, hypergeometric test; SI Appendix,Fig. S8H). The Sacramento soil did not share significant overlapsin compartment-enriched OTUs with the other sites.The enrichment/depletion effects within each rhizosphere com-
partment vary by soil. Rhizosphere compartments of plants inDavis and Arbuckle soils exhibited higher enrichment/depletionratios (72/3 and 53/17, respectively) than plants in Sacramentosoil (78/116) (SI Appendix, Fig. S8A). The level of enrichment issimilar between each soil in the rhizosphere; however, the de-pletion level is higher in Sacramento soil than in Arbuckle orDavis. Chemical analysis of the soils showed that the nutrientcompositions of the soils did not show any exceptional trends(Dataset S7). The Davis and Arbuckle fields were similar in pHand nitrate, magnesium, and phosphorus content, whereas theArbuckle and Sacramento fields were similar in potassium, cal-cium, and iron content. Taken together, these results indicate thateach soil contains a different pool of microbes and that the plantis not restricted to specific OTUs but instead draws from avail-able OTUs in the pool to organize its microbiome. Nevertheless,the distribution of phyla across the different compartments wassimilar for all three soil types (Fig. 1D), suggesting that the overallrecruitment of OTUs is governed by a set of factors that result ina consistent representation of phyla independent of soil type.
Microbial Communities in the Rhizocompartments Are Influenced byRice Genotype. To investigate the relationship between rice ge-notype and the root microbiome, domesticated rice varietiescultivated in widely separated growing regions were tested. Sixcultivated rice varieties spanning two species within the Oryzagenus were grown for 42 d in the greenhouse before sampling.Asian rice (Oryza sativa) cultivars M104, Nipponbare (bothtemperate japonica varieties), IR50, and 93-11 (both indica va-rieties) were grown alongside two cultivars of African cultivatedrice Oryza glaberrima, TOg7102 (Glab B) and TOg7267 (Glab E).PERMANOVA indicated that rice genotype accounted fora significant amount of variation between microbial communitieswhen using WUF (2.41% of the variance, P < 0.001; DatasetS5A) and UUF (1.54% of the variance, P < 0.066; Dataset S5H);however, visual representations for clustering patterns of thegenotypes were not evident on the first two axes of unconstrainedPCoA ordinations (SI Appendix, Fig. S10). We then used CAPanalysis to quantify the effect of rice genotype on the microbialcommunities. By focusing on rice cultivar and controlling forcompartment, soil type, and technical factors, we found that ge-notypic differences in rice have a significant effect on root-associated microbial communities (5.1%, P = 0.005, WUF, Fig.3A and 3.1%, P = 0.005, UUF, SI Appendix, Fig. S11A). Ordi-nation of the resulting CAP analysis revealed clustering patternsof the cultivars that are only partially consistent with geneticlineage for both the WUF and UUF metrics. The two japonicacultivars clustered together and the two O. glaberrima cultivarsclustered together; however, the indica cultivars were split, with93-11 clustering with the O. glaberrima cultivars and IR50 clus-tering with the japonica cultivars.To analyze how the genotypic effect manifests in individual
rhizocompartments, we separated the whole dataset to focus oneach compartment individually and conducted CAP analysiscontrolling for soil type and technical factors. The rhizosphere
had the greatest genotypic effect on the microbiome (30.3%,P = 0.005, WUF, SI Appendix, Fig. S11B and 10.5%, P = 0.005,UUF, SI Appendix, Fig. S11E). The clustering patterns of thecultivars in the rhizosphere were similar to the clustering pat-terns exhibited when conducting CAP analysis on the wholedata using all rhizocompartments. Again, the japonica andO. glaberrima cultivars clustered separately, whereas the indicacultivars were split between the japonica and O. glaberrima clusters.This clustering pattern is maintained in the rhizoplane commu-nities (SI Appendix, Fig. S11 C and F); however, it breaks down inthe endosphere compartment communities, which coincidentlyare the least affected by rice genotype (12.8%, P = 0.005, WUF,SI Appendix, Fig. S11D and 8.5%, P = 0.028, UUF, SI Appendix,Fig. S11G). α-Diversity measurements within the rhizosphereshow a notable difference between the cultivars (P = 3.12E-06,ANOVA), with the O. glaberrima cultivars exhibiting high di-versity relative to the japonica cultivars, especially in Arbucklesoil (Fig. 3B and Dataset S11). Again, the two japonica cultivarswere more similar to the indica cultivar IR50, and the twoO. glaberrima cultivars were more similar to the indica cultivar93-11. These patterns in α-diversity were not evident when ex-amining other compartments (SI Appendix, Fig. S12). To explainwhich OTUs accounted for the genotypic effects in each rhizo-compartment, we performed differential OTU abundance anal-yses between the cultivars (Dataset S12). In total, we found 125OTUs that were affected by the plant genotype in at least onerhizocompartment. The rhizosphere had the most OTUs thatwere significantly impacted by genotype (SI Appendix, Fig. S13).This is consistent with the results from PERMANOVA and theCAP analyses.
Geographical Effects on the Microbiomes of Field-Grown Plants. Wesought to determine whether the results from greenhouse plantswere generalizable to cultivated rice and to investigate otherfactors that might affect the microbiome under field conditions.We therefore characterized the root-associated microbiomes offield rice plants distributed across eight geographically separatesites across California’s Sacramento Valley (Fig. 4A). Theseeight sites were operated under two cultivation practices: organiccultivation and a more conventional cultivation practice termed“ecofarming” (see below). Because genotype explained the leastvariance in the greenhouse data, we limited the analysis to onecultivar, S102, a California temperate japonica variety that iswidely cultivated by commercial growers and is closely related toM104 (26). Field samples were collected from vegetativelygrowing rice plants in flooded fields and the previously definedrhizocompartments were analyzed as before. Unfortunately,collection of bulk soil controls for the field experiment was not
Fig. 3. Host plant genotype significantly affects microbial communities inthe rhizospheric compartments. (A) Ordination of CAP analysis using theWUF metric constrained to rice genotype. (B) Within-sample diversitymeasurements of rhizosphere samples of each cultivar grown in each soil.Estimated species richness was calculated as eShannon_entropy. The horizontalbars within boxes represent median. The tops and bottoms of boxes repre-sent 75th and 25th quartiles, respectively. The upper and lower whiskersextend 1.5× the interquartile range from the upper edge and lower edge ofthe box, respectively. All outliers are plotted as individual points.
4 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1414592112 Edwards et al.
Edwards et al. 2015. Structure, variation, and assembly of the root-associated
microbiomes of rice. PNAS
Rice: Cultivation Site EffectsEdwards et al. 2015.
Structure, variation, and assembly of the root-
associated microbiomes of rice.
PNAS
possible, because planting densities in California commercial ricefields are too high to find representative soil that is unlikely tobe affected by nearby plants. Amplification and sequencing ofthe field microbiome samples yielded 13,349,538 high-qualitysequences (median: 54,069 reads per sample; range: 12,535–148,233 reads per sample; Dataset S13). The sequences wereclustered into OTUs using the same criteria as the greenhouseexperiment, yielding 222,691 microbial OTUs and 47,983 OTUswith counts >5 across the field dataset.We found that the microbial diversity of field rice plants is
significantly influenced by the field site. α-Diversity measure-ments of the field rhizospheres indicated that the cultivation sitesignificantly impacts microbial diversity (SI Appendix, Fig. S14A,P = 2.00E-16, ANOVA and Dataset S14). Unconstrained PCoAusing both the WUF and UUF metrics showed that microbialcommunities separated by field site across the first axis (Fig. 4B,WUF and SI Appendix, Fig. S14B, UUF). PERMANOVA agreedwith the unconstrained PCoA in that field site explained thelargest proportion of variance between the microbial communi-ties for field plants (30.4% of variance, P < 0.001, WUF, DatasetS5O and 26.6% of variance, P < 0.001, UUF, Dataset S5P). CAPanalysis constrained to field site and controlled for rhizocom-partment, cultivation practice, and technical factors (sequencingbatch and biological replicate) agreed with the PERMANOVAresults in that the field site explains the largest proportion ofvariance between the root-associated microbial communities infield plants (27.3%, P = 0.005, WUF, SI Appendix, Fig. S15Aand 28.9%, P = 0.005, UUF, SI Appendix, Fig. S15E), sug-gesting that geographical factors may shape root-associatedmicrobial communities.
Rhizospheric Compartmentalization Is Retained in Field Plants. Sim-ilar to the greenhouse plants, the rhizospheric microbiomes offield plants are distinguishable by compartment. α-Diversity ofthe field plants again showed that the rhizosphere had thehighest microbial diversity, whereas the endosphere had the least
diversity for all fields tested (SI Appendix, Fig. S14A and DatasetS15). PCoA of the microbial communities from field plants usingthe WUF and UUF distance metrics showed that the rhizo-compartments separate across PCo 2 (Fig. 4C, WUF and SIAppendix, Fig. S14C, UUF). PERMANOVA indicated that theseparation in the rhizospheric compartments explained the sec-ond largest source of variation of the factors that were tested(20.76%, P < 0.001, WUF, Dataset S5O and 7.30%, P < 0.001,UUF, Dataset S5P). CAP analysis of the field plants’ micro-biomes constrained to the rhizocompartment factor and con-trolled for field site, cultivation practice, and technical factorsagreed with PERMANOVA that a significant proportion of thevariance between microbial communities is explained by rhizo-compartment (20.9%, P = 0.005, WUF, SI Appendix, Fig. S15Cand 10.9%, P = 0.005, UUF, SI Appendix, Fig. S15G).Taxonomic distributions of phyla for the field plants were
overall similar to the greenhouse plants: Proteobacteria,Chloroflexi, and Acidobacteria make up the majority of the ricemicrobiota. The taxonomic gradients from the rhizosphere to theendosphere are maintained in the field plants for Acidobacteria,Proteobacteria, Spirochaetes, Gemmatimonadetes, Armatimonadetes,and Planctomycetes. However, unlike for greenhouse plants, thedistribution of Actinobacteria generally showed an increasingtrend from the rhizosphere to the endosphere of field plants (SIAppendix, Fig. S14E and Dataset S16).We again performed differential abundance tests between the
OTUs in the compartments of field-grown plants (SI Appendix, Fig.S16). We found a set of 32 OTUs that were enriched in theendosphere compartment between every cultivation site, potentiallyrepresenting a core field rice endospheric microbiome (SI Appendix,Fig. S17). The set of 32 OTUs consisted of Deltaproteobacteria inthe genus Anaeromyxobacter and Spirochaetes, Actinobacteria,and Alphaproteobacteria in the family Methylocystaceae. In-terestingly, 11 of the 32 core field endosphere OTUs were alsofound to be enriched in the endosphere compartment ofgreenhouse plants (SI Appendix, Fig. S18). Three of theseOTUs were classifiable at the family level. These OTUs con-sisted of taxa in the families Kineosporiaceae, Rhodocyclaceae,and Myxococcaceae, all of which are also enriched in the Ara-bidopsis root endosphere microbiome (10).
Cultivation Practice Results in Discernible Differences in the Microbiomes.The rice fields that we sampled from were cultivated under twopractices, organic farming and a variation of conventional cultiva-tion called ecofarming (27). Ecofarming differs from organicfarming in that chemical fertilizers, fungicide use, and herbicide useare all permitted but growth of transgenic rice and use of post-harvest fumigants are not permitted. Although cultivation practiceitself does significantly affect α-diversity of the rhizospheric com-partments overall (P = 0.008, ANOVA; Dataset S14), there is alsoa significant interaction between the cultivation practice used andthe rhizocompartments (P = 3.52E-07, ANOVA; Dataset S14),indicating that the α-diversities of some rhizocompartments areaffected differentially by cultivation practice. The α-diversity withinthe rhizosphere compartment varied significantly by cultivationpractice, with the mean α-diversity being higher in ecofarmed rhi-zospheres than organic rhizospheres (P = 0.001, Wilcoxon test;Dataset S14), whereas not in the endosphere and rhizoplane mi-crobial communities (P = 0.51 and 0.75, respectively, Wilcoxontests; Dataset S14). Under nonconstrained PCoA, the cultivationpractices are separable across principal coordinates 2 and 3 for boththe WUF metric (Fig. 4D) and UUF metric (SI Appendix, Fig.S14D). PERMANOVA of the microbial communities was inagreement with the PCoAs in that cultivation practice has a signif-icant impact on the rhizospheric microbial communities of field riceplants (8.47%, P < 0.001, WUF, Dataset S5O and 6.52%, P < 0.001,UUF, Dataset S5P). CAP analysis of the field plants constrained tocultivation practice agreed with the PERMANOVA results thatthere are significant differences between microbial communitiesfrom organic and ecofarmed rice plants (6.9% of the variance,
Fig. 4. Root-associated microbiomes from field-grown plants are separableby cultivation site, rhizospheric compartment, and cultivation practice. (A)Map depicting the locations of the field experiment collection sites acrossCalifornia’s Central Valley. Circles represent organic-cultivated siteswhereas triangles represent ecofarm-cultivated sites. (Scale bar, 10 mi.) (B)PCoA using the WUF method colored to depict the various sample collec-tion sites. (C) Same PCoA in B colored by rhizospheric compartment. (D)Same PCoA in B and C depicting second and third axes and colored bycultivation practice.
Edwards et al. PNAS Early Edition | 5 of 10
PLANTBIOLO
GY
PNASPL
US
Rice: Functional Enrichment x Genotype
and mitochondrial) reads to analyze microbial abundance inthe endosphere over time (Fig. 6A). Using this technique, weconfirmed the sterility of seedling roots before transplantation.We found that microbial penetrance into the endosphere oc-curred at or before 24 h after transplantation and that the pro-portion of microbial reads to organellar reads increased over thefirst 2 wk after transplantation (Fig. 6A). To further support theevidence for microbiome acquisition within the first 24 h, wesampled root endospheric microbiomes from sterilely germi-nated seedlings before transplanting into Davis field soil as wellas immediately after transplantation and 24 h after transplan-tation (SI Appendix, Fig. S24). The root endospheres of sterilelygerminated seedlings, as well as seedlings transplanted intoDavis field soil for 1 min, both had a very low percentage ofmicrobial reads compared with organellar reads (0.22% and0.71%), with the differences not statistically significant (P = 0.1,Wilcoxon test). As before, endospheric microbial abundanceincreased significantly, by >10-fold after 24 h in field soil (3.95%,P = 0.05, Wilcoxon test). We conclude that brief soil contactdoes not strongly increase the proportion of microbial reads, andtherefore the increase in microbial reads at 24 h is indicative ofendophyte acquisition within 1 d after transplantation.α-Diversity significantly varied by rhizocompartment (P < 2E-
16; Dataset S23) and there was a significant interaction betweenrhizocompartment and collection time (P = 0.042; Dataset S23);however, when each rhizocompartment was analyzed individ-ually, the bulk soil was the only compartment that showeda significant amount of variation in α-diversity over time (SIAppendix, Fig. S25 and Dataset S23). The above results suggestthat a diverse microbiota can begin to colonize the rhizoplaneand endosphere as early as 24 h after transplanting into soil.We next evaluated how β-diversities shift over time in eachrhizocompartment. We compared the time-series microbialcommunities with the previous greenhouse experiment mi-crobial communities of M104 in Davis soil (Fig. 6 B and C).β-Diversity measurements of the time-series data indicatedthat microbiome samples from each compartment are sepa-rable by time. Furthermore, the rhizoplane and endospheremicrobiomes from the later time point in the time-series data
(13 d) approach the endosphere and rhizoplane microbiomecompositions for plants that have been grown in the green-house for 42 d.There are slight shifts in the distribution of phyla over time;
however, there are significant distinctions between the com-partments starting as early as 24 h after transplantation into soil(Fig. 6D, SI Appendix, Figs. S24B and S26, and Dataset S24).Because each phylum consists of diverse OTUs that could ex-hibit very different behaviors during acquisition, we next ex-amined the dynamics and colonization patterns of specificOTUs within the time-course experiment. The core set of 92endosphere-enriched OTUs obtained from the previous green-house experiment (SI Appendix, Fig. S9C) was analyzed forrelative abundances at different time points (Fig. 6E). Of the 92core endosphere-enriched microbes present in the greenhouseexperiment, 53 OTUs were detectable in the endosphere in thetime-course experiment. The average abundance profile overtime revealed a colonization pattern for the core endosphericmicrobiome. Relative abundance of the core endosphere-enriched microbiome peaks early (3 d) in the rhizosphere andthen decreases back to a steady, low level for the remainder ofthe time points. Similarly, the rhizoplane profile shows an in-crease after 3 d with a peak at 8 d with a decline at 13 d. Theendosphere generally follows the rhizoplane profile, except thatrelative abundance is still increasing at 13 d. These results sug-gest that the core endospheric microbes are first attracted to therhizosphere and then locate to the rhizoplane, where they attachbefore migration to the root interior. To summarize, microbiomeacquisition from soil appears to occur relatively rapidly, initiatingwithin 24 h and approaching steady state within 14 d. The dy-namics of accumulation suggest a multistep process, in which therhizosphere and rhizoplane are likely to play key roles in de-termining the compositions of the interior and exterior compo-nents of the root-associated microbiome (Discussion).
DiscussionFactors Affecting the Composition of Root-Associated Microbiomes.The data presented here provide a characterization of themicrobiome of rice, involving the combination of finer structural
Fig. 5. OTU coabundance network reveals modules of OTUs associated with methane cycling. (A) Subset of the entire network corresponding to 11modules with methane cycling potential. Each node represents one OTU and an edge is drawn between OTUs if they share a Pearson correlation ofgreater than or equal to 0.6. (B) Depiction of module 119 showing the relationship between methanogens, syntrophs, methanotrophs, and othermethane cycling taxonomies. Each node represents one OTU and is labeled by the presumed function of that OTU’s taxonomy in methane cycling. Anedge is drawn between two OTUs if they have a Pearson correlation of greater than or equal to 0.6. (C ) Mean abundance profile for OTUs in module 119across all rhizocompartments and field sites. The position along the x axis corresponds to a different field site. Error bars represent SE. The x and y axesrepresent no particular scale.
Edwards et al. PNAS Early Edition | 7 of 10
PLANTBIOLO
GY
PNASPL
US
Edwards et al. 2015. Structure, variation, and assembly of the root-associated microbiomes of rice. PNAS
Rice Developmental Time Series
resolution and deeper sequencing than previous plant micro-biome studies and using both controlled greenhouse and fieldstudies covering a geographical range of cultivation. Specifically,we have been able to characterize in-depth the compositions ofthree distinct rhizocompartments—the rhizosphere, rhizoplane,and endosphere—and gain insights into the effects of externalfactors on each of these compartments. We note that a detailedcharacterization of plant rhizoplane microbiota in relation tothe rhizosphere and the endosphere has not been previouslyattempted. To achieve this, we successfully adapted protocols forremoval of rhizoplane microbes from the endosphere of Arabi-dopsis roots (9, 10). Because the fractional abundance oforganellar reads in the rhizosphere, rhizoplane, and endosphereexhibits a clear increasing gradient (SI Appendix, Fig. S27), wehypothesize that we are isolating the rhizoplane fraction viadisruption of the rhizodermis, consistent with direct EM obser-vations on Arabidopsis roots following sonication (9, 10). Thefine structure approach we have used combined with depth ofsequencing allowed us to analyze over 250,000 OTUs, an order
of magnitude greater than in any single plant species to date.Under controlled greenhouse conditions, the rhizocompartmentsdescribed the largest source of variation in the microbial com-munities sampled (Dataset S5A). The pattern of separation be-tween the microbial communities in each compartment isconsistent with a spatial gradient from the bulk soil across therhizosphere and rhizoplane into the endosphere (Fig. 1C).Similarly, microbial diversity patterns within samples hold thesame pattern where there is a gradient in α-diversity from therhizosphere to the endosphere (Fig. 1B). Enrichment and de-pletion of certain microbes across the rhizocompartments indi-cates that microbial colonization of rice roots is not a passiveprocess and that plants have the ability to select for certain mi-crobial consortia or that some microbes are better at filling theroot colonizing niche. Similar to studies in Arabidopsis, we foundthat the relative abundance of Proteobacteria is increased in theendosphere compared with soil, and that the relative abundancesof Acidobacteria and Gemmatimonadetes decrease from the soilto the endosphere (9–11), suggesting that the distribution ofdifferent bacterial phyla inside the roots might be similar for allland plants (Fig. 1D and Dataset S6). Under controlled green-house conditions, soil type described the second largest sourceof variation within the microbial communities of each sample.However, the soil source did not affect the pattern of separationbetween the rhizospheric compartments, suggesting that therhizocompartments exert a recruitment effect on microbial con-sortia independent of the microbiome source.By using differential OTU abundance analysis in the com-
partments, we observed that the rhizosphere serves an enrich-ment role for a subset of microbial OTUs relative to bulk soil(Fig. 2). Further, the majority of the OTUs enriched in therhizosphere are simultaneously enriched in the rhizoplane and/orendosphere of rice roots (Fig. 2B and SI Appendix, Fig. S16B),consistent with a recruitment model in which factors produced bythe root attract taxa that can colonize the endosphere. We foundthat the rhizoplane, although enriched for OTUs that are alsoenriched in the endosphere, is also uniquely enriched for a subsetof OTUs, suggesting that the rhizoplane serves as a specializedniche for some taxa. Conversely, the vast majority of microbesdepleted in the rhizoplane are also depleted in the endosphere(Fig. 2C and SI Appendix, Fig. S16C), suggesting that the selec-tivity for colonization of the interior occurs at the rhizoplane andthat the rhizoplane may serve an important gating role for lim-iting microbial penetrance into the endosphere. It is important tonote that the community structure we observe in each com-partment is likely not simply caused by the plant alone. Microbialcommunity structural differences between the compartmentsmay be attributable also to microbial interactions involving bothcompetition and cooperation.In the case of field plants, we observed that the largest source
of microbiome separation was due to cultivation site, rather thanthe spatial compartments (Dataset S5 O and P). These resultsare in contrast to the controlled greenhouse experiment wherethe soil effect was the second largest source of variation, sug-gesting the geography may be more important for determiningthe composition of the root microbiome than soil structure alone(Dataset S5A). These results differ from the results in the maizemicrobiome study, where microbial communities showed clearseparation by state but not very much by geographic locationwithin the same state (12). However, we note that in our studythe locations within California were separated by distances of upto ∼125 km, vs. a maximum separation of ∼40 km in the in-trastate locations of the maize study. Other factors that mightaccount for the different results in our study include the numberof field sites examined (eight, vs. three intrastate fields examinedin the maize study), increased sequencing depth, different reso-lution because spatial compartments in maize roots were notseparately analyzed, or possibly intrinsic differences betweencultivated rice and maize.Our design of the field experiment allowed us to test for cul-
tivation practice effects on the rice root-associated microbiome,
Fig. 6. Time-series analysis of root-associated microbial communities revealsdistinct microbiome colonization patterns. (A) Ratios of microbial toorganellar (plastidial and mitochondrial) 16S rRNA gene reads in the endo-sphere after transplantation into Davis soil. The 42-d time point correspondsto the earlier greenhouse experiment data (Fig. 1) subsetted to M104 inDavis soil. Mean percentages of the ratios are depicted with each bar. (B)PCoA of the time-series experiment and the greenhouse experiment sub-setted to plants growing in Davis soil and colored by rhizospheric com-partment. (C) The same PCoA as in B colored by collection day aftertransplantation into soil. (D) Average relative abundance for select phylaover the course of microbiome acquisition. (E) Average abundance profile of53 out of the 92 core endosphere-enriched OTUs in each rhizospheric com-partment. Error bars represent SE.
8 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1414592112 Edwards et al.
Edwards et al. 2015. Structure, variation, and
assembly of the root-associated
microbiomes of rice. PNAS
Tree from Woese. 1987. Microbiological Reviews 51:221
Example III: rRNA Not Perfect
Lesson 5: Nothing is Perfect
Tree from Woese. 1987. Microbiological Reviews 51:221
Taxa Phylogeny III: rRNA Not Perfect
rRNA Copy # Correction by Phylogeny
Kembel SW, Wu M, Eisen JA, Green JL (2012) Incorporating 16S Gene Copy Number Information Improves Estimates of Microbial Diversity and Abundance. PLoS Comput Biol 8(10): e1002743. doi:10.1371/journal.pcbi.1002743
Jessica Green@jessicaleegreen
Steven Kembel@stevenkembel
Martin Wu
DNA extraction
PCRSequence all genes
Phylogenetic tree
Shotgun
GeneX
E. coli Humans
GeneX
Yeast
GeneX GeneX
Phylotyping
Phylogeny in Shotgun Metagenomics
RecA vs. rRNA
Eisen 1995 Journal of Molecular Evolution 41: 1105-1123..
RecA vs. rRNA
Eisen 1995 Journal of Molecular Evolution 41: 1105-1123..
Lesson 6: Keep Going Back
to Your Past
Phylotyping w/ Protein Markers
AMPHORA
http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.7
Genome Biology 2008, 9:R151
sequences are not conserved at the nucleotide level [29]. As a
result, the nr database does not actually contain many more
protein marker sequences that can be used as references than
those available from complete genome sequences.
Comparison of phylogeny-based and similarity-based phylotypingAlthough our phylogeny-based phylotyping is fully auto-
mated, it still requires many more steps than, and is slower
than, similarity based phylotyping methods such as a
MEGAN [30]. Is it worth the trouble? Similarity based phylo-
typing works by searching a query sequence against a refer-
ence database such as NCBI nr and deriving taxonomic
information from the best matches or 'hits'. When species
that are closely related to the query sequence exist in the ref-
erence database, similarity-based phylotyping can work well.
However, if the reference database is a biased sample or if it
contains no closely related species to the query, then the top
hits returned could be misleading [31]. Furthermore, similar-
ity-based methods require an arbitrary similarity cut-off
value to define the top hits. Because individual bacterial
genomes and proteins can evolve at very different rates, a uni-
versal cut-off that works under all conditions does not exist.
As a result, the final results can be very subjective.
In contrast, our tree-based bracketing algorithm places the
query sequence within the context of a phylogenetic tree and
only assigns it to a taxonomic level if that level has adequate
sampling (see Materials and methods [below] for details of
the algorithm). With the well sampled species Prochlorococ-
cus marinus, for example, our method can distinguish closely
related organisms and make taxonomic identifications at the
species level. Our reanalysis of the Sargasso Sea data placed
672 sequences (3.6% of the total) within a P. marinus clade.
On the other hand, for sparsely sampled clades such as
Aquifex, assignments will be made only at the phylum level.
Thus, our phylogeny-based analysis is less susceptible to data
sampling bias than a similarity based approach, and it makes
Major phylotypes identified in Sargasso Sea metagenomic dataFigure 3Major phylotypes identified in Sargasso Sea metagenomic data. The metagenomic data previously obtained from the Sargasso Sea was reanalyzed using AMPHORA and the 31 protein phylogenetic markers. The microbial diversity profiles obtained from individual markers are remarkably consistent. The breakdown of the phylotyping assignments by markers and major taxonomic groups is listed in Additional data file 5.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Alphap
roteo
bacte
ria
Betapr
oteob
acter
ia
Gammap
roteo
bacte
ria
Deltap
roteo
bacte
ria
Epsilo
npro
teoba
cteria
Unclas
sified
prote
obac
teria
Bacter
oidete
s
Chlamyd
iae
Cyano
bacte
ria
Acidob
acter
ia
Therm
otoga
e
Fusob
acter
ia
Actino
bacte
ria
Aquific
ae
Plancto
mycete
s
Spiroc
haete
s
Firmicu
tes
Chloro
flexi
Chloro
bi
Unclas
sified
bacte
ria
dnaGfrrinfCnusApgkpyrGrplArplBrplCrplDrplErplFrplKrplLrplMrplNrplPrplSrplTrpmArpoBrpsBrpsCrpsErpsIrpsJrpsKrpsMrpsSsmpBtsf
Rel
ativ
e ab
unda
nce
Martin Wu
GOS 1
GOS 2
GOS 3
GOS 4
GOS 5
Phylogenetic ID of Novel Lineages
Wu et al PLoS One 2011
Dongying Wu
Phylogenetic Diversity of Metagenomes
cally defined by a sequence similarity threshold) in the sampleas equally related. Newer ! diversity measures that incorporatephylogenetic information are more powerful because they ac-count for the degree of divergence between sequences (13, 18,29, 30). Phylogenetic ! diversity measures can also be eitherquantitative or qualitative depending on whether abundance istaken into account. The original, unweighted UniFrac measure(13) is a qualitative measure. Unweighted UniFrac measuresthe distance between two communities by calculating the frac-tion of the branch length in a phylogenetic tree that leads todescendants in either, but not both, of the two communities(Fig. 1A). The fixation index (FST), which measures thedistance between two communities by comparing the geneticdiversity within each community to the total genetic diversity ofthe communities combined (18), is a quantitative measure thataccounts for different levels of divergence between sequences.The phylogenetic test (P test), which measures the significanceof the association between environment and phylogeny (18), istypically used as a qualitative measure because duplicate se-quences are usually removed from the tree. However, the Ptest may be used in a semiquantitative manner if all clones,even those with identical or near-identical sequences, are in-cluded in the tree (13).
Here we describe a quantitative version of UniFrac that wecall “weighted UniFrac.” We show that weighted UniFrac be-haves similarly to the FST test in situations where both are
applicable. However, weighted UniFrac has a major advantageover FST because it can be used to combine data in whichdifferent parts of the 16S rRNA were sequenced (e.g., whennonoverlapping sequences can be combined into a single treeusing full-length sequences as guides). We use two differentdata sets to illustrate how analyses with quantitative and qual-itative ! diversity measures can lead to dramatically differentconclusions about the main factors that structure microbialdiversity. Specifically, qualitative measures that disregard rel-ative abundance can better detect effects of different foundingpopulations, such as the source of bacteria that first colonizethe gut of newborn mice and the effects of factors that arerestrictive for microbial growth such as temperature. In con-trast, quantitative measures that account for the relative abun-dance of microbial lineages can reveal the effects of moretransient factors such as nutrient availability.
MATERIALS AND METHODS
Weighted UniFrac. Weighted UniFrac is a new variant of the original un-weighted UniFrac measure that weights the branches of a phylogenetic treebased on the abundance of information (Fig. 1B). Weighted UniFrac is thus aquantitative measure of ! diversity that can detect changes in how many se-quences from each lineage are present, as well as detect changes in which taxaare present. This ability is important because the relative abundance of differentkinds of bacteria can be critical for describing community changes. In contrast,the original, unweighted UniFrac (Fig. 1A) is a qualitative ! diversity measurebecause duplicate sequences contribute no additional branch length to the tree(by definition, the branch length that separates a pair of duplicate sequences iszero, because no substitutions separate them).
The first step in applying weighted UniFrac is to calculate the raw weightedUniFrac value (u), according to the first equation:
u ! !i
n
bi " "Ai
AT#
Bi
BT"
Here, n is the total number of branches in the tree, bi is the length of branch i,Ai and Bi are the numbers of sequences that descend from branch i in commu-nities A and B, respectively, and AT and BT are the total numbers of sequencesin communities A and B, respectively. In order to control for unequal samplingeffort, Ai and Bi are divided by AT and BT.
If the phylogenetic tree is not ultrametric (i.e., if different sequences in thesample have evolved at different rates), clustering with weighted UniFrac willplace more emphasis on communities that contain quickly evolving taxa. Sincethese taxa are assigned more branch length, a comparison of the communitiesthat contain them will tend to produce higher values of u. In some situations, itmay be desirable to normalize u so that it has a value of 0 for identical commu-nities and 1 for nonoverlapping communities. This is accomplished by dividing uby a scaling factor (D), which is the average distance of each sequence from theroot, as shown in the equation as follows:
D ! !j
n
dj " #Aj
AT$
Bj
BT$
Here, dj is the distance of sequence j from the root, Aj and Bj are the numbersof times the sequences were observed in communities A and B, respectively, andAT and BT are the total numbers of sequences from communities A and B,respectively.
Clustering with normalized u values treats each sample equally instead of
TABLE 1. Measurements of diversity
Measure Measurement of " diversity Measurement of ! diversity
Only presence/absence of taxa considered Qualitative (species richness) QualitativeAdditionally accounts for the no. of times that
each taxon was observedQuantitative (species richness and evenness) Quantitative
FIG. 1. Calculation of the unweighted and the weighted UniFracmeasures. Squares and circles represent sequences from two differentenvironments. (a) In unweighted UniFrac, the distance between thecircle and square communities is calculated as the fraction of thebranch length that has descendants from either the square or the circleenvironment (black) but not both (gray). (b) In weighted UniFrac,branch lengths are weighted by the relative abundance of sequences inthe square and circle communities; square sequences are weightedtwice as much as circle sequences because there are twice as many totalcircle sequences in the data set. The width of branches is proportionalto the degree to which each branch is weighted in the calculations, andgray branches have no weight. Branches 1 and 2 have heavy weightssince the descendants are biased toward the square and circles, respec-tively. Branch 3 contributes no value since it has an equal contributionfrom circle and square sequences after normalization.
VOL. 73, 2007 PHYLOGENETICALLY COMPARING MICROBIAL COMMUNITIES 1577
Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of Metagenomes. PLoS ONE 6(8): e23214. doi:10.1371/journal.pone.0023214
Jessica Green
Steven Kembel
Katie Pollard
Phylosift/ pplacer WorkflowInput Sequences
rRNA workflow
protein workflow
profile HMMs used to align candidates to reference alignment
Taxonomic Summaries
parallel option
hmmalign multiple alignment
LAST fast candidate search
pplacer phylogenetic placement
LAST fast candidate search
LAST fast candidate search
search input against references
hmmalign multiple alignment
hmmalign multiple alignment
Infernal multiple alignment
LAST fast candidate search
<600 bp
>600 bp
Sample Analysis & Comparison
Krona plots, Number of reads placed
for each marker gene
Edge PCA, Tree visualization, Bayes factor tests
each
inpu
t seq
uenc
e sc
anne
d ag
ains
t bot
h w
orkf
low
s
Aaron Darling @koadman
Erik Matsen @ematsen
Holly Bik @hollybik
Guillaume Jospin @guillaumejospin
Darling AE, Jospin G, Lowe E, Matsen FA IV, Bik HM, Eisen JA. (2014) PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ 2:e243 http://dx.doi.org/10.7717/peerj.243
Erik Lowe
Whole Genome Tree of 2000 Taxa
Lang JM, Darling AE, Eisen JA (2013) Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees and Supermatrices. PLoS ONE 8(4): e62510. doi:10.1371/journal.pone.0062510
Jenna Lang@jennnomics
Aaron Darling@koadman
Phylosift Markers
• PMPROK – Dongying Wu’s Bac/Arch markers
• Eukaryotic Orthologs – Parfrey 2011 paper • 16S/18S rRNA • Mitochondria - protein-coding genes • Viral Markers – Markov clustering on
genomes • Codon Subtrees – finer scale taxonomy • Extended Markers – plastids, gene families
PhyEco Markers
Phylogenetic group Genome Number Gene Number Maker Candidates
Archaea 62 145415 106
Actinobacteria 63 267783 136
Alphaproteobacteria 94 347287 121
Betaproteobacteria 56 266362 311
Gammaproteobacteria 126 483632 118
Deltaproteobacteria 25 102115 206
Epislonproteobacteria 18 33416 455
Bacteriodes 25 71531 286
Chlamydae 13 13823 560
Chloroflexi 10 33577 323
Cyanobacteria 36 124080 590
Firmicutes 106 312309 87
Spirochaetes 18 38832 176
Thermi 5 14160 974
Thermotogae 9 17037 684
Wu D, Jospin G, Eisen JA (2013) Systematic Identification of Gene Families for Use as “Markers” for Phylogenetic and Phylogeny-Driven Ecological Studies of Bacteria and Archaea and Their Major Subgroups. PLoS ONE 8(10): e77033. doi:10.1371/journal.pone.0077033
Edge PCA: Identify lineages that explain most variation among samples
Edge PCA - Matsen and Evans 2013
Output: Edge PCA
QIIME Phylotyping and Phylogenetic Ecology
15
296
Fig. S6. A set of 96 OTUs mainly consisting of Proteobacteria is enriched across every 297
compartment in the greenhouse experiment. (A) Number of OTUs and the phyla and classes 298
they belong to that are enriched across all rhizocompartments in the greenhouse experiment. (B) 299
A subset of the Proteobacteria and the classes and families they belong to in the OTUs that are 300
enriched across all rhizocompartments in the greenhouse. 301
302
303
�������������
���� ����
�
https://evomics.org/2014/01/the-glories-of-the-gut-ask-a-fat-mouse/
Lesson 7: Don’t Accept
When You Are Defeated
Example IV: Functional Evolution
My Study Organisms
Tree from Woese. 1987. Microbiological Reviews 51:221
1st Genome Sequence
Fleischmann et al. 1995
TIGR Genome Projects
Tree from Woese. 1987. Microbiological Reviews 51:221
1st Genome Sequence
Fleischmann et al. 1995
Lesson 8: If you can’t beat them, critique
them or join them
• Leveraging an understanding of the evolution of function to better prediction functions
Function & Phylogeny
PHYLOGENENETIC PREDICTION OF GENE FUNCTION
IDENTIFY HOMOLOGS
OVERLAY KNOWNFUNCTIONS ONTO TREE
INFER LIKELY FUNCTIONOF GENE(S) OF INTEREST
1 2 3 4 5 6
3 5
3
1A 2A 3A 1B 2B 3B
2A 1B
1A
3A
1B2B
3B
ALIGN SEQUENCES
CALCULATE GENE TREE
12
4
6
CHOOSE GENE(S) OF INTEREST
2A
2A
5
3
Species 3Species 1 Species 2
1
1 2
2
2 31
1A 3A
1A 2A 3A
1A 2A 3A
4 6
4 5 6
4 5 6
2B 3B
1B 2B 3B
1B 2B 3B
ACTUAL EVOLUTION(ASSUMED TO BE UNKNOWN)
Duplication?
EXAMPLE A EXAMPLE B
Duplication?
Duplication?
Duplication
5
METHOD
Ambiguous
Based on Eisen, 1998 Genome Res 8: 163-167.
Phylogenomics
PHYLOGENENETIC PREDICTION OF GENE FUNCTION
IDENTIFY HOMOLOGS
OVERLAY KNOWNFUNCTIONS ONTO TREE
INFER LIKELY FUNCTIONOF GENE(S) OF INTEREST
1 2 3 4 5 6
3 5
3
1A 2A 3A 1B 2B 3B
2A 1B
1A
3A
1B2B
3B
ALIGN SEQUENCES
CALCULATE GENE TREE
12
4
6
CHOOSE GENE(S) OF INTEREST
2A
2A
5
3
Species 3Species 1 Species 2
1
1 2
2
2 31
1A 3A
1A 2A 3A
1A 2A 3A
4 6
4 5 6
4 5 6
2B 3B
1B 2B 3B
1B 2B 3B
ACTUAL EVOLUTION(ASSUMED TO BE UNKNOWN)
Duplication?
EXAMPLE A EXAMPLE B
Duplication?
Duplication?
Duplication
5
METHOD
Ambiguous
Based on Eisen, 1998 Genome Res 8: 163-167.
Phylogenomics
Lesson 9: If you invent your own omics word,
you are stuck with it so use it for
branding
Phylogenomics ~~ Phylotyping
Eisen et al. 1992Eisen et al. 1992. J. Bact.174: 3416
Phylogenomics ~~ Phylotyping
Eisen et al. 1992Eisen et al. 1992. J. Bact.174: 3416
Lesson 10: Stealing (with
acknowledgement) is OK
Proteorhodopsin Functional Diversity
Venter et al., Science 304: 66. 2004
• Leveraging understanding of gene gain and loss to better predict genome functions
Lesson 11: Who you hang out
with matters
Carboxydothermus hydrogenoformans
• Isolated from a Russian hotspring • Thermophile (grows at 80°C) • Anaerobic • Grows very efficiently on CO (Carbon
Monoxide) • Produces hydrogen gas • Low GC Gram positive (Firmicute) • Genome Determined (Wu et al. 2005
PLoS Genetics 1: e65. )
Homologs of Sporulation Genes
Wu et al. 2005 PLoS Genetics 1: e65.
Carboxydothermus sporulates
Wu et al. 2005 PLoS Genetics 1: e65.
Non-Homology Predictions: Phylogenetic Profiling
• Step 1: Search all genes in organisms of interest against all other genomes
• Ask: Yes or No, is each gene found in each other species
• Cluster genes by distribution patterns (profiles)
Sporulation Gene Profile
Wu et al. 2005 PLoS Genetics 1: e65.
B. subtilis new sporulation genes
J Bacteriol. 2013 Jan;195(2):253-60. doi: 10.1128/JB.01778-12
Bjorn Traag
Richard Losick
Tree from Woese. 1987. Microbiological Reviews 51:221
Example V: More Gaps
Lesson 12: Keep Returning to the Same Theme Over and Over
and Over
Yet Another Map
Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree
Genomes Poorly Sampled
Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree
TIGR Tree of Life Project
Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree
Genomic Encyclopedia of Bacteria & Archaea
Wu et al. 2009 Nature 462, 1056-1060
Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree
Genomic Encyclopedia of Bacteria & Archaea
Wu et al. 2009 Nature 462, 1056-1060
Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree
Family Diversity vs. PD
Wu et al. 2009 Nature 462, 1056-1060
GEBA Cyanobacteria
Shih et al. 2013. PNAS 10.1073/pnas.1217107110
light-harvesting strategies. The majority of cyanobacteria absorblight mainly with soluble pigment–protein complexes calledphycobilisomes, in contrast to eukaryotes, which use membrane-bound light-harvesting complexes (LHCs). However, an increasingnumber of transmembrane proteins involved in cyanobacteriallight harvesting are being identified, such as Pcb and IsiA (22, 23).These proteins are analogous in function to eukaryotic LHCs.Because of the growing number of proteins and names, an over-arching nomenclature has been proposed to name this proteinfamily the chlorophyll binding proteins (CBPs), which are char-acterized by six transmembrane helices and the ability to bindchlorophyll (24).With the increase in number and diversity of genomes, we find
that CBPs are widely distributed across the cyanobacterial phy-lum: 67% (84 of 126) of cyanobacterial genomes have, in addi-tion to the phycobilisomes, genes that putatively function asmembrane-bound light-harvesting proteins. In our phylogeneticanalysis, the increase in sequence diversity reveals strong supportfor various subclades that we have provisionally named CBPIV,-V, and -VI (Fig. 3A and SI Appendix, Fig. S5). Although not yetexperimentally demonstrated, members of CBPIV, -V, and -VIare expected to bind chlorophyll because they contain position-ally conserved histidine and glutamine residues that ligate chlo-rophyll in confirmed chlorophyll-binding CBPs (SI Appendix, Fig.S6). Some of these proteins, such as CBPIV, have previously
been annotated as PsbC homologs (25), because all CBP pro-teins are thought to have a common evolutionary origin with thepsbC gene (24). Because of the vast enrichment of cyanobacterialprotein sequences, the increase from two to six known CBPVIsequences augments phylogenetic resolution (bootstrap supportof 85%), allowing us to more confidently assert that there isa separate and distinct CBPVI subfamily. On the basis of ourphylogenetic analysis of the CBP family, and consistent withprevious studies (26), there seems to be a substantial amount ofgene duplication and horizontal gene transfer among CBPIV,-V, and -VI. In some genomes, CBPIV and CBPV are found ina gene cluster with other CBP proteins, including IsiA (Fig. 3C),suggestive of the potential for lateral transfer of gene clustersencoding light-harvesting proteins, as documented in marinecyanobacteria (27). Interestingly, many proteins of the CBPVclade also contain a C-terminal extension (SI Appendix, Fig. S7)with homology to the PsaL subunit of photosystem I (PSI).Notably, two distinct subclades within the CBPV family seem tohave independently lost the PsaL domains, reflecting the mod-ularity of this C-terminal extension. Homology modeling andinsertion of the PsaL-like domain into the PSI structure (Fig. 3Band SI Appendix, Fig. S8) suggests how the CBPV protein couldtheoretically be incorporated as an ancillary light-harvestingpolypeptide into a monomeric, but not trimeric, PSI. Althoughscattered observations of members of these CBP protein clades
0.3
B1
B2
C1
Paulinella
Glaucophyte
GreenRed
Chromalveolates
C2C3
AE
FG
B3D
A
B
Fig. 2. Implications on plastid evolution. (A) Maxi-mum-likelihood phylogenetic tree of plastids and cya-nobacteria, grouped by subclades (Fig. 1). The red dot(bootstrap support = 97%) represents the primaryendosymbiosis event that gave rise to the Arch-aeplastida lineage, made up of Glaucophytes (orange),Rhodophytes (red), Viridiplantae (green), and Chro-maleveolates (brown). The independent primary en-dosymbiosis in the amoeba Paulinella chromatophorais shown in purple. (B) Number of predicted eukary-otic, nuclear genes transferred from a cyanobacterialendosymbiont. Colors correspond to the lineageorganisms as above. Light and dark shades of colorsrepresent before and after adding the CyanoGEBAgenomes, respectively.
4 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1217107110 Shih et al.
Haloarchaeal GEBA-like
Lynch et al. (2012) PLoS ONE 7(7): e41389. doi:10.1371/journal.pone.0041389
The Dark Matter of Biology
From Wu et al. 2009 Nature 462, 1056-1060
75
Number of SAGs from Candidate Phyla
OD
1
OP
11
OP
3
SA
R4
06
Site A: Hydrothermal vent 4 1 - -Site B: Gold Mine 6 13 2 -Site C: Tropical gyres (Mesopelagic) - - - 2Site D: Tropical gyres (Photic zone) 1 - - -
Sample collections at 4 additional sites are underway.
Phil Hugenholtz
GEBA Uncultured
JGI Dark Matter Project
environmental samples (n=9)
isolation of singlecells (n=9,600)
whole genomeamplification (n=3,300)
SSU rRNA gene based identification
(n=2,000)
genome sequencing, assembly and QC (n=201)
draft genomes(n=201)
SAK
HSM ETLTG
HOT
GOM
GBS
EPR
TAETL T
PR
EBS
AK E
SM G TATTG
OM
OT
seawater brackish/freshwater hydrothermal sediment bioreactor
GN04WS3 (Latescibacteria)GN01
!"#$%&'$LD1
WS1PoribacteriaBRC1
LentisphaeraeVerrucomicrobia
OP3 (Omnitrophica)ChlamydiaePlanctomycetes
NKB19 (Hydrogenedentes)WYOArmatimonadetesWS4
ActinobacteriaGemmatimonadetesNC10SC4WS2
Cyanobacteria()*&2
Deltaproteobacteria
EM19 (Calescamantes)+,-*./'&'012345678#89/,-568/:
GAL35Aquificae
EM3Thermotogae
Dictyoglomi
SPAMGAL15
CD12 (Aerophobetes)OP8 (Aminicenantes)AC1SBR1093
ThermodesulfobacteriaDeferribacteres
Synergistetes
OP9 (Atribacteria)()*&2
CaldisericaAD3
Chloroflexi
AcidobacteriaElusimicrobiaNitrospirae49S1 2B
CaldithrixGOUTA4
*;<%0123=/68>8?8,6@98/:Chlorobi
486?8,A-5BTenericutes4AB@9/,-568/Chrysiogenetes
Proteobacteria
4896@9/,-565BTG3SpirochaetesWWE1 (Cloacamonetes)
C=1ZB3
=D)&'EF58>@,@,,AB&CG56?ABOP1 (Acetothermia)Bacteriodetes
TM7GN02 (Gracilibacteria)
SR1BH1
OD1 (Parcubacteria)
(*1OP11 (Microgenomates)
Euryarchaeota
Micrarchaea
DSEG (Aenigmarchaea)Nanohaloarchaea
Nanoarchaea
Cren MCGThaumarchaeota
Cren C2Aigarchaeota
Cren pISA7
Cren ThermoproteiKorarchaeota
pMC2A384 (Diapherotrites)
BACTERIA ARCHAEA
archaeal toxins (Nanoarchaea)
lytic murein transglycosylase
stringent response (Diapherotrites, Nanoarchaea)
ppGpp
limitingamino acids
SpotT RelA
(GTP or GDP)+ PPi
GTP or GDP+ATP
limitingphosphate,fatty acids,carbon, iron
DksA
Expression of components for stress response
sigma factor (Diapherotrites, Nanoarchaea)
!4
"#$#"%
!2!3 !1
-35 -10
&'()
&*()
+',#-./0123452
oxidoretucase
+ +e- donor e- acceptor
H
'Ribo
ADP
+
'62
O
Reduction
OxidationH
'Ribo
ADP
'6
O
2H
',)##$#6##$#72#####################',)6+ + -
HGT from Eukaryotes (Nanoarchaea)
Eukaryota
O68*62
OH
'6
*8*63
OO
68*62
'6
*8*63
O
tetra-peptide
O68*62
OH
'6
*8*63
OO
68*62
'6
*8*63
O
tetra-peptide
murein (peptido-glycan)
archaeal type purine synthesis (Microgenomates)
PurFPurD9:3'PurL/QPurMPurKPurE9:3*PurB
PurP
?
Archaea
adenine guanine
O
6##'2
+'
'62
'
'
H
H
'
'
'
H
HH' '
H
PRPP ;,<*,+
IMP
,<*,+
A*
GUA *G U
GU
A
*
GU
A UA * U
A * U
Growing AA chain
=+',>?/0@#recognizes
UGA1+',
UGA recoded for Gly (Gracilibacteria)
ribosome
Woyke et al. Nature 2013.
A Genomic Encyclopedia of Microbes (GEM)
Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree
Tetrahymena Genome Project
A Genomic Encyclopedia of Microbes (GEM)
Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree
Tree from Woese. 1987. Microbiological Reviews 51:221
Example VI: Beyond Sequence
Lesson 13: Don’t Overdo It
With That Theme
DNA extraction
PCRSequence all genes
Shotgun
Shotgun Metagenomics
Wu et al. 2006 PLoS Biology 4: e188.
Baumannia makes vitamins and cofactors
Sulcia makes amino acids
Phylogenetic Binning
HiC Crosslinking & Sequencing
Beitel CW, Froenicke L, Lang JM, Korf IF, Michelmore RW, Eisen JA, Darling AE. (2014) Strain- and plasmid-level deconvolution of a synthetic metagenome by sequencing proximity ligation products. PeerJ 2:e415 http://dx.doi.org/10.7717/peerj.415
Table 1 Species alignment fractions. The number of reads aligning to each replicon present in thesynthetic microbial community are shown before and after filtering, along with the percent of totalconstituted by each species. The GC content (“GC”) and restriction site counts (“#R.S.”) of each replicon,species, and strain are shown. Bur1: B. thailandensis chromosome 1. Bur2: B. thailandensis chromosome2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2: L. brevis plasmid 2, Ped: P. pentosaceus,K12: E. coli K12 DH10B, BL21: E. coli BL21. An expanded version of this table can be found in Table S2.
Sequence Alignment % of Total Filtered % of aligned Length GC #R.S.
Lac0 10,603,204 26.17% 10,269,562 96.85% 2,291,220 0.462 629
Lac1 145,718 0.36% 145,478 99.84% 13,413 0.386 3
Lac2 691,723 1.71% 665,825 96.26% 35,595 0.385 16
Lac 11,440,645 28.23% 11,080,865 96.86% 2,340,228 0.46 648
Ped 2,084,595 5.14% 2,022,870 97.04% 1,832,387 0.373 863
BL21 12,882,177 31.79% 2,676,458 20.78% 4,558,953 0.508 508
K12 9,693,726 23.92% 1,218,281 12.57% 4,686,137 0.507 568
E. coli 22,575,903 55.71% 3,894,739 17.25% 9,245,090 0.51 1076
Bur1 1,886,054 4.65% 1,797,745 95.32% 2,914,771 0.68 144
Bur2 2,536,569 6.26% 2,464,534 97.16% 3,809,201 0.672 225
Bur 4,422,623 10.91% 4,262,279 96.37% 6,723,972 0.68 369
Figure 1 Hi-C insert distribution. The distribution of genomic distances between Hi-C read pairs isshown for read pairs mapping to each chromosome. For each read pair the minimum path length onthe circular chromosome was calculated and read pairs separated by less than 1000 bp were discarded.The 2.5 Mb range was divided into 100 bins of equal size and the number of read pairs in each binwas recorded for each chromosome. Bin values for each chromosome were normalized to sum to 1 andplotted.
E. coli K12 genome were distributed in a similar manner as previously reported (Fig. 1;(Lieberman-Aiden et al., 2009)). We observed a minor depletion of alignments spanningthe linearization point of the E. coli K12 assembly (e.g., near coordinates 0 and 4686137)due to edge eVects induced by BWA treating the sequence as a linear chromosome ratherthan circular.
Beitel et al. (2014), PeerJ, DOI 10.7717/peerj.415 9/19
Figure 2 Metagenomic Hi-C associations. The log-scaled, normalized number of Hi-C read pairsassociating each genomic replicon in the synthetic community is shown as a heat map (see color scale,blue to yellow: low to high normalized, log scaled association rates). Bur1: B. thailandensis chromosome1. Bur2: B. thailandensis chromosome 2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2:L. brevis plasmid 2, Ped: P. pentosaceus, K12: E. coli K12 DH10B, BL21: E. coli BL21.
reference assemblies of the members of our synthetic microbial community with the samealignment parameters as were used in the top ranked clustering (described above). We firstcounted the number of Hi-C reads associating each reference assembly replicon (Fig. 2;Table S3), observing that Hi-C data associated replicons within the same species (cell)orders of magnitude more frequently than it associated replicons from diVerent species.The rate of within-species association was 98.8% when ignoring read pairs mapping lessthan 1,000 bp apart. Including read pairs <1,000 bp inflated this figure to 99.97%. Fig. 3illustrates this by visualizing the graph of contigs and their associations. Similarly, forthe two E. coli strains (K12, BL21) we observed the rate of within-strain association tobe 96.36%. When evaluated on genes unique to each strain (where read mapping to eachstrain would be unambiguous), the self-association rate was observed to be >99%.
We observed that the rate of association of L. brevis plasmids 1 and 2 with each other andwith the L. brevis chromosome was at least 100-fold higher than with the other constituentsof the synthetic community (Fig. 2). Chromosome and plasmid Hi-C contact maps showthat the plasmids associate with sequences throughout the L. brevis chromosome (Fig. 4;Figs. S3–S5) and exhibit the expected enrichment near restriction sites. This demonstratesthat metagenomic Hi-C can be used to associate plasmids to specific strains in microbialcommunities as well as to determine cell co-localization of plasmids with one another.
Variant graph connectednessAlgorithms that reconstruct single-molecule genotypes from samples containing two ormore closely-related strains or chromosomal haplotypes depend on reads or read pairsthat indicate whether pairs of variants coexist in the same DNA molecule. Such algorithms
Beitel et al. (2014), PeerJ, DOI 10.7717/peerj.415 11/19
Figure 3 Contigs associated by Hi-C reads. A graph is drawn with nodes depicting contigs and edgesdepicting associations between contigs as indicated by aligned Hi-C read pairs, with the count thereofdepicted by the weight of edges. Nodes are colored to reflect the species to which they belong (see legend)with node size reflecting contig size. Contigs below 5 kb and edges with weights less than 5 were excluded.Contig associations were normalized for variation in contig size.
typically represent the reads and variant sites as a variant graph wherein variant sites arerepresented as nodes, and sequence reads define edges between variant sites observed inthe same read (or read pair). We reasoned that variant graphs constructed from Hi-Cdata would have much greater connectivity (where connectivity is defined as the meanpath length between randomly sampled variant positions) than graphs constructed frommate-pair sequencing data, simply because Hi-C inserts span megabase distances. Suchconnectivity should, in theory, enable more accurate reconstruction of single-moleculegenotypes from smaller amounts of data. Furthermore, by linking distant sites with fewerintermediate nodes in the graph, estimates of linkage disequilibrium at distant sites (from amixed population) are likely to have greater precision.
To evaluate whether Hi-C produces more connected variant graphs we compared theconnectivity of variant graphs constructed from Hi-C data to those constructed fromsimulated mate-pair data (with average inserts of 5 kb, 10 kb, 20 kb, and 40 kb). To excludepaired-end products from the analysis, Hi-C reads with inserts under 1 kb were excludedfrom the analysis. For each variant graph constructed from these inputs, 10,000 variantposition pairs were sampled at random, with 94.75% and 100% of these pairs belonging tothe same connected graph component of the Hi-C and 40 kb variant graphs, respectively.
Beitel et al. (2014), PeerJ, DOI 10.7717/peerj.415 12/19
Figure 4 Hi-C contact maps for replicons of Lactobacillus brevis. Contact maps show the number ofHi-C read pairs associating each region of the L. brevis genome. The L. brevis chromosome (Lac0, (A),Spearman rank correlation) and plasmids (Lac1, (B); Lac2, (C)) show enrichment for local associations(bright diagonal band). Interactions between Lac1 and Lac0 (D) and Lac2 and Lac0 (E) are shown.All except Lac0 are log-scaled. Circularity of Lac0 became apparent after transforming data with theSpearman rank correlation (computed for each matrix element between the row and column sharingthat element) in place of log transformation (A) indicated by the high number of contacts between theends of the sequence. In all plots, pixels are sized to represent interactions between blocks sized at 1% ofthe interacting genomes. The number of HindIII restriction sites in each region of sequence is shown asa histogram on the left and top of each panel.
These rates fell to 6.21%, 16.6%, and 32.38% for the 5 kb, 10 kb, and 20 kb mate-pairvariant graphs, respectively (Table 3).
Across conditions, variant graphs diVered in terms of their connectivity, with Hi-Cgraphs showing the greatest connectivity. Despite having simulated an equal number ofreads for each mate-pair distance, the numbers of variant positions linked by such readswas diVerent across conditions. We observed that the variant graph derived from Hi-Cdata (>1 kb inserts, no alignment filtering), despite having the lowest number of variantlinks, had the lowest mean and maximum path length (5.47, 11; Table 3). Path lengthwas not correlated with distance within Hi-C variant graphs, in contrast to the mate-pairconditions (Fig. 5). The lengths of paths between variant pairs in the mate-pair graphsdid increase with distance, reaching maximums of 71, 96, 94, and 111 in the 5 kb, 10 kb,20 kb, and 40 kb cases, respectively. We further examined the eVect of alignment qualityand completeness filtering and observed that in the latter case such filtering vastly reducedthe rate at which variant positions occur within the same connected graph component.
DISCUSSIONThis study demonstrates that Hi-C sequencing data provide valuable information formetagenome analyses that are not currently obtainable by other methods. By applyingHi-C to a synthetic microbial community we showed that genomic DNA was associated
Beitel et al. (2014), PeerJ, DOI 10.7717/peerj.415 13/19
Chris Beitel@datscimed
Aaron Darling @koadman
Sequence Isn’t Everything
PB-PSB1 (Purple sulfur bacteria)
PB-SRB1 (Sulfate reducing bacteria)
(sulfate)
(sulfide)
Wilbanks, E.G. et al (2014). Environmental Microbiology
Lizzy Wilbanks@lizzywilbanks
12C, 12C14N, 32S
Biomass (RGB composite)
0.044 0.080
34S-incorporation (34S/32S ratio)
Wilbanks, E.G. et al (2014). Environmental Microbiology
Transfer of 34S from SRB to PSB
Long Reads Help, A Lot
Hiseq & Miseq
100-250 bp
Moleculo
2-20 kb
Pacbio RSII
2-20kb
Micky Kertesz, Tim Blauwcamp
Meredith AshbyCheryl Heiner
Illumina-based “synthetic long reads”
Real-time single molecule sequencing
(p4-c2, p5-c3)
295 Megabases 474 Megabases61 Gigabases
Light-responsive sulfate reducer?
rhodopsin
w/ Susumu Yoshizawa
Lesson 14: Asking for, and
getting, help, is a good thing
Seagrass Microbiome
>1000 samples collected. Not a blade of seagrass touched.
YEAR ONE
ZEN (Zostera Experimental Network)
25 partner sites leaves, roots, sediment, and water samples
MICROBES
Acknowledgements• GEBA:
• $$: DOE-JGI, DSMZ • Eddy Rubin, Phil Hugenholtz, Hans-Peter Klenk, Nikos Kyrpides, Tanya Woyke, Dongying Wu, Aaron Darling,
Jenna Lang • GEBA Cyanobacteria
• $$: DOE-JGI • Cheryl Kerfeld, Dongying Wu, Patrick Shih
• Haloarchaea • $$$ NSF • Marc Facciotti, Aaron Darling, Erin Lynch,
• Phylosift • $$$ DHS • Aaron Darling, Erik Matsen, Holly Bik, Guillaume Jospin
• iSEEM: • $$: GBMF • Katie Pollard, Jessica Green, Martin Wu, Steven Kembel, Tom Sharpton, Morgan Langille, Guillaume Jospin,
Dongying Wu, • aTOL
• $$: NSF • Naomi Ward, Jonathan Badger, Frank Robb, Martin Wu, Dongying Wu
• Others (not mentioned in detail) • $$: NSF, NIH, DOE, GBMF, DARPA, Sloan • Frank Robb, Craig Venter, Doug Rusch, Shibu Yooseph, Nancy Moran, Colleen Cavanaugh, Josh Weitz • EisenLab: Srijak Bhatnagar, Russell Neches, Lizzy Wilbanks, Holly Bik