uc davis eve 161 lecture 7 - rrna workflows - by jonathan eisen @phylogenomics

31
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Lecture 7: EVE 161: Microbial Phylogenomics Lecture #7: Era II: rRNA sequencing and analysis UC Davis, Winter 2014 Instructor: Jonathan Eisen 1

Upload: jonathan-eisen

Post on 10-May-2015

533 views

Category:

Education


3 download

DESCRIPTION

UC Davis EVE161 Lecture 7 by Jonathan Eisen @phylogenomics

TRANSCRIPT

Page 1: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Lecture 7:

EVE 161:Microbial Phylogenomics

!Lecture #7:

Era II: rRNA sequencing and analysis !

UC Davis, Winter 2014 Instructor: Jonathan Eisen

!1

Page 2: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Where we are going and where we have been

• Previous lecture: !6: Era II: PCR and major groups

• Current Lecture: !7: Era II: rRNA sequencing and analysis

• Next Lecture: !8: Era II: rRNA ecology

!2

Page 3: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

All Analysis Should Be Guided by Goals

Page 4: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

All Analysis Should Be Guided by Goals

• Taxonomic assignment for sequences (i.e., what type of organism is the sequence from)

Page 5: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

All Analysis Should Be Guided by Goals

• Taxonomic assignment for sequences (i.e., what type of organism is the sequence from)

– Best via phylogenetic analysis of sequences

– Sometimes done with blast

Page 6: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

All Analysis Should Be Guided by Goals

• Taxonomic assignment for sequences (i.e., what type of organism is the sequence from)

– Best via phylogenetic analysis of sequences

– Sometimes done with blast

• Ecological characterization of community

Page 7: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

All Analysis Should Be Guided by Goals

• Taxonomic assignment for sequences (i.e., what type of organism is the sequence from)

– Best via phylogenetic analysis of sequences

– Sometimes done with blast

• Ecological characterization of community –Grouping into species / classifying

–Have we sampled enough?

–Number of species

–Relative abundance

Page 8: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

All Analysis Should Be Guided by Goals

• Taxonomic assignment for sequences (i.e., what type of organism is the sequence from)

– Best via phylogenetic analysis of sequences

– Sometimes done with blast

• Ecological characterization of community –Grouping into species / classifying

–Have we sampled enough?

–Number of species

–Relative abundance

• Comparisons between communities

Page 9: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

All Analysis Should Be Guided by Goals

• Taxonomic assignment for sequences (i.e., what type of organism is the sequence from)

– Best via phylogenetic analysis of sequences

– Sometimes done with blast

• Ecological characterization of community –Grouping into species / classifying

–Have we sampled enough?

–Number of species

–Relative abundance

• Comparisons between communities –Taxonomy

–Ecological metrics

Page 10: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

All Analysis Should Be Guided by Goals

• Taxonomic assignment for sequences (i.e., what type of organism is the sequence from)

– Best via phylogenetic analysis of sequences

– Sometimes done with blast

• Ecological characterization of community –Grouping into species / classifying

–Have we sampled enough?

–Number of species

–Relative abundance

• Comparisons between communities –Taxonomy

–Ecological metrics

Page 11: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

All Analysis Should Be Guided by Goals

• Taxonomic assignment for sequences (i.e., what type of organism is the sequence from)

– Best via phylogenetic analysis of sequences

– Sometimes done with blast

• Ecological characterization of community –Grouping into species / classifying

–Have we sampled enough?

–Number of species

–Relative abundance

• Comparisons between communities –Taxonomy

–Ecological metrics

• Phylogenetic diversity

Page 12: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

All Analysis Should Be Guided by Goals

• Other Goals from rRNA analysis?

Page 13: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

rRNA Workflow

• General workflow ! Sample collection and DNA extraction ! rRNA PCR ! Sequence ! Alignment ! Cluster sequences into groups (known as operational

taxonomic units or OTUs) ! Measure relative abundance of OTU by # of

sequences in that group ! Try and assign a taxonomy to each OTU

• Caveats ! Copy number varies extensively ! Not all organisms amplified

!13

Page 14: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

What to Actually Measure in the Microbiome

• Lists ! Taxa ! Genes !

• Summary statistics ! Alpha diversity = within sample ! Beta diversity = between samples ! (and hope these reflect something about functional

properties) !

• Estimation vs. measurement

!14

Page 15: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

rRNA PCR

Page 16: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

rRNA PCR

Page 17: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Page 18: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Degenerate PCR

Conserved sequence shared

by all species

* * * * *

* Ambiguities in the sequence

5’-TWCGTSGARCTGCACGGVACCGGYAC-3’

W = A or T S = G or C R = A or G V = C or G or A Y = C or T

IUPAC degeneracies:

2*2*2*3*2 = 48 different primers sequences

Page 19: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Alignment

Page 20: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Page 21: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

rRNA OTUs

Page 22: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Clustering (and picking OTUs)

singletons

Page 23: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Clustering (and picking OTUs)

Page 24: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Clustering (and picking OTUs)

Page 25: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

rRNA phylogenetic trees

Page 26: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

• Alpha diversity is (basically) a measure of the diversity within a single sample

• Types of alpha diversity ! Total # of species = richness ! Phylogenetic diversity of species = PD ! Total # of genes = genetic richness ! Phylogenetic diversity of genes = genetic PD

!26

Diversity 1: Alpha Diversity

Page 27: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Rarefaction Curves

!27

person’s mouth are predicted to yield Chao1 estimates that fallwithin this range. Because the CIs overlap, one cannot rejectthe null hypothesis at the significance level of 0.05 that there isno difference between the richness of the mouth and gut com-munities. The CIs do not address how close the estimates areto the true total richness (i.e., bias) or whether these samplesare representative of other people’s mouths or guts.

Another question is how much more sampling is needed todetect a significant difference between two estimates, which inthis case differ by only 12 OTUs. The range of the CIs initiallyincreases with sample size, peaks, and then decreases expo-nentially. To obtain a rough idea of how much further sam-pling would be needed to detect a statistically significant dif-ference, we estimated the size of the CIs for larger samples byextrapolating from the decreasing portion of these curves.Negative exponential curves for both the mouth [f(x) !270e"0.0046x] and gut [f(x) ! 120e"0.0026x] data fit well (r2 !0.90 and r2 ! 0.87, respectively). From these curves, it appearsthat a sample of about 1,000 clones (four times the originalnumber) would be needed to detect a significant differencebetween these communities (Fig. 5).

Rarefaction curves yield the same pattern of relative diver-sity as Chao1; significantly more OTUs are observed in the gutsample than the mouth sample (Fig. 6). At the highest sharedsample size (264 clones), 79 OTUs are observed in the gutversus 59 OTUs in the mouth, and the 95% CIs do not overlap.As discussed in the previous section, however, rarefactioncurves do not address the precision of the observed speciesrichness. Thus, although the rarefaction curves suggest that thegut community is more diverse than the mouth community, wecannot address the statistical significance of this evidence withrarefaction curves.

Aquatic mesocosms. Bohannan and Leibold (unpublisheddata) sampled bacterial diversity from three outdoor aquaticmesocosms designed to mimic small ponds. The mesocosmsvaried along a gradient of increasing primary productivity anddecreasing eukaryotic algal diversity, and all received the sameinoculum. DNA was extracted from samples from each meso-

cosm, and a region of 16S rDNA was PCR amplified withBacteria-specific primers, the amplicons were cloned, and theclones were sequenced. The sequences were grouped intoOTUs using a definition of 95% similarity.

Bohannan and Leibold sequenced 158, 128, and 174 clonesfrom the low-, intermediate-, and high-productivity mesocosms,respectively. The Chao1 estimates suggest that OTU richnessvaries positively with productivity. The lowest productivitypond contained 54 OTUs (95% CIs, 42 and 80), the interme-diate pond contained 58 OTUs (43 and 90), and the high-pro-ductivity pond contained an estimated 95 OTUs (73 and 140).The richness of the high- and low-productivity ponds is signif-icantly different at the 0.10 level (Fig. 7). Furthermore, theChao1 estimates for the high-productivity pond have not yetstabilized (Fig. 7), suggesting that further sampling will resultin a greater difference in richness between the ponds with lowand high productivity.

FIG. 4. Chaol estimates of human mouth (E) and gut (F) bacterialrichness as a function of sample size. Error bars are 95% CIs and werecalculated with the variance formula derived by Chao (8). The dashedlines are error bars for the mouth. The solid lines are error bars for thegut.

FIG. 5. Average size of the 95% CIs of Chaol estimates for bacteriain the human mouth (E) and gut (F) as sample size increases. TheseCIs are the same as in Fig. 4, but only the decreasing portions of theCIs are plotted. The curves are fitted negative exponential curves[mouth, f(x) ! 270e"0.0046x, r2 ! 0.90; gut, f(x) ! 120e"0.0026x, r2 !0.87].

FIG. 6. Rarefaction curves of observed OTU richness in humanmouth (E) and gut (F) bacterial samples. The error bars are 95% CIsand were calculated from the variance of the number of OTUs drawnin 100 randomizations at each sample size.

VOL. 67, 2001 MINIREVIEW 4403

at U

NIV

OF

CA

LIF

DA

VIS

on

Ma

y 1

1, 2

01

0

ae

m.a

sm

.org

Do

wn

loa

de

d fro

m

Page 28: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Observed vs. Estimated Alpha Diversity

!28

lem of not being able to measure bias. (This assumes that thebias of an estimator does not differ so radically among com-munities that it disrupts the relative order of the estimates. Inthe absence of alternative evidence, this initial assumptionseems appropriate.)

Chao (8) derives a closed-form solution for the variance ofSChao1:

Var!SChao1" ! n2!m4

4 " m3 "m2

2 ", where m !n1

n2

This formula estimates the precision of Chao1; that is, it esti-mates the variance of richness estimates that one expects frommultiple samples. A closed-form solution of variance for theACE has not yet been derived.

Comparisons of relative species richness based on rarefac-tion may seem more reliable than comparisons using extrapo-lations that require a number of assumptions, but rarefaction islimited for two reasons. First, rarefaction compares samples,not communities. The error bars around a rarefaction curvedescribe the variation due to reordering of subsamples withinthe collected sample, not the precision of the observed rich-ness. In contrast, a measure of precision would describe thevariation in the number of species expected to be observed ifthe community were sampled repeatedly. It is possible to esti-mate the precision of rarefaction curves, for instance, by boot-strapping (20). Error bars derived by this method allow thedetection of significant differences in observed richness be-tween communities.

Second, the rank order of observed richness values does notnecessarily correspond to relative total richness, because rar-efaction analyses do not exclude the possibility that the speciesaccumulation curves cross at a higher sample size (34). In con-trast, species richness estimators take the shape of the accu-mulation curve into account to determine total richness. Thus,in theory these estimators can predict a crossover of the accu-mulation curves and thereby better predict relative total rich-ness.

CASE STUDIES

In terms of both underlying assumptions and their ability tobe evaluated, nonparametric estimators are a promising toolfor assessing microbial diversity. To further investigate theirpotential, we applied these techniques to four microbial datasets. In particular, we compared the use of nonparametricestimators with the rarefaction approach and investigated howthe precision of their estimates changes with sample size.These four data sets were among the largest available andrepresented a range of habitat types and environmental gradi-ents. We came across a number of additional data sets thatwould also have been appropriate for these analyses (19, 53),although others of comparable size were too diverse to beanalyzed with these techniques (5, 45).

The analyses were performed with EstimateS (version 5.0.1;R. Colwell, University of Connecticut [http://viceroy.eeb.uconn.edu/estimates]). For the purposes of inputting data into theprogram, we treated each cloned sequence as a separate sam-ple. We ran 100 randomizations for all tests. Further random-izations did not change the results.

Human mouth and gut. Two of the best-sampled microbialcommunities are from human habitats. Kroes et al. (33) sam-pled subgingival plaque from a human mouth. They used PCRto amplify the bacterial 16S rDNA, created clone libraries fromthe amplified DNA, and then sequenced 264 clones. Kroes etal. defined an OTU as a 16S rDNA sequence group in whichsequences differed by #1%. By this definition, they found 59distinct OTUs from their sample of 264 16S rDNA sequences.Although the accumulation curve does not reach an asymptote,it is not linear (Fig. 3). Thus, we can try to estimate total OTUrichness. For these data, the Chao1 estimator levels off at 123OTUs, suggesting that, after that point, the Chao1 estimate isrelatively independent of sample size. In contrast, the ACEdoes not plateau as sample size increases, indicating that theestimate is not independent of sample size.

Suau et al. (65) investigated the diversity of bacteria in ahuman gut. Similar to Kroes et al. (33), they amplified, cloned,and sequenced 16S rDNA fragments. Their definition of anOTU differed slightly from that in the Kroes et al. study, how-ever; they define an OTU as a 16S rDNA sequence group inwhich sequences differed by #2%. With this definition, theyidentified 82 OTUs from 284 clones.

Because the two studies use slightly different definitions ofan OTU, the data for the mouth and gut bacteria are notentirely comparable. Their contrast does demonstrate the ap-plication of these approaches, however. After an initial in-crease, the mean Chao1 estimate for both communities is rel-atively level as sample size increases, and therefore we cancompare the estimates at the highest sample size for each com-munity (Fig. 4). We used a log transformation to calculate theconfidence intervals (CIs) because the distribution of estimatesis not normal (8). Given the OTU definitions, total richness ofthe mouth and gut bacterial communities is not significantlydifferent, as estimated by Chao1. Chao1 estimates that themouth community has 123 OTUs (95% CIs, 93 and 180), andthe gut community has 135 OTUs (95% CIs, 110 and 170).

What do the CIs say about the Chao1 estimate? The CIsestimate the precision of the richness estimates. In otherwords, 95% of new samples of 264 clones from the same

FIG. 3. Observed and estimated OTU richness of bacteria in ahuman mouth (33) versus sample size. The number of OTUs observedfor a given sample size, or the accumulation curve, is averaged over 50simulations (E). Estimated OTU richness is plotted for Chaol (F) andACE (Œ) estimators.

4402 MINIREVIEW APPL. ENVIRON. MICROBIOL.

at U

NIV

OF

CA

LIF

DA

VIS

on

Ma

y 1

1, 2

01

0

ae

m.a

sm

.org

Do

wn

loa

de

d fro

m

ERRATUM

Counting the Uncountable: Statistical Approaches toEstimating Microbial Diversity

JENNIFER B. HUGHES, JESSICA J. HELLMANN, TAYLOR H. RICKETTS,AND BRENDAN J. M. BOHANNAN

Department of Biological Sciences, Stanford University,Stanford, California 94305-5020

Volume 67, no. 10, p. 4399–4406, 2001. Page 4402, legend to Fig. 3: lines 4 and 5 should read “. . .simulations (Œ). EstimatedOTU richness is plotted for Chao1 (!) and ACE (") estimators.”

448 ERRATUM APPL. ENVIRON. MICROBIOL.

at U

NIV

OF

CA

LIF

DA

VIS

on

Ma

y 1

1, 2

01

0

ae

m.a

sm

.org

Do

wn

loa

de

d fro

m

Page 29: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Rank Abundance Curves

communities plotted in Fig. 1 are well sampled, and the sam-ples therefore contain considerable information about totalrichness. The two intermediate curves provide the most tellingcomparison, however. Even though the moth sample is muchlarger than the mouth bacteria sample (4,538 versus 264 indi-viduals), the shape of the curves is similar. In other words, thecommunities have been sampled with roughly equivalent in-tensity relative to their overall richness.

Another way to compare how well communities have beensampled is to plot their rank-abundance curves. The speciesare ordered from most to least abundant on the x axis, and theabundance of each type observed is plotted on the y axis. Themoth and soil bacteria communities exhibit a similar pattern(Fig. 2), one that is typical of superdiverse communities such astropical insects. A few species in the sample are abundant, butmost are rare, producing the long right-hand tail on the rank-abundance curve.

If these organisms were sampled on the same spatial scale,there is no doubt that soil bacterial diversity would be higherthan moth diversity. These comparisons suggest, however, thatour ability to sample bacterial diversity in a human mouth or ina few grams of some soils may be similar to our ability tosample moth diversity in a few hundred square kilometers oftropical forest. Thus, at least for some communities, microbi-ologists may be able to coopt techniques that ecologists use toestimate and compare the richness of macroorganisms.

Ultimately, microbes—like tropical insects—are too diverseto count exhaustively. While it would be useful to know theactual diversity of different microbial communities, most diver-sity questions address how diversity changes across biotic andabiotic gradients, such as disturbance, productivity, area, lati-tude, and resource heterogeneity. The answers to these ques-tions require knowing only relative diversities among sites,over time, and under different treatment regimens. Using thisapproach, the relationships between insect diversity and manyenvironmental variables have been well studied (50, 57, 63, 64),

even though estimates of the total number of insect speciesrange over three orders of magnitude (22, 54).

SOME POSSIBLE TOOLS: RAREFACTION ANDRICHNESS ESTIMATORS

A variety of statistical approaches have been developed tocompare and estimate species richness from samples of mac-roorganisms. In this section, we consider the suitability of fourapproaches for microbial diversity studies.

The first approach, rarefaction, has been adopted recentlyby a number of microbiologists (4, 19, 40). Rarefaction com-pares observed richness among sites, treatments, or habitatsthat have been unequally sampled. A rarefied curve resultsfrom averaging randomizations of the observed accumulationcurve (25). The variance around the repeated randomizationsallows one to compare the observed richness among samples,but it is distinct from a measure of confidence about the actualrichness in the communities.

In contrast to rarefaction, richness estimators estimate thetotal richness of a community from a sample, and the esti-mates can then be compared across samples. These estimatorsfall into three main classes: extrapolation from accumulationcurves, parametric estimators, and nonparametric estimators(14, 23, 47). To date, we have found only two studies that applyrichness estimators to microbial data (33, 43).

Most curve extrapolation methods use the observed accu-mulation curve to fit an assumed functional form that modelsthe process of observing new species as sampling effort in-creases. The asymptote of this curve, or the species richnessexpected at infinite effort, is then estimated. These modelsinclude the Michaelis-Menten equation (13, 51) and the neg-ative exponential function (61). The benefit of estimating di-versity with such extrapolation methods is that once a specieshas been counted, it does not need to be counted again. Hence,a surveyor can focus effort on identifying new, generally rarer,species. The downside is that for diverse communities in which

FIG. 1. Accumulation curves for Michigan plants (�; n ! 1,783)(26), Costa Rican birds (Œ; n ! 5,007) (J. B. Hughes, unpublisheddata), human oral bacteria (E; n ! 264) (33), Costa Rican moths (⇥;n ! 4,538) (56), and East Amazonian soil bacteria (F; n ! 98) (6).Curves are averaged over 100 simulations using the computer programEstimateS and are standardized for the number of individuals andspecies observed.

FIG. 2. Rank-abundance curves for (a) tropical moths (n ! 4,538)(56) and (b) temperate soil bacteria (n ! 137) (39). The two mostabundant species of moths (396 and 173 individuals) are excluded frompanel a to shorten the y axis.

4400 MINIREVIEW APPL. ENVIRON. MICROBIOL.

at U

NIV

OF

CA

LIF

DA

VIS

on

Ma

y 1

1, 2

01

0

ae

m.a

sm

.org

Do

wn

loa

de

d fro

m

Page 30: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

• Beta diversity is (basically) a measure of the similarity in diversity between samples

• Types of beta diversity ! Species presence/absence ! Shared phylogenetic diversity ! Gene presence / absence ! Shared phylogenetic diversity of genes !

• Frequently used as values for PCA of PCoA analysis

!30

Diversity 2: Beta Diversity

Page 31: UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Variability in Health vs. Disease

!31

Almost all (99.96%) of the phylogenetically assigned genes belongedto the Bacteria and Archaea, reflecting their predominance in the gut.Genes that were not mapped to orthologous groups were clusteredinto gene families (see Methods). To investigate the functional con-tent of the prevalent gene set we computed the total number oforthologous groups and/or gene families present in any combinationof n individuals (with n5 2–124; see Fig. 2c). This rarefaction ana-lysis shows that the ‘known’ functions (annotated in eggNOG orKEGG) quickly saturate (a value of 5,569 groups was observed): whensampling any subset of 50 individuals, most have been detected.However, three-quarters of the prevalent gut functionalities consistsof uncharacterized orthologous groups and/or completely novel genefamilies (Fig. 2c).When including these groups, the rarefaction curveonly starts to plateau at the very end, at a much higher level (19,338groups were detected), confirming that the extensive sampling of alarge number of individuals was necessary to capture this considerableamount of novel/unknown functionality.

Bacterial functions important for life in the gut

The extensive non-redundant catalogue of the bacterial genes fromthe human intestinal tract provides an opportunity to identify bac-terial functions important for life in this environment. There arefunctions necessary for a bacterium to thrive in a gut context (thatis, the ‘minimal gut genome’) and those involved in the homeostasisof the whole ecosystem, encoded across many species (the ‘minimalgut metagenome’). The first set of functions is expected to be presentin most or all gut bacterial species; the second set in most or allindividuals’ gut samples.

To identify the functions encoded by the minimal gut genome weuse the fact that they should be present in most or all gut bacterialspecies and therefore appear in the gene catalogue at a frequencyabove that of the functions present in only some of the gut bacterialspecies. The relative frequency of different functions can be deducedfrom the number of genes recruited to different eggNOG clusters,after normalization for gene length and copy number (Supplemen-tary Fig. 10a, b). We ranked all the clusters by gene frequencies anddetermined the range that included the clusters specifying well-known essential bacterial functions, such as those determined experi-mentally for a well-studied firmicute, Bacillus subtilis27, hypothe-sizing that additional clusters in this range are equally important.As expected, the range that included most of B. subtilis essentialclusters (86%) was at the very top of the ranking order (Fig. 5).Some 76% of the clusters with essential genes of Escherichia coli28

were within this range, confirming the validity of our approach.This suggests that 1,244metagenomic clusters foundwithin the range(Supplementary Table 10; termed ‘range clusters’ hereafter) specifyfunctions important for life in the gut.

We found two types of functions among the range clusters: thoserequired in all bacteria (housekeeping) and those potentially specificfor the gut. Among many examples of the first category are thefunctions that are part of main metabolic pathways (for example,central carbon metabolism, amino acid synthesis), and importantprotein complexes (RNA andDNApolymerase, ATP synthase, generalsecretory apparatus). Not surprisingly, projection of the range clusterson the KEGG metabolic pathways gives a highly integrated picture ofthe global gut cell metabolism (Fig. 6a).

The putative gut-specific functions include those involved in adhe-sion to the host proteins (collagen, fibrinogen, fibronectin) or inharvesting sugars of the globoseries glycolipids, which are carriedon blood and epithelial cells. Furthermore, 15% of range clustersencode functions that are present in,10% of the eggNOG genomes(see Supplementary Fig. 11) and are largely (74.3%) not defined(Fig. 6b). Detailed studies of these should lead to a deeper compre-hension of bacterial life in the gut.

To identify the functions encoded by theminimal gut metagenome,we computed the orthologous groups that are shared by individuals ofour cohort. Thisminimal set, of 6,313 functions, ismuch larger than theone estimated in a previous study8. There are only 2,069 functionallyannotated orthologous groups, showing that they gravely underesti-mate the true size of the common functional complement among indi-viduals (Fig. 6c). Theminimal gutmetagenome includes a considerablefraction of functions (,45%) that are present in ,10% of thesequenced bacterial genomes (Fig. 6c, inset). These otherwise rare func-tionalities that are found in eachof the124 individualsmaybenecessaryfor the gut ecosystem. Eighty per cent of these orthologous groupscontain genes with at best poorly characterized function, underscoringour limited knowledge of gut functioning.

Of the known fraction, about 5% codes for (pro)phage-relatedproteins, implying a universal presence and possible important eco-logical role of bacteriophages in gut homeostasis. The most strikingsecondary metabolism that seems crucial for the minimal metage-nome relates, not unexpectedly, to biodegradation of complex sugarsand glycans harvested from the host diet and/or intestinal lining.Examples include degradation and uptake pathways for pectin(and its monomer, rhamnose) and sorbitol, sugars which are omni-present in fruits and vegetables, but which are not or poorly absorbedby humans. As some gutmicroorganisms were found to degrade bothof them29,30, this capacity seems to be selected for by the gut ecosystemas a non-competitive source of energy. Besides these, capacity toferment, for example, mannose, fructose, cellulose and sucrose is alsopart of the minimal metagenome. Together, these emphasize the

40

30

20

10

0

Clu

ster

(%)

1 2,001 4,001 6,001 8,001 10,001Cluster rank

Range

Figure 5 | Clusters that contain the B. subtilis essential genes. The clusterswere ranked by the number of genes they contain, normalized by averagelength and copy number (see Supplementary Fig. 10), and the proportion ofclusters with the essential B. subtilis genes was determined for successivegroups of 100 clusters. Range indicates the part of the cluster distributionthat contains 86% of the B. subtilis essential genes.

• •

• •

••

••

• •

• •

••

••

Healthy

Crohn’s disease

Ulcerative colitis

P value: 0.031

PC2

PC1

Figure 4 | Bacterial species abundance differentiates IBD patients andhealthy individuals. Principal component analysis with health status asinstrumental variables, based on the abundance of 155 species with$1%genome coverage by the Illumina reads in at least 1 individual of the cohort,was carried outwith 14 healthy individuals and 25 IBDpatients (21 ulcerativecolitis and 4 Crohn’s disease) from Spain (Supplementary Table 1). Two firstcomponents (PC1 and PC2) were plotted and represented 7.3% of wholeinertia. Individuals (represented by points) were clustered and centre ofgravity computed for each class;P-value of the link between health status andspecies abundance was assessed using a Monte-Carlo test (999 replicates).

ARTICLES NATURE |Vol 464 |4 March 2010

62Macmillan Publishers Limited. All rights reserved©2010

Qin et al. 2010. Nature.

Figure 4 | Bacterial species abundance differentiates IBD patients and healthy individuals.

It is possible to go backwards from these patterns to see which taxa or genes drive the clustering patters