microbial phylogenomics (eve161) class 15: shotgun metagenomics

35
EVE 161: Microbial Phylogenomics Era IV: Shotgun Metagenomics UC Davis, Winter 2016 Instructor: Jonathan Eisen 1

Upload: jonathan-eisen

Post on 21-Jan-2018

357 views

Category:

Science


3 download

TRANSCRIPT

Page 1: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

EVE 161:Microbial Phylogenomics

Era IV: Shotgun Metagenomics

UC Davis, Winter 2016 Instructor: Jonathan Eisen

!1

Page 2: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

!2

Environmental Genome ShotgunSequencing of the Sargasso SeaJ. Craig Venter,1* Karin Remington,1 John F. Heidelberg,3

Aaron L. Halpern,2 Doug Rusch,2 Jonathan A. Eisen,3

Dongying Wu,3 Ian Paulsen,3 Karen E. Nelson,3 William Nelson,3

Derrick E. Fouts,3 Samuel Levy,2 Anthony H. Knap,6

Michael W. Lomas,6 Ken Nealson,5 Owen White,3

Jeremy Peterson,3 Jeff Hoffman,1 Rachel Parsons,6

Holly Baden-Tillson,1 Cynthia Pfannkoch,1 Yu-Hui Rogers,4

Hamilton O. Smith1

Wehave applied “whole-genome shotgun sequencing” tomicrobial populationscollected enmasse on tangential flow and impact filters from seawater samplescollected from the Sargasso Sea near Bermuda. A total of 1.045 billion base pairsof nonredundant sequencewas generated, annotated, and analyzed to elucidatethe gene content, diversity, and relative abundance of the organisms withinthese environmental samples. These data are estimated to derive from at least1800 genomic species based on sequence relatedness, including 148 previouslyunknown bacterial phylotypes. We have identified over 1.2 million previouslyunknown genes represented in these samples, including more than 782 newrhodopsin-like photoreceptors. Variation in species present and stoichiometrysuggests substantial oceanic microbial diversity.

Microorganisms are responsible for most of thebiogeochemical cycles that shape the environ-ment of Earth and its oceans. Yet, these organ-isms are the least well understood on Earth, asthe ability to study and understand the metabol-ic potential of microorganisms has been ham-pered by the inability to generate pure cultures.Recent studies have begun to explore environ-mental bacteria in a culture-independent man-ner by isolating DNA from environmental sam-ples and transforming it into large insert clones.For example, a previously unknown light-drivenproton pump, proteorhodopsin, was discoveredwithin a bacterial artificial chromosome (BAC)from the genome of a SAR86 ribotype (1), andsoil microbial DNA libraries have been construct-ed and screened for specific activities (2).

Here we have applied whole-genome shot-gun sequencing to environmental-pooled DNAsamples to test whether new genomic approach-es can be effectively applied to gene and spe-cies discovery and to overall environmental

characterization. To help ensure a tractable pilotstudy, we sampled in the Sargasso Sea, a nutrient-limited, open ocean environment. Further, weconcentrated on the genetic material captured onfilters sized to isolate primarily microbial inhabit-ants of the environment, leaving detailed analysisof dissolved DNA and viral particles on one endof the size spectrum and eukaryotic inhabitants onthe other, for subsequent studies.The Sargasso Sea. The northwest Sar-

gasso Sea, at the Bermuda Atlantic Time-seriesStudy site (BATS), is one of the best-studiedand arguably most well-characterized regionsof the global ocean. The Gulf Stream representsthe western and northern boundaries of thisregion and provides a strong physical boundary,separating the low nutrient, oligotrophic openocean from the more nutrient-rich waters of theU.S. continental shelf. The Sargasso Sea hasbeen intensively studied as part of the 50-yeartime series of ocean physics and biogeochem-istry (3, 4) and provides an opportunity forinterpretation of environmental genomic data inan oceanographic context. In this region, for-mation of subtropical mode water occurs eachwinter as the passage of cold fronts across theregion erodes the seasonal thermocline andcauses convective mixing, resulting in mixedlayers of 150 to 300 m depth. The introductionof nutrient-rich deep water, following thebreakdown of seasonal thermoclines into thebrightly lit surface waters, leads to the bloom-ing of single cell phytoplankton, including twocyanobacteria species, Synechococcus and Pro-

chlorococcus, that numerically dominate thephotosynthetic biomass in the Sargasso Sea.

Surface water samples (170 to 200 liters)were collected aboard the RV Weatherbird IIfrom three sites off the coast of Bermuda inFebruary 2003. Additional samples were col-lected aboard the SV Sorcerer II from “Hydro-station S” in May 2003. Sample site locationsare indicated on Fig. 1 and described in tableS1; sampling protocols were fine-tuned fromone expedition to the next (5). Genomic DNAwas extracted from filters of 0.1 to 3.0 !m, andgenomic libraries with insert sizes ranging from2 to 6 kb were made as described (5). Theprepared plasmid clones were sequenced fromboth ends to provide paired-end reads at the J.Craig Venter Science Foundation Joint Tech-nology Center on ABI 3730XL DNA sequenc-ers (Applied Biosystems, Foster City, CA).Whole-genome random shotgun sequencing ofthe Weatherbird II samples (table S1, samples 1 to4) produced 1.66 million reads averaging 818 bpin length, for a total of approximately 1.36 Gbp ofmicrobial DNA sequence. An additional 325,561sequences were generated from the Sorcerer IIsamples (table S1, samples 5 to 7), yielding ap-proximately 265 Mbp of DNA sequence.Environmental genome shotgun as-

sembly. Whole-genome shotgun sequencingprojects have traditionally been applied to iden-tify the genome sequence(s) from one particularorganism, whereas the approach taken here isintended to capture representative sequencefrom many diverse organisms simultaneously.Variation in genome size and relative abun-dance determines the depth of coverage of anyparticular organism in the sample at a givenlevel of sequencing and has strong implicationsfor both the application of assembly algorithmsand for the metrics used in evaluating the re-sulting assembly. Although we would expectabundant species to be deeply covered and wellassembled, species of lower abundance may berepresented by only a few sequences. For asingle genome analysis, assembly coveragedepth in unique regions should approximate aPoisson distribution. The mean of this distribu-tion can be estimated from the observed data,looking at the depth of coverage of contigsgenerated before any scaffolding. The assem-bler used in this study, the Celera Assembler(6), uses this value to heuristically identifyclearly unique regions to form the backbone ofthe final assembly within the scaffolding phase.However, when the starting material consists ofa mixture of genomes of varying abundance, athreshold estimated in this way would classifysamples from the most abundant organism(s) asrepetitive, due to their greater-than-averagedepth of coverage, paradoxically leaving themost abundant organisms poorly assembled.We therefore used manual curation of an initial

1The Institute for Biological Energy Alternatives, 2TheCenter for the Advancement of Genomics, 1901 Re-search Boulevard, Rockville, MD 20850, USA. 3TheInstitute for Genomic Research, 9712 Medical CenterDrive, Rockville, MD 20850, USA. 4The J. Craig VenterScience Foundation Joint Technology Center, 5 Re-search Place, Rockville, MD 20850, USA. 5University ofSouthern California, 223 Science Hall, Los Angeles, CA90089–0740, USA. 6Bermuda Biological Station forResearch, Inc., 17 Biological Lane, St George GE 01,Bermuda.

*To whom correspondence should be addressed. E-mail: [email protected]

RESEARCH ARTICLE

2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org66

Page 3: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

We have applied “whole-genome shotgun sequencing” to microbial populations collected en masse on tangential flow and impact filters from seawater samples collected from the Sargasso Sea near Bermuda. A total of 1.045 billion base pairs of nonredundant sequence was generated, annotated, and analyzed to elucidate the gene content, diversity, and relative abundance of the organisms within these environmental samples. These data are estimated to derive from at least 1800 genomic species based on sequence relatedness, including 148 previously unknown bacterial phylotypes. We have identified over 1.2 million previously unknown genes represented in these samples, including more than 782 new rhodopsin-like photoreceptors. Variation in species present and stoichiometry suggests substantial oceanic microbial diversity.

Page 4: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

Sargasso Sea

assembly to identify a set of large, deeply as-sembling nonrepetitive contigs. This was used toset the expected coverage in unique regions (to23!) for a final run of the assembler. This al-lowed the deep contigs to be treated as uniquesequence when they would otherwise be labeledas repetitive. We evaluated our final assemblyresults in a tiered fashion, looking at well-sampledgenomic regions separately from those barelysampled at our current level of sequencing.

The 1.66 million sequences from theWeatherbird II samples (table S1; samples 1 to4; stations 3, 11, and 13), were pooled andassembled to provide a single master assemblyfor comparative purposes. The assembly gener-ated 64,398 scaffolds ranging in size from 826bp to 2.1 Mbp, containing 256 Mbp of uniquesequence and spanning 400 Mbp. After assem-bly, there remained 217,015 paired-end reads,or “mini-scaffolds,” spanning 820.7 Mbp aswell as an additional 215,038 unassembled sin-gleton reads covering 169.9 Mbp (table S2,column 1). The Sorcerer II samples providedalmost no assembly, so we consider for thesesamples only the 153,458 mini-scaffolds, span-ning 518.4 Mbp, and the remaining 18,692singleton reads (table S2, column 2). In total,1.045 Gbp of nonredundant sequence was gen-erated. The lack of overlapping reads within theunassembled set indicates that lack of addition-al assembly was not due to algorithmic limita-tions but to the relatively limited depth of se-quencing coverage given the level of diversitywithin the sample.

The whole-genome shotgun (WGS) assemblyhas been deposited at DDBJ/EMBL/GenBankunder the project accession AACY00000000,and all traces have been deposited in a corre-sponding TraceDB trace archive. The versiondescribed in this paper is the first version,AACY01000000. Unlike a conventional WGSentry, we have deposited not just contigs andscaffolds but the unassembled paired singletonsand individual singletons in order to accurate-ly reflect the diversity in the sample andallow searches across the entire sample with-in a single database.Genomes and large assemblies. Our

analysis first focused on the well-sampled ge-nomes by characterizing scaffolds with at least3! coverage depth. There were 333 scaffoldscomprising 2226 contigs and spanning 30.9Mbp that met this criterion (table S3), account-ing for roughly 410,000 reads, or 25% of thepooled assembly data set. From this set of well-sampled material, we were able to cluster andclassify assemblies by organism; from the rarespecies in our sample, we used sequence similar-ity based methods together with computationalgene finding to obtain both qualitative and quan-titative estimates of genomic and functional diver-sity within this particular marine environment.

We employed several criteria to sort themajor assembly pieces into tentative organism“bins”; these include depth of coverage, oligo-

nucleotide frequencies (7), and similarity topreviously sequenced genomes (5). With thesetechniques, the majority of sequence assignedto the most abundant species (16.5 Mbp of the30.9 Mb in the main scaffolds) could be sepa-rated based on several corroborating indicators.In particular, we identified a distinct group ofscaffolds representing an abundant populationclearly related to Burkholderia (fig. S2) andtwo groups of scaffolds representing two dis-tinct strains closely related to the published

Shewanella oneidensis genome (8) (fig. S3).There is a group of scaffolds assembling at over6! coverage that appears to represent the ge-nome of a SAR86 (table S3). Scaffold setsrepresenting a conglomerate of Prochlorococ-cus strains (Fig. 2), as well as an unculturedmarine archaeon, were also identified (table S3;Fig. 3). Additionally, 10 putative mega plasmidswere found in the main scaffold set, coveredat depths ranging from 4! to 36! (indicatedwith shading in table S3 with nine depicted in

Fig. 1. MODIS-Aqua satellite image ofocean chlorophyll in the Sargasso Sea gridabout the BATS site from 22 February2003. The station locations are overlainwith their respective identifications. Notethe elevated levels of chlorophyll (greencolor shades) around station 3, which arenot present around stations 11 and 13.

Fig. 2. Gene conser-vation among closelyrelated Prochlorococ-cus. The outermostconcentric circle ofthe diagram depictsthe competed genom-ic sequence of Pro-chlorococcus marinusMED4 (11). Fragmentsfrom environmentalsequencing were com-pared to this complet-ed Prochlorococcus ge-nome and are shown inthe inner concentriccircles and were givenboxed outlines. Genesfor the outermost cir-cle have been as-signed psuedospec-trum colors based onthe position of thosegenes along the chro-mosome, where genesnearer to the start ofthe genome are col-ored in red, and genesnearer to the end of the genome are colored in blue. Fragments from environmental sequencingwere subjected to an analysis that identifies conserved gene order between those fragments andthe completed Prochlorococcus MED4 genome. Genes on the environmental genome segmentsthat exhibited conserved gene order are colored with the same color assignments as theProchlorococcus MED4 chromosome. Colored regions on the environmental segments exhibitingcolor differences from the adjacent outermost concentric circle are the result of conserved geneorder with other MED4 regions and probably represent chromosomal rearrangements. Genes thatdid not exhibit conserved gene order are colored in black.

R E S E A R C H A R T I C L E

www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004 67

!4

http://www.sciencemag.org/content/304/5667/66

Page 5: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

• Why did the authors focus on the Sargasso Sea?

Page 6: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

The Sargasso Sea. The northwest Sargasso Sea, at the Bermuda Atlantic Time-series Study site (BATS), is one of the best-studied and arguably most well-characterized regions of the global ocean. The Gulf Stream represents the western and northern boundaries of this region and provides a strong physical boundary, separating the low nutrient, oligotrophic open ocean from the more nutrient-rich waters of the U.S. continental shelf. The Sargasso Sea has been intensively studied as part of the 50-year time series of ocean physics and biogeochemistry (3, 4) and provides an opportunity for interpretation of environmental genomic data in an oceanographic context. In this region, formation of subtropical mode water occurs each winter as the passage of cold fronts across the region erodes the seasonal thermocline and causes convective mixing, resulting in mixed layers of 150 to 300 m depth. The introduction of nutrient-rich deep water, following the breakdown of seasonal thermoclines into the brightly lit surface waters, leads to the blooming of single cell phytoplankton, including two cyanobacteria species, Synechococcus and Prochlorococcus, that numerically dominate the photosynthetic biomass in the Sargasso Sea.

Page 7: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics
Page 8: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

Sargasso Sea (from Wikipedia)

Page 9: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

Sargasso Sea (from Wikipedia)

Page 10: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics
Page 11: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

© 1990 Nature Publishing Group

http://www.nature.com/nature/journal/v345/n6270/abs/345060a0.html

Page 12: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

© 1990 Nature Publishing Group

© 1990 Nature Publishing Group

Page 13: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

© 1990 Nature Publishing Group

Page 14: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

Vol. 57, No. 6APPLIED AND ENVIRONMENTAL MICROBIOLOGY, June 1991, p. 1707-1713

0099-2240/91/061707-07$02.00/0Copyright C) 1991, American Society for Microbiology

Phylogenetic Analysis of a Natural Marine BacterioplanktonPopulation by rRNA Gene Cloning and Sequencing

THERESA B. BRITSCHGI AND STEPHEN J. GIOVANNONI*

Department of Microbiology, Oregon State University, Corvallis, Oregon 97331

Received 5 February 1991/Accepted 25 March 1991

The identification of the prokaryotic species which constitute marine bacterioplankton communities has been

a long-standing problem in marine microbiology. To address this question, we used the polymerase chain

reaction to construct and analyze a library of 51 small-subunit (16S) rRNA genes cloned from Sargasso Sea

bacterioplankton genomic DNA. Oligonucleotides complementary to conserved regions in the 16S rDNAs ofeubacteria were used to direct the synthesis of polymerase chain reaction products, which were then cloned by

blunt-end ligation into the phagemid vector pBluescript. Restriction fragment length polymorphisms and

hybridizations to oligonucleotide probes for the SARll and marine Synechococcus phylogenetic groups

indicated the presence of at least seven classes of genes. The sequences of five unique rDNAs were determined

completely. In addition to 16S rRNA genes from the marine Synechococcus cluster and the previously identifiedbut uncultivated microbial group, the SARll cluster [S. J. Giovannoni, T. B. Britschgi, C. L. Moyer, and

K. G. Field, Nature (London) 345:60-63], two new gene classes were observed. Phylogenetic comparisonsindicated that these belonged to unknown species of a- and -y-proteobacteria. The data confirm the earlierconclusion that a majority of planktonic bacteria are new species previously unrecognized by bacteriologists.

Within the past decade, it has become evident that bacte-rioplankton contribute significantly to biomass and bio-geochemical activity in planktonic systems (4, 36). Untilrecently, progress in identifying the microbial species whichconstitute these communities was slow, because the major-ity of the organisms present (as measured by direct counts)cannot be recovered in cultures (5, 14, 16). The discovery adecade ago that unicellular cyanobacteria are widely distrib-uted in the open ocean (34) and the more recent observationof abundant marine prochlorophytes (3) have underscoredthe uncertainty of our knowledge about bacterioplanktoncommunity structure.

Molecular approaches are now providing genetic markersfor the dominant bacterial species in natural microbial pop-ulations (6, 21, 33). The immediate objectives of thesestudies are twofold: (i) to identify species, both known andnovel, by reference to sequence data bases; and (ii) toconstruct species-specific probes which can be used inquantitative ecological studies (7, 29).

Previously, we reported rRNA genetic markers and quan-titative hybridization data which demonstrated that a novellineage belonging to the oa-proteobacteria was abundant inthe Sargasso Sea; we also identified novel lineages whichwere very closely related to cultivated marine Synechococ-cus spp. (6). In a study of a hot-spring ecosystem, Ward andcoworkers examined cloned fragments of 16S rRNA cDNAsand found numerous novel lineages but no matches to genesfrom cultivated hot-spring bacteria (33). These results sug-gested that natural ecosystems in general may include spe-cies which are unknown to microbiologists.Although any gene may be used as a genetic marker,

rRNA genes offer distinct advantages. The extensive use of16S rRNAs for studies of microbial systematics and evolu-tion has resulted in large computer data bases, such as theRNA Data Base Project, which encompass the phylogeneticdiversity found within culture collections. rRNA genes are

* Corresponding author.

highly conserved and therefore can be used to examinedistant phylogenetic relationships with accuracy. The pres-ence of large numbers of ribosomes within cells and theregulation of their biosyntheses in proportion to cellulargrowth rates make these molecules ideally suited for ecolog-ical studies with nucleic acid probes.Here we describe the partial analysis of a 16S rDNA

library prepared from photic zone Sargasso Sea bacterio-plankton using a general approach for cloning eubacterial16S rDNAs. The Sargasso Sea, a central oceanic gyre,typifies the oligotrophic conditions of the open ocean, pro-viding an example of a microbial community adapted toextremely low-nutrient conditions. The results extend ourprevious observations of high genetic diversity amongclosely related lineages within this microbial community andprovide rDNA genetic markers for two novel eubacteriallineages.

MATERIALS AND METHODS

Collection and amplification of rDNA. A surface (2-m)bacterioplankton population was collected from hydrosta-tion S (32° 4'N, 64° 23'W) in the Sargasso Sea as previouslydescribed (8). The polymerase chain reaction (24) was usedto produce double-stranded rDNA from the mixed-microbi-al-population genomic DNA prepared from the planktonsample. The amplification primers were complementary to

conserved regions in the 5' and 3' regions of the 16S rRNAgenes. The amplification primers were the universal 1406R(ACGGGCGGTGTGTRC) and eubacterial 68F (TNANACATGCAAGTCGAKCG) (17). The reaction conditions were

as follows: 1 ,ug of template DNA, 10 ,ul of 10x reactionbuffer (500 mM KCI, 100 mM Tris HCI [pH 9.0 at 250C], 15mM MgCl2, 0.1% gelatin, 1% Triton X-100), 1 U of Taq DNApolymerase (Promega Biological Research Products, Madi-son, Wis.), 1 ,uM (each) primer, 200 ,uM (each) dATP, dCTP,dGTP, and dTTP in a 100-,ul total volume; 1 min at 94° C, 1min at 600C, 4 min at 72° C; 30 cycles. Polymerase chain

1707

at U

NIV

OF

CA

LIF

DA

VIS

on

Ma

y 1

8, 2

01

0

ae

m.a

sm

.org

Do

wn

loa

de

d fro

m

MARINE BACTERIOPLANKTON 1711

Escherichia coliVibrio harveyi

inmm Vibrio anguillarum- Oceanospirillum linum- SAR 92

Caulobacter crescentusSAR83

- Erythrobacter OCh 114- Hyphomicrobium vulgareRhodopseudomonas marina

Rochalimaea quintanaAgrobacterium tumefaciens

SARI-Es SAR95

SAR1I

CZ* C)0._

u

10C)00

.0

0QC)00O)4..

eP.

- Liverwort CP*Anacystis nidulans- Prochlorothrix- SAR6* SAR7r SARIOOSAR139

0

p.00

0 0.05 0.10 0.15

FIG. 4. Phylogenetic tree showing relationships of the rDNA clones from the Sargasso Sea to representative, cultivated species (2, 6, 18,30, 31, 32, 35, 40). Positions of uncertain homology in regions containing insertions and deletions were omitted from the analysis.Evolutionary distances were calculated by the method of Jukes and Cantor (15), which corrects for the effects of superimposed mutations.The phylogenetic tree was determined by a distance matrix method (20). The tree was rooted with the sequence of Bacillus subtilis (38).Sequence data not referenced were provided by C. R. Woese and R. Rossen.

phototroph phyla were conserved in the correspondinggenes, which could also be identified by overall primarysequence similarities. The oxygenic phototroph genescloned from the bacterioplankton population were similar tothe 16S rRNA genes of cultured marine Synechococcus spp.isolated from the same oceanic region (similarity, .0.96; 33).Moreover, the cloned oxygenic phototroph genes containedall of the eubacterial and cyanobacterial signature nucleo-tides described by Woese (37). The differences among thegenes were localized in hypervariable regions of the mole-cule and were confirmed by compensatory base changesacross helices.When multiple genes of a given type were sequenced, as

with the SAR11 and oxygenic phototroph groups, a patternof sequence variation which indicated the presence of clus-ters of very closely related genes emerged. We reported thisphenomenon earlier and further confirm it with the addi-tional sequences presented here (SAR95, SAR100, andSAR139). Three approaches indicated that the sources ofthis variation were biological, rather than artifactual. First,in all cases compensatory changes were found in basespaired across helices. Second, identical primary sequencesand variations were observed when SAR11 cluster genesamplified from natural population Sargasso Sea DNA withspecific primers (SAR11 F and 1406 R) were sequenceddirectly (unpublished data). Third, the distributions of thevariations within the sequences were inconsistent with theexplanation that chimeric gene artifacts had occurred. Forexample, there were no genes containing ambiguous combi-nations of signature nucleotides, tracts of unpaired bases inhelical regions, or mosaic patterns of similarity in dot plots.A noteworthy exception occurred in the SAR11 cluster, inwhich all members exhibited an inversion on the order of avery highly conserved base pair between positions 570 and

880 (G - C to C G). We note with caution that some classesof chimeric gene artifacts, particularly those involving mul-tiple exchanges among very closely related genes, would beimpossible to detect.There are several alternative explanations for the variabil-

ity within the sequence clusters. Microheterogeneities in 16SrDNA gene families may account for some of the variability,particularly among very closely related genes (e.g., SAR1and SAR95; SAR100 and SAR139). There currently are nodocumented cases of microheterogeneities in 16S rDNAgene families, although many investigators informally reporttheir occurrence. It is likely that some of the sequencevariability among the more distantly related sequences is aconsequence of the divergence and speciation of cellularlineages. Wood and Townsend reported high genetic diver-sity among marine Synechococcus isolates on the basis ofRFLP data from four loci (39). The clone model of bacterialpopulation structure, which states that natural populationsare mixtures of independent clonal cell lines (26), predictsthat genetic diversity will be high in bacterial populationswith large effective population sizes. We estimate that theactual marine Synechococcus population size is ca. 1026.Viral predation may also play a role by promoting thedivergence of lineages through disruptive selection (22).The two novel eubacterial 16S rRNA lineages described

here, SAR83 and SAR92, could both be identified as proteo-bacteria. SAR92, a member of the ly-proteobacteria, formedno closer specific relationships. Since the -y-proteobacteriaare diverse in metabolic capacity, an ecological role forSAR92 could not be deduced from the observed phyloge-netic relationships.The specific phylogenetic relationship between SAR83

and the marine species E. longus Ochll4 suggests theintriguing possibility that these organisms might also share

VOL. 57, 1991

I I

at U

NIV

OF

CA

LIF

DA

VIS

on

Ma

y 1

8, 2

01

0

ae

m.a

sm

.org

Do

wn

loa

de

d fro

m

Page 15: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

• What did they sample and how did they sample?

Page 16: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

Sargasso Sea

assembly to identify a set of large, deeply as-sembling nonrepetitive contigs. This was used toset the expected coverage in unique regions (to23!) for a final run of the assembler. This al-lowed the deep contigs to be treated as uniquesequence when they would otherwise be labeledas repetitive. We evaluated our final assemblyresults in a tiered fashion, looking at well-sampledgenomic regions separately from those barelysampled at our current level of sequencing.

The 1.66 million sequences from theWeatherbird II samples (table S1; samples 1 to4; stations 3, 11, and 13), were pooled andassembled to provide a single master assemblyfor comparative purposes. The assembly gener-ated 64,398 scaffolds ranging in size from 826bp to 2.1 Mbp, containing 256 Mbp of uniquesequence and spanning 400 Mbp. After assem-bly, there remained 217,015 paired-end reads,or “mini-scaffolds,” spanning 820.7 Mbp aswell as an additional 215,038 unassembled sin-gleton reads covering 169.9 Mbp (table S2,column 1). The Sorcerer II samples providedalmost no assembly, so we consider for thesesamples only the 153,458 mini-scaffolds, span-ning 518.4 Mbp, and the remaining 18,692singleton reads (table S2, column 2). In total,1.045 Gbp of nonredundant sequence was gen-erated. The lack of overlapping reads within theunassembled set indicates that lack of addition-al assembly was not due to algorithmic limita-tions but to the relatively limited depth of se-quencing coverage given the level of diversitywithin the sample.

The whole-genome shotgun (WGS) assemblyhas been deposited at DDBJ/EMBL/GenBankunder the project accession AACY00000000,and all traces have been deposited in a corre-sponding TraceDB trace archive. The versiondescribed in this paper is the first version,AACY01000000. Unlike a conventional WGSentry, we have deposited not just contigs andscaffolds but the unassembled paired singletonsand individual singletons in order to accurate-ly reflect the diversity in the sample andallow searches across the entire sample with-in a single database.Genomes and large assemblies. Our

analysis first focused on the well-sampled ge-nomes by characterizing scaffolds with at least3! coverage depth. There were 333 scaffoldscomprising 2226 contigs and spanning 30.9Mbp that met this criterion (table S3), account-ing for roughly 410,000 reads, or 25% of thepooled assembly data set. From this set of well-sampled material, we were able to cluster andclassify assemblies by organism; from the rarespecies in our sample, we used sequence similar-ity based methods together with computationalgene finding to obtain both qualitative and quan-titative estimates of genomic and functional diver-sity within this particular marine environment.

We employed several criteria to sort themajor assembly pieces into tentative organism“bins”; these include depth of coverage, oligo-

nucleotide frequencies (7), and similarity topreviously sequenced genomes (5). With thesetechniques, the majority of sequence assignedto the most abundant species (16.5 Mbp of the30.9 Mb in the main scaffolds) could be sepa-rated based on several corroborating indicators.In particular, we identified a distinct group ofscaffolds representing an abundant populationclearly related to Burkholderia (fig. S2) andtwo groups of scaffolds representing two dis-tinct strains closely related to the published

Shewanella oneidensis genome (8) (fig. S3).There is a group of scaffolds assembling at over6! coverage that appears to represent the ge-nome of a SAR86 (table S3). Scaffold setsrepresenting a conglomerate of Prochlorococ-cus strains (Fig. 2), as well as an unculturedmarine archaeon, were also identified (table S3;Fig. 3). Additionally, 10 putative mega plasmidswere found in the main scaffold set, coveredat depths ranging from 4! to 36! (indicatedwith shading in table S3 with nine depicted in

Fig. 1. MODIS-Aqua satellite image ofocean chlorophyll in the Sargasso Sea gridabout the BATS site from 22 February2003. The station locations are overlainwith their respective identifications. Notethe elevated levels of chlorophyll (greencolor shades) around station 3, which arenot present around stations 11 and 13.

Fig. 2. Gene conser-vation among closelyrelated Prochlorococ-cus. The outermostconcentric circle ofthe diagram depictsthe competed genom-ic sequence of Pro-chlorococcus marinusMED4 (11). Fragmentsfrom environmentalsequencing were com-pared to this complet-ed Prochlorococcus ge-nome and are shown inthe inner concentriccircles and were givenboxed outlines. Genesfor the outermost cir-cle have been as-signed psuedospec-trum colors based onthe position of thosegenes along the chro-mosome, where genesnearer to the start ofthe genome are col-ored in red, and genesnearer to the end of the genome are colored in blue. Fragments from environmental sequencingwere subjected to an analysis that identifies conserved gene order between those fragments andthe completed Prochlorococcus MED4 genome. Genes on the environmental genome segmentsthat exhibited conserved gene order are colored with the same color assignments as theProchlorococcus MED4 chromosome. Colored regions on the environmental segments exhibitingcolor differences from the adjacent outermost concentric circle are the result of conserved geneorder with other MED4 regions and probably represent chromosomal rearrangements. Genes thatdid not exhibit conserved gene order are colored in black.

R E S E A R C H A R T I C L E

www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004 67

!16

http://www.sciencemag.org/content/304/5667/66

Page 17: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

Surface water samples (170 to 200 liters) were collected aboard the RV Weatherbird II from three sites off the coast of Bermuda in February 2003. Additional samples were collected aboard the SV Sorcerer II from “Hydrostation S” in May 2003. Sample site locations are indicated on Fig. 1 and described in table S1; sampling protocols were fine-tuned from one expedition to the next (5). Genomic DNA was extracted from filters of 0.1 to 3.0 μm, and genomic libraries with insert sizes ranging from 2 to 6 kb were made as described (5). The prepared plasmid clones were sequenced from both ends to provide paired-end reads at the J. Craig Venter Science Foundation Joint Technology Center on ABI 3730XL DNA sequencers (Applied Biosystems, Foster City, CA). Whole-genome random shotgun sequencing of the Weatherbird II samples (table S1, samples 1 to 4) produced 1.66 million reads averaging 818 bp in length, for a total of approximately 1.36 Gbp of microbial DNA sequence. An additional 325,561 sequences were generated from the Sorcerer II samples (table S1, samples 5 to 7), yielding approximately 265 Mbp of DNA sequence.

Page 18: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics
Page 19: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics
Page 20: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics
Page 21: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics
Page 22: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

• What are some questions one might want to answer about the Sargasso Sea samples / sequences?

Page 23: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

Who is out there?

Page 24: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

rRNA phylotyping from metagenomics

!24

http://www.sciencemag.org/content/304/5667/66

Page 25: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

• How can one do phylotyping with genes other than rRNA?

Page 26: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

Shotgun Sequencing Allows Alternative Anchors (e.g., RecA)

Page 27: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

Who is out there?

using the curated TIGR role categories (5). Abreakdown of predicted genes by category isgiven in Table 1.

The samples analyzed here represent onlyspecific size fractions of the sampled environ-ment, dictated by the pore size of the collectionfilters. By our selection of filter pore sizes, wedeliberately focused this initial study on theidentification and analysis of microbial organ-isms. However, we did examine the data for thepresence of eukaryotic content as well. Al-though the bulk of known protists are 10 !mand larger, there are some known in the rangeof 1 to 1.5 !m in diameter [for example, Os-treococcus tauri (15) and the Bolidomonas spe-cies (16)], and such organisms could potentiallywork their way through a 0.80 !m prefilter. Aninitial screening for 18S ribosomal RNA(rRNA), a commonly used eukaryotic marker,identified 69 18S rRNA genes, with 63 of theseon singletons and the remaining 6 on verysmall, lowcoverage assemblies. These 18SrRNAs are similar to uncultured marine eu-karyotes and are indicative of a eukaryotic pres-ence but inconclusive on their own. Becausebacterial DNA contains a much greater densityof genes than eukaryotic DNA, the relativeproportion of gene content can be used as an-other indicator to distinguish eukaryotic mate-rial in our sample. An inverse relation wasobserved between the pore size of the pre-filtersand collection filters and the fraction of se-quence coding for genes (table S5). This rela-tion, together with the presence of 18S rRNAgenes in the samples, is strong evidence thateukaryotic material was indeed captured.Diversity and species richness. Most

phylogenetic surveys of uncultured organismshave been based on studies of rRNA genesusing polymerase chain reaction (PCR) withprimers for highly conserved positions in thosegenes. More than 60,000 small subunit rRNAsequences from a wide diversity of prokaryotictaxa have been reported (17). However, PCR-based studies are inherently biased, because notall rRNA genes amplify with the same “univer-sal” primers. Within our shotgun sequence dataand assemblies, we identified 1164 distinctsmall subunit rRNA genes or fragments ofgenes in the Weatherbird II assemblies andanother 248 within the Sorcerer II reads (5).Using a 97% sequence similarity cutoff to dis-tinguish unique phylotypes, we identified 148previously unknown phylotypes in our samplewhen compared against the RDP II database(17). With a 99% similarity cutoff, this numberincreases to 643. Though sequence similarity isnot necessarily an accurate predictor of func-tional conservation and sequence divergencedoes not universally correlate with the biologi-cal notion of “species,” defining species (alsoknown as phylotypes) by sequence similaritywithin the rRNA genes is the accepted standardin studies of uncultured microbes. All sampledrRNAs were then assigned to taxonomic groups

using an automated rRNA classification pro-gram (5). Our samples are dominated by rRNAgenes from Proteobacteria (primarily membersof the ", #, and $ subgroups) with moderatecontributions from Firmicutes (low-GC Grampositive), Cyanobacteria, and species in theCFB phyla (Cytophaga, Flavobacterium, andBacteroides) (fig. S4A; Fig. 6). The patterns wesee are similar in broad outline to those ob-served by rRNA PCR studies from the SargassoSea (18), but with some quantitative differencesthat reflect either biases in PCR studies or dif-ferences in the species found in our sampleversus those in other studies.

An additional disadvantage associated withrelying on rRNA for estimates of species diver-sity and abundance is the varying number ofcopies of rRNA genes between taxa (more thanan order of magnitude among prokaryotes)(19). Therefore, we constructed phylogenetictrees (fig. S4, B to E) using other representedphylogenetic markers found in our data set,[RecA/RadA, heat shock protein 70 (HSP70),elongation factor Tu (EF-Tu), and elongationfactor G (EF-G)]. Each marker gene interval inour data set (with a minimum length of 75amino acids) was assigned to a putative taxo-nomic group using the phylogenetic analysisdescribed for rRNA. For example, our data set

contains over 600 recA homologs fromthroughout the bacterial phylogeny, includingrepresentatives of Proteobacteria, low- andhigh-GC Gram positives, Cyanobacteria, greensulfur and green nonsulfur bacteria, and othergroups. Assignment to phylogenetic groupsshows a broad consensus among the differentphylogenetic markers. For most taxa, therRNA-based proportion is the highest or lowestin comparison to the other markers. We believethis is due to the large amount of variation incopy number of rRNA genes between species.For example, the rRNA-based estimate of theproportion of $Proteobacteria is the highest,while the estimate for cyanobacteria is the low-est, which is consistent with the reports thatmembers of the $-Proteobacteria frequentlyhave more than five rRNA operon copies,whereas cyanobacteria frequently have fewerthan three (19).

Just as phylogenetic classification isstrengthened by a more comprehensive markerset, so too is the estimation of species richness.In this analysis, we define “genomic” species asa clustering of assemblies or unassembled readsmore than 94% identical on the nucleotide lev-el. This cutoff, adjusted for the protein-codingmarker genes, is roughly comparable to the97% cutoff traditionally used for rRNA. Thus

Fig. 6. Phylogenetic diversity of Sargasso Sea sequences using multiple phylogenetic markers. Therelative contribution of organisms from different major phylogenetic groups (phylotypes) wasmeasured using multiple phylogenetic markers that have been used previously in phylogeneticstudies of prokaryotes: 16S rRNA, RecA, EF-Tu, EF-G, HSP70, and RNA polymerase B (RpoB). Therelative proportion of different phylotypes for each sequence (weighted by the depth of coverageof the contigs from which those sequences came) is shown. The phylotype distribution wasdetermined as follows: (i) Sequences in the Sargasso data set corresponding to each of these geneswere identified using HMM and BLAST searches. (ii) Phylogenetic analysis was performed for eachphylogenetic marker identified in the Sargasso data separately compared with all members of thatgene family in all complete genome sequences (only complete genomes were used to control forthe differential sampling of these markers in GenBank). (iii) The phylogenetic affinity of eachsequence was assigned based on the classification of the nearest neighbor in the phylogenetic tree.

R E S E A R C H A R T I C L E

2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org70

!27

Page 28: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

defined, the mean number of species at thepoint of deepest coverage was 451 [averagedover the six genes analyzed; range 341 to 569(Table 2)]; this serves as the most conservativeestimate of species richness.

Although counts of observed species in asample are directly obtainable, the true numberof distinct species within a sample is almostcertainly greater than that which can be ob-served by finite sequence sampling. That is,given a diverse sample at any given level ofsequencing, random sampling is likely to en-tirely miss some subset of the species at lowestabundance. We considered three approaches toestimating the true diversity: nonparametricmethods for small sample corrections (20),parametric methods assuming a log-normal dis-tribution of species abundance (21), and a novelmethod based on fitting the observed depth ofcoverage to a theoretical model of assemblyprogress for a sample corresponding to a mix-

ture of organisms at different abundances. Allthree methods agree on a minimum of at least300 species per sample, with more than 1000species in the combined sample (5). We de-scribe in detail a model based on assemblydepth of coverage (5). Assuming standard ran-dom models for shotgun sequencing, sequenc-ing of an environmental sample should result indepths of sequence coverage reflecting a mix-ture of Poisson distributions. We computed theempirical distribution of coverage depth at ev-ery position in the full set of assemblies (includ-ing single fragment contigs, but not countinggaps between contigs), and compared it withhand-constructed mixtures of Poisson distribu-tions. The three depth of coverage–based mod-els shown in Table 3 indicate that there are atleast 1800 species in the combined sample, andthat a minimum of 12-fold deeper samplingwould be required to obtain 95% of the uniquesequence. However, these are only lower

bounds. The depth of coverage modeling isconsistent with as much as 80% of the assem-bled sequence being contributed by organismsat very low individual abundance and thuswould be compatible with total diversity ordersof magnitude greater than the lower bound. Theassembly coverage data also implies that morethan 100 Mbp of genome (i.e., probably morethan 50 species) is present at coverage highenough to permit assembly of a complete ornearly complete genome were we to sequenceto 5- to 10-fold greater sampling depth.

Taking the well-known marine rRNA cladeSAR11 as an example, one can readily get amore tangible view of the diversity within oursample. The SAR11 rRNA group accounts for26% of all RNA clones that have been identi-fied by culture-independent PCR amplificationof seawater (22) and has been found in nearlyevery pelagic marine bacterioplankton commu-nity. However, there are very few cultured rep-resentatives of this clade (23), and little isknown about its metabolic diversity. In total, 89scaffolds and 291 singletons from our data setcontain a SAR11 rRNA sequence. However,even with these nearly 400 representatives ofSAR11 within our sample, assembly depth ofcoverage ranges from only from 0.94- to 2.2-fold, and the largest scaffold is quite small at21,000 bp. This indicates a much more diversepopulation of organisms than previously attrib-uted to SAR11 based on rRNA PCR methods.Variability in species abundance. A

key advantage to the random sampling ap-proach described here is that it allows assess-ment of the stoichiometry (that is, estimation ofthe relative abundance of the dominant organ-isms). For example, we found that more thanhalf of all assemblies with more than 50 frag-ments were from organisms that were not atequal relative abundance in all of the samplescollected, suggesting widespread “patchiness”

Table 2. Diversity of ubiquitous single copy protein coding phylogenetic markers. Protein column usessymbols that identify six proteins encoded by exactly one gene in virtually all known bacteria. SequenceID specifies the GenBank identifier for corresponding E. coli sequence. Ortholog cutoff identifies BLASTxe-value chosen to identify orthologs when querying the E. coli sequence against the complete SargassoSea data set. Maximum fragment depth shows the number of reads satisfying the ortholog cutoff at thepoint along the query for which this value is maximal. Observed “species” shows the number of distinctclusters of reads from the maximum fragment depth column, after grouping reads whose containingassemblies had an overlap of at least 40 bp with ! 94% nucleotide identity (single-link clustering).Singleton “species” shows the number of distinct clusters from the observed “species” column thatconsist of a single read. Most abundant column shows the fraction of the maximum fragment depth thatconsists of single largest cluster.

Protein Sequence ID Orthologcutoff

Max.fragmentdepth

Observed“species”

Singleton“species”

Mostabundant(%)

AtpD NTL01EC03653 1e-32 836 456 317 6GyrB NTL01EC03620 1e-11 924 569 429 4Hsp70 NT01EC0015 1e-31 812 515 394 4RecA NTL01EC02639 1e-21 592 341 244 8RpoB NTL01EC03885 1e-41 669 428 331 7TufA NTL01EC03262 1e-41 597 397 307 3

Table 3. Diversity models based on depth of coverage. Each row corre-sponds to an abundance class of organisms. The first column in eachmodel “fr(asm)” gives the fraction of the assembly consensus modeleddue to organisms at an abundance giving “Depth” coverage depth (second

column) in the sample. The third column [“E(s)”] gives the fraction of sucha genome expected to be sampled. The fourth column (“Genomes”)gives the resulting estimated number of genomes in the abun-dance class.

Model 1 Model 2 Model 3

fr(asm) Depth E(s) Genomes fr(asm) Depth E(s) Genomes fr(asm) Depth E(s) Genomes

0.0055 25 1.00E"00 2.5 0.0055 25 1.00E"00 2.5 0.0055 25 1.00E"00 2.50.005 21 1.00E"00 2.3 0.005 21 1.00E"00 2.3 0.005 21 1.00E"00 2.30.0035 13 1.00E"00 1.6 0.0035 13 1.00E"00 1.6 0.0035 13 1.00E"00 1.60.004 9 1.00E"00 1.8 0.004 9 1.00E"00 1.8 0.004 9 1.00E"00 1.80.008 7 9.99E-01 3.6 0.0088 7 9.99E-01 4 0.0088 7 9.99E-01 40.0047 6 9.98E-01 2.1 0.0047 6 9.98E-01 2.1 0.0047 6 9.98E-01 2.10.01 4 9.82E-01 4.6 0.029 2.4 9.09E-01 14.4 0.029 2.4 9.09E-01 14.40.0258 2.4 9.09E-01 12.8 0.096 2 8.65E-01 50 0.097 2 8.65E-01 50.50.07 2 8.65E-01 36.4 0.0235 1 6.32E-01 16.7 0.0225 1 6.32E-01 160.8635 0.25 2.21E-01 1,756.7 0.06 0.5 3.93E-01 68.6 0.06 0.5 3.93E-01 68.6

0.76 0.09 8.61E-02 3,973.6 0.66 0.124 1.17E-01 2,546.70.1 0.001 1.00E-03 45,022.5

Total 1824.4 4137.6 47,733

R E S E A R C H A R T I C L E

www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004 71

http://www.sciencemag.org/content/304/5667/66

Page 29: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

MS 1093857: Environmental Genome Shotgun Sequencing of the Sargasso Sea Venter et al., revised

Figure S6. Accumulation curve for rpoB. Observed (black) OTU counts for rpoB (based on the fragment grouping summarized in Table 2), as well as the Chao1-corrected estimate of total species (red; see (3)). Points are mean values of 1000 shufflings of the observed data, while bars show 90% confidence intervals.

http://www.sciencemag.org/content/304/5667/66

Page 30: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

What are they doing?

Page 31: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

Community Function?

• Many attempts to treat community as a bag of genes

• The run pathway prediction tools on entire data set to try and predict “community metabolism”

• Does not work very well

!31

Page 32: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

Functional Inference from Metagenomics

• Can work well for individual genes

!32

Page 33: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

What are they doing?

identified and curated genes. With the vast ma-jority of the Sargasso sequence in short (lessthan 10 kb), unassociated scaffolds and single-tons from hundreds of different organisms, it isimpractical to apply this approach. Instead, wedeveloped an evidence-based gene finder (5).Briefly, evidence in the form of protein align-ments to sequences in the bacterial portion ofthe nonredundant amino acid (nraa) data set(13) was used to determine the most likelycoding frame. Likewise, approximate start andstop positions were determined from the bound-ing coordinates of the alignments and refined toidentify specific start and stop codons. Thisapproach identified 1,214,207 genes coveringover 700 MB of the total data set. This repre-sents approximately an order of magnitudemore sequences than currently archived in thecurated SwissProt database (14), which con-tains 137,885 sequence entries at the time ofwriting; roughly the same number of sequencesas have been deposited into the uncuratedREM-TrEMBL database (14) since its incep-tion in 1996. After excluding all intervals cov-ered by previously identified genes, additionalhypothetical genes were identified on the basisof the presence of conserved open readingframes (5). A total of 69,901 novel genes be-longing to 15,601 single link clusters were iden-tified. The predicted genes were categorized

Fig. 5. Prochlorococcus-related scaffold 2223290 illustrates the assembly of a broad commu-nity of closely related organisms, distinctly nonpunctate in nature. The image represents (A)global structure of Scaffold 2223290 with respect to assembly and (B) a sample of the multiplesequence alignment. Blue segments, contigs; green segments, fragments; and yellow segments,stages of the assembly of fragments into the resulting contigs. The yellow bars indicate thatfragments were initially assembled in several different pieces, which in places collapsed toform the final contig structure. The multiple sequence alignment for this region shows ahomogenous blend of haplotypes, none with sufficient depth of coverage to provide aseparate assembly.

Table 1. Gene count breakdown by TIGR rolecategory. Gene set includes those found on as-semblies from samples 1 to 4 and fragment readsfrom samples 5 to 7. A more detailed table, sep-arating Weatherbird II samples from the Sorcerer IIsamples is presented in the SOM (table S4). Notethat there are 28,023 genes which were classifiedin more than one role category.

TIGR role category Totalgenes

Amino acid biosynthesis 37,118Biosynthesis of cofactors,prosthetic groups, and carriers

25,905

Cell envelope 27,883Cellular processes 17,260Central intermediary metabolism 13,639DNA metabolism 25,346Energy metabolism 69,718Fatty acid and phospholipidmetabolism

18,558

Mobile and extrachromosomalelement functions

1,061

Protein fate 28,768Protein synthesis 48,012Purines, pyrimidines, nucleosides,and nucleotides

19,912

Regulatory functions 8,392Signal transduction 4,817Transcription 12,756Transport and binding proteins 49,185Unknown function 38,067Miscellaneous 1,864Conserved hypothetical 794,061

Total number of roles assigned 1,242,230

Total number of genes 1,214,207

R E S E A R C H A R T I C L E

www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004 69http://www.sciencemag.org/content/304/5667/66

Page 34: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

Functional Diversity of Proteorhodopsins?

!34http://www.sciencemag.org/content/304/5667/66

Page 35: Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics

!"##"$% #& '(#)$"

!"#$%& ' ()* +,, ' ,+ -$!& .//, ' 0001234567189: *+*

;694796<9=9;>?2 :9@785@7> 367 ;67>724 ?2 4<7 :7:A6327> 9B 234?C7:36?27 A38476?9;@32D4921

#<7 E3><F;<949@G>?> =343 ;76:?4 7>4?:34?92 9B 4<7 87@@5@36 892F8724634?92 9B ;694796<9=9;>?21 ">>5:?2H I,J 4<7 E3>< G?7@= 9B 4<772C?692:7243@ ;?H:724 ?> >?:?@36 49 4<34 9B ;694796<9=9;>?27K;67>>7= ?2 !" #$%& I4<34 ?>L ,/1MN 3A>96;4?92 8<32H7 34 M// 2:;76 3A>96;4?92 52?4 9B ;?H:724 34 M.O 2:JL 07 83@85@347 /1/PP3A>96;4?92 52?4> 9B ;694796<9=9;>?2 34 M.O 2: I/1PM!H 9B ;69F4796<9=9;>?2 ;6947?2 ;76 @?467 9B >73 03476J1 Q@3>< >;78469>89;G 03>2787>>36G 49 =74784 4<7 ;694796<9=9;>?2 ;?H:724> A7835>7 4<767 ?>:58< H673476 3A>96;4?92 ?2 4<7 C?>?A@7 632H7 AG 94<76 ;?H:724> ?24<7 >3:;@7L 0<?8< ><90 ;73D> 9B ,1RL ,1PL /1M. 32= /1OP 3A>96;4?9252?4>L 34 +POL +RSL R+O 32= ROP 2:L 67>;784?C7@G1

">>5:?2H 3@>9 I.J 3 :9@36 3A>96;4?92 897BT8?724 9B M/L///U!,

8:!, 34 4<7 3A>96;4?92 :3K?:5:L IPJ E5967>87287 &' (&)* <GA6?=?FV34?92 89524> 9B 3 4943@ 9B M1R " ,/,/ W"%SR 87@@> ?2 4<7 89287246347=>3:;@7 I!,/N 9B 4<7 4943@ A38476?3X >77U74<9=>JL I+J M/N 6789C76G9B :7:A6327> B69: 4<7 87@@>L 32= IMJ 4<34 4<7 5285@4?C347= W"%SRH695; ?> 4<7 ;6?28?;3@ A38476?9;@32D492 H695; 5>?2H 4<7>7 ;?H:724>L07 83@85@347 4<34 4<767 367 .1+ " ,/+ ;694796<9=9;>?2 :9@785@7> ;76W"%SR 87@@1

#<?> C3@57 ?> 92 4<7 96=76 9B 4<7 8928724634?92 9B A38476?96<9F=9;>?2 ?2 3 +" (,%&',-*. 87@@L ?2 0<?8< >5A>4324?3@ ;964?92> 9B 4<7:7:A6327 >56B387 3673 9B 4<7 87@@ 892>?>4 9B A38476?96<9=9;>?2 ?2 34?H<4@G ;38D7= 86G>43@@?27 3663GS1 Q96 89:;36?>92L .1+ " ,/+ A38476F?96<9=9;>?2 :9@785@7> =72>7@G ;38D7= ?2 4<7 ;56;@7F:7:A6327@344?87 095@= 89C76 3 /1RF!: =?3:7476L E34 8?685@36 3673 9B :7:FA6327Y3 >?H2?T8324 ;964?92 9B 4<7 >56B387 9B 3 >?2H@7 87@@P1 #<?>25:A76 9B:9@785@7> ?> >5BT8?724 49 ;69=587 >5A>4324?3@ 3:9524> 9B"#Z 52=76 ?@@5:?234?92[1 #<767B967L 4<7 <?H< =72>?4G 9B ;694796<9F=9;>?2 ?2 4<7 W"%SR :7:A6327 ?2=?8347= AG 956 83@85@34?92>>4692H@G >5HH7>4> 4<34 4<?> ;6947?2 <3> 3 >?H2?T8324 69@7 ?2 4<7

;<G>?9@9HG 9B 4<7>7 A38476?3 &' (&)*1#9 7K;@967 4<7 ;9>>?A@7 7K?>47287 9B 94<76 ;694796<9=9;>?2>L 07

>867727= 4<7 >3:7 :?K7=F;9;5@34?92 A38476?3@ 364?T8?3@ 8<69:9F>9:7 I\"]J @?A636G,/ ?2 0<?8< ;694796<9=9;>?2 03> ?2?4?3@@G =?>F89C767=L 0?4< 292F=7H7276347 ;9@G:763>7 8<3?2 67384?92 IZ]%J;6?:76>,1 W7C763@ 3==?4?923@ ;694796<9=9;>?2F89243?2?2H \"]8@927> 0767 B952= ?2 4<7 @?A636G1 #<7>7 ;694796<9=9;>?2> 0767>?:?@36L A54 =?= =?BB76 ?2 4<7?6 3:?29F38?= >7^57287> 0<72 89:F;367= 0?4< 4<7 96?H?23@ ;694796<9=9;>?2 IB96 7K3:;@7L >77 8@927>P,"SL R+"M 32= +/&SX Q?H> P 32= +J1 $>?2H 4<7 >3:7 292F=7H7276347;6?:76>L 07 895@= 3@>9 3:;@?BG AG Z]% ;694796<9=9;>?2 H727> B69:A38476?9;@32D492_!" 7K46384>L ?28@5=?2H 4<9>7 B69:U924767G \3GIU\ 8@927>X Q?H1 PJL 4<7 W954<762 98732 IZ3@:76 >434?92X Z"*8@927>J 32= 03476> 9B 4<7 872463@ !964< Z38?T8 98732 I`303??)8732 #?:7 >76?7> >434?92,,X `)# 8@927>J1

a7 =747847= ,M =?BB76724 C36?324> 9B ;694796<9=9;>?2 ?2 4<7 Z]%FH7276347= U924767G \3G ;694796<9=9;>?2 H727 @?A636GL B3@@?2H ?2494<677 8@5>476> IQ?H1 PJ 4<34 ><367 34 @73>4 [ON ?=724?4G 9C76 .+S 3:?2938?=> IQ?H1 +J 32= [PN ?=724?4G 34 4<7 _!" @7C7@1 #09 ;694796<9F=9;>?2 H727> B69: U924767GL +/&S 32= R+"ML 0767 7K;67>>7= ?2!" #$%& 32= ;69=587= 3A>96;4?92 >;78463 C76G >?:?@36 49 4<7 96?H?23@;694796<9=9;>?2, I8@927 P,"SJ ?>9@347= B69: 4<7 >3:7 03476> I=343294 ><902J1

%7:36D3A@GL 3@@ 4<7 ;694796<9=9;>?2 H727> 4<34 0767 3:;@?T7= AGZ]% B69: "243684?8 :36?27 A38476?9;@32D492 0767 =?BB76724 B69:4<9>7 9B U924767G \3G IQ?H1 PJL ><36?2H 3 :3K?:5: 9B OSN ?=724?4G9C76 .+S 3:?29 38?=> 0?4< 4<7 U924767G 8@3=7 IQ?H1 +J1 #<7 8<32H7>?2 3:?29F38?= >7^57287> 0767 294 67>46?847= 49 4<7 <G=69;<?@?8

Laserflash

Untreated membranes

Hydroxylamine-treated membranes

Retinal-reconstituted membranes

10–3

AU

10–1 s

!"#$%& ' !"#$% &"#'()*+,-$+ .%"*#)$*.# ". /00 *1 23 " 42*.$%$5 6"5 7"-.$%)289"*:.2*

1$17%"*$ 8%$8"%".)2*; <28= 7$32%$ "++).)2* 23 '5+%2>59"1)*$? 1)++9$= "3.$% 0;@4

'5+%2>59"1)*$ .%$".1$*. ". 8A B;0= CD !E= F).' /00(*1 )99,1)*".)2* 32% G01)*? 72..21=

"3.$% -$*.%)3,H)*H .F)-$ F).' %$#,#8$*#)2* )* C0014 8'2#8'".$ 7,33$%= 8A B;0= 32992F$+

75 "++).)2* 23 /!4 "99(!"#$% %$.)*"9 "*+ )*-,7".)2* 32% C ';

MB 0m2

MB 0m1

MB 20m2

MB 20m5

MB 40m12

MB 100m9

HOT 75m3

HOT 75m8

0.01

Mon

tere

y B

ay a

nd s

hallo

w H

OT

Ant

arct

ica

and

deep

HO

T

HOT 75m1

PAL B1

PAL B5PAL B6

PAL E7

PAL E1PAL B7

PAL B2PAL B8

MB 100m10

MB 100m5

MB 100m7

MB 20m12

MB 40m1

MB 40m5BAC 40E8

BAC 31A8

PAL E6

BAC 64A5

HOT 0m1

HOT 75m4

!"#$%& ( I'592H$*$.)- "*"95#)# 23 .'$ )*3$%%$+ "1)*2("-)+ #$J,$*-$ 23 -92*$+

8%2.$2%'2+28#)* H$*$#; K)#."*-$ "*"95#)# 23 @@0 82#).)2*# F"# ,#$+ .2 -"9-,9".$ .'$ .%$$

75 *$)H'72,%(L2)*)*H ,#)*H .'$ I",8M$"%-' 8%2H%"1 23 .'$ N)#-2*#)* I"-:"H$ O$%#)2*

C0;0 PQ$*$.)-# E218,.$% Q%2,8? 4"+)#2*= N)#-2*#)*R; &' %#()$#"*+ 7"-.$%)2%'2+28#)*

F"# ,#$+ "# "* 2,.H%2,8= "*+ )# *2. #'2F*; M-"9$ 7"% %$8%$#$*.# *,17$% 23 #,7#.).,.)2*#

8$% #).$; 629+ *"1$# )*+)-".$ .'$ 8%2.$2%'2+28#)*# .'". F$%$ #8$-.%"995 -'"%"-.$%)S$+ )*

.')# #.,+5;

© 2001 Macmillan Magazines Ltd

!35