an alignment-free metapeptide strategy for metaproteomic...

9
An Alignment-Free MetapeptideStrategy for Metaproteomic Characterization of Microbiome Samples Using Shotgun Metagenomic Sequencing Damon H. May, Emma Timmins-Schiman, Molly P. Mikan, § H. Rodger Harvey, § Elhanan Borenstein, ,,Brook L. Nunn, and William S. Noble* ,,Department of Genome Sciences and Department of Computer Science and Engineering, University of Washington, Seattle, Washington 98195-5065, United States § Department of Ocean, Earth & Atmospheric Sciences, Old Dominion University, Norfolk, Virginia 23529, United States Santa Fe Institute, Santa Fe, New Mexico 87501, United States ABSTRACT: In principle, tandem mass spectrometry can be used to detect and quantify the peptides present in a microbiome sample, enabling functional and taxonomic insight into microbiome metabolic activity. However, the phylogenetic diversity constituting a particular microbiome is often unknown, and many of the organisms present may not have assembled genomes. In ocean microbiome samples, with particularly diverse and uncultured bacterial communities, it is dicult to construct protein databases that contain the bulk of the peptides in the sample without losing detection sensitivity due to the overwhelming number of candidate peptides for each tandem mass spectrum. We describe a method for deriving metapeptides(short amino acid sequences that may be represented in multiple organisms) from shotgun metagenomic sequencing of microbiome samples. In two ocean microbiome samples, we constructed site-specic metapeptide databases to detect more than one and a half times as many peptides as by searching against predicted genes from an assembled metagenome and roughly three times as many peptides as by searching against the NCBI environmental proteome database. The increased peptide yield has the potential to enrich the taxonomic and functional characterization of sample metaproteomes. KEYWORDS: microbial ecology, metaproteomics, metagenomics, mass spectrometry, microbial communities 1. INTRODUCTION Because the ocean microbial community is the dominant driver of ocean biogeochemical processes such as the carbon cycle, a quantitative understanding of the ocean microbial taxa performing important functions is essential. 13 Due to culture technique limitations on mixed microbial communities, methods for examining whole microbiomes in situ are needed. Metaproteomic analysis of ocean samples has the power to detect peptides from thousands of proteins over a wide range of taxonomic groups within a single analysis. 47 Accordingly, metaproteomics has been used to investigate the functional roles of ocean microbes in a variety of ecological contexts. 5,7,8 However, the success of high-throughput proteomics on ocean samples has been limited by a lack of detection sensitivity. 9 The majority of organisms active in the ocean microbiome do not have assembled genomes. 10 Public databases can provide partial metaproteome coverage, but without a precise guide to which organisms are present in the sample, those databases must be extremely large in order to accommodate as much sequence variation as possible. Searching against such very large databases severely and negatively impacts search sensitivity. 1113 Because of the diculty of constructing a protein database that accurately reects an ocean bacterial microbiome, ocean metaproteomics experiments typically only detect a small proportion of the potentially detectable peptides in a sample. 5,14 As sequencing technologies have become more accessible, meta-omicsstudies have integrated metagenomic, metatran- scriptomic, and metaproteomic data. For example, databases for metaproteomic searches can be constructed using genes predicted from an assembled metagenome. 8 However, this approach can lead to low peptide detection sensitivity for two reasons. First, many gene fragments present in sequencing reads cannot be reliably assembled into longer contigs, so they will be missing from the gene prediction. Second, the process of optimal metagenome assembly requires expertise not necessar- ily shared by all researchers wishing to do metaproteomics analysis, and if not done optimally, then the metagenome may fail to contain much of the variation present in the sequencing data. For both of these reasons, even metaproteomic databases based on site-specic assembled metagenomes tend to provide substantially incomplete coverage of the sample metapro- teome. 13,15,16 Received: March 17, 2016 Published: July 11, 2016 Article pubs.acs.org/jpr © 2016 American Chemical Society 2697 DOI: 10.1021/acs.jproteome.6b00239 J. Proteome Res. 2016, 15, 26972705

Upload: others

Post on 31-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • An Alignment-Free “Metapeptide” Strategy for MetaproteomicCharacterization of Microbiome Samples Using ShotgunMetagenomic SequencingDamon H. May,† Emma Timmins-Schiffman,† Molly P. Mikan,§ H. Rodger Harvey,§

    Elhanan Borenstein,†,‡,∥ Brook L. Nunn,† and William S. Noble*,†,‡

    †Department of Genome Sciences and ‡Department of Computer Science and Engineering, University of Washington, Seattle,Washington 98195-5065, United States§Department of Ocean, Earth & Atmospheric Sciences, Old Dominion University, Norfolk, Virginia 23529, United States∥Santa Fe Institute, Santa Fe, New Mexico 87501, United States

    ABSTRACT: In principle, tandem mass spectrometry can beused to detect and quantify the peptides present in amicrobiome sample, enabling functional and taxonomic insightinto microbiome metabolic activity. However, the phylogeneticdiversity constituting a particular microbiome is oftenunknown, and many of the organisms present may not haveassembled genomes. In ocean microbiome samples, withparticularly diverse and uncultured bacterial communities, it isdifficult to construct protein databases that contain the bulk ofthe peptides in the sample without losing detection sensitivitydue to the overwhelming number of candidate peptides for each tandem mass spectrum. We describe a method for deriving“metapeptides” (short amino acid sequences that may be represented in multiple organisms) from shotgun metagenomicsequencing of microbiome samples. In two ocean microbiome samples, we constructed site-specific metapeptide databases todetect more than one and a half times as many peptides as by searching against predicted genes from an assembled metagenomeand roughly three times as many peptides as by searching against the NCBI environmental proteome database. The increasedpeptide yield has the potential to enrich the taxonomic and functional characterization of sample metaproteomes.

    KEYWORDS: microbial ecology, metaproteomics, metagenomics, mass spectrometry, microbial communities

    1. INTRODUCTION

    Because the ocean microbial community is the dominant driverof ocean biogeochemical processes such as the carbon cycle, aquantitative understanding of the ocean microbial taxaperforming important functions is essential.1−3 Due to culturetechnique limitations on mixed microbial communities,methods for examining whole microbiomes in situ are needed.Metaproteomic analysis of ocean samples has the power todetect peptides from thousands of proteins over a wide range oftaxonomic groups within a single analysis.4−7 Accordingly,metaproteomics has been used to investigate the functionalroles of ocean microbes in a variety of ecological contexts.5,7,8

    However, the success of high-throughput proteomics on oceansamples has been limited by a lack of detection sensitivity.9

    The majority of organisms active in the ocean microbiomedo not have assembled genomes.10 Public databases canprovide partial metaproteome coverage, but without a preciseguide to which organisms are present in the sample, thosedatabases must be extremely large in order to accommodate asmuch sequence variation as possible. Searching against suchvery large databases severely and negatively impacts searchsensitivity.11−13 Because of the difficulty of constructing aprotein database that accurately reflects an ocean bacterial

    microbiome, ocean metaproteomics experiments typically onlydetect a small proportion of the potentially detectable peptidesin a sample.5,14

    As sequencing technologies have become more accessible,“meta-omics” studies have integrated metagenomic, metatran-scriptomic, and metaproteomic data. For example, databases formetaproteomic searches can be constructed using genespredicted from an assembled metagenome.8 However, thisapproach can lead to low peptide detection sensitivity for tworeasons. First, many gene fragments present in sequencingreads cannot be reliably assembled into longer contigs, so theywill be missing from the gene prediction. Second, the process ofoptimal metagenome assembly requires expertise not necessar-ily shared by all researchers wishing to do metaproteomicsanalysis, and if not done optimally, then the metagenome mayfail to contain much of the variation present in the sequencingdata. For both of these reasons, even metaproteomic databasesbased on site-specific assembled metagenomes tend to providesubstantially incomplete coverage of the sample metapro-teome.13,15,16

    Received: March 17, 2016Published: July 11, 2016

    Article

    pubs.acs.org/jpr

    © 2016 American Chemical Society 2697 DOI: 10.1021/acs.jproteome.6b00239J. Proteome Res. 2016, 15, 2697−2705

    pubs.acs.org/jprhttp://dx.doi.org/10.1021/acs.jproteome.6b00239

  • An alternative approach takes advantage of the fact that mostof the organisms present in many microbiome samples areprokaryotes, and therefore high proportions of their genomesare protein-coding. Tools such as MetaGeneAnnotator,17,18

    Orphelia,16 and FragGeneScan19 predict gene fragmentsdirectly from sequencing reads, without assembling the readsinto contigs. These approaches can be used to constructmetaproteomic databases suitable for database search. AsCantarel et al.13 demonstrated, these databases enable a greaterpeptide yield via database search than other methods, withsensitivity greatly dependent on the specifics of the approach todatabase construction. However, the goal of these tools issensitive gene prediction rather than peptide detection, sodatabases containing translations of their raw gene fragmentoutput can be extremely large. This can lead to impracticallylong running times for database searches and, moreimportantly, reduced peptide detection sensitivity.In the approach described here, we begin with either the

    gene fragments predicted by MetaGeneAnnotator or six-frametranslations of raw reads. We trim and filter these sequences tobuild a database of “metapeptides”: short amino acid sequencesderived from open reading frame fragments found in individualreads that are more likely to be identifiable via LC-MS/MS(Figure 1A). This approach exploits more of the metagenomic

    data than an approach based on an assembled metagenome,incorporating reads that fail to be integrated into a contig aswell as all of the sample variation for each gene sequence whileavoiding a loss of sensitivity due to overinclusivity. It is bothmore complete and more focused on the sample at hand than astrategy based on public databases, potentially including

    sequences never before observed in any organism and excludingsequences from species not present in the sample.To evaluate the utility of our metapeptide approach, we

    compared the sets of peptide sequences detected in two ArcticOcean microbiome samples at a 1% false discovery rate (FDR)via database search against three different databases (Figure1B): the NCBI nonredundant database of environmentalprotein sequences (env_nr), which is commonly used tointerrogate ocean and soil microbiome samples,4,20 a databasederived from a metagenome assembled from shotgunmetagenomic sequencing of the two Arctic Ocean samples,and metapeptide databases constructed from the samesequencing reads.Two microbiome samples were collected from the Arctic

    Ocean: one sample from the surface chlorophyll maximumlayer in the Bering Strait (BSt) and one from bottom waters inthe Chuckchi Sea (CS). A total of 2050 peptides were detectedin the BSt sample by searching the environmental database. Ametagenome-derived database search yielded 2.12 times asmany peptides, and a metapeptide search yielded 3.37 times asmany peptides. Results were similar in the CS sample, thoughwith many fewer peptides detected in each search. Integratingthe results from all three databases further increased peptideyield.This substantial advantage in peptide yield contributes

    greatly to the taxonomic and potential functional classificationof the sample proteomes. We used Unipept21,22 to infer thelowest common ancestor taxon for peptides detected in eachsearch, as well as the list of Gene Ontology (GO) “biologicalprocess” categories associated with proteins containing eachpeptide. Comparison of the results revealed a much richertaxonomic characterization of the proteins present in thesamples from the metapeptide search than from either of theother methods, and a much higher number of detected peptideswith the potential for functional annotation. Thus, in additionto dramatically increasing the number of peptides detected in agiven ocean sample, the metapeptide-based approach cansignificantly expand our understanding of the organismsproducing the biochemically active molecules in a microbiome.This understanding is crucial to developing a functional modelof the microbiome.

    2. METHODS

    The data described in the following sections may bedownloaded at http://noble.gs.washington.edu/proj/metapeptide.

    2.1. Experimental Methods

    2.1.1. Sample Collection. Water samples were collected inAugust of 2013 from the Bering Strait (BSt) chlorophyllmaximum layer (7 m depth, 65°43.44″ N, 168°57.42″ W) andfrom the more northern Chukchi Sea (CS) bottom waters(55.5 m depth, 72°47.624″ N, 16°53.89″ W) using a 24-bottleCTD (conductivity, temperature, and depth) rosette (10 LGeneral Oceanics Niskin X). The measurement of integratedwater column chlorophyll was 226.88 mg/m2 at station BSt and2.64 mg/m2 at station CS. As our previous work has shown, toexamine bacterial contributions, it is essential to remove thevery high background contribution from algal inhabitants.23

    Also, oceanic marine bacteria are typically smaller than bacteriain gut biomes or freshwater systems, with the majority passing a1.0 μm filter.24,25 Accordingly, a 15 L water sample wasprefiltered through two high-volume cartridges (10 μm and

    Figure 1. Multiple approaches for metaproteomics of microbiomesamples. (A) After high-throughput sequencing, metapeptide databaseconstruction begins with six-frame translations of raw sequencingreads, or with gene fragments predicted from reads. Amino acidsequences are trimmed to their outermost tryptic sites to yieldmetapeptide sequences. Candidate metapeptides are filtered on per-base quality scores, sequence length and other features. Passingcandidates are added to the database. (B) Alternative proteomicsworkflows. Microbiome samples are subjected to shotgun meta-genomic sequencing and LC-MS/MS analysis. MS/MS spectra aresearched with Comet against the NCBI environmental database,against predicted genes from an assembled metagenome, or against ametapeptide database, resulting in peptide yields of different size.Photographs copyright 2016 Damon May.

    Journal of Proteome Research Article

    DOI: 10.1021/acs.jproteome.6b00239J. Proteome Res. 2016, 15, 2697−2705

    2698

    http://noble.gs.washington.edu/proj/metapeptidehttp://noble.gs.washington.edu/proj/metapeptidehttp://dx.doi.org/10.1021/acs.jproteome.6b00239

  • then 1 μm) to remove larger eukaryotes, and the filtratecomprising the bacterial microbiome was then collected on aglass fiber filter (GF/F) with nominal pore size of 0.7 μm.Filters were flash frozen and stored at −80 °C until extraction.2.1.2. Metagenome DNA Extraction, Library Prepara-

    tion, and Sequencing. Filters were sliced, and DNAextraction was accomplished using the protocol developed forplanktonic biomass on Sterivex filters, as described in Wright etal.26 Briefly, DNA was extracted from the collected cells usingphenol/chloroform and chloroform extractions. DNA was thenpurified using a cesium chloride density gradient. ExtractedDNA was sheared to

  • Results of searches of individual samples against multipledatabases were integrated as follows. PSMs from searchesagainst all databases were combined into a single tab-delimitedfile of features for input to Percolator.34 For each database, anew binary feature was added to the combined feature fileindicating whether the PSM was derived from a search againstthat database. Percolator was then used to analyze thecombined set, thereby computing a discriminant score foreach PSM. For each scan with multiple PSMs (from multipledatabases), all but the highest-scoring PSM were removed.Peptide-level FDR was then calculated as described above.2.2.4. Taxonomic and Functional Inference. Detected

    peptides were given taxonomic assignments by Unipept version1.1.0. For all tryptic peptides with no missed cleavages presentin UniProtKB, Unipept assigns a lowest common ancestor(LCA) taxon from the NCBI Taxonomy Database, the most-granular taxon common to all organisms containing thepeptide. For peptides with missed tryptic cleavages, Unipeptcalculates an LCA based on the LCAs associated with allcompletely cleaved peptide sequences contained in the peptide.No such standard methods currently exist for assigning

    functional annotations to detected peptide sequences, so weestimated the maximum number of peptides that could beassigned functional annotations. We used Unipept to retrieveall of the proteins containing each detected peptide, along withtheir GO category annotations. GO annotations are dividedinto three namespaces: “biological process”, “molecularfunction”, and “cellular compartment”. We declared a peptideto be potentially functionally informative if at least one proteincontaining it was annotated with at least one GO category inthe “biological process” namespace.

    3. RESULTS

    3.1. Gene Fragment Predictions from Deep ShotgunMetagenomics Sequencing Are Not Directly Usable forProteomics Database Search

    Shotgun sequencing of the BSt and CS samples generated 171million and 245 million reads, respectively. We evaluated threedifferent gene prediction tools: FragGeneScan, Orphelia, andMetaGeneAnnotator. None of these tools were originallydeveloped for this high depth of coverage, nor have they beenupdated to accommodate high-depth sequencing. On the BStreads, MetaGeneAnnotator ran to completion in 3.5 h,producing 133 million fragments. Orphelia quickly exceeded100 GB of memory usage; as its output on smaller inputs was33 times the size of the output of MetaGene, with no scoringmechanism to use for filtering, we decided not to pursue itfurther. After 5 days of running time, FragGeneScan had notyet completed, and its output on smaller inputs was 32 timesthe size of the output of MetaGene, so we decided not topursue it further.The 133 million fragments produced by MetaGeneAnnotator

    contained 222 million unique peptides and required more than4 days to search one replicate against. However, 177 million ofthe peptides in the database represented ragged ends ofpeptides terminating at the beginning or end of a metagenomicsequencing read and did not represent a detectable trypticpeptide.

    3.2. Environmental and Assembled MetagenomeDatabases Provide Incomplete Coverage of Peptides inOcean Samples

    Next, we quantified the extent to which a public database and ametagenome-derived database could be used to detect thepeptides present in the two ocean microbiome samples. Theenvironmental and metagenome databases contained 119million and 11 million peptides, respectively. All three replicatesof each sample were searched against both databases, and theset of peptides detected with FDR < 0.01 in searches againsteach database was determined with Percolator, as describedabove (Figure 2).

    For the BSt sample, of the 4344 peptides present in themetagenome database and detected in the metagenome search,61.2% did not occur in the environmental database; similarly,46.4% of the 2050 peptides present in the environmentaldatabase and detected in the environmental search were absentfrom the metapeptide database. This high complementarityindicates that large numbers of peptides present in the sampleare absent from each database. Furthermore, of the 1708peptides present in both databases (and therefore potentiallydetectable by either search) and detected in one or bothsearches, only 1.3% were detected in the environmentaldatabase search but not in the metapeptide database search.By contrast, 35.7% were detected in the metapeptide search butnot in the environmental search. Those peptides were presentin the environmental database, so we conclude that the failure

    Figure 2. Peptides detected in different searches. (A) Line plots ofpeptide false discovery rate (FDR, horizontal axis) vs number ofpeptide sequences detected at that FDR in the Bering Strait (BSt) andChukchi Sea (CS) samples (vertical axis), when searched four differentways. Dashed line indicates peptide yield at FDR > 0.01 from searchesagainst the NCBI environmental database (2050 in BSt and 1317 inCS), against the metagenome-derived database (4344 and 1877),against metapeptides derived from the sample being searched (6918and 3606), and integrated results from all three databases (7508 and3871). (B) Detected peptide counts at FDR < 0.01 in themetagenome, metapeptide, and integrated database searches as apercentage of the counts detected in environmental search.

    Journal of Proteome Research Article

    DOI: 10.1021/acs.jproteome.6b00239J. Proteome Res. 2016, 15, 2697−2705

    2700

    http://dx.doi.org/10.1021/acs.jproteome.6b00239

  • to detect them is due to a loss of statistical power stemmingfrom the much larger size of the environmental database.3.3. Searching Metapeptide Databases Increases PeptideYield and Enriches Taxonomic and FunctionalCharacterization

    Next, we evaluated the ability of our metapeptide databases toincrease peptide detection sensitivity relative to the environ-mental and metagenome databases. The metapeptide databasesconstructed from the shotgun metagenomic sequencing readsfrom the BSt and CS samples contained 12 million and 14million peptides, respectively. The BSt metapeptide database(12 million peptides) contained 2 million peptides in commonwith the environmental database (129 million peptides) and 4million in common with the metagenome database (11 millionpeptides). All MS/MS replicates of the BSt and CS oceanmicrobiome samples were searched against the metapeptidedatabase constructed from the sample being searched, and theset of peptides detected with FDR < 0.01 was derived withPercolator. In the BSt and CS samples, the numbers of peptidesdetected were 1.59 and 1.92 times the number detected bysearching against the metagenome-derived database and 3.37times and 2.74 times the number detected by searching againstthe environmental database, respectively (Figure 2).To determine the reasons for this larger peptide yield, we

    compared the sets of peptides detected by searching the BStspectra against the metapeptide and environmental databases(Figure 3). Of the 6918 peptides present in the metapeptide

    database and detected in the metapeptide search, 67.8% did notoccur in the environmental database; by contrast, only 27.5% ofthe 2050 peptides present in the environmental database anddetected in the environmental search were absent from themetapeptide database. This discrepancy suggests that themetapeptide database contains more of the peptides present

    in the sample. Furthermore, of the 2261 peptides present inboth databases and detected in one or both searches, only 1.5%were detected in the environmental database search but not inthe metapeptide database search. By contrast, 34.2% weredetected in the metapeptide search but not in the environ-mental search. Those peptides were present in the environ-mental database, so we conclude that the failure to detect themis due to a loss of statistical power stemming from the muchlarger size of the environmental database. We also comparedthe sets of peptides detected by searching the BSt spectraagainst the metapeptide and metagenome databases, which areof much more similar size. Of the 6918 peptides present in themetapeptide database and detected in the metapeptide search,41.9% did not occur in the environmental database; by contrast,only 7.9% of the 4344 peptides present in the metagenomedatabase and detected in the metagenome search were absentfrom the metapeptide database, suggesting more completesample coverage in the metapeptide database. Furthermore, ofthe 4250 peptides present in both databases and detected inone or both searches, 5.3% were detected in the metagenomesearch but not in the metapeptide database search, whereas5.8% were detected in the metapeptide search but not in themetagenome search. This suggests that the metapeptidedatabase advantage is essentially due to greater coverage andthat the searches of the two similarly sized databases areroughly equally sensitive.By themselves, detected peptide sequences provide limited

    information about a sample. However, the peptides can be usedto provide important insight into the sample’s communitycomposition. Accordingly, we assessed the extent to which theadditional peptides detected using the metapeptide databasecan enrich the taxonomic and functional classification of themetaproteome.For taxonomic inference, we used the Unipept tool to assign

    a least common ancestor (LCA) taxon to all possible peptidesdetected in a search of the BSt sample replicates against a givendatabase. The metapeptide search detected 1.28 times as manypeptides that were assigned LCAs as the metagenome-deriveddatabase search, and 1.76 times as many as the environmentaldatabase search. At every taxonomic rank more granular thanclass, the highest number of taxa were detected by theintegrated search, followed by the metapeptide, metagenome,and then environmental searches (Figure 4). The same orderwas observed when examining the number of peptides with anLCA at each taxonomic rank. As a side note, the number ofpeptides detected by a given database search with an LCA at agiven rank decreases monotonically with the granularity of therank, which is a reflection of the LCA ranks of detectablepeptides as a whole, as determined by Unipept based onUniProt annotations. The number of taxa detected increaseswith rank granularity until the rank of species, which shows amodest decline from the rank of genus. This relationship is alsoa function of the LCA ranks of detectible peptides and does notreflect any particular characteristics of the various searches.Because many metapeptides are likely from unsequenced

    microbes not present in public protein databases (and thereforeuninformative to Unipept), an important question is whetherthe detected peptides that were present in the metapeptidedatabase but absent from the environmental database conferredany taxonomic information via this method. Considering thepeptides detected in the metapeptide database search, thepercentage of peptides that are assignable to an LCA byUnipept is much greater for the subset of those peptides that

    Figure 3. Database and detected peptide comparisons. (A) The BStmetapeptide database contains roughly 12 million tryptic peptides.The environmental database contains 129 million, with an intersectionbetween the two databases of 2 million peptides. The metagenomedatabase contains 11 million peptides, with 4 million peptides incommon with the BSt metapeptide database. (B) Searching against theBSt metapeptide database detects 6918 unique peptides at FDR <0.01, vs 2050 when searching against the environmental database, with1452 in common. Of the 2261 peptides detected in either search thatwere present in both databases, 1452 were detected in both searches,774 were only detected in the BSt metapeptide database search, and 35were only detected in the environmental database search.

    Journal of Proteome Research Article

    DOI: 10.1021/acs.jproteome.6b00239J. Proteome Res. 2016, 15, 2697−2705

    2701

    http://dx.doi.org/10.1021/acs.jproteome.6b00239

  • are present in the environmental database (70.8%) than for thesubset that are absent (16.3%). However, because 67.8% of thepeptides detected in the metapeptide database search are absentfrom the environmental database, in absolute terms 32.6% ofthe peptides assignable to LCAs come from that latter group.Thus, both the greater peptide detection sensitivity and thegreater peptide coverage afforded by the metapeptide databasecontribute to its increased potential for metaproteometaxonomic classification.To estimate the number of peptides with potential functional

    annotations, we used Unipept to locate all of the proteinscontaining each detected peptide, along with their associatedGO categories. The metapeptide search detected 1.50 times asmany peptides associated with one or more GO categories inthe “biological process” namespace as the metagenome-deriveddatabase search and 2.12 as many as the environmentaldatabase search.3.4. Combining Results from Multiple Databases FurtherIncreases Peptide Coverage

    Although the metapeptide databases are the most valuableindividual databases for searching these samples, a higheroverall peptide yield can be obtained by combining results frommultiple databases. PSMs from searches against the environ-mental, metagenome, and metapeptide databases wereintegrated as described above. In the BSt and CS samples,respectively, 1.09 and 1.07 times as many peptides weredetected by this method as by searching against the individual

    metapeptide databases (Figure 2). For context, the largest set ofpeptides that could possibly be detected at FDR 0.01 by anymethod combining these three searches is the union of allpeptides detected at FDR 0.01 in each of the three separatedatabase searches of each sample. This number was 1.16 and1.18 times the number of peptides detected via metapeptidesearches of the two samples, respectively. We determined thatthis modest increase was statistically significant via an analysisof the three technical replicates for each sample. In themetapeptide database searches, technical replicates of the BStand CS samples detected a mean of 4691.3 and 2991.6peptides, respectively, with a mean pairwise intersection of3075.0 and 2377.7 peptides between replicates, respectively.The per-replicate means for the BSt and CS samples in theintegrated searches were 5090.0 and 3193.0, respectively, andfor each sample, the increase was statistically significant by bothone-tailed and two-tailed paired t tests at p < 0.05. In terms oftaxonomic inference, the integrated searches of the BSt and CSsamples detected 1.13 and 1.12 times as many peptidesassignable to LCAs compared with a metapeptide search(Figure 4 for BSt comparison), with more taxa observed atevery taxonomic rank lower than superkingdom.This method of integrating database search results yielded

    14.2% more peptides at FDR < 0.01 as searching aconcatenated database combining the environmental, meta-genome and metapeptide databases. Because of the reducedstatistical power of a search against a larger database, searchingthe concatenated database yielded 5.0% fewer peptides thansearching the metapeptide database alone. The extra Percolatorfeatures representing the database against which each PSM wasmade were of modest benefit, increasing peptide yield by 4.3%versus an integrated search with those features removed.

    3.5. Metapeptide Databases from Two MicrobiomeSamples Can Be Used to Interrogate Each Other

    Constructing a metapeptide database is a relatively expensiveendeavor, requiring library preparation, short read sequencing,and computational time, so it would be convenient to use asingle database to interrogate the metaproteome from multiplesamples. Our two samples are from two different locations andfrom two very different positions in the water column(chlorophyll maximum layer and bottom water). In each case,overall peptide yield from a database search against themetapeptide database derived from the other sample was a largeimprovement over the yield from a search against theenvironmental database (2.17 and 1.95 times as many peptides,respectively). In each case, however, searching a sample againstits site-specific metapeptide database detected many morepeptides than searching against the database derived from theother sample. Notably, the BSt sample appeared to benefitgreatly from a search against the BSt database rather thanagainst the CS database (1.55 times as many peptides), whereasthe effect in the opposite direction was not as pronounced(1.40 times as many). A potential explanation for this differencelies in the depth from which the two samples were taken: theBSt sample, from the upper water column, is expected tocontain more biodiversity than the CS sample taken from thebottom layer, which has no light.

    3.6. Filtering Protocols Are Critical to ResultingMetapeptide Database Size

    Prior to filtering, the trimmed MetaGene output contained 51million tryptic peptides. To investigate the effects of thefiltering criteria, we systematically varied each parameter while

    Figure 4. Taxonomic inference summary. Bar charts comparing thetaxonomic information derived from four different searches of the BStsample: against the NCBI environmental database, against themetagenome-derived database, against site-specific metapeptides, andintegrating results from all three databases. (A) Counts of taxadetected, by rank from superkingdom to species. (B) Counts ofpeptides associated with an LCA at each rank.

    Journal of Proteome Research Article

    DOI: 10.1021/acs.jproteome.6b00239J. Proteome Res. 2016, 15, 2697−2705

    2702

    http://dx.doi.org/10.1021/acs.jproteome.6b00239

  • leaving the remaining parameters set to the values described insection 2.2. The results (Figure 5) demonstrate that filtering

    metapeptides based on the support of two or more reads andthe use of MetaGeneAnnotator fragments rather than six-frametranslations of raw reads had particularly large effects ondatabase size, reducing the number of unique tryptic peptidesby 74.0 and 51.8%, respectively, and decreasing search time bysimilar proportions.To investigate the effect of database filtering parameters on

    peptide detection sensitivity, we generated a small sample set of24 000 MS/MS spectra from the BSt sample (8000 randomspectra from each replicate run) to compare the number ofpeptides detected at FDR < 0.01 by searching each database.Beginning with MetaGeneAnnotator fragments rather thanwith a six-frame translation of raw reads increased detectedpeptides by 9.0%, demonstrating that the MetaGeneAnnotatoris valuable but not crucial to the metapeptide strategy. TheMetaGeneAnnotator score was not useful as a filteringcriterion: higher score thresholds resulted in monotonicallylower peptide yield. Requiring two or more reads increaseddetected peptides by 8.4%. Higher read count thresholdsmonotonically reduced yield. Sufficiently restrictive values foreach parameter reduce peptide yield much more severely (datanot shown). However, in general, within the range of valuesshown in Figure 5 the reduction in yield was minor, suggestinga relative robustness of the parameter settings.

    4. DISCUSSIONIn this work, we have demonstrated the value of interrogatingmicrobial metaproteomes by constructing metapeptide data-bases from site-specific shotgun metagenomic sequencing reads.These databases afford much greater peptide detectionsensitivity than the NCBI environmental database or a databaseof genes predicted from an assembled metagenome. Fur-

    thermore, we have shown that a database derived from onesample can be used to interrogate another sample from adifferent location and position in the water column. Bycombining metapeptide databases from a variety of samples,sequencing efforts could potentially be centralized to an extent,and metapeptide databases integrated into existing metapro-teomics workflows such as the MetaProteomeAnalyzer.35 Inprinciple, these methods should be applicable to othermicrobiomes, such as riverine and soil-derived microbialcommunities, in which prokaryotes dominate the microbiomeand the great majority of organisms are unsequenced. Thesemethods may also provide additional sensitivity in a better-understood environment such as the human gut microbiome.From a practical standpoint, the larger environmental

    database required more computational time for database searchthan the metapeptide databases. On our hardware, the threeBSt sample replicates, with an average of 104 000 MS/MSscans, took us an average of 1.15 h to search against the BStdatabase and an average of 14 h to search against theenvironmental database. The much larger raw database ofMetaGeneAnnotator fragments took more than 4 days tosearch. The strategy of trimming reads to the outermost trypticsites and the filtering criteria applied are responsible for themuch smaller metapeptide database size, making the metapep-tide database easier to integrate into a proteomics pipeline.De novo sequencing is another strategy for increasing peptide

    detection sensitivity. Although this method can yield manyconfident partial peptide sequences, it is less effective atconfidently detecting full-length peptides. Furthermore, as aresult of codon degeneracy, de novo sequencing also cannoteasily link detected peptides with their correspondingnucleotide sequences for taxonomic annotation. As othershave noted, the space of peptides likely to be present in ametaproteomics sample should remain tractable to a databasesearch approach if search databases are constructed with an eyetoward detection sensitivity.36 However, de novo sequencingremains a viable approach for assignment of spectra that cannotbe assigned with database search.Some of the peptide sequences detected by metapeptide

    database search are present in organisms with publicly availablegenomes, enabling putative taxonomic assignment usingexisting peptide-based tools and enriching taxonomic character-ization. However, a large proportion of peptides detected bysearching against metapeptide databases have never beenreported in an assembled genome. In future work, we willplace those peptides within a taxonomic hierarchy usingsequence homology. This may be accomplished using all ofthe nucleotide sequences of the reads that contributed to theinclusion of each metapeptide in the database.Sequence homology could also be used to infer the putative

    function of proteins containing these detected peptides. Withboth taxonomic and functional assignments, a large number ofdetected peptides could be used in comparisons of the activityof various microbes between samples. This research willquantify the protein functions responsible for chemicaltransformations at meaningful taxonomic levels, therebyexposing microbial ecosystems at the molecular level toimprove our understanding of their interactions and biologicalroles. Applying this approach in conjunction with recentadvances in quantitative proteomics can bring about afundamental change in how we view, analyze, and modelmicrobial ecosystems.

    Figure 5. Metapeptide parameter comparison. Comparisons betweenmetapeptide databases constructed with different filtering parameters.The bars on the far left represent a database of MetaGene fragmentstrimmed to outermost tryptic ends but otherwise are unfiltered. Eachgroup of three bars represents three different values for a singleparameter, with all other parameters set as described in section 2.2. Ineach group, the middle value represents the value used to generate theresults described in Figures 2−5. The N/A value for MetaGene scorerepresents the use of raw six-frame translations of reads instead ofMetaGene output. (A) Millions of tryptic peptides in each database.(B) Percent difference in counts of peptides detected in a search of24 000 scans from the three BSt sample replicates at FDR < 0.01against each database, as compared with search against the unfiltered,trimmed MetaGene database.

    Journal of Proteome Research Article

    DOI: 10.1021/acs.jproteome.6b00239J. Proteome Res. 2016, 15, 2697−2705

    2703

    http://dx.doi.org/10.1021/acs.jproteome.6b00239

  • An important area for future research lies in the developmentof improved methods for combining search results frommultiple databases. The approach we have adopted here reliesupon the machine-learning algorithm Percolator to calibratescores between the different database searches. A morepowerful approach might be to adopt a strategy similar tocascade search,37 searching against, in order, the metapeptide,metagenome, and environmental databases. In future work, weplan to develop and validate a statistical method for combiningcascade search with a machine-learning postprocessing stepsuch as Percolator.The software tools described here have been implemented in

    Python 2.7. The software (including source code) and datadescribed in this manuscript may be downloaded at http://noble.gs.washington.edu/proj/metapeptide.

    ■ AUTHOR INFORMATIONCorresponding Author

    *E-mail: [email protected]. Tel.: (206) 221-4973.Notes

    The authors declare no competing financial interest.

    ■ ACKNOWLEDGMENTSResearch reported in this publication was supported by theNational Institute of General Medical Sciences of the NationalInstitutes of Health under award number P41 GM103533, theNational Science Foundation Directorate for Geosciencesunder award numbers OCE-1233014 and OCE-1233589, andthe National Defense Science and Engineering GraduateFellowship (NDSEG) Program.

    ■ REFERENCES(1) Allison, S. D.; Martiny, J. B. H. Colloquium paper: resistance,resilience, and redundancy in microbial communities. Proc. Natl. Acad.Sci. U. S. A. 2008, 105 (Supplement 1), 11512.(2) Pinhassi, J.; Azam, F.; Hemphal̈a,̈ J.; Long, R. A.; Martinez, J.;Zweifel, U. L.; Hagström, Å. Coupling between bacterioplanktonspecies composition, population dynamics, and organic matterdegradation. Aquat. Microb. Ecol. 1999, 17 (1), 13−26.(3) Azam, F.; Fenchel, T.; Field, J. G.; Gray, J. C.; Meyer-Reil, L. A.;Thingstad, F. The ecological role of water-column microbes in the sea.Mar. Ecol.: Prog. Ser. 1983, 10 (3), 257−264.(4) Morris, R. M.; Nunn, B. L.; Frazar, C.; Goodlett, D. R.; Ting, Y.S.; Rocap, G. Comparative metaproteomics reveals ocean-scale shiftsin microbial nutrient utilization and energy transduction. ISME J.2010, 4 (5), 673−685.(5) Georges, A. A.; El-Swais, H.; Craig, S. E.; Li, W. K. W.; Walsh, D.A Metaproteomic analysis of a winter to spring succession in coastalnorthwest Atlantic Ocean microbial plankton. ISME J. 2014, 8 (6),1301−1313.(6) Yoshida, M.; Yamamoto, K.; Suzuki, S. Metaproteomiccharacterization of dissolved organic matter in coastal seawater. J.Oceanogr. 2014, 70 (1), 105−113.(7) Hawley, A. K.; Brewer, H. M.; Norbeck, A. D.; Pasǎ-Tolic, L.;Hallam, S. J. Metaproteomics reveals differential modes of metaboliccoupling among ubiquitous oxygen minimum zone microbes. Proc.Natl. Acad. Sci. U. S. A. 2014, 111 (31), 11395−11400.(8) Teeling, H.; Fuchs, B. M.; Becher, D.; Klockow, C.; Gardebrecht,A.; Bennke, C. M; Kassabgy, M.; Huang, S.; Mann, A. J; Waldmann, J.;et al. Substrate-Controlled Succession of Marine BacterioplanktonPopulations Induced by a Phytoplankton Bloom. Science 2012, 336,608−611.(9) Muth, T.; Kolmeder, C. A.; Salojar̈vi, J.; Keskitalo, S.; Varjosalo,M.; Verdam, F. J.; Rensen, S. S.; Reichl, U.; de Vos, W. M.; Rapp, E.;

    Martens, L. Navigating through metaproteomics data: A logbook ofdatabase searching. Proteomics 2015, 15 (20), 3439−3453.(10) Rappe,́ M. S.; Giovannoni, S. J. The uncultured microbialmajority. Annu. Rev. Microbiol. 2003, 57, 369−394.(11) Nesvizhskii, A. I. A survey of computational methods and errorrate estimation procedures for peptide and protein identification inshotgun proteomics. J. Proteomics 2010, 73 (11), 2092−2123.(12) Noble, W. S. Mass spectrometrists should search only forpeptides they care about. Nat. Methods 2015, 12 (7), 605−608.(13) Cantarel, B. L.; Erickson, A. R.; VerBerkmoes, N. C.; Erickson,B. K.; Carey, P. A.; Pan, C.; Shah, M.; Mongodin, E. F.; Jansson, J. K.;Fraser-Liggett, C. M.; Hettich, R. L. Strategies for metagenomic-guided whole-community proteomics of complex microbial environ-ments. PLoS One 2011, 6, e27173.(14) Keiblinger, K. M.; Wilhartitz, I. C.; Schneider, T.; Roschitzki, B.;Schmid, E.; Eberl, L.; Riedel, K.; Zechmeister-Boltenstern, S. Soilmetaproteomics - Comparative evaluation of protein extractionprotocols. Soil Biol. Biochem. 2012, 54, 14−24.(15) Xiong, W.; Abraham, P. E.; Li, Z.; Pan, C.; Hettich, R. L.Microbial metaproteomics for characterizing the range of metabolicfunctions and activities of human gut microbiota. Proteomics 2015, 15(20), 3424−3438.(16) Hoff, K. J.; Lingner, T.; Meinicke, P.; Tech, M. Orphelia:Predicting genes in metagenomic sequencing reads. Nucleic Acids Res.2009, 37, W101−W105.(17) Noguchi, H.; Park, J.; Takagi, T. MetaGene: Prokaryotic genefinding from environmental genome shotgun sequences. Nucleic AcidsRes. 2006, 34 (19), 5623−5630.(18) Noguchi, H.; Taniguchi, T.; Itoh, T. MetaGeneAnnotator:Detecting species-specific patterns of ribosomal binding site for precisegene prediction in anonymous prokaryotic and phage genomes. DNARes. 2008, 15 (6), 387−396.(19) Rho, M.; Tang, H.; Ye, Y. FragGeneScan: Predicting genes inshort and error-prone reads. Nucleic Acids Res. 2010, 38 (20), e191.(20) Moore, E. K.; Nunn, B. L.; Faux, J. F.; Goodlett, D. R.; Harvey,H. R. Evaluation of electrophoretic protein extraction and database-driven protein identification from marine sediments. Limnol. Ocean-ogr.: Methods 2012, 10, 353−366.(21) Mesuere, B.; Devreese, B.; Debyser, G.; Aerts, M.; Vandamme,P.; Dawyndt, P. Unipept: Tryptic peptide-based biodiversity analysis ofmetaproteome samples. J. Proteome Res. 2012, 11 (12), 5773−5780.(22) Mesuere, B.; Debyser, G.; Aerts, M.; Devreese, B.; Vandamme,P.; Dawyndt, P. The Unipept metaproteomics analysis pipeline.Proteomics 2015, 15 (8), 1437−1442.(23) Moore, E. K.; Nunn, B. L.; Goodlett, D. R.; Harvey, R. H.Identifying and tracking proteins through the marine water column:Insights into the inputs and preservation mechanisms of protein insediments. Geochim. Cosmochim. Acta 2012, 83, 324−359.(24) Kirchman, D.Microbial ecology of the oceans; John Wiley & Sons:Hoboken, NJ, 2010.(25) Needham, D. M.; Fuhrman, J. A. Pronounced daily succession ofphytoplankton, archaea and bacteria following a spring bloom. NatureMicrobiology 2016, 1 (4), 16005.(26) Wright, J. J.; Lee, S.; Zaikova, E.; Walsh, D. A.; Hallam, S. J.DNA extraction from 0.22 microM Sterivex filters and cesium chloridedensity gradient centrifugation. J. Visualized Exp. 2009, 31, 3−6.(27) Cox, M. P.; Peterson, D. A.; Biggs, P. J. SolexaQA: At-a-glancequality assessment of Illumina second-generation sequencing data.BMC Bioinf. 2010, 11 (1), 485.(28) Ewing, B.; Hillier, L.; Wendl, M. C.; Green, P. Base-Calling ofAutomated Sequencer Traces Using Phred. II. Error Probabilities.Genome Res. 1998, 8, 186.(29) Nunn, B. L.; Slattery, K.; Cameron, K. A.; Timmins-Schiffman,E.; Junge, K. Proteomics of Colwellia psychrerythraea at subzerotemperatures - a life with limited movement, flexible membranes andvital DNA repair. Environ. Microbiol. 2015, 17, 2319−2335.(30) Kultima, J. R.; Sunagawa, S.; Li, J.; Chen, W.; Chen, H.; Mende,D. R.; Arumugam, M.; Pan, Q.; Liu, B.; Qin, J.; Wang, J.; Bork, P.

    Journal of Proteome Research Article

    DOI: 10.1021/acs.jproteome.6b00239J. Proteome Res. 2016, 15, 2697−2705

    2704

    http://noble.gs.washington.edu/proj/metapeptidehttp://noble.gs.washington.edu/proj/metapeptidemailto:[email protected]://dx.doi.org/10.1021/acs.jproteome.6b00239

  • MOCAT: A Metagenomics Assembly and Gene Prediction Toolkit.PLoS One 2012, 7 (10), e47656.(31) Eng, J. K.; Jahan, T. A.; Hoopmann, M. R. Comet: an opensource tandem mass spectrometry sequence database search tool.Proteomics 2013, 13 (1), 22−24.(32) Granholm, V.; Navarro, J. F.; Noble, W. S.; Kal̈l, L. Determiningthe calibration of confidence estimation procedures for uniquepeptides in shotgun proteomics. J. Proteomics 2013, 80 (27), 123−131.(33) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increasedconfidence in large-scale protein identifications by mass spectrometry.Nat. Methods 2007, 4 (3), 207−214.(34) Kal̈l, L.; Canterbury, J. D.; Weston, J.; Noble, W. S.; MacCoss,M. J. A semi-supervised machine learning technique for peptideidentification from shotgun proteomics datasets. Nat. Methods 2007, 4,923−925.(35) Muth, T.; Behne, A.; Heyer, R.; Kohrs, F.; Benndorf, D.;Hoffmann, M.; Lehteva,̈ M.; Reichl, U.; Martens, L.; Rapp, E. TheMetaProteomeAnalyzer: A Powerful Open-Source Software Suite forMetaproteomics Data Analysis and Interpretation. J. Proteome Res.2015, 14, 1557−1565.(36) Saito, M. A.; Dorsk, A.; Post, A. F.; Mcilvin, M. R.; Rappe,́ M. S.;Ditullio, G. R.; Moran, D. M. Needles in the blue sea: Sub-speciesspecificity in targeted protein biomarker analyses within the vastoceanic microbial metaproteome. Proteomics 2015, 15 (20), 3521−3531.(37) Kertesz-Farkas, A.; Keich, U.; Noble, W. S. Tandem massspectrum identification via cascaded search. J. Proteome Res. 2015, 14(8), 3027−3038.

    Journal of Proteome Research Article

    DOI: 10.1021/acs.jproteome.6b00239J. Proteome Res. 2016, 15, 2697−2705

    2705

    http://dx.doi.org/10.1021/acs.jproteome.6b00239