refsoil: a reference database of soil microbial genomes · 5/14/2016 · 1 refsoil: a reference...
TRANSCRIPT
RefSoil:Areferencedatabaseofsoilmicrobialgenomes1Runningtitle:Areferencedatabaseofsoilmicrobialgenomes23Jinlyung Choi1, Fan Yang1, Ramunas Stepanauskas2, Erick Cardenas3, Aaron4Garoutte4,RyanWilliams1, JaredFlater1, JamesMTiedje4,KirstenS.Hofmockel5,6,5BrianGelder1,AdinaHowe1§671Agricultural and Biosystems Engineering, Iowa State University, Iowa, United8States92BigelowLaboratoryforOceanSciences,Maine,UnitedStates10RamunasStepanauskas113Department of Microbiology & Immunology, University of British Columbia,12Vancouver,Canada134CenterforMicrobialEcology,MichiganStateUniversity,Michigan,UnitedStates145Environmental Molecular Sciences Laboratory, Pacific Northwest National15Laboratory,Washington,UnitedStates166DepartmentofEcology,EvolutionandOrganismalBiology, IowaStateUniversity,17Iowa,UnitedStates1819§Correspondingauthor2021Correspondingauthoraddresses:22Name:AdinaHowe23Postaladdress:1201SukupHallAmesIA,50011UnitedStates24Telephone:+1-515-294-017625E-mail:adina@iastate.edu262728SubjectCategories29Microbialpopulationandcommunityecology30313233343536373839404142434445
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
Abstract4647A database of curated genomes is needed to better assess soil microbial48communities and their processes associatedwith differing landmanagement and49environmental impacts. Interpreting soil metagenomic datasets with existing50sequence databases is challenging because these datasets are biased towards51medical andbiotechnology researchand can result inmisleading annotations.We52havecuratedadatabaseof922genomesofsoil-associatedorganisms(888bacteria53and 34 archaea). Using this database,we evaluated phyla and functions that are54enriched in soils as well as those thatmay be underrepresented in RefSoil. Our55comparisonofRefSoiltosoilamplicondatasetsallowedustoidentifytargetsthatif56cultured or sequenced would significantly increase the biodiversity represented57withinRefSoil.Todemonstratetheopportunitiestoaccesstheseunderrepresented58targets, we employed single cell genomics in a pilot experiment to sequence 1459genomes. This effort demonstrates the value of RefSoil in the guidance of future60researcheffortsandthecapabilityofsinglecellgenomicsasapracticalmeanstofill61theexistinggenomicdatagaps.6263Introduction6465
Microbialpopulationsinthesoilimpactthestructureandfertilityofoursoils66by contributing to nutrient availability, soil stability, plant productivity, and67protection against disease and pathogens. Soil microbiology has a rich history of68isolatingandcharacterizingrepresentativesfromthesoiltoassesstheirimpactson69soil health and stability. However, despite significant efforts to isolate microbes70from the soil, we have accessed only a small fraction of its biodiversity, with71estimates of less than 1 to 50 percent of species, even with novel isolation72techniques(VanHTPhamandKim,2012;Vartoukianetal.,2010;Schoenbornetal.,732004;Burmølleetal.,2009;Janssenetal.,2002;Lingetal.,2015).Withadvancesin74DNA sequencing technologies,we cannowdirectly interrogate soilmetagenomes,75allowing for the characterization of soil communities without the need to first76cultivate isolates. However, our ability to annotate and characterize the retrieved77genes is dependent on the availability of informative reference gene or genome78databases.The largestresourceof fullgenomereferencesequences forannotating79genesistheNCBIRefSeqdatabase(Tatusovaetal.,2013),whichcontainsatotalof8057,993 genomes (Release 74). While broadly useful, this database is not81representativeofthelargemajorityofsoilmicrobiomes,andthemajorityofgenes82in previously published soilmetagenomes (65-90%) cannot be annotated against83knowngenes(Delmontetal.,2012;Fiereretal.,2012).Further,contributionstothe84NCBI databases have largely originated from human health and biotechnology85research efforts that can mislead annotations of genes originating from soil86microbiomes(e.g.,annotationsthatareclearlynotcompatiblewithlifeinsoil).87 Soil microbiologists are not the first to face the problem of a limited88reference database. The NIH Human Microbiome Project (HMP) recognized the89criticalneedforawell-curatedreferencegenomedatasetanddevelopedareference90
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
catalogof3,000genomesthatwereisolatedandsequencedfromhuman-associated91microbial populations(Huttenhoweretal., 2012).Thispublicly available reference92setofmicrobialisolatesandtheirgenomicsequencesaidsintheanalysisofhuman93microbiomesequencingdata(Segataetal.,2012;Wuetal.,2009)andalsoprovides94strainsforwhichisolates(bothculturecollectionsandnucleicacids)areavailableas95resources for experiments. Following this successful model of the HMP, we have96developed a curated database of reference genome sequences that originate from97soil. This database, called RefSoil, represents the current state of knowledge of98sequenced soil genomes. In addition to helping us to understand known soil99microbiology, RefSoil aids in the identification of knowledge gaps that need to be100filledtoimproveourunderstandingofsoilbiodiversity.Wediscusstheimportance101the RefSoil database and demonstrate exciting opportunities to expand this102databaseanditsapplicationsinthefuture.103104MaterialsandMethods105106CreationofRefSoilDatabase107In order to create a soil-specific reference genome dataset, we targeted genome108sequences based on evidence that isolates had origins in soil systems. A total of1096,646completegenomeswereobtainedfromtheGenomesOnLineDatabase(GOLD,110https://gold.jgi.doe.gov,October9th,2014);theGOLDdatabasewaschosenbecause111oftheavailabilityofmetadata,particularlyenvironmentalorigin,relatedtogenome112sequences.Thesegenomeswerefurtherselectedbasedonsoil-associationwiththe113following criteria: (1) Within the GOLD database, organism information and114organismmetadata(knownhabitats,ecosystemcategory,andecosystemtype)was115requiredtoidentifyorganismasoriginatinginsoil-associatedcategories.Organisms116frommarine and deep sea environmentswere excluded, though these organisms117often were identified as soil-associated. Additionally, obligate host-associated118pathogensandextremophileswereexcluded,astheseorganismsareunlikelytobe119present in the absence of their host or in representative soil samples. We120consideredtheseorganismstooftenbeunderverystrongselectivepressuresthat121can lead to reduced genomes or high rates of recombination that are difficult to122assess soil-specific trends. (2) For organisms that lacked appropriatemetadata in123GOLD, a Google Custom Search was used to query all available websites for the124organism name and soil-related terms (rhizosphere, soil, sand, mud, or nodule).125Priorities were placed on the following websites:126http://www.ncbi.nlm.nih.gov/pubmed/, http://www.straininfo.net/,127https://microbewiki.kenyon.edu/index.php/MicrobeWiki,128https://en.wikipedia.org/, and https://scholar.google.com. Genomes with an129association with at least one webpage containing search query phrases were130includedinRefSoil.Thisapproachwastestedwithknownsoil-associatedorganisms,131verifyingthat itwouldreliablypredictsoilassociationfortheseorganisms.Within132resultinggenomes,duplicatedchromosomesandstrainswereremoved.Ifmultiple133genome accession numberswere associatedwith a single strain, themost recent134genomesequencewaschosen.NCBIGenbankannotationfilesassociatedwitheach135
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
strain were obtained on 2/18/16. The genomes contained within RefSoil are136providedinSupplementaryTable1.137138CharacterizationoforganismsinRefSoil139
All RefSoil genome sequences were associated with their NCBI RefSeq140accession IDandobtained fromNCBI. UsingNCBIGenbankCDSannotations, 16s141rRNAgenesequenceswere identified. Ifmultiple16SrRNAgenesequenceswere142present,thefirstsequencefromthefirstchromosome(byNCBIindex)wasselected143asrepresentativeforthegenome.Threegenomeslackedannotationsofa16SrRNA144geneanda16SrRNAgeneHMMmodelwasusedand identifiedanadditional16S145rRNAsequence(Guoetal.,2015).Intotal,88616SrRNAgenesequenceswereused146to build a phylogenetic tree for bacterial genomes within RefSoil (Fig. 1). All147bacterial 16S rRNA gene sequences were aligned using RDP’s bacterial model148(Infernal 1.1.1 (Nawrocki and Eddy, 2013)), and a Maximum-likelihood149phylogenetic tree was constructed based on the Jukes-Cantor model by Fasttree150(Priceetal.,2010)andvisualizedwithGraphlan(R3.2.2,version0.9.7)(Fig.1).The151scripts for these approaches are publicly available on https://github.com/germs-152lab/ref_soil. RefSoil genomes were extracted from corresponding genomes153containedwithinNCBIRefSeq(release74).AnnotationsoftaxonomyforRefSoiland154RefSeqgeneswereobtainedfromNCBI(February19,2016)(SupplementaryFig.1).155RefSoil genomeswere submitted to RAST and genes were annotated using SEED156subsystem ontology (Supplementary Table 1). Among 1,811,233 unique genes,157approximately 39.1% of themwere assigned tomultiple SEED subsystem level 1158categories.Forthepurposeofthisstudy,weincludedallannotationsatsubsystem159level 1 for each unique gene,which expanded the total gene counts to 2,619,643.160The percent abundance of genes from each phylum in RefSoil database was also161adjusted accordingly by including the abundance of genes that were assigned to162multiplesubsystemlevel1categories(Fig.2andSupplementTable2).Thenumber163of phyla identified in each subsystem level 1 categories was summarized in164SupplementalTable3.165166EMPdatabasesusedinthisstudy167
Atotalof15,481amplicondatasetsthatformtheEMPdatasetwereavailable168to compare environmental soil amplicons to RefSoil (total 5,594,412 OTUs,169clusteredat97%)(Rideoutetal.,2014).Soil sampleswereselectedbasedon their170association with soil metadata resulting in a total of 2,476,795 OTUs from 3035171samples. A total of 2,158 unique taxonomy assignmentswere identified for EMP172OTUs using the RDP Classifier (Wang et al., 2007) and used to construct a173phylogenetic tree with the Graphlan package (R 3.2.2, version 0.9.7) (Fig. 3).174Abundances for each taxonomic assignment were calculated as the sum of175abundanceofOTUsassociatedwiththattaxonomy.176
To evaluate thepresenceofRefSoil genomes inEMP,EMPandRefSoil 16S177rRNA gene amplicons were compared by alignment with BLAST, requiring an178alignmentwithgreaterthan97%similarity,aminimumalignmentof72bp,andE-179value ≤ 1e-5. If multiple hits were aligned, the hit with the lowest E-value was180selected as representative. Using these criteria, a total of 53,538 EMPOTUswere181
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
associated RefSoil 16s rRNA genes (1.4% of total EMP OTUs; 10.2% of all EMP182amplicons).183
SoilorderforEMPsampleslocatedintheUnitedStates(1817samples)was184obtained based on GPS location associatedwith samplemetadata; locationswere185onlyconsideredvalidifthecoordinatesenteredwereactuallylocatedintheUnited186States (1627samples). ValidGPSpointswere then locatedwithina soilorderby187queryingtheUSDANRCSGlobalSoilRegionsmap(Reich)(AccessedonJanuary26,1882016) for soil order plus a Rock/Sand/Ice category using ArcGIS 10.3.1. RefSoil189representatives for EMPOTUswere determined by similarity as described above190(SupplementaryTable4).191
192Most-wantedsoilOTUs193
TheselectionofthemostwantedOTUsforRefSoilbasedonEMP-associated194ampliconswasbasedonpresenceinEMPsamples,usingthecriteriaofEMP-RefSoil195similarityasdescribedabove,andabundance(numberofampliconsassociatedwith196the OTU in EMP samples). Candidate OTUs were ranked based on its observed197frequencyinallEMPsamplesandabundanceinEMPamplicons(Top100shownin198SupplementaryTable5and6).TaxonomywasassignedbytheRDPclassifier(Wang199etal., 2007). The 21OTUs present in both ‘themost abundance’ and ‘themost200frequent’arelistedintheTable1.201202Singlecellgenomics203
Forsinglecellgenomics,asoilsamplewascollectedfrom0-10cmdepthina204residentialgardeninNobleboro,Maine(44°5'48.10"N,69°29'10.56"W)onMay5th,2052015. Approximately five grams of the sample were mixed with 30 mL sterile-206filtered phosphate-saline buffer (PBS), vortexed for 30 s atmaximum speed, and207centrifuged for30sat2,000rpm.Theobtainedsupernatantwasdiluted tobelow208105cellxmL-1withPBS,pre-screenedthrougha40μmmesh-sizecellstrainer(BD),209and incubated with SYTO-9 DNA stain (5 μM; Invitrogen) for 10-60 min. The210generationofsingleamplifiedgenomes(SAGs)andtheirgenomicsequencingwere211performed by the Bigelow Laboratory Single Cell Genomics Center212(scgc.bigelow.org),aspreviouslydescribed(Stepanauskasetal.).SAGsrepresenting213the "most-wanted list"were selected based on 16S rRNA gene BLAST alignments214withgreaterthan97%similarityoveratleast72bp.215
216Codeandsequencingdataavailability217 The analysis code used to generate the results is available from218https://github.com/germs-lab/ref_soil.All14single-cellsequencingdatasetshave219been deposited at the NCBI. Sequencing data with NCBI accession identifier are220listedinTable2.221
222223224225226227
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
Results228229PhylogeneticandfunctionalcharacterizationofRefSoil230
RefSoil is a manually curated database containing a total 922 genomes of231soil-associated organisms, comprised of 888 bacteria and 34 archaea232(SupplementaryTable1).ThegenomeswithinRefSoil(soil-associated)andRefSeq233(all environments) reference databases were compared. Both genome databases234contained similar dominant phyla, including Proteobacteria, Firmicutes, and235Actinobacteria, comprising over 91% and 88% of genomes in RefSoil and RefSeq236bacteria, respectively. RefSoil contained higher proportions of Armatimonadetes,237Germmatimonadetes, Thermodesulfobacteria, Acidobacteria, Nitrospirae, and238ChloroflexithanRefSeq,suggestingthatthesephylamaybeenrichedinthesoilor239underrepresented in the RefSeq database. A total of eleven RefSeq-associated240phylums were absent from RefSoil indicating that these phyla may be absent or241difficult to cultivate in soil environments (Supplementary Fig. 1). The 16S rRNA242genesforRefSoilbacterialgenomeswereobtainedfromNCBIGenbankannotations243and aligned to construct a phylogenetic tree representing RefSoil phylogenetic244diversity(Fig.1). 245 RefSoil bacterial and archaeal genomes were further annotated with the246Rapid Annotation using Subsystem Technology (RAST, v 2.0(Aziz et al., 2008))247(SupplementaryTable1),resulting intheannotationofatotalof1,811,233RAST-248associated genes. RAST annotations, unlike GenBank annotations, include249classificationintofunctionalontologiesorsubsystems,allowingforthecomparison250of broad functions. Overall, 78% of the bacterial and archaeal genes could be251classifiedintofunctionalsubsystems(SupplementaryTable2).Forgenesassociated252with each subsystem,we evaluated the phylogenetic origins of RefSoil genes and253compared the phlya distribution of annotated genes with those within the254cumulative RefSoil database (Fig. 2). If the proportion of genes associatedwith a255phylum within a functional subsystem was greater than its representation in256RefSoil,weconsideredtherepresentationofthephylaenrichedinthisfunction(Fig.2572,SupplementaryTable3).Forexample,weobservedthatinmostsubsystems(15258out of 26 subsystems), Proteobacteria-associated geneswere enriched relative to259their representation within RefSoil (56% of all RefSoil genes associated with260Proteobacteria). Additionally, Actinobacteria, Crenarchaeota, and Proteobacteria261genes were enriched in functions related tometabolism of aromatic compounds;262Firmicutes and Proteobacteria genes were enriched in functions related to iron263acquisitionandmetabolism;and functioncategorydormancyandsporulationwas264enrichedwith Firmicutes genes only. These enrichments indicate that organisms265from a small number of phyla may have advantages over other organisms in266environmentswherethesefunctionsareimportant.267268RefSoilcomparedtosequencesoriginatingfromsoilenvironments269
Genomes in RefSoil represent cultivated strains originating from soils that270have been isolated and often characterized in laboratory conditions. In order to271estimate their natural abundance in soils, we compared RefSoil to amplicon272sequencingdatasetsfromglobalsoilmicrobiomesintheEarthMicrobiomeProject273
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
(EMP)(Rideoutetal.,2014;Gilbertetal.,2014).Amplicons(16SrRNAgenes)from274EMP datasets were clustered at 97% sequence similarity to determine275representative operational taxonomic units (OTUs)(Rideout et al., 2014). We276selected only EMP samples thatwere associatedwith soil samples, resulting in a277totalof3,035samples.Withinthesesoilsamples,weobservedthatthemajorityof278OTUswererare(e.g.,onlyobservedinafewsamples),with76%ofOTUsobserved279inlessthantensoilsamples,and1%ofOTUsrepresenting81%oftotalabundance280inEMP.281
ComparingRefSoil16SrRNAgenestoEMPamplicons,weobservedthatthe282majorityofRefSoil16SrRNAsequenceswerehighlysimilartoEMPamplicons(over28387% of RefSoil bacterial and archaeal 16s rRNA genes shared greater than 97%284similaritywithaminimum72bpalignmentlength).Incontrast,99%(2,442,432of2852,476,795) of EMP amplicons did not share high similarity (greater than 97%286similarity)toRefSoilgenes,suggestingthatEMPsoilsamplescontainmuchhigher287diversity than represented within RefSoil. While RefSoil genomes represent soil288microbes that have been well-studied in the laboratory, the EMP amplicons289represent microbes that have been observed in soils around the world but not290necessarilywell-characterized (e.g., limited to the observation of the presence of29116SrRNAgenesequences).292
293Soilmicrobiomesassociatedwithdifferentsoiltypes294
In1975,asoiltaxonomywasdevelopedbytheUnitedStatesDepartmentof295Agriculture (USDA) and the National Resources Conservation Service, which296separates soils into twelve orders that are based on their physical, chemical, or297biologicalproperties(SoilSurvayStaff1999,1999). Despite theavailabilityof this298classification, it israrely incorporatedintosoilmicrobiomesurveys. UsingRefSoil299and estimated abundances from similar EMP OTUs, we evaluated microbial300distribution in various soil orders. WeobtainedGPSdata fromEMP soil samples301originating from theUnitedStates thatallowedus toobtain the soil classification.302Within these EMP samples, the most represented soil orders included Mollisols303(58%, grassland fertile soils) andAlfisols (37%, fertile soils typicallyunder forest304vegetation) (Supplementary Table 4). The abundance ofOTUs similar to RefSoil305geneswasusedtoestimatethemembershipofwell-characterizedisolatesinthese306soilsamples(SupplementaryFig.2).Mollisols,AlfisolsandVertisols(soilswithhigh307claycontentwithpronouncedchangesinmoisture)wereassociatedwiththemost308RefSoil representatives,whileGelisols (cold climate soils),Ultisols (soilswith low309cationexchange),andsand/rock/icecontainedtheleast(SupplementaryTable4).310
311IdentificationofthemostbeneficialfuturetargetsforRefSoil312 EMP OTUs that do not share high similarity with 16S rRNA genes from313RefSoil represent current knowledge gaps for which we lack cultivated isolates.314These gaps were visually identified by overlaying the presence of highly similar315relatives in RefSoil (similarity >97%) to branches in a phylogenetic tree of soil-316associated EMP amplicons (Fig. 3). We evaluated the presence of OTUs in EMP317librariesbasedontheirpresenceinEMPsamples(frequency)andtheircumulative318observed abundance. We identified themost observed and abundant EMP OTUs319
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
thatdidnotsharehighsimilaritytoRefsoiltogeneratea“RefSoil’smostwantedlist”320oftargetsthat,ifisolatedorsequenced,couldprovideinformationonprevalentbut321uncharacterized lineages (Table 1, Supplementary Table 5-6). OTUs sharing322similaritytoVerrucomicrobia(8OTUs)andAcidobacteria(6OTUs)wereamongthe323most abundant and frequently observed EMP OTUs that are not currently324represented inRefSoil (Table1). Thesetargets,observed innaturebutnot inour325database,would be ideal to isolate and characterize to better understand the soil326biodiversity.327
328Singlecellgenomicsinsoil329
Sequencing-basedapproachesprovideanalternativetoaccessingthe330genomesofsoilorganismswithoutcultivation.Previouseffortshaveusedassembly331ofgenomesfrommetagenomes(Labontéetal.,2015;Martijnetal.,2015;Fieldetal.,3322014)andsinglecellgenomics(Gawadetal.,2016;Lasken,2012;Stepanauskas,3332012)toobtaingenomicblueprintsofyetunculturedmicrobialgroups.Toevaluate334theeffectivenessofsinglecellgenomicsonsoilcommunities,weperformedapilot-335scaleexperimentonaresidentialgardensoilinMaine.The16SrRNAgenewas336successfullyrecoveredfrom109ofthe317singleamplifiedgenomes(SAGs).This33734%16SrRNAgenerecoveryrateiscomparabletosinglecellgenomicsstudiesin338marine,freshwaterandotherenvironments(Martinez-Garciaetal.,2011;Rinkeet339al.,2013;Swanetal.,2011).The16SrRNAgenesof14oftheseSAGs,belongingto340Proteobacteria,Actinobacteria,Nitrospirae,Verrucomicrobia,Planktomycetes,341AcidobacteriaandChloroflexi,wereselectedbasedontheirlackofrepresentation342withinRefSoil.GenomicsequencingofthoseSAGsresultedinacumulativeassembly343of23Mbp(Table2,SupplementaryTable7).WeestimatedtheabundanceofOTUs344similartosinglecell16SrRNAgenestorangefrom5E-7to2E-2%ofEMPOTU345abundances.Theseabundancesareverylowbutcomparabletotheaverageand346medianrelativeabundanceofOTUswithinEMP(4E-3%and1E-6%,respectively).347IfthesedraftgenomeswereaddedtoRefSoil,these14SAGswouldincrease348RefSoil’srepresentationofEMPampliconsby7%byabundance.349350Discussion351352
Advances in sequencing techniques for utilizing culture-independent353approaches have created new opportunities for understanding soil microbiology354and its impact on soil health, stability, andmanagement.We provide RefSoil as a355community genomic resource to provide high-quality curation of organisms that356originate in soil studies, allowing us tomore accurately annotate soil sequencing357datasets. This database is an important initial effort to provide improved soil358references, and we acknowledge that it is far from a complete representation of359soil’s biodiversity. In fact, compared to EMP soils, RefSoil represents only 2% of360observedEMPOTUsand10%byabundanceoftheobservedsoilOTUs,highlighting361themagnitudeofsoilbiodiversityandthelimitationsofourcurrentdatabasebased362on cultivated representatives. These results are consistent with a recent effort363demonstratingthatweareverylimitedinourknowledgeofnotonlysoilbutlife’s364diversity and advocating that much can be learned by including uncultivated365
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
organisms(Hugetal., 2016). We showherehowRefSoil canbeused to evaluate366which organisms to target thatwouldmost effectively increase our knowledge of367soilbiodiversity.Forexample,ifgenomereferenceswereavailableforthetopmost368wanted organisms identified in this effort (Table 1), we could expand RefSoil’s369representationofEMPsoilsby58%byabundance.370
Notably, many of these targets have previously been observed to be371recalcitranttocultivationwithstandardlaboratorymedia,resultingintheirabsence372in current genome databases. Acidobacteria, for example, is known to be slow373growinganddifficulttocultivate(NunesdaRochaetal.,2009)whilealsoobserved374to be highly abundant in soil (33% of EMP amplicons by abundance). Another375fastidiousbacteria,Verrucomicrobia(Bergmannetal.,2011),werealsoobservedto376behighlyabundant(12.5%)inEMPbutnotwellrepresentedinRefSoil(twoof888377bacterial genomes). Despite their absence from cultivated isolates, both378Acidobacteria andVerrucomicrobia havebeenobserved to be critical for nutrient379cycling insoils(NunesdaRochaetal.,2009;Wardetal.,2009;Fiereretal.,2013;380Martinez-Garciaetal.,2012).381
Novel isolation and culturing techniques will help us to access these382previously difficult to grow bacteria andwill also be complemented by emerging383sequencing technologies. Inparticular, single-cellgenomicsholdgreatpromise to384providegenomiccharacterizationoflineagesthataredifficulttoculture(Gawadet385al., 2016; Lasken, 2012; Stepanauskas, 2012). In our pilot experiment, we386demonstrate, for the first time, that single-cell genomics is applicable on soil387samplesand iswellsuited torecover thegenomic information fromabundantbut388yet uncultured taxonomic groups. The 14 sequenced SAGs have significantly389increasedtheextenttowhichRefSoilrepresentsthepredominantsoillineagesfrom390asinglesample.Muchlargersinglecellgenomicsprojectsarefeasibleandhavebeen391employedinpriorstudiesofotherenvironments(Kashtanetal.,2014;Rinkeetal.,3922013;Swanetal.,2013).Thecontinued,rapidimprovementsinthistechnologyare393likelytoleadtofurtherscalability,offeringapracticalmeanstofilltheexistinggaps394intheRefSoildatabaseandbiodiversitymorebroadly.395 TheRefSoil database is also a tool that allows us to characterize currently396known soil bacteria phylogeny and functions. This database spans 24 phyla of397bacteria and archaea. While genes related to microbial growth and reproduction398(e.g., DNA, RNA, and protein metabolism) originate from diverse phyla, key399functions related to metabolism of aromatic compounds, iron metabolism, and400dormancy and sporulationwere observed to be enriched fromonly a fewRefSoil401phyla, suggesting that these phyla may have specialized functions within soil402communities. Specifically, Proteobacteria-associated genes were enriched in403functionsrelatedtomotility,chemotaxis,andmembranetransportation(Figure2),404suggestingthatProteobacteriaarelikelytobeefficientatacquiringreadilyavailable405nutrients and elements for growth(Fierer et al., 2007). Genes related to406Proteobacteria and Actinobacteria were also found dominant among RefSoil407genomes in subsystems related to the metabolism of aromatic compounds. This408observation is consistent with the association of Proteobacteria and humic409substances utilization and the contribution of Actinobacteria to plant material410degradation(Goddenetal.,1992;Fuchsetal.,2011).411
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
By comparing RefSoil to other databases, we identified biases for specific412phyla in RefSoil; in particular, Firmicutes are observed frequently in RefSoil but413were not observed to be highly abundant in soil environments (5.7% of all EMP414amplicons). Firmicutes have been well-studied as pathogens(Buffie and Pamer,4152013;Kamadaetal.,2013;Rupniketal.,2009),likelybiasingtheirrepresentationin416ourdatabasesandconsequentlytheirannotationsinsoilstudies. Akeyadvantage417to the development of the RefSoil database is the opportunity to evaluate these418biases and to identify targets for future curation that would create a more419representative database resource for the soil. To this end, we evaluated the420representationofRefSoilinvarioussoils.421
Consistent with previous observations that microbial community422composition is correlated with soil environments(Fierer et al., 2012; Fierer and423Jackson, 2006), we observed specific RefSoil membership associated with soil424taxonomy. NotallsoiltaxonomicorderswereequallyrepresentedinRefSoil,with425themostgenomesassociatedwithMollisols,orgrasslandsoils. Thisbias is likely426duetoincreasedresearchinthesesoilsduetotheirimportanceforagricultureand427productivity. In contrast, for Gellisol permafrost soils, which cover over 20% of428Earth’s terrestrial surface(Koven etal., 2011) and are a significant carbon source429fromourbiospheretotheatmosphere,weobserveonlyasmall fractionofknown430soilmicrobesarepresent. Previousstudiessuggestthathighlevelsofnoveltycan431be observed in permafrost microbiomes(Koven et al., 2011; Mackelprang et al.,4322011), suggesting that these soils could significantly benefit from improved433references.434 AnotheradvantagetotheRefSoildatabaseisthatitrepresentsgenomesfor435which strains should currently be available and for which we have high-quality436genomes. As a consequence, genomes within RefSoil could help to inform and437designsoilmicrobiologyexperiments.Forexample,itisknownthatnitrogencycling438genesareabundantinagriculturalsoils(Xueetal.,2013;Philippotetal.,2007;Long439et al., 2012). A mock community of isolates known for participating in nitrogen440cycling could be generated and proportionally designed to mimic soil conditions441usingRefSoil: these strainsmight includemicroorganisms related toStreptomyces442venezuelae ATCC 10712 (assimilatory nitrate reductase),Bacillus anthracis (nitric443oxide reductase), Pseudomonas brassicacearum (nitrous oxide reductase),444Halomonas elongata DSM 2581 (ammonia monooxygenase), Pseudomonas445fluorescens SBW25 (ammonia monooxygenase), and Pleurocapsa sp. PCC 7327446(nitrogen fixation). Combining RefSoil and available amplicon datasets, one could447estimatetheproportionsofstrainsasobservedinsoilsamples.448
In conclusion, RefSoil is an important first step towards building a more449comprehensive, well-curated database. Though it is far from complete, we have450beenabletouseRefSoilasatooltoidentifykeyphylathatrepresentgapsinknown451soil diversity and underrepresented phyla. Going forward, this reference will be452expanded as isolation and cultivation-independent technologies continue to453improve.454455456457
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
Acknowledgments458We are grateful for the support of the NSF Terragenome International Soil459MetagenomeSequencingConsortium forproviding a collaborativeworkshopwith460thesoilcommunity fordiscussionsto improvethisproject. Thismaterial isbased461uponworksupportedbytheU.S.DepartmentofEnergy,OfficeofScience,Officeof462BiologicalandEnvironmentalResearch,underAwardNumberSC0010775.463464Contributions E.C., A.G., A.H., and J.T. curated Refsoil; J.C. and A.H. collected the465genomeinformationtobuildthedatabase;J.C.,J.F.,A.H.,R.W,andF.Y.,characterized466andcomparedRefSoiltoEMPdatasets;B.G.providedsoilordersforEMPsamples;467R.S.performedsinglecellgenomics;J.T.,K.H.andA.H.supervisedthework;J.C.,F.Y.,468andA.H.wrotethemanuscriptwithcontributionsfromallotherauthors.469470ConflictofInterest471Theauthorsdeclarenoconflictofinterests.472473
474
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
References475AzizRK,BartelsD,BestAA,DeJonghM,DiszT,EdwardsRA,etal.(2008).TheRAST476Server:RapidAnnotationsusingSubsystemsTechnology.BMCGenomics9:75.477
BergmannGT,BatesST,EilersKG,LauberCL,CaporasoJG,WaltersWA,etal.478(2011).Theunder-recognizeddominanceofVerrucomicrobiainsoilbacterial479communities.SoilBiologyandBiochemistry43:1450–1455.480
BuffieCG,PamerEG.(2013).Microbiota-mediatedcolonizationresistanceagainst481intestinalpathogens.NatRevImmunol13:790–801.482
BurmølleM,JohnsenK,Al-SoudWA,HansenLH,SørensenSJ.(2009).Thepresence483ofembeddedbacterialpureculturesinagarplatesstimulatetheculturabilityofsoil484bacteria.JournalofMicrobiologicalMethods79:166–173.485
DelmontTO,PrestatE,KeeganKP,FaubladierM,RobeP,ClarkIM,etal.(2012).486Structure,fluctuationandmagnitudeofanaturalgrasslandsoilmetagenome.ISMEJ4876:1677–1687.488
FieldEK,SczyrbaA,LymanAE,HarrisCC,WoykeT,StepanauskasR,etal.(2014).489GenomicinsightsintotheuncultivatedmarineZetaproteobacteriaatLoihi490Seamount.ISMEJ9:857–870.491
FiererN,BradfordMA,JacksonRB.(2007).Towardanecologicalclassificationof492soilbacteria.Ecology88:1354–1364.493
FiererN,JacksonRB.(2006).Thediversityandbiogeographyofsoilbacterial494communities.PNatlAcadSciUSA103:626–631.495
FiererN,LadauJ,ClementeJC,LeffJW,OwensSM,PollardKS,etal.(2013).496ReconstructingtheMicrobialDiversityandFunctionofPre-AgriculturalTallgrass497PrairieSoilsintheUnitedStates.Science342:621–624.498
FiererN,LeffJW,AdamsBJ,NielsenUN,BatesST,LauberCL,etal.(2012).Cross-499biomemetagenomicanalysesofsoilmicrobialcommunitiesandtheirfunctional500attributes.PNatlAcadSciUSA109:21390–21395.501
FuchsG,BollM,HeiderJ.(2011).Microbialdegradationofaromaticcompounds—502fromonestrategytofour.NatRevMicro9:803–816.503
GawadC,KohW,QuakeSR.(2016).Single-cellgenomesequencing:currentstateof504thescience.NatRevGenet17:175–188.505
GilbertJA,JanssonJK,KnightR.(2014).TheEarthMicrobiomeproject:successes506andaspirations.BMCBiol12:69.507
GoddenB,BallAS,HelvensteinP,MccarthyAJ,PenninckxMJ.(1992).Towards508
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
elucidationofthelignindegradationpathwayinactinomycetes.JournalofGeneral509Microbiology138:2441–2448.510
GuoJ,ColeJR,ZhangQ,BrownCT,TiedjeJM.(2015).MicrobialCommunityAnalysis511withRibosomalGeneFragmentsfromShotgunMetagenomesSchlossPD(ed).512AppliedandEnvironmentalMicrobiology82:157–166.513
HugLA,BakerBJ,AnantharamanK,BrownCT,ProbstAJ,CastelleCJ,etal.(2016).A514newviewofthetreeoflife.NatMicrobiol16048.515
HuttenhowerC,GeversD,KnightR,AbubuckerS,BadgerJH,ChinwallaAT,etal.516(2012).Structure,functionanddiversityofthehealthyhumanmicrobiome.Nature517486:207–214.518
JanssenPH,YatesPS,GrintonBE,TaylorPM,SaitM.(2002).ImprovedCulturability519ofSoilBacteriaandIsolationinPureCultureofNovelMembersoftheDivisions520Acidobacteria,Actinobacteria,Proteobacteria,andVerrucomicrobia.Appliedand521EnvironmentalMicrobiology68:2391–2396.522
KamadaN,ChenGY,InoharaN,NúñezG.(2013).Controlofpathogensand523pathobiontsbythegutmicrobiota.NatImmunol14:685–690.524
KashtanN,RoggensackSE,RodrigueS,ThompsonJW,BillerSJ,CoeA,etal.(2014).525Single-CellGenomicsRevealsHundredsofCoexistingSubpopulationsinWild526Prochlorococcus.Science344:416–420.527
KovenCD,RingevalB,FriedlingsteinP,CiaisP,CaduleP,KhvorostyanovD,etal.528(2011).Permafrostcarbon-climatefeedbacksaccelerateglobalwarming.PNatl529AcadSciUSA108:14769–14774.530
LabontéJM,SwanBK,PoulosB,LuoH,KorenS,HallamSJ,etal.(2015).Single-cell531genomics-basedanalysisofvirus–hostinteractionsinmarinesurface532bacterioplankton.ISMEJ9:2386–2399.533
LaskenRS.(2012).Genomicsequencingofunculturedmicroorganismsfromsingle534cells.NatRevMicro10:631–640.535
LingLL,SchneiderT,PeoplesAJ,SpoeringAL,EngelsI,ConlonBP,etal.(2015).A536newantibiotickillspathogenswithoutdetectableresistance.Nature517:455–459.537
LongA,HeitmanJ,TobiasC,PhilipsR,SongB.(2012).Co-OccurringAnammox,538Denitrification,andCodenitrificationinAgriculturalSoils.Appliedand539EnvironmentalMicrobiology79:168–176.540
MackelprangR,WaldropMP,DeAngelisKM,DavidMM,ChavarriaKL,BlazewiczSJ,541etal.(2011).Metagenomicanalysisofapermafrostmicrobialcommunityrevealsa542rapidresponsetothaw.Nature480:368–371.543
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
MartijnJ,SchulzF,Zaremba-NiedzwiedzkaK,ViklundJ,StepanauskasR,Andersson544SGE,etal.(2015).Single-cellgenomicsofarareenvironmental545alphaproteobacteriumprovidesuniqueinsightsintoRickettsiaceaeevolution.ISMEJ5469:2373–2385.547
Martinez-GarciaM,BrazelDM,SwanBK,ArnostiC,ChainPSG,ReitengaKG,etal.548(2012).CapturingSingleCellGenomesofActivePolysaccharideDegraders:An549UnexpectedContributionofVerrucomicrobiaRavelJ(ed).PLoSONE7:e35314.550
Martinez-GarciaM,SwanBK,PoultonNJ,GomezML,MaslandD,SierackiME,etal.551(2011).High-throughputsingle-cellsequencingidentifiesphotoheterotrophsand552chemoautotrophsinfreshwaterbacterioplankton.ISMEJ6:113–123.553
NawrockiEP,EddySR.(2013).Infernal1.1:100-foldfasterRNAhomologysearches.554Bioinformatics29:2933–2935.555
NunesdaRochaU,VanOverbeekL,VanElsasJD.(2009).Explorationofhitherto-556unculturedbacteriafromtherhizosphere.FEMSMicrobiologyEcology69:313–328.557
PhilippotL,HallinS,SchloterM.(2007).EcologyofDenitrifyingProkaryotesin558AgriculturalSoil.In:AdvancesinAgronomyVol.96.Elsevier,pp249–305.559
PriceMN,DehalPS,ArkinAP.(2010).FastTree2–ApproximatelyMaximum-560LikelihoodTreesforLargeAlignmentsPoonAFY(ed).PLoSONE5:e9490.561
ReichP.Globalsoilsuborders.562http://www.nrcs.usda.gov/wps/portal/nrcs/detail/soils/use/?cid=nrcs142p2_054563013.564
RideoutJR,HeY,Navas-MolinaJA,WaltersWA,UrsellLK,GibbonsSM,etal.(2014).565Subsampledopen-referenceclusteringcreatesconsistent,comprehensiveOTU566definitionsandscalestobillionsofsequences.PeerJ2:e545–25.567
RinkeC,SchwientekP,SczyrbaA,IvanovaNN,AndersonIJ,ChengJ-F,etal.(2013).568Insightsintothephylogenyandcodingpotentialofmicrobialdarkmatter.Nature569499:431–437.570
RupnikM,WilcoxMH,GerdingDN.(2009).Clostridiumdifficileinfection:new571developmentsinepidemiologyandpathogenesis.NatRevMicro7:526–536.572
SchoenbornL,YatesPS,GrintonBE,HugenholtzP,JanssenPH.(2004).LiquidSerial573DilutionIsInferiortoSolidMediaforIsolationofCulturesRepresentativeofthe574Phylum-LevelDiversityofSoilBacteria.AppliedandEnvironmentalMicrobiology70:5754363–4366.576
SegataN,WaldronL,BallariniA,NarasimhanV,JoussonO,HuttenhowerC.(2012).577Metagenomicmicrobialcommunityprofilingusinguniqueclade-specificmarker578
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
genes.NatMeth9:811–814.579
SoilSurvayStaff1999.(1999).SoilTaxonomy:Abasicsystemofsoilclassification580formakingandinterpretingsoilsurveys.2nded.U.S.DepartmentofAgriculture581Handbook436:NaturalResourcesConservationService.582
StepanauskasR.(2012).Singlecellgenomics:anindividuallookatmicrobes.583CurrentOpinioninMicrobiology15:613–620.584
StepanauskasR,FergussonE,BrownJ,PoultonN,TupperB,LabontéJM,etal.585Improvedwholegenomeamplificationandtheintegratedstudyofgenomicand586opticalpropertiesofindividualcellsandviruses.Manuscriptinpreparation.587
SwanBK,Martinez-GarciaM,PrestonCM,SczyrbaA,WoykeT,LamyD,etal.(2011).588PotentialforChemolithoautotrophyAmongUbiquitousBacteriaLineagesinthe589DarkOcean.Science333:1296–1300.590
SwanBK,TupperB,SczyrbaA,LauroFM,Martinez-GarciaM,GonzalezJM,etal.591(2013).Prevalentgenomestreamliningandlatitudinaldivergenceofplanktonic592bacteriainthesurfaceocean.PNatlAcadSciUSA110:11463–11468.593
TatusovaT,CiufoS,FedorovB,O'NeillK,TolstoyI.(2013).RefSeqmicrobial594genomesdatabase:newrepresentationandannotationstrategy.NucleicAcidsRes59542:D553–D559.596
VanHTPham,KimJ.(2012).Cultivationofunculturablesoilbacteria.Trendsin597Biotechnology30:475–484.598
VartoukianSR,PalmerRM,WadeWG.(2010).Strategiesforcultureof‘unculturable’599bacteria.FEMSMicrobiologyLetters309:1–7.600
WangQ,GarrityGM,TiedjeJM,ColeJR.(2007).NaiveBayesianClassifierforRapid601AssignmentofrRNASequencesintotheNewBacterialTaxonomy.Appliedand602EnvironmentalMicrobiology73:5261–5267.603
WardNL,ChallacombeJF,JanssenPH,HenrissatB,CoutinhoPM,WuM,etal.604(2009).ThreeGenomesfromthePhylumAcidobacteriaProvideInsightintothe605LifestylesofTheseMicroorganismsinSoils.AppliedandEnvironmentalMicrobiology60675:2046–2056.607
WuD,HugenholtzP,MavromatisK,PukallR,DalinE,IvanovaNN,etal.(2009).A608phylogeny-drivengenomicencyclopaediaofBacteriaandArchaea.Nature462:6091056–1060.610
XueK,WuL,DengY,HeZ,VanNostrandJ,RobertsonPG,etal.(2013).Functional611GeneDifferencesinSoilMicrobialCommunitiesfromConventional,Low-Input,and612OrganicFarmlands.AppliedandEnvironmentalMicrobiology79:1284–1292.613
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
FigureLegends614
Figure1.PhylogenetictreeofRefSoil615Phylogenetictreeofaligned16SrRNAgenesoriginatingfromRefSoilbacterial616genomes.A:Acidobacteria,B:Actinobacteria,C:Aquificae,D:Armatimonadetes,E:617Bacteroidetes,F:Chlamydiae,G:Chlorobi,H:Chloroflexi,J:Cyanobacteria,K:618Deferribacteres,L:Deinococcus-Thermus,N:Firmicutes,O:Fusobacteria,P:619Gemmatimonadetes,Q:Nitrospirae,R:Planctomycetes,S:Proteobacteria,T:620Spirochaetes,U:Synergistetes,V:Tenericutes,W:Thermotogae,X:Verrucomicrobia.621622Figure2.Functionalanalysis623ThedistributionofphylogeneticoriginsassociatedwithRefSoilfunctional624subsystemsasannotatedbyRAST(panelA);theoverallphylogeneticdistributionof625allgenesinRefSoil(panelB).626627Figure3.PhylogenetictreeofEMPOTUsclusteredbytaxonomy.628RingI(green)representsthecumulativelog-scaledabundanceofOTUsinEMPsoil629samples.RingII(red)representsEMPOTUsthatsharegreaterthan97%gene630similarity(toRefSoil16SrRNAgenes;ringIII(blue)indicatesthatthese16SrRNA631genessharedsimilaritytosortedcellsthatwereselectedforsinglecellgenomics.A:632Acidobacteria,B:Actinobacteria,C:Aquificae,D:Armatimonadetes,E:Bacteroidetes,633F:Chlamydiae,G:Chlorobi,H:Chloroflexi,I:Crenarchaeota,K:Deferribacteres,L:634Deinococcus-Thermus,M:Euryarchaeota,N:Firmicutes,O:Fusobacteria,P:635Gemmatimonadetes,Q:Nitrospirae,R:Planctomycetes,S:Proteobacteria,T:636Spirochaetes,U:Synergistetes,V:Tenericutes,W:Thermotogae,X:Verrucomicrobia,637Y:Cyanobacteria/Chloroplast.638639TableLegends640Table1.RefSoil’smostwantedOTUs641RefSoil’smostwantedOTUsbasedonobservedfrequencyandabundanceinEMP642soilsamples.TaxonomyforOTUsareassignedbyRDPclassifier(Wangetal.,2007).643*:(Rideoutetal.,2014).644645Table2.Single-cellamplifiedgenomes646Taxonomicclassificationofsingle-cellamplifiedgenomesandtheabundanceofthe647mostsimilarEMPOTU.*:(Rideoutetal.,2014).648649SupplementaryDescription650SupplementaryFigure1.Theabundancedistributionofphyla651Theabundancedistributionofphyla(log(abundance)+1)presentintheRefSoil652databasecomparedtoNCBI’sRefSeq(A)andphylathatareenrichedinRefSoil653comparedtoRefSeq(B).654655SupplementaryFigure2.Averagerelativeabundanceforvarioussoilorders656AveragerelativeabundanceofEMPOTUs(sharingsimilaritywithRefSoilgenes)for657varioussoilordersasclassifiedbyNRCSSoilTaxonomy.658
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
659660SupplementaryTable1.TheRefSoilDatabase661662SupplementaryTable2.RefSoilgenesinRASTsubsystemfunctions663TotalnumberofRefSoilgenesobservedinRASTsubsystemfunctions.664665SupplementaryTable3.SubsystemlevelfunctionsinRefSoil666Thenumberofuniquephylaandthenumberofenrichedphylaassociatedwith667encodedsubsystemlevelfunctionsinRefSoil.(Bolditalicized:enrichedwithgenes668associatedwiththreeorlessphyla)669670SupplementaryTable4.AbundanceofEMPOTUsinsoilorder671ThetotalabundanceofEMPOTUsbysoilorderinsoilsamplesoriginatinginthe672UnitedStates.673674SupplementaryTable5.100mostabundantOTUs675100mostabundantOTUsinEMPsoilsamples676677SupplementaryTable6.100mostfrequentOTUs678100mostfrequentOTUsinEMPsoilsamples679680SupplementaryTable7.Singlecellgenomicproperties681 682
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
Table1.RefSoil’smostwantedOTUs683RefSoil’smostwantedOTUsbasedonobservedfrequencyandabundanceinEMP684soilsamples.TaxonomyforOTUsareassignedbyRDPclassifier(Wangetal.,2007).685686
687688689
Phylum Class
4457032 Verrucomicrobia Spartobacteria 1 8,007,453 2,312
4471583 Verrucomicrobia Spartobacteria 0.92 2,937,242 1,935
101868 Verrucomicrobia Spartobacteria 1 1,828,706 2,151
559213 Firmicutes Bacilli 1 1,606,757 2,410
1105039 Verrucomicrobia Spartobacteria 1 1,295,847 2,102
807954 Bacteroidetes Sphingobacteriia 0.98 875,988 2,546
4342972 Verrucomicrobia Spartobacteria 0.97 750,557 2,386
4423681 Gemmatimonadetes Gemmatimonadetes 0.25 689,209 1,954
1109646 Verrucomicrobia Spartobacteria 1 554,748 2,012
610188 Acidobacteria Acidobacteria_Gp6 0.99 553,476 2,694
4373456 Acidobacteria Acidobacteria_Gp4 1 397,255 2,150
4341176 Verrucomicrobia Subdivision3 0.99 383,900 2,098
720217 Proteobacteria Deltaproteobacteria 0.74 345,621 2,073
4314933 Acidobacteria Acidobacteria_Gp1 0.8 333,428 1,974
205391 Acidobacteria Acidobacteria_Gp3 0.87 327,162 2,190
4378940 Acidobacteria Acidobacteria_Gp6 1 310,421 2,499
946250 Verrucomicrobia Spartobacteria 1 300,544 2,386
3122801 Acidobacteria Acidobacteria_Gp1 0.97 273,493 2,107
4450676 Proteobacteria Alphaproteobacteria 1 255,424 1,963
206514 Proteobacteria Alphaproteobacteria 0.92 207,460 2,075
4463040 Firmicutes Clostridia 0.94 204,483 2,096
ClosestMatchinRDPClassiferOTUIDin
(Rideoutet
al.,2014)
RDPClassifier
SimilarityScore
Abundance
(total
amplicons)
Number
of
Samples
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
Table2.Single-cellamplifiedgenomes690Taxonomicclassificationofsingle-cellamplifiedgenomesandtheabundanceofthe691mostsimilarEMPOTU.692693
694695696
NCBIID Phylum Class EMPIDin(Rideoutetal.,2014) AbundanceRelative
Abundance(%)
LSTF00000000 Proteobacteria β-proteobacteria New.52.CleanUp.ReferenceOTU25474 81,630 2.13E-02
LSTI00000000 Actinobacteria Thermoleophilia 551344 35,500 9.27E-03
LSTA00000000 Verrucomicrobia Spartobacteria New.54.CleanUp.ReferenceOTU60738 5,076 1.33E-03
LSTC00000000 Nitrospirae Nitrospira New.22.CleanUp.ReferenceOTU3952 4,337 1.13E-03
LSTH00000000 Planctomycetes Planctomycetia 521158 4,161 1.09E-03
LSSZ00000000 Planctomycetes Planctomycetia New.5.CleanUp.ReferenceOTU188938 2,434 6.36E-04
LSTE00000000 Proteobacteria γ-proteobacteria NA NA NA
LSTB00000000 Proteobacteria α-proteobacteria New.52.CleanUp.ReferenceOTU4588 309 8.07E-05
LSSY00000000 Actinobacteria Thermoleophilia 904196 103 2.69E-05
LSTG00000000 Verrucomicrobia Opitutae New.5.CleanUp.ReferenceOTU291124 69 1.80E-05
LSTD00000000 Planctomycetes Planctomycetia New.59.CleanUp.ReferenceOTU1223798 44 1.15E-05
LSTJ00000000 Verrucomicrobia Verrucomicrobiae New.7.CleanUp.ReferenceOTU84651 11 2.87E-06
LSTK00000000 Acidobacteria Acidobacteriia New.59.CleanUp.ReferenceOTU999655 4 1.04E-06
LSSX00000000 Chloroflexi TK17 New.59.CleanUp.ReferenceOTU481621 2 5.22E-07
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
697698Figure1.PhylogenetictreeofRefSoil699Phylogenetictreeofaligned16SrRNAgenesoriginatingfromRefSoilbacterial700genomes.A:Acidobacteria,B:Actinobacteria,C:Aquificae,D:Armatimonadetes,E:701Bacteroidetes,F:Chlamydiae,G:Chlorobi,H:Chloroflexi,J:Cyanobacteria,K:702Deferribacteres,L:Deinococcus-Thermus,N:Firmicutes,O:Fusobacteria,P:703Gemmatimonadetes,Q:Nitrospirae,R:Planctomycetes,S:Proteobacteria,T:704Spirochaetes,U:Synergistetes,V:Tenericutes,W:Thermotogae,X:Verrucomicrobia.705706
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
707Figure2.Functionalanalysis708ThedistributionofphylogeneticoriginsassociatedwithRefSoilfunctional709subsystemsasannotatedbyRAST(panelA);theoverallphylogeneticdistributionof710allgenesinRefSoil(panelB).711712713714715716717
A B
0
25
50
75
100
Mem
bran
e Tr
ansp
ort
Mot
ility
and
Che
mot
axis
Met
abol
ism
of A
rom
atic
Com
poun
dsR
espi
ratio
nSu
lfur M
etab
olis
mSt
ress
Res
pons
eIro
n ac
quis
ition
and
met
abol
ism
Pota
ssiu
m m
etab
olis
mR
egul
atio
n an
d C
ell s
igna
ling
Viru
lenc
e, D
isea
se a
nd D
efen
sePh
ages
, Pro
phag
es, T
rans
posa
ble
elem
ents
, Pla
smid
sN
itrog
en M
etab
olis
mC
ell W
all a
nd C
apsu
leSe
cond
ary
Met
abol
ism
Amin
o Ac
ids
and
Der
ivativ
esR
NA
Met
abol
ism
Phos
phor
us M
etab
olis
mC
ofac
tors
, Vita
min
s, P
rost
hetic
Gro
ups,
Pig
men
tsPr
otei
n M
etab
olis
mC
arbo
hydr
ates
Fatty
Aci
ds, L
ipid
s, a
nd Is
opre
noid
sD
NA
Met
abol
ism
Nuc
leos
ides
and
Nuc
leot
ides
Cel
l Div
isio
n an
d C
ell C
ycle
Phot
osyn
thes
isD
orm
ancy
and
Spo
rula
tion
Ref
Soil
Perc
ent C
ontr
ibut
ion
(%)
PhylumFusobacteriaSynergistetesTenericutesDeferribacteresGemmatimonadetesChlamydiaePlanctomycetesArmatimonadetesVerrucomicrobiaAquificaeNitrospiraeThermotogaeChlorobiDeinococcus−ThermusCrenarchaeotaChloroflexiAcidobacteriaSpirochaetesEuryarchaeotaBacteroidetesCyanobacteriaActinobacteriaFirmicutesProteobacteria
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint
718Figure3.PhylogenetictreeofEMPOTUsclusteredbytaxonomy.719RingI(green)representsthecumulativelog-scaledabundanceofOTUsinEMPsoil720samples.RingII(red)representsEMPOTUsthatsharegreaterthan97%gene721similarity(toRefSoil16SrRNAgenes;ringIII(blue)indicatesthatthese16SrRNA722genessharedsimilaritytosortedcellsthatwereselectedforsinglecellgenomics.A:723Acidobacteria,B:Actinobacteria,C:Aquificae,D:Armatimonadetes,E:Bacteroidetes,724F:Chlamydiae,G:Chlorobi,H:Chloroflexi,I:Crenarchaeota,K:Deferribacteres,L:725Deinococcus-Thermus,M:Euryarchaeota,N:Firmicutes,O:Fusobacteria,P:726Gemmatimonadetes,Q:Nitrospirae,R:Planctomycetes,S:Proteobacteria,T:727Spirochaetes,U:Synergistetes,V:Tenericutes,W:Thermotogae,X:Verrucomicrobia,728Y:Cyanobacteria/Chloroplast.729730731732733734
.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint