refsoil: a reference database of soil microbial genomes · 5/14/2016  · 1 refsoil: a reference...

22
RefSoil: A reference database of soil microbial genomes 1 Running title: A reference database of soil microbial genomes 2 3 Jinlyung Choi 1 , Fan Yang 1 , Ramunas Stepanauskas 2 , Erick Cardenas 3 , Aaron 4 Garoutte 4 , Ryan Williams 1 , Jared Flater 1 , James M Tiedje 4 , Kirsten S. Hofmockel 5, 6 , 5 Brian Gelder 1 , Adina Howe 6 7 1 Agricultural and Biosystems Engineering, Iowa State University, Iowa, United 8 States 9 2 Bigelow Laboratory for Ocean Sciences, Maine, United States 10 Ramunas Stepanauskas 11 3 Department of Microbiology & Immunology, University of British Columbia, 12 Vancouver, Canada 13 4 Center for Microbial Ecology, Michigan State University, Michigan, United States 14 5 Environmental Molecular Sciences Laboratory, Pacific Northwest National 15 Laboratory, Washington, United States 16 6 Department of Ecology, Evolution and Organismal Biology, Iowa State University, 17 Iowa, United States 18 19 § Corresponding author 20 21 Corresponding author addresses: 22 Name: Adina Howe 23 Postal address: 1201 Sukup Hall Ames IA, 50011 United States 24 Telephone: +1-515-294-0176 25 E-mail: [email protected] 26 27 28 Subject Categories 29 Microbial population and community ecology 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 14, 2016. . https://doi.org/10.1101/053397 doi: bioRxiv preprint

Upload: others

Post on 03-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

RefSoil:Areferencedatabaseofsoilmicrobialgenomes1Runningtitle:Areferencedatabaseofsoilmicrobialgenomes23Jinlyung Choi1, Fan Yang1, Ramunas Stepanauskas2, Erick Cardenas3, Aaron4Garoutte4,RyanWilliams1, JaredFlater1, JamesMTiedje4,KirstenS.Hofmockel5,6,5BrianGelder1,AdinaHowe1§671Agricultural and Biosystems Engineering, Iowa State University, Iowa, United8States92BigelowLaboratoryforOceanSciences,Maine,UnitedStates10RamunasStepanauskas113Department of Microbiology & Immunology, University of British Columbia,12Vancouver,Canada134CenterforMicrobialEcology,MichiganStateUniversity,Michigan,UnitedStates145Environmental Molecular Sciences Laboratory, Pacific Northwest National15Laboratory,Washington,UnitedStates166DepartmentofEcology,EvolutionandOrganismalBiology, IowaStateUniversity,17Iowa,UnitedStates1819§Correspondingauthor2021Correspondingauthoraddresses:22Name:AdinaHowe23Postaladdress:1201SukupHallAmesIA,50011UnitedStates24Telephone:+1-515-294-017625E-mail:adina@iastate.edu262728SubjectCategories29Microbialpopulationandcommunityecology30313233343536373839404142434445

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 2: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

Abstract4647A database of curated genomes is needed to better assess soil microbial48communities and their processes associatedwith differing landmanagement and49environmental impacts. Interpreting soil metagenomic datasets with existing50sequence databases is challenging because these datasets are biased towards51medical andbiotechnology researchand can result inmisleading annotations.We52havecuratedadatabaseof922genomesofsoil-associatedorganisms(888bacteria53and 34 archaea). Using this database,we evaluated phyla and functions that are54enriched in soils as well as those thatmay be underrepresented in RefSoil. Our55comparisonofRefSoiltosoilamplicondatasetsallowedustoidentifytargetsthatif56cultured or sequenced would significantly increase the biodiversity represented57withinRefSoil.Todemonstratetheopportunitiestoaccesstheseunderrepresented58targets, we employed single cell genomics in a pilot experiment to sequence 1459genomes. This effort demonstrates the value of RefSoil in the guidance of future60researcheffortsandthecapabilityofsinglecellgenomicsasapracticalmeanstofill61theexistinggenomicdatagaps.6263Introduction6465

Microbialpopulationsinthesoilimpactthestructureandfertilityofoursoils66by contributing to nutrient availability, soil stability, plant productivity, and67protection against disease and pathogens. Soil microbiology has a rich history of68isolatingandcharacterizingrepresentativesfromthesoiltoassesstheirimpactson69soil health and stability. However, despite significant efforts to isolate microbes70from the soil, we have accessed only a small fraction of its biodiversity, with71estimates of less than 1 to 50 percent of species, even with novel isolation72techniques(VanHTPhamandKim,2012;Vartoukianetal.,2010;Schoenbornetal.,732004;Burmølleetal.,2009;Janssenetal.,2002;Lingetal.,2015).Withadvancesin74DNA sequencing technologies,we cannowdirectly interrogate soilmetagenomes,75allowing for the characterization of soil communities without the need to first76cultivate isolates. However, our ability to annotate and characterize the retrieved77genes is dependent on the availability of informative reference gene or genome78databases.The largestresourceof fullgenomereferencesequences forannotating79genesistheNCBIRefSeqdatabase(Tatusovaetal.,2013),whichcontainsatotalof8057,993 genomes (Release 74). While broadly useful, this database is not81representativeofthelargemajorityofsoilmicrobiomes,andthemajorityofgenes82in previously published soilmetagenomes (65-90%) cannot be annotated against83knowngenes(Delmontetal.,2012;Fiereretal.,2012).Further,contributionstothe84NCBI databases have largely originated from human health and biotechnology85research efforts that can mislead annotations of genes originating from soil86microbiomes(e.g.,annotationsthatareclearlynotcompatiblewithlifeinsoil).87 Soil microbiologists are not the first to face the problem of a limited88reference database. The NIH Human Microbiome Project (HMP) recognized the89criticalneedforawell-curatedreferencegenomedatasetanddevelopedareference90

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 3: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

catalogof3,000genomesthatwereisolatedandsequencedfromhuman-associated91microbial populations(Huttenhoweretal., 2012).Thispublicly available reference92setofmicrobialisolatesandtheirgenomicsequencesaidsintheanalysisofhuman93microbiomesequencingdata(Segataetal.,2012;Wuetal.,2009)andalsoprovides94strainsforwhichisolates(bothculturecollectionsandnucleicacids)areavailableas95resources for experiments. Following this successful model of the HMP, we have96developed a curated database of reference genome sequences that originate from97soil. This database, called RefSoil, represents the current state of knowledge of98sequenced soil genomes. In addition to helping us to understand known soil99microbiology, RefSoil aids in the identification of knowledge gaps that need to be100filledtoimproveourunderstandingofsoilbiodiversity.Wediscusstheimportance101the RefSoil database and demonstrate exciting opportunities to expand this102databaseanditsapplicationsinthefuture.103104MaterialsandMethods105106CreationofRefSoilDatabase107In order to create a soil-specific reference genome dataset, we targeted genome108sequences based on evidence that isolates had origins in soil systems. A total of1096,646completegenomeswereobtainedfromtheGenomesOnLineDatabase(GOLD,110https://gold.jgi.doe.gov,October9th,2014);theGOLDdatabasewaschosenbecause111oftheavailabilityofmetadata,particularlyenvironmentalorigin,relatedtogenome112sequences.Thesegenomeswerefurtherselectedbasedonsoil-associationwiththe113following criteria: (1) Within the GOLD database, organism information and114organismmetadata(knownhabitats,ecosystemcategory,andecosystemtype)was115requiredtoidentifyorganismasoriginatinginsoil-associatedcategories.Organisms116frommarine and deep sea environmentswere excluded, though these organisms117often were identified as soil-associated. Additionally, obligate host-associated118pathogensandextremophileswereexcluded,astheseorganismsareunlikelytobe119present in the absence of their host or in representative soil samples. We120consideredtheseorganismstooftenbeunderverystrongselectivepressuresthat121can lead to reduced genomes or high rates of recombination that are difficult to122assess soil-specific trends. (2) For organisms that lacked appropriatemetadata in123GOLD, a Google Custom Search was used to query all available websites for the124organism name and soil-related terms (rhizosphere, soil, sand, mud, or nodule).125Priorities were placed on the following websites:126http://www.ncbi.nlm.nih.gov/pubmed/, http://www.straininfo.net/,127https://microbewiki.kenyon.edu/index.php/MicrobeWiki,128https://en.wikipedia.org/, and https://scholar.google.com. Genomes with an129association with at least one webpage containing search query phrases were130includedinRefSoil.Thisapproachwastestedwithknownsoil-associatedorganisms,131verifyingthat itwouldreliablypredictsoilassociationfortheseorganisms.Within132resultinggenomes,duplicatedchromosomesandstrainswereremoved.Ifmultiple133genome accession numberswere associatedwith a single strain, themost recent134genomesequencewaschosen.NCBIGenbankannotationfilesassociatedwitheach135

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 4: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

strain were obtained on 2/18/16. The genomes contained within RefSoil are136providedinSupplementaryTable1.137138CharacterizationoforganismsinRefSoil139

All RefSoil genome sequences were associated with their NCBI RefSeq140accession IDandobtained fromNCBI. UsingNCBIGenbankCDSannotations, 16s141rRNAgenesequenceswere identified. Ifmultiple16SrRNAgenesequenceswere142present,thefirstsequencefromthefirstchromosome(byNCBIindex)wasselected143asrepresentativeforthegenome.Threegenomeslackedannotationsofa16SrRNA144geneanda16SrRNAgeneHMMmodelwasusedand identifiedanadditional16S145rRNAsequence(Guoetal.,2015).Intotal,88616SrRNAgenesequenceswereused146to build a phylogenetic tree for bacterial genomes within RefSoil (Fig. 1). All147bacterial 16S rRNA gene sequences were aligned using RDP’s bacterial model148(Infernal 1.1.1 (Nawrocki and Eddy, 2013)), and a Maximum-likelihood149phylogenetic tree was constructed based on the Jukes-Cantor model by Fasttree150(Priceetal.,2010)andvisualizedwithGraphlan(R3.2.2,version0.9.7)(Fig.1).The151scripts for these approaches are publicly available on https://github.com/germs-152lab/ref_soil. RefSoil genomes were extracted from corresponding genomes153containedwithinNCBIRefSeq(release74).AnnotationsoftaxonomyforRefSoiland154RefSeqgeneswereobtainedfromNCBI(February19,2016)(SupplementaryFig.1).155RefSoil genomeswere submitted to RAST and genes were annotated using SEED156subsystem ontology (Supplementary Table 1). Among 1,811,233 unique genes,157approximately 39.1% of themwere assigned tomultiple SEED subsystem level 1158categories.Forthepurposeofthisstudy,weincludedallannotationsatsubsystem159level 1 for each unique gene,which expanded the total gene counts to 2,619,643.160The percent abundance of genes from each phylum in RefSoil database was also161adjusted accordingly by including the abundance of genes that were assigned to162multiplesubsystemlevel1categories(Fig.2andSupplementTable2).Thenumber163of phyla identified in each subsystem level 1 categories was summarized in164SupplementalTable3.165166EMPdatabasesusedinthisstudy167

Atotalof15,481amplicondatasetsthatformtheEMPdatasetwereavailable168to compare environmental soil amplicons to RefSoil (total 5,594,412 OTUs,169clusteredat97%)(Rideoutetal.,2014).Soil sampleswereselectedbasedon their170association with soil metadata resulting in a total of 2,476,795 OTUs from 3035171samples. A total of 2,158 unique taxonomy assignmentswere identified for EMP172OTUs using the RDP Classifier (Wang et al., 2007) and used to construct a173phylogenetic tree with the Graphlan package (R 3.2.2, version 0.9.7) (Fig. 3).174Abundances for each taxonomic assignment were calculated as the sum of175abundanceofOTUsassociatedwiththattaxonomy.176

To evaluate thepresenceofRefSoil genomes inEMP,EMPandRefSoil 16S177rRNA gene amplicons were compared by alignment with BLAST, requiring an178alignmentwithgreaterthan97%similarity,aminimumalignmentof72bp,andE-179value ≤ 1e-5. If multiple hits were aligned, the hit with the lowest E-value was180selected as representative. Using these criteria, a total of 53,538 EMPOTUswere181

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 5: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

associated RefSoil 16s rRNA genes (1.4% of total EMP OTUs; 10.2% of all EMP182amplicons).183

SoilorderforEMPsampleslocatedintheUnitedStates(1817samples)was184obtained based on GPS location associatedwith samplemetadata; locationswere185onlyconsideredvalidifthecoordinatesenteredwereactuallylocatedintheUnited186States (1627samples). ValidGPSpointswere then locatedwithina soilorderby187queryingtheUSDANRCSGlobalSoilRegionsmap(Reich)(AccessedonJanuary26,1882016) for soil order plus a Rock/Sand/Ice category using ArcGIS 10.3.1. RefSoil189representatives for EMPOTUswere determined by similarity as described above190(SupplementaryTable4).191

192Most-wantedsoilOTUs193

TheselectionofthemostwantedOTUsforRefSoilbasedonEMP-associated194ampliconswasbasedonpresenceinEMPsamples,usingthecriteriaofEMP-RefSoil195similarityasdescribedabove,andabundance(numberofampliconsassociatedwith196the OTU in EMP samples). Candidate OTUs were ranked based on its observed197frequencyinallEMPsamplesandabundanceinEMPamplicons(Top100shownin198SupplementaryTable5and6).TaxonomywasassignedbytheRDPclassifier(Wang199etal., 2007). The 21OTUs present in both ‘themost abundance’ and ‘themost200frequent’arelistedintheTable1.201202Singlecellgenomics203

Forsinglecellgenomics,asoilsamplewascollectedfrom0-10cmdepthina204residentialgardeninNobleboro,Maine(44°5'48.10"N,69°29'10.56"W)onMay5th,2052015. Approximately five grams of the sample were mixed with 30 mL sterile-206filtered phosphate-saline buffer (PBS), vortexed for 30 s atmaximum speed, and207centrifuged for30sat2,000rpm.Theobtainedsupernatantwasdiluted tobelow208105cellxmL-1withPBS,pre-screenedthrougha40μmmesh-sizecellstrainer(BD),209and incubated with SYTO-9 DNA stain (5 μM; Invitrogen) for 10-60 min. The210generationofsingleamplifiedgenomes(SAGs)andtheirgenomicsequencingwere211performed by the Bigelow Laboratory Single Cell Genomics Center212(scgc.bigelow.org),aspreviouslydescribed(Stepanauskasetal.).SAGsrepresenting213the "most-wanted list"were selected based on 16S rRNA gene BLAST alignments214withgreaterthan97%similarityoveratleast72bp.215

216Codeandsequencingdataavailability217 The analysis code used to generate the results is available from218https://github.com/germs-lab/ref_soil.All14single-cellsequencingdatasetshave219been deposited at the NCBI. Sequencing data with NCBI accession identifier are220listedinTable2.221

222223224225226227

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 6: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

Results228229PhylogeneticandfunctionalcharacterizationofRefSoil230

RefSoil is a manually curated database containing a total 922 genomes of231soil-associated organisms, comprised of 888 bacteria and 34 archaea232(SupplementaryTable1).ThegenomeswithinRefSoil(soil-associated)andRefSeq233(all environments) reference databases were compared. Both genome databases234contained similar dominant phyla, including Proteobacteria, Firmicutes, and235Actinobacteria, comprising over 91% and 88% of genomes in RefSoil and RefSeq236bacteria, respectively. RefSoil contained higher proportions of Armatimonadetes,237Germmatimonadetes, Thermodesulfobacteria, Acidobacteria, Nitrospirae, and238ChloroflexithanRefSeq,suggestingthatthesephylamaybeenrichedinthesoilor239underrepresented in the RefSeq database. A total of eleven RefSeq-associated240phylums were absent from RefSoil indicating that these phyla may be absent or241difficult to cultivate in soil environments (Supplementary Fig. 1). The 16S rRNA242genesforRefSoilbacterialgenomeswereobtainedfromNCBIGenbankannotations243and aligned to construct a phylogenetic tree representing RefSoil phylogenetic244diversity(Fig.1). 245 RefSoil bacterial and archaeal genomes were further annotated with the246Rapid Annotation using Subsystem Technology (RAST, v 2.0(Aziz et al., 2008))247(SupplementaryTable1),resulting intheannotationofatotalof1,811,233RAST-248associated genes. RAST annotations, unlike GenBank annotations, include249classificationintofunctionalontologiesorsubsystems,allowingforthecomparison250of broad functions. Overall, 78% of the bacterial and archaeal genes could be251classifiedintofunctionalsubsystems(SupplementaryTable2).Forgenesassociated252with each subsystem,we evaluated the phylogenetic origins of RefSoil genes and253compared the phlya distribution of annotated genes with those within the254cumulative RefSoil database (Fig. 2). If the proportion of genes associatedwith a255phylum within a functional subsystem was greater than its representation in256RefSoil,weconsideredtherepresentationofthephylaenrichedinthisfunction(Fig.2572,SupplementaryTable3).Forexample,weobservedthatinmostsubsystems(15258out of 26 subsystems), Proteobacteria-associated geneswere enriched relative to259their representation within RefSoil (56% of all RefSoil genes associated with260Proteobacteria). Additionally, Actinobacteria, Crenarchaeota, and Proteobacteria261genes were enriched in functions related tometabolism of aromatic compounds;262Firmicutes and Proteobacteria genes were enriched in functions related to iron263acquisitionandmetabolism;and functioncategorydormancyandsporulationwas264enrichedwith Firmicutes genes only. These enrichments indicate that organisms265from a small number of phyla may have advantages over other organisms in266environmentswherethesefunctionsareimportant.267268RefSoilcomparedtosequencesoriginatingfromsoilenvironments269

Genomes in RefSoil represent cultivated strains originating from soils that270have been isolated and often characterized in laboratory conditions. In order to271estimate their natural abundance in soils, we compared RefSoil to amplicon272sequencingdatasetsfromglobalsoilmicrobiomesintheEarthMicrobiomeProject273

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 7: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

(EMP)(Rideoutetal.,2014;Gilbertetal.,2014).Amplicons(16SrRNAgenes)from274EMP datasets were clustered at 97% sequence similarity to determine275representative operational taxonomic units (OTUs)(Rideout et al., 2014). We276selected only EMP samples thatwere associatedwith soil samples, resulting in a277totalof3,035samples.Withinthesesoilsamples,weobservedthatthemajorityof278OTUswererare(e.g.,onlyobservedinafewsamples),with76%ofOTUsobserved279inlessthantensoilsamples,and1%ofOTUsrepresenting81%oftotalabundance280inEMP.281

ComparingRefSoil16SrRNAgenestoEMPamplicons,weobservedthatthe282majorityofRefSoil16SrRNAsequenceswerehighlysimilartoEMPamplicons(over28387% of RefSoil bacterial and archaeal 16s rRNA genes shared greater than 97%284similaritywithaminimum72bpalignmentlength).Incontrast,99%(2,442,432of2852,476,795) of EMP amplicons did not share high similarity (greater than 97%286similarity)toRefSoilgenes,suggestingthatEMPsoilsamplescontainmuchhigher287diversity than represented within RefSoil. While RefSoil genomes represent soil288microbes that have been well-studied in the laboratory, the EMP amplicons289represent microbes that have been observed in soils around the world but not290necessarilywell-characterized (e.g., limited to the observation of the presence of29116SrRNAgenesequences).292

293Soilmicrobiomesassociatedwithdifferentsoiltypes294

In1975,asoiltaxonomywasdevelopedbytheUnitedStatesDepartmentof295Agriculture (USDA) and the National Resources Conservation Service, which296separates soils into twelve orders that are based on their physical, chemical, or297biologicalproperties(SoilSurvayStaff1999,1999). Despite theavailabilityof this298classification, it israrely incorporatedintosoilmicrobiomesurveys. UsingRefSoil299and estimated abundances from similar EMP OTUs, we evaluated microbial300distribution in various soil orders. WeobtainedGPSdata fromEMP soil samples301originating from theUnitedStates thatallowedus toobtain the soil classification.302Within these EMP samples, the most represented soil orders included Mollisols303(58%, grassland fertile soils) andAlfisols (37%, fertile soils typicallyunder forest304vegetation) (Supplementary Table 4). The abundance ofOTUs similar to RefSoil305geneswasusedtoestimatethemembershipofwell-characterizedisolatesinthese306soilsamples(SupplementaryFig.2).Mollisols,AlfisolsandVertisols(soilswithhigh307claycontentwithpronouncedchangesinmoisture)wereassociatedwiththemost308RefSoil representatives,whileGelisols (cold climate soils),Ultisols (soilswith low309cationexchange),andsand/rock/icecontainedtheleast(SupplementaryTable4).310

311IdentificationofthemostbeneficialfuturetargetsforRefSoil312 EMP OTUs that do not share high similarity with 16S rRNA genes from313RefSoil represent current knowledge gaps for which we lack cultivated isolates.314These gaps were visually identified by overlaying the presence of highly similar315relatives in RefSoil (similarity >97%) to branches in a phylogenetic tree of soil-316associated EMP amplicons (Fig. 3). We evaluated the presence of OTUs in EMP317librariesbasedontheirpresenceinEMPsamples(frequency)andtheircumulative318observed abundance. We identified themost observed and abundant EMP OTUs319

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 8: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

thatdidnotsharehighsimilaritytoRefsoiltogeneratea“RefSoil’smostwantedlist”320oftargetsthat,ifisolatedorsequenced,couldprovideinformationonprevalentbut321uncharacterized lineages (Table 1, Supplementary Table 5-6). OTUs sharing322similaritytoVerrucomicrobia(8OTUs)andAcidobacteria(6OTUs)wereamongthe323most abundant and frequently observed EMP OTUs that are not currently324represented inRefSoil (Table1). Thesetargets,observed innaturebutnot inour325database,would be ideal to isolate and characterize to better understand the soil326biodiversity.327

328Singlecellgenomicsinsoil329

Sequencing-basedapproachesprovideanalternativetoaccessingthe330genomesofsoilorganismswithoutcultivation.Previouseffortshaveusedassembly331ofgenomesfrommetagenomes(Labontéetal.,2015;Martijnetal.,2015;Fieldetal.,3322014)andsinglecellgenomics(Gawadetal.,2016;Lasken,2012;Stepanauskas,3332012)toobtaingenomicblueprintsofyetunculturedmicrobialgroups.Toevaluate334theeffectivenessofsinglecellgenomicsonsoilcommunities,weperformedapilot-335scaleexperimentonaresidentialgardensoilinMaine.The16SrRNAgenewas336successfullyrecoveredfrom109ofthe317singleamplifiedgenomes(SAGs).This33734%16SrRNAgenerecoveryrateiscomparabletosinglecellgenomicsstudiesin338marine,freshwaterandotherenvironments(Martinez-Garciaetal.,2011;Rinkeet339al.,2013;Swanetal.,2011).The16SrRNAgenesof14oftheseSAGs,belongingto340Proteobacteria,Actinobacteria,Nitrospirae,Verrucomicrobia,Planktomycetes,341AcidobacteriaandChloroflexi,wereselectedbasedontheirlackofrepresentation342withinRefSoil.GenomicsequencingofthoseSAGsresultedinacumulativeassembly343of23Mbp(Table2,SupplementaryTable7).WeestimatedtheabundanceofOTUs344similartosinglecell16SrRNAgenestorangefrom5E-7to2E-2%ofEMPOTU345abundances.Theseabundancesareverylowbutcomparabletotheaverageand346medianrelativeabundanceofOTUswithinEMP(4E-3%and1E-6%,respectively).347IfthesedraftgenomeswereaddedtoRefSoil,these14SAGswouldincrease348RefSoil’srepresentationofEMPampliconsby7%byabundance.349350Discussion351352

Advances in sequencing techniques for utilizing culture-independent353approaches have created new opportunities for understanding soil microbiology354and its impact on soil health, stability, andmanagement.We provide RefSoil as a355community genomic resource to provide high-quality curation of organisms that356originate in soil studies, allowing us tomore accurately annotate soil sequencing357datasets. This database is an important initial effort to provide improved soil358references, and we acknowledge that it is far from a complete representation of359soil’s biodiversity. In fact, compared to EMP soils, RefSoil represents only 2% of360observedEMPOTUsand10%byabundanceoftheobservedsoilOTUs,highlighting361themagnitudeofsoilbiodiversityandthelimitationsofourcurrentdatabasebased362on cultivated representatives. These results are consistent with a recent effort363demonstratingthatweareverylimitedinourknowledgeofnotonlysoilbutlife’s364diversity and advocating that much can be learned by including uncultivated365

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 9: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

organisms(Hugetal., 2016). We showherehowRefSoil canbeused to evaluate366which organisms to target thatwouldmost effectively increase our knowledge of367soilbiodiversity.Forexample,ifgenomereferenceswereavailableforthetopmost368wanted organisms identified in this effort (Table 1), we could expand RefSoil’s369representationofEMPsoilsby58%byabundance.370

Notably, many of these targets have previously been observed to be371recalcitranttocultivationwithstandardlaboratorymedia,resultingintheirabsence372in current genome databases. Acidobacteria, for example, is known to be slow373growinganddifficulttocultivate(NunesdaRochaetal.,2009)whilealsoobserved374to be highly abundant in soil (33% of EMP amplicons by abundance). Another375fastidiousbacteria,Verrucomicrobia(Bergmannetal.,2011),werealsoobservedto376behighlyabundant(12.5%)inEMPbutnotwellrepresentedinRefSoil(twoof888377bacterial genomes). Despite their absence from cultivated isolates, both378Acidobacteria andVerrucomicrobia havebeenobserved to be critical for nutrient379cycling insoils(NunesdaRochaetal.,2009;Wardetal.,2009;Fiereretal.,2013;380Martinez-Garciaetal.,2012).381

Novel isolation and culturing techniques will help us to access these382previously difficult to grow bacteria andwill also be complemented by emerging383sequencing technologies. Inparticular, single-cellgenomicsholdgreatpromise to384providegenomiccharacterizationoflineagesthataredifficulttoculture(Gawadet385al., 2016; Lasken, 2012; Stepanauskas, 2012). In our pilot experiment, we386demonstrate, for the first time, that single-cell genomics is applicable on soil387samplesand iswellsuited torecover thegenomic information fromabundantbut388yet uncultured taxonomic groups. The 14 sequenced SAGs have significantly389increasedtheextenttowhichRefSoilrepresentsthepredominantsoillineagesfrom390asinglesample.Muchlargersinglecellgenomicsprojectsarefeasibleandhavebeen391employedinpriorstudiesofotherenvironments(Kashtanetal.,2014;Rinkeetal.,3922013;Swanetal.,2013).Thecontinued,rapidimprovementsinthistechnologyare393likelytoleadtofurtherscalability,offeringapracticalmeanstofilltheexistinggaps394intheRefSoildatabaseandbiodiversitymorebroadly.395 TheRefSoil database is also a tool that allows us to characterize currently396known soil bacteria phylogeny and functions. This database spans 24 phyla of397bacteria and archaea. While genes related to microbial growth and reproduction398(e.g., DNA, RNA, and protein metabolism) originate from diverse phyla, key399functions related to metabolism of aromatic compounds, iron metabolism, and400dormancy and sporulationwere observed to be enriched fromonly a fewRefSoil401phyla, suggesting that these phyla may have specialized functions within soil402communities. Specifically, Proteobacteria-associated genes were enriched in403functionsrelatedtomotility,chemotaxis,andmembranetransportation(Figure2),404suggestingthatProteobacteriaarelikelytobeefficientatacquiringreadilyavailable405nutrients and elements for growth(Fierer et al., 2007). Genes related to406Proteobacteria and Actinobacteria were also found dominant among RefSoil407genomes in subsystems related to the metabolism of aromatic compounds. This408observation is consistent with the association of Proteobacteria and humic409substances utilization and the contribution of Actinobacteria to plant material410degradation(Goddenetal.,1992;Fuchsetal.,2011).411

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 10: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

By comparing RefSoil to other databases, we identified biases for specific412phyla in RefSoil; in particular, Firmicutes are observed frequently in RefSoil but413were not observed to be highly abundant in soil environments (5.7% of all EMP414amplicons). Firmicutes have been well-studied as pathogens(Buffie and Pamer,4152013;Kamadaetal.,2013;Rupniketal.,2009),likelybiasingtheirrepresentationin416ourdatabasesandconsequentlytheirannotationsinsoilstudies. Akeyadvantage417to the development of the RefSoil database is the opportunity to evaluate these418biases and to identify targets for future curation that would create a more419representative database resource for the soil. To this end, we evaluated the420representationofRefSoilinvarioussoils.421

Consistent with previous observations that microbial community422composition is correlated with soil environments(Fierer et al., 2012; Fierer and423Jackson, 2006), we observed specific RefSoil membership associated with soil424taxonomy. NotallsoiltaxonomicorderswereequallyrepresentedinRefSoil,with425themostgenomesassociatedwithMollisols,orgrasslandsoils. Thisbias is likely426duetoincreasedresearchinthesesoilsduetotheirimportanceforagricultureand427productivity. In contrast, for Gellisol permafrost soils, which cover over 20% of428Earth’s terrestrial surface(Koven etal., 2011) and are a significant carbon source429fromourbiospheretotheatmosphere,weobserveonlyasmall fractionofknown430soilmicrobesarepresent. Previousstudiessuggestthathighlevelsofnoveltycan431be observed in permafrost microbiomes(Koven et al., 2011; Mackelprang et al.,4322011), suggesting that these soils could significantly benefit from improved433references.434 AnotheradvantagetotheRefSoildatabaseisthatitrepresentsgenomesfor435which strains should currently be available and for which we have high-quality436genomes. As a consequence, genomes within RefSoil could help to inform and437designsoilmicrobiologyexperiments.Forexample,itisknownthatnitrogencycling438genesareabundantinagriculturalsoils(Xueetal.,2013;Philippotetal.,2007;Long439et al., 2012). A mock community of isolates known for participating in nitrogen440cycling could be generated and proportionally designed to mimic soil conditions441usingRefSoil: these strainsmight includemicroorganisms related toStreptomyces442venezuelae ATCC 10712 (assimilatory nitrate reductase),Bacillus anthracis (nitric443oxide reductase), Pseudomonas brassicacearum (nitrous oxide reductase),444Halomonas elongata DSM 2581 (ammonia monooxygenase), Pseudomonas445fluorescens SBW25 (ammonia monooxygenase), and Pleurocapsa sp. PCC 7327446(nitrogen fixation). Combining RefSoil and available amplicon datasets, one could447estimatetheproportionsofstrainsasobservedinsoilsamples.448

In conclusion, RefSoil is an important first step towards building a more449comprehensive, well-curated database. Though it is far from complete, we have450beenabletouseRefSoilasatooltoidentifykeyphylathatrepresentgapsinknown451soil diversity and underrepresented phyla. Going forward, this reference will be452expanded as isolation and cultivation-independent technologies continue to453improve.454455456457

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 11: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

Acknowledgments458We are grateful for the support of the NSF Terragenome International Soil459MetagenomeSequencingConsortium forproviding a collaborativeworkshopwith460thesoilcommunity fordiscussionsto improvethisproject. Thismaterial isbased461uponworksupportedbytheU.S.DepartmentofEnergy,OfficeofScience,Officeof462BiologicalandEnvironmentalResearch,underAwardNumberSC0010775.463464Contributions E.C., A.G., A.H., and J.T. curated Refsoil; J.C. and A.H. collected the465genomeinformationtobuildthedatabase;J.C.,J.F.,A.H.,R.W,andF.Y.,characterized466andcomparedRefSoiltoEMPdatasets;B.G.providedsoilordersforEMPsamples;467R.S.performedsinglecellgenomics;J.T.,K.H.andA.H.supervisedthework;J.C.,F.Y.,468andA.H.wrotethemanuscriptwithcontributionsfromallotherauthors.469470ConflictofInterest471Theauthorsdeclarenoconflictofinterests.472473

474

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 12: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

References475AzizRK,BartelsD,BestAA,DeJonghM,DiszT,EdwardsRA,etal.(2008).TheRAST476Server:RapidAnnotationsusingSubsystemsTechnology.BMCGenomics9:75.477

BergmannGT,BatesST,EilersKG,LauberCL,CaporasoJG,WaltersWA,etal.478(2011).Theunder-recognizeddominanceofVerrucomicrobiainsoilbacterial479communities.SoilBiologyandBiochemistry43:1450–1455.480

BuffieCG,PamerEG.(2013).Microbiota-mediatedcolonizationresistanceagainst481intestinalpathogens.NatRevImmunol13:790–801.482

BurmølleM,JohnsenK,Al-SoudWA,HansenLH,SørensenSJ.(2009).Thepresence483ofembeddedbacterialpureculturesinagarplatesstimulatetheculturabilityofsoil484bacteria.JournalofMicrobiologicalMethods79:166–173.485

DelmontTO,PrestatE,KeeganKP,FaubladierM,RobeP,ClarkIM,etal.(2012).486Structure,fluctuationandmagnitudeofanaturalgrasslandsoilmetagenome.ISMEJ4876:1677–1687.488

FieldEK,SczyrbaA,LymanAE,HarrisCC,WoykeT,StepanauskasR,etal.(2014).489GenomicinsightsintotheuncultivatedmarineZetaproteobacteriaatLoihi490Seamount.ISMEJ9:857–870.491

FiererN,BradfordMA,JacksonRB.(2007).Towardanecologicalclassificationof492soilbacteria.Ecology88:1354–1364.493

FiererN,JacksonRB.(2006).Thediversityandbiogeographyofsoilbacterial494communities.PNatlAcadSciUSA103:626–631.495

FiererN,LadauJ,ClementeJC,LeffJW,OwensSM,PollardKS,etal.(2013).496ReconstructingtheMicrobialDiversityandFunctionofPre-AgriculturalTallgrass497PrairieSoilsintheUnitedStates.Science342:621–624.498

FiererN,LeffJW,AdamsBJ,NielsenUN,BatesST,LauberCL,etal.(2012).Cross-499biomemetagenomicanalysesofsoilmicrobialcommunitiesandtheirfunctional500attributes.PNatlAcadSciUSA109:21390–21395.501

FuchsG,BollM,HeiderJ.(2011).Microbialdegradationofaromaticcompounds—502fromonestrategytofour.NatRevMicro9:803–816.503

GawadC,KohW,QuakeSR.(2016).Single-cellgenomesequencing:currentstateof504thescience.NatRevGenet17:175–188.505

GilbertJA,JanssonJK,KnightR.(2014).TheEarthMicrobiomeproject:successes506andaspirations.BMCBiol12:69.507

GoddenB,BallAS,HelvensteinP,MccarthyAJ,PenninckxMJ.(1992).Towards508

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 13: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

elucidationofthelignindegradationpathwayinactinomycetes.JournalofGeneral509Microbiology138:2441–2448.510

GuoJ,ColeJR,ZhangQ,BrownCT,TiedjeJM.(2015).MicrobialCommunityAnalysis511withRibosomalGeneFragmentsfromShotgunMetagenomesSchlossPD(ed).512AppliedandEnvironmentalMicrobiology82:157–166.513

HugLA,BakerBJ,AnantharamanK,BrownCT,ProbstAJ,CastelleCJ,etal.(2016).A514newviewofthetreeoflife.NatMicrobiol16048.515

HuttenhowerC,GeversD,KnightR,AbubuckerS,BadgerJH,ChinwallaAT,etal.516(2012).Structure,functionanddiversityofthehealthyhumanmicrobiome.Nature517486:207–214.518

JanssenPH,YatesPS,GrintonBE,TaylorPM,SaitM.(2002).ImprovedCulturability519ofSoilBacteriaandIsolationinPureCultureofNovelMembersoftheDivisions520Acidobacteria,Actinobacteria,Proteobacteria,andVerrucomicrobia.Appliedand521EnvironmentalMicrobiology68:2391–2396.522

KamadaN,ChenGY,InoharaN,NúñezG.(2013).Controlofpathogensand523pathobiontsbythegutmicrobiota.NatImmunol14:685–690.524

KashtanN,RoggensackSE,RodrigueS,ThompsonJW,BillerSJ,CoeA,etal.(2014).525Single-CellGenomicsRevealsHundredsofCoexistingSubpopulationsinWild526Prochlorococcus.Science344:416–420.527

KovenCD,RingevalB,FriedlingsteinP,CiaisP,CaduleP,KhvorostyanovD,etal.528(2011).Permafrostcarbon-climatefeedbacksaccelerateglobalwarming.PNatl529AcadSciUSA108:14769–14774.530

LabontéJM,SwanBK,PoulosB,LuoH,KorenS,HallamSJ,etal.(2015).Single-cell531genomics-basedanalysisofvirus–hostinteractionsinmarinesurface532bacterioplankton.ISMEJ9:2386–2399.533

LaskenRS.(2012).Genomicsequencingofunculturedmicroorganismsfromsingle534cells.NatRevMicro10:631–640.535

LingLL,SchneiderT,PeoplesAJ,SpoeringAL,EngelsI,ConlonBP,etal.(2015).A536newantibiotickillspathogenswithoutdetectableresistance.Nature517:455–459.537

LongA,HeitmanJ,TobiasC,PhilipsR,SongB.(2012).Co-OccurringAnammox,538Denitrification,andCodenitrificationinAgriculturalSoils.Appliedand539EnvironmentalMicrobiology79:168–176.540

MackelprangR,WaldropMP,DeAngelisKM,DavidMM,ChavarriaKL,BlazewiczSJ,541etal.(2011).Metagenomicanalysisofapermafrostmicrobialcommunityrevealsa542rapidresponsetothaw.Nature480:368–371.543

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 14: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

MartijnJ,SchulzF,Zaremba-NiedzwiedzkaK,ViklundJ,StepanauskasR,Andersson544SGE,etal.(2015).Single-cellgenomicsofarareenvironmental545alphaproteobacteriumprovidesuniqueinsightsintoRickettsiaceaeevolution.ISMEJ5469:2373–2385.547

Martinez-GarciaM,BrazelDM,SwanBK,ArnostiC,ChainPSG,ReitengaKG,etal.548(2012).CapturingSingleCellGenomesofActivePolysaccharideDegraders:An549UnexpectedContributionofVerrucomicrobiaRavelJ(ed).PLoSONE7:e35314.550

Martinez-GarciaM,SwanBK,PoultonNJ,GomezML,MaslandD,SierackiME,etal.551(2011).High-throughputsingle-cellsequencingidentifiesphotoheterotrophsand552chemoautotrophsinfreshwaterbacterioplankton.ISMEJ6:113–123.553

NawrockiEP,EddySR.(2013).Infernal1.1:100-foldfasterRNAhomologysearches.554Bioinformatics29:2933–2935.555

NunesdaRochaU,VanOverbeekL,VanElsasJD.(2009).Explorationofhitherto-556unculturedbacteriafromtherhizosphere.FEMSMicrobiologyEcology69:313–328.557

PhilippotL,HallinS,SchloterM.(2007).EcologyofDenitrifyingProkaryotesin558AgriculturalSoil.In:AdvancesinAgronomyVol.96.Elsevier,pp249–305.559

PriceMN,DehalPS,ArkinAP.(2010).FastTree2–ApproximatelyMaximum-560LikelihoodTreesforLargeAlignmentsPoonAFY(ed).PLoSONE5:e9490.561

ReichP.Globalsoilsuborders.562http://www.nrcs.usda.gov/wps/portal/nrcs/detail/soils/use/?cid=nrcs142p2_054563013.564

RideoutJR,HeY,Navas-MolinaJA,WaltersWA,UrsellLK,GibbonsSM,etal.(2014).565Subsampledopen-referenceclusteringcreatesconsistent,comprehensiveOTU566definitionsandscalestobillionsofsequences.PeerJ2:e545–25.567

RinkeC,SchwientekP,SczyrbaA,IvanovaNN,AndersonIJ,ChengJ-F,etal.(2013).568Insightsintothephylogenyandcodingpotentialofmicrobialdarkmatter.Nature569499:431–437.570

RupnikM,WilcoxMH,GerdingDN.(2009).Clostridiumdifficileinfection:new571developmentsinepidemiologyandpathogenesis.NatRevMicro7:526–536.572

SchoenbornL,YatesPS,GrintonBE,HugenholtzP,JanssenPH.(2004).LiquidSerial573DilutionIsInferiortoSolidMediaforIsolationofCulturesRepresentativeofthe574Phylum-LevelDiversityofSoilBacteria.AppliedandEnvironmentalMicrobiology70:5754363–4366.576

SegataN,WaldronL,BallariniA,NarasimhanV,JoussonO,HuttenhowerC.(2012).577Metagenomicmicrobialcommunityprofilingusinguniqueclade-specificmarker578

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 15: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

genes.NatMeth9:811–814.579

SoilSurvayStaff1999.(1999).SoilTaxonomy:Abasicsystemofsoilclassification580formakingandinterpretingsoilsurveys.2nded.U.S.DepartmentofAgriculture581Handbook436:NaturalResourcesConservationService.582

StepanauskasR.(2012).Singlecellgenomics:anindividuallookatmicrobes.583CurrentOpinioninMicrobiology15:613–620.584

StepanauskasR,FergussonE,BrownJ,PoultonN,TupperB,LabontéJM,etal.585Improvedwholegenomeamplificationandtheintegratedstudyofgenomicand586opticalpropertiesofindividualcellsandviruses.Manuscriptinpreparation.587

SwanBK,Martinez-GarciaM,PrestonCM,SczyrbaA,WoykeT,LamyD,etal.(2011).588PotentialforChemolithoautotrophyAmongUbiquitousBacteriaLineagesinthe589DarkOcean.Science333:1296–1300.590

SwanBK,TupperB,SczyrbaA,LauroFM,Martinez-GarciaM,GonzalezJM,etal.591(2013).Prevalentgenomestreamliningandlatitudinaldivergenceofplanktonic592bacteriainthesurfaceocean.PNatlAcadSciUSA110:11463–11468.593

TatusovaT,CiufoS,FedorovB,O'NeillK,TolstoyI.(2013).RefSeqmicrobial594genomesdatabase:newrepresentationandannotationstrategy.NucleicAcidsRes59542:D553–D559.596

VanHTPham,KimJ.(2012).Cultivationofunculturablesoilbacteria.Trendsin597Biotechnology30:475–484.598

VartoukianSR,PalmerRM,WadeWG.(2010).Strategiesforcultureof‘unculturable’599bacteria.FEMSMicrobiologyLetters309:1–7.600

WangQ,GarrityGM,TiedjeJM,ColeJR.(2007).NaiveBayesianClassifierforRapid601AssignmentofrRNASequencesintotheNewBacterialTaxonomy.Appliedand602EnvironmentalMicrobiology73:5261–5267.603

WardNL,ChallacombeJF,JanssenPH,HenrissatB,CoutinhoPM,WuM,etal.604(2009).ThreeGenomesfromthePhylumAcidobacteriaProvideInsightintothe605LifestylesofTheseMicroorganismsinSoils.AppliedandEnvironmentalMicrobiology60675:2046–2056.607

WuD,HugenholtzP,MavromatisK,PukallR,DalinE,IvanovaNN,etal.(2009).A608phylogeny-drivengenomicencyclopaediaofBacteriaandArchaea.Nature462:6091056–1060.610

XueK,WuL,DengY,HeZ,VanNostrandJ,RobertsonPG,etal.(2013).Functional611GeneDifferencesinSoilMicrobialCommunitiesfromConventional,Low-Input,and612OrganicFarmlands.AppliedandEnvironmentalMicrobiology79:1284–1292.613

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 16: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

FigureLegends614

Figure1.PhylogenetictreeofRefSoil615Phylogenetictreeofaligned16SrRNAgenesoriginatingfromRefSoilbacterial616genomes.A:Acidobacteria,B:Actinobacteria,C:Aquificae,D:Armatimonadetes,E:617Bacteroidetes,F:Chlamydiae,G:Chlorobi,H:Chloroflexi,J:Cyanobacteria,K:618Deferribacteres,L:Deinococcus-Thermus,N:Firmicutes,O:Fusobacteria,P:619Gemmatimonadetes,Q:Nitrospirae,R:Planctomycetes,S:Proteobacteria,T:620Spirochaetes,U:Synergistetes,V:Tenericutes,W:Thermotogae,X:Verrucomicrobia.621622Figure2.Functionalanalysis623ThedistributionofphylogeneticoriginsassociatedwithRefSoilfunctional624subsystemsasannotatedbyRAST(panelA);theoverallphylogeneticdistributionof625allgenesinRefSoil(panelB).626627Figure3.PhylogenetictreeofEMPOTUsclusteredbytaxonomy.628RingI(green)representsthecumulativelog-scaledabundanceofOTUsinEMPsoil629samples.RingII(red)representsEMPOTUsthatsharegreaterthan97%gene630similarity(toRefSoil16SrRNAgenes;ringIII(blue)indicatesthatthese16SrRNA631genessharedsimilaritytosortedcellsthatwereselectedforsinglecellgenomics.A:632Acidobacteria,B:Actinobacteria,C:Aquificae,D:Armatimonadetes,E:Bacteroidetes,633F:Chlamydiae,G:Chlorobi,H:Chloroflexi,I:Crenarchaeota,K:Deferribacteres,L:634Deinococcus-Thermus,M:Euryarchaeota,N:Firmicutes,O:Fusobacteria,P:635Gemmatimonadetes,Q:Nitrospirae,R:Planctomycetes,S:Proteobacteria,T:636Spirochaetes,U:Synergistetes,V:Tenericutes,W:Thermotogae,X:Verrucomicrobia,637Y:Cyanobacteria/Chloroplast.638639TableLegends640Table1.RefSoil’smostwantedOTUs641RefSoil’smostwantedOTUsbasedonobservedfrequencyandabundanceinEMP642soilsamples.TaxonomyforOTUsareassignedbyRDPclassifier(Wangetal.,2007).643*:(Rideoutetal.,2014).644645Table2.Single-cellamplifiedgenomes646Taxonomicclassificationofsingle-cellamplifiedgenomesandtheabundanceofthe647mostsimilarEMPOTU.*:(Rideoutetal.,2014).648649SupplementaryDescription650SupplementaryFigure1.Theabundancedistributionofphyla651Theabundancedistributionofphyla(log(abundance)+1)presentintheRefSoil652databasecomparedtoNCBI’sRefSeq(A)andphylathatareenrichedinRefSoil653comparedtoRefSeq(B).654655SupplementaryFigure2.Averagerelativeabundanceforvarioussoilorders656AveragerelativeabundanceofEMPOTUs(sharingsimilaritywithRefSoilgenes)for657varioussoilordersasclassifiedbyNRCSSoilTaxonomy.658

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 17: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

659660SupplementaryTable1.TheRefSoilDatabase661662SupplementaryTable2.RefSoilgenesinRASTsubsystemfunctions663TotalnumberofRefSoilgenesobservedinRASTsubsystemfunctions.664665SupplementaryTable3.SubsystemlevelfunctionsinRefSoil666Thenumberofuniquephylaandthenumberofenrichedphylaassociatedwith667encodedsubsystemlevelfunctionsinRefSoil.(Bolditalicized:enrichedwithgenes668associatedwiththreeorlessphyla)669670SupplementaryTable4.AbundanceofEMPOTUsinsoilorder671ThetotalabundanceofEMPOTUsbysoilorderinsoilsamplesoriginatinginthe672UnitedStates.673674SupplementaryTable5.100mostabundantOTUs675100mostabundantOTUsinEMPsoilsamples676677SupplementaryTable6.100mostfrequentOTUs678100mostfrequentOTUsinEMPsoilsamples679680SupplementaryTable7.Singlecellgenomicproperties681 682

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 18: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

Table1.RefSoil’smostwantedOTUs683RefSoil’smostwantedOTUsbasedonobservedfrequencyandabundanceinEMP684soilsamples.TaxonomyforOTUsareassignedbyRDPclassifier(Wangetal.,2007).685686

687688689

Phylum Class

4457032 Verrucomicrobia Spartobacteria 1 8,007,453 2,312

4471583 Verrucomicrobia Spartobacteria 0.92 2,937,242 1,935

101868 Verrucomicrobia Spartobacteria 1 1,828,706 2,151

559213 Firmicutes Bacilli 1 1,606,757 2,410

1105039 Verrucomicrobia Spartobacteria 1 1,295,847 2,102

807954 Bacteroidetes Sphingobacteriia 0.98 875,988 2,546

4342972 Verrucomicrobia Spartobacteria 0.97 750,557 2,386

4423681 Gemmatimonadetes Gemmatimonadetes 0.25 689,209 1,954

1109646 Verrucomicrobia Spartobacteria 1 554,748 2,012

610188 Acidobacteria Acidobacteria_Gp6 0.99 553,476 2,694

4373456 Acidobacteria Acidobacteria_Gp4 1 397,255 2,150

4341176 Verrucomicrobia Subdivision3 0.99 383,900 2,098

720217 Proteobacteria Deltaproteobacteria 0.74 345,621 2,073

4314933 Acidobacteria Acidobacteria_Gp1 0.8 333,428 1,974

205391 Acidobacteria Acidobacteria_Gp3 0.87 327,162 2,190

4378940 Acidobacteria Acidobacteria_Gp6 1 310,421 2,499

946250 Verrucomicrobia Spartobacteria 1 300,544 2,386

3122801 Acidobacteria Acidobacteria_Gp1 0.97 273,493 2,107

4450676 Proteobacteria Alphaproteobacteria 1 255,424 1,963

206514 Proteobacteria Alphaproteobacteria 0.92 207,460 2,075

4463040 Firmicutes Clostridia 0.94 204,483 2,096

ClosestMatchinRDPClassiferOTUIDin

(Rideoutet

al.,2014)

RDPClassifier

SimilarityScore

Abundance

(total

amplicons)

Number

of

Samples

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 19: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

Table2.Single-cellamplifiedgenomes690Taxonomicclassificationofsingle-cellamplifiedgenomesandtheabundanceofthe691mostsimilarEMPOTU.692693

694695696

NCBIID Phylum Class EMPIDin(Rideoutetal.,2014) AbundanceRelative

Abundance(%)

LSTF00000000 Proteobacteria β-proteobacteria New.52.CleanUp.ReferenceOTU25474 81,630 2.13E-02

LSTI00000000 Actinobacteria Thermoleophilia 551344 35,500 9.27E-03

LSTA00000000 Verrucomicrobia Spartobacteria New.54.CleanUp.ReferenceOTU60738 5,076 1.33E-03

LSTC00000000 Nitrospirae Nitrospira New.22.CleanUp.ReferenceOTU3952 4,337 1.13E-03

LSTH00000000 Planctomycetes Planctomycetia 521158 4,161 1.09E-03

LSSZ00000000 Planctomycetes Planctomycetia New.5.CleanUp.ReferenceOTU188938 2,434 6.36E-04

LSTE00000000 Proteobacteria γ-proteobacteria NA NA NA

LSTB00000000 Proteobacteria α-proteobacteria New.52.CleanUp.ReferenceOTU4588 309 8.07E-05

LSSY00000000 Actinobacteria Thermoleophilia 904196 103 2.69E-05

LSTG00000000 Verrucomicrobia Opitutae New.5.CleanUp.ReferenceOTU291124 69 1.80E-05

LSTD00000000 Planctomycetes Planctomycetia New.59.CleanUp.ReferenceOTU1223798 44 1.15E-05

LSTJ00000000 Verrucomicrobia Verrucomicrobiae New.7.CleanUp.ReferenceOTU84651 11 2.87E-06

LSTK00000000 Acidobacteria Acidobacteriia New.59.CleanUp.ReferenceOTU999655 4 1.04E-06

LSSX00000000 Chloroflexi TK17 New.59.CleanUp.ReferenceOTU481621 2 5.22E-07

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 20: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

697698Figure1.PhylogenetictreeofRefSoil699Phylogenetictreeofaligned16SrRNAgenesoriginatingfromRefSoilbacterial700genomes.A:Acidobacteria,B:Actinobacteria,C:Aquificae,D:Armatimonadetes,E:701Bacteroidetes,F:Chlamydiae,G:Chlorobi,H:Chloroflexi,J:Cyanobacteria,K:702Deferribacteres,L:Deinococcus-Thermus,N:Firmicutes,O:Fusobacteria,P:703Gemmatimonadetes,Q:Nitrospirae,R:Planctomycetes,S:Proteobacteria,T:704Spirochaetes,U:Synergistetes,V:Tenericutes,W:Thermotogae,X:Verrucomicrobia.705706

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 21: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

707Figure2.Functionalanalysis708ThedistributionofphylogeneticoriginsassociatedwithRefSoilfunctional709subsystemsasannotatedbyRAST(panelA);theoverallphylogeneticdistributionof710allgenesinRefSoil(panelB).711712713714715716717

A B

0

25

50

75

100

Mem

bran

e Tr

ansp

ort

Mot

ility

and

Che

mot

axis

Met

abol

ism

of A

rom

atic

Com

poun

dsR

espi

ratio

nSu

lfur M

etab

olis

mSt

ress

Res

pons

eIro

n ac

quis

ition

and

met

abol

ism

Pota

ssiu

m m

etab

olis

mR

egul

atio

n an

d C

ell s

igna

ling

Viru

lenc

e, D

isea

se a

nd D

efen

sePh

ages

, Pro

phag

es, T

rans

posa

ble

elem

ents

, Pla

smid

sN

itrog

en M

etab

olis

mC

ell W

all a

nd C

apsu

leSe

cond

ary

Met

abol

ism

Amin

o Ac

ids

and

Der

ivativ

esR

NA

Met

abol

ism

Phos

phor

us M

etab

olis

mC

ofac

tors

, Vita

min

s, P

rost

hetic

Gro

ups,

Pig

men

tsPr

otei

n M

etab

olis

mC

arbo

hydr

ates

Fatty

Aci

ds, L

ipid

s, a

nd Is

opre

noid

sD

NA

Met

abol

ism

Nuc

leos

ides

and

Nuc

leot

ides

Cel

l Div

isio

n an

d C

ell C

ycle

Phot

osyn

thes

isD

orm

ancy

and

Spo

rula

tion

Ref

Soil

Perc

ent C

ontr

ibut

ion

(%)

PhylumFusobacteriaSynergistetesTenericutesDeferribacteresGemmatimonadetesChlamydiaePlanctomycetesArmatimonadetesVerrucomicrobiaAquificaeNitrospiraeThermotogaeChlorobiDeinococcus−ThermusCrenarchaeotaChloroflexiAcidobacteriaSpirochaetesEuryarchaeotaBacteroidetesCyanobacteriaActinobacteriaFirmicutesProteobacteria

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint

Page 22: RefSoil: A reference database of soil microbial genomes · 5/14/2016  · 1 RefSoil: A reference database of soil microbial genomes 2 Running title: A reference database of soil microbial

718Figure3.PhylogenetictreeofEMPOTUsclusteredbytaxonomy.719RingI(green)representsthecumulativelog-scaledabundanceofOTUsinEMPsoil720samples.RingII(red)representsEMPOTUsthatsharegreaterthan97%gene721similarity(toRefSoil16SrRNAgenes;ringIII(blue)indicatesthatthese16SrRNA722genessharedsimilaritytosortedcellsthatwereselectedforsinglecellgenomics.A:723Acidobacteria,B:Actinobacteria,C:Aquificae,D:Armatimonadetes,E:Bacteroidetes,724F:Chlamydiae,G:Chlorobi,H:Chloroflexi,I:Crenarchaeota,K:Deferribacteres,L:725Deinococcus-Thermus,M:Euryarchaeota,N:Firmicutes,O:Fusobacteria,P:726Gemmatimonadetes,Q:Nitrospirae,R:Planctomycetes,S:Proteobacteria,T:727Spirochaetes,U:Synergistetes,V:Tenericutes,W:Thermotogae,X:Verrucomicrobia,728Y:Cyanobacteria/Chloroplast.729730731732733734

.CC-BY 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted May 14, 2016. . https://doi.org/10.1101/053397doi: bioRxiv preprint