web-based tool for the exploration of fungal diversity · used for the identification of fungi...

7
Research © The Authors (2008). New Phytologist (2009) 181: 471–477 471 Journal compilation © New Phytologist (2008) www.newphytologist.org 471 Blackwell Publishing Ltd An outlook on the fungal internal transcribed spacer sequences in GenBank and the introduction of a web-based tool for the exploration of fungal diversity Martin Ryberg 1 , Erik Kristiansson 2 , Elisabet Sjökvist 1 and R. Henrik Nilsson 1 1 Department of Plant and Environmental Sciences, University of Gothenburg, PO Box 461, 405 30 Göteborg, Sweden; 2 Department of Zoology, University of Gothenburg, PO Box 463, 405 30 Göteborg, Sweden Summary • The environmental and distributional data associated with fungal internal transcribed spacer (ITS) sequences in GenBank are investigated and a new web- based tool with which these sequences can be explored is introduced. All fungal ITS sequences in GenBank were classified as either identified to species level or insufficiently identified and compared using BLAST. The results are made available as a biweekly updated web service that can be queried to retrieve all insufficiently identified sequences (IIS) associated with any fungal genus. The most commonly available annotation items in GenBank are isolation source (55%); country of origin (50%); and specific host (38%). The molecular sampling of fungi shows a bias towards North America, Europe, China, and Japan whereas vast geographical areas remain effectively unexplored. Mycorrhizal and parasitic genera are on average associated with more IIS than are saprophytic taxa. Glomus, Alternaria, and Tomentella are the genera represented by the highest number of insufficiently identified ITS sequences in GenBank. • The web service presented (http://andromeda.botany.gu.se/emerencia.html# genus_search) offers new means, particularly for mycorrhizal and plant pathogenic fungi, to examine the IIS in GenBank in a taxon-oriented framework and to explore their metadata in an easily accessible and time-efficient manner. Author for correspondence: Martin Ryberg Tel: +46 31 786 48 07 Fax: +46 31 786 25 60 Email: [email protected] Received: 28 June 2008 Accepted: 11 September 2008 New Phytologist (2009) 181: 471–477 doi: 10.1111/j.1469-8137.2008.02667.x Key words: environmental samples, fungal distribution, fungal ecology, fungi, metadata analysis, mycorrhiza, sequence databases. Introduction Fungi are a large and diverse group of organisms that serve many essential ecological functions, such as wood and litter decomposition, mycorrhizal symbiosis, and parasitism. Many aspects of the lives of fungi are difficult to study, however, as most species are only observable when they form conspicuous fruiting bodies. By contrast, the main part of the fungal life cycle takes place as a somatic mycelium inside, or otherwise associated with, the substrate inhabited. The possibilities of using molecular methods have therefore facilitated a deeper understanding of fungal ecology (e.g. Crozier et al., 2006; Taylor & McCormick, 2008). This is particularly true for mycorrhizal communities (Horton & Bruns, 2001). As sporulation structures such as fruiting bodies and conidiophores can be linked to species descriptions, it is possible to infer the identity of sequences from environmental samples by correlating them to sequences from specimens of known identity. In mycology, sequences from the internal transcribed spacer (ITS) region of the nuclear ribosomal DNA are commonly used for the identification of fungi (Kõljalg et al., 2005; Naumann et al., 2007; Nilsson et al., 2008). However, although this is one of the most frequently sequenced regions, ITS sequences from well-identified fruiting bodies are estimated to be available for < 1% of the hypothesized number of fungal species (Nilsson et al., 2005). There are thus many sequences from environmental samples whose species affiliation remains unknown in that they cannot be satisfactory matched to a sequence of known taxonomic identity (cf. Horton et al., 2005; Bastias et al., 2006; Kjøller, 2006). The International Nucleotide Sequence Database (INSD: GenBank, European Molecular Biology Laboratory (EMBL), and DNA Database of Japan (DDBJ); Benson et al., 2008) is the major open repository for sequence data. As a part of the documentation of a scientific study, most international journals require that all sequences used in a manuscript be made

Upload: others

Post on 09-Apr-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: web-based tool for the exploration of fungal diversity · used for the identification of fungi (Kõljalg et al., 2005; Naumann et al., 2007; Nilsson et al., 2008). However, although

Research

© The Authors (2008). New Phytologist (2009) 181: 471–477 471Journal compilation © New Phytologist (2008) www.newphytologist.org 471

Blackwell Publishing Ltd

An outlook on the fungal internal transcribed spacer sequences in GenBank and the introduction of a web-based tool for the exploration of fungal diversity

Martin Ryberg1, Erik Kristiansson2, Elisabet Sjökvist1 and R. Henrik Nilsson1

1Department of Plant and Environmental Sciences, University of Gothenburg, PO Box 461, 405 30 Göteborg, Sweden; 2Department of Zoology,

University of Gothenburg, PO Box 463, 405 30 Göteborg, Sweden

Summary

• The environmental and distributional data associated with fungal internaltranscribed spacer (ITS) sequences in GenBank are investigated and a new web-based tool with which these sequences can be explored is introduced.• All fungal ITS sequences in GenBank were classified as either identified to specieslevel or insufficiently identified and compared using BLAST. The results are madeavailable as a biweekly updated web service that can be queried to retrieve allinsufficiently identified sequences (IIS) associated with any fungal genus.• The most commonly available annotation items in GenBank are isolation source(55%); country of origin (50%); and specific host (38%). The molecular samplingof fungi shows a bias towards North America, Europe, China, and Japan whereasvast geographical areas remain effectively unexplored. Mycorrhizal and parasiticgenera are on average associated with more IIS than are saprophytic taxa. Glomus,Alternaria, and Tomentella are the genera represented by the highest number ofinsufficiently identified ITS sequences in GenBank.• The web service presented (http://andromeda.botany.gu.se/emerencia.html#genus_search) offers new means, particularly for mycorrhizal and plant pathogenicfungi, to examine the IIS in GenBank in a taxon-oriented framework and to exploretheir metadata in an easily accessible and time-efficient manner.

Author for correspondence:Martin RybergTel: +46 31 786 48 07Fax: +46 31 786 25 60Email: [email protected]

Received: 28 June 2008Accepted: 11 September 2008

New Phytologist (2009) 181: 471–477 doi: 10.1111/j.1469-8137.2008.02667.x

Key words: environmental samples, fungal distribution, fungal ecology, fungi, metadata analysis, mycorrhiza, sequence databases.

Introduction

Fungi are a large and diverse group of organisms that servemany essential ecological functions, such as wood and litterdecomposition, mycorrhizal symbiosis, and parasitism. Manyaspects of the lives of fungi are difficult to study, however, asmost species are only observable when they form conspicuousfruiting bodies. By contrast, the main part of the fungal lifecycle takes place as a somatic mycelium inside, or otherwiseassociated with, the substrate inhabited. The possibilities ofusing molecular methods have therefore facilitated a deeperunderstanding of fungal ecology (e.g. Crozier et al., 2006;Taylor & McCormick, 2008). This is particularly true formycorrhizal communities (Horton & Bruns, 2001). Assporulation structures such as fruiting bodies and conidiophorescan be linked to species descriptions, it is possible to infer theidentity of sequences from environmental samples by correlatingthem to sequences from specimens of known identity. In

mycology, sequences from the internal transcribed spacer(ITS) region of the nuclear ribosomal DNA are commonlyused for the identification of fungi (Kõljalg et al., 2005;Naumann et al., 2007; Nilsson et al., 2008). However, althoughthis is one of the most frequently sequenced regions, ITSsequences from well-identified fruiting bodies are estimatedto be available for < 1% of the hypothesized number of fungalspecies (Nilsson et al., 2005). There are thus many sequencesfrom environmental samples whose species affiliation remainsunknown in that they cannot be satisfactory matched to asequence of known taxonomic identity (cf. Horton et al.,2005; Bastias et al., 2006; Kjøller, 2006).

The International Nucleotide Sequence Database (INSD:GenBank, European Molecular Biology Laboratory (EMBL),and DNA Database of Japan (DDBJ); Benson et al., 2008) isthe major open repository for sequence data. As a part of thedocumentation of a scientific study, most international journalsrequire that all sequences used in a manuscript be made

Page 2: web-based tool for the exploration of fungal diversity · used for the identification of fungi (Kõljalg et al., 2005; Naumann et al., 2007; Nilsson et al., 2008). However, although

Research472

New Phytologist (2009) 181: 471–477 © The Authors (2008).www.newphytologist.org Journal compilation © New Phytologist (2008)

available through such public databases. Consequently, as apart of the publication process of environmental studies,many sequences that are not identified to the species level (i.e.that are insufficiently identified) are submitted to the INSD.Although these sequences are far from rigorously and homo-geneously annotated, their metadata can provide valuableinformation on the ecology and distribution of a group oforganisms or contribute important data in a phylogenetic context(Weiss et al., 2004; Porter et al., 2008; Ryberg et al., 2008).

Here we investigate the taxonomic distribution of allinsufficiently identified fungal ITS sequences in INSD (referredto as ‘insufficiently identified sequences’ (IIS)). In addition wecompare the IIS with fully identified fungal ITS sequences(referred to as ‘fully identified sequences’ (FIS)) and examinethe classes of metadata that are available for both types ofsequence. In addition we introduce a web service that enablessearches for IIS associated with any user-specified fungalgenus. This offers new possibilities to explore the IIS in ataxon-oriented way and to synthesize pertinent informationon taxonomy, ecology, and distribution of the focal genusfor subsequent use.

Materials and Methods

The software package emerencia (Nilsson et al., 2005; http://andromeda.botany.gu.se/emerencia.html) was used to downloadall fungal ITS sequences from GenBank and to separate fullyidentified sequences (i.e. the FIS) from those without a fullspecies name (i.e. the IIS). The sequences were stored intwo separate tables in a MySQL 5.16.3 database (http://www.mysql.com). In addition, the variable subregion ITS2was detected and extracted using the hidden Markov modelsof Nilsson et al. (2008) and was also stored in two separatetables. BLAST 2.2.9 (Altschul et al., 1997) was used to find theclosest FIS for all IIS. A Perl script (Supporting InformationText S1) was implemented to search the database for thespecies among the identified sequences that form the bestBLAST match of at least one IIS for the full sequence data.The results are used to calculate how many IIS are associatedwith each genus. For the genera associated with more than100 IIS, the number of fully identified sequences and thenumber of species they belong to according to their taxonomicannotation were obtained from the database. The total numberof species for these genera was obtained from Kirk et al.(2001) as this is the most comprehensive recent work includingsuch estimates. In addition, a function to perform searches todetermine which IIS are associated with any user-specifiedgenus was constructed (Supporting Information Text S2). Toevaluate the proportion of IIS that are assigned by BLASTto an incorrect genus, 100 IIS were randomly selected andaligned with their 10 closest BLAST matches using ClustalW 1.83 (Thompson et al., 1994). These alignments were theninvestigated to determine if the assigned generic affiliationswere probable.

To give the user of the web service an overview of thepresent state of ITS sampling of fungi, the ITS sequences werecompared for taxonomic affiliation, identification status, andecological roles. Furthermore, information on the classes ofmetadata that come bundled with the FIS and the IIS, respec-tively, was obtained through parsing all entries in Perl forinformation in the Features field of the GenBank annotation.The software package of Nilsson et al. (2006) was used toinvestigate the number of fully/insufficiently identified sequencessubmitted to INSD over time and the proportion of IIS thatoriginate from unpublished studies.

Results

The genus search

A new search function has been added to the emerencia webservice and is accessible through http://andromeda.botany.gu.se/emerencia.html#genus_search. By providing a genus name,the user can retrieve a list of all insufficiently identified INSDsequences found to be associated with the genus throughBLAST searches. The user is given the choice to base thesearch on the entire ITS region (default) or only the more variablesublocus ITS2 to lessen the impact of the very conserved 5.8Sand Large Subunit (LSU) regions. The ITS2 option does,however, mean a loss of c. 6% of the available sequences (i.e.sequences in which the ITS2 could not be detected forwhatever reason) and the default mode is what is used in thispaper. The IIS are output together with their best BLASTmatch and are extensively linked to more detailed presentationsof each entry and to tools and resources to assist in the qualitycontrol of the data (Supporting Information Text S3). Theoutput is also summarized in three tables. The first tablepresents the source literature of the IIS and details how manysequences are associated with each reference (including linksto the study in Google Scholar; http://scholar.google.com).The second table lists the source literature of the identifiedsequences that constitute the best BLAST matches of the IIS.Finally, the third table specifies the individual species (in thegenus) that form the best BLAST match of at least one IIS,the individual sequences that constitute these best BLASTmatches, and the number of IIS that are associated with eachspecies. It is also possible to have the IIS output in the FASTAformat (Pearson & Lipman, 1988) for incorporation into, forexample, alignment programs, phylogenetic analyses, or addi-tional quality control steps. Sequences inadvertently submittedas reverse complementary to National Center for BiotechnologyInformation (NCBI) are detected and displayed correctly in theweb service as long as they feature the last third of the 5.8S.

A closer look at the fungal ITS sequences

There were 50 956 (65%) fully identified and 27 364 (35%)insufficiently identified ITS sequences in GenBank as of

Page 3: web-based tool for the exploration of fungal diversity · used for the identification of fungi (Kõljalg et al., 2005; Naumann et al., 2007; Nilsson et al., 2008). However, although

Research 473

© The Authors (2008). New Phytologist (2009) 181: 471–477Journal compilation © New Phytologist (2008) www.newphytologist.org

February 2008. The proportion of IIS among the fungal ITSsequences being submitted to INSD each year increasessteadily such that their deposition rate now parallels that ofFIS (Fig. 1). The IIS were found to be associated with 1148different fungal genera, which represents 57% of the 2014fungal genera present in the FIS data set of INSD. A total of260 (23%) of the 1148 genera had only one IIS associatedwith them while 391 (34%) genera had 10, and 49 (4%) generahad > 100 associated sequences. Of the 49 genera with > 100associated sequences, 15 (30%) are well-known mycorrhizaformers and 5 (10%) include mycorrhizal, or putativelymycorrhizal, species while 25 (51%) are known to includeparasites (Table 1). Twenty-six (53%) of the 49 generaassociated with > 100 sequences were found to belong to theAscomycota, 21 (42%) to the Basidiomycota, one (2%) to theGlomeromycota, and one (2%) to the Zygomycota. The fulllist of genera is automatically updated and available throughthe emerencia web service (http://andromeda.botany.gu.se/genuslist.html). Of the 100 IIS that were investigated morethoroughly as a quality assessment measure, one wasdetermined to belong to a genus other than the genus towhich it had been assigned by BLAST and it was not possibleto assign 13 to any genus with certainty. Of these 13, abouthalf (54%) were classified as associated with various anamorphicAscomycota genera.

The proportion of species that have been sequenced variesconsiderably among the 49 genera with > 100 associated IIS.The mycorrhizal genera with more than 10 species in total(as estimated by Kirk et al., 2001) have rather few speciesrepresented as FIS (15–54% of the species) while saprophyticand parasitic genera are generally better represented (2–150%

of the estimated species represented as FIS). Indeed, many ofthe parasitic genera are represented by more species than theyhave been estimated to contain (Table 1).

Considering geographical metadata, the co-ordinates of thecollection locality were given for only 5% of the IIS. A moreencouraging 50% of the entries were explicitly annotated witha country of origin and of these a full 51% had a more precisegeographical annotation (state, province, or similar; Table 2).For the FIS, the corresponding values were even lower: only0.5% of the entries had geographical co-ordinates, and a modest37% had an explicit country annotation. Taken together, thesequences were found to originate from all continents; the IISfrom a total of 102 different countries and the FIS from 158countries (Supporting Information Text S3). Despite this broadgeographical sampling there was a clear overrepresentation ofsequences from North America, Europe, China, and Japanwhile other regions have been less well sampled (Fig. 2). Aspecific host was given for 38% of the IIS and 18% of the FIS(Table 2; see Supporting Information Text S3 for a moredetailed list) and isolation source was specified for 55% and10%, respectively; both fields are good sources of informationon host species for the fungi. The isolation source field canalso provide more specific information on the substrate of thefungi. Additionally, 47% of the IIS and 65% of the FIS hadauxiliary annotations in the Note field.

Although not examined here, the most important source ofinformation for any entry is probably the publication fromwhich the sequence originates. However, more than half(58%) of the IIS were marked as unpublished, but the realnumber is probably considerably lower as many sequenceauthors neglect, or take a very long time, to update thesequence annotations in INSD when their work is published(Nilsson et al., 2006).

Discussion

The increasing use of sequence-based methods for identificationof fungi has opened new windows to the scientific pursuitof ecology. The knowledge gained from such data has, forexample, changed our view of the ectomycorrhizal communityin showing that it is much more species-rich and site-specificthan previously thought (Horton & Bruns, 2001). It hasmoreover provided new insights into the ecology of manydifferent groups of fungi where mycorrhizal associations arecommon, such as the Pezizales, Sebacinales, and Agaricales(Tedersoo et al., 2006; Selosse et al., 2007; Ryberg et al., 2008).

The present analysis of the IIS in INSD reveals strongbiases in terms of both geography and taxon sampling. Thereis a preponderance of studies on mycorrhizal fungi and ofstudies on important parasites, while decay fungi have beenless often recovered in environmental studies. This particularbias is further demonstrated by the fact that the overwhelmingmajority of the IIS annotated with a specific host were associatedwith plants (Table 2). We also found a geographical bias

Fig. 1 The number of fungal internal transcribed spacer (ITS) sequences submitted to (or modified in) the International Nucleotide Sequence Database (INSD) each year divided into fully identified and insufficiently identified sequences with respect to taxonomic affiliation. Fully identified sequences are represented as light grey and insufficiently identified sequences as dark grey.

Page 4: web-based tool for the exploration of fungal diversity · used for the identification of fungi (Kõljalg et al., 2005; Naumann et al., 2007; Nilsson et al., 2008). However, although

Research474

New Phytologist (2009) 181: 471–477 © The Authors (2008).www.newphytologist.org Journal compilation © New Phytologist (2008)

Table 1 The 49 fungal genera that have more than 100 insufficiently identified sequences (IIS) associated with them

Number of IIS1Number of fully identified sequences2 Genus name3

Estimated total number of species in the genus4 Main nutritional mode5

1279 426 (30) Glomus (G) 85 M946 409 (60) Alternaria (A) 50 P/S762 79 (28) Tomentella (B) 75 M724 1195 (440) Cortinarius (B) 2000 M582 65 (11) Sclerotinia (A) 8 P564 1439 (75) Fusarium (A) 50 S/P530 306 (145) Russula (B) 750 M498 671 (2) Thanatephorus (B) 11 P422 269 (43) Cladosporium (A) 60 S/P402 718 (49) Trichoderma (A) 34 Sym/P392 28 (11) Tulasnella (B) 46 S/M/Sym379 630 (78) Cryptococcus (B) 37 S/P348 20 (9) Sebacina (B) 6 M348 59 (5) Piloderma (B) 6 M346 246 (64) Rhizopogon (B) 150 M326 198 (24) Phoma (A) 40 P311 457 (129) Lactarius (B) 400 M303 270 (110) Inocybe (B) 500 M295 21 (9) Ceratobasidium (B) 11 P/S/M?291 121 (14) Lophodermium (A) 103 P/S/Sym?283 867 (72) Hypocrea (A) 100 S/P?278 35 (9) Mortierella (Z) 90 S/P277 909 (163) Penicillium (A) 223 S230 67 (1) Cenococcum (A) 1 M226 331 (63) Tricholoma (B) 200 M222 945 (126) Candida (A) 165 S/Sym/P201 40 (2) Epicoccum (A) 2 S/Sym?195 110 (9) Phialocephala (A) 20 S/M/P182 67 (7) Cadophora (A) − P/S/Sym181 429 (64) Puccinia (B) 4000 P172 154 (18) Diaporthe (A) 75 P/S160 84 (2) Aureobasidium (A) 7 P/Sym156 80 (38) Phaeosphaeria (A) 45 P/S155 149 (32) Phomopsis (A) 100 P152 846 (115) Mycosphaerella (A) 500 P142 21 (3) Leptodontidium (A) 10 P/Sym140 852 (37) Tuber (A) 63 M128 227 (33) Ganoderma (B) 50 S/P128 185 (9) Gibberella (A) 10 P/Sym?128 7 (3) Craterellus (B) 20 M127 16 (2) Tylospora (B) 2 M125 164 (14) Nectria (A) 28 P123 119 (33) Leucoagaricus (B) 75 S122 4 (1) Amphinema (B) 4 M121 38 (2) Meliniomyces (A) − M?116 266 (50) Rhodotorula (B) 34 S/Sym?113 281 (63) Pestalotiopsis (A) 50 P/S/Sym109 70 (30) Xylaria (A) 100 S/Sym104 2 (2) Tremellodendron (B) 8 M/Sym

1Associated with the genus through BLAST.2The number of species these sequences represent (according to the International Nucleotide Sequence Database (INSD) annotation) is given in parentheses.3Phylum affiliation given in parentheses as: A, Ascomycota; B, Basidomycota; G, Glomeromycota; Z, Zygomycota.4According to Kirk et al. (2001).5M, mycorrhizal; P, parasitic; S, saprophytic; Sym, symbiotic (other than mycorrhizal). Uncertainty in the nutritional mode is indicated by ?.

Page 5: web-based tool for the exploration of fungal diversity · used for the identification of fungi (Kõljalg et al., 2005; Naumann et al., 2007; Nilsson et al., 2008). However, although

Research 475

© The Authors (2008). New Phytologist (2009) 181: 471–477Journal compilation © New Phytologist (2008) www.newphytologist.org

towards North America, Western Europe, and China and Japanand away from Africa and parts of Asia and South America(Fig. 2), an observation that also holds true for the FIS. Thisbias away from tropical areas is most unfortunate as they arepredicted to be particularly rich in undescribed species(Hawksworth, 2004). Although falling short of giving a fulldescription of diversity, sequencing of environmental samplesmight be an efficient way to obtain an overview of the richnessof these areas (cf. Herbert et al., 2003). The lack, or lownumber, of FIS from these areas reduces the chances of inte-grating such environmental sequences into a more informativetaxonomic framework.

There is furthermore a bias in how well represented differentgenera are, as seen by the proportion of their species for whichFIS are available. Several genera were represented by morespecies in GenBank than they have been estimated to contain.This could partly reflect rapid taxonomic development inthese genera over the last few years, which has rendered theestimates of Kirk et al. (2001) somewhat outdated. It could alsobe a result of taxonomic difficulties such as widespread use ofsynonymous names – particularly in relation to anamorphicstages – or use of names that are not formally published. Theopposite problem – that most genera are poorly covered byDNA sequencing efforts – is, however, more substantial andmore pressing. The fact that less than half of the species inmost of the larger mycorrhizal genera have been sequencedemphasizes the urgent need for extended sequencing effortstargeted at these fungi. Unfortunately, such projects rarely

seem to rank high in the priorities of the funding agencies. Itwould be preferable if such efforts were to involve typespecimens or well annotated and easily accessible voucherspecimens to avoid adding to the taxonomic complications inINSD (cf. Bidartondo et al., 2008). If the ambition is to beable to determine environmental sequences to species level, itseems especially important to focus on species in Glomus,Cortinarius, Tomentella, and Russula as these account for a largeproportion of the IIS (Table 1). However, Nilsson et al. (2006)showed that as many as 6% of the fungal ITS sequences lacka satisfactory BLAST match altogether, which is likely to be areflection of the complete absence of sequences in INSD formany genera. In order to increase taxonomic resolution inmolecular environmental studies, such genera should be primetargets for sequencing efforts.

In this study, BLAST was used to infer the taxonomicaffiliation of IIS. This represents a fast and efficient way toperform similarity searches for large data sets, but it does notgive reliable results on taxonomic identity in all situations(Koski & Golding, 2001). The number of IIS per genusshould therefore not be interpreted as the absolute abundanceof each genus but rather as an estimation to be used for furtherinvestigations. Our evaluation of the method does, however,show that it is reasonably reliable. It is also well known thatthe taxonomic annotations of INSD are not always correct,which may negatively affect our ability to infer the taxonomicidentity of the IIS. It has been shown that c. 5% of the IISare best matched by a sequence with a questionable speciesannotation (Nilsson et al., 2006), but the proportion ofsequences identified to the wrong genus is probably lower.The genus list of Table 1 is therefore a presentation of the datain INSD as seen through emerencia and represents a blendof the natural occurrence of the genera, methodologicalchallenges, and the efforts of the mycological community.

In allowing the user to retrieve all IIS associated with anyspecified genus, the web service unites and automates twohitherto separate and arduous processes: (1) the iteration ofBLAST over all fully identified sequences of the focal genus,and (2) the manual parsing of the BLAST results to pinpointIIS with a relation to the genus. The sequences thus obtainedare likely to represent constituent species of the focal genus,although this relation may have gone unnoticed before as aresult of the poor state of the taxonomic annotation of theentries. The metadata associated with the entries are thuspertinent also to the genus itself, which opens up the possibilityof synthesizing as yet incompletely explored information atthe generic level. Areas where this data mining may haveparticular potential include: (1) nutritional mode(s) of thegenus (where the IIS can be used as evidence to bind speciesof the focal genus to nutritional modes); (2) the taxonomicspan of the genus (where the IIS fall inside the genus and yetdo not produce a satisfactory match to any of its species, sug-gesting the presence of hitherto unsequenced taxa); and (3)the geographical distribution of the genus (where the IIS have

Table 2 Available metadata in the Features field for the insufficiently (IIS) and fully (FIS) identified fungal internal transcribed spacer (ITS) sequences in the International Nucleotide Sequence Database (INSD)

Type of metadata IIS (%) FIS (%)

Country 50 37Latitude/longitude 5 0.5Note 47 65Specific host1 38 18

Vascular plants 35.3 15.7Nonvascular plants 0.4 0Fungi 0.1 0.2Animals 2.2 2Other substrates 0.07 0.001

Isolation source2 55 10Specimen voucher 7 21Collected by 3 1Identified by 0.4 1

The table shows the percentage of fully identified and insufficiently identified sequences, respectively, with an annotation for the different annotation fields.1Poorly applicable to generalists and often applied in a loose sense in INSD.2Poorly applicable to unculturable fungi and often applied in a loose sense in INSD.

Page 6: web-based tool for the exploration of fungal diversity · used for the identification of fungi (Kõljalg et al., 2005; Naumann et al., 2007; Nilsson et al., 2008). However, although

Research476

New Phytologist (2009) 181: 471–477 © The Authors (2008).www.newphytologist.org Journal compilation © New Phytologist (2008)

been reported from locations extending beyond the knowngeographical range of the genus). These pursuits are furtherfacilitated by the web service through the generation of sum-maries of the literature annotation for the entries in questionand through the possibility of examining the underlyingpairwise alignments. The entries may be exported for furthersequence analysis and they are hyperlinked to GenBank,Google, and emerencia.

IIS form more than a third of the fungal ITS sequences inINSD, and their proportion is steadily increasing. As thisstudy shows, the IIS are often better annotated with respect toenvironmental and geographical data than the FIS. This indicatesthat the IIS, if used in the proper taxonomic context, couldcontribute valuable data on the ecology, distribution, andtaxonomy of fungi. The new search function of the emerenciaweb service presented here represents a means to retrieve such

Fig. 2 Maps illustrating the number of insufficiently identified (a) and fully identified (b) fungal internal transcribed spacer (ITS) sequences originating from each country according to their International Nucleotide Sequence Database (INSD) country annotation. Full specifications and complementary information are provided in Supporting Information Text S3.

Page 7: web-based tool for the exploration of fungal diversity · used for the identification of fungi (Kõljalg et al., 2005; Naumann et al., 2007; Nilsson et al., 2008). However, although

Research 477

© The Authors (2008). New Phytologist (2009) 181: 471–477Journal compilation © New Phytologist (2008) www.newphytologist.org

IIS associated with any user-specified genus in a format thatmakes this auxiliary information readily accessible to thescientific community.

Acknowledgements

Financial support for this study was received from the RoyalSociety of Arts and Sciences in Göteborg (MR) and the CarlStenholm Foundation (RHN and MR). Fig. 2 was compiledin collaboration with Villa Geografica. We are also grateful forcomments and suggestions by Tom Bruns and two anonymousreviewers on a previous draft of this paper.

References

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25: 3389–3402.

Bastias BA, Xu Z, Cairney JWG. 2006. Influence of long-term repeated prescribed burning on mycelial communities of ectomycorrhizal fungi. New Phytologist 172: 149–158.

Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. 2008. GenBank. Nucleic Acids Research 36: D25–D30.

Bidartondo MI, Bruns TD, Blackwell M, Edwards I, Taylor AFS, Horton T, Zhang N, Kôljalg U, May G, Kuyper TW et al. 2008. Preserving accuracy in GenBank. Science 319: 1616.

Crozier J, Thomas SE, Aime MC, Evans HC, Holmes KA. 2006. Molecular characterization of fungal endophytic morphospecies isolated from stems and pods of Theobroma caca. Plant Pathology 55: 783–791.

Hawksworth DL. 2004. Fungal diversity and its implications for genetic resource collections. Studies in Mycology 50: 9–18.

Herbert PDN, Cywinska A, Ball SL, deWaard JR. 2003. Biological identifications through DNA barcodes. Proceedings of the Royal Society B 270: 313–21.

Horton TR, Bruns TD. 2001. The molecular revolution in ectomycorrhizal ecology: peeking into the black-box. Molecular Ecology 10: 1855–1871.

Horton TR, Molina R, Hood K. 2005. Douglas-fir ectomycorrhizae in 40- and 400-yr-old stands: mycobiont availability to late successional western hemlock. Mycorrhiza 15: 393–403.

Kirk PM, Cannon PF, David JC, Stalpers JA, eds. 2001. Dictionary of the fungi, 9th edn. Wallingford, UK: CAB International

Kjøller R. 2006. Disproportionate abundance between ectomycorrhizal root tips and their associated mycelia. FEMS Microbiology Ecology 58: 214–224.

Kõljalg U, Larsson K-H, Abarenkov K, Nilsson RH, Alexander IJ, Eberhardt U, Erland S, Høiland K, Kjøller R, Larsson E et al. 2005. UNITE: a database providing web-based methods for the molecular identification of ectomycorrhizal fungi. New Phytologist 166: 1063–1068.

Koski LB, Golding GB. 2001. The closest BLAST hit is often not the nearest neighbour. Journal of Molecular Evolution 52: 540–542.

Nauman A, Navarro-González M, Sánchez-Hernández O, Hoegger PJ, Kües U. 2007. Correct identification of wood-inhabiting fungi by ITS analysis. Current Trends in Biotechnology and Pharmacy 1: 41–61.

Nilsson RH, Kristiansson E, Ryberg M, Hallenberg N, Larsson K-H. 2008. Intraspecific ITS variability in the kingdom Fungi as expressed in the international sequence databases and its implications for molecular species identification. Evolutionary Bioinformatics 4: 193–201.

Nilsson RH, Kristiansson E, Ryberg M, Larsson K-H. 2005. Approaching the taxonomic affiliation of unidentified sequences in public databases – an example from the mycorrhizal fungi. BMC Bioinformatics 6: 178.

Nilsson RH, Ryberg M, Kristiansson E, Abarenkov K, Larsson K-H, Kõljalg U. 2006. Taxonomic reliability of DNA sequences in public sequence databases: a fungal perspective. PLoS ONE 1: e59.

Pearson WR, Lipman DJ. 1988. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, USA 85: 2444–2448.

Porter TM, Schadt CW, Rizvi L, Martin AP, Schmidt SK, Scott-Denton L, Vilgalys R, Moncalvo J-M. 2008. Widespread occurrence and phylogenetic placement of a soil clone group adds a prominent new branch to the fungal tree of life. Molecular Phylogenetics and Evolution 46: 635–644.

Ryberg M, Nilsson RH, Kristiansson E, Töpel M, Jacobsson S, Larsson E. 2008. Mining metadata from unidentified ITS sequences in GenBank: a case study in Inocybe (Basidiomycota). BMC Evolutionary Biology 8: 50.

Selosse M-A, Setaro S, Glatard F, Richard F, Urcelay C, Weiß M. 2007. Sebacinales are common mycorrhizal associates of Ericaceae. New Phytologist 174: 864–878.

Taylor DL, McCormick MK. 2008. Internal transcribed spacer primers and sequences for improved characterization of basidiomycetous orchid mycorrhizas. New Phytologist 177: 1020–1033.

Tedersoo L, Hansen K, Perry BA, Kjøller R. 2006. Molecular and morphological diversity of pezizalean ectomycorrhiza. New Phytologist 170: 581–596.

Thompson JD, Higgins DG, Gibson TJ. 1994. Clustal W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and matrix choice. Nucleic Acids Research 11: 4673–4680.

Weiss M, Selosse M-A, Rexer K-H, Urban A, Oberwinkler F. 2004. Sebacinales: a hitherto overlooked cosm of heterobasidiomycetes with a broad mycorrhizal potential. Mycological Research 108: 1003–1010.

Supporting Information

Additional supporting information may be found in theonline version of this article.

Text S1 A Perl script (readable as ASCII [.txt] format) tocompile a list of genera that form the best BLAST match ofat least one insufficiently identified sequences (IIS).

Text S2 A Perl CGI script (readable as ASCII [.txt] format)to search for insufficiently identified sequences (IIS) associ-ated with a user-specified genus.

Text S3 Supplementary tables and figures: number ofsequences for different host taxa and different countries andfigure depicting the new search function of the emerenciaweb service.

Please note: Wiley-Blackwell are not responsible for the con-tent or functionality of any supporting information supplied bythe authors. Any queries (other than missing material) shouldbe directed to the New Phytologist Central Office.