ä t u ã ä w ä s - journal of clinical...

36
Benchmarking of Methods for Genomic Taxonomy. 1 2 RUNNING TITLE: Benchmarking of Methods for Genomic Taxonomy. 3 4 Mette V. Larsen 1* , Salvatore Cosentino 1 , Oksana Lukjancenko 1 , Dhany Saputra 1 , 5 Simon Rasmussen 1 , Henrik Hasman 2 , Thomas Sicheritz-Pontén 1 , Frank M. 6 Aarestrup 2 , David W. Ussery 1,3 , Ole Lund 1 7 8 1 Center for Biological Sequence Analysis, Department of Systems Biology, 9 Technical University of Denmark, 2800 Kgs. Lyngby, Denmark 10 11 2 National Food Institute, Technical University of Denmark, 2800 Kgs. Lyngby, 12 Denmark 13 14 3 Comparative Genomics Group, Biosciences Division, Oak Ridge National 15 Laboratory, Oak Ridge, TN 37831, USA 16 17 *Corresponding author, e-mail: [email protected] 18 19 20 21 22 23 24 JCM Accepts, published online ahead of print on 26 February 2014 J. Clin. Microbiol. doi:10.1128/JCM.02981-13 Copyright © 2014, American Society for Microbiology. All Rights Reserved. on May 23, 2018 by guest http://jcm.asm.org/ Downloaded from

Upload: dohanh

Post on 26-Mar-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

Benchmarking of Methods for Genomic Taxonomy. 1 2 RUNNING TITLE: Benchmarking of Methods for Genomic Taxonomy. 3 4 Mette V. Larsen1*, Salvatore Cosentino1, Oksana Lukjancenko1, Dhany Saputra1, 5 Simon Rasmussen1, Henrik Hasman2, Thomas Sicheritz-Pontén1, Frank M. 6 Aarestrup2, David W. Ussery1,3, Ole Lund1 7 8 1Center for Biological Sequence Analysis, Department of Systems Biology, 9 Technical University of Denmark, 2800 Kgs. Lyngby, Denmark 10 11 2National Food Institute, Technical University of Denmark, 2800 Kgs. Lyngby, 12 Denmark 13 14 3Comparative Genomics Group, Biosciences Division, Oak Ridge National 15 Laboratory, Oak Ridge, TN 37831, USA 16 17 *Corresponding author, e-mail: [email protected] 18 19 20 21 22 23 24

JCM Accepts, published online ahead of print on 26 February 2014J. Clin. Microbiol. doi:10.1128/JCM.02981-13Copyright © 2014, American Society for Microbiology. All Rights Reserved.

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 2: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

ABSTRACT 25 One of the first questions that emerge when encountering a prokaryotic 26 organism of interest is what it is – that is which species it is. The 16S rRNA gene 27 formed the basis of the first method for sequence-based taxonomy and has had a 28 tremendous impact on the field of microbiology. Nevertheless, the method has 29 been found to have a number of shortcomings. 30 In the current study we trained and benchmarked five methods for whole 31 genome sequence based prokaryotic species identification on a common dataset 32 of complete genomes; 1) SpeciesFinder, which is based on the complete 16S 33 rRNA gene, 2) Reads2Type that searches for species-specific 50-mers in either 34 the 16S rRNA gene or the GyrB gene (for the Enterobacteraceae family), 3) The 35 rMLST method that samples up to 53 ribosomal genes, 4) TaxonomyFinder, 36 which is based on species-specific functional protein domain profiles, and finally 37 5) KmerFinder, which examines the number of co-occurring k-mers. The 38 performances of the methods were subsequently evaluated on three datasets of 39 short sequence reads or draft genomes from public databases. In total, the 40 evaluation sets constituted sequence data from more than 11,000 isolates 41 covering 159 genera and 243 species. Our results indicate that methods that only 42 sample chromosomal, core genes have difficulties in distinguishing closely 43 related species, which only recently diverged. The KmerFinder method had the 44 overall highest accuracy and correctly identified from 93%-97% of the isolates in 45 the evaluations sets. 46 47 48 49

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 3: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

INTRODUCTION 50 Rapid identification of the species of isolated bacteria is essential for 51

surveillance for human and animal health and for choosing the optimal treatment and 52 control measures. Since the beginning of microbiology more than a century ago, this 53 was to a large extent been based on morphology and biochemical testing. However, 54 for more than 30 years, 16S rRNA sequence data has served as the backbone for the 55 classification of prokaryotes (1) and tremendous amounts of 16S rRNA sequences are 56 available in public repositories (2-4). However, due to the conserved nature of the 16S 57 rRNA gene, the resolution is often too low to adequately resolve different species and 58 sometimes not even adequate for genus delineation (5, 6). Furthermore, many 59 prokaryotic genomes contain several copies of the 16S rRNA gene with substantial 60 inter-gene variation (7, 8). It is also considered problematic that this gene represents 61 only a tiny fraction, roughly about 0.1% or less, of the coding part of a microbial 62 genome (9). 63

Second- and third generation sequencing techniques have the potential to 64 revolutionize the classification and characterization of prokaryotes and is now being 65 used routinely in some clinical microbiology labs. However, so far no consensus on 66 how to utilize the vast amount of information in Whole Genome Sequence (WGS) 67 data has emerged (10). Nevertheless, a number of different methods have been 68 proposed. Roughly, they can be divided into those that require annotation of genes in 69 the data and those that employ the nucleotide sequences directly (9). 70

One of the first attempts to employ WGS data for taxonomic purposes was 71 carried out in 1999 (11). At the time, 13 completely sequenced genomes of unicellular 72 organisms were available and distance-based phylogeny was constructed on the basis 73 of presence and absence of suspected orthologous (direct common ancestry) gene 74

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 4: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

pairs. Later it was recognized that methods that take into account gene content can be 75 greatly influenced by Horizontal Gene Transfer (HGT) and alternative methods were 76 developed that used homologous groups (gene family content) (12) or protein domains 77 (13). 78 Functional protein domains also form the basis of a recent approach developed 79 by our group (Lukjancenko O., Thomsen M. C. F., Larsen M. V., Ussery D. W., 80 submitted for publication, (14)). Here, the protein domains are combined into 81 functional profiles of which some are species-specific and can thus be used for 82 inferring taxonomy. 83

As an extension of 16S rRNA analysis, which focuses on a single locus, Super 84 Multilocus Sequence Typing (SuperMLST) has been proposed (15). It relies on the 85 selection of a set of genes that are highly conserved and hence can be used with any 86 organism. In a publication from 2012, Jolley et al. suggested that 53 genes encoding 87 ribosomal proteins are used for bacterial classification in an approach called 88 ribosomal MLST (rMLST) (16). Not all 53 genes were found in all bacterial genomes, 89 but due to the relatively high number of sampled loci, this is not considered 90 problematic. The rMLST method forms the basis of a proposed reclassification of 91 Neisseria species (17) and has also been used for analyzing human Campylobacter 92 isolates (18). 93

It is also possible to employ the sequence data directly without pre-annotation 94 of genes. This can, for instance, be done using BLAST (19). An alternative, faster 95 approach would be to look at k-mers (substrings of k nucleotides in DNA sequence 96 data) and use the number of co-occurring k-mers in two bacterial genomes as a 97 measure of evolutionary relatedness. Using the k-mer based approach, we have 98 developed a method, KmerFinder, which examines all regions of the genomes, not 99

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 5: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

only core genes (20). Furthermore, a gene segment will score highly despite the 100 transposition of a gene segment within the genome, since only the flanking regions 101 will be mismatched. 102 In the current study we have trained five different methods for species 103 identification on a common dataset of complete prokaryotic genomes. 1) The 104 SpeciesFinder method serves as the baseline, as it is based solely upon the 16S 105 rRNA gene, 2) Reads2Type is a variant hereof, searching for species-specific 50-106 mers, predominantly within the 16S rRNA gene, with the help of non-species-107 specific 50-mers to quickly narrow down the search, 3) rMLST, which predicts 108 species by examining 53 ribosomal genes, 4) TaxonomyFinder, which is based on 109 species-specific functional protein domain profiles, and finally 5) KmerFinder, 110 which predicts species by examining the number of overlapping 16-mers. 111

The public available databases contain ample amounts of WGS data from 112 prokaryotes, enabling us to conduct a large-scale benchmark study of the 113 proposed methods. Hence, the process of reaching a consensus on how the WGS 114 data should optimally be used for prokaryotic taxonomy is initiated. 115 116 117 MATERIAL & METHODS 118 Dataset 119 Training Data 120 In August 2011 a total of 1,647 complete genomes originating from 121 Bacteria (1,535) and Archaea (112) were downloaded from the National Center 122 for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov/genome). 123 For each genome, the annotated taxonomy according to GenBank was compared 124

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 6: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

to the taxonomy according to Entrez, which was retrieved using the taxonomy 125 module of BioPerl. Discrepancies were checked and corrected manually. For each 126 genome, it was also examined if the annotated name was in accordance to the 127 List of Prokaryotic names with Standing in Nomenclature 128 (http://www.bacterio.cict.fr/allnames.html)(21). When possible, names that 129 were not in accordance were corrected to valid ones. In this way, 1,426 genomes 130 were assigned to 847 approved genera and species names. The remaining 221 131 genomes, which were either only assigned to a genus, e.g., Vibrio spp., or assigned 132 to species with informal names, e.g., Synechoccus islandicus, were kept in the 133 training data under the assumption that they will influence the different methods 134 for species identification equally. An overview of the training data is available in 135 Supplementary Table 1. 136 137 Evaluation Data 138 Three datasets were generated for the purpose of evaluating the methods. 139 The first consisted of assembled complete or draft genomes with assigned 140 species, which were downloaded from NCBI in September 2012 and not already 141 part of the training data. Only genomes assigned to species that were also 142 present in the training data were included. The set was called NCBIdrafts and 143 consisted of genomes from 695 isolates covering 81 genera and 149 species. The 144 set includes three Archaea; two Methanobrevibacter smithii and one Sulfolobus 145 solfataricus. An overview of the data can be seen in Supplementary Table 2. 146 Furthermore, In January 2012, 11,768 sets of Illumina raw reads with 147 assigned species were downloaded from the NCBI Sequence Reads Archive (SRA, 148 http://www.ncbi.nlm.nih.gov/sra) (22). 10,517 of them had been sequenced by 149

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 7: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

the Illumina Genome Analyzer II sequencer, while the remaining 1,251 had been 150 sequenced by the Illumina HiSeq 2000 sequencer. 1,361 sets of reads originated 151 from species that were not part of the training data and were removed. The final 152 SRAreads dataset consisted of 8,798 sets of paired-end reads and 1,609 sets of 153 single reads, 10,407 sets in total. 154 For the short reads of the SRAreads set, the optimal k-mer length was 155 estimated and used for de novo assembly as described previously (23) using 156 velvet 1.1.04 (24). The resulting set of draft genomes constituted the SRAdrafts 157 evaluation set. To measure the qualities of the draft assemblies, the N50 values 158 were calculated (25). The draft assemblies had an average N50 of 77,018, 159 ranging from 101 to 779,945 (see Supplementary Figure 1), an average number 160 of scaffolds of 697, and an average size of 3,301 kilobases. 161 The SRAreads and SRAdrafts sets both cover 167 different species from 120 162 genera with more than 5,000 strains from the Streptococcus, Staphylococcus and 163 Salmonella genera. There are no species from Archaea. An overview of the 164 SRAreads and SRAdrafts sets is available in Supplementary Table 3. 165 166 Methods for species identification 167 SpeciesFinder 168 SpeciesFinder predicts the prokaryotic species based on the 16S rRNA 169 gene. The concept of using the 16S rRNA gene for taxonomic purposes goes back 170 to 1977 (1), but the implementation used in this study was developed by our 171 group. A 16S database was built from the genomes of the common training data 172 using RNAmmer (26). The species predictions were performed differently 173

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 8: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

depending on the input type. If the input was short reads, the prediction was 174 done as follows: 175 I. The reads were mapped against the 16S database using the Smith-176 Waterman Burrows-Wheeler aligner (BWA)(27). 177 II. The mapped reads were assembled using Trinity (28) to obtain the 16S 178 rRNA sequences. 179 III. The BLAST algorithm (19) was used to search the output from Trinity 180 against the 16S database. 181 IV. The best BLAST hit (see below) was chosen and the species associated 182 with the best hit was given as the final prediction. 183 184 When the input sequence was a draft or complete genome, the prediction 185 was performed as follows: 186 I. The 16S rRNA gene was predicted from the input sequence using 187 RNAmmer. 188 II. Using the BLAST algorithm, the predicted sequence was aligned against 189 the 16S database. 190 III. The best BLAST hit (see below) was chosen and the species associated 191 with it given as the final prediction. 192 193 The best BLAST hit was chosen by ranking the output from the BLAST 194 alignment by the best cumulative rank of coverage, percent identity, bitscore, 195 number of mismatches, and number of gaps. The highest ranked hit was chosen 196 for the prediction. 197

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 9: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

SpeciesFinder is freely available at 198 http://cge.cbs.dtu.dk/services/SpeciesFinder/. 199 200 rMLST 201 The rMLST method predicts bacterial species based on 53 ribosomal 202 genes originally defined by Jolley et al. (16). The set of genes can either be used 203 in an approach similar to Multilocus Sequence Typing (MLST), where each locus 204 in the query genome is considered identical or non-identical to alleles of the 205 corresponding locus in the reference database, and an allelic profile based on 206 arbitrary numbers assigned to each of the alleles in the database is generated 207 accordingly. Since the strains that we compare are more diverse than the ones 208 compared in MLST, it is likely that many loci would have no identical matches in 209 the database, making a simple cluster analysis based on allelic profiles 210 problematic. To improve the resolution of the method, in our implementation of 211 rMLST, the nucleotide sequence of each locus is aligned to the alleles in the 212 reference database and a measure of the similarity of the locus and the best 213 matching allele is used subsequently, as described below. 214 Briefly, for each of the genomes in the training data, the 53 ribosomal 215 genes were extracted by BLAST and provided to us by Keith Jolley, Department 216 of Zoology, University of Oxford, UK. In this way, for each genome, a gene 217 collection of up to 53 ribosomal genes was assigned. To predict the species of a 218 query genome, the query genome was first aligned to each gene collection using 219 Blat (29). Only hits with at least 95% identity and 95% coverage were 220 considered as a potential match. If there were several potential matches, the best 221 match was selected based on the best cumulative rank of coverage, percent 222

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 10: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

identity, bitscore, number of mistmatches, and number of gaps in the alignments. 223 The final prediction was given as the organism with the highest number of best 224 hits across all genes. Our implementation of rMLST performs predictions for 225 draft or complete genomes, but not short reads. 226 227 TaxonomyFinder 228 The TaxonomyFinder method is based on taxonomy group-specific 229 protein profiles developed by our group (Lukjancenko O., Thomsen M. C. F., 230 Larsen M. V., Ussery D. W., submitted for publication (14)). It performs 231 predictions for draft or complete genomes, but not for short reads. The common 232 training data was used to create the taxonomy-specific profile database. Briefly, 233 for each genome functional profiles were assigned based on three collections of 234 Hidden Markov Models (HMMs) databases: PfamA (30), TIGRFAM (31), and 235 Superfamily (32). Genes that did not match any entry in the HMM databases 236 were clustered using CD-HIT (33). Further, genomes were grouped according to 237 the taxonomy level, either phylum or species, and profiles that were specific to 238 each taxonomic group were extracted. Profiles were considered specific to a 239 taxonomic group, if they were conserved in 30 -100% of the genomes within a 240 phylum/species group and absent in all genomes outside of the group. The actual 241 threshold for conservation depended on the size of the group with large groups 242 having smaller thresholds for conservation. The workflow of the 243 TaxonomyFinder method is a four-step process, which includes: 244 I. Open-reading frame prediction using Prodigal (34). 245 II. Construction of functional profiles from protein-coding sequences. 246 III. Assignment of functional profiles. 247

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 11: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

IV. Functional profile comparison to the taxonomy-specific profile database. 248 The number of architectures, matched to each of the taxonomy groups, is 249 recorded, and the fraction of taxa-specific genes (score) is calculated. The 250 best-matching taxonomy group is selected based on a consensus of the 251 best score and highest number of matched architectures. 252 TaxonomyFinder is freely available at 253 http://cge.cbs.dtu.dk/services/TaxonomyFinder/. 254 255 KmerFinder 256 The KmerFinder method was developed by our group and predicts 257 prokaryotic species based on the number of overlapping (co-occurring) k-mers, 258 i.e. 16-mers, between the query genome and genomes in a reference database 259 (20). Initially, all genomes in the common training data were split into 260 overlapping 16-mers with step-size one, meaning that if the first 16-mer is 261 initiated at position N and ends at position N+15, the next 16-mer is initiated at 262 position N+1 and ends at position N+16, and so on. To reduce the size of the final 263 16-mer database only 16-mers with the prefix ATGAC were kept. These 16-mers 264 were stored in a hash table with links to the original genomes. The length of the 265 k-mers was chosen to be 16, since a parallel study showed that this resulted in 266 the highest performance of the method (results not shown). The prefix, ATGAC, 267 was initially selected in an attempt to focus the 16-mers on coding regions (ATG 268 is the start codon for protein coding sequences), while the “A” and “C” were 269 chosen arbitrarily as the two first nucleotides, when sorting the four nucleotides 270 alphabetically. Later studies have shown that the nucleotide sequence of the 271 prefix has little influence on the performance of the method as long as strongly 272

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 12: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

repetitive sequences, e.g. CCCCC or AAAAA, are omitted (data not shown). When 273 performing the prediction, the species of the query genome is predicted to be 274 identical to the species of the genome in the training data with which it has the 275 highest number of 16-mers in common regardless of position. In the case of ties, 276 the species were sorted alphabetically according to their name and the first 277 species selected. The input for KmerFinder can be draft or complete genomes as 278 well as short reads. KmerFinder is freely available at 279 http://cge.cbs.dtu.dk/services/KmerFinder/. 280 281 Reads2Type 282 Reads2Type was developed by our group and identifies the prokaryotic 283 species based on a database of 50-mer probes generated from chosen marker 284 genes (Saputra D., Rasmussen S., Larsen M.V., Haddad N., Aarestrup F.M., Lund O., 285 and Sicheritz-Pontén T., submitted for publication). The version of Reads2Type 286 evaluated in this study requires short reads as input. For bacterial species not 287 belonging to the Enterobacteriaceae family, the 50-mer database relies on the 288 16S rRNA locus, while for Enterobacteriaceae, the gyrB locus is used. Briefly, the 289 following steps were applied for building the 50-mer probe database: 290 I. 16S rRNA sequences of the complete bacterial genomes of the common 291 training set were predicted using RNammer (26). 292 II. For species belonging to the Enterobacteriaceae family the gyrB 293 sequences were downloaded from NCBI. 294 III. The above sequences were pooled and all possible 50-bp fragments were 295 generated from that pool. 296

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 13: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

IV. 16S rRNA probes unique for Enterobacteriaceae were removed from the 297 pool of 50-mers. 298 V. All 50-mer duplicates associated to the conserved regions of different 299 strains but the same species were removed. 300 VI. To further reduce the size of the final 50-mers database, 25 consecutive 301 50-mers previously fragmented from one ≥ 50 bp stretch of 16s rRNA 302 belonging to the same list of organism were removed. 303 The resulting 50-mers probe database consists of a number of sequences 304 found uniquely in one species, as well as other sequences shared between 305 several species. Subsequently, each read was compressed into a suffix tree, 306 which is a data structure for fast string matching. The compressed short reads 307 were aligned to the 50-mer probe database using a hieratical “narrow-down 308 approach” strategy, i.e. when a compressed read matched a probe belonging to a 309 group of species, a much smaller probe database excluding other species was 310 created on the fly, causing the read progress to be faster and the species to be 311 identified faster. 312 The Reads2Type method is freely available as a web server 313 (http://cge.cbs.dtu.dk/services/Reads2Type/) and as a console. The web-based 314 Reads2Type is unique in not requiring the short read file to be uploaded to the 315 server. Instead, the 4.6 MB 50-mers probe database is automatically transferred 316 into the client computer’s memory before initiating the species identification. All 317 computations needed for the species identification is fully performed on the 318 client’s computer, minimizing the data transfer and avoiding the network 319 bottleneck on the server. 320 321

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 14: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

Testing the speed 322 The speed of the methods was evaluated on non-published internal data 323 from up to 450 strains covering eight species (Enterococcus faecalis, 324 Enterococcus faecium, Eschericia coli, Escherichia fergusonii, Klebsiella 325 pneumoniaea, Salmonella enterica, Staphylococcus aureus, and Vibrio Cholera) 326 that had been sequenced by the Illumina sequencing method. Draft genomes 327 were de novo assembled as described above for the SRAdrafts set. The speed was 328 tested on a Cluster with x86_64 architecture, 128 nodes, 4 cores per node, 30 or 329 7G per node. SpeciesFinder used four cores per job, TaxonomyFinder up to ten 330 cores per job, and the other methods one core per job. 331 332 RESULTS 333 Five methods for species identification were trained on a common dataset 334 of completed prokaryotic genomes. The performances of the methods were 335 subsequently evaluated on three datasets of draft genomes or short sequence 336 reads. 337 338 Performances on NCBI draft genomes 339 The SpeciesFinder, rMLST, TaxonomyFinder, and KmerFinder methods 340 are able to perform species predictions on draft or completed prokaryotic 341 genomes. Their performances were evaluated on the NCBIdrafts set of 695 draft 342 genomes covering 149 species. Supplementary File 1 lists all predictions, while 343 Figure 1A summarizes the results. Overall, SpeciesFinder, which is based on the 344 16S rRNA gene, had the poorest performance, only correctly identifying 76% of 345 the isolates down to species level. KmerFinder, which is based on co-occurring 346

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 15: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

16-mers, had the highest performance and correctly identified 93% of the 347 isolates. For only three isolates (0.43%), KmerFinder did not even get the genus 348 correct. These three isolates were two E. coli predicted as Shigella sonnei and one 349 Providencia alcalifaciens predicted as Yersinia pestis. 350 The NCBIdrafts set contained three Archaeal isolates; two M. smithii and 351 one S. solfataricus. SpeciesFinder, TaxonomyFinder, and KmerFinder predicted 352 the species of all three isolates correctly, while rMLST, which was only intended 353 for characterization of Bacteria (16), predicted the M. smithii correctly, but was 354 unable to make a prediction for the S. solfataricus. 355 The overlap in predictions of SpeciesFinder, rMLST, TaxonomyFinder, and 356 KmerFinder was examined and illustrated in Figure 2A. All four methods 357 correctly identified 428 out of 695 isolates (62%), and all methods misidentified 358 the same six isolates. These six isolated were also misidentified by the BLAST-359 based method. Table 1 lists these six isolates. Since all five methods agreed on 360 these predictions, the isolates are possibly wrongly annotated. Alternatively, the 361 annotations of the isolates in the training data that the predictions were based 362 on are incorrect. 363 364 365 TABLE 1 HERE 366 367 As seen in Figure 2A, isolate predictions agreed upon by several methods 368 are more accurate that predictions unique to a particular method. However, the 369 KmerFinder method made unique predictions for 36 isolates of which 20 were in 370 concordance with the annotation. 371

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 16: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

Predictions for the most common species in the dataset were examined 372 more closely and illustrated in Figure 3 and in Supplementary File 4. In general, 373 the ‘wrong’ predictions by SpeciesFinder (that is, the ones that were in 374 disagreement with the NCBI annotation) were typically scattered, often 375 consisting of a few wrong predictions of each type. The rMLST method was, on 376 the other hand, more consistent in its incorrect predictions. As an example, the 377 rMLST method wrongly annotated all 14 Bacillus anthracis isolates as Bacillus 378 thuringiensis, all 8 Brucella abortus as Brucella suis, and all 6 Burkholderia mallei 379 as Burkholderia pseudomallei. In general, all four methods had difficulties 380 identifying species within the Bacillus genus, such as isolates annotated as B. 381 thuringiensis, but predicted to be Bacillus cereus or vice versa. Another mistake 382 common to all methods was Streptococcus mitis being predicted as Streptococcus 383 oralis or Streptococcus pneumoniae. Also, none of the methods were able to 384 correctly identify all annotated E. coli isolates, but identified at least some of 385 them as Shigella spp. SpeciesFinder and TaxonomyFinder both had problems 386 identifying the Borrelia burgorferi isolates, while SpeciesFinder and rMLST had 387 problems distinguishing Yersinia pestis from Yersinia pseudotuberculosis. 388 SpeciesFinder was the only method that had difficulties identifying 389 Mycobacterium tuberculosis isolates, often predicting them to be Mycobacterium 390 bovis. 391 392 Performances on SRA draft genomes 393 The SpeciesFinder, rMLST, TaxonomyFinder, and KmerFinder methods 394 were next evaluated on the SRAdrafts set of 10,407 draft genomes covering 167 395 species. The performances on the draft genomes, for which the methods were 396

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 17: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

able to make a prediction, are depicted in Figure 1B, while the overlap in 397 predictions is illustrated in Figure 2B. Again, SpeciesFinder had the lowest 398 performance with only 84% correct predictions. The rMLST, TaxonomyFinder, 399 and KmerFinder methods had almost equal performances of 94%, 95%, and 400 95%, respectively. There was, however, a difference in the percentage of draft 401 genomes for which each of the methods failed to make any prediction. 402 SpeciesFinder and KmerFinder were the most robust methods, failing to make 403 predictions for only 0.2% and 0.4% of the draft genomes, respectively. 404 TaxonomyFinder was not able to make a prediction for 1.8% of the draft 405 genomes, and rMLST not for 3.5%. That rMLST was the least robust method is at 406 least partly due to our implementation of the method, where only hits with at 407 least 95% identity and 95% coverage were considered a potential match. On the 408 other hand, the N50 values for the draft genomes that SpeciesFinder and 409 KmerFinder could not make a prediction for, were approximately half the size of 410 the corresponding values for rMLST and TaxonomyFinder (data not shown), 411 meaning that the quality of the draft genomes have to be higher for rMLST and 412 TaxonomyFinder to be able to make a prediction. This is in accordance with 413 these methods relying on the presence of many complete genes. 414 Predictions for the most common species in the dataset are shown in 415 Figure 4 and in Supplementary File 4. As seen previously when evaluating on the 416 NCBIdrafts set, the rMLST method was more consistent in its predictions for a 417 given species than the other methods. For instance, rMLST predicted all 15 418 Mycobacterium bovis isolates to be M. tuberculosis. As also seen when evaluating 419 on the NCBIdrafts set, it is evident that all methods had difficulties distinguishing E. 420 coli from species within the Shigella genus. Furthermore, species within the 421

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 18: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

Brucella genus were often wrongly identified. In particular, it was only 422 TaxonomyFinder that was able to correctly identify most Brucella abortus 423 isolates. Some of the common problems that were obvious when evaluating on 424 the NCBIdrafts set, were not obvious when evaluating on the SRAdrafts set, since the 425 problematic species were too scarcely represented here. For instance, there were 426 only five species from the Bacillus genus and only one S. mitis in SRAdrafts. The 427 difference in species distribution between the NCBIdrafts and SRAdrafts set also 428 explain why SpeciesFinder, TaxonomyFinder and rMLST all have increased 429 performance on the SRAdrafts set: While more than half of the isolates in the 430 SRAdrafts set belong to the Salmonella, Staphylococcus or Streptococcus genera, 431 which none of the methods have particular problems identifying, these genera 432 constitute less than 20% of NCBIdrafts. Conversely, the NCBIdrafts set contains a 433 high proportion of the problematic species E. coli (8.8%) and the genus Bacillus 434 (10%). The corresponding proportions for SRAdrafts are 3.5% E. coli and 0.05% 435 isolates of the Bacillus genus. Furthermore, the NCBIdrafts set is proportionally 436 more diverse consisting of 149 species, while the almost 15 times larger SRAdrafts 437 set consists of only 168 different species. 438 439 Performances on short reads from SRA 440 Only three of the methods were able to perform species predictions 441 directly on short reads, without first assembling the reads. These methods were 442 SpeciesFinder, KmerFinder, and Reads2Type. Their performances on the SRAreads 443 set of 10,407 sets of short reads representing 168 species are shown in Figure 444 1C. 445

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 19: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

Again, the SpeciesFinder method had the poorest performance with 86% 446 of the isolates being correctly predicted. Reads2Type performed marginally 447 better (87%), while KmerFinder achieved 97% correct. 448 Figure 2C illustrates the overlap in predictions between the three 449 methods, while predictions for the most common species are shown in 450 Supplementary Figure 2. In general, the results correspond to those observed for 451 the SRAdrafts set. 452 453 Speed 454 The speed of the methods was evaluated on a subset of draft genomes and 455 short reads as described in Material and Methods. Since the actual speed 456 experienced by the user will depend on a number of factors, for instance, the 457 network bandwidth capacity of the client computer and the number of jobs 458 queued at the server, the relative speed of the different methods in comparison 459 to each other is more relevant than the absolute speed. 460 461 TABLE 2 HERE 462 463 DISCUSSION 464 In the present study we trained five different methods for prokaryotic 465 species identification on a common dataset and evaluated their performances on 466 three datasets of draft genomes or short sequence reads. 467 The SpeciesFinder method is based on the 16S rRNA gene, which has 468 served as the backbone of prokaryotic systematics since 1977 (1). Accordingly, 469 sequencing of the 16S rRNA gene is a well-established method for identification 470

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 20: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

of prokaryotes and has in all likelihood been used for annotating some of the 471 isolates in the training and evaluation sets. In the light of this potential advantage 472 of the SpeciesFinder method over the other methods, it is noteworthy that it had 473 the lowest performance on all evaluation sets. Previous studies have, however, 474 also pointed to the many limitations of the 16S rRNA gene for taxonomic 475 purposes (5-9). Examples, which are also observed in this study, include its 476 inadequacy for the delineation of species within the Borrelia burgdorferi sensu 477 lato complex and the Mycobacterium tuberculosis complex (35). Similarly, in silico 478 studies of the applicability of the 16S rRNA gene for the identification of 479 medically important bacteria led to the author’s concluding that although the 480 method is useful for identification to the genus level, it is only able to identify 481 62% of anaerobic bacteria (36) and less than 30% of aerobic bacteria (37) 482 confidently to the species level. 483 The performance of SpeciesFinder was surpassed only marginally by 484 Reads2Type. This is not surprising, since the two methods are conceptionally 485 very similar: SpeciesFinder utilizes the entire 16S rRNA gene of approximately 486 1,540 nucleotides, while for most species, Reads2Type searches for species-487 specific 50-mers in the same gene. In terms of its future usability, Reads2Type 488 has, however, one advantage over the other methods: Like most of the other 489 methods it is available as a web-server, but uniquely it does not require the read 490 data to be uploaded to the server. Instead, a small 50-mer database is transferred 491 to the user’s computer and all computations performed here. As a result, 492 bottleneck problems on the server are avoided and the data transfer is 493 minimized, which may be particularly advantageous for users with limited 494 Internet access. 495

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 21: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

While SpeciesFinder and Reads2Type only sample one locus, the rMLST 496 method samples up to 53 loci – all ribosomal genes located to the chromosome of 497 the bacteria. Evaluating on the dataset of SRA draft genomes, rMLST, 498 TaxonomyFinder, and KmerFinder performed equally well. However, on the 499 more diverse and difficult set of NCBI draft genomes, the rMLST method 500 performed only marginally better than SpeciesFinder and significantly worse 501 than TaxonomyFinder and KmerFinder. In particular, the rMLST method 502 consistently made incorrect identifications of a number of closely related 503 species, e.g., Y. pestis versus Y. pseudotuberculosis (38) and M. tuberculosis versus 504 M. bovis (39). Also, rMLST consistently predicted the human pathogen B. 505 anthracis to be B. thuringiensis. The latter is used extensively as a biological 506 pesticide and is generally not considered harmful for humans. B. anthracis and B. 507 thuringiensis are both members of the B. cereus group and genetically very 508 similar, with most of the disease and host specificity being attributable to their 509 plasmid content (40, 41). It has even been suggested that all members of the B. 510 cereus group should be considered to be B. cereus and only subsequently be 511 differentiated by their plasmids (42). Hence, in concordance with rMLST 512 sampling only chromosomal, core genes, it is not surprising that the method fails 513 to distinguish these isolates. A similar example is given by the rMLST method 514 identifying all E. coli isolates as Shigella sonnei. Although Shigella spp. isolates 515 have been rewarded their own genus, its separation from Escherichia spp. is 516 mainly historical (43-45). To be sure, some of the mistakes commonly made by 517 rMLST as well as the other methods highlight taxonomic taxa that are 518 intrinsically difficult to distinguish due to a sub-optimal initial classification: 519 Although Shigella spp. has for several years been considered a sub-strain of E. 520

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 22: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

coli, the practical implications of renaming it is considered insurmountable. On 521 another account, it should be noted that the rMLST method was not developed 522 for usage with a fixed training set, but rather with all known alleles. Accordingly, 523 the performance of the method is expected to improve with increased size of the 524 reference rMLST database, which is currently expanding rapidly (Keith Jolley, 525 Department of Zoology, University of Oxford, UK, personal communication). 526 The TaxonomyFinder method was the second most accurate method on 527 the set of NCBI draft genomes and performed in the top for the SRA drafts set. In 528 contrary to the other methods it does not work directly on the nucleotide 529 sequence of the isolates, but rather on the proteome, utilizing functional protein 530 domain profiles for the species prediction. It was the slowest of the tested 531 methods, but in return for the extra time, the user is rewarded with an annotated 532 genome. 533 The KmerFinder method performs its predictions on the basis of co-534 occurring k-mers, regardless of their location in the chromosome. It had the 535 overall highest accuracy, works on complete or draft genomes as well as short 536 reads, was found to be very robust as well as fast. Furthermore, the KmerFinder 537 method holds promise for future improvements, as the implementation used for 538 this study was very simple: Only the raw number of co-occurring k-mers 539 between the query and reference genome was considered, although a parallel 540 analysis indicates that the performance could be improved even further if more 541 sophisticated measures were used, also taking into account the total number of 542 k-mers in the query and reference genome. KmerFinder took app. 9 sec. per 543 query genome, which makes it the fastest of the tested methods. To test the 544 general applicability of sampling the entire genome and not pre-selected genes 545

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 23: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

or sets of genes for the species prediction, we also implemented a whole-genome 546 BLAST-based method. The method used hit aggregation of significant matches 547 between the query genome and all genomes in the common training set. As the 548 final prediction, the species for which the query genome had the most bases 549 matched, was selected. The performance of this whole-genome BLAST-based 550 method was tested on the NCBIdrafts and SRAdrafts evaluation sets and found to be 551 very similar to that of KmerFinder (see Supplementary File 4). The method was, 552 however, almost 20 times slower than KmerFinder taking app. 3 min. per 553 genome. 554 It has previously been noted that some of the isolates present in public 555 databases, and hence used in this study, are wrongly annotated (17, 46, 47). 556 Based on the current study, it is likely that at least the six isolates from the 557 NCBIdrafts set that all methods identified as something different than the 558 annotated species, are wrongly annotated, or alternatively most closely related 559 to an isolate in the common training data that is wrongly annotated. In 560 agreement with this, one of the isolates has indeed been re-annotated, since we 561 initially downloaded the data. Of the remaining five isolates, two B. cereus 562 isolates were found to be most closely related to the B. weihenstephanensis strain 563 KBAB4 of the common training set. This strain is the single representative of the 564 species in the public database and not the type strain. Hence there is no guarantee that 565 the sequenced strain represents the named taxon (48). The same is the case for the C. 566 botulinum strain C Eklund, which is predicted to be a Clostridium novyi based on 567 its close resemblance to C. novyi strain NT of the training set. Clostridium novyi 568 strain NT is the only representative of this species in the database and not the 569 type strain. Obviously, all the evaluated methods are highly dependent on the 570

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 24: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

size and the accuracy of the set of genomes that they are trained on. Accordingly, 571 all methods have the potential to improve their performance in the future, when 572 more genomes become available and the present mistakes in the public 573 databases are corrected. Another way to ensure future improvement is to 574 combine the individual predictions of the methods and let the final predicted 575 species of a query genome be decided by a majority vote. We are currently 576 planning to implement such a system. 577 In the current study we only included species in the evaluation sets, 578 which were also present in the training set. We have hence not tested how the 579 methods would perform when presented with a species not included in the 580 training set. SpeciesFinder searches for the closest match in the query genome to 581 a database of 16S rRNA genes. If the species of the query genome is not 582 represented in the database, the closest match is likely to be of a closely related 583 species, but the method will also test if the %identity and coverage of the 16S 584 rRNA gene is above 98% and mark the prediction as “failed”, if the match is 585 below this threshold. The rMLST method searches for closest matches in a 586 database of 53 different ribosomal genes. In our implementation, the method will 587 not provide an output if the %identity and coverage of the matches is below a 588 threshold of 95%, and hence it will only be ably to select a closely related species 589 for species that are not represented in the training set. Other implementations of 590 the rMLST method would, however, not necessarily have this limitation. The 591 TaxonomyFinder method uses species or phylum specific protein profiles, and 592 would hence identify the correct phylum, if the species of the query genome was 593 not in the training set. Along with the predicted species, the KmerFinder outputs 594 the number of co-occurring k-mers that the selection was based on. A high 595

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 25: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

number of k-mers indicates that the identification is probable, while low 596 numbers of k-mers indicate that the predicted species is likely to be a related 597 species, and that the actual species is not in the training data. Further 598 investigations would be necessary to identify a threshold for the number of k-599 mers to make this distinction. 600 While some taxonomists consider the goal of bacterial taxonomy to “mirror 601 the order of nature and describe the evolutionary order back to the origin of life” (6, 602 49), a more pragmatic and applied view is likely to be advantageous for 603 epidemiological purposes, where most outbreaks last less than six months. The 604 number of prokaryotic genomes in public databases is currently sufficiently high to 605 substitute theoretical views of which loci to sample for optimal species identification 606 by actual testing of how different approaches perform. One locus (the 16S rRNA 607 gene) was initially used for sequenced-based examination of relationships between 608 bacteria, and when the approach was found to have limitations, more loci were added 609 in MLST and MLSA (50, 51). The addition of still more loci has been suggested for 610 improving MLSA even further (16, 35). This study suggests that an optimal approach 611 should not be limited to a finite number of genes, but rather look at the entire genome. 612 613 614 CONCLUSION 615 The 16S rRNA gene has served prokaryotic taxonomy well for more than 616 30 years, but the emergence of second- and third generation sequencing 617 technologies enables the use of WGS data with the potential of higher resolution 618 and more phylogenetically accurate classifications. Methods that sample the 619

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 26: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

entire genome, not just core genes located to the chromosome, seem particularly 620 well suited for taking up the baton. 621 622 623 ACKNOWLEDGEMENTS 624 This work was supported by the Center for Genomic Epidemiology at the 625 Technical University of Denmark and funded by grant 09-067103/DSF from the 626 Danish Council for Strategic Research. 627 We are grateful to John Damm Sørensen for excellent technical assistance. 628 We are grateful to Keith Jolley, Department of Zoology, University of Oxford, UK 629 for providing us with the rMLST genes for the genomes of the training data. 630 631 632 REFERENCES 633 1. Fox GE, Peckman, K.J., Woese, C. E. 1977. Comparative cataloging of 16S 634 ribosomal ribonucleic acid: molecular approach to procaryotic 635 systematics Int. J. Syst. Bacteriol. 27:44-57. 636 2. DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, 637 Huber T, Dalevi D, Hu P, Andersen GL. 2006. Greengenes, a chimera-638 checked 16S rRNA gene database and workbench compatible with ARB. 639 Appl Environ Microbiol 72:5069-5072. 640 3. Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, 641 Glockner FO. 2007. SILVA: a comprehensive online resource for quality 642 checked and aligned ribosomal RNA sequence data compatible with ARB. 643 Nucleic Acids Res 35:7188-7196. 644 4. Ludwig W, Strunk O, Westram R, Richter L, Meier H, Yadhukumar, 645 Buchner A, Lai T, Steppi S, Jobb G, Forster W, Brettske I, Gerber S, 646 Ginhart AW, Gross O, Grumann S, Hermann S, Jost R, Konig A, Liss T, 647 Lussmann R, May M, Nonhoff B, Reichel B, Strehlow R, Stamatakis A, 648 Stuckmann N, Vilbig A, Lenke M, Ludwig T, Bode A, Schleifer KH. 649 2004. ARB: a software environment for sequence data. Nucleic Acids Res 650 32:1363-1371. 651 5. Tindall BJ, Rossello-Mora R, Busse HJ, Ludwig W, Kampfer P. 2010. 652 Notes on the characterization of prokaryote strains for taxonomic 653 purposes. Int J Syst Evol Microbiol 60:249-266. 654

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 27: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

6. Kampfer P. 2012. Systematics of prokaryotes: the state of the art. Antonie 655 Van Leeuwenhoek 101:3-11. 656 7. Tindall BJ, Schneider S, Lapidus A, Copeland A, Glavina Del Rio T, 657 Nolan M, Lucas S, Chen F, Tice H, Cheng JF, Saunders E, Bruce D, 658 Goodwin L, Pitluck S, Mikhailova N, Pati A, Ivanova N, Mavrommatis 659 K, Chen A, Palaniappan K, Chain P, Land M, Hauser L, Chang YJ, 660 Jeffries CD, Brettin T, Han C, Rohde M, Goker M, Bristow J, Eisen JA, 661 Markowitz V, Hugenholtz P, Klenk HP, Kyrpides NC, Detter JC. 2009. 662 Complete genome sequence of Halomicrobium mukohataei type strain 663 (arg-2). Stand Genomic Sci 1:270-277. 664 8. Walcher M, Skvoretz R, Montgomery-Fullerton M, Jonas V, Brentano 665 S. 2013. Description of an Unusual Neisseria meningitidis Isolate 666 Containing and Expressing Neisseria gonorrhoeae-Specific 16S rRNA 667 Gene Sequences. J Clin Microbiol 51:3199-3206. 668 9. Klenk HP, Goker M. 2010. En route to a genome-based classification of 669 Archaea and Bacteria? Syst Appl Microbiol 33:175-182. 670 10. Koser CU, Ellington MJ, Cartwright EJ, Gillespie SH, Brown NM, 671 Farrington M, Holden MT, Dougan G, Bentley SD, Parkhill J, Peacock 672 SJ. 2012. Routine use of microbial whole genome sequencing in diagnostic 673 and public health microbiology. PLoS Pathog 8:e1002824. 674 11. Snel B, Bork P, Huynen MA. 1999. Genome phylogeny based on gene 675 content. Nat Genet 21:108-110. 676 12. House CH, Fitz-Gibbon ST. 2002. Using homolog groups to create a 677 whole-genomic tree of free-living organisms: an update. J Mol Evol 678 54:539-547. 679 13. Yang S, Doolittle RF, Bourne PE. 2005. Phylogeny determined by 680 protein domain content. Proc Natl Acad Sci U S A 102:373-378. 681 14. Lukjancenko O. 2013. PanFunPro: PAN-genome analysis based on 682 FUNctional PROfiles. F1000Research 2. 683 15. Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P. 684 2006. Toward automatic reconstruction of a highly resolved tree of life. 685 Science 311:1283-1287. 686 16. Jolley KA, Bliss CM, Bennett JS, Bratcher HB, Brehony C, Colles FM, 687 Wimalarathna H, Harrison OB, Sheppard SK, Cody AJ, Maiden MC. 688 2012. Ribosomal multilocus sequence typing: universal characterization 689 of bacteria from domain to strain. Microbiology 158:1005-1015. 690 17. Bennett JS, Jolley KA, Earle SG, Corton C, Bentley SD, Parkhill J, 691 Maiden MC. 2012. A genomic approach to bacterial taxonomy: an 692 examination and proposed reclassification of species within the genus 693 Neisseria. Microbiology 158:1570-1580. 694 18. Cody AJ, McCarthy ND, Jansen van Rensburg M, Isinkaye T, Bentley 695 SD, Parkhill J, Dingle KE, Bowler IC, Jolley KA, Maiden MC. 2013. Real-696 time genomic epidemiological evaluation of human campylobacter 697 isolates by use of whole-genome multilocus sequence typing. J Clin 698 Microbiol 51:2526-2534. 699 19. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, 700 Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of 701 protein database search programs. Nucleic Acids Res 25:3389-3402. 702

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 28: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

20. Hasman H, Saputra D, Sicheritz-Ponten T, Lund O, Svendsen CA, 703 Frimodt-Moller N, Aarestrup FM. 2014. Rapid whole-genome 704 sequencing for detection and characterization of microorganisms directly 705 from clinical samples. J Clin Microbiol 52:139-146. 706 21. Euzeby JP. 1997. List of Bacterial Names with Standing in Nomenclature: 707 a folder available on the Internet. Int J Syst Bacteriol 47:590-592. 708 22. Kodama Y, Shumway M, Leinonen R. 2012. The Sequence Read Archive: 709 explosive growth of sequencing data. Nucleic Acids Res 40:D54-56. 710 23. Larsen MV, Cosentino S, Rasmussen S, Friis C, Hasman H, Marvig RL, 711 Jelsbak L, Sicheritz-Ponten T, Ussery DW, Aarestrup FM, Lund O. 712 2012. Multilocus sequence typing of total-genome-sequenced bacteria. J 713 Clin Microbiol 50:1355-1361. 714 24. Zerbino DR, Birney E. 2008. Velvet: algorithms for de novo short read 715 assembly using de Bruijn graphs. Genome Res 18:821-829. 716 25. Miller JR, Koren S, Sutton G. 2010. Assembly algorithms for next-717 generation sequencing data. Genomics 95:315-327. 718 26. Lagesen K, Hallin P, Rodland EA, Staerfeldt HH, Rognes T, Ussery DW. 719 2007. RNAmmer: consistent and rapid annotation of ribosomal RNA 720 genes. Nucleic Acids Res 35:3100-3108. 721 27. Li H, Durbin R. 2010. Fast and accurate long-read alignment with 722 Burrows-Wheeler transform. Bioinformatics 26:589-595. 723 28. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, 724 Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, 725 Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, 726 Lindblad-Toh K, Friedman N, Regev A. 2011. Full-length transcriptome 727 assembly from RNA-Seq data without a reference genome. Nat Biotechnol 728 29:644-652. 729 29. Kent WJ. 2002. BLAT--the BLAST-like alignment tool. Genome Res 730 12:656-664. 731 30. Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang 732 N, Forslund K, Ceric G, Clements J, Heger A, Holm L, Sonnhammer EL, 733 Eddy SR, Bateman A, Finn RD. 2012. The Pfam protein families database. 734 Nucleic Acids Res 40:D290-301. 735 31. Haft DH, Selengut JD, White O. 2003. The TIGRFAMs database of protein 736 families. Nucleic Acids Res 31:371-373. 737 32. Gough J, Karplus K, Hughey R, Chothia C. 2001. Assignment of 738 homology to genome sequences using a library of hidden Markov models 739 that represent all proteins of known structure. J Mol Biol 313:903-919. 740 33. Li W, Godzik A. 2006. Cd-hit: a fast program for clustering and comparing 741 large sets of protein or nucleotide sequences. Bioinformatics 22:1658-742 1659. 743 34. Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. 2010. 744 Prodigal: prokaryotic gene recognition and translation initiation site 745 identification. BMC Bioinformatics 11:119. 746 35. Almeida LA, Araujo R. 2013. Highlights on molecular identification of 747 closely related species. Infect Genet Evol 13:67-75. 748 36. Woo PC, Chung LM, Teng JL, Tse H, Pang SS, Lau VY, Wong VW, Kam 749 KL, Lau SK, Yuen KY. 2007. In silico analysis of 16S ribosomal RNA gene 750

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 29: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

sequencing-based methods for identification of medically important 751 anaerobic bacteria. J Clin Pathol 60:576-579. 752 37. Teng JL, Yeung MY, Yue G, Au-Yeung RK, Yeung EY, Fung AM, Tse H, 753 Yuen KY, Lau SK, Woo PC. 2011. In silico analysis of 16S rRNA gene 754 sequencing based methods for identification of medically important 755 aerobic Gram-negative bacteria. J Med Microbiol 60:1281-1286. 756 38. Achtman M, Zurth K, Morelli G, Torrea G, Guiyoule A, Carniel E. 1999. 757 Yersinia pestis, the cause of plague, is a recently emerged clone of 758 Yersinia pseudotuberculosis. Proc Natl Acad Sci U S A 96:14043-14048. 759 39. Sreevatsan S, Pan X, Stockbauer KE, Connell ND, Kreiswirth BN, 760 Whittam TS, Musser JM. 1997. Restricted structural gene polymorphism 761 in the Mycobacterium tuberculosis complex indicates evolutionarily 762 recent global dissemination. Proc Natl Acad Sci U S A 94:9869-9874. 763 40. Rasko DA, Altherr MR, Han CS, Ravel J. 2005. Genomics of the Bacillus 764 cereus group of organisms. FEMS Microbiol Rev 29:303-329. 765 41. Jimenez G, Urdiain M, Cifuentes A, Lopez-Lopez A, Blanch AR, 766 Tamames J, Kampfer P, Kolsto AB, Ramon D, Martinez JF, Codoner 767 FM, Rossello-Mora R. 2013. Description of Bacillus toyonensis sp. nov., a 768 novel species of the Bacillus cereus group, and pairwise genome 769 comparisons of the species of the group by means of ANI calculations. Syst 770 Appl Microbiol 36:383-391. 771 42. Helgason E, Okstad OA, Caugant DA, Johansen HA, Fouet A, Mock M, 772 Hegna I, Kolsto AB. 2000. Bacillus anthracis, Bacillus cereus, and Bacillus 773 thuringiensis--one species on the basis of genetic evidence. Appl Environ 774 Microbiol 66:2627-2630. 775 43. Lan R, Reeves PR. 2002. Escherichia coli in disguise: molecular origins of 776 Shigella. Microbes Infect 4:1125-1132. 777 44. Lukjancenko O, Wassenaar TM, Ussery DW. 2010. Comparison of 61 778 sequenced Escherichia coli genomes. Microb Ecol 60:708-720. 779 45. Karaolis DK, Lan R, Reeves PR. 1994. Sequence variation in Shigella 780 sonnei (Sonnei), a pathogenic clone of Escherichia coli, over four 781 continents and 41 years. J Clin Microbiol 32:796-802. 782 46. Goris J, Konstantinidis KT, Klappenbach JA, Coenye T, Vandamme P, 783 Tiedje JM. 2007. DNA-DNA hybridization values and their relationship to 784 whole-genome sequence similarities. Int J Syst Evol Microbiol 57:81-91. 785 47. Yarza P, Richter M, Peplies J, Euzeby J, Amann R, Schleifer KH, 786 Ludwig W, Glockner FO, Rossello-Mora R. 2008. The All-Species Living 787 Tree project: a 16S rRNA-based phylogenetic tree of all sequenced type 788 strains. Syst Appl Microbiol 31:241-250. 789 48. Richter M, Rossello-Mora R. 2009. Shifting the genomic gold standard 790 for the prokaryotic species definition. Proc Natl Acad Sci U S A 791 106:19126-19131. 792 49. Kampfer P, Glaeser SP. 2012. Prokaryotic taxonomy in the sequencing 793 era--the polyphasic approach revisited. Environ Microbiol 14:291-317. 794 50. Gevers D, Cohan FM, Lawrence JG, Spratt BG, Coenye T, Feil EJ, 795 Stackebrandt E, Van de Peer Y, Vandamme P, Thompson FL, Swings J. 796 2005. Opinion: Re-evaluating prokaryotic species. Nat Rev Microbiol 797 3:733-739. 798

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 30: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

51. Maiden MC, Bygraves JA, Feil E, Morelli G, Russell JE, Urwin R, Zhang 799 Q, Zhou J, Zurth K, Caugant DA, Feavers IM, Achtman M, Spratt BG. 800 1998. Multilocus sequence typing: a portable approach to the 801 identification of clones within populations of pathogenic microorganisms. 802 Proc Natl Acad Sci U S A 95:3140-3145. 803 804 805 806 TABLES 807 TABLE 1: Isolates of the NCBIdrafts set for which all five methods predict the 808 species to be different from what it is annotated as. 809 RefSeq ID Strain name Annotated species Predicted species

NZ_ACLX00000000 AH621 uid55161 Bacillus cereus Bacillus weihenstephanensis

NZ_ACMD00000000 BDRD ST196 uid55169 Bacillus cereus Bacillus weihenstephanensis

NZ_ABDQ00000000 C Eklund uid54841 Clostridium botulinum

Clostridium novyi

NZ_ABXZ00000000 FTG uid55313 Francisella novicida Francisella tularensisNZ_AHIE00000000 DC283 uid86627 Pantoea stewartii Pantoea ananatis

NZ_AEPO00000000* ATCC 49296 uid61461 Streptococcus sanguinis

Streptococcus oralis

810 * NZ_AEPO00000000 has been re-annotated as S. oralis since we collected the data 811 in 2011. 812 813 TABLE 2: Speed of the tested methods. 814 Method Speed on draft genomes(mm:ss) Speed on short reads (mm:ss) SpeciesFinder 00:13 3:14 Reads2Type NA 1:20 rMLST 00:45 NA TaxonomyFinder 11:33 NA

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 31: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

KmerFinder 00:09 03:10 815 816 FIGURE LEGENDS 817 Figure 1: Performance of the five methods for species identification on A: NCBIdrafts 818 B: SRAdrafts C: SRAreads . The rMLST and TaxonomyFinder methods only take draft or 819 complete genomes as input, while Reads2Type only works for short reads. “Correct 820 (genus and species)”: Predicted genus and species are in accordance with the 821 annotation. “Only genus correct”: The predicted genus is in accordance with the 822 annotation, but the species is not. “Not even genus correct”: Neither predicted 823 genus nor species is in accordance with the annotation. 824 825 Figure 2: Overlap in predictions by the five methods for species identification. 826 Numbers written in regular font indicate the number of isolates for which the 827 predicted species corresponds to the annotated species. Numbers written in italics 828 indicate the number of isolates for which the predicted and annotated species 829 differ. A: The 16S, rMLST, KmerFinder and TaxonomyFinder methods evaluated on 830 the NCBIdrafts set. B: The 16S, rMLST, KmerFinder, and TaxonomyFinder methods 831 evaluated on the SRAdrafts set. C: The 16S, KmerFinder, and Reads2Type methods 832 evaluated on the SRAreads set. 833 834 Figure 3: Predictions for the most common species of the NCBIdrafts set. For each 835 method, the results for a given species is only shown if the method made a 836 prediction for five or more isolates annotated as this species (e.g., if there are five 837 isolates annotated as species A in the dataset, but the method was not able to make 838

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 32: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

a prediction for one of the isolates, the species is not shown), or two or more 839 isolates are predicted as this species (e.g., there are no isolates annotated as species 840 B in the dataset, but two isolates annotated as species C are predicted to be species 841 B, then species B is shown). A: Predictions by SpeciesFinder. B: Predictions by 842 rMLST. C: Predictions by TaxonomyFinder. D: Predictions by KmerFinder. 843 844 Figure 4: Predictions for the most common species in the SRAdrafts dataset. For each 845 method, the results for a given species is only shown if the method made a 846 prediction for ten or more isolates annotated as this species, or two or more 847 isolates are predicted as this species A: Predictions by SpeciesFinder. B: Predictions 848 by rMLST. C: Predictions by TaxonomyFinder. D: Predictions by KmerFinder. 849 850 851

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 33: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

0.760.77

0.850.93

0.8

0.9

1.0

NCBI draft genomes A

0.3

0.4

0.5

0.6

0.7

Freq

uenc

y

0.150.09

0.140.090.10

0.050.070.00

0.0

0.1

0.2

� Correct (genus and species)

� Only genus correct � Not even genus correctIncorrect

0.86

0.97

0.87

0.8

0.9

1.0

SRA short readsB C

0.84

0.94 0.95 0.95

0.8

0.9

1.0

SRA draft genomes

0 3

0.4

0.5

0.6

0.7

0 3

0.4

0.5

0.6

0.7

Freq

uenc

y

Freq

uenc

y

0.100.030.01 0.02

0.090.04

0.0

0.1

0.2

0.3

Correct (genus and species)

Only genus correct Not even genus correct

0.100.06

0.02 0.040.02 0.030.03 0.020.0

0.1

0.2

0.3

Correct (genus and species)

Only genus correct Not even genus correctIncorrectIncorrect p )p )

Figure 1: Performance of the five methods for species identification on A: NCBIdrafts B: SRAdrafts C: SRAreads . The rMLST and TaxonomyFindermethods only take draft or complete genomes as input, while Reads2Type only works for short reads. “Correct (genus and species)”: Predicted genus and species are in accordance with the annotation. “Only genus correct”: The predicted genus is in accordance with the annotation, but the species is not. “Not even genus correct”: Neither predicted genus nor species is in accordance with the annotation.

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 34: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

NCBI draft genomesA

SRA draft genomes SRA short readsB C

Figure 2: Overlap in predictions by the five methods for species identification. Numbers written in regular font indicate the number of isolates for which the predicted species corresponds to the annotated species. Numbers written in italics indicate the number of isolates for which the predicted and annotated species differ. A: The 16S, rMLST, KmerFinder and TaxonomyFinder methods evaluated on the NCBIdrafts set. B: The 16S, rMLST, TaxonomyFinder, and KmerFinder methods evaluated on the SRAdrafts set. C: The 16S, KmerFinder, and Reads2Type methods evaluated on the SRAreads set.

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 35: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from

Page 36: ä t u ã ä w ä s - Journal of Clinical Microbiologyjcm.asm.org/content/early/2014/02/20/JCM.02981-13.full.pdfww for more than 30 years, 16S rRNA sequence data has served as the

on May 23, 2018 by guest

http://jcm.asm

.org/D

ownloaded from