recent computational advances in metagenomics...

18
Recent Computational Advances in Metagenomics (RCAM’15) RCAM program committee Institut Pasteur Tuesday, Oct. 6, 2015 RCAM benefits from the financial support of Institut Pasteur www.pasteur. fr and the "Metaomics of Microbial Ecosystems (M2E)" www.mem.inra.fr metaprogram from the French National Institute for Agricultural Research (INRA)

Upload: dinhdiep

Post on 13-Aug-2019

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recent Computational Advances in Metagenomics (RCAM’15)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/files/u43/booklet.pdf · Recent Computational Advances in Metagenomics (RCAM’15)

Recent Computational Advances inMetagenomics (RCAM’15)

RCAM program committee

Institut PasteurTuesday, Oct. 6, 2015

RCAM benefits from the financial support of Institut Pasteur www.pasteur.fr and the "Metaomics of Microbial Ecosystems (M2E)" www.mem.inra.frmetaprogram from the French National Institute for Agricultural Research(INRA)

Page 2: Recent Computational Advances in Metagenomics (RCAM’15)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/files/u43/booklet.pdf · Recent Computational Advances in Metagenomics (RCAM’15)

RCAM 2015 Workshop ScheduleTuesday 6th of October, 2015

Time Speaker Title9:00 - 9:30 Welcome Coffee09:30 - 10:30 Martin S. Lindner (Key) Augmenting taxonomic profiling with cov-

erage information to improve sensitivityand specificity

10:30 - 11:00 Vitor C. Piro DUDes: a top-down taxonomic profiler formetagenomics samples

11:00 - 11:30 Ari Ugarte META-CLADE : a new tool to iden-tify domains and functionally annotatemetagenomic and metatranscriptomic se-quences

11:30 - 12:30 Christopher Quince (Key) Inferring microbial species and strains di-rectly from metagenome data

12:30 - 14h15 Lunch break14:14 - 15:15 Alex L. Mitchell (Key) The EBI Metagenomics Portal - a free

to use analysis platform for metagenomicdata

15:15 - 15:45 Maria Bernard FROGS: Find Rapidly OTU with GalaxySolution

15:45 - 16:15 Daniel H. Huson An Efficient Pipeline for MicrobiomeAnalysis

16:15 - 16:45 Coffee break16:45 - 17:15 Mohamed Mysara Simka : Fast kmer-based method for es-

timating the similarity between numerousmetagenomic datasets

17:15 - 17:45 Gaëtan Benoit Simka : Fast kmer-based method for es-timating the similarity between numerousmetagenomic datasets

17:45 - 18:15 Gregory Kucherov Improved computational techniques for k-mer-based metagenomic classification

18:15 Closing remarks

2

Page 3: Recent Computational Advances in Metagenomics (RCAM’15)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/files/u43/booklet.pdf · Recent Computational Advances in Metagenomics (RCAM’15)

Contents1 Augmenting taxonomic profiling with coverage information

to improve sensitivity and specificity (keynote) 4

2 DUDes: a top-down taxonomic profiler for metagenomicssamples 5

3 META-CLADE : a new tool to identify domains and func-tionally annotate metagenomic and metatranscriptomic se-quences 6

4 Inferring microbial species and strains directly from metagenomdata (keynote) 8

5 The EBI Metagenomics Portal - a free to use analysis plat-form for metagenomic data (keynote) 9

6 FROGS: Find Rapidly OTU with Galaxy Solution 10

7 An Efficient Pipeline for Microbiome Analysis 12

8 From Reads to OTUs. Improved Algorithms for Prepro-cessing Amplicon Sequencing data 14

9 Simka : Fast kmer-based method for estimating the similar-ity between numerous metagenomic datasets 16

10 Improved computational techniques for k-mer-based metage-nomic classification 17

3

Page 4: Recent Computational Advances in Metagenomics (RCAM’15)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/files/u43/booklet.pdf · Recent Computational Advances in Metagenomics (RCAM’15)

Augmenting taxonomic profiling with coverage information to

improve sensitivity and specificity

Martin S. Lindner1,2

1Robert Koch Institut, Berlin24-Antibody, Basel

Metagenomic samples typically consist of mixture of genomic material from multiple (in particularmicrobial) organisms. One of the key challenges in metagenomics, denoted as Taxonomic Profiling,is to disentangle the genomic ravel and identify the organisms present in a sample. In this talk, Iwill show how genome coverage information can be used to circumvent typical pitfalls causing lowsensitivity and specificity of current approaches in difficult situations.

Our idea was to fit mixtures of discrete probability distributions to genome coverage profiles withthe Expectation Maximization algorithm. With this information, we can calculate the average coveragein the covered areas of the genomes, handle spike-like artifacts, and estimate the similarity betweenthe reference genome and the organism in the sample. In our taxonomic profiling tool MicrobeGPS,we use this information to cluster reference genomes into groups, each representing one organism inthe sample. In addition to quantitative measures such as number of reads and relative abundance,our approach provides further information on the identity and reliability of the observed organisms.This simplifies the interpretation as well as leads to higher sensitivity and specificity of the results.

1

Page 5: Recent Computational Advances in Metagenomics (RCAM’15)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/files/u43/booklet.pdf · Recent Computational Advances in Metagenomics (RCAM’15)

DUDes: a top-down taxonomic profiler for metagenomics samples

Piro, Vitor C., Lindner, Martin S., Renard, Bernhard Y.

Research Group Bioinformatics (NG4), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany

The fast increase of complete genome sequences available on public databases has allowed better predictions of the microbial content from sequenced environmental and clinical samples. The identification of species and their quantification are common tasks in metagenomics and pathogen detection studies. The most recent techniques are built on mapping the sequenced reads against a reference database (e.g., whole genomes, marker genes, proteins) and performing further analysis. Although these methods proved to be useful in many scenarios, there is still room for improvement in species and strain level detections, mainly for low abundant organisms. We propose a new method: DUDes, a reference-based taxonomic profiler that introduces a novel top-down approach to analyze metagenomic NGS samples. Differently from the main taxonomic profiling tools that base their predictions estimating abundances in the sample, DUDes does not account for abundances directly as a mean of identification. Our method identifies possible candidates by comparing the strength of the read mapping in each node of the taxonomic tree in an iterative manner. Instead of using the lowest common ancestor (LCA), a commonly used bottom-up approach to solve ambiguities in identifications, we propose a new approach: the deepest uncommon descendent (DUD). Differently from the LCA method that solves ambiguous identifications by going back one taxonomic level to the lowest common ancestor, the DUD approach starts at the root node and tries to go for deeper taxonomic levels, even when ambiguities are found. That way it is possible to have less conservative identifications in higher taxonomic levels. Besides, when the provided data does not allow a specific identification on higher levels, the method can identify a set of probable candidates. Permutation tests are performed to estimate p-values between nodes and to identify the presence of them on each level. We showed in experiments that DUDes works for single and multiple organisms and can identify low abundant taxonomic groups with high precision.

Page 6: Recent Computational Advances in Metagenomics (RCAM’15)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/files/u43/booklet.pdf · Recent Computational Advances in Metagenomics (RCAM’15)

META-CLADE : a new tool to identify domains and functionally annotate metagenomic and

metatranscriptomic sequences.

Ari Ugarte, Juliana Bernardes, Alessandra Carbone. Laboratory of Computational and Quantitative

Biology (LCQB) UMR 7238 CNRS - UniversitéPierre et Marie Curie, 75006, Paris, France.

The improvements of next-generation sequencing have allowed researchers to study the genomic

diversity in microbial communities. The increased complexity of metagenomics data poses

computational challenges in assembling, annotating, and classifying genomic fragments from

multiple organisms. Domain identification provides insights of the biological function of a protein.

Hence, domain annotation is a crucial step to identify and quantify the genes in a microbial

community that are known and those that are completely new. Traditional protein annotation

methods describe known domains with probabilistic models representing the consensus among

homologous domain sequences. When relevant signals become too weak to be identified by

consensus, attempts for annotation fails. CLADE [1], a new method for protein domain

identification which achieves highly accurate predictions for single genomes compared to HMMER

methodology [2] is based on the observation that many structural and functional protein constraints

are not globally conserved through all species but might be locally conserved in separate clades.

CLADE uses an extension of the probabilistic model library in order to characterize local models to

improve signal detection. CLADE has been used to develop META-CLADE [3], a new protein

domain annotation tool for metagenomics and metatranscriptomics. In order to evaluate META-

CLADE performance, we simulated a dataset containing 500,000 reads of Roche's 454 FLX

titanium sequencer. We built this data set from 40 marine bacterial and archaeal complete genome

sequences assuming equal abundance. Genes predicted by FragGeneScan [4] in simulated reads

were translated to proteins and annotated with META-CLADE and HMMER using Pfam27 [5]

domain database. META-CLADE identifies substantially more domains than HMMER in simulated

reads (~30% more detected domains). Besides the improvement in domain recognition, META-

CLADE agrees with 99,5% of HMMER domain predictions and reinforces the signal of agreed

domains. To prove that this new method is suitable for real data, it was applied to 5 data sets

collected from 5 different ocean stations containing unicellular marine eukaryotic

metatranscriptomic sequences [6]. META-CLADE outperforms HMMER methodology in domain

recognition, and signal detection in agreed domains for all data sets. Domains identified by

eachmethods were mapped for functional annotation using Pfam2GO [7] and a list of GO Terms

[8,9] for each sample was obtained. META-CLADE allows extending the list of significant GO

Terms. Moreover, it permits to have a better resolution of significant GO Terms and highlights the

functional characteristics of each sample. In conclusion, our results show that META-CLADE is

suitable not only for domain recognition but also to improve functional annotation in

metagenomicsandmetatranscriptomics studies [10].

[1]High performance domain identification in proteinsreachedwith the agreement of many profiles and domainco-

occurrence. J. Bernardes, G. Zaverucha, C. Vaquero, A. Carbone. Submitted. (2015)

[2] HMMER web server: interactive sequence similarity searching. R.D. Finn, J. Clements, S.R. Eddy. Nucleic Acids

Research (2011) Web Server Issue 39:W29-W37.

[3] Meta-clade: a highlyprecise annotation method for metagenomicsamples, A.Ugarte, J.Bernardes, A. Carbone, in

preparation. (2015)

[4] FragGeneScan: Predicting Genes in Short and Error-prone Reads. Mina Rho, Haixu Tang, and Yuzhen Ye. Nucleic

Acids Research (2010)

[5] The Pfam protein families database. R.D. Finn, A. Bateman, J. Clements, P. Coggill, R.Y. Eberhardt, S.R. Eddy, A.

Heger, K. Hetherington, L. Holm, J. Mistry, E.L.L. Sonnhammer, J. Tate, M. Punta.Nucleic Acids Research (2014)

Database Issue 42:D222-D230

[6] The impact of temperature on marine phytoplankton resource allocation and metabolism.Toseland A., Daines S. J.,

Clark J. R., Kirkham A., Strauss J., Uhlig C., Lenton T. M., Valentin K., Pearson G. A., Moulton V., Mock T. (2013).

Nature Climate Change 3, 979–984

[7] Pfam2GO.Mitchell et al. (2015) Nucleic Acids Research.43 :D213-D221

[8] Ashburner et al. Gene ontology: tool for the unification of biology (2000) Nat Genet 25(1):25-9.

[9] The Gene Ontology Consortium. Gene Ontology Consortium: going forward. (2015) Nucl Acids Res 43 Database

Page 7: Recent Computational Advances in Metagenomics (RCAM’15)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/files/u43/booklet.pdf · Recent Computational Advances in Metagenomics (RCAM’15)

issue D1049–D1056.

[10] A new approach to the functional annotation of metagenomicsamples,A.Ugarte, T.Mock, A.Falciatore, A.Carbone,

in preparation(2015).

Page 8: Recent Computational Advances in Metagenomics (RCAM’15)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/files/u43/booklet.pdf · Recent Computational Advances in Metagenomics (RCAM’15)

Inferring microbial species and strains directly from

metagenome data

Dr Christopher Quince1

1Warwick Medical School, University of Warwick

In metagenome sequencing DNA from an entire microbial community is sequenced typically withshort reads. The assemblies produced from these studies are usually highly fragmented comprisinghundreds of thousands of partial assemblies or contigs. This is an inevitable consequence of intra-and inter-genomic repeats. Only from very simple communities can complete genomes be assem-bled. However, determining which contigs derive from which species or strain is almost as useful asa complete genome revealing gene complement. Metagenome analyses often comprise multiple sam-ples from longitudinal analysis of the same community over time or horizontal sampling of multiplesimilar communities. We exploit this in a method, CONCOCT: Clustering cONtigs on COverage andComposiTion, that combines sequence composition and coverage across multiple samples to automat-ically cluster contigs into species genomes. CONCOCT uses a dimensionality reduction coupled toa Gaussian mixture model, fit using a variational Bayesian algorithm, which automatically identifiesthe optimal number of clusters. We demonstrate high recall and precision rates on artificial as well asreal human gut metagenome datasets. We then extend this principle, developing a probabilistic modelof variant frequencies across samples on core genes within species clusters. These frequencies dependon the relative abundances of strains in each sample and their haplotype. Using a Gibbs samplingalgorithm we can use this model to reconstruct the abundances of the strains and their genotypeson the core genes. These genotypes can then be used to determine the phylogenetic relationshipsbetween the strains present. Finally, we can apply this information to all the contigs associated withthe species to reconstruct the accessory genomes of the different strains. This provides a methodologyfor de novo extraction of strain genome composition from metagenome analyses that does not rely onlong read sequencing.

1

Page 9: Recent Computational Advances in Metagenomics (RCAM’15)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/files/u43/booklet.pdf · Recent Computational Advances in Metagenomics (RCAM’15)

The EBI Metagenomics Portal - a free to use analysis platform

for metagenomic data

Alex L. Mitchell1

1European Bioinformatics Institute

Metagenomics, the analysis of genetic material from microbial communities inhabiting differentenvironments, is an exciting and fast-growing field of research. Metagenomic analyses have recentlybeen successfully applied to a variety of areas, including agriculture, food manufacture and spoilage,bioenergy production, the elucidation of antibiotic resistance mechanisms, and animal and humanhealth.

With the increased popularity of the method and the diminishing cost of sequencing, data volumesare becoming increasingly cumbersome to process, analyse and navigate. More often than not, thecomputational overhead is the bottleneck of a metagenomics experiment. To help alleviate this sit-uation, we have brought together a cross-disciplinary team of computer scientists, bioinformaticians,statisticians and biologists to produce the EBI Metagenomics Portal. Some of the recent technicaldevelopments necessary to deal with the 10 terabytes of sequence data from the first phase of the TaraOceans project processed by the Portal will be presented. The functional and taxonomic compositionof this flagship project, although vast in size, can be summarised in a just 5 files, amounting to onlya few megabytes.

The EBI Metagenomics Portal (https://www.ebi.ac.uk/metagenomics/), a free to use, analysisand archiving resource for the metagenomics research community. Covering data submission, archivingand sharing functions, community standards-compliant meta-data curation, and rich functional andtaxonomic diversity analyses, the service has attracted a growing user base world-wide. The websiteprovides access to analysis results for tens of billions of sequences, drawn from from thousands ofruns from hundreds of different projects across disparate biomes. An overview of the features of thewebsite will also be presented, describing the data submission and analysis processes, and highlightingthe ways in which it can be used to interrogate and compare metagenomic samples.

1

Page 10: Recent Computational Advances in Metagenomics (RCAM’15)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/files/u43/booklet.pdf · Recent Computational Advances in Metagenomics (RCAM’15)

FROGS: Find Rapidly OTU with Galaxy Solution

Frédéric ESCUDIÉ1*, Lucas AUER2*, Maria BERNARD3, Laurent CAUQUIL4, Katia VIDAL4, Sarah

MAMAN4, Mahendra MARIADASSOU5, Guillermina HERNANDEZ-RAQUET2, Géraldine PASCAL4.

1

Bioinformatics platform Toulouse Midi-Pyrenees, MIAT, INRA Auzeville CS 52627 31326 Castanet Tolosan cedex, France 2

Université de Toulouse, INSA, UPS, LISBP, F-31077 Toulouse Cedex 4, France ; INRA, UMR792 ISBP, F-31400 Toulouse, France 3

INRA, UMR1313, SIGENAE, F-78352 Jouy-en-Josas, France 4

INRA, UMR1388, F-31326 Castanet-Tolosan, France, Université de Toulouse INPT ENSAT, UMR1388, F-31326 Castanet-Tolosan, France, Université de Toulouse INPT ENVT, UMR1388, F-31076 Toulouse, France 5

INRA, Unité MaIAGE, F-78352 Jouy-en-Josas, France

* These authors have contributed equally to the present work.

Corresponding author: [email protected]

Abstract: High-throughput sequencing of 16S/18S RNA amplicons has opened new horizons in the study of microbe communities. With the sequencing at great depth the current processing pipelines struggle to run rapidly and the most effective solutions are often designed for specialists. These tools are designed to give both the abundance table of operational taxonomic units (OTUs) and their taxonomic affiliation. In this context we developed the pipeline FROGS: « Find Rapidly OTU with Galaxy Solution ». Developed for the Galaxy platform [1-3], FROGS was designed to be run in two modes: with or without demultiplexed sequences. A preprocessing tool merges paired sequences into contigs with flash [4], cleans the data with cutadapt [5], deletes the chimeras with VSEARCH [6] and dereplicates sequences with a home-made python script. The clusterisation tool runs with SWARM [7] that uses a local clustering threshold, not a global clustering threshold like other software do. This tool generate the OTU’s abundance table. The affiliation tool returns taxonomic affiliation for each OTU using both RDPClassifier [8] and NCBI Blast+ [9] on Silva SSU 119 and 123 [10]. And finally, the post processing tool allows users to process this table with the user-specified filters and provides statistical results and numerous graphical illustrations of these data. FROGS has been developed to be very fast even on large amounts of MiSeq data in using cutting-edge tools and an optimized design, also it is portable on all Galaxy platforms with a minimum of informatics and architecture dependencies. FROGS was tested on several simulated data sets. The tool has been extremely rapid, robust and highly sensitive for the detection of OTU with very few false positives compared to other pipelines widely used by the community.

1. Blankenberg, D., et al., Galaxy: a web-based genome analysis tool for experimentalists. Curr

Protoc Mol Biol, 2010. Chapter 19: p. Unit 19 10 1-21. 2. Giardine, B., et al., Galaxy: a platform for interactive large-scale genome analysis. Genome

Res, 2005. 15(10): p. 1451-5. 3. Goecks, J., et al., Galaxy: a comprehensive approach for supporting accessible, reproducible,

and transparent computational research in the life sciences. Genome Biol, 2010. 11(8): p. R86. 4. Magoc, T. and S.L. Salzberg, FLASH: fast length adjustment of short reads to improve genome

assemblies. Bioinformatics, 2011. 27(21): p. 2957-63. 5. Martin, M., Cutadapt removes adapter sequences from high-throughput sequencing reads.

EMBnet.journal, 2011. 17(1): p. 10-12. 6. Flouri, T., et al., the VSEARCH GitHub repository, release 1.0.16, doi 10.5281/zenodo.15524. 7. Mahé, F., et al., Swarm: robust and fast clustering method for amplicon-based studies. PeerJ,

2014(2:e593).

Page 11: Recent Computational Advances in Metagenomics (RCAM’15)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/files/u43/booklet.pdf · Recent Computational Advances in Metagenomics (RCAM’15)

8. Wang, Q., G. M. Garrity, J. M. Tiedje, and J. R. Cole, Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy. Appl Environ Appl Environ Microbiol. , 2007. 73(16): p. 5261-7.

9. Camacho, C., et al., BLAST+: architecture and applications. BMC Bioinformatics, 2009. 10: p. 421.

10. Quast, C., et al., The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res, 2013. 41(Database issue): p. D590-6.

Page 12: Recent Computational Advances in Metagenomics (RCAM’15)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/files/u43/booklet.pdf · Recent Computational Advances in Metagenomics (RCAM’15)

An Efficient Pipeline for Microbiome Analysis

Daniel H. Huson∗†, Benjamin Buchfink∗ and Hans-Joachim Ruscheweyh∗‡

August 28, 2015

Microbiome analysis using metagenome shotgun sequencing is computationally chal-lenging. A typical project may involve hundreds of samples, each represented by tens ofmillions of DNA sequencing reads. For functional analysis, it is necessary to align all readsagainst a comprehensive protein reference database, such as NCBI-NR, which currentlycontains over 60 million sequences.

Here we present a simple, highly efficient pipeline for analyzing metagenomic reads.Our pipeline has four parts.

1. DIAMOND [1] is used to compare all DNA sequencing reads against the NR database.Our program aligns Illumina reads against protein references at up to 20,000 timesthe speed of BLASTX, while achieving a similar level of sensitivity. Input is a file ofreads in FASTA or FASTQ format (usually gzipped).

2. Then, for each sample, the resulting file (“DIAMOND alignment archive” file withsuffix .daa) is analyzed using a new program called “daa2rma” that performs taxo-nomic and functional binning of all reads. As described in [2], reads are mapped tothe NCBI taxonomy using the LCA algorithm and functional analysis is performed bymapping reads to SEED, COG and/or KEGG. The output is an RMA file (MEGAN“Read Match Archive” file with suffix .rma).

3. The resulting RMA files are are then made accessible via the local network or overthe world wide web using a new program called MeganServer [3].

4. Project members access the RMA files remotely to interactively analyze and comparetheir datasets using an upcoming new version 6 of MEGAN [4] .

This pipeline minimizes computational time and disk space, and makes it easy to accessresults. In an ongoing project, the alignment of one billion Illumina reads against microbial

∗Center for Bioinformatics, Universitat Tubingen†Life Sciences Institute, National University of Singapore‡Department of Biosystems Science and Engineering, ETH Zurich (Basel), SIB Swiss Institute of Bioin-

formatics, Basel, and Scientific IT Services, Research Informatics, ETH Zurich (Basel)

1

Page 13: Recent Computational Advances in Metagenomics (RCAM’15)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/files/u43/booklet.pdf · Recent Computational Advances in Metagenomics (RCAM’15)

Figure 1: Microbiome analysis pipeline. FASTQ files produced by a sequencer are alignedagainst a protein reference database using DIAMOND. Taxonomic and functional analysesof the resulting alignments are performed using daa2rma. MeganServer publishes theresulting files to the web. Authorized project members access the files remotely usingMEGAN6.

NR using DIAMOND took one day on a single server, and the subsequent taxonomicand functional analysis using daa2rma took about 4 hours. Making files accessible viaMeganServer takes no additional time.

All three programs, DIAMOND, MeganServer and an alpha-test version of MEGAN6,are available from:http://ab.inf.uni-tuebingen.de/data/software.

References

[1] Buchfink, B., Xie, C., Huson, D.H.: Fast and sensitive protein alignment using DIA-MOND. Nature Methods 12, 59–60 (2015)

[2] Huson, D.H., Mitra, S., Weber, N., Ruscheweyh, H.-J., Schuster, S.C.: Integrativeanalysis of environmental sequences using MEGAN 4. Genome Research 21, 1552–1560(2011)

[3] Ruscheweyh, H.-J., Huson, D.H.: MeganServer - Easy interactive access to large-scalemetagenome data. In preparation (2015)

[4] Huson, D.H., et al.: MEGAN6 - Microbiome analysis involving hundreds of samplesand billions of reads. In preparation (2015)

2

Page 14: Recent Computational Advances in Metagenomics (RCAM’15)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/files/u43/booklet.pdf · Recent Computational Advances in Metagenomics (RCAM’15)

From Reads to OTUsImproved Algorithms for Preprocessing Amplicon Sequencing data

Mohamed Mysara1,2,3, Natalie Leys1, Jeroen Raes2,3,4 and Pieter Monsieurs1

1 Unit of Microbiology, Belgian Nuclear Research Centre (SCK-CEN), Mol, Belgium.2 Department of Bioscience Engineering, Vrije Universiteit Brussel, Brussels, Belgium.

3 VIB Center for the Biology of Disease, VIB, Leuven, Belgium.4 Department of Microbiology and Immunology, REGA institute, KU Leuven, Belgium.

A major breakthrough in microbial ecology has been realized by clonal amplification of the16S or 18S rRNA gene for the assessment of microbial diversity in a specific environment,thereby omitting the time-consuming and challenging culturing approach. This approach hasbeen accelerated via the introduction of high-throughput sequencing technologies, leading to adramatic increase of marker gene sequencing studies for the assessment of microbialcommunities.

In the most straightforward approach, the reads from different samples are pre-processed andclustered based on their sequence similarity to each other, commonly named operationaltaxonomic units (OTUs) approach. Several algorithms and pipelines were proposed to addressraw data pre-processing originating from different sequencing platforms, including 454 GS-FLX, 454 GS-FLX +, MiSeq, and PacBio. In order to assess the suitability of each of thetechnologies, we used different mock communities, i.e. samples with a known composition(varying between 16 and 60 species), either produced in-house or obtained from publicallyavailable samples. Regardless of the sequencing platform used, all technologies suffer fromthe presence of erroneous sequences: (i) chimera, i.e. artificial (non-biological) sequencesmainly introduced by the PCR reaction during sample preparation, and (ii) sequencing errorsproduced by the sequencing platform itself. For both types of sequencing errors, wedeveloped novel preprocessing algorithms to remove or correct erroneous reads.

First, a machine learning method called CATCh (Combining Algorithms to Track Chimeras)is developed which is able to integrate the output of existing chimera detection tools into anew more powerful method. Via a comparative study with other chimera detection tools,CATCh was shown to outperform all other tools, thereby predicting up to 9% more chimerathan could be obtained with the best individual tool (Fig 1). Second, NoDe (Noise Detector)was introduced as an algorithm to correct existing 454 pyrosequencing errors, therebydecreasing the number of reads and nucleotides that are disregarded by the current state-of-the-art denoising algorithms (Fig 2). NoDe was benchmarked against state-of-the-artdenoising algorithms, thereby outperforming all other existing denoising tools in reduction ofthe error rate (reduction of 75%), and decrease in computational costs (15 times faster thanthe best individual tool). Third, as the 454 pyrosequencing platform is in many microbialdiversity assessments replaced with the more cost-effective Illumina MiSeq technology, theIPED (Illumina Paired End Denoiser) algorithm was developed to handle error correction inIllumina MiSeq sequencing data as the first tool in the field. It uses an artificial intelligence-based classifier trained to identify Illumina's error and remove them, reducing the error rateby 73% (Fig 2).

Page 15: Recent Computational Advances in Metagenomics (RCAM’15)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/files/u43/booklet.pdf · Recent Computational Advances in Metagenomics (RCAM’15)

The combined effect of improved algorithms for chimera removal and error correction had apositive effect on the clustering of reads in operational taxonomic units, with an almostperfect correlation between the number of clusters and the number of species present in themock communities. Indeed, when applying our improved pipeline containing CATCh andNoDe on a 454 pyrosequencing mock dataset, our pipeline could reduce the number of OTUsto 28 (i.e. close 18, the correct number of species present in the 454 pyrosequencing mockcommunity). In contrast, running the straightforward pipeline without our algorithms includedwould inflate the number of OTUs to 98. Similarly, when tested on Illumina MiSeqsequencing data obtained for a mock community, using a pipeline integrating CATCh andIPED, the number of OTUs returned was 33 (i.e. close to the real number of 21 speciespresent in the Illumina MiSeq mock community), while a much higher number of 86 OTUswas obtained using the default mothur pipeline. Our algorithms are freely available, via ourwebsite: http://science.sckcen.be/en/Institutes/EHS/MCB/MIC/Bioinformatics/ and can easily beintegrated into other 16S rRNA data analysis pipelines (e.g. mothur).

Reference

o Mysara M., Leys N., Raes J., Monsieurs P.- NoDe: a fast error-correctionalgorithm for pyrosequencing amplicon reads.- In: BMC Bioinformatics,16:88(2015), p. 1-15.- ISSN 1471-2105

o Mysara M., Saeys Y., Leys N., Raes J., Monsieurs P.- CATCh, an EnsembleClassifier for Chimera Detection in 16S rRNA Sequencing Studies.- In: Appliedand Environmental Microbiology, 81:5(2015), p. 1573-1584.- ISSN 0099-2240

o Mysara M., J. Raes, N. Leys, P. Monsieurs, 2015, IPED: A highly efficientdenoising tool for Illumina paired-end 16S rRNA amplicon sequencing data, PLOSComputational Biology, submitted.

Figure 2 Plot indicating the effect of having 5% indels on the sensitivity of different tools.

Figure 1 Plot indicating the errorrate spread over different readpositions, after being treated withdenoising algorithm (454 on the left,MiSeq on the right)

Page 16: Recent Computational Advances in Metagenomics (RCAM’15)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/files/u43/booklet.pdf · Recent Computational Advances in Metagenomics (RCAM’15)

Simka : Fast kmer-based method for estimating the similarity between numerous metagenomic datasets

Gaëtan BENOIT1, Pierre PETERLONGO1, Dominique LAVENIER1 and Claire LEMAITRE1 1 Inria Rennes Bretagne Atlantique, UMR 6074 Irisa, Genscale, 263 Avenue Général Leclerc, Campus de Beaulieu, 35042, Rennes, Cedex, France Auteur à contacter : [email protected]

Abstract Comparative metagenomics aims to provide high-level information based on DNA material sequenced from different environments. The purpose is mainly to estimate proximity between two or more environmental sites at the genomic level. One way to estimate similarity is to count the number of similar DNA fragments. From a computational point of view, the problem is thus to calculate the intersections between datasets of reads. Resorting to traditional methods such as all-versus-all sequence alignment is not possible on current metagenomic projects. For instance, the Tara Oceans project involves hundreds of datasets of more than 100M reads each.

Maillet et al. defined the following heuristic in their method called Commet[1]. Two reads are considered similar if they share t non-overlapping kmers (words of length k). This method is currently the fastest but still does not scale on Tara Oceans samples.

To tackle this issue, we introduce a new method, called Simka[2], which computes the similarity between two datasets based on their shared kmers. To scale on large metagenomic projects, we use the GATB library[3] which provides a kmer counting tool able to count the kmers of N datasets simultaneously. Counting kmers also offers new possibilities such as filtering low frequency kmers, which potentially contain sequencing errors. Simka also provides efficiently new similarity functions. The first is based on Bray Curtis, a well-know similarity function in ecology, which informed about species abundance. The second computes the Jaccard similarity between the datasets and thus informed about presence and absence of species.

Simka was tested and compared to the state of the art on 21 Tara Oceans samples. This shows that our kmer-based similarity function is very close to the read-based ones. Regarding sample proximity, different methods identify the same clusters of datasets. The fastest method of the state of the art required a few weeks to compute all the intersections whereas Simka took only 4 hours.

[1] COMMET: comparing and combining multiple metagenomic datasets. N. Maillet, G. Collet, T. Vannier, D. Lavenier, P. Peterlongo. IEEE BIBM, 2014

[2] Simka: fast kmer-based method for estimating the similarity between numerous metagenomic datasets. G. Benoit, P. Peterlongo, D. Lavenier, C. Lemaitre. Hal – Inria, 2015 [3] GATB: Genome Assembly & Analysis Tool Box. E. Drezen, G. Rizk, R. Chikhi, C. Deltel, C. Lemaitre, P. Peterlongo, D. Lavenier. 10.1093/Bioinformatics/btu406, 2014

Page 17: Recent Computational Advances in Metagenomics (RCAM’15)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/files/u43/booklet.pdf · Recent Computational Advances in Metagenomics (RCAM’15)

Improved computational techniques for k-mer-based

metagenomic classification

Karel Brinda Maciej Sykulski Gregory Kucherov

Laboratoire d’Informatique Gaspard-Monge, Universite Paris-Est&CNRS, France

Metagenomics is a powerful approach to study genetic content of environmental samples thathas been strongly promoted by NGS technologies. A way to improve the accuracy of metagenomicclassification is to match the metagenome against as large set of known genomic sequences as possi-ble. With many thousands of completed microbial genomes available today, modern metagenomicprojects match their samples against genomic databases of tens of billions of bp [12].

Alignment-based classifiers [9] proceed by aligning metagenome sequences to each of the knowngenomes from the reference database, in order to use the best alignment score as an estimatorof the phylogenetic “closeness” between the sequence and the genome. While this approach canbe envisaged for small datasets (both metagenome and database) and is actually used in suchsoftware tools as Megan [4] or PhymmBL [2] (see [9] for more), it is unfeasible on the scale ofmodern metagenomic projects. On the other hand, there exists a multitude of specialized tools foraligning NGS reads – BWA [7], Novoalign (http://www.novocraft.com/), GEM [10], Bowtie[6], just to mention a few popular ones – which perform alignment at a higher speed and areadjusted to specificities of NGS-produced sequences. Still, aligning multimillion read sets againstthousands of genomes remains computationally difficult even with optimized ¡tools. Furthermore,read alignment algorithms are usually designed to compute high-scoring alignments only, and areoften unable to report low-quality alignments. As a result, a large fraction of reads may remainunmapped [8].

To cope with increasingly large metagenomic projects, alignment-free methods have recentlycome into use. Those methods do not compute read alignments, thus do not come with benefitsof these, such as gene identification. Two recently released tools – LMAT [1] and Kraken [12] –perform metagenomic classification of NGS reads based on the analysis of shared k-mers betweenan input read and each genome from a pre-compiled database. Given a taxonomic tree involvingthe species of the database, those tools “map” each read to a node of the tree, thus reportingthe most specific taxon or clade that the read gets associated with. Mapping is done by slidingthrough all k-mers occurring in the read and determining, for each of them, the genomes of thedatabase containing the k-mer. Based on obtained counts and tree topology, algorithms [1, 12]assign the read to the tree node “best explaining” the counts. Further similar tools have beenpublished during last months [11, 5].

In this work, we report on a computational improvement of methods [1, 12]. One source ofimprovement comes from using spaced k-mers rather than contiguous k-mers. Through a series acomputational experiments, we show that this can significantly increase the accuracy of metage-nomic classification of NGS reads [3]. In particular, we illustrate this by a series of large-scalemetagenomic classification experiments with modified Kraken software [12] extended by the pos-sibility of dealing with spaced seeds. Experiments have been performed on databases of size 3.3Gbto 4.1Gb and metagenomes (both simulated and real) of 10,000 to 50,000 reads.

We also present some other computational improvements, in particular a new indexing structurefor the reference database: tree of Bloom filters. This data structure is currently being implementedin a new software tool, and we report on its developement.

1

Page 18: Recent Computational Advances in Metagenomics (RCAM’15)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/files/u43/booklet.pdf · Recent Computational Advances in Metagenomics (RCAM’15)

References

[1] S. K. Ames, D. A. Hysom, et al. Scalable metagenomic taxonomy classification using areference genome database. Bioinformatics, 29(18):2253–2260, Sep 2013.

[2] A. Brady and S. Salzberg. PhymmBL expanded: confidence scores, custom databases, par-allelization and more. Nat. Methods, 8(5):367, May 2011.

[3] K. Brinda, M. Sykulski, and G. Kucherov. Spaced seeds improve k-mer-based metagenomicclassification. Bioinformatics, July 2015. 10.1093/bioinformatics/btv419.

[4] D. H. Huson, S. Mitra, et al. Integrative analysis of environmental sequences using MEGAN4.Genome Res., 21(9):1552–1560, Sep 2011.

[5] J. Kawulok and S. Deorowicz. CoMeta: Classification of Metagenomes Using k-mers. PLoSONE, 10(4):e0121453, 2015.

[6] B. Langmead, C. Trapnell, et al. Ultrafast and memory-efficient alignment of short DNAsequences to the human genome. Genome Biol., 10(3):R25, 2009.

[7] H. Li and R. Durbin. Fast and accurate short read alignment with Burrows-Wheeler trans-form. Bioinformatics, 25(14):1754–1760, Jul 2009.

[8] Stinus Lindgreen, Karen L Adair, and Paul Gardner. An evaluation of the accuracy andspeed of metagenome analysis tools. bioRxiv, 2015.

[9] S. S. Mande, M. H. Mohammed, et al. Classification of metagenomic sequences: methodsand challenges. Brief. Bioinformatics, 13(6):669–681, Nov 2012.

[10] S. Marco-Sola, M. Sammeth, et al. The GEM mapper: fast, accurate and versatile alignmentby filtration. Nat. Methods, 9(12):1185–1188, Dec 2012.

[11] R. Ounit, S. Wanamaker, T. J. Close, and S. Lonardi. CLARK: fast and accurate classificationof metagenomic and genomic sequences using discriminative k-mers. BMC Genomics, 16:236,2015.

[12] D. E. Wood and S. L. Salzberg. Kraken: ultrafast metagenomic sequence classification usingexact alignments. Genome Biol., 15(3):R46, 2014.

2