recent computational advances in metagenomics (rcam’15) ?· recent computational advances in...

Download Recent Computational Advances in Metagenomics (RCAM’15) ?· Recent Computational Advances in Metagenomics…

Post on 13-Aug-2019

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Recent Computational Advances in Metagenomics (RCAM’15)

    RCAM program committee

    Institut Pasteur Tuesday, Oct. 6, 2015

    RCAM benefits from the financial support of Institut Pasteur www.pasteur. fr and the "Metaomics of Microbial Ecosystems (M2E)" www.mem.inra.fr metaprogram from the French National Institute for Agricultural Research (INRA)

    www.pasteur.fr www.pasteur.fr www.mem.inra.fr

  • RCAM 2015 Workshop Schedule Tuesday 6th of October, 2015

    Time Speaker Title 9:00 - 9:30 Welcome Coffee 09:30 - 10:30 Martin S. Lindner (Key) Augmenting taxonomic profiling with cov-

    erage information to improve sensitivity and specificity

    10:30 - 11:00 Vitor C. Piro DUDes: a top-down taxonomic profiler for metagenomics samples

    11:00 - 11:30 Ari Ugarte META-CLADE : a new tool to iden- tify domains and functionally annotate metagenomic and metatranscriptomic se- quences

    11:30 - 12:30 Christopher Quince (Key) Inferring microbial species and strains di- rectly from metagenome data

    12:30 - 14h15 Lunch break 14:14 - 15:15 Alex L. Mitchell (Key) The EBI Metagenomics Portal - a free

    to use analysis platform for metagenomic data

    15:15 - 15:45 Maria Bernard FROGS: Find Rapidly OTU with Galaxy Solution

    15:45 - 16:15 Daniel H. Huson An Efficient Pipeline for Microbiome Analysis

    16:15 - 16:45 Coffee break 16:45 - 17:15 Mohamed Mysara Simka : Fast kmer-based method for es-

    timating the similarity between numerous metagenomic datasets

    17:15 - 17:45 Gaëtan Benoit Simka : Fast kmer-based method for es- timating the similarity between numerous metagenomic datasets

    17:45 - 18:15 Gregory Kucherov Improved computational techniques for k- mer-based metagenomic classification

    18:15 Closing remarks

    2

  • Contents 1 Augmenting taxonomic profiling with coverage information

    to improve sensitivity and specificity (keynote) 4

    2 DUDes: a top-down taxonomic profiler for metagenomics samples 5

    3 META-CLADE : a new tool to identify domains and func- tionally annotate metagenomic and metatranscriptomic se- quences 6

    4 Inferring microbial species and strains directly from metagenom data (keynote) 8

    5 The EBI Metagenomics Portal - a free to use analysis plat- form for metagenomic data (keynote) 9

    6 FROGS: Find Rapidly OTU with Galaxy Solution 10

    7 An Efficient Pipeline for Microbiome Analysis 12

    8 From Reads to OTUs. Improved Algorithms for Prepro- cessing Amplicon Sequencing data 14

    9 Simka : Fast kmer-based method for estimating the similar- ity between numerous metagenomic datasets 16

    10 Improved computational techniques for k-mer-based metage- nomic classification 17

    3

  • Augmenting taxonomic profiling with coverage information to

    improve sensitivity and specificity

    Martin S. Lindner1,2

    1Robert Koch Institut, Berlin 24-Antibody, Basel

    Metagenomic samples typically consist of mixture of genomic material from multiple (in particular microbial) organisms. One of the key challenges in metagenomics, denoted as Taxonomic Profiling, is to disentangle the genomic ravel and identify the organisms present in a sample. In this talk, I will show how genome coverage information can be used to circumvent typical pitfalls causing low sensitivity and specificity of current approaches in difficult situations.

    Our idea was to fit mixtures of discrete probability distributions to genome coverage profiles with the Expectation Maximization algorithm. With this information, we can calculate the average coverage in the covered areas of the genomes, handle spike-like artifacts, and estimate the similarity between the reference genome and the organism in the sample. In our taxonomic profiling tool MicrobeGPS, we use this information to cluster reference genomes into groups, each representing one organism in the sample. In addition to quantitative measures such as number of reads and relative abundance, our approach provides further information on the identity and reliability of the observed organisms. This simplifies the interpretation as well as leads to higher sensitivity and specificity of the results.

    1

  • DUDes: a top-down taxonomic profiler for metagenomics samples

    Piro, Vitor C., Lindner, Martin S., Renard, Bernhard Y.

    Research Group Bioinformatics (NG4), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany

    The fast increase of complete genome sequences available on public databases has allowed better predictions of the microbial content from sequenced environmental and clinical samples. The identification of species and their quantification are common tasks in metagenomics and pathogen detection studies. The most recent techniques are built on mapping the sequenced reads against a reference database (e.g., whole genomes, marker genes, proteins) and performing further analysis. Although these methods proved to be useful in many scenarios, there is still room for improvement in species and strain level detections, mainly for low abundant organisms. We propose a new method: DUDes, a reference-based taxonomic profiler that introduces a novel top-down approach to analyze metagenomic NGS samples. Differently from the main taxonomic profiling tools that base their predictions estimating abundances in the sample, DUDes does not account for abundances directly as a mean of identification. Our method identifies possible candidates by comparing the strength of the read mapping in each node of the taxonomic tree in an iterative manner. Instead of using the lowest common ancestor (LCA), a commonly used bottom-up approach to solve ambiguities in identifications, we propose a new approach: the deepest uncommon descendent (DUD). Differently from the LCA method that solves ambiguous identifications by going back one taxonomic level to the lowest common ancestor, the DUD approach starts at the root node and tries to go for deeper taxonomic levels, even when ambiguities are found. That way it is possible to have less conservative identifications in higher taxonomic levels. Besides, when the provided data does not allow a specific identification on higher levels, the method can identify a set of probable candidates. Permutation tests are performed to estimate p-values between nodes and to identify the presence of them on each level. We showed in experiments that DUDes works for single and multiple organisms and can identify low abundant taxonomic groups with high precision.

  • META-CLADE : a new tool to identify domains and functionally annotate metagenomic and

    metatranscriptomic sequences.

    Ari Ugarte, Juliana Bernardes, Alessandra Carbone. Laboratory of Computational and Quantitative

    Biology (LCQB) UMR 7238 CNRS - UniversitéPierre et Marie Curie, 75006, Paris, France.

    The improvements of next-generation sequencing have allowed researchers to study the genomic

    diversity in microbial communities. The increased complexity of metagenomics data poses

    computational challenges in assembling, annotating, and classifying genomic fragments from

    multiple organisms. Domain identification provides insights of the biological function of a protein.

    Hence, domain annotation is a crucial step to identify and quantify the genes in a microbial

    community that are known and those that are completely new. Traditional protein annotation

    methods describe known domains with probabilistic models representing the consensus among

    homologous domain sequences. When relevant signals become too weak to be identified by

    consensus, attempts for annotation fails. CLADE [1], a new method for protein domain

    identification which achieves highly accurate predictions for single genomes compared to HMMER

    methodology [2] is based on the observation that many structural and functional protein constraints

    are not globally conserved through all species but might be locally conserved in separate clades.

    CLADE uses an extension of the probabilistic model library in order to characterize local models to

    improve signal detection. CLADE has been used to develop META-CLADE [3], a new protein

    domain annotation tool for metagenomics and metatranscriptomics. In order to evaluate META-

    CLADE performance, we simulated a dataset containing 500,000 reads of Roche's 454 FLX

    titanium sequencer. We built this data set from 40 marine bacterial and archaeal complete genome

    sequences assuming equal abundance. Genes predicted by FragGeneScan [4] in simulated reads

    were translated to proteins and annotated with META-CLADE and HMMER using Pfam27 [5]

    domain database. META-CLADE identifies substantially more domains than HMMER in simulated

    reads (~30% more detected domains). Besides the improvement in domain recognition, META-

    CLADE agrees with 99,5% of HMMER domain predictions and reinforces the signal of agreed

    domains. To prove that this new method is suitable for real data, it was applied to 5 data sets

    collected from 5 different ocean stations containing unicellular marine eukaryotic

    metatranscriptomic sequences [6]. META-CLADE outperforms HMMER methodology in domain

    recognition, and signal detection in agreed domains for all data sets. Domains identified by

    eachmethods were mapped for functional annotation using Pfam2GO [7] and a list of GO Terms

    [8,9] for each sample was obtained. META-CLADE allows extending the list of significant GO

    Terms. Moreover, it permits to have a better resolution of significant GO

Recommended

View more >