compostbin : a dna composition based metagenomic binning algorithm sourav chatterji *, ichitaro...
Post on 05-Jan-2016
225 Views
Preview:
TRANSCRIPT
CompostBin : A DNA composition based metagenomic binning algorithm
CompostBin : A DNA composition based metagenomic binning algorithm
Sourav Chatterji*, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen
UC Davis schatterji@ucdavis.edu
Sourav Chatterji*, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen
UC Davis schatterji@ucdavis.edu
Overview of TalkOverview of Talk
Metagenomics and the binning problem. CompostBin
Metagenomics and the binning problem. CompostBin
The Microbial WorldThe Microbial World
Exploring the Microbial WorldExploring the Microbial World
Culturing Majority of microbes currently unculturable. No ecological context.
Molecular Surveys (e.g. 16S rRNA) “who is out there?” “what are they doing?”
Culturing Majority of microbes currently unculturable. No ecological context.
Molecular Surveys (e.g. 16S rRNA) “who is out there?” “what are they doing?”
Metagenomics
Interpreting Metagenomic DataInterpreting Metagenomic Data
Nature of Metagenomic Data Mosaic Intraspecies polymorphism Fragmentary
New Sequencing Technologies Enormous amount of data Short Reads
Nature of Metagenomic Data Mosaic Intraspecies polymorphism Fragmentary
New Sequencing Technologies Enormous amount of data Short Reads
Metagenomic BinningMetagenomic Binning
Classification of sequences by taxa
Binning in ActionBinning in Action
Glassy Winged Sharpshooter (Homalodisca coagulata).
Feeds on plant xylem (poor in organic nutrients).
Microbial Endosymbionts
Current Binning Methods Current Binning Methods
Assembly Align with Reference Genome Database Search [MEGAN, BLAST] Phylogenetic Analysis DNA Composition [TETRA,Phylopythia]
Assembly Align with Reference Genome Database Search [MEGAN, BLAST] Phylogenetic Analysis DNA Composition [TETRA,Phylopythia]
Current Binning Methods Current Binning Methods
Need closely related reference genomes. Poor performance on short fragments.
Sanger sequence reads 500-1000 bp long. Current assembly methods unreliable
Complex Communities Hard to Bin.
Need closely related reference genomes. Poor performance on short fragments.
Sanger sequence reads 500-1000 bp long. Current assembly methods unreliable
Complex Communities Hard to Bin.
Overview of TalkOverview of Talk
Metagenomics and the binning problem. CompostBin
Metagenomics and the binning problem. CompostBin
Genome SignaturesGenome Signatures
Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms? Yes [Karlin et al. 1990s]
What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?
Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms? Yes [Karlin et al. 1990s]
What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?
Imperfect WorldImperfect World
Horizontal Gene Transfer Recent Estimates [Ge et al. 2005]
Varies between 0-6% of genes.Typically ~2%.
But… Amelioration
Horizontal Gene Transfer Recent Estimates [Ge et al. 2005]
Varies between 0-6% of genes.Typically ~2%.
But… Amelioration
DNA-composition metricsDNA-composition metrics
The K-mer Frequency MetricCompostBin uses hexamers
Working with K-mers for Binning. Curse of Dimensionality : O(4K) independent
dimensions. Statistical noise increases with decreasing
fragment lengths. Project data into a lower dimensional space to
decrease noise. Principal Component Analysis.
Working with K-mers for Binning. Curse of Dimensionality : O(4K) independent
dimensions. Statistical noise increases with decreasing
fragment lengths. Project data into a lower dimensional space to
decrease noise. Principal Component Analysis.
DNA-composition metricsDNA-composition metrics
PCA separates speciesPCA separates species
Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]
Effect of Skewed Relative AbundanceEffect of Skewed Relative Abundance
B. anthracis and L. monogocytes
Abundance 1:1 Abundance 20:1
A Weighting SchemeA Weighting Scheme
For each read, find overlap with other sequences
A Weighting SchemeA Weighting Scheme
Calculate the redundancy of each position.
4 5 5 3
Weight is inverse of average redundancy.
Weighted PCAWeighted PCA
Calculate weighted mean µw :
Calculates weighted co-variance matrix Mw
PCs are eigenvectors of Mw. Use first three PCs for further analysis.
Calculate weighted mean µw :
Calculates weighted co-variance matrix Mw
PCs are eigenvectors of Mw. Use first three PCs for further analysis.
TTwwii
NN
11iiwwiiiiww ))μμ(X(X))μμ(X(XwwMM
N
Xwμ
N
1iii
w
Weighted PCA separates species
Weighted PCA separates species
B. anthracis and L. monogocytes : 20:1
PCA Weighted PCA
Un-supervised Classification ?Un-supervised Classification ?
Semi-Supervised ClassificationSemi-Supervised Classification
31 Marker Genes [courtesy Martin Wu] Omni-present Relatively Immune to Lateral Gene Transfer
Reads containing these marker genes can be classified with high reliability.
31 Marker Genes [courtesy Martin Wu] Omni-present Relatively Immune to Lateral Gene Transfer
Reads containing these marker genes can be classified with high reliability.
Semi-supervised ClassificationSemi-supervised Classification
Use a semi-supervised version of the normalized cut algorithm
The Semi-supervised Normalized Cut Algorithm
The Semi-supervised Normalized Cut Algorithm
1. Calculate the K-nearest neighbor graph from the point set.
2. Update graph with marker information.o If two nodes are from the same species, add an
edge between them.o If two nodes are from different species, remove
any edge between them.
3. Bisect the graph using the normalized-cut algorithm.
1. Calculate the K-nearest neighbor graph from the point set.
2. Update graph with marker information.o If two nodes are from the same species, add an
edge between them.o If two nodes are from different species, remove
any edge between them.
3. Bisect the graph using the normalized-cut algorithm.
Generalization to multiple binsGeneralization to multiple bins
Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis
[0.62]
Apply algorithm
recursively
Generalization to multiple binsGeneralization to multiple bins
Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis
[0.62]
TestingTesting
Simulate Metagenomic Sequencing Sanger Reads Variables
Number of speciesRelative abundanceGC contentPhylogenetic Diversity
Test on a “real” dataset where answer is well-established.
Simulate Metagenomic Sequencing Sanger Reads Variables
Number of speciesRelative abundanceGC contentPhylogenetic Diversity
Test on a “real” dataset where answer is well-established.
ResultsResults
Conclusions/Future DirectionsConclusions/Future Directions
Satisfactory performance No Training on Existing Genomes Sanger Reads Low number of Species
Future Work Holy Grail : Complex Communities
Semi-supervised projection? Hybrid Assembly/Binning
Satisfactory performance No Training on Existing Genomes Sanger Reads Low number of Species
Future Work Holy Grail : Complex Communities
Semi-supervised projection? Hybrid Assembly/Binning
AcknowledgementsAcknowledgements
UC DavisUC Davis Jonathan Eisen Martin Wu Dongying Wu Ichitaro Yamazaki Amber Hartman Marcel Huntemann
Jonathan Eisen Martin Wu Dongying Wu Ichitaro Yamazaki Amber Hartman Marcel Huntemann
UC BerkeleyUC Berkeley Lior Pachter Richard Karp Ambuj Tewari Narayanan Manikandan
Lior Pachter Richard Karp Ambuj Tewari Narayanan Manikandan
Princeton University Simon Levin Josh Weitz Jonathan Dushoff
top related