computational metagenomics : algorithms for understanding the " unculturable " microbial...

54
Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Sourav Chatterji UC Davis Genome Center [email protected]

Upload: tashya-frazier

Post on 03-Jan-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Computational Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority. Sourav Chatterji UC Davis Genome Center [email protected]. Background. The Microbial World. Exploring the Microbial World. Culturing Majority of microbes currently unculturable . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority

Sourav ChatterjiUC Davis Genome [email protected]

Page 2: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Background

Page 3: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

The Microbial World

Page 4: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Exploring the Microbial World

• Culturing– Majority of microbes currently unculturable.– No ecological context.

• Molecular Surveys (e.g. 16S rRNA)– “who is out there?”– “what are they doing?”

Page 5: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Environmental Shotgun Sequencing

Page 6: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Interpreting Metagenomic Data

• Nature of Metagenomic Data– Mosaic– Fragmentary

• New Sequencing Technologies– Enormous amount of data– Short Reads

Page 7: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Overview of Talk

• Metagenomic Binning• PhyloMetagenomics• The Big Picture/ Future Work

Page 8: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Overview of Talk

• Metagenomic Binning– Background– CompostBin [to appear in RECOMB 2008]

• PhyloMetagenomics• The Big Picture

Page 9: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Metagenomic Binning

Classification of sequences by taxa

Page 10: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Current Binning Methods

• Assembly • Align with Reference Genome• Database Search [MEGAN, BLAST]• Phylogenetic Analysis• DNA Composition [TETRA,Phylopythia]

Page 11: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Current Binning Methods

• Need closely related reference genomes.• Poor performance on short fragments.

– Sanger sequence reads 500-1000 bp long.– Current assembly methods unreliable

• Complex Communities Hard to Bin.

Page 12: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Genome Signatures

• Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms?– Yes [Karlin et al. 1990s]

• What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?

Page 13: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

DNA-composition metrics

The K-mer Frequency MetricCompostBin uses hexamers

Page 14: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

• Working with K-mers for Binning.– Curse of Dimensionality : O(4K) independent

dimensions.– Statistical noise increases with decreasing

fragment lengths.• Project data into a lower dimensional space to

decrease noise.– Principal Component Analysis.

DNA-composition metrics

Page 15: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

PCA separates species

Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]

Page 16: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Effect of Skewed Relative Abundance

B. anthracis and L. monogocytes

Abundance 1:1 Abundance 20:1

Page 17: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

A Weighting Scheme

For each read, find overlap with other sequences

Page 18: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

A Weighting Scheme

Calculate the redundancy of each position.

4 5 5 3

Weight is inverse of average redundancy.

Page 19: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Weighted PCA

• Calculate weighted mean µw :

• Calculates weighted co-variance matrix Mw

• Principal Components are eigenvectors of Mw.– Use first three PCs for further analysis.

å=

=N

1iiiXwwμ

Twi

N

1iwiiw )μ(X)μ(XwM --=å

=

Page 20: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Weighted PCA separates species

B. anthracis and L. monogocytes : 20:1

PCA Weighted PCA

Page 21: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Un-supervised Classification

Page 22: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Semi-Supervised Classification

• 31 Marker Genes [courtesy Martin Wu]– Omni-present– Relatively Immune to Lateral Gene Transfer

• Reads containing these marker genes can be classified with high reliability.

Page 23: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Semi-supervised Classification

Use a semi-supervised version of the normalized cut algorithm

Page 24: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

The Semi-supervised Normalized Cut Algorithm

1. Calculate the K-nearest neighbor graph (KNN-graph) from the point set.

2. Update the KNN-graph with information from marker genes.

3. Bisect the graph using the normalized-cut algorithm.

Page 25: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Generalization to multiple bins

Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis

[0.62]

Apply algorithm

recursively

Page 26: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Generalization to multiple bins

Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis

[0.62]

Page 27: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Testing

• Simulate Metagenomic Sequencing– Variables

• Number of species• Relative abundance• GC content• Phylogenetic Diversity

• Test on a “real” dataset where answer is well-established.

Page 28: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority
Page 29: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Future Directions

• Holy Grail : Complex Communities• Semi-supervised methods

– More marker genes– Semi-supervised projection?

• Hybrid Methods– Assembly Information– Population Genetic Information

Page 30: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Overview of Talk

• Metagenomic Binning• Phylo-Metagenomics

– Applications– Incorporating Alignment Accuracy

• The Big Picture/ Future Work

Page 31: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Garcia Martin et al., Nat. Biotechnology (2006)

Population Structure of Communities

Page 32: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Yooseph et al., PLoS Biology (2007)

Gene Family Characterization

Page 33: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority
Page 34: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Manual Masking

• Require skilled and tedious manual intervention

• Subjective and non-reproducible• Impractical for high throughput data

– Frequently ignored. “Garbage-in-and-garbage-out”

Page 35: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Gblocks

Page 36: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Probabilistic Masking using pair-HMMs

• Probabilistic formulation of alignment problem.

• Can answer additional questions– Alignment Reliability– Sub-optimal Alignments

Durbin et al., Cambridge University Press (1998)

Page 37: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Probabilistic Masking

• What is the probability residues xi and yj are homologous?

• Posterior Probability the residues xi and yj are homologous

• Can be calculated efficiently for all pairs (and gaps) in quadratic time.

y]Pr[x,y]x,,yPr[x

]yPr[x jiji

à=à

Page 38: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Scoring Multiple Alignments

• Calculate the “posterior probability matrix” and distances dij between every pair of sequences.

• Weighted “sum of pairs” score for column r :

å

å à

ji,ij

jiji,

ij

d

]rPr[rd

Page 39: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Testing

The Balibase 3.0 Benchmark Database

Page 40: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Testing

• Realign sequences using MSA programs like Clustalw.

• Sensitivity: for all correctly aligned columns, the fraction that has been masked as good

• Specificity: for all incorrectly aligned columns, the fraction that has been masked as bad

Page 41: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Performance

Gblocks

Prob Mask

Sensitivity Specificity

97% 93%

53% 94%

Page 42: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

The Final Result

A Phylogenetic Database/Pipeline (with Martin Wu)

Page 43: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Overview of Talk

• Metagenomic Binning • Phylo-Metagenomics• The Big Picture/ Future Work

Page 44: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Population Structure

Venter et al. , Science (2004)

How to integrate information from multiple markers?

Page 45: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Species-species Interactions

Page 46: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Interactions in Microbial Communities

Page 47: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Time Series Data

Ruan et al., Bioinformatics (2006)

Page 48: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Interaction Networks in Microbial Communities

Ruan et al., Bioinformatics (2006)

Page 49: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Functional Profiling

Prediction of Gene Function Prediction of Metabolic Pathway

Page 50: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Functional Profiling (with Binning)

McCutcheon and Moran PNAS.(2007)

Page 51: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Single Cell Genomics

Hutchinson and Venter, Nature Biotechnology (2006)

Page 52: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Single Cell Genomics

Reads From Single Cell “Simulated” Contamination

Page 53: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

The Big PictureMicrobial Community

Metagenomic Sampling Single Cell Genomics

Population Structure Functional Profiling

Species Interaction Network

Time Series Data

Page 54: Computational  Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority

Acknowledgements

UC Davis• Jonathan Eisen • Martin Wu• Dongying Wu• Ichitaro Yamazaki• Amber Hartman• Marcel Huntemann

UC Berkeley• Lior Pachter• Richard Karp• Ambuj Tewari• Narayanan Manikandan

Princeton University• Simon Levin• Josh Weitz• Jonathan Dushoff