computational metagenomics: algorithms for understanding the "unculturable" microbial...

54
Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Sourav Chatterji UC Davis Genome Center [email protected]

Upload: gillian-edwards

Post on 29-Dec-2015

229 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority

Sourav ChatterjiUC Davis Genome [email protected]

Page 2: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Background

Page 3: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

The Microbial World

Page 4: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Exploring the Microbial World

• Culturing– Majority of microbes currently unculturable.– No ecological context.

• Molecular Surveys (e.g. 16S rRNA)– “who is out there?”– “what are they doing?”

Page 5: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Environmental Shotgun Sequencing

Page 6: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Interpreting Metagenomic Data

• Nature of Metagenomic Data– Mosaic– Fragmentary

• New Sequencing Technologies– Enormous amount of data– Short Reads

Page 7: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Overview of Talk

• Metagenomic Binning• PhyloMetagenomics• The Big Picture/ Future Work

Page 8: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Overview of Talk

• Metagenomic Binning– Background– CompostBin [to appear in RECOMB 2008]

• PhyloMetagenomics• The Big Picture

Page 9: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Metagenomic Binning

Classification of sequences by taxa

Page 10: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Current Binning Methods

• Assembly • Align with Reference Genome• Database Search [MEGAN, BLAST]• Phylogenetic Analysis• DNA Composition [TETRA,Phylopythia]

Page 11: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Current Binning Methods

• Need closely related reference genomes.• Poor performance on short fragments.

– Sanger sequence reads 500-1000 bp long.– Current assembly methods unreliable

• Complex Communities Hard to Bin.

Page 12: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Genome Signatures

• Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms?– Yes [Karlin et al. 1990s]

• What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?

Page 13: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

DNA-composition metrics

The K-mer Frequency MetricCompostBin uses hexamers

Page 14: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

• Working with K-mers for Binning.– Curse of Dimensionality : O(4K) independent

dimensions.– Statistical noise increases with decreasing

fragment lengths.• Project data into a lower dimensional space to

decrease noise.– Principal Component Analysis.

DNA-composition metrics

Page 15: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

PCA separates species

Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]

Page 16: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Effect of Skewed Relative Abundance

B. anthracis and L. monogocytes

Abundance 1:1 Abundance 20:1

Page 17: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

A Weighting Scheme

For each read, find overlap with other sequences

Page 18: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

A Weighting Scheme

Calculate the redundancy of each position.

4 5 5 3

Weight is inverse of average redundancy.

Page 19: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Weighted PCA

• Calculate weighted mean µw :

• Calculates weighted co-variance matrix Mw

• Principal Components are eigenvectors of Mw.– Use first three PCs for further analysis.

å=

=N

1iiiXwwμ

Twi

N

1iwiiw )μ(X)μ(XwM --=å

=

Page 20: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Weighted PCA separates species

B. anthracis and L. monogocytes : 20:1

PCA Weighted PCA

Page 21: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Un-supervised Classification

Page 22: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Semi-Supervised Classification

• 31 Marker Genes [courtesy Martin Wu]– Omni-present– Relatively Immune to Lateral Gene Transfer

• Reads containing these marker genes can be classified with high reliability.

Page 23: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Semi-supervised Classification

Use a semi-supervised version of the normalized cut algorithm

Page 24: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

The Semi-supervised Normalized Cut Algorithm

1. Calculate the K-nearest neighbor graph (KNN-graph) from the point set.

2. Update the KNN-graph with information from marker genes.

3. Bisect the graph using the normalized-cut algorithm.

Page 25: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Generalization to multiple bins

Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis

[0.62]

Apply algorithm

recursively

Page 26: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Generalization to multiple bins

Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis

[0.62]

Page 27: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Testing

• Simulate Metagenomic Sequencing– Variables

• Number of species• Relative abundance• GC content• Phylogenetic Diversity

• Test on a “real” dataset where answer is well-established.

Page 28: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding
Page 29: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Future Directions

• Holy Grail : Complex Communities• Semi-supervised methods

– More marker genes– Semi-supervised projection?

• Hybrid Methods– Assembly Information– Population Genetic Information

Page 30: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Overview of Talk

• Metagenomic Binning• Phylo-Metagenomics

– Applications– Incorporating Alignment Accuracy

• The Big Picture/ Future Work

Page 31: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Garcia Martin et al., Nat. Biotechnology (2006)

Population Structure of Communities

Page 32: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Yooseph et al., PLoS Biology (2007)

Gene Family Characterization

Page 33: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding
Page 34: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Manual Masking

• Require skilled and tedious manual intervention

• Subjective and non-reproducible• Impractical for high throughput data

– Frequently ignored. “Garbage-in-and-garbage-out”

Page 35: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Gblocks

Page 36: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Probabilistic Masking using pair-HMMs

• Probabilistic formulation of alignment problem.

• Can answer additional questions– Alignment Reliability– Sub-optimal Alignments

Durbin et al., Cambridge University Press (1998)

Page 37: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Probabilistic Masking

• What is the probability residues xi and yj are homologous?

• Posterior Probability the residues xi and yj are homologous

• Can be calculated efficiently for all pairs (and gaps) in quadratic time.

y]Pr[x,y]x,,yPr[x

]yPr[x jiji

à=à

Page 38: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Scoring Multiple Alignments

• Calculate the “posterior probability matrix” and distances dij between every pair of sequences.

• Weighted “sum of pairs” score for column r :

å

å à

ji,ij

jiji,

ij

d

]rPr[rd

Page 39: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Testing

The Balibase 3.0 Benchmark Database

Page 40: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Testing

• Realign sequences using MSA programs like Clustalw.

• Sensitivity: for all correctly aligned columns, the fraction that has been masked as good

• Specificity: for all incorrectly aligned columns, the fraction that has been masked as bad

Page 41: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Performance

Gblocks

Prob Mask

Sensitivity Specificity

97% 93%

53% 94%

Page 42: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

The Final Result

A Phylogenetic Database/Pipeline (with Martin Wu)

Page 43: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Overview of Talk

• Metagenomic Binning • Phylo-Metagenomics• The Big Picture/ Future Work

Page 44: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Population Structure

Venter et al. , Science (2004)

How to integrate information from multiple markers?

Page 45: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Species-species Interactions

Page 46: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Interactions in Microbial Communities

Page 47: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Time Series Data

Ruan et al., Bioinformatics (2006)

Page 48: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Interaction Networks in Microbial Communities

Ruan et al., Bioinformatics (2006)

Page 49: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Functional Profiling

Prediction of Gene Function Prediction of Metabolic Pathway

Page 50: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Functional Profiling (with Binning)

McCutcheon and Moran PNAS.(2007)

Page 51: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Single Cell Genomics

Hutchinson and Venter, Nature Biotechnology (2006)

Page 52: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Single Cell Genomics

Reads From Single Cell “Simulated” Contamination

Page 53: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

The Big PictureMicrobial Community

Metagenomic Sampling Single Cell Genomics

Population Structure Functional Profiling

Species Interaction Network

Time Series Data

Page 54: Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding

Acknowledgements

UC Davis• Jonathan Eisen • Martin Wu• Dongying Wu• Ichitaro Yamazaki• Amber Hartman• Marcel Huntemann

UC Berkeley• Lior Pachter• Richard Karp• Ambuj Tewari• Narayanan Manikandan

Princeton University• Simon Levin• Josh Weitz• Jonathan Dushoff