introduction to metagenomics
TRANSCRIPT
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Introduction to metagenomics
Thomas Haverkamp
[email protected]: @Thomieh
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Overview
• Introduction
• Sequence classification for metagenomes
• Megan in brief
• Oilwell metagenome
• The exercise…
2
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
The bacterial tree of life
Lasken & McLean., Nature Rev. Genetics, 2014
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Ultra-small bacteria
4
OP11, DO1 was detected usingmetagenomic analysis of 0.2 μm filtered water.
The genome size is < 1 Mbp
Bacteria rely on other community members For basic resources.
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Metagenomics
Metagenome: the collective genome of all the microorganisms in an environment. (Handelsman et al., 1998)
Metagenomics is the study of genetic material recovered from an environmental sample.
5
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Metagenomics
6
Who is there? What are they doing?
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Methods for microbial communities
Amplicon based analysis
• SSU rRNA (e.g. 16S, ITS)
• protein coding genes: rpoB, nifH, IRS, cytC, …fungene.cme.msu.edu
Microarrays – requires knowledge of the community in advance.
• PhyloChip (taxonomic)
• Geochip (metabolic)
Shotgun sequencing – complete community analysis.
7
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Shot gun Metagenomics
8Venter et al., 2004
1.2 Million unknown genes
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
High throughput sequencing
9Source: https://flxlexblog.wordpress.com/
Newest Illumina HiSeq X 10 > 1 Tb of sequene data
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Metagenomics
10http://metagenomics.anl.gov/
2012
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Metagenomics
11http://metagenomics.anl.gov/
2015
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
12
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Sequence ClassificationSequence classification (binning) is the process of separating sequence
data using specific information creating bins
Sequence classification by: 1) sequence composition
Tetranucleotide frequency (kmer counting)
Clustering of reads. (e.g. swarm, cd-hit)
Sequence (co-) assembly (MetaHit, Metavelvet)
Differential coverage of contigs (GroopM, Concoct)
Advantage : read with unknown origin can be classified into a bin
Disadvantage: impossible to determine taxonomy or function of the reads.
13
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
GroopM workflow
14Imelfort et al., PeerJ, 2014
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Binning of synthetic contigs
15Imelfort et al., PeerJ, 2014
PCA - Tetranucleotide binning GroopM coverage binning
Input data: 1159 genomes
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Sequence ClassificationSequence classification (binning) is the process of separating sequence
data using specific information creating bins
Sequence classification by: 2) sequence similarity
Compare sequences to reference database (e.g. Blast, bwa, bowtie)
Use phylogenetics to classify sequences.
Advantage: One can determine taxonomy and function of reads.
Disadvantage: reads with no similarity to databases sequences, can not be classified.
16
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
17
Using the best blast hit
Blog of Nick Loman: http://nickloman.github.io/
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Sequence ClassificationNucleotide composition: CompostBin , PCA-analysis of k-mer
frequencies, Self-Organizing Maps (different variants), MetaCluster, PhyloPythia, Naïve Bayes classifier (NBC), etc
Sequence similarity: MEGAN*, SorT-Items, Threephyler, COMET, Metaphlan, PhyloSift, Kraken, etc
Both: Phymm / PhymmBL, Phylophytia, RAIphy, Metaxa2*(rRNA), PhyloOTU (rRNA), MLTreeMap, RITA, STAMP, WGSQuikr.
Differential Coverage: GroopM, Concoct, Blobology
18See also: Logares et al., 2012 / * In this course
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
MetaPhlAn vs Phylosift
19
MetaPhlAn: Metagenomic Phylogenetic Analysis Uses a database of taxon specific marker genes Works well with known ecosystems: e.g. gut communities
Phylosift: Uses a database of 37 universal proteins & rRNA genes. Designed to classify using phylogenies
Both databases are smaller than NCBI NR
Depending on your ecosystem, one will work better
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
MethaPhlAn output
20https://bitbucket.org/nsegata/metaphlan/wiki/MetaPhlAn_Pipelines_Tutorial
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
PhyloSift output
21http://sourceforge.net/p/krona/home/krona/
Interactive Kronaplots
Only one sample
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
MEGAN
22(Huson et al., Genome Research, 2007)
• Developed for characterization of metagenomic shotgun reads
• LCA assignment based on BLAST bitscore
• Support for paired-end reads and comparison of datasets.
• Latest version can analyze RDP files / QIIME OTU files
• Analysis of metabolism via SEED, KEGG or COG maps
• Comparison of multiple metagenomes (> 2)
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Why use Megan?
Easy to work with on a desktop / laptop computer:
Extra things needed: Java, a BLAST server
MEGAN gives a visualization of BLAST results
• Study diversity
• Compare samples
• Contamination filtering
• Special gene of interest
• Extraction of sequences based on taxonomic /metabolic information.
23
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
The basics of MEGAN
MEGAN uses BLAST, a database and a taxonomy file
• BLAST N : nucleotides against a nucleotide database.
• BLAST X : Translated nucleotideagainst a protein database.
• Which database?
one of the many available database like the NCBI-non-redundant database, or a your own custom database.
• Taxonomy: NCBI taxonomy, or your own custom taxonomy
BLAST output file is used to bin sequences using the LCA assignment algorithm into specific taxons.
24
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
The basics of MEGAN
• The LCA algorithm = “Lowest Common Ancestor” algorithm
“In this approach, every read is assigned to some taxon. If the read aligns very specifically only to a single taxon, then it is assigned to that taxon. The less specifically a read hits taxa, the higher up in the taxonomy it is placed. Reads that hit ubiquitously may even be assigned to the root node of the NCBI taxonomy.”
(the MEGAN manual)
25
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
26
multiple samples
Comparison between reads assignedto Phosphorus metabolismand Nitrogen metabolism
Reads were annotated using MG-RAST
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Metagenomics servers
• MG-RAST* (http://metagenomics.anl.gov/)
• IMG/M (http://img.jgi.doe.gov/)
• WebMGA (http://weizhong-lab.ucsd.edu/metagenomic-analysis/)
• METAgen assist* (http://www.metagenassist.ca/METAGENassist/faces/Home.jsp)
• Real-Time metagenomics (https://edwards.sdsu.edu/RTMg/)
*Can also be used for amplicon sequences
27
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Sampling an oil fieldSampling site
Pressure 27.3 bar
0 m
Reservoir sediment
Temperature 85C
Pressure 253 barSeal sediment
2950 m
2850 m
Subseasediments
350 m
Xpand Pressure Flask
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
The metagenome data
454 sequencing:
- raw reads: 702 607 (492 bp)
- clean reads: 362 562 (415 bp)
Newbler assembly
- Assembled contigs: 13 400 (longest 50 Kbp)
94% of reads assembled.
30Kotlar et al., 2011
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Contig GC content
31Kotlar et al., 2011
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Taxonomy and metabolism
alpha-Proteobacteria
gamma-Proteobacteria
delta/epsilon-Proteobacteria
unclassified Proteobacteria
Firmicutes
Thermotogae
Synergistaceae
other/unclassified Bacteria
Methanococcales
Thermococcales
other/unclassified Archaea
Eukaryota
not assigned
no hits
sulfur-reducing bacteria
methanogens
others
a b
Figure 2
groups with more than 2000 reads assigned-Delta/epsilon-Proteobacteria-Methanococcales-unclassified bacteria
Based on known metabolism annotations-Sulfur reducing bacteria-Methanogenes-others
Kotlar et al., 2011
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
33
> 50 %
< 50 %
MEGAN classifications
Major taxa are separated bydifferent GC content
Adaptation to environment
Kotlar et al., 2011
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Kotlar et al., 2011
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
a
b c
Figure S7
Missing Crispr genes
anti-viral defence
no virusses?
Genome comparisons
Kotlar et al., 2011
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
Testing an metagenome assembly
-1
0
1
2
3
4
5
6
7
8
30 40 50 60 70 80 90
temperature (°C)
rela
tive a
cti
vit
y
pNTA1
pGS-21a
Figure S6
Blue: Pelobacter carbinolicus enolaseGreen: E.coli enolase
Expression of both proteins in E.coliThe Pelobacter enolase is lesstemperature sensitive
Kotlar et al., 2011
GACTGACTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTGACTGACT
ACTGACGACTGA
CTGACTG
37