introduction to metagenomics

37
ACTGA GACTG CT GACT CTGAC ACTGA GACTG CTGACT CT GAC ACTGA GACTG CTGACT CTGAC ACT GA GACTG CTGACT CTGAC ACTGA GACTG CT GACT CTGAC ACTGA GACTG CTGACT CT GAC ACTGA GACTG ACTGA GACTG CT GACT CTGAC ACTGA GACTG CTGACT CT GAC ACTGA GACTG CTGACT CTGAC ACT GA GACTG CTGACT CTGAC ACTGA GACTG CT GACT CTGAC ACTGA GACTG CTGACT CT GAC ACTGA GACTG Introduction to metagenomics Thomas Haverkamp [email protected] Twitter: @Thomieh [email protected]

Upload: thomas-haverkamp

Post on 15-Jul-2015

502 views

Category:

Science


11 download

TRANSCRIPT

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Introduction to metagenomics

Thomas Haverkamp

[email protected]: @Thomieh

[email protected]

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Overview

• Introduction

• Sequence classification for metagenomes

• Megan in brief

• Oilwell metagenome

• The exercise…

2

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

The bacterial tree of life

Lasken & McLean., Nature Rev. Genetics, 2014

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Ultra-small bacteria

4

OP11, DO1 was detected usingmetagenomic analysis of 0.2 μm filtered water.

The genome size is < 1 Mbp

Bacteria rely on other community members For basic resources.

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Metagenomics

Metagenome: the collective genome of all the microorganisms in an environment. (Handelsman et al., 1998)

Metagenomics is the study of genetic material recovered from an environmental sample.

5

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Metagenomics

6

Who is there? What are they doing?

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Methods for microbial communities

Amplicon based analysis

• SSU rRNA (e.g. 16S, ITS)

• protein coding genes: rpoB, nifH, IRS, cytC, …fungene.cme.msu.edu

Microarrays – requires knowledge of the community in advance.

• PhyloChip (taxonomic)

• Geochip (metabolic)

Shotgun sequencing – complete community analysis.

7

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Shot gun Metagenomics

8Venter et al., 2004

1.2 Million unknown genes

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

High throughput sequencing

9Source: https://flxlexblog.wordpress.com/

Newest Illumina HiSeq X 10 > 1 Tb of sequene data

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Metagenomics

10http://metagenomics.anl.gov/

2012

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Metagenomics

11http://metagenomics.anl.gov/

2015

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

12

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Sequence ClassificationSequence classification (binning) is the process of separating sequence

data using specific information creating bins

Sequence classification by: 1) sequence composition

Tetranucleotide frequency (kmer counting)

Clustering of reads. (e.g. swarm, cd-hit)

Sequence (co-) assembly (MetaHit, Metavelvet)

Differential coverage of contigs (GroopM, Concoct)

Advantage : read with unknown origin can be classified into a bin

Disadvantage: impossible to determine taxonomy or function of the reads.

13

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

GroopM workflow

14Imelfort et al., PeerJ, 2014

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Binning of synthetic contigs

15Imelfort et al., PeerJ, 2014

PCA - Tetranucleotide binning GroopM coverage binning

Input data: 1159 genomes

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Sequence ClassificationSequence classification (binning) is the process of separating sequence

data using specific information creating bins

Sequence classification by: 2) sequence similarity

Compare sequences to reference database (e.g. Blast, bwa, bowtie)

Use phylogenetics to classify sequences.

Advantage: One can determine taxonomy and function of reads.

Disadvantage: reads with no similarity to databases sequences, can not be classified.

16

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

17

Using the best blast hit

Blog of Nick Loman: http://nickloman.github.io/

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Sequence ClassificationNucleotide composition: CompostBin , PCA-analysis of k-mer

frequencies, Self-Organizing Maps (different variants), MetaCluster, PhyloPythia, Naïve Bayes classifier (NBC), etc

Sequence similarity: MEGAN*, SorT-Items, Threephyler, COMET, Metaphlan, PhyloSift, Kraken, etc

Both: Phymm / PhymmBL, Phylophytia, RAIphy, Metaxa2*(rRNA), PhyloOTU (rRNA), MLTreeMap, RITA, STAMP, WGSQuikr.

Differential Coverage: GroopM, Concoct, Blobology

18See also: Logares et al., 2012 / * In this course

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

MetaPhlAn vs Phylosift

19

MetaPhlAn: Metagenomic Phylogenetic Analysis Uses a database of taxon specific marker genes Works well with known ecosystems: e.g. gut communities

Phylosift: Uses a database of 37 universal proteins & rRNA genes. Designed to classify using phylogenies

Both databases are smaller than NCBI NR

Depending on your ecosystem, one will work better

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

MethaPhlAn output

20https://bitbucket.org/nsegata/metaphlan/wiki/MetaPhlAn_Pipelines_Tutorial

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

PhyloSift output

21http://sourceforge.net/p/krona/home/krona/

Interactive Kronaplots

Only one sample

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

MEGAN

22(Huson et al., Genome Research, 2007)

• Developed for characterization of metagenomic shotgun reads

• LCA assignment based on BLAST bitscore

• Support for paired-end reads and comparison of datasets.

• Latest version can analyze RDP files / QIIME OTU files

• Analysis of metabolism via SEED, KEGG or COG maps

• Comparison of multiple metagenomes (> 2)

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Why use Megan?

Easy to work with on a desktop / laptop computer:

Extra things needed: Java, a BLAST server

MEGAN gives a visualization of BLAST results

• Study diversity

• Compare samples

• Contamination filtering

• Special gene of interest

• Extraction of sequences based on taxonomic /metabolic information.

23

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

The basics of MEGAN

MEGAN uses BLAST, a database and a taxonomy file

• BLAST N : nucleotides against a nucleotide database.

• BLAST X : Translated nucleotideagainst a protein database.

• Which database?

one of the many available database like the NCBI-non-redundant database, or a your own custom database.

• Taxonomy: NCBI taxonomy, or your own custom taxonomy

BLAST output file is used to bin sequences using the LCA assignment algorithm into specific taxons.

24

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

The basics of MEGAN

• The LCA algorithm = “Lowest Common Ancestor” algorithm

“In this approach, every read is assigned to some taxon. If the read aligns very specifically only to a single taxon, then it is assigned to that taxon. The less specifically a read hits taxa, the higher up in the taxonomy it is placed. Reads that hit ubiquitously may even be assigned to the root node of the NCBI taxonomy.”

(the MEGAN manual)

25

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

26

multiple samples

Comparison between reads assignedto Phosphorus metabolismand Nitrogen metabolism

Reads were annotated using MG-RAST

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Metagenomics servers

• MG-RAST* (http://metagenomics.anl.gov/)

• IMG/M (http://img.jgi.doe.gov/)

• WebMGA (http://weizhong-lab.ucsd.edu/metagenomic-analysis/)

• METAgen assist* (http://www.metagenassist.ca/METAGENassist/faces/Home.jsp)

• Real-Time metagenomics (https://edwards.sdsu.edu/RTMg/)

*Can also be used for amplicon sequences

27

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Sampling an oil fieldSampling site

Pressure 27.3 bar

0 m

Reservoir sediment

Temperature 85C

Pressure 253 barSeal sediment

2950 m

2850 m

Subseasediments

350 m

Xpand Pressure Flask

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

The metagenome data

454 sequencing:

- raw reads: 702 607 (492 bp)

- clean reads: 362 562 (415 bp)

Newbler assembly

- Assembled contigs: 13 400 (longest 50 Kbp)

94% of reads assembled.

30Kotlar et al., 2011

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Contig GC content

31Kotlar et al., 2011

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Taxonomy and metabolism

alpha-Proteobacteria

gamma-Proteobacteria

delta/epsilon-Proteobacteria

unclassified Proteobacteria

Firmicutes

Thermotogae

Synergistaceae

other/unclassified Bacteria

Methanococcales

Thermococcales

other/unclassified Archaea

Eukaryota

not assigned

no hits

sulfur-reducing bacteria

methanogens

others

a b

Figure 2

groups with more than 2000 reads assigned-Delta/epsilon-Proteobacteria-Methanococcales-unclassified bacteria

Based on known metabolism annotations-Sulfur reducing bacteria-Methanogenes-others

Kotlar et al., 2011

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

33

> 50 %

< 50 %

MEGAN classifications

Major taxa are separated bydifferent GC content

Adaptation to environment

Kotlar et al., 2011

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Kotlar et al., 2011

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

a

b c

Figure S7

Missing Crispr genes

anti-viral defence

no virusses?

Genome comparisons

Kotlar et al., 2011

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Testing an metagenome assembly

-1

0

1

2

3

4

5

6

7

8

30 40 50 60 70 80 90

temperature (°C)

rela

tive a

cti

vit

y

pNTA1

pGS-21a

Figure S6

Blue: Pelobacter carbinolicus enolaseGreen: E.coli enolase

Expression of both proteins in E.coliThe Pelobacter enolase is lesstemperature sensitive

Kotlar et al., 2011

GACTGACTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTGACTGACT

ACTGACGACTGA

CTGACTG

Any [email protected]

37