studying the microbiome

Studying the microbiome

Mick WatsonHead of Bioinformatics, Edinburgh Genomics, University of EdinburghResearch Group Leader, The Roslin Institute, University of Edinburgh

Edinburgh Genomics• Genomics facility based at the University of Edinburgh• Available for collaborations on an academic, non-profit basis• Formed from merger of

– ARK-Genomics– The GenePool

• Funded by three major bio UK research councils

• A range of technologies and expertise available

http://genomics.ed.ac.uk

Prevailing theory of the individual• An individual consists of at least 10x as many bacterial cells as “host” cells *

• Each individual is a “supra-organism”– a composite of host and microbial cells contribute the functions necessary for the

individual to survive

• The genetic landscape of any individual is a composite of the host genome and the genomes of the millions of microbial symbionts that live on and within that individual

• It is clearly important to take a holistic view when examining any animal phenotype

My focus• Move from discovery science to applied science• “What’s there?” “What can we do with it?”

• The “ten times” figure comes from a paper in 1972, and is estimated from 1g of human faeces

• More modern estimates range from equal to 100 times!

• American Society for Microbiology 2014 report puts the ratio closer to 3:1

• Panel included Peter Turnbaugh

• There’s still more of them though….

http://www.bostonglobe.com/ideas/2014/09/13/your-body-mostly-microbes-actually-have-idea/qlcoKot4wfUXecjeVaFKFN/story.html

Microbiome research is undergoing a crisisPlease don’t make things worse • Crisis 1

– The correlation/causation fallacy. For example….– Patients with type II diabetes have a different gut microbiome compared

to healthy patients– Does the microbiome cause diabetes?– Or do they have a different microbiome because they have diabetes?

(therefore different diet)

• Crisis 2– A lot of people want to do it, but don’t know how– Errors, bad experimental design, incorrect conclusions

What is the microbiome?“the ecological community of commensal, symbiotic, and pathogenic microorganisms that literally share our body space”

- Joshua Lederberg

Note: includes funghi, protists, archaea, bacteria, algae, viruses etc etc etc

(whisper it: most “microbiome” studies only look at bacteria/archaea)

How do we study the microbiome?• Marker gene vs shotgun metagenomics• Marker gene– 16S / 18S / ITS– Amplify this and compare

• Metagenomics– Extract all DNA– Fragment, sequence, interpret

• In theory, the latter least biased*

16S studies are not metagenomics

http://phylogenomics.blogspot.co.uk/2012/08/referring-to-16s-surveys-as.html, http://biomickwatson.wordpress.com/2014/01/12/youre-probably-not-doing-metagenomics/

http://phylogenomics.blogspot.co.uk/2012/08/referring-to-16s-surveys-as.html

http://biomickwatson.wordpress.com/2014/01/12/youre-probably-not-doing-metagenomics/



16S• Prokaryotic rRNA subunit• Present in all (?) bacterial/archaeal genomes, contains constant

and hypervariable regions• Hypervariable regions may give “species specific” signatures

16S process• Current sequencing technologies can’t sequence whole thing• Design primers in constant regions and PCR• Amplify 1 or more hypervariable regions• Cluster similar sequences into OTUs• Compare to 16S database and assign phylogenetic group• Compare abundance across sample groups (QIIME, Mothur)

16S problems• Some genomes have multiple copies of the 16S gene• The constant regions aren’t constant

– Design degenerate primers– Some primers pick up certain groups better than others– A perfect match primer will amplify better than one containing mis-

matches

• The abundances from 16S are wrong, we simply hope that they are consistently wrong across samples

• Absence really difficult to prove/wrong to assume• Chimeras, PCR artefacts consisting of 16S gene fragments

from two different molecules

• Ashelford KE, Chuzhanova NA, Fry JC, Jones AJ, Weightman AJ. At least 1 in 20 16S rRNA sequence records currently held in public repositories is estimated to contain substantial anomalies. Appl Environ Microbiol. 2005 71(12):7724-36.

SEQUENCING TECHNOLOGIES

References

Sequencing: what’s on the market?Technology Advantages Disadvantages Output per run

Illumina Highly accurate; cheap; industry leader; multiple

platforms

Slower than Ion; short reads;

HiSeq X Ten: 18TbHiSeq X: 1.8Tb

2500:HO 600Gb -> 1Tb 2500:RO: 180GbNextSeq: 140Gb

MiSeq: 25Gb

Ion Torrent Fast; cheap machine Very poor on homopolymers; doesn’t

match Illumina on throughput

PGM: 2GbProton P1: 10GbProton P2: 30Gb

PacBio Long reads; single molecule High error rate, needs correction; low throughput;

expensive machine

300-500Mb

Oxford Nanopore MinION

Long reads; single molecule; cheap; portable

High error rate; unknown quantity

Unknown

Complete Genomics

Highly accurate; cheap Limited to human; black box

Unknown; human genomes can be purchased

Illumina read lengths• HiSeq X Ten (Human only): 100PE• HiSeq 2500: V4 125PE, V3R 150PE, V3H 100PE• NextSeq: 150PE• MiSeq: V2: 250PE, V3 300PE

16S sequencing strategy?• Platform: MiSeq• Theoretically:

– 2x150bp can sequence ~180bp amplicon– 2x250bp can sequence ~480bp amplicon– 2x300bp can sequence ~580bp amplicon

Important paper• Amongst other

things, sequenced a mock community with different sequencing and bioinformatics strategies

• Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Appl Environ Microbiol. 2013 S79(17):5112-20.

• Three 16S regions sequenced using 2x250bp– V4 (~250 bp), V34 (430bp), and V45 regions (~375 bp)– In the Mock community, there should be 20 OTUs

16S sequencing strategy?• The only strategy that got close to the correct result is

complete overlap of 2x250bp MiSeq reads

SHOTGUN METAGENOMICS

Shotgun metagenomics• Take ecosystem, extract all DNA and sequence it• Should be unbiased, right?... Right?

• (NB: issues on the next few slides are also issues for marker gene studies)

Extraction protocol

“we found that each DNA extraction method resulted in unique community patterns”

“We observed significant differences in distribution of bacterial taxa depending on the method.”

Storage

“Samples frozen with and without glycerol as cryoprotectant indicated a major loss of Bacteroidetes in unprotected samples”

• In the chicken caecum, bacteroidetes dominate, followed by firmicutes:

• Nordentoft S et al (2011) The influence of the cage system and colonisation of Salmonella Enteritidis on the microbial gut flora of laying hens studied by T-RFLP and 454 pyrosequencing. BMC Microbiol. 11:187

• In the chicken caecum, firmicutes dominate, few proteobacteria, no bacteroidetes

• Danzeisen JL et al (2011). Modulations of the chicken cecal microbiome and metagenome in response to anticoccidial and growth promoter treatment. PLOS ONE. 6(11):e27949.

• Did I mention that microbiome research is undergoing a crisis?

• It gets worse…..

Contamination

• Sequenced a pure culture of Salmonella bongori

• Extracted DNA using different kits• Did serial dilutions of the pure

culture to assess impact of contaminating species

The kits• FastDNA Spin Kit For Soil (FP), MoBio UltraClean Microbial

DNA Isolation Kit (MB), QIAmp DNA Stool Mini Kit (QIA) and PSP Spin Stool DNA Plus kit (PSP)

FP had a stable kit profile dominated by Burkholderia, PSP was dominated by Bradyrhizobium, while the QIA kit had the most complex mix of bacterial DNA. Bradyrhizobiaceae, Burkholderiaceae, Chitinophagaceae, Comomonadaceae, Propionibacteriaceae and Pseudomonadaceae were present in at least three quarters of the dilutions from PSP, FP and QIA kits. However, relative abundances of taxa at the Family level varied according to kit: FP was marked by Burkholderiaceae and Enterobacteriaceae, PSP was marked by Bradyrhizobiaceae and Chitinophagaceae. Thecontamination in the QIA kit was relatively diverse in comparison to the other kits, and included higher proportions of Aerococcaceae, Bacillaceae, Flavobacteriaceae, Microbacteriaceae, Paenibacillaceae, Planctomycetaceae and Polyangiaceae than the other kits. Kit MB did not have a distinct contaminant profile and varied from dilution to dilution due to paucity of reads

“These metagenomic results therefore clearly show that contamination becomes the dominant feature of sequence data from low biomass samples, and that the kit used to extract DNA can have an impact on the observed bacterial diversity”

From Salter et al:“Tellingly, Laurence et al [1] recently demonstrated with an in silico analysis that Bradyrhizobium is a common contaminant of sequencing datasets including the 1000 Human Genome Project”

1. Laurence M, Hatzis C, Brash DE. Common contaminants in next-generation sequencing that hinder discovery of low-abundance microbes. PLoS One. 2014 9(5):e97876.

Adenoids are at the back of the nasal cavityBradyrhizobium is a soil bacterium

Confounding factors

ANYWAY…..

Shotgun metagenomics• Can assemble– MetaVelvet, Meta-IDBA, Ray Meta, MetAMOS– Different techniques for partitioning• Coverage, sequence composition, connectivity• MetaWatt, CONCOCT

– Predict genes: Glimmer-MG, FragGenScan• Use reference– Kraken, PhyloSift, MetaPhlAn, HUMAnN

All-in-one solution• EBI Metagenomics

• Hunter S, et al. EBI metagenomics--a new resource for the analysis and archiving of metagenomic data. Nucleic Acids Res. 2014 42(Database issue):D600-6.

CONCLUSIONS

Conclusions• I love microbiome research (honestly!)• Really, incredibly exciting… but….• Every step counts• Be very careful, at all stages• 16S – cheap, biased but effective• WGS – expensive, information rich, less biased• Beware contamination, include controls

studying the microbiome

Science