amplicon sequencing, de-noising and diversity estimation · amplicon sequencing, de-noising and...
TRANSCRIPT
Amplicon Sequencing, De-Noising And Diversity
Estimation
Anders LanzénDept. of Biology & Centre for Geobiology
University of BergenTuesday, 17 April 2012
Outline
• Introduction, sequencing technologies
• Different types of de-noising and filtering
• AmpliconNoise
• Chimera removal
• OTU Clustering
• Parametric diversity estimation
Tuesday, 17 April 2012
Introduction
• NGS has revolutionised microbial ecology
• Diversity - number of unique sequences / OTUs
• Taxonomical structure - what is in the sample?
• In prokaryotes, typically SSU rRNA (16S)
• First results from studies were striking (e.g. Sogin: 30k species from 600k reads), but vastly overestimated diversity of “the rare biosphere” [Kunin et al 2009, Quince et al 2009]
Tuesday, 17 April 2012
NGS Technologies
Name~Reads /
run~Cost /
run~Read length
Amplicon-Noise
454 GS Titanium 1M €10,000 450 bp Yes
IonTorrent 2M €1000 240 bp Soon?
Illumina HiSeq
80M (per lane, PE) €2,000 100-150 bp No
Tuesday, 17 April 2012
The noise problem
• No resequencing possible (no cloning step)
• 454: 0.25% error rate after filtering --> 3% of reads with > 4% incorrect bases. -->Huge overprediction of OTUs without filtering (>50,000 per run). +PCR noise, incl. chimeric sequences
Tuesday, 17 April 2012
Denoising Pipelines454 Sequencing
Filtering
Clustering(once or more)
Chimera removal
OTU Clustering and taxonomical class.
Base-calling (not in AmpliconNoise)
Tuesday, 17 April 2012
Noise Removal PipelinesAlgorithm Distance metric Algorithm Source
PyroNoise Quince et al. 2009
Flowgram Probabilis:c -‐ itera:ve
PyrotaggerKunin et al. 2010
Sequence Minimum distance threshold – one pass agglomora:ve
Pyrotagger website
Single-‐linkage preclustering (SLP)Huse et al. 2010
Sequence Minimum distance threshold – one pass agglomora:ve
Vamps website
DeNoiserReeder & Knight 2010
Flowgram Minimum distance threshold -‐ agglomora:ve
QIIME
AmpliconNoise Quince et al. BMC Bioinforma:cs 2011
PyroNoise -‐ flowgramSeqNoise -‐ sequence
Probabilis:c -‐ itera:ve Google code -‐ hWp://code.google.com/p/ampliconnoise/Erick Matsen -‐ hWps://github.com/Zcrc/ampliconnoiseMOTHUR
Tuesday, 17 April 2012
FlowgramsThe pyrosequencing raw data
• Sequence is read as light intensity when bases are washed over 454 plate in order T C A G
Example:0.03 1.03 0.09 0.12 0.89 0.09 0.09 1.01 0.11 1.03 0.12 0.12 2.00 0.12 1.92
Translation:CTGCTTAA
Tuesday, 17 April 2012
Flowgrams• Since intensity values are inexact, similar
flowgrams can lead to very different sequences after base-calling
Read #1 TGGGGGCAAAAA |||||| ||||Read #2 TGGGGGGCAAAA
Similarity = 10/12 = 83%
A G T C A
4.47
0.960.04
5.52
0.01
4.51
1.010.05
5.49
0.02
Read #1 Read #2
• Solution: Flowgram-to-sequence clustering. This needs distance metric!
Tuesday, 17 April 2012
• Calculation of probability density of P(I|n) where I = signal intensity, n= homopolymer length
Tuesday, 17 April 2012
AmpliconNoise
Removes PCR chimeras
Split on barcodes
Filter flowgrams
Extract flowgrams
PyroNoise: probabilis:c flowgram clusterer
SeqNoise: probabilis:c sequence clusterer
Perseus: chimera classifier
OTU construnc:on: using complete linkage and exact pairwise alignments
Remove PCR single base errors
Samples in 454 sff file format
Removes pyrosequencing errors
Use best available OTU construc>on algorithm
Tuesday, 17 April 2012
1. Pre-filtering: Truncates at base 360 or first noisy flow
2. PyroDist: Predicts the distance matrix between all flowgrams and perfect flowgrams (in first step taken from hierarchical clustering of flowgrams to base called sequences)
3. PyroNoise: Bayesian probability of perfect flowgram coming from a particular flow --> Reassignment of flows based on the most likely sequence it comes from, until convergence.Iterative ML algorithm
AmpliconNoise Algorithm
Tuesday, 17 April 2012
4. Base calling, to unique sequence cleaned from sequencing noise
5. SeqDist and SeqNoise: Cleaning of PCR point mutation artefacts.
6. Perseus (or PerseusM): Probabilistic self-referenced chimera removal
AmpliconNoise Algorithm
Tuesday, 17 April 2012
Chimera removal - perseus• Assumes that a chimera’s
parents will be in the dataset with equal or greater frequency than the chimera
• For each sequence directly• search for possible parents and break points that give a
close match
• Only search for parents amongst more abundant sequences
• Define chimera index – reflecting the probability that sequence was generated by evolution
Tuesday, 17 April 2012
Results
• Pyrosequencing (FLX) of artificial community:
• 90 clones with “known sequence” i e Sanger-sequenced to depth.
• V5 + V6 hypervariable regions
• Different concentrations
• Multiple alignment to determine OTUs at different cutoffs.
• The correct result known!
Tuesday, 17 April 2012
Benchmarking
Tuesday, 17 April 2012
Benchmarking
Tuesday, 17 April 2012
No more “Rare Biosphere”?
Tuesday, 17 April 2012
• Some environments s:ll appear very diverse
No more “Rare Biosphere”?
Tuesday, 17 April 2012
• Some environments s:ll appear very diverse• Ric valley soda lakes – 11,835 filtered reads gave 863 3% OTUs without noise removal
No more “Rare Biosphere”?
Tuesday, 17 April 2012
• Some environments s:ll appear very diverse• Ric valley soda lakes – 11,835 filtered reads gave 863 3% OTUs without noise removal
• Following noise removal obtain 585 OTUs a 1/3 reduc:on in observed diversity
No more “Rare Biosphere”?
Tuesday, 17 April 2012
• Some environments s:ll appear very diverse• Ric valley soda lakes – 11,835 filtered reads gave 863 3% OTUs without noise removal
• Following noise removal obtain 585 OTUs a 1/3 reduc:on in observed diversity
• Abundance distribu:on s:ll highly skewed
No more “Rare Biosphere”?
Tuesday, 17 April 2012
How to Use
• http://code.google.com/p/ampliconnoise/
• Manual with installation instructions and detailed description of scripts, programs etc.
• Requires MPI and the GNU Science Library (gsl) to run on cluster or SMP (for all but small datasets)
• Big datasets can be pre-clustered to reduce time.
Tuesday, 17 April 2012
Convenience Scripts
• RunPreSplit: Normal pipeline for de-noising and OTU-clustering a one-barcode sample from GS Titanium
• RunPreSplitXL: For bigger datasets that require pre-splitting step
• RunTitanium: Normal pipeline for de-noising SFF file with all barcodes in it
• ampliconflow.jar: Java library for processing of SFF files and OTUs
• Also gives statistics and some diversity indeces
Tuesday, 17 April 2012
ConceptsOperational Taxanomic Unit (OTU)
• Used instead of species concept
• Typically 97% similarity cluster of 16S gene
• Meaningful?
• Ecotype studies suggest 99%
• Multiple 16S genes
• Subunit cloning
Tuesday, 17 April 2012
OTU Clustering
• Purpose: construct clusters with similar sequences, e.g. to estimate diversity more robustly
• Agglomerative hierarchical clustering or “linkage clustering”
• Distance matrix from multiple alignment used for distance between sequences
• Produces OTUs (Operational Taxonomic Units), typically with 3% maximum distance limit
Tuesday, 17 April 2012
Linkage Clustering
Start by linking together sequences with the least distance.
Then group together clusters, as long as distance is <3%.
Tuesday, 17 April 2012
Linkage Clustering
• 3 Types of Linkage clustering, using different ways to measure distance between clusters
• 1) Maximum / Complete Linkage Clustering
• 2) Minimum / Single Linkage Clustering
• 3) Average Linkage Clustering (UPMGA)
Tuesday, 17 April 2012
Tuesday, 17 April 2012
Tuesday, 17 April 2012
Tuesday, 17 April 2012
Tuesday, 17 April 2012
Tuesday, 17 April 2012
Tuesday, 17 April 2012
Tuesday, 17 April 2012
Tuesday, 17 April 2012
Chao-1
• Commonly used, non-parametric heuristic (Chao, 1987)
• Based on #OTUs, #singletons and #doubletons
• Gives lower-bound of diversity
• Increases with sequencing depth and very sensitive to un-removed noisy sequences
Tuesday, 17 April 2012
From Gihring et al (2009)Pyrosequencing exacerbates sample size bias 3
© 2011 Society for Applied Microbiology and Blackwell Publishing Ltd, Environmental Microbiology
Tuesday, 17 April 2012
A Novel Bayesian Approach
• Quince et al (2008), “The rational exploration of microbial diversity”, ISME Journal (to appear)
• Bayes’ theorem:Likelihood ∝ P(data|parameters) × P(parameters)
In this case: find the best fit of the observed Taxa-abundance distribution and diversity to an underlying parametric distribution and sample diversity
Tuesday, 17 April 2012
Parametric Bayesian Approach (Quince et al, 2009)
• 3 different distributions tried:• Log-normal• Inverse Gaussian• Sichel distribution
• Some others shown to be poorer fit to samples (Exponential, gamma and mixtures)
• Maximise posterior probability using Markov chain Monte-Carlo sampling of parameters
• Verified on census rainforest tree dataset (Barro Colorado Island complete census of 222,655 trees identified to 303 species)
Tuesday, 17 April 2012
Application to Arctic soilsName Soil type Age
(yrs)Sample size
Clean no. 3% OTUs 3% Chao
Midtre Lovenbre 7 Tundra 2000 33,445 23,105 1,936 2,751
Midtre Lovenbre 1 Rock 10 35,569 24,004 1,578 2,339
Knutsenheia Desert n.a. 21,474 14,338 1,599 2,459
Storholmen Island 1000 23,679 14,586 1,554 2,349
14th July no. 2 Bird cliffs n.a. 32,661 23,264 1,763 2,405
14th July no. 3 Bird cliffs n.a. 19,796 11,846 1,398 2,056
Tuesday, 17 April 2012
Arctic soils
Tuesday, 17 April 2012
Arctic soilsName Log-‐normal Inverse Gaussian Sichel
Midtre Lovenbre 7 4542:5369:6546 3615:4006:4520 2994:3234:3565
Midtre Lovenbre 1 3787:4628:5994 2939:3307:3803 2421:2654:2990
Knutsenheia 3892:4728:6032 3179:3616:4275 2752:3117:3934
Storholmen 4099:5126:6758 3208:3705:4438 2746:3151:3905
14th July no. 2 3815:4476:5427 2979:3257:3624 2698:2933:3282
14th July no. 3 2990:3552:4420 2521:2827:3281 2176:2416:2777
Tuesday, 17 April 2012
90% sampling efforts
Name Log-‐normal Inverse Gaussian Sichel
Midtre Lovenbre 7 1.92e+06 2.58e+05 1.16e+05
Midtre Lovenbre 1 2.94e+06 2.84e+05 1.26e+05
Knutsenheia 1.37e+06 1.97e+05 1.16e+05
Storholmen 2.25e+06 2.32e+05 1.32e+05
14th July no. 2 1.61e+06 1.88e+05 1.25e+05
14th July no. 3 5.98e+05 1.19e+05 6.63e+04
Tuesday, 17 April 2012