amplicon sequencing, de-noising and diversity estimation · amplicon sequencing, de-noising and...

Amplicon Sequencing, De-Noising And Diversity

Estimation

Anders LanzénDept. of Biology & Centre for Geobiology

University of BergenTuesday, 17 April 2012

Outline

• Introduction, sequencing technologies

• Different types of de-noising and filtering

• AmpliconNoise

• Chimera removal

• OTU Clustering

• Parametric diversity estimation

Tuesday, 17 April 2012

Introduction

• NGS has revolutionised microbial ecology

• Diversity - number of unique sequences / OTUs

• Taxonomical structure - what is in the sample?

• In prokaryotes, typically SSU rRNA (16S)

• First results from studies were striking (e.g. Sogin: 30k species from 600k reads), but vastly overestimated diversity of “the rare biosphere” [Kunin et al 2009, Quince et al 2009]

NGS Technologies

Name~Reads /

run~Cost /

run~Read length

Amplicon-Noise

454 GS Titanium 1M €10,000 450 bp Yes

IonTorrent 2M €1000 240 bp Soon?

Illumina HiSeq

80M (per lane, PE) €2,000 100-150 bp No

The noise problem

• No resequencing possible (no cloning step)

• 454: 0.25% error rate after filtering --> 3% of reads with > 4% incorrect bases. -->Huge overprediction of OTUs without filtering (>50,000 per run). +PCR noise, incl. chimeric sequences

Denoising Pipelines454 Sequencing

Filtering

Clustering(once or more)

Chimera removal

OTU Clustering and taxonomical class.

Base-calling (not in AmpliconNoise)

Noise Removal PipelinesAlgorithm Distance metric Algorithm Source

PyroNoise Quince et al. 2009

Flowgram Probabilis:c -‐ itera:ve

PyrotaggerKunin et al. 2010

Sequence Minimum distance threshold – one pass agglomora:ve

Pyrotagger website

Single-‐linkage preclustering (SLP)Huse et al. 2010

Sequence Minimum distance threshold – one pass agglomora:ve

Vamps website

DeNoiserReeder & Knight 2010

Flowgram Minimum distance threshold -‐ agglomora:ve

AmpliconNoise Quince et al. BMC Bioinforma:cs 2011

PyroNoise -‐ flowgramSeqNoise -‐ sequence

Probabilis:c -‐ itera:ve Google code -‐ hWp://code.google.com/p/ampliconnoise/Erick Matsen -‐ hWps://github.com/Zcrc/ampliconnoiseMOTHUR

FlowgramsThe pyrosequencing raw data

• Sequence is read as light intensity when bases are washed over 454 plate in order T C A G

Example:0.03 1.03 0.09 0.12 0.89 0.09 0.09 1.01 0.11 1.03 0.12 0.12 2.00 0.12 1.92

Translation:CTGCTTAA

Flowgrams• Since intensity values are inexact, similar

flowgrams can lead to very different sequences after base-calling

Read #1 TGGGGGCAAAAA |||||| ||||Read #2 TGGGGGGCAAAA

Similarity = 10/12 = 83%

A G T C A

0.960.04

1.010.05

Read #1 Read #2

• Solution: Flowgram-to-sequence clustering. This needs distance metric!

• Calculation of probability density of P(I|n) where I = signal intensity, n= homopolymer length

AmpliconNoise

Removes PCR chimeras

Split on barcodes

Filter flowgrams

Extract flowgrams

PyroNoise: probabilis:c flowgram clusterer

SeqNoise: probabilis:c sequence clusterer

Perseus: chimera classifier

OTU construnc:on: using complete linkage and exact pairwise alignments

Remove PCR single base errors

Samples in 454 sff file format

Removes pyrosequencing errors

Use best available OTU construc>on algorithm

1. Pre-filtering: Truncates at base 360 or first noisy flow

2. PyroDist: Predicts the distance matrix between all flowgrams and perfect flowgrams (in first step taken from hierarchical clustering of flowgrams to base called sequences)

3. PyroNoise: Bayesian probability of perfect flowgram coming from a particular flow --> Reassignment of flows based on the most likely sequence it comes from, until convergence.Iterative ML algorithm

AmpliconNoise Algorithm

4. Base calling, to unique sequence cleaned from sequencing noise

5. SeqDist and SeqNoise: Cleaning of PCR point mutation artefacts.

6. Perseus (or PerseusM): Probabilistic self-referenced chimera removal

AmpliconNoise Algorithm

Chimera removal - perseus• Assumes that a chimera’s

parents will be in the dataset with equal or greater frequency than the chimera

• For each sequence directly• search for possible parents and break points that give a

close match

• Only search for parents amongst more abundant sequences

• Define chimera index – reflecting the probability that sequence was generated by evolution

Results

• Pyrosequencing (FLX) of artificial community:

• 90 clones with “known sequence” i e Sanger-sequenced to depth.

• V5 + V6 hypervariable regions

• Different concentrations

• Multiple alignment to determine OTUs at different cutoffs.

• The correct result known!

Benchmarking

No more “Rare Biosphere”?

• Some environments s:ll appear very diverse

• Some environments s:ll appear very diverse• Ric valley soda lakes – 11,835 filtered reads gave 863 3% OTUs without noise removal

• Following noise removal obtain 585 OTUs a 1/3 reduc:on in observed diversity

• Abundance distribu:on s:ll highly skewed

How to Use

• http://code.google.com/p/ampliconnoise/

• Manual with installation instructions and detailed description of scripts, programs etc.

• Requires MPI and the GNU Science Library (gsl) to run on cluster or SMP (for all but small datasets)

• Big datasets can be pre-clustered to reduce time.

Convenience Scripts

• RunPreSplit: Normal pipeline for de-noising and OTU-clustering a one-barcode sample from GS Titanium

• RunPreSplitXL: For bigger datasets that require pre-splitting step

• RunTitanium: Normal pipeline for de-noising SFF file with all barcodes in it

• ampliconflow.jar: Java library for processing of SFF files and OTUs

• Also gives statistics and some diversity indeces

ConceptsOperational Taxanomic Unit (OTU)

• Used instead of species concept

• Typically 97% similarity cluster of 16S gene

• Meaningful?

• Ecotype studies suggest 99%

• Multiple 16S genes

• Subunit cloning

OTU Clustering

• Purpose: construct clusters with similar sequences, e.g. to estimate diversity more robustly

• Agglomerative hierarchical clustering or “linkage clustering”

• Distance matrix from multiple alignment used for distance between sequences

• Produces OTUs (Operational Taxonomic Units), typically with 3% maximum distance limit

Linkage Clustering

Start by linking together sequences with the least distance.

Then group together clusters, as long as distance is <3%.

Linkage Clustering

• 3 Types of Linkage clustering, using different ways to measure distance between clusters

• 1) Maximum / Complete Linkage Clustering

• 2) Minimum / Single Linkage Clustering

• 3) Average Linkage Clustering (UPMGA)

Chao-1

• Commonly used, non-parametric heuristic (Chao, 1987)

• Based on #OTUs, #singletons and #doubletons

• Gives lower-bound of diversity

• Increases with sequencing depth and very sensitive to un-removed noisy sequences

From Gihring et al (2009)Pyrosequencing exacerbates sample size bias 3

A Novel Bayesian Approach

• Quince et al (2008), “The rational exploration of microbial diversity”, ISME Journal (to appear)

• Bayes’ theorem:Likelihood ∝ P(data|parameters) × P(parameters)

In this case: find the best fit of the observed Taxa-abundance distribution and diversity to an underlying parametric distribution and sample diversity

Parametric Bayesian Approach (Quince et al, 2009)

• 3 different distributions tried:• Log-normal• Inverse Gaussian• Sichel distribution

• Some others shown to be poorer fit to samples (Exponential, gamma and mixtures)

• Maximise posterior probability using Markov chain Monte-Carlo sampling of parameters

• Verified on census rainforest tree dataset (Barro Colorado Island complete census of 222,655 trees identified to 303 species)

Application to Arctic soilsName Soil type Age

(yrs)Sample size

Clean no. 3% OTUs 3% Chao

Midtre Lovenbre 7 Tundra 2000 33,445 23,105 1,936 2,751

Midtre Lovenbre 1 Rock 10 35,569 24,004 1,578 2,339

Knutsenheia Desert n.a. 21,474 14,338 1,599 2,459

Storholmen Island 1000 23,679 14,586 1,554 2,349

14th July no. 2 Bird cliffs n.a. 32,661 23,264 1,763 2,405

14th July no. 3 Bird cliffs n.a. 19,796 11,846 1,398 2,056

Arctic soils

Arctic soilsName Log-‐normal Inverse Gaussian Sichel

Midtre Lovenbre 7 4542:5369:6546 3615:4006:4520 2994:3234:3565

Midtre Lovenbre 1 3787:4628:5994 2939:3307:3803 2421:2654:2990

Knutsenheia 3892:4728:6032 3179:3616:4275 2752:3117:3934

Storholmen 4099:5126:6758 3208:3705:4438 2746:3151:3905

14th July no. 2 3815:4476:5427 2979:3257:3624 2698:2933:3282

14th July no. 3 2990:3552:4420 2521:2827:3281 2176:2416:2777

90% sampling efforts

Name Log-‐normal Inverse Gaussian Sichel

Midtre Lovenbre 7 1.92e+06 2.58e+05 1.16e+05

Midtre Lovenbre 1 2.94e+06 2.84e+05 1.26e+05

Knutsenheia 1.37e+06 1.97e+05 1.16e+05

Storholmen 2.25e+06 2.32e+05 1.32e+05

14th July no. 2 1.61e+06 1.88e+05 1.25e+05

14th July no. 3 5.98e+05 1.19e+05 6.63e+04

amplicon sequencing, de-noising and diversity estimation · amplicon sequencing, de-noising and...

Documents

adapterama ii: universal amplicon sequencing on illumina...

long amplicon sequencing of repetitive regions and genomic...

amplicon vaccine llc pullman, washington usa...

a sequel to sanger: amplicon sequencing that...

hla-typing strategies - cellex-stiftung.org · molecular...

amplicon sequencing / metagenomics

automated dna sequencing - amplicon...

deep amplicon sequencing for culture-free prediction of

development of a dual-index sequencing strategy and...

nanoampli-seq: a workflow for amplicon sequencing …...

16s rrna amplicon sequencing characterization of caecal

next generation-targeted amplicon sequencing (ng-tas...

multiplex restriction amplicon sequencing: a novel next ......

8 molecular methodsmethod provides a list of microbes and an...

amplicon sequencing for the quantification of spoilage

rapid, ultra-multiplexed amplicon-based targeted sequencing...

amplicon-based quasipecies assembly using next generation...

generic amplicon deep sequencing to determine ilarvirus...

amplicon sequencing slides - trina mcmahon - mewe 2013

application of a high-throughput amplicon sequencing