amplicon sequencing, de-noising and diversity estimation · amplicon sequencing, de-noising and...

Post on 28-May-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Amplicon Sequencing, De-Noising And Diversity

Estimation

Anders LanzénDept. of Biology & Centre for Geobiology

University of BergenTuesday, 17 April 2012

Outline

• Introduction, sequencing technologies

• Different types of de-noising and filtering

• AmpliconNoise

• Chimera removal

• OTU Clustering

• Parametric diversity estimation

Tuesday, 17 April 2012

Introduction

• NGS has revolutionised microbial ecology

• Diversity - number of unique sequences / OTUs

• Taxonomical structure - what is in the sample?

• In prokaryotes, typically SSU rRNA (16S)

• First results from studies were striking (e.g. Sogin: 30k species from 600k reads), but vastly overestimated diversity of “the rare biosphere” [Kunin et al 2009, Quince et al 2009]

Tuesday, 17 April 2012

NGS Technologies

Name~Reads /

run~Cost /

run~Read length

Amplicon-Noise

454 GS Titanium 1M €10,000 450 bp Yes

IonTorrent 2M €1000 240 bp Soon?

Illumina HiSeq

80M (per lane, PE) €2,000 100-150 bp No

Tuesday, 17 April 2012

The noise problem

• No resequencing possible (no cloning step)

• 454: 0.25% error rate after filtering --> 3% of reads with > 4% incorrect bases. -->Huge overprediction of OTUs without filtering (>50,000 per run). +PCR noise, incl. chimeric sequences

Tuesday, 17 April 2012

Denoising Pipelines454 Sequencing

Filtering

Clustering(once or more)

Chimera removal

OTU Clustering and taxonomical class.

Base-calling (not in AmpliconNoise)

Tuesday, 17 April 2012

Noise Removal PipelinesAlgorithm Distance  metric Algorithm Source

PyroNoise  Quince  et  al.  2009

Flowgram Probabilis:c  -­‐  itera:ve

PyrotaggerKunin  et  al.  2010

Sequence Minimum  distance  threshold  –  one  pass  agglomora:ve

Pyrotagger  website

Single-­‐linkage  preclustering  (SLP)Huse  et  al.  2010  

Sequence Minimum  distance  threshold  –  one  pass  agglomora:ve

Vamps  website

DeNoiserReeder  &  Knight  2010  

Flowgram Minimum  distance  threshold  -­‐  agglomora:ve

QIIME

AmpliconNoise  Quince  et  al.  BMC  Bioinforma:cs  2011

PyroNoise  -­‐  flowgramSeqNoise  -­‐  sequence

Probabilis:c  -­‐  itera:ve Google  code  -­‐  hWp://code.google.com/p/ampliconnoise/Erick  Matsen  -­‐  hWps://github.com/Zcrc/ampliconnoiseMOTHUR

Tuesday, 17 April 2012

FlowgramsThe pyrosequencing raw data

• Sequence is read as light intensity when bases are washed over 454 plate in order T C A G

Example:0.03 1.03 0.09 0.12 0.89 0.09 0.09 1.01 0.11 1.03 0.12 0.12 2.00 0.12 1.92

Translation:CTGCTTAA

Tuesday, 17 April 2012

Flowgrams• Since intensity values are inexact, similar

flowgrams can lead to very different sequences after base-calling

Read #1 TGGGGGCAAAAA |||||| ||||Read #2 TGGGGGGCAAAA

Similarity = 10/12 = 83%

A G T C A

4.47

0.960.04

5.52

0.01

4.51

1.010.05

5.49

0.02

Read #1 Read #2

• Solution: Flowgram-to-sequence clustering. This needs distance metric!

Tuesday, 17 April 2012

• Calculation of probability density of P(I|n) where I = signal intensity, n= homopolymer length

Tuesday, 17 April 2012

AmpliconNoise

Removes  PCR  chimeras

Split  on  barcodes

Filter  flowgrams

Extract  flowgrams

PyroNoise:  probabilis:c  flowgram  clusterer

SeqNoise:  probabilis:c  sequence  clusterer

Perseus:  chimera  classifier

OTU  construnc:on:  using  complete  linkage  and  exact  pairwise  alignments

Remove  PCR  single  base  errors

Samples  in  454  sff  file  format

Removes  pyrosequencing  errors

Use  best  available  OTU  construc>on  algorithm

Tuesday, 17 April 2012

1. Pre-filtering: Truncates at base 360 or first noisy flow

2. PyroDist: Predicts the distance matrix between all flowgrams and perfect flowgrams (in first step taken from hierarchical clustering of flowgrams to base called sequences)

3. PyroNoise: Bayesian probability of perfect flowgram coming from a particular flow --> Reassignment of flows based on the most likely sequence it comes from, until convergence.Iterative ML algorithm

AmpliconNoise Algorithm

Tuesday, 17 April 2012

4. Base calling, to unique sequence cleaned from sequencing noise

5. SeqDist and SeqNoise: Cleaning of PCR point mutation artefacts.

6. Perseus (or PerseusM): Probabilistic self-referenced chimera removal

AmpliconNoise Algorithm

Tuesday, 17 April 2012

Chimera removal - perseus• Assumes that a chimera’s

parents will be in the dataset with equal or greater frequency than the chimera

• For each sequence directly• search for possible parents and break points that give a

close match

• Only search for parents amongst more abundant sequences

• Define chimera index – reflecting the probability that sequence was generated by evolution

Tuesday, 17 April 2012

Results

• Pyrosequencing (FLX) of artificial community:

• 90 clones with “known sequence” i e Sanger-sequenced to depth.

• V5 + V6 hypervariable regions

• Different concentrations

• Multiple alignment to determine OTUs at different cutoffs.

• The correct result known!

Tuesday, 17 April 2012

Benchmarking

Tuesday, 17 April 2012

Benchmarking

Tuesday, 17 April 2012

No more “Rare Biosphere”?

Tuesday, 17 April 2012

• Some  environments  s:ll  appear  very  diverse

No more “Rare Biosphere”?

Tuesday, 17 April 2012

• Some  environments  s:ll  appear  very  diverse• Ric  valley  soda  lakes  –  11,835  filtered  reads  gave  863  3%  OTUs  without  noise  removal

No more “Rare Biosphere”?

Tuesday, 17 April 2012

• Some  environments  s:ll  appear  very  diverse• Ric  valley  soda  lakes  –  11,835  filtered  reads  gave  863  3%  OTUs  without  noise  removal

• Following  noise  removal  obtain  585  OTUs  a  1/3  reduc:on  in  observed  diversity  

No more “Rare Biosphere”?

Tuesday, 17 April 2012

• Some  environments  s:ll  appear  very  diverse• Ric  valley  soda  lakes  –  11,835  filtered  reads  gave  863  3%  OTUs  without  noise  removal

• Following  noise  removal  obtain  585  OTUs  a  1/3  reduc:on  in  observed  diversity  

• Abundance  distribu:on  s:ll  highly  skewed

No more “Rare Biosphere”?

Tuesday, 17 April 2012

How to Use

• http://code.google.com/p/ampliconnoise/

• Manual with installation instructions and detailed description of scripts, programs etc.

• Requires MPI and the GNU Science Library (gsl) to run on cluster or SMP (for all but small datasets)

• Big datasets can be pre-clustered to reduce time.

Tuesday, 17 April 2012

Convenience Scripts

• RunPreSplit: Normal pipeline for de-noising and OTU-clustering a one-barcode sample from GS Titanium

• RunPreSplitXL: For bigger datasets that require pre-splitting step

• RunTitanium: Normal pipeline for de-noising SFF file with all barcodes in it

• ampliconflow.jar: Java library for processing of SFF files and OTUs

• Also gives statistics and some diversity indeces

Tuesday, 17 April 2012

ConceptsOperational Taxanomic Unit (OTU)

• Used instead of species concept

• Typically 97% similarity cluster of 16S gene

• Meaningful?

• Ecotype studies suggest 99%

• Multiple 16S genes

• Subunit cloning

Tuesday, 17 April 2012

OTU Clustering

• Purpose: construct clusters with similar sequences, e.g. to estimate diversity more robustly

• Agglomerative hierarchical clustering or “linkage clustering”

• Distance matrix from multiple alignment used for distance between sequences

• Produces OTUs (Operational Taxonomic Units), typically with 3% maximum distance limit

Tuesday, 17 April 2012

Linkage Clustering

Start by linking together sequences with the least distance.

Then group together clusters, as long as distance is <3%.

Tuesday, 17 April 2012

Linkage Clustering

• 3 Types of Linkage clustering, using different ways to measure distance between clusters

• 1) Maximum / Complete Linkage Clustering

• 2) Minimum / Single Linkage Clustering

• 3) Average Linkage Clustering (UPMGA)

Tuesday, 17 April 2012

Tuesday, 17 April 2012

Tuesday, 17 April 2012

Tuesday, 17 April 2012

Tuesday, 17 April 2012

Tuesday, 17 April 2012

Tuesday, 17 April 2012

Tuesday, 17 April 2012

Tuesday, 17 April 2012

Chao-1

• Commonly used, non-parametric heuristic (Chao, 1987)

• Based on #OTUs, #singletons and #doubletons

• Gives lower-bound of diversity

• Increases with sequencing depth and very sensitive to un-removed noisy sequences

Tuesday, 17 April 2012

From Gihring et al (2009)Pyrosequencing exacerbates sample size bias 3

© 2011 Society for Applied Microbiology and Blackwell Publishing Ltd, Environmental Microbiology

Tuesday, 17 April 2012

A Novel Bayesian Approach

• Quince et al (2008), “The rational exploration of microbial diversity”, ISME Journal (to appear)

• Bayes’ theorem:Likelihood ∝ P(data|parameters) × P(parameters)

In this case: find the best fit of the observed Taxa-abundance distribution and diversity to an underlying parametric distribution and sample diversity

Tuesday, 17 April 2012

Parametric Bayesian Approach (Quince et al, 2009)

• 3 different distributions tried:• Log-normal• Inverse Gaussian• Sichel distribution

• Some others shown to be poorer fit to samples (Exponential, gamma and mixtures)

• Maximise posterior probability using Markov chain Monte-Carlo sampling of parameters

• Verified on census rainforest tree dataset (Barro Colorado Island complete census of 222,655 trees identified to 303 species)

Tuesday, 17 April 2012

Application to Arctic soilsName Soil  type Age  

(yrs)Sample  size

Clean  no. 3%  OTUs 3%  Chao

Midtre  Lovenbre  7 Tundra 2000 33,445 23,105 1,936 2,751

Midtre  Lovenbre  1 Rock 10 35,569 24,004 1,578 2,339

Knutsenheia Desert n.a. 21,474 14,338 1,599 2,459

Storholmen Island 1000 23,679 14,586 1,554 2,349

14th  July  no.  2 Bird  cliffs n.a. 32,661 23,264 1,763 2,405

14th  July  no.  3 Bird  cliffs n.a. 19,796 11,846 1,398 2,056

Tuesday, 17 April 2012

Arctic soils

Tuesday, 17 April 2012

Arctic soilsName Log-­‐normal Inverse  Gaussian Sichel

Midtre  Lovenbre  7 4542:5369:6546 3615:4006:4520 2994:3234:3565  

Midtre  Lovenbre  1 3787:4628:5994 2939:3307:3803 2421:2654:2990  

Knutsenheia 3892:4728:6032 3179:3616:4275 2752:3117:3934  

Storholmen 4099:5126:6758 3208:3705:4438 2746:3151:3905  

14th  July  no.  2 3815:4476:5427 2979:3257:3624 2698:2933:3282  

14th  July  no.  3 2990:3552:4420 2521:2827:3281 2176:2416:2777

Tuesday, 17 April 2012

90% sampling efforts

Name Log-­‐normal Inverse  Gaussian Sichel

Midtre  Lovenbre  7 1.92e+06 2.58e+05 1.16e+05

Midtre  Lovenbre  1 2.94e+06 2.84e+05 1.26e+05

Knutsenheia 1.37e+06 1.97e+05 1.16e+05

Storholmen 2.25e+06 2.32e+05 1.32e+05

14th  July  no.  2 1.61e+06 1.88e+05 1.25e+05

14th  July  no.  3 5.98e+05 1.19e+05 6.63e+04

Tuesday, 17 April 2012

top related