snpfilt: a pipeline for reference-free ... - fudan...

Post on 11-Jul-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

SnpFilt:Apipelineforreference-freeassembly-basedidentificationof

SNPsinbacterialgenomes

A/ProfRuitingLanUniversityofNewSouthWales

Australia

WhyinterestedinSNPsinbacteria?

• Genomesequencingforpublichealthmicrobiology– Outbreakinvestigations– Diseasetransmission

Salmonella outbreak1

• Outbreak1occurredinaresidentialcollege• 16casesofgastroenteritisamongstudentsandstaffovertwodays

• MLVAprofile3-11-7-12-523• Chocolatemousseaspossiblecommonfoodsource

• 13humanisolatesand6mousseisolatesweresequenced

Outbreak1G

ene

stfC

STM

0270

STM

0328

.s

allP

fepA

mrdB

mrdA

ybeV

gltL

ybiS

rpsA

rpoS

nlpD

barA

STM

3073

arcB

mreB

yhhK

mtlR

rpoZ

rbsR

ilvD

rplL

yjdE hfq

mpl

arcA

AA

Cha

nge

N ->

D

Y ->

C

K->

N

N ->

D

L ->

R

A ->

V

H ->

L

D ->

G

E ->

V

S ->

I

H ->

D

H ->

R

A ->

V

S ->

R

R ->

H

Q ->

STO

P

S ->

A

V ->

A

Q ->

STO

P

K ->

T

Lab No. Source Epidemiological link A A G A A T C G C C T A A C C A C A C G C A G C T C T A T C T1687 Human Yes . . . . . . . . . . . . . . . . . . . . . . . . G . . . . . .1688 Human Yes . . . . . . . . . . . . . . G . . . . . . . . . . . . . . . .1689 Human Yes . . . . . . . . . . . . . . . . . C . . . . . . . . . . . . .1690 Human Yes G . . G G A . . T T . . . . . G . . . . . . A . . T . C . T C1691 Human Yes . . . . . . . A . . C . . . . . . . . . . . . . . . . . . . .1692 Human Yes . . . . . . . . . . . . . T . . . . . . T . . . . . A . C . .1693 Human Yes . G . . . . . . . . . . T . . . . . . . . . . . . . . . . . .1694 Human Yes . . . . . . . . . . . . . . . . . . . . . . . T . . . . . . .1695 Human Yes . . T . . . T . . . . G . . . . T . T . . . . . . . . . . . .1696 Human Yes . . . . . . . . . . . . . . . . . C . . . C . . . . . . . . .1697 Human Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1698 Human Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1699 Human Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1700 Mousse Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1701 Mousse Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1702 Mousse Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1703 Mousse Yes . . . . . . . . . . . . . . . . . . . T . . . . . . . . . . .1704 Mousse Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1705 Mousse Yes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Human - Epidemiologically confirmed

Food/contaminated source

1687 1

1689

116941

1703

1

1688

1

1693

2

16912

1692

41695 5

1690

12

1696

1

169716981699

17001701170217041705

Octaviaetal.JCM2015 53:1063

Salmonellaoutbreak2• Outbreak2occurredinaReady-to-eatfoodfromthesamebakeryinmetropolitanSydney

• 27cases• MLVAtype3-9-8-12-523• 11isolatessequenced

– 9isolatesfrompatientswithsalmonellosisatthetimeoftheoutbreakandresidingnearby

• 4confirmedoutbreakbasedondescriptivecaseseries• 4unrelatedfollowingPHUinvestigation• 1unknownlink– patientdidnotattendPHUinterview

– 1isolatefromabootswaband1fromdirtyeggshellrinsefromthebakery

Outbreak2Gene Name stiC cydC cysK yhgH dlhHAA Change T -> I H -> YConsensus Source Date of collection Epidemiological link A A G A A A C

1837 Human 27-Apr-12 No . . . . . G .1838 Human 26-Apr-12 No . . . . . . .1839 Human 26-Apr-12 No . . . . . . .1840 Human 24-Apr-12 Yes . . . . . . .1841 Human 24-Apr-12 No . . . . . . A1842 Human 24-Apr-12 Yes . . . . . . .1843 Human 23-Apr-12 Yes . . . . . . .1844 Human 23-Apr-12 Yes . . . . . . .1845 Human 10-Apr-12 Unknown G G A C G . .1846 Boot Swab Row 4 03-May-12 Yes . . . . . . .1847 Dirty Egg Shell Rinse 03-May-12 Yes . . . . . . .

1837 11841

1

18461847

18381839

1840184218431844

1845

5

Human - Epidemiologically confirmed

Human - Unknown epidemiological link

Human - Epidemiologically unlinked

Food/contaminated source

Octaviaetal.JCM2015 53:1063

IsitaSNP?

Denovoassembly(velvet)

ProgressiveMAUVEalignment

CommonSNPs

SNPs

FilterreadsbyQUALITY

BWAMapping

SNPs

FilterreadsbyQUALITY

FilterSNPs

Mappingbased Assemblybased

Whyreferencefree?

• SNPsdiscovereddependingonthereferenceyouused

• SNPsculledforhighSNPdensityregions(Zhouetal. PLoS Genetics2013)

NGS raw reads

Assembly (SPAdes)

Map reads to contigs (BWA)

Apply filters

SNPs

SNPfilt work flow

SNPcallingperformancemetrics

Truepositives(TP)

Falsepositives(FP)

Falsenegatives(FN)

Truenegatives(TN)

Actualsequence

SNP NotSNP

SNPcallalgorithm

sSN

PNotSNP

Precision= TPTP+FP

Sensitivity= TPTP+FN

Choiceofassemblers

• Abyss• Cabog• Mira• MaSuRCA

• SGA• SoapDenovo• SPAdes• Velvet

Abys

s

Cab

og

Mira

MaS

uRC

A

SGA

Soap

Den

ovo

SPAd

es

Velv

et

Sens

itivi

ty

0.0

0.2

0.4

0.6

0.8

1.0A

Abys

s

Cab

og

Mira

MaS

uRC

A

SGA

Soap

Den

ovo

SPAd

es

Velv

et

Prec

isio

n

0.0

0.2

0.4

0.6

0.8

1.0M.abscessus (HiSeq)M.abscessus (MiSeq)R.sphaeroides (HiSeq)R.sphaeroides (MiSeq)

B

AssembliesfromtheGAGE-Bstudy

M.abscessus (HiSeq)M.abscessus (MiSeq)

R.sphaeroides (HiSeq)R.sphaeroides (MiSeq)

SNPfilters

• F1)Regionsofexcessivecoverage– Therunningmeanofthereadcoverageacrossawindowof100basesisgreaterthanthemedian+2mediandeviationacrossthewholeassembly

• F2)Lowmappingquality– Mappingquality<58,foranysitewithinaneighbourhoodof400bases

SNPfilters

• F3)Lowcoverage– <20reads,or0supportingreadineithertheforwardorreversedirection

• F4)lowforwardcoverage– <10readsintheforwarddirection,foranysitewithinaneighbourhoodof20bases

• F5)Highheterogeneity– Thenumberofsupportingreads<70%foranysitewithinaneighbourhoodof20bases

SNPfilters

• F6)Lowbasequality– Atleast50baseswithinawindowof2000baseshavebasequality<q.thres,whereq.thres isthemean- 3standarddeviationsofqualityscoresacrossthewholeassembly

Effectoffilters:GAGE-Bassemblies

M.abscessus

(HiSeq)

M.abscessus

(MiSeq)

R.sphaeroides

(HiSeq)

R.sphaeroides

(MiSeq)

Filter TN FN TN FN TN FN TN FN

F6:lowquality 24 9 8 0 107 1 0 0

F5:highheterogeneity 0 0 61 0 42 0 297 0

F4:lowforwardcoverage 0 0 6 0 2 2 11 0

F3:lowcoverage 0 0 58 2 0 3 0 2

F2:lowmappingquality 0 0 10 10 27 0 3 3

F1:excessivecoverage 0 7 0 7 22 0 0 0

Effectoffilters:GAGE-Bassemblies

Effectoffilters:knowngenomesE.coliK12 M.tuberculosis F11 S.pneumoniaeTIGR4

Filter Sites Errors Sites Errors Sites Errors

F6:lowquality 50026 29 244705 151 0 0

F5:highheterogeneity 3652 40 1706 38 91023 121

F4:lowforwardcoverage 8621 0 1365 0 750679 1

F3:lowcoverage 33832 0 7565 0 33057 4

F2:lowmappingquality 15062 4 36744 14 104219 12

F1:excessivecoverage 469390 6 375689 0 10357 0

F0:reliablesites 4017250 0 3713565 0 1937574 0

Totalassemblysize 4694957 4386568 2963539

Genomesize 4641652 4424435 2163340

CoveragerequiredforfullSNPcalls

20 40 60 80 100

010

2030

40

Read depth

TPs

●●● ●●●

●●

●●●

●●●

●●● ●●●

●●● ●●

● ●●●

● MiSeqNextSeq

Conclusions

• Reference-freeassembly-baseddiscoveryofSNPs

• Unreliableregionsareremovedbasedonthequalityandcoverageofre-alignedreads

• Atleast40-foldcoverageisrequiredforreliableandcompleteSNPcalls

Acknowledgments

• DrCarmenChan• DrSophieOctavia• A/ProfVitaliSintchenko• DrQinningWang

• FundingsupportfromNationalHealthandMedicalResearchCouncilofAustralia

IsitaSNP?MiSeq(2x250bp)sequencing

Mappingreadstothereferencegenome(LT2)

Burrows-Wheeler Aligner (BWA)

IdentificationsofSNPsSAMtools

denovoassemblyVelvet, Spades

ManualverificationofSNPs&NatureofSNPscustomscripts

AlignmentofContigs andscaffolds

progressiveMauve

ReadscorrectionbyQUAKE

• Whencoverageislow,correctionisworthwhile

1821

18261827

18191820182218231825

18241

1828

18531836

1830183118321834

1829 1 183321

181318181812

180818091810181118151817

1816 1 18141

Outbreak 3 Outbreak 4 Outbreak 5

Human - Epidemiologically confirmed

Human - Unknown epidemiological link

Human - Epidemiologically unlinked

Food/contaminated source

Threemoreoutbreaks

Octaviaetal.JCM2015 53:1063

Isitpartoftheoutbreak?Cut-offbasedonSNPdifferences

Octaviaetal.JCM2015 53:1063

1687 1

1689

116941

1703

1

1688

1

1693

2

16912

1692

41695 5

1690

12

1696

1

169716981699

17001702170217041705

1821

18261827

18191820182218231825

18241

1828

18531836

1830183118321834

1829 1 183321

18131818

1812

180818091810181118151817

1816 1 18141

18371

18411

18461847

18381839

1840184218431844

1845

5

Outbreak 1 Outbreak 2 Outbreak 3

Outbreak 4 Outbreak 5

Human - Epidemiologically confirmed

Human - Unknown epidemiological link

Human - Epidemiologically unlinked

Food/contaminated source

SNPcallingperformancemetrics

• Truepositives(TrueSNPs)• Truenegatives(TruenotSNPs)• Falsepositives(CalledSNPsbutnottrueSNPs)• Falsenegatives(SNPsbutnotcalledSNPs)

SNPdetection

• Qualitycontrolisveryimportant– Filterreads– Correctreads– FilterSNPs– ManualcheckingofSNPs

BWAmapping/SNPsiteextraction(4352)

Filter>=20readscoverage(3322)

Sitesthatcontain>=70%SNPsupportingreads(945)

Sitesthatcontain30%to<70%SNPsupportingreads(443)

Sitesthatcontain<30%SNPsupportingreads(1934)

DivideSNPs intothreecategories

Discard1.1%genuineSNPsthrownaway1.8%falsepositives

FilterreadsbyQUALITY/Correctreads

top related