streaming algorithms for real-time analysis of oxford nanopore

17
Streaming algorithms for real-time analysis of Oxford Nanopore sequencing data Minh Duc Cao [email protected] Institute for Molecular Bioscience The University of Queensland London Calling 2016 May 27, 2016 Minh Duc Cao, The University of Queensland London Calling 2016

Upload: hathuy

Post on 13-Feb-2017

227 views

Category:

Documents


2 download

TRANSCRIPT

Streaming algorithms for real-time analysis of OxfordNanopore sequencing data

Minh Duc [email protected]

Institute for Molecular BioscienceThe University of Queensland

London Calling 2016May 27, 2016

Minh Duc Cao, The University of Queensland London Calling 2016

Streaming algorithms

Real-time analysis:Answer the biological questions quickly e.g., infection diagnosisRun sequencing only until the answers are obtainedDecide complementary experimentsSave time, save money

Streaming algorithms:Process data input as a streamContinuously make inference and update the confident levelAre robust to noise and scalable to massive data sets

Minh Duc Cao, The University of Queensland London Calling 2016

Real-time analysis workflow

Run

sim

ulta

neou

slyDNA

extractionLibrary

preparationSequencing

setup2 hours 2.5 hours 0.5 hours

MinIONsequencing

Basecalling(Metrichor)

Fastqextraction

ScaffoldAssemblies

Speciestyping

Straintyping

Resistanceprofile

Minh Duc Cao, The University of Queensland London Calling 2016

Fastq extraction

DNAextraction

Librarypreparation

Sequencingsetup

2 hours 2.5 hours 0.5 hours

MinIONsequencing

Basecalling(Metrichor)

Fastqextraction

ScaffoldAssemblies

Speciestyping

Straintyping

Resistanceprofile

(Cao et al, 2015): Bioinformatics, DOI: 10.1093/bioinformatics/btv658

Minh Duc Cao, The University of Queensland London Calling 2016

Scaffold and complete genome assemblies

BWA‐MEM 

Stream of bridges

connec ng 

Stream of long reads

Pre‐assemblies

Stream of

alignment records

pairing 

Extending scaffolds

con

nuing process 

output in real‐ me 

repeats aligning 

(Cao et al, 2016): bioRxiv, DOI: 10.1101/054783Minh Duc Cao, The University of Queensland London Calling 2016

MinION sequencing

Sequence two K. pneumoniae strains with the MinION:

Strain Phred Quality Scores Read Length Distribution Emp. errorsBAA-2146 Del: 9.5%(NDM-1 strain) Ins: 6.3%

Mis: 15.3%Chemistry R7 Unaligned:Sep 2014 13.3%33-X Coverage

13883 Del: 7.9%(type strain) Ins: 6.0%

Mis: 12.9%Chemistry R7.3 Unaligned:Dec 2014 11%18-X coverage

Minh Duc Cao, The University of Queensland London Calling 2016

Scaffolding and completing genome results

0 8 16 240

5

10

15

20

Coverage (-fold)

Contigs

K. pneumoniae BAA-2146 (33X)

0 6 12 180

5

10

15

20

Coverage (-fold)

K. pneumoniae 13883 (18X)

0 30 60 900

20

40

60

Coverage (-fold)

S. cerevisae W303 (190X)

1.0

3.0

5.0

1.0

3.0

5.0

0.3

0.6

0.9

N50(M

b)

Met

hods

#Con

tig

N50

(Mb)

Mis

-as

sem

blie

s

Err

ors

/100

Kb

CP

Uhr

s

#Con

tig

N50

(Mb)

Mis

-as

sem

blie

s

Err

ors

/100

Kb

CP

Uhr

s

#Con

tig

N50

(Mb)

Mis

-as

sem

blie

s

Err

ors

/100

Kb

CP

Uhr

s

SPAdes 90 0.28 0 4.7 15.6 69 0.30 5 6.2 17.0 364 0.16 29 124.1 20.5-Hybrid 17 3.10 1 6.6 16.1 15 0.73 19 8.0 17.0 240 0.35 68 158.1 67.8-SSPACE 53 0.40 4 12.7 17.9 36 0.69 13 12.4 18.4 263 0.39 89 136.7 52.0-LINK 31 0.55 5 16.1 19.7 17 1.53 18 16.1 18.1 161 0.58 83 143.0 47.5-npScarf 5 5.40 0 20.0 17.2 4 4.92 21 10.8 17.5 19 0.91 82 141.9 41.8NaS 29 0.34 15 18.9 328 38 0.39 36 10.2 199.7 121 0.15 123 140.1 9951Nanocorr 68 0.14 8 141.3 314 60 0.15 16 118.3 163 108 0.60 133 197.0 7480

Minh Duc Cao, The University of Queensland London Calling 2016

Scaffolding and completing genome results

0 8 16 240

5

10

15

20

Coverage (-fold)

Contigs

K. pneumoniae BAA-2146 (33X)

0 6 12 180

5

10

15

20

Coverage (-fold)

K. pneumoniae 13883 (18X)

0 30 60 900

20

40

60

Coverage (-fold)

S. cerevisae W303 (190X)

1.0

3.0

5.0

1.0

3.0

5.0

0.3

0.6

0.9

N50(M

b)

Met

hods

#Con

tig

N50

(Mb)

Mis

-as

sem

blie

s

Err

ors

/100

Kb

CP

Uhr

s

#Con

tig

N50

(Mb)

Mis

-as

sem

blie

s

Err

ors

/100

Kb

CP

Uhr

s

#Con

tig

N50

(Mb)

Mis

-as

sem

blie

s

Err

ors

/100

Kb

CP

Uhr

s

SPAdes 90 0.28 0 4.7 15.6 69 0.30 5 6.2 17.0 364 0.16 29 124.1 20.5-Hybrid 17 3.10 1 6.6 16.1 15 0.73 19 8.0 17.0 240 0.35 68 158.1 67.8-SSPACE 53 0.40 4 12.7 17.9 36 0.69 13 12.4 18.4 263 0.39 89 136.7 52.0-LINK 31 0.55 5 16.1 19.7 17 1.53 18 16.1 18.1 161 0.58 83 143.0 47.5-npScarf 5 5.40 0 20.0 17.2 4 4.92 21 10.8 17.5 19 0.91 82 141.9 41.8NaS 29 0.34 15 18.9 328 38 0.39 36 10.2 199.7 121 0.15 123 140.1 9951Nanocorr 68 0.14 8 141.3 314 60 0.15 16 118.3 163 108 0.60 133 197.0 7480

Minh Duc Cao, The University of Queensland London Calling 2016

Pathogenicity island reconstruction

SapF

SapD

SapC

intA

mun

IM

Yqa

J

recT

ParA

Lex

A

CII

ydaU

dnaC

IS26

aadA

sul1

ebr

GC

N5-

like

IS61

00

IS26

hin

ltrA

umuD

umuC

FRG

Contigs

Sequence data required to join

15M

b15

Mb

54M

b

54M

b

54M

b

54M

b54

Mb

65M

b

65M

b

Minh Duc Cao, The University of Queensland London Calling 2016

Pathogen identification

(Cao et al, 2015): bioRxiv, DOI: 10.1101/019356

Minh Duc Cao, The University of Queensland London Calling 2016

Bacteria identification

Species typing

0 10 20 30 400

0.5

1

Sequencing time (min)

Proportion

K. pneumoniae ATCC BAA-2146

0 10 20 30 400

0.5

1

Sequencing time (min)

K. pneumoniae ATCC 13883

0 20 40 600

0.5

1

Sequencing time (min)

Mixture 75% E.coli+25% S. aureus

0

500

1,000

1,500

2,000

0

500

1,000

1,500

2,000

0

500

1,000

1,500

2,000

Yield

(rea

ds)

K. pneumoniae K. pneumoniae E. coli E. coli S. aureus

Sequencing yield

Strain typing

0 10 20 30 400

0.5

1

Sequencing time (min)

Probability

K. pneumoniae ATCC BAA-2146

0 10 20 30 400

0.5

1

Sequencing time (min)

K. pneumoniae ATCC 13883

0 20 40 600

0.5

1

Sequencing time (min)

Mixture 75% E.coli+25% S. aureus

0 50 1000

0.5

1

Sequencing time (min)

e) Mixture 75% E.coli+25% S. aureus

0

1,000

2,000

3,000

0

1,000

2,000

3,000

0

1,000

2,000

3,000

0

1,000

2,000

3,000

Yield

(rea

ds)

K. pneumoniae ST11 K. pneumoniae ST3 E. coli ST73 E. coli ST526 S. aureus ST243 S. aureus ST12

S. aureus ST1460 S. aureus ST291

Minh Duc Cao, The University of Queensland London Calling 2016

Bacteria identification

Species typing

0 10 20 30 400

0.5

1

Sequencing time (min)

Proportion

K. pneumoniae ATCC BAA-2146

0 10 20 30 400

0.5

1

Sequencing time (min)

K. pneumoniae ATCC 13883

0 20 40 600

0.5

1

Sequencing time (min)

Mixture 75% E.coli+25% S. aureus

0

500

1,000

1,500

2,000

0

500

1,000

1,500

2,000

0

500

1,000

1,500

2,000

Yield

(rea

ds)

K. pneumoniae K. pneumoniae E. coli E. coli S. aureus

Sequencing yield

Strain typing

0 10 20 30 400

0.5

1

Sequencing time (min)

Probability

K. pneumoniae ATCC BAA-2146

0 10 20 30 400

0.5

1

Sequencing time (min)

K. pneumoniae ATCC 13883

0 20 40 600

0.5

1

Sequencing time (min)

Mixture 75% E.coli+25% S. aureus

0 50 1000

0.5

1

Sequencing time (min)

e) Mixture 75% E.coli+25% S. aureus

0

1,000

2,000

3,000

0

1,000

2,000

3,000

0

1,000

2,000

3,000

0

1,000

2,000

3,000

Yield

(rea

ds)

K. pneumoniae ST11 K. pneumoniae ST3 E. coli ST73 E. coli ST526 S. aureus ST243 S. aureus ST12

S. aureus ST1460 S. aureus ST291

Minh Duc Cao, The University of Queensland London Calling 2016

Antibiotic resistance identification

K. pneumoniae BAA-2146 (NDM-1 strain): 27 genes

Time genes Sens. Spec. Data(mins) (%) (%) (reads)30 mphA 1228

blaSHVstrAblaTEMstrBblaCTX 22.2 100.0

60 blaLEN 2613sul2blaOXAaac3aac6blaCMYblaCFEblaLATblaBIL 55.5 100.0

Time genes Sens. Spec. Data(mins) (%) (%) (reads)90 QnrB 3844

aadAoqxAtetAoqxB 74.1 100.0

120 dfrA 77.8 100.0 5258240 blaOKP 81.2 100.0 10788270 rmtC 85.2 100.0 11931300 sul1 13022

sul3 92.6 100.0540 fosA 96.3 100.0 20200600 blaNDM 100.0 100.0 21546

Minh Duc Cao, The University of Queensland London Calling 2016

A glimpse of R9 flowcells

Sequence on R9 flowcell the mixture of

E. coli ESBL (40%),E. faecium VRE-VanA (20%)S. aureus MRSA (20%)S. aureus VRSA (20%)

Minh Duc Cao, The University of Queensland London Calling 2016

A glimpse of R9 flowcells

Time genes Data(mins) (reads)10 ermA 1230

ermGspcnorA

15 Van 2152VanHVanS

20 dfrA 2968aac6VanXtetUaadAsul1sul3aadD

30 fusB 3964mphA

40 VanY 5754catVanZVanA

Time genes Data(mins) (reads)50 tetM 6804

tetS120 aph6 9426

msrCmecA

150 catpC221 13198fosAblaOXA

240 blaCTX 15138cfr

480 blaZ 20412vgaA

780 vgaALC 26184ermCaac3blaCMYblaLATblaBIL

1020 dfrA29 36040

Minh Duc Cao, The University of Queensland London Calling 2016

Summary and outlook

Scaffold and complete bacterial assemblies with < 30-fold coverageIdentify pathogen species and strain with 1000 reads (<.5 hours sequencing)Detect antibiotic resistance profile in a few hours of sequencingWe expect the times to be significantly shortened:

Higher throughput with upcoming models: MinION MkII, PromethION.Quicker library preparation

Software availablehttps://github.com/mdcao/npReaderhttps://github.com/mdcao/npScarfhttps://github.com/mdcao/npAnalysis

Minh Duc Cao, The University of Queensland London Calling 2016

Acknowledgements

Lachlan CoinMatthew CooperDevika GanesamoorthySon NguyenAlysha ElliottTania Duarte

Vivian ZhangNick HamiltonDerek BensonDominique GorseMichael Thang

Minh Duc Cao, The University of Queensland London Calling 2016