streaming algorithms for real-time analysis of oxford nanopore
TRANSCRIPT
Streaming algorithms for real-time analysis of OxfordNanopore sequencing data
Minh Duc [email protected]
Institute for Molecular BioscienceThe University of Queensland
London Calling 2016May 27, 2016
Minh Duc Cao, The University of Queensland London Calling 2016
Streaming algorithms
Real-time analysis:Answer the biological questions quickly e.g., infection diagnosisRun sequencing only until the answers are obtainedDecide complementary experimentsSave time, save money
Streaming algorithms:Process data input as a streamContinuously make inference and update the confident levelAre robust to noise and scalable to massive data sets
Minh Duc Cao, The University of Queensland London Calling 2016
Real-time analysis workflow
Run
sim
ulta
neou
slyDNA
extractionLibrary
preparationSequencing
setup2 hours 2.5 hours 0.5 hours
MinIONsequencing
Basecalling(Metrichor)
Fastqextraction
ScaffoldAssemblies
Speciestyping
Straintyping
Resistanceprofile
Minh Duc Cao, The University of Queensland London Calling 2016
Fastq extraction
DNAextraction
Librarypreparation
Sequencingsetup
2 hours 2.5 hours 0.5 hours
MinIONsequencing
Basecalling(Metrichor)
Fastqextraction
ScaffoldAssemblies
Speciestyping
Straintyping
Resistanceprofile
(Cao et al, 2015): Bioinformatics, DOI: 10.1093/bioinformatics/btv658
Minh Duc Cao, The University of Queensland London Calling 2016
Scaffold and complete genome assemblies
BWA‐MEM
Stream of bridges
connec ng
Stream of long reads
Pre‐assemblies
Stream of
alignment records
pairing
Extending scaffolds
con
nuing process
output in real‐ me
repeats aligning
(Cao et al, 2016): bioRxiv, DOI: 10.1101/054783Minh Duc Cao, The University of Queensland London Calling 2016
MinION sequencing
Sequence two K. pneumoniae strains with the MinION:
Strain Phred Quality Scores Read Length Distribution Emp. errorsBAA-2146 Del: 9.5%(NDM-1 strain) Ins: 6.3%
Mis: 15.3%Chemistry R7 Unaligned:Sep 2014 13.3%33-X Coverage
13883 Del: 7.9%(type strain) Ins: 6.0%
Mis: 12.9%Chemistry R7.3 Unaligned:Dec 2014 11%18-X coverage
Minh Duc Cao, The University of Queensland London Calling 2016
Scaffolding and completing genome results
0 8 16 240
5
10
15
20
Coverage (-fold)
Contigs
K. pneumoniae BAA-2146 (33X)
0 6 12 180
5
10
15
20
Coverage (-fold)
K. pneumoniae 13883 (18X)
0 30 60 900
20
40
60
Coverage (-fold)
S. cerevisae W303 (190X)
1.0
3.0
5.0
1.0
3.0
5.0
0.3
0.6
0.9
N50(M
b)
Met
hods
#Con
tig
N50
(Mb)
Mis
-as
sem
blie
s
Err
ors
/100
Kb
CP
Uhr
s
#Con
tig
N50
(Mb)
Mis
-as
sem
blie
s
Err
ors
/100
Kb
CP
Uhr
s
#Con
tig
N50
(Mb)
Mis
-as
sem
blie
s
Err
ors
/100
Kb
CP
Uhr
s
SPAdes 90 0.28 0 4.7 15.6 69 0.30 5 6.2 17.0 364 0.16 29 124.1 20.5-Hybrid 17 3.10 1 6.6 16.1 15 0.73 19 8.0 17.0 240 0.35 68 158.1 67.8-SSPACE 53 0.40 4 12.7 17.9 36 0.69 13 12.4 18.4 263 0.39 89 136.7 52.0-LINK 31 0.55 5 16.1 19.7 17 1.53 18 16.1 18.1 161 0.58 83 143.0 47.5-npScarf 5 5.40 0 20.0 17.2 4 4.92 21 10.8 17.5 19 0.91 82 141.9 41.8NaS 29 0.34 15 18.9 328 38 0.39 36 10.2 199.7 121 0.15 123 140.1 9951Nanocorr 68 0.14 8 141.3 314 60 0.15 16 118.3 163 108 0.60 133 197.0 7480
Minh Duc Cao, The University of Queensland London Calling 2016
Scaffolding and completing genome results
0 8 16 240
5
10
15
20
Coverage (-fold)
Contigs
K. pneumoniae BAA-2146 (33X)
0 6 12 180
5
10
15
20
Coverage (-fold)
K. pneumoniae 13883 (18X)
0 30 60 900
20
40
60
Coverage (-fold)
S. cerevisae W303 (190X)
1.0
3.0
5.0
1.0
3.0
5.0
0.3
0.6
0.9
N50(M
b)
Met
hods
#Con
tig
N50
(Mb)
Mis
-as
sem
blie
s
Err
ors
/100
Kb
CP
Uhr
s
#Con
tig
N50
(Mb)
Mis
-as
sem
blie
s
Err
ors
/100
Kb
CP
Uhr
s
#Con
tig
N50
(Mb)
Mis
-as
sem
blie
s
Err
ors
/100
Kb
CP
Uhr
s
SPAdes 90 0.28 0 4.7 15.6 69 0.30 5 6.2 17.0 364 0.16 29 124.1 20.5-Hybrid 17 3.10 1 6.6 16.1 15 0.73 19 8.0 17.0 240 0.35 68 158.1 67.8-SSPACE 53 0.40 4 12.7 17.9 36 0.69 13 12.4 18.4 263 0.39 89 136.7 52.0-LINK 31 0.55 5 16.1 19.7 17 1.53 18 16.1 18.1 161 0.58 83 143.0 47.5-npScarf 5 5.40 0 20.0 17.2 4 4.92 21 10.8 17.5 19 0.91 82 141.9 41.8NaS 29 0.34 15 18.9 328 38 0.39 36 10.2 199.7 121 0.15 123 140.1 9951Nanocorr 68 0.14 8 141.3 314 60 0.15 16 118.3 163 108 0.60 133 197.0 7480
Minh Duc Cao, The University of Queensland London Calling 2016
Pathogenicity island reconstruction
SapF
SapD
SapC
intA
mun
IM
Yqa
J
recT
ParA
Lex
A
CII
ydaU
dnaC
IS26
aadA
sul1
ebr
GC
N5-
like
IS61
00
IS26
hin
ltrA
umuD
umuC
FRG
Contigs
Sequence data required to join
15M
b15
Mb
54M
b
54M
b
54M
b
54M
b54
Mb
65M
b
65M
b
Minh Duc Cao, The University of Queensland London Calling 2016
Pathogen identification
(Cao et al, 2015): bioRxiv, DOI: 10.1101/019356
Minh Duc Cao, The University of Queensland London Calling 2016
Bacteria identification
Species typing
0 10 20 30 400
0.5
1
Sequencing time (min)
Proportion
K. pneumoniae ATCC BAA-2146
0 10 20 30 400
0.5
1
Sequencing time (min)
K. pneumoniae ATCC 13883
0 20 40 600
0.5
1
Sequencing time (min)
Mixture 75% E.coli+25% S. aureus
0
500
1,000
1,500
2,000
0
500
1,000
1,500
2,000
0
500
1,000
1,500
2,000
Yield
(rea
ds)
K. pneumoniae K. pneumoniae E. coli E. coli S. aureus
Sequencing yield
Strain typing
0 10 20 30 400
0.5
1
Sequencing time (min)
Probability
K. pneumoniae ATCC BAA-2146
0 10 20 30 400
0.5
1
Sequencing time (min)
K. pneumoniae ATCC 13883
0 20 40 600
0.5
1
Sequencing time (min)
Mixture 75% E.coli+25% S. aureus
0 50 1000
0.5
1
Sequencing time (min)
e) Mixture 75% E.coli+25% S. aureus
0
1,000
2,000
3,000
0
1,000
2,000
3,000
0
1,000
2,000
3,000
0
1,000
2,000
3,000
Yield
(rea
ds)
K. pneumoniae ST11 K. pneumoniae ST3 E. coli ST73 E. coli ST526 S. aureus ST243 S. aureus ST12
S. aureus ST1460 S. aureus ST291
Minh Duc Cao, The University of Queensland London Calling 2016
Bacteria identification
Species typing
0 10 20 30 400
0.5
1
Sequencing time (min)
Proportion
K. pneumoniae ATCC BAA-2146
0 10 20 30 400
0.5
1
Sequencing time (min)
K. pneumoniae ATCC 13883
0 20 40 600
0.5
1
Sequencing time (min)
Mixture 75% E.coli+25% S. aureus
0
500
1,000
1,500
2,000
0
500
1,000
1,500
2,000
0
500
1,000
1,500
2,000
Yield
(rea
ds)
K. pneumoniae K. pneumoniae E. coli E. coli S. aureus
Sequencing yield
Strain typing
0 10 20 30 400
0.5
1
Sequencing time (min)
Probability
K. pneumoniae ATCC BAA-2146
0 10 20 30 400
0.5
1
Sequencing time (min)
K. pneumoniae ATCC 13883
0 20 40 600
0.5
1
Sequencing time (min)
Mixture 75% E.coli+25% S. aureus
0 50 1000
0.5
1
Sequencing time (min)
e) Mixture 75% E.coli+25% S. aureus
0
1,000
2,000
3,000
0
1,000
2,000
3,000
0
1,000
2,000
3,000
0
1,000
2,000
3,000
Yield
(rea
ds)
K. pneumoniae ST11 K. pneumoniae ST3 E. coli ST73 E. coli ST526 S. aureus ST243 S. aureus ST12
S. aureus ST1460 S. aureus ST291
Minh Duc Cao, The University of Queensland London Calling 2016
Antibiotic resistance identification
K. pneumoniae BAA-2146 (NDM-1 strain): 27 genes
Time genes Sens. Spec. Data(mins) (%) (%) (reads)30 mphA 1228
blaSHVstrAblaTEMstrBblaCTX 22.2 100.0
60 blaLEN 2613sul2blaOXAaac3aac6blaCMYblaCFEblaLATblaBIL 55.5 100.0
Time genes Sens. Spec. Data(mins) (%) (%) (reads)90 QnrB 3844
aadAoqxAtetAoqxB 74.1 100.0
120 dfrA 77.8 100.0 5258240 blaOKP 81.2 100.0 10788270 rmtC 85.2 100.0 11931300 sul1 13022
sul3 92.6 100.0540 fosA 96.3 100.0 20200600 blaNDM 100.0 100.0 21546
Minh Duc Cao, The University of Queensland London Calling 2016
A glimpse of R9 flowcells
Sequence on R9 flowcell the mixture of
E. coli ESBL (40%),E. faecium VRE-VanA (20%)S. aureus MRSA (20%)S. aureus VRSA (20%)
Minh Duc Cao, The University of Queensland London Calling 2016
A glimpse of R9 flowcells
Time genes Data(mins) (reads)10 ermA 1230
ermGspcnorA
15 Van 2152VanHVanS
20 dfrA 2968aac6VanXtetUaadAsul1sul3aadD
30 fusB 3964mphA
40 VanY 5754catVanZVanA
Time genes Data(mins) (reads)50 tetM 6804
tetS120 aph6 9426
msrCmecA
150 catpC221 13198fosAblaOXA
240 blaCTX 15138cfr
480 blaZ 20412vgaA
780 vgaALC 26184ermCaac3blaCMYblaLATblaBIL
1020 dfrA29 36040
Minh Duc Cao, The University of Queensland London Calling 2016
Summary and outlook
Scaffold and complete bacterial assemblies with < 30-fold coverageIdentify pathogen species and strain with 1000 reads (<.5 hours sequencing)Detect antibiotic resistance profile in a few hours of sequencingWe expect the times to be significantly shortened:
Higher throughput with upcoming models: MinION MkII, PromethION.Quicker library preparation
Software availablehttps://github.com/mdcao/npReaderhttps://github.com/mdcao/npScarfhttps://github.com/mdcao/npAnalysis
Minh Duc Cao, The University of Queensland London Calling 2016