150514 jts london_calling
TRANSCRIPT
Jared Simpson !
Ontario Institute for Cancer Research &
Department of Computer Science University of Toronto
Error correction, assembly and consensus algorithms
for MinION data
London Calling, May 14th, 2015
Our collaboration
2
An overview of NGS assembly • Illumina data: short reads, very accurate, very deep
• nearly all Illumina assembly is based on exact matching algorithms • fragmented assemblies !
• Algorithms for Illumina data do not work for long, noisy reads • PacBio developed a pipeline (“HGAP”) to assemble their data
• We used this recipe as a starting point but with custom components
3
Long read assembly pipeline
4
Error correction
Celera Assembler
Consensus
Input reads
Genome Assembly
Input Data• First challenge is finding overlaps for reads with 15-20% errors
5
Overlap Detection
6
we use github.com/thegenemyers/daligner to compute overlaps
Partial Order Graphs
7
add read GCTACGAT that we want to correct to graph
Partial Order Graphs
8
add sequence GCTCGAT to graph
Partial Order Graphs
9
add sequence GCTCGATT to graph
Partial Order Graphs
10
maximum weight path GCTCGAT is the corrected read
Error Correction
11
Contig Assembly
12
Celera Assembler produces one contig at 98.5% identity
Assembly Polishing• Consensus problem is viewed as choosing a sequence C’ that maximizes
the probability of the event data
13
C 0= argmax
S2CP (D|S)
P (D|S) =rY
k=1
P (ei,k, ei+1,k, ..., ej,k|S,⇥)
where
Selecting a Consensus
14
Mutate
ACTACGATCGACTTACGA CCTACGATCGACTTACGA TCTACGATCGACTTACGA
... -CTACGATCGACTTACGA G-TACGATCGACTTACGA GC-ACGATCGACTTACGA GCT-CGATCGACTTACGA
... GACTACGATCGACTTACGA GCCTACGATCGACTTACGA GGCTACGATCGACTTACGA GTCTACGATCGACTTACGA
P (D|S)
-190 -187 -192 -176 -191 -193 -168 -198 -191 -195 -181
GCTACGATCGACTTACGA
Selecting a Consensus
15
GCT-CGATCGACTTACGAMutate
ACT
... -GGCGCT
... GGGG
P (D|S)
-190 -187 -192 -176 -191 -193 -168 -198 -191 -195 -181
Select new consensusGCT-CGATCGACTTACGA -168
Pore Models
16
Generating Events• What do we expect events from a given sequence to look like?
17
GCTACGATTSample Current ●●●
●●●●●●●●●
●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●●●●●●
●●●●●●●
●●●●●●●●
●
●●●●●●●●●●●●●●●●●
●
●
●●●●
●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●
●●●
●●●●●●●●●●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●
●●●●●●●●●●●
●●●●●
●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●
●●●●●●●●●
●●●
●●●
●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●●●●
●
●●●●●●●●●●●●●
0
25
50
75
100
0.00 0.25 0.50 0.75 1.00 1.25time (s)
Cur
rent
(pA)
Generating Events• What do we expect events from a given sequence to look like?
18
GCTACGATTSample Current ●●●
●●●●●●●●●
●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●●●●●●
●●●●●●●
●●●●●●●●
●
●●●●●●●●●●●●●●●●●
●
●
●●●●
●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●
●●●
●●●●●●●●●●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●
●●●●●●●●●●●
●●●●●
●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●
●●●●●●●●●
●●●
●●●
●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●●●●
●
●●●●●●●●●●●●●
●●●●
●●●●●●
●●●
●●●●●●
●
●
●
●●●
●●●●●●●
●
●●●
●●●●●●●●●●●●●
●
●
●
●●●●●●
●●●●●
●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●
●●●●●
●●●●●●●●
0
25
50
75
100
0.00 0.25 0.50 0.75 1.00 1.25time (s)
Cur
rent
(pA)
Generating Events• What do we expect events from a given sequence to look like?
19
GCTACGATTSample Current ●●●
●●●●●●●●●
●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●●●●●●
●●●●●●●
●●●●●●●●
●
●●●●●●●●●●●●●●●●●
●
●
●●●●
●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●
●●●
●●●●●●●●●●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●
●●●●●●●●●●●
●●●●●
●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●
●●●●●●●●●
●●●
●●●
●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●●●●
●
●●●●●●●●●●●●●
●●●●
●●●●●●
●●●
●●●●●●
●
●
●
●●●
●●●●●●●
●
●●●
●●●●●●●●●●●●●
●
●
●
●●●●●●
●●●●●
●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●
●●●●●
●●●●●●●●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●●●
●●
●
●
●●●●●●
●●
●
●●
●●●
●
●
●
●●●
●
●
●●●●
●
●●
●
●
●
●●●●●●●
●●●
●
●●
●●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●●●
●●●●●
●
●
●●
●
●
●
●
●●●
●
●
●●●●●
●●
●●●
●
●
●●●
●●
●●●
●●
●
●
●●●
●●●●
●
●
●
●
●●
●●
●
●
●●●
●
●●
●
●●●
●
●
●
●
●
●
●
●●●●
●●
●
●●
●●
●
●●
●
●●
●
●
●●●●
●
●
●●●
●
●
●●
●●
●
●
●
●●●
●
●●
●
●
●●●●●●●●●
●●
●
●●
●
●●
●
●
●
●●
●
●
●●●●
●●●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●●
●●
●
●
●●●
●
●
●●●
●●●●
●
●●
●●
●
●●
●●
●
●●
●●●
●
●●●●
●
●●
●
●
●
●
●
●
●●●
●●
●●
●●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●●●●●
●
●
●●●●●
●
0
25
50
75
100
0.00 0.25 0.50 0.75 1.00 1.25time (s)
Cur
rent
(pA)
Generating Events• What do we expect events from a given sequence to look like?
20
GCTACGATTSample Current ●●●
●●●●●●●●●
●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●●●●●●
●●●●●●●
●●●●●●●●
●
●●●●●●●●●●●●●●●●●
●
●
●●●●
●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●
●●●
●●●●●●●●●●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●
●●●●●●●●●●●
●●●●●
●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●
●●●●●●●●●
●●●
●●●
●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●●●●
●
●●●●●●●●●●●●●
●●●●
●●●●●●
●●●
●●●●●●
●
●
●
●●●
●●●●●●●
●
●●●
●●●●●●●●●●●●●
●
●
●
●●●●●●
●●●●●
●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●
●●●●●
●●●●●●●●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●●●
●●
●
●
●●●●●●
●●
●
●●
●●●
●
●
●
●●●
●
●
●●●●
●
●●
●
●
●
●●●●●●●
●●●
●
●●
●●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●●●
●●●●●
●
●
●●
●
●
●
●
●●●
●
●
●●●●●
●●
●●●
●
●
●●●
●●
●●●
●●
●
●
●●●
●●●●
●
●
●
●
●●
●●
●
●
●●●
●
●●
●
●●●
●
●
●
●
●
●
●
●●●●
●●
●
●●
●●
●
●●
●
●●
●
●
●●●●
●
●
●●●
●
●
●●
●●
●
●
●
●●●
●
●●
●
●
●●●●●●●●●
●●
●
●●
●
●●
●
●
●
●●
●
●
●●●●
●●●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●●
●●
●
●
●●●
●
●
●●●
●●●●
●
●●
●●
●
●●
●●
●
●●
●●●
●
●●●●
●
●●
●
●
●
●
●
●
●●●
●●
●●
●●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●●●●●
●
●
●●●●●
●●
●
●●●●●●●●●●●
●
●●●●●●●
●
●●●
●
●
●●●●
●
●●●●●●●●
●
●
●
●
●●●●●●●
0
25
50
75
100
0.00 0.25 0.50 0.75 1.00 1.25time (s)
Cur
rent
(pA)
Generating Events• What do we expect events from a given sequence to look like?
21
CTACGATTSample Current ●●●
●●●●●●●●●
●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●●●●●●
●●●●●●●
●●●●●●●●
●
●●●●●●●●●●●●●●●●●
●
●
●●●●
●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●
●●●
●●●●●●●●●●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●
●●●●●●●●●●●
●●●●●
●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●
●●●●●●●●●
●●●
●●●
●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●●●●
●
●●●●●●●●●●●●●
●●●●
●●●●●●
●●●
●●●●●●
●
●
●
●●●
●●●●●●●
●
●●●
●●●●●●●●●●●●●
●
●
●
●●●●●●
●●●●●
●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●
●●●●●
●●●●●●●●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●●●
●●
●
●
●●●●●●
●●
●
●●
●●●
●
●
●
●●●
●
●
●●●●
●
●●
●
●
●
●●●●●●●
●●●
●
●●
●●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●●●
●●●●●
●
●
●●
●
●
●
●
●●●
●
●
●●●●●
●●
●●●
●
●
●●●
●●
●●●
●●
●
●
●●●
●●●●
●
●
●
●
●●
●●
●
●
●●●
●
●●
●
●●●
●
●
●
●
●
●
●
●●●●
●●
●
●●
●●
●
●●
●
●●
●
●
●●●●
●
●
●●●
●
●
●●
●●
●
●
●
●●●
●
●●
●
●
●●●●●●●●●
●●
●
●●
●
●●
●
●
●
●●
●
●
●●●●
●●●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●●
●●
●
●
●●●
●
●
●●●
●●●●
●
●●
●●
●
●●
●●
●
●●
●●●
●
●●●●
●
●●
●
●
●
●
●
●
●●●
●●
●●
●●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●●●●●
●
●
●●●●●
●●
●
●●●●●●●●●●●
●
●●●●●●●
●
●●●
●
●
●●●●
●
●●●●●●●●
●
●
●
●
●●●●●●●
●
●
●●●●
●●●
●
●●●●●●●●●●●●●●●
●●●●●
●
●
●●●●●
●
●
●
●●
●●●
●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●
●
●●●●
●
●
●
●●●●
●
●
●●
●●
●●●●●●●●●●
●●●●
●●
●●●●●
●
●
●●●●●●●
●
●●●●●●●●●
●
●
●
●●
0
25
50
75
100
0.00 0.25 0.50 0.75 1.00 1.25time (s)
Cur
rent
(pA)
Event Detection
22
●●●●●●●●●●●●
●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●●●●●●
●●●●●●●
●●●●●●●●
●
●●●●●●●●●●●●●●●●●
●
●
●●●●
●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●
●●●
●●●●●●●●●●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●
●●●●●●●●●●●
●●●●●
●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●
●●●●●●●●●
●●●
●●●
●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●●●●
●
●●●●●●●●●●●●●
●●●●
●●●●●●
●●●
●●●●●●
●
●
●
●●●
●●●●●●●
●
●●●
●●●●●●●●●●●●●
●
●
●
●●●●●●
●●●●●
●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●
●●●●●
●●●●●●●●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●●●
●●
●
●
●●●●●●
●●
●
●●
●●●
●
●
●
●●●
●
●
●●●●
●
●●
●
●
●
●●●●●●●
●●●
●
●●
●●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●●●
●●●●●
●
●
●●
●
●
●
●
●●●
●
●
●●●●●
●●
●●●
●
●
●●●
●●
●●●
●●
●
●
●●●
●●●●
●
●
●
●
●●
●●
●
●
●●●
●
●●
●
●●●
●
●
●
●
●
●
●
●●●●
●●
●
●●
●●
●
●●
●
●●
●
●
●●●●
●
●
●●●
●
●
●●
●●
●
●
●
●●●
●
●●
●
●
●●●●●●●●●
●●
●
●●
●
●●
●
●
●
●●
●
●
●●●●
●●●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●●
●●
●
●
●●●
●
●
●●●
●●●●
●
●●
●●
●
●●
●●
●
●●
●●●
●
●●●●
●
●●
●
●
●
●
●
●
●●●
●●
●●
●●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●●●●●
●
●
●●●●●
●●
●
●●●●●●●●●●●
●
●●●●●●●
●
●●●
●
●
●●●●
●
●●●●●●●●
●
●
●
●
●●●●●●●
●
●
●●●●
●●●
●
●●●●●●●●●●●●●●●
●●●●●
●
●
●●●●●
●
●
●
●●
●●●
●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●
●
●●●●
●
●
●
●●●●
●
●
●●
●●
●●●●●●●●●●
●●●●
●●
●●●●●
●
●
●●●●●●●
●
●●●●●●●●●
●
●
●
●●
0
25
50
75
100
0.00 0.25 0.50 0.75 1.00 1.25time (s)
Cur
rent
(pA)
Event mean current (pA) current stdv duration (s)
1 60.3 0.7 0.521
2 40.6 1.0 0.112
3 52.2 2.0 0.356
4 54.1 1.2 0.291
5 49.5 1.5 0.141
A simple model• What is the probability of observing events E given a sequence S? • Assuming for the moment there are no missing or extra events:
23
P (e1, e2, ..., en|s1, s2, ..., sn,⇥) =nY
i=1
P (ei|si, µsi ,�si)
P (ei|k, µk,�k) = N (µk,�2k)
Complications
24
●
●
●●
●●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●
●●
●
●●
●
●●●●
●●●
●
●●
●
●●●
●
●
●
●●●●
●
●
●●●●
●
●
●
●
●
●●●●●●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●●
●●●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●●●
●
●
●
●
●●●
●
●●
●●●●
●
●
●●●
●●
●
●●●●
●
●
●
●●
●
●●
●●
●●
●
●
●
●●
●●
●
●
●●
●●●
●
●●
●
●
●●●
●
●
●
●
●●●●
●
●
●●
●
●●●
●
●●
●●
●
●
●●
●
●
●
●
●
●●●●
●●●
●●●
●●
●
●
●
●●●
●
●●●
●
●
●●
●
●
●
●
●
●
●●●
●
●●
●●
●
●●●●
●
●●
●●
●●●
●
●
●●
●
●●●
●
●
●●●
●
●●
●●●
●●
●●
●
●●●●
●
●●●
●
●
●
●
●●●
●●●
●●●
●●
●
●
●●●●
●
●●●●
●●
●●
●
●●
●●
●●●●
●●●●●
●●
●
●●●
●
●
●●
●
●●●●●●●●
●●
●
●
●
●
●
●●●
●
●●●●
●
●
●
●●
●
●
●
●
●●
●●●
●●
●
●
●●●
●●●
●
●
●
●
●●
●●
●
●
●●
●
●
●●●●
●
●
●●
●
●
●●
●●
●
●●●
●
●
●
●
●●●
●●●●●
●
●
●
●
●
●
●
●
●●●●
●
●
●●●●
●●
●●●●
●
●●
●
●
●
●
●●●●●
●
●
●
●
●
●●●
●
●
●
●
●
30
40
50
60
70
0.0 0.2 0.4 0.6time (s)
Cur
rent
(pA)
Complications
25
●
●
●●
●●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●
●●
●
●●
●
●●●●
●●●
●
●●
●
●●●
●
●
●
●●●●
●
●
●●●●
●
●
●
●
●
●●●●●●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●●
●●●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●●●
●
●
●
●
●●●
●
●●
●●●●
●
●
●●●
●●
●
●●●●
●
●
●
●●
●
●●
●●
●●
●
●
●
●●
●●
●
●
●●
●●●
●
●●
●
●
●●●
●
●
●
●
●●●●
●
●
●●
●
●●●
●
●●
●●
●
●
●●
●
●
●
●
●
●●●●
●●●
●●●
●●
●
●
●
●●●
●
●●●
●
●
●●
●
●
●
●
●
●
●●●
●
●●
●●
●
●●●●
●
●●
●●
●●●
●
●
●●
●
●●●
●
●
●●●
●
●●
●●●
●●
●●
●
●●●●
●
●●●
●
●
●
●
●●●
●●●
●●●
●●
●
●
●●●●
●
●●●●
●●
●●
●
●●
●●
●●●●
●●●●●
●●
●
●●●
●
●
●●
●
●●●●●●●●
●●
●
●
●
●
●
●●●
●
●●●●
●
●
●
●●
●
●
●
●
●●
●●●
●●
●
●
●●●
●●●
●
●
●
●
●●
●●
●
●
●●
●
●
●●●●
●
●
●●
●
●
●●
●●
●
●●●
●
●
●
●
●●●
●●●●●
●
●
●
●
●
●
●
●
●●●●
●
●
●●●●
●●
●●●●
●
●●
●
●
●
●
●●●●●
●
●
●
●
●
●●●
●
●
●
●
●
30
40
50
60
70
0.0 0.2 0.4 0.6time (s)
Cur
rent
(pA)
Is this an event ?
Complications
26
●
●
●●
●●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●
●●
●
●●
●
●●●●
●●●
●
●●
●
●●●
●
●
●
●●●●
●
●
●●●●
●
●
●
●
●
●●●●●●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●●
●●●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●●●
●
●
●
●
●●●
●
●●
●●●●
●
●
●●●
●●
●
●●●●
●
●
●
●●
●
●●
●●
●●
●
●
●
●●
●●
●
●
●●
●●●
●
●●
●
●
●●●
●
●
●
●
●●●●
●
●
●●
●
●●●
●
●●
●●
●
●
●●
●
●
●
●
●
●●●●
●●●
●●●
●●
●
●
●
●●●
●
●●●
●
●
●●
●
●
●
●
●
●
●●●
●
●●
●●
●
●●●●
●
●●
●●
●●●
●
●
●●
●
●●●
●
●
●●●
●
●●
●●●
●●
●●
●
●●●●
●
●●●
●
●
●
●
●●●
●●●
●●●
●●
●
●
●●●●
●
●●●●
●●
●●
●
●●
●●
●●●●
●●●●●
●●
●
●●●
●
●
●●
●
●●●●●●●●
●●
●
●
●
●
●
●●●
●
●●●●
●
●
●
●●
●
●
●
●
●●
●●●
●●
●
●
●●●
●●●
●
●
●
●
●●
●●
●
●
●●
●
●
●●●●
●
●
●●
●
●
●●
●●
●
●●●
●
●
●
●
●●●
●●●●●
●
●
●
●
●
●
●
●
●●●●
●
●
●●●●
●●
●●●●
●
●●
●
●
●
●
●●●●●
●
●
●
●
●
●●●
●
●
●
●
●
30
40
50
60
70
0.0 0.2 0.4 0.6time (s)
Cur
rent
(pA) One event or two ?
Nanopore HMM• must consider:
• over segmentation • under segmentation • missed short events
• HMM: • M states: match event to 5-mers • E states: extra obs. of an event • K states: no event obs. for 5-mer
27
P (D|S)
P (⇡, e1, e2, ..., en|S,⇥) =nY
i=1
P (ei|⇡i, µsi ,�si)P (⇡i|⇡i�1, S)
P (e1, e2, ..., en|S,⇥) =X
⇡
P (⇡, e1, e2, ..., en|S,⇥)
Transition Probabilities• Probability of not observing an
event is a function of absolute difference between (expected) current
28
●
●
●●
●●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●
●●
●
●●
●
●●●●
●●●
●
●●
●
●●●
●
●
●
●●●●
●
●
●●●●
●
●
●
●
●
●●●●●●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●●
●●●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●●●
●
●
●
●
●●●
●
●●
●●●●
●
●
●●●
●●
●
●●●●
●
●
●
●●
●
●●
●●
●●
●
●
●
●●
●●
●
●
●●
●●●
●
●●
●
●
●●●
●
●
●
●
●●●●
●
●
●●
●
●●●
●
●●
●●
●
●
●●
●
●
●
●
●
●●●●
●●●
●●●
●●
●
●
●
●●●
●
●●●
●
●
●●
●
●
●
●
●
●
●●●
●
●●
●●
●
●●●●
●
●●
●●
●●●
●
●
●●
●
●●●
●
●
●●●
●
●●
●●●
●●
●●
●
●●●●
●
●●●
●
●
●
●
●●●
●●●
●●●
●●
●
●
●●●●
●
●●●●
●●
●●
●
●●
●●
●●●●
●●●●●
●●
●
●●●
●
●
●●
●
●●●●●●●●
●●
●
●
●
●
●
●●●
●
●●●●
●
●
●
●●
●
●
●
●
●●
●●●
●●
●
●
●●●
●●●
●
●
●
●
●●
●●
●
●
●●
●
●
●●●●
●
●
●●
●
●
●●
●●
●
●●●
●
●
●
●
●●●
●●●●●
●
●
●
●
●
●
●
●
●●●●
●
●
●●●●
●●
●●●●
●
●●
●
●
●
●
●●●●●
●
●
●
●
●
●●●
●
●
●
●
●
30
40
50
60
70
0.0 0.2 0.4 0.6time (s)
Cur
rent
(pA)
Transition Probabilities• Probability of not observing an
event is a function of absolute difference between (expected) current
29
●
●
●●
●●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●
●●
●
●●
●
●●●●
●●●
●
●●
●
●●●
●
●
●
●●●●
●
●
●●●●
●
●
●
●
●
●●●●●●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●●
●●●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●●●
●
●
●
●
●●●
●
●●
●●●●
●
●
●●●
●●
●
●●●●
●
●
●
●●
●
●●
●●
●●
●
●
●
●●
●●
●
●
●●
●●●
●
●●
●
●
●●●
●
●
●
●
●●●●
●
●
●●
●
●●●
●
●●
●●
●
●
●●
●
●
●
●
●
●●●●
●●●
●●●
●●
●
●
●
●●●
●
●●●
●
●
●●
●
●
●
●
●
●
●●●
●
●●
●●
●
●●●●
●
●●
●●
●●●
●
●
●●
●
●●●
●
●
●●●
●
●●
●●●
●●
●●
●
●●●●
●
●●●
●
●
●
●
●●●
●●●
●●●
●●
●
●
●●●●
●
●●●●
●●
●●
●
●●
●●
●●●●
●●●●●
●●
●
●●●
●
●
●●
●
●●●●●●●●
●●
●
●
●
●
●
●●●
●
●●●●
●
●
●
●●
●
●
●
●
●●
●●●
●●
●
●
●●●
●●●
●
●
●
●
●●
●●
●
●
●●
●
●
●●●●
●
●
●●
●
●
●●
●●
●
●●●
●
●
●
●
●●●
●●●●●
●
●
●
●
●
●
●
●
●●●●
●
●
●●●●
●●
●●●●
●
●●
●
●
●
●
●●●●●
●
●
●
●
●
●●●
●
●
●
●
●
30
40
50
60
70
0.0 0.2 0.4 0.6time (s)
Cur
rent
(pA)
Transition Probabilities
30
●
●
●
●
●
●
●
●
●●
●● ● ● ●
●●
●● ● ● ● ●
● ●● ● ●
●●
●
●● ●
●
●●
●
●●
0.1
0.2
0.3
0.4
0.5
0 5 10 15 20absolute difference (pA)
Skip
Pro
babi
lity
Assembly Accuracy
31
�
�
�
��
� �
�
�
�
� �
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
��
�
�
�
���
�
�
�
�
�
�
�
�
�
��
�
��
�
�
��
��
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
���
�
�
��
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
��
���
�
�
��
�
��
�
�
�
�� �
�
�
�
�
�
�
�
��
�
�
�
�
��
�
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
� �
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
���
�
�
�
��
�
��
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
� �
�
��
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
��
��
�
��
� �
�
�
�
�
�
��
��
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
��
�
�
�
�
�
�
��
�
�
���
�
� �
�
�
�
�
�
�
�
�
��
�
�
�
��
�
�
�
�
���
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
�
��
�
�
�
�
�
�
�
�
� �
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
� � �
�
�
�
�
��
��
�
�
�
��
��
�
�
�
� �
�
�
�
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
�
���
��
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
��
��
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
�
�
��
� �
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
�
���
�
�
�
�
�
�
�
�
0
5000
10000
0 5000 100005�mer count in reference
5�m
er c
ount
in d
raft
asse
mbl
y
0
3000
6000
9000
12000
TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGGkmer
coun
t
draft
reference
0
5000
10000
0 5000 100005�mer count in reference
5�m
er c
ount
in p
olis
hed
asse
mbl
y
0
3000
6000
9000
12000
TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGGkmer
coun
t
polished
reference
A
C
B
D
0
5000
10000
0 5000 100005�mer count in reference
5�m
er c
ount
in d
raft
asse
mbl
y
0
3000
6000
9000
12000
TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGGkmer
coun
t
draft
reference
�
�
�
�
�
� �
�
�
�
� �
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
�
�
��
�
�
�
�
�
�
�
�
��
�
�
�
���
�
�
�
�
�
�
�
�
�
��
�
��
�
�
��
��
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
� �
��
��
�
�
�
�
�
�
�
�
���
�
�
��
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
��
���
�
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
��
�
�
�
�
��
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
� �
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
���
�
�
�
��
�
��
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
��
��
�
��
� �
�
�
�
�
�
��
��
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
��
�
�
�
�
�
�
�
��
�
�
���
�
� �
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
���
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
��
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
� �
�
��
�
�
�
�
�
�
�
�
�
�
�
�
���
� �
�
�
� �
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
���
�
�
�
�
� �
��
�
�
�
� ��
� �
�
�
���
�
�
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
���
�
�
�
�
�
�
����
��
�
��
�
�
�
�
�
�
��
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
�
��
��
�
�
�
��
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
��
�
�
�
��
��
�
�
�
��
�
�
�
�
�
��
��
�
�
�
�
���
�
�
�
�
�
�
�
�
0
5000
10000
0 5000 100005�mer count in reference
5�m
er c
ount
in p
olis
hed
asse
mbl
y
0
3000
6000
9000
12000
TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGGkmer
coun
t
polished
reference
A
C
B
D
Draft: 98.5% accuracy Polished: 99.5% accuracy
Assembly Accuracy
32
0
5000
10000
0 5000 100005�mer count in reference
5�m
er c
ount
in d
raft
asse
mbl
y
0
3000
6000
9000
12000
TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGGkmer
coun
t
draft
reference
0
5000
10000
0 5000 100005�mer count in reference
5�m
er c
ount
in p
olis
hed
asse
mbl
y
0
3000
6000
9000
12000
TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGGkmer
coun
t
polished
reference
A
C
B
D
0
5000
10000
0 5000 100005�mer count in reference
5�m
er c
ount
in d
raft
asse
mbl
y
0
3000
6000
9000
12000
TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGGkmer
coun
t
draft
reference
0
5000
10000
0 5000 100005�mer count in reference
5�m
er c
ount
in p
olis
hed
asse
mbl
y
0
3000
6000
9000
12000
TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGGkmer
coun
t
polished
reference
A
C
B
D
Aligning Events to a Reference• HMM can also align events to a reference genome
!
!
!
!
!
• Read about it here: • http://simpsonlab.github.io/2015/04/08/eventalign/
33
Planned Improvements• Model dwell duration to better call homopolymers
!
!
!
• SNP calling/genotyping !
!
!
• Improve scalability to handle larger genomes • Use signal data during error correction
34
CTAAAAAAAAAAAAGTACA
P (gi|D) =P (D|gi)P (gi)
P (D)
Acknowledgements & Code• Collaborators:
• Nick Loman, Josh Quick (Birmingham) • Jonathan Dursi (OICR) !
!
• Code: • github.com/jts/nanocorrect (error correction) • github.com/jts/nanopolish (signal-level algorithms) • github.com/jts/nanopore-paper-analysis (reproduce our paper)
35