indexing strategies depend on the kind of sequence ... · compeau and pevzner, bioinformatics...
TRANSCRIPT
Indexing strategies depend on the kind of sequence comparison
https://peerj.com/articles/808/
http://www.langmead-lab.org/teaching-materials/
Hash tables
Suffix array, trie and BWT
|DB| >> |Query|
1-10 Gb 10-15 Kb
|DB| << |Queries|
3 Gb 1 Tb
BLAST
BWA and Bowtie2
BLAT DB
Sequencing comparison algorithms Indexed object
Sequ
ence
div
erge
nce
Que
ry s
ize
CLA
UD
IA C
HIC
A
C3B
I HA
ND
S-O
N N
GS
CO
URS
E –
IPP
- 23
RD N
OV
201
6MAPPING
Compeau and Pevzner, Bioinformatics Algorithms: An Active-Learning Approach. 2014.
Permutes characters in a way that similar contexts are clustered together ⇒ Fast retrieval & Compression
The Burrows-Wheeler transform: introduction
DB
BWT(DB)
⇆ Burrows-Wheeler Transform
All the contexts of the word and in the Watson and Crick paper about DNA structure
Data structure Memory usage
Suffix trie |DB|.(|DB|+1)/2Suffix tree k|DB| with k~20
BTW ~2|DB| DB = Genome |DB| = Genome sizeC
LAU
DIA
CH
ICA
C
3BI H
AN
DS-
ON
NG
S C
OU
RSE
– IP
P -
23RD
NO
V 2
016
MAPPING
Sort the string lexicographically
$ comes first
$GATGCGAGAGATGAGAGATG$GATGCGAGATG$GATGCGAGATG$GATGCGAGAGATGCGAGAGATG$GCGAGAGATG$GATGG$GATGCGAGAGATGAGAGATG$GATGCGAGATG$GATGCGAGATG$GATGCGAGAGATGCGAGAGATG$GCGAGAGATG$GATTG$GATGCGAGAGATGCGAGAGATG$GA
The Burrows-Wheeler transform: construction
Genome = GATGCGAGAGATG$
Form all cyclic rotations of
ATGCGAGAGATG$
GATGCGAGAGATG$$GATGCGAGAGATGG$GATGCGAGAGATTG$GATGCGAGAGAATG$GATGCGAGAGGATG$GATGCGAGAAGATG$GATGCGAGGAGATG$GATGCGAAGAGATG$GATGCGGAGAGATG$GATGCCGAGAGATG$GATGGCGAGAGATG$GATTGCGAGAGATG$GAATGCGAGAGATG$G
Burrows-Wheeler transform Last column =
GGGGGGTCAA$TAA
Given the construction procedure the last column is also the string
containing the preceding character of the sorter text string.
CLA
UD
IA C
HIC
A
C3B
I HA
ND
S-O
N N
GS
CO
URS
E –
IPP
- 23
RD N
OV
201
6MAPPING
The Burrows-Wheeler transform: inversion
DB
BWT(DB)
⇆
GGGGGGTCAA$TAA
GATGCGAGAGATG$⇑
GGGGGGTCAA$TAA
GATGCGAGAGATG$
⇑
If i have the last column of the BWT matrix, i have the first one for free. WHY?
GGGGGGTCAA$TAA
sort lexicographically
$AAAACGGGGGGTT
CLA
UD
IA C
HIC
A
C3B
I HA
ND
S-O
N N
GS
CO
URS
E –
IPP
- 23
RD N
OV
201
6MAPPING
The Burrows-Wheeler transform: inversion
The letters of the BWT text are in the same “relative” order as in the sorted text.
CLA
UD
IA C
HIC
A
C3B
I HA
ND
S-O
N N
GS
CO
URS
E –
IPP
- 23
RD N
OV
201
6MAPPING
(SORTED TEXT)
The generality of the first-last property of the BWT matrix
Given a symbol S of the string G and the corresponding BWT matrix of G:
the k-th occurrence of S in FirstColumn and the k-th occurrence of S in LastColumn correspond to the same
position of S in G.
$GATGCGAGAGATGAGAGATG$GATGCGAGATG$GATGCGAGATG$GATGCGAGAGATGCGAGAGATG$GCGAGAGATG$GATGG$GATGCGAGAGATGAGAGATG$GATGCGAGATG$GATGCGAGATG$GATGCGAGAGATGCGAGAGATG$GCGAGAGATG$GATTG$GATGCGAGAGATGCGAGAGATG$GA
The Burrows-Wheeler transform: first last property used for the more efficient inversion
Memory usage
Genome reconstruction requires 2|DB| memory space
CLA
UD
IA C
HIC
A
C3B
I HA
ND
S-O
N N
GS
CO
URS
E –
IPP
- 23
RD N
OV
201
6MAPPING
G A T G C G A G A G A T G $
G $Character before $
$Last character $
T G $1st G Character before first G
A T G $1st T Character before 1st T
G A T G $3rd A Character before 3rd A
A G A T G $4th G Character before 4th G
…
The Burrows-Wheeler transform: pattern matching
Genome = GATGCGAGAGATG$ Pattern = GAGA
Is GAGA in GATGCGAGAGATG$? where?
how many?
2 matches, but where?
Suffix array Holds the starting position of each suffix beginning a row
Memory usage
Pattern matching requires 2|DB| + |DB| memory space
CLA
UD
IA C
HIC
A
C3B
I HA
ND
S-O
N N
GS
CO
URS
E –
IPP
- 23
RD N
OV
201
6MAPPING
https://blog.sbgenomics.com
Complete mapping algorithm: Seed and extend
Seeds are extended at multiple sites.
Extension is the more expensive step.
Extension: must deal with polymorphism (SNPs), sequencing errors, indel events, etc.
Seed
Extend
CLA
UD
IA C
HIC
A
C3B
I HA
ND
S-O
N N
GS
CO
URS
E –
IPP
- 23
RD N
OV
201
6MAPPING
Seed-and-vote mapping paradigm
Yang Liao et al. Nucl. Acids Res. 2013;41:e108
Choose the mapped genomic location of the read directly from the seed
Seed & vote approach achieves local alignment simultaneously in multiple parts of the read
In-fill step with dynamic programming to complete the alignment
CLA
UD
IA C
HIC
A
C3B
I HA
ND
S-O
N N
GS
CO
URS
E –
IPP
- 23
RD N
OV
201
6MAPPING
Shannon entropy (1948)
Huffman coding (1950)
Lempel-Ziv compression
algorithm (1977)
Arithmetic coding (1984) Burrows-
Wheeler Transform
(1994)
Compressed suffix arrays
(2005)
Compression
Suffix ties and trees (1977)
Suffix arrays (1993)
MegaBLAST index (2008) Indexing
BLAST (1990) BLAT (2002) Bowtie (2009) Mappers Explosion Pattern matching
50 77 12
Growth in GenBank base pairs
http://en.wikipedia.org/wiki/GenBank
Sequence data base growth: Compression, Indexing and Pattern matching
CLA
UD
IA C
HIC
A
C3B
I HA
ND
S-O
N N
GS
CO
URS
E –
IPP
- 23
RD N
OV
201
6MAPPING
Fonseca N, Bioinfo, 2012
DNARNA
BisulfitemiRNA
Mappers explosion
Several algorithms • read length • parallelisation • DNA/RNA • Indels • Splicing
CLA
UD
IA C
HIC
A
C3B
I HA
ND
S-O
N N
GS
CO
URS
E –
IPP
- 23
RD N
OV
201
6MAPPING
Bowtie2: alignment modes
CLA
UD
IA C
HIC
A
C3B
I HA
ND
S-O
N N
GS
CO
URS
E –
IPP
- 23
RD N
OV
201
6MAPPING
http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
Bowtie2: main parametersD = # seed extension attempts R = # of re-seeding attempts N = # mismatches per seed L = seed length i = seed interval length
In this case: • the read has 30 characters • seed length is 10 • seed interval is 6
CLA
UD
IA C
HIC
A
C3B
I HA
ND
S-O
N N
GS
CO
URS
E –
IPP
- 23
RD N
OV
201
6MAPPING
Exercise: guess parameter values for the default mapping modes
http://bowtie-bio.sourceforge.net/bowtie2/manual.shtmlCLA
UD
IA C
HIC
A
C3B
I HA
ND
S-O
N N
GS
CO
URS
E –
IPP
- 23
RD N
OV
201
6MAPPING
Report modes • Report best default mode • Search for n -k mode • Report all -a mode
Bowtie2: reporting
CLA
UD
IA C
HIC
A
C3B
I HA
ND
S-O
N N
GS
CO
URS
E –
IPP
- 23
RD N
OV
201
6MAPPING
Report concordant: • Expected orientation • Expected length
Aligning pairs
What are the expected orientation and length of paired-end reads?
Why are discordant reads interesting?
Discordant: • Forward - forward • Reverse - forward • length < 200 bp or > 500 bp
200 - 500 bp
Exercise: mapping with bowtie2
OBJECTIVE: Identify the parameters that can improve the mapping efficiency.
Ref Dataset mapping effort mapped # unmapped % m # perfect unireads # multireads # multireads:
random choice % ran
KP
SB107 bowtie2 very-fast 11759847 125591 99% 11483699 276148 185845 67%SB107 bowtie2 very-sensitive 11765199 120239 99% 11472607 292592 186225 64%SB107 bwa 11958806 429 100% 0 11959235 0 0%kp_sim bowtie2 very-sensitive 8041510 3843928 68% 7836636 204874 138885 68%
KO SB107 bowtie2 very-sensitive 2516898 9368540 21% 2385106 131792 85761 65%
CLA
UD
IA C
HIC
A
C3B
I HA
ND
S-O
N N
GS
CO
URS
E –
IPP
- 23
RD N
OV
201
6MAPPING