indexing strategies depend on the kind of sequence ... · compeau and pevzner, bioinformatics...

Indexing strategies depend on the kind of sequence comparison

https://peerj.com/articles/808/

http://www.langmead-lab.org/teaching-materials/

Hash tables

Suffix array, trie and BWT

|DB| >> |Query|

1-10 Gb 10-15 Kb

|DB| << |Queries|

3 Gb 1 Tb

BLAST

BWA and Bowtie2

BLAT DB

Sequencing comparison algorithms Indexed object

Sequ

ence

div

erge

nce

Que

ry s

ize

CLA

UD

IA C

HIC

A

C3B

I HA

ND

S-O

N N

GS

CO

URS

E –

IPP

- 23

RD N

OV

201

6MAPPING

Compeau and Pevzner, Bioinformatics Algorithms: An Active-Learning Approach. 2014.

Permutes characters in a way that similar contexts are clustered together ⇒ Fast retrieval & Compression

The Burrows-Wheeler transform: introduction

DB

BWT(DB)

⇆ Burrows-Wheeler Transform

All the contexts of the word and in the Watson and Crick paper about DNA structure

Data structure Memory usage

Suffix trie |DB|.(|DB|+1)/2Suffix tree k|DB| with k~20

BTW ~2|DB| DB = Genome |DB| = Genome sizeC

LAU

DIA

CH

ICA

C

3BI H

AN

DS-

ON

NG

S C

OU

RSE

– IP

P -

23RD

NO

V 2

016

MAPPING

Sort the string lexicographically

$ comes first

$GATGCGAGAGATGAGAGATG$GATGCGAGATG$GATGCGAGATG$GATGCGAGAGATGCGAGAGATG$GCGAGAGATG$GATGG$GATGCGAGAGATGAGAGATG$GATGCGAGATG$GATGCGAGATG$GATGCGAGAGATGCGAGAGATG$GCGAGAGATG$GATTG$GATGCGAGAGATGCGAGAGATG$GA

The Burrows-Wheeler transform: construction

Genome = GATGCGAGAGATG$

Form all cyclic rotations of

ATGCGAGAGATG$

GATGCGAGAGATG$$GATGCGAGAGATGG$GATGCGAGAGATTG$GATGCGAGAGAATG$GATGCGAGAGGATG$GATGCGAGAAGATG$GATGCGAGGAGATG$GATGCGAAGAGATG$GATGCGGAGAGATG$GATGCCGAGAGATG$GATGGCGAGAGATG$GATTGCGAGAGATG$GAATGCGAGAGATG$G

Burrows-Wheeler transform Last column =

GGGGGGTCAA$TAA

Given the construction procedure the last column is also the string

containing the preceding character of the sorter text string.

CLA

UD

IA C

HIC

A

C3B

I HA

ND

S-O

N N

GS

CO

URS

E –

IPP

- 23

RD N

OV

201

6MAPPING

The Burrows-Wheeler transform: inversion

DB

BWT(DB)

⇆

GGGGGGTCAA$TAA

GATGCGAGAGATG$⇑

GGGGGGTCAA$TAA

GATGCGAGAGATG$

⇑

If i have the last column of the BWT matrix, i have the first one for free. WHY?

GGGGGGTCAA$TAA

sort lexicographically

$AAAACGGGGGGTT

CLA

UD

IA C

HIC

A

C3B

I HA

ND

S-O

N N

GS

CO

URS

E –

IPP

- 23

RD N

OV

201

6MAPPING

The Burrows-Wheeler transform: inversion

The letters of the BWT text are in the same “relative” order as in the sorted text.

CLA

UD

IA C

HIC

A

C3B

I HA

ND

S-O

N N

GS

CO

URS

E –

IPP

- 23

RD N

OV

201

6MAPPING

(SORTED TEXT)

The generality of the first-last property of the BWT matrix

Given a symbol S of the string G and the corresponding BWT matrix of G:

the k-th occurrence of S in FirstColumn and the k-th occurrence of S in LastColumn correspond to the same

position of S in G.

$GATGCGAGAGATGAGAGATG$GATGCGAGATG$GATGCGAGATG$GATGCGAGAGATGCGAGAGATG$GCGAGAGATG$GATGG$GATGCGAGAGATGAGAGATG$GATGCGAGATG$GATGCGAGATG$GATGCGAGAGATGCGAGAGATG$GCGAGAGATG$GATTG$GATGCGAGAGATGCGAGAGATG$GA

The Burrows-Wheeler transform: first last property used for the more efficient inversion

Memory usage

Genome reconstruction requires 2|DB| memory space

CLA

UD

IA C

HIC

A

C3B

I HA

ND

S-O

N N

GS

CO

URS

E –

IPP

- 23

RD N

OV

201

6MAPPING

G A T G C G A G A G A T G $

G $Character before $

$Last character $

T G $1st G Character before first G

A T G $1st T Character before 1st T

G A T G $3rd A Character before 3rd A

A G A T G $4th G Character before 4th G

…

The Burrows-Wheeler transform: pattern matching

Genome = GATGCGAGAGATG$ Pattern = GAGA

Is GAGA in GATGCGAGAGATG$? where?

how many?

2 matches, but where?

Suffix array Holds the starting position of each suffix beginning a row

Memory usage

Pattern matching requires 2|DB| + |DB| memory space

CLA

UD

IA C

HIC

A

C3B

I HA

ND

S-O

N N

GS

CO

URS

E –

IPP

- 23

RD N

OV

201

6MAPPING

https://blog.sbgenomics.com

Complete mapping algorithm: Seed and extend

Seeds are extended at multiple sites.

Extension is the more expensive step.

Extension: must deal with polymorphism (SNPs), sequencing errors, indel events, etc.

Seed

Extend

CLA

UD

IA C

HIC

A

C3B

I HA

ND

S-O

N N

GS

CO

URS

E –

IPP

- 23

RD N

OV

201

6MAPPING

Seed-and-vote mapping paradigm

Yang Liao et al. Nucl. Acids Res. 2013;41:e108

Choose the mapped genomic location of the read directly from the seed

Seed & vote approach achieves local alignment simultaneously in multiple parts of the read

In-fill step with dynamic programming to complete the alignment

CLA

UD

IA C

HIC

A

C3B

I HA

ND

S-O

N N

GS

CO

URS

E –

IPP

- 23

RD N

OV

201

6MAPPING

Shannon entropy (1948)

Huffman coding (1950)

Lempel-Ziv compression

algorithm (1977)

Arithmetic coding (1984) Burrows-

Wheeler Transform

(1994)

Compressed suffix arrays

(2005)

Compression

Suffix ties and trees (1977)

Suffix arrays (1993)

MegaBLAST index (2008) Indexing

BLAST (1990) BLAT (2002) Bowtie (2009) Mappers Explosion Pattern matching

50 77 12

Growth in GenBank base pairs

http://en.wikipedia.org/wiki/GenBank

Sequence data base growth: Compression, Indexing and Pattern matching

CLA

UD

IA C

HIC

A

C3B

I HA

ND

S-O

N N

GS

CO

URS

E –

IPP

- 23

RD N

OV

201

6MAPPING

Fonseca N, Bioinfo, 2012

DNARNA

BisulfitemiRNA

Mappers explosion

Several algorithms • read length • parallelisation • DNA/RNA • Indels • Splicing

CLA

UD

IA C

HIC

A

C3B

I HA

ND

S-O

N N

GS

CO

URS

E –

IPP

- 23

RD N

OV

201

6MAPPING

Bowtie2: alignment modes

CLA

UD

IA C

HIC

A

C3B

I HA

ND

S-O

N N

GS

CO

URS

E –

IPP

- 23

RD N

OV

201

6MAPPING

http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml

Bowtie2: main parametersD = # seed extension attempts R = # of re-seeding attempts N = # mismatches per seed L = seed length i = seed interval length

In this case: • the read has 30 characters • seed length is 10 • seed interval is 6

CLA

UD

IA C

HIC

A

C3B

I HA

ND

S-O

N N

GS

CO

URS

E –

IPP

- 23

RD N

OV

201

6MAPPING

Exercise: guess parameter values for the default mapping modes

http://bowtie-bio.sourceforge.net/bowtie2/manual.shtmlCLA

UD

IA C

HIC

A

C3B

I HA

ND

S-O

N N

GS

CO

URS

E –

IPP

- 23

RD N

OV

201

6MAPPING

Report modes • Report best default mode • Search for n -k mode • Report all -a mode

Bowtie2: reporting

CLA

UD

IA C

HIC

A

C3B

I HA

ND

S-O

N N

GS

CO

URS

E –

IPP

- 23

RD N

OV

201

6MAPPING

Report concordant: • Expected orientation • Expected length

Aligning pairs

What are the expected orientation and length of paired-end reads?

Why are discordant reads interesting?

Discordant: • Forward - forward • Reverse - forward • length < 200 bp or > 500 bp

200 - 500 bp

Exercise: mapping with bowtie2

OBJECTIVE: Identify the parameters that can improve the mapping efficiency.

Ref Dataset mapping effort mapped # unmapped % m # perfect unireads # multireads # multireads:

random choice % ran

KP

SB107 bowtie2 very-fast 11759847 125591 99% 11483699 276148 185845 67%SB107 bowtie2 very-sensitive 11765199 120239 99% 11472607 292592 186225 64%SB107 bwa 11958806 429 100% 0 11959235 0 0%kp_sim bowtie2 very-sensitive 8041510 3843928 68% 7836636 204874 138885 68%

KO SB107 bowtie2 very-sensitive 2516898 9368540 21% 2385106 131792 85761 65%

CLA

UD

IA C

HIC

A

C3B

I HA

ND

S-O

N N

GS

CO

URS

E –

IPP

- 23

RD N

OV

201

6MAPPING

indexing strategies depend on the kind of sequence ... · compeau and pevzner, bioinformatics...

Documents