suffix arrays, bwt and fm-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/sa... · fm-index...

107
Suffix arrays, BWT and FM-index Alan Medlar Wednesday 16 th March 2016

Upload: others

Post on 09-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Suffix arrays, BWT and FM-index

Alan Medlar Wednesday 16th March 2016

Page 2: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Outline

• Lecture: Technical background for read mapping tools used in this course

• Suffix array • Burrows-Wheeler transform (BWT) • FM-index

• Lab session: Using BWA to map paired-end data against the human genome, SAM/BAM files, etc

Page 3: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Read mapping

• Sequencers can generate up to 100 million reads per sample

• Human genome is ~3 billion basepairs

• Need to map reads to the genome to discover variants (SNVs, indels), counts (gene expression)

Page 4: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Preliminaries

• String

• sequence of characters,

• e.g. "banana", "ATGC", "MDLISTFS"

• Alphabet { A, C, G, T, $ }, { A-Z, a-z, $ }

• Lexicographical order

• $ < A < C < G < T

Page 5: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Preliminaries• Prefix

• non-empty substring that is the beginning of another string (left-to-right)

• e.g. "banana", "ATGC", "MDLISTFS"

• Suffix

• non-empty substring that is the ending of another string (right-to-left)

• e.g. "banana", "ATGC", "MDLISTFS"

Page 6: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Naïve exact search

• Text = "banana"

• Query = "nana"

• Linear search

Page 7: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Naïve exact search

B A N A N A

N A N A• Text = "banana"

• Query = "nana"

• Linear search

Page 8: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Naïve exact search

B A N A N A

N A N A• Text = "banana"

• Query = "nana"

• Linear search

Page 9: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Naïve exact search

B A N A N A

N A N A

N A N A• Text = "banana"

• Query = "nana"

• Linear search

Page 10: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Naïve exact search

B A N A N A

N A N A

N A N A• Text = "banana"

• Query = "nana"

• Linear search

Page 11: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Naïve exact search

B A N A N A

N A N A

N A N A

N A N A

• Text = "banana"

• Query = "nana"

• Linear search

Page 12: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Naïve exact search

B A N A N A

N A N A

N A N A

N A N A

• Text = "banana"

• Query = "nana"

• Linear search

Page 13: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Naïve exact search

B A N A N A

N A N A

N A N A

N A N A

• Text = "banana"

• Query = "nana"

• Linear search

Page 14: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Naïve exact search

B A N A N A

N A N A

N A N A

N A N A

• Text = "banana"

• Query = "nana"

• Linear search

Page 15: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Naïve exact search

B A N A N A

N A N A

N A N A

N A N A

• Text = "banana"

• Query = "nana"

• Linear search

Page 16: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Naïve search is too slow

• Human genome ~3 billion basepairs

• Read 100 basepairs

• Complexity of search scales linearly with the length of the text!

Page 17: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Suffix array

• Introduced by Manber and Myers (1990) as a space efficient alternative to suffix tree (independently by Gonnet (1987))

• Sorted array of all suffixes of a given text

• Allows fast search of very large texts (e.g. genomes)

Page 18: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

SA: building

B A N A N A $$ is lexicographically lower than all other characters in the alphabet and cannot appear in the text otherwise

Page 19: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

SA: building

B A N A N A $A N A N A $

Page 20: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

SA: building

B A N A N A $A N A N A $N A N A $A N A $N A $A $$

Page 21: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

SA: building

B A N A N A $ 0A N A N A $ 1N A N A $ 2A N A $ 3N A $ 4A $ 5$ 6

Page 22: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

SA: building

B A N A N A $ 0A N A N A $ 1N A N A $ 2A N A $ 3N A $ 4A $ 5$ 6

$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2

Page 23: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

SA: building

B A N A N A $ 0A N A N A $ 1N A N A $ 2A N A $ 3N A $ 4A $ 5$ 6

$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2

Page 24: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

SA: querying

• Search for prefixes in the suffix array that match our query string

• SA is sorted, so we can use binary search!

Page 25: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2

N A N A

Page 26: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2

N A N A

Page 27: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2

N A N A

Page 28: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2

N A N A

Page 29: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2

N A N A

Page 30: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2

N A N A

Page 31: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2

N A N A

Page 32: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2

N A N A

Page 33: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2

N A N A

Page 34: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2N A N A

Page 35: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2N A N A

Page 36: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2N A N A

Page 37: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2N A N A

Page 38: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2N A N A

Page 39: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

SA vs. naïve search

• Searching the human genome (~3 billion basepairs, n) for a single-end read (100 basepairs, m)

• Naïve search O(mn)

• Suffix array search O(m log(n))

Page 40: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

SA vs. naïve search

• Searching the human genome (~3 billion basepairs, n) for a single-end read (100 basepairs, m)

• Naïve search O(mn)

• Suffix array search O(m log(n))

n O(n) O(log(n))

8 8 3

16 16 4

32 32 5

64 64 6

128 128 7

256 256 8

512 512 9

Page 41: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Good enough for read mapping?

• Human genome is ~3 billion basepairs

• Assume 5 bytes per basepair (1 byte characters, 4 byte integers) = ~14 GB

• NGS data really hit in 2009 (16 GB RAM at the time was a luxury!)

Page 42: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Burrows-Wheeler transform

• Invented by Burrows and Wheeler (1994) while working at DEC

• Used in compression (.bz2 files)

• Interested in three things:

• how to perform BWT • why BWT is useful for compression • how to reverse BWT

Page 43: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT

B A N A N A $

Page 44: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT

B A N A N A $A N A N A $

Page 45: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT

B A N A N A $A N A N A $ B

Page 46: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT

B A N A N A $A N A N A $ BN A N A $

Page 47: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT

B A N A N A $A N A N A $ BN A N A $ B A

Page 48: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT

B A N A N A $A N A N A $ BN A N A $ B AA N A $ B A NN A $ B A N AA $ B A N A N$ B A N A N A

Page 49: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT

B A N A N A $A N A N A $ BN A N A $ B AA N A $ B A NN A $ B A N AA $ B A N A N$ B A N A N A

$ B A N A N AA $ B A N A NA N A $ B A NA N A N A $ BB A N A N A $N A $ B A N AN A N A $ B A

Page 50: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT

$ B A N A N AA $ B A N A NA N A $ B A NA N A N A $ BB A N A N A $N A $ B A N AN A N A $ B A

Page 51: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT compression

• T = "banana$"

• BWT(T) = "annb$aa"

Page 52: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT compression

• T = "peter_piper_picked_a_peck_of_pickled_peppers_a_peck_of_pickled_peppers_peter_piper_picked_if_peter_piper_picked_a_peck_of_pickled_peppers_wheres_the_peck_of_pickled_peppers_peter_piper_picked"

• BWT(T) = "ddsddkkkkaeaaddddsfsrrrrffffrrrrss___eeeeiiiiiiiieeeeeeeehppppkkkkllllpppppppptttthpppprppppiooootwpppppppp_ppppcccccccccccckkkk____________iiiipppp_______________eeeeeeeeeeeeeeeeerrrereeee__"

Page 53: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Relation to suffix array

$ B A N A N AA $ B A N A NA N A $ B A NA N A N A $ BB A N A N A $N A $ B A N AN A N A $ B A

• BWT matrix truncated at "$" in each row is the suffix array of the same text

• BWT can be computed directly from the suffix array

Page 54: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Reverse BWT

• It not very useful to compress something if we cannot get the original text back!

• BWT'(BWT(T)) = T

Page 55: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

LF mapping (T-rank)

B A N A N A $

Page 56: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

LF mapping (T-rank)

B A N A N A $

B0 A0 N0 A1 N1 A2 $T-rank

Page 57: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

LF mapping (T-rank)

F L$ B0 A0 N0 A1 N1 A2

A2 $ B0 A0 N0 A1 N1

A1 N1 A2 $ B0 A0 N0

A0 N0 A1 N1 A2 $ B0

B0 A0 N0 A1 N1 A2 $N1 A2 $ B0 A0 N0 A1

N0 A1 N1 A2 $ B0 A0

B A N A N A $

B0 A0 N0 A1 N1 A2 $T-rank

Page 58: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

LF mapping (T-rank)

F L$ B0 A0 N0 A1 N1 A2

A2 $ B0 A0 N0 A1 N1

A1 N1 A2 $ B0 A0 N0

A0 N0 A1 N1 A2 $ B0

B0 A0 N0 A1 N1 A2 $N1 A2 $ B0 A0 N0 A1

N0 A1 N1 A2 $ B0 A0

B A N A N A $

B0 A0 N0 A1 N1 A2 $T-rank

Page 59: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

LF mapping (T-rank)

F L$ B0 A0 N0 A1 N1 A2

A2 $ B0 A0 N0 A1 N1

A1 N1 A2 $ B0 A0 N0

A0 N0 A1 N1 A2 $ B0

B0 A0 N0 A1 N1 A2 $N1 A2 $ B0 A0 N0 A1

N0 A1 N1 A2 $ B0 A0

B A N A N A $

B0 A0 N0 A1 N1 A2 $T-rank

Page 60: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

LF mapping (T-rank)

F L$ B0 A0 N0 A1 N1 A2

A2 $ B0 A0 N0 A1 N1

A1 N1 A2 $ B0 A0 N0

A0 N0 A1 N1 A2 $ B0

B0 A0 N0 A1 N1 A2 $N1 A2 $ B0 A0 N0 A1

N0 A1 N1 A2 $ B0 A0

B A N A N A $

B0 A0 N0 A1 N1 A2 $T-rank

Page 61: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

LF mapping (T-rank)

F L$ B0 A0 N0 A1 N1 A2

A2 $ B0 A0 N0 A1 N1

A1 N1 A2 $ B0 A0 N0

A0 N0 A1 N1 A2 $ B0

B0 A0 N0 A1 N1 A2 $N1 A2 $ B0 A0 N0 A1

N0 A1 N1 A2 $ B0 A0

Ns in the L column are sorted by their

"right context", same as Ns in F column!

B A N A N A $

B0 A0 N0 A1 N1 A2 $T-rank

Page 62: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

LF mapping (B-rank)

B0 A2 N1 A1 N0 A0 $

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2

B A N A N A $

B0 A0 N0 A1 N1 A2 $T-rank

B-rank

Page 63: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

LF mapping (B-rank)

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2

B0 A2 N1 A1 N0 A0 $

B A N A N A $

B0 A0 N0 A1 N1 A2 $T-rank

B-rank

Page 64: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

LF mapping (B-rank)

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2

• F column contains very little information, just counts of each character

Page 65: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

LF mapping (B-rank)

LA0

N0

N1

B0

$A1

A2

Which row contains N1 in the F column?

{ $:1, A:3, B:1, N:2 }

Page 66: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

LF mapping (B-rank)

LA0

N0

N1

B0

$A1

A2

• Skip $ (+1) • Skip As (+3) • Skip Bs (+1) • Skip first N (+1) = 6

{ $:1, A:3, B:1, N:2 }

Which row contains N1 in the F column?

Page 67: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

LF mapping (B-rank)

LA0

N0

N1

B0

$A1

A2

F0 $1 A0

2 A1

3 A2

4 B0

5 N0

6 N1

• Skip $ (+1) • Skip As (+3) • Skip Bs (+1) • Skip first N (+1) = 6

{ $:1, A:3, B:1, N:2 }

Which row contains N1 in the F column?

Page 68: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Reverse BWT• Use B-ranking to reverse BWT, recreating the text T

from right-to-left

LA0

N0

N1

B0

$A1

A2

F0 $1 A0

2 A1

3 A2

4 B0

5 N0

6 N1

B0 A2 N1 A1 N0 A0 $

$

{ $:1, A:3, B:1, N:2 }

Page 69: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Reverse BWT

LA0

N0

N1

B0

$A1

A2

F0 $1 A0

2 A1

3 A2

4 B0

5 N0

6 N1

B0 A2 N1 A1 N0 A0 $

A0 $

{ $:1, A:3, B:1, N:2 }

• Use B-ranking to reverse BWT, recreating the text T from right-to-left

Page 70: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Reverse BWT

LA0

N0

N1

B0

$A1

A2

F0 $1 A0

2 A1

3 A2

4 B0

5 N0

6 N1

B0 A2 N1 A1 N0 A0 $

N0 A0 $

{ $:1, A:3, B:1, N:2 }

• Use B-ranking to reverse BWT, recreating the text T from right-to-left

Page 71: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Reverse BWT

LA0

N0

N1

B0

$A1

A2

F0 $1 A0

2 A1

3 A2

4 B0

5 N0

6 N1

B0 A2 N1 A1 N0 A0 $

A1 N0 A0 $

{ $:1, A:3, B:1, N:2 }

• Use B-ranking to reverse BWT, recreating the text T from right-to-left

Page 72: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Reverse BWT

LA0

N0

N1

B0

$A1

A2

F0 $1 A0

2 A1

3 A2

4 B0

5 N0

6 N1

B0 A2 N1 A1 N0 A0 $

N1 A1 N0 A0 $

{ $:1, A:3, B:1, N:2 }

• Use B-ranking to reverse BWT, recreating the text T from right-to-left

Page 73: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Reverse BWT

LA0

N0

N1

B0

$A1

A2

F0 $1 A0

2 A1

3 A2

4 B0

5 N0

6 N1

B0 A2 N1 A1 N0 A0 $

A2 N1 A1 N0 A0 $

{ $:1, A:3, B:1, N:2 }

• Use B-ranking to reverse BWT, recreating the text T from right-to-left

Page 74: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Reverse BWT

LA0

N0

N1

B0

$A1

A2

F0 $1 A0

2 A1

3 A2

4 B0

5 N0

6 N1

B0 A2 N1 A1 N0 A0 $

B0 A2 N1 A1 N0 A0 $

{ $:1, A:3, B:1, N:2 }

• Use B-ranking to reverse BWT, recreating the text T from right-to-left

Page 75: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Reverse BWT

LA0

N0

N1

B0

$A1

A2

F0 $1 A0

2 A1

3 A2

4 B0

5 N0

6 N1

B0 A2 N1 A1 N0 A0 $

B0 A2 N1 A1 N0 A0 $

{ $:1, A:3, B:1, N:2 }

• Use B-ranking to reverse BWT, recreating the text T from right-to-left

Page 76: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

FM-index• All BWT allows us to do is compress text

• Ferragina and Manzini (2000) "Full-text index in Minute space"

• Combine BWT with other auxiliary data structures to get an index

• Space savings: e.g. Human genome (3 billion bp)

• SA = ~14 GB (5 bytes/bp) • FM = ~1.5 GB (2 bits/bp)

Page 77: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Cannot search BWT like SA

$ B A N A N AA $ B A N A NA N A $ B A NA N A N A $ BB A N A N A $N A $ B A N AN A N A $ B A

• Rotation matrix contains the suffix array

• But we only store F and L columns, so binary search of prefixes not possible

Page 78: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Cannot search BWT like SA

$ B A N A N AA $ B A N A NA N A $ B A NA N A N A $ BB A N A N A $N A $ B A N AN A N A $ B A

• Rotation matrix contains the suffix array

• But we only store F and L columns, so binary search of prefixes not possible

Page 79: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT search

• In SA, we matched successively longer prefixes (left-to-right) of query string (binary search)

• In BWT, we will match successively longer suffixes (right-to-left) of query string (reverse BWT transform)

Page 80: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2

N A N A

Page 81: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2

N A N A

Page 82: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2

N A N A

Page 83: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2

N A N A

Page 84: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2

N A N A

Page 85: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2

N A N A

Page 86: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2

N A N A

Page 87: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2

N A N A

We know BWT contains the query, but unlike SA, we do not know the location

of the match in T!

Page 88: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2N A N A

$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2

Idea: just store SA as well?

Page 89: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2N A N A

$ 6A $A N A $ 3A N A N A $B A N A N A $ 0N A $N A N A $

Idea 2: store part of SA?

Page 90: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2N A N A

$ 6A $A N A $ 3A N A N A $B A N A N A $ 0N A $N A N A $

... and walk backwards through the BWT!

+1

Page 91: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2N A N A

$ 6A $A N A $ 3A N A N A $B A N A N A $ 0N A $N A N A $

... and walk backwards through the BWT!

+1+1

Page 92: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2N A N A

$ 6A $A N A $ 3A N A N A $B A N A N A $ 0N A $N A N A $

... and walk backwards through the BWT!

+1+1

Page 93: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2N A N A

$ 6A $A N A $ 3A N A N A $B A N A N A $ 0N A $N A N A $

... and walk backwards through the BWT!

+1+1

6531042

Page 94: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

BWT search

• Finding location takes constant time if the offsets into T are evenly spaced in T, not in the SA!

• Make tradeoff between space (RAM) and time (how long lookups take)

Page 95: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Things we left out

• Rank calculations in the BWT need to be fast! Needs another auxiliary data structure

• Only covered exact matching, read alignment requires mismatches (e.g. SNP in read, not in genome)

• Other details: store forwards and backwards indices of genome due to sequencing error profile

Page 96: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Lab exercisesBWA, SAM/BAM format, samtools

Page 97: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Reference Data• Reference genome

• wget http://wasabiapp.org/vbox/data/session_2/chromosome22.fa.gz

• md5sum chromosome22.fa.gz (168c78298e731128ee622cf422e70f1el)

• gunzip chromosome22.fa.gz

• du -h chromosome22.fa (49 MB, genome is 2.9 GB)

• less chromosome22.fa (where's the DNA?)

• grep -nv "^N" chromosome22.fa | head (line 175169!)

Page 98: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

bwa index

• Indexing options:

• bwa index

• Index human chromosome 22 (~1.5 mins, genome takes ~1.5 hours):

• bwa index chromosome22.fa

Page 99: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

bwa index outputbash-3.2$ bwa index chromosome22.fa

[bwa_index] Pack FASTA... 0.52 sec

[bwa_index] Construct BWT for the packed sequence...

[BWTIncCreate] textLength=101636936, availableWord=19151484

[BWTIncConstructFromPacked] 10 iterations done. 31590664 characters processed.

[BWTIncConstructFromPacked] 20 iterations done. 58359704 characters processed.

[BWTIncConstructFromPacked] 30 iterations done. 82148056 characters processed.

[BWTIncConstructFromPacked] 40 iterations done. 101636936 characters processed.

[bwt_gen] Finished constructing BWT in 40 iterations.

[bwa_index] 74.38 seconds elapse.

[bwa_index] Update BWT... 0.38 sec

[bwa_index] Pack forward-only FASTA... 0.37 sec

[bwa_index] Construct SA from BWT and Occ... 13.67 sec

[main] Version: 0.7.12-r1039

[main] CMD: bwa index chromosome22.fa

[main] Real time: 93.689 sec; CPU: 89.320 sec

Page 100: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Read Data• Paired-end reads

• wget http://wasabiapp.org/vbox/data/session_2/chromosome22.reads_1.fastq.gz

• wget http://wasabiapp.org/vbox/data/session_2/chromosome22.reads_2.fastq.gz

• md5sum chromosome22.reads_1.fastq.gz chromosome22.reads_2.fastq.gz

de1cd26056c61571de5cdf246ede60d3 chromosome22.reads_1.fastq.gz

2be64fb5848c2997af0ab8fab416d539 chromsome22.reads_2.fastq.gz

• gunzip chromosome22.reads_1.fastq.gz (and the other file)

• less chromosome22.reads_1.fastq

Page 101: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

bwa mapping options

• Several alignment options:

• bwa mem (70bp+ Illumina, 454, IonTorrent, Sanger)

• bwa bwasw (Smith-Waterman, frequent gaps)

• bwa aln/samse/sampe (short reads, original)

Page 102: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

bwa mem

• Mapping paired-end data • bwa mem [options] <idxbase> <in1.fq> <in2.fq>

• bwa mem -t 4 chromosome22.fa chromosome22.reads_1.fastq chromosome22.reads_2.fastq > chromosome22.sam

• -t specifies the number of CPUs to use

Page 103: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

Sequence Alignment/Map format (SAM)

• SAM format is a TAB-delimited text file, we can inspect with a pager: • less -S chromosome22.sam

• Each row represents an alignment, at least 11 fields

• Specification: https://samtools.github.io/hts-specs/SAMv1.pdf

Page 104: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

SAM fields• Column 1: read name

• Column 3: reference sequence name (in our case "22")

• Column 4: reference sequence position (reads were extracted from 2Mbase region)

Page 105: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

SAM flags

• SAM flags in column 2 describe mapping result • https://broadinstitute.github.io/picard/explain-flags.html

Page 106: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

SAM post-processing

• Convert SAM file to BAM format: • samtools view -Sb -o chromosome22.unsorted.bam

chromosome22.sam

• Sort BAM file: • samtools sort -o chromosome22.bam

chromosome22.unsorted.bam

• Index BAM file: • samtools index chromosome22.bam

Page 107: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair

samtools tview• View alignment in console (in pileup format https://

en.wikipedia.org/wiki/Pileup_format ): • samtools tview chromosome22.bam chromosome22.fa

• Scroll with arrow keys (but remember beginning of chr22 is all Ns)

• Type "g" (without quotes) and type "22:10732771" to get to a region where reads are mapped

• Get to help screen by typing "?"

• Exit with "q"