![Page 1: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/1.jpg)
Suffix arrays, BWT and FM-index
Alan Medlar Wednesday 16th March 2016
![Page 2: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/2.jpg)
Outline
• Lecture: Technical background for read mapping tools used in this course
• Suffix array • Burrows-Wheeler transform (BWT) • FM-index
• Lab session: Using BWA to map paired-end data against the human genome, SAM/BAM files, etc
![Page 3: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/3.jpg)
Read mapping
• Sequencers can generate up to 100 million reads per sample
• Human genome is ~3 billion basepairs
• Need to map reads to the genome to discover variants (SNVs, indels), counts (gene expression)
![Page 4: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/4.jpg)
Preliminaries
• String
• sequence of characters,
• e.g. "banana", "ATGC", "MDLISTFS"
• Alphabet { A, C, G, T, $ }, { A-Z, a-z, $ }
• Lexicographical order
• $ < A < C < G < T
![Page 5: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/5.jpg)
Preliminaries• Prefix
• non-empty substring that is the beginning of another string (left-to-right)
• e.g. "banana", "ATGC", "MDLISTFS"
• Suffix
• non-empty substring that is the ending of another string (right-to-left)
• e.g. "banana", "ATGC", "MDLISTFS"
![Page 6: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/6.jpg)
Naïve exact search
• Text = "banana"
• Query = "nana"
• Linear search
![Page 7: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/7.jpg)
Naïve exact search
B A N A N A
N A N A• Text = "banana"
• Query = "nana"
• Linear search
![Page 8: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/8.jpg)
Naïve exact search
B A N A N A
N A N A• Text = "banana"
• Query = "nana"
• Linear search
![Page 9: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/9.jpg)
Naïve exact search
B A N A N A
N A N A
N A N A• Text = "banana"
• Query = "nana"
• Linear search
![Page 10: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/10.jpg)
Naïve exact search
B A N A N A
N A N A
N A N A• Text = "banana"
• Query = "nana"
• Linear search
![Page 11: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/11.jpg)
Naïve exact search
B A N A N A
N A N A
N A N A
N A N A
• Text = "banana"
• Query = "nana"
• Linear search
![Page 12: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/12.jpg)
Naïve exact search
B A N A N A
N A N A
N A N A
N A N A
• Text = "banana"
• Query = "nana"
• Linear search
![Page 13: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/13.jpg)
Naïve exact search
B A N A N A
N A N A
N A N A
N A N A
• Text = "banana"
• Query = "nana"
• Linear search
![Page 14: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/14.jpg)
Naïve exact search
B A N A N A
N A N A
N A N A
N A N A
• Text = "banana"
• Query = "nana"
• Linear search
![Page 15: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/15.jpg)
Naïve exact search
B A N A N A
N A N A
N A N A
N A N A
• Text = "banana"
• Query = "nana"
• Linear search
![Page 16: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/16.jpg)
Naïve search is too slow
• Human genome ~3 billion basepairs
• Read 100 basepairs
• Complexity of search scales linearly with the length of the text!
![Page 17: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/17.jpg)
Suffix array
• Introduced by Manber and Myers (1990) as a space efficient alternative to suffix tree (independently by Gonnet (1987))
• Sorted array of all suffixes of a given text
• Allows fast search of very large texts (e.g. genomes)
![Page 18: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/18.jpg)
SA: building
B A N A N A $$ is lexicographically lower than all other characters in the alphabet and cannot appear in the text otherwise
![Page 19: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/19.jpg)
SA: building
B A N A N A $A N A N A $
![Page 20: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/20.jpg)
SA: building
B A N A N A $A N A N A $N A N A $A N A $N A $A $$
![Page 21: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/21.jpg)
SA: building
B A N A N A $ 0A N A N A $ 1N A N A $ 2A N A $ 3N A $ 4A $ 5$ 6
![Page 22: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/22.jpg)
SA: building
B A N A N A $ 0A N A N A $ 1N A N A $ 2A N A $ 3N A $ 4A $ 5$ 6
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
![Page 23: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/23.jpg)
SA: building
B A N A N A $ 0A N A N A $ 1N A N A $ 2A N A $ 3N A $ 4A $ 5$ 6
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
![Page 24: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/24.jpg)
SA: querying
• Search for prefixes in the suffix array that match our query string
• SA is sorted, so we can use binary search!
![Page 25: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/25.jpg)
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
N A N A
![Page 26: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/26.jpg)
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
N A N A
![Page 27: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/27.jpg)
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
N A N A
![Page 28: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/28.jpg)
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
N A N A
![Page 29: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/29.jpg)
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
N A N A
![Page 30: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/30.jpg)
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
N A N A
![Page 31: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/31.jpg)
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
N A N A
![Page 32: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/32.jpg)
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
N A N A
![Page 33: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/33.jpg)
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
N A N A
![Page 34: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/34.jpg)
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2N A N A
![Page 35: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/35.jpg)
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2N A N A
![Page 36: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/36.jpg)
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2N A N A
![Page 37: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/37.jpg)
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2N A N A
![Page 38: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/38.jpg)
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2N A N A
![Page 39: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/39.jpg)
SA vs. naïve search
• Searching the human genome (~3 billion basepairs, n) for a single-end read (100 basepairs, m)
• Naïve search O(mn)
• Suffix array search O(m log(n))
![Page 40: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/40.jpg)
SA vs. naïve search
• Searching the human genome (~3 billion basepairs, n) for a single-end read (100 basepairs, m)
• Naïve search O(mn)
• Suffix array search O(m log(n))
n O(n) O(log(n))
8 8 3
16 16 4
32 32 5
64 64 6
128 128 7
256 256 8
512 512 9
![Page 41: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/41.jpg)
Good enough for read mapping?
• Human genome is ~3 billion basepairs
• Assume 5 bytes per basepair (1 byte characters, 4 byte integers) = ~14 GB
• NGS data really hit in 2009 (16 GB RAM at the time was a luxury!)
![Page 42: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/42.jpg)
Burrows-Wheeler transform
• Invented by Burrows and Wheeler (1994) while working at DEC
• Used in compression (.bz2 files)
• Interested in three things:
• how to perform BWT • why BWT is useful for compression • how to reverse BWT
![Page 43: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/43.jpg)
BWT
B A N A N A $
![Page 44: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/44.jpg)
BWT
B A N A N A $A N A N A $
![Page 45: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/45.jpg)
BWT
B A N A N A $A N A N A $ B
![Page 46: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/46.jpg)
BWT
B A N A N A $A N A N A $ BN A N A $
![Page 47: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/47.jpg)
BWT
B A N A N A $A N A N A $ BN A N A $ B A
![Page 48: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/48.jpg)
BWT
B A N A N A $A N A N A $ BN A N A $ B AA N A $ B A NN A $ B A N AA $ B A N A N$ B A N A N A
![Page 49: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/49.jpg)
BWT
B A N A N A $A N A N A $ BN A N A $ B AA N A $ B A NN A $ B A N AA $ B A N A N$ B A N A N A
$ B A N A N AA $ B A N A NA N A $ B A NA N A N A $ BB A N A N A $N A $ B A N AN A N A $ B A
![Page 50: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/50.jpg)
BWT
$ B A N A N AA $ B A N A NA N A $ B A NA N A N A $ BB A N A N A $N A $ B A N AN A N A $ B A
![Page 51: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/51.jpg)
BWT compression
• T = "banana$"
• BWT(T) = "annb$aa"
![Page 52: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/52.jpg)
BWT compression
• T = "peter_piper_picked_a_peck_of_pickled_peppers_a_peck_of_pickled_peppers_peter_piper_picked_if_peter_piper_picked_a_peck_of_pickled_peppers_wheres_the_peck_of_pickled_peppers_peter_piper_picked"
• BWT(T) = "ddsddkkkkaeaaddddsfsrrrrffffrrrrss___eeeeiiiiiiiieeeeeeeehppppkkkkllllpppppppptttthpppprppppiooootwpppppppp_ppppcccccccccccckkkk____________iiiipppp_______________eeeeeeeeeeeeeeeeerrrereeee__"
![Page 53: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/53.jpg)
Relation to suffix array
$ B A N A N AA $ B A N A NA N A $ B A NA N A N A $ BB A N A N A $N A $ B A N AN A N A $ B A
• BWT matrix truncated at "$" in each row is the suffix array of the same text
• BWT can be computed directly from the suffix array
![Page 54: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/54.jpg)
Reverse BWT
• It not very useful to compress something if we cannot get the original text back!
• BWT'(BWT(T)) = T
![Page 55: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/55.jpg)
LF mapping (T-rank)
B A N A N A $
![Page 56: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/56.jpg)
LF mapping (T-rank)
B A N A N A $
B0 A0 N0 A1 N1 A2 $T-rank
![Page 57: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/57.jpg)
LF mapping (T-rank)
F L$ B0 A0 N0 A1 N1 A2
A2 $ B0 A0 N0 A1 N1
A1 N1 A2 $ B0 A0 N0
A0 N0 A1 N1 A2 $ B0
B0 A0 N0 A1 N1 A2 $N1 A2 $ B0 A0 N0 A1
N0 A1 N1 A2 $ B0 A0
B A N A N A $
B0 A0 N0 A1 N1 A2 $T-rank
![Page 58: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/58.jpg)
LF mapping (T-rank)
F L$ B0 A0 N0 A1 N1 A2
A2 $ B0 A0 N0 A1 N1
A1 N1 A2 $ B0 A0 N0
A0 N0 A1 N1 A2 $ B0
B0 A0 N0 A1 N1 A2 $N1 A2 $ B0 A0 N0 A1
N0 A1 N1 A2 $ B0 A0
B A N A N A $
B0 A0 N0 A1 N1 A2 $T-rank
![Page 59: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/59.jpg)
LF mapping (T-rank)
F L$ B0 A0 N0 A1 N1 A2
A2 $ B0 A0 N0 A1 N1
A1 N1 A2 $ B0 A0 N0
A0 N0 A1 N1 A2 $ B0
B0 A0 N0 A1 N1 A2 $N1 A2 $ B0 A0 N0 A1
N0 A1 N1 A2 $ B0 A0
B A N A N A $
B0 A0 N0 A1 N1 A2 $T-rank
![Page 60: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/60.jpg)
LF mapping (T-rank)
F L$ B0 A0 N0 A1 N1 A2
A2 $ B0 A0 N0 A1 N1
A1 N1 A2 $ B0 A0 N0
A0 N0 A1 N1 A2 $ B0
B0 A0 N0 A1 N1 A2 $N1 A2 $ B0 A0 N0 A1
N0 A1 N1 A2 $ B0 A0
B A N A N A $
B0 A0 N0 A1 N1 A2 $T-rank
![Page 61: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/61.jpg)
LF mapping (T-rank)
F L$ B0 A0 N0 A1 N1 A2
A2 $ B0 A0 N0 A1 N1
A1 N1 A2 $ B0 A0 N0
A0 N0 A1 N1 A2 $ B0
B0 A0 N0 A1 N1 A2 $N1 A2 $ B0 A0 N0 A1
N0 A1 N1 A2 $ B0 A0
Ns in the L column are sorted by their
"right context", same as Ns in F column!
B A N A N A $
B0 A0 N0 A1 N1 A2 $T-rank
![Page 62: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/62.jpg)
LF mapping (B-rank)
B0 A2 N1 A1 N0 A0 $
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
B A N A N A $
B0 A0 N0 A1 N1 A2 $T-rank
B-rank
![Page 63: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/63.jpg)
LF mapping (B-rank)
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
B0 A2 N1 A1 N0 A0 $
B A N A N A $
B0 A0 N0 A1 N1 A2 $T-rank
B-rank
![Page 64: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/64.jpg)
LF mapping (B-rank)
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
• F column contains very little information, just counts of each character
![Page 65: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/65.jpg)
LF mapping (B-rank)
LA0
N0
N1
B0
$A1
A2
Which row contains N1 in the F column?
{ $:1, A:3, B:1, N:2 }
![Page 66: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/66.jpg)
LF mapping (B-rank)
LA0
N0
N1
B0
$A1
A2
• Skip $ (+1) • Skip As (+3) • Skip Bs (+1) • Skip first N (+1) = 6
{ $:1, A:3, B:1, N:2 }
Which row contains N1 in the F column?
![Page 67: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/67.jpg)
LF mapping (B-rank)
LA0
N0
N1
B0
$A1
A2
F0 $1 A0
2 A1
3 A2
4 B0
5 N0
6 N1
• Skip $ (+1) • Skip As (+3) • Skip Bs (+1) • Skip first N (+1) = 6
{ $:1, A:3, B:1, N:2 }
Which row contains N1 in the F column?
![Page 68: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/68.jpg)
Reverse BWT• Use B-ranking to reverse BWT, recreating the text T
from right-to-left
LA0
N0
N1
B0
$A1
A2
F0 $1 A0
2 A1
3 A2
4 B0
5 N0
6 N1
B0 A2 N1 A1 N0 A0 $
$
{ $:1, A:3, B:1, N:2 }
![Page 69: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/69.jpg)
Reverse BWT
LA0
N0
N1
B0
$A1
A2
F0 $1 A0
2 A1
3 A2
4 B0
5 N0
6 N1
B0 A2 N1 A1 N0 A0 $
A0 $
{ $:1, A:3, B:1, N:2 }
• Use B-ranking to reverse BWT, recreating the text T from right-to-left
![Page 70: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/70.jpg)
Reverse BWT
LA0
N0
N1
B0
$A1
A2
F0 $1 A0
2 A1
3 A2
4 B0
5 N0
6 N1
B0 A2 N1 A1 N0 A0 $
N0 A0 $
{ $:1, A:3, B:1, N:2 }
• Use B-ranking to reverse BWT, recreating the text T from right-to-left
![Page 71: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/71.jpg)
Reverse BWT
LA0
N0
N1
B0
$A1
A2
F0 $1 A0
2 A1
3 A2
4 B0
5 N0
6 N1
B0 A2 N1 A1 N0 A0 $
A1 N0 A0 $
{ $:1, A:3, B:1, N:2 }
• Use B-ranking to reverse BWT, recreating the text T from right-to-left
![Page 72: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/72.jpg)
Reverse BWT
LA0
N0
N1
B0
$A1
A2
F0 $1 A0
2 A1
3 A2
4 B0
5 N0
6 N1
B0 A2 N1 A1 N0 A0 $
N1 A1 N0 A0 $
{ $:1, A:3, B:1, N:2 }
• Use B-ranking to reverse BWT, recreating the text T from right-to-left
![Page 73: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/73.jpg)
Reverse BWT
LA0
N0
N1
B0
$A1
A2
F0 $1 A0
2 A1
3 A2
4 B0
5 N0
6 N1
B0 A2 N1 A1 N0 A0 $
A2 N1 A1 N0 A0 $
{ $:1, A:3, B:1, N:2 }
• Use B-ranking to reverse BWT, recreating the text T from right-to-left
![Page 74: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/74.jpg)
Reverse BWT
LA0
N0
N1
B0
$A1
A2
F0 $1 A0
2 A1
3 A2
4 B0
5 N0
6 N1
B0 A2 N1 A1 N0 A0 $
B0 A2 N1 A1 N0 A0 $
{ $:1, A:3, B:1, N:2 }
• Use B-ranking to reverse BWT, recreating the text T from right-to-left
![Page 75: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/75.jpg)
Reverse BWT
LA0
N0
N1
B0
$A1
A2
F0 $1 A0
2 A1
3 A2
4 B0
5 N0
6 N1
B0 A2 N1 A1 N0 A0 $
B0 A2 N1 A1 N0 A0 $
{ $:1, A:3, B:1, N:2 }
• Use B-ranking to reverse BWT, recreating the text T from right-to-left
![Page 76: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/76.jpg)
FM-index• All BWT allows us to do is compress text
• Ferragina and Manzini (2000) "Full-text index in Minute space"
• Combine BWT with other auxiliary data structures to get an index
• Space savings: e.g. Human genome (3 billion bp)
• SA = ~14 GB (5 bytes/bp) • FM = ~1.5 GB (2 bits/bp)
![Page 77: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/77.jpg)
Cannot search BWT like SA
$ B A N A N AA $ B A N A NA N A $ B A NA N A N A $ BB A N A N A $N A $ B A N AN A N A $ B A
• Rotation matrix contains the suffix array
• But we only store F and L columns, so binary search of prefixes not possible
![Page 78: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/78.jpg)
Cannot search BWT like SA
$ B A N A N AA $ B A N A NA N A $ B A NA N A N A $ BB A N A N A $N A $ B A N AN A N A $ B A
• Rotation matrix contains the suffix array
• But we only store F and L columns, so binary search of prefixes not possible
![Page 79: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/79.jpg)
BWT search
• In SA, we matched successively longer prefixes (left-to-right) of query string (binary search)
• In BWT, we will match successively longer suffixes (right-to-left) of query string (reverse BWT transform)
![Page 80: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/80.jpg)
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
N A N A
![Page 81: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/81.jpg)
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
N A N A
![Page 82: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/82.jpg)
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
N A N A
![Page 83: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/83.jpg)
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
N A N A
![Page 84: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/84.jpg)
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
N A N A
![Page 85: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/85.jpg)
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
N A N A
![Page 86: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/86.jpg)
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
N A N A
![Page 87: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/87.jpg)
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
N A N A
We know BWT contains the query, but unlike SA, we do not know the location
of the match in T!
![Page 88: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/88.jpg)
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2N A N A
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
Idea: just store SA as well?
![Page 89: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/89.jpg)
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2N A N A
$ 6A $A N A $ 3A N A N A $B A N A N A $ 0N A $N A N A $
Idea 2: store part of SA?
![Page 90: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/90.jpg)
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2N A N A
$ 6A $A N A $ 3A N A N A $B A N A N A $ 0N A $N A N A $
... and walk backwards through the BWT!
+1
![Page 91: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/91.jpg)
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2N A N A
$ 6A $A N A $ 3A N A N A $B A N A N A $ 0N A $N A N A $
... and walk backwards through the BWT!
+1+1
![Page 92: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/92.jpg)
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2N A N A
$ 6A $A N A $ 3A N A N A $B A N A N A $ 0N A $N A N A $
... and walk backwards through the BWT!
+1+1
![Page 93: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/93.jpg)
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2N A N A
$ 6A $A N A $ 3A N A N A $B A N A N A $ 0N A $N A N A $
... and walk backwards through the BWT!
+1+1
6531042
![Page 94: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/94.jpg)
BWT search
• Finding location takes constant time if the offsets into T are evenly spaced in T, not in the SA!
• Make tradeoff between space (RAM) and time (how long lookups take)
![Page 95: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/95.jpg)
Things we left out
• Rank calculations in the BWT need to be fast! Needs another auxiliary data structure
• Only covered exact matching, read alignment requires mismatches (e.g. SNP in read, not in genome)
• Other details: store forwards and backwards indices of genome due to sequencing error profile
![Page 96: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/96.jpg)
Lab exercisesBWA, SAM/BAM format, samtools
![Page 97: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/97.jpg)
Reference Data• Reference genome
• wget http://wasabiapp.org/vbox/data/session_2/chromosome22.fa.gz
• md5sum chromosome22.fa.gz (168c78298e731128ee622cf422e70f1el)
• gunzip chromosome22.fa.gz
• du -h chromosome22.fa (49 MB, genome is 2.9 GB)
• less chromosome22.fa (where's the DNA?)
• grep -nv "^N" chromosome22.fa | head (line 175169!)
![Page 98: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/98.jpg)
bwa index
• Indexing options:
• bwa index
• Index human chromosome 22 (~1.5 mins, genome takes ~1.5 hours):
• bwa index chromosome22.fa
![Page 99: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/99.jpg)
bwa index outputbash-3.2$ bwa index chromosome22.fa
[bwa_index] Pack FASTA... 0.52 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=101636936, availableWord=19151484
[BWTIncConstructFromPacked] 10 iterations done. 31590664 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 58359704 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 82148056 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 101636936 characters processed.
[bwt_gen] Finished constructing BWT in 40 iterations.
[bwa_index] 74.38 seconds elapse.
[bwa_index] Update BWT... 0.38 sec
[bwa_index] Pack forward-only FASTA... 0.37 sec
[bwa_index] Construct SA from BWT and Occ... 13.67 sec
[main] Version: 0.7.12-r1039
[main] CMD: bwa index chromosome22.fa
[main] Real time: 93.689 sec; CPU: 89.320 sec
![Page 100: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/100.jpg)
Read Data• Paired-end reads
• wget http://wasabiapp.org/vbox/data/session_2/chromosome22.reads_1.fastq.gz
• wget http://wasabiapp.org/vbox/data/session_2/chromosome22.reads_2.fastq.gz
• md5sum chromosome22.reads_1.fastq.gz chromosome22.reads_2.fastq.gz
de1cd26056c61571de5cdf246ede60d3 chromosome22.reads_1.fastq.gz
2be64fb5848c2997af0ab8fab416d539 chromsome22.reads_2.fastq.gz
• gunzip chromosome22.reads_1.fastq.gz (and the other file)
• less chromosome22.reads_1.fastq
![Page 101: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/101.jpg)
bwa mapping options
• Several alignment options:
• bwa mem (70bp+ Illumina, 454, IonTorrent, Sanger)
• bwa bwasw (Smith-Waterman, frequent gaps)
• bwa aln/samse/sampe (short reads, original)
![Page 102: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/102.jpg)
bwa mem
• Mapping paired-end data • bwa mem [options] <idxbase> <in1.fq> <in2.fq>
• bwa mem -t 4 chromosome22.fa chromosome22.reads_1.fastq chromosome22.reads_2.fastq > chromosome22.sam
• -t specifies the number of CPUs to use
![Page 103: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/103.jpg)
Sequence Alignment/Map format (SAM)
• SAM format is a TAB-delimited text file, we can inspect with a pager: • less -S chromosome22.sam
• Each row represents an alignment, at least 11 fields
• Specification: https://samtools.github.io/hts-specs/SAMv1.pdf
![Page 104: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/104.jpg)
SAM fields• Column 1: read name
• Column 3: reference sequence name (in our case "22")
• Column 4: reference sequence position (reads were extracted from 2Mbase region)
![Page 105: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/105.jpg)
SAM flags
• SAM flags in column 2 describe mapping result • https://broadinstitute.github.io/picard/explain-flags.html
![Page 106: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/106.jpg)
SAM post-processing
• Convert SAM file to BAM format: • samtools view -Sb -o chromosome22.unsorted.bam
chromosome22.sam
• Sort BAM file: • samtools sort -o chromosome22.bam
chromosome22.unsorted.bam
• Index BAM file: • samtools index chromosome22.bam
![Page 107: Suffix arrays, BWT and FM-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/SA... · FM-index Alan Medlar Wednesday 16th March 2016. Outline ... • Assume 5 bytes per basepair](https://reader034.vdocuments.site/reader034/viewer/2022043010/5fa2327fd6efbe34f0348daa/html5/thumbnails/107.jpg)
samtools tview• View alignment in console (in pileup format https://
en.wikipedia.org/wiki/Pileup_format ): • samtools tview chromosome22.bam chromosome22.fa
• Scroll with arrow keys (but remember beginning of chr22 is all Ns)
• Type "g" (without quotes) and type "22:10732771" to get to a region where reads are mapped
• Get to help screen by typing "?"
• Exit with "q"