edit distance – levenshtein sequence alignment – needleman...
TRANSCRIPT
![Page 1: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/1.jpg)
1
Edit Distance – Levenshtein Sequence Alignment – Needleman & Wunsch
Not in Book
CS380 Algorithm Design and Analysis
![Page 2: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/2.jpg)
2
EDIT DISTANCE
http://en.wikipedia.org/wiki/Levenshtein_distance
CS380 Algorithm Design and Analysis
![Page 3: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/3.jpg)
3
Edit Distance
• Mutation in DNA is evolutionary.
• DNA replication errors cause o Substitutions
o Insertions
o Deletions
• of nucleotides, leading to “edited” DNA texts
CS380 Algorithm Design and Analysis
![Page 4: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/4.jpg)
4
Edit Distance: Definition
• Introduced by Vladimir Levenshtein in 1966
• The Edit Distance between two strings is the minimum number of editing operations needed to transform one string into another
• Operations are: o Insertion of a symbol
o Deletion of a symbol
o Substitution of one symbol for another CS380 Algorithm Design and Analysis
![Page 5: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/5.jpg)
5
Example
CS380 Algorithm Design and Analysis
• How would you transform: o X: TGCATAT
• To the string: o Y: ATCCGAT
![Page 6: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/6.jpg)
6
Edit Distance
• How many insertions, deletions, substitutions will transform one string into another?
• Backtracking will give us the steps used to convert one string to another
CS380 Algorithm Design and Analysis
![Page 7: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/7.jpg)
7
Recursive Solution
• Let dij = the minimum edit distance of x1x2x3..xi and y1y2y3..yi
CS380 Algorithm Design and Analysis
Insertion to X
Substitution
Deletion from X
Match
![Page 8: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/8.jpg)
8
Backtracking
• No need to keep track of the arrows
• Just know that: o Match/Substitution: Diagonal
o Insertion: Horizontal (Left)
o Deletion: Vertical (Up)
CS380 Algorithm Design and Analysis
![Page 9: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/9.jpg)
9
Example
• X = ATCGTT
• Y = AGTTAC
CS380 Algorithm Design and Analysis
![Page 10: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/10.jpg)
10
SEQUENCE ALIGNMENT
Kleinberg, Tardos, Algorithm Design, Pearson Addison Wesley, 2006, p 278
http://www.aw-bc.com/info/kleinberg/
CS380 Algorithm Design and Analysis
![Page 11: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/11.jpg)
11
Sequence Alignment
• Edit Distance: o Gave the minimum number of changes to
convert one string into another
• Sequence Alignment o Maximizes the similarity by giving weights to
types of differences
CS380 Algorithm Design and Analysis
![Page 12: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/12.jpg)
12
Sequence Alignment
• Needleman-Wunsch
• Similarity based on gaps and mismatches
• Generalized form of Levenshtein o additional parameters:
§ gap penalty, δ § mismatch cost ( αx,y ; αx,x = 0 )
CS380 Algorithm Design and Analysis
![Page 13: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/13.jpg)
13
Recurrence
• Two strings x1...xm and y1...yn
• In an optimal alignment, M, at least one of the following is true: o (xm, yn) is in M
o xm is not matched
o yn is not matched
CS380 Algorithm Design and Analysis
![Page 14: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/14.jpg)
14
Recurrence
• So, for i and j > 0
CS380 Algorithm Design and Analysis
![Page 15: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/15.jpg)
15
Example
• Assume that: o δ = 2
o α (v, v) = 1
o α (c, c) = 1
o α (v, c) = 3
• What is the cost of aligning the strings: o mean
o name CS380 Algorithm Design and Analysis
![Page 16: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/16.jpg)
16
SPACE-EFFICIENT SEQUENCE ALIGNMENT
CS380 Algorithm Design and Analysis
![Page 17: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/17.jpg)
17
Sequence Alignment Space Usage
• O(n2) is pretty low space usage
• However, for a 10GB genome, you’d need a huge amount of memory
• Can we use less? o Hirschberg’s algorithm
o 1975
CS380 Algorithm Design and Analysis
![Page 18: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/18.jpg)
18
Linear Space for Alignment Scores
• If you are only interested in the cost of the alignment, you need to only use O(n) space
• How? o When filling the entries, we only ever look at the
current and previous cols
o Only keep those two in memory
CS380 Algorithm Design and Analysis
![Page 19: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/19.jpg)
19
Space-Efficient-Alignment (X, Y)
CS380 Algorithm Design and Analysis
![Page 20: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/20.jpg)
20
Actual Alignment
• How do we recover the actual alignment?
• Do we need the entire matrix?
CS380 Algorithm Design and Analysis
![Page 21: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment](https://reader031.vdocuments.site/reader031/viewer/2022022001/5a78a8c57f8b9a8c428ec0af/html5/thumbnails/21.jpg)
21
Divide-and-Conquer-Alignment (X,Y)
CS380 Algorithm Design and Analysis