sequence local alignment using directed acyclic word graph

30
Sequence Local Alignment using Directed Acyclic Word Graph Do Huy Hoang

Upload: vic

Post on 22-Feb-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Sequence Local Alignment using Directed Acyclic Word Graph. Do Huy Hoang. Sequence Alignment. Sequence Similarity. Alignment Arrange DNA/Protein sequences to show the similarity “” denotes the insertion/deletion event. Other variations. Edit distance Longest common substring - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sequence Local Alignment using Directed Acyclic Word Graph

Sequence Local Alignment using Directed Acyclic Word Graph

Do Huy Hoang

Page 2: Sequence Local Alignment using Directed Acyclic Word Graph

SEQUENCE ALIGNMENT

Page 3: Sequence Local Alignment using Directed Acyclic Word Graph

Sequence Similarity

• Alignment–Arrange DNA/Protein sequences to show

the similarity• “” denotes the insertion/deletion event

Page 4: Sequence Local Alignment using Directed Acyclic Word Graph

Other variations

• Edit distance• Longest common substring• Affine gap scoring• Using scoring matrix (BLOSUM, PAM)

Page 5: Sequence Local Alignment using Directed Acyclic Word Graph

Alignment score computation

• Needleman–Wunsch – Dynamic programming

Page 6: Sequence Local Alignment using Directed Acyclic Word Graph

Other variationsName Problem Worst time Average time Memory

Four Russian Edit distance 1,0 M*N/log(N) <not good> MN

Ukkonen Global edit (linear cost)

ND N+D2 D2

Waterman Local alignment MN MN MN

Tree tree Local alignment M2N2 <close to M2N2>

BWTSW Meaningful local alignment

MN2 MN0.68

Page 7: Sequence Local Alignment using Directed Acyclic Word Graph

Local alignment

• Local alignment– Find the best alignments of two substring

from the sequences

Page 8: Sequence Local Alignment using Directed Acyclic Word Graph

BWTSW

Page 9: Sequence Local Alignment using Directed Acyclic Word Graph

• BWTSW– Motivation• Scoring 75% similarity• Local alignment table most are zero• Meaningful alignment

– Suffix tree– Meaningful alignment– Meaningful alignment with gap– How good is it?

Page 10: Sequence Local Alignment using Directed Acyclic Word Graph

Meaningful alignment (1)

• Sequences similarity sometimes implies functional similarity.

• Biologists is NOT usually interested in sequences with less than 70% similarity.

• BLAST score– Match = 1– Mismatch = -3– Open Gap = -5– Extending gap = -2

Page 11: Sequence Local Alignment using Directed Acyclic Word Graph

Meaningful alignment (2)

• BLAST score– Match = 1– Mismatch = -3– Open Gap = -5– Extending Gap = -2

– At least 70% match to have none zero score

Page 12: Sequence Local Alignment using Directed Acyclic Word Graph

Meaningful alignment (3)

• BLAST score– Match = 1– Mismatch = -3– Open Gap = -5– Extending Gap = -2

• How many none zero entries in the local alignment DP table?

Page 13: Sequence Local Alignment using Directed Acyclic Word Graph

How to improve?

• Idea:– Not storing zero score entries– Using suffix tree to prune off early

Page 14: Sequence Local Alignment using Directed Acyclic Word Graph

BWTSW details

• FM index for suffix tree representation• Prune zero entries• Store DP vector using linked list

Page 15: Sequence Local Alignment using Directed Acyclic Word Graph

Analysis

• Text length = N• Pattern length = M• Alphabet size =

Page 16: Sequence Local Alignment using Directed Acyclic Word Graph

Average running time (1)

• Let F(L) be the number of pairs of strings length L, which Score(S1,S2) > 0– Sizeof{(S1,S2) : Len(S1)=Len(S2)=L,

Score(S1,S2)>0}– F(L) counts the number of pairs of 75% identity.

• F(L) = sum(i=0..L/4, Binomial(L,i) * (-1)i) • F(L) k1k2

L

• F(log(N)) k3* N0.68

Page 17: Sequence Local Alignment using Directed Acyclic Word Graph

Average running time (2)

• Given S1, Pr(Score(S1,S2) > 0|S1) = F(L)/L

• For M < log(N)– The number of entries are– O(M * F(M)) < O(log(N)*F(log(N))

• For M > log (N)– O(M * N * F(M) / L)

• On average– Time = O(M*F(log(N))) = M * N0.68

Page 18: Sequence Local Alignment using Directed Acyclic Word Graph

DAWG

Page 19: Sequence Local Alignment using Directed Acyclic Word Graph

Possible improvement of BWTSW

• Worst case running time O(N2 M)– When M=N

– O(M N0.68+M3) When M is substring of N• What about ST vs. ST?

Page 20: Sequence Local Alignment using Directed Acyclic Word Graph

• What we used in BWTSW is Suffix Trie (not suffix tree).– #Prove it#

• Suffix trie has O(N2)nodes

• DAWG is a similar structure with O(N) nodes

Page 21: Sequence Local Alignment using Directed Acyclic Word Graph

DAWG (1)

Page 22: Sequence Local Alignment using Directed Acyclic Word Graph

DAWG (2)

• DAWG: Directed Acyclic Word Graph• DAWG is a cyclic automata that recognizes all

the sub-strings of the given string.

Page 23: Sequence Local Alignment using Directed Acyclic Word Graph

DAWG (3)

• Example:– DAWG of “abcbc”

a

b

bc, cab

abc

abcb, bcb, cb

abcbc, bcbc, cbc

a

b c

cb

c

b

b

c

Page 24: Sequence Local Alignment using Directed Acyclic Word Graph

DAWG (4)

• End-set view

0,1, 2,3,4,5

1

2, 4

3, 52

3

4

5

a

b c

cb

c

b

b

c

a

b

bc, cab

abc

abcb, bcb, cb

abcbc, bcbc, cbc

a

b c

cb

c

b

b

c

Page 25: Sequence Local Alignment using Directed Acyclic Word Graph

Trivial DAWG construction

• Using End-set class

0,1, 2,3,4,5

1

2, 4

3, 52

3

4

5

a

b c

cb

c

b

b

c

a

b

bc, cab

abc

abcb, bcb, cb

abcbc, bcbc, cbc

a

b c

cb

c

b

b

c

Page 26: Sequence Local Alignment using Directed Acyclic Word Graph

DAWG properties

• For |w|>2, the Directed Acyclic Word Graph for w has at most 2|w|-1 states, and 3|w|-4 edges

Page 27: Sequence Local Alignment using Directed Acyclic Word Graph

D(w) and ST(wR)• There is a map between nodes in DAWG and implicit

ST(wR)– Example: w=abcbc, wR=cbcba

• Store DAWG using ST, which uses only o(N) bits

a

ab

cb

cbaa

cba

a

b

bc, cab

abc

abcb, bcb, cb

abcbc, bcbc, cbc

a

b c

cb

c

b

b

c

Page 28: Sequence Local Alignment using Directed Acyclic Word Graph

D(w) and ST(wR) (2)list all incoming edges of node q in Dw using ST(w^R)

Page 29: Sequence Local Alignment using Directed Acyclic Word Graph

Local Alignment using DAWG

• Basis

• Induction

Page 30: Sequence Local Alignment using Directed Acyclic Word Graph

Extensions

• Meaningful alignment using DAWG– Prune the nodes whose Score is less than zero

• Shortest path pruning style• Cache log(N) nodes the worst case running

time is M*N*log(N), average case is the same for M << N.