a new model to solve the swap matching problem and efficient algorithms for short patterns
DESCRIPTION
A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns. Costas Iliopoulos M. Sohel Rahman. Classic Pattern Matching. Input : A string T of length n (the text) A string P of length m (the pattern). Output Whether P occurs in T - PowerPoint PPT PresentationTRANSCRIPT
23 Jan, 2008 SOFSEM 2008 1
A New Model to Solve the Swap Matching
Problem and Efficient Algorithms for Short
Patterns
Costas IliopoulosM. Sohel Rahman
23 Jan, 2008 SOFSEM 2008 2
Classic Pattern Matching
Input: A string T of length n (the text) A string P of length m (the
pattern).
Output Whether P occurs in T Occ = {i | P = T [i..i + m − 1]}
Existence Query
Computation of Occurrence
set
From Alphabet
23 Jan, 2008 SOFSEM 2008 3
Example
We have GAC at position 3 and 12 Occ = {3, 12}.
P = GAC
Occ = {5, 14}.
23 Jan, 2008 SOFSEM 2008 4
Swap Matching
G CC TC T C A C G T TText
P = ACGCT1 109 112 3 4 5 6 7 8 12 13
A
C TC GA
1 2 3 4 5
23 Jan, 2008 SOFSEM 2008 5
Swap Matching
G CC TC T C A C G T TText
P = ACGCT1 109 112 3 4 5 6 7 8 12 13
A
C TC GA
C TC GA
C TC GA
Occ = {1,5,6}
23 Jan, 2008 SOFSEM 2008 6
Motivation Swap Error is a common error
during typing.
The phenomenon of swaps occurs in gene mutations and duplications.
23 Jan, 2008 SOFSEM 2008 7
Existing results
O(nm1/3 log m log )
O(n log2 m)
O(n log m log )
= min(m,||)
(Some very special cases)
2000: Amir, Aumann, Landau,Lewenstein, Lewenstein.
1998: Amir, Landau,Lewenstein, Lewenstein.
2003: Amir, Cole, Hariharan,Lewenstein, Porat.
All results uses FFT
23 Jan, 2008 SOFSEM 2008 8
Existing results Some related variants are also
investigated in the literature: Approximate version:
Amir, Lewenstein, Porat (2002) Weighted Version:
Zhang, Guo, Iliopoulos (2004)
23 Jan, 2008 SOFSEM 2008 9
Our Contribution A new graph theoretic model O(m/w n logm) time.
For word-size patterns: O(n log m) The first non-FFT efficient algorithm
for swap matching
23 Jan, 2008 SOFSEM 2008 10
The new Model
23 Jan, 2008 SOFSEM 2008 11
T-Graph
c ca aa c b a c c b c
1 109 112 3 4 5 6 7 8 12 13
aT = b a
14 15
T-Graph
a c a abcacab a cc bc
23 Jan, 2008 SOFSEM 2008 12
P-Graph
c b a b
1 2 3 4 5
aP = P-Graph
a c b
babca
b
b
a c
a
ab
1 2 3 4 5
23 Jan, 2008 SOFSEM 2008 13
P-Graph
c c a b
1 2 3 4 5
aP = P-Graph
a c c
bacca
b
b
a c
a
ac
1 2 3 4 5
23 Jan, 2008 SOFSEM 2008 14
So…
P swap matches T
P-Graph swap matches T-Graph
23 Jan, 2008 SOFSEM 2008 15
An Efficient Algorithm
23 Jan, 2008 SOFSEM 2008 16
Degenerate strings Let = {A, C, G, T} Then we can get 2^4 -1 = 15 non-
empty sets of letters. At each position of a degenerate
string we have one of those sets.
23 Jan, 2008 SOFSEM 2008 17
Degenerate strings…
TGA C
GA C TA C TA G TC G
A C A G A T C G C TC G
A C G T
23 Jan, 2008 SOFSEM 2008 18
Degenerate strings…
X=T
CCA
T
C
A
CA C
1 2 3 4 5 6 7
23 Jan, 2008 SOFSEM 2008 19
Degenerate stringsEquality/Match
X=T
CCA
T
C
A
CA C
1 2 3 4 5 6 7
Y=T
CA
C
A
X[3] =d Y[1]. WHY?
Because, X[3] Y[1] = A
Y =d X[1..3]
Y =d X[3..5]
Y =d X[4..6]
23 Jan, 2008 SOFSEM 2008 20
P-Graph => Degenerate String
a c b
babc
b
a c
a
ab
1 2 3 4 5
a
c
a
b
c
a
b
c
a
b
a
b
23 Jan, 2008 SOFSEM 2008 21
P =>
c ab a a a b c b
1 1092 3 4 5 6 7 8
bT =
According to Deg. Mat, OK!
According to Swap. Mat, NOT OK!
Swap Match vs Deg. Match
a
c
a
b
c
a
b
c
a
b
a
b
a
c
a
b
c
a
b
c
a
b
a
b
23 Jan, 2008 SOFSEM 2008 22
Why Doesn’t Work?
c ab a a a b c b
1 1092 3 4 5 6 7 8
bT =
a
c
a
b
c
a
b
c
a
b
a
b
a c b
babc
b
a c
a
ab
1 2 3 4 5
c c a b
1 2 3 4 5
a
23 Jan, 2008 SOFSEM 2008 23
Forbidden Graph
a c c
bac
b
a
a
ac
23 Jan, 2008 SOFSEM 2008 24
Our Algorithm
Shift-Or Algorithm
The concept of the Forbidden Graph
23 Jan, 2008 SOFSEM 2008 25
D-Mask
c a baP = c => a
c
a
ba
b
a
c
a
cc
a c XD-> b
0 0 1ac 1
0 0 1ac 1
0 0 1ac 1
0 0 1abc 0
1
2
3
4
0 1 1ab 05
23 Jan, 2008 SOFSEM 2008 26
F-Mask
a c c
bac
b
a
a
ac
(a,a)
0
0
0
0
1
2
3
4
05
(a,b) (b,b) (c,c) (c,a) (X,X)
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 2 3 4 5
1 1
11
1
1
1 1
23 Jan, 2008 SOFSEM 2008 27
Computing R matrix
c ca aa c b a c c b c
1 109 112 3 4 5 6 7 8 12 13
a b a
14 15
2
3
4
5
1
c
c
a
b
a
1
1
1
1
1
0
1
1
1
1
1
Shift
1
1
1
1
0
0
0
0
0
0
Da
0
0
0
0
0
F(X,a)
Or
1
1
1
1
01
1
1
1
0
X
23 Jan, 2008 SOFSEM 2008 28
Computing R matrix
c ca aa c b a c c b c
1 109 112 3 4 5 6 7 8 12 13
a b a
14 15
2
3
4
5
1
c
c
a
b
a
1
1
1
1
1
0
1
1
1
1
0
Shift
0
1
1
1
0
0
0
0
1
0
Dc
0
0
0
0
0
F(a,c)
Or
0
1
1
1
01
1
1
1
0
X
0
1
1
1
0
23 Jan, 2008 SOFSEM 2008 29
Computing R matrix
c ca aa c b a c c b c
1 109 112 3 4 5 6 7 8 12 13
a b a
14 15
2
3
4
5
1
c
c
a
b
a
1
1
1
1
1
0
0
1
1
1
0
Shift
0
0
1
1
0
0
0
0
0
0
Da
0
0
0
1
0
F(c,a)
Or
0
0
1
1
01
1
1
1
0
X
0
1
1
1
0
0
0
1
1
0
23 Jan, 2008 SOFSEM 2008 30
Computing R matrix
c ca aa c b a c c b c
1 109 112 3 4 5 6 7 8 12 13
a b a
14 15
2
3
4
5
1
c
c
a
b
a
1
1
1
1
1
0
0
0
0
1
0
Shift
0
0
0
0
0
1
1
0
0
1
Db
0
0
0
0
0
F(c,b)
Or
1
1
0
0
11
1
1
1
0
X
0
1
1
1
0
0
0
1
1
0
0
0
0
1
0
1
1
0
0
1
23 Jan, 2008 SOFSEM 2008 31
Computing R matrix
c ca aa c b a c c b c
1 109 112 3 4 5 6 7 8 12 13
a b a
14 15
2
3
4
5
1
c
c
a
b
a
1
1
1
1
1
0
1
1
1
1
0
X
0
1
1
1
0
0
0
1
1
0
0
0
0
1
0
1
1
0
0
1
1
1
1
0
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
1
1
1
1
0
0
0
1
1
1
0
0
0
1
1
0
0
0
0
1
0
1
1
0
0
1
1
1
1
0
0
23 Jan, 2008 SOFSEM 2008 32
Running Time
Computing D-Maks: O(m/w (m + ||))
Computing F-Maks: O(m/w m log m)
Computing R Values: O(m/w n log m)
O(m/w n log m)
O(n log m)short patterns (m~w)
23 Jan, 2008 SOFSEM 2008 33
Future Works Explore the possibilities of using
Graph pattern matching Experimental works
Forthcoming paper contains experimental works using biological examples.
23 Jan, 2008 SOFSEM 2008 34
The End
Thank you very much
23 Jan, 2008 SOFSEM 2008 35