a new model to solve the swap matching problem and efficient algorithms for short patterns

35
23 Jan, 2008 SOFSEM 2008 1 A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns Costas Iliopoulos M. Sohel Rahman

Upload: gary-elliott

Post on 02-Jan-2016

34 views

Category:

Documents


2 download

DESCRIPTION

A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns. Costas Iliopoulos M. Sohel Rahman. Classic Pattern Matching. Input : A string T of length n (the text) A string P of length m (the pattern). Output Whether P occurs in T - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 1

A New Model to Solve the Swap Matching

Problem and Efficient Algorithms for Short

Patterns

Costas IliopoulosM. Sohel Rahman

Page 2: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 2

Classic Pattern Matching

Input: A string T of length n (the text) A string P of length m (the

pattern).

Output Whether P occurs in T Occ = {i | P = T [i..i + m − 1]}

Existence Query

Computation of Occurrence

set

From Alphabet

Page 3: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 3

Example

We have GAC at position 3 and 12 Occ = {3, 12}.

P = GAC

Occ = {5, 14}.

Page 4: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 4

Swap Matching

G CC TC T C A C G T TText

P = ACGCT1 109 112 3 4 5 6 7 8 12 13

A

C TC GA

1 2 3 4 5

Page 5: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 5

Swap Matching

G CC TC T C A C G T TText

P = ACGCT1 109 112 3 4 5 6 7 8 12 13

A

C TC GA

C TC GA

C TC GA

Occ = {1,5,6}

Page 6: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 6

Motivation Swap Error is a common error

during typing.

The phenomenon of swaps occurs in gene mutations and duplications.

Page 7: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 7

Existing results

O(nm1/3 log m log )

O(n log2 m)

O(n log m log )

= min(m,||)

(Some very special cases)

2000: Amir, Aumann, Landau,Lewenstein, Lewenstein.

1998: Amir, Landau,Lewenstein, Lewenstein.

2003: Amir, Cole, Hariharan,Lewenstein, Porat.

All results uses FFT

Page 8: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 8

Existing results Some related variants are also

investigated in the literature: Approximate version:

Amir, Lewenstein, Porat (2002) Weighted Version:

Zhang, Guo, Iliopoulos (2004)

Page 9: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 9

Our Contribution A new graph theoretic model O(m/w n logm) time.

For word-size patterns: O(n log m) The first non-FFT efficient algorithm

for swap matching

Page 10: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 10

The new Model

Page 11: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 11

T-Graph

c ca aa c b a c c b c

1 109 112 3 4 5 6 7 8 12 13

aT = b a

14 15

T-Graph

a c a abcacab a cc bc

Page 12: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 12

P-Graph

c b a b

1 2 3 4 5

aP = P-Graph

a c b

babca

b

b

a c

a

ab

1 2 3 4 5

Page 13: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 13

P-Graph

c c a b

1 2 3 4 5

aP = P-Graph

a c c

bacca

b

b

a c

a

ac

1 2 3 4 5

Page 14: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 14

So…

P swap matches T

P-Graph swap matches T-Graph

Page 15: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 15

An Efficient Algorithm

Page 16: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 16

Degenerate strings Let = {A, C, G, T} Then we can get 2^4 -1 = 15 non-

empty sets of letters. At each position of a degenerate

string we have one of those sets.

Page 17: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 17

Degenerate strings…

TGA C

GA C TA C TA G TC G

A C A G A T C G C TC G

A C G T

Page 18: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 18

Degenerate strings…

X=T

CCA

T

C

A

CA C

1 2 3 4 5 6 7

Page 19: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 19

Degenerate stringsEquality/Match

X=T

CCA

T

C

A

CA C

1 2 3 4 5 6 7

Y=T

CA

C

A

X[3] =d Y[1]. WHY?

Because, X[3] Y[1] = A

Y =d X[1..3]

Y =d X[3..5]

Y =d X[4..6]

Page 20: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 20

P-Graph => Degenerate String

a c b

babc

b

a c

a

ab

1 2 3 4 5

a

c

a

b

c

a

b

c

a

b

a

b

Page 21: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 21

P =>

c ab a a a b c b

1 1092 3 4 5 6 7 8

bT =

According to Deg. Mat, OK!

According to Swap. Mat, NOT OK!

Swap Match vs Deg. Match

a

c

a

b

c

a

b

c

a

b

a

b

a

c

a

b

c

a

b

c

a

b

a

b

Page 22: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 22

Why Doesn’t Work?

c ab a a a b c b

1 1092 3 4 5 6 7 8

bT =

a

c

a

b

c

a

b

c

a

b

a

b

a c b

babc

b

a c

a

ab

1 2 3 4 5

c c a b

1 2 3 4 5

a

Page 23: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 23

Forbidden Graph

a c c

bac

b

a

a

ac

Page 24: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 24

Our Algorithm

Shift-Or Algorithm

The concept of the Forbidden Graph

Page 25: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 25

D-Mask

c a baP = c => a

c

a

ba

b

a

c

a

cc

a c XD-> b

0 0 1ac 1

0 0 1ac 1

0 0 1ac 1

0 0 1abc 0

1

2

3

4

0 1 1ab 05

Page 26: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 26

F-Mask

a c c

bac

b

a

a

ac

(a,a)

0

0

0

0

1

2

3

4

05

(a,b) (b,b) (c,c) (c,a) (X,X)

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

1 2 3 4 5

1 1

11

1

1

1 1

Page 27: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 27

Computing R matrix

c ca aa c b a c c b c

1 109 112 3 4 5 6 7 8 12 13

a b a

14 15

2

3

4

5

1

c

c

a

b

a

1

1

1

1

1

0

1

1

1

1

1

Shift

1

1

1

1

0

0

0

0

0

0

Da

0

0

0

0

0

F(X,a)

Or

1

1

1

1

01

1

1

1

0

X

Page 28: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 28

Computing R matrix

c ca aa c b a c c b c

1 109 112 3 4 5 6 7 8 12 13

a b a

14 15

2

3

4

5

1

c

c

a

b

a

1

1

1

1

1

0

1

1

1

1

0

Shift

0

1

1

1

0

0

0

0

1

0

Dc

0

0

0

0

0

F(a,c)

Or

0

1

1

1

01

1

1

1

0

X

0

1

1

1

0

Page 29: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 29

Computing R matrix

c ca aa c b a c c b c

1 109 112 3 4 5 6 7 8 12 13

a b a

14 15

2

3

4

5

1

c

c

a

b

a

1

1

1

1

1

0

0

1

1

1

0

Shift

0

0

1

1

0

0

0

0

0

0

Da

0

0

0

1

0

F(c,a)

Or

0

0

1

1

01

1

1

1

0

X

0

1

1

1

0

0

0

1

1

0

Page 30: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 30

Computing R matrix

c ca aa c b a c c b c

1 109 112 3 4 5 6 7 8 12 13

a b a

14 15

2

3

4

5

1

c

c

a

b

a

1

1

1

1

1

0

0

0

0

1

0

Shift

0

0

0

0

0

1

1

0

0

1

Db

0

0

0

0

0

F(c,b)

Or

1

1

0

0

11

1

1

1

0

X

0

1

1

1

0

0

0

1

1

0

0

0

0

1

0

1

1

0

0

1

Page 31: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 31

Computing R matrix

c ca aa c b a c c b c

1 109 112 3 4 5 6 7 8 12 13

a b a

14 15

2

3

4

5

1

c

c

a

b

a

1

1

1

1

1

0

1

1

1

1

0

X

0

1

1

1

0

0

0

1

1

0

0

0

0

1

0

1

1

0

0

1

1

1

1

0

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

1

1

1

1

0

0

0

1

1

1

0

0

0

1

1

0

0

0

0

1

0

1

1

0

0

1

1

1

1

0

0

Page 32: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 32

Running Time

Computing D-Maks: O(m/w (m + ||))

Computing F-Maks: O(m/w m log m)

Computing R Values: O(m/w n log m)

O(m/w n log m)

O(n log m)short patterns (m~w)

Page 33: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 33

Future Works Explore the possibilities of using

Graph pattern matching Experimental works

Forthcoming paper contains experimental works using biological examples.

Page 34: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 34

The End

Thank you very much

Page 35: A New Model to Solve the Swap Matching Problem and Efficient Algorithms for Short Patterns

23 Jan, 2008 SOFSEM 2008 35