1 average-optimal multiple approximate string matching kimmo fredriksson, gonzalo navarro acm...
TRANSCRIPT
1
Average-Optimal Multiple Approximate String Matching
Kimmo Fredriksson , Gonzalo NavarroACM Journal of Experimental Algorithmics,
Vol 9, Article No. 1.4,2004, Pages 1-47
Professor R.C.T LeeSpeaker K.W.Liu
2
The Problem
•The approximate string matching problem:
Given text T[1...n] and pattern P[1...m] over some finite alphabet ∑ of size σ, find the approximate occurrences of P from T, allowing at most k differences ( insertion, deletion, substitution).
3
For a window of size m-k, if there exists a substring s1 in this window such that its edit distance with every substring of P is greater than k, we move P
Our algorithm scans from the right as shown below:
Fig. 3
S1T:
P:
m - k
4
For a window of size m-k, if there exists a suffix S
1 such that its edit distance with every substring of P is greater than k, we move P to S1
Our algorithm scans from the right as shown below:
Fig. 3
S1T:
P:
m - k
5
But, how do we know that ED(S1,S2) > k?
We use a very useful lemma.
6
LemmaConsider string Q and P. Let Q be divided in
to q1,q2,…,qn as shown below:
qn … q2 q1
For each qi, let pi be the substring in P such that ED(qi,pi) is the smallest, among all substrings in P.
kPQkpqn
iii
),ED( then ,),ED( If1
7
Proof: Divide P into n pieces as shown below
qn … q2 q1
p'n … p'2 p'1
Q
P
smallest. theis ),ED( because ),ED(
Therefore, . ),ED( that assume We
1
1
ii
n
iii
n
iii
pqkpq
kpq
8
To determine whether ED(S1,S2) > k, we mayUse the lemma.
We divide the window into small pieces: t1, t2, …,ta.
For each ti, we find the substring pi in P where ED(pi,ti) is the smallest.
T:
P:
Window W
Fig. 7
… t2 t1
p1 p2
9
. ),ED( whether find togprogrammin dynamic
use tohave We.conclusionany makecannot we
, ),( window, theof end at the If
. movecan weNow
. ),ED( that know we1, Lemma
toaccording ,),( assoon As
21
kPW
kpt
P
kSS
kpt
ii
ii
10
In general, to find such a pi, we may use Dynamic programming [Sellers 1980].
But, we may use a special kind of small pieces.
It is customary to call a small piece with sizeL a L-gram.
Let us use the 2-gram.
11
Note that for two substrings P and Q which are of length 2, the edit distance between them is equal to the Hamming distance between them.
Thus, we may use 2-grams in our algorithm.
12
Our algorithm
• Make a table D to store the smallest edit
distance between each possible 2-gram from
finite alphabet set and all substrings of the
pattern P.
•The above is done in the preprocessing stage.
13
Example T = ctagggaataatttacaatt P = ttaatatat k = 1
c t a g g g a a t a a t t t a c a a t t
← m-k →
Smallest edit distance between “aa” and all substrings of P = 0 Smallest edit distance between “gg” and all substrings of P = 2
∴∑ > k
14
Example T = ctagggaataatttacaatt P = ttaatatat k = 1
c t a g g g a a t a a t t t a c a a t t
← m-k →
Smallest edit distance between “tt” and all substrings of P = 0
Smallest edit distance between “aa” and all substrings of P = 0
Smallest edit distance between “at” and all substrings of P = 0
Smallest edit distance between “ga” and all substrings of P = 1
∴∑ == k
← m+2k →
c t a g g g a a t a a t t t a c a a t t
← m-k →
i
i+1
15
c t a g g g a a t a a t t t a c a a t t
← m-k →
← m+k →i+1
Example T = ctagggaataatttacaatt P = ttaatatat k = 1
To find the edit distance between “gaataattta” and P.
16
Example T = ctagggaataatttacaatt P = ttaatatat k = 1
c t a g g g a a t a a t t t a c a a t t
← m-k →
c t a g g g a a t a a t t t a c a a t t
← m-k →
17
In the preprocessing We make a D table to record the smallest edit
distance between each possible l-gram from alphabet set whose length is l and all substrings of P.
18
D table : example ( step by step )
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
Dp 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
For example P = aacaccgaa
For P = a a c a c c g a a a
For P = a a c a c c g a a a vs “aa”
19
For P = a a c a c c g a a a vs “ac”
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
Dp 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2
20
For P = a a c a c c g a a a vs “ag”For P = a a c a c c g a a a with “at”
For P = a a c a c c g a a a with “ca” For P = a a c a c c g a a a with cc
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
Dp 0 0 1 1 0 0 2 2 2 2 2 2 2 2 2 2
21
Time complexity
The average complexity of the algorithm is
mnrmk log for 121
22
The end
23
[BYN2000] New models and algorithms for multidimensional approximate pattern matching. BAEZA-YATES, R. AND NAVARRO, G. 2000. Journal of Discrete Algorithms 1, 1, 21–49. Special issue on Matching Patterns.
[BYN2002] New and faster filters for multiple approximate string matching. BAEZA-YATES, R. AND NAVARRO, G. 2002. Random Structures and Algorithms 20, 23–49.
[BYR99] Modern Information Retrieval. BAEZA-YATES, R. AND RIBEIRO-NETO, B. 1999. Addison-Wesley, Reading, MA.
[BYN99] Faster approximate string matching. BAEZA-YATES, R. A. AND NAVARRO, G. 1999. Algorithmica 23, 2, 127–158.
[CL94] Sublinear approximate string matching and biological applications. CHANG, W. AND LAWLER, E. 1994. Algorithmica 12, 4/5, 327–344.
[CM94] Approximate string matching and local similarity. CHANG, W. AND MARR, T. 1994. In Proceedings of 5th Combinatorial Pattern Matching (CPM’94). LNCS, vol. 807. Springer-Verlag, Berlin, 259–273.
[CCGJLPR94] Speeding up two string matching algorithms. CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W., AND RYTTER, W. 1994. Algorithmica 12, 4/5, 247–267.
24
[CR94] Text Algorithms. CROCHEMORE, M. AND RYTTER, W. 1994. Oxford University Press, Oxford, UK.
[DM79] Automatic Speech and Speaker Recognition. DIXON, R. AND MARTIN, T., Eds. 1979. IEEE Press.
[EL90] A review of segmentation and contextual analysis techniques for text recognition. ELLIMAN, D. AND LANCASTER, I. 1990. Pattern Recogn. 23, 3/4, 337–346.
[F2003] Row-wise tiling for the Myers’ bit-parallel approximate string matching algorithm. FREDRIKSSON, K. 2003. In Proceedings of 10th Symposium on String Processing and Information Retrieval (SPIRE’03). LNCS, vol. 2857. Springer-Verlag, Berlin, 66–79.
[FN2003]
Average-optimal multiple approximate string matching. FREDRIKSSON, K. AND NAVARRO, G. 2003. In Proceedings of 14th Combinatorial Pattern Matching (CPM’03). LNCS, vol. 2676. 109–128.
[FN2004]
Improved single and multiple approximate string matching. FREDRIKSSON, K. AND NAVARRO, G. 2004. In Proceedings of 15th Combinatorial Pattern Matching (CPM’04). LNCS, vol. 3109. Springer-Verlag, Berlin, 457–471.
25
[GL89] Simple and efficient string matching with k mismatches. GROSSI, R. AND LUCCIO, F. 1989. Information Processing Letters 33, 3, 113–120. HORSPOOL, R. 1980. Practical fast searching in strings. Software Practice and Experience 10, 501–506.
[HFN2004]
Increased bit-parallelism for approximate string matching. HYYR¨O, H., FREDRIKSSON, K., AND NAVARRO, G. 2004. In Proceedings of 3rd Workshop on Efficient and Experimental Algorithms (WEA’04). LNCS, vol. 3059. Springer-Verlag, Berlin, 285–298.
[HN2002] Faster bit-parallel approximate string matching. HYYR¨O, H. AND NAVARRO, G. 2002. In Proceedings of 13th Combinatorial Pattern Matching (CPM’02). LNCS, vol. 2373. Springer-Verlag, Berlin, 203–224. Extended version to appear in Algorithmica.
[JTU96] A comparison of approximate string matching algorithms. JOKINEN, P., TARHIO, J., AND UKKONEN, E. 1996. Software Practice and Experience 26, 12, 1439–1458.
[K92] Techniques for automatically correcting words in text. KUKICH, K. 1992. ACM Computing Surveys 24, 4, 377–439.
[KS94] A pattern-matching model for intrusion detection. KUMAR, S. AND SPAFFORD, E. 1994. In Proceedings of National Computer Security Conference. 11–21.
[LT94] On the searchability of electronic ink. LOPRESTI, D. AND TOMKINS, A. 1994. In Proceedings of 4th International Workshop on Frontiers in Handwriting Recognition. 156–165.
26
[MM96] Approximate multiple string search. MUTH, R. AND MANBER, U. 1996. In Proceedings of 7th Combinatorial Pattern Matching (CPM’96). LNCS, vol. 1075. Springer-Verlag, Berlin, 75–86.
[M99] A fast bit-vector algorithm for approximate string matching based on dynamic programming. MYERS, E.W. 1999. J. ACM 46, 3, 395–415.
[N2001] A guided tour to approximate string matching. NAVARRO, G. 2001. ACM Computing Surveys 33, 1, 31–88.
[NB99] Very fast and simple approximate string matching. NAVARRO, G. AND BAEZA-YATES, R. 1999. Inf. Process. Lett. 72, 65–70.
[NB2001] Improving an algorithm for approximate pattern matching. NAVARRO, G. AND BAEZA-YATES, R. 2001. Algorithmica 30, 4, 473–502.
[NF2004] Average complexity of exact and approximate multiple string matching. NAVARRO, G. AND FREDRIKSSON, K. 2004. Theor. Comput. Sci. 321, 2-3, 283–290.
[NR2000] Fast and flexible string matching by combining bitparallelism and suffix automata. NAVARRO, G. AND RAFFINOT, M. 2000. ACM J. Exp. Algorithmics 5, 4.
[NR2002] Flexible Pattern Matching in Strings—Practical on-line Search Algorithms for Texts and Biological Sequences. NAVARRO, G. AND RAFFINOT, M. 2002. Cambridge University Press, Cambridge, UK.
[NSTT2000]
Indexing text with approximate q-grams. NAVARRO, G., SUTINEN, E., TANNINEN, J., AND TARHIO, J. 2000. In Proceedings of 11th Combinatorial Pattern Matching (CPM’00). LNCS, vol. 1848. Springer-Verlag, Berlin, 350–363.
27
[PS80] Decision trees and random access machines. PAUL, W. AND SIMON, J. 1980. In Proceedings of International Symposium on Logic and Algorithmic (Zurich). 331–340.
[SK83]
Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. SANKOFF, D. AND KRUSKAL, J., Eds. 1983. Addison-Wesley, Reading, MA.
[S80] The theory and computation of evolutionary distances: Pattern recognition. SELLERS, P. 1980. J. Algorithms 1, 359–373.
[ST96] Filtration with q-samples in approximate string matching. SUTINEN, E. AND TARHIO, J. 1996. In Proceedings of 7th Combinatorial Pattern Matching. LNCS, vol. 1075. Springer-Verlag, Berlin, 50–63.
[TU93]
Approximate Boyer–Moore string matching. TARHIO, J. AND UKKONEN, E. 1993. SIAM J. Comput. 22, 2, 243–260.
[U85] Finding approximate patterns in strings. UKKONEN, E. 1985. J. Algorithms 6, 132–137.
[W95] Introduction to Computational Biology. WATERMAN, M. 1995. Chapman and Hall, London.
[Y79] The complexity of pattern matching for a random string. YAO, A. C. 1979. SIAM J. Comput. 8, 3, 368–387.