1 backward nondeterministic dawg matching algorithm speaker: l. c. chen advisor: prof. r. c. t. lee...
Post on 21-Dec-2015
223 views
TRANSCRIPT
![Page 1: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/1.jpg)
1
Backward Nondeterministic DAWG Matching Algorithm
Speaker: L. C. Chen
Advisor: Prof. R. C. T. Lee
A Bit-parallel Approach to Suffix Automata:Fast Extended String Matching,
Navarro, G. and Raffinot, M., Lecture Notes in Computer Science, Vol.1448, 1998, pp. 14-33
![Page 2: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/2.jpg)
2
Problem Definition:
Input : A text T and a pattern P.
Output : All the locations where P matches T.
![Page 3: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/3.jpg)
3
This algorithm uses rule 1: Suffix to Prefix Rule:
For a window to have any chance to match a pattern, in some way, there must be a suffix of the window which is equal to a prefix of the pattern.
T
P
![Page 4: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/4.jpg)
4
Find the longest suffix U of the window which is equal to some prefix of P. Skip the pattern as follows:
T
P
U
![Page 5: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/5.jpg)
5
ExampleT = GCA TCGACAGAC TATACAGTACG
P = GACGGATCA
∵The longest suffix of the window which is equal to a prefix of P is “GAC”, slide the window by 6.
T = GCATCGACAGACTATACAGTACGP = GACGGATCA
![Page 6: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/6.jpg)
6
We give an example to introduce how this algorithm find the longest suffix of the window which is equal to a prefix of P.
![Page 7: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/7.jpg)
7
Text : ABDDCCDBADEGGGGJJ
Pattern : BADADCEAD
Example:
We want to find the longest suffix of “BDDCCDBAD” which is also a prefix of the pattern.
![Page 8: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/8.jpg)
8
Text : ABDDCCDBADEGGGGJJ
Pattern : BADADCEAD
Example:
First, we read “D”.
![Page 9: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/9.jpg)
9
Text : ABDDCCDBADEGGGGJJ
Pattern : BADADCEAD
Example:
We find all the substrings ”D” in the pattern.
![Page 10: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/10.jpg)
10
Text : ABDDCCDBADEGGGGJJ
Pattern : BADADCEAD
Example:
We read the next character “A”.
We check if the right of the substrings ”D” are “A” or not.
![Page 11: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/11.jpg)
11
Text : ABDDCCDBADEGGGGJJ
Pattern : BADADCEAD
Example:
Thus, we find out all the substrings ”AD” in the pattern.
![Page 12: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/12.jpg)
12
Text : ABDDCCDBADEGGGGJJ
Pattern : BADADCEAD
Example:
We read the next character “B”.
We check if the right of the substrings “AD” are “B” or not.
![Page 13: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/13.jpg)
13
Text : ABDDCCDBADEGGGGJJ
Pattern : BADADCEAD
Example:
We find that the substring ”BAD” is in the pattern. Note that “BAD”is also a prefix of P.
![Page 14: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/14.jpg)
14
Text : ABDDCCDBADEGGGGJJ
Pattern : BADADCEAD
Example:
We can not find a character “D” in the right of the substring “BAD”.We report that “BAD” is the longest suffix of “BDDCCDBAD”which is equal a prefix of P.
We read the next character “D”.
![Page 15: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/15.jpg)
15
Text : ABDDCCDDADEGGGGJJ
Pattern : ACDADCEAD
Another example:
We want to find the longest suffix of “BDDCCDDAD” which is also a substring of the pattern.
![Page 16: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/16.jpg)
16
Text : ABDDCCDDADEGGGGJJ
Pattern : ACDADCEAD
First, we find all the substrings ”D” in the pattern.
Another example:
![Page 17: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/17.jpg)
17
Text : ABDDCCDDADEGGGGJJ
Pattern : ACDADCEAD
mismatch
Then we find out all the substrings ”AD” in the pattern.
Another example:
![Page 18: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/18.jpg)
18
Text : ABDDCCDDADEGGGGJJ
Pattern : ACDADCEAD
Then we find out all the substrings ”AD” in the pattern.
Another example:
![Page 19: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/19.jpg)
19
Text : ABDDCCDDADEGGGGJJ
Pattern : ACDADCEAD
mismatch
We find out all the substrings ”DAD” in the pattern.
Another example:
![Page 20: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/20.jpg)
20
Text : ABDDCCDDADEGGGGJJ
Pattern : ACDADCEAD
We find out all the substrings ”DAD” in the pattern.
Another example:
![Page 21: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/21.jpg)
21
Text : ABDDCCDDADEGGGGJJ
Pattern : ACDADCEAD
mismatch
We find all the substrings ”DDAD” in the pattern.
Another example:
![Page 22: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/22.jpg)
22
Text : ABDDCCDDADEGGGGJJ
Pattern : ACDADCEAD
mismatch
We find all the substrings ”DDAD” in the pattern. There is no substring “DDAD” in the pattern.
There is no any suffix of “BDDCCDDAD” which is equal to a prefixof P.
Another example:
![Page 23: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/23.jpg)
23
The idea that we explained above is the main idea of this
algorithm. And next we will use bit-parallel method to
implement this algorithm.
![Page 24: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/24.jpg)
24
We use bits to store the positions of a character in P.
Example:
For character “A”, we store A: 0 1 00 01 0
For character “B”, we store B: 0 0 11 0 00
For character “C”, we store C: 1 0 0 0 100
For character “D”, we store D: 0 0 0 0 0 01
For the characters do not exit in P we store *: 0 0 0 0 0 0 0
P: CABBCAD P: CABBCAD
![Page 25: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/25.jpg)
25
Text: ABCABCABA
Pattern: CABBCAD
,∑={A,B,C,D}
Pattern: CABCCAD
A: 0100010 B: 0011000 C: 1000100 D: 0000001 other: 0000000
D: 1111111
Here, we explain how to use bit-parallel to find the substring of a pattern which is equaled to a suffix of the window.
We use a mask D to record some information.
![Page 26: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/26.jpg)
26
Text: ABCABCABA
Pattern: CABBCAD
,∑={A,B,C,D}
Pattern: CABCCAD
A: 0100010 B: 0011000 C: 1000100 D: 0000001 other: 0000000
D: 1111111
A: 0100010And
0100010
D= 0100010<<1 =1000100
D: 1000100
<<1: left shift one bit.
![Page 27: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/27.jpg)
27
Text: ABCABCABA
Pattern: CABBCAD
,∑={A,B,C,D}
D: 1000100
C: 1000100And
1000100
We know “CA” is a suffix of the window which is equal to a prefix of the pattern.
D= 1000100<<1 =0001000
Pattern: CABCCAD
A: 0100010 B: 0011000 C: 1000100 D: 0000001 other: 0000000
D: 0001000
![Page 28: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/28.jpg)
28
Text: ABCABCABA
Pattern: CABBCAD
,∑={A,B,C,D}
D: 0001000
B: 0011000And
0001000
We know “BCA” is a substring of the pattern.
D= 0001000<<1 =0010000
Pattern: CABCCAD
A: 0100010 B: 0011000 C: 1000100 D: 0000001 other: 0000000
D: 0010000
![Page 29: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/29.jpg)
29
Text: ABCABCABA
Pattern: CABBCAD
,∑={A,B,C,D}
D: 0010000
A: 0100010And
0000000
There is no substring “ABCA” in the pattern.
Pattern: CABCCAD
A: 0100010 B: 0011000 C: 1000100 D: 0000001 other: 0000000
![Page 30: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/30.jpg)
30
Text: ABCABCABA
Pattern: CABBCAD
,∑={A,B,C,D}
“CA” is a suffix of “BCA” which is a prefix of the pattern.
![Page 31: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/31.jpg)
31
Text: ABCABCCBA ,∑={A,B,C,D}Example:
Pattern: ACBCCBD
We take another example:
![Page 32: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/32.jpg)
32
Text: ABCABCCBA
Pattern: ACBCCBD
,∑={A,B,C,D}Example:
First, we build:
Pattern: ACBCCBD
A: 1000000 B: 0010010 C: 0101100 D: 0000001 others: 0000000
![Page 33: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/33.jpg)
33
Text: ABCABCCBA
Pattern: ACBCCBD
,∑={A,B,C,D}Example:
D: 1111111
Pattern: ACBCCBD
A: 1000000 B: 0010010 C: 0101100 D: 0000001 others: 0000000
![Page 34: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/34.jpg)
34
Text: ABCABCCBA
Pattern: ACBCCBD
,∑={A,B,C,D}Example:
Pattern: ACBCCBD
A: 1000000 B: 0010010 C: 0101100 D: 0000001 others: 0000000
D: 1111111
![Page 35: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/35.jpg)
35
Text: ABCABCCBA
Pattern: ACBCCBD
,∑={A,B,C,D}Example:
D: 1111111
D: 1111111
C: 0101100And
0101100
We set D =
Where there is a “1”, there is a substring “C” in Pattern.
0101100<<1= 1011000
Pattern: ACBCCBD
A: 1000000 B: 0010010 C: 0101100 D: 0000001 others: 0000000
![Page 36: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/36.jpg)
36
Text: ABCABCCBA
Pattern: ACBCCBD
,∑={A,B,C,D}Example:
D: 1011000
D: 1011000
C: 0101100And
0001000
We set D =
Where there is a “1”, there is a substring “CC” in Pattern.
0001000<<1= 0010000
Pattern: ACBCCBD
A: 1000000 B: 0010010 C: 0101100 D: 0000001 others: 0000000
![Page 37: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/37.jpg)
37
Text: ABCABCCBA
Pattern: ACBCCBD
,∑={A,B,C,D}Example:
D: 0010000
D: 0010000
B: 0010010And
0010000
We set D =
Where there is a “1”, there is a substring “BCC” in Pattern.
0010000<<1= 0100000
Pattern: ACBCCBD
A: 1000000 B: 0010010 C: 0101100 D: 0000001 others: 0000000
![Page 38: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/38.jpg)
38
Text: ABCABCCBA
Pattern: ACBCCBD
,∑={A,B,C,D}Example:
D: 0100000
D: 0100000
A: 1000000And
0000000
There is no any suffix of the window which is equal to a prefix of the pattern.
There is no substring “ABCC” in Pattern.
Pattern: ACBCCBD
A: 1000000 B: 0010010 C: 0101100 D: 0000001 others: 0000000
![Page 39: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/39.jpg)
39
Time Complexity:
If the length of the text is n and the length of pattern is m,
the time complexity of this algorithm is O(mn) in the worst case.
![Page 40: 1 Backward Nondeterministic DAWG Matching Algorithm Speaker: L. C. Chen Advisor: Prof. R. C. T. Lee A Bit-parallel Approach to Suffix Automata: Fast Extended](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d695503460f94a47ac8/html5/thumbnails/40.jpg)
40
Reference• [BG92]A new approach to text searching, R. Baeza-Yates and Navarro, G., CACM. Vol. 35, 1
992, pp.74-82.
• [BEH89]Average sizes of suffix trees and dawgs., Blumer, A., Ehrenfeucht, A. and Haussler, D., Discrete Applied Mathematics, Vol. 24, 1989, pp.37-45.
• [BM77] A fast string searching algorithm. Boyer, R. S. and Moore, J. S., Communications of the ACM, Vol. 20, 1977, pp.762-772.
• [GM98] A Bit-Parallel Approach to Suffix Automata: Fast Extended String Matching, G. NAVARRO and M. RAFFINOT, In Proceedings of the 9th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science 1448, Springer-Verlag, Berlin, 1998, pp.14-31.