finding characteristic substrings from compressed texts
DESCRIPTION
Finding Characteristic Substrings from Compressed Texts. Shunsuke Inenaga Kyushu University, Japan Hideo Bannai Kyushu University, Japan. Text Mining and Text Compression. Text mining is a task of finding some rule and/or knowledge from given textual data. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/1.jpg)
Finding Characteristic Substringsfrom Compressed Texts
Shunsuke InenagaKyushu University, Japan
Hideo BannaiKyushu University, Japan
![Page 2: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/2.jpg)
Text Mining and Text Compression
Text mining is a task of finding some rule and/or knowledge from given textual data.
Text compression is to reduce a space to store given textual data by removing redundancy.
compress
decompress
![Page 3: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/3.jpg)
Our Contribution
We present efficient algorithms to find characteristic substrings (patterns) from given compressed strings directly (i.e., without decompression).
Longest repeating substring (LRS)Longest non-overlapping repeating substring (LNRS)Most frequent substring (MFS)Most frequent non-overlapping substring (MFNS)Left and right contexts of given pattern
![Page 4: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/4.jpg)
X1 = aX2 = bX3 = X1X2 X4 = X3X1
X5 = X3X4
X6 = X5X5
X7 = X4X6
X8 = X7X5 T =
SLP T
Text Compression by Straight Line Program
SLP T is a CFG in the Chomsky normal form which generates language {T}.
![Page 5: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/5.jpg)
X1 = aX2 = bX3 = X1X2 X4 = X3X1
X5 = X3X4
X6 = X5X5
X7 = X4X6
X8 = X7X5 T =
SLP T
Text Compression by Straight Line Program
Encodings of the LZ-family, run-length, Sequitur, etc. can quickly be transformed into SLP.
![Page 6: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/6.jpg)
Exponential Compression by SLP
Highly repetitive texts can be exponentially large w.r.t. the corresponding SLP-compressed texts.Text T = ababab…ab (T is an N repetition of ab)
SLP T : X1 = a, X2 = b, X3 = X1X2, X4 = X3X3, X5 = X4X4, ... , Xn = Xn-1Xn-1
N = O(2n)Any algorithms that decompress given SLP-compressed texts can take exponential time!We present efficient (i.e., polynomial-time) algorithms without decompression.
![Page 7: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/7.jpg)
Finding Longest Repeating Substring
Input: SLP T which generates text TOutput: A longest repeating substring (LRS) of T
T≠
≠
T = aabaabcabaabb
Example
![Page 8: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/8.jpg)
Key Observation – 6 Cases of Occurrences of LRS
Xi
Xl Xr
Xi
Xl Xr
Xi
Xl Xr
Xi
Xl Xr
Xi
Xl Xr
Case 1
Xi
Xl Xr
Case 2 Case 3
Case 6Case 5Case 4
![Page 9: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/9.jpg)
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
![Page 10: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/10.jpg)
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
Xi
Xl Xr
Case 1
![Page 11: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/11.jpg)
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
Xi
Xl Xr
Case 1
![Page 12: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/12.jpg)
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
![Page 13: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/13.jpg)
Xi
Xl Xr
Case 2
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
![Page 14: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/14.jpg)
Xi
Xl Xr
Case 2
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
![Page 15: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/15.jpg)
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
![Page 16: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/16.jpg)
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
Xi
Xl Xr
Case 3
LRS of Xi of Case 3 is
the longest common substring of Xl and
Xr.
![Page 17: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/17.jpg)
Longest Common Substring of Two SLPs
Theorem 1 [Matsubara et al. 2009]
For every pair of variables Xl and Xr, we can compute a longest common substring of Xl and Xr in total of O(n4logn) time.
n is num. of variables in SLP T
![Page 18: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/18.jpg)
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
Xi
Xl Xr
Case 4
![Page 19: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/19.jpg)
Xi
Case 4
Xl XrXj
![Page 20: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/20.jpg)
Xj
Xi
Case 4-1
Xl Xr
Xj
XtXs
![Page 21: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/21.jpg)
Case 4-1
Xl
Xt
Overlap of Xl and
Xt
![Page 22: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/22.jpg)
Xj
Xi
Case 4-1
Xl Xr
Xj
XtXs
Overlap of Xl and
Xt
Expand overlap
![Page 23: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/23.jpg)
Xj
Xi
Case 4-2
Xl Xr
Xj
XtXs
Overlap of Xr and
Xs
Expand overlap
![Page 24: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/24.jpg)
Set of Overlaps
X
Y
Set of length of overlaps of X and Y
![Page 25: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/25.jpg)
a a b a a b aa b a a b a b b
a b a a b a b b
a b a a b a b b
Set of Overlaps
OL(aabaaba, abaababb) = {1, 3, 6}
X
Y
Y
Y
![Page 26: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/26.jpg)
Set of Overlaps
Lemma 1 [Kaprinski et al. 1997]
For every pair of variables Xi and Yj,OL(Xi, Yj) forms O(n) arithmetic progressions.
Lemma 2 [Kaprinski et al. 1997]
For every pair of variables Xi and Yj,OL(Xi, Yj) can be computed in total of O(n4logn)
time.
n is num. of variables in SLP T
![Page 27: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/27.jpg)
Case 4
Lemma 3
For every variable Xi, a longest repeating substring
in Case 4 can be computed in O(n3logn) time.[Sketch of proof]•We can expand all elements of each arithmetic progression of OL(Xi, Xj) in O(nlogn) time.•The size of OL(Xi, Xj) is O(n) by Lemma 1.•There are at most n-1 descendants Xj of Xi.
![Page 28: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/28.jpg)
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
Xi
Xl Xr
Case 5
Symmetric to Case 4
![Page 29: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/29.jpg)
Algorithm to Compute LRS
Input: SLP TOutput: LRS of text Tforeach variable Xi of SLP T do
compute LRS of Case 1;compute LRS of Case 2;compute LRS of Case 3;compute LRS of Case 4;compute LRS of Case 5;compute LRS of Case 6;
return two positions and the length of the “longest” LRS above;
Similarly to Case 4
Xl Xr
Case 6
Xi
![Page 30: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/30.jpg)
Finding Longest Repeating Substring
Theorem 2
For any SLP T which generates text T, we can compute an LRS of T in O(n4logn) time.
n is num. of variables in SLP T
![Page 31: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/31.jpg)
Finding Longest Non-Overlapping Repeating Substring
Input: SLP T which generates text TOutput: A longest non-overlapping repeating substring (LNRS) of T
T = ababababab
Example
LRS of T is ababababLRNS of T is abab
![Page 32: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/32.jpg)
Finding Longest Non-Overlapping Repeating Substring
Theorem 3
For any SLP T which generates text T, we can compute an LNRS of T in O(n6logn) time.
n is num. of variables in SLP T
![Page 33: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/33.jpg)
Finding Most Frequent Substring
Input: SLP T which generates text TOutput: A most frequent substring (MFS) of T
The solution is always the empty
string e.
![Page 34: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/34.jpg)
Finding Most Frequent Substring
Input: SLP T which generates text TOutput: A most frequent substring (MFS) of T of length 2
T
![Page 35: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/35.jpg)
Algorithm to Compute MFS
Input: SLP TOutput: MFS of text Tforeach substring P of T of length 2 do
construct an SLP P which generates substring P;compute num. of occurrences of P in T;
return substring of maximum num. of occurrences;
|S|2 substrings of length 2
Y3
Y1 Y2
a b
Lemma 4For every pair of variables Xi and Yj, the number of
occurrences of Yj in Xi can be computed in total of O(n2) time.
![Page 36: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/36.jpg)
Finding Most Frequent Substring
Theorem 4
For any SLP T which generates text T, we can compute an MFS of T of length 2 in O(|S|2n2) time.
n is num. of variables in SLP T
![Page 37: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/37.jpg)
T = aaaaababab
Example
Finding Most Frequent Non-Overlapping Substring
Input: SLP T which generates text TOutput: A most frequent non-overlapping substring (MFNS) of T of length 2
MFS of T of length 2 is aaMFNS of T of length 2 is ab
![Page 38: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/38.jpg)
Finding Most Frequent Non-Overlapping Substring
Theorem 5
For any SLP T which generates text T, we can compute an MFNS of T of length 2 in O(n4logn) time.
n is num. of variables in SLP T
![Page 39: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/39.jpg)
T = bbaabaabbaabbExample
Computing Left and Right Contexts of Given Pattern
Input: Two SLPs T and P which generate text T and pattern P, respectivelyOutput: Substring aPb of T such that
a (resp. b) always precedes (resp. follows) P in Ta and b are as long as possible
P = aba = bab = e
![Page 40: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/40.jpg)
Computing Left and Right Contexts of Given Pattern
Examples of applications of computing left and right contexts of patterns are:
Blog spam detection [Narisawa et al. 2007]Compute maximal extension of most frequent substrings (MFS)
T
![Page 41: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/41.jpg)
Boundary Lemma [1/2]
Lemma 5 [Miyazaki et al. 1997]For any SLP variables X = XlXr and Y, the
occurrences of Y that touch or cover the boundary of X form a single arithmetic progression.
X
Xl Xr
YY
Y
![Page 42: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/42.jpg)
u u v
u u v
u u v
Boundary Lemma [2/2]
Lemma 5 [Miyazaki et al. 1997](Cont.) If the number of elements in the
progression is more than 2, then the step of the progression is the smallest period of Y.
X
Xl Xr
YY
Y
![Page 43: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/43.jpg)
a
a
a
u u v
u u v
u u v
Left and Right Contexts
X
Xl Xr
Yb
b
b
The left context a of Y in X is a suffix of u.
The right context b of Y in X is a prefix of uv[|v|: |uv|].
![Page 44: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/44.jpg)
Computing Left and Right Contexts of Given Pattern
Theorem 6For any SLPs T and P which generate
text T and pattern P, respectively, we can compute the left and right contexts of P in T in O(n4logn) time.
n is num. of variables in SLP T
![Page 45: Finding Characteristic Substrings from Compressed Texts](https://reader035.vdocuments.site/reader035/viewer/2022062302/568165cb550346895dd8d491/html5/thumbnails/45.jpg)
Conclusions and Future Work
We presented polynomial time algorithms to find characteristic substrings of given SLP-compressed texts.
Our algorithms are more efficient than any algorithms that work on uncompressed strings.
Would it be possible to efficiently find other types of substrings from SLP-compressed texts?
Squares (substrings of form xx)Cubes (substrings of form xxx)Runs (maximal substrings of form xk with k ≥ 2)Gapped palindromes (substrings of form xyxR with |y| ≥ 1)