bundled suffix trees - units.it
TRANSCRIPT
Introduction Bundled Suffix Trees An application
BUNDLED SUFFIX TREES
Luca Bortolussi1 Francesco Fabris2 Alberto Policriti1
1Department of Mathematics and Computer ScienceUniversity of Udine
2Department of Mathematics and Computer ScienceUniversity of Trieste
IFIP TCS 2006, Santiago, Chile, 23rd–24th August 2006
Introduction Bundled Suffix Trees An application
Outline
1 IntroductionSuffix Trees
2 Bundled Suffix TreesEncoding Approximate InformationDefinitionSize and Construction
3 An applicationComputing Surprise MeasuresSummary
Introduction Bundled Suffix Trees An application
Suffix Trees
bcabbabc#
Gusfield D., Algorithms on strings, trees andsequences, Cambridge University Press, 1997.
E. Ukkonen. On-line construction of suffix-trees.Algorithmica, 14:249-260, 1995.
A Suffix Tree is a data structurerevealing the internal structureof a string.They occupy O(n) space andcan be built in O(n) time.
They are efficient for:
Exact String Matching
Longest Exact CommonSubstring Problem
Identifying ExactlyRepeated Patterns
Introduction Bundled Suffix Trees An application
Limitations of Suffix Trees
bcabbabc#
Gusfield D., Algorithms on strings, trees andsequences, Cambridge University Press, 1997.
Landau G.M., Vishkin U., Efficient StringMatching with k Mismatches, TheoreticalComputer Science, 43, 239-249, 1986.
Suffix Trees cannot dealnaturally with approximatestring matching problems.(Hamming or Edit distance)
Two difficult problems:
Longest CommonApproximate SubstringProblem
Extraction of approximatelyrepeated patterns
Introduction Bundled Suffix Trees An application
Extending Suffix Trees
THE TARGETExtending Suffix Trees in order to solve in a simple way someclasses of approximate string matching problems.
Bundled Suffix TreesBundled Suffix Trees extend suffix Trees.
They incorporate approximate information;They can be used like Suffix Trees for:
Longest Common Approximate Substring ProblemExtraction of approximately repeated patterns
Introduction Bundled Suffix Trees An application
Approximate Matching
Character matching is a relation among letters(in fact, it is the equality relation)
We model approximate matching as a non-transitive relationamong letters:
two strings “match” if all their letters are in relation.
Introduction Bundled Suffix Trees An application
Approximate Matching
Character matching is a relation among letters(in fact, it is the equality relation)
We model approximate matching as a non-transitive relationamong letters:
two strings “match” if all their letters are in relation.
Introduction Bundled Suffix Trees An application
Non-Transitive Relation: An Example
Modeling a relation based on Hamming Distance
Start from a basic alphabet (e.g. binary: A = {0, 1})Construct an alphabet composed of macrocharacters(e.g. A = {00, 01, 10, 11})Two letters x , y ∈ A are in relation if and only ifdH(x , y) ≤ D (e.g. D = 1).
The Relation Graph
00 ↔ 01l l
10 ↔ 11
Relation is non-transitive
It encapsulates a(restricted) form ofdistance.
Introduction Bundled Suffix Trees An application
Bundled Suffix Tree: An Example
bcabbabca ↔ b ↔ c We start from the suffix
tree for the string.
Let’s compare suffix 3 andsuffix 1:
b c a b b a b cm m m m m 6ma b b a c c
After bcabb in the tree, weput a red node with label 3.
Due to symmetry, there isalso a red node with label1 after abbab .
Introduction Bundled Suffix Trees An application
Bundled Suffix Tree: An Example
bcabbabca ↔ b ↔ c We start from the suffix
tree for the string.
Let’s compare suffix 3 andsuffix 1:
b c a b b a b cm m m m m 6ma b b a c c
After bcabb in the tree, weput a red node with label 3.
Due to symmetry, there isalso a red node with label1 after abbab .
Introduction Bundled Suffix Trees An application
Bundled Suffix Tree: An Example
bcabbabca ↔ b ↔ c We start from the suffix
tree for the string.
Let’s compare suffix 3 andsuffix 1:
b c a b b a b cm m m m m 6ma b b a c c
After bcabb in the tree, weput a red node with label 3.
Due to symmetry, there isalso a red node with label1 after abbab .
Introduction Bundled Suffix Trees An application
Bundled Suffix Tree: An Example
bcabbabca ↔ b ↔ c We start from the suffix
tree for the string.
Let’s compare suffix 3 andsuffix 1:
b c a b b a b cm m m m m 6ma b b a c c
After bcabb in the tree, weput a red node with label 3.
Due to symmetry, there isalso a red node with label1 after abbab .
Introduction Bundled Suffix Trees An application
Bundled Suffix Tree: An Example
bcabbabc ;a ↔ b ↔ c
If we do this process for everycouple of suffixes, we build aBundled Suffix Tree!
Note that this data structure isin the middle between a suffixtree and a suffix trie.
Introduction Bundled Suffix Trees An application
Bundled Suffix Tree: An Example
bcabbabc ;a ↔ b ↔ c
Bundled Suffix Trees can beused to:
solve the LongestCommon ApproximateSubstring Problem withrespect to a given relation(just find the lowest rednode).
extract information aboutapproximately repeatedpatterns.
Introduction Bundled Suffix Trees An application
How Big?
The number of red nodesinserted depends on:
the relation
the structure of the text.
In the worst case, the numberof red nodes is quadratic in thelength of the text S. Example
On average, the number of red nodes is limited by
m1+δ, δ = log1/p+ C.
( m is the length of the text, p+ is the normalized frequency of themost common letter in S, C depends on the relation)
1 + δ is slightly greater than one! Example
Introduction Bundled Suffix Trees An application
How Fast?
Naive AlgorithmThe naive algorithm for building a BuST tries to “match”every suffix of the textalong every branch of the suffix tree,until a “mismatch” is found.
It can be quadratic in the worst case.
An analysis based on the average shape of a suffix treeshows that its average complexity is bounded by m1+δ′
(δ′
just slightly greater that δ).
W. Szpankowski. A Generalized Suffix Tree and its (Un)expected Asymptotic Behaviors. SIAM J. Comput.22(6): 1176-1198 (1993)
P. Jacquet, B. McVey, W. Szpankowski. Compact Suffix Trees Resemble PATRICIA Tries: LimitingDistribution of Depth, Journal of the Iranian Statistical Society, 3, 139-148, 2004.
Introduction Bundled Suffix Trees An application
Faster
Efficient AlgorithmWe found an “McCreight-like” algorithm that is linear in the sizeof the output.
IntuitionsIt processes the suffixes backwards.
It is based on the concept of inverse suffix links. Show Details
It identifies the red nodes for suffix i by processing the rednodes for suffix i + 1. Show Details
Introduction Bundled Suffix Trees An application
Experimental Results
We have implemented the naive algorithm for theconstruction of BuST.
We have tested it with relations induced by hammingdistance, defined over DNA-macrocharacters.
With macrocharacters of size 4 (X ↔ Y ⇔ dH(X , Y ) ≤ 1)the algorithm can process texts of length 100K in fewseconds.
The number of red nodes grows tamely. Show Details
Introduction Bundled Suffix Trees An application
Measures of surprise: exact case
z-score
δ(α) =f (α)− E(α)
N(α)
f (α) is the observed frequency of α
E(α) is the expected frequency of α
N(α) is a normalization factor (e.g. the variance or itsfirst-order approximation).
MonotonicityIf f (α) = f (αβ) then δ(α) ≤ δ(αβ).
δ needs to be computed only for maximal strings at a fixedfrequency. These are exactly the strings ending at nodesof the Suffix Tree.
Introduction Bundled Suffix Trees An application
Computing the z-score
bcabbabc# Using a Suffix Tree, we cancompute and store thez-score for all “interesting”substrings of a given text inlinear time and space(given that we cancompute E and N in lineartime and space).
A. Apostolico, M.E. Block, S. Lonardi. Monotonyof surprise and the large-scale quest for unusualwords. Journal of Computational Biology, 7(3-4),2003.
Introduction Bundled Suffix Trees An application
Measures of Surprise in the Approximate World
bcabbabc ;a ↔ b ↔ c
Let’s consider asoccurrences of β in α allthe substrings β′ that arein relation with β.
Reasoning as in the exactcase, we can use a BuSTto compute the z-score forall interesting substrings ofα in time and spaceproportional to the BuST’ssize.
Introduction Bundled Suffix Trees An application
Measures of Surprise in the Approximate World
If we use an Hamming-like relation built on macrocharacters,we are counting all the occurrences of a string with distancebounded by a threshold proportional to the string’s length.
Pros and ConsPros :
the algorithm runs in time proportional to the number ofmaximal substrings (w.r.t. δ).BuST provides a compact way to store and retrieve thisinformation.
Cons :the macrocharacters introduce rigidity (we can countcompute the z-score only for strings of length multiple of themacrocharacter’s size).the distance must be distributed evenly amongmacrocharacters.
Introduction Bundled Suffix Trees An application
Conclusions
We have introduced Bundled Suffix Trees, a new datastructure extending suffix trees.
Given a relation among characters encoding some sort ofapproximate information, a BuST reveals the innerstructure of the strings w.r.t. this relation (all thisinformation is internal w.r.t. the processed string).
BuST can be used for all the problems related to the innerstructure of the string, like computation of approximatedfrequency.
The structure is based on a very general concept ofnon-transitive relation among characters. The use ofHamming-like relation on tuples is just a possible example.
Its size is slightly more than linear on average, and there’sa fast (McCreight-like) algorithm to build it.
Dimension of BuST Efficient Algorithm
Quadratic BuST
Let’s consider the text
a . . . a︸ ︷︷ ︸m
c . . . c︸ ︷︷ ︸m
b . . . b︸ ︷︷ ︸2m
,
over {a, b, c, d}, with
a ↔ bl ld ↔ c
The number of nodessurrounded by the red boxis quadratic in m!
Return
Dimension of BuST Efficient Algorithm
Quadratic BuST
Let’s consider the text
a . . . a︸ ︷︷ ︸m
c . . . c︸ ︷︷ ︸m
b . . . b︸ ︷︷ ︸2m
,
over {a, b, c, d}, with
a ↔ bl ld ↔ c
The number of nodessurrounded by the red boxis quadratic in m!
Return
Dimension of BuST Efficient Algorithm
Quadratic BuST
Let’s consider the text
a . . . a︸ ︷︷ ︸m
c . . . c︸ ︷︷ ︸m
b . . . b︸ ︷︷ ︸2m
,
over {a, b, c, d}, with
a ↔ bl ld ↔ c
The number of nodessurrounded by the red boxis quadratic in m!
Return
Dimension of BuST Efficient Algorithm
The exponent δ
Return
Dimension of BuST Efficient Algorithm
The exponent δ
Return
Dimension of BuST Efficient Algorithm
Test
Number of macrocharacters of length 4 over DNA alphabet.Test strings are generated according to a uniform p.d.
Return
Dimension of BuST Efficient Algorithm
Inverse Suffix Links
A crucial role in the fastconstruction of suffix trees isplayed by suffix links.
Suffix links are pointers fromnodes with path label xα tonodes with path label α.
Whenever there is a node withpath label xα, there’s also a nodewith path label α.
Return
Dimension of BuST Efficient Algorithm
Inverse Suffix Links
Inverse suffix links are pointersfrom nodes with path label α topositions in the tree labeled xα,for each x in the alphabet suchthat xα is a substring of S.
They can point in the middle ofan arc.
If a ISL takes from α to xα, it islabeled with x .
Return
Dimension of BuST Efficient Algorithm
The Algorithm
Red nodes for suffix S[i] can becomputed from red nodes forsuffix S[i + 1], using InverseSuffix Links.
Suppose a red node for suffixS[i + 1] is just under a “black”node with path label α.
From this node, we can cross allinverse suffix links labeled withcharacters in relation with S(i).
With a skip and count trick, wecan identify the positions of rednodes for S[i].
Return
Dimension of BuST Efficient Algorithm
The Algorithm
Red nodes for suffix S[i] can becomputed from red nodes forsuffix S[i + 1], using InverseSuffix Links.
Suppose a red node for suffixS[i + 1] is just under a “black”node with path label α.
From this node, we can cross allinverse suffix links labeled withcharacters in relation with S(i).
With a skip and count trick, wecan identify the positions of rednodes for S[i].
Return
Dimension of BuST Efficient Algorithm
The Algorithm
Red nodes for suffix S[i] can becomputed from red nodes forsuffix S[i + 1], using InverseSuffix Links.
Suppose a red node for suffixS[i + 1] is just under a “black”node with path label α.
From this node, we can cross allinverse suffix links labeled withcharacters in relation with S(i).
With a skip and count trick, wecan identify the positions of rednodes for S[i].
Return