sparse suffix sorting€¦ · karp-rabin fingerprints (rolling hash) [karp and rabin, `87] • q is...

Post on 10-Jul-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Sparse Suffix Sorting

Tomohiro I

!  Sorting suffixes of a string T is a fundamental task having a lot of applications such as text indexing.

Suffix sorting

!  Sorting suffixes of a string T is a fundamental task having a lot of applications such as text indexing.

Suffix sorting

T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Suffix sorting

lionlyingonlyonlyon$ ionlyingonlyonlyon$ onlyingonlyonlyon$ nlyingonlyonlyon$ lyingonlyonlyon$ yingonlyonlyon$ ingonlyonlyon$ ngonlyonlyon$ gonlyonlyon$ onlyonlyon$ nlyonlyon$ lyonlyon$ yonlyon$ onlyon$ nlyon$ lyon$ yon$ on$ n$ $

all suffixes of T

T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

$ gonlyonlyon$ ingonlyonlyon$ ionlyingonlyonlyon$ lionlyingonlyonlyon$ lyingonlyonlyon$ lyon$ lyonlyon$ n$ ngonlyonlyon$ nlyingonlyonlyon$ nlyon$ nlyonlyon$ on$ onlyingonlyonlyon$ onlyon$ onlyonlyon$ yingonlyonlyon$ yon$ yonlyon$

20 9 7 2 1 5

16 12 19 8 4

15 11 18 3

14 10 6

17 13

sort

Suffix sorting

lionlyingonlyonlyon$ ionlyingonlyonlyon$ onlyingonlyonlyon$ nlyingonlyonlyon$ lyingonlyonlyon$ yingonlyonlyon$ ingonlyonlyon$ ngonlyonlyon$ gonlyonlyon$ onlyonlyon$ nlyonlyon$ lyonlyon$ yonlyon$ onlyon$ nlyon$ lyon$ yon$ on$ n$ $

$ gonlyonlyon$ ingonlyonlyon$ ionlyingonlyonlyon$ lionlyingonlyonlyon$ lyingonlyonlyon$ lyon$ lyonlyon$ n$ ngonlyonlyon$ nlyingonlyonlyon$ nlyon$ nlyonlyon$ on$ onlyingonlyonlyon$ onlyon$ onlyonlyon$ yingonlyonlyon$ yon$ yonlyon$

20 9 7 2 1 5

16 12 19 8 4

15 11 18 3

14 10 6

17 13

all suffixes of T

sort

T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Suffix Array of T

Suffix sorting

lionlyingonlyonlyon$ ionlyingonlyonlyon$ onlyingonlyonlyon$ nlyingonlyonlyon$ lyingonlyonlyon$ yingonlyonlyon$ ingonlyonlyon$ ngonlyonlyon$ gonlyonlyon$ onlyonlyon$ nlyonlyon$ lyonlyon$ yonlyon$ onlyon$ nlyon$ lyon$ yon$ on$ n$ $

$ gonlyonlyon$ ingonlyonlyon$ ionlyingonlyonlyon$ lionlyingonlyonlyon$ lyingonlyonlyon$ lyon$ lyonlyon$ n$ ngonlyonlyon$ nlyingonlyonlyon$ nlyon$ nlyonlyon$ on$ onlyingonlyonlyon$ onlyon$ onlyonlyon$ yingonlyonlyon$ yon$ yonlyon$

20 9 7 2 1 5

16 12 19 8 4

15 11 18 3

14 10 6

17 13

T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Suffix Array of T

lyon

Suffix sorting

lionlyingonlyonlyon$ ionlyingonlyonlyon$ onlyingonlyonlyon$ nlyingonlyonlyon$ lyingonlyonlyon$ yingonlyonlyon$ ingonlyonlyon$ ngonlyonlyon$ gonlyonlyon$ onlyonlyon$ nlyonlyon$ lyonlyon$ yonlyon$ onlyon$ nlyon$ lyon$ yon$ on$ n$ $

$ gonlyonlyon$ ingonlyonlyon$ ionlyingonlyonlyon$ lionlyingonlyonlyon$ lyingonlyonlyon$ lyon$ lyonlyon$ n$ ngonlyonlyon$ nlyingonlyonlyon$ nlyon$ nlyonlyon$ on$ onlyingonlyonlyon$ onlyon$ onlyonlyon$ yingonlyonlyon$ yon$ yonlyon$

20 9 7 2 1 5

16 12 19 8 4

15 11 18 3

14 10 6

17 13

T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Suffix Array of T

lyon

Suffix sorting

lionlyingonlyonlyon$ ionlyingonlyonlyon$ onlyingonlyonlyon$ nlyingonlyonlyon$ lyingonlyonlyon$ yingonlyonlyon$ ingonlyonlyon$ ngonlyonlyon$ gonlyonlyon$ onlyonlyon$ nlyonlyon$ lyonlyon$ yonlyon$ onlyon$ nlyon$ lyon$ yon$ on$ n$ $

$ gonlyonlyon$ ingonlyonlyon$ ionlyingonlyonlyon$ lionlyingonlyonlyon$ lyingonlyonlyon$ lyon$ lyonlyon$ n$ ngonlyonlyon$ nlyingonlyonlyon$ nlyon$ nlyonlyon$ on$ onlyingonlyonlyon$ onlyon$ onlyonlyon$ yingonlyonlyon$ yon$ yonlyon$

20 9 7 2 1 5

16 12 19 8 4

15 11 18 3

14 10 6

17 13

T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Suffix Array of T

I found lyon in 3 steps!

Sparse suffix sorting

If we are interested in some (small) set of suffixes of T.

T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

lionlyingonlyonlyon$ ionlyingonlyonlyon$ onlyingonlyonlyon$ nlyingonlyonlyon$ lyingonlyonlyon$ yingonlyonlyon$ ingonlyonlyon$ ngonlyonlyon$ gonlyonlyon$ onlyonlyon$ nlyonlyon$ lyonlyon$ yonlyon$ onlyon$ nlyon$ lyon$ yon$ on$ n$ $

Sparse suffix sorting

T = l i o n l y i n g o n l y o n l y o n $ lionlyingonlyonlyon$ ionlyingonlyonlyon$ onlyingonlyonlyon$ nlyingonlyonlyon$ lyingonlyonlyon$ yingonlyonlyon$ ingonlyonlyon$ ngonlyonlyon$ gonlyonlyon$ onlyonlyon$ nlyonlyon$ lyonlyon$ yonlyon$ onlyon$ nlyon$ lyon$ yon$ on$ n$ $

lionlyingonlyonlyon$ lyonlyon$ on$ onlyon$ onlyonlyon$

1 12 18 14 10

Sparse Suffix Array (SSA)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

For sparse indexing, the sparse suffix array is enough, which is much smaller than full suffix array.

sort

If we are interested in some (small) set of suffixes of T.

!   It can be solved in O(n) time by computing full suffix array and make it sparse, but it needs O(n) space in addition to the text.

!  An alternative way is to use any sorting algorithm for a set of strings (not for suffixes) that uses O(b) additional space, but it takes Ω(bn) time.

Problem

Given T of length n and set of b positions, sort the designated suffixes.

Using O(s) additional space for some s ∈ [b, n], how fast can we sort sparse suffixes?

Question

Monte-Carlo Algorithm

Randomized algorithms that may return a wrong output with low probability.

We can win the gamble with high probability!!

!  Monte-Carlo algorithm using Karp-Rabin fingerprints. ! Time-space tradeoffs: for any s ∈ [b, n]

O(n+(bn/s)log s)) time using O(s) additional space. ! When s = b, the time complexity is O(n log b). ! When s = b log b, the time complexity is O(n).

Monte-Carlo Algorithm

Randomized algorithms that may return a wrong output with low probability.

We can win the gamble with high probability!!

Karp-Rabin fingerprints (rolling hash) [Karp and Rabin, `87]

•  q is a prime number. •  r is chosen from [0, q-1] uniformly at random.

For sufficiently large q, the hash function becomes perfect with high probability, i.e., T[i..i+l] = T[i’..i’+l] iff FP[i..i+l] = FP[i’..i’+l].

T i j

FP[i.. j]= T[k]⋅ r j−kk=i

j

∑ modq

Karp-Rabin fingerprints (rolling hash) [Karp and Rabin, `87]

i j

FP[i.. j +1]= FP[i.. j]⋅ r +T[ j +1]modq

FP[k +1.. j]= FP[i.. j]−FP[i..k]⋅ r j−kmodq

i j k+1

i j

i j+1

T

We can compute FPs incrementally.

We can “subtract” prefix FP’s influence.

T 1 n

We can preprocess T in O(n) time and O(s) space so that the fingerprint for any substring of length l can be computed in O(min{l, n/s}) time.

Lemma

FP-2

FP-4

n/s 2n/s 3n/s 4n/s

O(s)

FP-1

FP-3

FP-3

FP-1

FP-2

FP-4

T 1 n n/s 2n/s 3n/s 4n/s

We can preprocess T in O(n) time and O(s) space so that the fingerprint for any substring of length l can be computed in O(min{l, n/s}) time.

Lemma

O(s)

FP-short

FP-long

FP-1

FP-3

!  The algorithm constructs sparse suffix tree.

Monte-Carlo algorithm

1

12

18

10

14

ionlying…

yonlyon$

lyon

l

on

ly…

$ $

lionlyingonlyonlyon$ lyonlyon$ on$ onlyon$ onlyonlyon$

1 12 18 14 10

Sparse Suffix Array

Sparse Suffix Tree: •  Compacted trie representing all designated suffixes. •  For each node, child edges are sorted. •  Since edge labels can be encoded by pointers to T,

sparse suffix tree can be represented in O(b) space.

Overview: gradual refinement

1

10

12

18

14

lionlying…

lyonlyon$ onlyon$

onlyonlyon$

on$

sort edges

1

12

18

10

14

ionlying…

yonlyon$

lyon

l

on

ly…

$ $

goal

1

10

14

18

12

only

lionlying…

lyonlyon$

on$

on$

only…

1

10

14

12

18

lyon on

$

lionlying…

$

ly…

lyonlyon$

1

12

10

18

14

ionlying…

yonlyon$

lyon

l

on ly…

$ $

T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

l-strict (unsorted) sparse suffix trees

sort edges

1

12

18

10

14

ionlying…

yonlyon$

lyon

l

on

ly…

$ $

goal

2-strict 4-strict {32, 16, 8}-strict

1-strict

T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1

10

12

18

14

lionlying…

lyonlyon$ onlyon$

onlyonlyon$

on$

1

10

14

18

12

only

lionlying…

lyonlyon$

on$

on$

only…

1

10

14

12

18

lyon on

$

lionlying…

$

ly…

lyonlyon$

1

12

10

18

14

ionlying…

yonlyon$

lyon

l

on ly…

$ $

l-strict (unsorted) sparse suffix trees

For any internal node of l-strict tree, any pair of child edge labels has common prefix of length less than l.

T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1

10

12

18

14

lionlying…

lyonlyon$ onlyon$

onlyonlyon$

on$

1

10

14

18

12

only

lionlying…

lyonlyon$

on$

on$

only…

1

10

14

12

18

lyon on

$

lionlying…

$

ly…

lyonlyon$

1

12

10

18

14

ionlying…

yonlyon$

lyon

l

on ly…

$ $

2-strict 4-strict {32, 16, 8}-strict

1-strict Condition of l-strict tree

1

10

12

18

14

lionlying…

lyonlyon$ onlyon$

onlyonlyon$

on$

1

12

10

18

14

ionlying…

yonlyon$

lyon

l

on ly…

$ $

From 4-strict tree to 2-strict tree

{32, 16, 8}-strict

1-strict

2-strict 4-strict

For every edge label of 4-strict tree, we compute FP for the prefix of length 2, and merge edge labels having the same value.

T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1

10

14

18

12

only

lionlying…

lyonlyon$

on$

on$

only…

1

10

14

12

18

lyon on

$

lionlying…

$

ly…

lyonlyon$

Gradual refinement

T = l i o n l y i n g o n l y o n l y o n $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1

10

12

18

14

lionlying…

lyonlyon$ onlyon$

onlyonlyon$

on$

1

10

14

18

12

only

lionlying…

lyonlyon$

on$

on$

only…

1

10

14

12

18

lyon on

$

lionlying…

$

ly…

lyonlyon$

1

12

10

18

14

ionlying…

yonlyon$

lyon

l

on ly…

$ $

2-strict 4-strict {32, 16, 8}-strict

1-strict

!  For each refinement step, the FPs can be computed in O(b min{l, n/s}) time.

!  For the first log s steps, the cost is O((bn/s)log s). !  After log s steps, since l < n/2log s = n/s, the cost is

bn/s + bn/2s + bn/4s + … + b = O(bn/s)

The total cost for gradual refinement

The total cost for FP computation is O((bn/s)log s). Proof.

!  For each refinement step, we conduct radix grouping of O(b) integer keys with radix s, which can be done in O(b logsn) time.

The total cost for gradual refinement

Proof.

The total cost for merging edge labels is O(b log n logsn) = O(b log2n / log s) = O((bn/s)log s).

Grouping by distribute-and-collect

432 52

52

978

978

52

k elements associated with keys (integers in [0, n])

Empty-initialized buckets of size n

0 1 … 51 52 53 … 431 432 434 … 977 978 979 … n

Grouping by distribute-and-collect

432 52

52

978

978

52

k elements associated with keys (integers in [0, n])

0 1 … 51 52 53 … 431 432 434 … 977 978 979 … n

Empty-initialized buckets of size n

Grouping by distribute-and-collect

432 52

52

978

978

52

k elements associated with keys (integers in [0, n])

0 1 … 51 52 53 … 431 432 434 … 977 978 979 … n

O(k) time Empty-initialized buckets of size n

Grouping by distribute-and-collect

432 52

52

978

978

52

k elements associated with keys (integers in [0, n])

0 1 2 3 4 5 6 7 8 9

Empty-initialized buckets of size 10

Grouping by distribute-and-collect

432 52

52

978

978

52

k elements associated with keys (integers in [0, n])

0 1 2 3 4 5 6 7 8 9

Empty-initialized buckets of size 10

Grouping by distribute-and-collect

432 52

52

978

978

52

k elements associated with keys (integers in [0, n])

0 1 2 3 4 5 6 7 8 9

432 52

52 978 978

52

Empty-initialized buckets of size 10

Grouping by distribute-and-collect

k elements associated with keys (integers in [0, n])

0 1 2 3 4 5 6 7 8 9

432 52

52 978 978

52

432 52

52

978

978

52

Empty-initialized buckets of size 10

Grouping by distribute-and-collect

k elements associated with keys (integers in [0, n])

0 1 2 3 4 5 6 7 8 9

432 52

52 978 978

52

432 52

52

978

978

52 432

52 52

52

Empty-initialized buckets of size 10

Grouping by distribute-and-collect

k elements associated with keys (integers in [0, n])

0 1 2 3 4 5 6 7 8 9

432 52

52 978 978

52

432 52

52

978

978

52 432

52 52

52

978 978

432

52 52

52

978 978

O(k log10n) time Empty-initialized buckets of size 10

Grouping by distribute-and-collect Radix grouping with radix s

k elements associated with keys (integers in [0, n])

0 1 2 3 … … s-4 s-3 s-2 s-1

432 52

52

978

978

52 432

52 52

52

978 978

O(k logsn) time

logsn steps

Empty-initialized buckets of size s

!  For each refinement step, we conduct radix grouping of O(b) integer keys with radix s, which can be done in O(b logsn) time.

The total cost for gradual refinement

Proof.

The total cost for merging edge labels is O(b log n logsn) = O(b log2n / log s) = O((bn/s)log s).

For any s ∈ [b, n], we can construct sparse suffix tree correctly with high probability in O(n+(bn/s)log s)) time using O(s) additional space.

!  When s = b, the time complexity is O(n log b). !  When s = b log b, the time complexity is O(n).

Monte-Carlo algorithm

Theorem

The total cost for FP computation is O((bn/s)log s).

The total cost for merging edge labels is O(b log n logsn) = O(b log2n / log s) = O((bn/s)log s).

!  Follow the gradual refinement steps for this instance.

Exercise

1

3

6

9

7

doortodoortodortmund$

odoortodortmund$

doortodortmund$

ortodoortodortmund$

ortodortmund$

T = d o o r t o d o o r t o d o r t m u n d $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

?

13

12

odortmund$

dortmund$

top related