etri linear-time search in suffix arrays july 14, 2003 jeong seop sim, dong kyue kim heejin park,...

43
ETRI ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

Upload: dorcas-haynes

Post on 19-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

ETRI Suffix arrays Example for T = abbabaababbb# The suffixes of T abbabaababbb# (1) bbabaababbb# (2) abaababbb# (3) … b# (12) # (13) are stored in lexicographical order. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # # is the lexicographically smallest special character.

TRANSCRIPT

Page 1: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Linear-Time Search in Suffix Arrays

July 14, 2003

Jeong Seop Sim, Dong Kyue Kim

Heejin Park, Kunsoo Park

Page 2: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Suffix arrays

Suffix array of text TThe lexicographically sorted list of all suffixes of text T

Page 3: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Suffix arraysExample for T = abbabaababbb#

The suffixes of T abbabaababbb# (1)

bbabaababbb# (2) abaababbb# (3)

… b# (12) # (13)

are stored in lexicographical order.

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b ## is the lexicographically smallest special character.

Page 4: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Suffix arraysExample for T = abbabaababbb#

The suffixes of T are abbabaababbb# (1) bbabaababbb# (2)

abaababbb# (3) … b# (12) # (13)

In actual suffix arrays, we store only the starting positions of suffixes in T but for convenience, we assume that suffixes themselves are stored.

1 13 #

2 6 a a b a b b b #

3 4 a b a a b a b b b #

4 7 a b a b b b #

5 1 a b b a b a a b a b b b #

6 9 a b b b #

7 12 b #

8 5 b a a b a b b b #

9 3 b a b a a b a b b b #

10 8 b a b b b #

11 11 b b #

12 2 b b a b a a b a b b b #

13 10 b b b #

Page 5: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Suffix arrays

Definition: s-suffixesSuffixes starting with string sa-suffixes, ba-suffixes, …

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Page 6: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Suffix arrays vs. Suffix treesConstruction time

Suffix Array = Suffix Tree

Space Suffix Array = Suffix Tree

In practice , suffix arrays are more space efficient than suffix trees.

Search timeSuffix Array: , (p=|P|, n=|T|)Suffix Tree:

|)|log( p)log( np |)|( p

Page 7: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

ContributionConstruction time

Suffix Array = Suffix Tree

Space Suffix Array = Suffix Tree

In practice , suffix arrays are more space efficient than suffix trees.

Search timeSuffix Array: , , Suffix Tree: |)|log( p

)log( np |)|( p |)|log( p

Page 8: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

The meaning of our contributionConstruction time

Suffix Array = Suffix Tree

Space Suffix Array = Suffix Tree

In practice , suffix arrays are more space efficient than suffix trees.

Search timeSuffix Array: , , Suffix Tree: |)|log( p

)log( np |)|( p |)|log( p

Search time: SA ST

Page 9: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

The meaning of our contributionConstruction time

Suffix Array = Suffix Tree

Space Suffix Array = Suffix Tree

In practice , suffix arrays are more space efficient than suffix trees.

Search timeSuffix Array: , , Suffix Tree: |)|log( p

)log( np |)|( p |)|log( p

Search time: SA ST

Suffix arrays are more powerful than suffix trees.

Page 10: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Our search algorithm

Our search algorithm

Page 11: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Search in a suffix array

Definition: Search in a suffix arrayInput

A pattern P A suffix array of T

Output

All P-suffixes of T

Page 12: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Search in a suffix array

All ab-suffixes are neighbors.

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

P = ab

T = abbabaababbb#

Find all ab-suffixes.

A search example

Page 13: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Search in a suffix array

We have only to find

the first and the last ab-suffixes.

Because the other ab-suffixes are

stored between them.

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

P = ab

T = abbabaababbb#

A search example

Page 14: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Related workIn developing our search algorithm, we adopt the idea suggested by Ferragina and Manzini (FOCS 2001).

Search a pattern in a file compressed by the Burrows-Wheeler compression algorithm

Search P from the last character to the first character of PP = ababaaabbabaaabb

We adopt this backward pattern searching idea.

Page 15: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Algorithm outline

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

P = aba

T = abbabaababbb#

Outline of our search algorithm

We find all aba-suffixes

by searching P backward.

Our algorithm has p stages

(In this case, there are 3 stages.)

Page 16: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Algorithm outline

P = aba

T = abbabaababbb#

Outline of our search algorithm

We find all aba-suffixes

by searching P backward.

Stage 1: find all a-suffixes.

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Page 17: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Algorithm outline

P = aba

T = abbabaababbb#

Outline of our search algorithm

We find all aba-suffixes

by searching P backward.

stage 1: find all a-suffixes.

stage 2: find all ba-suffixes.

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Page 18: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Algorithm outline

P = aba

T = abbabaababbb#

Outline of our search algorithm

We find all aba-suffixes

by searching P backward.

stage 1: find all a-suffixes.

stage 2: find all ba-suffixes.

stage 3: find all aba-suffixes.

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Page 19: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Elaborate stage 2

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

P = aba

A stage by elaborating stage 2

We find the first ba-suffix from the

first a-suffix and the last ba-suffix

from the last a-suffix.

We find all ba-suffixes

using a-suffixes found in stage 1.

Page 20: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Elaborate stage 2

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

P = aba

Only explain how to find the first

ba-suffix from the first a-suffix.

Finding the last ba-suffix is similar.

A stage by elaborating stage 2

Page 21: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Elaborate stage 2

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

To find the first ba-suffix, we count the number of suffixes that precede ba-suffixes in this suffix array.

P = aba

Page 22: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Suffixes preceding ba-suffixes are

divided into two categories.

- A-type: Suffixes starting with

characters lexicographically smaller than b. (#-suffixes, a-suffixes)

- B-type: Suffixes starting with the same

character b and preceding ba-suffixes.

We count A-type and B-type suffixes in different ways.

Elaborate stage 2

A-type

B-type

Page 23: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Count the number of A-type suffixes

Count the number of A-type suffixes 1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

The number of A-type suffixes = The number of #-suffixes and a-suffixes = The position of the last a-suffix.

A-type

Page 24: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Count the number of A-type suffixes

We generate an array that stores the positions of the last #-suffix, the last a-suffix, and the last b-suffix.

With this array, we can count A-type suffixes in O(1) time.

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

# 1

a 6

b 13

Page 25: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Count the number of A-type suffixes

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Array S pace:Time: O(n) (one scan)

|)(|

# 1

a 6

b 13

Page 26: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Count the number of B-type suffixes

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Count B-type suffixesb-suffixes preceding ba-suffixes.

B-type

Page 27: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Count the number of B-type suffixes

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

B-type suffixesb-suffixes preceding ba-suffixes.

A suffix generated by removing the leftmost b from a B-type suffix appears in a suffix subarray preceding a-suffixes found in stage 1.

B-type

Page 28: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Count the number of B-type suffixes

The number of B-type suffixes are the number of suffixes

being in a suffix subarray that precedes a-suffixes

whose previous characters are bs B-type

We count this with array N.

b

b

b

a

#

b

b

a

b

a

b

a

a

Let U be the conceptual array of

previous characters of suffixes.

U

Page 29: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

b

b

b

a

#

b

b

a

b

a

b

a

a

Count the number of B-type suffixes # a b

0 0 1

0 0 2

0 0 3

0 1 3

1 1 3

1 1 4

1 1 5

1 2 5

1 2 6

1 3 6

1 3 7

1 4 7

1 5 7

5],7[ bN

Array N

entries|| n

N[i,b] stores the number of suffixes whose previous characters are bs in a suffix subarray SA[1,i].

U

Page 30: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

b

b

b

a

#

b

b

a

b

a

b

a

a

Count the number of B-type suffixes # a b

0 0 1

0 0 2

0 0 3

0 1 3

1 1 3

1 1 4

1 1 5

1 2 5

1 2 6

1 3 6

1 3 7

1 4 7

1 5 7

U

We can count B-type

suffixes in O(1) time

by accessing an entry of N.

Page 31: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Array NSpace:

An alternative way Space: O(n) time for counting B-type suffixes.

Array N

|)| (O n

|)|(logO

# a b

0 0 1

0 0 2

0 0 3

0 1 3

1 1 3

1 1 4

1 1 5

1 2 5

1 2 6

1 3 6

1 3 7

1 4 7

1 5 7

Page 32: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Query for N[i,b]Counting B-type suffixes

O(log n) time

O(log ) time||

Page 33: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

b

b

b

a

#

b

b

a

b

a

b

a

a

UQuery for N[i,b] O(log n) time

In an O(log n) time algorithm,

we generate an array

whose ith entry stores

the location of the ith b in U.

1 1

2 2

3 3

4 6

5 7

6 9

7 11

Page 34: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

b

b

b

a

#

b

b

a

b

a

b

a

a

UQuery for N[i,b]: O(log n) time

1 1

2 2

3 3

4 6

5 7

6 9

7 11

To count suffixes whose previous

characters are bs in SA[1,8].

= To count bs in U[1,8]

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Page 35: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

b

b

b

a

#

b

b

a

b

a

b

a

a

UQuery for N[i,b]: O(log n) time

1 1

2 2

3 3

4 6

5 7

6 9

7 11

Find the largest value not

exceeding 8 in this array.

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Page 36: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

b

b

b

a

#

b

b

a

b

a

b

a

a

UQuery for N[i,b]: O(log n) time

1 1

2 2

3 3

4 6

5 7

6 9

7 11

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

To find 7 in this array,

we perform binary search.

O(log n) time.

Page 37: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

b

b

b

a

#

b

b

a

b

a

b

a

a

UQuery for N[i,b]: O(log n) time

1 1

2 2

3 3

4 6

5 7

6 9

7 11

The index of 7 (5) is

the number of b’s in U[1,8].1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Page 38: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

b

b

b

a

#

b

b

a

b

a

b

a

a

UQuery for N[i,b]: O(log n) time

1 1

2 2

3 3

4 6

5 7

6 9

7 11

1 4

2 8

3 10

4 12

5 13

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

1 5

Generally, we require arrays for

all characters. #

a

b

O(n) space

Page 39: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Query for N[i,b]

O(log n) time

O(log ) time||

Page 40: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

For the last characters

of each block,

we compute the entries

of N.

b

b

b

a

#

b

b

a

b

a

b

a

a

UQuery for N[i,b]: time

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Divide U into

-sized blocks.

|)|(log O# a b

0 0 3

1 1 4

1 2 6

1 4 7

||

Page 41: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

For the other entries

in each block,

we generate a similar

data structure used

in O(log n) time alg.

O(log ) time

for binary search.

Still O(n) space in total.

b

b

b

a

#

b

b

a

b

a

b

a

a

UQuery for N[i,b]: time

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

|)|(log O# a b

0 0 3

1 1 4

1 2 6

1 4 7

||

Page 42: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Summaryp stages

Each stageCount A-type suffixes

Time: O(1) Space: O(n) for M array

Count B-type suffixes Time: Space: O(n) for computing the value of an entry N

In total, time with O(n) space.|)|log( p

|)|(log

Page 43: ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRIETRI

Conclusion

In a suffix array, one can choose or search time algorithm depending on the alphabet

size.

Suffix arrays are more powerful than suffix trees.

|)|log( p )log( np