bbm 202 algorithms - hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · radix. number of...

178
May. 9, 2013 BBM 202 - ALGORITHMS STRING SORTS, TRIES DEPT. OF COMPUTER ENGINEERING ERKUT ERDEM Acknowledgement: The course slides are adapted from the slides prepared by R. Sedgewick and K. Wayne of Princeton University.

Upload: others

Post on 22-Mar-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

May. 9, 2013

BBM 202 - ALGORITHMS

STRING SORTS, TRIES

DEPT. OF COMPUTER ENGINEERING

ERKUT ERDEM

Acknowledgement:  The  course  slides  are  adapted  from  the  slides  prepared  by  R.  Sedgewick  and  K.  Wayne  of  Princeton  University.

Page 2: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

STRING SORTS, TRIES

‣ String sorts‣ Key-indexed counting‣ LSD radix sort‣ MSD radix sort‣ 3-way radix quicksort‣ Suffix arrays

‣ Tries

Page 3: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

3

String processing

String. Sequence of characters.

Important fundamental abstraction.• Information processing.

• Genomic sequences.

• Communication systems (e.g., email).

• Programming systems (e.g., Java programs).

• …

“ The digital information that underlies biochemistry, cell

biology, and development can be represented by a simple string of G's, A's, T's and C's. This string is the root data

structure of an organism's biology. ” — M. V. Olson

Page 4: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

4

The char data type

C char data type. Typically an 8-bit integer.• Supports 7-bit ASCII.

• Need more bits to represent certain characters.

Java char data type. A 16-bit unsigned integer.• Supports original 16-bit Unicode.

• Supports 21-bit Unicode 3.0 (awkwardly).

6676.5 Q Data Compression

ASCII encoding. When you HexDump a bit-stream that contains ASCII-encoded charac-ters, the table at right is useful for reference. Given a 2-digit hex number, use the first hex digit as a row index and the second hex digit as a column reference to find the character that it encodes. For example, 31 encodes the digit 1, 4A encodes the letter J, and so forth. This table is for 7-bit ASCII, so the first hex digit must be 7 or less. Hex numbers starting with 0 and 1 (and the numbers 20 and 7F) correspond to non-printing control charac-ters. Many of the control characters are left over from the days when physical devices like typewriters were controlled by ASCII input; the table highlights a few that you might see in dumps. For example SP is the space character, NUL is the null character, LF is line-feed, and CR is carriage-return.

!" #$%%&'(, working with data compression requires us to reorient our thinking about standard input and standard output to include binary encoding of data. BinaryStdIn and BinaryStdOut provide the methods that we need. They provide a way for you to make a clear distinction in your client programs between writing out information in-tended for file storage and data transmission (that will be read by programs) and print-ing information (that is likely to be read by humans).

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI

1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US

2 SP ! “ # $ % & ‘ ( ) * + , - . /

3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?

4 @ A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ \ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z { | } ~ DEL

Hexadecimal to ASCII conversion table

4

The char data type

C char data type. Typically an 8-bit integer.

・Supports 7-bit ASCII.

・Can represent only 256 characters.

Java char data type. A 16-bit unsigned integer.

・Supports original 16-bit Unicode.

・Supports 21-bit Unicode 3.0 (awkwardly).

6676.5 Q Data Compression

ASCII encoding. When you HexDump a bit-stream that contains ASCII-encoded charac-ters, the table at right is useful for reference. Given a 2-digit hex number, use the first hex digit as a row index and the second hex digit as a column reference to find the character that it encodes. For example, 31 encodes the digit 1, 4A encodes the letter J, and so forth. This table is for 7-bit ASCII, so the first hex digit must be 7 or less. Hex numbers starting with 0 and 1 (and the numbers 20 and 7F) correspond to non-printing control charac-ters. Many of the control characters are left over from the days when physical devices like typewriters were controlled by ASCII input; the table highlights a few that you might see in dumps. For example SP is the space character, NUL is the null character, LF is line-feed, and CR is carriage-return.

!" #$%%&'(, working with data compression requires us to reorient our thinking about standard input and standard output to include binary encoding of data. BinaryStdIn and BinaryStdOut provide the methods that we need. They provide a way for you to make a clear distinction in your client programs between writing out information in-tended for file storage and data transmission (that will be read by programs) and print-ing information (that is likely to be read by humans).

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI

1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US

2 SP ! “ # $ % & ‘ ( ) * + , - . /

3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?

4 @ A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ \ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z { | } ~ DEL

Hexadecimal to ASCII conversion table

U+1D50AU+2202U+00E1U+0041

Unicode characters

Page 5: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

5

I (heart) Unicode

Page 6: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

String data type. Sequence of characters (immutable).

Length. Number of characters.Indexing. Get the ith character.Substring extraction. Get a contiguous sequence of characters. String concatenation. Append one character to end of another string.

6

The String data type

Fundamental constant-time String operations

0 1 2 3 4 5 6 7 8 9 10 11 12

A T T A C K A T D A W N s

s.charAt(3)

s.length()

s.substring(7, 11)

Page 7: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

7

The String data type: Java implementation

public final class String implements Comparable<String>{ private char[] val; // characters private int offset; // index of first char in array private int length; // length of string private int hash; // cache of hashCode()

public int length() { return length; }

public char charAt(int i) { return value[i + offset]; } private String(int offset, int length, char[] val) { this.offset = offset; this.length = length; this.val = val; } public String substring(int from, int to) { return new String(offset + from, to - from, val); } …

X X A T T A C K X

0 1 2 3 4 5 6 7 8

val[]

offset

length

copy of reference to

original char array

Page 8: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

8

The String data type: performance

String data type. Sequence of characters (immutable).Underlying implementation. Immutable char[] array, offset, and length.

Memory. 40 + 2N bytes for a virgin String of length N.

can use byte[] or char[] instead of String to save space

(but lose convenience of String data type)

StringString

operation guarantee extra space

length() 1 1

charAt() 1 1

substring() 1 1

concat() N N

Page 9: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

9

The StringBuilder data type

StringBuilder data type. Sequence of characters (mutable).Underlying implementation. Resizing char[] array and length.

Remark. StringBuffer data type is similar, but thread safe (and slower).

StringString StringBuilderStringBuilder

operation guarantee extra space guarantee extra space

length() 1 1 1 1

charAt() 1 1 1 1

substring() 1 1 N N

concat() N N 1 * 1 *

* amortized

Page 10: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

10

String vs. StringBuilder

Q. How to efficiently reverse a string?

A.

B.

public static String reverse(String s) { String rev = ""; for (int i = s.length() - 1; i >= 0; i--) rev += s.charAt(i); return rev; }

public static String reverse(String s) { StringBuilder rev = new StringBuilder(); for (int i = s.length() - 1; i >= 0; i--) rev.append(s.charAt(i)); return rev.toString(); }

quadratic time

linear time

Page 11: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

11

String challenge: array of suffixes

Q. How to efficiently form array of suffixes?

a a c a a g t t t a c a a g c

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

input string

0 a a c a a g t t t a c a a g c1 a c a a g t t t a c a a g c2 c a a g t t t a c a a g c3 a a g t t t a c a a g c4 a g t t t a c a a g c5 g t t t a c a a g c6 t t t a c a a g c7 t t a c a a g c8 t a c a a g c9 a c a a g c10 c a a g c11 a a g c12 a g c13 g c14 c

suffixes

Page 12: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

12

String vs. StringBuilder

Q. How to efficiently form array of suffixes?

A.

B.

public static String[] suffixes(String s) { int N = s.length(); String[] suffixes = new String[N]; for (int i = 0; i < N; i++) suffixes[i] = s.substring(i, N); return suffixes; }

public static String[] suffixes(String s) { int N = s.length(); StringBuilder sb = new StringBuilder(s); String[] suffixes = new String[N]; for (int i = 0; i < N; i++) suffixes[i] = sb.substring(i, N); return suffixes; }

linear time and

linear space

quadratic time and

quadratic space

Page 13: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

13

Longest common prefix

Q. How long to compute length of longest common prefix?

Running time. Proportional to length D of longest common prefix.Remark. Also can compute compareTo() in sublinear time.

public static int lcp(String s, String t) { int N = Math.min(s.length(), t.length()); for (int i = 0; i < N; i++) if (s.charAt(i) != t.charAt(i)) return i; return N; }

p r e f i x

p r e f e t c h

0 1 2 3 4 5 6 7

linear time (worst case)

sublinear time (typical case)

Page 14: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Digital key. Sequence of digits over fixed alphabet.Radix. Number of digits R in alphabet.

Alphabets

14

604 CHAPTER 6 ! Strings

holds the frequencies in Count is an example of a character-indexed array. With a Java String, we have to use an array of size 256; with Alphabet, we just need an array with one entry for each alphabet character. This savings might seem modest, but, as you will see, our algorithms can produce huge numbers of such arrays, and the space for arrays of size 256 can be prohibitive.

Numbers. As you can see from our several of the standard Alphabet examples, we of-ten represent numbers as strings. The methods toIndices() coverts any String over a given Alphabet into a base-R number represented as an int[] array with all values between 0 and R!1. In some situations, doing this conversion at the start leads to com-pact code, because any digit can be used as an index in a character-indexed array. For example, if we know that the input consists only of characters from the alphabet, we could replace the inner loop in Count with the more compact code

int[] a = alpha.toIndices(s); for (int i = 0; i < N; i++) count[a[i]]++;

name R() lgR() characters

BINARY 2 1 01

OCTAL 8 3 01234567

DECIMAL 10 4 0123456789

HEXADECIMAL 16 4 0123456789ABCDEF

DNA 4 2 ACTG

LOWERCASE 26 5 abcdefghijklmnopqrstuvwxyz

UPPERCASE 26 5 ABCDEFGHIJKLMNOPQRSTUVWXYZ

PROTEIN 20 5 ACDEFGHIKLMNPQRSTVWY

BASE64 64 6ABCDEFGHIJKLMNOPQRSTUVWXYZabcdef ghijklmnopqrstuvwxyz0123456789+/

ASCII 128 7 ASCII characters

EXTENDED_ASCII 256 8 extended ASCII characters

UNICODE16 65536 16 Unicode characters

Standard alphabets

Page 15: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

STRING SORTS

‣ Key-indexed counting‣ LSD radix sort‣ MSD radix sort‣ 3-way radix quicksort‣ Suffix arrays

Page 16: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Review: summary of the performance of sorting algorithms

Frequency of operations = key compares.

Lower bound. ~ N lg N compares required by any compare-based algorithm.

Q. Can we do better (despite the lower bound)?A. Yes, if we don't depend on key compares.

16

algorithm guarantee random extra space stable? operations on keys

insertion sort N2 / 2 N2 / 4 1 yes compareTo()

mergesort N lg N N lg N N yes compareTo()

quicksort 1.39 N lg N * 1.39 N lg N c lg N no compareTo()

heapsort 2 N lg N 2 N lg N 1 no compareTo()

* probabilistic

Page 17: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Key-indexed counting: assumptions about keys

Assumption. Keys are integers between 0 and R - 1.

Implication. Can use key as an array index.

Applications.• Sort string by first letter.

• Sort class roster by section.

• Sort phone numbers by area code.

• Subroutine in a sorting algorithm. [stay tuned]

Remark. Keys may have associated data ⇒can't just count up number of keys of each value.

17

Anderson 2 Harris 1Brown 3 Martin 1Davis 3 Moore 1Garcia 4 Anderson 2Harris 1 Martinez 2Jackson 3 Miller 2Johnson 4 Robinson 2Jones 3 White 2Martin 1 Brown 3Martinez 2 Davis 3Miller 2 Jackson 3Moore 1 Jones 3Robinson 2 Taylor 3Smith 4 Williams 3Taylor 3 Garcia 4Thomas 4 Johnson 4Thompson 4 Smith 4White 2 Thomas 4Williams 3 Thompson 4Wilson 4 Wilson 4

Typical candidate for key-indexed counting

input sorted result

keys aresmall integers

section (by section) name

Page 18: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

18

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

R=6

use a for 0b for 1c for 2

d for 3e for 4f for 5

Page 19: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

a 0

b 2

c 3

d 1

e 2

f 1

- 3

19

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

count

frequencies

offset by 1

[stay tuned]

r count[r]

Page 20: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

a 0

b 2

c 5

d 6

e 8

f 9

- 12

20

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

r count[r]

compute

cumulates

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i]; 6 keys < d, 8 keys < e

so d’s go in a[6] and a[7]

Page 21: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

a 0

b 2

c 5

d 6

e 8

f 9

- 12

21

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

r count[r]

0

1

2

3

4

5

6

7

8

9

10

11

i aux[i]

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

move

items

Page 22: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

a 0

b 2

c 5

d 7

e 8

f 9

- 12

22

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

0

1

2

3

4

5

6 d

7

8

9

10

11

r count[r]

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

move

items

i aux[i]

Page 23: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

a 1

b 2

c 5

d 7

e 8

f 9

- 12

23

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

0 a

1

2

3

4

5

6 d

7

8

9

10

11

r count[r]

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

move

items

i aux[i]

Page 24: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

a 1

b 2

c 6

d 7

e 8

f 9

- 12

24

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

0 a

1

2

3

4

5 c

6 d

7

8

9

10

11

r count[r]

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

move

items

i aux[i]

Page 25: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

a 1

b 2

c 6

d 7

e 8

f 10

- 12

25

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

0 a

1

2

3

4

5 c

6 d

7

8

9 f

10

11

r count[r]

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

move

items

i aux[i]

Page 26: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

a 1

b 2

c 6

d 7

e 8

f 11

- 12

26

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

0 a

1

2

3

4

5 c

6 d

7

8

9 f

10 f

11

r count[r]

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

move

items

i aux[i]

Page 27: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

a 1

b 3

c 6

d 7

e 8

f 11

- 12

27

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

0 a

1

2 b

3

4

5 c

6 d

7

8

9 f

10 f

11

r count[r]

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

move

items

i aux[i]

Page 28: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

a 1

b 3

c 6

d 8

e 8

f 11

- 12

28

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

0 a

1

2 b

3

4

5 c

6 d

7 d

8

9 f

10 f

11

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

move

items

i aux[i]

r count[r]

Page 29: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

a 1

b 4

c 6

d 8

e 8

f 11

- 12

29

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

0 a

1

2 b

3 b

4

5 c

6 d

7 d

8

9 f

10 f

11

r count[r]

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

move

items

i aux[i]

Page 30: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

a 1

b 4

c 6

d 8

e 8

f 12

- 12

30

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

0 a

1

2 b

3 b

4

5 c

6 d

7 d

8

9 f

10 f

11 f

r count[r]

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

move

items

i aux[i]

Page 31: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

a 1

b 5

c 6

d 8

e 8

f 12

- 12

31

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

0 a

1

2 b

3 b

4 b

5 c

6 d

7 d

8

9 f

10 f

11 f

r count[r]

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

move

items

i aux[i]

Page 32: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

a 1

b 5

c 6

d 8

e 9

f 12

- 12

32

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

0 a

1

2 b

3 b

4 b

5 c

6 d

7 d

8 e

9 f

10 f

11 f

r count[r]

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

move

items

i aux[i]

Page 33: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

a 2

b 5

c 6

d 8

e 9

f 12

- 12

33

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

0 a

1 a

2 b

3 b

4 b

5 c

6 d

7 d

8 e

9 f

10 f

11 f

r count[r]

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

move

items

i aux[i]

Page 34: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

a 2

b 5

c 6

d 8

e 9

f 12

- 12

34

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

move

items

0 a

1 a

2 b

3 b

4 b

5 c

6 d

7 d

8 e

9 f

10 f

11 f

r count[r]

i aux[i]

Page 35: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

a 2

b 5

c 6

d 8

e 9

f 12

- 12

35

Key-indexed counting demo

i a[i]

0 a

1 a

2 b

3 b

4 b

5 c

6 d

7 d

8 e

9 f

10 f

11 fcopy

back

0 a

1 a

2 b

3 b

4 b

5 c

6 d

7 d

8 e

9 f

10 f

11 f

r count[r]

i aux[i]

Page 36: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

a 0

b 2

c 3

d 1

e 2

f 1

- 3

36

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

count

frequencies

offset by 1

[stay tuned]

r count[r]

Page 37: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

a 0

b 2

c 5

d 6

e 8

f 9

- 12

37

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

r count[r]

compute

cumulates

 int  N  =  a.length;  int[]  count  =  new  int[R+1];

 for  (int  i  =  0;  i  <  N;  i++)        count[a[i]+1]++;

 for  (int  r  =  0;  r  <  R;  r++)        count[r+1]  +=  count[r];

 for  (int  i  =  0;  i  <  N;  i++)        aux[count[a[i]]++]  =  a[i];

 for  (int  i  =  0;  i  <  N;  i++)        a[i]  =  aux[i]; 6 keys < d, 8 keys < e

so d’s go in a[6] and a[7]

Page 38: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

a 2

b 5

c 6

d 8

e 9

f 12

- 12

38

Key-indexed counting demo

i a[i]

0 d

1 a

2 c

3 f

4 f

5 b

6 d

7 b

8 f

9 b

10 e

11 a

move

items

0 a

1 a

2 b

3 b

4 b

5 c

6 d

7 d

8 e

9 f

10 f

11 f

r count[r]

i aux[i]

Page 39: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Goal. Sort an array a[] of N integers between 0 and R - 1.

• Count frequencies of each letter using key as index.

• Compute frequency cumulates which specify destinations.

• Access cumulates using key as index to move items.

• Copy back into original array.

int N = a.length; int[] count = new int[R+1];

for (int i = 0; i < N; i++) count[a[i]+1]++;

for (int r = 0; r < R; r++) count[r+1] += count[r];

for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];

for (int i = 0; i < N; i++) a[i] = aux[i];

a 2

b 5

c 6

d 8

e 9

f 12

- 12

39

Key-indexed counting demo

i a[i]

0 a

1 a

2 b

3 b

4 b

5 c

6 d

7 d

8 e

9 f

10 f

11 fcopy

back

0 a

1 a

2 b

3 b

4 b

5 c

6 d

7 d

8 e

9 f

10 f

11 f

r count[r]

i aux[i]

Page 40: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Key-indexed counting: analysis

Proposition. Key-indexed counting uses ~ 11 N + 4 R array accesses to sort

N items whose keys are integers between 0 and R - 1.

Proposition. Key-indexed counting uses extra space proportional to N + R.

Stable?

40

Anderson 2 Harris 1Brown 3 Martin 1Davis 3 Moore 1Garcia 4 Anderson 2Harris 1 Martinez 2Jackson 3 Miller 2Johnson 4 Robinson 2Jones 3 White 2Martin 1 Brown 3Martinez 2 Davis 3Miller 2 Jackson 3Moore 1 Jones 3Robinson 2 Taylor 3Smith 4 Williams 3Taylor 3 Garcia 4Thomas 4 Johnson 4Thompson 4 Smith 4White 2 Thomas 4Williams 3 Thompson 4Wilson 4 Wilson 4

Distributing the data (records with key 3 highlighted)

count[]1 2 3 40 3 8 140 4 8 140 4 9 140 4 10 140 4 10 151 4 10 151 4 11 151 4 11 161 4 12 162 4 12 162 5 12 162 6 12 163 6 12 163 7 12 163 7 12 173 7 13 173 7 13 183 7 13 193 8 13 193 8 14 193 8 14 20

a[0]

a[1]

a[2]

a[3]

a[4]

a[5]

a[6]

a[7]

a[8]

a[9]

a[10]

a[11]

a[12]

a[13]

a[14]

a[15]

a[16]

a[17]

a[18]

a[19]

aux[0]

aux[1]

aux[2]

aux[3]

aux[4]

aux[5]

aux[6]

aux[7]

aux[8]

aux[9]

aux[10]

aux[11]

aux[12]

aux[13]

aux[14]

aux[15]

aux[16]

aux[17]

aux[18]

aux[19]

for (int i = 0; i < N; i++) aux[count[a[i].key(d)]++] = a[i];

Page 41: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

STRING SORTS

‣ Key-indexed counting‣ LSD radix sort‣ MSD radix sort‣ 3-way radix quicksort‣ Suffix arrays

Page 42: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Least-significant-digit-first string sort

LSD string (radix) sort.• Consider characters from right to left.

• Stably sort using dth character as the key (using key-indexed counting).

42

0 d a b

1 a d d

2 c a b

3 f a d

4 f e e

5 b a d

6 d a d

7 b e e

8 f e d

9 b e d

10 e b b

11 a c e

0 d a b

1 c a b

2 f a d

3 b a d

4 d a d

5 e b b

6 a c e

7 a d d

8 f e d

9 b e d

10 f e e

11 b e e

sort key (d=1)

0 a c e

1 a d d

2 b a d

3 b e d

4 b e e

5 c a b

6 d a b

7 d a d

8 e b b

9 f a d

10 f e d

11 f e e

sort key (d=0)

0 d a b

1 c a b

2 e b b

3 a d d

4 f a d

5 b a d

6 d a d

7 f e d

8 b e d

9 f e e

10 b e e

11 a c e

sort must be stable

(arrows do not cross)

sort key (d=2)

Page 43: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

43

LSD string sort: correctness proof

Proposition. LSD sorts fixed-length strings in ascending order.

Pf. [by induction on i]After pass i, strings are sorted by last i characters.

• If two strings differ on sort key,key-indexed sort puts them in proper relative order.

• If two strings agree on sort key,stability keeps them in proper relative order.

0 d a b

1 c a b

2 f a d

3 b a d

4 d a d

5 e b b

6 a c e

7 a d d

8 f e d

9 b e d

10 f e e

11 b e e

0 a c e

1 a d d

2 b a d

3 b e d

4 b e e

5 c a b

6 d a b

7 d a d

8 e b b

9 f a d

10 f e d

11 f e e

sorted from

previous passes

(by induction)

sort key

Page 44: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

44

LSD string sort: Java implementation

key-indexed

counting

public class LSD{ public static void sort(String[] a, int W) { int R = 256; int N = a.length; String[] aux = new String[N];

for (int d = W-1; d >= 0; d--) { int[] count = new int[R+1]; for (int i = 0; i < N; i++) count[a[i].charAt(d) + 1]++; for (int r = 0; r < R; r++) count[r+1] += count[r]; for (int i = 0; i < N; i++) aux[count[a[i].charAt(d)]++] = a[i]; for (int i = 0; i < N; i++) a[i] = aux[i]; } }}

do key-indexed counting

for each digit from right to left

radix R

fixed-length W strings

Page 45: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Summary of the performance of sorting algorithms

Frequency of operations.

Q. What if strings do not have same length?

45

algorithm guarantee random extra space stable? operations on keys

insertion sort N2 / 2 N2 / 4 1 yes compareTo()

mergesort N lg N N lg N N yes compareTo()

quicksort 1.39 N lg N * 1.39 N lg N c lg N no compareTo()

heapsort 2 N lg N 2 N lg N 1 no compareTo()

LSD † 2 W N 2 W N N + R yes charAt()

* probabilistic

† fixed-length W keys

Page 46: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Problem. Sort a huge commercial database on a fixed-length key.Ex. Account number, date, Social Security number, ...

Which sorting method to use?• Insertion sort.

• Mergesort.

• Quicksort.

• Heapsort.

• LSD string sort.

46

String sorting challenge 1

B14-99-8765

756-12-AD46

CX6-92-0112

332-WX-9877

375-99-QWAX

CV2-59-0221

387-SS-0321

KJ-00-12388

715-YT-013C

MJ0-PP-983F

908-KK-33TY

BBN-63-23RE

48G-BM-912D

982-ER-9P1B

WBL-37-PB81

810-F4-J87Q

LE9-N8-XX76

908-KK-33TY

B14-99-8765

CX6-92-0112

CV2-59-0221

332-WX-23SQ

332-6A-9877

256 (or 65,536) counters;

Fixed-length strings sort in W passes.

Page 47: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

47

String sorting challenge 2a

Problem. Sort one million 32-bit integers.Ex. Google (or presidential) interview.

Which sorting method to use?• Insertion sort.

• Mergesort.

• Quicksort.

• Heapsort.

• LSD string sort.

Google CEO Eric Schmidt interviews Barack Obama

Page 48: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

48

String sorting challenge 2b

Problem. Sort huge array of random 128-bit numbers.Ex. Supercomputer sort, internet router.

Which sorting method to use?• Insertion sort.

• Mergesort.

• Quicksort.

• Heapsort.

• LSD string sort.

01110110111011011101...1011101

Page 49: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

49

String sorting challenge 2b

Problem. Sort huge array of random 128-bit numbers.Ex. Supercomputer sort, internet router.

Which sorting method to use?• Insertion sort.

• Mergesort.

• Quicksort.

• Heapsort.

• LSD string sort.✓

Divide each word into eight 16-bit “chars”

216 = 65,536 counters.

Sort in 8 passes.

01110110111011011101...1011101

Page 50: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

50

String sorting challenge 2b

Problem. Sort huge array of random 128-bit numbers.Ex. Supercomputer sort, internet router.

Which sorting method to use?• Insertion sort.

• Mergesort.

• Quicksort.

• Heapsort.

• LSD string sort.✓

Divide each word into eight 16-bit “chars”

216 = 65,536 counters

LSD sort on leading 32 bits in 2 passes

Finish with insertion sort

Examines only ~25% of the data

Page 51: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

51

How to take a census in 1900s?

1880 Census. Took 1,500 people 7 years to manually process data.

Herman Hollerith. Developed counting and sorting machine to automate.• Use punch cards to record data (e.g., gender, age).

• Machine sorts one column at a time (into one of 12 bins).

• Typical question: how many women of age 20 to 30?

1890 Census. Finished months early and under budget!

punch card (12 holes per column)Hollerith tabulating machine and sorter

Page 52: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

52

How to get rich sorting in 1900s?

Punch cards. [1900s to 1950s]• Also useful for accounting, inventory, and business processes.

• Primary medium for data entry, storage, and processing.

Hollerith's company later merged with 3 others to form Computing Tabulating Recording Corporation (CTRC); the company was renamed in 1924.

IBM 80 Series Card Sorter (650 cards per minute)

Page 53: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

LSD string sort: a moment in history (1960s)

53

card punch punched cards card reader mainframe line printer

To sort a card deck

- start on right column

- put cards into hopper

- machine distributes into bins

- pick up cards (stable)

- move left one column

- continue until sorted

card sorter

Page 54: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

STRING SORTS

‣ Key-indexed counting‣ LSD radix sort‣ MSD radix sort‣ 3-way radix quicksort‣ Suffix arrays

Page 55: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

55

MSD string (radix) sort.• Partition array into R pieces according to first character

(use key-indexed counting).

• Recursively sort all strings that start with each character(key-indexed counts delineate subarrays to sort).

Most-significant-digit-first string sort

0 d a b

1 a d d

2 c a b

3 f a d

4 f e e

5 b a d

6 d a d

7 b e e

8 f e d

9 b e d

10 e b b

11 a c e

0 a d d

1 a c e

2 b a d

3 b e e

4 b e d

5 c a b

6 d a b

7 d a d

8 e b b

9 f a d

10 f e e

11 f e d

sort key

0 a d d

1 a c e

2 b a d

3 b e e

4 b e d

5 c a b

6 d a b

7 d a d

8 e b b

9 f a d

10 f e e

11 f e d

sort subarrays

recursively

count[]

a 0

b 2

c 5

d 6

e 8

f 9

- 12

Page 56: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

56

MSD string sort: example

shesellsseashellsbytheseashoretheshellsshesellsaresurelyseashells

arebyshesellsseashellsseashoreshellsshesellssurelyseashellsthethe

arebysellsseashellsseasellsseashellssheshoreshellsshesurelythethe

input

arebyseaseashellsseashellssellssellsshesheshellsshoresurelythethe

output

arebyseashellsseaseashellssellssellssheshoreshellsshesurelythethe

arebyseaseashellsseashellssellssellssheshoreshellsshesurelythethe

arebyseaseashellsseashellssellssellssheshoreshellsshesurelythethe

arebyseaseashellsseashellssellssellssheshoreshellsshesurelythethe

arebyseasseashellsseashellssellssellssheshellsshoreshesurelythethe

arebyseaseashellsseashellssellssellssheshellsshoreshesurelythethe

arebyseaseashellsseashellssellssellssheshellssheshoresurelythethe

arebyseaseashellsseashellssellssellssheshellssheshoresurelythethe

arebyseaseashellsseashellssellssellssheshellssheshoresurelythethe

arebyseaseashellsseashellssellssellssheshellssheshoresurelythethe

arebyseaseashellsseashellssellssellsshesheshellsshoresurelythethe

arebyseaseashellsseashellssellssellsshesheshellsshoresurelythethe

arebyseaseashellsseashellssellssellsshesheshellsshoresurelythethe

Trace of recursive calls for MSD string sort (no cuto! for small subarrays, subarrays of size 0 and 1 omitted)

end-of-stringgoes before any

char value

need to examineevery characterin equal keys

d

lo

hi

Page 57: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Variable-length strings

Treat strings as if they had an extra char at end (smaller than any char).

C strings. Have extra char '\0' at end ⇒ no extra work needed.57

0 s e a -1

1 s e a s h e l l s -1

2 s e l l s -1

3 s h e -1

4 s h e -1

5 s h e l l s -1

6 s h o r e -1

7 s u r e l y -1

she before shells

private static int charAt(String s, int d){ if (d < s.length()) return s.charAt(d); else return -1;}

why smaller?

Page 58: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

58

MSD string sort: Java implementation

public static void sort(String[] a){ aux = new String[a.length]; sort(a, aux, 0, a.length, 0);}

private static void sort(String[] a, String[] aux, int lo, int hi, int d){ if (hi <= lo) return; int[] count = new int[R+2]; for (int i = lo; i <= hi; i++) count[charAt(a[i], d) + 2]++; for (int r = 0; r < R+1; r++) count[r+1] += count[r]; for (int i = lo; i <= hi; i++) aux[count[charAt(a[i], d) + 1]++] = a[i]; for (int i = lo; i <= hi; i++) a[i] = aux[i - lo]; for (int r = 0; r < R; r++) sort(a, aux, lo + count[r], lo + count[r+1] - 1, d+1);}

key-indexed counting

sort R subarrays recursively

can recycle aux[] array

but not count[] array

Page 59: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

59

MSD string sort: potential for disastrous performanceObservation 1. Much too slow for small subarrays.• Each function call needs its own count[] array.

• ASCII (256 counts): 100x slower than copy pass for N = 2.

• Unicode (65,536 counts): 32,000x slower for N = 2.

Observation 2. Huge number of small subarraysbecause of recursion.

a[]

0 b

1 a

count[]

aux[]

0 a

1 b

Page 60: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

60

Cutoff to insertion sort

Solution. Cutoff to insertion sort for small subarrays.• Insertion sort, but start at dth character.

• Implement less() so that it compares starting at dth character.

public static void sort(String[] a, int lo, int hi, int d) { for (int i = lo; i <= hi; i++) for (int j = i; j > lo && less(a[j], a[j-1], d); j--) exch(a, j, j-1); }

private static boolean less(String v, String w, int d) { return v.substring(d).compareTo(w.substring(d)) < 0; }

in Java, forming and comparing

substrings is faster than directly

comparing chars with charAt()

Page 61: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Number of characters examined.• MSD examines just enough characters to sort the keys.

• Number of characters examined depends on keys.

• Can be sublinear in input size!

61

MSD string sort: performance

1EIO4021HYL4901ROZ5722HXE7342IYE2302XOR8463CDB5733CVP7203IGJ3193KNA3823TAV8794CQP7814QGI2844YHV229

1DNB3771DNB3771DNB3771DNB3771DNB3771DNB3771DNB3771DNB3771DNB3771DNB3771DNB3771DNB3771DNB3771DNB377

Non-randomwith duplicates(nearly linear)

Random(sublinear)

Worst case(linear)

Characters examined by MSD string sort

arebyseaseashellsseashellssellssellsshesheshellsshoresurelythethe

compareTo() based sorts

can also be sublinear!

Page 62: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Summary of the performance of sorting algorithms

Frequency of operations.

62

algorithm guarantee random extra space stable? operations on keys

insertion sort N2 / 2 N2 / 4 1 yes compareTo()

mergesort N lg N N lg N N yes compareTo()

quicksort 1.39 N lg N * 1.39 N lg N c lg N no compareTo()

heapsort 2 N lg N 2 N lg N 1 no compareTo()

LSD † 2 N W 2 N W N + R yes charAt()

MSD ‡ 2 N W N log R N N + D R yes charAt()

* probabilistic

† fixed-length W keys

‡ average-length W keys

D = function-call stack depth

(length of longest prefix match)

Page 63: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

63

MSD string sort vs. quicksort for strings

Disadvantages of MSD string sort.• Accesses memory "randomly" (cache inefficient).

• Inner loop has a lot of instructions.

• Extra space for count[].

• Extra space for aux[].

Disadvantage of quicksort.• Linearithmic number of string compares (not linear).

• Has to rescan many characters in keys with long prefix matches.

Goal. Combine advantages of MSD and quicksort.

Page 64: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

STRING SORTS

‣ Key-indexed counting‣ LSD radix sort‣ MSD radix sort‣ 3-way radix quicksort‣ Suffix arrays

Page 65: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

she

sells

seashells

by

the

sea

shore

the

shells

she

sells

are

surely

seashells

Overview. Do 3-way partitioning on the dth character.

• Less overhead than R-way partitioning in MSD string sort.

• Does not re-examine characters equal to the partitioning char(but does re-examine characters not equal to the partitioning char).

65

3-way string quicksort (Bentley and Sedgewick, 1997)

partitioning item

use first character to

partition into

"less", "equal", and "greater"

subarraysrecursively sort subarrays,

excluding first character

for middle subarray

by

are

seashells

she

seashells

sea

shore

surely

shells

she

sells

sells

the

the

Page 66: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

she

sells

seashells

by

the

sea

shore

the

shells

she

sells

are

surely

seashells

66

3-way string quicksort: trace of recursive calls

by

are

seashells

she

seashells

sea

shore

surely

shells

she

sells

sells

the

the

Trace of first few recursive calls for 3-way string quicksort (subarrays of size 1 not shown)

partitioning item

are

by

seashells

she

seashells

sea

shore

surely

shells

she

sells

sells

the

the

are

by

seashells

sea

seashells

sells

sells

shells

she

surely

shore

she

the

the

are

by

seashells

sells

seashells

sea

sells

shells

she

surely

shore

she

the

the

Page 67: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

67

3-way string quicksort: Java implementation

private static void sort(String[] a) { sort(a, 0, a.length - 1, 0); }

private static void sort(String[] a, int lo, int hi, int d) { if (hi <= lo) return; int lt = lo, gt = hi; int v = charAt(a[lo], d); int i = lo + 1; while (i <= gt) { int t = charAt(a[i], d); if (t < v) exch(a, lt++, i++); else if (t > v) exch(a, i, gt--); else i++; }

sort(a, lo, lt-1, d); if (v >= 0) sort(a, lt, gt, d+1); sort(a, gt+1, hi, d); }

3-way partitioning

(using dth character)

sort 3 subarrays recursively

to handle variable-length strings

Page 68: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Standard quicksort.• Uses ~ 2 N ln N string compares on average.

• Costly for keys with long common prefixes (and this is a common case!)

3-way string (radix) quicksort.• Uses ~ 2 N ln N character compares on average for random strings.

• Avoids re-comparing long common prefixes.

68

3-way string quicksort vs. standard quicksort

Jon L. Bentley* Robert Sedgewick#

Abstract We present theoretical algorithms for sorting and

searching multikey data, and derive from them practical C implementations for applications in which keys are charac- ter strings. The sorting algorithm blends Quicksort and radix sort; it is competitive with the best known C sort codes. The searching algorithm blends tries and binary search trees; it is faster than hashing and other commonly used search methods. The basic ideas behind the algo- rithms date back at least to the 1960s but their practical utility has been overlooked. We also present extensions to more complex string problems, such as partial-match searching.

1. Introduction Section 2 briefly reviews Hoare’s [9] Quicksort and

binary search trees. We emphasize a well-known isomor- phism relating the two, and summarize other basic facts.

The multikey algorithms and data structures are pre- sented in Section 3. Multikey Quicksort orders a set of II vectors with k components each. Like regular Quicksort, it partitions its input into sets less than and greater than a given value; like radix sort, it moves on to the next field once the current input is known to be equal in the given field. A node in a ternary search tree represents a subset of vectors with a partitioning value and three pointers: one to lesser elements and one to greater elements (as in a binary search tree) and one to equal elements, which are then pro- cessed on later fields (as in tries). Many of the structures and analyses have appeared in previous work, but typically as complex theoretical constructions, far removed from practical applications. Our simple framework opens the door for later implementations.

The algorithms are analyzed in Section 4. Many of the analyses are simple derivations of old results.

Section 5 describes efficient C programs derived from the algorithms. The first program is a sorting algorithm

Fast Algorithms for Sorting and Searching Strings

that is competitive with the most efficient string sorting programs known. The second program is a symbol table implementation that is faster than hashing, which is com- monly regarded as the fastest symbol table implementa- tion. The symbol table implementation is much more space-efficient than multiway trees, and supports more advanced searches.

In many application programs, sorts use a Quicksort implementation based on an abstract compare operation, and searches use hashing or binary search trees. These do not take advantage of the properties of string keys, which are widely used in practice. Our algorithms provide a nat- ural and elegant way to adapt classical algorithms to this important class of applications.

Section 6 turns to more difficult string-searching prob- lems. Partial-match queries allow “don’t care” characters (the pattern “so.a”, for instance, matches soda and sofa). The primary result in this section is a ternary search tree implementation of Rivest’s partial-match searching algo- rithm, and experiments on its performance. “Near neigh- bor” queries locate all words within a given Hamming dis- tance of a query word (for instance, code is distance 2 from soda). We give a new algorithm for near neighbor searching in strings, present a simple C implementation, and describe experiments on its efficiency.

Conclusions are offered in Section 7.

2. Background Quicksort is a textbook divide-and-conquer algorithm.

To sort an array, choose a partitioning element, permute the elements such that lesser elements are on one side and greater elements are on the other, and then recursively sort the two subarrays. But what happens to elements equal to the partitioning value? Hoare’s partitioning method is binary: it places lesser elements on the left and greater ele- ments on the right, but equal elements may appear on either side.

* Bell Labs, Lucent Technologies, 700 Mountam Avenue, Murray Hill. NJ 07974; [email protected].

# Princeton University. Princeron. NJ. 08514: [email protected].

Algorithm designers have long recognized the desir- irbility and difficulty of a ternary partitioning method. Sedgewick [22] observes on page 244: “Ideally, we would llke to get all [equal keys1 into position in the file, with all

360

Page 69: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

69

3-way string quicksort vs. MSD string sort

MSD string sort.• Is cache-inefficient.

• Too much memory storing count[].

• Too much overhead reinitializing count[] and aux[].

3-way string quicksort.• Has a short inner loop.

• Is cache-friendly.

• Is in-place.

Bottom line. 3-way string quicksort is the method of choice for sorting strings.

library of Congress call numbers

Page 70: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Summary of the performance of sorting algorithms

Frequency of operations.

70

algorithm guarantee random extra space stable? operations on keys

insertion sort N2 / 2 N2 / 4 1 yes compareTo()

mergesort N lg N N lg N N yes compareTo()

quicksort 1.39 N lg N * 1.39 N lg N c lg N no compareTo()

heapsort 2 N lg N 2 N lg N 1 no compareTo()

LSD † 2 N W 2 N W N + R yes charAt()

MSD ‡ 2 N W N log R N N + D R yes charAt()

3-way string quicksort 1.39 W N lg N * 1.39 N lg N log N + W no charAt()

* probabilistic

† fixed-length W keys

‡ average-length W keys

Page 71: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

STRING SORTS

‣ Key-indexed counting‣ LSD radix sort‣ MSD radix sort‣ 3-way radix quicksort‣ Suffix arrays

Page 72: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

% more tale.txtit was the best of timesit was the worst of timesit was the age of wisdomit was the age of foolishnessit was the epoch of beliefit was the epoch of incredulityit was the season of lightit was the season of darknessit was the spring of hopeit was the winter of despair...

72

Keyword-in-context search

Given a text of N characters, preprocess it to enable fast substring search(find all occurrences of query string context).

Applications. Linguistics, databases, web search, word processing, ….

% java KWIC tale.txt 15searcho st giless to search for contrabandher unavailing search for your fathele and gone in search of her husbandt provinces in search of impoverishe dispersing in search of other carrin that bed and search the straw hold

better thingt is a far far better thing that i do than some sense of better things else forgottewas capable of better things mr carton ent

characters of

surrounding context

Page 73: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

73

Suffix sort

a a c a a g t t t a c a a g c

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

input string

0 a a c a a g t t t a c a a g c1 a c a a g t t t a c a a g c2 c a a g t t t a c a a g c3 a a g t t t a c a a g c4 a g t t t a c a a g c5 g t t t a c a a g c6 t t t a c a a g c7 t t a c a a g c8 t a c a a g c9 a c a a g c10 c a a g c11 a a g c12 a g c13 g c14 c

form suffixes

0 a a c a a g t t t a c a a g c11 a a g c3 a a g t t t a c a a g c9 a c a a g c1 a c a a g t t t a c a a g c12 a g c4 a g t t t a c a a g c14 c10 c a a g c2 c a a g t t t a c a a g c13 g c5 g t t t a c a a g c8 t a c a a g c7 t t a c a a g c6 t t t a c a a g c

sort suffixes to bring repeated substrings together

Page 74: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

• Preprocess: suffix sort the text.

• Query: binary search for query; scan until mismatch.

74

Keyword-in-context search: suffix-sorting solution

⋮632698 s e a l e d _ m y _ l e t t e r _ a n d _ …

713727 s e a m s t r e s s _ i s _ l i f t e d _ …

660598 s e a m s t r e s s _ o f _ t w e n t y _ …

67610 s e a m s t r e s s _ w h o _ w a s _ w i …

4430 s e a r c h _ f o r _ c o n t r a b a n d …

42705 s e a r c h _ f o r _ y o u r _ f a t h e …

499797 s e a r c h _ o f _ h e r _ h u s b a n d …

182045 s e a r c h _ o f _ i m p o v e r i s h e …

143399 s e a r c h _ o f _ o t h e r _ c a r r i …

411801 s e a r c h _ t h e _ s t r a w _ h o l d …

158410 s e a r e d _ m a r k i n g _ a b o u t _ …

691536 s e a s _ a n d _ m a d a m e _ d e f a r …

536569 s e a s e _ a _ t e r r i b l e _ p a s s …

484763 s e a s e _ t h a t _ h a d _ b r o u g h …⋮

KWIC search for "search" in Tale of Two Cities

Page 75: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

75

Longest repeated substring

Given a string of N characters, find the longest repeated substring.

Applications. Bioinformatics, cryptanalysis, data compression, ...

a a c a a g t t t a c a a g c a t g a t g c t g t a c t a g g a g a g t t a t a c t g g t c g t c a a a c c t g a a c c t a a t c c t t g t g t g t a c a c a c a c t a c t a c t g t c g t c g t c a t a t a t c g a g a t c a t c g a a c c g g a a g g c c g g a c a a g g c g g g g g g t a t a g a t a g a t a g a c c c c t a g a t a c a c a t a c a t a g a t c t a g c t a g c t a g c t c a t c g a t a c a c a c t c t c a c a c t c a a g a g t t a t a c t g g t c a a c a c a c t a c t a c g a c a g a c g a c c a a c c a g a c a g a a a a a a a a c t c t a t a t c t a t a a a a

Page 76: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

76

Longest repeated substring: a musical application

Visualize repetitions in music. http://www.bewitched.com

Mary Had a Little Lamb

Bach's Goldberg Variations

Page 77: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

77

Longest repeated substring

Given a string of N characters, find the longest repeated substring.

Brute-force algorithm.• Try all indices i and j for start of possible match.

• Compute longest common prefix (LCP) for each pair.

Analysis. Running time ≤ D N 2 , where D is length of longest match.

i

a a c a a g t t t a c a a g c

j

Page 78: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

78

Longest repeated substring: a sorting solution

a a c a a g t t t a c a a g c

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

input string

0 a a c a a g t t t a c a a g c1 a c a a g t t t a c a a g c2 c a a g t t t a c a a g c3 a a g t t t a c a a g c4 a g t t t a c a a g c5 g t t t a c a a g c6 t t t a c a a g c7 t t a c a a g c8 t a c a a g c9 a c a a g c10 c a a g c11 a a g c12 a g c13 g c14 c

form suffixes

0 a a c a a g t t t a c a a g c11 a a g c3 a a g t t t a c a a g c9 a c a a g c1 a c a a g t t t a c a a g c12 a g c4 a g t t t a c a a g c14 c10 c a a g c2 c a a g t t t a c a a g c13 g c5 g t t t a c a a g c8 t a c a a g c7 t t a c a a g c6 t t t a c a a g c

sort suffixes to bring repeated substrings together

compute longest prefix between adjacent suffixes

a a c a a g t t t a c a a g c

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Page 79: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

public String lrs(String s) { int N = s.length();

String[] suffixes = new String[N]; for (int i = 0; i < N; i++) suffixes[i] = s.substring(i, N);

Arrays.sort(suffixes);

String lrs = ""; for (int i = 0; i < N-1; i++) { int len = lcp(suffixes[i], suffixes[i+1]); if (len > lrs.length()) lrs = suffixes[i].substring(0, len); } return lrs; }

79

Longest repeated substring: Java implementation

% java LRS < mobydick.txt,- Such a funny, sporty, gamy, jesty, joky, hoky-poky lad, is the Ocean, oh! Th

create suffixes

(linear time and space)

sort suffixes

find LCP between

adjacent suffixes in

sorted order

Page 80: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

80

Sorting challenge

Problem. Five scientists A, B, C, D, and E are looking for long repeated substring in a genome with over 1 billion nucleotides.

• A has a grad student do it by hand.

• B uses brute force (check all pairs).

• C uses suffix sorting solution with insertion sort.

• D uses suffix sorting solution with LSD string sort.

• E uses suffix sorting solution with 3-way string quicksort.

Q. Which one is more likely to lead to a cure cancer?

but only if LRS is not long (!)

Page 81: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

input file characters brute suffix sort length of LRS

LRS.java 2,162 0.6 sec 0.14 sec 73

amendments.txt 18,369 37 sec 0.25 sec 216

aesop.txt 191,945 1.2 hours 1.0 sec 58

mobydick.txt 1.2 million 43 hours † 7.6 sec 79

chromosome11.txt 7.1 million 2 months † 61 sec 12,567

pi.txt 10 million 4 months † 84 sec 14

pipi.txt 20 million forever † ??? 10 million

81

Longest repeated substring: empirical analysis

† estimated

Page 82: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Bad input: longest repeated substring very long.• Ex: same letter repeated N times.

• Ex: two copies of the same Java codebase.

LRS needs at least 1 + 2 +3 + ... + D character compares,where D = length of longest match

Running time. Quadratic (or worse) in the length of the longest match.82

Suffix sorting: worst-case input

0 t w i n s t w i n s1 w i n s t w i n s2 i n s t w i n s3 n s t w i n s4 s t w i n s5 t w i n s6 w i n s7 i n s8 n s9 s

form suffixes

9 i n s8 i n s t w i n s7 n s 6 n s t w i n s5 s 4 s t w i n s3 t w i n s 2 t w i n s t w i n s1 w i n s 0 w i n s t w i n s

sorted suffixes

Page 83: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

83

Suffix sorting challenge

Problem. Suffix sort an arbitrary string of length N.

Q. What is worst-case running time of best algorithm for problem?• Quadratic.

• Linearithmic.

• Linear.

• Nobody knows.suffix trees (beyond our scope)✓Manber's algorithm✓

Page 84: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

84

Suffix sorting in linearithmic time

Manber's MSD algorithm overview. • Phase 0: sort on first character using key-indexed counting sort.

• Phase i: given array of suffixes sorted on first 2i-1 characters,

create array of suffixes sorted on first 2i characters.

Worst-case running time. N lg N.

• Finishes after lg N phases.

• Can perform a phase in linear time. (!) [ahead]

Page 85: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

17 01 a b a a a a b c b a b a a a a a 016 a 03 a a a a b c b a b a a a a a 04 a a a b c b a b a a a a a 05 a a b c b a b a a a a a 06 a b c b a b a a a a a 015 a a 014 a a a 013 a a a a 012 a a a a a 010 a b a a a a a 00 b a b a a a a b c b a b a a a a a 09 b a b a a a a a 011 b a a a a a 07 b c b a b a a a a a 02 b a a a a b c b a b a a a a a 08 c b a b a a a a a 0

85

Linearithmic suffix sort example: phase 0

0 b a b a a a a b c b a b a a a a a 01 a b a a a a b c b a b a a a a a 02 b a a a a b c b a b a a a a a 03 a a a a b c b a b a a a a a 04 a a a b c b a b a a a a a 05 a a b c b a b a a a a a 06 a b c b a b a a a a a 07 b c b a b a a a a a 08 c b a b a a a a a 09 b a b a a a a a 010 a b a a a a a 011 b a a a a a 012 a a a a a 013 a a a a 014 a a a 015 a a 016 a 017 0

key-indexed counting sort (first character)

sorted

original suffixes

Page 86: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

86

Linearithmic suffix sort example: phase 1

17 016 a 012 a a a a a 03 a a a a b c b a b a a a a a 04 a a a b c b a b a a a a a 05 a a b c b a b a a a a a 013 a a a a 015 a a 014 a a a 06 a b c b a b a a a a a 01 a b a a a a b c b a b a a a a a 010 a b a a a a a 00 b a b a a a a b c b a b a a a a a 09 b a b a a a a a 011 b a a a a a 02 b a a a a b c b a b a a a a a 07 b c b a b a a a a a 08 c b a b a a a a a 0

0 b a b a a a a b c b a b a a a a a 01 a b a a a a b c b a b a a a a a 02 b a a a a b c b a b a a a a a 03 a a a a b c b a b a a a a a 04 a a a b c b a b a a a a a 05 a a b c b a b a a a a a 06 a b c b a b a a a a a 07 b c b a b a a a a a 08 c b a b a a a a a 09 b a b a a a a a 010 a b a a a a a 011 b a a a a a 012 a a a a a 013 a a a a 014 a a a 015 a a 016 a 017 0

sorted

index sort (first two characters)original suffixes

Page 87: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

87

Linearithmic suffix sort example: phase 2

17 016 a 015 a a 014 a a a 03 a a a a b c b a b a a a a a 012 a a a a a 013 a a a a 04 a a a b c b a b a a a a a 05 a a b c b a b a a a a a 01 a b a a a a b c b a b a a a a a 010 a b a a a a a 06 a b c b a b a a a a a 02 b a a a a b c b a b a a a a a 0 a 011 b a a a a a 00 b a b a a a a b c b a b a a a a a 09 b a b a a a a a 07 b c b a b a a a a a 08 c b a b a a a a a 0

0 b a b a a a a b c b a b a a a a a 01 a b a a a a b c b a b a a a a a 02 b a a a a b c b a b a a a a a 03 a a a a b c b a b a a a a a 04 a a a b c b a b a a a a a 05 a a b c b a b a a a a a 06 a b c b a b a a a a a 07 b c b a b a a a a a 08 c b a b a a a a a 09 b a b a a a a a 010 a b a a a a a 011 b a a a a a 012 a a a a a 013 a a a a 014 a a a 015 a a 016 a 017 0

sorted

index sort (first four characters)original suffixes

Page 88: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

88

Linearithmic suffix sort example: phase 3

17 016 a 015 a a 014 a a a 013 a a a a 012 a a a a a 03 a a a a b c b a b a a a a a 04 a a a b c b a b a a a a a 05 a a b c b a b a a a a a 010 a b a a a a a 01 a b a a a a b c b a b a a a a a 06 a b c b a b a a a a a 011 b a a a a a 02 b a a a a b c b a b a a a a a 0 a 09 b a b a a a a a 00 b a b a a a a b c b a b a a a a a 07 b c b a b a a a a a 08 c b a b a a a a a 0

0 b a b a a a a b c b a b a a a a a 01 a b a a a a b c b a b a a a a a 02 b a a a a b c b a b a a a a a 03 a a a a b c b a b a a a a a 04 a a a b c b a b a a a a a 05 a a b c b a b a a a a a 06 a b c b a b a a a a a 07 b c b a b a a a a a 08 c b a b a a a a a 09 b a b a a a a a 010 a b a a a a a 011 b a a a a a 012 a a a a a 013 a a a a 014 a a a 015 a a 016 a 017 0

finished (no equal keys)

index sort (first eight characters)original suffixes

Page 89: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

17 016 a 015 a a 014 a a a 03 a a a a b c b a b a a a a a 012 a a a a a 013 a a a a 04 a a a b c b a b a a a a a 05 a a b c b a b a a a a a 01 a b a a a a b c b a b a a a a a 010 a b a a a a a 06 a b c b a b a a a a a 02 b a a a a b c b a b a a a a a 0 a 011 b a a a a a 00 b a b a a a a b c b a b a a a a a 09 b a b a a a a a 07 b c b a b a a a a a 08 c b a b a a a a a 0

0 b a b a a a a b c b a b a a a a a 01 a b a a a a b c b a b a a a a a 02 b a a a a b c b a b a a a a a 03 a a a a b c b a b a a a a a 04 a a a b c b a b a a a a a 05 a a b c b a b a a a a a 06 a b c b a b a a a a a 07 b c b a b a a a a a 08 c b a b a a a a a 09 b a b a a a a a 010 a b a a a a a 011 b a a a a a 012 a a a a a 013 a a a a 014 a a a 015 a a 016 a 017 0

89

Constant-time string compare by indexing into inverse

0 + 4 = 4

9 + 4 = 13

suffixes4[13] ≤ suffixes4[4] (because inverse[13] < inverse[4])so suffixes8[9] ≤ suffixes8[0]

0 14

1 9

2 12

3 4

4 7

5 8

6 11

7 16

8 17

9 15

10 10

11 13

12 5

13 6

14 3

15 2

16 1

17 0

inverseindex sort (first four characters)original suffixes

Page 90: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

90

Suffix sort: experimental results

† estimated

algorithm mobydick.txt aesopaesop.txt

brute-force 36.000 † 4000 †

quicksort 9.5 167

LSD not fixed length not fixed length

MSD 395 out of memory

MSD with cutoff 6.8 162

3-way string quicksort 2.8 400

Manber MSD 17 8.5

time to suffix sort (seconds)

Page 91: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

String sorting summary

We can develop linear-time sorts.• Key compares not necessary for string keys.

• Use characters as index in an array.

We can develop sublinear-time sorts.• Should measure amount of data in keys, not number of keys.

• Not all of the data has to be examined.

3-way string quicksort is asymptotically optimal.• 1.39 N lg N chars for random data.

Long strings are rarely random in practice.• Goal is often to learn the structure!

• May need specialized algorithms.

91

Page 92: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

STRING SORTS, TRIES

‣ String sorts

‣ Tries‣ R-way tries‣ Ternary search tries‣ Character-based operations

Page 93: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Review: summary of the performance of symbol-table implementationsOrder of growth of the frequency of operations.

Q. Can we do better?A. Yes, if we can avoid examining the entire key, as with string sorting.

93

implementation

typical caseordered operations

implementation

search insert deleteoperations on keys

red-black BST log N log N log N yes compareTo()

hash table 1 † 1 † 1 † noequals()

hashcode()

• † under uniform hashing assumption

Page 94: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

String symbol table. Symbol table specialized to string keys.

Goal. Faster than hashing, more flexible than BSTs.

94

String symbol table basic API

public class StringST<Value> public class StringST<Value>

StringST()StringST() create an empty symbol table

void put(String key, Value val)put(String key, Value val) put key-value pair into the symbol table

Value get(String key)get(String key) return value paired with given key

void delete(String key)delete(String key) delete key and corresponding value

⋮⋮

Page 95: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

95

String symbol table implementations cost summary

Challenge. Efficient performance for string keys.

Parameters

•N = number of strings

•L = length of string

•R = radix

file size words distinct

moby.txt 1.2 MB 210 K 32 K

actors.txt 82 MB 11.4 M 900 K

character accesses (typical case)character accesses (typical case)character accesses (typical case)character accesses (typical case) dedupdedup

implementationsearch

hit

search

missinsert

space

(references)moby.txt actors.txt

red-black BST L + c lg 2 N c lg 2 N c lg 2 N 4N 1.40 97.4

hashing

(linear probing)L L L 4N to 16N 0.76 40.6

Page 96: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

‣ R-way tries‣ Ternary search tries‣ Character-based operations

TRIES

Page 97: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Tries. [from retrieval, but pronounced "try"]• Store characters in nodes (not keys).

• Each node has R children, one for each possible character.

• Store values in nodes corresponding to last characters in keys.

97

Tries

e

r

e

a l

l

s

b

y

o

h

e

t

7

50

3

1

6

4

s

l

l

e

h

s

root

link to trie for all keys

that start with slink to trie for all keys

that start with she

value for she in node

corresponding to last

key character

key value

by 4

sea 6

sells 1

she 0

shells 3

shore 7

the 5

for now, we do not

draw null links

Page 98: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Follow links corresponding to each character in the key.• Search hit: node where search ends has a non-null value.

• Search miss: reach a null link or node where search ends has null value.

98

Search in a trie

e

r

get("shells")

e

a l

l

s

b

y

o

h

e

t

7

50

3

1

6

4

ss

ll

ll

ee

hh

ss

return value associated

with last key character

(return 3)

3

Page 99: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Follow links corresponding to each character in the key.• Search hit: node where search ends has a non-null value.

• Search miss: reach a null link or node where search ends has null value.

99

Search in a trie

e

r

get("she")

e

a l

l

s

b

y

o

h

e

t

7

50

3

1

6

4

s

l

l

ee

hh

ss

search may terminated

at an intermediate node

(return 0)

0

Page 100: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Follow links corresponding to each character in the key.• Search hit: node where search ends has a non-null value.

• Search miss: reach a null link or node where search ends has null value.

100

Search in a trie

e

r

get("shell")

e

a l

l

s

b

y

o

h

e

t

7

50

3

1

6

4

s

ll

ll

ee

hh

ss

no value associated

with last key character

(return null)

Page 101: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Follow links corresponding to each character in the key.• Search hit: node where search ends has a non-null value.

• Search miss: reach a null link or node where search ends has null value.

101

Search in a trie

e

r

get("shelter")

e

a l

l

s

b

y

o

h

e

t

7

50

3

1

6

4

s

l

ll

ee

hh

ss

no link to 't'

(return null)

Page 102: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Follow links corresponding to each character in the key.• Encounter a null link: create new node.

• Encounter the last character of the key: set value in that node.

102

Insertion into a trie

e

r

put("shore", 7)

e

a l

l

s

e

l

s

b

y

l

o

h

e

t

hh

ss

7

50

3

1

6

4

Page 103: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Trie construction demo

103

trie

Page 104: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

e

Trie construction demo

104

put("she", 0)

h

s

0

value is in node

corresponding to

last character

key is sequence

of characters from

root to value

Page 105: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

e

Trie construction demo

105

trie

h

s

0

Page 106: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

h

e

Trie construction demo

106

trie

s

0

Page 107: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

h

e

Trie construction demo

107

put("sells", 1)

s

l

l

e

ss

1

0

Page 108: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

h

e

Trie construction demo

108

trie

l

l

s

s

e

1

0

Page 109: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

h

e

Trie construction demo

109

trie

l

l

s

s

e

0

1

Page 110: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

h

ea

Trie construction demo

110

put("sea", 2)

l

l

s

ee

ss

2

1

0

Page 111: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

h

ea

Trie construction demo

111

trie

l

l

s

s

e

1

2 0

Page 112: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

a

Trie construction demo

112

put("shells", 3)

l

l

s

e

s

l

l

ee

hh

ss

3

1

2 0

Page 113: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

a

Trie construction demo

113

trie

l

l

s

l

s

s

l

he

e

3

1

2 0

Page 114: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

y

b

a

Trie construction demo

114

put("by", 4)

l

l

s

l

s

s

l

he

e

4

3

1

2 0

Page 115: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

b

y

a

Trie construction demo

115

trie

l

l

s

l

s

s

l

he

e

3

1

2

4

0

Page 116: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

b

y

a

Trie construction demo

116

put("the", 5)

l

l

s

l

s

s

l

he

e e

h

t

5

3

1

2

4

0

Page 117: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

a

Trie construction demo

117

trie

e

l

l

s

l

s

b

y

s

l

h h

t

e e 5

3

1

2

4

0

Page 118: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

2a

Trie construction demo

118

put("sea", 6)

l

l

s

e

l

s

b

y

l

h h

e

t

a

ee

ss

6

overwrite

old value with

new value

5

3

1

4

0

Page 119: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Trie construction demo

119

trie

e

a l

l

s

e

l

s

b

y

s

l

h h

e

t

5

3

1

6

4

0

Page 120: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Trie construction demo

120

trie

e

a l

l

s

e

l

s

b

y

s

l

h h

e

t

3

1

6

4

0 5

Page 121: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

e

r

Trie construction demo

121

put("shore", 7)

e

a l

l

s

e

l

s

b

y

l

o

h

e

t

hh

ss

7

5

3

1

6

4

0

Page 122: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Trie construction demo

122

trie

e

a l

l

s

e

l

s

b

y

s

l

o

r

e

h

t

h

e 5

7

3

1

6

4

0

Page 123: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

123

Trie representation: Java implementation

Node. A value, plus references to R nodes.

private static class Node{ private Object value; private Node[] next = new Node[R];}

Trie representation

each node hasan array of links

and a value

characters are implicitlydefined by link index

s

h

e 0

e

l

l

s 1

a

s

h

e

e

l

l

s

a0

1

22

neither keys nor

characters are

explicitly stored

use Object instead of Value since

no generic array creation in Java

Page 124: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

public class TrieST<Value>{ private static final int R = 256; private Node root;

private static class Node { /* see previous slide */ } public void put(String key, Value val) { root = put(root, key, val, 0); }

private Node put(Node x, String key, Value val, int d) { if (x == null) x = new Node(); if (d == key.length()) { x.val = val; return x; } char c = key.charAt(d); x.next[c] = put(x.next[c], key, val, d+1); return x; }

124

R-way trie: Java implementation

extended ASCII

Page 125: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

public boolean contains(String key) { return get(key) != null; } public Value get(String key) { Node x = get(root, key, 0); if (x == null) return null; return (Value) x.val; }

private Node get(Node x, String key, int d) { if (x == null) return null; if (d == key.length()) return x; char c = key.charAt(d); return get(x.next[c], key, d+1); }

}

125

R-way trie: Java implementation (continued)

cast needed

Page 126: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Trie performance

Search hit. Need to examine all L characters for equality.

Search miss.• Could have mismatch on first character.

• Typical case: examine only a few characters (sublinear).

Space. R null links at each leaf.(but sublinear space possible if many short strings share common prefixes)

Bottom line. Fast search hit and even faster search miss, but wastes space.

126

Page 127: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

To delete a key-value pair:• Find the node corresponding to key and set value to null.

• If that node has all null links, remove that node (and recur).

127

Deletion in an R-way trie

e

r

delete("shells")

e

a l

l

s

b

y

o

h

e

t

7

50

3

1

6

4

s

set value to nulls

l

l

e

h

s

null value and links

(delete node)

Page 128: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

128

String symbol table implementations cost summary

R-way trie.• Method of choice for small R.

• Too much memory for large R.

Challenge. Use less memory, e.g., 65,536-way trie for Unicode!

character accesses (typical case)character accesses (typical case)character accesses (typical case)character accesses (typical case) dedupdedup

implementationsearch

hit

search

missinsert

space

(references)moby.txt actors.txt

red-black BST L + c lg 2 N c lg 2 N c lg 2 N 4N 1.40 97.4

hashing

(linear probing)L L L 4N to 16N 0.76 40.6

R-way trie L log R N L (R+1) N 1.12out of

memory

Page 129: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

129

Digression: out of memory?

“ 640 K ought to be enough for anybody. ” — (mis)attributed to Bill Gates, 1981 (commenting on the amount of RAM in personal computers)

“ 64 MB of RAM may limit performance of some Windows XP features; therefore, 128 MB or higher is recommended for best performance. ” — Windows XP manual, 2002

“ 64 bit is coming to desktops, there is no doubt about that. But apart from Photoshop, I can't think of desktop applications where you would need more than 4GB of physical memory, which is what you have to have in order to benefit from this technology. Right now, it is costly. ” — Bill Gates, 2003

Page 130: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Digression: out of memory?

A short (approximate) history.

130

machine yearaddress

bits

addressable

memory

typical actual

memorycost

PDP-8 1960s 12 6 KB 6 KB $16K

PDP-10 1970s 18 256 KB 256 KB $1M

IBM S/360 1970s 24 4 MB 512 KB $1M

VAX 1980s 32 4 GB 1 MB $1M

Pentium 1990s 32 4 GB 1 GB $1K

Xeon 2000s 64 enough 4 GB $100

?? future 128+ enough enough $1

“ 512-bit words ought to be enough for anybody. ” — Kevin Wayne, 1995

Page 131: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

‣ R-way tries‣ Ternary search tries‣ Character-based operations

TRIES

Page 132: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

132

Ternary search tries

• Store characters and values in nodes (not keys).

• Each node has three children: smaller (left), equal (middle), larger (right).

Jon L. Bentley* Robert Sedgewick#

Abstract We present theoretical algorithms for sorting and

searching multikey data, and derive from them practical C implementations for applications in which keys are charac- ter strings. The sorting algorithm blends Quicksort and radix sort; it is competitive with the best known C sort codes. The searching algorithm blends tries and binary search trees; it is faster than hashing and other commonly used search methods. The basic ideas behind the algo- rithms date back at least to the 1960s but their practical utility has been overlooked. We also present extensions to more complex string problems, such as partial-match searching.

1. Introduction Section 2 briefly reviews Hoare’s [9] Quicksort and

binary search trees. We emphasize a well-known isomor- phism relating the two, and summarize other basic facts.

The multikey algorithms and data structures are pre- sented in Section 3. Multikey Quicksort orders a set of II vectors with k components each. Like regular Quicksort, it partitions its input into sets less than and greater than a given value; like radix sort, it moves on to the next field once the current input is known to be equal in the given field. A node in a ternary search tree represents a subset of vectors with a partitioning value and three pointers: one to lesser elements and one to greater elements (as in a binary search tree) and one to equal elements, which are then pro- cessed on later fields (as in tries). Many of the structures and analyses have appeared in previous work, but typically as complex theoretical constructions, far removed from practical applications. Our simple framework opens the door for later implementations.

The algorithms are analyzed in Section 4. Many of the analyses are simple derivations of old results.

Section 5 describes efficient C programs derived from the algorithms. The first program is a sorting algorithm

Fast Algorithms for Sorting and Searching Strings

that is competitive with the most efficient string sorting programs known. The second program is a symbol table implementation that is faster than hashing, which is com- monly regarded as the fastest symbol table implementa- tion. The symbol table implementation is much more space-efficient than multiway trees, and supports more advanced searches.

In many application programs, sorts use a Quicksort implementation based on an abstract compare operation, and searches use hashing or binary search trees. These do not take advantage of the properties of string keys, which are widely used in practice. Our algorithms provide a nat- ural and elegant way to adapt classical algorithms to this important class of applications.

Section 6 turns to more difficult string-searching prob- lems. Partial-match queries allow “don’t care” characters (the pattern “so.a”, for instance, matches soda and sofa). The primary result in this section is a ternary search tree implementation of Rivest’s partial-match searching algo- rithm, and experiments on its performance. “Near neigh- bor” queries locate all words within a given Hamming dis- tance of a query word (for instance, code is distance 2 from soda). We give a new algorithm for near neighbor searching in strings, present a simple C implementation, and describe experiments on its efficiency.

Conclusions are offered in Section 7.

2. Background Quicksort is a textbook divide-and-conquer algorithm.

To sort an array, choose a partitioning element, permute the elements such that lesser elements are on one side and greater elements are on the other, and then recursively sort the two subarrays. But what happens to elements equal to the partitioning value? Hoare’s partitioning method is binary: it places lesser elements on the left and greater ele- ments on the right, but equal elements may appear on either side.

* Bell Labs, Lucent Technologies, 700 Mountam Avenue, Murray Hill. NJ 07974; [email protected].

# Princeton University. Princeron. NJ. 08514: [email protected].

Algorithm designers have long recognized the desir- irbility and difficulty of a ternary partitioning method. Sedgewick [22] observes on page 244: “Ideally, we would llke to get all [equal keys1 into position in the file, with all

360

Page 133: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

• Store characters and values in nodes (not keys).

• Each node has three children: smaller (left), equal (middle), larger (right).

133

Ternary search tries

TST representation of a trie

each node hasthree links

link to TST for all keysthat start with s

link to TST for all keysthat start witha letter before s

t

h

e 8

a

r

e 12

s

h u

e 10

e

l

l

s 11

l

l

s 15

r 0

e

l

y 13

o

7

r

e

b

y 4

a 14

t

h

e 8

a

r

e 12

s

hu

e 10

e

l

l

s 11

l

l

s 15

r 0

e

l

y 13

o

7

r

e

b

y 4

a14

Page 134: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Follow links corresponding to each character in the key.• If less, take left link; if greater, take right link.

• If equal, take the middle link and move to the next key character.

Search hit. Node where search ends has a non-null value. Search miss. Reach a null link or node where search ends has null value.

134

Search in a TST

TST search example

return valueassociated with

last key character

match: take middle link,move to next char

mismatch: take left or right link, do not move to next char

t

h

e 8

a

r

e 12

s

hu

e 10

e

l

l

s 11

l

l

s 15

r

e

l

y 13

o

7

r

e

b

y 4

a14

get("sea")

Page 135: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

135

Search in a TST

o

r

e 7

t

h

e 5

b

y 4

a

get("sea")

e

h

s

e

l

l

s 1

6

l

s

l

3

0

a

l

e

h

s

return value associated

with last key character

6

Page 136: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

136

Search in a TST

o

r

e 7

t

h

e 5

b

y 4

a

get("shelter")

e

h

s

e

l

l

s 1

6

l

s

l

3

0e

h

s

l

l

no link to 't'

(return null)

Page 137: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Ternary search trie insertion demo

137

ternary search trie

Page 138: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Ternary search trie insertion demo

138

put("she", 0)

h

s

value is in node

corresponding to

last character

key is sequence

of characters from

root to value

using middle links

e 0

Page 139: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Ternary search trie insertion demo

139

put("she", 0)

e

h

s

0

Page 140: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

l

e

Ternary search trie insertion demo

140

put("sells", 1)

e

h

s

s 1

0

l

h

s

Page 141: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Ternary search trie insertion demo

141

ternary search trie

e

h

s

e

l

l

s 1

0

Page 142: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

a2

ll

l

s 1

ee

Ternary search trie insertion demo

142

put("sea", 2)

e

h

s

0

h

s

Page 143: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

a

Ternary search trie insertion demo

143

ternary search trie

e

h

s

e

l

l

s 1

2

0

Page 144: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

s 3

l

a

Ternary search trie insertion demo

144

put("shells", 3)

e

h

s

e

l

l

s 1

2

l

0e

h

s

Page 145: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

a

Ternary search trie insertion demo

145

ternary search trie

e

h

s

e

l

l

s 1

2

l

s

l

3

0

Page 146: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

b

y 4

a

Ternary search trie insertion demo

146

put("by", 4)

e

h

s

e

l

l

s 1

2

l

s

l

3

0

s

Page 147: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

b

y 4

a

Ternary search trie insertion demo

147

ternary search trie

e

h

s

e

l

l

s 1

2

l

s

l

3

0

Page 148: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

e 5

h

tb

y 4

a

Ternary search trie insertion demo

148

put("the", 5)

e

h

s

e

l

l

s 1

2

l

s

l

3

0

s

Page 149: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

t

h

e 5

b

y 4

a

Ternary search trie insertion demo

149

ternary search trie

e

h

s

e

l

l

s 1

2

l

s

l

3

0

Page 150: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

a

l

l

s 1

26

overwrite

old value with

new value

a

l

ee

t

h

e 5

b

y 4

Ternary search trie insertion demo

150

put("sea", 6)

e

h

s

l

s

l

3

0

h

s

Page 151: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

t

h

e 5

b

y 4

a

Ternary search trie insertion demo

151

ternary search trie

e

h

s

e

l

l

s 1

6

l

s

l

3

0

Page 152: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

e 7

r

o

t

h

e 5

b

y 4

a

Ternary search trie insertion demo

152

put("shore", 7)

e

h

s

e

l

l

s 1

6

l

s

l

3

0e

h

s

Page 153: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

o

r

e 7

t

h

e 5

b

y 4

a

Ternary search trie insertion demo

153

ternary search trie

e

h

s

e

l

l

s 1

6

l

s

l

3

0

Page 154: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Ternary search trie insertion demo

154

ternary search trie

e

a

l

l

s

e

l

s

b

y h

l

o

r

e

t

h

e

s

5

7

3

1

6

4

0

Page 155: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

26-way trie. 26 null links in each leaf.

TST. 3 null links in each leaf.

155

26-way trie vs. TST

26-way trie (1035 null links, not shown)

TST (155 null links)

nowfortipilkdimtagjotsobnobskyhutacebetmeneggfewjayowljoyrapgigweewascabwadcawcuefeetapagotarjamdugand

Page 156: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

A TST node is five fields:• A value.

• A character c.

• A reference to a left TST.

• A reference to a middle TST.

• A reference to a right TST.

156

TST representation in Java

private class Node{ private Value val; private char c; private Node left, mid, right;}

Trie node representations

s

e h u

link for keysthat start with s

link for keysthat start with su

hue

standard array of links (R = 26) ternary search tree (TST)

s

Page 157: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

157

TST: Java implementation

public class TST<Value>{ private Node root;

private class Node { /* see previous slide */ }

public void put(String key, Value val) { root = put(root, key, val, 0); }

private Node put(Node x, String key, Value val, int d) { char c = key.charAt(d); if (x == null) { x = new Node(); x.c = c; } if (c < x.c) x.left = put(x.left, key, val, d); else if (c > x.c) x.right = put(x.right, key, val, d); else if (d < key.length() - 1) x.mid = put(x.mid, key, val, d+1); else x.val = val; return x; }

Page 158: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

158

TST: Java implementation (continued)

public boolean contains(String key) { return get(key) != null; } public Value get(String key) { Node x = get(root, key, 0); if (x == null) return null; return x.val; }

private Node get(Node x, String key, int d) { if (x == null) return null; char c = key.charAt(d); if (c < x.c) return get(x.left, key, d); else if (c > x.c) return get(x.right, key, d); else if (d < key.length() - 1) return get(x.mid, key, d+1); else return x; }}

Page 159: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

159

String symbol table implementation cost summary

Remark. Can build balanced TSTs via rotations to achieve L + log Nworst-case guarantees.

Bottom line. TST is as fast as hashing (for string keys), space efficient.

character accesses (typical case)character accesses (typical case)character accesses (typical case)character accesses (typical case) dedupdedup

implementationsearch

hit

search

missinsert

space

(references)moby.txt actors.txt

red-black BST L + c lg 2 N c lg 2 N c lg 2 N 4 N 1.40 97.4

hashing

(linear probing)L L L 4 N to 16 N 0.76 40.6

R-way trie L log R N L (R + 1) N 1.12out of

memory

TST L + ln N ln N L + ln N 4 N 0.72 38.7

Page 160: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

160

TST vs. hashing

Hashing.• Need to examine entire key.

• Search hits and misses cost about the same.

• Performance relies on hash function.

• Does not support ordered symbol table operations.

TSTs. • Works only for strings (or digital keys).

• Only examines just enough key characters.

• Search miss may involve only a few characters.

• Supports ordered symbol table operations (plus others!).

Bottom line. TSTs are:• Faster than hashing (especially for search misses).

More flexible than red-black BSTs. [stay tuned]

Page 161: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

‣ R-way tries‣ Ternary search tries‣ Character-based operations

TRIES

Page 162: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Character-based operations. The string symbol table API supports several useful character-based operations.

Prefix match. Keys with prefix "sh": "she", "shells", and "shore".

Wildcard match. Keys that match ".he": "she" and "the".

Longest prefix. Key that is the longest prefix of "shellsort": "shells".

162

String symbol table API

key value

by 4

sea 6

sells 1

she 0

shells 3

shore 7

the 5

Page 163: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Remark. Can also add other ordered ST methods, e.g., floor() and rank().

163

String symbol table API

public class StringST<Value> public class StringST<Value> public class StringST<Value>

StringST() create a symbol table with string keys

void put(String key, Value val) put key-value pair into the symbol table

Value get(String key) value paired with key

void delete(String key) delete key and corresponding value

Iterable<String> keys() all keys

Iterable<String> keysWithPrefix(String s) keys having s as a prefix

Iterable<String> keysThatMatch(String s) keys that match s (where . is a wildcard)

String longestPrefixOf(String s) longest key that is a prefix of s

Page 164: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

To iterate through all keys in sorted order:• Do inorder traversal of trie; add keys encountered to a queue.

• Maintain sequence of characters on path from root to node.

164

Warmup: ordered iteration

o

r

e 7

t

h

e 5

s

h

e 0

e

l

l

s 1

l

l

s 3

b

y 4

a 6

bbys

seseasel

sellsells

shshe

shellshells

shoshor

shoretth

the

by

by sea

by sea sells

by sea sells she

by sea sells she shells

by sea sells she shells shore

by sea sells she shells shore the

Collecting the keys in a trie (trace)

key q

keysWithPrefix("");

Page 165: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

To iterate through all keys in sorted order:• Do inorder traversal of trie; add keys encountered to a queue.

• Maintain sequence of characters on path from root to node.

165

Ordered iteration: Java implementation

public Iterable<String> keys(){ Queue<String> queue = new Queue<String>(); collect(root, "", queue); return queue;}

private void collect(Node x, String prefix, Queue<String> q){ if (x == null) return; if (x.val != null) q.enqueue(prefix); for (char c = 0; c < R; c++) collect(x.next[c], prefix + c, q);}

sequence of characters

on path from root to x

Page 166: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Find all keys in symbol table starting with a given prefix.

Ex. Autocomplete in a cell phone, search bar, text editor, or shell.• User types characters one at a time.

• System reports all matching strings.

166

Prefix matches

Page 167: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Find all keys in symbol table starting with a given prefix.

167

Prefix matches

o

r

e 7

t

h

e 5

s

h

e 0

e

l

l

s 1

l

l

s 3

b

y 4

a 6

find subtrie for allkeys beginning with "sh"

o

r

e 7

t

h

e 5

s

h

e 0

e

l

l

s 1

l

l

s 3

b

y 4

a 6

collect keysin that subtrie

keysWithPrefix("sh");

Pre!x match in a trie

shsheshelshell

shellsshoshorshore

she

she shells

she shells shore

key q

public Iterable<String> keysWithPrefix(String prefix){ Queue<String> queue = new Queue<String>(); Node x = get(root, prefix, 0); collect(x, prefix, queue); return queue;}

root of subtrie for all strings

beginning with given prefix

Page 168: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Use wildcard . to match any character in alphabet.

168

Wildcard matches

co....er

coalizer

coberger

codifier

cofaster

cofather

cognizer

cohelper

colander

coleader

...

compiler

...

composer

computer

cowkeper

.c...c.

acresce

acroach

acuracy

octarch

science

scranch

scratch

scrauch

screich

scrinch

scritch

scrunch

scudick

scutock

Page 169: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

Search as usual if character is not a period;go down all R branches if query character is a period.

169

Wildcard matches

public Iterable<String> keysThatMatch(String pat) { Queue<String> queue = new Queue<String>(); collect(root, "", 0, pat, queue); return queue; }

private void collect(Node x, String prefix, String pat, Queue<String> q) { if (x == null) return; int d = prefix.length(); if (d == pat.length() && x.val != null) q.enqueue(prefix); if (d == pat.length()) return; char next = pat.charAt(d); for (char c = 0; c < R; c++) if (next == '.' || next == c) collect(x.next[c], prefix + c, pat, q); }

Page 170: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

170

Longest prefix

Find longest key in symbol table that is a prefix of query string.

Ex. To send packet toward destination IP address, router chooses IP addressin routing table that is longest prefix match.

Note. Not the same as floor:

represented as 32-bit

binary number for IPv4

(instead of string)

floor("128.112.100.16") = "128.112.055.15"

"128"

"128.112"

"128.112.055"

"128.112.055.15"

"128.112.136"

"128.112.155.11"

"128.112.155.13"

"128.222"

"128.222.136"

longestPrefixOf("128.112.136.11") = "128.112.136"

longestPrefixOf("128.112.100.16") = "128.112"

longestPrefixOf("128.166.123.45") = "128"

Page 171: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

171

Longest prefix

Find longest key in symbol table that is a prefix of query string.• Search for query string.

• Keep track of longest key encountered.

Possibilities for longestPrefixOf()

s

h

e 0

e

l

l

s 1

l

l

s 3

a 2

"she" "shell"

search ends atend of string

value is not null return she

s

h

e 0

e

l

l

s 1

l

l

s 3

a 2 search ends at end of stringvalue is nullreturn she

(last key on path)

"shellsort"

s

h

e 0

e

l

l

s 1

l

l

s 3

a 2

search ends at null link

return shells(last key on path)

search ends at null link

return she(last key on path)

"shelters"

s

h

e 0

e

l

l

s 1

l

l

s 3

a 2

Page 172: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

172

Longest prefix: Java implementation

Find longest key in symbol table that is a prefix of query string.• Search for query string.

• Keep track of longest key encountered.

public String longestPrefixOf(String query) { int length = search(root, query, 0, 0); return query.substring(0, length); }

private int search(Node x, String query, int d, int length) { if (x == null) return length; if (x.val != null) length = d; if (d == query.length()) return length; char c = query.charAt(d); return search(x.next[c], query, d+1, length); }

Page 173: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

173

T9 texting

Goal. Type text messages on a phone keypad.

Multi-tap input. Enter a letter by repeatedly pressing a key until the desired letter appears.

T9 text input.• Find all words that correspond to given sequence of numbers.

• Press 0 to see all completion options.

Ex. hello• Multi-tap: 4 4 3 3 5 5 5 5 5 5 6 6 6

• T9: 4 3 5 5 6

www.t9.com

"a much faster and more fun way to enter text"

Page 174: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

174

A letter to t9.com

To: [email protected]: Tue, 25 Oct 2005 14:27:21 -0400 (EDT)

Dear T9 texting folks,

I enjoyed learning about the T9 text systemfrom your webpage, and used it as an examplein my data structures and algorithms class.However, one of my students noticed a bugin your phone keypad

http://www.t9.com/images/how.gif

Somehow, it is missing the letter s. (!)

Just wanted to bring this information toyour attention and thank you for your website.

Regards,

Kevin

where the @#$% is the "s" ???

Page 175: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

175

A world without “s” ??

To: "'Kevin Wayne'" <[email protected]>Date: Tue, 25 Oct 2005 12:44:42 -0700

Thank you Kevin.

I am glad that you find T9 o valuable for yourcla. I had not noticed thi before. Thank forwriting in and letting u know.

Take care,

Brooke nyder OEM Dev upport AOL/Tegic Communication 1000 Dexter Ave N. uite 300 eattle, WA 98109

ALL INFORMATION CONTAINED IN THIS EMAIL IS CONSIDEREDCONFIDENTIAL AND PROPERTY OF AOL/TEGIC COMMUNICATIONS

Page 176: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

176

Patricia trie

Patricia trie. [Practical Algorithm to Retrieve Information Coded in Alphanumeric]

• Remove one-way branching.

• Each node represents a sequence of characters.

• Implementation: one step beyond this course.

Applications.• Database search.

• P2P network search.

• IP routing tables: find longest prefix match.

• Compressed quad-tree for N-body simulation.

• Efficiently storing and querying XML documents.

Also known as: crit-bit tree, radix tree.

1

1 2

2

put("shells", 1);put("shellfish", 2);

Removing one-way branching in a trie

h

e

l

f

i

s

h

l

s

s s

shell

fish

internalone-way

branching

externalone-way

branching

standardtrie

no one-waybranching

Page 177: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

177

Suffix tree

Suffix tree.• Patricia trie of suffixes of a string.

• Linear-time construction: beyond this course.

Applications.• Linear-time: longest repeated substring, longest common substring,

longest palindromic substring, substring search, tandem repeats, ….

• Computational biology databases (BLAST, FASTA).

BANANAS A NA S

NA S S NAS

NAS S

suffix tree for BANANAS

Page 178: BBM 202 ALGORITHMS - Hacettepeerkut/bbm202.s13/w10-string-sorts-tries.pdf · Radix. Number of digits R in alphabet. Alphabets 14 604 CHAPTER 6! Strings holds the frequencies in Count

178

String symbol tables summary

A success story in algorithm design and analysis.

Red-black BST.• Performance guarantee: log N key compares.

• Supports ordered symbol table API.

Hash tables.• Performance guarantee: constant number of probes.

• Requires good hash function for key type.

Tries. R-way, TST.• Performance guarantee: log N characters accessed.

• Supports character-based operations.

Bottom line. You can get at anything by examining 50-100 bits (!!!)