Download - Copyright 2004-2006 Curt Hill Tries An N-Way tree with unusual properties

Copyright 2004-2006 Curt Hill

Tries

An N-Way tree with unusual properties


The word

• Derived from the middle of retrieve• Pronounced either like “try” or

“tree”• Pronouncing as “try” is less

confusing, so this is what we will use

• However, before looking at a trie we must consider a radix sort


Radix sort• In the olden days we had card decks• We typically put a sequence number in

the last 8 columns• If the deck every got shuffled we took it

to operations and they could resort it based on that sequence

• They used a machine called a card sorter

• It had one input and 12 output bins– A card had 12 rows


It worked like this:• The operator set which single

column it would sort• The deck was read in and put into

one of 12 slots based on the value in that column

• If two cards were the same then they hit the slot in the same order that they were originally in

• Most sorts do not preserve the input order in equal keys


How should we sort a deck?

• Sort first or sort last?• Sort first

– Most of us would sort on the first character and then have 10 decks

– Then sort each of those decks into 100 decks

– Then sort each of those decks into 1000 decks

– etc


Sort Last• Sort last – only works because the card

sorter maintains order for equal keys• Sort on the last digit

– Recombine the decks into one but based on their slot order

– Next sort on next to last digit• Recombine the decks into one but based on their

slot order

– Keep doing this until you run out of digits


Example• Consider the following data:

• 434,214,123,432,124,431,223

• Three passes based on three digits• Sort on last digit

– 431, 432, 123, 223, 434, 214,124• Sort on middle digit

– 214, 123, 223, 124, 431, 432, 434• Sort on first digit

– 123, 124, 214, 223, 431, 432, 434


Quaint?

• The importance of the radix sort at this point is that it deals with the key as a sequence of digits rather than a unified whole

• A trie will do the same• It will also combine tree searches

and subscripting


Subscripting as a search

• Pros– The advantage of a vector or an array is

that subscripting is extremely quick– The advantage of a binary or B tree is that

the key can have any form– Hashing attempts to make a key from

things that are not a key

• Cons– Vector/arrays only allow integer subscripts– Trees O(log n) search are much slower than

O(C) searches of arrays– Hash tables maul the sorted order


Trie Again

• The trie is an attempt to bring subscripting back to searching

• The key concept is to think of a string, not as a single indivisible item, but as a sequence of characters– String is the most general key

• Works well for dense keys


The organization of the trie

• A trie is a multiway tree where no search occurs on the keys– Instead a subscript evaluation

• Suppose that we have a string of 5 digits for a key– Each node will contain 10 possibilities –

one for each digit

• Therefore the trie is 10-way tree• The root node has one subtree for

each digit


Root and three of 10 descendents

9 8 7 6 5 4 3 2 1 0

9 8 7 6 5 4 3 2 1 0

9 8 7 6 5 4 3 2 1 0

9 8 7 6 5 4 3 2 1 0


Notes• If the key is constant length

– Only the leaves have any data– The path to leaf is the key– The digits are not actually stored in

the nodes– Just the pointers to subnodes

• If the key is not constant length then each node has the data corresponding to that key

• Subscripting and pointer dereferencing is used


Searching• We use the first digit of the search

key to find the proper subtree• Each subtree of the root, then uses

the second letter as the basis to find the correct subtree

• At level N then the Nth letter of the word is used as a subscript of the node to find the subtree

• The depth of the tree is the length of the longest key


Example• Consider a structure that contains

English words of which there are more than a million

• Each entry points to a definition and other stuff– Perhaps a file ID if the whole structure

does not fit in memory

• Twenty six entries in root• Contains words:

– a, an, and, am, any


Four levels of Another Trie

a b c d … x y z

a b … z… m n

NULL

a

a b c d … x y z an

a b c d … x y z and

a b c d … x y z am

a b c d … x y z any

Level 0 haszero length items

Level 1 haslength one items

Level 2 haslength two items


Notes

• One pointer for each descendent– The data for the pointers is not needed

since no comparisons are done

• One data item for the word currently constructed– Not the word itself (that is the key) but

the data corresponding to that key– I will not show insertion/deletion etc.

because it follows naturally from the structure and previous experience on trees


Searching Trees

• How does this Trie compare with other tree searches?

• Searching a binary tree is a log2 operation– Each node examination splits the tree

into (hopefully) two nearly equal pieces

– Log2 of 1 million is about 20 (19.9)


BTree• Searching a BTree is a harder to determine

operation• Suppose that N=4• Each node should be between 4 and 8 keys• Then the search should be log6 operation• Each node examination splits the tree into

approximately six nearly equal pieces• Log6 of 1 million is 7.71• However, in the traversing from root to leaf in

these 7 searches we also searched 7 or 8 nodes each of which had approximately 6 items in them

• Hence we had some sort of inspection of about 50 items


Trie• Searching a trie using characters is a

log26 operation– Each node examination splits the tree into 26

pieces but be sure they are not equal, especially if they use English letter frequency

• If a product id with sequential numbers/letters then an equal distribution is possible– Log26 of 1 million is 4.24– Hence the tree is much shallower– The root to leaf search did four subscript

evaluations rather than four searches


Balance

• There is none unless the keys accidentally cause balance

• Balance is something that can be done when searching but not subscripting


Space utilization

• Tries are preferred only for dense keys

• What is a dense key?• A key where adjacent keys are

relatively close to each other• The key space has few holes in it• English words are usually sparse


Example• In a 55,000 entry dictionary

"gunfire" and "gunlock" are adjacent– These have the first three characters

the same but how many permutations are between them?

– There are five letters between the f and l

– There should be 5 * 26 * 26 * 26 pseudo words of length 7 between them which is greater than 87,000


Example Again• This neglects other gunf and gunl words

as well as differing lengths– Social security numbers, telephone

numbers, product numbers are much more likely to be dense

• If a node of 26 items only uses three of them then it is pretty wasteful of space even if the subscripting is quick

• In the dictionary example, the second level has a number of letter combinations that do not exist– Most two consonant pairs do not exist,

except as abbreviations– bb, bc, bf


Tree Nodes• We seldom want to take a trie to

the bitter end unless we have a manageable key– Key must be dense– Must be evenly distributed key– Such a key is often numeric– Combinations must occur with equal

frequency

• A hybrid tree has a mixture of formats


First Example• License plate numbers in ND have three

letters and three digits– A binary tree would need a height of 24– A BTree of N = 4 would still need 9 levels

• For a trie the top three levels would have letters and the bottom three digits– Would only be six levels deep– The quickest possible search– Two distinct forms of nodes– One form of hybrid


Hybrids again

• The more common practice is to use the trie for a small number of levels and then switch to another data structure

• The top structure is usually a trie for its high fanout

• The bottom structure is usually a binary tree for memory structures and a BTree for disk structures, but other things may also be used


Demonstration Notes

• What follows is a trie program with some unusual features– Consider these before observing code


Preprocessor commands• This code has preprocessor conditional

compilations in it• It tailors the Trie to either accept a key that is

only digits or only uppercase letters• There is either a definition of TRIE_LETTERS or

not– The value of this is not important, not even given,

but the question is it defined or not• Since the preprocessor is finished before the

compiler starts we can generate the compiler input

• In this case we end up with two different versions

• Great amounts of similarity but still different• Notice it also extends to the main C++ file


Key density

• What happens when a word like "AARDVARK" is inserted into an empty trie?– This is the problem of a sparse key

• What would happen to binary or B tree that had the same situation?


Subscripting into the node• In both find and insert the search is trivial• Extract the letter, adjust by the

beginning of the alphabet and use as a subscript

• This is simple because we only allow a key to be a string of uppercase letters

• What would we do if we needed to allow any characters allowed in a word?– Such as hyphen, space, apostrophe or digits– Not all characters are allowed, just some


The what if• This complicates the lookup and slows

the simple lookup• The procedure would be something like

this:– If the character is a letter, adjust as always– If the character is a digit, adjust by

subtracting the zero and adding 26– Else do a case on the character and merely

assign the subscript directly• The more characters the more costly

and the less simple this search will be• This lack of simplicity will hinder the

speed of the trie and make it less desirable, based on the probability of the characters


Iterator• Notice the handshaking that has to go on

between the Trie and Trie_iterator class– Both have to tell the other what is going on

• This could be a recursive routine, but uses a stack and loop instead

• Notice the word of a Trie node needs to be displayed first

• Every leaf contains an array of NULL pointers and the pointer to the data item

Download - Copyright 2004-2006 Curt Hill Tries An N-Way tree with unusual properties

Top Related