Copyright 2004-2006 Curt Hill
Tries
An N-Way tree with unusual properties
Copyright 2004-2006 Curt Hill
The word
• Derived from the middle of retrieve• Pronounced either like “try” or
“tree”• Pronouncing as “try” is less
confusing, so this is what we will use
• However, before looking at a trie we must consider a radix sort
Copyright 2004-2006 Curt Hill
Radix sort• In the olden days we had card decks• We typically put a sequence number in
the last 8 columns• If the deck every got shuffled we took it
to operations and they could resort it based on that sequence
• They used a machine called a card sorter
• It had one input and 12 output bins– A card had 12 rows
Copyright 2004-2006 Curt Hill
It worked like this:• The operator set which single
column it would sort• The deck was read in and put into
one of 12 slots based on the value in that column
• If two cards were the same then they hit the slot in the same order that they were originally in
• Most sorts do not preserve the input order in equal keys
Copyright 2004-2006 Curt Hill
How should we sort a deck?
• Sort first or sort last?• Sort first
– Most of us would sort on the first character and then have 10 decks
– Then sort each of those decks into 100 decks
– Then sort each of those decks into 1000 decks
– etc
Copyright 2004-2006 Curt Hill
Sort Last• Sort last – only works because the card
sorter maintains order for equal keys• Sort on the last digit
– Recombine the decks into one but based on their slot order
– Next sort on next to last digit• Recombine the decks into one but based on their
slot order
– Keep doing this until you run out of digits
Copyright 2004-2006 Curt Hill
Example• Consider the following data:
• 434,214,123,432,124,431,223
• Three passes based on three digits• Sort on last digit
– 431, 432, 123, 223, 434, 214,124• Sort on middle digit
– 214, 123, 223, 124, 431, 432, 434• Sort on first digit
– 123, 124, 214, 223, 431, 432, 434
Copyright 2004-2006 Curt Hill
Quaint?
• The importance of the radix sort at this point is that it deals with the key as a sequence of digits rather than a unified whole
• A trie will do the same• It will also combine tree searches
and subscripting
Copyright 2004-2006 Curt Hill
Subscripting as a search
• Pros– The advantage of a vector or an array is
that subscripting is extremely quick– The advantage of a binary or B tree is that
the key can have any form– Hashing attempts to make a key from
things that are not a key
• Cons– Vector/arrays only allow integer subscripts– Trees O(log n) search are much slower than
O(C) searches of arrays– Hash tables maul the sorted order
Copyright 2004-2006 Curt Hill
Trie Again
• The trie is an attempt to bring subscripting back to searching
• The key concept is to think of a string, not as a single indivisible item, but as a sequence of characters– String is the most general key
• Works well for dense keys
Copyright 2004-2006 Curt Hill
The organization of the trie
• A trie is a multiway tree where no search occurs on the keys– Instead a subscript evaluation
• Suppose that we have a string of 5 digits for a key– Each node will contain 10 possibilities –
one for each digit
• Therefore the trie is 10-way tree• The root node has one subtree for
each digit
Copyright 2004-2006 Curt Hill
Root and three of 10 descendents
9 8 7 6 5 4 3 2 1 0
9 8 7 6 5 4 3 2 1 0
9 8 7 6 5 4 3 2 1 0
9 8 7 6 5 4 3 2 1 0
Copyright 2004-2006 Curt Hill
Notes• If the key is constant length
– Only the leaves have any data– The path to leaf is the key– The digits are not actually stored in
the nodes– Just the pointers to subnodes
• If the key is not constant length then each node has the data corresponding to that key
• Subscripting and pointer dereferencing is used
Copyright 2004-2006 Curt Hill
Searching• We use the first digit of the search
key to find the proper subtree• Each subtree of the root, then uses
the second letter as the basis to find the correct subtree
• At level N then the Nth letter of the word is used as a subscript of the node to find the subtree
• The depth of the tree is the length of the longest key
Copyright 2004-2006 Curt Hill
Example• Consider a structure that contains
English words of which there are more than a million
• Each entry points to a definition and other stuff– Perhaps a file ID if the whole structure
does not fit in memory
• Twenty six entries in root• Contains words:
– a, an, and, am, any
Copyright 2004-2006 Curt Hill
Four levels of Another Trie
a b c d … x y z
a b … z… m n
NULL
a
a b c d … x y z an
a b c d … x y z and
a b c d … x y z am
a b c d … x y z any
Level 0 haszero length items
Level 1 haslength one items
Level 2 haslength two items
Copyright 2004-2006 Curt Hill
Notes
• One pointer for each descendent– The data for the pointers is not needed
since no comparisons are done
• One data item for the word currently constructed– Not the word itself (that is the key) but
the data corresponding to that key– I will not show insertion/deletion etc.
because it follows naturally from the structure and previous experience on trees
Copyright 2004-2006 Curt Hill
Searching Trees
• How does this Trie compare with other tree searches?
• Searching a binary tree is a log2 operation– Each node examination splits the tree
into (hopefully) two nearly equal pieces
– Log2 of 1 million is about 20 (19.9)
Copyright 2004-2006 Curt Hill
BTree• Searching a BTree is a harder to determine
operation• Suppose that N=4• Each node should be between 4 and 8 keys• Then the search should be log6 operation• Each node examination splits the tree into
approximately six nearly equal pieces• Log6 of 1 million is 7.71• However, in the traversing from root to leaf in
these 7 searches we also searched 7 or 8 nodes each of which had approximately 6 items in them
• Hence we had some sort of inspection of about 50 items
Copyright 2004-2006 Curt Hill
Trie• Searching a trie using characters is a
log26 operation– Each node examination splits the tree into 26
pieces but be sure they are not equal, especially if they use English letter frequency
• If a product id with sequential numbers/letters then an equal distribution is possible– Log26 of 1 million is 4.24– Hence the tree is much shallower– The root to leaf search did four subscript
evaluations rather than four searches
Copyright 2004-2006 Curt Hill
Balance
• There is none unless the keys accidentally cause balance
• Balance is something that can be done when searching but not subscripting
Copyright 2004-2006 Curt Hill
Space utilization
• Tries are preferred only for dense keys
• What is a dense key?• A key where adjacent keys are
relatively close to each other• The key space has few holes in it• English words are usually sparse
Copyright 2004-2006 Curt Hill
Example• In a 55,000 entry dictionary
"gunfire" and "gunlock" are adjacent– These have the first three characters
the same but how many permutations are between them?
– There are five letters between the f and l
– There should be 5 * 26 * 26 * 26 pseudo words of length 7 between them which is greater than 87,000
Copyright 2004-2006 Curt Hill
Example Again• This neglects other gunf and gunl words
as well as differing lengths– Social security numbers, telephone
numbers, product numbers are much more likely to be dense
• If a node of 26 items only uses three of them then it is pretty wasteful of space even if the subscripting is quick
• In the dictionary example, the second level has a number of letter combinations that do not exist– Most two consonant pairs do not exist,
except as abbreviations– bb, bc, bf
Copyright 2004-2006 Curt Hill
Tree Nodes• We seldom want to take a trie to
the bitter end unless we have a manageable key– Key must be dense– Must be evenly distributed key– Such a key is often numeric– Combinations must occur with equal
frequency
• A hybrid tree has a mixture of formats
Copyright 2004-2006 Curt Hill
First Example• License plate numbers in ND have three
letters and three digits– A binary tree would need a height of 24– A BTree of N = 4 would still need 9 levels
• For a trie the top three levels would have letters and the bottom three digits– Would only be six levels deep– The quickest possible search– Two distinct forms of nodes– One form of hybrid
Copyright 2004-2006 Curt Hill
Hybrids again
• The more common practice is to use the trie for a small number of levels and then switch to another data structure
• The top structure is usually a trie for its high fanout
• The bottom structure is usually a binary tree for memory structures and a BTree for disk structures, but other things may also be used
Copyright 2004-2006 Curt Hill
Demonstration Notes
• What follows is a trie program with some unusual features– Consider these before observing code
Copyright 2004-2006 Curt Hill
Preprocessor commands• This code has preprocessor conditional
compilations in it• It tailors the Trie to either accept a key that is
only digits or only uppercase letters• There is either a definition of TRIE_LETTERS or
not– The value of this is not important, not even given,
but the question is it defined or not• Since the preprocessor is finished before the
compiler starts we can generate the compiler input
• In this case we end up with two different versions
• Great amounts of similarity but still different• Notice it also extends to the main C++ file
Copyright 2004-2006 Curt Hill
Key density
• What happens when a word like "AARDVARK" is inserted into an empty trie?– This is the problem of a sparse key
• What would happen to binary or B tree that had the same situation?
Copyright 2004-2006 Curt Hill
Subscripting into the node• In both find and insert the search is trivial• Extract the letter, adjust by the
beginning of the alphabet and use as a subscript
• This is simple because we only allow a key to be a string of uppercase letters
• What would we do if we needed to allow any characters allowed in a word?– Such as hyphen, space, apostrophe or digits– Not all characters are allowed, just some
Copyright 2004-2006 Curt Hill
The what if• This complicates the lookup and slows
the simple lookup• The procedure would be something like
this:– If the character is a letter, adjust as always– If the character is a digit, adjust by
subtracting the zero and adding 26– Else do a case on the character and merely
assign the subscript directly• The more characters the more costly
and the less simple this search will be• This lack of simplicity will hinder the
speed of the trie and make it less desirable, based on the probability of the characters
Copyright 2004-2006 Curt Hill
Iterator• Notice the handshaking that has to go on
between the Trie and Trie_iterator class– Both have to tell the other what is going on
• This could be a recursive routine, but uses a stack and loop instead
• Notice the word of a Trie node needs to be displayed first
• Every leaf contains an array of NULL pointers and the pointer to the data item