dictionaries and hash tables. dictionary a dictionary, in computer science, implies a container that...
TRANSCRIPT
Dictionaries and Hash Tables
Dictionary
A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows for quick retrieval.
– Items must be stored in a way that allows them to be located with the key
– Not necessary to store the items in order Unordered dictionary Ordered dictionary
Dictionary ADT
Operations in a Dictionary ADT:int size()bool isEmpty()iter elements()iter keys()pos find( key )iter findAll( key )void insertItem( key, elem )void removeElement( key )void removeAllElements( key )
Dictionary Examples
Natural language dictionary• word is key
• element contains word, definition, pronunciation, etc.
Web pages• URL is key
• html or other file is element
Any typical database (e.g. student record)• has one or more search keys
• each key may require own organizational dictionary
Implementing a Dictionary
There are many ways a dictionary can be implemented. Some of them are:– Log file or Audit Trail– Ordered Dictionary and Binary search trees– Hash table
Log File or Audit Trail
This is the simplest way to implement a dictionary. It uses an unordered vector, list or sequence to store the key-element pairs.void insertItem( key, elem )
Each new item is appended at the end – O(1)
pos find( key ) Scan the entire list and examine each key – O(n)
void removeElement( key ) Scan the entire list to find the item, then remove it – O(n)
This allows for fast insertions. However, find and retrieval are slow.
– Good solution for storing items that are stored frequently but retrieved rarely such as archiving database and operating systems transactions.
– Storing log file
Ordered Dictionary ADT
All of the Dictionary operations, e.g. find(k), insertItem(k,e), removeElement(k)
Additional operationspos closestBefore( key )
pos closestAfter( key )
Look-Up Tables
A look-up table is an implementation of an ordered dictionary ( eg. trigonometry table )
Here is an example, where all items are stored in a vector, in ascending order of the keys.
0 1 2 3 4
A
5 6 7 8 9 10
13 265 3716 2115
Lookup Table Performance
In a look-up table, inserting or removing may require shifting elements
0 1 2 3 4
A
5 6 7 8 9 10
13 265 3716 2115
0 1 2 3 4
A
5 6 7 8 9 10
13 265 3716 21152
Example:Insert an item with a key of 2
n elements shifted to make room
insertItem(k,e) takes O(n) time in the worst caseremoveElement(k) takes O(n) time in the worst case
Lookup Table – find(k)
However, since the items in a lookup table are ordered, we can implement find(k) with a binary search algorithmA binary search algorithm (or binary chop) is a technique for finding a particular value in a linear array, by ruling out half of the data at each step. A binary search finds the median, makes a comparison to determine whether the desired value comes before or after it, and then searches the remaining half in the same manner. A binary search is an example of a divide and conquer algorithm.
0 1 2 3 4
A
5 6 7 8 9 10 11 12 13 14 15
Binary Search
5 124 148 972 22 3319 3727 282517
Example: find(22)
low highmid
0 1 2 3 4
A
5 6 7 8 9 10 11 12 13 14 15
22 3319 3727 2817
mid highlow
5 124 148 972 25
A 2217mid highlow
5 124 148 972 33 3727 282519
A
low = mid = high
5 124 148 972 33 3727 28252217 19
Binary Search Algorithm
Algorithm BinarySearch( A, k, low, high)if low > high then return Nullelse mid = (low + high) / 2 if ( k == key(mid) ) then return Position(mid) else if ( k < key(mid) ) then return BinarySearch( A, k, low, mid – 1 ) else return BinarySearch( A, k, mid + 1, high )
Hash Tables
In computer science, a hash table, or a hash map, is a data structure that associates keys with values. The primary operation it supports efficiently is a lookup: given a key (e.g. a person's name), find the corresponding value (e.g. that person's telephone number). It works by transforming the key using a hash function into a hash, a number that the hash table uses to locate the desired value.
This is considered the most efficient way to implement a dictionary.
Hash Table
Bucket Arrays
A Bucket array for a hash table, is an array A of size N, where each cell of A is thought of as a ‘bucket’, and N defines the capacity of the array.Example
– Small company with less than 100 employees– Each employee has an ID number in the range 0–99– Store employee records in an array, so that the employee ID
number matches the array index
EMPTY
01Turing, A.
02Babbage, C. EMPTY
04Gates,
W.
0 1 2 3 4
A …
Bucket Arrays
If the keys are unique, then searches, insertions and removals in the bucket array take worst-case time of O(1).
However, bucket arrays have 2 drawbacks. – It requires a capacity of N (which is the
maximum number of elements possible– The key has to be a integer in the range [0, N-1]
Hash Functions
A good hash function is essential for good hash table performance. If a hash function tends to produce similar values, slow searches will result.
Example– Small company with less than 100 employees– Already uses a 5-digit ID number
A simple hash function for this example is ( ID % 100 )
EMPTY
55301Turing, A.
81202Babbage, C. EMPTY
77404Gates,
W.
0 1 2 3 4
A …
Hash Functions
A hash function is a way of creating a small digital "fingerprint" from any kind of data. The function chops and mixes the data to create the fingerprint, often called a hash value. A good hash function is one that yields few hash collisions in expected input domains.
To do this, the index into the hash table's array is generally calculated in two steps:
– A generic hash value is calculated to map the key to an integer ( hash code )
– This value is reduced to a valid array index ( compression map )
Hash Code
Take an arbitrary key k and assigning it to an integer value h. Then h is know as the hash code or hash value of k.
key -> integer
This integer h does not need to be in the range of the array that is being used for hashing and may even be a negative number, but we want the set of hash codes assigned to our keys to avoid collisions as much as possible.
Hash coding can be done in many ways:
– Integer cast
– Summing components
– Polynomial accumulations
Hash code – Integer Cast
int hashCode( int key ){ return key; }
int hashCode( char key ){ return hashCode( int(key) ); // cast it
// to an integer }
Hash code – Summing Components
If the long int has twice as many bits as the int datatype, e.g. 32 bits for int, 64 bits for long
Treat the high-order bits as an integer and the low-order bits as an integer, then sum them
int hashCode( long key ){ typedef unsigned long ulong; return hashCode( int( ulong(key) >> 32 ) + int( key ) ); }
Hash code – Summing Components Applied to Strings
One approach is to sum the ASCII values of all the chars in the string– Problem: too many collisions because many
different words will have the same result– For example, stop, tops, pots, spot
ASCII
s = 115t = 116o = 111p = + 112
Hashcode = 454
Hash code – Polynomial Accumulation
Better approach for string keys– Modify each char’s ASCII value by a number based on its
position in the string– Then sum the results– Where x represents a char, k is the total number of chars, and a is a constant (but not 1), the following formula can be used:
x0ak-1 + x1ak-2 + … + xk-2a + xk-1
s = 115 * 103 = 115000t = 116 * 102 = 11600o = 111 * 101 = 1110p = 112 * 100 = + 112
Hashcode = 127822
Example, assume thatthe string is “stop” and a = 10
Compression Maps
This is the second part of the hash function action. Once we have a hash code, we need to map it to an integer in the range of array index numbers
This can me accomplished in many ways:– Truncation– Truncation and Summation– Division method– MAD method
Compression Maps - Truncation
One way would be to simply ignore parts of the key and use the remaining part.
Eg:employee number: 15436578bucket size: 1000possibility 1: k = last 3 digits = 578possibility 2: k = digits 4, 6 and 8 = 358
This is a fast scheme, but it fails to give an even distribution of keys throughout the table.
Compression Maps – Truncation and Summation
This method might use a combination of truncating and summing parts of the key.
Eg:employee number: 15496578bucket size: 1000possibility: k = partition into 3, and together and truncate if necessary.k = 154 + 965 + 78 = 1197 = 197
This provides a better spread than simple truncation, but it still does not prevent collision.
Compression Maps - Division Method
int k = hashCode( key );int index = abs(k) % ARRAY_SIZE;
It has been found that the size of the array should be a prime number. This reduces the number of collisions and spreads out the distribution of hashed values
Example Keys = {200,210,220,230,…,600} IF Array size = 100 - a non-prime number produces collisions for
each hash code IF Array size = 101 - a prime number produces less collisions
for each hash code
Compression Maps - MAD Method
This is another method to convert the hash code into a known range. MAD stands for “Multiply, Add, and Divide” where
a and b are non-negative integers (a % ARRAY_SIZE) must not be 0 a and b are chosen at random when the program is written
int k = hashCode( key );int i = abs(a * k + b) % ARRAY_SIZE;
–Example:Keys = {200,210,220,230,…,600}where a=8, b=7, array size = 100200 => (8*200+7) % 100 => 7210 => (8*210+7) % 100 => 87220 => (8*220+7) % 100 => 67230 => (8*230+7) % 100 => 47
Collisions
There is no restriction as to the key being unique or for the hash function to generate a unique value. This means that there is a chance that there might be more than one element that wants to be mapped to the same position. This would create a collision.
Collisions
Two different keys are mapped to the same location in the array
Best approach – minimize collisions by picking a good hash function
Example– A bad hash function is ( key % 100 ) because it is
too likely to cause collisions . key % 101 is better.
Collisions
If two keys hash to the same index, the corresponding records cannot be stored in the same location. So, if it's already occupied, we must find another location to store the new record, and do it so that we can find it when we look it up later on.Example
– Previous hash function of ( ID % 100 ) is too likely to cause collisions
EMPTY
55301Turing, A.
81202Babbage, C. EMPTY
77404Gates,
W.
0 1 2 3 4
A
38104McNealy,
S.
!
…
Collision Handling
There are a number of collision resolution techniques, but the most popular are chaining and open addressing.
Two different approaches– Chaining
– Open addressing
Chaining
Separate chaining is a method for dealing with collisions. The hash table is an array of linked lists. Data elements that hash to the same value are stored in a linked list originating from the index equivalent of their hash value.
– Each location in the hash table holds a pointer to a list
– Each list can hold many items
– As long as the hash function is good, the lists will be small because there will be few collisions
Separate Chaining Example
90 next NULL12 next 38 next 25 next
0
A
12
3456
7
89
101112
36 next NULL10 next
41 next NULL28 next 54 next
18 next NULL
Open Addressing
This is a method where only one item is always stored in one bucket. If multiple elements map to same bucket, some method must be used to find an empty bucket• Linear probing
h’(k) = ( h(k) + j ) mod N where j = 0, 1, 2, 3, . . .
»Keep adding 1 to rank to find empty bucket
• Quadratic probing
h’(k) = ( h(k) + j² ) mod N where j = 0, 1, 2, 3, . . .
• Double hashing
h’(k) = ( h(k) + j * h’’(k) ) mod N where j = 0, 1, 2, 3, . . .
where h’’(k) = q – (k mod q )
Linear Probing
If a bucket is already occupied, then try the next available bucket
EMPTY
55301Turing, A.
81202Babbage, C. EMPTY
77404Gates,
W.
0 1 2 3 4
A
38104McNealy,
S.
!
…
Linear Probing
If a bucket is already occupied, then try the next available bucket
EMPTY
55301Turing, A.
81202Babbage, C. EMPTY
77404Gates,
W.
0 1 2 3 4
A
38104McNealy,
S.
!
38104McNealy,
S.
55301Turing, A.
81202Babbage, C. EMPTY
77404Gates,
W.
0 1 2 3 4
A …
…
Linear Probing – insertItem(k,e)
If a location is already occupied, then try the next available location
Example:– h(k) = ( (k % cap) + j ) mod cap where j = 0, 1, 2, 3, . . .– Insert the following keys into hash table A
{13,26,5,37,16,21,15}
0 1 2 3 4
A
5 6 7 8 9 10
13 26 5 37 16 2115
Linear Probing – Using Lazy Deletes
Problem: – If the find() operation is looking for a key, it stops looking when it gets
to an empty location and assumes the key isn’t there– If multiple items with the same key are stored in the hash table with
linear probing and then one of them is deleted, a “hole” is created, and find() might stop prematurely
0 1 2 3 4
A
5 6 7 8 9 10
13 26 37 16 2115
• Solution: Implement removeElement so that it never deletes an item, it just marks the location “FREE”
FREE EMPTYEMPTYEMPTYEMPTY
Quadratic Probing
Quadratic Probing is another open addressing strategy to deal with collisions. It uses the following formula:
h(k) = ( (k % cap) + j² ) mod cap where j = 0, 1, 2, 3, . . .
Example: {13,26,5,37,16,21,15}((37 % 11) + 02) % 11 = 4 //collision((37 % 11) + 12) % 11 = 5 //collision((37 % 11) + 22) % 11 = 8 //OK
0 1 2 3 4
A
5 6 7 8 9 10
13 26 5 3716 2115
Quadratic Probing Pros and Cons
Advantages– Avoids clustering
Disadvantages– Creates secondary clustering – a different pattern of
filled array locations
– If the load factor is 0.5 or more, an empty location may not be found even if one exists
Double Hashing
Double hashing is another alternative to linear probing where, if there’s a collision, then a second, different hash function h' is usedh’(k) = ( h(k) + j * h’’(k) ) mod N where j = 0, 1, 2, 3, . . .
and where h’’(k) = q – (k mod q )h(k) = ( (key % cap) + (j * ( q – ( key % q ) ) ) ) % cap
where j = 0, 1, 2, 3, . .
Example: {13,26,5,37}Let q = 7
h(k) = ( (37 % 11) + (j * ( 7 – ( 37 % 7 ) ) ) ) % 11h(k) = h(37) + 0*(…) = 37 % 11 = 4 //collisionh(k) = (4 + (7 – (37 % 7)) % 11 = 9 //OK
0 1 2 3 4
A
5 6 7 8 9 10
13 26 5 37
Load Factor
The load factor of a hashing table is the ratio of the number of items in the hash table to the number of buckets and is expressed by ( lambda )
– Expresses how “full” the hash table has become– Should always be kept below 0.75– Example
capacity = 11
items stored = 7
load factor = 7/11 = 0.64
Rehashing
Maximum load factor, based on experimental data:– 0.5 for open addressing schemes– 0.9 for separate chaining
If the load factor is above that threshold, then the table should be resized
– New table should be at least double the old table so that the time cost can be amortized
– Hash function should be modified– Rehash the data – take each item out of the old array and
insert it into the new one using the new hash function