copyright © 2003-2005curt hill hashing a quick lookup strategy
TRANSCRIPT
Copyright © 2003-2005Curt Hill
Hashing
A quick lookup strategy
Copyright © 2003-2005Curt Hill
What is a hash?
• Hashing is another name for key transformation
• The original key is usually a character string or other sparse key
• The result is usually a dense integer key
Copyright © 2003-2005Curt Hill
Components
• Every hash scheme has three pieces– A hash function– A collection of buckets– Collision scheme
• The hash function transforms the key into an integer 0 – N
• The buckets are numbered 0 – N and hold the resulting data
Copyright © 2003-2005Curt Hill
Basics
• The idea: – Have N containers for the data– Have a function that maps the original
key to a number in the range 0 – (N-1)– One access retrieves the data
regardless of size– When it works it is the fastest way to
lookup a value
Copyright © 2003-2005Curt Hill
Hash function
• Convert a sparse key into a dense key in the form of an integer
• The input is the real key which is a sparse key– Sparse keys use very few of very
many possible key values
• Typical key is a character string while the result needs an integer
Copyright © 2003-2005Curt Hill
Typical hash function
• Input is a character string• Output is an integer in range 0 - N• Sum the ordinal value of each character
of the string• Divide the result by N and take the
remainder– This guarantees the right range
• Number theory tells us to make N prime
Copyright © 2003-2005Curt Hill
Example values with N = 256
• “Abcdef” returns 53• “Hi there” returns 233• “ABCDEF” returns 149• “FEDCBA” returns 149• “A character string” returns 197 • “A big character string” returns 23• “Zoology” returns 243• See the pattern yet?
Copyright © 2003-2005Curt Hill
Hash functions
• Computing the function is easy• Two problems
– One result produced by two different keys – called a collision
– The order of the original key is scrambled in the result
• Very good for an equality test and very bad for a range test
Copyright © 2003-2005Curt Hill
Is it that simple?• Not usually• The previous scheme maps all
words with exactly the same characters to the same integer value
• Many variation– Do a different computation on even
and odd characters so a re-arrangment will produce different values
Copyright © 2003-2005Curt Hill
Collisions
• Two keys giving the same integer– Assume each bucket may only hold
one record
• Prevent hashing from the optimal search technique it could be
• Almost every hashing scheme needs a collision strategy
• There have been many variations
Copyright © 2003-2005Curt Hill
Collision Strategies
• Linear probing– Add one to index until right one is
found
• Quadratic probing– Square the index when the collision
occurs
• Rehash– Secondary hash function
• Chaining– Link duplicates in an overflow area
Copyright © 2003-2005Curt Hill
Collision Strategies
0
1
5
6
7
4
2
3
6 m
6 y
2 e
2 t
1 d
5 c
0
5
4
3
2
1
6
7
1 d
2 e
5 c
6 m
2 t
6 y
Linear Quadratic Chaining
0
1
2
3
4
5 5 c
6
7
2 e
6 m1 d
2 t
6 y
Copyright © 2003-2005Curt Hill
Free Space
• As the free space decreases the collisions increase
• The best plan is to have about twice as many buckets as will be needed
Copyright © 2003-2005Curt Hill
Effect of load using linear probing
Load factor Number of probes
10% 1.06
25% 1.17
50% 1.5
75% 2.5
90% 5.5
95% 10.5
Copyright © 2003-2005Curt Hill
Dynamics
• A hash table is much harder to both grow and shrink
• Increasing or decreasing the number of buckets requires:– Allocate a new table (larger or smaller)– Rehashing every item in old into the new
• Deleting a record (without resizing) is also a problem because of collisions
Copyright © 2003-2005Curt Hill
Deletion of a key
• The problem is collisions• If the key has no collisions just
mark the bucket as empty• If it has collisions
– All of them need to be rehashed– Finding them depends on the collision
strategy
Copyright © 2003-2005Curt Hill
Two more terms
• A perfect hash is one that has no collisions– Only occurs in situations with two
conditions:– Keys are fixed and known in advance– The hash function is tailored to these
keys• A minimal hash has no empty
buckets
Copyright © 2003-2005Curt Hill
Data Skew• A good hash function spreads the
keys uniformly among the integer range
• How well does this work when the data is not uniformly distributed?
• For example consider – Names - There are many more Smiths
than Garnjobsts– Numbers – there are many more 101
courses than most other numbers • Can the hash function still give a
good distribution?
Copyright © 2003-2005Curt Hill
Summary
• Hashing works best when it works well– Stable index– Good distribution from the hash
function
• It is much harder to make dynamic than trees