cse 326 hashing david kaplan dept of computer science & engineering autumn 2001
TRANSCRIPT
![Page 1: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/1.jpg)
CSE 326Hashing
David Kaplan
Dept of Computer Science & EngineeringAutumn 2001
![Page 2: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/2.jpg)
HashingCSE 326 Autumn 2001
2
Reminder: Dictionary ADTDictionary operations
insert find delete create destroy
Stores values associated with user-specified keys
values may be any (homogeneous) type
keys may be any (homogeneous) comparable type
AdrienRoller-blade
demon
HannahC++ guru
DaveOlder than dirt
…
insert
find(Adrien) Adrien Roller-blade demon
Donald l33t haxtor
![Page 3: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/3.jpg)
HashingCSE 326 Autumn 2001
3
Dictionary Implementations So Far
Insert Find Delete
Unsorted list O(1) O(n) O(n)
Trees O(log n)
O(log n)
O(log n)
Sorted array O(n) O(log n)
O(n)
Array special caseknown keys {1, … ,
K}
O(1) O(1) O(1)
![Page 4: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/4.jpg)
HashingCSE 326 Autumn 2001
4
ADT Legalities:
A Digression on KeysMethods are the contract between an ADT and the outside agent (client code)
Ex: Dictionary contract is {insert, find, delete} Ex: Priority Q contract is {insert, deleteMin}
Keys are the currency used in transactions between an outside agent and ADT
Ex: insert(key), find(key), delete(key)
So … How about O(1) insert/find/delete for any key type?
![Page 5: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/5.jpg)
HashingCSE 326 Autumn 2001
5
Hash Table Goal:
Key as IndexWe can access a record as a[5]
We want to access a record as a[“Hannah”]
Adrienroller-blade demon2
HannahC++ guru5
Adrienroller-blade demonAdrien
HannahC++ guruHannah
![Page 6: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/6.jpg)
HashingCSE 326 Autumn 2001
6
Hash Table Approach
But… is there a problem with this pipe-dream?
f(x)
Hannah
Dave
Adrien
Donald
Ed
![Page 7: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/7.jpg)
HashingCSE 326 Autumn 2001
7
Hash Table Dictionary Data StructureHash function: maps keys to integers
Result: Can quickly find the right
spot for a given entry
Unordered and sparse tableResult:
Cannot efficiently list all entries
Cannot efficiently find min, max, ordered ranges
f(x)Hannah
DaveAdrienDonald
Ed
![Page 8: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/8.jpg)
HashingCSE 326 Autumn 2001
8
Hash Table Taxonomy
f(x)
Hannah
Dave
Adrien
Donald
Ed
hash function
collision
keys
load factor = # of entries in table
tableSize
![Page 9: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/9.jpg)
HashingCSE 326 Autumn 2001
9
Agenda:
Hash Table Design Decisions What should the hash function be?
What should the table size be?
How should we resolve collisions?
![Page 10: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/10.jpg)
HashingCSE 326 Autumn 2001
10
Hash FunctionHash function maps a key to a table
indexValue & find(Key & key) { int index = hash(key) % tableSize; return Table[index];}
![Page 11: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/11.jpg)
HashingCSE 326 Autumn 2001
11
What Makes A Good Hash Function?
Fast runtime O(1) and fast in practical terms
Distributes the data evenly hash(a) % size hash(b) % size
Uses the whole hash table for all 0 i < size, k such that hash(k) %
size = i
![Page 12: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/12.jpg)
HashingCSE 326 Autumn 2001
12
Good Hash Function for Integer KeysChoose
tableSize is prime hash(n) = n
Example: tableSize = 7
insert(4)insert(17)find(12)insert(9)delete(17)
3
2
1
0
6
5
4
![Page 13: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/13.jpg)
HashingCSE 326 Autumn 2001
13
Good Hash Function for Strings?Let s = s1s2s3s4…sn: choose
hash(s) = s1 + s2128 + s31282 + s41283 + … + sn128n
Think of the string as a base 128 (aka radix 128) number
Problems: hash(“really, really big”) = well… something really,
really big
hash(“one thing”) % 128 = hash(“other thing”) % 128
![Page 14: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/14.jpg)
HashingCSE 326 Autumn 2001
14
String Hashing
Issues and TechniquesMinimize collisions
Make tableSize and radix relatively primeTypically, make tableSize not a multiple of 128
Simplify computation Use Horner’s Ruleint hash(String s) { h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (s[i] + 128*h) % tableSize; } return h; }
![Page 15: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/15.jpg)
HashingCSE 326 Autumn 2001
15
Good Hashing:
Multiplication MethodHash function is defined by size plus a parameter A
hA(k) = size * (k*A mod 1) where 0 < A < 1
Example: size = 10, A = 0.485hA(50) = 10 * (50*0.485 mod 1)
= 10 * (24.25 mod 1) = 10 * 0.25 = 2
no restriction on size! when building a static table, we can try several values of
A more computationally intensive than a single mod
![Page 16: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/16.jpg)
HashingCSE 326 Autumn 2001
16
Hashing DilemmaSuppose your Worst Enemy 1) knows your hash function; 2) gets to decide which keys to send you?
Faced with this enticing possibility, Worst Enemy decides to:a) Send you keys which maximize collisions for your hash
function.b) Take a nap.
Moral: No single hash function can protect you!
Faced with this dilemma, you:a) Give up and use a linked list for your Dictionary.b) Drop out of software, and choose a career in fast foods.c) Run and hide.d) Proceed to the next slide, in hope of a better alternative.
![Page 17: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/17.jpg)
HashingCSE 326 Autumn 2001
17
Universal Hashing1
Suppose we have a set K of possible keys, and a finite set H of hash functions that map keys to entries in a hashtable of size m.
1Motivation: see previous slide (or visit http://www.burgerking.com/jobs)
Definition: H is a universal collection of hash functions if and only if …
For any two keys k1, k2 in K, there are at most |H|/m functions in H for which h(k1) = h(k2).
So … if we randomly choose a hash function from H, our chances of collision are no more than if we get to choose hash table entries at random!
01
.
.
.
m-1K
H
h
hi
hj
k2
k1
![Page 18: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/18.jpg)
HashingCSE 326 Autumn 2001
18
Random Hashing – Not!How can we “randomly choose a hash function”?
Certainly we cannot randomly choose hash functions at runtime, interspersed amongst the inserts, finds, deletes! Why not?
We can, however, randomly choose a hash function each time we initialize a new hashtable.
Conclusions Worst Enemy never knows which hash function we will
choose – neither do we! No single input (set of keys) can always evoke worst-case
behavior
![Page 19: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/19.jpg)
HashingCSE 326 Autumn 2001
19
Good Hashing:Universal Hash Function A (UHFa)
Parameterized by prime table size and vector:a = <a0 a1 … ar> where 0 <= ai < size
Represent each key as r + 1 integers where ki < size
size = 11, key = 39752 ==> <3,9,7,5,2> size = 29, key = “hello world” ==>
<8,5,12,12,15,23,15,18,12,4>
ha(k) = sizekar
iii mod
0
![Page 20: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/20.jpg)
HashingCSE 326 Autumn 2001
20
UHFa: Example Context: hash strings of length 3 in a table of
size 131
let a = <35, 100, 21>ha(“xyz”) = (35*120 + 100*121 + 21*122) %
131 = 129
![Page 21: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/21.jpg)
HashingCSE 326 Autumn 2001
21
Thinking about UHFa
Strengths: works on any type as long as you can form ki’s
if we’re building a static table, we can try many values of the hash vector <a>
random <a> has guaranteed good properties no matter what we’re hashing
Weaknesses must choose prime table size larger than any ki
![Page 22: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/22.jpg)
HashingCSE 326 Autumn 2001
22
Good Hashing:Universal Hash Function 2 (UHF2)
Parameterized by j, a, and b: j * size should fit into an int a and b must be less than size
hj,a,b(k) = ((ak + b) mod (j*size))/j
![Page 23: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/23.jpg)
HashingCSE 326 Autumn 2001
23
UHF2 : ExampleContext: hash integers in a table of size 16
let j = 32, a = 100, b = 200hj,a,b(1000) = ((100*1000 + 200) % (32*16)) / 32
= (100200 % 512) / 32 = 360 / 32 = 11
![Page 24: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/24.jpg)
HashingCSE 326 Autumn 2001
24
Thinking about UHF2
Strengths if we’re building a static table, we can try many
parameter values random a,b has guaranteed good properties no
matter what we’re hashing can choose any size table very efficient if j and size are powers of 2
(why?)
Weaknesses need to turn non-integer keys into integers
![Page 25: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/25.jpg)
HashingCSE 326 Autumn 2001
25
Hash Function SummaryGoals of a hash function
reproducible mapping from key to table index evenly distribute keys across the table separate commonly occurring keys (neighboring keys?) fast runtime
Some hash function candidates h(n) = n % size h(n) = string as base 128 number % size Multiplication hash: compute percentage through the table Universal hash function A: dot product with random vector Universal hash function 2: next pseudo-random number
![Page 26: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/26.jpg)
HashingCSE 326 Autumn 2001
26
Hash Function Design Considerations Know what your keys are Study how your keys are distributed Try to include all important information
in a key in the construction of its hash Try to make “neighboring” keys hash to
very different places Prune the features used to create the
hash until it runs “fast enough” (very application dependent)
![Page 27: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/27.jpg)
HashingCSE 326 Autumn 2001
27
Handling CollisionsPigeonhole principle says we can’t avoid all collisions
try to hash without collision n keys into m slots with n > m try to put 6 pigeons into 5 holes
What do we do when two keys hash to the same entry? Separate Chaining: put a little dictionary in each entry Open Addressing: pick a next entry to try within hashtable
Terminology madness :-( Separate Chaining sometimes called Open Hashing Open Addressing sometimes called Closed Hashing
![Page 28: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/28.jpg)
HashingCSE 326 Autumn 2001
28
3
2
1
0
6
5
4
a d
e b
c
Separate ChainingPut a little dictionary at each entry
Commonly, unordered linked list (chain)
Or, choose another Dictionary type as appropriate (search tree, hashtable, etc.)
Properties can be greater than 1 performance degrades with length
of chains Alternate Dictionary type (e.g.
search tree, hashtable) can speed up secondary search
h(a) = h(d)h(e) = h(b)
![Page 29: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/29.jpg)
HashingCSE 326 Autumn 2001
29
Separate Chaining Code
[private]
Dictionary & findBucket(const Key & k) {
return table[hash(k)%table.size];
}
void insert(const Key & k, const Value & v) {
findBucket(k).insert(k,v);
}
Value & find(const Key & k) { return findBucket(k).find(k);}
void delete(const Key & k) { findBucket(k).delete(k);}
![Page 30: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/30.jpg)
HashingCSE 326 Autumn 2001
30
Load Factor in Separate ChainingSearch cost
unsuccessful search:
successful search:
Desired load factor:
![Page 31: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/31.jpg)
HashingCSE 326 Autumn 2001
31
Open AddressingAllow one key at each table entry
two objects that hash to the same spot can’t both go there
first one there gets the spot next one must go in another spot
Properties 1 performance degrades with
difficulty of finding right spot
a
c
e3
2
1
0
6
5
4
h(a) = h(d)h(e) = h(b)
d
b
![Page 32: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/32.jpg)
HashingCSE 326 Autumn 2001
32
ProbingRequires collision resolution function f(i)
Probing how to: First probe - given a key k, hash to h(k) Second probe - if h(k) is occupied, try h(k) + f(1) Third probe - if h(k) + f(1) is occupied, try h(k) + f(2) And so forth
Probing properties we force f(0) = 0 ith probe is to (h(k) + f(i)) mod size if i reaches size - 1, the probe has failed depending on f(), the probe may fail sooner long sequences of probes are costly!
![Page 33: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/33.jpg)
HashingCSE 326 Autumn 2001
33
Linear Probingf(i) = iProbe sequence is
h(k) mod size h(k) + 1 mod size h(k) + 2 mod size …
bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash(k); do { entry = &table[probePoint]; probePoint = (probePoint + 1) % size; } while (!entry->isEmpty() && entry->key != k); return !entry->isEmpty();}
![Page 34: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/34.jpg)
Linear Probing Example
probes:
47
93
40
103
2
1
0
6
5
4
insert(55)55%7 = 6
3
76
3
2
1
0
6
5
4
insert(76)76%7 = 6
1
76
3
2
1
0
6
5
4
insert(93)93%7 = 2
1
93
76
3
2
1
0
6
5
4
insert(40)40%7 = 5
1
93
40
76
3
2
1
0
6
5
4
insert(47)47%7 = 5
3
47
93
40
76
103
2
1
0
6
5
4
insert(10)10%7 = 3
1
55
76
93
40
47
![Page 35: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/35.jpg)
HashingCSE 326 Autumn 2001
35
Load Factor in Linear ProbingFor any < 1, linear probing will find an empty slotSearch cost (for large table sizes)
successful search:
unsuccessful search:
Linear probing suffers from primary clusteringPerformance quickly degrades for > 1/2
21
11
2
1
1
11
2
1
![Page 36: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/36.jpg)
HashingCSE 326 Autumn 2001
36
Quadratic Probingf(i) = i2
Probe sequence: h(k) mod size h(k) + 1 mod size h(k) + 4 mod size h(k) + 9 mod size …
bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash(k), i = 0; do { entry = &table[probePoint]; i++; probePoint = (probePoint + (2*i - 1)) % size; } while (!entry->isEmpty() && entry->key != k); return !entry->isEmpty();}
![Page 37: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/37.jpg)
Good Quadratic Probing Example
probes:
76
3
2
1
0
6
5
4
insert(76)76%7 = 6
1
76
3
2
1
0
6
5
4
insert(40)40%7 = 5
1
40 40
76
3
2
1
0
6
5
4
insert(48)48%7 = 6
2
48 47
40
76
3
2
1
0
6
5
4
insert(5)5%7 = 5
3
5 5
40
553
2
1
0
6
5
4
insert(55)55%7 = 6
3
76
47
![Page 38: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/38.jpg)
Bad Quadratic Probing Example
probes:
76
3
2
1
0
6
5
4
insert(76)76%7 = 6
1
35
93
40
76
3
2
1
0
6
5
4
insert(47)47%7 = 5
76
3
2
1
0
6
5
4
insert(93)93%7 = 2
1
93 93
76
3
2
1
0
6
5
4
insert(40)40%7 = 5
1
40
93
40
76
3
2
1
0
6
5
4
insert(35)35%7 = 0
1
35
![Page 39: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/39.jpg)
HashingCSE 326 Autumn 2001
39
Quadratic Probing Succeeds for ½If size is prime and ½, then quadratic probing will find an empty slot in size/2 probes or fewer.
show for all 0 i, j size/2 and i j(h(x) + i2) mod size (h(x) + j2) mod size
by contradiction: suppose that for some i, j:(h(x) + i2) mod size = (h(x) + j2) mod sizei2 mod size = j2 mod size(i2 - j2) mod size = 0[(i + j)(i - j)] mod size = 0
but how can i + j = 0 or i + j = size when
i j and i,j size/2? same for i - j mod size = 0
![Page 40: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/40.jpg)
HashingCSE 326 Autumn 2001
40
Quadratic Probing May Failfor > ½ For any i larger than size/2, there is
some j smaller than i that adds with i to equal size (or a multiple of size). D’oh!
![Page 41: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/41.jpg)
HashingCSE 326 Autumn 2001
41
Load Factor in Quadratic Probing For any ½, quadratic probing will find
an empty slot For > ½, quadratic probing may find a
slot Quadratic probing does not suffer from
primary clustering Quadratic probing does suffer from
secondary clustering How could we possibly solve this?
![Page 42: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/42.jpg)
HashingCSE 326 Autumn 2001
42
Double HashingDouble Hashingf(i) = i*hash2(k)Probe sequence:
h1(k) mod size (h1(k) + 1 h2(x)) mod size (h1(k) + 2 h2(x)) mod size …
bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash1(k), delta = hash2(k); do { entry = &table[probePoint]; probePoint = (probePoint + delta) % size; } while (!entry->isEmpty() && entry->key != k); return !entry->isEmpty();}
![Page 43: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/43.jpg)
HashingCSE 326 Autumn 2001
43
A Good Double Hash Function… …is quick to evaluate.…differs from the original hash function.…never evaluates to 0 (mod size).
One good choice:Choose a prime p < sizeLet hash2(k)= p - (k mod p)
![Page 44: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/44.jpg)
Double HashingDouble Hashing Example (p=5)
probes:
93
55
40
103
2
1
0
6
5
4
insert(55)55%7 = 6
5 - (55%5) = 5
2
76
3
2
1
0
6
5
4
insert(76)76%7 = 6
1
76
3
2
1
0
6
5
4
insert(93)93%7 = 2
1
93
76
3
2
1
0
6
5
4
insert(40)40%7 = 5
1
93
40
76
3
2
1
0
6
5
4
insert(47)47%7 = 5
5 - (47%5) = 3
2
47
93
40
76
103
2
1
0
6
5
4
insert(10)10%7 = 3
1
47
76
93
40
47
![Page 45: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/45.jpg)
HashingCSE 326 Autumn 2001
45
Load Factor in Double HashingFor any < 1, double hashing will find an empty slot (given appropriate table size and hash2)
Search cost appears to approach optimal (random hash):
successful search:
unsuccessful search:
No primary clustering and no secondary clustering
One extra hash calculation
1
1 1
1ln
1
![Page 46: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/46.jpg)
HashingCSE 326 Autumn 2001
46
0
1
2
73
2
1
0
6
5
4
delete(2)
0
1
73
2
1
0
6
5
4
find(7)
Where is it?!
Deletion in Open Addressing
Must use lazy deletion! On insertion, treat a (lazily)
deleted item as an empty slot
![Page 47: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/47.jpg)
HashingCSE 326 Autumn 2001
47
The Squished Pigeon Principle Insert using Open Addressing cannot work with
1. Insert using Open Addressing with quadratic
probing may not work with ½. With Separate Chaining or Open Addressing,
large load factors lead to poor performance!
How can we relieve the pressure on the pigeons? Hint: what happens when we overrun array storage in
a {queue, stack, heap}? What else must happen with a hashtable?
![Page 48: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/48.jpg)
HashingCSE 326 Autumn 2001
48
RehashingWhen the gets “too large” (over some constant threshold), rehash all elements into a new, larger table:
takes O(n), but amortized O(1) as long as we (just about) double table size on the resize
spreads keys back out, may drastically improve performance
gives us a chance to retune parameterized hash functions
avoids failure for Open Addressing techniques allows arbitrarily large tables starting from a small table clears out lazily deleted items
![Page 49: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/49.jpg)
HashingCSE 326 Autumn 2001
49
Case StudySpelling dictionary
30,000 words static arbitrary(ish)
preprocessing time
Goals fast spell checking minimal storage
Practical notes almost all searches
are successful – Why? words average about
8 characters in length 30,000 words at 8
bytes/word ~ .25 MB pointers are 4 bytes there are many
regularities in the structure of English words
![Page 50: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/50.jpg)
HashingCSE 326 Autumn 2001
50
Case Study:
Design ConsiderationsPossible Solutions
sorted array + binary search Separate Chaining Open Addressing + linear probing
Issues Which data structure should we use? Which type of hash function should we use?
![Page 51: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/51.jpg)
HashingCSE 326 Autumn 2001
51
Case Study:
StorageAssume words are strings and entries are pointers to strings
Array +binary search Separate Chaining
…
Open addressing
How many pointers does each use?
![Page 52: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/52.jpg)
HashingCSE 326 Autumn 2001
52
Case Study:
Analysisstorage time
Binary searchn pointers + words = 360KB
log2n 15 probes per access, worst case
Separate Chainingn + n/ pointers + words( = 1 600KB)
1 + /2 probes per access on average( = 1 1.5 probes)
Open Addressingn/ pointers + words( = 0.5 480KB)
(1 + 1/(1 - ))/2 probes per access on average
( = 0.5 1.5 probes)
What to do, what to do? …
![Page 53: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/53.jpg)
HashingCSE 326 Autumn 2001
53
Perfect HashingWhen we know the entire key set in
advance … Examples: programming language
keywords, CD-ROM file list, spelling dictionary, etc.
… then perfect hashing lets us achieve: Worst-case O(1) time complexity! Worst-case O(n) space complexity!
![Page 54: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/54.jpg)
HashingCSE 326 Autumn 2001
54
Perfect Hashing Technique Static set of n known keys Separate chaining, two-level
hash Primary hash table size=n jth secondary hash table size=nj
2
(where nj keys hash to slot j in primary hash table)
Universal hash functions in all hash tables
Conduct (a few!) random trials, until we get collision-free hash functions
3
2
1
0
6
5
4
Primary hash table
Secondary hash tables
![Page 55: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/55.jpg)
HashingCSE 326 Autumn 2001
55
Perfect Hashing Theorems1
Theorem: If we store n keys in a hash table of size n2 using a randomly chosen universal hash function, then the probability of any collision is < ½.
Theorem: If we store n keys in a hash table of size m=n using a randomly chosen universal hash function, then
where nj is the number of keys hashing to slot j.
Corollary: If we store n keys in a hash table of size m=n using a randomly chosen universal hash function and we set the size of each secondary hash table to mj=nj
2, then:a)The expected amount of storage required for all secondary hash tables is less than
2n.b)The probability that the total storage used for all secondary hash tables exceeds 4n
is less than ½.
nEm
jjn 2
1
0
2
1Intro to Algorithms, 2nd ed. Cormen, Leiserson, Rivest, Stein
![Page 56: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/56.jpg)
HashingCSE 326 Autumn 2001
56
Perfect Hashing ConclusionsPerfect hashing theorems set tight expected bounds on sizes and collision behavior of all the hash tables (primary and all secondaries).
Conduct a few random trials of universal hash functions, by simply varying UHF parameters, until we get a set of UHFs and associated table sizes which deliver …
Worst-case O(1) time complexity! Worst-case O(n) space complexity!
![Page 57: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/57.jpg)
HashingCSE 326 Autumn 2001
57
Extendible Hashing:
Cost of a Database Query
I/O to CPU ratio is 300-to-1!
![Page 58: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/58.jpg)
HashingCSE 326 Autumn 2001
58
Extendible HashingHashing technique for huge data sets
optimizes to reduce disk accesses each hash bucket fits on one disk block better than B-Trees if order is not important – why?
Table contains buckets, each fitting in one disk block, with the
data a directory that fits in one disk block is used to
hash to the correct bucket
![Page 59: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/59.jpg)
HashingCSE 326 Autumn 2001
59
001 010 011 110 111 101
Extendible Hash Table Directory entry: key prefix (first k bits) and a pointer to the
bucket with all keys starting with its prefix Each block contains keys matching on first j k bits, plus
the data associated with each key
000 100
(2)00001000110010000110
(2)010010101101100
(3)1000110011
(3)101011011010111
(2)11001110111110011110
directory for k = 3
![Page 60: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/60.jpg)
Inserting (easy case)
001 010 011 110 111 101000 100
(2)00001000110010000110
(2)010010101101100
(3)1000110011
(3)101011011010111
(2)11001110111110011110
insert(11011)
001 010 011 110 111 101000 100
(2)00001000110010000110
(2)010010101101100
(3)1000110011
(3)101011011010111
(2)110011110011110
![Page 61: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/61.jpg)
Splitting a Leaf
001 010 011 110 111 101000 100
(2)00001000110010000110
(2)010010101101100
(3)1000110011
(3)101011011010111
(2)11001110111110011110
insert(11000)
001 010 011 110 111 101000 100
(2)00001000110010000110
(2)010010101101100
(3)1000110011
(3)101011011010111
(3)110001100111011
(3)1110011110
![Page 62: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/62.jpg)
HashingCSE 326 Autumn 2001
62
Splitting the Directory1. insert(10010)
But, no room to insert and no adoption!
2. Solution: Expand directory
3. Then, it’s just a normal split.
01 10 1100
(2)01101
(2)10000100011001110111
(2)1100111110
001 010 011 110 111 101000 100
![Page 63: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/63.jpg)
HashingCSE 326 Autumn 2001
63
If Extendible Hashing Doesn’t Cut ItStore only pointers to the items
+ (potentially) much smaller M+ fewer items in the directory– one extra disk access!
Rehash+ potentially better distribution over the buckets+ fewer unnecessary items in the directory– can’t solve the problem if there’s simply too much data
What if these don’t work? use a B-Tree to store the directory!
![Page 64: CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001](https://reader033.vdocuments.site/reader033/viewer/2022052913/5697c02b1a28abf838cd8765/html5/thumbnails/64.jpg)
HashingCSE 326 Autumn 2001
64
Hash WrapCollision resolution•Separate Chaining
Expand beyond hashtable via secondary Dictionaries
Allows > 1•Open Addressing
Expand within hashtable Secondary probing:
{linear, quadratic, double hash}
1 (by definition!) ½ (by preference!)
Rehashing Tunes up hashtable when
crosses the line
Hash functions Simple integer hash: prime
table size Multiplication method Universal hashing guarantees
no (always) bad input
Perfect hashing Requires known, fixed keyset Achieves O(1) time, O(n)
space - guaranteed!
Extendible hashing For disk-based data Combine with b-tree
directory if needed