hashing cse 331 section 2 james daly. reminders homework 3 is out due thursday in class spring break...
TRANSCRIPT
Hashing
CSE 331Section 2James Daly
Reminders
• Homework 3 is out• Due Thursday in class
• Spring Break is next week• Homework 4 is out
• Due after Spring Break
Review: Sets
• Containers for determining membership in a group
• Elements are unique• Two main types
• Ordered tree sets• Unordered hash sets
Language Ordered Unordered
C++ set unordered_set
Java TreeSet HashSet
C# SortedSet HashSet
Review: Set Operations
• Add / Insert• Remove / Delete• Exists / Find• Size / IsEmpty• Iterator• Clear / RemoveAll• Sometimes Union (AddAll) / Intersection
(RetainAll)
Direct Addressing Table
• An element with key k is stored in slot k• Search(T, k) = O(1)• Insertion(T, k) = O(1)• Deletion(T, k) = O(1)
• Problem: number of keys can be large (232)
1 2 6 7 9T:
Hashing
• Store an element with key k in h(k)• h(k) maps the universe U of keys into slots
of a hash table• Example
• T with slots [0, 1, …, m – 1]• h: U → {0, 1, …, m – 1}
• Key → O(1) hash → address
Diagram
UActual keys
012
m – 1
T
h(k)
Example
• Students with unique IDs• A: 10001• B: 10002• C: 10003• h(s) = s.id % 10
Problem
• What if several keys hash to the same value?
• Several solutions• Knock out the old• Discard the new• Chaining (keep a list)• Probing (try another location)
Chaining
012
m – 1
T
A
B
C
A B C
Chained Hash
• Insert(T, k)• Insert k into the list T[h(k)]: O(1)
• Search(T, k)• Search for an element with key k in list T[h(k)]• O(|T[h(k)]|): the size of the chain at h(k)
• Deletion• Delete element with key k in list T[h(k)]: O(|
T[h(k)]|)
Chained Hash
012
m – 1
T
Lots of stuff
012
m – 1
T
Bad Good
Analysis
• Assumption: simple uniform hashing• Each key is equally likely to be hashed to any
slot• Independent of the other keys
• Load Factor: average number of keys per slot• α = n / m
• Expected search cost:• Θ(1 + α): hash cost + search through the list• Θ(1) if α = O(1)
Analysis
• Load factor is more important than the table size!
Birthday Problem
• What is the probability that there will be no collisions?• Approximately 45 people in the room• Probability everyone has a different birthday?• Load factor: 12.3%
Hashing
• Two central problems• Design a good hash function
• Distributes keys uniformly into the table• Regularity in distribution should not affect the
uniformity• (shouldn’t use only half the slots with even numbers)
• Resolve collisions
Hash Functions
• A good hash function:• Has equal probability of hashing a key in each
slot• Must be fast
Sample Hash Function
• Hash function for integers• h(x) = x mod b• For some constant b
• Consider b = 2r
• 10111012 mod 23 = 1012 = 5
• h(x) returns the last r bits• Not good! Too easy to game.
• Typically b is chosen to be prime
String hash functions
• “pt” = <112, 116> (ascii values)• Sum of ascii values
• 112 + 116 = 228• Same as for “tp” (bad)
• Weighted sum• 112 * 1 + 116 * 2 = 344• Same as “rs”: 114 * 1 + 115 * 2 = 344
String hash functions
• Geometric Series• h(a0 a1 a2 a3) = a0 b0 + a1 b1 + a2 b2 + a3 b3
• More generally
• Usually b is chosen to be prime• Java uses this with b = 31 for
String.hashCode()
Other hash functions
• Lots of them!• Murmur Hash• Fowler-Noll-Vo
• Some have different purposes• Crypographic (non-invertible)
• Used to validate integrity of message• SHA-1• MD5
Open vs Closed Addressing
• Talked about Closed Addressing• Item always ends up in the same slot• Uses chaining or similar structure
• Open Addressing• Item may end up in different location• Probes alternate locations if the item isn’t
found
Open Addressing
• No storage is used outside of the table itself
• Insertion systematically probes the table until an empty slot is found
• Hash function depends on both the keys and the probe number• h : U x {0, 1, …, m – 1} → {0, 1, …, m – 1}• Probe sequence <h(k, 0), h(k, 1), …, h(k, m-
1)> should be a permutation of {0, 1, … m – 1}
Linear Probing
• Given ordinary hash function h’(k),• h(k, i) = h’(k) + i mod m
• Example:• h’(k) = k• h(k, i) = (k mod 11) + i) mod 11
Example
0
321
4567
1098
Insert 15:h’(15) = 15 mod 11 = 4h(15, 0) = 4
15
Insert 4:h’(4) = 4h(4, 0) = 4h(4, 1) = 5
Insert 16:h’(16) = 16 mod 11 = 5h(16, 0) = 5h(15, 1) = 5 + 1 = 6
416
Primary Clustering
Double Hashing
• Given two ordinary hash function h1(k) and h2(k)
• h(k, i) = (h1(k) + i * h2(k)) mod m
• h2 must be non-zero
Example
0
32
791
694567
1099
8
501112
h1 = k mod 13h2 = 1 + (k mod 11)
Insert 14h1(14) = 14 mod 13 = 1h2(14) = 1 + (14 mod 1) = 4h(14, 0) = 1h(14, 1) = 1 + 4 = 5
14
Delete 72h1(72) = 72 mod 13 = 7h(72, 0) = 7
72
Example
0
32
791
694145
67
1099
8
501112
h1 = k mod 13h2 = 1 + (k mod 11)
Delete 98h1(98) = 7h2(98) = 11h(98, 0) = 7h(98, 1) = 5h(98, 2) = 3
When can we stop?
Rehashing
• Efficiency degrades as load factor increases• Dependent on number of items and table size
• Need to increase the table size occasionally to when adding items
• Need to move items• Requires slight adjustments to the hash
function• Mod by new table size
Rehashing
• Rehashing requires Θ(n) time• Don’t increase size by a fixed amount
• Causes average time to also be Θ(n)
• Grow by a multiplicative factor instead (double)• Θ(n) once every Θ(n) inserts• Amortized Θ(n) time
Rehashing
70
9, 733232151
4115866
0
3152
1
456
77
231099
73, 868
111112
Applications – Pattern Matching
• For a given string, sub, test whether it is a substring of another (larger) string S
ACGT ACGTS
Sub = “ACGT”
|S| = n|Sub| = m
Cost = O(m n)
RabinKarp(s[1..n], sub[1..m])
hsub ← hash(sub[1..m])For i = 1 to n – m + 1
hs ← hash(s[1..m])If hs = hsub
If s[i..i+m-1] = subReturn i
Return not found
String comparison:h(s1) = h(s2) does not mean s1 = s2
Bloom-Filter
• Set membership detection• Space-efficient data structure use to test
for membership of a set• Uses several hash functions and a bitset
• Each hi(k) must be set to be in the set
• Allows false positives, but not false negatives• Probably “yes”, definitely “no”
Bloom-Filter
1 1 1 1 1 1
{X, Y, Z}
Map / Dictionary
• Abstract data type representing a partial function
• Relates keys to values• Keys are unique• Values might not be• Two main types (like sets)
• Ordered tree maps• Unordered hash maps
Language Ordered Unordered
C++ map unordered_map
Java TreeMap HashMap
C# SortedDictionary Dictionary
Map / Dictionary
Keys Values
Map / Dictionary Methods
• Insert / Put: inserts tuple• Get / At: gets value from key
• Often indexer (operator[]) to do both put and get
• Remove / Delete: removes tuple by key• KeyExists• Iterator• Size / IsEmpty• Clear
TreeMap
John555-3612
Jacob555-3147
Mary555-1243
Mathew555-2179
Luke555-7293
Mark555-3479
Sarah555-5394
Key: NameValue: Cell #
Mary?
555-1243
Hash Map
Sarah, 555-539401
John, 555-36122Jacob, 555-31473
4Mary, 555-12435
Mathew, 555-21796Mark, 555-34797
8Luke, 555-72939
Mary?
h(Mary) = 5555-1243