hashing cse 331 section 2 james daly. reminders homework 3 is out due thursday in class spring break...

Hashing

CSE 331Section 2James Daly

Reminders

• Homework 3 is out• Due Thursday in class

• Spring Break is next week• Homework 4 is out

• Due after Spring Break

Review: Sets

• Containers for determining membership in a group

• Elements are unique• Two main types

• Ordered tree sets• Unordered hash sets

Language Ordered Unordered

C++ set unordered_set

Java TreeSet HashSet

C# SortedSet HashSet

Review: Set Operations

• Add / Insert• Remove / Delete• Exists / Find• Size / IsEmpty• Iterator• Clear / RemoveAll• Sometimes Union (AddAll) / Intersection

(RetainAll)

Direct Addressing Table

• An element with key k is stored in slot k• Search(T, k) = O(1)• Insertion(T, k) = O(1)• Deletion(T, k) = O(1)

• Problem: number of keys can be large (232)

1 2 6 7 9T:

Hashing

• Store an element with key k in h(k)• h(k) maps the universe U of keys into slots

of a hash table• Example

• T with slots [0, 1, …, m – 1]• h: U → {0, 1, …, m – 1}

• Key → O(1) hash → address

Diagram

UActual keys

012

m – 1

T

h(k)

Example

• Students with unique IDs• A: 10001• B: 10002• C: 10003• h(s) = s.id % 10

Problem

• What if several keys hash to the same value?

• Several solutions• Knock out the old• Discard the new• Chaining (keep a list)• Probing (try another location)

Chaining

012

m – 1

T

A

B

C

A B C

Chained Hash

• Insert(T, k)• Insert k into the list T[h(k)]: O(1)

• Search(T, k)• Search for an element with key k in list T[h(k)]• O(|T[h(k)]|): the size of the chain at h(k)

• Deletion• Delete element with key k in list T[h(k)]: O(|

T[h(k)]|)

Chained Hash

012

m – 1

T

Lots of stuff

012

m – 1

T

Bad Good

Analysis

• Assumption: simple uniform hashing• Each key is equally likely to be hashed to any

slot• Independent of the other keys

• Load Factor: average number of keys per slot• α = n / m

• Expected search cost:• Θ(1 + α): hash cost + search through the list• Θ(1) if α = O(1)

Analysis

• Load factor is more important than the table size!

Birthday Problem

• What is the probability that there will be no collisions?• Approximately 45 people in the room• Probability everyone has a different birthday?• Load factor: 12.3%

Hashing

• Two central problems• Design a good hash function

• Distributes keys uniformly into the table• Regularity in distribution should not affect the

uniformity• (shouldn’t use only half the slots with even numbers)

• Resolve collisions

Hash Functions

• A good hash function:• Has equal probability of hashing a key in each

slot• Must be fast

Sample Hash Function

• Hash function for integers• h(x) = x mod b• For some constant b

• Consider b = 2r

• 10111012 mod 23 = 1012 = 5

• h(x) returns the last r bits• Not good! Too easy to game.

• Typically b is chosen to be prime

String hash functions

• “pt” = <112, 116> (ascii values)• Sum of ascii values

• 112 + 116 = 228• Same as for “tp” (bad)

• Weighted sum• 112 * 1 + 116 * 2 = 344• Same as “rs”: 114 * 1 + 115 * 2 = 344

String hash functions

• Geometric Series• h(a0 a1 a2 a3) = a0 b0 + a1 b1 + a2 b2 + a3 b3

• More generally

• Usually b is chosen to be prime• Java uses this with b = 31 for

String.hashCode()

Other hash functions

• Lots of them!• Murmur Hash• Fowler-Noll-Vo

• Some have different purposes• Crypographic (non-invertible)

• Used to validate integrity of message• SHA-1• MD5

Open vs Closed Addressing

• Talked about Closed Addressing• Item always ends up in the same slot• Uses chaining or similar structure

• Open Addressing• Item may end up in different location• Probes alternate locations if the item isn’t

found

Open Addressing

• No storage is used outside of the table itself

• Insertion systematically probes the table until an empty slot is found

• Hash function depends on both the keys and the probe number• h : U x {0, 1, …, m – 1} → {0, 1, …, m – 1}• Probe sequence <h(k, 0), h(k, 1), …, h(k, m-

1)> should be a permutation of {0, 1, … m – 1}

Linear Probing

• Given ordinary hash function h’(k),• h(k, i) = h’(k) + i mod m

• Example:• h’(k) = k• h(k, i) = (k mod 11) + i) mod 11

Example

0

321

4567

1098

Insert 15:h’(15) = 15 mod 11 = 4h(15, 0) = 4

15

Insert 4:h’(4) = 4h(4, 0) = 4h(4, 1) = 5

Insert 16:h’(16) = 16 mod 11 = 5h(16, 0) = 5h(15, 1) = 5 + 1 = 6

416

Primary Clustering

Double Hashing

• Given two ordinary hash function h1(k) and h2(k)

• h(k, i) = (h1(k) + i * h2(k)) mod m

• h2 must be non-zero

Example

0

32

791

694567

1099

8

501112

h1 = k mod 13h2 = 1 + (k mod 11)

Insert 14h1(14) = 14 mod 13 = 1h2(14) = 1 + (14 mod 1) = 4h(14, 0) = 1h(14, 1) = 1 + 4 = 5

14

Delete 72h1(72) = 72 mod 13 = 7h(72, 0) = 7

72

Example

0

32

791

694145

67

1099

8

501112

h1 = k mod 13h2 = 1 + (k mod 11)

Delete 98h1(98) = 7h2(98) = 11h(98, 0) = 7h(98, 1) = 5h(98, 2) = 3

When can we stop?

Rehashing

• Efficiency degrades as load factor increases• Dependent on number of items and table size

• Need to increase the table size occasionally to when adding items

• Need to move items• Requires slight adjustments to the hash

function• Mod by new table size

Rehashing

• Rehashing requires Θ(n) time• Don’t increase size by a fixed amount

• Causes average time to also be Θ(n)

• Grow by a multiplicative factor instead (double)• Θ(n) once every Θ(n) inserts• Amortized Θ(n) time

Rehashing

70

9, 733232151

4115866

0

3152

1

456

77

231099

73, 868

111112

Applications – Pattern Matching

• For a given string, sub, test whether it is a substring of another (larger) string S

ACGT ACGTS

Sub = “ACGT”

|S| = n|Sub| = m

Cost = O(m n)

RabinKarp(s[1..n], sub[1..m])

hsub ← hash(sub[1..m])For i = 1 to n – m + 1

hs ← hash(s[1..m])If hs = hsub

If s[i..i+m-1] = subReturn i

Return not found

String comparison:h(s1) = h(s2) does not mean s1 = s2

Bloom-Filter

• Set membership detection• Space-efficient data structure use to test

for membership of a set• Uses several hash functions and a bitset

• Each hi(k) must be set to be in the set

• Allows false positives, but not false negatives• Probably “yes”, definitely “no”

Bloom-Filter

1 1 1 1 1 1

{X, Y, Z}

Map / Dictionary

• Abstract data type representing a partial function

• Relates keys to values• Keys are unique• Values might not be• Two main types (like sets)

• Ordered tree maps• Unordered hash maps

Language Ordered Unordered

C++ map unordered_map

Java TreeMap HashMap

C# SortedDictionary Dictionary

Map / Dictionary

Keys Values

Map / Dictionary Methods

• Insert / Put: inserts tuple• Get / At: gets value from key

• Often indexer (operator[]) to do both put and get

• Remove / Delete: removes tuple by key• KeyExists• Iterator• Size / IsEmpty• Clear

TreeMap

John555-3612

Jacob555-3147

Mary555-1243

Mathew555-2179

Luke555-7293

Mark555-3479

Sarah555-5394

Key: NameValue: Cell #

Mary?

555-1243

Hash Map

Sarah, 555-539401

John, 555-36122Jacob, 555-31473

4Mary, 555-12435

Mathew, 555-21796Mark, 555-34797

8Luke, 555-72939

Mary?

h(Mary) = 5555-1243

hashing cse 331 section 2 james daly. reminders homework 3 is out due thursday in class spring break...

Documents

o1 slide

othk slide

location slide

keys hash

md5 slide

fast slide

sortedsethashset slide

prime slide