hashing cse 331 section 2 james daly. reminders homework 3 is out due thursday in class spring break...

40
Hashing CSE 331 Section 2 James Daly

Upload: johnathan-warner

Post on 26-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Hashing

CSE 331Section 2James Daly

Page 2: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Reminders

• Homework 3 is out• Due Thursday in class

• Spring Break is next week• Homework 4 is out

• Due after Spring Break

Page 3: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Review: Sets

• Containers for determining membership in a group

• Elements are unique• Two main types

• Ordered tree sets• Unordered hash sets

Language Ordered Unordered

C++ set unordered_set

Java TreeSet HashSet

C# SortedSet HashSet

Page 4: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Review: Set Operations

• Add / Insert• Remove / Delete• Exists / Find• Size / IsEmpty• Iterator• Clear / RemoveAll• Sometimes Union (AddAll) / Intersection

(RetainAll)

Page 5: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Direct Addressing Table

• An element with key k is stored in slot k• Search(T, k) = O(1)• Insertion(T, k) = O(1)• Deletion(T, k) = O(1)

• Problem: number of keys can be large (232)

1 2 6 7 9T:

Page 6: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Hashing

• Store an element with key k in h(k)• h(k) maps the universe U of keys into slots

of a hash table• Example

• T with slots [0, 1, …, m – 1]• h: U → {0, 1, …, m – 1}

• Key → O(1) hash → address

Page 7: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Diagram

UActual keys

012

m – 1

T

h(k)

Page 8: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Example

• Students with unique IDs• A: 10001• B: 10002• C: 10003• h(s) = s.id % 10

Page 9: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Problem

• What if several keys hash to the same value?

• Several solutions• Knock out the old• Discard the new• Chaining (keep a list)• Probing (try another location)

Page 10: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Chaining

012

m – 1

T

A

B

C

A B C

Page 11: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Chained Hash

• Insert(T, k)• Insert k into the list T[h(k)]: O(1)

• Search(T, k)• Search for an element with key k in list T[h(k)]• O(|T[h(k)]|): the size of the chain at h(k)

• Deletion• Delete element with key k in list T[h(k)]: O(|

T[h(k)]|)

Page 12: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Chained Hash

012

m – 1

T

Lots of stuff

012

m – 1

T

Bad Good

Page 13: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Analysis

• Assumption: simple uniform hashing• Each key is equally likely to be hashed to any

slot• Independent of the other keys

• Load Factor: average number of keys per slot• α = n / m

• Expected search cost:• Θ(1 + α): hash cost + search through the list• Θ(1) if α = O(1)

Page 14: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Analysis

• Load factor is more important than the table size!

Page 15: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Birthday Problem

• What is the probability that there will be no collisions?• Approximately 45 people in the room• Probability everyone has a different birthday?• Load factor: 12.3%

Page 16: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Hashing

• Two central problems• Design a good hash function

• Distributes keys uniformly into the table• Regularity in distribution should not affect the

uniformity• (shouldn’t use only half the slots with even numbers)

• Resolve collisions

Page 17: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Hash Functions

• A good hash function:• Has equal probability of hashing a key in each

slot• Must be fast

Page 18: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Sample Hash Function

• Hash function for integers• h(x) = x mod b• For some constant b

• Consider b = 2r

• 10111012 mod 23 = 1012 = 5

• h(x) returns the last r bits• Not good! Too easy to game.

• Typically b is chosen to be prime

Page 19: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

String hash functions

• “pt” = <112, 116> (ascii values)• Sum of ascii values

• 112 + 116 = 228• Same as for “tp” (bad)

• Weighted sum• 112 * 1 + 116 * 2 = 344• Same as “rs”: 114 * 1 + 115 * 2 = 344

Page 20: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

String hash functions

• Geometric Series• h(a0 a1 a2 a3) = a0 b0 + a1 b1 + a2 b2 + a3 b3

• More generally

• Usually b is chosen to be prime• Java uses this with b = 31 for

String.hashCode()

Page 21: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Other hash functions

• Lots of them!• Murmur Hash• Fowler-Noll-Vo

• Some have different purposes• Crypographic (non-invertible)

• Used to validate integrity of message• SHA-1• MD5

Page 22: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Open vs Closed Addressing

• Talked about Closed Addressing• Item always ends up in the same slot• Uses chaining or similar structure

• Open Addressing• Item may end up in different location• Probes alternate locations if the item isn’t

found

Page 23: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Open Addressing

• No storage is used outside of the table itself

• Insertion systematically probes the table until an empty slot is found

• Hash function depends on both the keys and the probe number• h : U x {0, 1, …, m – 1} → {0, 1, …, m – 1}• Probe sequence <h(k, 0), h(k, 1), …, h(k, m-

1)> should be a permutation of {0, 1, … m – 1}

Page 24: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Linear Probing

• Given ordinary hash function h’(k),• h(k, i) = h’(k) + i mod m

• Example:• h’(k) = k• h(k, i) = (k mod 11) + i) mod 11

Page 25: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Example

0

321

4567

1098

Insert 15:h’(15) = 15 mod 11 = 4h(15, 0) = 4

15

Insert 4:h’(4) = 4h(4, 0) = 4h(4, 1) = 5

Insert 16:h’(16) = 16 mod 11 = 5h(16, 0) = 5h(15, 1) = 5 + 1 = 6

416

Primary Clustering

Page 26: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Double Hashing

• Given two ordinary hash function h1(k) and h2(k)

• h(k, i) = (h1(k) + i * h2(k)) mod m

• h2 must be non-zero

Page 27: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Example

0

32

791

694567

1099

8

501112

h1 = k mod 13h2 = 1 + (k mod 11)

Insert 14h1(14) = 14 mod 13 = 1h2(14) = 1 + (14 mod 1) = 4h(14, 0) = 1h(14, 1) = 1 + 4 = 5

14

Delete 72h1(72) = 72 mod 13 = 7h(72, 0) = 7

72

Page 28: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Example

0

32

791

694145

67

1099

8

501112

h1 = k mod 13h2 = 1 + (k mod 11)

Delete 98h1(98) = 7h2(98) = 11h(98, 0) = 7h(98, 1) = 5h(98, 2) = 3

When can we stop?

Page 29: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Rehashing

• Efficiency degrades as load factor increases• Dependent on number of items and table size

• Need to increase the table size occasionally to when adding items

• Need to move items• Requires slight adjustments to the hash

function• Mod by new table size

Page 30: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Rehashing

• Rehashing requires Θ(n) time• Don’t increase size by a fixed amount

• Causes average time to also be Θ(n)

• Grow by a multiplicative factor instead (double)• Θ(n) once every Θ(n) inserts• Amortized Θ(n) time

Page 31: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Rehashing

70

9, 733232151

4115866

0

3152

1

456

77

231099

73, 868

111112

Page 32: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Applications – Pattern Matching

• For a given string, sub, test whether it is a substring of another (larger) string S

ACGT ACGTS

Sub = “ACGT”

|S| = n|Sub| = m

Cost = O(m n)

Page 33: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

RabinKarp(s[1..n], sub[1..m])

hsub ← hash(sub[1..m])For i = 1 to n – m + 1

hs ← hash(s[1..m])If hs = hsub

If s[i..i+m-1] = subReturn i

Return not found

String comparison:h(s1) = h(s2) does not mean s1 = s2

Page 34: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Bloom-Filter

• Set membership detection• Space-efficient data structure use to test

for membership of a set• Uses several hash functions and a bitset

• Each hi(k) must be set to be in the set

• Allows false positives, but not false negatives• Probably “yes”, definitely “no”

Page 35: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Bloom-Filter

1 1 1 1 1 1

{X, Y, Z}

Page 36: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Map / Dictionary

• Abstract data type representing a partial function

• Relates keys to values• Keys are unique• Values might not be• Two main types (like sets)

• Ordered tree maps• Unordered hash maps

Language Ordered Unordered

C++ map unordered_map

Java TreeMap HashMap

C# SortedDictionary Dictionary

Page 37: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Map / Dictionary

Keys Values

Page 38: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Map / Dictionary Methods

• Insert / Put: inserts tuple• Get / At: gets value from key

• Often indexer (operator[]) to do both put and get

• Remove / Delete: removes tuple by key• KeyExists• Iterator• Size / IsEmpty• Clear

Page 39: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

TreeMap

John555-3612

Jacob555-3147

Mary555-1243

Mathew555-2179

Luke555-7293

Mark555-3479

Sarah555-5394

Key: NameValue: Cell #

Mary?

555-1243

Page 40: Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Hash Map

Sarah, 555-539401

John, 555-36122Jacob, 555-31473

4Mary, 555-12435

Mathew, 555-21796Mark, 555-34797

8Luke, 555-72939

Mary?

h(Mary) = 5555-1243