introduction to algorithms

51
Introduction to Algorithms Jiafen Liu Sept. 2013

Upload: nola-joyner

Post on 03-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Introduction to Algorithms. Jiafen Liu. Sept. 2013. Today’s Tasks. Hashing Direct access tables Choosing good hash functions Division Method Multiplication Method Resolving collisions by chaining Resolving collisions by open addressing. Symbol-Table Problem. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Algorithms

Introduction to Algorithms

Jiafen Liu

Sept. 2013

Page 2: Introduction to Algorithms

Today’s Tasks

Hashing

• Direct access tables

• Choosing good hash functions– Division Method– Multiplication Method

• Resolving collisions by chaining

• Resolving collisions by open addressing

Page 3: Introduction to Algorithms

Symbol-Table Problem

• Hashing comes up in compilers called the Symbol Table Problem.

• Suppose: Table S holding n records:

• Operations on S:– INSERT(S, x)– DELETE(S, x)– SEARCH(S, k)

• Dynamic Set vs Static Set

Page 4: Introduction to Algorithms

The Simplest Case• Suppose that the keys are drawn from the set U {0, ⊆

1, …, m–1}, and keys are distinct. • Direct access Table: set up an array T[0 . .m–1]

if x S and key[x] = k,∈

otherwise.• In the worst case, the 3 operations take time of

– Θ(1)

• Limitations of direct-access table?– The range of keys can be large: 64-bit numbers– character strings (difficult to represent it).

• Hashing: Try to keep the table small, while preserving the property of linear running time.

Page 5: Introduction to Algorithms

Naïve Hashing

• Solution: Use a hash function h to map the keys of records in S into {0, 1, …, m–1}.

Keys

k1k3

k4

k5

k2

T0

m-1

h(k1)

h(k2)

h(k3)

h(k4)

=h(k5)

Page 6: Introduction to Algorithms

Collisions

• When a record to be inserted maps to an already occupied slot in T, a collision occurs.

• The Simplest way to solve collision?– Link records in the same slot into a list.

49 86 52

h(49)=h(86)=h(52)=i

Page 7: Introduction to Algorithms

Worst Case of Chaining

• What’s the worst case of chaining?– Each key hashes to the same slot. The table

turn out to be a chaining list.

• Access Time in the worst case?– Θ(n) if we assume the size of S is n.

Page 8: Introduction to Algorithms

Average Case of Chaining

• In order to analyze the average case– we should know all possible inputs and their

probability. – We don’t know exactly the distribution, so we

always make assumptions.

• Here, we make the assumption of simple uniform hashing: – Each key k in S is equally likely be hashed to any

slot in T, independent of other keys.

• Simple uniform hashing includes an independence assumption.

Page 9: Introduction to Algorithms

Average Case of Chaining

• Let n be the number of keys in the table, and let m be the number of slots.

• Under simple uniform hashing assumption what’s the possibility of two keys are hashed to the same slot?

– 1/m.

• Define: load factor of T to be α= n/m, that means?

– The average number of keys per slot.

Page 10: Introduction to Algorithms

Search Cost

• The expected time for an unsuccessful search for a record with a given key is?

Θ(1 + α)

• If α= O(1), expected search time = Θ(1)

• How about a successful search?– It has same asymptotic bound. – Reserved for your homework.

apply hash function and access slot

search the list

Page 11: Introduction to Algorithms

Choosing a hash function

• The assumption of simple uniform hashing is hard to guarantee, but several common techniques tend to work well in practice.– A good hash function should distribute the

keys uniformly into all the slots.– Regularity of the key distribution should not

affect this uniformity.• For example, all the keys are even numbers.

• The simplest way to distribute keys to m slots evenly?

Page 12: Introduction to Algorithms

Division Method

• Assume all keys are integers, and define

h(k) = k mod m.

• Advantage: Simple and practical usually.

• Caution:– Be careful about choice of modulus m. – It doesn't work well for every size m of table.

• Example: if we pick m with a small divisor d.

Page 13: Introduction to Algorithms

Deficiency of Division Method

• Deficiency: if we pick m with a small divisor d.– Example: d=2, so that m is an even number.– It happens to all keys are even.– What happens to the hash table?– We will never hash anything to an odd-

numbered slot.

Page 14: Introduction to Algorithms

Deficiency of Division Method

• Extreme deficiency: If m= 2r, that’s to say, all its factors are small divisors.

• If k= (1011000111011010)2 and m=26, What the hash value turns out to be?

• The hash value doesn’t evenly depend on all the bits of k.

• Suppose: all the low order bits are the same, and all the high order bits differ.

Page 15: Introduction to Algorithms

How to choose modulus?

• Heuristics for choosing modulus m:– Choose m to be a prime– Make m not close to a power of two or ten.

• Division method is not a really good one:– Sometimes, making the table size a prime is

inconvenient. We often want to create a table in size 2r.

– The other reason is division takes more time to compute compared with multiplication or addition on computers.

Page 16: Introduction to Algorithms

Another method—Multiplication• Multiplication method is a little more

complicated but superior.

• Assume that all keys are integers, m= 2r, and our computer has w-bit words.

• Define h(k) = (A·k mod 2w) rsh (w–r):– A is an odd integer in the range 2w–1< A< 2w.– (Both the highest bit and the lowest bit are 1)– rsh is the “bitwise right-shift” operator .

• Multiplication modulo 2w is fast compared to division, and the rsh operator is fast.• Tips: Don’t pick A too close to 2w–1 or 2w.

Page 17: Introduction to Algorithms

Example of multiplication method

• Suppose that m= 8 = 23, r=3, and that our computer has w= 7-bit words:

• We chose A =1 0 1 1 0 0 1

• k =1 1 0 1 0 1 1

• 1 0 0 1 0 1 0 0 1 1 0 0 1 1 Ignored by mod Ignored by rsh h(k)

Page 18: Introduction to Algorithms

Another way to solve collision

• We’ve talked about resolving collisions by chaining. With chaining, we need an extra link field in each record.

• There's another way—open addressing, with idea: No storage for links.

• We should systematically probe the table until an empty slot is found.

Page 19: Introduction to Algorithms

Open Addressing

• The hash function depends on both the key and probe number:

universe of keys probe number slot number

• The probe sequence ⟨h(k,0), h(k,1), …, h(k,m–1) should be a permutation of ⟩ {0, 1, …, m–1}.

Page 20: Introduction to Algorithms

Implementation of Insertion

• What about HASH-SEARCH(T,k)?

Page 21: Introduction to Algorithms

Implementation of Searching

Page 22: Introduction to Algorithms

More about Open Addressing

• The hash table may fill up.– We must have the number of elements less than

or equal to the table size.

• Deletion is difficult, why?– When we remove a key out of the table, and

somebody is going to find his element. – The probe sequence he uses happens to hit the

key we’ve deleted. – He finds it's an empty slot, and says the key I am

looking for probably isn't in the table.

• We should keep deleted things marked.

Page 23: Introduction to Algorithms

Example of open addressing

Page 24: Introduction to Algorithms

Example of open addressing

Page 25: Introduction to Algorithms

Example of open addressing

Page 26: Introduction to Algorithms

Example of open addressing

Page 27: Introduction to Algorithms

Some heuristics about probe

• We can record the largest times of probes needed to do an insertion globally. – A search never looks more than that number.

• There are lots of ideas about forming a probe sequence effectively.

• The simplest one is ?– linear probing.

Page 28: Introduction to Algorithms

The simplest probing strategy

• Linear probing: given an hash function h(k), linear probing uses

h(k,i) = (h(k,0) +i) mod m

• Advantage: Simple

• Disadvantage?– primary clustering

Page 29: Introduction to Algorithms

Primary Clustering

• It suffers from primary clustering, where regions of the hash table get full.– Anything that hashes into that region has to

look through all the stuff.– What’s more, where long runs of occupied

slots build up, increasing the average search time.

Page 30: Introduction to Algorithms

Another probing strategy

• Double hashing: given two ordinary hash functions h1(k), h2(k), double hashing uses

h(k,i) = ( h1(k) +i h⋅ 2(k) ) mod m

• If h2(k) is relatively prime to m, double hashing generally produces excellent results. – We always make m a power of 2 and design

h2(k) to produce only odd numbers.

Page 31: Introduction to Algorithms

Analysis of open addressing

• We make the assumption of uniform hashing:– Each key is equally likely to have any one of

the m! permutations as its probe sequence, independent of other keys.

• Theorem. Given an open-addressed hash table with load factor α= n/m< 1, the expected number of probes in an unsuccessful search is at most 1/(1–α) .

Page 32: Introduction to Algorithms

Proof of the theoremProof:• At least one probe is always necessary.• With probability , the first probe hits an

occupied slot, and a second probe is necessary.• With probability ,the second probe hits

an occupied slot, and a third probe is necessary.• With probability ,the third probe hits an

occupied slot, etc.

• And then how to prove?

• Observe that for i= 1, 2, …, n.

n/m

(n–1)/(m–1)

(n–2)/(m–2)

Page 33: Introduction to Algorithms

Proof of the theorem

• Therefore, the expected number of probes is

(geometric series)

Page 34: Introduction to Algorithms

Implications of the theorem

• If α is constant, then accessing an open-addressed hash table takes constant time.

• If the table is half full, then the expected number of probes is ?– 1/(1–0.5) = 2.

• If the table is 90%full, then the expected number of probes is ?– 1/(1–0.9) = 10.

• Full utilization in spaces causes hashing slow.

Page 35: Introduction to Algorithms

Still Hashing

• Universal hashing

• Perfect hashing

Page 36: Introduction to Algorithms

A weakness of hashing

• Problem: For any hash function h, there exists a bad set of keys that all hash to the same slot. – It causes the average access time of a hash

table to skyrocket.– An adversary can pick all keys from {k: h(k) =

i } for some slot i.

• IDEA: Choose the hash function at random, independently of the keys.

Page 37: Introduction to Algorithms

Universal hashing

Page 38: Introduction to Algorithms

Universality is good

• Theorem:

• Let h be a hash function chosen at random from a universal set H of hash functions.

• Suppose h is used to hash n arbitrary keys into the m slots of a table T.

• Then for a given key x, we have:

E[number of collisions with x] < n/m.

Page 39: Introduction to Algorithms

Universality theorem

• Proof. Let Cx be the random variable denoting the total number of collisions of keys in T with x, and let

Page 40: Introduction to Algorithms

Universality theorem

For E[cxy]=1/m

Page 41: Introduction to Algorithms

Construction universal hash function set

• One method to construct a set of universal hash functions:

• Let m be prime. Decompose key k into r+1 digits, each with value in the set {0, 1, …, m–1}. – That is, let k = <k0, k1, …, kr>, where 0≤ki<m.

• Randomized strategy:– Pick a = a⟨ 0, a1, …, ar ⟩ where each ai is chosen

randomly from {0, 1, …, m–1}.

• Define

Page 42: Introduction to Algorithms

One method of Construction

• How big is H = {ha}?

– |H| = mr + 1.

• Theorem. The set H = {ha} is universal.

• Proof.

• Suppose that x = x⟨ 0, x1, …, xr and y = y⟩ ⟨ 0, y1, …, yr be distinct keys. ⟩

• Thus, they differ in at least one digit position. • Without loss of generality, position 0.

• For how many ha H do x and y collide? ∈

Page 43: Introduction to Algorithms

One method of Construction

• ha(x) = ha(y), which implies that

• Equivalently, we have

Page 44: Introduction to Algorithms

Fact from number theory

Page 45: Introduction to Algorithms

Back to the proof

• We just have

and since x0 ≠ y0

, an inverse (x0– y0)–1

must exist, which implies that

• Thus, for any choices of a1, a2, …, ar, exactly one choice of a0 causes x and y to collide.

Page 46: Introduction to Algorithms

Proof

• How many ha will cause x and y to collide?

– There are m choices for each of a1, a2, …, ar ,

but once these are chosen, exactly one choice for a0

causes x and y to collide,

• Thus, the number of h that cause x and y to collide is mr ·1 = mr = |H|/m.

Page 47: Introduction to Algorithms

Perfect hashing

• Requirement: Given a set of n keys, construct a static hash table of size m = O(n) such that SEARCH takes Θ(1) time in the worst case.

• IDEA: Two- level scheme with universal hashing at both levels. No collisions at level 2 !

Page 48: Introduction to Algorithms

Example of Perfect hashing

Page 49: Introduction to Algorithms

Collisions at level 2

• Theorem. Let H be a class of universal hash functions for a table of size m = n2. If we use a random h H to hash ∈ n keys into the table, the expected number of collisions is at most 1/2.

• Proof. By the definition of universality, the probability that two given keys collide under h is 1/m = 1/n2. There are pairs of keys that can possibly collide, the expected number of collisions is

Page 50: Introduction to Algorithms

Another fact from number theory

• Markov’s inequality says that for any non negative random variable X, we have Pr{X ≥ t} ≤ E[X]/t.

• Theorem. The probability of no collisions is at least 1/2.

• Proof. Applying this inequality with t = 1, we find that the probability of 1 or more collisions is at most 1/2.

• Conclusion: Just by testing random hash functions in H, we’ll quickly find one that works.

Page 51: Introduction to Algorithms