introduction to algorithms

Introduction to Algorithms

Jiafen Liu

Sept. 2013

Today’s Tasks

Hashing

• Direct access tables

• Choosing good hash functions– Division Method– Multiplication Method

• Resolving collisions by chaining

• Resolving collisions by open addressing

Symbol-Table Problem

• Hashing comes up in compilers called the Symbol Table Problem.

• Suppose: Table S holding n records:

• Operations on S:– INSERT(S, x)– DELETE(S, x)– SEARCH(S, k)

• Dynamic Set vs Static Set

The Simplest Case• Suppose that the keys are drawn from the set U {0, ⊆

1, …, m–1}, and keys are distinct. • Direct access Table: set up an array T[0 . .m–1]

if x S and key[x] = k,∈

otherwise.• In the worst case, the 3 operations take time of

– Θ(1)

• Limitations of direct-access table?– The range of keys can be large: 64-bit numbers– character strings (difficult to represent it).

• Hashing: Try to keep the table small, while preserving the property of linear running time.

Naïve Hashing

• Solution: Use a hash function h to map the keys of records in S into {0, 1, …, m–1}.

Keys

k1k3

k4

k5

k2

T0

m-1

h(k1)

h(k2)

h(k3)

h(k4)

=h(k5)

Collisions

• When a record to be inserted maps to an already occupied slot in T, a collision occurs.

• The Simplest way to solve collision?– Link records in the same slot into a list.

49 86 52

h(49)=h(86)=h(52)=i

Worst Case of Chaining

• What’s the worst case of chaining?– Each key hashes to the same slot. The table

turn out to be a chaining list.

• Access Time in the worst case?– Θ(n) if we assume the size of S is n.

Average Case of Chaining

• In order to analyze the average case– we should know all possible inputs and their

probability. – We don’t know exactly the distribution, so we

always make assumptions.

• Here, we make the assumption of simple uniform hashing: – Each key k in S is equally likely be hashed to any

slot in T, independent of other keys.

• Simple uniform hashing includes an independence assumption.

Average Case of Chaining

• Let n be the number of keys in the table, and let m be the number of slots.

• Under simple uniform hashing assumption what’s the possibility of two keys are hashed to the same slot?

– 1/m.

• Define: load factor of T to be α= n/m, that means?

– The average number of keys per slot.

Search Cost

• The expected time for an unsuccessful search for a record with a given key is?

Θ(1 + α)

• If α= O(1), expected search time = Θ(1)

• How about a successful search?– It has same asymptotic bound. – Reserved for your homework.

apply hash function and access slot

search the list

Choosing a hash function

• The assumption of simple uniform hashing is hard to guarantee, but several common techniques tend to work well in practice.– A good hash function should distribute the

keys uniformly into all the slots.– Regularity of the key distribution should not

affect this uniformity.• For example, all the keys are even numbers.

• The simplest way to distribute keys to m slots evenly?

Division Method

• Assume all keys are integers, and define

h(k) = k mod m.

• Advantage: Simple and practical usually.

• Caution:– Be careful about choice of modulus m. – It doesn't work well for every size m of table.

• Example: if we pick m with a small divisor d.

Deficiency of Division Method

• Deficiency: if we pick m with a small divisor d.– Example: d=2, so that m is an even number.– It happens to all keys are even.– What happens to the hash table?– We will never hash anything to an odd-

numbered slot.

Deficiency of Division Method

• Extreme deficiency: If m= 2r, that’s to say, all its factors are small divisors.

• If k= (1011000111011010)2 and m=26, What the hash value turns out to be?

• The hash value doesn’t evenly depend on all the bits of k.

• Suppose: all the low order bits are the same, and all the high order bits differ.

How to choose modulus?

• Heuristics for choosing modulus m:– Choose m to be a prime– Make m not close to a power of two or ten.

• Division method is not a really good one:– Sometimes, making the table size a prime is

inconvenient. We often want to create a table in size 2r.

– The other reason is division takes more time to compute compared with multiplication or addition on computers.

Another method—Multiplication• Multiplication method is a little more

complicated but superior.

• Assume that all keys are integers, m= 2r, and our computer has w-bit words.

• Define h(k) = (A·k mod 2w) rsh (w–r):– A is an odd integer in the range 2w–1< A< 2w.– (Both the highest bit and the lowest bit are 1)– rsh is the “bitwise right-shift” operator .

• Multiplication modulo 2w is fast compared to division, and the rsh operator is fast.• Tips: Don’t pick A too close to 2w–1 or 2w.

Example of multiplication method

• Suppose that m= 8 = 23, r=3, and that our computer has w= 7-bit words:

• We chose A =1 0 1 1 0 0 1

• k =1 1 0 1 0 1 1

• 1 0 0 1 0 1 0 0 1 1 0 0 1 1 Ignored by mod Ignored by rsh h(k)

Another way to solve collision

• We’ve talked about resolving collisions by chaining. With chaining, we need an extra link field in each record.

• There's another way—open addressing, with idea: No storage for links.

• We should systematically probe the table until an empty slot is found.

Open Addressing

• The hash function depends on both the key and probe number:

universe of keys probe number slot number

• The probe sequence ⟨h(k,0), h(k,1), …, h(k,m–1) should be a permutation of ⟩ {0, 1, …, m–1}.

Implementation of Insertion

• What about HASH-SEARCH(T,k)?

Implementation of Searching

More about Open Addressing

• The hash table may fill up.– We must have the number of elements less than

or equal to the table size.

• Deletion is difficult, why?– When we remove a key out of the table, and

somebody is going to find his element. – The probe sequence he uses happens to hit the

key we’ve deleted. – He finds it's an empty slot, and says the key I am

looking for probably isn't in the table.

• We should keep deleted things marked.

Example of open addressing

Some heuristics about probe

• We can record the largest times of probes needed to do an insertion globally. – A search never looks more than that number.

• There are lots of ideas about forming a probe sequence effectively.

• The simplest one is ?– linear probing.

The simplest probing strategy

• Linear probing: given an hash function h(k), linear probing uses

h(k,i) = (h(k,0) +i) mod m

• Advantage: Simple

• Disadvantage?– primary clustering

Primary Clustering

• It suffers from primary clustering, where regions of the hash table get full.– Anything that hashes into that region has to

look through all the stuff.– What’s more, where long runs of occupied

slots build up, increasing the average search time.

Another probing strategy

• Double hashing: given two ordinary hash functions h1(k), h2(k), double hashing uses

h(k,i) = ( h1(k) +i h⋅ 2(k) ) mod m

• If h2(k) is relatively prime to m, double hashing generally produces excellent results. – We always make m a power of 2 and design

h2(k) to produce only odd numbers.

Analysis of open addressing

• We make the assumption of uniform hashing:– Each key is equally likely to have any one of

the m! permutations as its probe sequence, independent of other keys.

• Theorem. Given an open-addressed hash table with load factor α= n/m< 1, the expected number of probes in an unsuccessful search is at most 1/(1–α) .

Proof of the theoremProof:• At least one probe is always necessary.• With probability , the first probe hits an

occupied slot, and a second probe is necessary.• With probability ,the second probe hits

an occupied slot, and a third probe is necessary.• With probability ,the third probe hits an

occupied slot, etc.

• And then how to prove?

• Observe that for i= 1, 2, …, n.

n/m

(n–1)/(m–1)

(n–2)/(m–2)

Proof of the theorem

• Therefore, the expected number of probes is

(geometric series)

Implications of the theorem

• If α is constant, then accessing an open-addressed hash table takes constant time.

• If the table is half full, then the expected number of probes is ?– 1/(1–0.5) = 2.

• If the table is 90%full, then the expected number of probes is ?– 1/(1–0.9) = 10.

• Full utilization in spaces causes hashing slow.

Still Hashing

• Universal hashing

• Perfect hashing

A weakness of hashing

• Problem: For any hash function h, there exists a bad set of keys that all hash to the same slot. – It causes the average access time of a hash

table to skyrocket.– An adversary can pick all keys from {k: h(k) =

i } for some slot i.

• IDEA: Choose the hash function at random, independently of the keys.

Universal hashing

Universality is good

• Theorem:

• Let h be a hash function chosen at random from a universal set H of hash functions.

• Suppose h is used to hash n arbitrary keys into the m slots of a table T.

• Then for a given key x, we have:

E[number of collisions with x] < n/m.

Universality theorem

• Proof. Let Cx be the random variable denoting the total number of collisions of keys in T with x, and let

Universality theorem

For E[cxy]=1/m

Construction universal hash function set

• One method to construct a set of universal hash functions:

• Let m be prime. Decompose key k into r+1 digits, each with value in the set {0, 1, …, m–1}. – That is, let k = <k0, k1, …, kr>, where 0≤ki<m.

• Randomized strategy:– Pick a = a⟨ 0, a1, …, ar ⟩ where each ai is chosen

randomly from {0, 1, …, m–1}.

• Define

One method of Construction

• How big is H = {ha}?

– |H| = mr + 1.

• Theorem. The set H = {ha} is universal.

• Proof.

• Suppose that x = x⟨ 0, x1, …, xr and y = y⟩ ⟨ 0, y1, …, yr be distinct keys. ⟩

• Thus, they differ in at least one digit position. • Without loss of generality, position 0.

• For how many ha H do x and y collide? ∈

One method of Construction

• ha(x) = ha(y), which implies that

• Equivalently, we have

Fact from number theory

•

•

Back to the proof

• We just have

and since x0 ≠ y0

, an inverse (x0– y0)–1

must exist, which implies that

• Thus, for any choices of a1, a2, …, ar, exactly one choice of a0 causes x and y to collide.

Proof

• How many ha will cause x and y to collide?

– There are m choices for each of a1, a2, …, ar ,

but once these are chosen, exactly one choice for a0

causes x and y to collide,

• Thus, the number of h that cause x and y to collide is mr ·1 = mr = |H|/m.

Perfect hashing

• Requirement: Given a set of n keys, construct a static hash table of size m = O(n) such that SEARCH takes Θ(1) time in the worst case.

• IDEA: Two- level scheme with universal hashing at both levels. No collisions at level 2 !

Example of Perfect hashing

Collisions at level 2

• Theorem. Let H be a class of universal hash functions for a table of size m = n2. If we use a random h H to hash ∈ n keys into the table, the expected number of collisions is at most 1/2.

• Proof. By the definition of universality, the probability that two given keys collide under h is 1/m = 1/n2. There are pairs of keys that can possibly collide, the expected number of collisions is

Another fact from number theory

• Markov’s inequality says that for any non negative random variable X, we have Pr{X ≥ t} ≤ E[X]/t.

• Theorem. The probability of no collisions is at least 1/2.

• Proof. Applying this inequality with t = 1, we find that the probability of 1 or more collisions is at most 1/2.

• Conclusion: Just by testing random hash functions in H, we’ll quickly find one that works.

introduction to algorithms

Documents