hashing cs 110: data structures and algorithms first semester, 2010-2011

HashingCS 110: Data Structures and

Algorithms

First Semester, 2010-2011

Hashing

►In a dictionary, if it can be arranged such that the key is also the index to the array that stores the entries, searching and inserting items would be very fast

►Example: empdata[1000]index = employee ID number►Search for employee with ID number 500► return empdata[500]►Running Time: O(1)

Hash Table

►A data structure implemented as an array of objects, where the search keys correspond to the array indices

►Insert and find operations involve straight forward array accesses: O(1) time complexity

About Hash Tables

►In the first example shown, it was relatively easy since employee number is an integer

►A few problems may arise in different situations

About Hash Table

►Problem 1: possible integer key values might be too large; creating an appropriate array might be impractical►Need to map large integer values to

smaller array indices►Problem 2: What if the key is a word in

the English Alphabet (e.g. last names)►Need to map names to integers

(indices)

Large Values to Small Values

►Hash function: converts a number from a large range into a number from a smaller range (the range of array indices)

►Size of the array►Rule of thumb: the array size should be

about twice the size of the data set►For 50,000 words, use an array of

100,000 elements

Hash Function and Modulo

►Simplest Hash Function: achieved by using the modulo function (returns the remainder)►For example, 33 % 10 = 3►General Formula:

LargeNumber % SmallRange

Hash Functions for Names

►Sum of Digits Method►Map the alphabet A to Z to the

numbers 1 to 26 (a=1, b=2, c=3, etc)►Add the total of the letters►For example, “cats”

►c=3, a=1, t=20, s=19, 3+1+20+19=43►“cats” will be stored using index 43

►Use modulo to map to a smaller array

Collisions

►Problem►Too many words with the same index►“was”, “tin”, “give”, “tend”, “moan”,

“tick” and several other words add to 43

►These are called collisions: case where two different search keys hash to the same index value

Collisions

►Can occur even when dealing with integers►Suppose the size of the hash table is

100►Keys 158 and 358 hash to the same

value when using the modulo hash function

Collision Resolution Policy

►Need to know what to do when a collision occurs; i.e. during an insert operation; What if the array slot is already occupied?

►Most common policy: go to the next available slot►“wrap around” the array if necessary

Collision Resolution Policy

►Consequence: when searching, use the hash function, first check whether the element is the one you are looking for

►If not, try the next slots►How do you know if the element is

not in the array?

Probe Sequence

►Sequence of indices that serve as array slots where a key value would map to

►The first index in the probe sequence is the home position; the value of the hash function

►The next indices are the alternative slots

Probe Sequence

►Suppose the array size is 10, and the hash function is h(K) = K%10.

►The probe sequence for K=25 is:►5,6,7,8,9,0,1,2,3,4►Here, we assume that most common

collision resolution policy of going to the next slot: p(K,i) = I

►Goal: exhaust array slots

Hash Table Operations

►Insert object Obj with key value K► home h(K)for i 0 to M-1 do pos = (home + p(K,i)) % 10 if HT[pos].getKey() = K then throw exception “error” // or overwrite it else if HT[pos] is null then HT[pos] Obj break;


►Finding an object with key value K► home h(K)for i 0 to M-1 do pos = (home + p(K,i)) % 10 if HT[pos].getKey() = K then return HT[pos] else if HT[pos] is null then throw exception “not found”


►Although insert and find run in O(1) time during typical conditions, the time complexity in the worst-case is O(n)

►Something to think about: characterize the worst-case scenarios for insert and find

Removing Elements

►Removing an element from a hash table during a delete operation poses a problem

►If we set the corresponding hash table entry to null, then succeeding find operations might not work properly► Recall that for the find algorithm, seeing a

null means a target element is not found but in fact the element might be in a next slot

Removing Elements

►Solution: tombstone►Arrange it so that deleted entries

seem null when inserting, but don’t seem null when searching

►Requires a simple flag on the objects stored

Hash Tables in Java

►java.util.Hashtable►Important methods for Hashtable

class►put(Object key, Object entry)►Object get(Object key)► remove(Object key)►boolean constainsKey(Object key)

Summary

►Hash tables implement the dictionary data structure and enable O(1) insert, find, and remove operations►Caveat: O(n) in the worst-case

because of the possibility of collisions

Summary

►Requires a hash function(maps keys to array indices) and a collision resolution policy►Probe sequence depicts a sequence

of array slots that an object would occupy, given its key

►In Java: use the Hashtable class

hashing cs 110: data structures and algorithms first semester, 2010-2011

Documents