log files. o(n) data structure exercises 16.1

Log Files

9:00:12 May 6, 2004 231808 DS CITY WINNIPEG TAX/TAX $480.00 9:01:34 May 6, 2004 452203 DS HYDR BPY/FAC $101.71 9:02:45 May 6, 2004 764808 PR HOBBY HOBBY INC $259.93 9:02:47 May 6, 2004 457221 DS ENBRIDGE BPY/FAC $212.96 9:02:56 May 6, 2004 234621 IB 2146 PORTAGE $300.00 9:04:01 May 6, 2004 111345 PR WAL-MART #2055 $183.00 9:04:23 May 6, 2004 457524 CK NO.110 $53.15 9:04:25 May 6, 2004 234979 DS MTS BPY/FAC $36.10

This is a dictionary of bank transactions.

How can we find the transaction for the account 111345?

log file: An implementation of a dictionary using an unordered vector, list, or sequence to store the key-element pairs. Log file is also called audit trail. Examples: Bank transactions

Computer log file

maydin pts/22 wnpgmb11dc1-res- Sat Apr 17 23:10 pzhou pts/24 io.uwinnipeg.ca Sat Apr 17 15:43 igwizon pts/23 io.uwinnipeg.ca Sat Apr 17 15:31 sliao pts/22 wnpgmb02dc1-res- Sat Apr 17 15:31 dbetanco pts/22 io.uwinnipeg.ca Sat Apr 17 14:32 dchiu pts/22 h24-76-245-128.w Sat Apr 17 13:55 igwizon pts/32 io.uwinnipeg.ca Sat Apr 17 13:53 swang4 pts/32 wnpgmb02dc1-180- Sat Apr 17 09:40 jkwok pts/32 io.uwinnipeg.ca Sat Apr 17 00:41 clim1 pts/22 wnpgmb11dc1-res- Fri Apr 16 17:42 sliao pts/32 wnpgmb09dc1-65-8 Fri Apr 16 17:06 pzhou pts/22 io.uwinnipeg.ca Fri Apr 16 15:32 jpark3 pts/32 wnpgmb11dc1-res- Fri Apr 16 15:30 sliao pts/17 142.132.40.26 Fri Apr 16 14:33 jpark3 pts/32 slk-170-133-res. Fri Apr 16 14:31 jpark3 pts/32 slk-170-133-res. Fri Apr 16 13:51 maydin pts/32 io.uwinnipeg.ca Fri Apr 16 12:58 maydin pts/32 wnpgmb11dc1-166- Fri Apr 16 12:09 ttsukamo pts/32 io.uwinnipeg.ca Fri Apr 16 11:49

F o r m a l l y , w e s a y t h a t a l o g f i l e i s a n i m p l e m e n t a t i o n o f a d i c t i o n a r y D u s i n g a s e q u e n c e S t o s t o r e t h e i t e m s o f D i n a r b i t r a r y o r d e r . ( u n o r d e r e d s e q u e n c e i m p l e m e n t a t i o n )

3 4 2 2 1 8 4 4 7 3 0

N e x t i n s e r t i o n

Characteristics of a log file: It is an unordered list. It is easy to insert an item while searching an item with a given key needs some effort. Good application if we need only to search items occasionally. Assume the size of the log file is n. Method Running time insertItem(k,e) fast findElement(k) removeElement(k) findAllElements(k) removeAllElements(k)

O(n)O(n)O(n)O(n)

Data Structure Exercises 16.1

Hash Tables

R e c a l l th a t in J a v a , v a r ia b le s o f o b je c t s o f a w o r k in g c la s s a r e in f a c t r e f e r e n c e s to th e o b je c t s . W h a t s to r e d in a v a r ia b le i s th e m e m o r y lo c a t io n o f th e o b je c t . T h e r e f o r e , a v a r ia b le n a m e , f o r e x a m p le , f lo w e r i s a s s o c ia te d w i th a n o b je c t th r o u g h th e c o r r e s p o n d in g m e m o r y a d d r e s s .

f lo w e r

“ R o s e ”

hash table: Mapping of a key object to an integer in the range [0, N-1] where N is the capacity, or say, the number of the key objects considered.

flower

“Rose”

0x2004BA00x2004BA0

“Bill Scott”

46210 “Bill Scott”BScHistory101250

T h e r e a r e t w o c o m p o n e n t s i n a h a s h t a b l e : a b u c k e t a r r a y a n d a h a s h f u n c t i o n . b u c k e t a r r a y : A n a r r a y A o f s i z e N , w h e r e e a c h c e l l o f A i s t h o u g h t o f a s a “ b u c k e t ” ( n a m e l y , a c o n t a i n e r o f e l e m e n t s ) a n d t h e i n t e g e r N d e f i n e s t h e c a p a c i t y o f t h e a r r a y , o r s a y , t h e n u m b e r o f c e l l s i n A . ( * E a c h c e l l m a y a c c o m m o d a t e m o r e t h a n o n e i t e m s . * )

0 1 2 3 4 5 6 87 9

If the keys are integers and each key k is unique, we can access the items easily with this arrangment by placing the item in the bucket A[k].

0 1 2 3

(4,C)

5 6 87 94

C

collision: More than one elements have the same key. There are methods to handle collisions. However, we want to avoid collisions if we can.

0 1 2 3

(4,B)

5 6 87 94

C

(4,C)

B

Storing item with an integer key k in A[k] seems very efficient. However, there are drawbacks with this approach. The first drawback is that the capacity of the array A may have to be much larger than what we need. Example: Department Department number Student number History 200 200000-200999 Physics 400 400000-400999 If we use the student numbers as keys (integers), then do we have to allocate an array of the size N = 400,000 to accommodate only 2,000 students maximum?

The second drawback is that keys are often not integers. Example: We want to implement an English dictionary. Key Element depose v. 1. To remove from office or a position

of power. 2. To testify, esp. in writing. How can we use a bucket array to store it?

The solution is to use a mapping function to map an arbitrary key to an integer. Examples:

1. A function h( n ) that returns integers in the range [0-2000] when n is in the range [0 - 400000].

2. A function h( word ) that returns integers in the range [0-10000] when word is any English word, “depose” for example.

hash function: A function that maps each key k in the dictionary to an integer in the range [0 - N-1], where N is the capacity of the bucket array for the hash table.

Now, rather than storing the item (k,e) in A[k], we store it in A[h(k)].

0 1 2 3 5 6 87 94

A

(200004,A)

h( 200004 ) = 4

A g o o d h a s h f u n c t i o n i s t h e o n e t h a t m i n i m i z e c o l l i o n s a n d e a s y t o c o m p u t e . I n p r a c t i c e , a h a s h f u n c t i o n u s u a l l y c o n s i s t s o f t w o s t e p s : T h e f i r s t s t e p i s t o m a p a k e y t o a n i n t e g e r , c a l l e d t h e h a s h c o d e . T h e s e c o n d s t e p i s t o m a p t h e h a s h c o d e t o a n i n t e g e r w i t h i n t h e r a n g e o f i n d i c e s o f a b u c k e t a r r a y , c a l l t h e c o m p r e s s i o n m a p .

a r b i t r a r y k e y ( - , + )

h a s h c o d e

[ 0 , N - 1 ]

c o m p r e s s m a p

i n d i c e s o fb u c k e t a r r a y

hash code (hash value): The integer assigned to a key k.

1. Integers may be positive or negative. 2. The function to assign an integer to a key should avoid

collisions as much as possible. 3. Equivalent keys should have the same integer assigned.

The Object class in Java has a hashCode() method that returns an integer. In practice, we usually need to override this function to make it suitable for our purposes. Let us consider a few of the approaches.

For those data types that can be automatically converted to an integer, such as byte, short, int, and char, we can have a good hash code simply by convert the value to an integer (int). For float type, we can use the method Float.floatToIntBits(x) to convert a real number x to an integer. Examples: char 'T' 84 byte 115 115 short 3020 3020 int 4089254562 4089254562 float 3.14159 1078530000

Summing Components Recall that long and double in Java cannot be converted to int automatically. Variables of these types require storage size that is larger than that required by an int variable. In this case, the approach of Summing Component is to split the storage into two 32-bit parts and calculate the sum of their integer representations. static int hashCode( long i ) {

return ( int )(( i >> 32 ) + ( int ) i ); }

Examples:

i i >> 32 (int)i sum 12084 0 12084 12084 64533853369376 15025 1969746976 1969762001 For values of double type, the method Double.doubleToLongBits( x ) can be used to convert it to a long type first.

x i i >> 32 (int)i sum 1.0 4607182418800017408 1072693248 0 1072693248 3.14159 4614256650576692846 1074340345 4028335726 5102676071 Note: for 3.14159, the sum results in an overflow, which is removed automatically.

In general, the approach of Summing Components can beextended to keys with m components. Let the key be k = (x0, x1, …, xm-1), we compute the integer.We may use the following expression as its hash code:

hash code =

m

iix

0

Examples: We decompose words into characters and compute the sum of their values. temp01 535 temp10 535 stop 454 tops 454 Note that words “temp01” and “temp10” have the same hash code. Also “stop” and “tops” have the same. Therefore this approach is not good for strings.

v(t) + v(e) + v(m) + v(p) + v(0) + v(1)

122

11

0 ...codehash mm

mm xaxaxax ,

where the key is k = (x0, x1, …, xm-2, xm-1).

Polynomial Hash Codes In this approach, we choose a constant a 1 and calculate theinteger value

Examples: With , we have the following values temp01 7601359 temp10 7601367 stop 94342 tops 94678 Notice that now the hash codes are no longer the same.

A = 9

A carefully chosen value of the constant a can reduce thenumber of conflicts significantly. Good values include 33,37, 39, and 41 according to some experimental studies.

C y c l i c S h i f t H a s h C o d e s E x a m p l e : 5 - b i t c y c l i c s h i f t

1 0 0 1 0

1 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 1

1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 1

h = 3 2 0 0 3 7 9 0 8 5 1 5

1 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 1

1 0 0 1 0

static int hashCode( String s ) { int h = 0; for( int i = 0; i < s.length(); i++ ) { h = ( h << 5 ) | ( h >> 27 ); h += ( int )s.charAt( i ); } return h; }

Example: s = “two” i h h = ( h << 5 ) | ( h >> 27 ); 0 0 h += ( int )s.charAt( i ); 0 1110100 h = ( h << 5 ) | ( h >> 27 ); 1 111010000000 h += ( int )s.charAt( i ); 1 111011110111 h = ( h << 5 ) | ( h >> 27 ); 2 11101111011100000 h += ( int )s.charAt( i ); 2 11101111101001111

T h e s e c o n d s t e p i n a h a s h f u n c t i o n i s t o m a p t h e h a s h c o d e i n t o t h e r a n g e [ 0 , N - 1 ] . T h e r e a r e t w o p o p u l a r a p p r o a c h e s . T h e y a r e t h e d i v i s i o n m e t h o d a n d t h e m u l t i p l y a d d a n d d i v i d e ( M A D ) m e t h o d . T h e D i v i s i o n M e t h o d I n t h i s m e t h o d , t h e c o m p r e s s i o n m a p i s g i v e n b y

Nkkh mod)(

Examples: N = 100 hash code compressed 200 0 205 5 430 30 500 0 505 5 Notice that there are collisions.

N=101 hash code compressed 200 99 205 3 430 26 500 96 505 0 In general, if N is a prime number, it helps reduce collisions.

T h e s e c o n d a p p r o a c h i s t h e m u l t i p l y a d d a n d d i v i d e ( M A D ) m e t h o d . I n t h i s m e t h o d , t h e m a p p i n g i s g i v e n b y

Nbakkh mod)(

where N is a prime number, a and b are nonnegative integersrandomly chosen at the time when the compression functionis determined so that a mod N 0. This method is more sophisticated works better.

Collision-Handling Schemes Recall that if there is no collision, we can store the item (k, e)in the bucket array cell A[h(k)]. However, collision does occurtime to time. In this case, two different keys, k1 and k2 cause thehash function to return a same value: h(k1) = h(k2). Thereforewe cannot store -the item directly in A[h(k)]. The two schemes to handle collisions: 1. 1. Separate Chaining2. 2. Open Addressing

Separate Chaining In this approach, what stored in A[h(k)]. is a reference to a sequence Sk rather than the item. In turn, the items that have the same hashfunction value k are all stored in Sk . The sequence Sk can be implemented as a log file.

Algorithms for fundamental dictionary operations Algorithm findElement(k): Assign the sequence A[h(k)] to a variable B if B is empty then return NO_SUCH_KEY else return B.findElement(k)

0 1 2 3 5 64

S

B=A[h(k)]

h( k ) = 4

k

Algorithm insertItem(k,e): If A[h(k)] is empty then Create a new initially empty,

sequence-based dictionary B Assign B to A[h(k)] else Assign A[h(k)] to B B.insertItem(k, e ) 0 1 2 3 5 64

S

B

h( k ) = 4

(k, e)

Algorithm removeElement(k): Assign A[h(k)] to B If B is empty then return NO_SUCH_KEY return B.removeElement(k)

0 1 2 3 5 64

S

B=A[h(k)]

h( k ) = 4

k

Example: Consider a dictionary of the size 13. The hash function is h(k) = k mod 13. There are 10 items in the dictionary: k h(k) 10 10 12 12 18 5 25 12 28 2 36 10 38 12 41 2 54 2 90 12

0

1

2

3

5

6

4

A

9

7

8

11

41

12

10

28 54

28

36 10

90 12 38 25

Open Addressing In separate chaining, extra memory blocks have to be allocated for the sequences for each bucket. To save memory space, items can be stored directly in a bucket while collisions are handled by means of other methods. These are called open addressing schemes

Linear Probing - A simple opening addressing scheme Assume that we want to insert an item (k, e) and I = h(k). Theoperation goes like this: If the bucket A[i] is not empty, we try thenext bucket A[(i+1) mod N]. If this is not empty, then we tryA[(i+2) mod N], and so on, until we find an empty bucket.

Example:

0 1 2 3 4 5 6 7 8 9 1 0

51 3 2 6 3 7 1 6 2 1

h ( k ) = k m o d 1 1

N e w e le m e n t w i thk e y = 1 5

T he findE lem ent operation needs to search consecu tive buckets, starting from A [h (k )], un til e ither the item or an em pty bucket is found .

0 1 2 3 4 5 6 7 8 9 10

513 26 37 16 21

h(k) = k m od 11

Find the item (15 , 15)

15

T h e r e m o v e E le m e n t o p e r a t io n i s m o r e c o m p l ic a te d . W h e n a n i t e m i s r e m o v e d , w e h a v e to s h i f t i t e m s in th e b u c k e t a r r a y to f i l l th e e m p ty s p o t w h i le le a v in g a lo n e th o s e th a t a r e in th e i r c o r r e c t lo c a t io n . E x a m p le :

0 1 2 3 4 5 6 7 8 9 1 0

51 3 2 6 3 7 2 1

h ( k ) = k m o d 1 1

R e m o v e th e i t e m (3 7 , 3 7 )

1 51 6

W e h a v e to s h i f t ( 1 5 ,1 5 ) w h i le le a v in g ( 1 8 ,1 8 ) a lo n e :

0 1 2 3 4 5 6 7 8 9 1 0

51 3 2 6 2 11 5

0 1 2 3 4 5 6 7 8 9 1 0

51 3 2 6 2 11 5

1 6

1 6

T o av o id th e co m p lica tio n lik e th is , w e c an u se a sp ec ia l item ca lled R E M O V E D _ IT E M to rep lac e th e re m o v ed item .

0 1 2 3 4 5 6 7 8 9 1 0

51 3 2 6 2 11 5

0 1 2 3 4 5 6 7 8 9 1 0

51 3 2 6 2 11 5

R E M O V E D _ IT E M

1 6

1 6

O n th e o th e r h a n d , th e f in d E le m e n t o p e ra tio n sh o u ld sk ip th is i te m w h e n it d o e s s e a rc h in g . E x a m p le :

0 1 2 3 4 5 6 7 8 9 1 0

51 3 2 6


1 6 2 1

h (k ) = k m o d 1 1

F in d th e i te m (1 5 , 1 5 )

1 5

A n d th e in se r tI te m o p e ra tio n sh o u ld re p la c e i t w ith th e n e w ite m . E x a m p le :


0 1 2 3 4 5 6 7 8 9 1 0

51 3 2 6 1 6 2 1

h (k ) = k m o d 1 1

N e w e le m e n t w ithk e y = 1 5

One of disadvantages with Linear probing is that it tends to clusterthe items of the dictionary into contiguous runs. This causes thesearches to slow down quite a bit. To avoid this, we can usequadratic probing. Quadratic Probing Rather than searching the buckets for , we search the bucketsA[(i + j) mod N] for j = 0, 1, 2, …, we search the bucketsA[(i + j2) mod N].

E xam ple: A n insertItem operation

0 1 2 3 4 5 6 7 8 9 10

513 26 37 16 21

h(k) = k m od 1 1

N ew e lem ent w ithkey = 15

T hough quadratic p rob ing avo ids the c lustering p rob lem s that occur w ith linear p rob ing , it has i ts ow n clustering p rob lem s called secondary c lustering . T h is m ay cause it no t ab le to find an em pty bucket w hile there are em pty buckets availab le .

Double Hashing In this approach, we search the buckets A[(i+f(i)) mod N], where f(i) = j*h’(k) and h’(k) is the secondary hash function. In thisapproach, the secondary hash function is not allowed to be zero.A common choice is h’(k) = q – (k mod q)

where q < N is some prime number, which can be divide by 1 anditself.

E x a m p l e :

11mod)( kkh )()( khjjf

)7mod(7)( kkh f o r k = 1 5 , i = h ( k ) = 4 , a n d . j f ( j ) i + f ( j ) 0 0 4 1 6 1 0 2 1 2 1 6 3 1 8 2 2

h’(k) = 6

The Ordered Dictionary ADT

Recall that keys in a dictionary may not have a total order relation. On the other hand, if a total order relation on the keys is defined, the dictionary is an ordered dictionary. Example: In an ordered dictionary, we have a few elements: {(1,E), (2,C), (5,A), (7,B), (8,D)} How can we find the element with a key that is closest to 3?

In addition to the methods for the dictionary abstract data type we have already learned such as findElement, insertItem and removeElement, an ordered dictionary also supports the following methods: closestKeyBefore(k): Return the key of the item with largest key less

than or equal to k. Input: Object (key); Output: Object (key)

closestElemBefore(k): Return the element for the item with largest key less than or equal to k. Input: Object (key); Output: Object (element)

closestKeyAfter(k): Return the key of the item with smallest key greater than or equal to k. Input: Object (key); Output: Object (key)

closestElemAfter(k): Return the element for the item with smallest key greater than or equal to k. Input: Object (key); Output: Object (element)

Example: The table shows the effect of a series of operations on an ordered dictionary with five elements: {(1,E), (2,C), (5,A), (7,B), (8,D)}

Operation Output closestKeyBefore(3) 2 closestKeyBefore(7) 7 closestKeyBefore(6) 5 closestElemBefore(2) C closestElemBefore(9) D closestElemBefore(0) NO_SUCH_KEY

closestKeyAfter(3) 5 closestKeyAfter(7) 7 closestKeyAfter(6) 7 closestElemAfter(2) C closestElemAfter(9) NO_SUCH_KEY closestElemAfter(0) E

log files. o(n) data structure exercises 16.1

Documents