data structures1 overview(1) o(1) access to files variation of the relative file record number for a...
TRANSCRIPT
Data Structures 1
Overview(1)
O(1) access to files Variation of the relative file Record number for a record is not arbitrary;
rather, it is obtained by a hashing function H applied to the primary key, H(key)
Record numbers generated should be uniformly and randomly distributed such that 0 < H(key) < N
Overview
Data Structures 2
Overview(2)
A hash function is like a block box that produces an address every time you drop in a key
All parts of the key should be used by the hashing function H so that a lot of records with similar keys do not all hash to the same location
Given two random keys X, Y and N slots, the probability H(X)=H(Y) is 1/N; in this case, X and Y are called synonyms and a collision occurs
Overview
Data Structures 3
Introduction
Hash function : h(k) Transforms a key K into an address
Hash vs other index Sequential search : O(N) Binary search : O(log2N) B(B+) Tree index : O(logkN) where k records i
n an index node Hash : O(1)
11.1 Introduction
Data Structures 4
A Simple Hashing Scheme(1)
NameASCII Code for
First Two LettersProduct Home
Address
BALL
LOWELL
TREE
66 65
76 96
84 82
66 X 65 = 4,290
76 X 96 = 6,004
84 X 82 = 6,888
4,290
6,004
6,888
11.1 Introduction
Data Structures 5
A Simple Hashing Scheme(2)
LOWELL’shome
address
K=LOWELL
h(K)
Address
Address Record
key 1234
0
56
... ...
LOWELL . . .4
11.1 Introduction
Data Structures 6
Idea behind Hash-based Files
Record with hash key i is stored in node i All record with hash key h are stored in node h Primary blocks of data level nodes are stored
sequentially Contents of the root node can be expressed by a
simple function: Address of data level node for record with primary key k = address of node 0 + H(k)
In literature on hash-based files, primary blocks of data level nodes are called buckets
11.1 Introduction
Data Structures 7
e.g. Hash-based File
0 1 2 3 4 5 6
root node
70 15 50 1 30 51 10 45 3 11 60 61124 40 20 55
57 14 15
11.1 Introduction
Data Structures 8
Hashing(1)
Hashing Functions :
Consider a primary key consisting of
a string of 12 letters and
a file with 100,000 slots.
Since 2612 >> 105,
So synonyms (collisions) are inevitable!
11.1 Introduction
Data Structures 9
Hashing(2) Possible means of hashing
First, because 12 characters = 12 bytes = 3(32-bit) words, partition into 3 words and perform folding as follows : either
(a) add modulo 232, or (b) combine using exclusive OR, or (c) invent your own method
Next, let R = V mod N R => Record-number V => Value obtained in the above N => Number of Slots so 0< R < N
If N has many small factors, a poor distribution leading to many collisions can occur. Normally, N is chosen to be a primary number
11.1 Introduction
Data Structures 10
If M = number of records, N = number of available slots, P(k) = probability of k records hashing to the same slot
then P(k) = where f is the loading factor M/N
As f --> 1, we know that p(0)->1/e and p(1) -> 1/e. The other (1-1/e) of the records must hash into (1-2/e) of
the slots, for an average of 2.4 slot. So many synonyms!!
Hashing(3)
MK
1N x 1
N1- ~~
ek*k!
fk
11.1 Introduction
Data Structures 11
Collision
Collision Situation in which a record is hashed to an
address that does not have sufficient room to store the record
Perfect hashing : impossible! Different key, same hash value (Different record, same address)
Solutions Spread out the records Use extra memory Put more than one record at a single address
11.1 Introduction
Data Structures 12
A Simple Hashing Algorithm
Step 1. Represent the key in numerical
form If the key is a string : take the ASCII code If the key is a number : nothing to be done
e.g.. LOWELL = 76 79 87 69 76 76 32 32 32 32 32 32 L O W E L L ( 6 blanks )
11.2 A Simple Hashing Algorithm
Data Structures 13
Step 2 . Fold and Add Fold
76 79 | 87 69 | 76 76 | 32 32 | 32 32 | 32 32
Add parts into one integer
(Suppose we use 15bit integer expression, 32767 is limit)
7679 + 8769 + 7676 + 3232 + 3232 + 3232 = 33820 > 32767 (overflow!)
Largest addend : 9090 ( ‘ZZ’ ) Largest allowable result : 32767 - 9090 = 19937 Ensure no intermediate sum exceeds using ‘mod’
( 7679 + 8769 ) mod 19937 = 16448 (16448 + 7676 ) mod 19937 = 4187 ( 4187 + 3232 ) mod 19937 = 7419 ( 7419 + 3232 ) mod 19937 = 10651 ( 10651 + 3232) mod 19937 = 13883
11.2 A Simple Hashing Algorithm
Data Structures 14
a = s mod n a : home address s : the sum produced in step 2 n : the number of addresses in the file e.g.. a = 13883 mod 100 = 83 (s = 13883, n = 100)
A prime number is usually used for the divisor because primes tend to distribute remainders much more uniformly than do nonprimes
So, we chose a prime number as close as possible to the desired size of the address space (eg, a file with 75 records, a good choice for n is 101, then the file will become 74.3 percent full
Step 3. Divide by size of the address space
11.2 A Simple Hashing Algorithm
Data Structures 15
Hashing Functions and Record Distributions
Distributing records among addressAcceptable
ABCDEFG
123456789
10
Record Address
Best
ABCDEFG
123456789
10
Record Address
ABCDEFG
123456789
10
Record Address
Worst
11.3 Hashing Functions and Record Distributions
Data Structures 16
Some other hashing methods
Better-than-random Examine keys for a pattern Fold parts of the key Divide the key by a number
When the better-than-random methods do not work ----- randomize! Square the key and take the middle Radix transformation
11.3 Hashing Functions and Record Distributions
Data Structures 17
How Much Extra Memory Should Be Used?
Packing Density = # of records# of spaces
rN
=
The more records are packed, the more likely a collision will occur
11.4 How Much Extra Memory Should Be Used?
Data Structures 18
Poisson Distribution
p(x) = (r/N)xe -r/N
x!(poisson distribution)
N = the number of available addressesr = the number of records to be storedx = the number of records assigned to a given address
p(x) : the probability that a given address will have x records assigned to it after the hashing function has been applied to all n records
( x records 가 collision 할 확률 )
11.4 How Much Extra Memory Should Be Used?
Data Structures 19
Predicting Collisions for Different Packing Densities
# of addresses no record assigned : N X P(0)
# of addresses one record assigned : N X
P(1)
# of addresses more than two assigned :
N X [P(2) + P(3) + P(4) + ...]
# of overflows : 1 X NP(2) + 2 X NP(3) +...
Percentage of overflow records :
N
1 X NP(2) + 2 X NP(3) + 3 X NP(4) ...X 100
11.4 How Much Extra Memory Should Be Used?
Data Structures 20
The larger space, the less overflows
Packing Density(%)
Synonym as % of records
102030405060708090100
4.89.413.617.621.424.828.131.234.136.8
11.4 How Much Extra Memory Should Be Used?
N addresses
r records
r/N
Data Structures 21
Collision Resolution by Progressive Overflow
Progressive overflow (= linear probing)
Insert a new record 1. Take home address if empty 2. Otherwise, next several addresses are
searched in sequence, until an empty one is found
3. If no more space -- wrapping around
11.5 Collision Resolution by Progressive Overflow
Data Structures 22
Progressive Overflow(Cont’d)
York’s homeaddress (busy)
KeyYorkHashRoutine Address
0
1
5
6
7
8
9
2nd try (busy)3rd try (busy)4th try (open)York’s actual address
....
........
Novak. . .
Rosen. . .
Jasper. . .
Morely. . .
....6
11.5 Collision Resolution by Progressive Overflow
Data Structures 23
Progressive Overflow(Cont'd)
KeyBlue
HashRoutine Address
0
1
2
....
........
....
99
97
98
99 Jello...
Wrapping around
11.5 Collision Resolution by Progressive Overflow
Data Structures 24
Progressive Overflow(Cont'd)
Search a record with a hash function value k: from home address k, look at successive records, until Found, or An open address is encountered
Worst case When the record does not exist and the file is full
11.5 Collision Resolution by Progressive Overflow
Data Structures 25
Progressive Overflow(Cont'd)
- Search length : # of accesses required to retrieve a record (from secondary memory)
202122232425
Adams. . .Bates. . .Cole. . .Dean. . .Evans. . .
...
......
ActualAddress
Home Address
2021212220
Search length
11225
Average search length = total search lengthtotal # of records
= 2.2
11.5 Collision Resolution by Progressive Overflow
Data Structures 26
Progressive Overflow(Cont'd)
With perfect hashing function : average search length = 1
Average search length of no greater than 2.0 are generally considered acceptable
Averagesearchlength
Packing density20 60 80 10040
5
4
3
2
1
11.5 Collision Resolution by Progressive Overflow
Data Structures 27
Storing More Than One Record per Address : Buckets
Bucket : a block of records sharing the same address
Key
GreenHallJerkKingLandMarxNutt
Home Address
30 30 32 33 33 33 33
Green ... Hall ...
Jenks ...King... Land... Marks...
30313233
Bucket contents
(Nutt... is an overflow record)
Bucketaddress
11.6 Storing More Than One Record per Address : Buckets
Data Structures 28
Effects of Buckets on Performance
N : # of addresses b : # of records fit in a bucket bN : # of available locations for records Packing density = r/bN
# of overflow recordsN X [ 1XP(b+1) + 2XP(b+2) + 3XP(b+3)...]
As the bucket size gets larger, performance continues to improve
11.6 Storing More Than One Record per Address : Buckets
Data Structures 29
Bucket Implementation
A full bucket
An empty bucket
Two entries
/ / / / / / / / / / / / / / /
ARNSWORTHJONES / / / / / / / / / / / / / / /
ARNSWORTHJONES STOCKTON BRICE TROOP
0
2
5
/ / / / // / / / /
Collision counter =< bucket size
11.6 Storing More Than One Record per Address : Buckets
Data Structures 30
Bucket Implementation(Cont'd)
Initializing and Loading Creating empty space Use hash values and find the bucket to
store If the home bucket is full, continue to look
at successive buckets Problems when
No empty space exists Duplicate keys occur
11.6 Storing More Than One Record per Address : Buckets
Data Structures 31
Making Deletions
The slot freed by the deletion hinders(disturb) later searches Use tombstones and reuse the freed slots
Record
AdamsJonesMorrisSmith
Homeaddress
5 6 6 5
Adams...Jones...Morris...Smith...
5
6
7
8
Adams...Jones...######Smith...
5
6
7
8
A tombstone for Morris
Delete Morris
11.7 Making Deletions
Data Structures 32
Other Collision Resolution Techniques
Double hashing : avoid clustering with a second hash function for overflow records
Chained progressive overflow : each home address contains a pointer to the record with the same address
Chaining with a separate overflow area : move all overflow records to a separate overflow area
Scatter tables : Hash file contains only pointers to records (like indexing)
11.8 Other Collision Resolution Techniques
Data Structures 33
Overflow File
When building the file, if a collision occurs, place the new synonym into a separate area of the file called the overflow section
This method is not recommended; if there is a high load factor,
there will either be overflow from this overflow section
or it will be organized sequentially and performance suffers;
if there is a low load factor, much space is wasted
11.8 Other Collision Resolution Techniques
Data Structures 34
Linear Probing(1)
When a synonym is identified, search forward from the address given by the hash function (the natural address) until an empty slot is located, and store this record there
This is an example of open addressing (examining a predictable sequence of slots for an empty one)
11.8 Other Collision Resolution Techniques
Data Structures 35
Linear Probing(2)A S E A R C H I N G E X A M P L E
S
A
A
C
A
E
E
G
H
I
X
E
L
M
N
P
R
key : hash : 1 0 5 1 18 3 8 9 14 7 5 5 1 13 16 12 5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
inse
rtio
n se
qu
ence
Memory SpaceA
AA C
E
E E G H I
E E G H I X
11.
8 O
ther
Col
lisio
n R
esol
utio
n T
echn
ique
s
Data Structures 36
Rehashing(1)
Another form of open addressing In linear probing, if synonym occurred, incremented r by 1
and searched next location In rehashing, use a second hash function for the
displacement: D = (FOLD(key) mod P) + 1, where P < N is another prime number
This method has the advantage of avoiding congestion, because each synonym under the first hash function likely uses a different displacement D, and this examines a different sequence of slots
11.8 Other Collision Resolution Techniques
Data Structures 37
Rehashing(where P=3)(2)A S E A R C H I N G E X A M P L Ekey :
hash : 1 0 5 1 18 3 8 9 14 7 5 5 1 13 16 12 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
S
A
C
E
G
H
I
L
M
N
P
R
inse
rtio
n se
qu
ence
Memory SpaceA
A
E
E
G
H
A
A A
E
E
E
N X
H
11 .
8 O
t her
Co l
lisio
n R
esol
u tio
n T
echn
ique
s
Data Structures 38
Chaining without Replacement(1)
Uses pointers to build linked lists within the file; each linked list contains each set of synonyms; the head of the list is the record at the natural address of
these synonyms When a record is added whose natural address is occupied,
it is added to the list whose head is at the natural address Linear probing usually used to find where to put a new
synonym, although rehashing could be used as well
11.8 Other Collision Resolution Techniques
Data Structures 39
Chaining without Replacement(2)
A problem is that linked lists can coalesce: Suppose that H(R1) = H(R2) = H(R4) = i and H(R3) = H(R5) = i+1, and records are added in the order R1, R2, R3, R4, R5.
Then the lists for natural addresses i and i+1 coalesce. Periodic reorganization shortens such lists.
Let FWD and BWD be forward and backward
pointers along these chains (doubly-linked)
11.8 Other Collision Resolution Techniques
Data Structures 40
Chaining without Replacement(3)
R1
R2
R3
R4
R5
H
H
i
i+1
11.8 Other Collision Resolution Techniques
Data Structures 41
Chaining with Replacement(4)
Eliminates problem with deletion that caused abandonment to be necessary
Further reduces search lengths When inserting a new record, if the slot at the natural address is occupied by a record
for which it is not the natural address, then record is relocated so the new record may be
replaced at its natural address Synonym chains can never coalesce, so a record can be deleted even if is the head of a chain, simply by moving the second record on the chain to its
natural address ( ABANDON thus is no longer necessary )
11.8 Other Collision Resolution Techniques
Data Structures 42
Chaining with Replacement(5)
R1R2 H
R1R3 HR2 H
ii+1
R1R3R2
i
i+1
R4R5
after R1, R2 added after R3 hashes to i+1 finally
HH
2 chains : R1 - R2 -R4 R3 - R5
11.8 Other Collision Resolution Techniques
Data Structures 43
Patterns of Record Access
Pareto Principle ( 80/20 Rule of Thumb): 80 % of the accesses are performed on 20 % of the records!
The concepts of “the Vital Few and the Trivial Many” 20 % of the fisherman catch 80 % of the fish 20 % of the burglars steal 80 % of the loot
If we know the patterns of record access ahead, we can do many intelligent and effective things!
Sometimes we can know or guess the access patterns Very useful hints for file syetms or DBMSs Intelligent placement of records
fast accesses less collisions