cpsc 3220 file and database processing
DESCRIPTION
CpSc 3220 File and Database Processing. Hashing. Exercise – Build a B + - Tree. Construct an order-4 B + -tree for the following set of key values: (2, 3, 5, 7, 11, 17, 9 , 6, 29, and 4) Assume the tree is initially empty and values are added in ascending order. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CpSc 3220 File and Database Processing](https://reader036.vdocuments.site/reader036/viewer/2022062410/568162aa550346895dd32bf2/html5/thumbnails/1.jpg)
CpSc 3220File and Database Processing
Hashing
![Page 2: CpSc 3220 File and Database Processing](https://reader036.vdocuments.site/reader036/viewer/2022062410/568162aa550346895dd32bf2/html5/thumbnails/2.jpg)
Exercise – Build a B+-Tree
• Construct an order-4 B+-tree for the following set of key values:
(2, 3, 5, 7, 11, 17, 9, 6, 29, and 4)• Assume the tree is initially empty and values
are added in ascending order. • Now delete keys 2, 5, and 17
![Page 3: CpSc 3220 File and Database Processing](https://reader036.vdocuments.site/reader036/viewer/2022062410/568162aa550346895dd32bf2/html5/thumbnails/3.jpg)
Objectives
• Survey Hashing Concepts• Investigate Hashing Algorithms• Study Collision Reduction• Analyze Performance• Investigate File Deterioration• Look at Patterns of Access
![Page 4: CpSc 3220 File and Database Processing](https://reader036.vdocuments.site/reader036/viewer/2022062410/568162aa550346895dd32bf2/html5/thumbnails/4.jpg)
Schematic View of Hash File
00
.
101
0.
Record for Key
Record for KeyxhashKey
![Page 5: CpSc 3220 File and Database Processing](https://reader036.vdocuments.site/reader036/viewer/2022062410/568162aa550346895dd32bf2/html5/thumbnails/5.jpg)
Basic Hashing Concepts• A hash file contains a fixed number of record spaces• Each record space is of a fixed size• A hash function determines the address of a record space for a
given key• A hash function may give same address for two different
records• A single address for different keys is called a collision.• Different keys that give identical addresses are called
synonyms.• A hash function that gives no collisions is called a perfect hash
function.
![Page 6: CpSc 3220 File and Database Processing](https://reader036.vdocuments.site/reader036/viewer/2022062410/568162aa550346895dd32bf2/html5/thumbnails/6.jpg)
Objectives for a Hash File Package
• Keep collisions ‘low’– Spread out (distribute) records over address space– Use extra memory (increase address space)– Put more than one record per address
• Handle collisions efficiently
![Page 7: CpSc 3220 File and Database Processing](https://reader036.vdocuments.site/reader036/viewer/2022062410/568162aa550346895dd32bf2/html5/thumbnails/7.jpg)
Outline for a Simple Hashing Algorithm
1. Put Key in numerical form2. Fold and Add to reduce numerical form to
‘integer’ size3. Divide by the size of the address space and
use remainder as RRN address (offset) of Key
![Page 8: CpSc 3220 File and Database Processing](https://reader036.vdocuments.site/reader036/viewer/2022062410/568162aa550346895dd32bf2/html5/thumbnails/8.jpg)
Simple Hash Function(when Key is an alphanumeric string)
int Hash (string key){ int sum = 0; int len = strlen(key); if (len % 2 == 1) key = concat(key, ‘ ‘)// make len even for (int j = 0; j < len; j += 2)
sum = (sum + 256 * (ord)key[j] + (ord)key[j+1]) % FILE_SIZE; return sum;}
![Page 9: CpSc 3220 File and Database Processing](https://reader036.vdocuments.site/reader036/viewer/2022062410/568162aa550346895dd32bf2/html5/thumbnails/9.jpg)
Hash Function Distribution
• Uniform (Perfect)• Random• Worse than random
We will look at random distributions
![Page 10: CpSc 3220 File and Database Processing](https://reader036.vdocuments.site/reader036/viewer/2022062410/568162aa550346895dd32bf2/html5/thumbnails/10.jpg)
Predicting Record DistributionIf r records are distributed randomly into N spaces, the probability that a given address will have exactly x records assigned to it is p(x) = (r!/( (r-x)! x! ) )/(1-(1/N))r-x(1/N)x
p(0) – probability that an address is not usedp(1) – probability that no collision occursp(2) – probability that 1 collision occursetc.
Difficult to compute for large values of r and N.
![Page 11: CpSc 3220 File and Database Processing](https://reader036.vdocuments.site/reader036/viewer/2022062410/568162aa550346895dd32bf2/html5/thumbnails/11.jpg)
Poisson’s FunctionFor large values of r and N, p(x) can be approximately by this function
p(x) = ( (r/N)x e-(r/N) ) / x!
The value r/N is the ratio of the number of records to the number of address spaces. If only one record is placed in each space it is a measure of the percent of storage space that will be used (the packing density).
![Page 12: CpSc 3220 File and Database Processing](https://reader036.vdocuments.site/reader036/viewer/2022062410/568162aa550346895dd32bf2/html5/thumbnails/12.jpg)
From Page 484 of File Structures by Folk, Zoellick, and Riccardi
![Page 13: CpSc 3220 File and Database Processing](https://reader036.vdocuments.site/reader036/viewer/2022062410/568162aa550346895dd32bf2/html5/thumbnails/13.jpg)
Collision Resolution Using Progressive Overflow ( Linear Probing)
00
.
111
0.
Record for Key0Record for Key1Record for Key2hashKey3
Hi = (hash(key) + i) mod TableSize
![Page 14: CpSc 3220 File and Database Processing](https://reader036.vdocuments.site/reader036/viewer/2022062410/568162aa550346895dd32bf2/html5/thumbnails/14.jpg)
ASL = (total # probes)/(# of Recs)
![Page 15: CpSc 3220 File and Database Processing](https://reader036.vdocuments.site/reader036/viewer/2022062410/568162aa550346895dd32bf2/html5/thumbnails/15.jpg)
Address Spaces Can Hold More Than One Record
2
1
2
0
2
1
0
Key a
Key r
Key k
Key x
Key w
Key d
Key b
Key t
Packing Density = r/(bN) Address Density = r/N
![Page 16: CpSc 3220 File and Database Processing](https://reader036.vdocuments.site/reader036/viewer/2022062410/568162aa550346895dd32bf2/html5/thumbnails/16.jpg)
Implementation Issues
• Loading a Hash File• Deletions– Tombstones– Performance Effects
![Page 17: CpSc 3220 File and Database Processing](https://reader036.vdocuments.site/reader036/viewer/2022062410/568162aa550346895dd32bf2/html5/thumbnails/17.jpg)
Other Collision Resolution Techniques
• Quadratic Hashing– H(i) = (hash(key) + i2) mod TS
• Double Hashing– H(i) = (hash(key) + f(i)) mod TS where f(i) =
i*hash2(key) – Note that hash2(key) must never be zero
• Separate Overflow Area• Chained Overflow with Separate Overflow Area• Scatter Tables
![Page 18: CpSc 3220 File and Database Processing](https://reader036.vdocuments.site/reader036/viewer/2022062410/568162aa550346895dd32bf2/html5/thumbnails/18.jpg)
Patterns of Record Access
• 20 percent of records account for 80 percent of activity
• Most active records must be in home address or performance deteriorates
![Page 19: CpSc 3220 File and Database Processing](https://reader036.vdocuments.site/reader036/viewer/2022062410/568162aa550346895dd32bf2/html5/thumbnails/19.jpg)
Summary• Hashing provides O(1) direct access performance.• If hash function gives collisions ASL may increase.• Collisions can be reduced by:
– Spreading out records (choosing a better hash fct)– Using extra memory– Using buckets
• Poisson Distribution allows us to analyze hash file performance
• Better overflow handling can reduce ASL• Record Deletion requires special handling• Consider record access patterns • Hashing does not provide efficient sequential access• Hashing requires that we fix file size in advance