chapter 51 chapter 5 record storage & primary file organizations

Chapter 5 1

Chapter 5Record Storage & Primary File

Organizations

Chapter 5 2

Storage

• The are two general types of storage media that is used with computers. They are :– Primary Storage - This includes all storage

media that can be operated on directly by the CPU (RAM , L1 and L2 Cache Memory)

– Secondary Storage - This includes Hard Drives, CD’s and tape.

Chapter 5 3

Memory Hierarchies & Storage Devices

• The Memory Hierarchy is based upon speed of access. However, this speed comes with a price tag attached which varies inversely with the access time of memory. Like cars the faster the memory access is the more it costs.

Chapter 5 4

Primary Storage Level of Memory

• The Primary Storage Level of Memory is generally made up of 3 Levels.– L1 Cache which is located on the CPU – L2 Cache which is located near the CPU– Main Memory which is the RAM figure that is

often referred to in computer advertisements

Chapter 5 5

Secondary Storage Level of Memory

• The Secondary Storage Level of Memory may be made up of 4 Levels.– Flash Memory or EEPROM– Hard Drives – CD ROM’s– Tape

Chapter 5 6

Figure 5.1

Chapter 5 7

Terms Used in the Hardware Description of Hard Drives

• Capacity - The number of bytes it can store.

• Single-sided vs. Double-sided - States if the disk/platter is written on one or both sides.

• Disk Pack - A collection of disks/platters that are assembled together into a pack.

• Track - A Circle of a small width on a disk. A disk surface will have many tracks.

Chapter 5 8


• Sector - A segment or arc of a track.

• Block - is the division of a track into equal sized portions by the operating system.

• Interblock Gaps - These are fixed sized segments that separate the blocks.

• Read/Write Head - Actual reads/writes the information to the disk.

Chapter 5 9


• Cylinder - Tracks with the same diameter that are located on the disk surface of a disk pack.

Chapter 5 10

Figure 5.2

Chapter 5 11

Terms Used in Measuring Disk Operations

• Seek Time (s)- The time it takes to position the read/write head on the desired track. It will be given in all problems that it is needed for.

• Rotational Delay (rd) - The average amount of time it takes the desired block to rotate into position under the read/write head. Rd=(1/2)*(1/p) min where p is rpm of the disk

Chapter 5 12


• Transfer Rate (tr) - The rate at which information can be transferred to or from the disk. tr =(track size)/(1/p min)

• Block Transfer Time (btt) - The time it takes to transfer the data once the read/write head has been positioned. btt = B/tr msec where B is the block size in bytes.

Chapter 5 13


• Bulk Transfer Rate (btr) - The rate at which multiple blocks can be written/read to contiguous blocks. Where G is the Interblock Gap

btr = (B/(B+G)) * tr bytes/msec

• Rewrite Time (Trw) - Time it takes after a block is read to write that same block back to the disk or the time for one revolution.

Chapter 5 14

Computing Times

• Given :– Seek Time (s) = 10 msec– Rotational speed = 3600 rpm– Track size = 50 KB– Block size (B) = 512 bytes– Interblock Gap = 128 bytes

Chapter 5 15

Problems for Disk Operations

• Compute the average time it takes to transfer 1 block on this system.

• Compute the average time it takes to transfer 20 non-contiguous blocks that are located on the same track.

• Compute the average time it takes to transfer 20 contiguous blocks.

Chapter 5 16

Parallelizing Disk Access Using RAID

• RAID - Stands for Redundant Arrays of Inexpensive Disks or Redundant Arrays of Independent Disks.

• RAIDs are used to provide increased reliability, increased performance or both.

Chapter 5 17

RAID Levels

• Level 0 - has no redundancy and the best write performance but its read performance is not as good as level 1.

• Level 1 - uses mirrored disks which provide redundancy and improved read performance.

• Level 2 - provides redundancy using Hamming Codes

Chapter 5 18

RAID Levels

• Level 3 - uses a single parity disk.

• Level 4 and 5 - use block-level data striping with level 5 distributing the data across all the disks.

• Level 6 - uses the P + Q redundancy scheme making use of the Reed-Soloman codes to protect against the failure of 2 Disks.

Chapter 5 19

Figure 5.4

Chapter 5 20

Fig 5.5

Chapter 5 21

Fig 5.6

Chapter 5 22

Records

• Records is the term used to refer to a number of related values or items. Each value or item is stored in a field of a specific data type.

• Records may be of either fixed or variable lengths.

Chapter 5 23

Variable Length Records in Files

• There are several reasons a record with the same record type may be of variable length.– Variable length fields– Repeating fields

• For efficiency reasons different record types may be clustered in a file.

Chapter 5 24

Fig 5.7

Chapter 5 25

Spanned Vs Unspanned Records

• When the records in a file is stored on a disk they may be placed in blocks of a fixed size. This will rarely match the record size. So a decision must be made when the record size is smaller than the block size and the block size is not a multiple of the record size whether to store the record all in one block and have unused space or in two different blocks.

Chapter 5 26

Fig 5.8

Chapter 5 27

File Operations

• File may either be stored in contiguous blocks or by linking the blocks together. There are advantages and disadvantages to both methods.

• Operations on files can be group into two type of operations. Retrieval or update. Retrieval only involves a read while and update involves read, write and modification.

Chapter 5 28

File Structure

• Heap (Pile) Files

• Hash (Direct) Files

• Ordered (Sorted) Files

• B - Trees

Chapter 5 29

• Once the data has been brought into memory, it can be accessed by an instruction in .00000004 seconds by a machine running a 25MIPS. The disparity between time for memory access and disk access is enormous:we can perform 625,000 instructions in the time it takes to read /write one disk page.

• To put this in human terms if you were typing a letter for you boss and found a word you could not make out so you leave him a voice mail message. Since you were told to do nothing else but this you patiently wait for his reply doing Nothing! Unfortunately, he just went on vacation and does not get your message for 3 WEEKS.

• This is similar to the computer waiting .025 seconds to get the needed data into memory from a disk read.

Chapter 5 30

Heap (Pile) Files(Unordered)

• Insertions - Very efficient

• Search - Very inefficient (Linear Search)

• Deletion - Very inefficient– Lazy Deletion

• Problems?

• When are they Used?

Chapter 5 31

Ordered (Sorted Files) Records

• Records are stored based on the value contained in one of their fields called the ordering field.

• If the ordering field is also a key field than the field is better described as an ordering key.

Chapter 5 32

Advantages of Ordered Files

• Reading of the records in order of the ordering field is extremely efficient.

• Finding the next record is fast.

• Finding records based on a query of the ordering field is efficient. (binary search).

• Binary search may be done on the blocks as well.

Chapter 5 33

Disadvantages of Ordered Files

• Searches on non-ordering fields are inefficient.

• Insertion and deletion of records are very expensive.

• Solutions to these problems?

Chapter 5 34

Hashing Techniques

• This is where a records placement is determined by value in the hash field. This value has a hash or randomizing function applied to it which yields the address of the disk block where the record is stored. For most records, we need only a single-block access to retrieve that record.

Chapter 5 35

Internal Hashing

• Internal Hashing is implemented as a hash table through the use of an array of records. (In memory)

• An array index range of 0 to M-1. A function that transforms the hash field value into an integer between 0 to M-1 is used. A common one is h(K) =K mod M.

Chapter 5 36

Internal Hashing (con’t)

• Collisions occur when a hash field value of a record being inserted hashes to an address that already contains a different record.

• The process of finding another position for this record is called collision resolution.

Chapter 5 37

Collision Resolution

• Open Addressing- Places the record to be inserted in the first available position subsequent to the hash address.

• Chaining - A pointer field is added to each record location. When an overflow occurs this pointer is set to point to overflow blocks making a linked list.

Chapter 5 38

Collision Resolution (con’t)

• Multiple hashing - If an overflow occurs a second hash function is used to find a new location. If that location is also filled either another hash function is applied or open addressing is used.

Chapter 5 39

Fig 5.10 Page 140

Chapter 5 40

Goals of the Hash Function

• The goals of a good hash function are to uniformly distribute the records over the address space while minimizing collisions to avoid wasting space.

• Research has shown – 70% to 90% fill ratio best.– That when uses a Mod function M should be a

prime number.

Chapter 5 41

External Hashing for Disk Files

• External hashing makes use of buckets, each of which can hold multiple records.

• A bucket is either a block or a cluster of contiguous blocks.

• The hash function maps a key into a relative bucket number, rather than an absolute block address for the bucket.

Chapter 5 42

Types of External Hashing

• Using a fixed address space is called static hashing.

• Dynamically changing address space:– Extendible hashing*– Linear hashing**

* With a Directory

** Without a Directory

Chapter 5 43

Static Hashing

• Under Static Hashing a fixed number of buckets (M) is allocated.

• Based on the hash value a bucket number is determined in the block directory array which yields the block address.

• If n records fit into each block. This method allows up to n*M records to be stored.

Chapter 5 44

Fig 5.11 Page 143

Chapter 5 45

Fig 5.12 Page 144

Chapter 5 46

Extendible Hashing• In Extendible Hashing, a type of directory is

maintained as an array of 2d bucket addresses. Where d refers to the first d high (left most) order bits and is referred to as the global depth of the directory. However, there does NOT have to be a DISTINCT bucket for each directory entry.

• A local depth d’ is stored with each bucket to indicate the number of bits used for that bucket.

Chapter 5 47

Figure 5.13 Page 146

Chapter 5 48

Overflow (Bucket Splitting)

• When an overflow in a bucket occurs that bucket is split. This is done by dynamically allocating a new bucket and redistributing the contents of the old bucket between the old and new buckets based on the increased local depth d’+1 of both these buckets.

Chapter 5 49


• Now the new bucket’s address must be added to the directory.

• If the overflow occurred in a bucket whose current local depth d’ is less than or equal to the global depth d adjust the directory entries accordingly. (No change in the directory size is made.)

Chapter 5 50


• If the overflow occurred in a bucket whose current local depth d’ is now greater than the global depth d you must increase the global depth accordingly.

• This results in a doubling of the directory size for each time d is increased by 1 and appropriate adjustment of the entries.

Chapter 5 51

Slide showing how buckets are split under Extendible Hashing.

Chapter 5 52

Shrinking Extendible Hashing Files

• The generally used principal for shrinking extendible hashing files is that when d > d’ for all buckets after a deletion occurs.

• Buckets may be combined when the each of the buckets to be combined are less than half full and have the same bit pattern with the exception of the d’ bit. I.e. d’ = 3 and the bit patterns of 110 and 111.

Chapter 5 53

Linear Hashing

• Linear Hashing allows the hash file to expand and shrink its number of buckets dynamically without needing a directory.

• It starts with M buckets numbered 0 to M-1 and use the mod hash function

h(K)= K mod M

as the initial hash function called hi.

Chapter 5 54

Linear Hashing (Con’t)

• Overflow is handled by chaining individual overflow chains for each bucket.

• It works by methodically splitting the original buckets; starting with bucket 0, redistributing the contents of bucket 0 between bucket 0 and bucket M (the new bucket) using a secondary hash function:

h i+1(K) = K mod 2M

Chapter 5 55

Linear Hashing (Con’t)

• This splitting of buckets is done in order (0,1,…,M-1) REGARDLESS of which bucket the collision occurred. To keep track of the next bucket to be split we will use n. So n would be incremented to 1.

• When a record hashes to a bucket less than n we use the secondary hash function to determine which of the two buckets it belongs in.

Chapter 5 56

Linear Hashing (Con’t)• When all of the original M buckets have

been split and we have 2M buckets and n = M

• We reset M to 2M, n to 0 and change our secondary hash function to our primary hash function.

• Shrinking of the file is done based on the load factor using the reverse of splitting.

Chapter 5 57

Slide showing how to split using linear hashing.

chapter 51 chapter 5 record storage & primary file organizations

Documents