storage and file structure - casmy.fit.edu/.../slides/storageandfilestructure.pdfa single spindle...

66
©Silberschatz, Korth and Sudarshan 1 Storage and File Structure Overview of Physical Storage Media Buffer Management File Organization

Upload: others

Post on 20-Apr-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan1

Storage and File Structure

Overview of Physical Storage Media

Buffer Management

File Organization

Page 2: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan2

Properties of Physical Storage Media

Types of storage media differ in terms of:

Speed of data access

Cost per unit of data

Reliability:

Data loss on power failure or system (software) crash

Physical failure of the storage device

=> What is the value of data (e.g., salon example)?

Page 3: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan3

Classification of Storage Media

Storage be classified as:

Volatile:

Content is lost when power is switched off

Includes primary storage (cache, main-memory)

Non-volatile:

Contents persist even when power is switched off

Secondary and tertiary storage, plus battery-backed up RAM

Page 4: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan4

Storage Hierarchy, Cont.

Storage can also be classified as:

Primary Storage:

Fastest media

Volatile

Includes cache and main memory

Secondary Storage:

Moderately fast access time

Non-volatile

Includes flash memory and magnetic disks

Also called on-line storage

Tertiary Storage:

Slow access time

Non-volatile

Includes magnetic tape and optical storage

Also called off-line storage

Page 5: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan5

Primary Storage

Cache:

Volatile

Fastest and most costly form of storage

Managed by the computer system hardware

Proprietary

Main memory (also known as RAM):

Volatile

Fast access (10s to 100s of nanosecs; 1 nanosec = 10–9 seconds)

Generally too small (or too expensive) to store the entire database

=> The above comment notwithstanding, main memory databases do

exist, especially for clusters.

Page 6: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan6

Physical Storage Media, Cont.

Flash memory:

Non-volatile

Widely used in embedded devices such as digital cameras

SD cards, USB drives, and solid-state drives

Data can be written at a location only once, but location can be erased and written to again

Can support only a limited number of write/erase cycles

Reads are roughly as fast as main memory

Writes are slow (few microseconds), erase is slower

Cost per unit of storage roughly similar to main memory

Page 7: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan7

Physical Storage Media, Cont.

Magnetic-disk:

Non-volatile

• Survives power failures and system (software) crashes

• Failure can destroy data, but is very rare (??????)

Primary medium for database storage*

Typically stores the entire database

• Capacities of individual disks is currently in the100s of GBs or TBs

• Much larger capacity than main or flash memory

Direct/random-access, unlike magnetic tape, which is sequential access.

Page 8: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan8

Physical Storage Media, Cont.

Optical storage:

Non-volatile

Data is read optically from a spinning disk using a laser

CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms

Write-once, read-many (WORM) optical disks used for archival storage (CD-R and DVD-R)

Multiple write versions available (CD-RW, DVD-RW, and DVD-RAM)

Reads and writes are slower than with magnetic disk

Page 9: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan9

Physical Storage Media, Cont.

Tape storage:

Non-volatile

Data is accessed sequentially, consequently known as sequential-access

Much slower than a disk

Very high capacity (40 to 300 GB tapes available)

Storage costs are much cheaper than for a disk, but high quality drives can be very expensive

Used mainly for backup (recover from disk failure), archive and transfer of very large amounts of data

Page 10: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan10

Storage Hierarchy

speed

cost

volatility

Page 11: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan11

Magnetic Hard Disk Mechanism

NOTE: Diagram is only a simplification of actual disk drives

Page 12: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan12

Magnetic Disks

Disk assembly consists of:

A single spindle that spins continually (at 7500 or 10000 RPMs typically)

Multiple disk platters (typically 2 to 5)

Surface of platter divided into circular tracks:

Over 16,000 tracks per platter on typical hard disks

Each track is divided into sectors:

A sector is the smallest unit of data that can be read or written

Typically 512 bytes

Typical sectors per track: 200 (on inner tracks) to 400 (on outer tracks)

Head-disk assemblies:

One head per platter, mounted on a common arm, each very close to its platter.

Reads or writes magnetically encoded information

“Cylinder i” consists of ith track of all the platters

Page 13: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan13

Magnetic Disks, Cont.

Disks are the primary performance bottleneck in a database system, in part

because of the need for physical movement.

To read/write a sector:

disk arm swings to position head on right track – seek time (4-10 ms)

sector rotates under read/write head – rotational latency (4-11 ms)

data is read/written as sector passes under head – transfer rate (25-100 bps)

Access time - The time it takes from when a read or write request is issued to

when data transfer begins, i.e., seek time + rotational latency.

Page 14: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan14

Performance Measures of Disks, Cont.

Multiple disks may share an interface controller, so the rate that the interface

controller can handle data is also important

ATA-5: 66 MB/second, SCSI-3: 40 MB/s, Fiber Channel: 256 MB/s

Mean time to failure (MTTF) - The average time the disk is expected to run

continuously without any failure.

Typically 3 to 5 years

Probability of failure of new disks is quite low - 30,000 to 1,200,000 hours

An MTTF of 1,200,000 hours for a new disk means that given 1000 relatively new disks, on an average one

will fail every 1200 hours

MTTF decreases as disk ages

Page 15: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan15

Magnetic Disks (Cont.)

The term “controller” is used (primarily) in two ways, both in the book and other literature; the book does not distinguish between the two very well.

Disk controller: (use #1)

Packaged within the disk

Accepts high-level commands to read or write a sector

Initiates actions such as moving the disk arm to the right track and actually reading or writing the data

Computes and attaches checksums to each sector for reliability

Performs re-mapping of bad sectors

Page 16: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan16

Disk Subsystem

Interface controller: (use #2)

Multiple disks are typically connected to the system bus through an interface controller (i.e., a host adapter)

Many functions (checksum, bad sector re-mapping) are often carried out by individual disk controllers;

reduces load on the interface controller

interface

Page 17: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan17

Disk Subsystem

The distribution of work between the disk controller and interface controller depends on the interface standard.

Disk interface standard families:

ATA (AT adaptor)/IDE

SCSI (Small Computer System Interconnect)

Fibre Channel, etc.

Several variants of each standard (different speeds and capabilities)

One computer can have many interface controllers, of the same or different type

=> Like disks, interface controllers are also a performance bottleneck.

Page 18: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan18

Techniques for Optimization

of Disk-Block Access

Several techniques are employed to minimize the negative effects of disk and

controller bottlenecks

File organization:

Most DBMS’s allow the DBA to manipulate file organization of a database in a variety of ways.

De-fragmentation utilities

Organize data on disk based on how it will be accessed

• Store related information (e.g., data in the same file) on the same or nearby cylinders.

• Pad data to align it on word boundaries

• Spread tables across disks or controllers

Page 19: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan19

Techniques for Optimization

of Disk-Block Access

Block size adjustment:

A block is a contiguous sequence of sectors from a single track

A DBMS transfers data between disk and main memory in blocks

Sizes range from 512 bytes to several kilobytes

• Smaller blocks: more transfers from disk

• Larger blocks: may waste space due to partially filled blocks

• Typical block sizes range from 4 to 16 kilobytes

Block size can be adjusted to accommodate workload

Disk-arm-scheduling algorithms:

Order pending accesses so that disk arm movement is minimized

One example is the “Elevator algorithm”

Page 20: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan20

Log disk:

A disk devoted to the transaction log

Writes to a log disk are very fast since no seeks are required

Often times provided with battery back-up

Nonvolatile write buffers/RAM:

Battery backed up RAM or flash memory

Typically associated with a disk or storage device

Blocks to be written are first written to the non-volatile RAM buffer

Disk controller writes data to disk whenever the disk has no other requests

Allows processes to continue without waiting on write operations

Data is safe even in the event of power failure

Allows writes to be reordered to minimize disk arm movement

Techniques for Optimization

of Disk Block Access, Cont.

Page 21: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan21

RAID

Redundant Arrays of Independent Disks (RAID):

Disk organization techniques that exploit a collection of disks, i.e., striping, mirroring, parity

Speed, reliability or the combination are increased

Collection appears as a single disk to the system

Originally “inexpensive” disks

Ironically, using multiple disks increases the risk of failure (not data loss):

A system with 100 disks, each with MTTF of 100,000 hours (approx. 11 years), will have a system MTTF

of 1000 hours (approximately 41 days)

Page 22: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan22

Storage Access

A DBMS will typically have several files allocated to it for storage, which are

(usually) formatted and managed by the DBMS.

Each file typically maps to either a disk, disk partition or storage device.

Each file is partitioned into blocks (a.k.a, pages).

A block consists of one or more contiguous sectors.

A block is the smallest unit of DBMS storage allocation & transfer.

Each block is partitioned into records.

Each record is partitioned into fields.

Page 23: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan23

Buffer Management

The DBMS will transfer blocks of data between RAM and Disk, in a manner similar

to an operating system.

The DBMS seeks to minimize the number of block transfers between the disk and

memory.

The portion of main memory available to store copies of disk blocks is called the

buffer (a.k.a. the cache).

The DBMS subsystem responsible for allocating buffer space in main memory is

called the buffer manager.

Page 24: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan24

Buffer Manager Algorithm

Programs call the buffer manager when they need a (disk) block.

If a requested block is already in the buffer then the requesting program is given

the address of the block in main memory.

If the block is not in the buffer then the buffer manager:

Allocates buffer space for the block, throwing out another block, if required.

Writes out to disk the thrown out block if it has been modified since the last time it was retrieved.

Reads the requested block from disk to the buffer once space is available.

Passes the address of the block in main memory to requester.

Page 25: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan25

Buffer-Replacement Policies

How is a block selected for replacement?

Most operating systems use Least Recently Used (LRU).

In a DBMS, LRU can be a bad strategy for certain access patterns.

Queries have well-defined access patterns (such as sequential scans), and so a DBMS can predict future references (an OS cannot).

Hybrid strategies are common.

Statistical information pertaining to the data can be exploited.

Well-known access patterns can be exploited.

Simple Examples:

The data dictionary is frequently accessed, so keep those blocks in the buffer.

Index blocks are used frequently, so keep them in the buffer.

Page 26: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan26

Buffer-Replacement Policies, Cont.

Some terminology...

A disk block that has been loaded into the buffer and is not allowed to be

replaced is said to be pinned.

At any given time a buffer block is in one of the following states:

Unused (free) – does not contain the copy of a block from disk.

Used, but not pinned – contains the copy of a block from disk, which is available for replacement.

Pinned – contains the copy of a block from disk, but which is not available for replacement.

Page 27: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan27

Buffer-Replacement Policies, Cont.

Page replacement strategies:

Most recently used (MRU) – The moment a disk block in the buffer becomes unpinned, it becomes the

most recently used block.

Least recently used (LRU) – Of all unpinned disk blocks in the buffer, the one referenced least recently.

Page 28: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan28

Buffer-Replacement Policies, Cont.

Recall the borrower and customer relations:

borrower = (customer-name, loan-number)

customer = (customer-name, customer-street, customer-city)

Consider the a query that performs a join on borrower and customer:

select customer-name, loan-number, street, city

from borrower, customer

where b[customer-name] = c[customer-name];

Suppose that borrower consists of blocks b1, b2, and b3, that customer consists of

blocks c1, c2, c3, c4, and c5, and that the buffer can fit five (5) blocks total.

Page 29: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan29

Buffer-Replacement Policies, Cont.

Now consider the following query plan for the query:

Use a nested-loop join algorithm:

for each tuple b of borrower do // scan borrower sequentially

for each tuple c of customer do // scan customer sequentially

if b[customer-name] = c[customer-name] then begin

x[customer-name] := b[customer-name];

x[loan-number] := b[load-number];

x[street] := c[street]

x[city] := c[city];

include x in the result of the query;

end if;

end for;

end for;

Use LRU for block replacement:

− Read in a block of borrower, and keep it, i.e., pin it, in the buffer until the last tuple in that block has been processed.

− Once all of the tuples from that block have been processed, unpin it.

− If a customer block is to be brought into the buffer, and another block must be moved out in order to make room, then move

out the least recently used block (borrower, or customer).

Page 30: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan30

Buffer-Replacement Policies, Cont.

Applying the LRU replacement algorithm results in the following for the first two

tuples of b1:

6 block replacements

Notice that each time a customer block is read into the buffer, the block replaced is

the next block required!

b1

c1

c2

c3

b1

c1

c2

c3

c4

b1

c1

c2

b1

c1

b1 b1

c4

c1

c2

c2

b1

c5

c1

c2

c3

b1

c5

c1

c2

c4

b1

c5

c1

c3

c4

b1

c5

c2

c3

c4

b1

c4

c5

c2

c2

Page 31: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan31

Buffer-Replacement Policies, Cont.

Suppose an MRU block replacement strategy is used:

Read in a block of borrower, and keep it, i.e., pin it, in the buffer until the last tuple in that block has been

processed.

Once all of the tuples from that block have been processed, unpin it.

If a customer block is to be brought into the buffer, and another block must be moved out in order to

make room, then move out the most recently used block (borrower, or customer).

Page 32: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan32

Buffer-Replacement Policies, Cont.

Applying the MRU replacement algorithm results in the following for the first two

tuples of b1:

2 block replacements

b1

c1

c2

c3

b1

c1

c2

c3

c4

b1

c1

c2

b1

c1

b1 b1

c1

c2

c4

c5

b1

c1

c2

c3

c5

Page 33: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan33

File Management

Traditional “low-level” storage DBMS issues:

Avoiding large numbers of block reads - algorithmically, the bar has been raised

Records crossing block boundaries

Dangling pointers

Disk fragmentation

Allocating free space

Storage “mechanisms,” which provide partial solutions:

Record packing

Free/reserved space lists

Reserved space

Record splitting

Mixed block formatting

Slotted-page structure

Page 34: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan34

File Management

Recall:

A database is stored as a collection of files.

A file consists of a sequence of blocks.

A block consists of a sequence of records.

A record consists of a sequence of fields.

Assumptions:

Record size may be fixed or variable.

Record size is usually small relative to the block size, but might be larger.

Each file may have records of one type (one relation) or multiple types (more than one relation).

Sometimes tuples need to be sorted, other times not.

Which assumptions apply determine which techniques are appropriate.

Page 35: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan35

Record Packing

Record “Packing” - basic approach:

Pack records as tightly as possible.

Works best for fixed-size records:

Store record i , where i >=0, starting from byte n i, where n is the record size in bytes.

Record access is simple, given i.

Add each new record to the end

Deletion of record i options:

Move records i + 1, . . ., n

to i, . . . , n – 1 (shift)

Move record n to i

Page 36: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan36

Free Lists

Free lists – connect free space (or used space) into a linked list.

Advantages:

Insertion & deletion can be done with few block I/Os (assuming tuples aren’t sorted).

Records don’t have to move; eliminates dangling pointers.

Free lists are used by virtually all DBMSs at the file level.

Page 37: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan37

Record Packing

For mixed or variable sized records:

Attach an end-of-record () character to the end of each record.

Pack the records as tightly as possible.

Issues:

Deletion, insertion and record growth result in 1) lots of record movement or 2) fragmentation.

Record movement potentially results in dangling pointers.

Page 38: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan38

Reserved Space

Reserved space – use the maximum required space for each record:

Can be used for repeating fields.

Also with varying length attributes, e.g., a name field.

Main issue - Could waste significant space.

Page 39: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan39

Record Splitting

Record Splitting:

A single variable-length record is stored as collection of smaller records, linked with pointers.

Can be used even if the maximum record length is not known.

Once again, could use with pointers for free and occupied records.

Disadvantages:

Records are more likely to cross block boundaries

Space is wasted in all records except the first in a chain

Page 40: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan40

Mixed Block Formatting

Mixed Block Formatting – format blocks differently (sometimes in different files):

Anchor block

Overflow block

Records are likely to cross block boundaries.

Frequently used for very large attributes such as BLOBS, text, etc.

Page 41: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan41

Slotted Page Structure

A Slotted Page (i.e., block) is formatted with three sections:

Header

Free-space

Records

Header contents:

Number of record entries

Pointer to the end of free space in the block

Location and size of each record

Key points:

External pointers refer to header entries rather than directly to records.

Records can be moved within a page to eliminate empty space; header must be updated.

Page 42: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan42

File Organization

More sophisticated techniques are typically used in combination with the preceding storage

mechanisms:

Hashing

Sequential/sorted

Clustering

These techniques can improve query performance substantially.

Page 43: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan43

Sequential File Organization

Records in a file are ordered by a search-key.

Good for point queries and range queries.

Maintaining physical sequential order is difficult with insertions & deletions.

Page 44: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan44

Sequential File Organization

Use pointers to keep track of logical record order:

Probably used in conjunction with a free-list.

Page 45: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan45

Sequential File Organization

Insertion – locate the position where the record is to be inserted.

If there is free space there then insert directly.

If no free space, insert the record in another location off the free-list.

In either case, pointer chain must be updated.

Deletion – maintain pointer chains in the obvious way.

Main issue - file may degenerate over

time - reorganize the file from time to time

to restore sequential order.

Page 46: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan46

Clustering File Organization

A clustered file organization stores several relations in one file.

Clustered organization of customer and depositor:

Good for some queries involving:

depositor customer

an individual customer and their accounts.

Bad for others, e.g., involving only customer.

Page 47: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan47

Large Objects – A Special Case

Large objects:

text documents

images

computer aided designs

audio and video data

Most DBMS vendors provide supporting types:

binary large objects (blobs)

character large objects (clobs)

text, image, etc.

Large objects are stored in ad-hoc ways:

Stored in a special storage “heap” on disk, or split into many blocks distributed on disk (record splitting).

Typically stored in a contiguous sequence of bytes or blocks when brought into memory.

Might be preferable to disallow direct access to data using SQL, and only allow access through a 3rd party file system API.

Page 48: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

End of Chapter

Page 49: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan49

Data Dictionary Storage

The data dictionary (a.k.a. system catalog) stores metadata:

Information about relations

• names of relations

• names and types of attributes

• names and definitions of views

• integrity constraints

Physical file organization information

• How relation is stored (sequential, hash, etc.)

• Physical location of relation

– operating system file name, or

– disk addresses of blocks containing the records

Statistical and descriptive data

• number of tuples in each relation

User and accounting information, including passwords

Information about indices (Chapter 12)

Page 50: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan50

Data Dictionary Storage (Cont.)

Catalog structure can use either:

specialized data structures designed for efficient access

a set of relations, with existing system features used to ensure efficient access

The latter alternative is the standard.

A possible catalog representation:

Relation-metadata = (relation-name, number-of-attributes, storage-organization, location)

Attribute-metadata = (attribute-name, relation-name, domain-type, position, length)

User-metadata = (user-name, encrypted-password, group)

Index-metadata = (index-name, relation-name, index-type, index-attributes)

View-metadata = (view-name, definition)

Page 51: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan51

Improvement of Reliability

via Redundancy

RAID improves on reliability by using two techniques:

Parity Information

Mirroring

Parity Information (basic):

Makes use of the exclusive-or operator

1 + 1 + 0 + 1 = 1

0 + 1 + 1 + 0 = 0

Enables the detection of single-bit errors

Could be applied to a collection of disks

Page 52: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan52

Improvement of Reliability

via Redundancy

Mirroring:

For every disk, keep a duplicate copy

Every write is carried out on both disks (in parallel)

If one disk in a pair fails, data still available

Different reads can take place from either disk and, in particular, from both disks at the same time

Data loss would occur only if 1) a disk fails, and 2) its mirror disk fails before the

first is repaired:

Probability of both events is very small

If MTTF is100,000 hours, mean time to repair is10 hours, then mean time to data loss is

approximately 57,000 years for a mirrored pair of disks

Page 53: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan53

Improvement in Performance

via Parallelism

RAID improves performance mainly through the use of parallelism.

Two main goals of parallelism in a disk system:

Parallelize large accesses to reduce response time (transfer rate).

Load balance multiple small accesses to increase throughput (I/O rate).

Parallelism is achieved primarily through striping, but also through mirroring.

Striping can be done at varying levels:

bit-level

block-level (variable size)

Page 54: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan54

Improvement in Performance via Parallelism

Bit-level striping:

Abstractly, the data to be stored can be thought of as a list of bytes.

Suppose there are eight disks, and write bit i of each byte to disk i.

In theory, each access can read data at eight times the rate of a single disk (reduces transfer time,

and hence improves response time).

Each byte read or written ties up all 8 disks (reduces throughput).

Not commonly used, in particular, as described.

Block-level striping:

Abstractly, the data to be stored can be thought of as a list of blocks, numbered 0..(k-1).

Suppose there are n disks, numbered 0..(n-1).

Store block i of a file on disk (i mod n).

Requests for different blocks can run in parallel if the blocks reside on different disks (improves

throughput).

A request for a long sequence of blocks can use all disks in parallel (improves response time).

Page 55: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan55

RAID Levels

Different RAID organizations, or RAID “levels,” have differing cost, performance and reliability characteristics.

Levels vary by source and vendor:

Standard levels: 0, 1, 2, 3, 4, 5, 6

Nested levels: 0+1, 1+0, 5+0, 5+1

Several other, non-standard, vendor specific levels exist as well

Our book: 0, 1, 2, 3, 4, 5

Performance and reliability are not linear in the level #.

It is helpful to compare each level to every other level, and to the single disk option.

Page 56: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan56

RAID Levels

RAID Level 0:

Block-level striping.

Used in high-performance applications where data lost is not critical.

RAID Level 1:

Mirrored disks with block striping.

Offers the best write performance, according to the authors.

Sometimes called 0+1, 1+0, 01 or 10.

C C C C

Page 57: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan57

RAID Levels, Cont.

RAID Level 2:

Memory-Style Error-Correcting-Codes (ECC) with bit-level striping.

Parity information is used to detect, locate and correct errors.

Outdated - Disks contain embedded functionality to detect and locate sector

errors, so only a single parity disk is needed (for correction).

Also, disks don’t read/write individual bits.

P P P

Page 58: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan58

RAID Levels, Cont.

RAID Level 3:

Bit-level striping, with parity.

Individual disks report errors.

To recover data in a damaged disk, compute XOR of bits from other disks

(including parity bit disk).

Faster response time than with a single disk, but fewer I/Os per second since

every disk has to participate in every I/O.

When writing data, corresponding parity bits must also be computed and written

to a parity bit disk.

Subsumes Level 2 (provides all its benefits, at lower cost).

P

Page 59: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan59

RAID Levels (Cont.)

RAID Level 4:

Block-level striping, with parity.

Individual disks report errors.

To recover data in a damaged disk, compute XOR of bits from other disks

(including parity bit disk).

When writing a data block, the corresponding block of parity bits must also be recomputed and written to the parity disk

• For a single block write – use the old parity block, old value of current block and new value of

current block (2 block reads + 2 block writes)

• For a large sequential block write - use the new values of blocks corresponding to the parity

block.

P

Page 60: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan60

RAID Levels (Cont.)

RAID Level 4: (Cont.)

Provides higher I/O rates for independent block reads than Level 3

Provides higher transfer rates for large, multi-block reads & writes compared

to a single disk.

Parity block is a bottleneck for independent block writes.

Page 61: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan61

RAID Levels (Cont.)

RAID Level 5:

Block-level striping with distributed parity; partitions data and parity among all

N + 1 disks, rather than storing data in N disks and parity in 1 disk.

For example, with 5 disks, parity block for ith set of N blocks is stored on disk

(i mod 5), with the data blocks stored on the other 4 disks.

P P P P P

Page 62: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan62

RAID Levels (Cont.)

RAID Level 5 (Cont.)

Provides same benefits of level 4, but minimizes the parity disk bottleneck.

Higher I/O rates than Level 4.

RAID Level 6:

P+Q Redundancy scheme; similar to Level 5, but uses additional information

to guard against multiple disk failures.

Better reliability than Level 5 at a higher cost; not used as widely.

P PP P

P P

Page 63: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan63

Choice of RAID Level

The “best” level for any particular application is not always clear.

Factors in choosing RAID level:

Monetary cost of disks, controllers or RAID storage devices

Reliability requirements

Performance requirements:

• Throughput vs. response time

• Short vs. long I/O

• Reads vs. writes

• During failure and rebuilding of a failed disk

Page 64: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan64

Choice of RAID Level, Cont.

The books analysis…

RAID 0 is only used when (data) reliability is not important.

Level 2 and 4 never used since they are subsumed by 3 and 5.

Level 3 is not used (typically) since bit-level striping forces single block

reads to access all disks, wasting disk arm movement.

Sometimes vendors advertize a version of level 3 using byte-level striping.

Level 6 is rarely used since levels 1 and 5 offer adequate safety for

almost all applications.

So competition is between 1 and 5 only.

Page 65: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan65

Choice of RAID Level (Cont.)

Level 1 provides much better write performance than level 5:

Level 5 requires at least 2 block reads and 2 block writes to write a single block

Level 1 only requires 2 block writes

Level 1 has higher storage cost than level 5, however:

I/O requirements have increased greatly, e.g. for Web servers.

When enough disks have been bought to satisfy required rate I/O rate, they often have spare storage capacity…

In that case there is often no extra monetary cost for Level 1!

Level 5 is preferred for applications with low update rate,and large amounts of data.

Level 1 is preferred for all other applications.

Page 66: Storage and File Structure - CASmy.fit.edu/.../Slides/StorageAndFileStructure.pdfA single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically

©Silberschatz, Korth and Sudarshan66

Other RAID Issues

Hardware vs. software RAID.

Local storage buffers.

Hot-swappable disks.

Spare components: disks, fans, power supplies, controllers.

Battery backup.