6/3/20141 psus cs 587 chapter 9, disks and files the storage hierarchy disks mechanics performance...

04/10/23 1PSU’s CS 587

Chapter 9, Disks and Files The Storage Hierarchy Disks

Mechanics Performance RAID

Disk Space Management Buffer Management Files of Records

Format of a Heap File Format of a Data Page Format of Records

04/10/23 2PSU’s CS 587

Learning objectives

Given disk parameters, compute storage needs and read times

Given a reminder about what each level means, be able to derive any figures on the RAID performance slide

Describe the pros and cons of alternative structures for files, pages and records

04/10/23 3PSU’s CS 587

A (Very) Simple Hardware Model

mainmemory

I/O bridge

bus interface

ALU

register file

CPU chip

system bus memory bus

disk controller

graphicsadapter

USBcontroller

mousekeyboard monitor

disk

I/O bus Expansion slots forother devices suchas network adapters.

04/10/23 4PSU’s CS 587

Storage Options

1k-2k bytes1 TcWay Expensive

10s -1000s K Bytes2-20 Tc$10 / MByte

G Bytes300 – 1000 Tc$0.03 / MB (eBay)

100s G Bytes10 ms = 30M Tc $0.10/ GB (eBay)

CapacityAccess TimeCost

InfiniteForeverWay Cheap

Registers

Caches

Main Memory

Hard Disk / Flash

Tape1

10

100

1000

10000

1 10 100

Relative Latency Improvement

Relative BW

Improvement

Processor

Memory

Network

Disk

(Latency improvement = Bandwidth improvement)

04/10/23 5PSU’s CS 587

Memory “Hierarchy”

1k-2k bytes1 TcWay Expensive

10s -1000s K Bytes2-20 Tc$10 / MByte

G Bytes300 – 1000 Tc$0.03 / MB (eBay)

100s G Bytes10 ms = 30M Tc $0.10/ GB (eBay)

CapacityAccess TimeCost

InfiniteForeverWay Cheap

Registers

Cache - SDRAMmay be multiple levels!

Memory - DRAM

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Size

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS4K+ bytes

user/operatorGbytes

Upper Level

Lower Level

Faster

Larger

04/10/23 6PSU’s CS 587

Why Does “Hierarchy” Work?

Locality: Program access a relatively

small portion of the address space at any instant of time

Two Different Types Temporal Locality (Locality in Time): If an item is referenced, it will tend to be

referenced again soon (e.g., loops, reuse)

Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)

Spring 2007 Portland State University 51

Principles in Computer Design• Make the common case fast

– e.g. Micro-VAX vs. VAX-11

• Amdahl’s Law (or law of diminishing returns)– After a while, making the common case fast doesn’t help

any more

• Locality of reference– I might want to see you again soon (temporal)

– I want to visit your neighbors when I visit you (spatial)

• Golden handcuffs– ISAs, like diamonds, are forever

04/10/23 7PSU’s CS 587

9.1 The Memory Hierarchy Typical storage hierarchy as used by a

RDBMS: Primary storage:

Main memory (RAM) for currently used data Secondary storage:

Disk, Flash Memory for the main database• http://www.cs.cmu.edu/~damon2007/pdf/graefe07fiveminrule.pdf • What are other reasons besides cost to use disk?

Tertiary storageTapes, DVDs for archiving older versions of the data

Other factors Caches at every level Controllers, protocols Network connections

04/10/23 8PSU’s CS 587

What is FLASH Memory, Anyway?

Floating gate transitor Presence of charge => “0” Erase Electrically or UV (EPROM)

Peformance Reads like DRAM (~ns) Writes like DISK (~ms). Write is a complex

operation

04/10/23 9PSU’s CS 587

Components of a Disk

Platters

• platters are always spinning (say, 120rps).

• one head reads/writes at any one time.

• to read a record:

• position arm (seek)

• engage head

• wait for data to spin by

• read (transfer data)

SpindleDisk head

Arm movement

Arm assembly

Tracks

Sector

04/10/23 10PSU’s CS 587

More terminology

Each track is made up of fixed size sectors. Page size is a multiple of sector size. A platter typically has data on

both surfaces. All the tracks that you can reach from one position of the arm is called a cylinder (imaginary!).

Platters

SpindleDisk head

Arm movement

Arm assembly

Tracks

Sector

04/10/23 11PSU’s CS 587

Disks Technology Background

Seagate 373453, 2003 15000 RPM (4X) 73.4 GBytes (2500X) Tracks/Inch: 64000 (80X) Bits/Inch: 533,000 (60X) Four 2.5” platters

(in 3.5” form factor) Bandwidth:

86 MBytes/sec (140X) Latency: 5.7 ms (8X) Cache: 8 MBytes

CDC Wren I, 1983 3600 RPM 0.03 GBytes capacity Tracks/Inch: 800 Bits/Inch: 9550 Three 5.25” platters

Bandwidth: 0.6 MBytes/sec

Latency: 48.3 ms Cache: none

04/10/23 12PSU’s CS 587

Typical Disk Drive Statistics (2008)

Sector size: 512 bytesSeek time

Average 4-10 ms Track to track .6-1.0 ms

Average Rotational Delay - 3 to 5 ms (rotational speed 10,000 RPM to 5,400RPM)

Transfer Time - Sustained data rate0.3- 0.1 msec per 8K page, or25-75 MB/second

Density12-18 GB/in2

04/10/23 13PSU’s CS 587

Disk Capacity

Capacity: maximum number of bits that can be stored. Expressed in units of gigabytes (GB), where 1 GB = 10^9

bytes Capacity is determined by:

Recording density (bits/in): number of bits that can be squeezed into a 1 inch segment of a track.

Track density (tracks/in): number of tracks that can be squeezed into a 1 inch radial segment.

Areal density (bits/in2): product of recording and track density. Modern disks partition tracks into disjoint subsets called

recording zones Each track in a zone has the same number of sectors,

determined by the circumference of innermost track. Each zone has a different number of sectors/track

04/10/23 14PSU’s CS 587

Cost of Accessing Data on Disk

Time to access (read/write) a disk block: Taccess = Tavg seek + Tavg rotation + Tavg transfer seek time (moving arms to position disk head on track) rotational delay (waiting for block to rotate under head)

• Half a rotation, on average transfer time (actually moving data to/from disk surface)

Key to lower I/O cost: reduce seek/rotation delays! No way to avoid transfer time…

Textbook measures query cost by NUMBER of page I/Os Implies all I/Os have the same cost, and that CPU time is

free• This is a common simplification.

Real DBMSs (in the optimizer) would consider sequential vs. random disk reads

• Because sequential reads are much faster • and would count CPU time.

04/10/23 15PSU’s CS 587

Disk Parameters Practice

A 2-platter disk rotates at 7,200 rpm. Each track contains 256KB. How many cylinders are required to store an

8 Gigabyte file?

What is the average rotational delay, in milliseconds?

04/10/23 16PSU’s CS 587

Disk Access Time Example Given:

Rotational rate = 7,200 RPM Average seek time = 9 ms. Avg # sectors/track = 400.

Derived: Tavg rotation = 1/2 x (60 secs/7200 RPM) x 1000 ms/sec = 4 ms. Tavg transfer = 60/7200 RPM x 1/400 secs/track x 1000 ms/sec =

0.02 ms Taccess = 9 ms + 4 ms + 0.02 ms

Important points: Access time dominated by seek time and rotational latency. First bit in a sector is the most expensive, the rest are free. SRAM access time is about 4 ns/doubleword, DRAM about 60 ns

• Disk is about 40,000 times slower than SRAM, • 2,500 times slower than DRAM.

04/10/23 17PSU’s CS 587

So, How far away is the data?

RegistersOn Chip CacheOn Board Cache

Memory

Disk

12

10

100

Tape /Optical Robot

109

106

Sacramento

This CampusThis Room

My Head

10 min

1.5 hr

2 Years

1 min

Pluto

2,000 YearsAndromdeda

Clo

ck T

icks

From http://research.microsoft.com/~gray/papers/AlphaSortSigmod.doc

04/10/23 18PSU’s CS 587

Block, page and record sizes

Block – According to text, smallest unit of I/O.

Page – often used in place of block. “typical” record size: commonly

hundreds, sometimes thousands of bytes Unlike the toy records in textbooks

“typical” page size 4K, 8K

04/10/23 19PSU’s CS 587

Effect of page size on read time Suppose rotational delay is 4ms, average seek

time 6 ms, transfer speed .5msec/8K. This graph shows the time required to read 1Gig of

data for different page sizes.

0

5

10

15

20

25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Page Size (multiples of 8K)

Min

ute

s

04/10/23 20PSU’s CS 587

Why the difference? What accounts for the difference, in times to read one

Gigabyte, on the previous graph? Assume: rotational delay 4ms, average seek time 6 ms,

transfer speed .5msec/8K Transfer time

(230/213 8K blocks) (.5msec/8K) = 66 secs ~= one minute How many reads?

Page size 8K: there are 230/213 = 217 = 128K reads Page size 64K, there are 1/8th that many reads = 16K reads

Time taken by rotational delays and seeks Each read requires a rotational delay and a seek, totalling 10 msec. 8K: (128K reads) (10msec/read) = 1,311 secs ~= 22 minutes 64K: 1/8 of that, or 164 secs ~= 3 minutes

04/10/23 21PSU’s CS 587

Moral of the Story As page size increases, read (and write) time

reduces to transfer time, a big savings. So why not use a huge page size?

Wastes memory space if you don’t need all that is read

Wastes read time if you don’t need all that is read What applications could use a large page size?

Those that sequentially access data The problem with a small page size is that

pages get scattered across the disk. Turn the page….

04/10/23 22PSU’s CS 587

Faster I/O, even with a small page size Even if the page size is small, you can achieve

fast I/O by storing a file’s data as follows: Consecutive pages on same track, followed by Consecutive tracks on same cylinder, followed by Consecutive cylinders adjacent to each other First two incur no seek time or rotational delay, seek for

third is only one-track. What is saved with this storage pattern? How is this storage pattern obtained?

Disk defragmenter and its relatives/predecessors• Also places frequently used files near the spindle

When data is in this storage pattern, the application can do sequential I/O Otherwise it must do random I/O

04/10/23 23PSU’s CS 587

More Hardware Issues Disk Controllers

Interface from Disks to bus Checksums, remap bad sectors, driver mgt, etc

Interface Protocols and MB per second xfer rates IDE/EIDE/ATA/PATA, SATA -133 SCSI -640

• BUT for a single device, SCSI is inferior Faster network technologies such as Fibre Channel

Storage Area Networks (SANs) Disk farm networked to servers Servers can be heterogeneous – a primary advantage Centralized management

9. Disks

04/10/23 24PSU’s CS 587

Dependability Module reliability = measure of continuous service

accomplishment (or time to failure). 2 metrics

1. Mean Time To Failure (MTTF) measures Reliability2. Failures In Time (FIT) = 1/MTTF, the rate of failures

• Traditionally reported as failures per billion hours of operation Mean Time To Repair (MTTR) measures Service Interruption

Mean Time Between Failures (MTBF) = MTTF+MTTR Module availability measures service as alternate between

the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9)

Module availability = MTTF / ( MTTF + MTTR)

04/10/23 25PSU’s CS 587

Example calculating reliability

If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules

Example: Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk) 1 disk controller (0.5M hour MTTF) and 1 power supply (0.2M hour MTTF)

04/10/23 26PSU’s CS 587

Example calculating reliability Calculate FIT and MTTF for

10 disks (1M hour MTTF per disk) 1 disk controller (0.5M hour MTTF) and 1 power supply (0.2M hour MTTF):

hours

MTTF

FIT

eFailureRat

000,59

000,17/000,000,000,1

000,17

000,000,1/17

000,000,1/5210

000,200/1000,500/1)000,000,1/1(10

04/10/23 27PSU’s CS 587

9.2 RAID [587] Disk Array: Arrangement of several disks

that gives abstraction of a single, large disk.

Goals: Increase performance and reliability. Two main techniques:

Data striping: Data is partitioned; size of a partition is called the striping Unit. Partitions are distributed over several disks.

Redundancy: More disks => more failures. Redundant information allows reconstruction of data if a disk fails.

9.Disks

04/10/23 28PSU’s CS 587

Data Striping• CPUs go fast, disks don’t. How can disks keep up?• CPUs do work in parallel. Can disks?• Answer: Partition data across D disks (see next slide).• If Partition unit is a page:

• A single page I/O request is no faster

• Multiple I/O requests can run at aggregated bandwidth

• Number of pages in a partition unit called the depth of the partition.

• Contrary to text, partition units of a bit are almost never used and partition units of a byte are rare.

04/10/23 29PSU’s CS 587

Data Striping (RAID Level 0)

0

D

2D

…

0

1

D+1

2D+1

…

1

2

D+2

2D+2

…

2

D-1

2D-1

3D-1

…

D-1

...

Disk 0 Disk 1 Disk 2 Disk D-1

04/10/23 30PSU’s CS 587

Redundancy

• Striping is seductive, but remember reliability!• MTTF of a disk is about 6 years• If we stripe over 24 disks, what is MTTF?

• Solution: redundancy– Parity: corrects single failures

– Others: detect where the failure is, and corrects multiple failures

– But failure location is provided by controller

– Redundancy may require more than one check bit

• Redundancy makes writes slower – why?

04/10/23 31PSU’s CS 587

RAID Levels• Standardized by SNIA (www.snia.org )• Vary in practice• For each level, decide (assume single user)

• Number of disks required to hold D disks of data.

• Speedup s (compared to 1 disk) for

• S/R (Sequential/Random) R/W (Reads/Writes)

• Random: each I/O is one block

• Sequential: Each I/O is one stripe

• Number of disks/blocks that can fail w/o data loss

• Level 0: Block Striped, No redundancy • Picture is 2 slides back

04/10/23 32PSU’s CS 587

JBOD, RAID Level 1• JBOD: Just a Bunch of Disks

0

1

2

3

…

2

3

...


0

1

2

3

…

0

…

0

1

2

3

…

1

…

0

1

2

3

…

3

4

• Level 1: Mirrored (two identical JBODs – no striping)

04/10/23 33PSU’s CS 587

RAID Level 0+1: Stripe + Mirror

0 D

2D … 0

1 D+1

2D+1 …

1

2 D+2 2D+2…

2

D-1 2D-1 3D-1 … D-1...


0 D

2D … 0

1 D+1

2D+1 …

1

2 D+2 2D+2…

2

D-1 2D-1 3D-1 … D-1...

Disk D Disk D+1 Disk D+2 Disk 2D-1

04/10/23 34PSU’s CS 587

RAID Level 4• Block-Interleaved Parity (not common)

– One check disk, uses one bit of parity. – How to tell if there is a failure, or which disk failed?– Read-modify-write– Disk D is a bottleneck

0

D

2D

…

0

1

D+1

2D+1

…

1

2

D+2

2D+2

…

2

D-1

2D-1

3D-1

…

D-1

...

Disk 0 Disk 1 Disk 2 Disk D-1 Disk D

P

P

P

P

…

04/10/23 35PSU’s CS 587

RAID Level 5 Level 5: Block-Interleaved Distributed Parity

0

D

2D

…

…

1

D+1

2D+1

…

…

D-2

2D-2

P

…

…

D-1

P

3D-2

…

…

...

Disk 0 Disk 1 Disk D-2 Disk D-1 Disk D

P

2D-1

3D-1

…

…

Level 6: Like 5, but 2 parity bits/disks Can survive loss of 2 disks/blocks

04/10/23 36PSU’s CS 587

Notation on the next slide #Disks

Number of disks required to hold D disks worth of data using this RAID level

Reads/Write speedup of blocks in a single file: SR: Sequential Read RR: Random read SW: Sequential write RW: Random write

Failure Tolerance How many disks can fail without loss of data

Internal Data s = Blocks transferred in the time it takes to transfer

one block of data from one disk. These numbers are theoretical!

• YMMV…and vary significantly!

04/10/23 37PSU’s CS 587

RAID Performance

Level #Disks SRspeedu

p

RRspeedup

SWspeedu

p

RWspeedup

FailureToleranc

e

0 D s=D 1sD

s=D 1sD 0

1 2D s=2 s=2 s=1**

s=1** D*

0+1 2D s=2D 2s2D

s=D**

1sD**

D*

5 D+1 s=D 1sD

s=D Varies 1

*If no two are copies of each other

** note – can’t write both mirrors at once – why?

04/10/23 38PSU’s CS 587

Small Writes on Levels 4 and 5

Levels 4 and 5 require a read-modify-write cycle for all writes, since the parity block must be read and modified.

On small writes this can be very expensive

This is another justification for Log Based File Systems (see your OS course)

04/10/23 39PSU’s CS 587

Which RAID Level is best? If data loss is not a problem

Level 0 If storage cost is not a problem

Level 0+1 Else

Level 5 Software Support

Linux: 0,1,4,5 (http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html )

Windows: 0,1,5 (http://www.techimo.com/articles/index.pl?photo=149 )

04/10/23 40PSU’s CS 587

9.3, 9.4.1: Covered earlier

9.Disks

04/10/23 41PSU’s CS 587

9.4.2 DBMS vs. OS File System

OS does disk space & buffer mgmt: why not let OS manage these tasks? [715]

Differences in OS support: portability issues Some limitations, e.g., files can’t span disks. Buffer management in DBMS requires ability

to: pin a page in buffer pool, force a page to disk

(important for implementing CC & recovery), adjust replacement policy, and pre-fetch pages

based on access patterns in typical DB operations.• Sometimes MRU is the best replacement policy: For

example, for a scan or a loop that does not fit.

9.Disks

04/10/23 42PSU’s CS 587

9.5 Files of Records Page or block is OK when doing I/O, but

higher levels of DBMS operate on records, and files of records.

FILE: A collection of pages, each containing a collection of records. Must support: insert/delete/modify record read a particular record (specified using record

id) scan all records (possibly with some conditions

on the records to be retrieved)

9.Disks

04/10/23 43PSU’s CS 587

9.5.1 Unordered (Heap) Files

Simplest file structure contains records in no particular order.

As file grows and shrinks, disk pages are allocated and de-allocated.

To support record level operations, we must: keep track of the pages in a file keep track of free space on pages keep track of the records on a page

There are at least two alternatives for keeping track of heap files.

9.Disks

04/10/23 44PSU’s CS 587

Heap File Implemented as a List

The header page id and Heap file name must be stored someplace.

Each page contains 2 `pointers’ plus data.

HeaderPage

DataPage

DataPage

DataPage

DataPage

DataPage

DataPage Pages with

Free Space

Full Pages

9.Disks

04/10/23 45PSU’s CS 587

Heap File Using a Page Directory

The entry for a page can include the number of free bytes on the page.

The directory is a collection of pages; linked list implementation is just one alternative. Much smaller than linked list of all HF pages!

DataPage 1

DataPage 2

DataPage N

HeaderPage

DIRECTORY

9.Disks

04/10/23 46PSU’s CS 587

Comparing Heap File Implementations Assume

100 directory entries per page. U full pages, E pages with free space D directory pages Then D = (U+E) /100 Note that D is two orders of magnitude less than U or

E Cost to find a page with enough free space

List: E/2 Directory: (D/2) + 1 Cost to Move a page from Full to Free

(e.g., when a record is deleted) List: 3, Directory: 1

Can you think of some other operations?

04/10/23 47PSU’s CS 587

9.6 Page Formats: Fixed Length Records

Slot 1Slot 2

Slot N

. . . . . .

N M10. . .

M ... 3 2 1PACKED UNPACKED, BITMAP

Slot 1Slot 2

Slot N

FreeSpace

Slot M

11

number of records

numberof slots

9.Disks

04/10/23 48PSU’s CS 587

Packed vs Unpacked Page Formats Record ID (RID, TID) = (page#, slot#) ,

in all page formats Note that indexes are filled with RIDs Data entries in alternatives 2 and 3 are

(key, RID..) Packed

stores more records RIDs change when a record is deleted

• This may not be acceptable.

Unpacked RID does not change Less data movement when deleting

04/10/23 49PSU’s CS 587

Page Formats: Variable Length Records

Page iRid = (i,N)

Rid = (i,2)

Rid = (i,1)

Pointerto startof freespace

SLOT DIRECTORY

N . . . 2 120 16 24 N

# slots

9.Disks

04/10/23 50PSU’s CS 587

Slotted Page Format Intergalactic Standard, for fixed length records

also. How to deal with free space fragmentation?

Pack records. lazily Note that RIDs don’t change How are updates handled which expand the

size of a record? Forwarding flag to new location

http://www.postgresql.org/docs/8.3/interactive/storage-page-layout.html

postgresql-8.3.1\src\include\storage\bufpage.h

04/10/23 51PSU’s CS 587

9.7 Record Formats: Fixed Length

Information about field types same for all records in a file; stored in system catalogs.

Finding i’th field does not require scan of record.

Base address (B)

L1 L2 L3 L4

F1 F2 F3 F4

Address = B+L1+L2

9.Disks

04/10/23 52PSU’s CS 587

Record Formats: Variable Length Two alternative formats (# fields is fixed):

Second offers direct access to i’th field, efficient storage of nulls (special don’t know value); small directory overhead.

4 $ $ $ $

FieldCount

Fields Delimited by Special Symbols

F1 F2 F3 F4

F1 F2 F3 F4

Array of Field Offsets

9.Disks

6/3/20141 psus cs 587 chapter 9, disks and files the storage hierarchy disks mechanics performance...

Documents

disk time

psus cs

disk block

disk capacity capacity

disk parameters

disk platters platters

ms track

s g bytes