6/3/20141 psus cs 587 chapter 9, disks and files the storage hierarchy disks mechanics performance...
TRANSCRIPT
04/10/23 1PSU’s CS 587
Chapter 9, Disks and Files The Storage Hierarchy Disks
Mechanics Performance RAID
Disk Space Management Buffer Management Files of Records
Format of a Heap File Format of a Data Page Format of Records
04/10/23 2PSU’s CS 587
Learning objectives
Given disk parameters, compute storage needs and read times
Given a reminder about what each level means, be able to derive any figures on the RAID performance slide
Describe the pros and cons of alternative structures for files, pages and records
04/10/23 3PSU’s CS 587
A (Very) Simple Hardware Model
mainmemory
I/O bridge
bus interface
ALU
register file
CPU chip
system bus memory bus
disk controller
graphicsadapter
USBcontroller
mousekeyboard monitor
disk
I/O bus Expansion slots forother devices suchas network adapters.
04/10/23 4PSU’s CS 587
Storage Options
1k-2k bytes1 TcWay Expensive
10s -1000s K Bytes2-20 Tc$10 / MByte
G Bytes300 – 1000 Tc$0.03 / MB (eBay)
100s G Bytes10 ms = 30M Tc $0.10/ GB (eBay)
CapacityAccess TimeCost
InfiniteForeverWay Cheap
Registers
Caches
Main Memory
Hard Disk / Flash
Tape1
10
100
1000
10000
1 10 100
Relative Latency Improvement
Relative BW
Improvement
Processor
Memory
Network
Disk
(Latency improvement = Bandwidth improvement)
04/10/23 5PSU’s CS 587
Memory “Hierarchy”
1k-2k bytes1 TcWay Expensive
10s -1000s K Bytes2-20 Tc$10 / MByte
G Bytes300 – 1000 Tc$0.03 / MB (eBay)
100s G Bytes10 ms = 30M Tc $0.10/ GB (eBay)
CapacityAccess TimeCost
InfiniteForeverWay Cheap
Registers
Cache - SDRAMmay be multiple levels!
Memory - DRAM
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Size
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS4K+ bytes
user/operatorGbytes
Upper Level
Lower Level
Faster
Larger
04/10/23 6PSU’s CS 587
Why Does “Hierarchy” Work?
Locality: Program access a relatively
small portion of the address space at any instant of time
Two Different Types Temporal Locality (Locality in Time): If an item is referenced, it will tend to be
referenced again soon (e.g., loops, reuse)
Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
Spring 2007 Portland State University 51
Principles in Computer Design• Make the common case fast
– e.g. Micro-VAX vs. VAX-11
• Amdahl’s Law (or law of diminishing returns)– After a while, making the common case fast doesn’t help
any more
• Locality of reference– I might want to see you again soon (temporal)
– I want to visit your neighbors when I visit you (spatial)
• Golden handcuffs– ISAs, like diamonds, are forever
04/10/23 7PSU’s CS 587
9.1 The Memory Hierarchy Typical storage hierarchy as used by a
RDBMS: Primary storage:
Main memory (RAM) for currently used data Secondary storage:
Disk, Flash Memory for the main database• http://www.cs.cmu.edu/~damon2007/pdf/graefe07fiveminrule.pdf • What are other reasons besides cost to use disk?
Tertiary storageTapes, DVDs for archiving older versions of the data
Other factors Caches at every level Controllers, protocols Network connections
04/10/23 8PSU’s CS 587
What is FLASH Memory, Anyway?
Floating gate transitor Presence of charge => “0” Erase Electrically or UV (EPROM)
Peformance Reads like DRAM (~ns) Writes like DISK (~ms). Write is a complex
operation
04/10/23 9PSU’s CS 587
Components of a Disk
Platters
• platters are always spinning (say, 120rps).
• one head reads/writes at any one time.
• to read a record:
• position arm (seek)
• engage head
• wait for data to spin by
• read (transfer data)
SpindleDisk head
Arm movement
Arm assembly
Tracks
Sector
04/10/23 10PSU’s CS 587
More terminology
Each track is made up of fixed size sectors. Page size is a multiple of sector size. A platter typically has data on
both surfaces. All the tracks that you can reach from one position of the arm is called a cylinder (imaginary!).
Platters
SpindleDisk head
Arm movement
Arm assembly
Tracks
Sector
04/10/23 11PSU’s CS 587
Disks Technology Background
Seagate 373453, 2003 15000 RPM (4X) 73.4 GBytes (2500X) Tracks/Inch: 64000 (80X) Bits/Inch: 533,000 (60X) Four 2.5” platters
(in 3.5” form factor) Bandwidth:
86 MBytes/sec (140X) Latency: 5.7 ms (8X) Cache: 8 MBytes
CDC Wren I, 1983 3600 RPM 0.03 GBytes capacity Tracks/Inch: 800 Bits/Inch: 9550 Three 5.25” platters
Bandwidth: 0.6 MBytes/sec
Latency: 48.3 ms Cache: none
04/10/23 12PSU’s CS 587
Typical Disk Drive Statistics (2008)
Sector size: 512 bytesSeek time
Average 4-10 ms Track to track .6-1.0 ms
Average Rotational Delay - 3 to 5 ms (rotational speed 10,000 RPM to 5,400RPM)
Transfer Time - Sustained data rate0.3- 0.1 msec per 8K page, or25-75 MB/second
Density12-18 GB/in2
04/10/23 13PSU’s CS 587
Disk Capacity
Capacity: maximum number of bits that can be stored. Expressed in units of gigabytes (GB), where 1 GB = 10^9
bytes Capacity is determined by:
Recording density (bits/in): number of bits that can be squeezed into a 1 inch segment of a track.
Track density (tracks/in): number of tracks that can be squeezed into a 1 inch radial segment.
Areal density (bits/in2): product of recording and track density. Modern disks partition tracks into disjoint subsets called
recording zones Each track in a zone has the same number of sectors,
determined by the circumference of innermost track. Each zone has a different number of sectors/track
04/10/23 14PSU’s CS 587
Cost of Accessing Data on Disk
Time to access (read/write) a disk block: Taccess = Tavg seek + Tavg rotation + Tavg transfer seek time (moving arms to position disk head on track) rotational delay (waiting for block to rotate under head)
• Half a rotation, on average transfer time (actually moving data to/from disk surface)
Key to lower I/O cost: reduce seek/rotation delays! No way to avoid transfer time…
Textbook measures query cost by NUMBER of page I/Os Implies all I/Os have the same cost, and that CPU time is
free• This is a common simplification.
Real DBMSs (in the optimizer) would consider sequential vs. random disk reads
• Because sequential reads are much faster • and would count CPU time.
04/10/23 15PSU’s CS 587
Disk Parameters Practice
A 2-platter disk rotates at 7,200 rpm. Each track contains 256KB. How many cylinders are required to store an
8 Gigabyte file?
What is the average rotational delay, in milliseconds?
04/10/23 16PSU’s CS 587
Disk Access Time Example Given:
Rotational rate = 7,200 RPM Average seek time = 9 ms. Avg # sectors/track = 400.
Derived: Tavg rotation = 1/2 x (60 secs/7200 RPM) x 1000 ms/sec = 4 ms. Tavg transfer = 60/7200 RPM x 1/400 secs/track x 1000 ms/sec =
0.02 ms Taccess = 9 ms + 4 ms + 0.02 ms
Important points: Access time dominated by seek time and rotational latency. First bit in a sector is the most expensive, the rest are free. SRAM access time is about 4 ns/doubleword, DRAM about 60 ns
• Disk is about 40,000 times slower than SRAM, • 2,500 times slower than DRAM.
04/10/23 17PSU’s CS 587
So, How far away is the data?
RegistersOn Chip CacheOn Board Cache
Memory
Disk
12
10
100
Tape /Optical Robot
109
106
Sacramento
This CampusThis Room
My Head
10 min
1.5 hr
2 Years
1 min
Pluto
2,000 YearsAndromdeda
Clo
ck T
icks
From http://research.microsoft.com/~gray/papers/AlphaSortSigmod.doc
04/10/23 18PSU’s CS 587
Block, page and record sizes
Block – According to text, smallest unit of I/O.
Page – often used in place of block. “typical” record size: commonly
hundreds, sometimes thousands of bytes Unlike the toy records in textbooks
“typical” page size 4K, 8K
04/10/23 19PSU’s CS 587
Effect of page size on read time Suppose rotational delay is 4ms, average seek
time 6 ms, transfer speed .5msec/8K. This graph shows the time required to read 1Gig of
data for different page sizes.
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Page Size (multiples of 8K)
Min
ute
s
04/10/23 20PSU’s CS 587
Why the difference? What accounts for the difference, in times to read one
Gigabyte, on the previous graph? Assume: rotational delay 4ms, average seek time 6 ms,
transfer speed .5msec/8K Transfer time
(230/213 8K blocks) (.5msec/8K) = 66 secs ~= one minute How many reads?
Page size 8K: there are 230/213 = 217 = 128K reads Page size 64K, there are 1/8th that many reads = 16K reads
Time taken by rotational delays and seeks Each read requires a rotational delay and a seek, totalling 10 msec. 8K: (128K reads) (10msec/read) = 1,311 secs ~= 22 minutes 64K: 1/8 of that, or 164 secs ~= 3 minutes
04/10/23 21PSU’s CS 587
Moral of the Story As page size increases, read (and write) time
reduces to transfer time, a big savings. So why not use a huge page size?
Wastes memory space if you don’t need all that is read
Wastes read time if you don’t need all that is read What applications could use a large page size?
Those that sequentially access data The problem with a small page size is that
pages get scattered across the disk. Turn the page….
04/10/23 22PSU’s CS 587
Faster I/O, even with a small page size Even if the page size is small, you can achieve
fast I/O by storing a file’s data as follows: Consecutive pages on same track, followed by Consecutive tracks on same cylinder, followed by Consecutive cylinders adjacent to each other First two incur no seek time or rotational delay, seek for
third is only one-track. What is saved with this storage pattern? How is this storage pattern obtained?
Disk defragmenter and its relatives/predecessors• Also places frequently used files near the spindle
When data is in this storage pattern, the application can do sequential I/O Otherwise it must do random I/O
04/10/23 23PSU’s CS 587
More Hardware Issues Disk Controllers
Interface from Disks to bus Checksums, remap bad sectors, driver mgt, etc
Interface Protocols and MB per second xfer rates IDE/EIDE/ATA/PATA, SATA -133 SCSI -640
• BUT for a single device, SCSI is inferior Faster network technologies such as Fibre Channel
Storage Area Networks (SANs) Disk farm networked to servers Servers can be heterogeneous – a primary advantage Centralized management
9. Disks
04/10/23 24PSU’s CS 587
Dependability Module reliability = measure of continuous service
accomplishment (or time to failure). 2 metrics
1. Mean Time To Failure (MTTF) measures Reliability2. Failures In Time (FIT) = 1/MTTF, the rate of failures
• Traditionally reported as failures per billion hours of operation Mean Time To Repair (MTTR) measures Service Interruption
Mean Time Between Failures (MTBF) = MTTF+MTTR Module availability measures service as alternate between
the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9)
Module availability = MTTF / ( MTTF + MTTR)
04/10/23 25PSU’s CS 587
Example calculating reliability
If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules
Example: Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk) 1 disk controller (0.5M hour MTTF) and 1 power supply (0.2M hour MTTF)
04/10/23 26PSU’s CS 587
Example calculating reliability Calculate FIT and MTTF for
10 disks (1M hour MTTF per disk) 1 disk controller (0.5M hour MTTF) and 1 power supply (0.2M hour MTTF):
hours
MTTF
FIT
eFailureRat
000,59
000,17/000,000,000,1
000,17
000,000,1/17
000,000,1/5210
000,200/1000,500/1)000,000,1/1(10
04/10/23 27PSU’s CS 587
9.2 RAID [587] Disk Array: Arrangement of several disks
that gives abstraction of a single, large disk.
Goals: Increase performance and reliability. Two main techniques:
Data striping: Data is partitioned; size of a partition is called the striping Unit. Partitions are distributed over several disks.
Redundancy: More disks => more failures. Redundant information allows reconstruction of data if a disk fails.
9.Disks
04/10/23 28PSU’s CS 587
Data Striping• CPUs go fast, disks don’t. How can disks keep up?• CPUs do work in parallel. Can disks?• Answer: Partition data across D disks (see next slide).• If Partition unit is a page:
• A single page I/O request is no faster
• Multiple I/O requests can run at aggregated bandwidth
• Number of pages in a partition unit called the depth of the partition.
• Contrary to text, partition units of a bit are almost never used and partition units of a byte are rare.
04/10/23 29PSU’s CS 587
Data Striping (RAID Level 0)
0
D
2D
…
0
1
D+1
2D+1
…
1
2
D+2
2D+2
…
2
D-1
2D-1
3D-1
…
D-1
...
Disk 0 Disk 1 Disk 2 Disk D-1
04/10/23 30PSU’s CS 587
Redundancy
• Striping is seductive, but remember reliability!• MTTF of a disk is about 6 years• If we stripe over 24 disks, what is MTTF?
• Solution: redundancy– Parity: corrects single failures
– Others: detect where the failure is, and corrects multiple failures
– But failure location is provided by controller
– Redundancy may require more than one check bit
• Redundancy makes writes slower – why?
04/10/23 31PSU’s CS 587
RAID Levels• Standardized by SNIA (www.snia.org )• Vary in practice• For each level, decide (assume single user)
• Number of disks required to hold D disks of data.
• Speedup s (compared to 1 disk) for
• S/R (Sequential/Random) R/W (Reads/Writes)
• Random: each I/O is one block
• Sequential: Each I/O is one stripe
• Number of disks/blocks that can fail w/o data loss
• Level 0: Block Striped, No redundancy • Picture is 2 slides back
04/10/23 32PSU’s CS 587
JBOD, RAID Level 1• JBOD: Just a Bunch of Disks
0
1
2
3
…
2
3
...
Disk 0 Disk 1 Disk 2 Disk D-1
0
1
2
3
…
0
…
0
1
2
3
…
1
…
0
1
2
3
…
3
4
• Level 1: Mirrored (two identical JBODs – no striping)
04/10/23 33PSU’s CS 587
RAID Level 0+1: Stripe + Mirror
0 D
2D … 0
1 D+1
2D+1 …
1
2 D+2 2D+2…
2
D-1 2D-1 3D-1 … D-1...
Disk 0 Disk 1 Disk 2 Disk D-1
0 D
2D … 0
1 D+1
2D+1 …
1
2 D+2 2D+2…
2
D-1 2D-1 3D-1 … D-1...
Disk D Disk D+1 Disk D+2 Disk 2D-1
04/10/23 34PSU’s CS 587
RAID Level 4• Block-Interleaved Parity (not common)
– One check disk, uses one bit of parity. – How to tell if there is a failure, or which disk failed?– Read-modify-write– Disk D is a bottleneck
0
D
2D
…
0
1
D+1
2D+1
…
1
2
D+2
2D+2
…
2
D-1
2D-1
3D-1
…
D-1
...
Disk 0 Disk 1 Disk 2 Disk D-1 Disk D
P
P
P
P
…
04/10/23 35PSU’s CS 587
RAID Level 5 Level 5: Block-Interleaved Distributed Parity
0
D
2D
…
…
1
D+1
2D+1
…
…
D-2
2D-2
P
…
…
D-1
P
3D-2
…
…
...
Disk 0 Disk 1 Disk D-2 Disk D-1 Disk D
P
2D-1
3D-1
…
…
Level 6: Like 5, but 2 parity bits/disks Can survive loss of 2 disks/blocks
04/10/23 36PSU’s CS 587
Notation on the next slide #Disks
Number of disks required to hold D disks worth of data using this RAID level
Reads/Write speedup of blocks in a single file: SR: Sequential Read RR: Random read SW: Sequential write RW: Random write
Failure Tolerance How many disks can fail without loss of data
Internal Data s = Blocks transferred in the time it takes to transfer
one block of data from one disk. These numbers are theoretical!
• YMMV…and vary significantly!
04/10/23 37PSU’s CS 587
RAID Performance
Level #Disks SRspeedu
p
RRspeedup
SWspeedu
p
RWspeedup
FailureToleranc
e
0 D s=D 1sD
s=D 1sD 0
1 2D s=2 s=2 s=1**
s=1** D*
0+1 2D s=2D 2s2D
s=D**
1sD**
D*
5 D+1 s=D 1sD
s=D Varies 1
*If no two are copies of each other
** note – can’t write both mirrors at once – why?
04/10/23 38PSU’s CS 587
Small Writes on Levels 4 and 5
Levels 4 and 5 require a read-modify-write cycle for all writes, since the parity block must be read and modified.
On small writes this can be very expensive
This is another justification for Log Based File Systems (see your OS course)
04/10/23 39PSU’s CS 587
Which RAID Level is best? If data loss is not a problem
Level 0 If storage cost is not a problem
Level 0+1 Else
Level 5 Software Support
Linux: 0,1,4,5 (http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html )
Windows: 0,1,5 (http://www.techimo.com/articles/index.pl?photo=149 )
04/10/23 40PSU’s CS 587
9.3, 9.4.1: Covered earlier
9.Disks
04/10/23 41PSU’s CS 587
9.4.2 DBMS vs. OS File System
OS does disk space & buffer mgmt: why not let OS manage these tasks? [715]
Differences in OS support: portability issues Some limitations, e.g., files can’t span disks. Buffer management in DBMS requires ability
to: pin a page in buffer pool, force a page to disk
(important for implementing CC & recovery), adjust replacement policy, and pre-fetch pages
based on access patterns in typical DB operations.• Sometimes MRU is the best replacement policy: For
example, for a scan or a loop that does not fit.
9.Disks
04/10/23 42PSU’s CS 587
9.5 Files of Records Page or block is OK when doing I/O, but
higher levels of DBMS operate on records, and files of records.
FILE: A collection of pages, each containing a collection of records. Must support: insert/delete/modify record read a particular record (specified using record
id) scan all records (possibly with some conditions
on the records to be retrieved)
9.Disks
04/10/23 43PSU’s CS 587
9.5.1 Unordered (Heap) Files
Simplest file structure contains records in no particular order.
As file grows and shrinks, disk pages are allocated and de-allocated.
To support record level operations, we must: keep track of the pages in a file keep track of free space on pages keep track of the records on a page
There are at least two alternatives for keeping track of heap files.
9.Disks
04/10/23 44PSU’s CS 587
Heap File Implemented as a List
The header page id and Heap file name must be stored someplace.
Each page contains 2 `pointers’ plus data.
HeaderPage
DataPage
DataPage
DataPage
DataPage
DataPage
DataPage Pages with
Free Space
Full Pages
9.Disks
04/10/23 45PSU’s CS 587
Heap File Using a Page Directory
The entry for a page can include the number of free bytes on the page.
The directory is a collection of pages; linked list implementation is just one alternative. Much smaller than linked list of all HF pages!
DataPage 1
DataPage 2
DataPage N
HeaderPage
DIRECTORY
9.Disks
04/10/23 46PSU’s CS 587
Comparing Heap File Implementations Assume
100 directory entries per page. U full pages, E pages with free space D directory pages Then D = (U+E) /100 Note that D is two orders of magnitude less than U or
E Cost to find a page with enough free space
List: E/2 Directory: (D/2) + 1 Cost to Move a page from Full to Free
(e.g., when a record is deleted) List: 3, Directory: 1
Can you think of some other operations?
04/10/23 47PSU’s CS 587
9.6 Page Formats: Fixed Length Records
Slot 1Slot 2
Slot N
. . . . . .
N M10. . .
M ... 3 2 1PACKED UNPACKED, BITMAP
Slot 1Slot 2
Slot N
FreeSpace
Slot M
11
number of records
numberof slots
9.Disks
04/10/23 48PSU’s CS 587
Packed vs Unpacked Page Formats Record ID (RID, TID) = (page#, slot#) ,
in all page formats Note that indexes are filled with RIDs Data entries in alternatives 2 and 3 are
(key, RID..) Packed
stores more records RIDs change when a record is deleted
• This may not be acceptable.
Unpacked RID does not change Less data movement when deleting
04/10/23 49PSU’s CS 587
Page Formats: Variable Length Records
Page iRid = (i,N)
Rid = (i,2)
Rid = (i,1)
Pointerto startof freespace
SLOT DIRECTORY
N . . . 2 120 16 24 N
# slots
9.Disks
04/10/23 50PSU’s CS 587
Slotted Page Format Intergalactic Standard, for fixed length records
also. How to deal with free space fragmentation?
Pack records. lazily Note that RIDs don’t change How are updates handled which expand the
size of a record? Forwarding flag to new location
http://www.postgresql.org/docs/8.3/interactive/storage-page-layout.html
postgresql-8.3.1\src\include\storage\bufpage.h
04/10/23 51PSU’s CS 587
9.7 Record Formats: Fixed Length
Information about field types same for all records in a file; stored in system catalogs.
Finding i’th field does not require scan of record.
Base address (B)
L1 L2 L3 L4
F1 F2 F3 F4
Address = B+L1+L2
9.Disks
04/10/23 52PSU’s CS 587
Record Formats: Variable Length Two alternative formats (# fields is fixed):
Second offers direct access to i’th field, efficient storage of nulls (special don’t know value); small directory overhead.
4 $ $ $ $
FieldCount
Fields Delimited by Special Symbols
F1 F2 F3 F4
F1 F2 F3 F4
Array of Field Offsets
9.Disks