cs252 graduate computer architecture lecture 18 i/o, buses, queuing theory john kubiatowicz...
DESCRIPTION
4/6/2009cs252-S09, Lecture 18 3 Motivation: Who Cares About I/O? CPU Performance: 60% per year I/O system performance limited by mechanical delays (disk I/O) –< 10% per year (IO per sec or MB per sec) Amdahl's Law: system speed-up limited by the slowest part! –10% IO & 10x CPU => 5x Performance (lose 50%) –10% IO & 100x CPU => 10x Performance (lose 90%) I/O bottleneck: –Diminishing fraction of time in CPU –Diminishing value of faster CPUsTRANSCRIPT
CS252Graduate Computer Architecture
Lecture 18
I/O, Buses, Queuing Theory
John KubiatowiczElectrical Engineering and Computer Sciences
University of California, Berkeley
http://www.eecs.berkeley.edu/~kubitron/cs252
4/6/2009 cs252-S09, Lecture 18 2
Review: Reed-Solomon Codes
4
3
2
1
0
43210
43210
43210
43210
43210
43210
43210
77777666665555544444333332222211111
aaaaa
G
• Reed-solomon codes (Non-systematic):
– Data as coefficients, code space as values of polynomial:
– P(x)=a0+a1x1+… a4x4
– Coded: P(1),P(2)….,P(6),P(7)• Called Vandermonde Matrix:
maximum rank• Different representation
(This H’ and G not related)– Clear that all combinations of
two or less columns independent d=3
– Very easy to pick whatever d you happen to want: add more rows
• Fast, Systematic version of Reed-Solomon:
– Cauchy Reed-Solomon, others
1111111
0000000'
76543217654321
H
4/6/2009 cs252-S09, Lecture 18 3
Motivation: Who Cares About I/O?• CPU Performance: 60% per year• I/O system performance limited by mechanical
delays (disk I/O)– < 10% per year (IO per sec or MB per sec)
• Amdahl's Law: system speed-up limited by the slowest part!
– 10% IO & 10x CPU => 5x Performance (lose 50%)– 10% IO & 100x CPU => 10x Performance (lose 90%)
• I/O bottleneck: – Diminishing fraction of time in CPU– Diminishing value of faster CPUs
4/6/2009 cs252-S09, Lecture 18 4
I/O Systems
Processor
Cache
Memory - I/O Bus
MainMemory
I/OController
Disk Disk
I/OController
I/OController
Graphics Network
interruptsinterrupts
4/6/2009 cs252-S09, Lecture 18 5
A Bus Is:• shared communication link• single set of wires used to connect multiple
subsystems
• A Bus is also a fundamental tool for composing large, complex systems
– systematic means of abstraction
Control
Datapath
Memory
ProcessorInput
Output
What is a bus?
4/6/2009 cs252-S09, Lecture 18 6
• Versatility:– New devices can be added easily– Peripherals can be moved between computer
systems that use the same bus standard• Low Cost:
– A single set of wires is shared in multiple ways
MemoryProcesserI/O
DeviceI/O
DeviceI/O
Device
Advantages of Buses
4/6/2009 cs252-S09, Lecture 18 7
• It creates a communication bottleneck– The bandwidth of that bus can limit the maximum I/O
throughput• The maximum bus speed is largely limited by:
– The length of the bus– The number of devices on the bus– The need to support a range of devices with:
» Widely varying latencies » Widely varying data transfer rates
MemoryProcesserI/O
DeviceI/O
DeviceI/O
Device
Disadvantage of Buses
4/6/2009 cs252-S09, Lecture 18 8
• Control lines:– Signal requests and acknowledgments– Indicate what type of information is on the data lines
• Data lines carry information between the source and the destination:
– Data and Addresses– Complex commands
Data Lines
Control Lines
The General Organization of a Bus
4/6/2009 cs252-S09, Lecture 18 9
• A bus transaction includes two parts:– Issuing the command (and address) – request– Transferring the data – action
• Master is the one who starts the bus transaction by:– issuing the command (and address)
• Slave is the one who responds to the address by:– Sending data to the master if the master ask for data– Receiving data from the master if the master wants to send data
BusMaster
BusSlave
Master issues command
Data can go either way
Master versus Slave
4/6/2009 cs252-S09, Lecture 18 10
Types of Buses• Processor-Memory Bus (design specific)
– Short and high speed– Only need to match the memory system
» Maximize memory-to-processor bandwidth– Connects directly to the processor– Optimized for cache block transfers
• I/O Bus (industry standard)– Usually is lengthy and slower– Need to match a wide range of I/O devices– Connects to the processor-memory bus or backplane bus
• Backplane Bus (standard or proprietary)– Backplane: an interconnection structure within the chassis– Allow processors, memory, and I/O devices to coexist– Cost advantage: one bus for all components
4/6/2009 cs252-S09, Lecture 18 11
A Computer System with One Bus:
Backplane Bus
• A single bus (the backplane bus) is used for:– Processor to memory communication– Communication between I/O devices and memory
• Advantages: Simple and low cost• Disadvantages: slow and the bus can become a
major bottleneck• Example: IBM PC - AT
Processor Memory
I/O Devices
Backplane Bus
4/6/2009 cs252-S09, Lecture 18 12
A Two-Bus System
• I/O buses tap into the processor-memory bus via bus adaptors:
– Processor-memory bus: mainly for processor-memory traffic– I/O buses: provide expansion slots for I/O devices
• Apple Macintosh-II– NuBus: Processor, memory, and a few selected I/O devices– SCCI Bus: the rest of the I/O devices
Processor Memory
I/OBus
Processor Memory Bus
BusAdaptor
BusAdaptor
BusAdaptor
I/OBus
I/OBus
4/6/2009 cs252-S09, Lecture 18 13
A Three-Bus System (+ backside cache)
• A small number of backplane buses tap into the processor-memory bus
– Processor-memory bus is only used for processor-memory traffic– I/O buses are connected to the backplane bus
• Advantage: loading on the processor bus is greatly reduced
Processor MemoryProcessor Memory Bus
BusAdaptor
BusAdaptor
BusAdaptor
I/O Bus
BacksideCache bus
I/O BusL2 Cache
4/6/2009 cs252-S09, Lecture 18 14
The move from Parallel to Serial I/O
• Shared Parallel Bus Wires– Clock rate limited by clock skew across long bus (~100MHz)– High power to drive large number of loaded bus lines– Central bus arbiter adds latency to each transaction,– Expensive parallel connectors and backplanes/cables – Examples: VMEbus, Sbus, ISA bus, PCI, SCSI, IDE
• Dedicated Point-to-point Serial Links– Point-to-point links run at multi-gigabit speed using advanced clock/signal
encoding (requires lots of circuitry at each end)– Lower power since only one well-behaved load– Multiple simultaneous transfers– Cheap cables and connectors (trade greater endpoint transistor cost for lower
physical wiring cost), customize bandwidth per device with multiple links– Examples: Ethernet, Infiniband, PCI Express, SATA, USB, Firewire, etc.
CPU I/O IF I/O 1 I/O 2
Central Bus Arbiter
CPU I/O IF
I/O 1
I/O 2
VS
4/6/2009 cs252-S09, Lecture 18 15
Main components of Intel Chipset: Pentium 4
• Northbridge:– Handles memory– Graphics
• Southbridge: I/O– PCI bus– Disk controllers– USB controllers– Audio– Serial I/O– Interrupt controller– Timers
4/6/2009 cs252-S09, Lecture 18 16
DeviceController
readwrite
controlstatus
AddressableMemoryand/orQueuesRegisters
(port 0x20)
HardwareController
Memory MappedRegion: 0x8f008020
BusInterface
How does the processor actually talk to the device?
• CPU interacts with a Controller– Contains a set of registers that
can be read and written– May contain memory for request
queues or bit-mapped images • Regardless of the complexity of the connections and
buses, processor accesses registers in two ways: – I/O instructions: in/out instructions
» Example from the Intel architecture: out 0x21,AL– Memory mapped I/O: load/store instructions
» Registers/memory appear in physical address space» I/O accomplished with load and store instructions
Address+Data
Interrupt Request
Processor Memory Bus
CPURegularMemory
InterruptController
BusAdaptor
BusAdaptor
Other Devicesor Buses
4/6/2009 cs252-S09, Lecture 18 17
Example: Memory-Mapped Display Controller• Memory-Mapped:
– Hardware maps control registers and display memory into physical address space
» Addresses set by hardware jumpers or programming at boot time
– Simply writing to display memory (also called the “frame buffer”) changes image on screen
» Addr: 0x8000F000—0x8000FFFF– Writing graphics description to
command-queue area » Say enter a set of triangles that describe
some scene» Addr: 0x80010000—0x8001FFFF
– Writing to the command register may cause on-board graphics hardware to do something
» Say render the above scene» Addr: 0x0007F004
• Can protect with page tables
DisplayMemory
0x8000F000
0x80010000
Physical AddressSpace
Status0x0007F000Command0x0007F004
GraphicsCommand
Queue
0x80020000
4/6/2009 cs252-S09, Lecture 18 18
Hard Disk Drives
IBM/Hitachi MicrodriveWestern Digital Drive
http://www.storagereview.com/guide/
Read/Write HeadSide View
4/6/2009 cs252-S09, Lecture 18 19
Historical Perspective• 1956 IBM Ramac — early 1970s Winchester
– Developed for mainframe computers, proprietary interfaces– Steady shrink in form factor: 27 in. to 14 in.
• Form factor and capacity drives market more than performance• 1970s developments
– 5.25 inch floppy disk formfactor (microcode into mainframe)– Emergence of industry standard disk interfaces
• Early 1980s: PCs and first generation workstations• Mid 1980s: Client/server computing
– Centralized storage on file server» accelerates disk downsizing: 8 inch to 5.25
– Mass market disk drives become a reality» industry standards: SCSI, IPI, IDE» 5.25 inch to 3.5 inch drives for PCs, End of proprietary interfaces
• 1900s: Laptops => 2.5 inch drives• 2000s: Shift to perpendicular recording
– 2007: Seagate introduces 1TB drive– 2009: Seagate/WD promises 2TB drive
4/6/2009 cs252-S09, Lecture 18 20
Disk History
Data densityMbit/sq. in.
Capacity ofUnit ShownMegabytes
1973:1. 7 Mbit/sq. in140 MBytes
1979:7. 7 Mbit/sq. in2,300 MBytes
source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even mroe data into even smaller spaces”
4/6/2009 cs252-S09, Lecture 18 21
Disk History
1989:63 Mbit/sq. in60,000 MBytes
1997:1450 Mbit/sq. in2300 MBytes
source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even mroe data into even smaller spaces”
1997:3090 Mbit/sq. in8100 MBytes
4/6/2009 cs252-S09, Lecture 18 22
Seagate Barracuda
• 2TB! 400 GB/in2
• 4 platters, 2 heads each• 3.5” platters • Perpendicular recording• 7200 RPM• 4.2ms latency (?)• 100MB/Sec transfer speed• 32MB cache
4/6/2009 cs252-S09, Lecture 18 23
Properties of a Hard Magnetic Disk
• Properties– Independently addressable element: sector
» OS always transfers groups of sectors together—”blocks”– A disk can access directly any given block of information it
contains (random access). Can access any file either sequentially or randomly.
– A disk can be rewritten in place: it is possible to read/modify/write a block from the disk
• Typical numbers (depending on the disk size):– 500 to more than 20,000 tracks per surface– 32 to 800 sectors per track
» A sector is the smallest unit that can be read or written• Zoned bit recording
– Constant bit density: more sectors on outer tracks– Speed varies with track location
Track
Sector
Platters
4/6/2009 cs252-S09, Lecture 18 24
MBits per square inch: DRAM as % of Disk over time
0%10%20%30%40%50%
1974 1980 1986 1992 1998
source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even mroe data into even smaller spaces”
470 v. 3000 Mb/si
9 v. 22 Mb/si
0.2 v. 1.7 Mb/si
4/6/2009 cs252-S09, Lecture 18 25
Nano-layered Disk Heads• Special sensitivity of Disk head comes from “Giant
Magneto-Resistive effect” or (GMR) • IBM is (was) leader in this technology
–Same technology as TMJ-RAM breakthrough
Coil for writing
4/6/2009 cs252-S09, Lecture 18 26
Disk Figure of Merit: Areal Density• Bits recorded along a track
– Metric is Bits Per Inch (BPI)• Number of tracks per surface
– Metric is Tracks Per Inch (TPI)• Disk Designs Brag about bit density per unit area
– Metric is Bits Per Square Inch: Areal Density = BPI x TPIYear Areal Density1973 2 1979 8 1989 63 1997 3,090 2000 17,100 2006 130,000 2007 164,0002009 400,000
1
10
100
1,000
10,000
100,000
1,000,000
1970 1980 1990 2000 2010
Year
Are
al D
ensi
ty
4/6/2009 cs252-S09, Lecture 18 27
Newest technology: Perpendicular Recording
• In Perpendicular recording:– Bit densities much higher– Magnetic material placed on top of magnetic underlayer that reflects
recording head and effectively doubles recording field
4/6/2009 cs252-S09, Lecture 18 28
Disk I/O Performance
Response Time = Queue+Disk Service Time
UserThread
Queue[OS Paths]
Controller
Disk
• Performance of disk drive/file system– Metrics: Response Time, Throughput– Contributing factors to latency:
» Software paths (can be loosely modeled by a queue)» Hardware controller» Physical disk media
• Queuing behavior:– Can lead to big increase of latency as utilization approaches 100%
100%
ResponseTime (ms)
Throughput (Utilization)(% total BW)
0
100
200
300
0%
4/6/2009 cs252-S09, Lecture 18 29
Magnetic Disk Characteristic• Cylinder: all the tracks under the
head at a given point on all surface• Read/write data is a three-stage
process:– Seek time: position the head/arm over the proper track (into proper
cylinder)– Rotational latency: wait for the desired sector
to rotate under the read/write head– Transfer time: transfer a block of bits (sector)
under the read-write head• Disk Latency = Queueing Time + Controller time +
Seek Time + Rotation Time + Xfer Time
• Highest Bandwidth: – transfer large group of blocks sequentially from one track
SectorTrack
CylinderHead
Platter
SoftwareQueue
(Device Driver)
Hardw
areController
Media Time(Seek+Rot+Xfer)
Request
Result
4/6/2009 cs252-S09, Lecture 18 30
Disk Time Example• Disk Parameters:
– Transfer size is 8K bytes– Advertised average seek is 12 ms– Disk spins at 7200 RPM– Transfer rate is 4 MB/sec
• Controller overhead is 2 ms• Assume that disk is idle so no queuing delay• Disk Latency =
Queuing Time + Seek Time + Rotation Time + Xfer Time + Ctrl Time
• What is Average Disk Access Time for a Sector?– Ave seek + ave rot delay + transfer time + controller overhead– 12 ms + [0.5/(7200 RPM/60s/M)] 1000 ms/s +
[8192 bytes/(4106 bytes/s)] 1000 ms/s + 2 ms– 12 + 4.17 + 2.05 + 2 = 20.22 ms
• Advertised seek time assumes no locality: typically 1/4 to 1/3 advertised seek time: 12 ms => 4 ms
4/6/2009 cs252-S09, Lecture 18 31
Typical Numbers of a Magnetic Disk• Average seek time as reported by the industry:
– Typically in the range of 4 ms to 12 ms– Due to locality of disk reference may only be 25% to 33% of the
advertised number• Rotational Latency:
– Most disks rotate at 3,600 to 7200 RPM (Up to 15,000RPM or more)– Approximately 16 ms to 8 ms per revolution, respectively– An average latency to the desired information is halfway around
the disk: 8 ms at 3600 RPM, 4 ms at 7200 RPM• Transfer Time is a function of:
– Transfer size (usually a sector): 1 KB / sector– Rotation speed: 3600 RPM to 15000 RPM– Recording density: bits per inch on a track– Diameter: ranges from 1 in to 5.25 in– Typical values: 2 to 50 MB per second
• Controller time?– Depends on controller hardware—need to examine each case
individually
4/6/2009 cs252-S09, Lecture 18 32
DeparturesArrivalsQueuing System
Introduction to Queuing Theory
• What about queuing time??– Let’s apply some queuing theory– Queuing Theory applies to long term, steady state behavior
Arrival rate = Departure rate• Little’s Law:
Mean # tasks in system = arrival rate x mean response time– Observed by many, Little was first to prove– Simple interpretation: you should see the same number of tasks
in queue when entering as when leaving.• Applies to any system in equilibrium, as long as nothing
in black box is creating or destroying tasks– Typical queuing theory doesn’t deal with transient behavior, only
steady-state behavior
Queue
Controller
Disk
4/6/2009 cs252-S09, Lecture 18 33
Background: Use of random distributions• Server spends variable time with customers
– Mean (Average) m1 = p(T)T– Variance 2 = p(T)(T-m1)2 = p(T)T2-m1– Squared coefficient of variance: C = 2/m12
Aggregate description of the distribution.• Important values of C:
– No variance or deterministic C=0 – “memoryless” or exponential C=1
» Past tells nothing about future» Many complex systems (or aggregates)
well described as memoryless – Disk response times C 1.5 (majority seeks < avg)
• Mean Residual Wait Time, m1(z):– Mean time must wait for server to complete current task– Can derive m1(z) = ½m1(1 + C)
» Not just ½m1 because doesn’t capture variance– C = 0 m1(z) = ½m1; C = 1 m1(z) = m1
Mean (m1)
mean
Memoryless
Distributionof service times
4/6/2009 cs252-S09, Lecture 18 34
A Little Queuing Theory: Mean Wait Time
• Parameters that describe our system: : mean number of arriving customers/second– Tser: mean time to service a customer (“m1”)– C: squared coefficient of variance = 2/m12
– μ: service rate = 1/Tser– u: server utilization (0u1): u = /μ = Tser
• Parameters we wish to compute:– Tq: Time spent in queue– Lq: Length of queue = Tq (by Little’s law)
• Basic Approach:– Customers before us must finish; mean time = Lq Tser– If something at server, takes m1(z) to complete on avg
» Chance server busy = u mean time is u m1(z)• Computation of wait time in queue (Tq):
– Tq = Lq Tser + u m1(z)
Arrival Rate
Queue ServerService Rateμ=1/Tser
4/6/2009 cs252-S09, Lecture 18 35
Mean Residual Wait Time: m1(z)
• Imagine n samples– There are n P(Tx) samples of size Tx
– Total space of samples of size Tx: – Total time for n services:– Chance arrive in service of length Tx:
– Avg remaining time if land in Tx: ½Tx
– Finally: Average Residual Time m1(z):
)()( xxxx TPTnTPnT
T1 T2 T3 Tn…
Random Arrival Point
Total time for n services
serx xx TnTPTn )(
ser
xx
ser
xx
TTPT
TnTPTn )()(
CTTTT
TTE
TTPTT ser
ser
serser
ser
x
x ser
xxx
1
21
21)(
21)(
21
2
222
4/6/2009 cs252-S09, Lecture 18 36
A Little Queuing Theory: M/G/1 and M/M/1• Computation of wait time in queue (Tq):
Tq = Lq Tser + u m1(z) Tq = Tq Tser + u m1(z) Tq = u Tq + u m1(z)Tq (1 – u) = m1(z) u Tq = m1(z) u/(1-u) Tq = Tser ½(1+C) u/(1 – u)
• Notice that as u1, Tq !• Assumptions so far:
– System in equilibrium; No limit to the queue: works First-In-First-Out
– Time between two successive arrivals in line are random and memoryless: (M for C=1 exponentially random)
– Server can start on next customer immediately after prior finishes
• General service distribution (no restrictions), 1 server:– Called M/G/1 queue: Tq = Tser x ½(1+C) x u/(1 – u))
• Memoryless service distribution (C = 1):– Called M/M/1 queue: Tq = Tser x u/(1 – u)
Little’s LawDefn of utilization (u)
4/6/2009 cs252-S09, Lecture 18 37
A Little Queuing Theory: An Example• Example Usage Statistics:
– User requests 10 x 8KB disk I/Os per second– Requests & service exponentially distributed (C=1.0)– Avg. service = 20 ms (From controller+seek+rot+trans)
• Questions: – How utilized is the disk?
» Ans: server utilization, u = Tser– What is the average time spent in the queue?
» Ans: Tq– What is the number of requests in the queue?
» Ans: Lq– What is the avg response time for disk request?
» Ans: Tsys = Tq + Tser• Computation:
(avg # arriving customers/s) = 10/sTser (avg time to service customer) = 20 ms (0.02s)u (server utilization) = x Tser= 10/s x .02s = 0.2Tq (avg time/customer in queue) = Tser x u/(1 – u) = 20 x 0.2/(1-0.2) = 20 x 0.25 = 5 ms (0 .005s)Lq (avg length of queue) = x Tq=10/s x .005s = 0.05Tsys (avg time/customer in system) =Tq + Tser= 25 ms
4/6/2009 cs252-S09, Lecture 18 38
Use Arrays of Small Disks?
14”10”5.25”3.5”
3.5”
Disk Array: 1 disk design
Conventional: 4 disk designs
Low End High End
•Katz and Patterson asked in 1987: •Can smaller disks be used to close gap in performance between disks and CPUs?
4/6/2009 cs252-S09, Lecture 18 39
Array Reliability
• Reliability of N disks = Reliability of 1 Disk ÷ N
50,000 Hours ÷ 70 disks = 700 hours
Disk system MTTF: Drops from 6 years to 1 month!
• Arrays (without redundancy) too unreliable to be useful!
Hot spares support reconstruction in parallel with access: very high media availability can be achieved
4/6/2009 cs252-S09, Lecture 18 40
Redundant Arrays of Disks• Files are "striped" across multiple spindles• Redundancy yields high data availability
Disks will fail
Contents reconstructed from data redundantly stored in the arrayCapacity penalty to store it
Bandwidth penalty to update
Mirroring/Shadowing (high capacity cost)
Horizontal Hamming Codes (overkill)
Parity & Reed-Solomon Codes
Failure Prediction (no capacity overhead!)VaxSimPlus — Technique is controversial
Techniques:
4/6/2009 cs252-S09, Lecture 18 41
Redundant Arrays of DisksRAID 1: Disk Mirroring/Shadowing
• Each disk is fully duplicated onto its "shadow" Very high availability can be achieved
• Bandwidth sacrifice on write: Logical write = two physical writes
• Reads may be optimized
• Most expensive solution: 100% capacity overheadTargeted for high I/O rate , high availability environments
recoverygroup
4/6/2009 cs252-S09, Lecture 18 42
Redundant Arrays of Disks RAID 5+: High I/O Rate Parity
A logical writebecomes fourphysical I/Os
Independent writespossible because ofinterleaved parity
Reed-SolomonCodes ("Q") forprotection duringreconstruction
D0 D1 D2 D3 P
D4 D5 D6 P D7
D8 D9 P D10 D11
D12 P D13 D14 D15
P D16 D17 D18 D19
D20 D21 D22 D23 P...
.
.
.
.
.
.
.
.
.
.
.
.Disk Columns
IncreasingLogical
Disk Addresses
Stripe
StripeUnit
Targeted for mixedapplications
4/6/2009 cs252-S09, Lecture 18 43
Problems of Disk Arrays: Small Writes
D0 D1 D2 D3 PD0'
+
+
D0' D1 D2 D3 P'
newdata
olddata
old parity
XOR
XOR
(1. Read) (2. Read)
(3. Write) (4. Write)
RAID-5: Small Write Algorithm1 Logical Write = 2 Physical Reads + 2 Physical Writes
4/6/2009 cs252-S09, Lecture 18 44
System Availability: Orthogonal RAIDs
ArrayController
StringController
StringController
StringController
StringController
StringController
StringController
. . .
. . .
. . .
. . .
. . .
. . .
Data Recovery Group: unit of data redundancyRedundant Support Components: fans, power supplies, controller, cablesEnd to End Data Integrity: internal parity protected data paths
4/6/2009 cs252-S09, Lecture 18 45
System-Level AvailabilityFully dual redundantI/O Controller I/O Controller
Array Controller Array Controller
. . .
. . .
. . .
. . . . . .
.
.
.RecoveryGroup
Goal: No SinglePoints ofFailure
host host
with duplicated paths, higher performance can beobtained when there are no failures
OceanStore:Global Scale Persistent Storage
Global-Scale Persistent Storage
4/6/2009 cs252-S09, Lecture 18 47
Pac Bell
Sprint
IBMAT&T
CanadianOceanStore
• Service provided by confederation of companies– Monthly fee paid to one service provider– Companies buy and sell capacity from each other
IBM
Utility-based Infrastructure
4/6/2009 cs252-S09, Lecture 18 48
Important P2P Technology(Decentralized Object Location and Routing)
GUID1
DOLR
GUID1GUID2
4/6/2009 cs252-S09, Lecture 18 49
Peer-to-peer systems can be very stable
(May 2003: 1.5 TB over 4 hours)In JSAC, To appear
4/6/2009 cs252-S09, Lecture 18 50
The Path of an OceanStore UpdateSecond-Tier
Caches
Multicasttrees
Inner-RingServers
Clients
4/6/2009 cs252-S09, Lecture 18 51
Archival Disseminationof Fragments
4/6/2009 cs252-S09, Lecture 18 52
Aside: Why erasure coding?High Durability/overhead ratio!
• Exploit law of large numbers for durability!• 6 month repair, FBLPY:
– Replication: 0.03– Fragmentation: 10-35
Fraction Blocks Lost Per Year (FBLPY)
4/6/2009 cs252-S09, Lecture 18 53
Conclusion• Disk industry growing rapidly, improves:
– bandwidth 40%/yr , – areal density 60%/year, $/MB faster?
• Disk Time = queue + controller + seek + rotate + transfer• Advertised average seek time benchmark much greater
than average seek time in practice• Redundancy useful to gain reliability
– Redundant disks+controllers+etc (RAID)– Geographical scale systems (OceanStore)
• Queueing theory: for (c=1):
u
uxCW
1
121
u
uxW1