radio: managing the performance of large, distributed...
TRANSCRIPT
![Page 1: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/1.jpg)
RADIO: Managing the Performance of Large, Distributed Storage Systems
Scott A. Brandtand
Carlos Maltzahn, Anna Povzner, Roberto Pineiro, Andrew Shewmaker, and Tim Kaldewey
Computer Science DepartmentUniversity of California Santa Cruz
andRichard Golding and Ted Wong, IBM Almaden Research Center
UPC—July 7, 2009
![Page 2: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/2.jpg)
Who am I?
• Professor, Computer Science Department, UC Santa Cruz
• Director, UCSC/LANL Institute for Scalable Scientific Data Management (ISSDM)
• Director, UCSC Systems Research Laboratory (SRL)
• Background
• 1999 Ph.D. CS, Colorado
• 1987/1993 B. Math/MS CS, Minnesota
• 1982-1994 Programmer/Research Scientist/VP, CPT, B-Tree, Honeywell SRC, Theseus Research, Alliant TechSystems RTS, Secure Computing
• My Research
• High-performance petascale storage
• Real-time systems
• Performance management and virtualization
• Active object-based storage
• Other Research
• Secure operating systems
• Asynchronous circuits
• Real-time image processing
![Page 3: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/3.jpg)
Distributed systems need performance guarantees
• Many distributed systems and applications need (or want) I/O performance guarantees• Multimedia, high-performance simulation, transaction processing,
virtual machines, service level agreements, real-time data capture, sensor networks, ...
• Systems tasks like backup and recovery
• Even so-called best-effort applications
• Providing such guarantees is difficult because it involves:• Multiple interacting resources
• Dynamic workloads
• Interference among workloads
• Non-commensurable metrics: CPU utilization, network throughput, cache space, disk bandwidth
![Page 4: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/4.jpg)
In a nutshell
• Big distributed systems• Serve many users/jobs
• Process petabytes of data
• Data center design• Use rules of thumb
• Over-provision
• Isolate
• Ad hoc per formance management approaches creates marginal storage systems that cost more than necessary
• A better system would guarantee each user the performance they need from the CPUs, memory, disks, and network
![Page 5: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/5.jpg)
Outline
1. Problem: Managing the performance of large, distributed storage systems
2. Approach: End-to-end performance management
3. Model: RAD
4. Instances: • Disk
• Network
• Buffer cache
5. Application: Data Center Performance Management and Monitoring
![Page 6: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/6.jpg)
End-to-end I/O performance guarantees
• Goal: Improve end-to-end performance management in large distributed systems• Manage performance
• Isolate traffic
• Provide high performance
• Targets: High-performance storage (LLNL), data centers (LANL), satellite communications (IBM), virtual machines (VMware), sensor networks, ...
• Approach:1. Develop a uniform model for managing performance
2. Apply it to each resource
3. Integrate the solutions
![Page 7: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/7.jpg)
Our current target
• High-performance I/O• From client, across network, through server, to disk
• Up to hundreds of thousands of processing nodes
• Up to tens of thousands of I/O nodes
• Big, fat, network interconnect
• Up to thousands of storage nodes with cache and disk
• Challenges• Interference between I/O streams, variability of
workloads, variety of resources, variety of applications, legacy code, system management tasks, scale
![Page 8: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/8.jpg)
Stages in the I/O path
client
cache
network
transport
diskstorage
cache
network
transport
flow control with one client
connection management between clients
IO selection and head scheduling
prefetch and writeback based on utilization, QoS
app
app
I/O
scheduler
client
cache
network
transport
app
app
integration between client and server cache
1. Disk I/O
2. Server cache
3. Flow control across network
• Within one client’s session and between clients
4. Client cache
![Page 9: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/9.jpg)
System architecture
Client
StorageServer
QoSBroker
StorageServer
StorageServer
StorageServer
Request
Reservation
Utilizationreservations
1
2
3
4
NetworkServer Caches
Disks• Client: Task, host, distributed
application, VM, file, ...
• Reservations made via broker
• Specify workload: throughput, read/write ratio, burstiness, etc.
• Broker does admission control
• Requirements + workload are translated to utilization
• Utilizations are summed to see if they are feasible
• Once admitted, I/O streams are guaranteed (subject to workload adherence)
• Disk, caches, network controllers maintain guarantees
I/O
![Page 10: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/10.jpg)
Achieving robust guaranteeable resources
• Goal: Unified resource management algorithms capable of providing• Good performance
• Arbitrarily hard or soft performance guarantees with• Arbitrary resource allocations
• Arbitrary timing / granularity
• Complete isolation between workloads
• All resources: CPU, disk, network, server cache, client cache
➡Virtual resources indistinguishable from “real” resources with fractional performance
![Page 11: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/11.jpg)
Isolation is key
• CPU• 20% of a 3 Ghz CPU should be indistinguishable
from a 600 Mhz CPU
• Running: compiler, editor, audio, video
• Disk• 20% of a disk with 100 MB/second bandwidth
should be indistinguishable from a disk with 20 MB/second bandwidth
• Serving:1 stream, n streams, sequential, random
![Page 12: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/12.jpg)
Scott’s epistemology of virtualization
• Virtual Machines and LUNs provide good HW virtualization
• Question: Given perfect HW virtualization, how can a process tell the difference between a virtual resource and a real resource?
• Answer: By not getting its share of the resource when it needs it
![Page 13: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/13.jpg)
Observation
• Resource management consists of two distinct decisions• Resource Allocation: How much resources to
allocate?• Dispatching: When to provide the allocated
resources?
• Most resource managers conflate them• Best-effort, proportional-share, real-time
![Page 14: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/14.jpg)
Separating them is powerful!
• Separately managing resource allocation and dispatching gives direct control over the delivery of resources to tasks
• Enables direct, integrated support of all types of timeliness needs
Res
ourc
e A
lloca
tion
MissedDeadline
SRT
Dispatchingunconstrained
unco
nstra
ined
cons
train
ed
ResourceAllocation
SRTSoftReal-Time
BestEffort
CPU-Bound
I/O-Bound
HardReal-Time
Rate-Based
constrained
![Page 15: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/15.jpg)
The resource allocation/dispatching (RAD) scheduling model
Rate
Deadlines
DispatcherSeries of jobs w/
budgets and deadlines
Share of resources
Times at which allocation must equal share
Process
![Page 16: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/16.jpg)
Supporting different timeliness requirements with RAD
HardReal-time
Rate-based
Best-effort
Soft Real-time
Rate
Deadlines
Dispatcher
SchedulingMechanism
RuntimeSystem
RateBounds
PeriodWCET
PeriodACET
Priority
PiPiPiPi
Set of jobs w/
budgets and deadlines
SchedulingPolicy
![Page 17: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/17.jpg)
Rate-Based Earliest Deadline (RBED) CPU scheduler
Rate
Deadlines
EDF + timers
SchedulingPolicy
SchedulingMechanism
RuntimeSystem
Set of jobs w/
budgets and deadlines
• Processes have rate & period• ∑rates ≤ 100%
• Periods based on processing characteristics, latency needs, etc.
• Jobs have budget & deadline• budget = rate * period
• Deadlines based on period or other characteristics
• Jobs dispatched via Earliest Deadline First (EDF)• Budgets enforced with timers
• Guarantees all budgets & deadlines = all rates & periods
![Page 18: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/18.jpg)
Adapting RAD to disk, network, and buffer cache
• Fahrrad—Guaranteed disk request schedulingAnna Povzner (UCSC)
• RADoN—Guaranteeing storage network performanceAndrew Shewmaker (UCSC and LANL)
• Radium—Buffer management for I/O guaranteesRoberto Pineiro (UCSC)
Disk
![Page 19: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/19.jpg)
Guaranteed disk request scheduling
• Goals• Hard and soft performance guarantees
• Isolation between I/O streams
• Good I/O performance
• Challenging because disk I/O is:• Stateful
• Non-deterministic
• Non-preemptable, and
• Best- and worst-case times vary by 3–4 orders of magnitude
![Page 20: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/20.jpg)
Fahrrad
• Manages disk time instead of disk throughput
• Adapts RAD/RBED to disk I/O
• Reorders aggressively to provide good performance, without violating guarantees
A B C BE
Disk
I/O streams
Fahrrad
![Page 21: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/21.jpg)
A bit more detail
• Reservations in terms of disk time utilization and period (granularity)
• All I/Os feasible before the earliest deadline in the system are moved to a Disk Scheduling Set (DSS)
• I/Os in the DSS are issued in the most efficient way
• I/O charging model is critical
• Overhead reservation ensures exact utilization• 2 WCRTs per period for “context switches”
• 1 WCRT per period to ensure last I/Os
• 1 WCRT for the process with the shortest period due to non-preemptability
![Page 22: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/22.jpg)
Fahrrad outperforms Linux
• Workload• Media 1: 400 sequential I/Os per second (20%)
• Media 2: 800 sequential I/Os per second, (40%)
• Transaction: short bursts of random I/Os at random times (30%)
• Background: random (10%)
• Result: Better isolation AND better throughput
Linux Linux w/Fahrrad
![Page 23: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/23.jpg)
New work: virtual disks
• Provide workload-independent performance guarantees
• Isolate from other workloads concurrently accessing the device
• LUNs virtualize storage capacity
• Fahrrad virtualizes storage performance
![Page 24: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/24.jpg)
Fahrrad virtual disks
• Implemented with the Fahrrad real-time I/O scheduler
• Guarantee reserved and isolated share of the time on storage device• Hard guarantees on performance isolation
• Virtual disk throughput same as equivalent standalone throughput
• Amount of data transferred:• ∀i, Di(x%, t) = Di(100%, x%·t)
Share ofdisk
Time Share ofdisk
Time
![Page 25: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/25.jpg)
Guaranteeing performance isolation
• Virtual disk reservation: disk share (utilization) and time granularity (period) • Account for all extra (inter-stream) seeks
• Reserve overhead utilization to do them
• Charge each I/O stream for all of the time it uses, including inter- and intra-stream seeks
• Reservation = Disk Share + Overhead utilization
25%, 1 sec
30%, 250 ms
19%, 1 min
Disk time
![Page 26: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/26.jpg)
Extra seeks
• Intra-stream seeks caused by workload
• Inter-stream seeks caused by reservations• Low time granularity causes more frequent seeks
• At most two extra seeks per stream per period• To and away from the stream to meet deadlines
➡Extra seeks caused by un-queued requests
![Page 27: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/27.jpg)
Charging model and reservations
• Charge streams responsible for inter-stream seeking• From overhead: for seeks caused by reservations
• From reservation: for seeks caused by bursty behavior
• Overhead utilization needed for hard guarantees
• Overhead utilization = WCRT/p + 2*WCRT/p + WCRT/p
• We can trade-off hard guarantees for lower overhead by assuming less than worst-case request time
Guaranteereservedutilization
Account for inter-stream
seeks
Maintain 2outstanding
requests
![Page 28: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/28.jpg)
• Throughput is determined by reservation and workload
0
100
200
300
400
500
0.01 0.1 1 10 100 1000
Am
ou
nt of
data
tra
nsf
ere
d [
MB
]
Run length of semi-sequential stream [MB]
Sequential, virtual disk (20% share, 150s)Semi-sequential, virtual disk (20% share, 150s)
Random, virtual disk (20% share, 150s)
Each virtual disk reserves 20% with 1 second
granularity
Performance: guaranteeing throughput
![Page 29: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/29.jpg)
• Throughput is determined by reservation and workload
Each virtual disk reserves 20% with 1 second
granularity
0
100
200
300
400
500
0.01 0.1 1 10 100 1000
Am
ou
nt of
data
tra
nsf
ere
d [
MB
]
Run length of semi-sequential stream [MB]
Sequential, virtual disk (20% share, 150s)Semi-sequential, virtual disk (20% share, 150s)
Random, virtual disk (20% share, 150s)Sequential, standalone (100%, 30s)
Semi-sequential, standalone (100%, 30s)Random, standalone (100%, 30s)
Performance: guaranteeing throughput
![Page 30: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/30.jpg)
Performance: Controlling throughput
• Each virtual disk is isolated from the other
• Performance is fully determined by the reservation and workload
Reserved share for sequential stream Reserved share for sequential stream
Dat
a tr
ansf
erre
d (M
B)
Dis
k tim
e re
serv
atio
n (%
)
![Page 31: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/31.jpg)
Performance: Controlling latency
• Reservation granularity bounds latency: • period = latency/2
• Virtual device serves periodic semi-sequential stream and shares storage with random background stream. Four experiments for different period reservations.
Upper bounds
Frac
tion
of I/
Os
Latency Period of virtual disk
Util
izat
ion
![Page 32: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/32.jpg)
Performance: Isolation guarantees
• Hard guarantees require high overhead (proportional to reservation granularity)
• Three virtual disks each serving one sequential stream with many outstanding I/Os share a storage system with a random background stream.
Dis
k tim
e re
serv
atio
n (%
)
Period of virtual disk 3Period of virtual disk 3
Dat
a tr
ansf
erre
d (M
B)
![Page 33: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/33.jpg)
Performance: Soft guarantees w/isolation
• Overhead based on less than worst-case I/O time
• Increased short term throughput variation• Virtual disk (10%, 1 sec) runs one sequential stream with 400 IO/sec arrival rate
and shares the system with 5 virtual disks each running one random stream.
•
Dis
k tim
e re
serv
atio
n (%
)
Percentile of observed service timesPercentile of observed service times
Dat
a tr
ansf
erre
d (%
)
![Page 34: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/34.jpg)
Performance: Soft guarantees w/isolation
• Linux fails to support Cello99 (variation up to 30% from standalone)
• Fahrrad Virtual Disks provide Cello99 and OpenMail performance close to standalone
• Cello99 and OpenMail virtual disks share the system with random background stream.•
Thr
ough
put
(I/O
s pe
r se
cond
)
TimeTime
Linux Fahrrad Virtual Disks
![Page 35: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/35.jpg)
Fahrrad Virtual Disks
1. Guarantee throughput by accounting for overhead and guaranteeing utilization
2. Guarantee isolation between workloads by accurately accounting for all disk time
3. Provide high throughput (w/guarantees) by minimizing interference between workloads
4. Result: performance of virtual disk depends only on reservation, workload, and performance of device
![Page 36: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/36.jpg)
Guaranteeing storage network performance
• Goals• Hard and soft performance guarantees
• Isolation between I/O streams
• Good I/O performance
• Challenging because network I/O is:• Distributed
• Non-deterministic (due to collisions or switch queue overflows)
• Non-preemptable
• Assumption: closed network
![Page 37: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/37.jpg)
What we want
Client
Client
Client
Server
Server
Server
30%
50%
20%
![Page 38: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/38.jpg)
What we have
• Switched fat tree w/full bisection bandwidth
• Issue 1: Capacity of shared links
• Issue 2: Switch queue contention
![Page 39: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/39.jpg)
Congestion in a simple switch model
• Each transmit port on the switch is a collision domain
tx/rx ports
shared
FIFO
switch fabric
1
2
3
4
5
6
7
8
![Page 40: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/40.jpg)
Congestion in a simple switch model
• One of the packets arriving at the same switch transmit port is delayed on the queue
switch fabric
1 and 2congest1
2
3
4
5
6
7
8
1 and 2send to 5
![Page 41: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/41.jpg)
Congestion in a simple switch model
• Delayed packets from unrelated streams affect each other on the queue
switch fabric
1 and 2congest1
2
3
43 and 4congest
2 and 4congest
5
6
7
8
1 and 2send to 5
3 and 4send to 8
![Page 42: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/42.jpg)
TCP
• Those who do not understand TCP are destined to reimplement it
• Jon Postel
• Ack-clocked flow control
• Packet loss based congestion control
• Sawtooth throughput
• Incast throughput collapse
![Page 43: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/43.jpg)
Network resource usage measurements
• Round trip time RTTi = Ci - Si
• Combines queueing effects on forward and reverse path + response time
Clock 1
Si
Ci
Clock 2
![Page 44: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/44.jpg)
Network resource usage measurements
• One-way delay OWDi = Ri - Si
• Isolates queueing affects on forward path, but
• Requires synchronized clocks
Clock 1
Si
Ri
Clock 2
![Page 45: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/45.jpg)
Network resource usage measurements
• Relative forward delay RFDi,j = (Rj - Ri) - (Sj - Si)
• Isolates queueing affects on forward path, and
• Does not require synchronized clocks• But they must be relatively stable
Clock 1S
i
Ri
Rj
Sj
Clock 2
![Page 46: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/46.jpg)
RADoN
• A reservation has a network share (utilization) and a time granularity (period)
• Two real-time scheduling algorithms• Earliest Deadline First (EDF) - absolute deadlines
• Least Laxity First (LLF) - relative laxities
now deadlinelaxity
release
![Page 47: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/47.jpg)
Approximating optimal scheduling
• Flow control - throttling senders• Execution time (per period) e = utilization /
period
• Budget in packets m = e / packets_per_second
• Congestion control - avoiding switch contention (adjust wait time between packets)• Percent budget %budget = (1 - %laxity) = e/(d-t)
• Packet wait time w = wmin / %budget
• Size change w∆ = -|wi - wmin|/2
• New wait time wi+1 = min(wmax, max(wmin, w∆))
![Page 48: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/48.jpg)
Queue modeling: single network stream
• No contention: 765 Mbps w/no lost packets
0
5
10
15
20
25
30
35
0 5000 10000 15000 20000 25000
Queue D
epth
(packets
)
Packet Sequence Number
basic modelmedian-filter model
pathload model
![Page 49: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/49.jpg)
Queue modeling: punctuated stream
• Contention: 5 bursts of 250 Mbps
0
50
100
150
200
250
300
350
400
0 5000 10000 15000 20000 25000
Queue D
epth
(packets
)
Packet Sequence Number
basic modelmedian-filter model
pathload modellost packets
Median filter detects congestion before packet loss and decreasing queue size after congestion
![Page 50: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/50.jpg)
Queue modeling: punctuated adaptive stream
• Contention: 5 bursts of 250 Mbps
Adapting to median-filter model decreases packet loss
0
50
100
150
200
250
0 5000 10000 15000 20000 25000
Queue D
epth
(packets
)
Packet Sequence Number
basic modelmedian-filter model
pathload modellost packets
![Page 51: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/51.jpg)
Userspace RADoN prototype
• Detects congestion using Relative Forward Delay
• Responds to congestion using RAD real-time theory
• Decreases packet loss significantly
• Improves goodput
• Requires no global knowledge or synchronization
• Ongoing: RADoN kernel implementation
![Page 52: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/52.jpg)
Buffer management for I/O guarantees
• Goals• Hard and soft performance guarantees
• Isolation between I/O streams
• Improved I/O performance
• Challenging because:• Buffer is space-shared rather than time-shared
• Space limits time guarantees
• Best- and worst-case are opposite of disk
• Buffering affects performance in non-obvious ways
Disk
![Page 53: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/53.jpg)
Guarantees in the buffer cache
• Role
• Improve performance
• Preserve & enhance guarantees
• App-specific guarantees:
• Hard at core
• Soft when possible
• Predictable
• Hard isolation
• Device time utilization
ResourceBroker
App 1
Disk Scheduler
Buffer-Cache
Disk
App 2 App n
App 1 App 2 App n
Dsk
Disk
![Page 54: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/54.jpg)
Buffering roles in storage servers
• Staging and de-staging data• Decouples sender and receiver
• Speed matching• Allows slower and faster devices to communicate
• Traffic shaping• Shapes traffic to optimize performance of
interfacing devices
• Assumption: reuse primarily occurs at the client
Disk
![Page 55: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/55.jpg)
Radium
• I/O into and out of buffer have rates and time granularities (periods)
• Period transformation: period into cache may be shorter than from cache to disk
• Rate transformation: rate into cache may be higher than disk can support
• Partition cache based on I/O characteristics and performance requirements
• Cache policies enhance performance within constraints determined by I/O requirements• Use slack to prefetch reads and delay
writes
Disk
App 1
Disk Scheduler
Buffer-Cache
Disk
App 2 App 3
![Page 56: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/56.jpg)
Enhancing guarantees in the buffer cache
• Reclaim unused resources (e.g., unused overhead)• Use slack to prefetch reads and delay writes
• Allow more unguaranteed services
• Resource redistribution (buffer swapping) accommodates burstiness
• Period transformation: period into cache may be shorter than from cache to disk
• Rate transformation: rate into cache may be higher than disk can support
Disk
![Page 57: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/57.jpg)
Managing a sequential workload
0
3
6
9
12
15
no-cachemonolithic
Thro
ughp
ut [t
hous
and
I/O p
er s
ec]
Sequential workloadexecuted in isolation
target performancereservation
0
3
6
9
12
15
no-cache monolithic Radium
Sequential workloadcombined with random workload
(with reservations)
target performancereservation
fifonoop
deadlineanticipatory
cfq(50%/50%)quanta(50%/50%, 2 sec)
RAD(50%/50%, 2 sec)
![Page 58: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/58.jpg)
Managing a random workload
0
3
6
9
12
15
no-cachemonolithic
Thro
ughp
ut [t
ens
of I/
O p
er s
ec]
Random workloadexecuted in isolation
target performancereservation
0
3
6
9
12
15
no-cache monolithic Radium
Random workloadcombined with sequential workload
(with reservations)
target performancereservation
fifonoop
deadlineanticipatory
cfq(50%/50%)quanta(50%/50%, 2 sec)
RAD(50%/50%, 2 sec)
![Page 59: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/59.jpg)
Managing combined workloads
0
1
2
3
4
5
fifonoopdeadline
anticipatory
cfqquantaRADTh
roug
hput
[tho
usan
d I/O
per
sec
] no cache
target performance
Combined throughput of rand.(top) and seq.(bottom) workloads
0
1
2
3
4
5
fifonoopdeadline
anticipatory
cfqquantaRAD
Monolithic
target performance
Combined throughput of rand.(top) and seq.(bottom) workloads
0
1
2
3
4
5
fifonoopdeadline
anticipatory
cfqquantaRAD
Radium
target performance
Combined throughput of rand.(top) and seq.(bottom) workloads
![Page 60: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/60.jpg)
Controlling throughput w/mixed workloads
0 1000 2000 3000 4000 5000 6000
1500 2500 3500 4500
677 815 953 1091 1229
Target throughput [I/Os per sec]
Radium+RAD
0 1000 2000 3000 4000 5000 6000
1500 2500 3500 4500
677 815 953 1091 1229
Target throughput [I/Os per sec]
Radium+CFQ
0 1000 2000 3000 4000 5000 6000
1500 2500 3500 4500
677 815 953 1091 1229
Target throughput [I/Os per sec]
Radium+Anticipatory
0 1000 2000 3000 4000 5000 6000
1500 2500 3500 4500
677 815 953 1091 1229
Target throughput [I/Os per sec]
Radium+deadline
0 1000 2000 3000 4000 5000 6000
1500 2500 3500 4500
677 815 953 1091 1229
Actu
al th
roug
hput
[I/O
s pe
r sec
]
Target throughput [I/Os per sec]
Radium+FIFO
stream 1, 2 secstream 2, 2 secIdeal stream 1Ideal stream 2
0 1000 2000 3000 4000 5000 6000
1500 2500 3500 4500
677 815 953 1091 1229
Actu
al th
roug
hput
[I/O
s pe
r sec
]
Target throughput [I/Os per sec]
Radium+NOOP
Consistent and predictable throughput for arbitrary reservations
![Page 61: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/61.jpg)
Controlling latency w/mixed workloads
0
0.2
0.4
0.6
0.8
1
0 250 500 750
Cum
ulat
ive d
istrib
utio
n[%
]
Latency [ms]
Radium+CFQ
Upper boundsUpper bounds
0
0.2
0.4
0.6
0.8
1
0 250 500 750Latency [ms]
Radium+RAD
Upper boundsUpper bounds
period 1 sec 750 ms 500 ms 250 ms
Precise control over the service times of each stream
![Page 62: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/62.jpg)
Results w/complex workloads
0
20
40
60
80
100
0 15 30 45 60 75 90
Avg.
thro
ughp
ut [I
Os
per s
ec]
Time [sec]
Radium+CFQ
0
20
40
60
80
100
0 15 30 45 60 75 90Time [sec]
Radium+RADSoft stream 1, period 500 msSoft stream 2, period 500 ms
Hard stream 3, period 500 msGreedy stream 4, period 1 sec Greedy stream 5, period 1 sec
Reasonable control with complex workloads
![Page 63: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/63.jpg)
Data center performance management
• Big distributed systems• Serve many users/jobs
• Process petabytes of data
• Data center design• Use rules of thumb
• Over-provision
• Isolate
• Ad hoc performance management creates marginal storage systems that cost more than necessary
• A better system would guarantee each user the performance they need from the CPUs, memory, disks, and network
![Page 64: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/64.jpg)
Data center performance mgmt. goals
1. A first-principles model for data center perf. mgmt.
2. Full-system performance metrics for client processing nodes, buffer cache, network, server buffer cache, and disk
3. Performance visualization by application, client node, reservation, or device
4. Application workload profiling and modeling
5. Full system performance provisioning and management based on all of the above
6. Online machine-learning based performance monitoring for real-time diagnostics
![Page 65: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/65.jpg)
RADIX
• $1 million from UC Lab Fee program
• Based on schedulers and workload-independent utilization metrics from our E2E QoS research
• Plan1.Performance model and metrics
2.Tools for profiling, prediction, and planning
3.Operating systems components
4.Performance monitors and visualization tools
• Case study: LANL data centers
![Page 66: RADIO: Managing the Performance of Large, Distributed ...blogs.iec.cat/sct/wp-content/uploads/sites/19/2011/02/...2009/07/07 · writeback based on utilization, QoS app app I/O scheduler](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92d202b438d7f4a7e6f12/html5/thumbnails/66.jpg)
Conclusion
• Distributed I/O performance management requires management of many separate components
• An integrated approach is needed
• RAD provides the basis for a solution
• It has been successfully applied to several resources: CPU, disk, network, and buffer cache
• We are on our way to an integrated solution
• There are many useful applications: Data center performance management, full storage virtualization, ...