latency as a performability metric for internet services
DESCRIPTION
Latency as a Performability Metric for Internet Services. Pete Broadwell [email protected]. Outline. Performability background/review Latency-related concepts Project status Initial test results Current issues. 9. 9. 9. 9. 9. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
Latency as a Performability Metric for Internet Services
Pete [email protected]
Outline1. Performability
background/review2. Latency-related concepts3. Project status
• Initial test results• Current issues
• A goal of ROC project: develop metrics to evaluate new recovery techniques
• Problem: basic concept of availability assumes system is either “up” or “down” at a given time
• “Nines” only describe fraction of uptime over a certain interval
Motivation
99 999
• Availability doesn’t describe durations or frequencies of individual outages– Both can strongly influence user perception
of service, as well as revenue• Availability doesn’t capture system’s
capacity to support degraded service– degraded performance during failures– reduced data quality during high load (Web)
Why Is Availability Insufficient?
What is “performability”?• Combination of performance and
dependability measures• Classical defn: probabilistic (model-
based) measure of a system’s “ability to perform” in the presence of faults1
– Concept from traditional fault-tolerant systems community, ca. 1978
– Has since been applied to other areas, but still not in widespread use
1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994
Performability ExampleDiscrete-time Markov chain (DTMC) model of
a RAID-5 disk array1
1 Hannu H. Kari, Ph.D. Thesis, Helsinki University of Technology, 1997
pi(t) = probability that system is in state i at time t
p0(t)
Normaloperation
= failure rate of a single disk drive
D = number of data disks
(D+1)
p1(t)
1 disk failed,repair necessary
= disk repair rate
D
p2(t)
Failure -data loss
wi(t) = reward (disk I/O operations/sec)
w0(t) w1(t) w2(t)
Degraded throughput
Average throughput
Visualizing PerformabilityThroughput
Time
I/O o
pera
tions
/sec
DETECT
Normal throughputFAILURE RECOVER
REPAIR
Metrics for Web Services• Throughput - requests/sec• Latency – render time, time to first byte• Data quality
– harvest (response completeness)– yield (% queries answered)1
1 E. Brewer, Lessons from Giant-Scale Internet Services, 2001
Time
Perf
Applications of Metrics• Modeling the expected failure-related
performance of a system, prior to deployment
• Benchmarking the performance of an existing system during various recovery phases
• Comparing the reliability gains offered by different recovery strategies
Related Projects• HP: Automating Data Dependability
– uses “time to data access” as one objective for storage systems
• Rutgers: PRESS/Mendosus– evaluated throughput of PRESS server
during injected failures• IBM: Autonomic Storage• Numerous ROC projects
Arguments for Using Latency as a Metric
• Originally, performability metrics were meant to capture end-user experience1
• Latency better describes the experience of an end user of a web site– response time >8 sec = site abandonment
= lost income $$2
• Throughput describes the raw processing ability of a service– best used to quantify expenses
1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 19942 Zona Research and Keynote Systems, The Need for Speed II, 2001
Current Progress• Using Mendosus fault injection
system on a 4-node PRESS web server (both from Rutgers)
• Running latency-based performability tests on the cluster– Inject faults during load test– Record page-load times before,
during and after faults
Test Setup
Normal version: cooperative cachingHA version: cooperative caching +
heartbeat monitoring
PRESS web server +MendosusTest clients
Emulatedswitch
Request
Cachinginfo Page
Response
Effect of Component Failure on Performability Metrics
Time
Perform-abilitymetric
REPAIRFAILURE
Throughput
Latency
Observations• Below saturation, throughput is more
dependent on load than latency• Above saturation, latency is more
dependent on load
Time1 2 3 4 5
Thru = 6/sLat = .14s
Thru = 3/sLat = .14s
Thru = 7/sLat = .4s
How to Represent Latency?• Average response time over a
given time period– Make a distinction between “render
time” & “time to first byte”?• Deviation from baseline latency
– Impose a greater penalty for deviations toward longer wait times?
Response Time with Load Shedding Policy
Time
Responsetime(sec)
REPAIR
8s Abandonment threshold
FAILURE
Load-shedding threshold
X users get “server too busy” msg
Load Shedding Issues• Load shedding means returning 0% data quality – a
different kind of performability metric• To combine load shedding and latency, define a
“demerit” system:
• Such systems quickly lose generality, however- “Server too busy” msg – 3 demerits
- 8 sec response time – 1 demerit/sec
Further Work• Collect more experimental results!• Compare throughput and latency-
based results of normal and high-availability versions of PRESS
• Evaluate usefulness of “demerit” systems to describe the user experience (latency and data quality)
Latency as a Performability Metric for Internet Services
Pete [email protected]