latency as a performability metric for internet services

Latency as a Performability Metric for Internet Services

Pete [email protected]

Outline1. Performability

background/review2. Latency-related concepts3. Project status

• Initial test results• Current issues

• A goal of ROC project: develop metrics to evaluate new recovery techniques

• Problem: basic concept of availability assumes system is either “up” or “down” at a given time

• “Nines” only describe fraction of uptime over a certain interval

Motivation

99 999

• Availability doesn’t describe durations or frequencies of individual outages– Both can strongly influence user perception

of service, as well as revenue• Availability doesn’t capture system’s

capacity to support degraded service– degraded performance during failures– reduced data quality during high load (Web)

Why Is Availability Insufficient?

What is “performability”?• Combination of performance and

dependability measures• Classical defn: probabilistic (model-

based) measure of a system’s “ability to perform” in the presence of faults1

– Concept from traditional fault-tolerant systems community, ca. 1978

– Has since been applied to other areas, but still not in widespread use

1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994

Performability ExampleDiscrete-time Markov chain (DTMC) model of

a RAID-5 disk array1

1 Hannu H. Kari, Ph.D. Thesis, Helsinki University of Technology, 1997

pi(t) = probability that system is in state i at time t

p0(t)

Normaloperation

= failure rate of a single disk drive

D = number of data disks

(D+1)

p1(t)

1 disk failed,repair necessary

= disk repair rate

D

p2(t)

Failure -data loss

wi(t) = reward (disk I/O operations/sec)

w0(t) w1(t) w2(t)

Degraded throughput

Average throughput

Visualizing PerformabilityThroughput

Time

I/O o

pera

tions

/sec

DETECT

Normal throughputFAILURE RECOVER

REPAIR

Metrics for Web Services• Throughput - requests/sec• Latency – render time, time to first byte• Data quality

– harvest (response completeness)– yield (% queries answered)1

1 E. Brewer, Lessons from Giant-Scale Internet Services, 2001

Time

Perf

Applications of Metrics• Modeling the expected failure-related

performance of a system, prior to deployment

• Benchmarking the performance of an existing system during various recovery phases

• Comparing the reliability gains offered by different recovery strategies

Related Projects• HP: Automating Data Dependability

– uses “time to data access” as one objective for storage systems

• Rutgers: PRESS/Mendosus– evaluated throughput of PRESS server

during injected failures• IBM: Autonomic Storage• Numerous ROC projects

Arguments for Using Latency as a Metric

• Originally, performability metrics were meant to capture end-user experience1

• Latency better describes the experience of an end user of a web site– response time >8 sec = site abandonment

= lost income $$2

• Throughput describes the raw processing ability of a service– best used to quantify expenses

1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 19942 Zona Research and Keynote Systems, The Need for Speed II, 2001

Current Progress• Using Mendosus fault injection

system on a 4-node PRESS web server (both from Rutgers)

• Running latency-based performability tests on the cluster– Inject faults during load test– Record page-load times before,

during and after faults

Test Setup

Normal version: cooperative cachingHA version: cooperative caching +

heartbeat monitoring

PRESS web server +MendosusTest clients

Emulatedswitch

Request

Cachinginfo Page

Response

Effect of Component Failure on Performability Metrics

Time

Perform-abilitymetric

REPAIRFAILURE

Throughput

Latency

Observations• Below saturation, throughput is more

dependent on load than latency• Above saturation, latency is more

dependent on load

Time1 2 3 4 5

Thru = 6/sLat = .14s

Thru = 3/sLat = .14s

Thru = 7/sLat = .4s

How to Represent Latency?• Average response time over a

given time period– Make a distinction between “render

time” & “time to first byte”?• Deviation from baseline latency

– Impose a greater penalty for deviations toward longer wait times?

Response Time with Load Shedding Policy

Time

Responsetime(sec)

REPAIR

8s Abandonment threshold

FAILURE

Load-shedding threshold

X users get “server too busy” msg

Load Shedding Issues• Load shedding means returning 0% data quality – a

different kind of performability metric• To combine load shedding and latency, define a

“demerit” system:

• Such systems quickly lose generality, however- “Server too busy” msg – 3 demerits

- 8 sec response time – 1 demerit/sec

Further Work• Collect more experimental results!• Compare throughput and latency-

based results of normal and high-availability versions of PRESS

• Evaluate usefulness of “demerit” systems to describe the user experience (latency and data quality)

Latency as a Performability Metric for Internet Services

Pete [email protected]

latency as a performability metric for internet services

Documents

performability evaluation

web siteresponse time

data dependabilityuses

systems ability

node press web server

combination of performance

existing system

keynote systems