comprehensive depiction of configuration-dependent performance anomalies in distributed server...

Comprehensive Depiction of Configuration-dependent Performance

Anomalies in Distributed Server Systems

Christopher Stewart, Ming Zhong,

Kai Shen, and Thomas O’Neill

University of Rochester

Presented at the 2nd USENIX Workshop on Hot Topics in System Dependability

2

Context

Distributed server systems Example: J2EE Application

servers Many system configurations

Switches that control runtime execution

Wide range of workload conditions exogenous demands for system

resources

Example J2EE

Runtime Conditions

System Configurations

Concurrency limit

Component placement

Workload Conditions

Request rate

3

Presumptions Performance expectations based on

knowledge of system design are reasonable Lead developers–high-level algorithms Administrators–day-to-day experience

Example ExpectationLittle’s Law

Average number of requests in the system equals the average arrival rate times service time

4

Th

rou

gh

pu

t

Anomalies

Actual

Expectation

Component Placement Strategies

Real Performance Anomalies

Problem Statement Dependable performance is

important for system management QoS scheduling SLA negotiations

Performance Anomalies – runtime conditions in which performance falls below expectations – are not uncommon

5

Goals Previous Work: Anomaly characterization can aid the

debugging process and guide online avoidance [AGU-SOSP99, QUI-SOSP05, CHE-NSDI04, COH-SOSP05, KEL-WORLDS05]

Focused on specific runtime conditions (e.g., those encountered during a particular execution)

We wish to depict all anomalous conditions

Comprehensive depictions can: Aid the debugging of production systems before

distribution Enable preemptive avoidance of anomalies in live

systems

6

Approach Our depictions are derived in a 3-step process:

1. Generate performance expectations by building a comprehensive whole-system performance model

2. Search for anomalous runtime conditions

3. Extrapolate a comprehensive anomaly depiction

Challenges: Model must consider a wide-range of system configurations Systematic method to determine anomaly error threshold An appropriate method to detect correlations between runtime

conditions and anomalies

7

Outline Performance expectations for a wide-range of

configuration settings

Determination of the anomaly error threshold

Decision-tree based anomaly depiction

Preliminary results

Discussion/ Conclusion

8

Comprehensive Performance Expectations Modeling the configuration space is hard

Configurations have complex effects on performance Considering a wide-range of configurations increases

model complexity

Our modeling methodology Build performance models as a hierarchy of sub-models Sub-models can be independently adjusted to consider

new system configurations

9

Rules for Our Sub-Model Hierarchies The output of each sub-model is a workload property

Workload property – internal demands for resources (e.g., CPU consumption)

The inputs to each sub-model are either1. workload properties 2. system configuration settings

Sub-models on the highest level produce performance expectations

Workload properties at the lowest level, canonical workload properties, can be measured independent of system configurations

10

A Hierarchy of Sub-Models We leverage the

workload properties of earlier work [STE-NSDI05]

Advantages Sub-models have

meaning

Limitations Configuration

dependencies may make sub-models complex

Sub-model 3:average request CPU usage at each machine

Sub-model 4:average request comm. need at each machine

Workload property:component CPU

usage w/o caching

Sub-model 1: average request CPU usage at

each component

Sub-model 2: average request comm. need at

each component

Sub-model 5: average request response time

Configuration:cache coherence

Configuration:cache coherence

Workload property:component comm. need w/o caching

Configuration:component placement

Configuration:remote

invocation method

Configuration:component placement

Configuration:service

concurrency level

Sub-model 6:system

throughput

Hierarchy of sub-models for J2EE application servers.

11





Preliminary results


12

Determination of the Anomaly Error Threshold Sometimes slight discrepancies between actual and

expected performance should be tolerated

Leniency depends on the end-use of the depiction

For online avoidance: focus on error magnitude Large errors may induce poor management decisions Sensitivity analysis of system management functions

For debugging: focus on targeted performance bugs Noisy depictions will mislead debuggers Group anomalies with the same root cause

13

Anomaly Error Threshold for Debugging Observation: anomaly manifestations due to

the same cause are more likely to share similar error magnitude than unrelated anomaly manifestations

Root causes can be grouped by clustering based on the expectation error:

14

Anomaly Error Threshold for Debugging Knee-points mark

clusters boundaries

Knee-point selection Higher magnitude

emphasizes large anomalies

Low magnitude captures multiple anomalies

Validation: we notice that knee points disappear when problems are resolved

100%

0 400 800 1200 1600

80%

60%

40%

20%

0%

knee

knee

kneeknee

kneeResponse Time

Tput

Sample runtime conditions (sorted on expectation error)

Expectation Error Clustering

15





Preliminary results


16

Decision Tree Based Anomaly Depictions Decision trees correlate anomalies to problematic runtime conditions

Interpretable Unlike

Neural Nets, SVM, Perceptrons

No prior knowledge Unlike

Bayesian trees [COH-OSDI04]

Versatile

If a=0: anomaly

If a=1,b=0: normal

If a=1,b=1: anomaly

White-box Usage for Debugging

Hints

Prefer shorter, easily interpreted trees

Black-box Usage for Avoidance

Prefer longer, more precise tree

a a

b b b

c c c c

Anomaly80% prob.

Normal70% prob.

Anomaly90% prob.

0 1 0 1

a=0,b=1,c=2,….

Anomaly

Normal

17

Design Recap We wish to depict performance anomalies

across a wide-range of system configurations and workload conditions

1. Derive performance expectations via a hierarchy of sub-models

2. Search for anomalous runtime conditions with carefully selected anomaly error threshold

3. Use decision trees to extrapolate a comprehensive anomaly depiction

18





Preliminary results


19

Depiction Assisted Debugging System: JBoss

8 runtime conditions (including app type) 4 machine cluster, 2.66 GHz CPU

Found and fixed 3 performance anomalies One is shown in detail below

Application type

Componentplacement strategy

79%anomalous

68%anomalous

87%anomalous

88%anomalous

Container-managed persistence (CMP)

Node Comp. 1 {2} 2 {1,3,5} 3 {4}

1 {4,5} 2 {1,2,3} 3 null

1 null2 {1,2,4}3 {3,5}

1 {5}2 {1,2,4}3 {3}

Depiction of a real performance anomaly.

Misunderstood J2EE configuration which manifests when multiple components are placed on node 2

20

Discovered Anomalies

1. Misunderstood J2EE configuration caused remote invocations to unintentionally execute locally

2. A mishandled out-of-memory error under high concurrency caused the Tomcat 5.0 servlet container to drop requests

3. Circular dependency in the component invocation sequences caused connection timeouts under certain component placement strategies

21





Preliminary results


22

Discussion Limitations

Cannot detect non-deterministic anomalies Is it model inaccuracy or a performance anomaly?

Requires manual investigation, but model is much less complex than the system

Debugging is still a manual process

Future work Short term: Investigate more system configurations Short term: Depict anomalies in more systems Long term: More systematic depiction-assisted

debugging methods

23

Take Away Comprehensive depictions of performance

anomalies on a wide-range of runtime conditions can aid debugging and avoidance

We have designed and implemented an approach to: Model a wide-range of system configurations Determine anomalous conditions Depict the anomalies in an easy-to-interpret fashion

We have already used our approach to find 3 performance bugs

comprehensive depiction of configuration-dependent performance anomalies in distributed server...

Documents

system performance modelsearch

performance falls

knowledge of system

new system configurationsrules

specific runtime conditions

anomaly characterization

configuration space

canonical workload properties