comprehensive depiction of configuration-dependent performance anomalies in distributed server...
TRANSCRIPT
Comprehensive Depiction of Configuration-dependent Performance
Anomalies in Distributed Server Systems
Christopher Stewart, Ming Zhong,
Kai Shen, and Thomas O’Neill
University of Rochester
Presented at the 2nd USENIX Workshop on Hot Topics in System Dependability
2
Context
Distributed server systems Example: J2EE Application
servers Many system configurations
Switches that control runtime execution
Wide range of workload conditions exogenous demands for system
resources
Example J2EE
Runtime Conditions
System Configurations
Concurrency limit
Component placement
Workload Conditions
Request rate
3
Presumptions Performance expectations based on
knowledge of system design are reasonable Lead developers–high-level algorithms Administrators–day-to-day experience
Example ExpectationLittle’s Law
Average number of requests in the system equals the average arrival rate times service time
4
Th
rou
gh
pu
t
Anomalies
Actual
Expectation
Component Placement Strategies
Real Performance Anomalies
Problem Statement Dependable performance is
important for system management QoS scheduling SLA negotiations
Performance Anomalies – runtime conditions in which performance falls below expectations – are not uncommon
5
Goals Previous Work: Anomaly characterization can aid the
debugging process and guide online avoidance [AGU-SOSP99, QUI-SOSP05, CHE-NSDI04, COH-SOSP05, KEL-WORLDS05]
Focused on specific runtime conditions (e.g., those encountered during a particular execution)
We wish to depict all anomalous conditions
Comprehensive depictions can: Aid the debugging of production systems before
distribution Enable preemptive avoidance of anomalies in live
systems
6
Approach Our depictions are derived in a 3-step process:
1. Generate performance expectations by building a comprehensive whole-system performance model
2. Search for anomalous runtime conditions
3. Extrapolate a comprehensive anomaly depiction
Challenges: Model must consider a wide-range of system configurations Systematic method to determine anomaly error threshold An appropriate method to detect correlations between runtime
conditions and anomalies
7
Outline Performance expectations for a wide-range of
configuration settings
Determination of the anomaly error threshold
Decision-tree based anomaly depiction
Preliminary results
Discussion/ Conclusion
8
Comprehensive Performance Expectations Modeling the configuration space is hard
Configurations have complex effects on performance Considering a wide-range of configurations increases
model complexity
Our modeling methodology Build performance models as a hierarchy of sub-models Sub-models can be independently adjusted to consider
new system configurations
9
Rules for Our Sub-Model Hierarchies The output of each sub-model is a workload property
Workload property – internal demands for resources (e.g., CPU consumption)
The inputs to each sub-model are either1. workload properties 2. system configuration settings
Sub-models on the highest level produce performance expectations
Workload properties at the lowest level, canonical workload properties, can be measured independent of system configurations
10
A Hierarchy of Sub-Models We leverage the
workload properties of earlier work [STE-NSDI05]
Advantages Sub-models have
meaning
Limitations Configuration
dependencies may make sub-models complex
Sub-model 3:average request CPU usage at each machine
Sub-model 4:average request comm. need at each machine
Workload property:component CPU
usage w/o caching
Sub-model 1: average request CPU usage at
each component
Sub-model 2: average request comm. need at
each component
Sub-model 5: average request response time
Configuration:cache coherence
Configuration:cache coherence
Workload property:component comm. need w/o caching
Configuration:component placement
Configuration:remote
invocation method
Configuration:component placement
Configuration:service
concurrency level
Sub-model 6:system
throughput
Hierarchy of sub-models for J2EE application servers.
11
Outline Performance expectations for a wide-range of
configuration settings
Determination of the anomaly error threshold
Decision-tree based anomaly depiction
Preliminary results
Discussion/ Conclusion
12
Determination of the Anomaly Error Threshold Sometimes slight discrepancies between actual and
expected performance should be tolerated
Leniency depends on the end-use of the depiction
For online avoidance: focus on error magnitude Large errors may induce poor management decisions Sensitivity analysis of system management functions
For debugging: focus on targeted performance bugs Noisy depictions will mislead debuggers Group anomalies with the same root cause
13
Anomaly Error Threshold for Debugging Observation: anomaly manifestations due to
the same cause are more likely to share similar error magnitude than unrelated anomaly manifestations
Root causes can be grouped by clustering based on the expectation error:
14
Anomaly Error Threshold for Debugging Knee-points mark
clusters boundaries
Knee-point selection Higher magnitude
emphasizes large anomalies
Low magnitude captures multiple anomalies
Validation: we notice that knee points disappear when problems are resolved
100%
0 400 800 1200 1600
80%
60%
40%
20%
0%
knee
knee
kneeknee
kneeResponse Time
Tput
Sample runtime conditions (sorted on expectation error)
Expectation Error Clustering
15
Outline Performance expectations for a wide-range of
configuration settings
Determination of the anomaly error threshold
Decision-tree based anomaly depiction
Preliminary results
Discussion/ Conclusion
16
Decision Tree Based Anomaly Depictions Decision trees correlate anomalies to problematic runtime conditions
Interpretable Unlike
Neural Nets, SVM, Perceptrons
No prior knowledge Unlike
Bayesian trees [COH-OSDI04]
Versatile
If a=0: anomaly
If a=1,b=0: normal
If a=1,b=1: anomaly
White-box Usage for Debugging
Hints
Prefer shorter, easily interpreted trees
Black-box Usage for Avoidance
Prefer longer, more precise tree
a a
b b b
c c c c
Anomaly80% prob.
Normal70% prob.
Anomaly90% prob.
0 1 0 1
a=0,b=1,c=2,….
Anomaly
Normal
17
Design Recap We wish to depict performance anomalies
across a wide-range of system configurations and workload conditions
1. Derive performance expectations via a hierarchy of sub-models
2. Search for anomalous runtime conditions with carefully selected anomaly error threshold
3. Use decision trees to extrapolate a comprehensive anomaly depiction
18
Outline Performance expectations for a wide-range of
configuration settings
Determination of the anomaly error threshold
Decision-tree based anomaly depiction
Preliminary results
Discussion/ Conclusion
19
Depiction Assisted Debugging System: JBoss
8 runtime conditions (including app type) 4 machine cluster, 2.66 GHz CPU
Found and fixed 3 performance anomalies One is shown in detail below
Application type
Componentplacement strategy
79%anomalous
68%anomalous
87%anomalous
88%anomalous
Container-managed persistence (CMP)
Node Comp. 1 {2} 2 {1,3,5} 3 {4}
1 {4,5} 2 {1,2,3} 3 null
1 null2 {1,2,4}3 {3,5}
1 {5}2 {1,2,4}3 {3}
Depiction of a real performance anomaly.
Misunderstood J2EE configuration which manifests when multiple components are placed on node 2
20
Discovered Anomalies
1. Misunderstood J2EE configuration caused remote invocations to unintentionally execute locally
2. A mishandled out-of-memory error under high concurrency caused the Tomcat 5.0 servlet container to drop requests
3. Circular dependency in the component invocation sequences caused connection timeouts under certain component placement strategies
21
Outline Performance expectations for a wide-range of
configuration settings
Determination of the anomaly error threshold
Decision-tree based anomaly depiction
Preliminary results
Discussion/ Conclusion
22
Discussion Limitations
Cannot detect non-deterministic anomalies Is it model inaccuracy or a performance anomaly?
Requires manual investigation, but model is much less complex than the system
Debugging is still a manual process
Future work Short term: Investigate more system configurations Short term: Depict anomalies in more systems Long term: More systematic depiction-assisted
debugging methods
23
Take Away Comprehensive depictions of performance
anomalies on a wide-range of runtime conditions can aid debugging and avoidance
We have designed and implemented an approach to: Model a wide-range of system configurations Determine anomalous conditions Depict the anomalies in an easy-to-interpret fashion
We have already used our approach to find 3 performance bugs