analysis and modeling of time-correlated failures in large-scale distributed systems
Post on 25-Feb-2016
36 Views
Preview:
DESCRIPTION
TRANSCRIPT
22-04-23
Challenge the future
DelftUniversity ofTechnology
Analysis and Modeling of Time-Correlated Failures in
Large-Scale Distributed Systems
Nezih Yigitbasi1, Matthieu Gallet2, Derrick Kondo3,
Alexandru Iosup1, Dick Epema1
1TUDelft, 2École Normale Supérieure de Lyon, 3INRIA
The Failur
eTraceArchiv
ehttp://guardg.st.ewi.tudelft.nl/
2
Failures Do Happen• … Build a computing system with 10 thousand servers with MTBF of 30
years each, watch one fail per day … Jeff Dean, Google Fellow, LADIS’09 Keynote
• … Average worker deaths per MapReduce job is 1.2 … MapReduce, OSDI’04
• … 20-45% failures in TeraGrid … Khalili et al., GRID’06
• … During the month of March 2005 on one dedicated cluster with 1500 Xeon CPUs, there were 32,580 Sawzall jobs launched, using an average of 220 machines each. While running those jobs, 18,636 failures occurred (application failure, network outage, system crash, etc.) that triggered rerunning some portion of the job ...
Rob Pike et al., Google
3
• Common assumption
• Is this realistic for large-scale distributed systems?• Already know that space correlations exist
• Time correlations may impact• Proactive fault-tolerance solutions• Design decisions• Checkpointing & scheduling decisions (e.g., migrate
computation at the beginning of a predicted peak)
Are Failures Independent?
M.Gallet, N.Yigitbasi, B.Javadi, D.Kondo, A.Iosup, D.Epema, A Model for Space-correlated Failures in Large-scale Distributed Systems, Euro-Par 2010.
4
GOAL 1Investigate whether failures have time correlations
GOAL 2Model the time-varying behavior of failures (peaks)
Our Goals
5
Outline
Background
Our ApproachAnalysis of Time-CorrelationModeling the Peaks of FailuresConclusions
6
Why Not Root-Cause Analysis?
• Root-cause analysis is definitely useful
Challenges• Systems are large and complex
• Not all subsystems provide detailed info
• Little monitoring/debugging support
• Environment-specific or temporary failures
• Huge size of failure data• 19 systems
7
Failure Trace Archive (FTA)http://fta.inria.fr
Provides• Availability traces of diverse distributed systems of
different scale• Standard format for failure events• Tools for parsing & analysis
Enables• Comparing models/algorithms using identical data sets • Evaluation of the generality/specificity of
models/algorithms across different types of systems• Analysis of availability evolution across time scales• And many more …
The Failur
eTraceArchiv
e
8
FTA Schema
• Hierarchical trace format
• Resource centric
• Event-based
• Associated metadata
• Codes for different components and events
• Available in raw, tabbed and MYSQL formats
9
Sample Trace
Identifiers for the event/component/node/platformNode nameType of event: unavailability/availabilityEvent start/stop time (UNIX time)
10
Outline
BackgroundOur Approach
Analysis of Time-CorrelationModeling the Peaks of FailuresConclusions
11
Our Approach (1): Outline
Traces• Nineteen failure traces from the FTA• Mostly production systems
Analysis• Use the auto-correlation of failure rate time series
Modeling• Fit well-known probability distributions to the failure
data to model failure peaks
12
Our Approach (2): Traces
100K+ hosts~1.2 M failure events
15+ years of operation in total
http://fta.inria.fr
13
Our Approach (3): Analysis• Auto-Correlation Function (ACF)
• Similarity between observations as a function of the time lag between them
• Mathematical tool for finding repeating patterns• Used for assessing time correlations• [-1 1]: weak strong correlation
14
Our Approach (4): Modeling
• We use five probability distributions to fit to the empirical data
• Exponential, Weibull, Pareto, Log-Normal, and Gamma
• Maximum likelihood estimation + Goodness of Fit Tests
15
Outline
BackgroundOur ApproachAnalysis of Time-Correlation
Modeling the Peaks of FailuresConclusions
16
WEBSITES
Analysis (1): Auto-correlation
• Many systems exhibit moderate/strong auto-correlation for moderate/short time lags (GRID5K, LDNS, SKYPE, …)
17
TERAGRID
• Small number of systems exhibit low auto-correlation (TeraGrid, PNNL, NOTRE-DAME)
Analysis (2): Auto-correlation
18
Daily/WeeklyCycles
Analysis (3): Failure Patterns
Daily/WeeklyCycles
MICROSOFT SKYPE• Systems with similar usage patterns
have similar failure patterns
19
GRID5000
Analysis (4): Workload Intensity vs Failure Rate
• There is a strong correlation between the workload intensity and the failure rate in some systems
20
Outline
BackgroundOur ApproachAnalysis of Time-CorrelationModeling the Peaks of Failures
Conclusions
21
Failure Peaks (1): Model
μ+kσμ
1 2
3
4
22
Failure Peaks (2): IdentificationOur goal
• Balance between capturing the extreme system behavior and characterizing an important part of the system failures
We use a threshold to isolate peaks• μ + kσ where k is a positive integer• Large k=> Few periods explaining only a small fraction of
failures• Small k=> More failures of probably very different
characteristics
We use k=1• Tried k={0.5, 0.9, 1.0, 1.1, 1.25, 1.5, 2.0}• Over all traces, average fraction of downtime and average
number of failures are close (see Technical Report)
23
Failure Peaks (3): Modeling Results (1)
1. On average, 50% - 95% of the system downtime is caused by the failures that originate during peaks, but the fraction of peaks < 10% for all platforms
2. The average peak durations are on the order of 1-2 hours
3. The average time between peaks is on the order of 15-80 hours
4. Average IAT over the entire trace is about 9x the IAT during peaks
24
Failure Peaks (4): Modeling Results (2)
5. Exponential distribution is not a good fit for IAT during peaks, time between peaks, and failure duration during peaks
• Traditional models are not enough
6. Model parameters do not follow a heavy-tailed distribution
• Goodness of fit test results (p-values) for the Pareto distribution are very low
7. Weibull and the Log-Normal provide the best fit• See the paper for the parameters
25
Conclusions (1)
• Nineteen traces most of which are production systems• 100K+ hosts – ~1.2 M failure events – 15+ years of operation • Four new traces available in the FTA (3 CONDOR + 1 TERAGRID)
Large-Scale Study
GOAL 1: Analysis
• Failures exhibit strong periodic behavior & time correlation• Systems with similar usage patterns have similar failure patterns• Strong correlation between workload intensity and failure rate
26
Conclusions (2)
GOAL 2: Modeling
• Peak duration, time between peaks, the failure IAT during peaks, and the failure duration during peaks• On average 50% - 95% of the system downtime is caused by the failures that originate during peaks (fraction of peaks < 10%)• Weibull & the Log-Normal distributions provide good fit
27
“M.N.Yigitbasi@tudelft.nl”http://www.st.ewi.tudelft.nl/~nezih/
More Information:• Guard-g Project: http://guardg.st.ewi.tudelft.nl/• The Failure Trace Archive: http://fta.inria.fr• PDS publication database: http://www.pds.twi.tudelft.nl
Thank you! Questions? Comments?
The Failur
eTraceArchiv
e
28
-0.50
0.0
0.50
1.0
-200 0 200 400 600 800
random
0.0
0.200.400.600.80
1.0
0 1000 2000 3000
randoms
t
-2.0
-1.0
0.0
1.0
2.0
0 1000 2000 3000
sin+ran
s
t-1.0
-0.500.0
0.501.01.5
-200 0 200 400 600 800
sin + ran
-1.5-1.0
-0.500.0
0.501.01.5
-20 0 20 40 60 80 100 120
sin
-1.0
0.0
1.0
-20 0 20 40 60 80 100 120
sin
s
t
X
X
X
29
+1
-1
0
lag k0 100Aut
ocor
rela
tion
Coe
ffic
ient
Significant positivecorrelation at short lags
Autocorrelation Function
30
+1
-1
0
lag k0 100Aut
ocor
rela
tion
Coe
ffic
ient
No statistically significantcorrelation beyond this lag
Autocorrelation Function
31
•For most processes (e.g., Poisson, or compound Poisson), the autocorrelation function drops to zero very quickly
• usually immediately or exponentially fast
•For self-similar processes, the autocorrelation function drops very slowly
• i.e., hyperbolically, toward zero, but may never reach zero
Long-range Dependence
32
+1
-1
0
lag k0 100Aut
ocor
rela
tion
Coe
ffic
ient
Typical long-range dependent process
Typical short-rangedependent process
Autocorrelation Function
top related