lawrence livermore national laboratory greg bronevetsky in collaboration with ignacio laguna,...
Post on 02-Jan-2016
218 Views
Preview:
TRANSCRIPT
Lawrence Livermore National Laboratory
Greg Bronevetskyin collaboration with
Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin Schulz
Statistical Fault Detection and Analysis with AutomaDeD
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Reliability is a Critical Challenge in Large Systems
Need tools to detect faults, identify causes• Fault tolerance : requires fault detection• System management: need to know what failed
Faults come from various causes• Hardware: soft errors, marginal circuits, physical
degradation, design bugs• Software: coding bugs, misconfigurations
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
In General Fault Detection and Fault Tolerance is Undecidable
Option 1: Make all applications fault resilient• Application-specific solutions hard to design• Many applications• How does fault resilience compose?
Option 2: Develop approximate fault detection, tolerate via checkpointing et al• Statistically model application behavior• Look for deviations from model behavior• Identify model components that likely caused
deviation
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
In General Fault Detection and Fault Tolerance is Undecidable
Option 2: Develop approximate fault detection, tolerate via checkpointing et al• Statistically model application behavior• Look for deviations from model behavior• Identify model components that likely caused deviation
Application Model
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Focus on Modeling Individual MPI Applications
Primary goal is fault detection for HPC applications• Model behavior of single MPI application• Detect deviations from norm• Identify origin of deviation in time/space
Other branches of field• Model system component interactions • Model application as dataflow graph of modules• Model micro-architecture state as vulnerable/non-
vulnerable (ACE analysis)
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Goal: Detect Unusual Application Behavior, Identify Cause
. . . . . . . . .
Single Run - SpatialDifferences between behavior of processes
Single Run - TemporalDifferences between
one time point and others
Multiple RunsDifferences between
behavior of runs
MPI Application
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Semi-Markov Models
SMM - Transition system Nodes: application states Edges: transitions from one state to another
• Probability of transition• Time spent in prior state before transition
.2 / 5μs
.7 / 15μs
.1 / 500μs
A
B
C
D
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
SMMs Represent Application Control Flow
SMM states correspond to• Calls to MPI • Code Between MPI Calls
Computation
main()foo()Send-DBL
Computation
main()foo()Recv-DBL
Computation
main()Finalize
main()Initmain() { MPI_Init() … Computation … MPI_Send(…, 1, MPI_INTEGER, …); for(…) foo(); MPI_Recv(…, 1, MPI_INTEGER, …); MPI_Finalize();}
foo() { MPI_Send(…, 1024, MPI_DOUBLE, …); …Computation… MPI_Recv(…, 1024, MPI_DOUBLE, …); …Computation…}
Application Code Semi-Markov Model
main()Send-INT
main()Recv-INT
Different statefor different
calling context
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Transitions Represent Time Spent at States
During execution each transition observed multiple timesTime series of transition times: [t1, t2, …, tn]
Represented as probability distribution• Gaussian• Histogram
.2 / 5μs
.7 / 15μs
.1 / 500μs
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Transitions Represent Time Spent at States
Gaussian
Histogram
Time Values
HistogramBucket Counts
Gaussian Tail
Line Connectors
Time Values
Time Values
Probabilities
DataSamples
• Cheaper• Lower Accuracy
• More Expensive• Greater Accuracy
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Using SMMs to Help Detect Faults
Hardware faults → behavior abnormalities Given sample runs, learn time distribution on
each transition (Top and bottom 0% or 10% of each
transition’s times removed)
If some transition takes an unusual amount of time, declare it an error
Time Values
Probabilities
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Detection threshold computed from maximum normal variation
Need threshold to separate normal, abnormal timing
Threshold = lowest probability observed in set of sample runs (Top and bottom 1% removed)
Time Values
Probabilities
Nothing Removed Top/Bottom 10% Removed
False Positive Rate 0% 19%
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Evaluated Fault Detector Using Fault Injection
NAS Parallel Benchmarks• 16-process runs• Input class A• Used BT, CG, FT, MG,LU and SP
(EP and IS use MPI in very simple ways)
Local delays (FIN_LOOP): 1, 5, 10 sec MPI message drop (DROP_MESG) or repetition
(REP_MESG) Extra CPU-intensive (CPU_THR) or Memory-
intensive (MEM_THR) thread
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Rates of Fault Detection Within 1ms of Injection
NoDetection
False DetectionBefore Injection
Detection of FaultWithin 1ms
DetectionAfter 1ms
sup
ervi
sed
sup
ervF
ilt(1
0)
sup
ervi
sed
sup
ervF
ilt(1
0)
sup
ervi
sed
sup
ervF
ilt(1
0)
sup
ervi
sed
sup
ervF
ilt(1
0)
sup
ervi
sed
sup
ervF
ilt(1
0)
sup
ervi
sed
sup
ervF
ilt(1
0)
sup
ervi
sed
sup
ervF
ilt(1
0)
FIN_LOOP-1 FIN_LOOP-5 FIN_LOOP-10 DROP_MESG REP_MESG CPU_THR MEM_THR
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100% noDetect
preDetect
Fault Detect(1ms)
postDetect
Filtering Usually Improves
Detection Rates
Single-Point EventsEasier to Detect ThanPersistent Changes
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
SMMs used to Help Identify Software Faults in MPI Applications
User knows application has fault but needs help to focus on cause
Help identify point where fault first manifests as change in application behavior
Key tasks on faulty run:• Identify time period of manifestation• Identify task where fault first manifested• Identify code region where fault first manifested
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Focus on the Time Period of Unusual Behavior
User marks phase boundaries in code Compute SMM for each task/phase
Task 1
Task 2
Task n
. . .
Task 1
Task 2
Task n
. . .
Task 1
Task 2
Task n
Task 1
Task 2
Task n
. . .
Task 1
Task 2
Task n
. . .
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Focus on the Time Period of Abnormal Behavior
Find phase with most unusual SMMs If sample runs available, compare faulty run’s
SMMs to sample runs’ SMMs
If none available, compare each phase to others
. . .
Faulty Run Sample Run
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Cluster Tasks According to Behavior to Identify Abnormal Task
User provides application’s natural cluster count k Use sample execution to compute clustering
threshold τ that produces k clusters• Use sample runs if available• Otherwise, compute τ from start of execution
During real runs cluster tasks using threshold τ
Task 1 Task 2 Task n
. . .
Task 3
Task 4 Task 5 Task 6
Task 7 Task 8 Task 9
Task 1
Task 2
Master-Worker
Task 3
Task 4 Task 5 Task 6
Task 7 Task 8 Task 9
Task 1
Task 2
Bug in Task 9
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Cluster Tasks According to Behavior to Identify Abnormal Task
Compare tasks in each cluster to their behavior in• Sample runs• Start of execution
Most abnormal is identified
Transition most responsible for difference identified as origin
Task 3
Task 4 Task 5 Task 6
Task 7 Task 8 Task 9
Task 1
Task 2
Bug in Task 9
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Evaluated Fault Detector Using Fault Injection
NAS Parallel Benchmarks• 16-task, Class A: BT, CG, FT, MG,LU and SP
2000 injection experiments per application• Local livelock/deadlock (FIN_LOOP, INF_LOOP)• Message drop (DROP_MESG), repetition (REP_MESG) • CPU-intensive (CPU_THR) or Memory-intensive
(MEM_THR) thread Examined variants of training runs
• 20 training runs with no faults• 20 training runs, 10% have fault• No training runs
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Phase Detection Accuracy
Accuracy ~90% for Loops and Message drops, ~60% for Extra threads• Training significantly better than no training
(10% bug training is close)• Histograms better than Gaussians
Fault10 - Gauss
Fault10 - Histogram
NoFault - Gauss
NoFault - Histogram
NoSample - Gauss
NoSample - Histogram
CPU_THR
MEM
_THR
DROP_MSG
REP_MSG
FIN_L
OOP-1
FIN_L
OOP-5
FIN_L
OOP-10
INF_L
OOP
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Training vsNo TrainingNoFault Sample
vs Some FaultsGaussian vs Histogram
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Cluster Isolation Accuracy
Results assume phase detected accurately Accuracy of Cluster Isolation highly variable
• Depends on propagation of fault’s effects• Accuracy upto 90% for extra threads• Poor detection
elsewhere sinceno informationon event timing
BT
CG
FT
LU
MG
SP
CPU_THR
MEM
_THR
DROP_M
SG
REP_MSG
FIN_L
OO
P-1
FIN_L
OO
P-5
FIN_L
OO
P-10
INF_L
OO
P
0%
20%
40%
60%
80%
100%
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Cluster Isolation Accuracy
Extended cluster isolation with information on event order
Focuses on first abnormal transition Significantly better accuracy for loop faults
BT
CG
FT
LU
MG
SP
CPU_THR
MEM
_THR
DROP_M
SG
REP_MSG
FIN_L
OO
P-1
FIN_L
OO
P-5
FIN_L
OO
P-10
INF_L
OO
P
0%
20%
40%
60%
80%
100%Abnormal Transition - Fault10
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Transition Isolation
Accuracy: injected transition in top 5 candidates• Accuracy ~90% for Loop faults• Highly variable for others• Less variable if event order information is used
CPU_THR
MEM
_THR
DROP_M
SG
REP_MSG
FIN_L
OO
P-1
FIN_L
OO
P-5
FIN_L
OO
P-10
INF_L
OO
P
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
BT
CG
FT
LU
MG
SP
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Abnormality Detection Helps Illuminate MVAPICH Bug
Job execution script failed clean up at job end, left runaway processes on nodes
Simulated by executing BT (16- and 64-task runs) concurrently with LU, MG or SP (16-task runs)
Experiments show • Average SMM difference in regular BT runs• Difference between BT runs with interference
and no-interference runs• Overlap execution during initial portion of BT run
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Abnormality Detection Helps Illuminate MVAPICH Bug
Experiments show • Average SMM difference in regular BT runs• Difference between BT runs with interference
and no-interference runs
1 2 3 4 5 6 7 8 9 101E+1
1E+2
1E+3
1E+4
1E+516-task BT / 16-task SP/LU/MG
Phase
SM
M D
evia
tio
n S
core
AVG No-InterferenceConcurrent SPConcurrent LUConcurrent MG
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Abnormality Detection Helps Illuminate MVAPICH Bug
Experiments show • Average SMM difference in regular BT runs• Difference between BT runs with interference
and no-interference runs
1 2 3 4 5 6 7 8 9 101E+2
1E+3
1E+4
1E+564-task BT / 16-task SP/LU/MG
Phase
SM
M D
evia
tio
n S
core
AVG No-InterferenceConcurrent SPConcurrent LUConcurrent MG
LLNL-PRES-424502 Option:Additional Information
Lawrence Livermore National Laboratory
Behavior Modeling is Critical Component of Fault Detection and Analysis
Complex behavior of applications and systems Statistical models provide accurate summary Promising results
• Quick detection of faults• Focused localization of root causes
Ongoing work• Scaling implementations to real HPC systems• Improving accuracy through
More data Models custom-tailored to applications
top related