lawrence livermore national laboratory greg bronevetsky in collaboration with ignacio laguna,...

Lawrence Livermore National Laboratory

Greg Bronevetskyin collaboration with

Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin Schulz

Statistical Fault Detection and Analysis with AutomaDeD

LLNL-PRES-424502 Option:Additional Information

Reliability is a Critical Challenge in Large Systems

Need tools to detect faults, identify causes• Fault tolerance : requires fault detection• System management: need to know what failed

Faults come from various causes• Hardware: soft errors, marginal circuits, physical

degradation, design bugs• Software: coding bugs, misconfigurations

In General Fault Detection and Fault Tolerance is Undecidable

Option 1: Make all applications fault resilient• Application-specific solutions hard to design• Many applications• How does fault resilience compose?

Option 2: Develop approximate fault detection, tolerate via checkpointing et al• Statistically model application behavior• Look for deviations from model behavior• Identify model components that likely caused

deviation

In General Fault Detection and Fault Tolerance is Undecidable

Option 2: Develop approximate fault detection, tolerate via checkpointing et al• Statistically model application behavior• Look for deviations from model behavior• Identify model components that likely caused deviation

Application Model

Focus on Modeling Individual MPI Applications

Primary goal is fault detection for HPC applications• Model behavior of single MPI application• Detect deviations from norm• Identify origin of deviation in time/space

Other branches of field• Model system component interactions • Model application as dataflow graph of modules• Model micro-architecture state as vulnerable/non-

vulnerable (ACE analysis)

Goal: Detect Unusual Application Behavior, Identify Cause

. . . . . . . . .

Single Run - SpatialDifferences between behavior of processes

Single Run - TemporalDifferences between

one time point and others

Multiple RunsDifferences between

behavior of runs

MPI Application

Semi-Markov Models

SMM - Transition system Nodes: application states Edges: transitions from one state to another

• Probability of transition• Time spent in prior state before transition

.2 / 5μs

.7 / 15μs

.1 / 500μs

SMMs Represent Application Control Flow

SMM states correspond to• Calls to MPI • Code Between MPI Calls

Computation

main()foo()Send-DBL

Computation

main()foo()Recv-DBL

Computation

main()Finalize

main()Initmain() { MPI_Init() … Computation … MPI_Send(…, 1, MPI_INTEGER, …); for(…) foo(); MPI_Recv(…, 1, MPI_INTEGER, …); MPI_Finalize();}

foo() { MPI_Send(…, 1024, MPI_DOUBLE, …); …Computation… MPI_Recv(…, 1024, MPI_DOUBLE, …); …Computation…}

Application Code Semi-Markov Model

main()Send-INT

main()Recv-INT

Different statefor different

calling context

Transitions Represent Time Spent at States

During execution each transition observed multiple timesTime series of transition times: [t1, t2, …, tn]

Represented as probability distribution• Gaussian• Histogram

.2 / 5μs

.7 / 15μs

.1 / 500μs

Transitions Represent Time Spent at States

Gaussian

Histogram

Time Values

HistogramBucket Counts

Gaussian Tail

Line Connectors

Time Values

Probabilities

DataSamples

• Cheaper• Lower Accuracy

• More Expensive• Greater Accuracy

Using SMMs to Help Detect Faults

Hardware faults → behavior abnormalities Given sample runs, learn time distribution on

each transition (Top and bottom 0% or 10% of each

transition’s times removed)

If some transition takes an unusual amount of time, declare it an error

Time Values

Probabilities

Detection threshold computed from maximum normal variation

Need threshold to separate normal, abnormal timing

Threshold = lowest probability observed in set of sample runs (Top and bottom 1% removed)

Time Values

Probabilities

Nothing Removed Top/Bottom 10% Removed

False Positive Rate 0% 19%

Evaluated Fault Detector Using Fault Injection

NAS Parallel Benchmarks• 16-process runs• Input class A• Used BT, CG, FT, MG,LU and SP

(EP and IS use MPI in very simple ways)

Local delays (FIN_LOOP): 1, 5, 10 sec MPI message drop (DROP_MESG) or repetition

(REP_MESG) Extra CPU-intensive (CPU_THR) or Memory-

intensive (MEM_THR) thread

Rates of Fault Detection Within 1ms of Injection

NoDetection

False DetectionBefore Injection

Detection of FaultWithin 1ms

DetectionAfter 1ms

FIN_LOOP-1 FIN_LOOP-5 FIN_LOOP-10 DROP_MESG REP_MESG CPU_THR MEM_THR

100% noDetect

preDetect

Fault Detect(1ms)

postDetect

Filtering Usually Improves

Detection Rates

Single-Point EventsEasier to Detect ThanPersistent Changes

SMMs used to Help Identify Software Faults in MPI Applications

User knows application has fault but needs help to focus on cause

Help identify point where fault first manifests as change in application behavior

Key tasks on faulty run:• Identify time period of manifestation• Identify task where fault first manifested• Identify code region where fault first manifested

Focus on the Time Period of Unusual Behavior

User marks phase boundaries in code Compute SMM for each task/phase

Task 1

Task 2

Task n

Task 1

Task 2

Task n

Task 1

Task 2

Task n

Task 1

Task 2

Task n

Task 1

Task 2

Task n

Focus on the Time Period of Abnormal Behavior

Find phase with most unusual SMMs If sample runs available, compare faulty run’s

SMMs to sample runs’ SMMs

If none available, compare each phase to others

Faulty Run Sample Run

Cluster Tasks According to Behavior to Identify Abnormal Task

User provides application’s natural cluster count k Use sample execution to compute clustering

threshold τ that produces k clusters• Use sample runs if available• Otherwise, compute τ from start of execution

During real runs cluster tasks using threshold τ

Task 1 Task 2 Task n

Task 3

Task 4 Task 5 Task 6

Task 1

Task 2

Master-Worker

Task 3

Task 1

Task 2

Bug in Task 9

Cluster Tasks According to Behavior to Identify Abnormal Task

Compare tasks in each cluster to their behavior in• Sample runs• Start of execution

Most abnormal is identified

Transition most responsible for difference identified as origin

Task 3

Task 1

Task 2

Bug in Task 9

Evaluated Fault Detector Using Fault Injection

NAS Parallel Benchmarks• 16-task, Class A: BT, CG, FT, MG,LU and SP

2000 injection experiments per application• Local livelock/deadlock (FIN_LOOP, INF_LOOP)• Message drop (DROP_MESG), repetition (REP_MESG) • CPU-intensive (CPU_THR) or Memory-intensive

(MEM_THR) thread Examined variants of training runs

• 20 training runs with no faults• 20 training runs, 10% have fault• No training runs

Phase Detection Accuracy

Accuracy ~90% for Loops and Message drops, ~60% for Extra threads• Training significantly better than no training

(10% bug training is close)• Histograms better than Gaussians

Fault10 - Gauss

Fault10 - Histogram

NoFault - Gauss

NoFault - Histogram

NoSample - Gauss

NoSample - Histogram

CPU_THR

DROP_MSG

REP_MSG

OOP-10

Training vsNo TrainingNoFault Sample

vs Some FaultsGaussian vs Histogram

Cluster Isolation Accuracy

Results assume phase detected accurately Accuracy of Cluster Isolation highly variable

• Depends on propagation of fault’s effects• Accuracy upto 90% for extra threads• Poor detection

elsewhere sinceno informationon event timing

CPU_THR

DROP_M

REP_MSG

Cluster Isolation Accuracy

Extended cluster isolation with information on event order

Focuses on first abnormal transition Significantly better accuracy for loop faults

CPU_THR

DROP_M

REP_MSG

100%Abnormal Transition - Fault10

Transition Isolation

Accuracy: injected transition in top 5 candidates• Accuracy ~90% for Loop faults• Highly variable for others• Less variable if event order information is used

CPU_THR

DROP_M

REP_MSG

Abnormality Detection Helps Illuminate MVAPICH Bug

Job execution script failed clean up at job end, left runaway processes on nodes

Simulated by executing BT (16- and 64-task runs) concurrently with LU, MG or SP (16-task runs)

Experiments show • Average SMM difference in regular BT runs• Difference between BT runs with interference

and no-interference runs• Overlap execution during initial portion of BT run

and no-interference runs

1 2 3 4 5 6 7 8 9 101E+1

1E+516-task BT / 16-task SP/LU/MG

AVG No-InterferenceConcurrent SPConcurrent LUConcurrent MG

and no-interference runs

1 2 3 4 5 6 7 8 9 101E+2

1E+564-task BT / 16-task SP/LU/MG

AVG No-InterferenceConcurrent SPConcurrent LUConcurrent MG

Behavior Modeling is Critical Component of Fault Detection and Analysis

Complex behavior of applications and systems Statistical models provide accurate summary Promising results

• Quick detection of faults• Focused localization of root causes

Ongoing work• Scaling implementations to real HPC systems• Improving accuracy through

More data Models custom-tailored to applications

lawrence livermore national laboratory greg bronevetsky in collaboration with ignacio laguna,...

foo mpi

mpi code

approximate fault detection

init computation mpi

application statesedges

unusual application

prior state

causesfault tolerance

Documents

cs717 detection of control flow errors survey of hardware...

mike noeth, frank mueller north carolina state university...

· projekt budowy placu zabaw przy publicznej szkole...

cs717 checking data structures greg bronevetsky. cs717 the...

binary-level tools for floating-point correctness analysis...

openmp technical report 8: version 5.1 preview ·...

introduction to geometric nonlinear control...

polityka a w∏adza -...

theoretical program checking greg bronevetsky. background...

cs717 checking specific algorithms greg bronevetsky

tanzima z. islam, saurabh bagchi, rudolf eigenmann –...

the scalable process topology interface of mpi 2 ·...

cs717 detection and tolerance of complex faults in computing...

towards novel hbb application platform: experimental...

fotografia na całej stronieartur lozióski ewelina luczka...

powerpoint presentation · noo 000 888 w . title:...

bronis ropė. Žalioji mintis kaip žiedinės ekonomikos

powerpoint presentation · better workplaces better world"'...

powerpoint presentation...powerpoint presentation author:...

nps……..chds university and agency partnership initiative...