final vrisha hpdc11 presentation€¦ · 250 training" 300 350 400 normal" buggy...

1

Vrisha: Using Scaling Proper2es of Parallel Programs for Bug Detec2on and Localiza2on

Bowen Zhou, Milind Kulkarni and Saurabh Bagchi School of Electrical and Computer Engineering

Purdue University

Scale-‐dependent Bugs in Parallel Programs

•  What are they? – Bugs that arise oIen in large-‐scale runs while staying invisible in small-‐scale ones

– Examples? Race condi2on, integer overflow, etc. – Severe impact on HPC systems: performance degrada2on, silent data corrup2on

•  Difficult to detect, let alone localize

2

Scale-‐dependent Bugs in Parallel Programs

•  Why do parallel programs have such bugs? – Coded and tested in small-‐scale development machines with small-‐scale test cases

– Deployed in large-‐scale produc2on systems

When Sta2s2cal Debugging Meets Scale-‐dependent Bugs

Commun

ica2

on

0

50

100

150

200

250

300

350

400

Training

Normal

Buggy

Previous Sta2s2cal Debugging

•  Previous methods of sta2s2cal debugging for parallel programs [Mirgorodskiy SC’06] [Gao SC’07] [Kasick FAST’10]

3

When Sta2s2cal Debugging Meets Scale-‐dependent Bugs

0

50

100

150

200

250

300

350

400

0 50 100 150 200

Commun

ica2

on Normal

Buggy Training

Scale

Previous Sta2s2cal Debugging Our Proposed Method

•  To address scale-‐dependent bugs with previous methods –  Either be very restricted in your feature selec2on –  Or have access to a bug-‐free run on the large-‐scale system

•  Our method solved the problems of previous methods –  Capable of modeling scaling behavior of parallel program –  Only need access to bug-‐free runs from small-‐scale systems

Key Insights

•  “Natural scaling behavior" of an applica2on can be captured and used for bug detec2on

•  Instead of looking for devia2ons from scale invariant behavior, we will look for devia2ons from the scaling trend

4

Contribu2ons

•  We build Vrisha to use scale of run as a parameter to sta2s2cal model for debugging scale-‐dependent bugs

•  Vrisha is capable of building a model to deduce the correla2on between scale of run and program behavior

•  Vrisha detects both applica2on and library bugs with low overhead as validated with two real bugs from a popular MPI implementa2on

Example of Scale-‐dependent Bugs

•  A bug in MPI_Allgather in MPICH2-‐1.1 – Allgather is a collec2ve communica2on which lets every process gather data from all par2cipa2ng processes

Allgather

P1

P2

P3

P1

P2

P3

5

Example of Scale-‐dependent Bugs

•  MPICH2 uses dis2nct algorithms to do Allgather in different situa2ons

•  Op2mal algorithm is selected based on the total amount of data received by each process

P0 P1 P2 P3 P4 P5 P6 P7P0 P1 P2 P3 P4 P5 P6 P7

RINGRECURSIVE DOUBLING

Example of Scale-‐dependent Bugs int MPIR_Allgather (

…… int recvcount, MPI_Datatype recvtype, MPID_Comm *comm_ptr ) { int comm_size, rank; int curr_cnt, dst, type_size, left, right, jnext, comm_size_is_pof2;

…… if ((recvcount*comm_size*type_size < MPIR_ALLGATHER_LONG_MSG) && (comm_size_is_pof2 == 1)) { /* Short or medium size message and power-‐of-‐two no. of processes. * Use recursive doubling algorithm */

…… else if (recvcount*comm_size*type_size < MPIR_ALLGATHER_SHORT_MSG) {

/* Short message and non-‐power-‐of-‐two no. of processes. Use * Bruck algorithm (see description above). */

…… else { /* long message or medium-‐size message and non-‐power-‐of-‐two

* no. of processes. use ring algorithm. */ ……

recvcount*comm_size*type_size can easily overflow a 32-‐bit integer on large systems and fail the if statement

6

Key Observa2on

•  We observe that some program proper2es are predictable by the scale of run in parallel programs

•  These are called scale-‐determined proper2es

0

2

4

6

8

10

1-‐node 2-‐node 4-‐node 8-‐node

Number of Comm Partners Total Bytes Sent

7

6

5 3

1

2

0

4

Scale-‐determined

x

16-node

65536-node

64-node

32-node

128-node

256-node

Com

munication

Scale

TrainingPredictionObservationx

Bug Detec2on

Localiza2on of bug is done by iden2fying the proper2es that cause the gap and the code regions that affect these proper2es

Large gap means bug detected!

7

Vrisha’s Workflow

a.  Collect bug-‐free data from training runs of different scales b.  Aggregate data into scale and program property c.  Build a model from the scale and property d.  Collect data from the produc2on run e.  Perform detec2on and diagnosis for the produc2on run

(e) Detection and Diagnosis

n=16 n=32

n=64(a) Training runs at

different scales

Scale

Behavior

(b) Data collection

Model

(c) Model building

n=128(d) Production run? ? ?

Challenges

•  What features should we use to build the model of scaling behavior?


n=16 n=32


different scales

Scale

Behavior

(b) Data collection

Model

(c) Model building

n=128(d) Production run

8

Control Feature

•  We generalize the concept of “scale” to control features

•  A set of parameters given by system or user – number of processors in the system – command-‐line arguments –  Input size

•  Control features are the predictors of program behavior

Observa2onal Feature

•  To characterize program behavior, we use observaEonal features

•  A set of vantage points in source code to profile various run2me proper2es

•  Observa2onal features are the manifesta2ons of program behavior

9

Observa2onal Feature

•  What does Vrisha use for observa2onal feature? –  Each unique call stack at the socket level under the MPI library as a dis2nct vantage point

–  Record the amount of data communicated at each vantage point as observa2onal feature

•  Why? –  Capture both applica2on and library bugs – Not need to modify applica2on code –  Impose low overhead –  Provide source level diagnosis

Challenges

•  What model should we use to capture the rela2onship between scale and behavior? – Can describe both linear and non-‐linear rela2onships


n=16 n=32


different scales

Control

Observational

(b) Data collection

Model

(c) Model building


10

Model: Canonical Correla2on Analysis

X Xu

Y Yv

Control feature

Observa2onal feature

CorrelaEon maximized

1 such that

),( max ,

== vu

YvXucorrvu

Model: Kernel CCA

ϕ(X)u

ϕ(Y)v

Control feature


CorrelaEon maximized

We use “kernel trick”, a popular machine learning technique, to transform non-‐linear rela2onship to linear ones via a non-‐linear mapping ϕ(.)

X

Y

ϕ(X)

ϕ(Y)

11

Challenges

•  How does Vrisha do bug detec2on and diagnosis?


n=16 n=32


different scales

Control

Observational

(b) Data collection

KCCA

(c) Model building


Bug Detec2on

ϕ(X’)u

ϕ(Y’)v

Control feature


Poorly correlated

We declare a bug in the produc2on run if the correla2on between the control feature X’ and the observa2onal features Y’ in the subspace defined by u and v is below certain range

X’

Y’

ϕ(X’)

ϕ(Y’)

12

Bug Localiza2on

•  Normalize communica2on volume of training and produc2on runs to the same scale

•  Comparing training (bug-‐free) with produc2on (buggy) runs

•  Iden2fy the call stacks for the biggest changes as the poten2al loca2on for bug

Evalua2on

•  We validated Vrisha with two real bugs from the MPICH2 library and Vrisha was able to detect and localize both of them

•  Experiment setup – 16-‐node cluster – dual-‐socket 2.2GHz quad-‐core CPU – 512K L2, 8G RAM – MPICH2-‐1.1.1

13

Detect the Bug in Allgather

•  The bug is configured to be triggered at 16-‐node run only •  A KCCA model is built solely on 4-‐ to 15-‐node runs •  Vrisha is capable of revealing this scale-‐dependent bug with

the very low correla2on in the 16-‐node buggy run

Locate the Bug in Allgather

•  All the training runs like 4-‐ and 8-‐node runs share the same paqern

•  The 16-‐node run shows a different paqern

•  The most radical difference happens between feature #9 and #16 which leads us to find the program makes a wrong choice at the buggy if statement

14

Vrisha’s Overhead •  We measured the average overhead of Vrisha with NAS Parallel Benchmarks

•  Vrisha’s instrumenta2on overhead is less than 8% execu2on 2me on average

•  Vrisha takes less than 30 ms to build model and less than 5 ms to detect bug, equivalent to around 1/10000 of the running 2me of these benchmarks on average

Conclusion

•  Some program proper2es are scale-‐determined in parallel programs and can be exploited to detect and localize scale-‐dependent bugs

•  We built a system called Vrisha leveraging this idea and successfully detected two real bugs from the MPICH2 library with low overhead

final vrisha hpdc11 presentation€¦ · 250 training" 300 350 400 normal" buggy...

Documents