triage: diagnosing production run failures at the users site joseph tucek, shan lu, chengdu huang,...

Triage: Diagnosing Production Run Failures at the User’s SiteJoseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou

Department of Computer ScienceUniversity Illinois, Urbana Champaign

Joseph Tucek CS-UIUC

Despite all of our effort, production runs still fail What do we do about these failures?


What is (currently) done about end-user failures?

Dumps leave much manual effort to diagnose We still need to reproduce the bug

This is hard, if not impossible, to do


Why on-site diagnosis of production run failures? Production run bugs are valuable

Not caught in testing Potentially environment specific

Causing real damage to end users We can’t diagnose production failures off-site

Reproduction is hard The programmer doesn’t have the end-user environment

Privacy concerns limit even the reports we do get

We must diagnose at the end-user’s site


What do we mean by diagnosis?

Diagnosis traces back to the underlying fault Core dumps tell you about the failure Bug detection tells you about some errors Existing diagnosis tools are offline

trigger

fault errorfailure

service interruptionincorrect state

e.g. smashed stack

root cause

buggy line of code


What do we need to perform diagnosis? (1) We need information about the failure

What is the fault, the error, the propagation tree? Off-site:

Repeatedly inspect the bug (e.g. with a debugger) We run analysis tools targeted at the failure, or at

suspected failures

Off-site techniques don’t work on-site Reproducing the bug is non-trivial We don’t know what specific failures will occur Existing analysis tools are too expensive


What do we need to perform diagnosis? (2) We need guidance as to what to do next

What analysis should we perform, what is likely to work well, and what variables are interesting?

Off-site: The programmer decides, based on past knowledge

On-site, there is no programmer. Any decisions as to action must be made

automatically.


What do we need to perform diagnosis? (3) We need to try “what-if’s” with the execution

If we change this input, what happens? Skip this function?

Off-site: Programmers run many input variations Even with differing code.

This is difficult on-site Most replay focuses on minimizing variance We can’t understand what the results mean


What does Triage contribute?

Enables on-site diagnosis Uses systems techniques to make offline analysis

tools feasible on-site Addresses the three previous challenges

Allows a new technique, delta analysis

Human study Real programmers and real bugs

Show large time savings in time-to-fix


Overview

Introduction Addressing the three challenges Diagnosis process & design Experimental results

Human study Overhead

Related work Conclusions


Getting information about the failure

Checkpoint/re-execution can capture the bug The environment, input, memory state, etc.

Everything we need to reproduce the bug

Benefits: We can relive the failure over and over Dynamically plug in analysis tools “on-demand”

Makes the expensive cheap Normal-run overhead is low too


Guidance about what to do next

A human-like diagnosis protocol can guide the diagnosis process Repeated replay lets us diagnose incrementally Based on past results, we can pick the next step

E.g. if the bug doesn’t always repeat, we should look for races

Stage Goal

1 failure/error type & location

2 failure triggering conditions

3 Fault related code & variables


Trying “what-ifs” with the execution

Flexible re-execution lets us play with what-ifs Three types of re-execution

Plain – deterministic Loose – allow some variance Wild – introduce (potentially large) variations

Extracts how they differ with delta analysis


Main idea of Triage

How to get information about the failure? Capture the bug with checkpoint/re-execution Relive the bug with various diagnostic techniques

How to decide what to do? Use a human-like protocol to select analysis Incrementally increase our understanding of the bug

How to try out “what-if” scenarios? Flexible re-execution allows varied executions Delta analysis points out what makes them different


Overview

Introduction Addressing the three challenges Diagnosis process & design Experimental results

Human study Overhead

Related work Conclusions


Triage Architecture

Checkpointing Subsystem

Analysis Tools

(e.g. backward

slicing, bug detection)

Control

Unit

(Protocol)


Triage vs. Rx

Both are in memory Both support variations in execution

Triage has no output commit Triage has no need for safety

Can even skip code

Triage considers why the failure occurs Tries to analyze the failure


Failure analysis & delta generation (stage 1 and 2)Bounds checking (1.1x)

Assertion checking (1x)

Happens-before (12x)

Atomicity detection (60x)

Static core analysis (1x)

Taint analysis (2x)

Dynamic Slicing (1000x)

Symbolic exec. (1000x)

Lockset analysis (20x)

Rearrange allocation

Drop inputs

Mutate inputs

Pad buffers

Change file state

Drop code

Reschedule threads

Change libraries

Reorder messages

The differences caused by variations are useful as well


Delta analysis

A

BC

D

EF

G

A

BC

X

E

G

Y

A

BC

D

EF

G

X

Y

{A:1 B:1 C:1 D:1 X:0 E:1 F:1 G:1 Y:0}

{A:1 B:1 C:1 D:0 X:1 E:1 F:0 G:1 Y:1}

{A:0 B:0 C:0 D:1 X:1 E:0 F:1 G:0 Y:1}

Compute the basic block vector:


Delta analysis

From delta generation’s many runs, Triage finds the “most similar” Compare the basic block vectors

Triage will diff the two closest runs The minimum edit distance, aka shortest edit script

A B C D E F G

- ^ V

A B C X E G Y


A bug in TARchar *get_directory_contents (char *path, dev_t device){ struct accumulator *accumulator; /* Recursively scan the given PATH. */ { char *dirp = savedir (path); char const *entry; size_t entrylen; char *name_buffer; size_t name_buffer_size; size_t name_length; struct directory *directory; enum children children;

if (! dirp) savedir_error (path); errno = 0;

name_buffer_size = strlen (path) + NAME_FIELD_SIZE; name_buffer = xmalloc (name_buffer_size + 2); strcpy (name_buffer, path); if (! ISSLASH (path[strlen (path) - 1])) strcat (name_buffer, "/"); name_length = strlen (name_buffer);

directory = find_directory (path); children = directory ? directory->children : CHANGED_CHILDREN;

accumulator = new_accumulator ();

if (children != NO_CHILDREN) for (entry = dirp; (entrylen = strlen (entry)) != 0; entry += entrylen + 1)

char *savedir (const char *dir){ DIR *dirp; struct dirent *dp; char *name_space; size_t allocated = NAME_SIZE_DEFAULT; size_t used = 0; int save_errno;

dirp = opendir (dir); if (dirp == NULL) return NULL;

name_space = xmalloc (allocated);

errno = 0; while ((dp = readdir (dirp)) != NULL) { char const *entry = dp->d_name; if (entry[entry[0] != '.' ? 0 : entry[1] != '.' ? 1 : 2] != '\0')

{ size_t entry_size = strlen (entry) + 1; if (used + entry_size < used) xalloc_die (); if (allocated <= used + entry_size) { do

{ if (2 * allocated < allocated) xalloc_die (); allocated *= 2;}

while (allocated <= used + entry_size);

Segmentation fault

null point dereference

Execution difference


Sample Triage report

Failure point Segfault in lib strlen Stack & heap OK

Bug detection Deterministic bug Null pointer at

incremen.c:207

Fault propagation

dirp = opendir (dir);

if (dirp == NULL) return NULL;

dirp = savedir (path);

entry = dirp;

strlen(entry)


Results – Human Study

We tested Triage with a human study 15 programmers drawn from faculty, research

programmers, and graduate students No undergraduates!

Measured time to repair bugs, with/without Triage Everybody got core dumps, sample inputs, instructions on

how to replicate, and access to many debugging tools Including Valgrind

3 simple toy bugs, & 2 real bugs The TAR bug you just saw A copy-paste error in BC


Time to fix a bug

We hope that the report is be easy to check We cut out the reproduction step

This is quite unfair to Triage Also, we put a time limit

Over time is counted as max time

reproduce find failure …error …fault fix it

check Triage report fix it


Results – Human study

For the real bugs, Triage strongly helps (47%) Better than 99.99% confidence that with < without


Results – Other BugsΔ

Generation

Δ Analysis

Dynamic

Slicing

Apache Input element 12% 8 instructions

Apache Input element 69% 3 instructions

CVS -- -- 4 functions

MySQL interleaving -- --

Squid 1 character 71% 6 instructions

BC array padding 98% 3 instructions

Linux-ext -- -- 6 instructions

MAN -- -- 9 functions

NCOMP -- -- 5 instructions

TAR file perms 68% 6 instructions


Results – Normal Run Overhead

Identical to checkpoint system (Rx) overhead Under 5%


Results – Diagnosis Overhead

CPU bound is the worst case Still reasonable because we’re only redoing 200ms

Delta analysis is somewhat costly Should be run in the background


Related work

Checkpointing & re-execution Zap [Osman, OSDI’02], TTVM [King, USENIX’05]

Bug detection & diagnosis Valgrind [Nethercote], CCured [Necula, POPL’02], Purify

[Hastings, USENIX’92] Eraser [Savage, TOCS’97], [Netzer , PPoPP’91] Backward slicing [Weiser, CACM’82] Innumerable others

Execution variation Input variation

Delta debugging [Zeller, FSE’02], Fuzzing [B. So] Environment variation

Rx [Qin, SOSP’05] DieHard [Berger, PLDI’06]


Conclusions & Future Work

On-site diagnosis can be made feasible Checkpoint can effectively capture the failure Expensive off-line analysis can be done on-site Privacy issues are minimized

Also useful for in house testing Reduces the manual portion of analysis

Future work Automatic bug hot fixes Visualization of delta analysis


Thank you

Questions?

Special thanks to Hewlett-Packard for student scholarship support.

This work supported by NSF, DoE, and Intel

triage: diagnosing production run failures at the users site joseph tucek, shan lu, chengdu huang,...

Documents

joseph tucekcsuiucpage

failure slide

delta analysis slide

expensive slide

fix slide

different slide

execution triage

failure bug detection