triage: diagnosing production run failures at the users site joseph tucek, shan lu, chengdu huang,...

31
Triage: Diagnosing Production Run Failures at the User’s Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer Science University Illinois, Urbana Champaign

Upload: arielle-urwin

Post on 28-Mar-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Triage: Diagnosing Production Run Failures at the User’s SiteJoseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou

Department of Computer ScienceUniversity Illinois, Urbana Champaign

Page 2: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 2

Despite all of our effort, production runs still fail What do we do about these failures?

Page 3: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 3

What is (currently) done about end-user failures?

Dumps leave much manual effort to diagnose We still need to reproduce the bug

This is hard, if not impossible, to do

Page 4: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 4

Why on-site diagnosis of production run failures? Production run bugs are valuable

Not caught in testing Potentially environment specific

Causing real damage to end users We can’t diagnose production failures off-site

Reproduction is hard The programmer doesn’t have the end-user environment

Privacy concerns limit even the reports we do get

We must diagnose at the end-user’s site

Page 5: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 5

What do we mean by diagnosis?

Diagnosis traces back to the underlying fault Core dumps tell you about the failure Bug detection tells you about some errors Existing diagnosis tools are offline

trigger

fault errorfailure

service interruptionincorrect state

e.g. smashed stack

root cause

buggy line of code

Page 6: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 6

What do we need to perform diagnosis? (1) We need information about the failure

What is the fault, the error, the propagation tree? Off-site:

Repeatedly inspect the bug (e.g. with a debugger) We run analysis tools targeted at the failure, or at

suspected failures

Off-site techniques don’t work on-site Reproducing the bug is non-trivial We don’t know what specific failures will occur Existing analysis tools are too expensive

Page 7: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 7

What do we need to perform diagnosis? (2) We need guidance as to what to do next

What analysis should we perform, what is likely to work well, and what variables are interesting?

Off-site: The programmer decides, based on past knowledge

On-site, there is no programmer. Any decisions as to action must be made

automatically.

Page 8: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 8

What do we need to perform diagnosis? (3) We need to try “what-if’s” with the execution

If we change this input, what happens? Skip this function?

Off-site: Programmers run many input variations Even with differing code.

This is difficult on-site Most replay focuses on minimizing variance We can’t understand what the results mean

Page 9: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 9

What does Triage contribute?

Enables on-site diagnosis Uses systems techniques to make offline analysis

tools feasible on-site Addresses the three previous challenges

Allows a new technique, delta analysis

Human study Real programmers and real bugs

Show large time savings in time-to-fix

Page 10: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 10

Overview

Introduction Addressing the three challenges Diagnosis process & design Experimental results

Human study Overhead

Related work Conclusions

Page 11: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 11

Getting information about the failure

Checkpoint/re-execution can capture the bug The environment, input, memory state, etc.

Everything we need to reproduce the bug

Benefits: We can relive the failure over and over Dynamically plug in analysis tools “on-demand”

Makes the expensive cheap Normal-run overhead is low too

Page 12: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 12

Guidance about what to do next

A human-like diagnosis protocol can guide the diagnosis process Repeated replay lets us diagnose incrementally Based on past results, we can pick the next step

E.g. if the bug doesn’t always repeat, we should look for races

Stage Goal

1 failure/error type & location

2 failure triggering conditions

3 Fault related code & variables

Page 13: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 13

Trying “what-ifs” with the execution

Flexible re-execution lets us play with what-ifs Three types of re-execution

Plain – deterministic Loose – allow some variance Wild – introduce (potentially large) variations

Extracts how they differ with delta analysis

Page 14: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 14

Main idea of Triage

How to get information about the failure? Capture the bug with checkpoint/re-execution Relive the bug with various diagnostic techniques

How to decide what to do? Use a human-like protocol to select analysis Incrementally increase our understanding of the bug

How to try out “what-if” scenarios? Flexible re-execution allows varied executions Delta analysis points out what makes them different

Page 15: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 15

Overview

Introduction Addressing the three challenges Diagnosis process & design Experimental results

Human study Overhead

Related work Conclusions

Page 16: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 16

Triage Architecture

Checkpointing Subsystem

Analysis Tools

(e.g. backward

slicing, bug detection)

Control

Unit

(Protocol)

Page 17: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 17

Triage vs. Rx

Both are in memory Both support variations in execution

Triage has no output commit Triage has no need for safety

Can even skip code

Triage considers why the failure occurs Tries to analyze the failure

Page 18: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 18

Failure analysis & delta generation (stage 1 and 2)Bounds checking (1.1x)

Assertion checking (1x)

Happens-before (12x)

Atomicity detection (60x)

Static core analysis (1x)

Taint analysis (2x)

Dynamic Slicing (1000x)

Symbolic exec. (1000x)

Lockset analysis (20x)

Rearrange allocation

Drop inputs

Mutate inputs

Pad buffers

Change file state

Drop code

Reschedule threads

Change libraries

Reorder messages

The differences caused by variations are useful as well

Page 19: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 19

Delta analysis

A

BC

D

EF

G

A

BC

X

E

G

Y

A

BC

D

EF

G

X

Y

{A:1 B:1 C:1 D:1 X:0 E:1 F:1 G:1 Y:0}

{A:1 B:1 C:1 D:0 X:1 E:1 F:0 G:1 Y:1}

{A:0 B:0 C:0 D:1 X:1 E:0 F:1 G:0 Y:1}

Compute the basic block vector:

Page 20: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 20

Delta analysis

From delta generation’s many runs, Triage finds the “most similar” Compare the basic block vectors

Triage will diff the two closest runs The minimum edit distance, aka shortest edit script

A B C D E F G

- ^ V

A B C X E G Y

Page 21: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 21

A bug in TARchar *get_directory_contents (char *path, dev_t device){ struct accumulator *accumulator; /* Recursively scan the given PATH. */ { char *dirp = savedir (path); char const *entry; size_t entrylen; char *name_buffer; size_t name_buffer_size; size_t name_length; struct directory *directory; enum children children;

if (! dirp) savedir_error (path); errno = 0;

name_buffer_size = strlen (path) + NAME_FIELD_SIZE; name_buffer = xmalloc (name_buffer_size + 2); strcpy (name_buffer, path); if (! ISSLASH (path[strlen (path) - 1])) strcat (name_buffer, "/"); name_length = strlen (name_buffer);

directory = find_directory (path); children = directory ? directory->children : CHANGED_CHILDREN;

accumulator = new_accumulator ();

if (children != NO_CHILDREN) for (entry = dirp; (entrylen = strlen (entry)) != 0; entry += entrylen + 1)

char *savedir (const char *dir){ DIR *dirp; struct dirent *dp; char *name_space; size_t allocated = NAME_SIZE_DEFAULT; size_t used = 0; int save_errno;

dirp = opendir (dir); if (dirp == NULL) return NULL;

name_space = xmalloc (allocated);

errno = 0; while ((dp = readdir (dirp)) != NULL) { char const *entry = dp->d_name; if (entry[entry[0] != '.' ? 0 : entry[1] != '.' ? 1 : 2] != '\0')

{ size_t entry_size = strlen (entry) + 1; if (used + entry_size < used) xalloc_die (); if (allocated <= used + entry_size) { do

{ if (2 * allocated < allocated) xalloc_die (); allocated *= 2;}

while (allocated <= used + entry_size);

Segmentation fault

null point dereference

Execution difference

Page 22: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 22

Sample Triage report

Failure point Segfault in lib strlen Stack & heap OK

Bug detection Deterministic bug Null pointer at

incremen.c:207

Fault propagation

dirp = opendir (dir);

if (dirp == NULL) return NULL;

dirp = savedir (path);

entry = dirp;

strlen(entry)

Page 23: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 23

Results – Human Study

We tested Triage with a human study 15 programmers drawn from faculty, research

programmers, and graduate students No undergraduates!

Measured time to repair bugs, with/without Triage Everybody got core dumps, sample inputs, instructions on

how to replicate, and access to many debugging tools Including Valgrind

3 simple toy bugs, & 2 real bugs The TAR bug you just saw A copy-paste error in BC

Page 24: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 24

Time to fix a bug

We hope that the report is be easy to check We cut out the reproduction step

This is quite unfair to Triage Also, we put a time limit

Over time is counted as max time

reproduce find failure …error …fault fix it

check Triage report fix it

Page 25: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 25

Results – Human study

For the real bugs, Triage strongly helps (47%) Better than 99.99% confidence that with < without

Page 26: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 26

Results – Other BugsΔ

Generation

Δ Analysis

Dynamic

Slicing

Apache Input element 12% 8 instructions

Apache Input element 69% 3 instructions

CVS -- -- 4 functions

MySQL interleaving -- --

Squid 1 character 71% 6 instructions

BC array padding 98% 3 instructions

Linux-ext -- -- 6 instructions

MAN -- -- 9 functions

NCOMP -- -- 5 instructions

TAR file perms 68% 6 instructions

Page 27: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 27

Results – Normal Run Overhead

Identical to checkpoint system (Rx) overhead Under 5%

Page 28: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 28

Results – Diagnosis Overhead

CPU bound is the worst case Still reasonable because we’re only redoing 200ms

Delta analysis is somewhat costly Should be run in the background

Page 29: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 29

Related work

Checkpointing & re-execution Zap [Osman, OSDI’02], TTVM [King, USENIX’05]

Bug detection & diagnosis Valgrind [Nethercote], CCured [Necula, POPL’02], Purify

[Hastings, USENIX’92] Eraser [Savage, TOCS’97], [Netzer , PPoPP’91] Backward slicing [Weiser, CACM’82] Innumerable others

Execution variation Input variation

Delta debugging [Zeller, FSE’02], Fuzzing [B. So] Environment variation

Rx [Qin, SOSP’05] DieHard [Berger, PLDI’06]

Page 30: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 30

Conclusions & Future Work

On-site diagnosis can be made feasible Checkpoint can effectively capture the failure Expensive off-line analysis can be done on-site Privacy issues are minimized

Also useful for in house testing Reduces the manual portion of analysis

Future work Automatic bug hot fixes Visualization of delta analysis

Page 31: Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer

Joseph Tucek CS-UIUC Page 31

Thank you

Questions?

Special thanks to Hewlett-Packard for student scholarship support.

This work supported by NSF, DoE, and Intel