a human study of fault localization accuracy zachary p. fry westley weimer university of virginia...

A HUMAN STUDY OF FAULT LOCALIZATION ACCURACY

Zachary P. FryWestley WeimerUniversity of VirginiaSeptember 16, 2010

SOFTWARE MAINTENANCE Maintenance can account for the majority

of the software lifecycleLocating defects in code is a considerable

challenge What if we knew how easy it was to locate

faults in a code base beforehand? Engineer systems to make bug finding easierConcentrate on problem areas

Could we develop a model that measures this?How would we gather a data set?

2

PROBLEM – FAULT LOCALIZATION We treat fault localization as the task of determining if a program or code fragment contains a defect and, if so, locating the line where that defect resides

Research question: Which factors contribute to a human’s ability to detect and locate defects?

3

PROBLEM – FAULT LOCALIZATION We examine four categories of defect and code characteristicsError typeSurface and syntactical featuresControl flow and contextual

featuresAbstraction

Which of these affect humans’ abilities to locate defects in code?

4

OUTLINE Motivation Structure of Model Human Study Evaluation of Model Conclusions

5

MOTIVATION: AN EXAMPLE

6

/** Move a single disk from src to dest. */public static void hanoi1(int src, int dest){ System.out.println(src + " => " + dest);}/** Move two disks from src to dest, making use of a spare peg. */public static void hanoi2(int src, int dest, int spare) { hanoi1(src, dest); System.out.println(src + " => " + dest); hanoi1(spare, dest);}/** Move three disks from src to dest, making use of a spare peg. */public static void hanoi3(int src, int dest, int spare) { hanoi2(src, spare, dest); System.out.println(src + " => " + dest); hanoi2(spare, dest, src);}

hanoi1(src, spare); 33% of participants correctly located the defect

TOWERS OF HANOI – VERSION 2 More complex control

flow if/else statement recursion

Rich commenting Descriptive identifiers

53% of participant correctly located the fault

7

/******************************************* Performs the initial call to moveTower to solve the puzzle. Moves the disks from tower 1 to tower 3 using tower 2.********************************************/public void solve () { moveTower (totalDisks, 1, 3, 2); }/******************************************* Moves the specified number of disks from one tower to another by moving a subtower of n-1 disks out of the way, moving one disk, then moving the subtower back. Base case of 1 disk.********************************************/private void moveTower (int numDisks, int start, int end, int temp) { if (numDisks == 1) moveTower(numDisks-1, temp, end, start);else { moveTower (numDisks-1, start, temp, end); moveOneDisk (start, end); moveTower (numDisks-1, temp, end, start); }}/******************************************* Prints instructions to move one disk from the specified start tower to the specified end tower.*******************************************/private void moveOneDisk (int start, int end) { System.out.println ("Move one disk from " + start + " to " + end);}

moveOneDisk (start, end);

MODEL – OVERVIEW We desire a model of human fault

localization accuracy that, given source code as input, can predict the likelihood that a human will be able to accurately locate faults within it

We hypothesize that features relevant to such a model will fall into four categories: fault type, syntax, context, and abstractionExisting work tends to focus on only one of

these areas at a time Linear regression – trained on human

study dataEase of analysis

8

DEFECT FEATURES Error type

Adapted and expanded existing Knight taxonomy

Sampled from consecutive Mozilla bugs to obtain types and distribution

We consider 17 total types of single-line defects

9

Missing statementUninitialized variableExtra assignmentIncorrect typeIncorrect constantIncorrect parameterNegated conditionalIncorrect method callIncorrect variable…

MODEL – CODE FEATURES Code based features

Most measured automatically, some manually 92 total

10

SyntaxBlock nesting levelNumber of method callsNum of local varsNum of var declarationsNum of var usesAvg line length…

ContextAvg/Max CFG in-edgesAvg/Max CFG out-edgesAvg CFG path lengthNum of CFG edgesNum of CFG leavesRatio of “ifs” to “elses”…

AbstractionNum of array-based structuresUses underlying data structureImplements a heap Implements a treeImplements reheap…

HUMAN STUDY – PARTICIPANT SELECTION 215 fourth year students and volunteers

from the internet (crowdsourcing) Monetary reward given for completion to

encourage best effort

11

Subset Average Accuracy

Number of Participants

All 46.3% 65Accuracy > 40% 55.2% 46Experience >4 years 51.5% 34Experience = 4 years 46.7% 17Experience < 4 years 33.4% 14

HUMAN STUDY – CODE SELECTION

Five textbooksThree sets of code features to vary or

control: Syntax and Surface Control flow and Contextual Abstraction

Provides similar concepts but differing presentations and/or implementations

45 Java files total12

HUMAN STUDY – FAULT SEEDING

Types and distribution based on Mozilla All faults selected are limited to one line for simplicity

Random seeding Zero or one bugs per file Type chosen based on distribution All possible sites enumerated and one is randomly

chosen Fault seeded manually, based on actual bugs if possible

20 line search-space windows To further control for code length and facilitate quick

and accurate response Randomly chosen around the seeded fault location

13

HUMAN STUDY - PROTOCOL Each participant sees 30 consecutive

files and is asked:Is there a bug in this code?If so, on what line does the bug occur?How difficult do you feel this code is to

understand (1-5)? Participants cannot execute or

automatically search the code – only manual inspection is permitted

14

EVALUATION Three separate experiments

1. Examines defect type as related to fault localization accuracy

Are certain bugs harder to find?2. Examines Syntactical, Contextual, and

Abstraction features as related to fault localization accuracy

Does our model correlate with actual human ability to locate faults better than existing baselines?

3. Analysis of individual features What features contribute the most towards

humans’ ability to locate defects in source code?15

EVALUATION – EXPERIMENT 1 Goal: relate fault type to fault localization accuracy

16

EVALUATION – EXPERIMENT 2 Goal: measure accuracy of our model’s ability to

predict ease of human fault localization Two version of our model

All features vs. only those that are measured automatically

Baselines Code readability (syntactic and surface features) Cyclomatic complexity (contextual features) “Textbook difficulty” (chapter number in the textbook)

10-fold cross validation to mitigate over-fitting

17

EVALUATION – EXPERIMENT 2

Our model greatly outperforms the baselines

Automatic-only model does only slightly worse than the full model

18

EVALUATION – EXPERIMENT 2 Perceived difficulty

is a concrete measure of understandability Fault localization

accuracy is correlated with understandability

While baselines do comparably better, our model correlates in a similar fashion

19

EVALUATION – FEATURE ANALYSIS ANOVA of features with respect to human

accuracy

20

(type) - Feature F Pr(F) Dirabs – uses abstraction: array 130.9 < 0.001 -abs – provides abstraction: queue 54.1 < 0.001 +syn – ratio of constant to variable assignments

40.4 < 0.001 +

syn – avg block nesting level 38.9 < 0.001 -abs – provides abstraction: heap 28.3 < 0.001 +syn – max global variables 25.6 < 0.001 +abs – uses abstraction: linked list 25.6 < 0.001 -syn – ratio simple to constant conditional 20.6 < 0.001 -cfg – max CFG out-edges per node 10.0 0.002 -cfg – avg CFG in-edges per node 5.8 0.016 +…

CONCLUSION We present a human study of 65 participants

based on concrete fault localization tasks We analyze the effect that the type of defects

has on humans’ ability to locate faults Based on the source code, we analyze the

correlation of surface, control flow, and abstract features on humans’ ability to locate faults

We present a model of human fault localization accuracy based on these features that correlates with human accuracy at least four times more than corresponding baselines

21

Questions?

22

a human study of fault localization accuracy zachary p. fry westley weimer university of virginia...

Documents