empirical evaluation of innovations in automatic repair claire le goues site visit february 7, 2013...

33
EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

Upload: rhoda-parker

Post on 24-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

1

EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR

CLAIRE LE GOUES

SITE VISIT

FEBRUARY 7, 2013

Page 2: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

2

“Benchmarks set standards for innovation, and can encourage or stifle it.”

-Blackburn et al.

Page 3: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

3

2009: 15 papers on automatic program repair*

2011: Dagstuhl seminar on self-repairing programs

2012: 30 papers on automatic program repair*

2013: dedicated program repair track at ICSE

*manually reviewed the results of a search of the ACM digital library for “automatic program repair”

AUTOMATIC PROGRAM REPAIR OVER TIME

Page 4: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

4

Manually sift through bugtraq data.

Indicative example: Axis project for automatically repairing concurrency bugs

• 9 weeks of sifting to find 8 bugs to study.• Direct quote from Charles Zhang, senior author, on the

process: "it's very painful”

Very difficult to compare against previous or related work or generate sufficiently large datasets.

CURRENT APPROACH

Page 5: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

5

GOAL: HIGH-QUALITY EMPIRICAL EVALUATION

Page 6: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

6

SUBGOAL: HIGH-QUALITY BENCHMARK SUITE

Page 7: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

7

Indicative of important real-world bugs, found systematically in open-source programs.

Support a variety of research objectives.

• “Latitudinal” studies: many different types of bugs and programs

• “Longitudinal” studies: many iterative bugs in one program.

Scientifically meaningful: passing test cases repair

Admit push-button, simple integration with tools like GenProg.

BENCHMARK REQUIREMENTS

Page 8: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

8

Indicative of important real-world bugs, found systematically in open-source programs.

Support a variety of research objectives.

• “Latitudinal” studies: many different types of bugs and programs

• “Longitudinal” studies: many iterative bugs in one program.

Scientifically meaningful: passing test cases repair

Admit push-button, simple integration with tools like GenProg.

BENCHMARK REQUIREMENTS

Page 9: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

http://genprog.cs.virginia.edu 9

Goal: a large set of important, reproducible bugs in non-trivial programs.

Approach: use historical data to approximate discovery and repair of bugs in the wild.

SYSTEMATIC BENCHMARK SELECTION

Page 10: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

10

Indicative of important real-world bugs, found systematically in open-source programs:

• Add new programs to the set, with as wide a variety of types as possible (support “latitudinal” studies)

Support a variety of research objectives:

• Allow studies of iterative bugs, development, and repair: generate a very large (100) set of bugs in one program (php) (support “longitudinal” studies).

NEW BUGS, NEW PROGRAMS

Page 11: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

11

Program LOC Tests Bugs Description

fbc 97,000 773 3 Language (legacy)

gmp 145,000 146 2 Multiple precision math

gzip 491,000 12 5 Data compression

libtiff 77,000 78 24 Image manipulation

lighttpd 62,000 295 9 Web server

php 1,046,000 11,995 100 Language (web)

python 407,000 355 11 Language (general)

wireshark 2,814,000 63 7 Network packet analyzer

valgrind 711,000 595 2 Simulator and debugger

vlc 522,000 17 ?? Media player

svn 629,000 1,748 ?? Source control

Total 7,001,000 16,077 163

Page 12: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

12

Indicative of important real-world bugs, found systematically in open-source programs.

Support a variety of research objectives.

• “Latitudinal” studies: many different types of bugs and programs

• “Longitudinal” studies: many iterative bugs in one program.

Scientifically meaningful: passing test cases repair

Admit push-button, simple integration with tools like GenProg.

BENCHMARK REQUIREMENTS

Page 13: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

13

They must exist.

• Sometimes, but not always, true (see: Jonathan Dorn)

TEST CASE CHALLENGES

Page 14: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

14

Program LOC Tests Bugs Description

fbc 97,000 773 3 Language (legacy)

gmp 145,000 146 2 Multiple precision math

gzip 491,000 12 5 Data compression

libtiff 77,000 78 24 Image manipulation

lighttpd 62,000 295 9 Web server

php 1,046,000 11,995 100 Language (web)

python 407,000 355 11 Language (general)

wireshark 2,814,000 63 7 Network packet analyzer

valgrind 711,000 595 2 Simulator and debugger

Total 5,850,000 14,312 163

BENCHMARKS

Page 15: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

15

They must exist.

• Sometimes, but not always, true (see: Jonathan Dorn)

They should be of high quality.

• This has been a challenge from day 0: nullhttpd• Lincoln labs noticed it too: sort• In both cases, adding test cases led to better repairs.

TEST CASE CHALLENGES

Page 16: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

16

They must exist.

• Sometimes, but not always, true (see: Jonathan Dorn)

They should be of high quality.

• This has been a challenge from day 0: nullhttpd• Lincoln labs noticed it too: sort• In both cases, adding test cases led to better repairs.

They must be automated to run one at a time, programmatically, from within another framework.

TEST CASE CHALLENGES

Page 17: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

17

Need to be able to compile and run new variants programmatically.

Need to be able to run test cases one at a time.

• It’s not simple, and as we scale up to real-world systems, becomes increasingly tricky.

• Much of the challenge is unrelated to the program in question, instead requiring highly-technical knowledge of OS-level details.

PUSH-BUTTON INTEGRATION

Page 18: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

18

Calling a process from within another process :

• system(“run test 1”) ...; wait()

wait() returns the process exit status.

This is complex.

• Example: a system call can fail because the OS ran out of memory in creating the process, or because the process itself ran out of memory.

How do we tell the difference?

• Answer: bit masking

DIGRESSION ON WAIT()

Page 19: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

19

Moral: integration is tricky, and lends itself to human mistakes.

Possibility 1: original programmers make mistakes in developing the test suite.

• Test cases can have bugs, too.

Possibility 2: we (GenProg devs/users) make mistakes in integration.

• A few old php test cases are not to our standards; faulty bitshift math for extracting the return value components.

REAL-WORLD COMPLEXITY

Page 20: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

20

Interested in more, better benchmark design, with easy integration (without gnarly OS details).

• Virtual machines provide one approach.

Need a better definition of “high quality test case” vs. “low quality test case:”

• Can the empty program pass it? • Can every program pass it?• Can the “always crashes” program pass it?

INTEGRATION CONCERNS

Page 21: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

21

Over the past year, we have conducted studies of representation and operators for automatic program repair:

• One-point crossover on patch representation.• Non-uniform mutation operator selection.• Alternative fault localization framework.

Results on the next slide incorporate “all the bells and whistles:”

• Improvements based on those large-scale studies.• Manually confirmed quality of testing framework.

CURRENT REPAIR SUCCESS

Page 22: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

22

CURRENT REPAIR SUCCESS

Program Previous Results Current Results

fbc 1/3 1/3

gmp 1/2 1/2

gzip 1/5 1/5

libtiff 17/24 17/24

lighttpd 5/9 5/9

php 28/44 55/100

python 1/11 2/11

wireshark 1/7 4/7

valgrind --- 1/2

Total 55/105 87/163

Page 23: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

23

TRANSITION

Page 24: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

24

REPAIR TEMPLATES

CLAIRE LE GOUES

SHIRLEY PARK

DARPA SITE VISIT

FEBRUARY 7, 2013

Page 25: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

BIO + CS INTERACTION

25

Page 26: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

Immune response is equally fast for large and small animals.

• Human lung is 100x larger than mouse lung, still finds influenza infections in ~8 hours.

• Successfully balances local search and global response.

Balance between generic and specialized T-cells:

• Rapid response to new pathogens vs. long-term memory of previous infections (cf. vaccines).

IMMUNOLOGY: T-CELLS

26

Page 27: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

27MUTATE

DISCARD

INPUT EVALUATE FITNESS

ACCEPT

OUTPUT

Page 28: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

Tradeoff between generic mutation actions and more specific action templates:

• Generic: INSERT, DELETE, REPLACE• Specific:

if ( != NULL) { <code using >}

AUTOMATIC SOFTWARE REPAIR

28

Page 29: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

29

HYPOTHESIS: GENPROG CAN REPAIR MORE BUGS, AND REPAIR BUGS MORE QUICKLY, IF WE AUGMENT MUTATION ACTIONS WITH

“REPAIR TEMPLATES.”

Page 30: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

30

Insight: Just like T-cells “remember” previous infections, abstract previous fixes to generate new mutations.

Approach:

• Model previous changes using structured documentation.• Cluster a large set of changes by similarity.• Abstract the center of each cluster

Example:

if( < 0)

return 0;

else

<code using >

OPTION 1: PREVIOUS CHANGES

Page 31: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

31

Insight: Looking up things at a library provides people with the best example of what they are looking to reproduce.

Approach:

• Generate static paths through C programs.• Mine API usage patterns from those paths• Abstract the patterns into mutation templates.

Example:

while(it.hasnext())

<code using it.next()>

OPTION 2: EXISTING BEHAVIOR

Page 32: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

32

THIS WORK IS ONGOING.

Page 33: EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1

33

We are generating a benchmark suite to support GenProg research, integration and tech transfer, and the automatic repair community at large.

Current GenProg results for 12-hour repair scenario: 87/163 (53%) of real-world bugs in dataset.

Repair templates will augment GenProg’s mutation operators to help repair more bugs, and repair bugs more quickly.

CONCLUSIONS