personalized defect prediction

PersonalizedDefect Prediction

Tian Jiang Lin Tan Sunghun KimUniversity of

WaterlooHong Kong University of Science and Technology

University of Waterloo

1

How to Find Bugs?

• Code Review

• Testing

• Static Analysis

• Dynamic Analysis

• Verification

• Defect Prediction

2

2

Defect Prediction

3

SoftwareHistory

PredictorFuture Defect

3

Developers are Different

4

4


4

0

20

40

60

80

A B C D Average

% o

f Bug

gy C

hang

es

Modulo % FOR Bitwise OR CONTINUE

Linux Kernel, 2005-2010

4


4

0

20

40

60

80

A B C D Average

% o

f Bug

gy C

hang

es

Modulo % FOR Bitwise OR CONTINUE

Linux Kernel, 2005-2010

Personalized models can improve performance.

4

Successes in Other Fields

5

5


• Google personalized search

5

5


• Google personalized search

• Facebook personalized ad placement

5

5

Contributions

6

6

Contributions

• Personalized Change Classification (PCC)

✦ One model for each developer

6

6

Contributions



• Confidence-based Hybrid PCC (PCC+)

✦ Picks predictions with highest confidence

6

6

Contributions



• Confidence-based Hybrid PCC (PCC+)

✦ Picks predictions with highest confidence

• Evaluate on six C and Java projects

✦ Find up to 155 more bugs by inspecting 20% LOC

✦ Improve F1 by up to 0.08

6

6

What is a Change?

7

7

What is a Change?

7

Commit: 09a02f...Author: John SmithMessage: I submitted some code.

file1.c+++-

file2.c+-----

file3.c++---

7

What is a Change?

7


file1.c+++-

file2.c+-----

file3.c++---

Commit

Change 1 Change 2 Change 3

7

What is a Change?

7


file1.c+++-

file2.c+-----

file3.c++---

Commit

Change 1 Change 2 Change 3

Change-Level: Inspect less code to locate a bug.

7

Change Classification (CC)

8

8


8

Software History

Training Phase Prediction Phase

8


8

Software History

Training Instances


1. Label changes with clean or buggy

8


8

Software History

Training Instances

Features



2. Extract features

8


8

Software History

Training Instances

Features Classification Algorithm

Model



2. Extract features

3. Build prediction model

8


8

Software History

Training Instances

Features Classification Algorithm

ModelFuture

Instances



2. Extract features

3. Build prediction model

4. Predict

8

Label Clean or Buggy

9

9


9

Revision History

[Sliwerski et al. ’05]

9


9

Commit: 1da57...Message: I fixed a bugfileA.c

- if (i < 128)+if (i <= 128)

Bug-Fixing Change

Revision History


Contain keyword “fix”, orID of manually verified bug report [Herzif et al. ’13]

9


9

Commit: 1da57...Message: I fixed a bugfileA.c

- if (i < 128)+if (i <= 128)

Commit: 7a3bc...Message: new featurefileA.c+...+if (i < 128)+...

git blame

Bug-Fixing ChangeBuggy Change

Revision History


Contain keyword “fix”, orID of manually verified bug report [Herzif et al. ’13]Fixed by a later change

9

Three Types of Features

10

10

Three Types of Features

• Metadata

• Bag-of-Words

• Characteristic Vector

10

10

Characteristic Vector

11

11


11

Count Abstract Syntax Tree (AST) nodes

11


for (...; ...; ...) { for (...; ...; ...) { if (...) ...; } }

11


11


for (...; ...; ...) { for (...; ...; ...) { if (...) ...; } }

for:if:while:...

11


11

keynote:/Users/tian/Library/Mobile%20Documents/com~apple~Keynote/Documents/Personalized%20Defect%20Prediction.key-tef?id=BGSlide-16



for (...; ...; ...) { for (...; ...; ...) { if (...) ...; } }

for:if:while:...

11

210


11



CC: Training

12

12

CC: Training

12

Training Instances Model

12

CC: Prediction

13

UnlabeledChanges

13

CC: Prediction

13

ModelUnlabeledChanges

PredictedChanges

13

PCC: Training

14

14

PCC: Training

14

Training Instances

14

PCC: Training

14

Group Changes by Developer

Training Instances

Dev 1

Dev 2

Dev 3

14

PCC: Training

14

Group Changes by Developer Training

Training Instances

Dev 1

Dev 2

Dev 3

Model 1

Model 2

Model 3

14

PCC: Prediction

15

Model 1

Model 2

Model 3

15

PCC: Prediction

15

Choose a Model by Developer

(Dev 2)

Model 1

Model 2

Model 3

15

PCC: Prediction

15

Choose a Model by Developer Prediction

(Dev 2)

Model 1

Model 2

Model 3

15

PCC+: Prediction

16

16

PCC+: Prediction

16

Prediction

CC

PCC

Com

biner

Feed Changes to All Models

16

Confidence Measure

17

17

Confidence Measure

• Bugginess

✦ Probability of a change being buggy

17

17

Confidence Measure

• Bugginess


• Confidence Measure

✦ Comparable measure of confidence

17

17

Confidence Measure

• Bugginess


• Confidence Measure

✦ Comparable measure of confidence

• Select the prediction with the highest confidence.

17

17

Research Questions

18

18

Research Questions

• RQ1: Do PCC and PCC+ outperform CC?

18

18

Research Questions

• RQ1: Do PCC and PCC+ outperform CC?

• RQ2: Does PCC outperform CC in other setups?

✦ Classification algorithms

✦ Sizes of training sets

18

18

Two Metrics

19

19

Two Metrics

• F1-Score

✦ Harmonic mean of precision and recall

19

19

Two Metrics

• F1-Score

✦ Harmonic mean of precision and recall

• Cost Effectiveness

✦ Relevant in cost sensitive scenarios

✦ NofB20: Number of Bugs discovered by inspecting top 20% lines of code

19

19

Cost EffectivenessCumulative LOC Changes LOC

10% Buggy #1 10

15% Buggy #2 5

19% Buggy #3 4

27% Buggy #4 8

Buggy #5 12

... ...

100

20

20


10% Buggy #1 10

15% Buggy #2 5

19% Buggy #3 4

27% Buggy #4 8

Buggy #5 12

... ...

100

21

21


10% Buggy #1 10

15% Buggy #2 5

19% Buggy #3 4

27% Buggy #4 8

Buggy #5 12

... ...

100

21

True Bug

True Bug

True Bug

NofB20=3

21

Test Subjects

22

Projects Language LOC # of Changes

Linux kernel C 7.3M 429K

PostgreSQL C 289K 89K

Xorg C 1.1M 46K

Eclipse Java 1.5M 73K

Lucene* Java 828K 76K

Jackrabbit* Java 589K 61K

* With manually labelled bug report data [Herzif et al. ’13]

22

PCC/PCC+ vs. CC

23

Decision Tree, NofB20

23

PCC/PCC+ vs. CC

23

Projects CC PCC Delta PCC+ Delta

Linux 160 179 +19 172 +12PostgreSQL 55 210 +155 175 +120

Xorg 96 159 +63 161 +65Eclipse 116 207 +91 200 +84Lucene 177 254 +77 257 +80

Jackrabbit 411 449 +38 459 +48Average - - +74 - +68

Statistical significant deltas are in bold.

Decision Tree, NofB20

23

PCC/PCC+ outperforms CC.

24

24

25

ProjectsNaive BayesNaive BayesNaive Bayes Logistic RegressionLogistic RegressionLogistic Regression

ProjectsCC PCC Delta CC PCC Delta

Linux 138 147 +9 102 137 +35

PostgreSQL 89 113 +24 46 56 +10

Xorg 84 101 +17 52 29 -23

Eclipse 65 108 +43 54 55 +1

Lucene 152 139 -13 30 200 +170

Jackrabbit 420 414 -6 261 370 +109

Average - - +12 - - +59Statistical significant deltas are in bold.

NofB20Different Classification Alg.

25

Different Training Set Sizes

26

PCC CC

100

150

200

250

300

10 20 30 40 50 60 70 80 90

Nof

B20

Training Set Size Per Developer

26

The improvement presents in other setups.

27

27

Related Work

• Kim et al., Classifying software changes: Clean or buggy?, TSE ’08

• Bettenburg et al., Think locally, act globally: Improving defect and effort prediction models, MSR ’12

28

28

Conclusions & Future Work

• PCC and PCC+ improve prediction performance.

• The improvement presents in other setups.

• Personalized approach can be applied to other fields.

✦ Recommendation systems

✦ Vulnerability prediction

✦ Top crashes prediction

29

29

personalized defect prediction

Technology

change commit

label clean

label changes

classication cc8

personalized models

buggy changes80

buggy sliwerski

john smith message