micro interaction metrics for defect prediction (esec/fse 2011)

Micro Interaction Metrics for Defect Prediction

Taek Lee, Jaechang Nam, Dongyun Han, Sunghun Kim, Hoh Peter In

FSE 2011, Hungary, Sep. 5-9

Outline

• Research motivation

• The existing metrics

• The proposed metrics

• Experiment results

• Threats to validity

• Conclusion

Defect Prediction?

why is it necessary?

Software quality assurance is inherently a resource

constrained activity!

Predicting defect-prone

software entities* is

to put the best labor effort

on the entities

* functions or code files

• Complexity of source codes (Chidamber and Kemerer 1994)

• Frequent code changes (Moser et al. 2008)

• Previous defect information (Kim et al. 2007)

• Code dependencies (Zimmermann 2007)

Indicators of defects

Indeed,

where do defects come from?

Humans Error!

Programmers make mistakes,

consequently defects are

injected, and software fails

Human Errors

Bugs Injected

Software fails

Programmer Interaction and Software Quality

“Errors are from cognitive breakdown while understanding and implementing

requirements”

- Ko et al. 2005


“Work interruptions or task switching may affect programmer productivity”

- DeLine et al. 2006

“Errors are from cognitive breakdown while understanding and implementing

requirements”

- Ko et al. 2005


Don’t we need to also consider

developers’ interactions

as defect indicators?

…, but the existing indicators

can NOT directly capture

developers’ interactions

Using Mylyn data, we propose novel

“Micro Interaction Metrics (MIMs)”

capturing developers’ interactions

The Mylyn* data is stored

as an attachment to the

corresponding bug reports in

the XML format

* Eclipse plug-in storing and recovering task contexts

Two levels of MIMs Design


File-level MIMs

specific interactions for a

file in a task (e.g., AvgTimeIntervalEditEdit)


File-level MIMs



Task-level MIMs

property values shared

over the whole task (e.g., TimeSpent)


File-level MIMs



Mylyn Task Logs

Selection

Edit

Edit

file A

file B

file B

10:30

11:00

12:30


Task-level MIMs

property values shared

over the whole task (e.g., TimeSpent)

Mylyn Task Logs

Selection

Edit

Edit

file A

file B

file B

10:30

11:00

12:30

The Proposed Micro Interaction Metrics

For example,

NumPatternSXEY is to capture

this interaction:

For example,

NumPatternSXEY is to capture

this interaction:

“How much times did a programmer

Select a file of group X

and then Edit a file of group Y

in a task activity?”

group X or Y

X if a file shows defect locality* properties

Y otherwise

group H or L

H if a file has high** DOI value

L otherwise

* hinted by the paper [Kim et al. 2007] ** threshold: median of degree of interest (DOI) values in a task

Bug Prediction Process

STEP1: Counting & Labeling Instances

Sep 2010 Time P

…

Task 1 Task 2 Task 3 Task i

f3.java f1.java

f2.java f3.java

f1.java f2.java f3.java

Task i+1 Task i+2 Task i+3

…

Dec 2005


Sep 2010 Time P

…


f3.java f1.java

f2.java f3.java



…

Dec 2005

All the Mylyn task data collectable from Eclipse subprojects (Dec 05 ~Sep 10)


Sep 2010 Time P

…


f3.java f1.java

f2.java f3.java



…

Dec 2005


Sep 2010 Time P

…


f3.java f1.java

f2.java f3.java



…

Post-defect counting period

Dec 2005


Sep 2010 Time P

…


f3.java f1.java

f2.java f3.java



…

Post-defect counting period

Dec 2005

The number of counted post defects (edited files only within bug fixing tasks)

f1.java = 1 f2.java = 1 f3.java = 2

…

Labeling rule for the file instance “buggy” (if # of post-defects > 0)

“clean” (if # of post-defects = 0)

Sep 2010 Time P Dec 2005

STEP2: Extraction of MIMs

Sep 2010 Time P

Metrics extraction period

Dec 2005

Task 1 Task 2 Task 3 Task 4

…


Sep 2010 Time P

f3.java ...

edit …

edit …

Task 1


Dec 2005


Sep 2010 Time P

f3.java ...

edit …

edit …

Task 1


Dec 2005

Metrics Computation


Sep 2010 Time P

f3.java ...

edit …

edit …

Task 1


Dec 2005

Metrics Computation

MIMf3.java valueTask1


Sep 2010 Time P

f3.java ...

edit …

edit …

Task 1

f1.java ...

edit …

edit …

Task 2


Dec 2005

Metrics Computation




Sep 2010 Time P

f3.java ...

edit …

edit …

Task 1

f1.java ...

edit …

edit …

Task 2

f2.java ...

edit …

edit …

Task 3


Dec 2005

Metrics Computation





Sep 2010 Time P

…

f3.java ...

edit …

edit …

Task 1

f1.java ...

edit …

edit …

Task 2

f2.java ...

edit …

edit …

Task 3

f1.java …edit …edit..

f2.java …edit…

Task 4


Dec 2005

Metrics Computation





Sep 2010 Time P

…

f3.java ...

edit …

edit …

Task 1

f1.java ...

edit …

edit …

Task 2

f2.java ...

edit …

edit …

Task 3


f2.java …edit…

Task 4


Dec 2005

Metrics Computation

MIMf1.java (valueTask2+valueTask4)




Sep 2010 Time P

…

f3.java ...

edit …

edit …

Task 1

f1.java ...

edit …

edit …

Task 2

f2.java ...

edit …

edit …

Task 3


f2.java …edit…

Task 4


Dec 2005

Metrics Computation





MIMf1.java (valueTask2+valueTask4)/2

MIMf2.java (valueTask3+valueTask4)/2

Understand JAVA tool was used

for extracting 32 source Code

Metrics (CMs)*

* Chidamber and Kemerer, and OO metrics

Understand JAVA tool was used

for extracting 32 source Code

Metrics (CMs)*

* Chidamber and Kemerer, and OO metrics

Time P Dec 2005 Sep 2010

…

CVS last revision

List of selected source code metrics

Fifteen History Metrics (HMs)* were collected from the corresponding

CVS repository

* Moser et al.

Fifteen History Metrics (HMs)* were collected from the corresponding

CVS repository

Time P Dec 2005 Sep 2010

…

CVS revisions

List of history metrics (HMs)

* Moser et al.

…

Instance Name

Extracted MIMs Label

…

# of post defects

Instance Name

Extracted MIMs

Training Classifier

Training Regression

STEP3: Creating a training corpus

STEP4: Building prediction models

Classification and Regression

modeling with different machine

learning algorithms using the

WEKA* tool

* an open source data mining tool

STEP5: Prediction Evaluation

Classification Measures



How many instances are really buggy among

the buggy-predicted outcomes?



How many instances are really buggy among

the buggy-predicted outcomes?

How many instances are correctly predicted as ‘buggy’ among the real

buggy ones


correlation coefficient (-1~1)

mean absolute error (0~1)

root square error (0~1)

Regression Measures


correlation coefficient (-1~1)

mean absolute error (0~1)

root square error (0~1)

Regression Measures

between # of real buggy instances and # of instances

predicted as buggy

Reject H0* and accept H1* if p-value < 0.05

(at the 95% confidence level)

* H0: no difference in average performance, H1: different (better!)

T-test with 100 times of 10-fold cross validation

Result Summary MIMs improve prediction accuracy for

different Eclipse project subjects

different machine learning algorithms

different model training periods

1

2

3

File instances and % of defects

Prediction for different project subjects


MIM: the proposed metrics CM: source code metrics HM: history metrics



BASELINE: Dummy Classifier

predicts in a purely random manner

e.g., for 12.5% of buggy instances

Precision(B)=12.5%, Recall(B)=50%

F-measure(B)=20%


T-test results (significant figures are in bold, p-value < 0.05)

Prediction with different algorithms

Prediction in different training periods

Sep 2010 Time P

Model training period

Dec 2005

Model testing period


Sep 2010 Time P

Model training period

Dec 2005

Model testing period

50% : 50% 70% : 30% 80% : 20%

Top 42 (37%) from MIMs among total 113 metrics

(MIMs+CMs+HMs)

TOP 1: NumLowDOIEdit TOP 2: NumPatternEXSX TOP 3: TimeSpentOnEdit

Possible Insight

TOP 1: NumLowDOIEdit TOP 2: NumPatternEXSX TOP 3: TimeSpentOnEdit

Chances are that more defects might be generated

if a programmer TOP2 repeatedly edit and browse a

file especially related to the previous defects TOP3

with putting more weight on editing time, and

especially TOP1 when editing such the files less

frequently or less recently accessed ever …

Possible Insight

Performance comparison

with regression modeling

for predicting # of post-defects

Predicting Post-Defect Numbers

• Systems examined might not be representative

• Systems are all open source projects

• Defect information might be biased

Threats to Validity

Our findings exemplify that developer’s

interaction can affect software quality

Our proposed micro interaction metrics

improve defect prediction accuracy

significantly

…

Conclusion

We believe future defect prediction models will use more developers’ direct and

micro level interaction information

MIMs are a first step towards it

Thank you! Any Question?

• Problem – Developer’s interaction information can affect

software quality (defects)?

• Approach – We proposed novel micro interaction metrics

(MIMs) overcoming the popular static metrics

• Result – MIMs significantly improve prediction accuracy

compared to source code metrics (CMs) and history metrics (HMs)

Backup Slides

One possible ARGUMENT:

Some developers may not

have used Mylyn to fix bugs

Error chance in counting post-defects

as a result getting biased labels

(i.e., incorrect % of buggy instances)

Repeated experiment using

same instances but with a

different heuristics of defect

counting, CVS-log-based

approach*

* with keywords: “fix”, “bug”, “bug report ID” in change logs

CV

S-lo

g-b

ase

d

Prediction with CVS-log-based approach

CV

S-lo

g-b

ase

d


Prediction with CVS-log-based approach

CVS-log-based approach reported more

additional post-defects

(more % of buggy-labeled instances)

CVS-log-based approach reported more

additional post-defects

(more % of buggy-labeled instances)

MIMs failed to feature them due to

lack of the corresponding Mylyn data

Note that it is difficult to

100% guarantee the quality of

CVS change logs

(e.g., no explicit bug ID, missing logs)

micro interaction metrics for defect prediction (esec/fse 2011)

Technology

task activity

task switchingmay

task itask

task contexts

labeling instances task

capturethis interaction

existing metrics

defectprone software