micro interaction metrics for defect prediction (esec/fse 2011)
DESCRIPTION
ESEC/FSE presentationTRANSCRIPT
Micro Interaction Metrics for Defect Prediction
Taek Lee, Jaechang Nam, Dongyun Han, Sunghun Kim, Hoh Peter In
FSE 2011, Hungary, Sep. 5-9
Outline
• Research motivation
• The existing metrics
• The proposed metrics
• Experiment results
• Threats to validity
• Conclusion
Defect Prediction?
why is it necessary?
Software quality assurance is inherently a resource
constrained activity!
Predicting defect-prone
software entities* is
to put the best labor effort
on the entities
* functions or code files
• Complexity of source codes (Chidamber and Kemerer 1994)
• Frequent code changes (Moser et al. 2008)
• Previous defect information (Kim et al. 2007)
• Code dependencies (Zimmermann 2007)
Indicators of defects
Indeed,
where do defects come from?
Humans Error!
Programmers make mistakes,
consequently defects are
injected, and software fails
Human Errors
Bugs Injected
Software fails
Programmer Interaction and Software Quality
“Errors are from cognitive breakdown while understanding and implementing
requirements”
- Ko et al. 2005
Programmer Interaction and Software Quality
“Work interruptions or task switching may affect programmer productivity”
- DeLine et al. 2006
“Errors are from cognitive breakdown while understanding and implementing
requirements”
- Ko et al. 2005
Programmer Interaction and Software Quality
Don’t we need to also consider
developers’ interactions
as defect indicators?
…, but the existing indicators
can NOT directly capture
developers’ interactions
Using Mylyn data, we propose novel
“Micro Interaction Metrics (MIMs)”
capturing developers’ interactions
The Mylyn* data is stored
as an attachment to the
corresponding bug reports in
the XML format
* Eclipse plug-in storing and recovering task contexts
<InteractionEvent … Kind=“ ” … StartDate=“ ” EndDate=“ ” … StructureHandle=“ ” … Interest=“ ” … >
<InteractionEvent … Kind=“ ” … StartDate=“ ” EndDate=“ ” … StructureHandle=“ ” … Interest=“ ” … >
<InteractionEvent … Kind=“ ” … StartDate=“ ” EndDate=“ ” … StructureHandle=“ ” … Interest=“ ” … >
Two levels of MIMs Design
Two levels of MIMs Design
File-level MIMs
specific interactions for a
file in a task (e.g., AvgTimeIntervalEditEdit)
Two levels of MIMs Design
File-level MIMs
specific interactions for a
file in a task (e.g., AvgTimeIntervalEditEdit)
Task-level MIMs
property values shared
over the whole task (e.g., TimeSpent)
Two levels of MIMs Design
File-level MIMs
specific interactions for a
file in a task (e.g., AvgTimeIntervalEditEdit)
Mylyn Task Logs
Selection
Edit
Edit
file A
file B
file B
10:30
11:00
12:30
Two levels of MIMs Design
Task-level MIMs
property values shared
over the whole task (e.g., TimeSpent)
Mylyn Task Logs
Selection
Edit
Edit
file A
file B
file B
10:30
11:00
12:30
The Proposed Micro Interaction Metrics
The Proposed Micro Interaction Metrics
The Proposed Micro Interaction Metrics
For example,
NumPatternSXEY is to capture
this interaction:
For example,
NumPatternSXEY is to capture
this interaction:
“How much times did a programmer
Select a file of group X
and then Edit a file of group Y
in a task activity?”
group X or Y
X if a file shows defect locality* properties
Y otherwise
group H or L
H if a file has high** DOI value
L otherwise
* hinted by the paper [Kim et al. 2007] ** threshold: median of degree of interest (DOI) values in a task
Bug Prediction Process
STEP1: Counting & Labeling Instances
Sep 2010 Time P
…
Task 1 Task 2 Task 3 Task i
f3.java f1.java
f2.java f3.java
f1.java f2.java f3.java
Task i+1 Task i+2 Task i+3
…
Dec 2005
STEP1: Counting & Labeling Instances
Sep 2010 Time P
…
Task 1 Task 2 Task 3 Task i
f3.java f1.java
f2.java f3.java
f1.java f2.java f3.java
Task i+1 Task i+2 Task i+3
…
Dec 2005
All the Mylyn task data collectable from Eclipse subprojects (Dec 05 ~Sep 10)
STEP1: Counting & Labeling Instances
Sep 2010 Time P
…
Task 1 Task 2 Task 3 Task i
f3.java f1.java
f2.java f3.java
f1.java f2.java f3.java
Task i+1 Task i+2 Task i+3
…
Dec 2005
STEP1: Counting & Labeling Instances
Sep 2010 Time P
…
Task 1 Task 2 Task 3 Task i
f3.java f1.java
f2.java f3.java
f1.java f2.java f3.java
Task i+1 Task i+2 Task i+3
…
Post-defect counting period
Dec 2005
STEP1: Counting & Labeling Instances
Sep 2010 Time P
…
Task 1 Task 2 Task 3 Task i
f3.java f1.java
f2.java f3.java
f1.java f2.java f3.java
Task i+1 Task i+2 Task i+3
…
Post-defect counting period
Dec 2005
STEP1: Counting & Labeling Instances
Sep 2010 Time P
…
Task 1 Task 2 Task 3 Task i
f3.java f1.java
f2.java f3.java
f1.java f2.java f3.java
Task i+1 Task i+2 Task i+3
…
Post-defect counting period
Dec 2005
STEP1: Counting & Labeling Instances
Sep 2010 Time P
…
Task 1 Task 2 Task 3 Task i
f3.java f1.java
f2.java f3.java
f1.java f2.java f3.java
Task i+1 Task i+2 Task i+3
…
Post-defect counting period
Dec 2005
STEP1: Counting & Labeling Instances
Sep 2010 Time P
…
Task 1 Task 2 Task 3 Task i
f3.java f1.java
f2.java f3.java
f1.java f2.java f3.java
Task i+1 Task i+2 Task i+3
…
Post-defect counting period
Dec 2005
STEP1: Counting & Labeling Instances
Sep 2010 Time P
…
Task 1 Task 2 Task 3 Task i
f3.java f1.java
f2.java f3.java
f1.java f2.java f3.java
Task i+1 Task i+2 Task i+3
…
Post-defect counting period
Dec 2005
The number of counted post defects (edited files only within bug fixing tasks)
f1.java = 1 f2.java = 1 f3.java = 2
…
Labeling rule for the file instance “buggy” (if # of post-defects > 0)
“clean” (if # of post-defects = 0)
Sep 2010 Time P Dec 2005
STEP2: Extraction of MIMs
Sep 2010 Time P
Metrics extraction period
Dec 2005
Task 1 Task 2 Task 3 Task 4
…
STEP2: Extraction of MIMs
Sep 2010 Time P
f3.java ...
edit …
edit …
Task 1
Metrics extraction period
Dec 2005
STEP2: Extraction of MIMs
Sep 2010 Time P
f3.java ...
edit …
edit …
Task 1
Metrics extraction period
Dec 2005
Metrics Computation
STEP2: Extraction of MIMs
Sep 2010 Time P
f3.java ...
edit …
edit …
Task 1
Metrics extraction period
Dec 2005
Metrics Computation
MIMf3.java valueTask1
STEP2: Extraction of MIMs
Sep 2010 Time P
f3.java ...
edit …
edit …
Task 1
f1.java ...
edit …
edit …
Task 2
Metrics extraction period
Dec 2005
Metrics Computation
MIMf1.java valueTask2
MIMf3.java valueTask1
STEP2: Extraction of MIMs
Sep 2010 Time P
f3.java ...
edit …
edit …
Task 1
f1.java ...
edit …
edit …
Task 2
f2.java ...
edit …
edit …
Task 3
Metrics extraction period
Dec 2005
Metrics Computation
MIMf1.java valueTask2
MIMf2.java valueTask3
MIMf3.java valueTask1
STEP2: Extraction of MIMs
Sep 2010 Time P
…
f3.java ...
edit …
edit …
Task 1
f1.java ...
edit …
edit …
Task 2
f2.java ...
edit …
edit …
Task 3
f1.java …edit …edit..
f2.java …edit…
Task 4
Metrics extraction period
Dec 2005
Metrics Computation
MIMf1.java valueTask2
MIMf2.java valueTask3
MIMf3.java valueTask1
STEP2: Extraction of MIMs
Sep 2010 Time P
…
f3.java ...
edit …
edit …
Task 1
f1.java ...
edit …
edit …
Task 2
f2.java ...
edit …
edit …
Task 3
f1.java …edit …edit..
f2.java …edit…
Task 4
Metrics extraction period
Dec 2005
Metrics Computation
MIMf1.java (valueTask2+valueTask4)
MIMf2.java (valueTask3+valueTask4)
MIMf3.java valueTask1
STEP2: Extraction of MIMs
Sep 2010 Time P
…
f3.java ...
edit …
edit …
Task 1
f1.java ...
edit …
edit …
Task 2
f2.java ...
edit …
edit …
Task 3
f1.java …edit …edit..
f2.java …edit…
Task 4
Metrics extraction period
Dec 2005
Metrics Computation
MIMf1.java (valueTask2+valueTask4)
MIMf2.java (valueTask3+valueTask4)
MIMf3.java valueTask1
STEP2: Extraction of MIMs
MIMf1.java (valueTask2+valueTask4)/2
MIMf2.java (valueTask3+valueTask4)/2
Understand JAVA tool was used
for extracting 32 source Code
Metrics (CMs)*
* Chidamber and Kemerer, and OO metrics
Understand JAVA tool was used
for extracting 32 source Code
Metrics (CMs)*
* Chidamber and Kemerer, and OO metrics
Time P Dec 2005 Sep 2010
…
CVS last revision
List of selected source code metrics
Fifteen History Metrics (HMs)* were collected from the corresponding
CVS repository
* Moser et al.
Fifteen History Metrics (HMs)* were collected from the corresponding
CVS repository
Time P Dec 2005 Sep 2010
…
CVS revisions
List of history metrics (HMs)
* Moser et al.
…
Instance Name
Extracted MIMs Label
…
# of post defects
Instance Name
Extracted MIMs
Training Classifier
Training Regression
STEP3: Creating a training corpus
STEP4: Building prediction models
Classification and Regression
modeling with different machine
learning algorithms using the
WEKA* tool
* an open source data mining tool
STEP5: Prediction Evaluation
Classification Measures
STEP5: Prediction Evaluation
Classification Measures
How many instances are really buggy among
the buggy-predicted outcomes?
STEP5: Prediction Evaluation
Classification Measures
How many instances are really buggy among
the buggy-predicted outcomes?
How many instances are correctly predicted as ‘buggy’ among the real
buggy ones
STEP5: Prediction Evaluation
correlation coefficient (-1~1)
mean absolute error (0~1)
root square error (0~1)
Regression Measures
STEP5: Prediction Evaluation
correlation coefficient (-1~1)
mean absolute error (0~1)
root square error (0~1)
Regression Measures
between # of real buggy instances and # of instances
predicted as buggy
Reject H0* and accept H1* if p-value < 0.05
(at the 95% confidence level)
* H0: no difference in average performance, H1: different (better!)
T-test with 100 times of 10-fold cross validation
Result Summary MIMs improve prediction accuracy for
different Eclipse project subjects
different machine learning algorithms
different model training periods
1
2
3
File instances and % of defects
Prediction for different project subjects
Prediction for different project subjects
MIM: the proposed metrics CM: source code metrics HM: history metrics
Prediction for different project subjects
MIM: the proposed metrics CM: source code metrics HM: history metrics
BASELINE: Dummy Classifier
predicts in a purely random manner
e.g., for 12.5% of buggy instances
Precision(B)=12.5%, Recall(B)=50%
F-measure(B)=20%
Prediction for different project subjects
MIM: the proposed metrics CM: source code metrics HM: history metrics
Prediction for different project subjects
T-test results (significant figures are in bold, p-value < 0.05)
Prediction with different algorithms
Prediction with different algorithms
T-test results (significant figures are in bold, p-value < 0.05)
Prediction in different training periods
Sep 2010 Time P
Model training period
Dec 2005
Model testing period
Prediction in different training periods
Sep 2010 Time P
Model training period
Dec 2005
Model testing period
50% : 50% 70% : 30% 80% : 20%
Prediction in different training periods
Prediction in different training periods
T-test results (significant figures are in bold, p-value < 0.05)
Top 42 (37%) from MIMs among total 113 metrics
(MIMs+CMs+HMs)
TOP 1: NumLowDOIEdit TOP 2: NumPatternEXSX TOP 3: TimeSpentOnEdit
Possible Insight
TOP 1: NumLowDOIEdit TOP 2: NumPatternEXSX TOP 3: TimeSpentOnEdit
Chances are that more defects might be generated
if a programmer TOP2 repeatedly edit and browse a
file especially related to the previous defects TOP3
with putting more weight on editing time, and
especially TOP1 when editing such the files less
frequently or less recently accessed ever …
Possible Insight
Performance comparison
with regression modeling
for predicting # of post-defects
Predicting Post-Defect Numbers
Predicting Post-Defect Numbers
T-test results (significant figures are in bold, p-value < 0.05)
• Systems examined might not be representative
• Systems are all open source projects
• Defect information might be biased
Threats to Validity
Our findings exemplify that developer’s
interaction can affect software quality
Our proposed micro interaction metrics
improve defect prediction accuracy
significantly
…
Conclusion
We believe future defect prediction models will use more developers’ direct and
micro level interaction information
MIMs are a first step towards it
Thank you! Any Question?
• Problem – Developer’s interaction information can affect
software quality (defects)?
• Approach – We proposed novel micro interaction metrics
(MIMs) overcoming the popular static metrics
• Result – MIMs significantly improve prediction accuracy
compared to source code metrics (CMs) and history metrics (HMs)
Backup Slides
One possible ARGUMENT:
Some developers may not
have used Mylyn to fix bugs
Error chance in counting post-defects
as a result getting biased labels
(i.e., incorrect % of buggy instances)
Repeated experiment using
same instances but with a
different heuristics of defect
counting, CVS-log-based
approach*
* with keywords: “fix”, “bug”, “bug report ID” in change logs
CV
S-lo
g-b
ase
d
Prediction with CVS-log-based approach
CV
S-lo
g-b
ase
d
T-test results (significant figures are in bold, p-value < 0.05)
Prediction with CVS-log-based approach
CVS-log-based approach reported more
additional post-defects
(more % of buggy-labeled instances)
CVS-log-based approach reported more
additional post-defects
(more % of buggy-labeled instances)
MIMs failed to feature them due to
lack of the corresponding Mylyn data
Note that it is difficult to
100% guarantee the quality of
CVS change logs
(e.g., no explicit bug ID, missing logs)