cross-project build co-change prediction
TRANSCRIPT
Cross-Project Build Co-change Prediction
Shane McIntosh
Ahmed E. Hassan
[email protected]@shane_mcintoshshanemcintosh.org
Emad Shihab
David Lo
Xin Xia
.tex
.c
.cc
.o
.o
.dvi
.a
.exe
.deb
Build systems describe how sources aretranslated into deliverables
3
The build system is at theheart of techniques like
Continuous Integration (CI)
Commit
4
Commit 9719cf0
.c .mk
The build system is at theheart of techniques like
Continuous Integration (CI)
Commit
4
BuildCommit 9719cf0
.c .mk
The build system is at theheart of techniques like
Continuous Integration (CI)
Commit
4
Build
Test
Commit 9719cf0
.c .mk
The build system is at theheart of techniques like
Continuous Integration (CI)
Commit
4
Build
Test
ReportCommit 9719cf0 was successfully integrated
Commit 9719cf0
.c .mk
The build system is at theheart of techniques like
Continuous Integration (CI)
Commit
4
Build
Test
ReportCommit 9719cf0 was successfully integrated
Commit 9719cf0
.c .mk
“...nothing can be said to be certain, except death and taxes” - Benjamin Franklin
The Build “Tax”
An Empirical Study of Build Maintenance Effort
S. McIntosh, B. Adams, T. H. D. Nguyen, Y. Kamei, A. E. Hassan
[ICSE 2011]
Up to 27% of source changes require build
changes, too!
5
Neglected build maintenanceis a frequent cause of
build breakage
Commit
6
Build
Test
Commit aedd38
.c
.mk
Neglected build maintenanceis a frequent cause of
build breakage
Commit
6
Build
Test
Commit aedd38
.c
.mk
Neglected build maintenanceis a frequent cause of
build breakage
Commit
6
Build
Test
Report
Commit aedd38
.c
.mk
Commit aedd38 broke the build!
Neglected build maintenancecan even impact end users
7
Not working due to linking of
incorrect SQLite library version
Neglected build maintenancecan even impact end users
7
Not working due to linking of
incorrect SQLite library version
When are buildchanges necessary?
Missed codein #2121
Add feature#2121
Fix forbug #1234
Grouping related changes according to the work items that they address
.c .c .c
Transactions
Changes .mk
9
2121
Missed codein #2121
Add feature#2121
1234
Fix forbug #1234
Grouping related changes according to the work items that they address
.c .c .c
Transactions
Work items
Changes .mk
9
1 2
.mk
10
We train classifiers to identify code changes that require build co-changes
Workitems
.c.c .c
Classification model
Build change necessary
No build change necessary
1 2
.mk
10
We train classifiers to identify code changes that require build co-changes
Workitems
.c
.c .cClassification model
Build change necessary
No build change necessary
1 2
.mk
11
Workitems
.c
Build changenecessary
No build change necessary
Classification model
We train classifiers to identify code changes that require build co-changes
12
Prior work shows that within-project build co-change prediction can be accurate
Mining Co-Change Information to Understand when Build Changes
are NecessaryS. McIntosh, B. Adams, M. Nagappan, A. E. Hassan
[ICSME 2014]
Build co-change classifiers can achieve an AUC of 0.60-0.88
However, a large amount of historical data was used to train the classifiers
13
What about new
projects?
However, a large amount of historical data was used to train the classifiers
13
What about new
projects?
…or projects with poorly-recorded historical data?
However, a large amount of historical data was used to train the classifiers
13
What about new
projects?
…or projects with poorly-recorded historical data?
Can we leverage these largecorpora for the small ones?
14
How well do build co-change prediction models perform on sparse data?
Precision
Recall
F1-score
AUC
0 0.25 0.5 0.75 1
5%50%90%
14
How well do build co-change prediction models perform on sparse data?
Precision
Recall
F1-score
AUC
0 0.25 0.5 0.75 1
5%50%90%
Challenge 1:Very small datasets tend
to yield models that under-perform
14
How well do build co-change prediction models perform on sparse data?
Precision
Recall
F1-score
AUC
0 0.25 0.5 0.75 1
5%50%90%
How well do build co-change prediction models
perform on other datasets?
Precision
Recall
F1-score
AUC
0 0.25 0.5 0.75 1
Eclipse => MozillaJazz => MozillaLucene => Mozilla
Challenge 1:Very small datasets tend
to yield models that under-perform
14
How well do build co-change prediction models perform on sparse data?
Precision
Recall
F1-score
AUC
0 0.25 0.5 0.75 1
5%50%90%
How well do build co-change prediction models
perform on other datasets?
Precision
Recall
F1-score
AUC
0 0.25 0.5 0.75 1
Eclipse => MozillaJazz => MozillaLucene => Mozilla
Challenge 1:Very small datasets tend
to yield models that under-perform
Challenge 2:Cross-project build co-change models tend
to under-perform
15
Domain-specific project characteristics may limit the applicability of cross-project models
Training corpus
Testing corpus
Training corpus
16
Classification model
Testing corpus
Domain-specific project characteristics may limit the applicability of cross-project models
Training corpus
16
Classification model
Testing corpus
?
Domain-specific project characteristics may limit the applicability of cross-project models
17
Using transfer learning to provide some domain knowledge to the training corpus
Training corpus
Testing corpus
Move some training data from target
system to the training corpus
17
Using transfer learning to provide some domain knowledge to the training corpus
Training corpus
Testing corpus
18
Training corpus
Testing corpus
Using transfer learning to provide some domain knowledge to the training corpus
19
Training corpus
Testing corpus
Classification model
Using transfer learning to provide some domain knowledge to the training corpus
19
Training corpus
Testing corpus
Classification model
?
Using transfer learning to provide some domain knowledge to the training corpus
22
Training corpus
Testing corpus
Classification model
Use training corpus to find an appropriate threshold
Set aside the testing corpus
22
Training corpus
Testing corpus
Classification model
Use training corpus to find an appropriate threshold
23
Training corpus
Classification model
Use training corpus to find an appropriate threshold
Training corpus
Incorrectly classified!
23
Training corpus
Classification model
Use training corpus to find an appropriate threshold
Training corpus
25
Use training corpus to find an appropriate threshold
Training corpus
Classification model
Classification model 1
2
25
Use training corpus to find an appropriate threshold
Training corpus
Classification model
Classification model 1
2
26
Use training corpus to find an appropriate threshold
Classification model
Classification model 1
2…
Classification model N
Ensemble of models used on
the testing corpus
26
Use training corpus to find an appropriate threshold
Classification model
Classification model 1
2…
Classification model N
29
Our approach outperforms baselinecross-project approaches
Eclipse
Jazz
Lucene
Mozilla
Average
0 0.25 0.5 0.75 1
Our approach Ordinary cross-project AdaBoost TrAdaBoost
Wor
st m
easu
red
F-sc
ore
29
Our approach outperforms baselinecross-project approaches
Eclipse
Jazz
Lucene
Mozilla
Average
0 0.25 0.5 0.75 1
Our approach Ordinary cross-project AdaBoost TrAdaBoost
Wor
st m
easu
red
F-sc
ore
37%-42% improvement
30
Our approach achieves similar results to within-project models
Eclipse
Jazz
Lucene
Mozilla
Average
0 0.25 0.5 0.75 1
Our approach Within-project
Wor
st m
easu
red
F-sc
ore
30
Our approach achieves similar results to within-project models
Eclipse
Jazz
Lucene
Mozilla
Average
0 0.25 0.5 0.75 1
Our approach Within-project
Only a 7% drop in performance
Wor
st m
easu
red
F-sc
ore
31
Evaluating our approach
Relativeperformance
37%-42% improvement over baseline
Sour
ceTa
rget
Training configurationsensitivity
31
Evaluating our approach
Relativeperformance
37%-42% improvement over baseline
Only 7% dropof within-project
F-measureSo
urce
Targ
et
Training configurationsensitivity
32
Evaluating our approach
Relativeperformance
Sour
ceTa
rget
37%-42% improvement over baseline
Only 7% dropof within-project
F-measure
Training configurationsensitivity
33
Additional data from the target system slowly improves classifier performance
Sour
ce
Targ
et
319
F-sc
ore
34
Evaluating our approach
Relativeperformance
Sour
ceTa
rget
37%-42% improvement over baseline
Only 7% dropof within-project
F-measure
Training configurationsensitivity
34
Evaluating our approach
Relativeperformance
Sour
ceTa
rget
37%-42% improvement over baseline
Only 7% dropof within-project
F-measure
Training configurationsensitivity
F-score tends to improve as more target system data becomes available