studying software quality using topic models
TRANSCRIPT
Studying Software Quality Using Topic Models
Tse-Hsun (Peter) Chen
Related Publications
2
Explaining Software Defects Using Topic Models, Tse-Hsun Chen, Stephen W. Thomas, Meiyappan Nagappan, Ahmed E. Hassan, 9th Working Conference on Mining Software Repositories (MSR). Zurich, Switzerland. June 2-3, 2012 (acceptance rate: 18/64 (28%))
Studying the Effect of Testing on Code Quality using Topic Models, Tse-Hsun Chen, Stephen W. Thomas, Hadi Hemmati, Meiyappan Nagappan, Ahmed E. Hassan, under review for the Journal of Empirical Software Engineering. Springer Press (Impact Factor 1.854).
An Empirical Study of Concerns and Their Ability to Explain Defects in Large Software Systems, Tse-Hsun Chen, Stephen W. Thomas, Meiyappan Nagappan, Ahmed E. Hassan, to be submitted for IEEE Transactions on Software Engineering (Impact Factor 1.98).
Thesis Statement
3
Topics, which are approximations of software concerns, can be used to study software quality by better explaining the quality of code and helping allocate software quality assurance efforts effectively.
4
int readFile(String filePath){ // reading filefp =
readFile(filePath)if fp == NULLreturn -1
elsereturn fp
}
int manageMemory(int index){
if mem[index] is not NULL{
// find free // memory
freeInd = findFreeMemoryLoc()
goto(freeInd)}}
More Risky Concern
Can we use concerns to study software quality?
Capturing Concerns Using Topic Models
manage memory index mem free ind find free memory loc
read file file path fp file path fp
Topics Models(LDA)
Topic 1
Topic 2
read, file, path, fp, file
5
manage, memory, mem,
free
Topic 3Index, ind, find,
loc
60 %0 %40 %
0 %55 %45 %
6
Studying code quality using topics
Studying code coverage using topics
CodeThings to
test
7
Studying code quality using topics
Studying code coverage using topics
CodeThings to
test
8
How defect prone are topics?
Can topics help explain software defects?
Studying Code Quality Using Topics
Are Topics Equally Defect-prone?
9
If they are, then we CANNOT use topics to study code quality
[MSR 2012]
10
F1
F2
F3
T1
T2
T3
T4
Measuring Topic Defect-proneness
[MSR 2012]
11
F1
F2
F3
T1
T2
T3
T4
Measuring Topic Defect-proneness
[MSR 2012]
12
Few Topics are Defect-prone
Jface,Comparison check
Task, Eclipse, Task ui,Repository
[MSR 2012]
Topi
c D
efec
t Den
sity
Explaining Defects
13
Lines of Code
Pre-release DefectsCode Churn
Static
Historical
Topics Topic Metrics
[MSR 2012]
Explainability of Metrics
14
Deviance Explained(D1)
D2
Improvement in Explainability = D2 – D1
Static
StaticTopics
[MSR 2012]
15
F1
F2
F3
T1
T2
T3
T4
Using Topics to Explain DefectsNumber of Topics
[MSR 2012]
16
F1
F2
F3
T1
T2
T3
T4
Using Topics to Explain DefectsNumber of Topics
[MSR 2012]
17
F3
T1
T2
T3
T4
Using Topics to Explain DefectsNumber of Defect-prone
Topics
F1
F2
[MSR 2012]
More Topics More Defects in File
Series10
10
20
30
40
50
60
30 %
48 %
18
Avg.
% Im
prov
emen
t in
D2
[MSR 2012]
Series10
5
10
15
20
25 21 % 21 %
49 %
0 %
7 % 6 %
Number of Topics
Number of Defect-prone Topics
Compare with Other Cohesion/Coupling Metrics
19
# of topics and other topic-based metrics, which one is better?
# of topics?
[TSE 201X]
# Topics Outperforms Others
20
Series10
5
10
15
20
25
30
35
40
45
%Av
g. Im
prov
emen
t in
D2
over
bas
e
# topics (our metric) State-of-the-arts metrics
39 %
3 % 3 %
20 %
[TSE 201X]
21
Studying code quality using topics
Studying code coverage using topics
CodeThings to
test
22
Studying code quality using topics
Studying code coverage using topics
CodeThings to
test
We found only a few topics are defect-prone…
C an we allocate MORE testing resources on low tested but defect prone
topics?
23
24
Can we predict low unit tested and high defect-prone topics?
Studying Code Coverage Using Topics
Relationship between code coverage and quality?
Measuring Topic Testedness
25
F1
T1
T1
T2
[EMSE 201X]
Topic Testedness: how much a topic is tested
More Unit Tested, Less Defect Prone
26[EMSE 201X]
Predict LTHD Topics Accurately
27
Series10.62
0.64
0.66
0.68
0.7
0.72
0.74
0.76
0.78
0.8
0.82
Avg.
F-M
easu
re
[EMSE 201X]
0.8
0.76
0.68
Can We Give Improvements to Existing Approach?
Tester usually test at concern level…but existing approaches do not satisfy it
28
Can we HELP existing test allocation approach?
[EMSE 201X]
Low Overlap With Existing Approach – Prediction Model
29
Top N buggy files that may need more test
Top N buggy files found
On average, only 5.3% overlapping files
[EMSE 201X]
Our ApproachPrediction–based
Approach
File Defect DensityNumber of Bugs
30
Lines of CodeFile Defect Density =
A measure for estimating efforts for finding bugs
Files We Found Have Higher Defect Density
31
Series10
50
100
150
200
250
300
Avg.
% D
efec
t Den
sity
Impr
ovem
ents
[EMSE 201X]
64 %
242 %
30 %
32
Studying code quality using topics
Studying code coverage using topics
CodeThings to
test
Thesis Statement
33
Topics, which are approximations of software concerns, can be used to study software quality by better explaining the quality of code and helping allocate software quality assurance efforts effectively.
34
Code
Study Code Quality using Topics
Relationship between defects and topics
Use topicsTo explaindefects
Study Code Coverage using Topics
Relationship between topic testedness and defects
Predict low unit tested and defect prone topics
Things to
test