to what extent could we detect field defects? what extent could we detect field defects? 3 to what...
TRANSCRIPT
Noname manuscript No.(will be inserted by the editor)
To What Extent Could We Detect Field Defects?
—An Extended Empirical Study of False Negatives in StaticBug-Finding Tools
Ferdian Thung · Lucia · David Lo ·Lingxiao Jiang · Foyzur Rahman ·Premkumar T. Devanbu
Received: date / Accepted: date
Abstract Software defects can cause much loss. Static bug-finding tools aredesigned to detect and remove software defects and believed to be effective.However, do such tools in fact help prevent actual defects that occur in the fieldand reported by users? If these tools had been used, would they have detect-ed these field defects, and generated warnings that would direct programmersto fix them? To answer these questions, we perform an empirical study thatinvestigates the effectiveness of five state-of-the-art static bug-finding tools(FindBugs, JLint, PMD, CheckStyle, and JCSC) on hundreds of reported andfixed defects extracted from three open source programs (Lucene, Rhino, andAspectJ). Our study addresses the question: To what extent could field defectsbe detected by state-of-the-art static bug-finding tools? Different from past s-tudies that are concerned with the numbers of false positives produced by suchtools, we address an orthogonal issue on the numbers of false negatives. We
Ferdian ThungSchool of Information Systems, Singapore Management UniversityE-mail: [email protected]
LuciaSchool of Information Systems, Singapore Management UniversityE-mail: [email protected]
David LoSchool of Information Systems, Singapore Management UniversityE-mail: [email protected]
Lingxiao JiangSchool of Information Systems, Singapore Management UniversityE-mail: [email protected]
Foyzur RahmanUniversity of California, DavisE-mail: [email protected]
Premkumar T. DevanbuUniversity of California, DavisE-mail: [email protected]
2 Ferdian Thung et al.
find that although many field defects could be detected by static bug-findingtools, a substantial proportion of defects could not be flagged. We also analyzethe types of tool warnings that are more effective in finding field defects andcharacterize the types of missed defects. Furthermore, we analyze the effec-tiveness of the tools in finding field defects of various severities, difficulties,and types.
Keywords False Negatives · Static Bug-Finding Tools · Empirical Study
1 Introduction
Bugs are prevalent in many software systems. The National Institute of Stan-dards and Technology (NIST) has estimated that bugs cost the US economybillions of dollars annually (Tassey, 2002). Bugs are not merely economicallyharmful; they can also harm life and properties when mission critical systemsmalfunction. Clearly, techniques that can detect and reduce bugs would bevery beneficial. To achieve this goal, many static analysis tools have been pro-posed to find bugs. Static bug-finding tools, such as FindBugs (Hovemeyer andPugh, 2004), JLint (Artho, 2006), PMD (Copeland, 2005), CheckStyle (Burn,2007), and JCSC (Jocham, 2005), have been shown to be helpful in detectingmany bugs, even in mature software (Ayewah et al, 2007). It is thus reasonableto believe that such tools are a useful adjunct to other bug-finding techniquessuch as testing and inspection.
Although static bug-finding tools are effective in some settings, it is un-clear whether the warnings that they generate are really useful. Two issuesare particularly important to be addressed: First, many warnings need to cor-respond to actual defects that would be experienced and reported by users.Second, many actual defects should be captured by the generated warnings.For the first issue, there have been a number of studies showing that the num-bers of false warnings (i.e., false positives) are too many, and some have pro-posed techniques to prioritize warnings (Heckman and Williams, 2011, 2009;Ruthruff et al, 2008; Heckman, 2007). While the first issue has received muchattention, the second issue has received less. Many papers on bug detectiontools just report the number of defects that they can detect. It is unclear howmany defects are missed by these bug detection tools (i.e., false negatives).While the first issue is concerned with false positives, the second focuses onfalse negatives. We argue that both issues deserve equal attention as both haveimpact on the quality of software systems. If false positives are not satisfacto-rily addressed, this would make bug-finding tools unusable. If false negativesare not satisfactorily addressed, the impact of these tools on software qualitywould be minimal. On mission critical systems, false negatives may even de-serve more attention. Thus, there is a need to investigate the false negativerates of such tools on actual field defects.
Our study tries to fill this research gap by answering the following researchquestion, and we use the term “bug” and “defect” interchangeably, both ofwhich refer to errors or flaws in software:
To What Extent Could We Detect Field Defects? 3
To what extent could state-of-the-art static bug-finding tools detect fielddefects?
To investigate this research question, we make use of data available inbug-tracking systems and software repositories. Bug-tracking systems, suchas Bugzilla or JIRA, record descriptions of bugs that are actually experiencedand reported by users. Software repositories contain information on what codeelements get changed, removed, or added at different periods of time. Such in-formation can be linked together to track bugs and when and how they getfixed. JIRA has the capability to link a bug report with the changed code thatfixes the bug. Also, many techniques have been employed to link bug report-s in Bugzilla to their corresponding SVN/CVS code changes (Dallmeier andZimmermann, 2007; Wu et al, 2011a). These data sources provide us descrip-tions of actual field defects and their treatments. Based on the descriptions,we are able to infer root causes of defects (i.e., the faulty lines of code) fromthe bug treatments. To ensure accurate identification of faulty lines of code,we perform several iterations of manual inspections to identify lines of codethat are responsible for the defects. Then, we are able to compare the identi-fied root causes with the lines of code flagged by static bug-finding tools, andto analyze the proportion of field defects that are missed or captured by thetools.
In this work, we perform an exploratory study with five state-of-the-artstatic bug-finding tools (FindBugs, PMD, Jlint, CheckStyle, and JCSC) onthree reasonably large open source Java programs (Lucene, Rhino, and As-pectJ). We use bugs reported in JIRA for Lucene version 2.9, and the iBugsdataset provided by Dallmeier and Zimmermann (Dallmeier and Zimmerman-n, 2007) for Rhino and AspectJ. Our manual analysis identifies 200 real-lifedefects that we can unambiguously locate faulty lines of code. We find thatmany of these defects could be detected by FindBugs, PMD, JLint, Check-Style, and JCSC, but a number of them remain undetected.
The main contributions of this work are as follows:
1. We examine the number of real-life defects missed by five various staticbug-finding tools, and evaluate the tools’ performance in terms of theirfalse negative rates.
2. We investigate the warning families in various tools that are effective indetecting actual defects.
3. We characterize actual defects that could not be flagged by the static bug-finding tools.
4. We analyze the effectiveness of the static bug-finding tools on defects ofvarious severities, difficulties, and types.
The paper is structured as follows. In Section 2, we present introductoryinformation on various static bug-finding tools. In Section 3, we present ourexperimental methodology. In Section 4, we present our empirical findings anddiscuss interesting issues. In Section 5, we describe related work. We concludewith future work in Section 6.
4 Ferdian Thung et al.
2 Bug-Finding Tools
In this section, we first provide a short survey of different bug-finding toolsthat could be grouped into: static, dynamic, and machine learning based. Wethen present the five static bug-finding tools that we evaluate in this study,namely FindBugs, JLint, PMD, CheckStyle and JCSC.
2.1 Categorization of Bug-Finding Tools
Many bug-finding tools are based on static analysis techniques (Nielson et al,2005), such as type systems (Necula et al, 2005), constraint-based analysis (Xieand Aiken, 2007), model checking (Holzmann et al, 2000; Corbett et al, 2000;Beyer et al, 2007), abstract interpretation (Cousot et al, 2009; Cousot andCousot, 2012), or a combination of various techniques (Ball et al, 2011; Gram-maTech, 2012; Visser and Mehlitz, 2005; IBM, 2012). They often producevarious false positives, and in theory they should be free of false negativesfor the kinds of defects they are designed to detect. However, due to imple-mentation limitations and the fact that a large program often contains defecttypes that are beyond the designed capabilities of the tools, such tools maystill suffer from false negatives with respect to all kinds of defects.
In this study, we analyze several static bug-finding tools that make use ofwarning patterns for bug detection. These tools are lightweight and can scale tolarge programs. On the downside, these tools do not consider the specificationsof a system, and may miss defects due to specification violations.
Other bug-finding tools also use dynamic analysis techniques, such as dy-namic slicing (Weeratunge et al, 2010), dynamic instrumentation (Nethercoteand Seward, 2007), directed random testing (Sen et al, 2005; Godefroid et al,2005; Cadar et al, 2008), and invariant detection (Brun and Ernst, 2004; Gabeland Su, 2010). Such tools often explore particular parts of a program andproduce no or few false positives. However, they seldom cover all parts of aprogram; they are thus expected to have false negatives.
There are also studies on bug prediction with data mining and machinelearning techniques, which may have both false positives and negatives. Forexample, Sliwerski et al (2005) analyze code change patterns that may causedefects. Ostrand et al (2005) use a regression model to predict defects. Nagap-pan et al (2006) apply principal component analysis on the code complexitymetrics of commercial software to predict failure-prone components. Kim et al(2008) predict potential faults from bug reports and fix histories.
2.2 FindBugs
FindBugs was first developed by Hovemeyer and Pugh (2004). It statical-ly analyzes Java bytecode against various families of warnings characterizingcommon bugs in many systems. Code matching a set of warning patterns areflagged to the user, along with the specific locations of the code.
To What Extent Could We Detect Field Defects? 5
FindBugs comes with a lot of built-in warnings. These include: null pointerdereference, method not checking for null argument, close() invoked on a val-ue that is always null, test for floating point equality, and many more. Thereare hundreds of warnings; these fall under a set of warning families including:correctness, bad practice, malicious code vulnerability, multi-threaded correct-ness, style, internationalization, performance, risky coding practice, etc.
2.3 JLint
JLint, developed by Artho (2006), is a tool to find defects, inconsistent code,and problems with synchronization in multi-threading applications. Similar toFindBugs, JLint also analyzes Java bytecode against a set of warning patterns.It constructs and checks a lock-graph, and does data-flow analysis. Code frag-ments matching the warning patterns are flagged and outputted to the useralong with their locations.
JLint provides many warnings such as potential deadlocks, unsynchronizedmethod implementing ‘Runnable’ interface, method finalize() not calling su-per.finalize(), null reference, etc. These warnings are group under three fami-lies: synchronization, inheritance, and data flow.
2.4 PMD
PMD, developed by Copeland (2005), is a tool that finds defects, dead code,duplicate code, sub-optimal code, and overcomplicated expressions. Differentfrom FindBugs and JLint, PMD analyzes Java source code rather than Javabytecode. PMD also comes with a set of warning patterns and finds locationsin code matching these patterns.
PMD provides many warning patterns, such as jumbled incrementer, returnfrom finally block, class cast exception with toArray, misplaced null check, etc.These warning patterns fall into families, such as design, strict exceptions,clone, unused code, String and StringBuffer, security code, etc., which arereferred to as rule sets.
2.5 CheckStyle
CheckStyle, developed by Burn (2007), is a tool to write Java code that en-forces a coding standard; default configuration is the Sun Code Conventions.Similar to PMD, CheckStyle analyzes Java source code. CheckStyle also comeswith a set of warning patterns and finds locations in code matching these pat-terns.
CheckStyle provides many warning patterns, such as layout issues, class de-sign problems, duplicate code, Javadoc comments, metrics, modifiers, namingconventions, regular expression, size violations, whitespace, etc. These warning
6 Ferdian Thung et al.
patterns fall into families, such as sizes, regular expression, whitespace, Javadocumentation, etc.
2.6 JCSC
JCSC, developed by Jocham (2005), is a tool to write Java code that enforcesa customized coding standard and also check for potential bad code. Similarto CheckStyle, the default coding standard is the Sun Code Conventions. Thestandard can be defined for naming conventions for class, interfaces, fields,parameter, structural layout of the type (class/interface), etc. Similar to PMDand CheckStyle, JCSC analyzes Java source code. It also comes with a set ofwarning patterns and finds locations in code matching these patterns.
JCSC provides many warning patterns, such as empty catch/finally block,switch without default, slow code, inner classes in class/interface issues, con-structor or method or field declaration issues, etc. These warning patterns fallinto families, such as method, metrics, field, Java documentation, etc.
3 Methodology
We make use of bug-tracking, version control, and state-of-the-art bug-findingtools. First, we extract bugs and the faulty lines of code that are responsiblefor the bugs. Next, we run bug-finding tools for the various program releasesbefore the bugs get fixed. Finally, we compare warnings given by bug-findingtools and the real bugs to find false negatives.
3.1 Extraction of Faulty Lines of Code
We analyze two common configurations of bug tracking systems and coderepositories to get historically faulty lines of code. One configuration is thecombination of CVS as the source control repository, and Bugzilla as the bugtracking system. Another configuration is the combination of Git as the sourcecontrol repository, and JIRA as the bug tracking system. We describe howthese two configurations could be analyzed to extract root causes of fixeddefects.
Data Extraction: CVS with Bugzilla. For the first configuration, Dallmeierand Zimmermann (2007) have proposed an approach to automatically analyzeCVS and Bugzilla to link information. Their approach is able to extract is-sue reports linked to corresponding CVS commit entries that correspond tothe reports. They are also able to download the code before and after eachchange. The code to perform this has been publicly released for several soft-ware systems. As our focus is on defects, we remove issue reports that aremarked as enhancements (i.e., they are not bug fixes) and the CVS commitsthat correspond to them.
To What Extent Could We Detect Field Defects? 7
Data Extraction: Git cum JIRA. Git and JIRA have features that make itpreferable over CVS and Bugzilla. JIRA bug tracking systems explicitly linksbug reports to the revisions that fix the corresponding defects. From thesefixes, we use Git diff to find the location of the buggy code and we downloadthe revision prior to the fix by appending the “∧” symbol to the hashcode ofthe corresponding fix revision number. Again we remove bug reports and Gitcommits that are marked as enhancements.
Identification of Faulty Lines. The above process gives us a set of real-lifedefects along with the set of changes that fix them. To find the correspondingroot causes, we perform a manual process based on the treatments of thedefects. Kawrykow and Robillard (2011) have proposed an approach to removenon-essential changes and convert “dirty” treatments to “clean” treatments.However, they still do not recover the root causes of defects.
Our process for locating root causes could not be easily automated bya simple diff operation between two versions of the systems (after and pri-or to the fix), due to the following reasons. Firstly, not all changes fix thebug (Kawrykow and Robillard, 2011); some, such as addition of new lines,removal of new lines, changes in indentations, etc., are only cosmetic changesthat make the code aesthetically better. Figure 1 shows such an example. Sec-ondly, even if all changes are essential, it is not straightforward to identifythe defective lines from the fixes. Some fixes introduce additional code, andwe need to find the corresponding faulty lines that are fixed by the additionalcode. We show several examples highlighting the process of extracting rootcauses from their treatments for a simple and a slightly more complicated casein Figures 2 and 4.
Figure 2 describes a bug fixing activity where one line of code is changedby modifying the operator == to !=. It is easy for us to identify the faultyline which is the line that gets changed. A more complicated case is shown inFigure 4. There are four sets of faulty lines that could be inferred from thediff: one is line 259 (marked with *), where an unnecessary method call needsto be removed; a similar fault is at line 262; the third set of faulty lines areat lines 838-340, and they are condition checks that should be removed; thefourth one is at line 887 and it misses a pre-condition check. For this case, thediff is much larger than the faulty lines that are manually identified.
Figure 4 illustrates the difficulties in automating the identification of faultylines. To ensure the fidelity of identified root causes, we perform several itera-tions of manual inspections. For some ambiguous cases, several of the authorsdiscussed and came to resolutions. Some cases that are still deemed ambiguous(i.e., it is unclear or difficult to manually locate the faulty lines) are removedfrom this empirical study. We show such an example in Figure 3. This exampleshows a bug fixing activity where an if block is inserted into the code and itis unclear which lines are really faulty.
At the end of the above process, we get the sets of faulty lines Faulty
for all defects in our collection. Notation-wise, we refer to these sets of lines
8 Ferdian Thung et al.A
sp
ectJ
-file
nam
e= o
rg.a
sp
ectj
/mo
du
les/
org
.asp
ect
j.ajd
t.co
re/s
rc/o
rg/a
sp
ectj
/ajd
t/in
tern
al/c
om
pil
er/a
st/T
his
Jo
inP
oin
tVis
itor
.jav
a
Bu
gg
y ve
rsio
nF
ixed
ve
rsio
nLi
ne 7
0:
}
Line
71:
}
//in
sert
alin
e b
etw
een
lin
e 70
an
d 71
in th
e b
uggy
ve
rsio
n
Lin
e 7
2: //
Sys
tem
.err
.prin
tln("
don
e: "
+ m
eth
od)
;Li
ne 7
8: /
/Sys
tem
.err
.prin
tln("
isR
ef:
" +
exp
r +
", "
+ b
ind
ing
);L
ine
is d
ele
ted
Line
87:
else
if (
isR
ef(r
ef, t
hisJ
oinP
oint
Sta
ticP
art
Dec
))L
ine
89:
} e
lse
if (is
Ref
(ref
, th
isJo
inP
oint
Sta
ticP
artD
ec))
{
Fig
.1:
Exam
ple
of
Sim
ple
Cosm
etic
Ch
anges
Lu
cen
e 2.
9 -
file
nam
e=s
rc/ja
va/o
rg/a
pa
che/
luce
ne/
sear
ch/S
core
r.ja
vaB
ug
gy
ver
sio
nF
ixed
ver
sio
n*L
ine
90:
r
etur
n do
c =
= N
O_M
OR
E_D
OC
SLi
ne 9
0:
retu
rn d
oc !
= N
O_M
OR
E_
DO
CS
;
Fig
.2:
Iden
tifi
cati
onof
Root
Cause
s(F
au
lty
Lin
es)
from
Tre
atm
ents
[Sim
ple
Case
]
Asp
ectJ
-o
rg.a
sp
ectj
/mo
du
les/
we
aver
/src
/org
/asp
ect
j/wea
ver/
pat
tern
s/S
ign
atu
reP
atte
rn.ja
vaB
ug
gy
ver
sio
nF
ixed
ver
sio
n
Lin
e 8
7:
}L
ine
88:
if (!
mod
ifier
s.m
atch
es(s
ig.g
etM
odifi
ers(
))) r
etu
rn fa
lse;
Lin
e 8
9:L
ine
90:
if (
kind
==
Me
mb
er.S
TA
TIC
_IN
ITIA
LIZ
AT
ION
) {
// In
sert
the
se 2
lin
es b
etw
een
line
87
and
88 in
the
bu
ggy
vers
ion
Lin
e 8
8:
if (k
ind
==
Mem
ber.
AD
VIC
E) r
etur
n tr
ue;
Lin
e 8
9: <
new
lin
e>
Fig
.3:
Iden
tifi
cati
on
of
Root
Cause
s(F
au
lty
Lin
es)
from
Tre
atm
ents
[Am
big
uou
sC
ase
]
To What Extent Could We Detect Field Defects? 9A
spec
tJ-
file
nam
e=
org.
asp
ectj
/mo
dule
s/w
eave
r/sr
c/or
g/as
pec
tj/w
eave
r/bce
l/Laz
yMet
hod
Gen
.java
Bug
gy v
ersi
onFi
xed
vers
ion
*Lin
e 25
9:
lng.
setS
tart
(nu
ll);
Lin
e is
de
lete
d*L
ine
262:
ln
g.se
tEn
d(n
ull);
Lin
e is
de
lete
d*L
ine
838:
if (
i in
sta
nce
of L
ocal
Va
riabl
eIn
stru
ctio
n)
{*L
ine
839:
int
inde
x =
((L
oca
lVa
riabl
eIn
stru
ctio
n)i).
get
Ind
ex()
;*L
ine
840:
if (l
vt.g
etS
lot(
) =
= in
dex)
{L
ine
841:
if
(loca
lVar
iab
leS
tart
s.ge
t(lv
t) =
= n
ull)
{
Lin
e 84
2:
lo
calV
aria
ble
Sta
rts.
put(
lvt,
jh);
Lin
e 84
3:
}L
ine
844:
lo
calV
aria
ble
End
s.p
ut(lv
t, jh
);L
ine
845:
}
Lin
e 84
6:
}
Lin
e 83
6: if
(lo
calV
aria
bleS
tart
s.g
et(l
vt)
==
nu
ll) {
Lin
e 83
7:
loc
alV
aria
ble
Sta
rts.
put(
lvt,
jh);
Lin
e 83
8:
}Li
ne
839:
lo
calV
aria
ble
End
s.p
ut(lv
t, j
h);
Lin
e 88
2:
keys
.add
All(
loca
lVar
iabl
eSta
rts.
keyS
et(
));
Lin
e 88
3:
Col
lect
ion
s.so
rt(k
eys,
new
Co
mp
ara
tor(
) {
Lin
e 88
4:
publ
ic in
tco
mp
are
(Obj
ect
a,O
bje
ctb)
{
Lin
e 88
5:
Lo
calV
aria
ble
Ta
gta
ga
= (
Loc
alV
aria
ble
Tag
)a;
Lin
e 88
6:
Lo
calV
aria
ble
Ta
gta
gb
= (
Loc
alV
aria
ble
Tag
)b;
*Lin
e 88
7:
ret
urn
ta
ga.g
etN
am
e().
com
par
eTo
(tag
b.g
etN
am
e()
);
Lin
e 88
8:
}
});
Lin
e 88
9:
fo
r (I
tera
tor
iter
= k
eys
.iter
ator
();
iter.
hasN
ext(
); )
{
Lin
e 87
0:
key
s.a
ddA
ll(lo
calV
aria
bleS
tart
s.ke
ySet
());
Lin
e 87
1-8
74:
//the
se l
ine
s a
re c
om
men
ted
co
des
Lin
e 87
5:
Co
llect
ions
.sor
t(ke
ys,
new
Co
mpa
rato
r()
{Li
ne
876:
pu
blic
int
com
par
e(O
bje
ct a
, O
bje
ct b
) {
Lin
e 87
7:
L
oca
lVa
riabl
eT
agta
ga=
(Lo
calV
aria
bleT
ag)
a;
Lin
e 87
8:
L
oca
lVa
riabl
eT
agta
gb=
(Lo
calV
aria
bleT
ag)
b;
Lin
e 87
9:
if
(ta
ga.g
etN
ame(
).st
art
sWith
("a
rg")
) {
Lin
e 88
0:
if
(tagb
.get
Na
me
().s
tart
sWith
("a
rg")
) {
Lin
e 88
1:
retu
rn -
tag
a.ge
tNam
e()
.co
mp
areT
o(t
agb.
get
Nam
e())
;Li
ne
882:
} el
se {
Lin
e 88
3:
retu
rn 1
; //
Wh
atev
er
tagb
is,
it m
ust
com
e ou
t be
fore
'arg
'Li
ne
884:
}Li
ne
885:
} e
lse
if (
tagb
.get
Nam
e().
sta
rtsW
ith("
arg
"))
{Li
ne
886:
re
turn
-1
; //
Wha
teve
r ta
ga
is,
it m
ust
co
me
out
be
fore
'arg
'Li
ne
887:
}
els
e {
Lin
e 88
8:
re
turn
-ta
ga.
getN
ame
().c
om
par
eT
o(t
agb.
getN
ame
());
Lin
e 88
9:
}
Lin
e 89
0:
}
Lin
e 89
1:
}
);Li
ne
892-
894
: //t
hese
lin
es
are
co
mm
ente
d c
ode
sLi
ne
895:
for
(Ite
rato
rite
r=
key
s.ite
rato
r();
iter
.has
Ne
xt()
; )
{
Fig
.4:
Iden
tifi
cati
onof
Root
Cau
ses
(Fault
yL
ines
)fr
om
Tre
atm
ents
[Com
ple
xC
ase
]
10 Ferdian Thung et al.
using an index notation Faulty[]. We refer to the set of lines in Faulty thatcorrespond to the ith defect as Faulty[i].
3.2 Extraction of Warnings
To evaluate the static bug-finding tools, we run the tools on the programversions before the defects get fixed. An ideal bug-finding tool would recoverthe faulty lines of the program—possibly with other lines corresponding toother defects lying within the software system. Some of these warnings arefalse positives, while others are true positives.
For each defect, we extract the version of the program in the repositoryprior to the bug fix. We have such information already as an intermediateresult for the root cause extraction process. We then run the bug-finding toolson these program versions. Because 13.42% of these versions could not becompiled, we remove them from our analysis, which is a threat to validity ofour study (see section 4.3).
As described in Section 2, each of the bug-finding tools takes in a set ofrules or the types of defects and flags them if found. By default, we enable allrules/types of defects available in the tools, except that we exclude two rulesets from PMD: one related to Android development, and the other whoseXML configuration file could not be read by PMD (i.e., Coupling).
Each run would produce a set of warnings. Each warning flags a set oflines of code. Notation-wise, we refer to the sets of lines of code for all runsas Warning, and the set of lines of code for the ith warning as Warning[i].Also, we refer to the sets of lines of code for all runs of a particular tool T asWarningT , and similarly we have WarningT [i].
3.3 Extraction of Missed Defects
With the faulty lines Faulty obtained through manual analysis, and the linesflagged by the static bug-finding tools Warning, we can look for false negatives,i.e., actual reported and fixed defects that are missed by the tools.
To get these missed defects, for every ith warning, we take the intersectionof the sets Faulty[i] and Warning[i] from all bug-finding tools, and theintersection between Faulty[i] with each WarningT [i]. If an intersection isan empty set, we say that the corresponding bug-finding tool misses the ith
defect. If the intersection covers a true subset of the lines in Faulty[i], wesay that the bug-finding tool partially captures the ith defect. Otherwise, if theintersection covers all lines in Faulty[i], we say that the bug-finding tool fullycaptures the ith defect. We differentiate the partial and full cases as developersmight be able to recover the other faulty lines, given that some of the faultylines have been flagged.
To What Extent Could We Detect Field Defects? 11
3.4 Overall Approach
Our overall approach is illustrated by the pseudocode in Figure 5. Our ap-proach takes in a bug repository (e.g., Bugzilla or JIRA), a code repository(e.g., CVS or Git), and a bug-finding tool (e.g., FindBugs, PMD, JLint, Check-Style, or JCSC). For each bug report in the repository, it performs three stepsmentioned in previous sub-sections: Faulty lines extraction, warning identifi-cation, and missed defect detection.
The first step corresponds to lines 3-8. We find the bug fix commit cor-responding to the bug report. We identify the version prior to the bug fixcommit. We perform a diff to find the differences between these two versions.Faulty lines are then extracted by a manual analysis. The second step corre-sponds to lines 9-11. Here, we simply run the bug-finding tools and collect linesof code flagged by the various warnings. Finally, step three is performed bylines 12-19. Here, we detect cases where the bug-finding tool misses, partiallycaptures, or fully captures a defect. The final statistics is output at line 20.
Procedure IdentifyMissedDefectsInputs:
BugRepo : Bug RepositoryCodeRepo : Code RepositoryBFTool : Bug-Finding Tool
Output:Statistics of Defects that are Missed andCaptured (Fully or Partially)
Method:1: Let Stats = {}2: For each bug report br in BugRepo
3: // Step 1: Extract faulty lines of code4: Let fixC = br ’s corresponding fix commit in CodeRepo
5: Let bugC = Revision before fixC in CodeRepo
6: Let diff = The difference between fixC and bugC7: Extract faulty lines from diff8: Let Faultybr = Faulty lines in bugC9: // Step 2: Get warnings10: Run BFTool on bugC11: Let Warningbr = Flagged lines in bugC by BFTool12: // Step 3: Detect missed defects13: Let Common = Faultybr ∩Warningbr14: If Common = {}15: Add 〈br ,miss〉 to Stats16: Else If Common = Faultybr17: Add 〈br , full〉 to Stats18: Else19: Add 〈br , partial〉 to Stats20: Output Stats
Fig. 5: Identification of Missed Defects.
12 Ferdian Thung et al.
4 Empirical Evaluation
In this section we present our research questions, datasets, empirical findings,and threats to validity.
4.1 Research Questions & Datasets
We would like to answer the following research questions:
RQ1 How many real-life reported and fixed defects from Lucene, Rhino, andAspectJ are missed by state-of-the-art static bug-finding tools?
RQ2 What types of warnings reported by the tools are most effective in de-tecting actual defects?
RQ3 What are some characteristics of the defects missed by the tools?RQ4 How effective are the tools in finding defects of various severity?RQ5 How effective are the tools in finding defects of various difficulties?RQ6 How effective are the tools in finding defects of various types?
We evaluate five static bug-finding tools, namely FindBugs, JLint, PMD,CheckStyle, and JCSC on three open source projects: Lucene, Rhino, and As-pectJ. Lucene from Apache Software Foundation1 is a general purpose textsearch engine library. Rhino from Mozilla Foundation2 is an implementationof JavaScript written in Java. AspectJ from Eclipse Foundation3 is an aspect-oriented extension of Java. We crawl JIRA for defects tagged for Lucene ver-sion 2.9. For Rhino and AspectJ, we analyze the iBugs repository preparedby Dallmeier and Zimmermann (2007). The average sizes of Lucene, Rhino,and AspectJ are around 265,822, 75,544, and 448,176 lines of code (LOC) re-spectively. We show the numbers of unambiguous defects that we are able tomanually locate root causes from the three datasets in Table 1 together withthe total numbers of defects available in the datasets and the average faultylines per defect.
Table 1: Number of Defects for Various Datasets
Dataset # of # of Avg # ofUnambiguous Defects Faulty Lines
Defects Per Defect
Lucene 28 57 3.54Rhino 20 32 9.1AspectJ 152 350 4.07
1 http://lucene.apache.org/core/2 http://www.mozilla.org/rhino/3 http://www.eclipse.org/aspectj/
To What Extent Could We Detect Field Defects? 13
4.2 Experimental Results
4.2.1 RQ1: Number of Missed Defects
We show the number of missed defects by each and all of the five tools, forLucene, Rhino, and AspectJ in Table 2. We do not summarize across softwareprojects as the numbers of warnings for different projects differ greatly andummarization across projects may not be meaningful. We first show the resultsfor all defects, and then zoom into subsets of the defects that span a smallnumber of lines of code, and those that are severe. Our goal is to evaluate theeffectiveness of state-of-the-art static bug-finding tools in terms of their falsenegative rates. We would also like to evaluate whether false negative rates maybe reduced if we use all five tools together.
Consider All Defects.
Lucene. For Lucene, as shown in Table 2, we find that with all five tools, 35.7%of all defects could be fully identified (i.e., all faulty lines Faulty[i] of a defecti are flagged). Another 28.6% of all defects could also be partially identified(i.e., some but not all faulty lines in Faulty[i] are flagged). Still, 35.7% of alldefects could not be flagged by the tools. Thus, the five tools are effective butare not very successful in capturing Lucene defects. Among the tools, PMDcaptures the most numbers of bugs, followed by FindBugs, CheckStyle, JCSC,and finally JLint.
Rhino. For Rhino, as shown in Table 2, we find that with all five tools, 95%of all defects could be fully identified. Only 5% of all defects could not becaptured by the tools. Thus, the tools are very effective in capturing Rhinodefects. Among the tools, FindBugs captures the most numbers of defects,followed by PMD, CheckStyle, JLint, and finally JCSC.
AspectJ. For AspectJ, as shown in Table 2, we find that with all five tools,71.1% of all defects could be fully captured. Also, another 28.3% of all defectscould be partially captured. Only 0.7% of the defects could not be capturedby any of the tools. Thus, the tools are very effective in capturing AspectJdefects. PMD captures the most numbers of defects, followed by CheckStyle,FindBugs, JCSC, and finally JLint.
Also, to compare the five bug-finding tools, we show the average numbersof lines in one defect program version (over all defects) that are flagged bythe tools for our subject programs in Table 3. We note that PMD flags themost numbers of lines of code, and likely produces more false positives thanothers, while JLint flags the least numbers of lines of code. With respect to theaverage sizes of programs (see Column “Avg # LOC” in Table 4), the toolsmay flag 0.2% to 56.9% of the whole program as buggy.
14 Ferdian Thung et al.
Tab
le2:
Per
centa
ges
of
Def
ects
that
are
Mis
sed
,P
art
iall
yC
ap
ture
d,
an
dF
ull
yC
ap
ture
d.
Th
enum
ber
sin
the
pare
nth
eses
ind
icate
the
nu
mb
ers
of
act
ual
fau
lty
lin
esca
ptu
red
,if
any.
Tools
vs.
Lu
cen
eR
hin
oA
spectJ
Pro
gra
ms
Mis
sP
art
ial
Fu
llM
iss
Part
ial
Fu
llM
iss
Part
ial
Fu
ll
Fin
dB
ugs
71.4
%14
.3%
(5)
14.3
%(8
)10.0
%0%
90.0
%(1
79)
33.6
%19.7
%(9
1)
46.7
%(1
37)
JL
int
82.1
%10
.7%
(3)
7.1
%(4
)80.0
%10.0
%(2
)10.0
%(3
)97.4
%1.3
%(2
)1.3
%(2
)
PM
D50
.0%
14.3
%(1
1)
35.7
%(1
3)
15.0
%5.0
%(1
8)
80.0
%(1
57)
3.9
%27.6
%(2
51)
68.4
%(1
96)
Ch
eckS
tyle
71.4
%14
.3%
(5)
14.3
%(5
)65%
35%
(47)
0%
32.2
%38.2
%(2
72)
29.6
%(6
1)
JC
SC
64.3
%25
%(1
5)
10.7
%(3
)100%
0%
0%
86.8
%9.9
%(3
8)
3.3
%(5
)
All
35.7
%28
.6%
(11)
35.7
%(2
2)
5.0
%0%
95.0
%(1
81)
0.7
%28.3
%(2
74)
71.1
%(2
06)
To What Extent Could We Detect Field Defects? 15
Table 3: Average Numbers of Lines Flagged Per Defect Version by VariousStatic Bug-Finding Tools.
Tools vs.Programs
Lucene Rhino AspectJ
FindBugs 33,208.82 40,511.55 83,320.09
JLint 515.54 330.6 720.44
PMD 124,281.5 42,999.3 198,065.51
Checkstyle 16,180.6 11,144.65 55,900.13
JCSC 16,682.46 3,573.55 22,010.77
All 126,367.75 52,123.05 220,263
Note that there may be many other issues in a version besides the defects inour dataset in a program version, and a tool may generate many warnings forthe version. These partially explain the high average numbers of lines flaggedin Table 3. To further understanding these numbers, we present more statisticsabout the warnings generated by the tools for each of the three programs inTable 4. Column “Avg # Warning” provides the average numbers of reportedwarnings by each tool for each program. Column “Avg # Flagged Line PerWarning” is the average numbers of lines flagged by a warning report. Column“Avg # Unfixed” is the average numbers of lines of code that are flagged butare not the buggy lines fixed in a version. Column “Avg # LOC” is the numbersof lines of code in a particular program (across all buggy versions).
We notice that the average numbers of warnings generated by PMD is high.FindBugs reports less warnings but many warnings span many consecutivelines of code. Many flagged lines do not correspond to the buggy lines fixed ina version. These could either be false positives or bugs found in the future. Wedo not check the exact number of false positives though, as they require muchmanual labor (i.e., checking thousands of warnings in each of the hundreds ofprogram versions), and they are the subject of other studies (e.g. Heckmanand Williams, 2011). On the other hand, the high numbers of flagged linesraise concerns with the effectiveness of warnings generated by the tools: Arethe warnings effectively correlated with actual defects?
To answer such a question, we create random “bug” finding tools thatwould randomly flag some lines of code as buggy according to the distributionsof the numbers of the lines flagged by each warning generated by each of thebug-finding tools, and compare the bug-capturing effectiveness of such randomtools with the actual tools. For each version of the subject programs and eachof the actual tools, we run a random tool 10,000 times; each time the randomtool would randomly generate the same numbers of warnings by following thedistribution of the numbers of lines flagged by each warning; then, we countthe numbers of missed, partially captured, and fully captured defects by therandom tool; finally, we compare the effectiveness of the random tools withthe actual tools by calculating the p-values as in Table 5. A value x in each
16 Ferdian Thung et al.T
able
4:U
nfi
xed
War
nin
gsby
Var
iou
sD
efec
tF
ind
ing
Tools
Soft
ware
Tool
Avg
#W
arn
ing
Avg
#F
lagged
Lin
eP
erW
arn
ing
Avg
#U
nfi
xed
Avg
#L
OC
Rh
ino
Fin
dB
ugs
177.9
838.7
240,5
02.6
75,5
43.8
JL
int
34
1330.3
5
PM
D11,8
64.9
521.1
242,9
90.5
5
Ch
eckst
yle
34,9
58.7
51
11,1
44.6
5
JC
SC
24,0
87
13,5
73.5
5
All
71,7
26.8
2.9
552,1
14.3
Lu
cen
e
Fin
dB
ugs
270.2
5469.4
33,2
08.3
6
265,8
21.7
5
JL
int
685.2
11
515.2
9
PM
D39,9
93.6
811.7
4124,2
80.6
4
Ch
eckst
yle
30,7
79.7
11
16,6
80.2
5
JC
SC
29,0
56.9
61
16,6
81.8
2
All
100,5
13.5
71.4
6126,3
66.8
9
Asp
ectJ
Fin
dB
ugs
802.2
9381.7
683,3
18.5
9
448,1
75.9
4
JL
int
3,9
37
1720.4
1
PM
D73,6
41.8
62.9
4079
198,0
62.5
7
Ch
eckst
yle
93,3
41.5
155,8
97.9
3
JC
SC
198.5
18,6
51
22,0
10.4
9
All
326,0
40.7
21.1
6220,2
60.3
1
To What Extent Could We Detect Field Defects? 17
cell in the table means that our random tool would have x × 100% chanceto get at least as good results as the actual tool for either partially or fullycapturing the bugs. The values in the table imply the actual tools may indeeddetect more bugs than random tools, although they may produce many falsepositives. However, some tools for some programs, such as PMD for Lucene,may not be much better than random tools. If there is no correlation betweenthe warnings generated by the actual tools with actual defects, the tools shouldnot perform differently from the random tools. Our results show that this isnot the case at least for some tools with some programs.
Table 5: p-values comparing random tools against actual tools.
Tool Program Full Partial or Full
FindBugsLucene 0.2766 0.1167Rhino <0.0001 0.0081
AspectJ 0.5248 0.0015
JLintLucene 0.0011 <0.0001Rhino <0.0001 <0.0001
AspectJ 0.0222 0.0363
PMDLucene 0.9996 1Rhino 0.9996 1
AspectJ 0.9996 <0.0001
CheckstyleLucene 0.0471 0.0152Rhino 0.5085 1
AspectJ <0.0001 <0.0001
JCSCLucene 0.0152 1Rhino 0.3636 1
AspectJ 0.0006 1
Consider Localized Defects.
Many of the defects that we analyze span more than a few lines of code.Thus, we further focus only on defects that can be localized to a few lines ofcode (at most five lines of code, which we call localized defects), and investigatethe effectiveness of the various bug-finding tools. We show the numbers ofmissed localized defects by the five tools, for Lucene, Rhino, and AspectJ inTables 6.
Lucene. Table 6 shows that the tools together fully identify 47.6% of all local-ized defects and partially identify another 14.3%. This is only slightly higherthan the percentage for all defects (see Table 2). The ordering of the toolsbased on their ability to fully capture localized defects is the same.
Rhino. As shown in Table 6, all five tools together could fully identify 93.3%of all localized defects. This is slightly lower than the percentage for all defects(see Table 2).
AspectJ. Table 6 shows that the tools together fully capture 78.9% of all lo-calized defects and miss only 0.8%. These are better than the percentages forall defects (see Table 2).
18 Ferdian Thung et al.
Tab
le6:
Per
centa
ges
ofL
oca
lize
dD
efec
tsth
atar
eM
isse
d,
Part
iall
yC
ap
ture
d,
an
dF
ull
yC
ap
ture
d
Tools
vs.
Program
sLucene
Rhin
oAsp
ectJ
Miss
Partial
Full
Miss
Partial
Full
Miss
Partial
Full
Fin
dB
ugs
66.7
%14.3
%19.0
%13.3
%0%
86.7
%32.0
%14.8
%53.1
%
JL
int
80.9
%9.5
%9.5
%86.7
%0%
13.3
%96.9
%1.6
%1.6
%
PM
D38.1
%14.3
%47.6
%20.0
%0%
80.0
%2.3
%21.1
%76.6
%
Ch
eckst
yle
66.6
7%
19.0
5%
14.2
9%
92.3
1%
7.6
9%
0%
35.9
4%
28.9
%35.1
6%
JC
SC
76.1
9%
9.5
2%
14.2
9%
100%
0%
0%
89.8
4%
6.2
5%
14.2
9%
All
38.1
%14.3
%47.6
%6.7
%0%
93.3
%0.8
%20.3
%78.9
%
To What Extent Could We Detect Field Defects? 19
Consider Stricter Analysis.
We also perform a stricter analysis that requires not only that the faultylines are covered by the warnings, but also the type of the warnings must bedirectly related to the faulty lines. For example, if the faulty line is a nullpointer dereference, then the warning must explicitly say so. Warning reportsthat cover this line but do not have reasonable explanation for the warningwould not be counted as capturing the fault. Again we manually analyze tosee if the warnings strictly captures the defect. We focus on defects that arelocalized to one line of code. There are 7 bugs, 5 bugs, and 66 bugs for Lucene,Rhino, and AspectJ respectively that can be localized to one line of code.
We show the results of our analysis for Lucene, Rhino, and AspectJ inTable 7. We notice that under this stricter requirement, very few of the de-fects fully captured by the tools (see Column “Full”) are strictly captured bythe same tools (see Column “Strict”). Thus, although the faulty lines mightbe flagged by the tools, the warning messages from the tools may not havesufficient information for the developers to understand the defects.
Table 7: Numbers of One-Line Defects that are Strictly Captured versusFully Captured
Tool Lucene Rhino AspectJStrict Full Strict Full Strict Full
FindBugs 0 4 1 4 5 42PMD 0 7 0 5 1 65JLint 0 0 0 1 1 2Checkstyle 0 3 0 0 0 33JCSC 0 3 0 0 0 5
4.2.2 RQ2: Effectiveness of Different Warnings
We show the effectiveness of various warning families of FindBugs, PMD,JLint, CheckStyle, and JCSC in flagging defects in Tables 8, 9, and 10 respec-tively. We highlight the top-5 warning families in terms of their ability in fullycapturing the root causes of the defects.
FindBugs. For FindBugs, as shown in Table 8, in terms of low false negativerates, we find that the best warning families are: Style, Performance, MaliciousCode, Bad Practice, and Correctness. Warnings in the style category includeswitch statement having one case branch to fall through to the next casebranch, switch statement having no default case, assignment to a local variablewhich is never used, unread public or protected field, a referenced variablecontains null value, etc. Warnings belonging to the malicious code categoryinclude a class attribute should be made final, a class attribute should bepackage protected, etc. Furthermore, we find that a violation of each of thesewarnings would likely flag an entire class that violates it. Thus, a lot of lines
20 Ferdian Thung et al.
of code would be flagged, making it having a higher chance of capturing thedefects. Warnings belonging to the bad practice category include comparisonof String objects using == or !=, a method ignores exceptional return value,etc. Performance related warnings include method concatenates strings using+ instead of StringBuffer, etc. Correctness related warnings include a fieldthat masks a superclass’s field, invocation of toString to an array, etc.
Table 8: Percentages of Defects that are Missed, Partially Captured, andFully Captured for Different Warning Families of FindBugs
Warning Family Miss Partial Full
Style 46.5% 9.5% 44.0%Performance 62.0% 9.0% 29.0%Malicious Code 68.5% 7.5% 24.0%Bad Practise 73.5% 4.5% 22.0%Correctness 74.0% 6.5% 19.5%
PMD. For PMD, as shown in Table 9, we find that the warning categoriesthat are the most effective are: Code Size, Design, Controversial, Optimiza-tion, and Naming. Code size warnings include problems related to a code beingtoo large or complex, e.g., number of acyclic execution paths is more than 200,method length is long, etc. Code size is correlated with defects but does notreally inform the types of defects that needs a fix. Design warnings identifysuboptimal code implementation; these include the simplification of booleanreturn, missing default case in a switch statement, deeply nested if statement,etc which may not be correlated with defects. Controversial warnings includeunnecessary constructor, null assignment, assignments in operands, etc. Opti-mization warnings include best practices to improve the efficiency of the code.Naming warnings include rules pertaining to the preferred names of variousprogram elements.
Table 9: Percentages of Warnings that are Missed, Partially Captured, andFully Captured for Different Warning Families of PMD
Warning Family Miss Partial Full
Code Size 21.5% 5.0% 73.5%Design 30.0% 8.5% 61.5%Controversial 27.5% 19.5% 53.0%Optimization 38.5% 23.0% 38.5%Naming 60.0% 23.0% 17.0%
JLint. For JLint, as shown in Table 10, we have three categories: Inheritance,Synchronization, and Data Flow. We find that inheritance is more effectivethan data flow which in turn is more effective than synchronization in detectingdefects. Inheritance warnings relate to class inheritance issues, such as: new
To What Extent Could We Detect Field Defects? 21
method in a sub-class is created with identical name but different parametersas one inherited from the super class, etc. Synchronization warnings relateto erroneous multi-threading applications in particular problems related toconflicts on shared data usage by multiple threads, such as potential deadlocks,required but missing synchronized keywords, etc. Data flow warnings relate toproblems that JLint detects by performing data flow analysis on Java bytecode,such as null referenced variable, type cast misuse, comparison of String withobject references, etc.
Table 10: Percentages of Defects that are Missed, Partially Captured, andFully Captured for Different Warning Families of JLint
Warning Family Miss Partial FullInheritance 93.0% 5.0% 2.0%Data Flow 93.5% 5.0% 1.5%Synchronization 99.5% 0% 0.5%
CheckStyle. For CheckStyle, as shown in Table 11, we have five most effec-tive categories: Sizes, Regular expression, White space, Java documentation,and Miscellaneous. Sizes warning occurred when a considerable limit of goodprogram size is exceeded. Regular expression warning occurred when a spe-cific standard pattern exist or not exist in the code file. White space warningoccurred if there is an incorrect whitespace around generic tokens. Java docu-mentation warning occurred when there is no javadoc or the javadoc does notconform to standard. Miscellaneous warning includes other issues that are notincluded in the other warning category. Although likely unrelated to the actualcause of the defects, the top warning family is quite effective on catching thedefects.
Table 11: Percentages of Defects that are Missed, Partially Captured, andFully Captured for Different Warning Families of CheckStyle
Warning Family Miss Partial Full
Sizes 26% 23.5% 50.5%Regular expression 39.5% 21% 39.5%White space 34% 28.5% 37.5%Java documentation 42.5% 32% 25.5%Miscellaneous 54.5% 28% 17.5%
JCSC. For JCSC, as shown in Table 12, we have five most effective categories:General, Java documentation, Metrics, Field, and Method. General warningincludes common issues in coding, such as the length limit of the code, missingdefault in switch statement, empty catch statement, etc. Java documentationwarning includes issues related to the existence of javadoc and its conformance
22 Ferdian Thung et al.
to standard. Metrics warning includes issues related to metrics used for mea-suring code quality, such as NCSS and NCC. Field warning includes issuesrelated to the field of the class, such as bad modifier order. Method warn-ing includes issues related to the method of the class, such as bad number ofarguments of a method. The warnings generated by JCSC are generally noteffective and very bad in detecting defects in the three systems.
Table 12: Percentages of Defects that are Missed, Partially Captured, andFully Captured for Different Warning Families of JCSC
Warning Family Miss Partial Full
General 76.5% 16.5% 7%Java documentation 91% 7% 2%Metrics 95% 4.5% 0.5%Field 94.5% 5% 0.5%Method 98% 2% 0%
4.2.3 RQ3: Characteristics of Missed Defects
There are 12 defects that are missed by all five tools. These defects involve log-ical or functionality errors and thus they are difficult to be detected by staticbug-finding tools without the knowledge of the specification of the systems.We can categorize the defects into several categories: method related defects(addition, removal of method calls, changes to parameters passed in to meth-ods), conditional checking related defects (addition, removal, or changes topre-condition checks), assignment related defects (wrong value being assignedto variables), return value related defects (wrong value being returned) andobject usage related defects (missing type cast, etc). The distribution of suchdefect categories that are missed by all five tools are listed in Table 13.
Table 13: Distribution of Missed Defect Types
Type Number of Defects
Assigment related defects 1Conditional checking related defects 6Return value related defects 1Method related defects 3Object usage related defects 1
A sample defect missed by all five tools is shown in Figure 6, which involvesan invocation of a wrong method.
To What Extent Could We Detect Field Defects? 23
Rhino - mozi lla/js/rhino/xmlimplsrc/org/mozi lla/javascript/xmlimpl/XML.java
Buggy version Fixed versionLine 3043: return createEmptyXML(lib); Line 3043: return createFromJS(lib, "");
Fig. 6: Example of A Defect Missed By All Tools
4.2.4 RQ4: Effectiveness on Defects of Various Severity
Different defects are often labeled with different severity levels. Lucene istracked using Jira bug tracking systems, while Rhino and AspectJ are trackedusing Bugzilla bug tracking systems. Jira has the following predefined severitylevels: blocker, critical, major, minor, trivial. Bugzilla has the following pre-defined severity levels: blocker, critical, major, normal, minor, trivial. In Jira,there is no normal severity level. To standardize the two, we group blocker,critical, and major as severe group, and normal, minor, and trivial as non-severe group. Table 14 describes the distribution of severe and non-severedefects among the 200 defects that we analyze in this study.
Table 14: Distribution of Defects of Various Severity Levels
Type Number of Defects
Severe 45Non-Severe 155
We investigate the number of false negatives for each category of defects.We show the results for severe and non-severe defects in Tables 15 and 16respectively. We find that PMD captures the most numbers of severe and non-severe defects in most datasets. For almost all the datasets that we analyze,we also find that the tools that could more effectively capture severe defectsin a dataset, could also more effectively capture non-severe defects in thesame dataset. For Lucene, PMD and JCSC are the top two tools that couldmore effectively capture severe defects, where only 33.3% and 50% of thedefects are missed respectively. For non-severe defects in Lucene, PMD andFindBugs are the top two tools that effectively capture non-severe bugs whereonly 62.5% of the defects are missed. For Rhino, PMD and FindBugs arethe top two tools that could more effectively capture severe defects withoutmissing any defects and capture non-severe defects where only 16.67% and11.11% of the defects are missed. For AspectJ, PMD and CheckStyle are thetop two tools that could more effectively capture both severe (i.e., 3.2% and29.03% of the defects are missed by PMD and CheckStyle respectively) andnon-severe defects (i.e., 2.48% and 33.06% of the defects are missed by PMDand CheckStyle respectively). FindBugs could also effectively capture non-severe defects in AspectJ where only 33.06% of the defects are missed.
Based on the percentage of bugs that could be partially or fully capturedby all the five tools together, the tools could capture severe defects more
24 Ferdian Thung et al.T
able
15:
Per
centa
ges
ofS
ever
eD
efec
tsth
at
are
Mis
sed
,P
art
iall
yC
ap
ture
d,
an
dF
ull
yC
ap
ture
d
Tools
vs.
Program
sLucene
Rhin
oAsp
ectJ
Miss
Partial
Full
Miss
Partial
Full
Miss
Partial
Full
Fin
dB
ugs
83.3
%0%
16.7
%0%
0%
100.0
%35.5
%25.8
%38.7
%
JL
int
83.3
%0%
16.7
%100.0
%0%
0%
96.8
%3.2
%0%
PM
D33.3
%8.3
%58.3
%0%
0%
100.0
%3.2
%35.5
%61.3
%
Ch
eckS
tyle
66.6
7%
8.3
3%
25%
50%
50%
0%
29.0
3%
51.6
1%
19.3
5%
JC
SC
50%
25%
25%
100%
0%
0%
96.7
7%
3.2
3%
0%
All
16.6
7%
25%
58.3
3%
0%
0%
100.0
%0%
38.7
1%
61.2
9%
Tab
le16
:P
erce
nta
ges
of
Non-S
ever
eD
efec
tsth
at
are
Mis
sed
,P
art
iall
yC
ap
ture
d,
an
dF
ull
yC
ap
ture
d
Tools
vs.
Program
sLucene
Rhin
oAsp
ectJ
Miss
Partial
Full
Miss
Partial
Full
Miss
Partial
Full
Fin
dB
ugs
62.5
%25%
12.5
%11.1
1%
0%
88.8
9%
33.0
6%
18.1
8%
48.7
6%
JL
int
81.2
5%
18.7
5%
0%
83.3
3%
11.1
1%
5.5
6%
97.5
2%
0.8
3%
1.6
5%
PM
D62.5
%18.7
5%
18.7
5%
16.6
7%
5.5
6%
77.7
8%
2.4
8%
27.2
7%
70.2
5%
Ch
eckS
tyle
75%
18.7
5%
6.2
5%
66.6
7%
33.3
3%
0%
33.0
6%
34.7
1%
32.2
3%
JC
SC
75%
25%
0%
100%
0%
0%
84.3
%11.5
7%
4.1
3%
All
50%
31.2
5%
18.7
5%
5.5
6%
0%
94.4
4%
0.8
3%
25.6
2%
73.5
5%
To What Extent Could We Detect Field Defects? 25
effectively than non-severe defects. For Lucene, only 16.67% of severe defectsand 50% of non-severe defects could not be captured with all the five tools.While for Rhino and AspectJ, all severe defects could be captured using allthe five tools, and only 5.56% of non-severe defects of Rhino and 0.83% ofnon-severe defects of AspectJ could not be captured.
Summary. Overall, we find that among the five tools, PMD captures the mostnumbers of bugs for both severe and non-severe defects in most datasets.Also, all the five tools together could capture severe defects more effectivelyas compared to non-severe defects.
4.2.5 RQ5: Effectiveness on Defects of Various Difficulties
Some defects are more difficult to fix than others. Difficulty could be measuredin various ways: time needed to fix a defect, number of bug resolvers involved inthe bug fixing process, number of lines of code churned to fix a bug, and so on.We investigate these different dimensions of difficulty and analyze the numberof false negatives on defects of various difficulties in the following paragraphs.
Consider Time to Fix a Defect.
First, we measure difficulty in terms of the time it takes to fix a defect.Time to fix defect has been studied in a number of previous studies (Kim andJr, 2006; Weiss et al, 2007; Hosseini et al, 2012). Following Thung et al (2012),we divide time to fix a defect into 2 classes: ≤30 days, and >30 days.
We show the results for ≤30 days and and >30 days defects in Tables 17and 18 respectively. Generally, we find that >30 days defects can be bettercaptured by the tools than ≤30 days defects. By using the five tools together(i.e., “All” rows in the tables), for defects that are fixed within 30 days, wecould partially or fully capture all defects of AspectJ, and miss 14.9% and 40%defects of AspectJ and Lucene respectively. By using the five tools together,for defects that are fixed in more than 30 days, all defects of Lucene and Rhinocan be captured, while only 0.79% of AspectJ defects are missed.
Also we find that among all the five tools, PMD could more effectivelycaptures both defects that are fixed within 30 days and those fixed in morethan 30 days. The top two most effective tools in capturing defects fixed within30 days are similar with those capturing defects fixed in more than 30 days. ForLucene’s defects, PMD and JCSC are the most effective tools to identify bothdefects fixed within 30 days and those fixed in more than 30 days: only 56% and68% of defects fixed within 30 days could not be captured by PMD and JCSCrespectively; also, PMD could capture all of the defects fixed in more than30 days, while JCSC, Findbugs, JLint, and CheckStyles could not capture33.33% of the >30 days defects. For Rhino’s defects, PMD and FindBugsare the most effective tools to identify both defects fixed within 30 days andthose fixed in more than 30 days: only 28.57% of defects fixed within 30 dayscould not be identified by PMD and Findbugs; also, FindBugs could captureall Rhino’s defects that are fixed in more than 30 days and only 7.69% of
26 Ferdian Thung et al.T
able
17:
Per
centa
ges
of
Def
ects
Fix
edW
ith
in30
Day
sth
atare
Mis
sed
,P
art
ially
Cap
ture
d,
an
dF
ull
yC
ap
ture
d
Tools
vs.
Program
sLucene
Rhin
oAsp
ectJ
Miss
Partial
Full
Miss
Partial
Full
Miss
Partial
Full
Fin
dB
ugs
76%
8%
16%
28.5
7%
0%
71.4
3%
11.5
4%
23.0
8%
65.3
8%
JL
int
88%
4%
8%
85.7
1%
0%
14.2
9%
96.1
5%
0%
3.8
5%
PM
D56%
8%
36%
28.5
7%
0%
71.4
3%
0%
26.9
2%
73.0
8%
Ch
eckS
tyle
76%
8%
16%
85.7
1%
14.2
9%
0%
38.4
6%
30.7
7%
30.7
7%
JC
SC
68%
20%
12%
100%
0%
0%
96.1
5%
3.8
5%
0%
All
40%
24%
36%
14.2
9%
0%
85.7
1%
0%
23.0
8%
76.9
2%
Tab
le18
:P
erce
nta
ges
ofD
efec
tsF
ixed
Ab
ove
30
Day
sth
atare
Mis
sed
,P
art
iall
yC
ap
ture
d,
an
dF
ull
yC
aptu
red
Tools
vs.
Program
sLucene
Rhin
oAsp
ectJ
Miss
Partial
Full
Miss
Partial
Full
Miss
Partial
Full
Fin
dB
ugs
33.3
3%
66.6
7%
0%
0%
0%
100%
38.1
%19.0
5%
42.8
6%
JL
int
33.3
3%
66.6
7%
0%
84.6
2%
15.3
8%
0%
97.6
2%
1.5
9%
0.7
9%
PM
D0%
66.6
7%
33.3
3%
7.6
9%
7.6
9%
84.6
2%
2.3
8%
30.1
6%
67.4
6%
Ch
eckS
tyle
33.3
3%
66.6
7%
0%
53.8
5%
46.1
5%
0%
30.9
5%
39.6
8%
29.3
7%
JC
SC
33.3
3%
66.6
7%
0%
100%
0%
0%
84.9
2%
11.1
1%
3.9
7%
All
0%
66.6
7%
33.3
3%
0%
0%
100.0
%0.7
9%
29.3
7%
69.8
4%
To What Extent Could We Detect Field Defects? 27
the >30 days defects could not be captured by PMD. For AspectJ’s defects,PMD, FindBugs, and CheckStyles are the most effective tools to identify bothdefects fixed within 30 days and those fixed in more than 30 days: only 38.46%,11.54%, and 0% of defects fixed within 30 days could not be identified byCheckStyles, Findbugs, and PMD respectively; also, only 38.1%, 30.95%, and2.38% of defects fixed more than 30 days could not be identified by FindBugs,CheckStyles, and PMD respectively.
Summary. Overall, we find that the five tools used together are more effectivein capturing defects that are fixed in more than 30 days than those that arefixed within 30 days. This shows that the tools are beneficial as they cancapture more difficult bugs. Also, among all the five tools, PMD could moreeffectively capture both defects fixed within 30 days and those fixed in morethan 30 days.
Consider Number of Resolvers for a Defect.
Next, we measure difficulty in terms of the number of bug resolvers. Num-ber of bug resolvers has been studied in a number of previous studies (e.g. Wuet al, 2011b; Xie et al, 2012; Xia et al, 2013). Based on the number of bugresolvers, we divide bugs into 2 classes: ≤2 people, and >2 people. A simplebug typically would have only 2 bug resolvers: a triager who assigns the bugreport to a fixer, and the fixer who fixes the bug.
We show the results for ≤2 and >2 resolvers defects in Tables 19 and 20respectively. When using the 5 tools together, we find that they are almostequally effective in capturing ≤2 and >2 resolvers defects for Lucene andAspectJ datasets. For the Lucene dataset, by using all the 5 tools, we missthe same percentage (i.e., 35.71%) of ≤2 and >2 resolvers defects. For theAspectJ dataset, by using all the 5 tools, we miss 0% and 1.23% of ≤2 and >2resolvers defects respectively. However, for the Rhino dataset, by using all the5 tools together, we can more effectively capture > 2 resolvers defects than≤2 resolvers defects, i.e., 0% of > 2 resolvers defects and 25% of ≤2 resolversdefects are missed.
Also, we find that among all the five tools, PMD could more effectivelycapture both ≤2 and >2 resolvers defects. The top two most effective tools incapturing ≤2 resolvers defects are similar with those capturing >2 resolversdefects. For Lucene’s defects, PMD and JCSC are the most effective tools tocapture for both ≤2 and >2 resolvers defects: only 42.86% and 57.14% of ≤2resolvers defects could not be captured by PMD and JCSC respectively; also,only 57.14% and 71.43% of >2 resolvers defects could not be identified byPMD and JCSC respectively. For Rhino’s defects, PMD and FindBugs arethe most effective tools to capture both ≤2 and >2 resolvers defects: only 25%and 50% of ≤2 resolvers defects could not be captured by PMD and Findbugs;also, FindBugs could capture all Rhino’s >2 resolvers defects while only 12.5%of the defects could not be captured by PMD. For AspectJ’s defects, PMD,FindBugs, and CheckStyles are the most effective tools to capture both ≤2 and>2 resolvers defects: only 35.21%, 33.8%, and 2.82% of ≤2 resolvers defects
28 Ferdian Thung et al.T
able
19:
Per
centa
ges
ofD
efec
tsF
ixed
by
at
Mos
t2
Res
olve
rsth
at
are
Mis
sed
,P
art
iall
yC
ap
ture
d,
an
dF
ull
yC
ap
ture
d
Tools
vs.
Program
sLucene
Rhin
oAsp
ectJ
Miss
Partial
Full
Miss
Partial
Full
Miss
Partial
Full
Fin
dB
ugs
64.2
9%
21.4
3%
14.2
9%
50%
0%
50%
33.8
%18.3
1%
47.8
9%
JL
int
64.2
9%
21.4
3%
14.2
9%
75%
0%
25%
97.1
8%
1.4
1%
1.4
1%
PM
D42.8
6%
14.2
9%
42.8
6%
25%
25%
50%
2.8
2%
26.7
6%
70.4
2%
Ch
eckS
tyle
57.1
4%
14.2
9%
28.5
7%
75%
25%
0%
35.2
1%
35.2
1%
29.5
8%
JC
SC
57.1
4%
21.4
3%
21.4
3%
100%
0%
0%
88.7
3%
8.4
5%
2.8
2%
All
35.7
1%
21.4
3%
42.8
6%
25%
0%
75%
0%
25.3
5%
74.6
5%
Tab
le20
:P
erce
nta
ges
ofD
efec
tsF
ixed
by
Mor
eth
an2
Res
olv
ers
that
are
Mis
sed
,P
art
iall
yC
ap
ture
d,
an
dF
ull
yC
ap
ture
d
Tools
vs.
Program
sLucene
Rhin
oAsp
ectJ
Miss
Partial
Full
Miss
Partial
Full
Miss
Partial
Full
Fin
dB
ugs
78.5
7%
7.1
4%
14.2
9%
0%
0%
100%
33.3
3%
20.9
9%
45.6
8%
JL
int
100%
0%
0%
87.5
%12.5
%0%
97.5
3%
1.2
3%
1.2
3%
PM
D57.1
4%
14.2
9%
28.5
7%
12.5
%0%
87.5
%1.2
3%
32.1
%66.6
7%
Ch
eckS
tyle
85.7
1%
14.2
9%
0%
62.5
%37.5
%0%
29.6
3%
40.7
4%
29.6
3%
JC
SC
71.4
3%
28.5
7%
0%
100%
0%
0%
85.1
9%
11.1
1%
3.7
%
All
35.7
1%
35.7
1%
28.5
7%
0%
0%
100.0
%1.2
3%
30.8
6%
67.9
%
To What Extent Could We Detect Field Defects? 29
could not be captured by CheckStyles, Findbugs, and PMD respectively; also,only 33.33%, 29.63%, and 1.23% of AspectJ’s >2 resolvers defects could notbe captured by FindBugs, CheckStyles, and PMD respectively.
Summary. Overall, we find that by using the five tools together, we can cap-ture >2 resolvers defects more effectively than ≤2 resolvers defects (at leaston Rhino dataset). This shows that the tools are beneficial as they can capturemore difficult bugs. Also among all the five tools, PMD could more effectivelycapture for both ≤2 and >2 resolvers defects.
Consider Number of Lines of Code Churned for a Defect.
We also measure difficulty in terms of the number of lines of code churned.Number of lines of code churned has been investigated in a number of previousstudies, e.g., (Nagappan and Ball, 2005; Giger et al, 2011). Based on thenumber of lines of code churned, we divide the bugs into 2 classes: ≤10 lineschurned, and >10 lines churned.
We show the results for ≤10 and >10 lines churned defects in Tables 21and 22. The performance of the 5 tools (used together) in capturing ≤10 lineschurned defects is similar to the performance in capturing >10 lines churneddefects. For Lucene, using the 5 tools together, we can capture ≤10 lineschurned defects better than >10 lines churned defects (33.33% vs. 38.46% ≤10and >10 lines churned defects are missed respectively). However, for Rhino,using the 5 tools together, we can capture >10 lines churned defects betterthan ≤10 lines churned defects (7.14% vs. 0% ≤10 and >10 lines churneddefects are missed respectively). For AspectJ, using the 5 tools together, wecan capture >10 lines churned defects and ≤10 lines churned defects equallywell (0.78% vs. 0% ≤10 and >10 lines churned defects are missed respectively).
We find that among all the five tools, PMD could more effectively captureboth ≤10 and >10 lines churned defects. The top two most effective tools incapturing ≤10 lines churned defects however may vary with those capturing>10 lines churned defects.
For Lucene’s defects, PMD and CheckStyles are the most effective toolsto capture ≤10 lines churned defects (i.e., only 40% and 40% of the defectsare missed by PMD and CheckStyles respectively), while PMD and JCSCare the most effective tools to identify >10 lines churned defects (i.e., only61.54% of the defects are missed by both tools). For Rhino’s defects, PMDand FindBugs are the most effective tools to capture both ≤10 and >10 lineschurned defects: both tools could capture all Rhino’s >10 lines churned defects,and only 21.43% and 14.29% of Rhino’s ≤10 lines churned defects could not becaptured by the tools. For AspectJ’s defects, PMD, FindBugs, and CheckStylesare the most effective tools to identify for both ≤10 and >10 lines churneddefects: only 34.11%, 33.33%, and 2.33% of ≤10 lines churned defects couldnot be identified by CheckStyles, Findbugs, and PMD respectively; also, only34.78%, 21.74%, and 0% of AspectJ’s >10 lines churned defects could not beidentified by FindBugs, CheckStyles, and PMD respectively.
30 Ferdian Thung et al.T
able
21:
Per
centa
ges
ofD
efec
tsF
ixed
by
Chu
rnin
gat
Mos
t10
Lin
esth
at
are
Mis
sed
,P
art
iall
yC
ap
ture
d,
an
dF
ull
yC
aptu
red
Tools
vs.
Program
sLucene
Rhin
oAsp
ectJ
Miss
Partial
Full
Miss
Partial
Full
Miss
Partial
Full
Fin
dB
ugs
66.6
7%
20%
13.3
3%
14.2
9%
0%
85.7
1%
33.3
3%
19.3
8%
47.2
9%
JL
int
80%
13.3
3%
6.6
7%
92.8
6%
0%
7.1
4%
96.9
%1.5
5%
1.5
5%
PM
D40%
20%
40%
21.4
3%
0%
78.5
7%
2.3
3%
27.1
3%
70.5
4%
Ch
eckS
tyle
60%
20%
20%
85.7
1%
14.2
9%
0%
34.1
1%
32.5
6%
33.3
3%
JC
SC
66.6
7%
20%
13.3
3%
100%
0%
0%
89.1
5%
6.9
8%
3.8
8%
All
33.3
3%
26.6
7%
40%
7.1
4%
0%
92.8
6%
0.7
8%
25.5
8%
73.6
4%
Tab
le22
:P
erce
nta
ges
ofD
efec
tsF
ixed
by
Chu
rnin
gM
ore
than
10
Lin
esth
at
are
Mis
sed
,P
art
ially
Cap
ture
d,
an
dF
ull
yC
aptu
red
Tools
vs.
Program
sLucene
Rhin
oAsp
ectJ
Miss
Partial
Full
Miss
Partial
Full
Miss
Partial
Full
Fin
dB
ugs
76.9
2%
7.6
9%
15.3
8%
0%
0%
100%
34.7
8%
21.7
4%
43.4
8%
JL
int
84.6
2%
7.6
9%
7.6
9%
66.6
7%
33.3
3%
0%
100%
0%
0%
PM
D61.5
4%
7.6
9%
30.7
7%
0%
16.6
7%
83.3
3%
0%
43.4
8%
56.5
2%
Ch
eckS
tyle
84.6
2%
7.6
9%
7.6
9%
16.6
7%
83.3
3%
0%
21.7
4%
69.5
7%
8.7
%
JC
SC
61.5
4%
30.7
7%
7.6
9%
100%
0%
0%
73.9
1%
26.0
9%
0%
All
38.4
6%
30.7
7%
30.7
7%
0%
0%
100.0
%0%
43.4
8%
56.5
2%
To What Extent Could We Detect Field Defects? 31
Table 23: Distribution of Defect Types
Type Number of Defects
Assignment related defects 27Conditional related defects 127Return related defects 11Method invocation related defects 28Object usage related defects 7
Summary. Overall, we find that the performance of the 5 tools (used together)in capturing ≤10 lines churned defects are similar to that in capturing >10lines churned defects. Also among the five tools, PMD could more effectivelyidentify ≤10 and >10 lines churned defects.
4.2.6 RQ6: Effectiveness on Defects of Various Types
We consider a rough categorization of defects based on the type of the programelements that are affected by a defect. We categorize defects into the followingfamilies (or types): assignment, conditional, return, method invocation, andobject usage. A defect belongs to a family if the statement responsible for thedefect contains a corresponding program type. For example, if a defect affectsan “if” statement it belongs to the conditional family. Since a defect canspan multiple program statements of different types, it can belong to multiplefamilies, for such cases we analyze the bug fix and choose one dominant family.Table 23 describes the distribution of defect types among the 200 defects thatwe analyze in this study.
We investigate the number of false negatives for each category of defects.We show the results for assignment related defects in Table 24. We find thatusing the five tools together we could effectively capture (partially or fully)assignment related defects in all datasets: we could capture all assignmentrelated defects in Rhino and AspectsJ, and only 16.67% of Lucene’s assignmentrelated defects could not be captured by the tools. We also find that PMD isthe most effective tool in capturing assignments related defects, while JLintand JCSC are the most ineffective. For Lucene, 33.33%, 66.66%, and 66.67% ofits assignment related defects could not be captured by PMD, FindBugs, andCheckStyle. PMD and Findbugs could effectively capture Rhino’s assignmentrelated defects without missing any defects, and only 11.11% of AspectJ’sassignment related defects could not be captured by these tools. Thus, besidesPMD, FindBugs is also effective in capturing assignment related defects.
We show the results for conditional related defects in Table 25. We find thatusing the five tools together, we could capture all conditional related defects inRhino, and only 0.99% of AspectJ’s and 38.46% of Lucene conditional relateddefects could not be captured by the tools. We also find that PMD is the mosteffective tool in capturing conditional related defects, while JLint is the mostineffective in capturing the defects. For Lucene, PMD only misses 46.15%of conditional related defects, while the other tools miss more than 50% of
32 Ferdian Thung et al.T
able
24:
Per
centa
ges
of
Ass
ign
men
tR
elat
edD
efec
tsth
atare
Mis
sed
,P
art
iall
yC
ap
ture
d,
an
dF
ull
yC
ap
ture
d
Tools
vs.
Program
sLucene
Rhin
oAsp
ectJ
Miss
Partial
Full
Miss
Partial
Full
Miss
Partial
Full
Fin
dB
ugs
66.6
6%
16.6
7%
16.6
7%
0%
0%
100.0
%11.1
1%
11.1
1%
77.7
8%
JL
int
100%
0%
0%
100.0
%0%
0%
94.4
4%
5.5
6%
0%
PM
D33.3
3%
16.6
7%
50%
0%
0%
100.0
%11.1
1%
5.5
6%
83.3
3%
Ch
eckS
tyle
66.6
7%
16.6
7%
16.6
7%
33.3
3%
66.6
7%
0%
38.8
9%
27.7
8%
33.3
3%
JC
SC
83.3
3%
16.6
7%
0%
100%
0%
0%
88.8
9%
11.1
1%
0%
All
16.6
7%
33.3
3%
50%
0%
0%
100.0
%0%
5.5
6%
94.4
4%
Tab
le25
:P
erce
nta
ges
of
Con
dit
ion
al
Rel
ate
dD
efec
tsth
atare
Mis
sed
,P
art
iall
yC
ap
ture
d,
an
dF
ull
yC
ap
ture
d
Tools
vs.
Program
sLucene
Rhin
oAsp
ectJ
Miss
Partial
Full
Miss
Partial
Full
Miss
Partial
Full
Fin
dB
ugs
61.5
4%
15.3
8%
23.0
8%
7.6
9%
0%
92.3
1%
30.6
9%
23.7
6%
45.5
4%
JL
int
76.9
2%
15.3
8%
7.6
9%
84.6
2%
15.3
8%
0%
98.0
2%
0.9
9%
0.9
9%
PM
D46.1
5%
15.3
8%
38.4
6%
15.3
8%
7.6
9%
76.9
2%
0.9
9%
34.6
5%
64.3
6%
Ch
eckS
tyle
69.2
3%
15.3
8%
15.3
8%
69.2
3%
30.7
7%
0%
30.6
9%
39.6
%29.7
%JC
SC
61.5
4%
23.0
8%
15.3
8%
100%
0%
0%
92.0
8%
6.9
3%
0.9
9%
All
38.4
6%
23.0
8%
38.4
6%
0%
0%
100.0
%0.9
9%
33.6
6%
65.3
5%
Tab
le26
:P
erce
nta
ges
of
Ret
urn
Rel
ated
Def
ects
that
are
Mis
sed
,P
art
iall
yC
ap
ture
d,
an
dF
ull
yC
ap
ture
d
Tools
vs.
Program
sLucene
Rhin
oAsp
ectJ
Miss
Partial
Full
Miss
Partial
Full
Miss
Partial
Full
Fin
dB
ugs
0%
100%
0%
50%
0%
50%
75%
12.5
%12.5
%JL
int
0%
100%
0%
50%
0%
50%
100%
0%
0%
PM
D0%
100%
0%
50%
0%
50%
0%
12.5
%87.5
%C
hec
kS
tyle
0%
100%
0%
100%
0%
0%
37.5
%37.5
%25%
JC
SC
0%
100%
0%
100%
0%
0%
62.5
%25%
12.5
%A
ll0%
100%
0%
50%
0%
50%
0%
12.5
%87.5
%
To What Extent Could We Detect Field Defects? 33
the defects. PMD and Findbugs could effectively capture Rhino’s conditionalrelated defects with only 15.38% and 7.69% missed defects, and only 30.69%of AspectJ’s conditional related defects could not be captured by these tools.Thus, aside from PMD, FindBugs is also effective in capturing conditionalrelated defects.
We show the results for return related defects in Table 26. We find thatusing the five tools together is effective enough to capture all return relateddefects in AspectJ and Lucene, however they are less effective in capturingreturn related defects in Rhino (50% missed defects). We also find that PMDis the most effective tool in capturing return related defects. For Lucene, alltools could capture (partially or fully) all the return related defects. For Rhino,50% of return related defects could not be captured by PMD, FindBugs, andJLint, while the other tools could not capture all the defects (100% misseddefects). For AspectJ, PMD could effectively capture all return related defectswithout any missed defects, and CheckStyle only missed 37.5% of the defects.The other tools could not effectively capture AspectJ’s return related defectsas they miss more than 50% of the defects.
We show the results for method invocation related defects in Table 27.We find that using the five tools together is effective enough to capture allmethod invocation defects of Rhino and AspectJ, but they are less effectivein capturing method invocation related defects in Lucene (50% missing de-fects). We also find that PMD is the most effective tool in capturing methodinvocation related defects. PMD could effectively capture all method invoca-tion related defects in Rhino and AspectJ, but it misses 66.67% of Lucene’smethod invocation related defects. The other 4 tools could not capture morethan 70% of Lucene’s method invocation related defects. Other than PMD,Findbugs could also effectively capture all Rhino’s method invocation relateddefects without any missed defects, while the other 3 tools could not captureall Rhino’s method invocation related defects. For AspectJ, CheckStyle couldalso effectively capture the defects with only 28.57% missed defects, while theother 3 tools could not capture more than 40% of AspectJ’s method invocationrelated defects.
We show the results for object usage related defects in Table 28. We findthat using the five tools together, we can capture all object usage related de-fects of Rhino and AspectJ, but they are less effective in capturing objectusage related defects of Lucene (50% missing defects). We also find that gen-erally PMD is the most effective tool in capturing object usage related defects.PMD could effectively capture all object usage related defects in Rhino andAspectJ, but it misses all Lucene’s object usage related defects. Only JCSCcould capture 50% of Lucene’s object usage related defects, while the othertools could not capture these defects at all. Other than PMD, Findbugs andCheckStyle could also effectively capture all Rhino’s method invocation de-fects without any missed defects, while the other tools could not capture allRhino’s method invocation defects. For AspectJ, the other tools except PMDcould not effectively capture object usage related defects in this dataset—theycould not capture 50 or more percent of the defects.
34 Ferdian Thung et al.T
able
27:
Per
centa
ges
of
Met
hod
Invo
cati
on
Rel
ate
dD
efec
tsth
at
are
Mis
sed
,P
art
iall
yC
ap
ture
d,
and
Fu
lly
Cap
ture
d
Tools
vs.
Program
sLucene
Rhin
oAsp
ectJ
Miss
Partial
Full
Miss
Partial
Full
Miss
Partial
Full
Fin
dB
ugs
100%
0%
0%
0%
0%
100.0
%42.8
6%
14.2
9%
42.8
6%
JL
int
83.3
3%
0%
16.6
7%
100.0
%0%
0%
95.2
4%
0%
4.7
6%
PM
D66.6
7%
0%
33.3
3%
0%
0%
100%
0%
33.3
3%
66.6
7%
Ch
eckS
tyle
83.3
3%
0%
16.6
7%
100%
0%
0%
28.5
7%
38.1
%33.3
3%
JC
SC
66.6
7%
16.6
7%
16.6
7%
100%
0%
0%
76.1
9%
9.5
2%
14.2
9%
All
50%
16.6
7%
33.3
3%
0%
0%
100.0
%0%
28.5
7%
71.4
3%
Tab
le28
:P
erce
nta
ges
ofO
bje
ctU
sage
Rel
ate
dD
efec
tsth
atar
eM
isse
d,
Part
iall
yC
ap
ture
d,
an
dF
ull
yC
ap
ture
d
Tools
vs.
Program
sLucene
Rhin
oAsp
ectJ
Miss
Partial
Full
Miss
Partial
Full
Miss
Partial
Full
Fin
dB
ugs
100%
0%
0%
0%
0%
100.0
%75%
0%
25%
JL
int
100%
0%
0%
100.0
%0%
0%
100%
0%
0%
PM
D100%
0%
0%
0%
0%
100.0
%0%
25%
75%
Ch
eckSty
le100%
0%
0%
0%
100%
0%
50%
50%
0%
JC
SC
50%
50%
0%
100%
0%
0%
50%
50%
0%
All
50%
50%
0%
0%
0%
100.0
%0%
25%
75%
To What Extent Could We Detect Field Defects? 35
Summary. Overall, we find that PMD could outperform other tools in captur-ing various types of defects. The effectiveness of the other tools varies acrossthe datasets; for example, Findbugs are more effective in capturing defects inRhino, while Checktyle is more effective in capturing defects in AspectJ. JLintand JCSC are often the most ineffective in capturing defects in the datasets.Based on the percentage of missed defects, averaging across the defect fam-ilies, the 5 bug-finding tools (used together) could more effectively captureassignment defects, followed by conditional defects. They are less effective incapturing return, method invocation, and object usage related defects.
4.3 Threats to Validity
Threats to internal validity may include experimental biases. We manuallyextract faulty lines from changes that fix them; this manual process might beerror-prone. We also manually categorize defects into their families; this man-ual process might also be error prone. To reduce this threat, we have checkedand refined the results. We also exclude defects that could not be unambigu-ously localized to the responsible faulty lines of code. Also, we exclude someversions of our subject programs that cannot be compiled. There might alsobe implementation errors in the various scripts and code used to collect andevaluate defects.
Threats to external validity are related to the generalizability of our find-ings. In this work, we only analyze five static bug-finding tools and three opensource Java programs. We analyze only one version of Lucene dataset; for Rhi-no and AspectJ, we only analyze defects that are available in iBugs. Also, weinvestigate only defects that get reported and fixed. In the future, we couldanalyze more programs and more defects to reduce selection bias. Also, weplan to investigate more bug-finding tools and programs in various languages.
5 Related Work
We summarize related studies on bug finding, warning prioritization, bugtriage, and empirical study on defects.
5.1 Bug Finding
There are many bug-finding tools proposed in the literature (Nielson et al,2005; Necula et al, 2005; Holzmann et al, 2000; Sen et al, 2005; Godefroidet al, 2005; Cadar et al, 2008; Sliwerski et al, 2005). A short survey of thesetools has been provided in Section 2. In this paper, we focus on five staticbug-finding tools for Java, including FindBugs, PMD, JLint, Checkstyle andJCSC. These tools are relatively lightweight, and have been applied to largesystems.
36 Ferdian Thung et al.
None of these bug-finding tools is able to detect all kinds of bugs. To thebest of our knowledge, there is no study on the false negative rates of thesetools. We are the first to carry out an empirical study to answer this issue basedon a few hundreds of real-life bugs from three Java programs. In the future,we plan to investigate other bug-finding tools with more programs written indifferent languages.
5.2 Warning Prioritization
Many studies deal with the many false positives produced by various bug-finding tools. Kremenek and Engler (2003) use z-ranking to prioritize warn-ings. Kim and Ernst (2007) prioritize warning categories using historical data.Ruthruff et al (2008) predict actionable static analysis warnings by proposinga logistic regression model that differentiates false positives from actionablewarnings. Liang et al (2010) propose a technique to construct a training setfor better prioritization of static analysis warnings. A comprehensive surveyof static analysis warnings prioritization has been written by Heckman andWilliams (2011). There are large scale studies on the false positives producedby FindBugs (Ayewah and Pugh, 2010; Ayewah et al, 2008).
While past studies on warning prioritization focus on coping with falsepositives with respect to actionable warnings, we investigate an orthogonalproblem on false negatives. False negatives are important as it can cause bugsto go unnoticed and cause harm when the software is used by end users.Analyzing false negatives is also important to guide future research on buildingadditional bug-finding tools.
5.3 Bug Triage
Even when the bug reports, either from a tool or from a human user, are allfor real bugs, the sheer number of reports can be huge for a large system. Inorder to allocate appropriate development and maintenance resources, projectmanagers often need to triage the reports.
Cubranic and Murphy (2004) propose a method that uses text categoriza-tion to triage bug reports. Anvik et al (2006) use machine learning techniquesto triage bug reports. Hooimeijer and Weimer (2007) construct a descriptivemodel to measure the quality of bug reports. Park et al (2011) propose Cos-Triage that further takes the cost associated with bug reporting and fixing intoconsideration. Sun et al (2010) use data mining techniques to detect duplicatebug reports.
Different from warning prioritization, these studies address the problem oftoo many true positives. They are also different from our work which dealswith false negatives.
To What Extent Could We Detect Field Defects? 37
5.4 Empirical Studies on Defects
Many studies investigate the nature of defects. Pan et al (2009) investigatedifferent bug fix patterns for various software systems. They highlight pat-terns such as method calls with different actual parameter values, change ofassignment expressions, etc. Chou et al (2001) perform an empirical study ofoperating system errors.
Closest to our work is the study by Rutar et al (2004) on the comparison ofbug-finding tools in Java. They compare the number of warnings generated byvarious bug-finding tools. However, no analysis have been made as to whetherthere are false positives or false negatives. The authors commented that: “Aninteresting area of future work is to gather extensive information about theactual faults in programs, which would enable us to precisely identify falsepositives and false negatives.” In this study, we address the false negativesmentioned by them.
6 Conclusion and Future Work
Defects can harm software vendors and users. A number of static bug-findingtools have been developed to catch defects. In this work, we empirically s-tudy the effectiveness of state-of-the-art static bug-finding tools in preventingreal-life defects. We investigate five bug-finding tools, FindBugs, JLint, PMD,CheckStyle, and JCSC on three programs, Lucene, Rhino, and AspectJ. Weanalyze 200 fixed real defects and extract faulty lines of code responsible forthese defects from their treatments.
We find that most defects in the programs could be partially or fully cap-tured by combining reports generated by all bug-finding tools. However, asubstantial proportion of defects (e.g., about 36% for Lucene) are still missed.We find that FindBugs and PMD are the best among the bug-finding toolsin preventing false negatives. Our stricter analysis sheds light that althoughmany of these warnings cover faulty lines, often the warnings are too genericand developers need to inspect the code to find the defects. We find that somebugs are not flagged by any of the bug-finding tools—these bugs involve logicalor functionality errors that are difficult to be detected without any specifica-tion of the system. We also find that the bug-finding tools are more effectivein capturing severe than non-severe defects. Also, based on various measuresof difficulty, the bug-finding tools are more effective in capturing more difficultbugs. Furthermore, the bug-finding tools could perform differently for differ-ent bug types: they could be more effectively in capturing assignment andconditional defects than return, method invocation, and object usage relateddefects.
As future work, we would like to reduce the threats to external validity byexperimenting with even more bugs from more software systems. Based on thecharacteristics of the missed bugs, we also plan to build a new tool that couldhelp to reduce false negatives further.
38 Ferdian Thung et al.
7 Acknowledgement
We thank the researchers creating and maintaining the iBugs repository whichis publicly available at http://www.st.cs.uni-saarland.de/ibugs/. We al-so appreciate very much the valuable comments from anonymous reviewersand our shepherd Andreas Zeller for earlier versions of this paper.
References
Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug? In: ICSE, pp361–370
Artho C (2006) Jlint - find bugs in java programs. http://jlint.sourceforge.net/Ayewah N, Pugh W (2010) The google findbugs fixit. In: ISSTA, pp 241–252Ayewah N, Pugh W, Morgenthaler JD, Penix J, Zhou Y (2007) Evaluating
static analysis defect warnings on production software. In: PASTE, pp 1–8Ayewah N, Hovemeyer D, Morgenthaler JD, Penix J, Pugh W (2008) Using
static analysis to find bugs. IEEE Software 25(5):22–29Ball T, Levin V, Rajamani SK (2011) A decade of software model checking
with slam. Commun ACM 54(7):68–76Beyer D, Henzinger TA, Jhala R, Majumdar R (2007) The software model
checker blast. STTT 9(5-6):505–525Brun Y, Ernst MD (2004) Finding latent code errors via machine learning over
program executions. In: ICSE, pp 480–490Burn O (2007) Checkstyle. http://checkstyle.sourceforge.net/Cadar C, Ganesh V, Pawlowski PM, Dill DL, Engler DR (2008) EXE: Auto-
matically generating inputs of death. ACM Trans Inf Syst Secur 12(2)Chou A, Yang J, Chelf B, Hallem S, Engler DR (2001) An empirical study of
operating system errors. In: SOSP, pp 73–88Copeland T (2005) PMD Applied. Centennial BooksCorbett JC, Dwyer MB, Hatcliff J, Laubach S, Pasareanu CS, Robby, Zheng
H (2000) Bandera: extracting finite-state models from java source code. In:ICSE, pp 439–448
Cousot P, Cousot R (2012) An abstract interpretation framework for termi-nation. In: POPL, pp 245–258
Cousot P, Cousot R, Feret J, Mauborgne L, Mine A, Rival X (2009) Why doesastree scale up? Formal Methods in System Design 35(3):229–264
Cubranic D, Murphy GC (2004) Automatic bug triage using text categoriza-tion. In: SEKE, pp 92–97
Dallmeier V, Zimmermann T (2007) Extraction of bug localization bench-marks from history. In: ASE, pp 433–436
Gabel M, Su Z (2010) Online inference and enforcement of temporal properties.In: ICSE, pp 15–24
Giger E, Pinzger M, Gall H (2011) Comparing fine-grained source code changesand code churn for bug prediction. In: MSR, pp 83–92
To What Extent Could We Detect Field Defects? 39
Godefroid P, Klarlund N, Sen K (2005) Dart: directed automated randomtesting. In: PLDI, pp 213–223
GrammaTech (2012) Codesonar. http://www.grammatech.com/products/
codesonar/overview.html
Heckman SS (2007) Adaptively ranking alerts generated from automated staticanalysis. ACM Crossroads 14(1)
Heckman SS, Williams LA (2009) A model building process for identifyingactionable static analysis alerts. In: ICST, pp 161–170
Heckman SS, Williams LA (2011) A systematic literature review of actionablealert identification techniques for automated static code analysis. Informa-tion & Software Technology 53(4):363–387
Holzmann GJ, Najm E, Serhrouchni A (2000) Spin model checking: An intro-duction. STTT 2(4):321–327
Hooimeijer P, Weimer W (2007) Modeling bug report quality. In: ASE, pp34–43
Hosseini H, Nguyen R, Godfrey MW (2012) A market-based bug allocationmechanism using predictive bug lifetimes. In: Software Maintenance andReengineering (CSMR), 2012 16th European Conference on, IEEE, pp 149–158
Hovemeyer D, Pugh W (2004) Finding bugs is easy. In: OOPSLAIBM (2012) T.J. Watson Libraries for Analysis (WALA). http://wala.
sourceforge.net
Jocham R (2005) Jcsc - java coding standard checker.http://jcsc.sourceforge.net/
Kawrykow D, Robillard MP (2011) Non-essential changes in version histories.In: ICSE
Kim S, Ernst MD (2007) Prioritizing warning categories by analyzing softwarehistory. In: MSR
Kim S, Jr EJW (2006) How long did it take to fix bugs? In: Proceedings ofthe 2006 international workshop on Mining software repositories, ACM, pp173–174
Kim S, Zimmermann T, Jr EJW, Zeller A (2008) Predicting faults from cachedhistory. In: ISEC, pp 15–16
Kremenek T, Engler DR (2003) Z-ranking: Using statistical analysis to counterthe impact of static analysis approximations. In: SAS
Liang G, Wu L, Wu Q, Wang Q, Xie T, Mei H (2010) Automatic constructionof an effective training set for prioritizing static analysis warnings. In: ASE
Nagappan N, Ball T (2005) Use of relative code churn measures to predictsystem defect density. In: ICSE, pp 284–292
Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict componentfailures. In: ICSE, pp 452–461
Necula GC, Condit J, Harren M, McPeak S, Weimer W (2005) Ccured:type-safe retrofitting of legacy software. ACM Trans Program Lang Syst27(3):477–526
Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamicbinary instrumentation. In: PLDI, pp 89–100
40 Ferdian Thung et al.
Nielson F, Nielson HR, Hankin C (2005) Principles of Program Analysis.Springer
Ostrand TJ, Weyuker EJ, Bell RM (2005) Predicting the location and numberof faults in large software systems. IEEE Trans Softw Eng 31:340–355
Pan K, Kim S, Jr EJW (2009) Toward an understanding of bug fix patterns.Empirical Software Engineering 14(3):286–315
Park JW, Lee MW, Kim J, won Hwang S, Kim S (2011) CosTriage: A cost-aware triage algorithm for bug reporting systems. In: AAAI
Rutar N, Almazan CB, Foster JS (2004) A comparison of bug finding tools forjava. In: ISSRE, pp 245–256
Ruthruff JR, Penix J, Morgenthaler JD, Elbaum SG, Rothermel G (2008)Predicting accurate and actionable static analysis warnings: an experimentalapproach. In: ICSE, pp 341–350
Sen K, Marinov D, Agha G (2005) CUTE: a concolic unit testing engine forC. In: ESEC/SIGSOFT FSE, pp 263–272
Sliwerski J, Zimmermann T, Zeller A (2005) When do changes induce fixes?SIGSOFT Softw Eng Notes 30:1–5
Sun C, Lo D, Wang X, Jiang J, Khoo SC (2010) A discriminative modelapproach for accurate duplicate bug report retrieval. In: ICSE, pp 45–54
Tassey G (2002) The economic impacts of inadequate infrastructure for soft-ware testing. National Institute of Standards and Technology Planning Re-port 02-32002
Thung F, Lo D, Jiang L, Lucia, Rahman F, Devanbu PT (2012) When wouldthis bug get reported? In: ICSM, pp 420–429
Visser W, Mehlitz P (2005) Model checking programs with Java PathFinder.In: SPIN
Weeratunge D, Zhang X, Sumner WN, Jagannathan S (2010) Analyzing con-currency bugs using dual slicing. In: ISSTA, pp 253–264
Weiss C, Premraj R, Zimmermann T, Zeller A (2007) How long will it taketo fix this bug? In: Proceedings of the Fourth International Workshop onMining Software Repositories, MSR ’07
Wu R, Zhang H, Kim S, Cheung SC (2011a) Relink: recovering links betweenbugs and changes. In: SIGSOFT FSE, pp 15–25
Wu W, Zhang W, Yang Y, Wang Q (2011b) Drex: Developer recommendationwith k-nearest-neighbor search and expertise ranking. In: Software Engi-neering Conference (APSEC), 2011 18th Asia Pacific, IEEE, pp 389–396
Xia X, Lo D, Wang X, Zhou B (2013) Accurate developer recommendation forbug resolution. In: Proceedings of the 20th Working Conference on ReverseEngineering
Xie X, Zhang W, Yang Y, Wang Q (2012) Dretom: Developer recommendationbased on topic models for bug resolution. In: Proceedings of the 8th Inter-national Conference on Predictive Models in Software Engineering, ACM,pp 19–28
Xie Y, Aiken A (2007) Saturn: A scalable framework for error detection usingboolean satisfiability. ACM Trans Program Lang Syst 29(3)