to what extent could we detect field defects? what extent could we detect field defects? 3 to what...

Noname manuscript No.(will be inserted by the editor)

To What Extent Could We Detect Field Defects?

—An Extended Empirical Study of False Negatives in StaticBug-Finding Tools

Ferdian Thung · Lucia · David Lo ·Lingxiao Jiang · Foyzur Rahman ·Premkumar T. Devanbu

Received: date / Accepted: date

Abstract Software defects can cause much loss. Static bug-finding tools aredesigned to detect and remove software defects and believed to be effective.However, do such tools in fact help prevent actual defects that occur in the fieldand reported by users? If these tools had been used, would they have detect-ed these field defects, and generated warnings that would direct programmersto fix them? To answer these questions, we perform an empirical study thatinvestigates the effectiveness of five state-of-the-art static bug-finding tools(FindBugs, JLint, PMD, CheckStyle, and JCSC) on hundreds of reported andfixed defects extracted from three open source programs (Lucene, Rhino, andAspectJ). Our study addresses the question: To what extent could field defectsbe detected by state-of-the-art static bug-finding tools? Different from past s-tudies that are concerned with the numbers of false positives produced by suchtools, we address an orthogonal issue on the numbers of false negatives. We

Ferdian ThungSchool of Information Systems, Singapore Management UniversityE-mail: [email protected]

LuciaSchool of Information Systems, Singapore Management UniversityE-mail: [email protected]

David LoSchool of Information Systems, Singapore Management UniversityE-mail: [email protected]

Lingxiao JiangSchool of Information Systems, Singapore Management UniversityE-mail: [email protected]

Foyzur RahmanUniversity of California, DavisE-mail: [email protected]

Premkumar T. DevanbuUniversity of California, DavisE-mail: [email protected]

2 Ferdian Thung et al.

find that although many field defects could be detected by static bug-findingtools, a substantial proportion of defects could not be flagged. We also analyzethe types of tool warnings that are more effective in finding field defects andcharacterize the types of missed defects. Furthermore, we analyze the effec-tiveness of the tools in finding field defects of various severities, difficulties,and types.

Keywords False Negatives · Static Bug-Finding Tools · Empirical Study

1 Introduction

Bugs are prevalent in many software systems. The National Institute of Stan-dards and Technology (NIST) has estimated that bugs cost the US economybillions of dollars annually (Tassey, 2002). Bugs are not merely economicallyharmful; they can also harm life and properties when mission critical systemsmalfunction. Clearly, techniques that can detect and reduce bugs would bevery beneficial. To achieve this goal, many static analysis tools have been pro-posed to find bugs. Static bug-finding tools, such as FindBugs (Hovemeyer andPugh, 2004), JLint (Artho, 2006), PMD (Copeland, 2005), CheckStyle (Burn,2007), and JCSC (Jocham, 2005), have been shown to be helpful in detectingmany bugs, even in mature software (Ayewah et al, 2007). It is thus reasonableto believe that such tools are a useful adjunct to other bug-finding techniquessuch as testing and inspection.

Although static bug-finding tools are effective in some settings, it is un-clear whether the warnings that they generate are really useful. Two issuesare particularly important to be addressed: First, many warnings need to cor-respond to actual defects that would be experienced and reported by users.Second, many actual defects should be captured by the generated warnings.For the first issue, there have been a number of studies showing that the num-bers of false warnings (i.e., false positives) are too many, and some have pro-posed techniques to prioritize warnings (Heckman and Williams, 2011, 2009;Ruthruff et al, 2008; Heckman, 2007). While the first issue has received muchattention, the second issue has received less. Many papers on bug detectiontools just report the number of defects that they can detect. It is unclear howmany defects are missed by these bug detection tools (i.e., false negatives).While the first issue is concerned with false positives, the second focuses onfalse negatives. We argue that both issues deserve equal attention as both haveimpact on the quality of software systems. If false positives are not satisfacto-rily addressed, this would make bug-finding tools unusable. If false negativesare not satisfactorily addressed, the impact of these tools on software qualitywould be minimal. On mission critical systems, false negatives may even de-serve more attention. Thus, there is a need to investigate the false negativerates of such tools on actual field defects.

Our study tries to fill this research gap by answering the following researchquestion, and we use the term “bug” and “defect” interchangeably, both ofwhich refer to errors or flaws in software:

To What Extent Could We Detect Field Defects? 3

To what extent could state-of-the-art static bug-finding tools detect fielddefects?

To investigate this research question, we make use of data available inbug-tracking systems and software repositories. Bug-tracking systems, suchas Bugzilla or JIRA, record descriptions of bugs that are actually experiencedand reported by users. Software repositories contain information on what codeelements get changed, removed, or added at different periods of time. Such in-formation can be linked together to track bugs and when and how they getfixed. JIRA has the capability to link a bug report with the changed code thatfixes the bug. Also, many techniques have been employed to link bug report-s in Bugzilla to their corresponding SVN/CVS code changes (Dallmeier andZimmermann, 2007; Wu et al, 2011a). These data sources provide us descrip-tions of actual field defects and their treatments. Based on the descriptions,we are able to infer root causes of defects (i.e., the faulty lines of code) fromthe bug treatments. To ensure accurate identification of faulty lines of code,we perform several iterations of manual inspections to identify lines of codethat are responsible for the defects. Then, we are able to compare the identi-fied root causes with the lines of code flagged by static bug-finding tools, andto analyze the proportion of field defects that are missed or captured by thetools.

In this work, we perform an exploratory study with five state-of-the-artstatic bug-finding tools (FindBugs, PMD, Jlint, CheckStyle, and JCSC) onthree reasonably large open source Java programs (Lucene, Rhino, and As-pectJ). We use bugs reported in JIRA for Lucene version 2.9, and the iBugsdataset provided by Dallmeier and Zimmermann (Dallmeier and Zimmerman-n, 2007) for Rhino and AspectJ. Our manual analysis identifies 200 real-lifedefects that we can unambiguously locate faulty lines of code. We find thatmany of these defects could be detected by FindBugs, PMD, JLint, Check-Style, and JCSC, but a number of them remain undetected.

The main contributions of this work are as follows:

1. We examine the number of real-life defects missed by five various staticbug-finding tools, and evaluate the tools’ performance in terms of theirfalse negative rates.

2. We investigate the warning families in various tools that are effective indetecting actual defects.

3. We characterize actual defects that could not be flagged by the static bug-finding tools.

4. We analyze the effectiveness of the static bug-finding tools on defects ofvarious severities, difficulties, and types.

The paper is structured as follows. In Section 2, we present introductoryinformation on various static bug-finding tools. In Section 3, we present ourexperimental methodology. In Section 4, we present our empirical findings anddiscuss interesting issues. In Section 5, we describe related work. We concludewith future work in Section 6.


2 Bug-Finding Tools

In this section, we first provide a short survey of different bug-finding toolsthat could be grouped into: static, dynamic, and machine learning based. Wethen present the five static bug-finding tools that we evaluate in this study,namely FindBugs, JLint, PMD, CheckStyle and JCSC.

2.1 Categorization of Bug-Finding Tools

Many bug-finding tools are based on static analysis techniques (Nielson et al,2005), such as type systems (Necula et al, 2005), constraint-based analysis (Xieand Aiken, 2007), model checking (Holzmann et al, 2000; Corbett et al, 2000;Beyer et al, 2007), abstract interpretation (Cousot et al, 2009; Cousot andCousot, 2012), or a combination of various techniques (Ball et al, 2011; Gram-maTech, 2012; Visser and Mehlitz, 2005; IBM, 2012). They often producevarious false positives, and in theory they should be free of false negativesfor the kinds of defects they are designed to detect. However, due to imple-mentation limitations and the fact that a large program often contains defecttypes that are beyond the designed capabilities of the tools, such tools maystill suffer from false negatives with respect to all kinds of defects.

In this study, we analyze several static bug-finding tools that make use ofwarning patterns for bug detection. These tools are lightweight and can scale tolarge programs. On the downside, these tools do not consider the specificationsof a system, and may miss defects due to specification violations.

Other bug-finding tools also use dynamic analysis techniques, such as dy-namic slicing (Weeratunge et al, 2010), dynamic instrumentation (Nethercoteand Seward, 2007), directed random testing (Sen et al, 2005; Godefroid et al,2005; Cadar et al, 2008), and invariant detection (Brun and Ernst, 2004; Gabeland Su, 2010). Such tools often explore particular parts of a program andproduce no or few false positives. However, they seldom cover all parts of aprogram; they are thus expected to have false negatives.

There are also studies on bug prediction with data mining and machinelearning techniques, which may have both false positives and negatives. Forexample, Sliwerski et al (2005) analyze code change patterns that may causedefects. Ostrand et al (2005) use a regression model to predict defects. Nagap-pan et al (2006) apply principal component analysis on the code complexitymetrics of commercial software to predict failure-prone components. Kim et al(2008) predict potential faults from bug reports and fix histories.

2.2 FindBugs

FindBugs was first developed by Hovemeyer and Pugh (2004). It statical-ly analyzes Java bytecode against various families of warnings characterizingcommon bugs in many systems. Code matching a set of warning patterns areflagged to the user, along with the specific locations of the code.


FindBugs comes with a lot of built-in warnings. These include: null pointerdereference, method not checking for null argument, close() invoked on a val-ue that is always null, test for floating point equality, and many more. Thereare hundreds of warnings; these fall under a set of warning families including:correctness, bad practice, malicious code vulnerability, multi-threaded correct-ness, style, internationalization, performance, risky coding practice, etc.

2.3 JLint

JLint, developed by Artho (2006), is a tool to find defects, inconsistent code,and problems with synchronization in multi-threading applications. Similar toFindBugs, JLint also analyzes Java bytecode against a set of warning patterns.It constructs and checks a lock-graph, and does data-flow analysis. Code frag-ments matching the warning patterns are flagged and outputted to the useralong with their locations.

JLint provides many warnings such as potential deadlocks, unsynchronizedmethod implementing ‘Runnable’ interface, method finalize() not calling su-per.finalize(), null reference, etc. These warnings are group under three fami-lies: synchronization, inheritance, and data flow.

2.4 PMD

PMD, developed by Copeland (2005), is a tool that finds defects, dead code,duplicate code, sub-optimal code, and overcomplicated expressions. Differentfrom FindBugs and JLint, PMD analyzes Java source code rather than Javabytecode. PMD also comes with a set of warning patterns and finds locationsin code matching these patterns.

PMD provides many warning patterns, such as jumbled incrementer, returnfrom finally block, class cast exception with toArray, misplaced null check, etc.These warning patterns fall into families, such as design, strict exceptions,clone, unused code, String and StringBuffer, security code, etc., which arereferred to as rule sets.

2.5 CheckStyle

CheckStyle, developed by Burn (2007), is a tool to write Java code that en-forces a coding standard; default configuration is the Sun Code Conventions.Similar to PMD, CheckStyle analyzes Java source code. CheckStyle also comeswith a set of warning patterns and finds locations in code matching these pat-terns.

CheckStyle provides many warning patterns, such as layout issues, class de-sign problems, duplicate code, Javadoc comments, metrics, modifiers, namingconventions, regular expression, size violations, whitespace, etc. These warning


patterns fall into families, such as sizes, regular expression, whitespace, Javadocumentation, etc.

2.6 JCSC

JCSC, developed by Jocham (2005), is a tool to write Java code that enforcesa customized coding standard and also check for potential bad code. Similarto CheckStyle, the default coding standard is the Sun Code Conventions. Thestandard can be defined for naming conventions for class, interfaces, fields,parameter, structural layout of the type (class/interface), etc. Similar to PMDand CheckStyle, JCSC analyzes Java source code. It also comes with a set ofwarning patterns and finds locations in code matching these patterns.

JCSC provides many warning patterns, such as empty catch/finally block,switch without default, slow code, inner classes in class/interface issues, con-structor or method or field declaration issues, etc. These warning patterns fallinto families, such as method, metrics, field, Java documentation, etc.

3 Methodology

We make use of bug-tracking, version control, and state-of-the-art bug-findingtools. First, we extract bugs and the faulty lines of code that are responsiblefor the bugs. Next, we run bug-finding tools for the various program releasesbefore the bugs get fixed. Finally, we compare warnings given by bug-findingtools and the real bugs to find false negatives.

3.1 Extraction of Faulty Lines of Code

We analyze two common configurations of bug tracking systems and coderepositories to get historically faulty lines of code. One configuration is thecombination of CVS as the source control repository, and Bugzilla as the bugtracking system. Another configuration is the combination of Git as the sourcecontrol repository, and JIRA as the bug tracking system. We describe howthese two configurations could be analyzed to extract root causes of fixeddefects.

Data Extraction: CVS with Bugzilla. For the first configuration, Dallmeierand Zimmermann (2007) have proposed an approach to automatically analyzeCVS and Bugzilla to link information. Their approach is able to extract is-sue reports linked to corresponding CVS commit entries that correspond tothe reports. They are also able to download the code before and after eachchange. The code to perform this has been publicly released for several soft-ware systems. As our focus is on defects, we remove issue reports that aremarked as enhancements (i.e., they are not bug fixes) and the CVS commitsthat correspond to them.


Data Extraction: Git cum JIRA. Git and JIRA have features that make itpreferable over CVS and Bugzilla. JIRA bug tracking systems explicitly linksbug reports to the revisions that fix the corresponding defects. From thesefixes, we use Git diff to find the location of the buggy code and we downloadthe revision prior to the fix by appending the “∧” symbol to the hashcode ofthe corresponding fix revision number. Again we remove bug reports and Gitcommits that are marked as enhancements.

Identification of Faulty Lines. The above process gives us a set of real-lifedefects along with the set of changes that fix them. To find the correspondingroot causes, we perform a manual process based on the treatments of thedefects. Kawrykow and Robillard (2011) have proposed an approach to removenon-essential changes and convert “dirty” treatments to “clean” treatments.However, they still do not recover the root causes of defects.

Our process for locating root causes could not be easily automated bya simple diff operation between two versions of the systems (after and pri-or to the fix), due to the following reasons. Firstly, not all changes fix thebug (Kawrykow and Robillard, 2011); some, such as addition of new lines,removal of new lines, changes in indentations, etc., are only cosmetic changesthat make the code aesthetically better. Figure 1 shows such an example. Sec-ondly, even if all changes are essential, it is not straightforward to identifythe defective lines from the fixes. Some fixes introduce additional code, andwe need to find the corresponding faulty lines that are fixed by the additionalcode. We show several examples highlighting the process of extracting rootcauses from their treatments for a simple and a slightly more complicated casein Figures 2 and 4.

Figure 2 describes a bug fixing activity where one line of code is changedby modifying the operator == to !=. It is easy for us to identify the faultyline which is the line that gets changed. A more complicated case is shown inFigure 4. There are four sets of faulty lines that could be inferred from thediff: one is line 259 (marked with *), where an unnecessary method call needsto be removed; a similar fault is at line 262; the third set of faulty lines areat lines 838-340, and they are condition checks that should be removed; thefourth one is at line 887 and it misses a pre-condition check. For this case, thediff is much larger than the faulty lines that are manually identified.

Figure 4 illustrates the difficulties in automating the identification of faultylines. To ensure the fidelity of identified root causes, we perform several itera-tions of manual inspections. For some ambiguous cases, several of the authorsdiscussed and came to resolutions. Some cases that are still deemed ambiguous(i.e., it is unclear or difficult to manually locate the faulty lines) are removedfrom this empirical study. We show such an example in Figure 3. This exampleshows a bug fixing activity where an if block is inserted into the code and itis unclear which lines are really faulty.

At the end of the above process, we get the sets of faulty lines Faulty

for all defects in our collection. Notation-wise, we refer to these sets of lines

8 Ferdian Thung et al.A

sp

ectJ

-file

nam

e= o

rg.a

sp

ectj

/mo

du

les/

org

.asp

ect

j.ajd

t.co

re/s

rc/o

rg/a

sp

ectj

/ajd

t/in

tern

al/c

om

pil

er/a

st/T

his

Jo

inP

oin

tVis

itor

.jav

a

Bu

gg

y ve

rsio

nF

ixed

ve

rsio

nLi

ne 7

0:

}

Line

71:

}

//in

sert

alin

e b

etw

een

lin

e 70

an

d 71

in th

e b

uggy

ve

rsio

n

Lin

e 7

2: //

Sys

tem

.err

.prin

tln("

don

e: "

+ m

eth

od)

;Li

ne 7

8: /

/Sys

tem

.err

.prin

tln("

isR

ef:

" +

exp

r +

", "

+ b

ind

ing

);L

ine

is d

ele

ted

Line

87:

else

if (

isR

ef(r

ef, t

hisJ

oinP

oint

Sta

ticP

art

Dec

))L

ine

89:

} e

lse

if (is

Ref

(ref

, th

isJo

inP

oint

Sta

ticP

artD

ec))

{

Fig

.1:

Exam

ple

of

Sim

ple

Cosm

etic

Ch

anges

Lu

cen

e 2.

9 -

file

nam

e=s

rc/ja

va/o

rg/a

pa

che/

luce

ne/

sear

ch/S

core

r.ja

vaB

ug

gy

ver

sio

nF

ixed

ver

sio

n*L

ine

90:

r

etur

n do

c =

= N

O_M

OR

E_D

OC

SLi

ne 9

0:

retu

rn d

oc !

= N

O_M

OR

E_

DO

CS

;

Fig

.2:

Iden

tifi

cati

onof

Root

Cause

s(F

au

lty

Lin

es)

from

Tre

atm

ents

[Sim

ple

Case

]

Asp

ectJ

-o

rg.a

sp

ectj

/mo

du

les/

we

aver

/src

/org

/asp

ect

j/wea

ver/

pat

tern

s/S

ign

atu

reP

atte

rn.ja

vaB

ug

gy

ver

sio

nF

ixed

ver

sio

n

Lin

e 8

7:

}L

ine

88:

if (!

mod

ifier

s.m

atch

es(s

ig.g

etM

odifi

ers(

))) r

etu

rn fa

lse;

Lin

e 8

9:L

ine

90:

if (

kind

==

Me

mb

er.S

TA

TIC

_IN

ITIA

LIZ

AT

ION

) {

// In

sert

the

se 2

lin

es b

etw

een

line

87

and

88 in

the

bu

ggy

vers

ion

Lin

e 8

8:

if (k

ind

==

Mem

ber.

AD

VIC

E) r

etur

n tr

ue;

Lin

e 8

9: <

new

lin

e>

Fig

.3:

Iden

tifi

cati

on

of

Root

Cause

s(F

au

lty

Lin

es)

from

Tre

atm

ents

[Am

big

uou

sC

ase

]

To What Extent Could We Detect Field Defects? 9A

spec

tJ-

file

nam

e=

org.

asp

ectj

/mo

dule

s/w

eave

r/sr

c/or

g/as

pec

tj/w

eave

r/bce

l/Laz

yMet

hod

Gen

.java

Bug

gy v

ersi

onFi

xed

vers

ion

*Lin

e 25

9:

lng.

setS

tart

(nu

ll);

Lin

e is

de

lete

d*L

ine

262:

ln

g.se

tEn

d(n

ull);

Lin

e is

de

lete

d*L

ine

838:

if (

i in

sta

nce

of L

ocal

Va

riabl

eIn

stru

ctio

n)

{*L

ine

839:

int

inde

x =

((L

oca

lVa

riabl

eIn

stru

ctio

n)i).

get

Ind

ex()

;*L

ine

840:

if (l

vt.g

etS

lot(

) =

= in

dex)

{L

ine

841:

if

(loca

lVar

iab

leS

tart

s.ge

t(lv

t) =

= n

ull)

{

Lin

e 84

2:

lo

calV

aria

ble

Sta

rts.

put(

lvt,

jh);

Lin

e 84

3:

}L

ine

844:

lo

calV

aria

ble

End

s.p

ut(lv

t, jh

);L

ine

845:

}

Lin

e 84

6:

}

Lin

e 83

6: if

(lo

calV

aria

bleS

tart

s.g

et(l

vt)

==

nu

ll) {

Lin

e 83

7:

loc

alV

aria

ble

Sta

rts.

put(

lvt,

jh);

Lin

e 83

8:

}Li

ne

839:

lo

calV

aria

ble

End

s.p

ut(lv

t, j

h);

Lin

e 88

2:

keys

.add

All(

loca

lVar

iabl

eSta

rts.

keyS

et(

));

Lin

e 88

3:

Col

lect

ion

s.so

rt(k

eys,

new

Co

mp

ara

tor(

) {

Lin

e 88

4:

publ

ic in

tco

mp

are

(Obj

ect

a,O

bje

ctb)

{

Lin

e 88

5:

Lo

calV

aria

ble

Ta

gta

ga

= (

Loc

alV

aria

ble

Tag

)a;

Lin

e 88

6:

Lo

calV

aria

ble

Ta

gta

gb

= (

Loc

alV

aria

ble

Tag

)b;

*Lin

e 88

7:

ret

urn

ta

ga.g

etN

am

e().

com

par

eTo

(tag

b.g

etN

am

e()

);

Lin

e 88

8:

}

});

Lin

e 88

9:

fo

r (I

tera

tor

iter

= k

eys

.iter

ator

();

iter.

hasN

ext(

); )

{

Lin

e 87

0:

key

s.a

ddA

ll(lo

calV

aria

bleS

tart

s.ke

ySet

());

Lin

e 87

1-8

74:

//the

se l

ine

s a

re c

om

men

ted

co

des

Lin

e 87

5:

Co

llect

ions

.sor

t(ke

ys,

new

Co

mpa

rato

r()

{Li

ne

876:

pu

blic

int

com

par

e(O

bje

ct a

, O

bje

ct b

) {

Lin

e 87

7:

L

oca

lVa

riabl

eT

agta

ga=

(Lo

calV

aria

bleT

ag)

a;

Lin

e 87

8:

L

oca

lVa

riabl

eT

agta

gb=

(Lo

calV

aria

bleT

ag)

b;

Lin

e 87

9:

if

(ta

ga.g

etN

ame(

).st

art

sWith

("a

rg")

) {

Lin

e 88

0:

if

(tagb

.get

Na

me

().s

tart

sWith

("a

rg")

) {

Lin

e 88

1:

retu

rn -

tag

a.ge

tNam

e()

.co

mp

areT

o(t

agb.

get

Nam

e())

;Li

ne

882:

} el

se {

Lin

e 88

3:

retu

rn 1

; //

Wh

atev

er

tagb

is,

it m

ust

com

e ou

t be

fore

'arg

'Li

ne

884:

}Li

ne

885:

} e

lse

if (

tagb

.get

Nam

e().

sta

rtsW

ith("

arg

"))

{Li

ne

886:

re

turn

-1

; //

Wha

teve

r ta

ga

is,

it m

ust

co

me

out

be

fore

'arg

'Li

ne

887:

}

els

e {

Lin

e 88

8:

re

turn

-ta

ga.

getN

ame

().c

om

par

eT

o(t

agb.

getN

ame

());

Lin

e 88

9:

}

Lin

e 89

0:

}

Lin

e 89

1:

}

);Li

ne

892-

894

: //t

hese

lin

es

are

co

mm

ente

d c

ode

sLi

ne

895:

for

(Ite

rato

rite

r=

key

s.ite

rato

r();

iter

.has

Ne

xt()

; )

{

Fig

.4:

Iden

tifi

cati

onof

Root

Cau

ses

(Fault

yL

ines

)fr

om

Tre

atm

ents

[Com

ple

xC

ase

]


using an index notation Faulty[]. We refer to the set of lines in Faulty thatcorrespond to the ith defect as Faulty[i].

3.2 Extraction of Warnings

To evaluate the static bug-finding tools, we run the tools on the programversions before the defects get fixed. An ideal bug-finding tool would recoverthe faulty lines of the program—possibly with other lines corresponding toother defects lying within the software system. Some of these warnings arefalse positives, while others are true positives.

For each defect, we extract the version of the program in the repositoryprior to the bug fix. We have such information already as an intermediateresult for the root cause extraction process. We then run the bug-finding toolson these program versions. Because 13.42% of these versions could not becompiled, we remove them from our analysis, which is a threat to validity ofour study (see section 4.3).

As described in Section 2, each of the bug-finding tools takes in a set ofrules or the types of defects and flags them if found. By default, we enable allrules/types of defects available in the tools, except that we exclude two rulesets from PMD: one related to Android development, and the other whoseXML configuration file could not be read by PMD (i.e., Coupling).

Each run would produce a set of warnings. Each warning flags a set oflines of code. Notation-wise, we refer to the sets of lines of code for all runsas Warning, and the set of lines of code for the ith warning as Warning[i].Also, we refer to the sets of lines of code for all runs of a particular tool T asWarningT , and similarly we have WarningT [i].

3.3 Extraction of Missed Defects

With the faulty lines Faulty obtained through manual analysis, and the linesflagged by the static bug-finding tools Warning, we can look for false negatives,i.e., actual reported and fixed defects that are missed by the tools.

To get these missed defects, for every ith warning, we take the intersectionof the sets Faulty[i] and Warning[i] from all bug-finding tools, and theintersection between Faulty[i] with each WarningT [i]. If an intersection isan empty set, we say that the corresponding bug-finding tool misses the ith

defect. If the intersection covers a true subset of the lines in Faulty[i], wesay that the bug-finding tool partially captures the ith defect. Otherwise, if theintersection covers all lines in Faulty[i], we say that the bug-finding tool fullycaptures the ith defect. We differentiate the partial and full cases as developersmight be able to recover the other faulty lines, given that some of the faultylines have been flagged.


3.4 Overall Approach

Our overall approach is illustrated by the pseudocode in Figure 5. Our ap-proach takes in a bug repository (e.g., Bugzilla or JIRA), a code repository(e.g., CVS or Git), and a bug-finding tool (e.g., FindBugs, PMD, JLint, Check-Style, or JCSC). For each bug report in the repository, it performs three stepsmentioned in previous sub-sections: Faulty lines extraction, warning identifi-cation, and missed defect detection.

The first step corresponds to lines 3-8. We find the bug fix commit cor-responding to the bug report. We identify the version prior to the bug fixcommit. We perform a diff to find the differences between these two versions.Faulty lines are then extracted by a manual analysis. The second step corre-sponds to lines 9-11. Here, we simply run the bug-finding tools and collect linesof code flagged by the various warnings. Finally, step three is performed bylines 12-19. Here, we detect cases where the bug-finding tool misses, partiallycaptures, or fully captures a defect. The final statistics is output at line 20.

Procedure IdentifyMissedDefectsInputs:

BugRepo : Bug RepositoryCodeRepo : Code RepositoryBFTool : Bug-Finding Tool

Output:Statistics of Defects that are Missed andCaptured (Fully or Partially)

Method:1: Let Stats = {}2: For each bug report br in BugRepo

3: // Step 1: Extract faulty lines of code4: Let fixC = br ’s corresponding fix commit in CodeRepo

5: Let bugC = Revision before fixC in CodeRepo

6: Let diff = The difference between fixC and bugC7: Extract faulty lines from diff8: Let Faultybr = Faulty lines in bugC9: // Step 2: Get warnings10: Run BFTool on bugC11: Let Warningbr = Flagged lines in bugC by BFTool12: // Step 3: Detect missed defects13: Let Common = Faultybr ∩Warningbr14: If Common = {}15: Add 〈br ,miss〉 to Stats16: Else If Common = Faultybr17: Add 〈br , full〉 to Stats18: Else19: Add 〈br , partial〉 to Stats20: Output Stats

Fig. 5: Identification of Missed Defects.


4 Empirical Evaluation

In this section we present our research questions, datasets, empirical findings,and threats to validity.

4.1 Research Questions & Datasets

We would like to answer the following research questions:

RQ1 How many real-life reported and fixed defects from Lucene, Rhino, andAspectJ are missed by state-of-the-art static bug-finding tools?

RQ2 What types of warnings reported by the tools are most effective in de-tecting actual defects?

RQ3 What are some characteristics of the defects missed by the tools?RQ4 How effective are the tools in finding defects of various severity?RQ5 How effective are the tools in finding defects of various difficulties?RQ6 How effective are the tools in finding defects of various types?

We evaluate five static bug-finding tools, namely FindBugs, JLint, PMD,CheckStyle, and JCSC on three open source projects: Lucene, Rhino, and As-pectJ. Lucene from Apache Software Foundation1 is a general purpose textsearch engine library. Rhino from Mozilla Foundation2 is an implementationof JavaScript written in Java. AspectJ from Eclipse Foundation3 is an aspect-oriented extension of Java. We crawl JIRA for defects tagged for Lucene ver-sion 2.9. For Rhino and AspectJ, we analyze the iBugs repository preparedby Dallmeier and Zimmermann (2007). The average sizes of Lucene, Rhino,and AspectJ are around 265,822, 75,544, and 448,176 lines of code (LOC) re-spectively. We show the numbers of unambiguous defects that we are able tomanually locate root causes from the three datasets in Table 1 together withthe total numbers of defects available in the datasets and the average faultylines per defect.

Table 1: Number of Defects for Various Datasets

Dataset # of # of Avg # ofUnambiguous Defects Faulty Lines

Defects Per Defect

Lucene 28 57 3.54Rhino 20 32 9.1AspectJ 152 350 4.07

1 http://lucene.apache.org/core/2 http://www.mozilla.org/rhino/3 http://www.eclipse.org/aspectj/


4.2 Experimental Results

4.2.1 RQ1: Number of Missed Defects

We show the number of missed defects by each and all of the five tools, forLucene, Rhino, and AspectJ in Table 2. We do not summarize across softwareprojects as the numbers of warnings for different projects differ greatly andummarization across projects may not be meaningful. We first show the resultsfor all defects, and then zoom into subsets of the defects that span a smallnumber of lines of code, and those that are severe. Our goal is to evaluate theeffectiveness of state-of-the-art static bug-finding tools in terms of their falsenegative rates. We would also like to evaluate whether false negative rates maybe reduced if we use all five tools together.

Consider All Defects.

Lucene. For Lucene, as shown in Table 2, we find that with all five tools, 35.7%of all defects could be fully identified (i.e., all faulty lines Faulty[i] of a defecti are flagged). Another 28.6% of all defects could also be partially identified(i.e., some but not all faulty lines in Faulty[i] are flagged). Still, 35.7% of alldefects could not be flagged by the tools. Thus, the five tools are effective butare not very successful in capturing Lucene defects. Among the tools, PMDcaptures the most numbers of bugs, followed by FindBugs, CheckStyle, JCSC,and finally JLint.

Rhino. For Rhino, as shown in Table 2, we find that with all five tools, 95%of all defects could be fully identified. Only 5% of all defects could not becaptured by the tools. Thus, the tools are very effective in capturing Rhinodefects. Among the tools, FindBugs captures the most numbers of defects,followed by PMD, CheckStyle, JLint, and finally JCSC.

AspectJ. For AspectJ, as shown in Table 2, we find that with all five tools,71.1% of all defects could be fully captured. Also, another 28.3% of all defectscould be partially captured. Only 0.7% of the defects could not be capturedby any of the tools. Thus, the tools are very effective in capturing AspectJdefects. PMD captures the most numbers of defects, followed by CheckStyle,FindBugs, JCSC, and finally JLint.

Also, to compare the five bug-finding tools, we show the average numbersof lines in one defect program version (over all defects) that are flagged bythe tools for our subject programs in Table 3. We note that PMD flags themost numbers of lines of code, and likely produces more false positives thanothers, while JLint flags the least numbers of lines of code. With respect to theaverage sizes of programs (see Column “Avg # LOC” in Table 4), the toolsmay flag 0.2% to 56.9% of the whole program as buggy.


Tab

le2:

Per

centa

ges

of

Def

ects

that

are

Mis

sed

,P

art

iall

yC

ap

ture

d,

an

dF

ull

yC

ap

ture

d.

Th

enum

ber

sin

the

pare

nth

eses

ind

icate

the

nu

mb

ers

of

act

ual

fau

lty

lin

esca

ptu

red

,if

any.

Tools

vs.

Lu

cen

eR

hin

oA

spectJ

Pro

gra

ms

Mis

sP

art

ial

Fu

llM

iss

Part

ial

Fu

llM

iss

Part

ial

Fu

ll

Fin

dB

ugs

71.4

%14

.3%

(5)

14.3

%(8

)10.0

%0%

90.0

%(1

79)

33.6

%19.7

%(9

1)

46.7

%(1

37)

JL

int

82.1

%10

.7%

(3)

7.1

%(4

)80.0

%10.0

%(2

)10.0

%(3

)97.4

%1.3

%(2

)1.3

%(2

)

PM

D50

.0%

14.3

%(1

1)

35.7

%(1

3)

15.0

%5.0

%(1

8)

80.0

%(1

57)

3.9

%27.6

%(2

51)

68.4

%(1

96)

Ch

eckS

tyle

71.4

%14

.3%

(5)

14.3

%(5

)65%

35%

(47)

0%

32.2

%38.2

%(2

72)

29.6

%(6

1)

JC

SC

64.3

%25

%(1

5)

10.7

%(3

)100%

0%

0%

86.8

%9.9

%(3

8)

3.3

%(5

)

All

35.7

%28

.6%

(11)

35.7

%(2

2)

5.0

%0%

95.0

%(1

81)

0.7

%28.3

%(2

74)

71.1

%(2

06)


Table 3: Average Numbers of Lines Flagged Per Defect Version by VariousStatic Bug-Finding Tools.

Tools vs.Programs

Lucene Rhino AspectJ

FindBugs 33,208.82 40,511.55 83,320.09

JLint 515.54 330.6 720.44

PMD 124,281.5 42,999.3 198,065.51

Checkstyle 16,180.6 11,144.65 55,900.13

JCSC 16,682.46 3,573.55 22,010.77

All 126,367.75 52,123.05 220,263

Note that there may be many other issues in a version besides the defects inour dataset in a program version, and a tool may generate many warnings forthe version. These partially explain the high average numbers of lines flaggedin Table 3. To further understanding these numbers, we present more statisticsabout the warnings generated by the tools for each of the three programs inTable 4. Column “Avg # Warning” provides the average numbers of reportedwarnings by each tool for each program. Column “Avg # Flagged Line PerWarning” is the average numbers of lines flagged by a warning report. Column“Avg # Unfixed” is the average numbers of lines of code that are flagged butare not the buggy lines fixed in a version. Column “Avg # LOC” is the numbersof lines of code in a particular program (across all buggy versions).

We notice that the average numbers of warnings generated by PMD is high.FindBugs reports less warnings but many warnings span many consecutivelines of code. Many flagged lines do not correspond to the buggy lines fixed ina version. These could either be false positives or bugs found in the future. Wedo not check the exact number of false positives though, as they require muchmanual labor (i.e., checking thousands of warnings in each of the hundreds ofprogram versions), and they are the subject of other studies (e.g. Heckmanand Williams, 2011). On the other hand, the high numbers of flagged linesraise concerns with the effectiveness of warnings generated by the tools: Arethe warnings effectively correlated with actual defects?

To answer such a question, we create random “bug” finding tools thatwould randomly flag some lines of code as buggy according to the distributionsof the numbers of the lines flagged by each warning generated by each of thebug-finding tools, and compare the bug-capturing effectiveness of such randomtools with the actual tools. For each version of the subject programs and eachof the actual tools, we run a random tool 10,000 times; each time the randomtool would randomly generate the same numbers of warnings by following thedistribution of the numbers of lines flagged by each warning; then, we countthe numbers of missed, partially captured, and fully captured defects by therandom tool; finally, we compare the effectiveness of the random tools withthe actual tools by calculating the p-values as in Table 5. A value x in each

16 Ferdian Thung et al.T

able

4:U

nfi

xed

War

nin

gsby

Var

iou

sD

efec

tF

ind

ing

Tools

Soft

ware

Tool

Avg

#W

arn

ing

Avg

#F

lagged

Lin

eP

erW

arn

ing

Avg

#U

nfi

xed

Avg

#L

OC

Rh

ino

Fin

dB

ugs

177.9

838.7

240,5

02.6

75,5

43.8

JL

int

34

1330.3

5

PM

D11,8

64.9

521.1

242,9

90.5

5

Ch

eckst

yle

34,9

58.7

51

11,1

44.6

5

JC

SC

24,0

87

13,5

73.5

5

All

71,7

26.8

2.9

552,1

14.3

Lu

cen

e

Fin

dB

ugs

270.2

5469.4

33,2

08.3

6

265,8

21.7

5

JL

int

685.2

11

515.2

9

PM

D39,9

93.6

811.7

4124,2

80.6

4

Ch

eckst

yle

30,7

79.7

11

16,6

80.2

5

JC

SC

29,0

56.9

61

16,6

81.8

2

All

100,5

13.5

71.4

6126,3

66.8

9

Asp

ectJ

Fin

dB

ugs

802.2

9381.7

683,3

18.5

9

448,1

75.9

4

JL

int

3,9

37

1720.4

1

PM

D73,6

41.8

62.9

4079

198,0

62.5

7

Ch

eckst

yle

93,3

41.5

155,8

97.9

3

JC

SC

198.5

18,6

51

22,0

10.4

9

All

326,0

40.7

21.1

6220,2

60.3

1


cell in the table means that our random tool would have x × 100% chanceto get at least as good results as the actual tool for either partially or fullycapturing the bugs. The values in the table imply the actual tools may indeeddetect more bugs than random tools, although they may produce many falsepositives. However, some tools for some programs, such as PMD for Lucene,may not be much better than random tools. If there is no correlation betweenthe warnings generated by the actual tools with actual defects, the tools shouldnot perform differently from the random tools. Our results show that this isnot the case at least for some tools with some programs.

Table 5: p-values comparing random tools against actual tools.

Tool Program Full Partial or Full

FindBugsLucene 0.2766 0.1167Rhino <0.0001 0.0081

AspectJ 0.5248 0.0015

JLintLucene 0.0011 <0.0001Rhino <0.0001 <0.0001

AspectJ 0.0222 0.0363

PMDLucene 0.9996 1Rhino 0.9996 1

AspectJ 0.9996 <0.0001

CheckstyleLucene 0.0471 0.0152Rhino 0.5085 1

AspectJ <0.0001 <0.0001

JCSCLucene 0.0152 1Rhino 0.3636 1

AspectJ 0.0006 1

Consider Localized Defects.

Many of the defects that we analyze span more than a few lines of code.Thus, we further focus only on defects that can be localized to a few lines ofcode (at most five lines of code, which we call localized defects), and investigatethe effectiveness of the various bug-finding tools. We show the numbers ofmissed localized defects by the five tools, for Lucene, Rhino, and AspectJ inTables 6.

Lucene. Table 6 shows that the tools together fully identify 47.6% of all local-ized defects and partially identify another 14.3%. This is only slightly higherthan the percentage for all defects (see Table 2). The ordering of the toolsbased on their ability to fully capture localized defects is the same.

Rhino. As shown in Table 6, all five tools together could fully identify 93.3%of all localized defects. This is slightly lower than the percentage for all defects(see Table 2).

AspectJ. Table 6 shows that the tools together fully capture 78.9% of all lo-calized defects and miss only 0.8%. These are better than the percentages forall defects (see Table 2).


Tab

le6:

Per

centa

ges

ofL

oca

lize

dD

efec

tsth

atar

eM

isse

d,

Part

iall

yC

ap

ture

d,

an

dF

ull

yC

ap

ture

d

Tools

vs.

Program

sLucene

Rhin

oAsp

ectJ

Miss

Partial

Full

Miss

Partial

Full

Miss

Partial

Full

Fin

dB

ugs

66.7

%14.3

%19.0

%13.3

%0%

86.7

%32.0

%14.8

%53.1

%

JL

int

80.9

%9.5

%9.5

%86.7

%0%

13.3

%96.9

%1.6

%1.6

%

PM

D38.1

%14.3

%47.6

%20.0

%0%

80.0

%2.3

%21.1

%76.6

%

Ch

eckst

yle

66.6

7%

19.0

5%

14.2

9%

92.3

1%

7.6

9%

0%

35.9

4%

28.9

%35.1

6%

JC

SC

76.1

9%

9.5

2%

14.2

9%

100%

0%

0%

89.8

4%

6.2

5%

14.2

9%

All

38.1

%14.3

%47.6

%6.7

%0%

93.3

%0.8

%20.3

%78.9

%


Consider Stricter Analysis.

We also perform a stricter analysis that requires not only that the faultylines are covered by the warnings, but also the type of the warnings must bedirectly related to the faulty lines. For example, if the faulty line is a nullpointer dereference, then the warning must explicitly say so. Warning reportsthat cover this line but do not have reasonable explanation for the warningwould not be counted as capturing the fault. Again we manually analyze tosee if the warnings strictly captures the defect. We focus on defects that arelocalized to one line of code. There are 7 bugs, 5 bugs, and 66 bugs for Lucene,Rhino, and AspectJ respectively that can be localized to one line of code.

We show the results of our analysis for Lucene, Rhino, and AspectJ inTable 7. We notice that under this stricter requirement, very few of the de-fects fully captured by the tools (see Column “Full”) are strictly captured bythe same tools (see Column “Strict”). Thus, although the faulty lines mightbe flagged by the tools, the warning messages from the tools may not havesufficient information for the developers to understand the defects.

Table 7: Numbers of One-Line Defects that are Strictly Captured versusFully Captured

Tool Lucene Rhino AspectJStrict Full Strict Full Strict Full

FindBugs 0 4 1 4 5 42PMD 0 7 0 5 1 65JLint 0 0 0 1 1 2Checkstyle 0 3 0 0 0 33JCSC 0 3 0 0 0 5

4.2.2 RQ2: Effectiveness of Different Warnings

We show the effectiveness of various warning families of FindBugs, PMD,JLint, CheckStyle, and JCSC in flagging defects in Tables 8, 9, and 10 respec-tively. We highlight the top-5 warning families in terms of their ability in fullycapturing the root causes of the defects.

FindBugs. For FindBugs, as shown in Table 8, in terms of low false negativerates, we find that the best warning families are: Style, Performance, MaliciousCode, Bad Practice, and Correctness. Warnings in the style category includeswitch statement having one case branch to fall through to the next casebranch, switch statement having no default case, assignment to a local variablewhich is never used, unread public or protected field, a referenced variablecontains null value, etc. Warnings belonging to the malicious code categoryinclude a class attribute should be made final, a class attribute should bepackage protected, etc. Furthermore, we find that a violation of each of thesewarnings would likely flag an entire class that violates it. Thus, a lot of lines


of code would be flagged, making it having a higher chance of capturing thedefects. Warnings belonging to the bad practice category include comparisonof String objects using == or !=, a method ignores exceptional return value,etc. Performance related warnings include method concatenates strings using+ instead of StringBuffer, etc. Correctness related warnings include a fieldthat masks a superclass’s field, invocation of toString to an array, etc.

Table 8: Percentages of Defects that are Missed, Partially Captured, andFully Captured for Different Warning Families of FindBugs

Warning Family Miss Partial Full

Style 46.5% 9.5% 44.0%Performance 62.0% 9.0% 29.0%Malicious Code 68.5% 7.5% 24.0%Bad Practise 73.5% 4.5% 22.0%Correctness 74.0% 6.5% 19.5%

PMD. For PMD, as shown in Table 9, we find that the warning categoriesthat are the most effective are: Code Size, Design, Controversial, Optimiza-tion, and Naming. Code size warnings include problems related to a code beingtoo large or complex, e.g., number of acyclic execution paths is more than 200,method length is long, etc. Code size is correlated with defects but does notreally inform the types of defects that needs a fix. Design warnings identifysuboptimal code implementation; these include the simplification of booleanreturn, missing default case in a switch statement, deeply nested if statement,etc which may not be correlated with defects. Controversial warnings includeunnecessary constructor, null assignment, assignments in operands, etc. Opti-mization warnings include best practices to improve the efficiency of the code.Naming warnings include rules pertaining to the preferred names of variousprogram elements.

Table 9: Percentages of Warnings that are Missed, Partially Captured, andFully Captured for Different Warning Families of PMD


Code Size 21.5% 5.0% 73.5%Design 30.0% 8.5% 61.5%Controversial 27.5% 19.5% 53.0%Optimization 38.5% 23.0% 38.5%Naming 60.0% 23.0% 17.0%

JLint. For JLint, as shown in Table 10, we have three categories: Inheritance,Synchronization, and Data Flow. We find that inheritance is more effectivethan data flow which in turn is more effective than synchronization in detectingdefects. Inheritance warnings relate to class inheritance issues, such as: new


method in a sub-class is created with identical name but different parametersas one inherited from the super class, etc. Synchronization warnings relateto erroneous multi-threading applications in particular problems related toconflicts on shared data usage by multiple threads, such as potential deadlocks,required but missing synchronized keywords, etc. Data flow warnings relate toproblems that JLint detects by performing data flow analysis on Java bytecode,such as null referenced variable, type cast misuse, comparison of String withobject references, etc.

Table 10: Percentages of Defects that are Missed, Partially Captured, andFully Captured for Different Warning Families of JLint

Warning Family Miss Partial FullInheritance 93.0% 5.0% 2.0%Data Flow 93.5% 5.0% 1.5%Synchronization 99.5% 0% 0.5%

CheckStyle. For CheckStyle, as shown in Table 11, we have five most effec-tive categories: Sizes, Regular expression, White space, Java documentation,and Miscellaneous. Sizes warning occurred when a considerable limit of goodprogram size is exceeded. Regular expression warning occurred when a spe-cific standard pattern exist or not exist in the code file. White space warningoccurred if there is an incorrect whitespace around generic tokens. Java docu-mentation warning occurred when there is no javadoc or the javadoc does notconform to standard. Miscellaneous warning includes other issues that are notincluded in the other warning category. Although likely unrelated to the actualcause of the defects, the top warning family is quite effective on catching thedefects.

Table 11: Percentages of Defects that are Missed, Partially Captured, andFully Captured for Different Warning Families of CheckStyle


Sizes 26% 23.5% 50.5%Regular expression 39.5% 21% 39.5%White space 34% 28.5% 37.5%Java documentation 42.5% 32% 25.5%Miscellaneous 54.5% 28% 17.5%

JCSC. For JCSC, as shown in Table 12, we have five most effective categories:General, Java documentation, Metrics, Field, and Method. General warningincludes common issues in coding, such as the length limit of the code, missingdefault in switch statement, empty catch statement, etc. Java documentationwarning includes issues related to the existence of javadoc and its conformance


to standard. Metrics warning includes issues related to metrics used for mea-suring code quality, such as NCSS and NCC. Field warning includes issuesrelated to the field of the class, such as bad modifier order. Method warn-ing includes issues related to the method of the class, such as bad number ofarguments of a method. The warnings generated by JCSC are generally noteffective and very bad in detecting defects in the three systems.

Table 12: Percentages of Defects that are Missed, Partially Captured, andFully Captured for Different Warning Families of JCSC


General 76.5% 16.5% 7%Java documentation 91% 7% 2%Metrics 95% 4.5% 0.5%Field 94.5% 5% 0.5%Method 98% 2% 0%

4.2.3 RQ3: Characteristics of Missed Defects

There are 12 defects that are missed by all five tools. These defects involve log-ical or functionality errors and thus they are difficult to be detected by staticbug-finding tools without the knowledge of the specification of the systems.We can categorize the defects into several categories: method related defects(addition, removal of method calls, changes to parameters passed in to meth-ods), conditional checking related defects (addition, removal, or changes topre-condition checks), assignment related defects (wrong value being assignedto variables), return value related defects (wrong value being returned) andobject usage related defects (missing type cast, etc). The distribution of suchdefect categories that are missed by all five tools are listed in Table 13.

Table 13: Distribution of Missed Defect Types

Type Number of Defects

Assigment related defects 1Conditional checking related defects 6Return value related defects 1Method related defects 3Object usage related defects 1

A sample defect missed by all five tools is shown in Figure 6, which involvesan invocation of a wrong method.


Rhino - mozi lla/js/rhino/xmlimplsrc/org/mozi lla/javascript/xmlimpl/XML.java

Buggy version Fixed versionLine 3043: return createEmptyXML(lib); Line 3043: return createFromJS(lib, "");

Fig. 6: Example of A Defect Missed By All Tools

4.2.4 RQ4: Effectiveness on Defects of Various Severity

Different defects are often labeled with different severity levels. Lucene istracked using Jira bug tracking systems, while Rhino and AspectJ are trackedusing Bugzilla bug tracking systems. Jira has the following predefined severitylevels: blocker, critical, major, minor, trivial. Bugzilla has the following pre-defined severity levels: blocker, critical, major, normal, minor, trivial. In Jira,there is no normal severity level. To standardize the two, we group blocker,critical, and major as severe group, and normal, minor, and trivial as non-severe group. Table 14 describes the distribution of severe and non-severedefects among the 200 defects that we analyze in this study.

Table 14: Distribution of Defects of Various Severity Levels


Severe 45Non-Severe 155

We investigate the number of false negatives for each category of defects.We show the results for severe and non-severe defects in Tables 15 and 16respectively. We find that PMD captures the most numbers of severe and non-severe defects in most datasets. For almost all the datasets that we analyze,we also find that the tools that could more effectively capture severe defectsin a dataset, could also more effectively capture non-severe defects in thesame dataset. For Lucene, PMD and JCSC are the top two tools that couldmore effectively capture severe defects, where only 33.3% and 50% of thedefects are missed respectively. For non-severe defects in Lucene, PMD andFindBugs are the top two tools that effectively capture non-severe bugs whereonly 62.5% of the defects are missed. For Rhino, PMD and FindBugs arethe top two tools that could more effectively capture severe defects withoutmissing any defects and capture non-severe defects where only 16.67% and11.11% of the defects are missed. For AspectJ, PMD and CheckStyle are thetop two tools that could more effectively capture both severe (i.e., 3.2% and29.03% of the defects are missed by PMD and CheckStyle respectively) andnon-severe defects (i.e., 2.48% and 33.06% of the defects are missed by PMDand CheckStyle respectively). FindBugs could also effectively capture non-severe defects in AspectJ where only 33.06% of the defects are missed.

Based on the percentage of bugs that could be partially or fully capturedby all the five tools together, the tools could capture severe defects more


able

15:

Per

centa

ges

ofS

ever

eD

efec

tsth

at

are

Mis

sed

,P

art

iall

yC

ap

ture

d,

an

dF

ull

yC

ap

ture

d

Tools

vs.

Program

sLucene

Rhin

oAsp

ectJ

Miss

Partial

Full

Miss

Partial

Full

Miss

Partial

Full

Fin

dB

ugs

83.3

%0%

16.7

%0%

0%

100.0

%35.5

%25.8

%38.7

%

JL

int

83.3

%0%

16.7

%100.0

%0%

0%

96.8

%3.2

%0%

PM

D33.3

%8.3

%58.3

%0%

0%

100.0

%3.2

%35.5

%61.3

%

Ch

eckS

tyle

66.6

7%

8.3

3%

25%

50%

50%

0%

29.0

3%

51.6

1%

19.3

5%

JC

SC

50%

25%

25%

100%

0%

0%

96.7

7%

3.2

3%

0%

All

16.6

7%

25%

58.3

3%

0%

0%

100.0

%0%

38.7

1%

61.2

9%

Tab

le16

:P

erce

nta

ges

of

Non-S

ever

eD

efec

tsth

at

are

Mis

sed

,P

art

iall

yC

ap

ture

d,

an

dF

ull

yC

ap

ture

d

Tools

vs.

Program

sLucene

Rhin

oAsp

ectJ

Miss

Partial

Full

Miss

Partial

Full

Miss

Partial

Full

Fin

dB

ugs

62.5

%25%

12.5

%11.1

1%

0%

88.8

9%

33.0

6%

18.1

8%

48.7

6%

JL

int

81.2

5%

18.7

5%

0%

83.3

3%

11.1

1%

5.5

6%

97.5

2%

0.8

3%

1.6

5%

PM

D62.5

%18.7

5%

18.7

5%

16.6

7%

5.5

6%

77.7

8%

2.4

8%

27.2

7%

70.2

5%

Ch

eckS

tyle

75%

18.7

5%

6.2

5%

66.6

7%

33.3

3%

0%

33.0

6%

34.7

1%

32.2

3%

JC

SC

75%

25%

0%

100%

0%

0%

84.3

%11.5

7%

4.1

3%

All

50%

31.2

5%

18.7

5%

5.5

6%

0%

94.4

4%

0.8

3%

25.6

2%

73.5

5%


effectively than non-severe defects. For Lucene, only 16.67% of severe defectsand 50% of non-severe defects could not be captured with all the five tools.While for Rhino and AspectJ, all severe defects could be captured using allthe five tools, and only 5.56% of non-severe defects of Rhino and 0.83% ofnon-severe defects of AspectJ could not be captured.

Summary. Overall, we find that among the five tools, PMD captures the mostnumbers of bugs for both severe and non-severe defects in most datasets.Also, all the five tools together could capture severe defects more effectivelyas compared to non-severe defects.

4.2.5 RQ5: Effectiveness on Defects of Various Difficulties

Some defects are more difficult to fix than others. Difficulty could be measuredin various ways: time needed to fix a defect, number of bug resolvers involved inthe bug fixing process, number of lines of code churned to fix a bug, and so on.We investigate these different dimensions of difficulty and analyze the numberof false negatives on defects of various difficulties in the following paragraphs.

Consider Time to Fix a Defect.

First, we measure difficulty in terms of the time it takes to fix a defect.Time to fix defect has been studied in a number of previous studies (Kim andJr, 2006; Weiss et al, 2007; Hosseini et al, 2012). Following Thung et al (2012),we divide time to fix a defect into 2 classes: ≤30 days, and >30 days.

We show the results for ≤30 days and and >30 days defects in Tables 17and 18 respectively. Generally, we find that >30 days defects can be bettercaptured by the tools than ≤30 days defects. By using the five tools together(i.e., “All” rows in the tables), for defects that are fixed within 30 days, wecould partially or fully capture all defects of AspectJ, and miss 14.9% and 40%defects of AspectJ and Lucene respectively. By using the five tools together,for defects that are fixed in more than 30 days, all defects of Lucene and Rhinocan be captured, while only 0.79% of AspectJ defects are missed.

Also we find that among all the five tools, PMD could more effectivelycaptures both defects that are fixed within 30 days and those fixed in morethan 30 days. The top two most effective tools in capturing defects fixed within30 days are similar with those capturing defects fixed in more than 30 days. ForLucene’s defects, PMD and JCSC are the most effective tools to identify bothdefects fixed within 30 days and those fixed in more than 30 days: only 56% and68% of defects fixed within 30 days could not be captured by PMD and JCSCrespectively; also, PMD could capture all of the defects fixed in more than30 days, while JCSC, Findbugs, JLint, and CheckStyles could not capture33.33% of the >30 days defects. For Rhino’s defects, PMD and FindBugsare the most effective tools to identify both defects fixed within 30 days andthose fixed in more than 30 days: only 28.57% of defects fixed within 30 dayscould not be identified by PMD and Findbugs; also, FindBugs could captureall Rhino’s defects that are fixed in more than 30 days and only 7.69% of


able

17:

Per

centa

ges

of

Def

ects

Fix

edW

ith

in30

Day

sth

atare

Mis

sed

,P

art

ially

Cap

ture

d,

an

dF

ull

yC

ap

ture

d

Tools

vs.

Program

sLucene

Rhin

oAsp

ectJ

Miss

Partial

Full

Miss

Partial

Full

Miss

Partial

Full

Fin

dB

ugs

76%

8%

16%

28.5

7%

0%

71.4

3%

11.5

4%

23.0

8%

65.3

8%

JL

int

88%

4%

8%

85.7

1%

0%

14.2

9%

96.1

5%

0%

3.8

5%

PM

D56%

8%

36%

28.5

7%

0%

71.4

3%

0%

26.9

2%

73.0

8%

Ch

eckS

tyle

76%

8%

16%

85.7

1%

14.2

9%

0%

38.4

6%

30.7

7%

30.7

7%

JC

SC

68%

20%

12%

100%

0%

0%

96.1

5%

3.8

5%

0%

All

40%

24%

36%

14.2

9%

0%

85.7

1%

0%

23.0

8%

76.9

2%

Tab

le18

:P

erce

nta

ges

ofD

efec

tsF

ixed

Ab

ove

30

Day

sth

atare

Mis

sed

,P

art

iall

yC

ap

ture

d,

an

dF

ull

yC

aptu

red

Tools

vs.

Program

sLucene

Rhin

oAsp

ectJ

Miss

Partial

Full

Miss

Partial

Full

Miss

Partial

Full

Fin

dB

ugs

33.3

3%

66.6

7%

0%

0%

0%

100%

38.1

%19.0

5%

42.8

6%

JL

int

33.3

3%

66.6

7%

0%

84.6

2%

15.3

8%

0%

97.6

2%

1.5

9%

0.7

9%

PM

D0%

66.6

7%

33.3

3%

7.6

9%

7.6

9%

84.6

2%

2.3

8%

30.1

6%

67.4

6%

Ch

eckS

tyle

33.3

3%

66.6

7%

0%

53.8

5%

46.1

5%

0%

30.9

5%

39.6

8%

29.3

7%

JC

SC

33.3

3%

66.6

7%

0%

100%

0%

0%

84.9

2%

11.1

1%

3.9

7%

All

0%

66.6

7%

33.3

3%

0%

0%

100.0

%0.7

9%

29.3

7%

69.8

4%


the >30 days defects could not be captured by PMD. For AspectJ’s defects,PMD, FindBugs, and CheckStyles are the most effective tools to identify bothdefects fixed within 30 days and those fixed in more than 30 days: only 38.46%,11.54%, and 0% of defects fixed within 30 days could not be identified byCheckStyles, Findbugs, and PMD respectively; also, only 38.1%, 30.95%, and2.38% of defects fixed more than 30 days could not be identified by FindBugs,CheckStyles, and PMD respectively.

Summary. Overall, we find that the five tools used together are more effectivein capturing defects that are fixed in more than 30 days than those that arefixed within 30 days. This shows that the tools are beneficial as they cancapture more difficult bugs. Also, among all the five tools, PMD could moreeffectively capture both defects fixed within 30 days and those fixed in morethan 30 days.

Consider Number of Resolvers for a Defect.

Next, we measure difficulty in terms of the number of bug resolvers. Num-ber of bug resolvers has been studied in a number of previous studies (e.g. Wuet al, 2011b; Xie et al, 2012; Xia et al, 2013). Based on the number of bugresolvers, we divide bugs into 2 classes: ≤2 people, and >2 people. A simplebug typically would have only 2 bug resolvers: a triager who assigns the bugreport to a fixer, and the fixer who fixes the bug.

We show the results for ≤2 and >2 resolvers defects in Tables 19 and 20respectively. When using the 5 tools together, we find that they are almostequally effective in capturing ≤2 and >2 resolvers defects for Lucene andAspectJ datasets. For the Lucene dataset, by using all the 5 tools, we missthe same percentage (i.e., 35.71%) of ≤2 and >2 resolvers defects. For theAspectJ dataset, by using all the 5 tools, we miss 0% and 1.23% of ≤2 and >2resolvers defects respectively. However, for the Rhino dataset, by using all the5 tools together, we can more effectively capture > 2 resolvers defects than≤2 resolvers defects, i.e., 0% of > 2 resolvers defects and 25% of ≤2 resolversdefects are missed.

Also, we find that among all the five tools, PMD could more effectivelycapture both ≤2 and >2 resolvers defects. The top two most effective tools incapturing ≤2 resolvers defects are similar with those capturing >2 resolversdefects. For Lucene’s defects, PMD and JCSC are the most effective tools tocapture for both ≤2 and >2 resolvers defects: only 42.86% and 57.14% of ≤2resolvers defects could not be captured by PMD and JCSC respectively; also,only 57.14% and 71.43% of >2 resolvers defects could not be identified byPMD and JCSC respectively. For Rhino’s defects, PMD and FindBugs arethe most effective tools to capture both ≤2 and >2 resolvers defects: only 25%and 50% of ≤2 resolvers defects could not be captured by PMD and Findbugs;also, FindBugs could capture all Rhino’s >2 resolvers defects while only 12.5%of the defects could not be captured by PMD. For AspectJ’s defects, PMD,FindBugs, and CheckStyles are the most effective tools to capture both ≤2 and>2 resolvers defects: only 35.21%, 33.8%, and 2.82% of ≤2 resolvers defects


able

19:

Per

centa

ges

ofD

efec

tsF

ixed

by

at

Mos

t2

Res

olve

rsth

at

are

Mis

sed

,P

art

iall

yC

ap

ture

d,

an

dF

ull

yC

ap

ture

d

Tools

vs.

Program

sLucene

Rhin

oAsp

ectJ

Miss

Partial

Full

Miss

Partial

Full

Miss

Partial

Full

Fin

dB

ugs

64.2

9%

21.4

3%

14.2

9%

50%

0%

50%

33.8

%18.3

1%

47.8

9%

JL

int

64.2

9%

21.4

3%

14.2

9%

75%

0%

25%

97.1

8%

1.4

1%

1.4

1%

PM

D42.8

6%

14.2

9%

42.8

6%

25%

25%

50%

2.8

2%

26.7

6%

70.4

2%

Ch

eckS

tyle

57.1

4%

14.2

9%

28.5

7%

75%

25%

0%

35.2

1%

35.2

1%

29.5

8%

JC

SC

57.1

4%

21.4

3%

21.4

3%

100%

0%

0%

88.7

3%

8.4

5%

2.8

2%

All

35.7

1%

21.4

3%

42.8

6%

25%

0%

75%

0%

25.3

5%

74.6

5%

Tab

le20

:P

erce

nta

ges

ofD

efec

tsF

ixed

by

Mor

eth

an2

Res

olv

ers

that

are

Mis

sed

,P

art

iall

yC

ap

ture

d,

an

dF

ull

yC

ap

ture

d

Tools

vs.

Program

sLucene

Rhin

oAsp

ectJ

Miss

Partial

Full

Miss

Partial

Full

Miss

Partial

Full

Fin

dB

ugs

78.5

7%

7.1

4%

14.2

9%

0%

0%

100%

33.3

3%

20.9

9%

45.6

8%

JL

int

100%

0%

0%

87.5

%12.5

%0%

97.5

3%

1.2

3%

1.2

3%

PM

D57.1

4%

14.2

9%

28.5

7%

12.5

%0%

87.5

%1.2

3%

32.1

%66.6

7%

Ch

eckS

tyle

85.7

1%

14.2

9%

0%

62.5

%37.5

%0%

29.6

3%

40.7

4%

29.6

3%

JC

SC

71.4

3%

28.5

7%

0%

100%

0%

0%

85.1

9%

11.1

1%

3.7

%

All

35.7

1%

35.7

1%

28.5

7%

0%

0%

100.0

%1.2

3%

30.8

6%

67.9

%


could not be captured by CheckStyles, Findbugs, and PMD respectively; also,only 33.33%, 29.63%, and 1.23% of AspectJ’s >2 resolvers defects could notbe captured by FindBugs, CheckStyles, and PMD respectively.

Summary. Overall, we find that by using the five tools together, we can cap-ture >2 resolvers defects more effectively than ≤2 resolvers defects (at leaston Rhino dataset). This shows that the tools are beneficial as they can capturemore difficult bugs. Also among all the five tools, PMD could more effectivelycapture for both ≤2 and >2 resolvers defects.

Consider Number of Lines of Code Churned for a Defect.

We also measure difficulty in terms of the number of lines of code churned.Number of lines of code churned has been investigated in a number of previousstudies, e.g., (Nagappan and Ball, 2005; Giger et al, 2011). Based on thenumber of lines of code churned, we divide the bugs into 2 classes: ≤10 lineschurned, and >10 lines churned.

We show the results for ≤10 and >10 lines churned defects in Tables 21and 22. The performance of the 5 tools (used together) in capturing ≤10 lineschurned defects is similar to the performance in capturing >10 lines churneddefects. For Lucene, using the 5 tools together, we can capture ≤10 lineschurned defects better than >10 lines churned defects (33.33% vs. 38.46% ≤10and >10 lines churned defects are missed respectively). However, for Rhino,using the 5 tools together, we can capture >10 lines churned defects betterthan ≤10 lines churned defects (7.14% vs. 0% ≤10 and >10 lines churneddefects are missed respectively). For AspectJ, using the 5 tools together, wecan capture >10 lines churned defects and ≤10 lines churned defects equallywell (0.78% vs. 0% ≤10 and >10 lines churned defects are missed respectively).

We find that among all the five tools, PMD could more effectively captureboth ≤10 and >10 lines churned defects. The top two most effective tools incapturing ≤10 lines churned defects however may vary with those capturing>10 lines churned defects.

For Lucene’s defects, PMD and CheckStyles are the most effective toolsto capture ≤10 lines churned defects (i.e., only 40% and 40% of the defectsare missed by PMD and CheckStyles respectively), while PMD and JCSCare the most effective tools to identify >10 lines churned defects (i.e., only61.54% of the defects are missed by both tools). For Rhino’s defects, PMDand FindBugs are the most effective tools to capture both ≤10 and >10 lineschurned defects: both tools could capture all Rhino’s >10 lines churned defects,and only 21.43% and 14.29% of Rhino’s ≤10 lines churned defects could not becaptured by the tools. For AspectJ’s defects, PMD, FindBugs, and CheckStylesare the most effective tools to identify for both ≤10 and >10 lines churneddefects: only 34.11%, 33.33%, and 2.33% of ≤10 lines churned defects couldnot be identified by CheckStyles, Findbugs, and PMD respectively; also, only34.78%, 21.74%, and 0% of AspectJ’s >10 lines churned defects could not beidentified by FindBugs, CheckStyles, and PMD respectively.


able

21:

Per

centa

ges

ofD

efec

tsF

ixed

by

Chu

rnin

gat

Mos

t10

Lin

esth

at

are

Mis

sed

,P

art

iall

yC

ap

ture

d,

an

dF

ull

yC

aptu

red

Tools

vs.

Program

sLucene

Rhin

oAsp

ectJ

Miss

Partial

Full

Miss

Partial

Full

Miss

Partial

Full

Fin

dB

ugs

66.6

7%

20%

13.3

3%

14.2

9%

0%

85.7

1%

33.3

3%

19.3

8%

47.2

9%

JL

int

80%

13.3

3%

6.6

7%

92.8

6%

0%

7.1

4%

96.9

%1.5

5%

1.5

5%

PM

D40%

20%

40%

21.4

3%

0%

78.5

7%

2.3

3%

27.1

3%

70.5

4%

Ch

eckS

tyle

60%

20%

20%

85.7

1%

14.2

9%

0%

34.1

1%

32.5

6%

33.3

3%

JC

SC

66.6

7%

20%

13.3

3%

100%

0%

0%

89.1

5%

6.9

8%

3.8

8%

All

33.3

3%

26.6

7%

40%

7.1

4%

0%

92.8

6%

0.7

8%

25.5

8%

73.6

4%

Tab

le22

:P

erce

nta

ges

ofD

efec

tsF

ixed

by

Chu

rnin

gM

ore

than

10

Lin

esth

at

are

Mis

sed

,P

art

ially

Cap

ture

d,

an

dF

ull

yC

aptu

red

Tools

vs.

Program

sLucene

Rhin

oAsp

ectJ

Miss

Partial

Full

Miss

Partial

Full

Miss

Partial

Full

Fin

dB

ugs

76.9

2%

7.6

9%

15.3

8%

0%

0%

100%

34.7

8%

21.7

4%

43.4

8%

JL

int

84.6

2%

7.6

9%

7.6

9%

66.6

7%

33.3

3%

0%

100%

0%

0%

PM

D61.5

4%

7.6

9%

30.7

7%

0%

16.6

7%

83.3

3%

0%

43.4

8%

56.5

2%

Ch

eckS

tyle

84.6

2%

7.6

9%

7.6

9%

16.6

7%

83.3

3%

0%

21.7

4%

69.5

7%

8.7

%

JC

SC

61.5

4%

30.7

7%

7.6

9%

100%

0%

0%

73.9

1%

26.0

9%

0%

All

38.4

6%

30.7

7%

30.7

7%

0%

0%

100.0

%0%

43.4

8%

56.5

2%


Table 23: Distribution of Defect Types


Assignment related defects 27Conditional related defects 127Return related defects 11Method invocation related defects 28Object usage related defects 7

Summary. Overall, we find that the performance of the 5 tools (used together)in capturing ≤10 lines churned defects are similar to that in capturing >10lines churned defects. Also among the five tools, PMD could more effectivelyidentify ≤10 and >10 lines churned defects.

4.2.6 RQ6: Effectiveness on Defects of Various Types

We consider a rough categorization of defects based on the type of the programelements that are affected by a defect. We categorize defects into the followingfamilies (or types): assignment, conditional, return, method invocation, andobject usage. A defect belongs to a family if the statement responsible for thedefect contains a corresponding program type. For example, if a defect affectsan “if” statement it belongs to the conditional family. Since a defect canspan multiple program statements of different types, it can belong to multiplefamilies, for such cases we analyze the bug fix and choose one dominant family.Table 23 describes the distribution of defect types among the 200 defects thatwe analyze in this study.

We investigate the number of false negatives for each category of defects.We show the results for assignment related defects in Table 24. We find thatusing the five tools together we could effectively capture (partially or fully)assignment related defects in all datasets: we could capture all assignmentrelated defects in Rhino and AspectsJ, and only 16.67% of Lucene’s assignmentrelated defects could not be captured by the tools. We also find that PMD isthe most effective tool in capturing assignments related defects, while JLintand JCSC are the most ineffective. For Lucene, 33.33%, 66.66%, and 66.67% ofits assignment related defects could not be captured by PMD, FindBugs, andCheckStyle. PMD and Findbugs could effectively capture Rhino’s assignmentrelated defects without missing any defects, and only 11.11% of AspectJ’sassignment related defects could not be captured by these tools. Thus, besidesPMD, FindBugs is also effective in capturing assignment related defects.

We show the results for conditional related defects in Table 25. We find thatusing the five tools together, we could capture all conditional related defects inRhino, and only 0.99% of AspectJ’s and 38.46% of Lucene conditional relateddefects could not be captured by the tools. We also find that PMD is the mosteffective tool in capturing conditional related defects, while JLint is the mostineffective in capturing the defects. For Lucene, PMD only misses 46.15%of conditional related defects, while the other tools miss more than 50% of


able

24:

Per

centa

ges

of

Ass

ign

men

tR

elat

edD

efec

tsth

atare

Mis

sed

,P

art

iall

yC

ap

ture

d,

an

dF

ull

yC

ap

ture

d

Tools

vs.

Program

sLucene

Rhin

oAsp

ectJ

Miss

Partial

Full

Miss

Partial

Full

Miss

Partial

Full

Fin

dB

ugs

66.6

6%

16.6

7%

16.6

7%

0%

0%

100.0

%11.1

1%

11.1

1%

77.7

8%

JL

int

100%

0%

0%

100.0

%0%

0%

94.4

4%

5.5

6%

0%

PM

D33.3

3%

16.6

7%

50%

0%

0%

100.0

%11.1

1%

5.5

6%

83.3

3%

Ch

eckS

tyle

66.6

7%

16.6

7%

16.6

7%

33.3

3%

66.6

7%

0%

38.8

9%

27.7

8%

33.3

3%

JC

SC

83.3

3%

16.6

7%

0%

100%

0%

0%

88.8

9%

11.1

1%

0%

All

16.6

7%

33.3

3%

50%

0%

0%

100.0

%0%

5.5

6%

94.4

4%

Tab

le25

:P

erce

nta

ges

of

Con

dit

ion

al

Rel

ate

dD

efec

tsth

atare

Mis

sed

,P

art

iall

yC

ap

ture

d,

an

dF

ull

yC

ap

ture

d

Tools

vs.

Program

sLucene

Rhin

oAsp

ectJ

Miss

Partial

Full

Miss

Partial

Full

Miss

Partial

Full

Fin

dB

ugs

61.5

4%

15.3

8%

23.0

8%

7.6

9%

0%

92.3

1%

30.6

9%

23.7

6%

45.5

4%

JL

int

76.9

2%

15.3

8%

7.6

9%

84.6

2%

15.3

8%

0%

98.0

2%

0.9

9%

0.9

9%

PM

D46.1

5%

15.3

8%

38.4

6%

15.3

8%

7.6

9%

76.9

2%

0.9

9%

34.6

5%

64.3

6%

Ch

eckS

tyle

69.2

3%

15.3

8%

15.3

8%

69.2

3%

30.7

7%

0%

30.6

9%

39.6

%29.7

%JC

SC

61.5

4%

23.0

8%

15.3

8%

100%

0%

0%

92.0

8%

6.9

3%

0.9

9%

All

38.4

6%

23.0

8%

38.4

6%

0%

0%

100.0

%0.9

9%

33.6

6%

65.3

5%

Tab

le26

:P

erce

nta

ges

of

Ret

urn

Rel

ated

Def

ects

that

are

Mis

sed

,P

art

iall

yC

ap

ture

d,

an

dF

ull

yC

ap

ture

d

Tools

vs.

Program

sLucene

Rhin

oAsp

ectJ

Miss

Partial

Full

Miss

Partial

Full

Miss

Partial

Full

Fin

dB

ugs

0%

100%

0%

50%

0%

50%

75%

12.5

%12.5

%JL

int

0%

100%

0%

50%

0%

50%

100%

0%

0%

PM

D0%

100%

0%

50%

0%

50%

0%

12.5

%87.5

%C

hec

kS

tyle

0%

100%

0%

100%

0%

0%

37.5

%37.5

%25%

JC

SC

0%

100%

0%

100%

0%

0%

62.5

%25%

12.5

%A

ll0%

100%

0%

50%

0%

50%

0%

12.5

%87.5

%


the defects. PMD and Findbugs could effectively capture Rhino’s conditionalrelated defects with only 15.38% and 7.69% missed defects, and only 30.69%of AspectJ’s conditional related defects could not be captured by these tools.Thus, aside from PMD, FindBugs is also effective in capturing conditionalrelated defects.

We show the results for return related defects in Table 26. We find thatusing the five tools together is effective enough to capture all return relateddefects in AspectJ and Lucene, however they are less effective in capturingreturn related defects in Rhino (50% missed defects). We also find that PMDis the most effective tool in capturing return related defects. For Lucene, alltools could capture (partially or fully) all the return related defects. For Rhino,50% of return related defects could not be captured by PMD, FindBugs, andJLint, while the other tools could not capture all the defects (100% misseddefects). For AspectJ, PMD could effectively capture all return related defectswithout any missed defects, and CheckStyle only missed 37.5% of the defects.The other tools could not effectively capture AspectJ’s return related defectsas they miss more than 50% of the defects.

We show the results for method invocation related defects in Table 27.We find that using the five tools together is effective enough to capture allmethod invocation defects of Rhino and AspectJ, but they are less effectivein capturing method invocation related defects in Lucene (50% missing de-fects). We also find that PMD is the most effective tool in capturing methodinvocation related defects. PMD could effectively capture all method invoca-tion related defects in Rhino and AspectJ, but it misses 66.67% of Lucene’smethod invocation related defects. The other 4 tools could not capture morethan 70% of Lucene’s method invocation related defects. Other than PMD,Findbugs could also effectively capture all Rhino’s method invocation relateddefects without any missed defects, while the other 3 tools could not captureall Rhino’s method invocation related defects. For AspectJ, CheckStyle couldalso effectively capture the defects with only 28.57% missed defects, while theother 3 tools could not capture more than 40% of AspectJ’s method invocationrelated defects.

We show the results for object usage related defects in Table 28. We findthat using the five tools together, we can capture all object usage related de-fects of Rhino and AspectJ, but they are less effective in capturing objectusage related defects of Lucene (50% missing defects). We also find that gen-erally PMD is the most effective tool in capturing object usage related defects.PMD could effectively capture all object usage related defects in Rhino andAspectJ, but it misses all Lucene’s object usage related defects. Only JCSCcould capture 50% of Lucene’s object usage related defects, while the othertools could not capture these defects at all. Other than PMD, Findbugs andCheckStyle could also effectively capture all Rhino’s method invocation de-fects without any missed defects, while the other tools could not capture allRhino’s method invocation defects. For AspectJ, the other tools except PMDcould not effectively capture object usage related defects in this dataset—theycould not capture 50 or more percent of the defects.


able

27:

Per

centa

ges

of

Met

hod

Invo

cati

on

Rel

ate

dD

efec

tsth

at

are

Mis

sed

,P

art

iall

yC

ap

ture

d,

and

Fu

lly

Cap

ture

d

Tools

vs.

Program

sLucene

Rhin

oAsp

ectJ

Miss

Partial

Full

Miss

Partial

Full

Miss

Partial

Full

Fin

dB

ugs

100%

0%

0%

0%

0%

100.0

%42.8

6%

14.2

9%

42.8

6%

JL

int

83.3

3%

0%

16.6

7%

100.0

%0%

0%

95.2

4%

0%

4.7

6%

PM

D66.6

7%

0%

33.3

3%

0%

0%

100%

0%

33.3

3%

66.6

7%

Ch

eckS

tyle

83.3

3%

0%

16.6

7%

100%

0%

0%

28.5

7%

38.1

%33.3

3%

JC

SC

66.6

7%

16.6

7%

16.6

7%

100%

0%

0%

76.1

9%

9.5

2%

14.2

9%

All

50%

16.6

7%

33.3

3%

0%

0%

100.0

%0%

28.5

7%

71.4

3%

Tab

le28

:P

erce

nta

ges

ofO

bje

ctU

sage

Rel

ate

dD

efec

tsth

atar

eM

isse

d,

Part

iall

yC

ap

ture

d,

an

dF

ull

yC

ap

ture

d

Tools

vs.

Program

sLucene

Rhin

oAsp

ectJ

Miss

Partial

Full

Miss

Partial

Full

Miss

Partial

Full

Fin

dB

ugs

100%

0%

0%

0%

0%

100.0

%75%

0%

25%

JL

int

100%

0%

0%

100.0

%0%

0%

100%

0%

0%

PM

D100%

0%

0%

0%

0%

100.0

%0%

25%

75%

Ch

eckSty

le100%

0%

0%

0%

100%

0%

50%

50%

0%

JC

SC

50%

50%

0%

100%

0%

0%

50%

50%

0%

All

50%

50%

0%

0%

0%

100.0

%0%

25%

75%


Summary. Overall, we find that PMD could outperform other tools in captur-ing various types of defects. The effectiveness of the other tools varies acrossthe datasets; for example, Findbugs are more effective in capturing defects inRhino, while Checktyle is more effective in capturing defects in AspectJ. JLintand JCSC are often the most ineffective in capturing defects in the datasets.Based on the percentage of missed defects, averaging across the defect fam-ilies, the 5 bug-finding tools (used together) could more effectively captureassignment defects, followed by conditional defects. They are less effective incapturing return, method invocation, and object usage related defects.

4.3 Threats to Validity

Threats to internal validity may include experimental biases. We manuallyextract faulty lines from changes that fix them; this manual process might beerror-prone. We also manually categorize defects into their families; this man-ual process might also be error prone. To reduce this threat, we have checkedand refined the results. We also exclude defects that could not be unambigu-ously localized to the responsible faulty lines of code. Also, we exclude someversions of our subject programs that cannot be compiled. There might alsobe implementation errors in the various scripts and code used to collect andevaluate defects.

Threats to external validity are related to the generalizability of our find-ings. In this work, we only analyze five static bug-finding tools and three opensource Java programs. We analyze only one version of Lucene dataset; for Rhi-no and AspectJ, we only analyze defects that are available in iBugs. Also, weinvestigate only defects that get reported and fixed. In the future, we couldanalyze more programs and more defects to reduce selection bias. Also, weplan to investigate more bug-finding tools and programs in various languages.

5 Related Work

We summarize related studies on bug finding, warning prioritization, bugtriage, and empirical study on defects.

5.1 Bug Finding

There are many bug-finding tools proposed in the literature (Nielson et al,2005; Necula et al, 2005; Holzmann et al, 2000; Sen et al, 2005; Godefroidet al, 2005; Cadar et al, 2008; Sliwerski et al, 2005). A short survey of thesetools has been provided in Section 2. In this paper, we focus on five staticbug-finding tools for Java, including FindBugs, PMD, JLint, Checkstyle andJCSC. These tools are relatively lightweight, and have been applied to largesystems.


None of these bug-finding tools is able to detect all kinds of bugs. To thebest of our knowledge, there is no study on the false negative rates of thesetools. We are the first to carry out an empirical study to answer this issue basedon a few hundreds of real-life bugs from three Java programs. In the future,we plan to investigate other bug-finding tools with more programs written indifferent languages.

5.2 Warning Prioritization

Many studies deal with the many false positives produced by various bug-finding tools. Kremenek and Engler (2003) use z-ranking to prioritize warn-ings. Kim and Ernst (2007) prioritize warning categories using historical data.Ruthruff et al (2008) predict actionable static analysis warnings by proposinga logistic regression model that differentiates false positives from actionablewarnings. Liang et al (2010) propose a technique to construct a training setfor better prioritization of static analysis warnings. A comprehensive surveyof static analysis warnings prioritization has been written by Heckman andWilliams (2011). There are large scale studies on the false positives producedby FindBugs (Ayewah and Pugh, 2010; Ayewah et al, 2008).

While past studies on warning prioritization focus on coping with falsepositives with respect to actionable warnings, we investigate an orthogonalproblem on false negatives. False negatives are important as it can cause bugsto go unnoticed and cause harm when the software is used by end users.Analyzing false negatives is also important to guide future research on buildingadditional bug-finding tools.

5.3 Bug Triage

Even when the bug reports, either from a tool or from a human user, are allfor real bugs, the sheer number of reports can be huge for a large system. Inorder to allocate appropriate development and maintenance resources, projectmanagers often need to triage the reports.

Cubranic and Murphy (2004) propose a method that uses text categoriza-tion to triage bug reports. Anvik et al (2006) use machine learning techniquesto triage bug reports. Hooimeijer and Weimer (2007) construct a descriptivemodel to measure the quality of bug reports. Park et al (2011) propose Cos-Triage that further takes the cost associated with bug reporting and fixing intoconsideration. Sun et al (2010) use data mining techniques to detect duplicatebug reports.

Different from warning prioritization, these studies address the problem oftoo many true positives. They are also different from our work which dealswith false negatives.


5.4 Empirical Studies on Defects

Many studies investigate the nature of defects. Pan et al (2009) investigatedifferent bug fix patterns for various software systems. They highlight pat-terns such as method calls with different actual parameter values, change ofassignment expressions, etc. Chou et al (2001) perform an empirical study ofoperating system errors.

Closest to our work is the study by Rutar et al (2004) on the comparison ofbug-finding tools in Java. They compare the number of warnings generated byvarious bug-finding tools. However, no analysis have been made as to whetherthere are false positives or false negatives. The authors commented that: “Aninteresting area of future work is to gather extensive information about theactual faults in programs, which would enable us to precisely identify falsepositives and false negatives.” In this study, we address the false negativesmentioned by them.

6 Conclusion and Future Work

Defects can harm software vendors and users. A number of static bug-findingtools have been developed to catch defects. In this work, we empirically s-tudy the effectiveness of state-of-the-art static bug-finding tools in preventingreal-life defects. We investigate five bug-finding tools, FindBugs, JLint, PMD,CheckStyle, and JCSC on three programs, Lucene, Rhino, and AspectJ. Weanalyze 200 fixed real defects and extract faulty lines of code responsible forthese defects from their treatments.

We find that most defects in the programs could be partially or fully cap-tured by combining reports generated by all bug-finding tools. However, asubstantial proportion of defects (e.g., about 36% for Lucene) are still missed.We find that FindBugs and PMD are the best among the bug-finding toolsin preventing false negatives. Our stricter analysis sheds light that althoughmany of these warnings cover faulty lines, often the warnings are too genericand developers need to inspect the code to find the defects. We find that somebugs are not flagged by any of the bug-finding tools—these bugs involve logicalor functionality errors that are difficult to be detected without any specifica-tion of the system. We also find that the bug-finding tools are more effectivein capturing severe than non-severe defects. Also, based on various measuresof difficulty, the bug-finding tools are more effective in capturing more difficultbugs. Furthermore, the bug-finding tools could perform differently for differ-ent bug types: they could be more effectively in capturing assignment andconditional defects than return, method invocation, and object usage relateddefects.

As future work, we would like to reduce the threats to external validity byexperimenting with even more bugs from more software systems. Based on thecharacteristics of the missed bugs, we also plan to build a new tool that couldhelp to reduce false negatives further.


7 Acknowledgement

We thank the researchers creating and maintaining the iBugs repository whichis publicly available at http://www.st.cs.uni-saarland.de/ibugs/. We al-so appreciate very much the valuable comments from anonymous reviewersand our shepherd Andreas Zeller for earlier versions of this paper.

References

Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug? In: ICSE, pp361–370

Artho C (2006) Jlint - find bugs in java programs. http://jlint.sourceforge.net/Ayewah N, Pugh W (2010) The google findbugs fixit. In: ISSTA, pp 241–252Ayewah N, Pugh W, Morgenthaler JD, Penix J, Zhou Y (2007) Evaluating

static analysis defect warnings on production software. In: PASTE, pp 1–8Ayewah N, Hovemeyer D, Morgenthaler JD, Penix J, Pugh W (2008) Using

static analysis to find bugs. IEEE Software 25(5):22–29Ball T, Levin V, Rajamani SK (2011) A decade of software model checking

with slam. Commun ACM 54(7):68–76Beyer D, Henzinger TA, Jhala R, Majumdar R (2007) The software model

checker blast. STTT 9(5-6):505–525Brun Y, Ernst MD (2004) Finding latent code errors via machine learning over

program executions. In: ICSE, pp 480–490Burn O (2007) Checkstyle. http://checkstyle.sourceforge.net/Cadar C, Ganesh V, Pawlowski PM, Dill DL, Engler DR (2008) EXE: Auto-

matically generating inputs of death. ACM Trans Inf Syst Secur 12(2)Chou A, Yang J, Chelf B, Hallem S, Engler DR (2001) An empirical study of

operating system errors. In: SOSP, pp 73–88Copeland T (2005) PMD Applied. Centennial BooksCorbett JC, Dwyer MB, Hatcliff J, Laubach S, Pasareanu CS, Robby, Zheng

H (2000) Bandera: extracting finite-state models from java source code. In:ICSE, pp 439–448

Cousot P, Cousot R (2012) An abstract interpretation framework for termi-nation. In: POPL, pp 245–258

Cousot P, Cousot R, Feret J, Mauborgne L, Mine A, Rival X (2009) Why doesastree scale up? Formal Methods in System Design 35(3):229–264

Cubranic D, Murphy GC (2004) Automatic bug triage using text categoriza-tion. In: SEKE, pp 92–97

Dallmeier V, Zimmermann T (2007) Extraction of bug localization bench-marks from history. In: ASE, pp 433–436

Gabel M, Su Z (2010) Online inference and enforcement of temporal properties.In: ICSE, pp 15–24

Giger E, Pinzger M, Gall H (2011) Comparing fine-grained source code changesand code churn for bug prediction. In: MSR, pp 83–92


Godefroid P, Klarlund N, Sen K (2005) Dart: directed automated randomtesting. In: PLDI, pp 213–223

GrammaTech (2012) Codesonar. http://www.grammatech.com/products/

codesonar/overview.html

Heckman SS (2007) Adaptively ranking alerts generated from automated staticanalysis. ACM Crossroads 14(1)

Heckman SS, Williams LA (2009) A model building process for identifyingactionable static analysis alerts. In: ICST, pp 161–170

Heckman SS, Williams LA (2011) A systematic literature review of actionablealert identification techniques for automated static code analysis. Informa-tion & Software Technology 53(4):363–387

Holzmann GJ, Najm E, Serhrouchni A (2000) Spin model checking: An intro-duction. STTT 2(4):321–327

Hooimeijer P, Weimer W (2007) Modeling bug report quality. In: ASE, pp34–43

Hosseini H, Nguyen R, Godfrey MW (2012) A market-based bug allocationmechanism using predictive bug lifetimes. In: Software Maintenance andReengineering (CSMR), 2012 16th European Conference on, IEEE, pp 149–158

Hovemeyer D, Pugh W (2004) Finding bugs is easy. In: OOPSLAIBM (2012) T.J. Watson Libraries for Analysis (WALA). http://wala.

sourceforge.net

Jocham R (2005) Jcsc - java coding standard checker.http://jcsc.sourceforge.net/

Kawrykow D, Robillard MP (2011) Non-essential changes in version histories.In: ICSE

Kim S, Ernst MD (2007) Prioritizing warning categories by analyzing softwarehistory. In: MSR

Kim S, Jr EJW (2006) How long did it take to fix bugs? In: Proceedings ofthe 2006 international workshop on Mining software repositories, ACM, pp173–174

Kim S, Zimmermann T, Jr EJW, Zeller A (2008) Predicting faults from cachedhistory. In: ISEC, pp 15–16

Kremenek T, Engler DR (2003) Z-ranking: Using statistical analysis to counterthe impact of static analysis approximations. In: SAS

Liang G, Wu L, Wu Q, Wang Q, Xie T, Mei H (2010) Automatic constructionof an effective training set for prioritizing static analysis warnings. In: ASE

Nagappan N, Ball T (2005) Use of relative code churn measures to predictsystem defect density. In: ICSE, pp 284–292

Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict componentfailures. In: ICSE, pp 452–461

Necula GC, Condit J, Harren M, McPeak S, Weimer W (2005) Ccured:type-safe retrofitting of legacy software. ACM Trans Program Lang Syst27(3):477–526

Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamicbinary instrumentation. In: PLDI, pp 89–100


Nielson F, Nielson HR, Hankin C (2005) Principles of Program Analysis.Springer

Ostrand TJ, Weyuker EJ, Bell RM (2005) Predicting the location and numberof faults in large software systems. IEEE Trans Softw Eng 31:340–355

Pan K, Kim S, Jr EJW (2009) Toward an understanding of bug fix patterns.Empirical Software Engineering 14(3):286–315

Park JW, Lee MW, Kim J, won Hwang S, Kim S (2011) CosTriage: A cost-aware triage algorithm for bug reporting systems. In: AAAI

Rutar N, Almazan CB, Foster JS (2004) A comparison of bug finding tools forjava. In: ISSRE, pp 245–256

Ruthruff JR, Penix J, Morgenthaler JD, Elbaum SG, Rothermel G (2008)Predicting accurate and actionable static analysis warnings: an experimentalapproach. In: ICSE, pp 341–350

Sen K, Marinov D, Agha G (2005) CUTE: a concolic unit testing engine forC. In: ESEC/SIGSOFT FSE, pp 263–272

Sliwerski J, Zimmermann T, Zeller A (2005) When do changes induce fixes?SIGSOFT Softw Eng Notes 30:1–5

Sun C, Lo D, Wang X, Jiang J, Khoo SC (2010) A discriminative modelapproach for accurate duplicate bug report retrieval. In: ICSE, pp 45–54

Tassey G (2002) The economic impacts of inadequate infrastructure for soft-ware testing. National Institute of Standards and Technology Planning Re-port 02-32002

Thung F, Lo D, Jiang L, Lucia, Rahman F, Devanbu PT (2012) When wouldthis bug get reported? In: ICSM, pp 420–429

Visser W, Mehlitz P (2005) Model checking programs with Java PathFinder.In: SPIN

Weeratunge D, Zhang X, Sumner WN, Jagannathan S (2010) Analyzing con-currency bugs using dual slicing. In: ISSTA, pp 253–264

Weiss C, Premraj R, Zimmermann T, Zeller A (2007) How long will it taketo fix this bug? In: Proceedings of the Fourth International Workshop onMining Software Repositories, MSR ’07

Wu R, Zhang H, Kim S, Cheung SC (2011a) Relink: recovering links betweenbugs and changes. In: SIGSOFT FSE, pp 15–25

Wu W, Zhang W, Yang Y, Wang Q (2011b) Drex: Developer recommendationwith k-nearest-neighbor search and expertise ranking. In: Software Engi-neering Conference (APSEC), 2011 18th Asia Pacific, IEEE, pp 389–396

Xia X, Lo D, Wang X, Zhou B (2013) Accurate developer recommendation forbug resolution. In: Proceedings of the 20th Working Conference on ReverseEngineering

Xie X, Zhang W, Yang Y, Wang Q (2012) Dretom: Developer recommendationbased on topic models for bug resolution. In: Proceedings of the 8th Inter-national Conference on Predictive Models in Software Engineering, ACM,pp 19–28

Xie Y, Aiken A (2007) Saturn: A scalable framework for error detection usingboolean satisfiability. ACM Trans Program Lang Syst 29(3)

to what extent could we detect field defects? what extent could we detect field defects? 3 to what...

Documents