localversusglobalmodelsforjust-in-timesoftware...

14
Research Article Local versus Global Models for Just-In-Time Software Defect Prediction Xingguang Yang , 1,2 Huiqun Yu , 1,3 Guisheng Fan , 1 Kai Shi , 1 and Liqiong Chen 4 1 Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai 200237, China 2 Shanghai Key Laboratory of Computer Software Evaluating and Testing, Shanghai 201112, China 3 Shanghai Engineering Research Center of Smart Energy, Shanghai, China 4 Department of Computer Science and Information Engineering, Shanghai Institute of Technology, Shanghai 201418, China Correspondence should be addressed to Huiqun Yu; [email protected] and Guisheng Fan; [email protected] Received 16 January 2019; Revised 14 April 2019; Accepted 23 April 2019; Published 12 June 2019 Academic Editor: Emiliano Tramontana Copyright © 2019 Xingguang Yang et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Just-in-time software defect prediction (JIT-SDP) is an active topic in software defect prediction, which aims to identify defect- inducing changes. Recently, some studies have found that the variability of defect data sets can affect the performance of defect predictors. By using local models, it can help improve the performance of prediction models. However, previous studies have focused on module-level defect prediction. Whether local models are still valid in the context of JIT-SDP is an important issue. To this end, we compare the performance of local and global models through a large-scale empirical study based on six open-source projects with 227417 changes. e experiment considers three evaluation scenarios of cross-validation, cross-project-validation, and timewise-cross-validation. To build local models, the experiment uses the k-medoids to divide the training set into several homogeneous regions. In addition, logistic regression and effort-aware linear regression (EALR) are used to build classification models and effort-aware prediction models, respectively. e empirical results show that local models perform worse than global models in the classification performance. However, local models have significantly better effort-aware prediction performance than global models in the cross-validation and cross-project-validation scenarios. Particularly, when the number of clusters k is set to 2, local models can obtain optimal effort-aware prediction performance. erefore, local models are promising for effort- aware JIT-SDP. 1. Introduction Today, software plays an important role in people’s daily life. However, the defective software can bring great economic losses to users and enterprises. According to the estimates of the National Institute of Standards and Technology, the economic losses caused by software defects to the United States are as high as $60 billion per year [1]. erefore, it is necessary to detect and repair defects in the software through a large number of software quality assurance activities. Software defect prediction technology plays an impor- tant role in software quality assurance, and it is also an active research topic in the field of software engineering data mining [2, 3]. In the actual software development, test resources are always limited, so it is necessary to prioritize the allocation of limited test resources to modules that are more likely to be defective [4]. Software defect prediction technology aims to predict the defect proneness, defect number, or defect density of a program module by using defect prediction models based on statistical or machine learning methods [3]. e results of defect prediction models help us to assign limited test resources more reasonably and improve software quality. Traditional defect predictions are performed at a coarse- grained level (such as package or file level). However, when large files are predicted as defect prone by predictors, it can be time-consuming to perform code checks on them [5]. In addition, since large files have been modified by multiple developers, it is not easy to find the right developer to check Hindawi Scientific Programming Volume 2019, Article ID 2384706, 13 pages https://doi.org/10.1155/2019/2384706

Upload: others

Post on 20-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LocalversusGlobalModelsforJust-In-TimeSoftware ...downloads.hindawi.com/journals/sp/2019/2384706.pdf · Just-in-time software defect prediction (JIT-SDP) is an active topic in software

Research ArticleLocal versus Global Models for Just-In-Time SoftwareDefect Prediction

Xingguang Yang 12 Huiqun Yu 13 Guisheng Fan 1 Kai Shi 1 and Liqiong Chen 4

1Department of Computer Science and Engineering East China University of Science and Technology Shanghai 200237 China2Shanghai Key Laboratory of Computer Software Evaluating and Testing Shanghai 201112 China3Shanghai Engineering Research Center of Smart Energy Shanghai China4Department of Computer Science and Information Engineering Shanghai Institute of Technology Shanghai 201418 China

Correspondence should be addressed to Huiqun Yu yhqecusteducn and Guisheng Fan gsfanecusteducn

Received 16 January 2019 Revised 14 April 2019 Accepted 23 April 2019 Published 12 June 2019

Academic Editor Emiliano Tramontana

Copyright copy 2019 Xingguang Yang et al -is is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

Just-in-time software defect prediction (JIT-SDP) is an active topic in software defect prediction which aims to identify defect-inducing changes Recently some studies have found that the variability of defect data sets can affect the performance of defectpredictors By using local models it can help improve the performance of prediction models However previous studies havefocused on module-level defect prediction Whether local models are still valid in the context of JIT-SDP is an important issue Tothis end we compare the performance of local and global models through a large-scale empirical study based on six open-sourceprojects with 227417 changes -e experiment considers three evaluation scenarios of cross-validation cross-project-validationand timewise-cross-validation To build local models the experiment uses the k-medoids to divide the training set into severalhomogeneous regions In addition logistic regression and effort-aware linear regression (EALR) are used to build classificationmodels and effort-aware prediction models respectively -e empirical results show that local models perform worse than globalmodels in the classification performance However local models have significantly better effort-aware prediction performancethan global models in the cross-validation and cross-project-validation scenarios Particularly when the number of clusters k is setto 2 local models can obtain optimal effort-aware prediction performance -erefore local models are promising for effort-aware JIT-SDP

1 Introduction

Today software plays an important role in peoplersquos daily lifeHowever the defective software can bring great economiclosses to users and enterprises According to the estimates ofthe National Institute of Standards and Technology theeconomic losses caused by software defects to the UnitedStates are as high as $60 billion per year [1] -erefore it isnecessary to detect and repair defects in the softwarethrough a large number of software quality assuranceactivities

Software defect prediction technology plays an impor-tant role in software quality assurance and it is also an activeresearch topic in the field of software engineering datamining [2 3] In the actual software development test

resources are always limited so it is necessary to prioritizethe allocation of limited test resources to modules that aremore likely to be defective [4] Software defect predictiontechnology aims to predict the defect proneness defectnumber or defect density of a program module by usingdefect prediction models based on statistical or machinelearningmethods [3]-e results of defect predictionmodelshelp us to assign limited test resources more reasonably andimprove software quality

Traditional defect predictions are performed at a coarse-grained level (such as package or file level) However whenlarge files are predicted as defect prone by predictors it canbe time-consuming to perform code checks on them [5] Inaddition since large files have been modified by multipledevelopers it is not easy to find the right developer to check

HindawiScientific ProgrammingVolume 2019 Article ID 2384706 13 pageshttpsdoiorg10115520192384706

the files [6] To this end just-in-time software defect pre-diction (JIT-SDP) technology is proposed [7ndash10] JIT-SDP ismade at change level which has a finer granularity thanmodule level Once developers submit a change for theprogram defect predictors can identify whether the changeis defect-inducing immediately -erefore JIT-SDP has theadvantages of fine granularity instantaneity and traceabilitycompared with traditional defect prediction [7]

In order to improve the performance of software defectprediction researchers have invested a lot of effort-emainresearch content includes the use of various machinelearning methods feature selection processing class im-balance processing noise etc [2] Recently researcherspropose a novel idea called local models that may helpimprove the performance of defect prediction [11ndash14] -elocal models divide all available training data into severalregions with similar metrics and train a prediction model foreach region However the research objects in previous arefile-level or class-level defect data sets Whether local modelsare still valid in the context of JIT-SDP remains for furtherstudy

To this end we conduct experiment on the change-leveldefect data sets from six open-source projects including227417 changes We empirically compare the classificationperformance and effort-aware prediction performance oflocal and global models through a large-scale empiricalstudy In order to build local models the experiment uses thek-medoids clustering algorithm to divide the trainingsamples into homogeneous regions For each region theexperiment builds classification models and effort-awareprediction models based on logistic regression and effort-aware linear regression (EALR) respectively which arealways used for baseline models in prior studies [7ndash9]

-e main contributions of this paper are as follows

(i) We compare the classification performance andeffort-aware prediction performance of local andglobal models in three evaluation scenarios namelycross-validation scenario cross-project-validationscenario and timewise-cross-validation scenariofor JIT-SDP

(ii) Considering the influence of the number of clustersk on the performance of local models we evaluatethe classification performance and effort-awareprediction performance of local models with dif-ferent values of k by adjusting k from 2 to 10

(iii) -e empirical results show that local models per-form worse in classification performance thanglobal models in three evaluation scenarios Inaddition local models have worse effort-awareprediction performance in the timewise-cross-validation scenario However local models havebetter effort-aware prediction performance thanglobal models in the cross-validation scenario andcross-project-validation scenario Particularlywhen the parameter k is set to 2 local models canobtain optimal performance in the ACC and Poptindicators

-e results of the experiment provide valuable knowl-edge about the advantages and limitations of local models inthe JIT-SDP problem In order to ensure the repeatability ofthe experiment and promote future research we share thecode and data of the experiment

-is paper is organized as follows Section 2 introducesthe background and related work of software defect pre-diction Section 3 introduces the details of local models -eexperimental setup is presented in Section 4 Section 5demonstrates the experimental results and analysis Section6 introduces the threats to validity Section 7 summarizes thepaper and describes future work

2 Background and Related Work

21 Background of Software Defect Prediction Softwaredefect prediction technology originated in the 1970s [15]At that time the software systems were less complex andresearchers found that the number of software defects in aprogram system was related to the number of lines of codeIn 1971 Akiyama et al [16] proposed a formula for therelationship between the number of software defects (D)and lines of code (LOC) in a project D 486 +

0018 times LOC However as software scale and complexitycontinue to increase the formula is too simple to predictthe relationship between D and LOC -erefore based onempirical knowledge researchers designed a series ofmetrics associated with software defects McCabe et al [17]believed that code complexity is better correlated withsoftware defects than the number of code lines and pro-posed a cyclomatic complexity metric to measure codecomplexity In 1994 Chidamber and Kemerer [18] pro-posed CK metrics for the object-oriented program code Inaddition to code-based metrics researchers found thatmetrics based on software development processes are alsoassociated with software defects Nagappan and Ball [19]proposed relative code churn metrics to predict defectdensity at file level by measuring code churn duringsoftware development

-e purpose of defect prediction technology is to predictthe defect proneness number of defects or defect density ofa program module Taking the prediction of defect prone-ness as an example the general process of defect predictiontechnology is described in Figure 1

(1) Program modules are extracted from the softwarehistory repository and the granularity of the mod-ules can be set to package file change or functionaccording to actual requirement

(2) Based on the software code and software develop-ment process metrics related to software defects aredesigned and used to measure program modulesDefect data sets are built by mining defect logs in theversion control system and defect reports in thedefect tracking system

(3) Defect data sets are performed necessary pre-processing (excluding outliers data normalizationetc) Defect prediction models are built by using

2 Scientific Programming

statistical or machine learning algorithms such asNaive Bayes random forest and support vectormachines Prediction models are evaluated based onperformance indicators such as accuracy recall andAUC

(4) When a new program module appears the samemetrics are used to measure the module and theprediction models are used to predict whether themodule is defect prone (DP) or nondefect prone(NDP)

22 Related Work

221 Just-In-Time Software Defect Prediction Traditionaldefect prediction techniques are performed at coarsegranularity (package file or function) Although thesetechniques are effective in some cases they have the fol-lowing disadvantages [7] first the granularity of predictionsis coarse so developers need to check a lot of code to locatethe defective code second program modules are oftenmodified by multiple developers so it is difficult to find asuitable expert to perform code checking on the defect-prone modules third since the defect prediction is per-formed in the late stages of the software development cycledevelopers may have forgotten the details of the develop-ment which reduces the efficiency of defects detection andrepair

In order to overcome the shortcomings of traditionaldefect prediction a new technology called just-in-timesoftware defect prediction (JIT-SDP) is proposed [7 20]JIT-SDP is a fine-grained prediction method whose pre-diction objects are code changes instead of modules whichhas the following merits first since the prediction is per-formed at a fine granularity the scope of the code inspectioncan be reduced to save the effort of code inspection secondJIT-SDP is performed at the change level and once a de-veloper submits the modified code the defect prediction canbe executed -erefore defect predictions are performedmore timely and it is easy to find the right developer forbuggy changes

In recent years research related to JIT-SDP has de-veloped rapidly Mockus and Weiss [20] designed thechange metrics to evaluate the probability of defect-inducing changes for initial modification requests (IMR)

in a 5ESS network switch project Yang et al [21] firstapplied deep learning technology to JIT-SDP and pro-posed a method called Deeper -e empirical results showthat the Deeper method can detect more buggy changesand have better F1 indicator than the EALR methodKamei et al [10] conducted experiment on 11 open-sourceprojects for JIT-SDP using cross-project models -eyfound that the cross-project models can perform well whencarefully selecting training data Yang et al [22] proposed atwo-layer ensemble learning method (TLEL) which le-verages decision tree and ensemble learning methodExperimental results show that TLEL can significantlyimprove the PofB20 and F1 indicators compared to thethree baselines Chen et al [9] proposed a novel supervisedlearning method MULTI which applied multiobjectiveoptimization algorithm to software defect predictionExperimental results show that MULTI is superior to 43state-of-the-art prediction models Considering a commitin the software development may be partially defectivewhich may involve defective files and nondefective files Tothis end Pascarella et al [23] proposed a fine-grained JIT-SDP model to predict buggy files in a commit based on tenopen-source projects Experimental results show that 43defective commits are mixed by buggy and clean resourcesand their method can obtain 82 AUC-ROC to detectdefective files in a commit

222 Effort-Aware Software Defect Prediction In recentyears researchers pointed out that when using the defectprediction model the cost-effectiveness of the model shouldbe considered [24] Due to the high cost of checking forpotentially defective program modules modules with highdefect densities should be prioritized when testing resourcesare limited Effort-aware defect prediction aims at findingmore buggy changes in limited testing resources which hasbeen extensively studied in module-level (package file orfunction) defect prediction [25 26] Kamei et al [7] appliedeffort-aware defect prediction to JIT-SDP In their study theyused code churn (total number of modified lines of code) as ameasure of the effort and proposed the EALRmodel based onlinear regression -eir study showed that 35 defect-inducing changes can be detected by using only 20 effort

Recently many supervised and unsupervised learningmethods are investigated in the context of effort-aware JIT-

Software historyrepository Program modules

Designmetrics

Extractprogrammodule

New programmodule

Metrics Prediction

Buildmodels Prediction

model

Metrics Label

NDPDF

Figure 1 -e process of software defect prediction

Scientific Programming 3

SDP Yang et al [8] compared simple unsupervised modelswith supervised models for effort-aware JIT-SDP and foundthat many unsupervised models outperform the state-of-the-art supervised models Inspired by the study of Yanget al [8] Liu et al [27] proposed a novel unsupervisedlearning method CCUM which is based on code churn-eir study showed that CCUM performs better than thestate-of-the-art prediction model at that time As the resultsYang et al [8] reported are startling Fu et al [28] repeatedtheir experiment -eir study indicates that supervisedlearning methods are better than unsupervised methodswhen their analysis is conducted on a project-by-projectbasis However their study supports the general goal of Yanget al [8] Huang et al [29] also performed a replication studybased on the study of Yang et al [8] and extended theirstudy in the literature [30] -ey found three weaknesses ofthe LT model proposed by Yang et al [8] more contextswitching more false alarms and worse F1 than supervisedmethods -erefore they proposed a novel supervisedmethod CBS+ which combines the advantages of LT andEALR Empirical results demonstrate that CBS + can detectmore buggy changes than EALR More importantlyCBS + can significantly reduce context switching comparedto LT

3 Software Defect Prediction Based onLocal Models

Menzies et al [11] first applied local models to defectprediction and effort estimation in 2011 and extended theirstudy work in 2013 [12] -e basic idea of local models issimple first cluster all training data into homogeneousregions then train a prediction model for each region A testinstance will be predicted by one of the predictors [31]

To the best of our knowledge there are only five studiesof local models in the filed of defect prediction Menzies et al[11 12] first proposed the idea of local models -eir studiesattempted to explore the best way to learn lessons for defectprediction and effort estimation -ey adopted WHEREalgorithm to cluster the data into the regions with similarproperties Experimental results showed that the lessonslearned from clusters have better prediction performance bycomparing the lessons generated from all the data localprojects and each cluster Scanniello et al [13] proposed anovel defect prediction method based on software clusteringfor class-level defect prediction -eir research demon-strated that for a class to be predicted in a project system themodels built from the classes related to it outperformed theclasses from the entire system Bettenburg et al [14] com-pared local and global models based on statistical learningtechnology for software defect prediction Experimentalresults demonstrated that local models had better fit andprediction performance Herbold et al [31] introduced localmodels into the context of cross-project defect prediction-ey compared local models with global models and atransfer learning technique -eir findings showed that localmodels just made a minor difference compared with globalmodels and the transfer learning model Mezouar et al [32]compared the performance of local and global models in the

context of effort-aware defect prediction -eir study isbased on 15 module-level defect data sets and uses k-medoids to build local models Experimental results showedthat local models perform worse than global models

-e process and difference between local models andglobal models are described in Figure 2 For global modelsall training data are used to build prediction models by usingstatistical or machine learning methods and the models canbe used to predict any test instance In comparison localmodels first build a cluster model and divide the trainingdata into several clusters based on the cluster model In theprediction phase each test instance is assigned a clusternumber based on the cluster models And each test instanceis predicted by the model trained by the correspondingcluster instead of all the training data

In our experiments we apply local models to the JIT-SDP problem In previous studies various clusteringmethods have been used in local models such as WHERE[11 12] MCLUST [14] k-means [14] hierarchical clustering[14] BorderFlow [13] EM clustering [31] and k-mediods[32] According to the suggestion of Herbold et al [31] onecan use any of these clustering methods Our experimentadopts k-medoids as the clustering method in local modelsK-medoids is a widely used clustering method Differentfrom k-means used in the previous study [14] the clustercenter of k-medoids is an object which has low sensitivity tooutliers Moreover there are mature Python libraries sup-porting the implementation of k-medoids In the stage ofmodel building we use logistic regression and EALR to buildthe classification model and the effort-aware predictionmodel in local models which are used as baseline models inprevious JIT-SDP studies [7ndash9]-e detailed process of localmodels for JIT-SDP is described in Algorithm 1

4 Experimental Setup

-is article answers the following three questions throughexperiment

(i) RQ1 how is the classification performance andeffort-aware prediction performance of local modelscompared to those of global models in the cross-validation scenario

(ii) RQ2 how is the classification performance andeffort-aware prediction performance of local modelscompared to those of global models in the cross-project-validation scenario

(iii) RQ3 how is the classification performance andeffort-aware prediction performance of local modelscompared to those of global models in the timewise-cross-validation scenario

-is section introduces the experimental setup from fiveaspects data sets modeling methods performance evalua-tion indicators data analysis method and data pre-processing Our experiment is performed on the hardwarewith Intel (R) Core (TM) i7-7700 CPU 360GHZ -e op-eration system is Windows 10 -e programming environ-ment for scripts is Python 36

4 Scientific Programming

41 Data Sets -e data sets used in experiment are widelyused for JIT-SDP which are shared by Kamei et al [7] -edata sets contain six open-source projects from differentapplication domains For example Bugzilla (BUG) is anopen-source defect tracking system Eclipse JDT (JDT) is afull-featured Java integrated development environmentplugin for the Eclipse platform Mozilla (MOZ) is a webbrowser and PostgreSQL (POS) is a database system [9]-ese data sets are extracted from CVS repositories ofprojects and corresponding bug reports [7] -e basic in-formation of the data sets is shown in Table 1

To predict whether a change is defective or not Kameiet al [7] design 14 metrics associated with software defectswhich can be grouped into five groups (ie diffusion sizepurpose history and experience)-e specific description ofthese metrics is shown in Table 2 A more detailed in-troduction to the metrics can be found in the literature [7]

42 Modeling Methods -ere are three modeling methodsthat are used in our experiment First in the process of using

local models the training set is divided into multipleclusters -erefore this process is completed by the k-medoids model Second the classification performance andeffort-aware prediction performance of local models andglobal models are investigated in our experiment -ereforewe use logistic regression and EALR to build classificationmodels and effort-aware prediction models respectivelywhich are used as baseline models in previous JIT-SDPstudies [7ndash9] Details of the three models are shown below

421 K-Medoids -e first step in local models is to use aclustering algorithm to divide all training data into severalhomogeneous regions In this process the experiment usesk-medoids as a clustering algorithm which is widely used indifferent clustering tasks Different from k-means the co-ordinate center of each cluster of k-medoids is a repre-sentative object rather than a centroid Hence k-medoids isless sensitive to the outliers -e objective function of k-medoids can be defined by using Equation (1) where p is asample in the cluster Ci and Oi is a medoid in the cluster Ci

Training data

Test dataPrediction result

Classifier

(a)

Training data Clustering algorithm

Cluster 1

Cluster i

Cluster n

Cluster model

Cluster number i

Classifier i

Prediction result

Test data

(b)

Figure 2 -e process of (a) global and (b) local models

Input training set D (x1 y1) (x2 y2) (xn yn)1113864 1113865 test set T (x1 y1) (x2 y2) (xm ym)1113864 1113865 the number of clusters kOutput prediction results R

(1) begin(2) R⟵[

(3) Divide the training set into k clusters(4) clusters cluster_centers k-medoids (D k)(5) Train a classifier for each cluster(6) classifiersmodeling_algorithm (clusters)(7) for Instance t in T do(8) Assign a cluster number i for each instance t(9) i calculate_min_distance (t cluster_centers)(10) r classifiers [i]predict (t) perform prediction(11) Rappend (r)(12) end(13) end

ALGORITHM 1 -e application process of local models for JIT-SDP

Scientific Programming 5

E 1113944k

i11113944

pϵCi

pminusOi

111386811138681113868111386811138681113868111386811138682 (1)

Among different k-medoids algorithms the partitioningaroundmedoids (PAM) algorithm is used in the experimentwhich was proposed by Kaufman el al [33] and is one of themost popularly used algorithms for k-medoids

422 Logistic Regression Similar to prior studies logisticregression is used to build classification models [7] Logisticregression is a widely used regression model that can be usedfor a variety of classification and regression tasks Assumingthat there are nmetrics for a change c logistic regression canbe expressed by using Equation (2) where w0 w1 wn

denote the coefficients of logistic regression and the in-dependent variables m1 mn represent the metrics of achange which are introduced in Table 2 -e output of themodel y(c) indicates the probability that a change is buggy

y(c) 1

1 + eminus w0+w1m1++wnmn( ) (2)

In our experiment logistic regression is used to solveclassification problems However the output of the model isa continuous value with a range of 0 to 1 -erefore for achange c if the value of y(c) is greater than 05 then c isclassified as buggy otherwise it is classified as clean -eprocess can be defined as follows

Y(c) 1 if y(c)gt 05

0 if y(c)le 051113896 (3)

423 Effort-Aware Linear Regression Arisholm et al [24]pointed out that the use of defect prediction models shouldconsider not only the classification performance of themodel but also the cost-effectiveness of code inspectionAfterwards a large number of studies focus on the effort-aware defect prediction [8 9 11 12] Effort-aware linearregression (EALR) was firstly proposed by Kamei et al [7]for solving effort-aware JIT-SDP which is based on linearregression models Different from the above logistic re-gression the dependent variable of EALR is Y(c)Effort(c)where Y(c) denotes the defect proneness of a change andEffort(c) denotes the effort required to code checking for thechange c According to the suggestion of Kamei et al [7] theeffort for a change can obtained by calculating the number ofmodified lines of code

43 Performance Indicators -ere are four performanceindicators used to evaluate and compare the performance oflocal and global models for JIT-SDP (ie F1 AUC ACCand Popt) Specifically we use F1 and AUC to evaluate theclassification performance of prediction models and useACC and Popt to evaluate the effort-aware prediction per-formance of prediction models -e details are as follows

Table 1 -e basic information of data sets

Project Period Number of defective changes Number of changes defect rateBUG 19980826sim20061216 1696 4620 3671COL 20021125sim20060727 1361 4455 3055JDT 20010502sim20071231 5089 35386 1438MOZ 20000101sim20061217 5149 98275 524PLA 20010502sim20071231 9452 64250 1471POS 19960709sim20100515 5119 20431 2506

Table 2 -e description of metrics

Dimension Metric Description

Diffusion

NS Number of modified subsystemsND Number of modified directoriesNF Number of modified files

Entropy Distribution of modified code across each file

SizeLA Lines of code addedLD Lines of code deletedLT Lines of code in a file before the change

Purpose FIX Whether or not the change is a defect fix

History

NDEV Number of developers that changed the files

AGE Average time interval between the last and thecurrent change

NUC Number of unique last changes to the files

ExperienceEXP Developer experienceREXP Recent developer experienceSEXP Developer experience on a subsystem

6 Scientific Programming

-e experiment first uses F1 and AUC to evaluate theclassification performance of models JIT-SDP technologyaims to predict the defect proneness of a change which is aclassification task-e predicted results can be expressed as aconfusion matrix shown in Table 3 including true positive(TP) false negative (FN) false positive (FP) and truenegative (TN)

For classification problems precision (P) and recall (R)are often used to evaluate the prediction performance of themodel

P TP

TP + FP

R TP

TP + FN

(4)

However the two indicators are usually contradictory intheir results In order to evaluate a prediction modelcomprehensively we use F1 to evaluate the classificationperformance of the model which is a harmonic mean ofprecision and recall and is shown in the following equation

F1 2 times P times RP + R

(5)

-e second indicator is AUC Data class imbalanceproblems are often encountered in software defect pre-diction Our data sets are also imbalanced (the number ofbuggy changes is much less than clean changes) In the class-imbalanced problem threshold-based evaluation indicators(such as accuracy recall rate and F1) are sensitive to thethreshold [4] -erefore a threshold-independent evalua-tion indicator AUC is used to evaluate the classificationperformance of a model AUC (area under curve) is the areaunder the ROC (receiver operating characteristic) curve-edrawing process of the ROC curve is as follows First sort theinstances in a descending order of probability that is pre-dicted to be positive and then treat them as positive ex-amples in turn In each iteration the true positive rate (TPR)and the false positive rate (FPR) are calculated as the or-dinate and the abscissa respectively to obtain an ROCcurve

TPR TP

TP + FN

FPR FP

TN + FP

(6)

-e experiment uses ACC and Popt to evaluate the effort-aware prediction performance of local and global modelsACC and Popt are commonly used performance indicatorswhich have widely been used in previous research [7ndash9 25]

-e calculation method of ACC and Popt is shown inFigure 3 In Figure 3 x-axis and y-axis represent the cu-mulative percentage of code churn (ie the number of linesof code added and deleted) and defect-inducing changes Tocompute the values of ACC and Popt three curves are in-cluded in Figure 3 optimal model curve prediction modelcurve and worst model curve [8] In the optimal modelcurve the changes are sorted in a descending order based on

their actual defect densities In the worst model curve thechanges are sorted in the ascending order according to theiractual defect densities In the prediction model curve thechanges are sorted in the descending order based on theirpredicted defect densities -en the area between the threecurves and the x-axis respectively Area(optimal)Area(m) and Area(worst) are calculated ACC indicatesthe recall of defect-inducing changes when 20 of the effortis invested based on the prediction model curve Accordingto [7] Popt can be defined by the following equation

Popt 1minusarea(optimal)minus area(m)

area(optimal)minus optimal(worst) (7)

44 Data Analysis Method -e experiment considers threescenarios to evaluate the performance of prediction modelswhich are 10 times 10-fold cross-validation cross-project-validation and timewise-cross-validation respectively -edetails are as follows

441 10 Times 10-Fold Cross-Validation Cross-validation isperformed within the same project and the basic process of10 times 10-fold cross-validation is as follows first the datasets of a project are divided into 10 parts with equal sizeSubsequently nine parts of them are used to train a pre-diction model and the remaining part is used as the test set-is process is repeated 10 times in order to reduce therandomness deviation -erefore 10 times 10-fold cross-validation can produce 10 times 10 100 results

Table 3 Confusion matrix

Actual valuePrediction result

Positive NegativePositive TP FNNegative FP TN

Def

ect-i

nduc

ing

chan

ges (

)

100

80

60

40

20

0100806040200

Code churn ()

Optimal modelPrediction modelWorst model

Figure 3 -e performance indicators of ACC and Popt

Scientific Programming 7

442 Cross-Project-Validation Cross-project defect pre-diction has received widespread attention in recent years[34] -e basic idea of cross-project defect prediction is totrain a defect prediction model with the defect data sets of aproject to predict defects of another project [8] Given datasets containing n projects n times (nminus 1) run results can begenerated for a prediction model -e data sets used in ourexperiment contain six open-source projects so6 times (6minus 1) 30 prediction results can be generated for aprediction model

443 Timewise-Cross-Validation Timewise-cross-validationis also performed within the same project but it con-siders the chronological order of changes in the data sets[35] Defect data sets are collected in the chronological orderduring the actual software development process -ereforechanges committed early in the data sets can be used topredict the defect proneness of changes committed late Inthe timewise-cross-validation scenario all changes in thedata sets are arranged in the chronological order and thechanges in the same month are grouped in the same groupSuppose the data sets are divided into n groups Group i andgroup i + 1 are used to train a prediction model and thegroup i + 4 and group i + 5 are used as a test set -ereforeassuming that a project contains nmonths of data sets nminus 5run results can be generated for a prediction model

-e experiment uses the Wilcoxon signed-rank test [36]and Cliffrsquos δ [37] to test the significance of differences inclassification performance and prediction performance be-tween local and global models -e Wilcoxon signed-ranktest is a popular nonparametric statistical hypothesis testWe use p values to examine if the difference of the per-formance of local and global models is statistically significantat a significance level of 005-e experiment also uses Cliffrsquosδ to determine the magnitude of the difference in theprediction performance between local models and globalmodels Usually the magnitude of the difference can bedivided into four levels ie trivial ((|δ|lt 0147)) small(0147le |δ|lt 033) moderate (033le |δ|lt 0474) and large(ge0474) [38] In summary when the p value is less than005 and Cliff rsquos δ is greater than or equal to 0147 localmodels and global models have significant differences inprediction performance

45 Data Preprocessing According to suggestion of Kameiet al [7] it is necessary to preprocess the data sets It includesthe following three steps

(1) Removing Highly Correlated Metrics -e ND andREXP metrics are excluded because NF and NDREXP and EXP are highly correlated LA and LD arenormalized by dividing by LT since LA and LD arehighly correlated LT and NUC are normalized bydividing by NF since LT and NUC are highly cor-related with NF

(2) Logarithmic Transformation Since most metrics arehighly skewed each metric executes logarithmictransformation (except for fix)

(3) Dealing with Class Imbalance -e data sets used inthe experiment have the problem of class imbalanceie the number of defect-inducing changes is far lessthan the number of clean changes -erefore weperform random undersampling on the training setBy randomly deleting clean changes the number ofbuggy changes remains the same as the number ofclean changes Note that the undersampling is notperformed in the test set

5 Experimental Results and Analysis

51 Analysis for RQ1 To investigate and compare the per-formance of local and global models in the 10 times 10-foldcross-validation scenario we conduct a large-scale empiricalstudy based on data sets from six open-source projects -eexperimental results are shown in Table 4 As is shown inTable 4 the first column represents the evaluation indicatorsfor prediction models including AUC F1 ACC and Poptthe second column denotes the project names of six data setsthe third column shows the performance values of the globalmodels in the four evaluation indicators from the fourthcolumn to the twelfth column the performance values of thelocal models at different k values are shown Consideringthat the number of clusters k may affect the performance oflocal models the value of parameter k is tuned from 2 to 10Since 100 prediction results are generated in the 10 times 10-fold cross-validation scenario the values in the table aremedians of the experimental results We do the followingprocessing on the data in Table 4

(i) We performed the significance test defined inSection 44 on the performance of the local andglobal models -e performance values of localmodels that are significantly different from theperformance values of global model are bolded

(ii) -e performance values of the local models whichare significantly better than those of the globalmodels are given in italics

(iii) -e WDL values (ie the number of projects forwhich local models can obtain a better equal andworse prediction performance than global models)are calculated under different k values and differentevaluation indicators

First we analyze and compare the classification per-formance of local and global models As is shown in Table 4firstly all of the performance of local models with differentparameters k and different projects are worse than that ofglobal models in the AUC indicator Secondly in most casesthe performance of local models are worse than that of globalmodels (except for four values in italics) in the F1 indicator-erefore it can be concluded that global models performbetter than local models in the classification performance

Second we investigate and compare effort-aware pre-diction performance for local and global models As can beseen from Table 4 local models are valid for effort-awareJIT-SDP However local models have different effects ondifferent projects and different parameters k

8 Scientific Programming

First local models have different effects on differentdata sets To describe the effectiveness of local models ondifferent data sets we calculate the number of times thatlocal models perform significantly better than globalmodels on 9 k values and two evaluation indicators(ie ACC and Popt) for each data set -e statistical resultsare shown in Table 5 where the second row is the numberof times that local models perform better than globalmodels for each project It can be seen that local modelshave better performance on the projects MOZ and PLA andpoor performance on the projects BUG and COL Based onthe number of times that local models outperform globalmodels in six projects we rank the project as shown in thethird row In addition we list the sizes of data sets and theirrank as shown in the fourth row It can be seen that theperformance of local models on data sets has a strongcorrelation with the size of data sets Generally the largerthe size of the data set the more significant the perfor-mance improvement of local models

In view of this situation we consider that if the size of thedata set is too small when using local models since the datasets are divided into multiple clusters the training samplesof each cluster will be further reduced which will lead to theperformance degradation of models While the size of datasets is large the performance of the model is not easilylimited by the size of the data sets so that the advantages oflocal models are better reflected

In addition the number of clusters k also has an impacton the performance of local models As can be seen from thetable according to the WDL analysis local models canobtain the best ACC indicator when k is set to 2 and obtainthe best Popt indicator when k is set to 2 and 6 -erefore werecommend that in the cross-validation scenario whenusing local models to solve the effort-aware JIT-SDPproblem the parameter k can be set to 2

-is section analyzes and compares the classificationperformance and effort-aware prediction performance oflocal and global models in 10 times 10-fold cross-validationscenario -e empirical results show that compared withglobal models local models perform poorly in the classifi-cation performance but perform better in the effort-awareprediction performance However the performance of localmodels is affected by the parameter k and the size of the datasets -e experimental results show that local models canobtain better effort-aware prediction performance on theproject with larger size of data sets and can obtain bettereffort-aware prediction performance when k is set to 2

52Analysis forRQ2 -is section discusses the performancedifferences between local and global models in the cross-project-validation scenario for JIT-SDP -e results of theexperiment are shown in Table 6 in which the first columnindicates four evaluation indicators including AUC F1

Table 4 Local vs global models in the 10 times 10-fold cross-validation scenario

Indicator Project GlobalLocal

k 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC

BUG 0730 0704 0687 0684 0698 0688 0686 0686 0693 0685COL 0742 0719 0708 0701 0703 0657 0666 0670 0668 0660JDT 0725 0719 0677 0670 0675 0697 0671 0674 0671 0654MOZ 0769 0540 0741 0715 0703 0730 0730 0725 0731 0722PLA 0742 0733 0702 0693 0654 0677 0640 0638 0588 0575POS 0777 0764 0758 0760 0758 0753 0755 0705 0757 0757

WDL mdash 006 006 006 006 006 006 006 006 006

F1

BUG 0663 0644 0628 0616 0603 0598 0608 0610 0601 0592COL 0696 0672 0649 0669 0678 0634 0537 0596 0604 0564JDT 0723 0706 0764 0757 0734 0695 0615 0643 0732 0722MOZ 0825 0374 0853 0758 0785 0741 0718 0737 0731 0721PLA 0704 0678 0675 0527 0477 0557 0743 0771 0627 0604POS 0762 0699 0747 0759 0746 0698 0697 0339 0699 0700

WDL mdash 006 204 114 015 006 015 105 015 015

ACC

BUG 0421 0452 0471 0458 0424 0451 0450 0463 0446 0462COL 0553 0458 0375 0253 0443 0068 0195 0459 0403 0421JDT 0354 0532 0374 0431 0291 0481 0439 0314 0287 0334MOZ 0189 0386 0328 0355 0364 0301 0270 0312 0312 0312PLA 0368 0472 0509 0542 0497 0509 0456 0442 0456 0495POS 0336 0490 0397 0379 0457 0442 0406 0341 0400 0383

WDL mdash 501 411 411 312 411 411 222 312 411

Popt

BUG 0742 055 0747 0744 0726 0738 0749 0750 0760 0750 0766COL 0726 0661 0624 0401 0731 0295 0421 0644 0628 0681JDT 0660 0779 0625 0698 0568 0725 0708 0603 0599 0634MOZ 0555 0679 0626 0645 0659 0588 0589 0643 0634 0630PLA 0664 0723 0727 0784 0746 0758 0718 0673 0716 0747POS 0673 0729 0680 0651 0709 0739 0683 0596 0669 0644

WDL mdash 411 222 213 321 411 231 123 132 213

Scientific Programming 9

ACC and Popt the second column represents the perfor-mance values of local models from the third column to theeleventh column the prediction performance of localmodels under different parameters k is represented In ourexperiment six data sets are used in the cross-project-validation scenario -erefore each prediction model pro-duces 30 prediction results -e values in Table 6 representthe median of the prediction results

For the values in Table 6 we do the following processing

(i) Using the significance test method the performancevalues of local models that are significantly differentfrom those of global models are bolded

(ii) -e performance values of local models that aresignificantly better than those of global models aregiven in italics

(iii) -e WDL column analyzes the number of timesthat local models perform significantly better thanglobal models at different parameters k

First we compare the difference in classification per-formance between local models and global models As can beseen from Table 6 in the AUC and F1 indicators theclassification performance of local models under differentparameter k is worse than or equal to that of global models-erefore we can conclude that in the cross-project-validation scenario local models perform worse thanglobal models in the classification performance

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 6 that the performance of local models is better than orequal to that of global models in the ACC and Popt indicatorswhen the parameter k is from 2 to 10-erefore local modelsare valid for effort-aware JIT-SDP in the cross-validationscenario In particular when the number k of clusters is setto 2 local models can obtain better ACC values and Poptvalues than global models when k is set to 5 local modelscan obtain better ACC values -erefore it is appropriate toset k to 2 which can find 493 defect-inducing changeswhen using 20 effort and increase the ACC indicator by570 and increase the Popt indicator by 164

In the cross-project-validation scenario local modelsperform worse in the classification performance but performbetter in the effort-aware prediction performance thanglobal models However the setting of k in local models hasan impact on the effort-aware prediction performance forlocal models-e empirical results show that when k is set to2 local models can obtain the optimal effort-aware pre-diction performance

53 Analysis for RQ3 -is section compares the predictionperformance of local and global models in the timewise-

cross-validation scenario Since in the timewise-cross-validation scenario only two months of data in each datasets are used to train prediction models the size of thetraining sample in each cluster is small -rough experi-ment we find that if the parameter k of local models is settoo large the number of training samples in each clustermay be too small or even equal to 0 -erefore the ex-periment sets the parameter k of local models to 2 -eexperimental results are shown in Table 7 Suppose aproject contains n months of data and prediction modelscan produce nminus 5 prediction results in the timewise-cross-validation scenario -erefore the third and fourth col-umns in Table 7 represent the median and standard de-viation of prediction results for global and local modelsBased on the significance test method the performancevalues of local models with significant differences fromthose of global models are bold -e row WDL sum-marizes the number of projects for which local models canobtain a better equal and worse performance than globalmodels in each indicator

First we compare the classification performance of localand global models It can be seen from Table 7 that localmodels perform worse on the AUC and F1 indicators thanglobal models on the 6 data sets -erefore it can be con-cluded that local models perform poorly in classificationperformance compared to global models in the timewise-cross-validation scenario

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 7 that in the projects BUG and COL local models haveworse ACC and Popt indicators than global models while inthe other four projects (JDT MOZ PLA and POS) localmodels and global models have no significant difference inthe ACC and Popt indicators -erefore local models per-form worse in effort-aware prediction performance thanglobal models in the timewise-cross-validation scenario-is conclusion is inconsistent with the conclusions inSections 51 and 52 For this problem we consider that themain reason lies in the size of the training sample In thetimewise-cross-validation scenario only two months of datain the data sets are used to train prediction models (there areonly 452 changes per month on average)When local modelsare used since the training set is divided into several clustersthe number of training samples in each cluster is furtherdecreased which results in a decrease in the effort-awareprediction performance of local models

In this section we compare the classification perfor-mance and effort-aware prediction performance of local andglobal models in the timewise-cross-validation scenarioEmpirical results show that local models perform worse thanglobal models in the classification performance and effort-aware prediction performance -erefore we recommend

Table 5 -e number of wins of local models at different k values

Project BUG COL JDT MOZ PLA POSCount 4 0 7 18 14 11Rank 5 6 4 1 2 3Number of changes (rank) 4620 (5) 4455 (6) 35386 (3) 98275 (1) 64250 (2) 20431 (3)

10 Scientific Programming

using global models to solve the JIT-SDP problem in thetimewise-cross-validation scenario

6 Threats to Validity

61 External Validity Although our experiment uses publicdata sets that have extensively been used in previous studieswe cannot yet guarantee that the discovery of the experimentcan be applied to all other change-level defect data sets-erefore the defect data sets of more projects should bemined in order to verify the generalization of the experi-mental results

62 Construct Validity We use F1 AUC ACC and Popt toevaluate the prediction performance of local and globalmodels Because defect data sets are often imbalanced AUCis more appropriate as a threshold-independent evaluationindicator to evaluate the classification performance of

models [4] In addition we use F1 a threshold-basedevaluation indicator as a supplement to AUC Besidessimilar to prior study [7 9] the experiment uses ACC andPopt to evaluate the effort-aware prediction performance ofmodels However we cannot guarantee that the experi-mental results are valid in all other evaluation indicatorssuch as precision recall Fβ and PofB20

63 Internal Validity Internal validity involves errors in theexperimental code In order to reduce this risk we carefullyexamined the code of the experiment and referred to thecode published in previous studies [7 8 29] so as to improvethe reliability of the experiment code

7 Conclusions and Future Work

In this article we firstly apply local models to JIT-SDP Ourstudy aims to compare the prediction performance of localand global models under cross-validation cross-project-validation and timewise-cross-validation for JIT-SDP Inorder to compare the classification performance and effort-aware prediction performance of local and global models wefirst build local models based on the k-medoids methodAfterwards logistic regression and EALR are used to buildclassification models and effort-aware prediction models forlocal and global models -e empirical results show that localmodels perform worse in the classification performance thanglobal models in three evaluation scenarios Moreover localmodels have worse effort-aware prediction performance in thetimewise-cross-validation scenario However local models aresignificantly better than global models in the effort-awareprediction performance in the cross-validation-scenario andtimewise-cross-validation scenario Particularly the optimaleffort-aware prediction performance can be obtained whenthe parameter k of local models is set to 2 -erefore localmodels are still promising in the context of effort-aware JIT-SDP

In the future first we plan to consider more commercialprojects to further verify the validity of the experimentalconclusions Second logistic regression and EALR are usedin local and global models Considering that the choice ofmodeling methods will affect the prediction performance oflocal and global models we hope to add more baselines tolocal and global models to verify the generalization of ex-perimental conclusions

Data Availability

-e experiment uses public data sets shared by Kamei et al[7] and they have already published the download address of

Table 6 Local vs global models in the cross-project-validation scenario

Indicator GlobalLocal

WDLk 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC 0706 0683 0681 0674 0673 0673 0668 0659 0680 0674 009F1 0678 0629 0703 0696 0677 0613 0620 0554 0629 0637 045ACC 0314 0493 0396 0348 0408 0393 0326 034 0342 0345 270Popt 0656 0764 0691 0672 0719 0687 0619 0632 0667 0624 180

Table 7 Local vs global models in the timewise-cross-validationscenario

Indicators Project Global Local (k 2)

AUC

BUG 0672plusmn 0125 0545plusmn 0124COL 0717plusmn 0075 0636plusmn 0135JDT 0708plusmn 0043 0645plusmn 0073MOZ 0749plusmn 0034 0648plusmn 0128PLA 0709plusmn 0054 0642plusmn 0076POS 0743plusmn 0072 0702plusmn 0095

WDL mdash 006

F1

BUG 0587plusmn 0108 0515plusmn 0128COL 0651plusmn 0066 0599plusmn 0125JDT 0722plusmn 0048 0623plusmn 0161MOZ 0804plusmn 0050 0696plusmn 0190PLA 0702plusmn 0053 0618plusmn 0162POS 0714plusmn 0065 0657plusmn 0128

WDL mdash 006

ACC

BUG 0357plusmn 0212 0242plusmn 0210COL 0426plusmn 0168 0321plusmn 0186JDT 0356plusmn 0148 0329plusmn 0189MOZ 0218plusmn 0132 0242plusmn 0125PLA 0358plusmn 0181 0371plusmn 0196POS 0345plusmn 0159 0306plusmn 0205

WDL mdash 042

Popt

BUG 0622plusmn 0194 0458plusmn 0189COL 0669plusmn 0150 0553plusmn 0204JDT 0654plusmn 0101 0624plusmn 0142MOZ 0537plusmn 0110 0537plusmn 0122PLA 0631plusmn 0115 0645plusmn 0151POS 0615plusmn 0131 0583plusmn 0167

WDL mdash 042

Scientific Programming 11

the data sets in their paper In addition in order to verify therepeatability of our experiment and promote future researchall the experimental code in this article can be downloaded athttpsgithubcomyangxingguangLocalJIT

Conflicts of Interest

-e authors declare that they have no conflicts of interest

Acknowledgments

-is work was partially supported by the NSF of Chinaunder Grant nos 61772200 and 61702334 Shanghai PujiangTalent Program under Grant no 17PJ1401900 ShanghaiMunicipal Natural Science Foundation under Grant nos17ZR1406900 and 17ZR1429700 Educational ResearchFund of ECUST under Grant no ZH1726108 and Collab-orative Innovation Foundation of Shanghai Institute ofTechnology under Grant no XTCX2016-20

References

[1] M Newman Software Errors Cost US Economy $595 BillionAnnually NIST Assesses Technical Needs of Industry to ImproveSoftware-Testing 2002 httpwwwabeachacomNIST_press_release_bugs_costhtm

[2] Z Li X-Y Jing and X Zhu ldquoProgress on approaches tosoftware defect predictionrdquo IET Software vol 12 no 3pp 161ndash175 2018

[3] X Chen Q Gu W Liu S Liu and C Ni ldquoSurvey of staticsoftware defect predictionrdquo Ruan Jian Xue BaoJournal ofSoftware vol 27 no 1 2016 in Chinese

[4] C Tantithamthavorn S McIntosh A E Hassan andKMatsumoto ldquoAn empirical comparison of model validationtechniques for defect prediction modelsrdquo IEEE Transactionson Software Engineering vol 43 no 1 pp 1ndash18 2017

[5] A G Koru D Dongsong Zhang K El Emam andH Hongfang Liu ldquoAn investigation into the functional formof the size-defect relationship for software modulesrdquo IEEETransactions on Software Engineering vol 35 no 2pp 293ndash304 2009

[6] S Kim E J Whitehead and Y Zhang ldquoClassifying softwarechanges clean or buggyrdquo IEEE Transactions on SoftwareEngineering vol 34 no 2 pp 181ndash196 2008

[7] Y Kamei E Shihab B Adams et al ldquoA large-scale empiricalstudy of just-in-time quality assurancerdquo IEEE Transactions onSoftware Engineering vol 39 no 6 pp 757ndash773 2013

[8] Y Yang Y Zhou J Liu et al ldquoEffort-aware just-in-time defectprediction simple unsupervised models could be better thansupervised modelsrdquo in Proceedings of the 24th ACM SIGSOFTInternational Symposium on Foundations of Software Engi-neering pp 157ndash168 Hong Kong China November 2016

[9] X Chen Y Zhao Q Wang and Z Yuan ldquoMULTI multi-objective effort-aware just-in-time software defect predictionrdquoInformation and Software Technology vol 93 pp 1ndash13 2018

[10] Y Kamei T Fukushima S McIntosh K YamashitaN Ubayashi and A E Hassan ldquoStudying just-in-time defectprediction using cross-project modelsrdquo Empirical SoftwareEngineering vol 21 no 5 pp 2072ndash2106 2016

[11] T Menzies A Butcher A Marcus T Zimmermann andD R Cok ldquoLocal vs global models for effort estimation anddefect predictionrdquo in Proceedings of the 26th IEEEACM

International Conference on Automated Software Engineer-ing (ASE) pp 343ndash351 Lawrence KS USA November 2011

[12] T Menzies A Butcher D Cok et al ldquoLocal versus globallessons for defect prediction and effort estimationrdquo IEEETransactions on Software Engineering vol 39 no 6pp 822ndash834 2013

[13] G Scanniello C Gravino A Marcus and T Menzies ldquoClasslevel fault prediction using software clusteringrdquo in Pro-ceedings of the 2013 28th IEEEACM International Conferenceon Automated Software Engineering (ASE) pp 640ndash645Silicon Valley CA USA November 2013

[14] N Bettenburg M Nagappan and A E Hassan ldquoTowardsimproving statistical modeling of software engineering datathink locally act globallyrdquo Empirical Software Engineeringvol 20 no 2 pp 294ndash335 2015

[15] Q Wang S Wu and M Li ldquoSoftware defect predictionrdquoJournal of Software vol 19 no 7 pp 1565ndash1580 2008 inChinese

[16] F Akiyama ldquoAn example of software system debuggingrdquoInformation Processing vol 71 pp 353ndash359 1971

[17] T J McCabe ldquoA complexity measurerdquo IEEE Transactions onSoftware Engineering vol SE-2 no 4 pp 308ndash320 1976

[18] S R Chidamber and C F Kemerer ldquoAmetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[19] N Nagappan and T Ball ldquoUse of relative code churn mea-sures to predict system defect densityrdquo in Proceedings of the27th International Conference on Software Engineering (ICSE)pp 284ndash292 Saint Louis MO USA May 2005

[20] A Mockus and D M Weiss ldquoPredicting risk of softwarechangesrdquo Bell Labs Technical Journal vol 5 no 2 pp 169ndash180 2000

[21] X Yang D Lo X Xia Y Zhang and J Sun ldquoDeep learningfor just-in-time defect predictionrdquo in Proceedings of the IEEEInternational Conference on Software Quality Reliability andSecurity (QRS) pp 17ndash26 Vancouver BC Canada August2015

[22] X Yang D Lo X Xia and J Sun ldquoTLEL a two-layer ensemblelearning approach for just-in-time defect predictionrdquo In-formation and Software Technology vol 87 pp 206ndash220 2017

[23] L Pascarella F Palomba and A Bacchelli ldquoFine-grained just-in-time defect predictionrdquo Journal of Systems and Softwarevol 150 pp 22ndash36 2019

[24] E Arisholm L C Briand and E B Johannessen ldquoA sys-tematic and comprehensive investigation of methods to buildand evaluate fault prediction modelsrdquo Journal of Systems andSoftware vol 83 no 1 pp 2ndash17 2010

[25] Y Kamei S Matsumoto A Monden K MatsumotoB Adams and A E Hassan ldquoRevisiting common bug pre-diction findings using effort-aware modelsrdquo in Proceedings ofthe 26th IEEE International Conference on Software Mainte-nance (ICSM) pp 1ndash10 Shanghai China September 2010

[26] Y Zhou B Xu H Leung and L Chen ldquoAn in-depth study ofthe potentially confounding effect of class size in fault pre-dictionrdquo ACM Transactions on Software Engineering andMethodology vol 23 no 1 pp 1ndash51 2014

[27] J Liu Y Zhou Y Yang H Lu and B Xu ldquoCode churn aneglectedmetric in effort-aware just-in-time defect predictionrdquoin Proceedings of the ACMIEEE International Symposium onEmpirical Software Engineering and Measurement (ESEM)pp 11ndash19 Toronto ON Canada November 2017

[28] W Fu and T Menzies ldquoRevisiting unsupervised learning fordefect predictionrdquo in Proceedings of the 2017 11th Joint

12 Scientific Programming

Meeting on Foundations of Software Engineering (ESECFSE)pp 72ndash83 Paderborn Germany September 2017

[29] Q Huang X Xia and D Lo ldquoSupervised vs unsupervisedmodels a holistic look at effort-aware just-in-time defectpredictionrdquo in Proceedings of the IEEE International Con-ference on Software Maintenance and Evolution (ICSME)pp 159ndash170 Shanghai China September 2017

[30] Q Huang X Xia and D Lo ldquoRevisiting supervised andunsupervised models for effort-aware just-in-time defectpredictionrdquo Empirical Software Engineering pp 1ndash40 2018

[31] S Herbold A Trautsch and J Grabowski ldquoGlobal vs localmodels for cross-project defect predictionrdquo Empirical Soft-ware Engineering vol 22 no 4 pp 1866ndash1902 2017

[32] M E Mezouar F Zhang and Y Zou ldquoLocal versus globalmodels for effort-aware defect predictionrdquo in Proceedings ofthe 26th Annual International Conference on Computer Sci-ence and Software Engineering (CASCON) pp 178ndash187Toronto ON Canada October 2016

[33] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis Wiley Hoboken NJ USA1990

[34] S Hosseini B Turhan and D Gunarathna ldquoA systematicliterature review and meta-analysis on cross project defectpredictionrdquo IEEE Transactions on Software Engineeringvol 45 no 2 pp 111ndash147 2019

[35] J Sliwerski T Zimmermann and A Zeller ldquoWhen dochanges induce fixesrdquo Mining Software Repositories vol 30no 4 pp 1ndash5 2005

[36] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin vol 1 no 6 pp 80ndash83 1945

[37] N Cliff Ordinal Methods for Behavioral Data AnalysisPsychology Press London UK 2014

[38] J Romano J D Kromrey J Coraggio and J SkowronekldquoAppropriate statistics for ordinal level data should we reallybe using t-test and cohensd for evaluating group differenceson the nsse and other surveysrdquo in Annual Meeting of theFlorida Association of Institutional Research pp 1ndash33 2006

Scientific Programming 13

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 2: LocalversusGlobalModelsforJust-In-TimeSoftware ...downloads.hindawi.com/journals/sp/2019/2384706.pdf · Just-in-time software defect prediction (JIT-SDP) is an active topic in software

the files [6] To this end just-in-time software defect pre-diction (JIT-SDP) technology is proposed [7ndash10] JIT-SDP ismade at change level which has a finer granularity thanmodule level Once developers submit a change for theprogram defect predictors can identify whether the changeis defect-inducing immediately -erefore JIT-SDP has theadvantages of fine granularity instantaneity and traceabilitycompared with traditional defect prediction [7]

In order to improve the performance of software defectprediction researchers have invested a lot of effort-emainresearch content includes the use of various machinelearning methods feature selection processing class im-balance processing noise etc [2] Recently researcherspropose a novel idea called local models that may helpimprove the performance of defect prediction [11ndash14] -elocal models divide all available training data into severalregions with similar metrics and train a prediction model foreach region However the research objects in previous arefile-level or class-level defect data sets Whether local modelsare still valid in the context of JIT-SDP remains for furtherstudy

To this end we conduct experiment on the change-leveldefect data sets from six open-source projects including227417 changes We empirically compare the classificationperformance and effort-aware prediction performance oflocal and global models through a large-scale empiricalstudy In order to build local models the experiment uses thek-medoids clustering algorithm to divide the trainingsamples into homogeneous regions For each region theexperiment builds classification models and effort-awareprediction models based on logistic regression and effort-aware linear regression (EALR) respectively which arealways used for baseline models in prior studies [7ndash9]

-e main contributions of this paper are as follows

(i) We compare the classification performance andeffort-aware prediction performance of local andglobal models in three evaluation scenarios namelycross-validation scenario cross-project-validationscenario and timewise-cross-validation scenariofor JIT-SDP

(ii) Considering the influence of the number of clustersk on the performance of local models we evaluatethe classification performance and effort-awareprediction performance of local models with dif-ferent values of k by adjusting k from 2 to 10

(iii) -e empirical results show that local models per-form worse in classification performance thanglobal models in three evaluation scenarios Inaddition local models have worse effort-awareprediction performance in the timewise-cross-validation scenario However local models havebetter effort-aware prediction performance thanglobal models in the cross-validation scenario andcross-project-validation scenario Particularlywhen the parameter k is set to 2 local models canobtain optimal performance in the ACC and Poptindicators

-e results of the experiment provide valuable knowl-edge about the advantages and limitations of local models inthe JIT-SDP problem In order to ensure the repeatability ofthe experiment and promote future research we share thecode and data of the experiment

-is paper is organized as follows Section 2 introducesthe background and related work of software defect pre-diction Section 3 introduces the details of local models -eexperimental setup is presented in Section 4 Section 5demonstrates the experimental results and analysis Section6 introduces the threats to validity Section 7 summarizes thepaper and describes future work

2 Background and Related Work

21 Background of Software Defect Prediction Softwaredefect prediction technology originated in the 1970s [15]At that time the software systems were less complex andresearchers found that the number of software defects in aprogram system was related to the number of lines of codeIn 1971 Akiyama et al [16] proposed a formula for therelationship between the number of software defects (D)and lines of code (LOC) in a project D 486 +

0018 times LOC However as software scale and complexitycontinue to increase the formula is too simple to predictthe relationship between D and LOC -erefore based onempirical knowledge researchers designed a series ofmetrics associated with software defects McCabe et al [17]believed that code complexity is better correlated withsoftware defects than the number of code lines and pro-posed a cyclomatic complexity metric to measure codecomplexity In 1994 Chidamber and Kemerer [18] pro-posed CK metrics for the object-oriented program code Inaddition to code-based metrics researchers found thatmetrics based on software development processes are alsoassociated with software defects Nagappan and Ball [19]proposed relative code churn metrics to predict defectdensity at file level by measuring code churn duringsoftware development

-e purpose of defect prediction technology is to predictthe defect proneness number of defects or defect density ofa program module Taking the prediction of defect prone-ness as an example the general process of defect predictiontechnology is described in Figure 1

(1) Program modules are extracted from the softwarehistory repository and the granularity of the mod-ules can be set to package file change or functionaccording to actual requirement

(2) Based on the software code and software develop-ment process metrics related to software defects aredesigned and used to measure program modulesDefect data sets are built by mining defect logs in theversion control system and defect reports in thedefect tracking system

(3) Defect data sets are performed necessary pre-processing (excluding outliers data normalizationetc) Defect prediction models are built by using

2 Scientific Programming

statistical or machine learning algorithms such asNaive Bayes random forest and support vectormachines Prediction models are evaluated based onperformance indicators such as accuracy recall andAUC

(4) When a new program module appears the samemetrics are used to measure the module and theprediction models are used to predict whether themodule is defect prone (DP) or nondefect prone(NDP)

22 Related Work

221 Just-In-Time Software Defect Prediction Traditionaldefect prediction techniques are performed at coarsegranularity (package file or function) Although thesetechniques are effective in some cases they have the fol-lowing disadvantages [7] first the granularity of predictionsis coarse so developers need to check a lot of code to locatethe defective code second program modules are oftenmodified by multiple developers so it is difficult to find asuitable expert to perform code checking on the defect-prone modules third since the defect prediction is per-formed in the late stages of the software development cycledevelopers may have forgotten the details of the develop-ment which reduces the efficiency of defects detection andrepair

In order to overcome the shortcomings of traditionaldefect prediction a new technology called just-in-timesoftware defect prediction (JIT-SDP) is proposed [7 20]JIT-SDP is a fine-grained prediction method whose pre-diction objects are code changes instead of modules whichhas the following merits first since the prediction is per-formed at a fine granularity the scope of the code inspectioncan be reduced to save the effort of code inspection secondJIT-SDP is performed at the change level and once a de-veloper submits the modified code the defect prediction canbe executed -erefore defect predictions are performedmore timely and it is easy to find the right developer forbuggy changes

In recent years research related to JIT-SDP has de-veloped rapidly Mockus and Weiss [20] designed thechange metrics to evaluate the probability of defect-inducing changes for initial modification requests (IMR)

in a 5ESS network switch project Yang et al [21] firstapplied deep learning technology to JIT-SDP and pro-posed a method called Deeper -e empirical results showthat the Deeper method can detect more buggy changesand have better F1 indicator than the EALR methodKamei et al [10] conducted experiment on 11 open-sourceprojects for JIT-SDP using cross-project models -eyfound that the cross-project models can perform well whencarefully selecting training data Yang et al [22] proposed atwo-layer ensemble learning method (TLEL) which le-verages decision tree and ensemble learning methodExperimental results show that TLEL can significantlyimprove the PofB20 and F1 indicators compared to thethree baselines Chen et al [9] proposed a novel supervisedlearning method MULTI which applied multiobjectiveoptimization algorithm to software defect predictionExperimental results show that MULTI is superior to 43state-of-the-art prediction models Considering a commitin the software development may be partially defectivewhich may involve defective files and nondefective files Tothis end Pascarella et al [23] proposed a fine-grained JIT-SDP model to predict buggy files in a commit based on tenopen-source projects Experimental results show that 43defective commits are mixed by buggy and clean resourcesand their method can obtain 82 AUC-ROC to detectdefective files in a commit

222 Effort-Aware Software Defect Prediction In recentyears researchers pointed out that when using the defectprediction model the cost-effectiveness of the model shouldbe considered [24] Due to the high cost of checking forpotentially defective program modules modules with highdefect densities should be prioritized when testing resourcesare limited Effort-aware defect prediction aims at findingmore buggy changes in limited testing resources which hasbeen extensively studied in module-level (package file orfunction) defect prediction [25 26] Kamei et al [7] appliedeffort-aware defect prediction to JIT-SDP In their study theyused code churn (total number of modified lines of code) as ameasure of the effort and proposed the EALRmodel based onlinear regression -eir study showed that 35 defect-inducing changes can be detected by using only 20 effort

Recently many supervised and unsupervised learningmethods are investigated in the context of effort-aware JIT-

Software historyrepository Program modules

Designmetrics

Extractprogrammodule

New programmodule

Metrics Prediction

Buildmodels Prediction

model

Metrics Label

NDPDF

Figure 1 -e process of software defect prediction

Scientific Programming 3

SDP Yang et al [8] compared simple unsupervised modelswith supervised models for effort-aware JIT-SDP and foundthat many unsupervised models outperform the state-of-the-art supervised models Inspired by the study of Yanget al [8] Liu et al [27] proposed a novel unsupervisedlearning method CCUM which is based on code churn-eir study showed that CCUM performs better than thestate-of-the-art prediction model at that time As the resultsYang et al [8] reported are startling Fu et al [28] repeatedtheir experiment -eir study indicates that supervisedlearning methods are better than unsupervised methodswhen their analysis is conducted on a project-by-projectbasis However their study supports the general goal of Yanget al [8] Huang et al [29] also performed a replication studybased on the study of Yang et al [8] and extended theirstudy in the literature [30] -ey found three weaknesses ofthe LT model proposed by Yang et al [8] more contextswitching more false alarms and worse F1 than supervisedmethods -erefore they proposed a novel supervisedmethod CBS+ which combines the advantages of LT andEALR Empirical results demonstrate that CBS + can detectmore buggy changes than EALR More importantlyCBS + can significantly reduce context switching comparedto LT

3 Software Defect Prediction Based onLocal Models

Menzies et al [11] first applied local models to defectprediction and effort estimation in 2011 and extended theirstudy work in 2013 [12] -e basic idea of local models issimple first cluster all training data into homogeneousregions then train a prediction model for each region A testinstance will be predicted by one of the predictors [31]

To the best of our knowledge there are only five studiesof local models in the filed of defect prediction Menzies et al[11 12] first proposed the idea of local models -eir studiesattempted to explore the best way to learn lessons for defectprediction and effort estimation -ey adopted WHEREalgorithm to cluster the data into the regions with similarproperties Experimental results showed that the lessonslearned from clusters have better prediction performance bycomparing the lessons generated from all the data localprojects and each cluster Scanniello et al [13] proposed anovel defect prediction method based on software clusteringfor class-level defect prediction -eir research demon-strated that for a class to be predicted in a project system themodels built from the classes related to it outperformed theclasses from the entire system Bettenburg et al [14] com-pared local and global models based on statistical learningtechnology for software defect prediction Experimentalresults demonstrated that local models had better fit andprediction performance Herbold et al [31] introduced localmodels into the context of cross-project defect prediction-ey compared local models with global models and atransfer learning technique -eir findings showed that localmodels just made a minor difference compared with globalmodels and the transfer learning model Mezouar et al [32]compared the performance of local and global models in the

context of effort-aware defect prediction -eir study isbased on 15 module-level defect data sets and uses k-medoids to build local models Experimental results showedthat local models perform worse than global models

-e process and difference between local models andglobal models are described in Figure 2 For global modelsall training data are used to build prediction models by usingstatistical or machine learning methods and the models canbe used to predict any test instance In comparison localmodels first build a cluster model and divide the trainingdata into several clusters based on the cluster model In theprediction phase each test instance is assigned a clusternumber based on the cluster models And each test instanceis predicted by the model trained by the correspondingcluster instead of all the training data

In our experiments we apply local models to the JIT-SDP problem In previous studies various clusteringmethods have been used in local models such as WHERE[11 12] MCLUST [14] k-means [14] hierarchical clustering[14] BorderFlow [13] EM clustering [31] and k-mediods[32] According to the suggestion of Herbold et al [31] onecan use any of these clustering methods Our experimentadopts k-medoids as the clustering method in local modelsK-medoids is a widely used clustering method Differentfrom k-means used in the previous study [14] the clustercenter of k-medoids is an object which has low sensitivity tooutliers Moreover there are mature Python libraries sup-porting the implementation of k-medoids In the stage ofmodel building we use logistic regression and EALR to buildthe classification model and the effort-aware predictionmodel in local models which are used as baseline models inprevious JIT-SDP studies [7ndash9]-e detailed process of localmodels for JIT-SDP is described in Algorithm 1

4 Experimental Setup

-is article answers the following three questions throughexperiment

(i) RQ1 how is the classification performance andeffort-aware prediction performance of local modelscompared to those of global models in the cross-validation scenario

(ii) RQ2 how is the classification performance andeffort-aware prediction performance of local modelscompared to those of global models in the cross-project-validation scenario

(iii) RQ3 how is the classification performance andeffort-aware prediction performance of local modelscompared to those of global models in the timewise-cross-validation scenario

-is section introduces the experimental setup from fiveaspects data sets modeling methods performance evalua-tion indicators data analysis method and data pre-processing Our experiment is performed on the hardwarewith Intel (R) Core (TM) i7-7700 CPU 360GHZ -e op-eration system is Windows 10 -e programming environ-ment for scripts is Python 36

4 Scientific Programming

41 Data Sets -e data sets used in experiment are widelyused for JIT-SDP which are shared by Kamei et al [7] -edata sets contain six open-source projects from differentapplication domains For example Bugzilla (BUG) is anopen-source defect tracking system Eclipse JDT (JDT) is afull-featured Java integrated development environmentplugin for the Eclipse platform Mozilla (MOZ) is a webbrowser and PostgreSQL (POS) is a database system [9]-ese data sets are extracted from CVS repositories ofprojects and corresponding bug reports [7] -e basic in-formation of the data sets is shown in Table 1

To predict whether a change is defective or not Kameiet al [7] design 14 metrics associated with software defectswhich can be grouped into five groups (ie diffusion sizepurpose history and experience)-e specific description ofthese metrics is shown in Table 2 A more detailed in-troduction to the metrics can be found in the literature [7]

42 Modeling Methods -ere are three modeling methodsthat are used in our experiment First in the process of using

local models the training set is divided into multipleclusters -erefore this process is completed by the k-medoids model Second the classification performance andeffort-aware prediction performance of local models andglobal models are investigated in our experiment -ereforewe use logistic regression and EALR to build classificationmodels and effort-aware prediction models respectivelywhich are used as baseline models in previous JIT-SDPstudies [7ndash9] Details of the three models are shown below

421 K-Medoids -e first step in local models is to use aclustering algorithm to divide all training data into severalhomogeneous regions In this process the experiment usesk-medoids as a clustering algorithm which is widely used indifferent clustering tasks Different from k-means the co-ordinate center of each cluster of k-medoids is a repre-sentative object rather than a centroid Hence k-medoids isless sensitive to the outliers -e objective function of k-medoids can be defined by using Equation (1) where p is asample in the cluster Ci and Oi is a medoid in the cluster Ci

Training data

Test dataPrediction result

Classifier

(a)

Training data Clustering algorithm

Cluster 1

Cluster i

Cluster n

Cluster model

Cluster number i

Classifier i

Prediction result

Test data

(b)

Figure 2 -e process of (a) global and (b) local models

Input training set D (x1 y1) (x2 y2) (xn yn)1113864 1113865 test set T (x1 y1) (x2 y2) (xm ym)1113864 1113865 the number of clusters kOutput prediction results R

(1) begin(2) R⟵[

(3) Divide the training set into k clusters(4) clusters cluster_centers k-medoids (D k)(5) Train a classifier for each cluster(6) classifiersmodeling_algorithm (clusters)(7) for Instance t in T do(8) Assign a cluster number i for each instance t(9) i calculate_min_distance (t cluster_centers)(10) r classifiers [i]predict (t) perform prediction(11) Rappend (r)(12) end(13) end

ALGORITHM 1 -e application process of local models for JIT-SDP

Scientific Programming 5

E 1113944k

i11113944

pϵCi

pminusOi

111386811138681113868111386811138681113868111386811138682 (1)

Among different k-medoids algorithms the partitioningaroundmedoids (PAM) algorithm is used in the experimentwhich was proposed by Kaufman el al [33] and is one of themost popularly used algorithms for k-medoids

422 Logistic Regression Similar to prior studies logisticregression is used to build classification models [7] Logisticregression is a widely used regression model that can be usedfor a variety of classification and regression tasks Assumingthat there are nmetrics for a change c logistic regression canbe expressed by using Equation (2) where w0 w1 wn

denote the coefficients of logistic regression and the in-dependent variables m1 mn represent the metrics of achange which are introduced in Table 2 -e output of themodel y(c) indicates the probability that a change is buggy

y(c) 1

1 + eminus w0+w1m1++wnmn( ) (2)

In our experiment logistic regression is used to solveclassification problems However the output of the model isa continuous value with a range of 0 to 1 -erefore for achange c if the value of y(c) is greater than 05 then c isclassified as buggy otherwise it is classified as clean -eprocess can be defined as follows

Y(c) 1 if y(c)gt 05

0 if y(c)le 051113896 (3)

423 Effort-Aware Linear Regression Arisholm et al [24]pointed out that the use of defect prediction models shouldconsider not only the classification performance of themodel but also the cost-effectiveness of code inspectionAfterwards a large number of studies focus on the effort-aware defect prediction [8 9 11 12] Effort-aware linearregression (EALR) was firstly proposed by Kamei et al [7]for solving effort-aware JIT-SDP which is based on linearregression models Different from the above logistic re-gression the dependent variable of EALR is Y(c)Effort(c)where Y(c) denotes the defect proneness of a change andEffort(c) denotes the effort required to code checking for thechange c According to the suggestion of Kamei et al [7] theeffort for a change can obtained by calculating the number ofmodified lines of code

43 Performance Indicators -ere are four performanceindicators used to evaluate and compare the performance oflocal and global models for JIT-SDP (ie F1 AUC ACCand Popt) Specifically we use F1 and AUC to evaluate theclassification performance of prediction models and useACC and Popt to evaluate the effort-aware prediction per-formance of prediction models -e details are as follows

Table 1 -e basic information of data sets

Project Period Number of defective changes Number of changes defect rateBUG 19980826sim20061216 1696 4620 3671COL 20021125sim20060727 1361 4455 3055JDT 20010502sim20071231 5089 35386 1438MOZ 20000101sim20061217 5149 98275 524PLA 20010502sim20071231 9452 64250 1471POS 19960709sim20100515 5119 20431 2506

Table 2 -e description of metrics

Dimension Metric Description

Diffusion

NS Number of modified subsystemsND Number of modified directoriesNF Number of modified files

Entropy Distribution of modified code across each file

SizeLA Lines of code addedLD Lines of code deletedLT Lines of code in a file before the change

Purpose FIX Whether or not the change is a defect fix

History

NDEV Number of developers that changed the files

AGE Average time interval between the last and thecurrent change

NUC Number of unique last changes to the files

ExperienceEXP Developer experienceREXP Recent developer experienceSEXP Developer experience on a subsystem

6 Scientific Programming

-e experiment first uses F1 and AUC to evaluate theclassification performance of models JIT-SDP technologyaims to predict the defect proneness of a change which is aclassification task-e predicted results can be expressed as aconfusion matrix shown in Table 3 including true positive(TP) false negative (FN) false positive (FP) and truenegative (TN)

For classification problems precision (P) and recall (R)are often used to evaluate the prediction performance of themodel

P TP

TP + FP

R TP

TP + FN

(4)

However the two indicators are usually contradictory intheir results In order to evaluate a prediction modelcomprehensively we use F1 to evaluate the classificationperformance of the model which is a harmonic mean ofprecision and recall and is shown in the following equation

F1 2 times P times RP + R

(5)

-e second indicator is AUC Data class imbalanceproblems are often encountered in software defect pre-diction Our data sets are also imbalanced (the number ofbuggy changes is much less than clean changes) In the class-imbalanced problem threshold-based evaluation indicators(such as accuracy recall rate and F1) are sensitive to thethreshold [4] -erefore a threshold-independent evalua-tion indicator AUC is used to evaluate the classificationperformance of a model AUC (area under curve) is the areaunder the ROC (receiver operating characteristic) curve-edrawing process of the ROC curve is as follows First sort theinstances in a descending order of probability that is pre-dicted to be positive and then treat them as positive ex-amples in turn In each iteration the true positive rate (TPR)and the false positive rate (FPR) are calculated as the or-dinate and the abscissa respectively to obtain an ROCcurve

TPR TP

TP + FN

FPR FP

TN + FP

(6)

-e experiment uses ACC and Popt to evaluate the effort-aware prediction performance of local and global modelsACC and Popt are commonly used performance indicatorswhich have widely been used in previous research [7ndash9 25]

-e calculation method of ACC and Popt is shown inFigure 3 In Figure 3 x-axis and y-axis represent the cu-mulative percentage of code churn (ie the number of linesof code added and deleted) and defect-inducing changes Tocompute the values of ACC and Popt three curves are in-cluded in Figure 3 optimal model curve prediction modelcurve and worst model curve [8] In the optimal modelcurve the changes are sorted in a descending order based on

their actual defect densities In the worst model curve thechanges are sorted in the ascending order according to theiractual defect densities In the prediction model curve thechanges are sorted in the descending order based on theirpredicted defect densities -en the area between the threecurves and the x-axis respectively Area(optimal)Area(m) and Area(worst) are calculated ACC indicatesthe recall of defect-inducing changes when 20 of the effortis invested based on the prediction model curve Accordingto [7] Popt can be defined by the following equation

Popt 1minusarea(optimal)minus area(m)

area(optimal)minus optimal(worst) (7)

44 Data Analysis Method -e experiment considers threescenarios to evaluate the performance of prediction modelswhich are 10 times 10-fold cross-validation cross-project-validation and timewise-cross-validation respectively -edetails are as follows

441 10 Times 10-Fold Cross-Validation Cross-validation isperformed within the same project and the basic process of10 times 10-fold cross-validation is as follows first the datasets of a project are divided into 10 parts with equal sizeSubsequently nine parts of them are used to train a pre-diction model and the remaining part is used as the test set-is process is repeated 10 times in order to reduce therandomness deviation -erefore 10 times 10-fold cross-validation can produce 10 times 10 100 results

Table 3 Confusion matrix

Actual valuePrediction result

Positive NegativePositive TP FNNegative FP TN

Def

ect-i

nduc

ing

chan

ges (

)

100

80

60

40

20

0100806040200

Code churn ()

Optimal modelPrediction modelWorst model

Figure 3 -e performance indicators of ACC and Popt

Scientific Programming 7

442 Cross-Project-Validation Cross-project defect pre-diction has received widespread attention in recent years[34] -e basic idea of cross-project defect prediction is totrain a defect prediction model with the defect data sets of aproject to predict defects of another project [8] Given datasets containing n projects n times (nminus 1) run results can begenerated for a prediction model -e data sets used in ourexperiment contain six open-source projects so6 times (6minus 1) 30 prediction results can be generated for aprediction model

443 Timewise-Cross-Validation Timewise-cross-validationis also performed within the same project but it con-siders the chronological order of changes in the data sets[35] Defect data sets are collected in the chronological orderduring the actual software development process -ereforechanges committed early in the data sets can be used topredict the defect proneness of changes committed late Inthe timewise-cross-validation scenario all changes in thedata sets are arranged in the chronological order and thechanges in the same month are grouped in the same groupSuppose the data sets are divided into n groups Group i andgroup i + 1 are used to train a prediction model and thegroup i + 4 and group i + 5 are used as a test set -ereforeassuming that a project contains nmonths of data sets nminus 5run results can be generated for a prediction model

-e experiment uses the Wilcoxon signed-rank test [36]and Cliffrsquos δ [37] to test the significance of differences inclassification performance and prediction performance be-tween local and global models -e Wilcoxon signed-ranktest is a popular nonparametric statistical hypothesis testWe use p values to examine if the difference of the per-formance of local and global models is statistically significantat a significance level of 005-e experiment also uses Cliffrsquosδ to determine the magnitude of the difference in theprediction performance between local models and globalmodels Usually the magnitude of the difference can bedivided into four levels ie trivial ((|δ|lt 0147)) small(0147le |δ|lt 033) moderate (033le |δ|lt 0474) and large(ge0474) [38] In summary when the p value is less than005 and Cliff rsquos δ is greater than or equal to 0147 localmodels and global models have significant differences inprediction performance

45 Data Preprocessing According to suggestion of Kameiet al [7] it is necessary to preprocess the data sets It includesthe following three steps

(1) Removing Highly Correlated Metrics -e ND andREXP metrics are excluded because NF and NDREXP and EXP are highly correlated LA and LD arenormalized by dividing by LT since LA and LD arehighly correlated LT and NUC are normalized bydividing by NF since LT and NUC are highly cor-related with NF

(2) Logarithmic Transformation Since most metrics arehighly skewed each metric executes logarithmictransformation (except for fix)

(3) Dealing with Class Imbalance -e data sets used inthe experiment have the problem of class imbalanceie the number of defect-inducing changes is far lessthan the number of clean changes -erefore weperform random undersampling on the training setBy randomly deleting clean changes the number ofbuggy changes remains the same as the number ofclean changes Note that the undersampling is notperformed in the test set

5 Experimental Results and Analysis

51 Analysis for RQ1 To investigate and compare the per-formance of local and global models in the 10 times 10-foldcross-validation scenario we conduct a large-scale empiricalstudy based on data sets from six open-source projects -eexperimental results are shown in Table 4 As is shown inTable 4 the first column represents the evaluation indicatorsfor prediction models including AUC F1 ACC and Poptthe second column denotes the project names of six data setsthe third column shows the performance values of the globalmodels in the four evaluation indicators from the fourthcolumn to the twelfth column the performance values of thelocal models at different k values are shown Consideringthat the number of clusters k may affect the performance oflocal models the value of parameter k is tuned from 2 to 10Since 100 prediction results are generated in the 10 times 10-fold cross-validation scenario the values in the table aremedians of the experimental results We do the followingprocessing on the data in Table 4

(i) We performed the significance test defined inSection 44 on the performance of the local andglobal models -e performance values of localmodels that are significantly different from theperformance values of global model are bolded

(ii) -e performance values of the local models whichare significantly better than those of the globalmodels are given in italics

(iii) -e WDL values (ie the number of projects forwhich local models can obtain a better equal andworse prediction performance than global models)are calculated under different k values and differentevaluation indicators

First we analyze and compare the classification per-formance of local and global models As is shown in Table 4firstly all of the performance of local models with differentparameters k and different projects are worse than that ofglobal models in the AUC indicator Secondly in most casesthe performance of local models are worse than that of globalmodels (except for four values in italics) in the F1 indicator-erefore it can be concluded that global models performbetter than local models in the classification performance

Second we investigate and compare effort-aware pre-diction performance for local and global models As can beseen from Table 4 local models are valid for effort-awareJIT-SDP However local models have different effects ondifferent projects and different parameters k

8 Scientific Programming

First local models have different effects on differentdata sets To describe the effectiveness of local models ondifferent data sets we calculate the number of times thatlocal models perform significantly better than globalmodels on 9 k values and two evaluation indicators(ie ACC and Popt) for each data set -e statistical resultsare shown in Table 5 where the second row is the numberof times that local models perform better than globalmodels for each project It can be seen that local modelshave better performance on the projects MOZ and PLA andpoor performance on the projects BUG and COL Based onthe number of times that local models outperform globalmodels in six projects we rank the project as shown in thethird row In addition we list the sizes of data sets and theirrank as shown in the fourth row It can be seen that theperformance of local models on data sets has a strongcorrelation with the size of data sets Generally the largerthe size of the data set the more significant the perfor-mance improvement of local models

In view of this situation we consider that if the size of thedata set is too small when using local models since the datasets are divided into multiple clusters the training samplesof each cluster will be further reduced which will lead to theperformance degradation of models While the size of datasets is large the performance of the model is not easilylimited by the size of the data sets so that the advantages oflocal models are better reflected

In addition the number of clusters k also has an impacton the performance of local models As can be seen from thetable according to the WDL analysis local models canobtain the best ACC indicator when k is set to 2 and obtainthe best Popt indicator when k is set to 2 and 6 -erefore werecommend that in the cross-validation scenario whenusing local models to solve the effort-aware JIT-SDPproblem the parameter k can be set to 2

-is section analyzes and compares the classificationperformance and effort-aware prediction performance oflocal and global models in 10 times 10-fold cross-validationscenario -e empirical results show that compared withglobal models local models perform poorly in the classifi-cation performance but perform better in the effort-awareprediction performance However the performance of localmodels is affected by the parameter k and the size of the datasets -e experimental results show that local models canobtain better effort-aware prediction performance on theproject with larger size of data sets and can obtain bettereffort-aware prediction performance when k is set to 2

52Analysis forRQ2 -is section discusses the performancedifferences between local and global models in the cross-project-validation scenario for JIT-SDP -e results of theexperiment are shown in Table 6 in which the first columnindicates four evaluation indicators including AUC F1

Table 4 Local vs global models in the 10 times 10-fold cross-validation scenario

Indicator Project GlobalLocal

k 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC

BUG 0730 0704 0687 0684 0698 0688 0686 0686 0693 0685COL 0742 0719 0708 0701 0703 0657 0666 0670 0668 0660JDT 0725 0719 0677 0670 0675 0697 0671 0674 0671 0654MOZ 0769 0540 0741 0715 0703 0730 0730 0725 0731 0722PLA 0742 0733 0702 0693 0654 0677 0640 0638 0588 0575POS 0777 0764 0758 0760 0758 0753 0755 0705 0757 0757

WDL mdash 006 006 006 006 006 006 006 006 006

F1

BUG 0663 0644 0628 0616 0603 0598 0608 0610 0601 0592COL 0696 0672 0649 0669 0678 0634 0537 0596 0604 0564JDT 0723 0706 0764 0757 0734 0695 0615 0643 0732 0722MOZ 0825 0374 0853 0758 0785 0741 0718 0737 0731 0721PLA 0704 0678 0675 0527 0477 0557 0743 0771 0627 0604POS 0762 0699 0747 0759 0746 0698 0697 0339 0699 0700

WDL mdash 006 204 114 015 006 015 105 015 015

ACC

BUG 0421 0452 0471 0458 0424 0451 0450 0463 0446 0462COL 0553 0458 0375 0253 0443 0068 0195 0459 0403 0421JDT 0354 0532 0374 0431 0291 0481 0439 0314 0287 0334MOZ 0189 0386 0328 0355 0364 0301 0270 0312 0312 0312PLA 0368 0472 0509 0542 0497 0509 0456 0442 0456 0495POS 0336 0490 0397 0379 0457 0442 0406 0341 0400 0383

WDL mdash 501 411 411 312 411 411 222 312 411

Popt

BUG 0742 055 0747 0744 0726 0738 0749 0750 0760 0750 0766COL 0726 0661 0624 0401 0731 0295 0421 0644 0628 0681JDT 0660 0779 0625 0698 0568 0725 0708 0603 0599 0634MOZ 0555 0679 0626 0645 0659 0588 0589 0643 0634 0630PLA 0664 0723 0727 0784 0746 0758 0718 0673 0716 0747POS 0673 0729 0680 0651 0709 0739 0683 0596 0669 0644

WDL mdash 411 222 213 321 411 231 123 132 213

Scientific Programming 9

ACC and Popt the second column represents the perfor-mance values of local models from the third column to theeleventh column the prediction performance of localmodels under different parameters k is represented In ourexperiment six data sets are used in the cross-project-validation scenario -erefore each prediction model pro-duces 30 prediction results -e values in Table 6 representthe median of the prediction results

For the values in Table 6 we do the following processing

(i) Using the significance test method the performancevalues of local models that are significantly differentfrom those of global models are bolded

(ii) -e performance values of local models that aresignificantly better than those of global models aregiven in italics

(iii) -e WDL column analyzes the number of timesthat local models perform significantly better thanglobal models at different parameters k

First we compare the difference in classification per-formance between local models and global models As can beseen from Table 6 in the AUC and F1 indicators theclassification performance of local models under differentparameter k is worse than or equal to that of global models-erefore we can conclude that in the cross-project-validation scenario local models perform worse thanglobal models in the classification performance

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 6 that the performance of local models is better than orequal to that of global models in the ACC and Popt indicatorswhen the parameter k is from 2 to 10-erefore local modelsare valid for effort-aware JIT-SDP in the cross-validationscenario In particular when the number k of clusters is setto 2 local models can obtain better ACC values and Poptvalues than global models when k is set to 5 local modelscan obtain better ACC values -erefore it is appropriate toset k to 2 which can find 493 defect-inducing changeswhen using 20 effort and increase the ACC indicator by570 and increase the Popt indicator by 164

In the cross-project-validation scenario local modelsperform worse in the classification performance but performbetter in the effort-aware prediction performance thanglobal models However the setting of k in local models hasan impact on the effort-aware prediction performance forlocal models-e empirical results show that when k is set to2 local models can obtain the optimal effort-aware pre-diction performance

53 Analysis for RQ3 -is section compares the predictionperformance of local and global models in the timewise-

cross-validation scenario Since in the timewise-cross-validation scenario only two months of data in each datasets are used to train prediction models the size of thetraining sample in each cluster is small -rough experi-ment we find that if the parameter k of local models is settoo large the number of training samples in each clustermay be too small or even equal to 0 -erefore the ex-periment sets the parameter k of local models to 2 -eexperimental results are shown in Table 7 Suppose aproject contains n months of data and prediction modelscan produce nminus 5 prediction results in the timewise-cross-validation scenario -erefore the third and fourth col-umns in Table 7 represent the median and standard de-viation of prediction results for global and local modelsBased on the significance test method the performancevalues of local models with significant differences fromthose of global models are bold -e row WDL sum-marizes the number of projects for which local models canobtain a better equal and worse performance than globalmodels in each indicator

First we compare the classification performance of localand global models It can be seen from Table 7 that localmodels perform worse on the AUC and F1 indicators thanglobal models on the 6 data sets -erefore it can be con-cluded that local models perform poorly in classificationperformance compared to global models in the timewise-cross-validation scenario

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 7 that in the projects BUG and COL local models haveworse ACC and Popt indicators than global models while inthe other four projects (JDT MOZ PLA and POS) localmodels and global models have no significant difference inthe ACC and Popt indicators -erefore local models per-form worse in effort-aware prediction performance thanglobal models in the timewise-cross-validation scenario-is conclusion is inconsistent with the conclusions inSections 51 and 52 For this problem we consider that themain reason lies in the size of the training sample In thetimewise-cross-validation scenario only two months of datain the data sets are used to train prediction models (there areonly 452 changes per month on average)When local modelsare used since the training set is divided into several clustersthe number of training samples in each cluster is furtherdecreased which results in a decrease in the effort-awareprediction performance of local models

In this section we compare the classification perfor-mance and effort-aware prediction performance of local andglobal models in the timewise-cross-validation scenarioEmpirical results show that local models perform worse thanglobal models in the classification performance and effort-aware prediction performance -erefore we recommend

Table 5 -e number of wins of local models at different k values

Project BUG COL JDT MOZ PLA POSCount 4 0 7 18 14 11Rank 5 6 4 1 2 3Number of changes (rank) 4620 (5) 4455 (6) 35386 (3) 98275 (1) 64250 (2) 20431 (3)

10 Scientific Programming

using global models to solve the JIT-SDP problem in thetimewise-cross-validation scenario

6 Threats to Validity

61 External Validity Although our experiment uses publicdata sets that have extensively been used in previous studieswe cannot yet guarantee that the discovery of the experimentcan be applied to all other change-level defect data sets-erefore the defect data sets of more projects should bemined in order to verify the generalization of the experi-mental results

62 Construct Validity We use F1 AUC ACC and Popt toevaluate the prediction performance of local and globalmodels Because defect data sets are often imbalanced AUCis more appropriate as a threshold-independent evaluationindicator to evaluate the classification performance of

models [4] In addition we use F1 a threshold-basedevaluation indicator as a supplement to AUC Besidessimilar to prior study [7 9] the experiment uses ACC andPopt to evaluate the effort-aware prediction performance ofmodels However we cannot guarantee that the experi-mental results are valid in all other evaluation indicatorssuch as precision recall Fβ and PofB20

63 Internal Validity Internal validity involves errors in theexperimental code In order to reduce this risk we carefullyexamined the code of the experiment and referred to thecode published in previous studies [7 8 29] so as to improvethe reliability of the experiment code

7 Conclusions and Future Work

In this article we firstly apply local models to JIT-SDP Ourstudy aims to compare the prediction performance of localand global models under cross-validation cross-project-validation and timewise-cross-validation for JIT-SDP Inorder to compare the classification performance and effort-aware prediction performance of local and global models wefirst build local models based on the k-medoids methodAfterwards logistic regression and EALR are used to buildclassification models and effort-aware prediction models forlocal and global models -e empirical results show that localmodels perform worse in the classification performance thanglobal models in three evaluation scenarios Moreover localmodels have worse effort-aware prediction performance in thetimewise-cross-validation scenario However local models aresignificantly better than global models in the effort-awareprediction performance in the cross-validation-scenario andtimewise-cross-validation scenario Particularly the optimaleffort-aware prediction performance can be obtained whenthe parameter k of local models is set to 2 -erefore localmodels are still promising in the context of effort-aware JIT-SDP

In the future first we plan to consider more commercialprojects to further verify the validity of the experimentalconclusions Second logistic regression and EALR are usedin local and global models Considering that the choice ofmodeling methods will affect the prediction performance oflocal and global models we hope to add more baselines tolocal and global models to verify the generalization of ex-perimental conclusions

Data Availability

-e experiment uses public data sets shared by Kamei et al[7] and they have already published the download address of

Table 6 Local vs global models in the cross-project-validation scenario

Indicator GlobalLocal

WDLk 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC 0706 0683 0681 0674 0673 0673 0668 0659 0680 0674 009F1 0678 0629 0703 0696 0677 0613 0620 0554 0629 0637 045ACC 0314 0493 0396 0348 0408 0393 0326 034 0342 0345 270Popt 0656 0764 0691 0672 0719 0687 0619 0632 0667 0624 180

Table 7 Local vs global models in the timewise-cross-validationscenario

Indicators Project Global Local (k 2)

AUC

BUG 0672plusmn 0125 0545plusmn 0124COL 0717plusmn 0075 0636plusmn 0135JDT 0708plusmn 0043 0645plusmn 0073MOZ 0749plusmn 0034 0648plusmn 0128PLA 0709plusmn 0054 0642plusmn 0076POS 0743plusmn 0072 0702plusmn 0095

WDL mdash 006

F1

BUG 0587plusmn 0108 0515plusmn 0128COL 0651plusmn 0066 0599plusmn 0125JDT 0722plusmn 0048 0623plusmn 0161MOZ 0804plusmn 0050 0696plusmn 0190PLA 0702plusmn 0053 0618plusmn 0162POS 0714plusmn 0065 0657plusmn 0128

WDL mdash 006

ACC

BUG 0357plusmn 0212 0242plusmn 0210COL 0426plusmn 0168 0321plusmn 0186JDT 0356plusmn 0148 0329plusmn 0189MOZ 0218plusmn 0132 0242plusmn 0125PLA 0358plusmn 0181 0371plusmn 0196POS 0345plusmn 0159 0306plusmn 0205

WDL mdash 042

Popt

BUG 0622plusmn 0194 0458plusmn 0189COL 0669plusmn 0150 0553plusmn 0204JDT 0654plusmn 0101 0624plusmn 0142MOZ 0537plusmn 0110 0537plusmn 0122PLA 0631plusmn 0115 0645plusmn 0151POS 0615plusmn 0131 0583plusmn 0167

WDL mdash 042

Scientific Programming 11

the data sets in their paper In addition in order to verify therepeatability of our experiment and promote future researchall the experimental code in this article can be downloaded athttpsgithubcomyangxingguangLocalJIT

Conflicts of Interest

-e authors declare that they have no conflicts of interest

Acknowledgments

-is work was partially supported by the NSF of Chinaunder Grant nos 61772200 and 61702334 Shanghai PujiangTalent Program under Grant no 17PJ1401900 ShanghaiMunicipal Natural Science Foundation under Grant nos17ZR1406900 and 17ZR1429700 Educational ResearchFund of ECUST under Grant no ZH1726108 and Collab-orative Innovation Foundation of Shanghai Institute ofTechnology under Grant no XTCX2016-20

References

[1] M Newman Software Errors Cost US Economy $595 BillionAnnually NIST Assesses Technical Needs of Industry to ImproveSoftware-Testing 2002 httpwwwabeachacomNIST_press_release_bugs_costhtm

[2] Z Li X-Y Jing and X Zhu ldquoProgress on approaches tosoftware defect predictionrdquo IET Software vol 12 no 3pp 161ndash175 2018

[3] X Chen Q Gu W Liu S Liu and C Ni ldquoSurvey of staticsoftware defect predictionrdquo Ruan Jian Xue BaoJournal ofSoftware vol 27 no 1 2016 in Chinese

[4] C Tantithamthavorn S McIntosh A E Hassan andKMatsumoto ldquoAn empirical comparison of model validationtechniques for defect prediction modelsrdquo IEEE Transactionson Software Engineering vol 43 no 1 pp 1ndash18 2017

[5] A G Koru D Dongsong Zhang K El Emam andH Hongfang Liu ldquoAn investigation into the functional formof the size-defect relationship for software modulesrdquo IEEETransactions on Software Engineering vol 35 no 2pp 293ndash304 2009

[6] S Kim E J Whitehead and Y Zhang ldquoClassifying softwarechanges clean or buggyrdquo IEEE Transactions on SoftwareEngineering vol 34 no 2 pp 181ndash196 2008

[7] Y Kamei E Shihab B Adams et al ldquoA large-scale empiricalstudy of just-in-time quality assurancerdquo IEEE Transactions onSoftware Engineering vol 39 no 6 pp 757ndash773 2013

[8] Y Yang Y Zhou J Liu et al ldquoEffort-aware just-in-time defectprediction simple unsupervised models could be better thansupervised modelsrdquo in Proceedings of the 24th ACM SIGSOFTInternational Symposium on Foundations of Software Engi-neering pp 157ndash168 Hong Kong China November 2016

[9] X Chen Y Zhao Q Wang and Z Yuan ldquoMULTI multi-objective effort-aware just-in-time software defect predictionrdquoInformation and Software Technology vol 93 pp 1ndash13 2018

[10] Y Kamei T Fukushima S McIntosh K YamashitaN Ubayashi and A E Hassan ldquoStudying just-in-time defectprediction using cross-project modelsrdquo Empirical SoftwareEngineering vol 21 no 5 pp 2072ndash2106 2016

[11] T Menzies A Butcher A Marcus T Zimmermann andD R Cok ldquoLocal vs global models for effort estimation anddefect predictionrdquo in Proceedings of the 26th IEEEACM

International Conference on Automated Software Engineer-ing (ASE) pp 343ndash351 Lawrence KS USA November 2011

[12] T Menzies A Butcher D Cok et al ldquoLocal versus globallessons for defect prediction and effort estimationrdquo IEEETransactions on Software Engineering vol 39 no 6pp 822ndash834 2013

[13] G Scanniello C Gravino A Marcus and T Menzies ldquoClasslevel fault prediction using software clusteringrdquo in Pro-ceedings of the 2013 28th IEEEACM International Conferenceon Automated Software Engineering (ASE) pp 640ndash645Silicon Valley CA USA November 2013

[14] N Bettenburg M Nagappan and A E Hassan ldquoTowardsimproving statistical modeling of software engineering datathink locally act globallyrdquo Empirical Software Engineeringvol 20 no 2 pp 294ndash335 2015

[15] Q Wang S Wu and M Li ldquoSoftware defect predictionrdquoJournal of Software vol 19 no 7 pp 1565ndash1580 2008 inChinese

[16] F Akiyama ldquoAn example of software system debuggingrdquoInformation Processing vol 71 pp 353ndash359 1971

[17] T J McCabe ldquoA complexity measurerdquo IEEE Transactions onSoftware Engineering vol SE-2 no 4 pp 308ndash320 1976

[18] S R Chidamber and C F Kemerer ldquoAmetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[19] N Nagappan and T Ball ldquoUse of relative code churn mea-sures to predict system defect densityrdquo in Proceedings of the27th International Conference on Software Engineering (ICSE)pp 284ndash292 Saint Louis MO USA May 2005

[20] A Mockus and D M Weiss ldquoPredicting risk of softwarechangesrdquo Bell Labs Technical Journal vol 5 no 2 pp 169ndash180 2000

[21] X Yang D Lo X Xia Y Zhang and J Sun ldquoDeep learningfor just-in-time defect predictionrdquo in Proceedings of the IEEEInternational Conference on Software Quality Reliability andSecurity (QRS) pp 17ndash26 Vancouver BC Canada August2015

[22] X Yang D Lo X Xia and J Sun ldquoTLEL a two-layer ensemblelearning approach for just-in-time defect predictionrdquo In-formation and Software Technology vol 87 pp 206ndash220 2017

[23] L Pascarella F Palomba and A Bacchelli ldquoFine-grained just-in-time defect predictionrdquo Journal of Systems and Softwarevol 150 pp 22ndash36 2019

[24] E Arisholm L C Briand and E B Johannessen ldquoA sys-tematic and comprehensive investigation of methods to buildand evaluate fault prediction modelsrdquo Journal of Systems andSoftware vol 83 no 1 pp 2ndash17 2010

[25] Y Kamei S Matsumoto A Monden K MatsumotoB Adams and A E Hassan ldquoRevisiting common bug pre-diction findings using effort-aware modelsrdquo in Proceedings ofthe 26th IEEE International Conference on Software Mainte-nance (ICSM) pp 1ndash10 Shanghai China September 2010

[26] Y Zhou B Xu H Leung and L Chen ldquoAn in-depth study ofthe potentially confounding effect of class size in fault pre-dictionrdquo ACM Transactions on Software Engineering andMethodology vol 23 no 1 pp 1ndash51 2014

[27] J Liu Y Zhou Y Yang H Lu and B Xu ldquoCode churn aneglectedmetric in effort-aware just-in-time defect predictionrdquoin Proceedings of the ACMIEEE International Symposium onEmpirical Software Engineering and Measurement (ESEM)pp 11ndash19 Toronto ON Canada November 2017

[28] W Fu and T Menzies ldquoRevisiting unsupervised learning fordefect predictionrdquo in Proceedings of the 2017 11th Joint

12 Scientific Programming

Meeting on Foundations of Software Engineering (ESECFSE)pp 72ndash83 Paderborn Germany September 2017

[29] Q Huang X Xia and D Lo ldquoSupervised vs unsupervisedmodels a holistic look at effort-aware just-in-time defectpredictionrdquo in Proceedings of the IEEE International Con-ference on Software Maintenance and Evolution (ICSME)pp 159ndash170 Shanghai China September 2017

[30] Q Huang X Xia and D Lo ldquoRevisiting supervised andunsupervised models for effort-aware just-in-time defectpredictionrdquo Empirical Software Engineering pp 1ndash40 2018

[31] S Herbold A Trautsch and J Grabowski ldquoGlobal vs localmodels for cross-project defect predictionrdquo Empirical Soft-ware Engineering vol 22 no 4 pp 1866ndash1902 2017

[32] M E Mezouar F Zhang and Y Zou ldquoLocal versus globalmodels for effort-aware defect predictionrdquo in Proceedings ofthe 26th Annual International Conference on Computer Sci-ence and Software Engineering (CASCON) pp 178ndash187Toronto ON Canada October 2016

[33] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis Wiley Hoboken NJ USA1990

[34] S Hosseini B Turhan and D Gunarathna ldquoA systematicliterature review and meta-analysis on cross project defectpredictionrdquo IEEE Transactions on Software Engineeringvol 45 no 2 pp 111ndash147 2019

[35] J Sliwerski T Zimmermann and A Zeller ldquoWhen dochanges induce fixesrdquo Mining Software Repositories vol 30no 4 pp 1ndash5 2005

[36] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin vol 1 no 6 pp 80ndash83 1945

[37] N Cliff Ordinal Methods for Behavioral Data AnalysisPsychology Press London UK 2014

[38] J Romano J D Kromrey J Coraggio and J SkowronekldquoAppropriate statistics for ordinal level data should we reallybe using t-test and cohensd for evaluating group differenceson the nsse and other surveysrdquo in Annual Meeting of theFlorida Association of Institutional Research pp 1ndash33 2006

Scientific Programming 13

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 3: LocalversusGlobalModelsforJust-In-TimeSoftware ...downloads.hindawi.com/journals/sp/2019/2384706.pdf · Just-in-time software defect prediction (JIT-SDP) is an active topic in software

statistical or machine learning algorithms such asNaive Bayes random forest and support vectormachines Prediction models are evaluated based onperformance indicators such as accuracy recall andAUC

(4) When a new program module appears the samemetrics are used to measure the module and theprediction models are used to predict whether themodule is defect prone (DP) or nondefect prone(NDP)

22 Related Work

221 Just-In-Time Software Defect Prediction Traditionaldefect prediction techniques are performed at coarsegranularity (package file or function) Although thesetechniques are effective in some cases they have the fol-lowing disadvantages [7] first the granularity of predictionsis coarse so developers need to check a lot of code to locatethe defective code second program modules are oftenmodified by multiple developers so it is difficult to find asuitable expert to perform code checking on the defect-prone modules third since the defect prediction is per-formed in the late stages of the software development cycledevelopers may have forgotten the details of the develop-ment which reduces the efficiency of defects detection andrepair

In order to overcome the shortcomings of traditionaldefect prediction a new technology called just-in-timesoftware defect prediction (JIT-SDP) is proposed [7 20]JIT-SDP is a fine-grained prediction method whose pre-diction objects are code changes instead of modules whichhas the following merits first since the prediction is per-formed at a fine granularity the scope of the code inspectioncan be reduced to save the effort of code inspection secondJIT-SDP is performed at the change level and once a de-veloper submits the modified code the defect prediction canbe executed -erefore defect predictions are performedmore timely and it is easy to find the right developer forbuggy changes

In recent years research related to JIT-SDP has de-veloped rapidly Mockus and Weiss [20] designed thechange metrics to evaluate the probability of defect-inducing changes for initial modification requests (IMR)

in a 5ESS network switch project Yang et al [21] firstapplied deep learning technology to JIT-SDP and pro-posed a method called Deeper -e empirical results showthat the Deeper method can detect more buggy changesand have better F1 indicator than the EALR methodKamei et al [10] conducted experiment on 11 open-sourceprojects for JIT-SDP using cross-project models -eyfound that the cross-project models can perform well whencarefully selecting training data Yang et al [22] proposed atwo-layer ensemble learning method (TLEL) which le-verages decision tree and ensemble learning methodExperimental results show that TLEL can significantlyimprove the PofB20 and F1 indicators compared to thethree baselines Chen et al [9] proposed a novel supervisedlearning method MULTI which applied multiobjectiveoptimization algorithm to software defect predictionExperimental results show that MULTI is superior to 43state-of-the-art prediction models Considering a commitin the software development may be partially defectivewhich may involve defective files and nondefective files Tothis end Pascarella et al [23] proposed a fine-grained JIT-SDP model to predict buggy files in a commit based on tenopen-source projects Experimental results show that 43defective commits are mixed by buggy and clean resourcesand their method can obtain 82 AUC-ROC to detectdefective files in a commit

222 Effort-Aware Software Defect Prediction In recentyears researchers pointed out that when using the defectprediction model the cost-effectiveness of the model shouldbe considered [24] Due to the high cost of checking forpotentially defective program modules modules with highdefect densities should be prioritized when testing resourcesare limited Effort-aware defect prediction aims at findingmore buggy changes in limited testing resources which hasbeen extensively studied in module-level (package file orfunction) defect prediction [25 26] Kamei et al [7] appliedeffort-aware defect prediction to JIT-SDP In their study theyused code churn (total number of modified lines of code) as ameasure of the effort and proposed the EALRmodel based onlinear regression -eir study showed that 35 defect-inducing changes can be detected by using only 20 effort

Recently many supervised and unsupervised learningmethods are investigated in the context of effort-aware JIT-

Software historyrepository Program modules

Designmetrics

Extractprogrammodule

New programmodule

Metrics Prediction

Buildmodels Prediction

model

Metrics Label

NDPDF

Figure 1 -e process of software defect prediction

Scientific Programming 3

SDP Yang et al [8] compared simple unsupervised modelswith supervised models for effort-aware JIT-SDP and foundthat many unsupervised models outperform the state-of-the-art supervised models Inspired by the study of Yanget al [8] Liu et al [27] proposed a novel unsupervisedlearning method CCUM which is based on code churn-eir study showed that CCUM performs better than thestate-of-the-art prediction model at that time As the resultsYang et al [8] reported are startling Fu et al [28] repeatedtheir experiment -eir study indicates that supervisedlearning methods are better than unsupervised methodswhen their analysis is conducted on a project-by-projectbasis However their study supports the general goal of Yanget al [8] Huang et al [29] also performed a replication studybased on the study of Yang et al [8] and extended theirstudy in the literature [30] -ey found three weaknesses ofthe LT model proposed by Yang et al [8] more contextswitching more false alarms and worse F1 than supervisedmethods -erefore they proposed a novel supervisedmethod CBS+ which combines the advantages of LT andEALR Empirical results demonstrate that CBS + can detectmore buggy changes than EALR More importantlyCBS + can significantly reduce context switching comparedto LT

3 Software Defect Prediction Based onLocal Models

Menzies et al [11] first applied local models to defectprediction and effort estimation in 2011 and extended theirstudy work in 2013 [12] -e basic idea of local models issimple first cluster all training data into homogeneousregions then train a prediction model for each region A testinstance will be predicted by one of the predictors [31]

To the best of our knowledge there are only five studiesof local models in the filed of defect prediction Menzies et al[11 12] first proposed the idea of local models -eir studiesattempted to explore the best way to learn lessons for defectprediction and effort estimation -ey adopted WHEREalgorithm to cluster the data into the regions with similarproperties Experimental results showed that the lessonslearned from clusters have better prediction performance bycomparing the lessons generated from all the data localprojects and each cluster Scanniello et al [13] proposed anovel defect prediction method based on software clusteringfor class-level defect prediction -eir research demon-strated that for a class to be predicted in a project system themodels built from the classes related to it outperformed theclasses from the entire system Bettenburg et al [14] com-pared local and global models based on statistical learningtechnology for software defect prediction Experimentalresults demonstrated that local models had better fit andprediction performance Herbold et al [31] introduced localmodels into the context of cross-project defect prediction-ey compared local models with global models and atransfer learning technique -eir findings showed that localmodels just made a minor difference compared with globalmodels and the transfer learning model Mezouar et al [32]compared the performance of local and global models in the

context of effort-aware defect prediction -eir study isbased on 15 module-level defect data sets and uses k-medoids to build local models Experimental results showedthat local models perform worse than global models

-e process and difference between local models andglobal models are described in Figure 2 For global modelsall training data are used to build prediction models by usingstatistical or machine learning methods and the models canbe used to predict any test instance In comparison localmodels first build a cluster model and divide the trainingdata into several clusters based on the cluster model In theprediction phase each test instance is assigned a clusternumber based on the cluster models And each test instanceis predicted by the model trained by the correspondingcluster instead of all the training data

In our experiments we apply local models to the JIT-SDP problem In previous studies various clusteringmethods have been used in local models such as WHERE[11 12] MCLUST [14] k-means [14] hierarchical clustering[14] BorderFlow [13] EM clustering [31] and k-mediods[32] According to the suggestion of Herbold et al [31] onecan use any of these clustering methods Our experimentadopts k-medoids as the clustering method in local modelsK-medoids is a widely used clustering method Differentfrom k-means used in the previous study [14] the clustercenter of k-medoids is an object which has low sensitivity tooutliers Moreover there are mature Python libraries sup-porting the implementation of k-medoids In the stage ofmodel building we use logistic regression and EALR to buildthe classification model and the effort-aware predictionmodel in local models which are used as baseline models inprevious JIT-SDP studies [7ndash9]-e detailed process of localmodels for JIT-SDP is described in Algorithm 1

4 Experimental Setup

-is article answers the following three questions throughexperiment

(i) RQ1 how is the classification performance andeffort-aware prediction performance of local modelscompared to those of global models in the cross-validation scenario

(ii) RQ2 how is the classification performance andeffort-aware prediction performance of local modelscompared to those of global models in the cross-project-validation scenario

(iii) RQ3 how is the classification performance andeffort-aware prediction performance of local modelscompared to those of global models in the timewise-cross-validation scenario

-is section introduces the experimental setup from fiveaspects data sets modeling methods performance evalua-tion indicators data analysis method and data pre-processing Our experiment is performed on the hardwarewith Intel (R) Core (TM) i7-7700 CPU 360GHZ -e op-eration system is Windows 10 -e programming environ-ment for scripts is Python 36

4 Scientific Programming

41 Data Sets -e data sets used in experiment are widelyused for JIT-SDP which are shared by Kamei et al [7] -edata sets contain six open-source projects from differentapplication domains For example Bugzilla (BUG) is anopen-source defect tracking system Eclipse JDT (JDT) is afull-featured Java integrated development environmentplugin for the Eclipse platform Mozilla (MOZ) is a webbrowser and PostgreSQL (POS) is a database system [9]-ese data sets are extracted from CVS repositories ofprojects and corresponding bug reports [7] -e basic in-formation of the data sets is shown in Table 1

To predict whether a change is defective or not Kameiet al [7] design 14 metrics associated with software defectswhich can be grouped into five groups (ie diffusion sizepurpose history and experience)-e specific description ofthese metrics is shown in Table 2 A more detailed in-troduction to the metrics can be found in the literature [7]

42 Modeling Methods -ere are three modeling methodsthat are used in our experiment First in the process of using

local models the training set is divided into multipleclusters -erefore this process is completed by the k-medoids model Second the classification performance andeffort-aware prediction performance of local models andglobal models are investigated in our experiment -ereforewe use logistic regression and EALR to build classificationmodels and effort-aware prediction models respectivelywhich are used as baseline models in previous JIT-SDPstudies [7ndash9] Details of the three models are shown below

421 K-Medoids -e first step in local models is to use aclustering algorithm to divide all training data into severalhomogeneous regions In this process the experiment usesk-medoids as a clustering algorithm which is widely used indifferent clustering tasks Different from k-means the co-ordinate center of each cluster of k-medoids is a repre-sentative object rather than a centroid Hence k-medoids isless sensitive to the outliers -e objective function of k-medoids can be defined by using Equation (1) where p is asample in the cluster Ci and Oi is a medoid in the cluster Ci

Training data

Test dataPrediction result

Classifier

(a)

Training data Clustering algorithm

Cluster 1

Cluster i

Cluster n

Cluster model

Cluster number i

Classifier i

Prediction result

Test data

(b)

Figure 2 -e process of (a) global and (b) local models

Input training set D (x1 y1) (x2 y2) (xn yn)1113864 1113865 test set T (x1 y1) (x2 y2) (xm ym)1113864 1113865 the number of clusters kOutput prediction results R

(1) begin(2) R⟵[

(3) Divide the training set into k clusters(4) clusters cluster_centers k-medoids (D k)(5) Train a classifier for each cluster(6) classifiersmodeling_algorithm (clusters)(7) for Instance t in T do(8) Assign a cluster number i for each instance t(9) i calculate_min_distance (t cluster_centers)(10) r classifiers [i]predict (t) perform prediction(11) Rappend (r)(12) end(13) end

ALGORITHM 1 -e application process of local models for JIT-SDP

Scientific Programming 5

E 1113944k

i11113944

pϵCi

pminusOi

111386811138681113868111386811138681113868111386811138682 (1)

Among different k-medoids algorithms the partitioningaroundmedoids (PAM) algorithm is used in the experimentwhich was proposed by Kaufman el al [33] and is one of themost popularly used algorithms for k-medoids

422 Logistic Regression Similar to prior studies logisticregression is used to build classification models [7] Logisticregression is a widely used regression model that can be usedfor a variety of classification and regression tasks Assumingthat there are nmetrics for a change c logistic regression canbe expressed by using Equation (2) where w0 w1 wn

denote the coefficients of logistic regression and the in-dependent variables m1 mn represent the metrics of achange which are introduced in Table 2 -e output of themodel y(c) indicates the probability that a change is buggy

y(c) 1

1 + eminus w0+w1m1++wnmn( ) (2)

In our experiment logistic regression is used to solveclassification problems However the output of the model isa continuous value with a range of 0 to 1 -erefore for achange c if the value of y(c) is greater than 05 then c isclassified as buggy otherwise it is classified as clean -eprocess can be defined as follows

Y(c) 1 if y(c)gt 05

0 if y(c)le 051113896 (3)

423 Effort-Aware Linear Regression Arisholm et al [24]pointed out that the use of defect prediction models shouldconsider not only the classification performance of themodel but also the cost-effectiveness of code inspectionAfterwards a large number of studies focus on the effort-aware defect prediction [8 9 11 12] Effort-aware linearregression (EALR) was firstly proposed by Kamei et al [7]for solving effort-aware JIT-SDP which is based on linearregression models Different from the above logistic re-gression the dependent variable of EALR is Y(c)Effort(c)where Y(c) denotes the defect proneness of a change andEffort(c) denotes the effort required to code checking for thechange c According to the suggestion of Kamei et al [7] theeffort for a change can obtained by calculating the number ofmodified lines of code

43 Performance Indicators -ere are four performanceindicators used to evaluate and compare the performance oflocal and global models for JIT-SDP (ie F1 AUC ACCand Popt) Specifically we use F1 and AUC to evaluate theclassification performance of prediction models and useACC and Popt to evaluate the effort-aware prediction per-formance of prediction models -e details are as follows

Table 1 -e basic information of data sets

Project Period Number of defective changes Number of changes defect rateBUG 19980826sim20061216 1696 4620 3671COL 20021125sim20060727 1361 4455 3055JDT 20010502sim20071231 5089 35386 1438MOZ 20000101sim20061217 5149 98275 524PLA 20010502sim20071231 9452 64250 1471POS 19960709sim20100515 5119 20431 2506

Table 2 -e description of metrics

Dimension Metric Description

Diffusion

NS Number of modified subsystemsND Number of modified directoriesNF Number of modified files

Entropy Distribution of modified code across each file

SizeLA Lines of code addedLD Lines of code deletedLT Lines of code in a file before the change

Purpose FIX Whether or not the change is a defect fix

History

NDEV Number of developers that changed the files

AGE Average time interval between the last and thecurrent change

NUC Number of unique last changes to the files

ExperienceEXP Developer experienceREXP Recent developer experienceSEXP Developer experience on a subsystem

6 Scientific Programming

-e experiment first uses F1 and AUC to evaluate theclassification performance of models JIT-SDP technologyaims to predict the defect proneness of a change which is aclassification task-e predicted results can be expressed as aconfusion matrix shown in Table 3 including true positive(TP) false negative (FN) false positive (FP) and truenegative (TN)

For classification problems precision (P) and recall (R)are often used to evaluate the prediction performance of themodel

P TP

TP + FP

R TP

TP + FN

(4)

However the two indicators are usually contradictory intheir results In order to evaluate a prediction modelcomprehensively we use F1 to evaluate the classificationperformance of the model which is a harmonic mean ofprecision and recall and is shown in the following equation

F1 2 times P times RP + R

(5)

-e second indicator is AUC Data class imbalanceproblems are often encountered in software defect pre-diction Our data sets are also imbalanced (the number ofbuggy changes is much less than clean changes) In the class-imbalanced problem threshold-based evaluation indicators(such as accuracy recall rate and F1) are sensitive to thethreshold [4] -erefore a threshold-independent evalua-tion indicator AUC is used to evaluate the classificationperformance of a model AUC (area under curve) is the areaunder the ROC (receiver operating characteristic) curve-edrawing process of the ROC curve is as follows First sort theinstances in a descending order of probability that is pre-dicted to be positive and then treat them as positive ex-amples in turn In each iteration the true positive rate (TPR)and the false positive rate (FPR) are calculated as the or-dinate and the abscissa respectively to obtain an ROCcurve

TPR TP

TP + FN

FPR FP

TN + FP

(6)

-e experiment uses ACC and Popt to evaluate the effort-aware prediction performance of local and global modelsACC and Popt are commonly used performance indicatorswhich have widely been used in previous research [7ndash9 25]

-e calculation method of ACC and Popt is shown inFigure 3 In Figure 3 x-axis and y-axis represent the cu-mulative percentage of code churn (ie the number of linesof code added and deleted) and defect-inducing changes Tocompute the values of ACC and Popt three curves are in-cluded in Figure 3 optimal model curve prediction modelcurve and worst model curve [8] In the optimal modelcurve the changes are sorted in a descending order based on

their actual defect densities In the worst model curve thechanges are sorted in the ascending order according to theiractual defect densities In the prediction model curve thechanges are sorted in the descending order based on theirpredicted defect densities -en the area between the threecurves and the x-axis respectively Area(optimal)Area(m) and Area(worst) are calculated ACC indicatesthe recall of defect-inducing changes when 20 of the effortis invested based on the prediction model curve Accordingto [7] Popt can be defined by the following equation

Popt 1minusarea(optimal)minus area(m)

area(optimal)minus optimal(worst) (7)

44 Data Analysis Method -e experiment considers threescenarios to evaluate the performance of prediction modelswhich are 10 times 10-fold cross-validation cross-project-validation and timewise-cross-validation respectively -edetails are as follows

441 10 Times 10-Fold Cross-Validation Cross-validation isperformed within the same project and the basic process of10 times 10-fold cross-validation is as follows first the datasets of a project are divided into 10 parts with equal sizeSubsequently nine parts of them are used to train a pre-diction model and the remaining part is used as the test set-is process is repeated 10 times in order to reduce therandomness deviation -erefore 10 times 10-fold cross-validation can produce 10 times 10 100 results

Table 3 Confusion matrix

Actual valuePrediction result

Positive NegativePositive TP FNNegative FP TN

Def

ect-i

nduc

ing

chan

ges (

)

100

80

60

40

20

0100806040200

Code churn ()

Optimal modelPrediction modelWorst model

Figure 3 -e performance indicators of ACC and Popt

Scientific Programming 7

442 Cross-Project-Validation Cross-project defect pre-diction has received widespread attention in recent years[34] -e basic idea of cross-project defect prediction is totrain a defect prediction model with the defect data sets of aproject to predict defects of another project [8] Given datasets containing n projects n times (nminus 1) run results can begenerated for a prediction model -e data sets used in ourexperiment contain six open-source projects so6 times (6minus 1) 30 prediction results can be generated for aprediction model

443 Timewise-Cross-Validation Timewise-cross-validationis also performed within the same project but it con-siders the chronological order of changes in the data sets[35] Defect data sets are collected in the chronological orderduring the actual software development process -ereforechanges committed early in the data sets can be used topredict the defect proneness of changes committed late Inthe timewise-cross-validation scenario all changes in thedata sets are arranged in the chronological order and thechanges in the same month are grouped in the same groupSuppose the data sets are divided into n groups Group i andgroup i + 1 are used to train a prediction model and thegroup i + 4 and group i + 5 are used as a test set -ereforeassuming that a project contains nmonths of data sets nminus 5run results can be generated for a prediction model

-e experiment uses the Wilcoxon signed-rank test [36]and Cliffrsquos δ [37] to test the significance of differences inclassification performance and prediction performance be-tween local and global models -e Wilcoxon signed-ranktest is a popular nonparametric statistical hypothesis testWe use p values to examine if the difference of the per-formance of local and global models is statistically significantat a significance level of 005-e experiment also uses Cliffrsquosδ to determine the magnitude of the difference in theprediction performance between local models and globalmodels Usually the magnitude of the difference can bedivided into four levels ie trivial ((|δ|lt 0147)) small(0147le |δ|lt 033) moderate (033le |δ|lt 0474) and large(ge0474) [38] In summary when the p value is less than005 and Cliff rsquos δ is greater than or equal to 0147 localmodels and global models have significant differences inprediction performance

45 Data Preprocessing According to suggestion of Kameiet al [7] it is necessary to preprocess the data sets It includesthe following three steps

(1) Removing Highly Correlated Metrics -e ND andREXP metrics are excluded because NF and NDREXP and EXP are highly correlated LA and LD arenormalized by dividing by LT since LA and LD arehighly correlated LT and NUC are normalized bydividing by NF since LT and NUC are highly cor-related with NF

(2) Logarithmic Transformation Since most metrics arehighly skewed each metric executes logarithmictransformation (except for fix)

(3) Dealing with Class Imbalance -e data sets used inthe experiment have the problem of class imbalanceie the number of defect-inducing changes is far lessthan the number of clean changes -erefore weperform random undersampling on the training setBy randomly deleting clean changes the number ofbuggy changes remains the same as the number ofclean changes Note that the undersampling is notperformed in the test set

5 Experimental Results and Analysis

51 Analysis for RQ1 To investigate and compare the per-formance of local and global models in the 10 times 10-foldcross-validation scenario we conduct a large-scale empiricalstudy based on data sets from six open-source projects -eexperimental results are shown in Table 4 As is shown inTable 4 the first column represents the evaluation indicatorsfor prediction models including AUC F1 ACC and Poptthe second column denotes the project names of six data setsthe third column shows the performance values of the globalmodels in the four evaluation indicators from the fourthcolumn to the twelfth column the performance values of thelocal models at different k values are shown Consideringthat the number of clusters k may affect the performance oflocal models the value of parameter k is tuned from 2 to 10Since 100 prediction results are generated in the 10 times 10-fold cross-validation scenario the values in the table aremedians of the experimental results We do the followingprocessing on the data in Table 4

(i) We performed the significance test defined inSection 44 on the performance of the local andglobal models -e performance values of localmodels that are significantly different from theperformance values of global model are bolded

(ii) -e performance values of the local models whichare significantly better than those of the globalmodels are given in italics

(iii) -e WDL values (ie the number of projects forwhich local models can obtain a better equal andworse prediction performance than global models)are calculated under different k values and differentevaluation indicators

First we analyze and compare the classification per-formance of local and global models As is shown in Table 4firstly all of the performance of local models with differentparameters k and different projects are worse than that ofglobal models in the AUC indicator Secondly in most casesthe performance of local models are worse than that of globalmodels (except for four values in italics) in the F1 indicator-erefore it can be concluded that global models performbetter than local models in the classification performance

Second we investigate and compare effort-aware pre-diction performance for local and global models As can beseen from Table 4 local models are valid for effort-awareJIT-SDP However local models have different effects ondifferent projects and different parameters k

8 Scientific Programming

First local models have different effects on differentdata sets To describe the effectiveness of local models ondifferent data sets we calculate the number of times thatlocal models perform significantly better than globalmodels on 9 k values and two evaluation indicators(ie ACC and Popt) for each data set -e statistical resultsare shown in Table 5 where the second row is the numberof times that local models perform better than globalmodels for each project It can be seen that local modelshave better performance on the projects MOZ and PLA andpoor performance on the projects BUG and COL Based onthe number of times that local models outperform globalmodels in six projects we rank the project as shown in thethird row In addition we list the sizes of data sets and theirrank as shown in the fourth row It can be seen that theperformance of local models on data sets has a strongcorrelation with the size of data sets Generally the largerthe size of the data set the more significant the perfor-mance improvement of local models

In view of this situation we consider that if the size of thedata set is too small when using local models since the datasets are divided into multiple clusters the training samplesof each cluster will be further reduced which will lead to theperformance degradation of models While the size of datasets is large the performance of the model is not easilylimited by the size of the data sets so that the advantages oflocal models are better reflected

In addition the number of clusters k also has an impacton the performance of local models As can be seen from thetable according to the WDL analysis local models canobtain the best ACC indicator when k is set to 2 and obtainthe best Popt indicator when k is set to 2 and 6 -erefore werecommend that in the cross-validation scenario whenusing local models to solve the effort-aware JIT-SDPproblem the parameter k can be set to 2

-is section analyzes and compares the classificationperformance and effort-aware prediction performance oflocal and global models in 10 times 10-fold cross-validationscenario -e empirical results show that compared withglobal models local models perform poorly in the classifi-cation performance but perform better in the effort-awareprediction performance However the performance of localmodels is affected by the parameter k and the size of the datasets -e experimental results show that local models canobtain better effort-aware prediction performance on theproject with larger size of data sets and can obtain bettereffort-aware prediction performance when k is set to 2

52Analysis forRQ2 -is section discusses the performancedifferences between local and global models in the cross-project-validation scenario for JIT-SDP -e results of theexperiment are shown in Table 6 in which the first columnindicates four evaluation indicators including AUC F1

Table 4 Local vs global models in the 10 times 10-fold cross-validation scenario

Indicator Project GlobalLocal

k 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC

BUG 0730 0704 0687 0684 0698 0688 0686 0686 0693 0685COL 0742 0719 0708 0701 0703 0657 0666 0670 0668 0660JDT 0725 0719 0677 0670 0675 0697 0671 0674 0671 0654MOZ 0769 0540 0741 0715 0703 0730 0730 0725 0731 0722PLA 0742 0733 0702 0693 0654 0677 0640 0638 0588 0575POS 0777 0764 0758 0760 0758 0753 0755 0705 0757 0757

WDL mdash 006 006 006 006 006 006 006 006 006

F1

BUG 0663 0644 0628 0616 0603 0598 0608 0610 0601 0592COL 0696 0672 0649 0669 0678 0634 0537 0596 0604 0564JDT 0723 0706 0764 0757 0734 0695 0615 0643 0732 0722MOZ 0825 0374 0853 0758 0785 0741 0718 0737 0731 0721PLA 0704 0678 0675 0527 0477 0557 0743 0771 0627 0604POS 0762 0699 0747 0759 0746 0698 0697 0339 0699 0700

WDL mdash 006 204 114 015 006 015 105 015 015

ACC

BUG 0421 0452 0471 0458 0424 0451 0450 0463 0446 0462COL 0553 0458 0375 0253 0443 0068 0195 0459 0403 0421JDT 0354 0532 0374 0431 0291 0481 0439 0314 0287 0334MOZ 0189 0386 0328 0355 0364 0301 0270 0312 0312 0312PLA 0368 0472 0509 0542 0497 0509 0456 0442 0456 0495POS 0336 0490 0397 0379 0457 0442 0406 0341 0400 0383

WDL mdash 501 411 411 312 411 411 222 312 411

Popt

BUG 0742 055 0747 0744 0726 0738 0749 0750 0760 0750 0766COL 0726 0661 0624 0401 0731 0295 0421 0644 0628 0681JDT 0660 0779 0625 0698 0568 0725 0708 0603 0599 0634MOZ 0555 0679 0626 0645 0659 0588 0589 0643 0634 0630PLA 0664 0723 0727 0784 0746 0758 0718 0673 0716 0747POS 0673 0729 0680 0651 0709 0739 0683 0596 0669 0644

WDL mdash 411 222 213 321 411 231 123 132 213

Scientific Programming 9

ACC and Popt the second column represents the perfor-mance values of local models from the third column to theeleventh column the prediction performance of localmodels under different parameters k is represented In ourexperiment six data sets are used in the cross-project-validation scenario -erefore each prediction model pro-duces 30 prediction results -e values in Table 6 representthe median of the prediction results

For the values in Table 6 we do the following processing

(i) Using the significance test method the performancevalues of local models that are significantly differentfrom those of global models are bolded

(ii) -e performance values of local models that aresignificantly better than those of global models aregiven in italics

(iii) -e WDL column analyzes the number of timesthat local models perform significantly better thanglobal models at different parameters k

First we compare the difference in classification per-formance between local models and global models As can beseen from Table 6 in the AUC and F1 indicators theclassification performance of local models under differentparameter k is worse than or equal to that of global models-erefore we can conclude that in the cross-project-validation scenario local models perform worse thanglobal models in the classification performance

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 6 that the performance of local models is better than orequal to that of global models in the ACC and Popt indicatorswhen the parameter k is from 2 to 10-erefore local modelsare valid for effort-aware JIT-SDP in the cross-validationscenario In particular when the number k of clusters is setto 2 local models can obtain better ACC values and Poptvalues than global models when k is set to 5 local modelscan obtain better ACC values -erefore it is appropriate toset k to 2 which can find 493 defect-inducing changeswhen using 20 effort and increase the ACC indicator by570 and increase the Popt indicator by 164

In the cross-project-validation scenario local modelsperform worse in the classification performance but performbetter in the effort-aware prediction performance thanglobal models However the setting of k in local models hasan impact on the effort-aware prediction performance forlocal models-e empirical results show that when k is set to2 local models can obtain the optimal effort-aware pre-diction performance

53 Analysis for RQ3 -is section compares the predictionperformance of local and global models in the timewise-

cross-validation scenario Since in the timewise-cross-validation scenario only two months of data in each datasets are used to train prediction models the size of thetraining sample in each cluster is small -rough experi-ment we find that if the parameter k of local models is settoo large the number of training samples in each clustermay be too small or even equal to 0 -erefore the ex-periment sets the parameter k of local models to 2 -eexperimental results are shown in Table 7 Suppose aproject contains n months of data and prediction modelscan produce nminus 5 prediction results in the timewise-cross-validation scenario -erefore the third and fourth col-umns in Table 7 represent the median and standard de-viation of prediction results for global and local modelsBased on the significance test method the performancevalues of local models with significant differences fromthose of global models are bold -e row WDL sum-marizes the number of projects for which local models canobtain a better equal and worse performance than globalmodels in each indicator

First we compare the classification performance of localand global models It can be seen from Table 7 that localmodels perform worse on the AUC and F1 indicators thanglobal models on the 6 data sets -erefore it can be con-cluded that local models perform poorly in classificationperformance compared to global models in the timewise-cross-validation scenario

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 7 that in the projects BUG and COL local models haveworse ACC and Popt indicators than global models while inthe other four projects (JDT MOZ PLA and POS) localmodels and global models have no significant difference inthe ACC and Popt indicators -erefore local models per-form worse in effort-aware prediction performance thanglobal models in the timewise-cross-validation scenario-is conclusion is inconsistent with the conclusions inSections 51 and 52 For this problem we consider that themain reason lies in the size of the training sample In thetimewise-cross-validation scenario only two months of datain the data sets are used to train prediction models (there areonly 452 changes per month on average)When local modelsare used since the training set is divided into several clustersthe number of training samples in each cluster is furtherdecreased which results in a decrease in the effort-awareprediction performance of local models

In this section we compare the classification perfor-mance and effort-aware prediction performance of local andglobal models in the timewise-cross-validation scenarioEmpirical results show that local models perform worse thanglobal models in the classification performance and effort-aware prediction performance -erefore we recommend

Table 5 -e number of wins of local models at different k values

Project BUG COL JDT MOZ PLA POSCount 4 0 7 18 14 11Rank 5 6 4 1 2 3Number of changes (rank) 4620 (5) 4455 (6) 35386 (3) 98275 (1) 64250 (2) 20431 (3)

10 Scientific Programming

using global models to solve the JIT-SDP problem in thetimewise-cross-validation scenario

6 Threats to Validity

61 External Validity Although our experiment uses publicdata sets that have extensively been used in previous studieswe cannot yet guarantee that the discovery of the experimentcan be applied to all other change-level defect data sets-erefore the defect data sets of more projects should bemined in order to verify the generalization of the experi-mental results

62 Construct Validity We use F1 AUC ACC and Popt toevaluate the prediction performance of local and globalmodels Because defect data sets are often imbalanced AUCis more appropriate as a threshold-independent evaluationindicator to evaluate the classification performance of

models [4] In addition we use F1 a threshold-basedevaluation indicator as a supplement to AUC Besidessimilar to prior study [7 9] the experiment uses ACC andPopt to evaluate the effort-aware prediction performance ofmodels However we cannot guarantee that the experi-mental results are valid in all other evaluation indicatorssuch as precision recall Fβ and PofB20

63 Internal Validity Internal validity involves errors in theexperimental code In order to reduce this risk we carefullyexamined the code of the experiment and referred to thecode published in previous studies [7 8 29] so as to improvethe reliability of the experiment code

7 Conclusions and Future Work

In this article we firstly apply local models to JIT-SDP Ourstudy aims to compare the prediction performance of localand global models under cross-validation cross-project-validation and timewise-cross-validation for JIT-SDP Inorder to compare the classification performance and effort-aware prediction performance of local and global models wefirst build local models based on the k-medoids methodAfterwards logistic regression and EALR are used to buildclassification models and effort-aware prediction models forlocal and global models -e empirical results show that localmodels perform worse in the classification performance thanglobal models in three evaluation scenarios Moreover localmodels have worse effort-aware prediction performance in thetimewise-cross-validation scenario However local models aresignificantly better than global models in the effort-awareprediction performance in the cross-validation-scenario andtimewise-cross-validation scenario Particularly the optimaleffort-aware prediction performance can be obtained whenthe parameter k of local models is set to 2 -erefore localmodels are still promising in the context of effort-aware JIT-SDP

In the future first we plan to consider more commercialprojects to further verify the validity of the experimentalconclusions Second logistic regression and EALR are usedin local and global models Considering that the choice ofmodeling methods will affect the prediction performance oflocal and global models we hope to add more baselines tolocal and global models to verify the generalization of ex-perimental conclusions

Data Availability

-e experiment uses public data sets shared by Kamei et al[7] and they have already published the download address of

Table 6 Local vs global models in the cross-project-validation scenario

Indicator GlobalLocal

WDLk 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC 0706 0683 0681 0674 0673 0673 0668 0659 0680 0674 009F1 0678 0629 0703 0696 0677 0613 0620 0554 0629 0637 045ACC 0314 0493 0396 0348 0408 0393 0326 034 0342 0345 270Popt 0656 0764 0691 0672 0719 0687 0619 0632 0667 0624 180

Table 7 Local vs global models in the timewise-cross-validationscenario

Indicators Project Global Local (k 2)

AUC

BUG 0672plusmn 0125 0545plusmn 0124COL 0717plusmn 0075 0636plusmn 0135JDT 0708plusmn 0043 0645plusmn 0073MOZ 0749plusmn 0034 0648plusmn 0128PLA 0709plusmn 0054 0642plusmn 0076POS 0743plusmn 0072 0702plusmn 0095

WDL mdash 006

F1

BUG 0587plusmn 0108 0515plusmn 0128COL 0651plusmn 0066 0599plusmn 0125JDT 0722plusmn 0048 0623plusmn 0161MOZ 0804plusmn 0050 0696plusmn 0190PLA 0702plusmn 0053 0618plusmn 0162POS 0714plusmn 0065 0657plusmn 0128

WDL mdash 006

ACC

BUG 0357plusmn 0212 0242plusmn 0210COL 0426plusmn 0168 0321plusmn 0186JDT 0356plusmn 0148 0329plusmn 0189MOZ 0218plusmn 0132 0242plusmn 0125PLA 0358plusmn 0181 0371plusmn 0196POS 0345plusmn 0159 0306plusmn 0205

WDL mdash 042

Popt

BUG 0622plusmn 0194 0458plusmn 0189COL 0669plusmn 0150 0553plusmn 0204JDT 0654plusmn 0101 0624plusmn 0142MOZ 0537plusmn 0110 0537plusmn 0122PLA 0631plusmn 0115 0645plusmn 0151POS 0615plusmn 0131 0583plusmn 0167

WDL mdash 042

Scientific Programming 11

the data sets in their paper In addition in order to verify therepeatability of our experiment and promote future researchall the experimental code in this article can be downloaded athttpsgithubcomyangxingguangLocalJIT

Conflicts of Interest

-e authors declare that they have no conflicts of interest

Acknowledgments

-is work was partially supported by the NSF of Chinaunder Grant nos 61772200 and 61702334 Shanghai PujiangTalent Program under Grant no 17PJ1401900 ShanghaiMunicipal Natural Science Foundation under Grant nos17ZR1406900 and 17ZR1429700 Educational ResearchFund of ECUST under Grant no ZH1726108 and Collab-orative Innovation Foundation of Shanghai Institute ofTechnology under Grant no XTCX2016-20

References

[1] M Newman Software Errors Cost US Economy $595 BillionAnnually NIST Assesses Technical Needs of Industry to ImproveSoftware-Testing 2002 httpwwwabeachacomNIST_press_release_bugs_costhtm

[2] Z Li X-Y Jing and X Zhu ldquoProgress on approaches tosoftware defect predictionrdquo IET Software vol 12 no 3pp 161ndash175 2018

[3] X Chen Q Gu W Liu S Liu and C Ni ldquoSurvey of staticsoftware defect predictionrdquo Ruan Jian Xue BaoJournal ofSoftware vol 27 no 1 2016 in Chinese

[4] C Tantithamthavorn S McIntosh A E Hassan andKMatsumoto ldquoAn empirical comparison of model validationtechniques for defect prediction modelsrdquo IEEE Transactionson Software Engineering vol 43 no 1 pp 1ndash18 2017

[5] A G Koru D Dongsong Zhang K El Emam andH Hongfang Liu ldquoAn investigation into the functional formof the size-defect relationship for software modulesrdquo IEEETransactions on Software Engineering vol 35 no 2pp 293ndash304 2009

[6] S Kim E J Whitehead and Y Zhang ldquoClassifying softwarechanges clean or buggyrdquo IEEE Transactions on SoftwareEngineering vol 34 no 2 pp 181ndash196 2008

[7] Y Kamei E Shihab B Adams et al ldquoA large-scale empiricalstudy of just-in-time quality assurancerdquo IEEE Transactions onSoftware Engineering vol 39 no 6 pp 757ndash773 2013

[8] Y Yang Y Zhou J Liu et al ldquoEffort-aware just-in-time defectprediction simple unsupervised models could be better thansupervised modelsrdquo in Proceedings of the 24th ACM SIGSOFTInternational Symposium on Foundations of Software Engi-neering pp 157ndash168 Hong Kong China November 2016

[9] X Chen Y Zhao Q Wang and Z Yuan ldquoMULTI multi-objective effort-aware just-in-time software defect predictionrdquoInformation and Software Technology vol 93 pp 1ndash13 2018

[10] Y Kamei T Fukushima S McIntosh K YamashitaN Ubayashi and A E Hassan ldquoStudying just-in-time defectprediction using cross-project modelsrdquo Empirical SoftwareEngineering vol 21 no 5 pp 2072ndash2106 2016

[11] T Menzies A Butcher A Marcus T Zimmermann andD R Cok ldquoLocal vs global models for effort estimation anddefect predictionrdquo in Proceedings of the 26th IEEEACM

International Conference on Automated Software Engineer-ing (ASE) pp 343ndash351 Lawrence KS USA November 2011

[12] T Menzies A Butcher D Cok et al ldquoLocal versus globallessons for defect prediction and effort estimationrdquo IEEETransactions on Software Engineering vol 39 no 6pp 822ndash834 2013

[13] G Scanniello C Gravino A Marcus and T Menzies ldquoClasslevel fault prediction using software clusteringrdquo in Pro-ceedings of the 2013 28th IEEEACM International Conferenceon Automated Software Engineering (ASE) pp 640ndash645Silicon Valley CA USA November 2013

[14] N Bettenburg M Nagappan and A E Hassan ldquoTowardsimproving statistical modeling of software engineering datathink locally act globallyrdquo Empirical Software Engineeringvol 20 no 2 pp 294ndash335 2015

[15] Q Wang S Wu and M Li ldquoSoftware defect predictionrdquoJournal of Software vol 19 no 7 pp 1565ndash1580 2008 inChinese

[16] F Akiyama ldquoAn example of software system debuggingrdquoInformation Processing vol 71 pp 353ndash359 1971

[17] T J McCabe ldquoA complexity measurerdquo IEEE Transactions onSoftware Engineering vol SE-2 no 4 pp 308ndash320 1976

[18] S R Chidamber and C F Kemerer ldquoAmetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[19] N Nagappan and T Ball ldquoUse of relative code churn mea-sures to predict system defect densityrdquo in Proceedings of the27th International Conference on Software Engineering (ICSE)pp 284ndash292 Saint Louis MO USA May 2005

[20] A Mockus and D M Weiss ldquoPredicting risk of softwarechangesrdquo Bell Labs Technical Journal vol 5 no 2 pp 169ndash180 2000

[21] X Yang D Lo X Xia Y Zhang and J Sun ldquoDeep learningfor just-in-time defect predictionrdquo in Proceedings of the IEEEInternational Conference on Software Quality Reliability andSecurity (QRS) pp 17ndash26 Vancouver BC Canada August2015

[22] X Yang D Lo X Xia and J Sun ldquoTLEL a two-layer ensemblelearning approach for just-in-time defect predictionrdquo In-formation and Software Technology vol 87 pp 206ndash220 2017

[23] L Pascarella F Palomba and A Bacchelli ldquoFine-grained just-in-time defect predictionrdquo Journal of Systems and Softwarevol 150 pp 22ndash36 2019

[24] E Arisholm L C Briand and E B Johannessen ldquoA sys-tematic and comprehensive investigation of methods to buildand evaluate fault prediction modelsrdquo Journal of Systems andSoftware vol 83 no 1 pp 2ndash17 2010

[25] Y Kamei S Matsumoto A Monden K MatsumotoB Adams and A E Hassan ldquoRevisiting common bug pre-diction findings using effort-aware modelsrdquo in Proceedings ofthe 26th IEEE International Conference on Software Mainte-nance (ICSM) pp 1ndash10 Shanghai China September 2010

[26] Y Zhou B Xu H Leung and L Chen ldquoAn in-depth study ofthe potentially confounding effect of class size in fault pre-dictionrdquo ACM Transactions on Software Engineering andMethodology vol 23 no 1 pp 1ndash51 2014

[27] J Liu Y Zhou Y Yang H Lu and B Xu ldquoCode churn aneglectedmetric in effort-aware just-in-time defect predictionrdquoin Proceedings of the ACMIEEE International Symposium onEmpirical Software Engineering and Measurement (ESEM)pp 11ndash19 Toronto ON Canada November 2017

[28] W Fu and T Menzies ldquoRevisiting unsupervised learning fordefect predictionrdquo in Proceedings of the 2017 11th Joint

12 Scientific Programming

Meeting on Foundations of Software Engineering (ESECFSE)pp 72ndash83 Paderborn Germany September 2017

[29] Q Huang X Xia and D Lo ldquoSupervised vs unsupervisedmodels a holistic look at effort-aware just-in-time defectpredictionrdquo in Proceedings of the IEEE International Con-ference on Software Maintenance and Evolution (ICSME)pp 159ndash170 Shanghai China September 2017

[30] Q Huang X Xia and D Lo ldquoRevisiting supervised andunsupervised models for effort-aware just-in-time defectpredictionrdquo Empirical Software Engineering pp 1ndash40 2018

[31] S Herbold A Trautsch and J Grabowski ldquoGlobal vs localmodels for cross-project defect predictionrdquo Empirical Soft-ware Engineering vol 22 no 4 pp 1866ndash1902 2017

[32] M E Mezouar F Zhang and Y Zou ldquoLocal versus globalmodels for effort-aware defect predictionrdquo in Proceedings ofthe 26th Annual International Conference on Computer Sci-ence and Software Engineering (CASCON) pp 178ndash187Toronto ON Canada October 2016

[33] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis Wiley Hoboken NJ USA1990

[34] S Hosseini B Turhan and D Gunarathna ldquoA systematicliterature review and meta-analysis on cross project defectpredictionrdquo IEEE Transactions on Software Engineeringvol 45 no 2 pp 111ndash147 2019

[35] J Sliwerski T Zimmermann and A Zeller ldquoWhen dochanges induce fixesrdquo Mining Software Repositories vol 30no 4 pp 1ndash5 2005

[36] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin vol 1 no 6 pp 80ndash83 1945

[37] N Cliff Ordinal Methods for Behavioral Data AnalysisPsychology Press London UK 2014

[38] J Romano J D Kromrey J Coraggio and J SkowronekldquoAppropriate statistics for ordinal level data should we reallybe using t-test and cohensd for evaluating group differenceson the nsse and other surveysrdquo in Annual Meeting of theFlorida Association of Institutional Research pp 1ndash33 2006

Scientific Programming 13

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 4: LocalversusGlobalModelsforJust-In-TimeSoftware ...downloads.hindawi.com/journals/sp/2019/2384706.pdf · Just-in-time software defect prediction (JIT-SDP) is an active topic in software

SDP Yang et al [8] compared simple unsupervised modelswith supervised models for effort-aware JIT-SDP and foundthat many unsupervised models outperform the state-of-the-art supervised models Inspired by the study of Yanget al [8] Liu et al [27] proposed a novel unsupervisedlearning method CCUM which is based on code churn-eir study showed that CCUM performs better than thestate-of-the-art prediction model at that time As the resultsYang et al [8] reported are startling Fu et al [28] repeatedtheir experiment -eir study indicates that supervisedlearning methods are better than unsupervised methodswhen their analysis is conducted on a project-by-projectbasis However their study supports the general goal of Yanget al [8] Huang et al [29] also performed a replication studybased on the study of Yang et al [8] and extended theirstudy in the literature [30] -ey found three weaknesses ofthe LT model proposed by Yang et al [8] more contextswitching more false alarms and worse F1 than supervisedmethods -erefore they proposed a novel supervisedmethod CBS+ which combines the advantages of LT andEALR Empirical results demonstrate that CBS + can detectmore buggy changes than EALR More importantlyCBS + can significantly reduce context switching comparedto LT

3 Software Defect Prediction Based onLocal Models

Menzies et al [11] first applied local models to defectprediction and effort estimation in 2011 and extended theirstudy work in 2013 [12] -e basic idea of local models issimple first cluster all training data into homogeneousregions then train a prediction model for each region A testinstance will be predicted by one of the predictors [31]

To the best of our knowledge there are only five studiesof local models in the filed of defect prediction Menzies et al[11 12] first proposed the idea of local models -eir studiesattempted to explore the best way to learn lessons for defectprediction and effort estimation -ey adopted WHEREalgorithm to cluster the data into the regions with similarproperties Experimental results showed that the lessonslearned from clusters have better prediction performance bycomparing the lessons generated from all the data localprojects and each cluster Scanniello et al [13] proposed anovel defect prediction method based on software clusteringfor class-level defect prediction -eir research demon-strated that for a class to be predicted in a project system themodels built from the classes related to it outperformed theclasses from the entire system Bettenburg et al [14] com-pared local and global models based on statistical learningtechnology for software defect prediction Experimentalresults demonstrated that local models had better fit andprediction performance Herbold et al [31] introduced localmodels into the context of cross-project defect prediction-ey compared local models with global models and atransfer learning technique -eir findings showed that localmodels just made a minor difference compared with globalmodels and the transfer learning model Mezouar et al [32]compared the performance of local and global models in the

context of effort-aware defect prediction -eir study isbased on 15 module-level defect data sets and uses k-medoids to build local models Experimental results showedthat local models perform worse than global models

-e process and difference between local models andglobal models are described in Figure 2 For global modelsall training data are used to build prediction models by usingstatistical or machine learning methods and the models canbe used to predict any test instance In comparison localmodels first build a cluster model and divide the trainingdata into several clusters based on the cluster model In theprediction phase each test instance is assigned a clusternumber based on the cluster models And each test instanceis predicted by the model trained by the correspondingcluster instead of all the training data

In our experiments we apply local models to the JIT-SDP problem In previous studies various clusteringmethods have been used in local models such as WHERE[11 12] MCLUST [14] k-means [14] hierarchical clustering[14] BorderFlow [13] EM clustering [31] and k-mediods[32] According to the suggestion of Herbold et al [31] onecan use any of these clustering methods Our experimentadopts k-medoids as the clustering method in local modelsK-medoids is a widely used clustering method Differentfrom k-means used in the previous study [14] the clustercenter of k-medoids is an object which has low sensitivity tooutliers Moreover there are mature Python libraries sup-porting the implementation of k-medoids In the stage ofmodel building we use logistic regression and EALR to buildthe classification model and the effort-aware predictionmodel in local models which are used as baseline models inprevious JIT-SDP studies [7ndash9]-e detailed process of localmodels for JIT-SDP is described in Algorithm 1

4 Experimental Setup

-is article answers the following three questions throughexperiment

(i) RQ1 how is the classification performance andeffort-aware prediction performance of local modelscompared to those of global models in the cross-validation scenario

(ii) RQ2 how is the classification performance andeffort-aware prediction performance of local modelscompared to those of global models in the cross-project-validation scenario

(iii) RQ3 how is the classification performance andeffort-aware prediction performance of local modelscompared to those of global models in the timewise-cross-validation scenario

-is section introduces the experimental setup from fiveaspects data sets modeling methods performance evalua-tion indicators data analysis method and data pre-processing Our experiment is performed on the hardwarewith Intel (R) Core (TM) i7-7700 CPU 360GHZ -e op-eration system is Windows 10 -e programming environ-ment for scripts is Python 36

4 Scientific Programming

41 Data Sets -e data sets used in experiment are widelyused for JIT-SDP which are shared by Kamei et al [7] -edata sets contain six open-source projects from differentapplication domains For example Bugzilla (BUG) is anopen-source defect tracking system Eclipse JDT (JDT) is afull-featured Java integrated development environmentplugin for the Eclipse platform Mozilla (MOZ) is a webbrowser and PostgreSQL (POS) is a database system [9]-ese data sets are extracted from CVS repositories ofprojects and corresponding bug reports [7] -e basic in-formation of the data sets is shown in Table 1

To predict whether a change is defective or not Kameiet al [7] design 14 metrics associated with software defectswhich can be grouped into five groups (ie diffusion sizepurpose history and experience)-e specific description ofthese metrics is shown in Table 2 A more detailed in-troduction to the metrics can be found in the literature [7]

42 Modeling Methods -ere are three modeling methodsthat are used in our experiment First in the process of using

local models the training set is divided into multipleclusters -erefore this process is completed by the k-medoids model Second the classification performance andeffort-aware prediction performance of local models andglobal models are investigated in our experiment -ereforewe use logistic regression and EALR to build classificationmodels and effort-aware prediction models respectivelywhich are used as baseline models in previous JIT-SDPstudies [7ndash9] Details of the three models are shown below

421 K-Medoids -e first step in local models is to use aclustering algorithm to divide all training data into severalhomogeneous regions In this process the experiment usesk-medoids as a clustering algorithm which is widely used indifferent clustering tasks Different from k-means the co-ordinate center of each cluster of k-medoids is a repre-sentative object rather than a centroid Hence k-medoids isless sensitive to the outliers -e objective function of k-medoids can be defined by using Equation (1) where p is asample in the cluster Ci and Oi is a medoid in the cluster Ci

Training data

Test dataPrediction result

Classifier

(a)

Training data Clustering algorithm

Cluster 1

Cluster i

Cluster n

Cluster model

Cluster number i

Classifier i

Prediction result

Test data

(b)

Figure 2 -e process of (a) global and (b) local models

Input training set D (x1 y1) (x2 y2) (xn yn)1113864 1113865 test set T (x1 y1) (x2 y2) (xm ym)1113864 1113865 the number of clusters kOutput prediction results R

(1) begin(2) R⟵[

(3) Divide the training set into k clusters(4) clusters cluster_centers k-medoids (D k)(5) Train a classifier for each cluster(6) classifiersmodeling_algorithm (clusters)(7) for Instance t in T do(8) Assign a cluster number i for each instance t(9) i calculate_min_distance (t cluster_centers)(10) r classifiers [i]predict (t) perform prediction(11) Rappend (r)(12) end(13) end

ALGORITHM 1 -e application process of local models for JIT-SDP

Scientific Programming 5

E 1113944k

i11113944

pϵCi

pminusOi

111386811138681113868111386811138681113868111386811138682 (1)

Among different k-medoids algorithms the partitioningaroundmedoids (PAM) algorithm is used in the experimentwhich was proposed by Kaufman el al [33] and is one of themost popularly used algorithms for k-medoids

422 Logistic Regression Similar to prior studies logisticregression is used to build classification models [7] Logisticregression is a widely used regression model that can be usedfor a variety of classification and regression tasks Assumingthat there are nmetrics for a change c logistic regression canbe expressed by using Equation (2) where w0 w1 wn

denote the coefficients of logistic regression and the in-dependent variables m1 mn represent the metrics of achange which are introduced in Table 2 -e output of themodel y(c) indicates the probability that a change is buggy

y(c) 1

1 + eminus w0+w1m1++wnmn( ) (2)

In our experiment logistic regression is used to solveclassification problems However the output of the model isa continuous value with a range of 0 to 1 -erefore for achange c if the value of y(c) is greater than 05 then c isclassified as buggy otherwise it is classified as clean -eprocess can be defined as follows

Y(c) 1 if y(c)gt 05

0 if y(c)le 051113896 (3)

423 Effort-Aware Linear Regression Arisholm et al [24]pointed out that the use of defect prediction models shouldconsider not only the classification performance of themodel but also the cost-effectiveness of code inspectionAfterwards a large number of studies focus on the effort-aware defect prediction [8 9 11 12] Effort-aware linearregression (EALR) was firstly proposed by Kamei et al [7]for solving effort-aware JIT-SDP which is based on linearregression models Different from the above logistic re-gression the dependent variable of EALR is Y(c)Effort(c)where Y(c) denotes the defect proneness of a change andEffort(c) denotes the effort required to code checking for thechange c According to the suggestion of Kamei et al [7] theeffort for a change can obtained by calculating the number ofmodified lines of code

43 Performance Indicators -ere are four performanceindicators used to evaluate and compare the performance oflocal and global models for JIT-SDP (ie F1 AUC ACCand Popt) Specifically we use F1 and AUC to evaluate theclassification performance of prediction models and useACC and Popt to evaluate the effort-aware prediction per-formance of prediction models -e details are as follows

Table 1 -e basic information of data sets

Project Period Number of defective changes Number of changes defect rateBUG 19980826sim20061216 1696 4620 3671COL 20021125sim20060727 1361 4455 3055JDT 20010502sim20071231 5089 35386 1438MOZ 20000101sim20061217 5149 98275 524PLA 20010502sim20071231 9452 64250 1471POS 19960709sim20100515 5119 20431 2506

Table 2 -e description of metrics

Dimension Metric Description

Diffusion

NS Number of modified subsystemsND Number of modified directoriesNF Number of modified files

Entropy Distribution of modified code across each file

SizeLA Lines of code addedLD Lines of code deletedLT Lines of code in a file before the change

Purpose FIX Whether or not the change is a defect fix

History

NDEV Number of developers that changed the files

AGE Average time interval between the last and thecurrent change

NUC Number of unique last changes to the files

ExperienceEXP Developer experienceREXP Recent developer experienceSEXP Developer experience on a subsystem

6 Scientific Programming

-e experiment first uses F1 and AUC to evaluate theclassification performance of models JIT-SDP technologyaims to predict the defect proneness of a change which is aclassification task-e predicted results can be expressed as aconfusion matrix shown in Table 3 including true positive(TP) false negative (FN) false positive (FP) and truenegative (TN)

For classification problems precision (P) and recall (R)are often used to evaluate the prediction performance of themodel

P TP

TP + FP

R TP

TP + FN

(4)

However the two indicators are usually contradictory intheir results In order to evaluate a prediction modelcomprehensively we use F1 to evaluate the classificationperformance of the model which is a harmonic mean ofprecision and recall and is shown in the following equation

F1 2 times P times RP + R

(5)

-e second indicator is AUC Data class imbalanceproblems are often encountered in software defect pre-diction Our data sets are also imbalanced (the number ofbuggy changes is much less than clean changes) In the class-imbalanced problem threshold-based evaluation indicators(such as accuracy recall rate and F1) are sensitive to thethreshold [4] -erefore a threshold-independent evalua-tion indicator AUC is used to evaluate the classificationperformance of a model AUC (area under curve) is the areaunder the ROC (receiver operating characteristic) curve-edrawing process of the ROC curve is as follows First sort theinstances in a descending order of probability that is pre-dicted to be positive and then treat them as positive ex-amples in turn In each iteration the true positive rate (TPR)and the false positive rate (FPR) are calculated as the or-dinate and the abscissa respectively to obtain an ROCcurve

TPR TP

TP + FN

FPR FP

TN + FP

(6)

-e experiment uses ACC and Popt to evaluate the effort-aware prediction performance of local and global modelsACC and Popt are commonly used performance indicatorswhich have widely been used in previous research [7ndash9 25]

-e calculation method of ACC and Popt is shown inFigure 3 In Figure 3 x-axis and y-axis represent the cu-mulative percentage of code churn (ie the number of linesof code added and deleted) and defect-inducing changes Tocompute the values of ACC and Popt three curves are in-cluded in Figure 3 optimal model curve prediction modelcurve and worst model curve [8] In the optimal modelcurve the changes are sorted in a descending order based on

their actual defect densities In the worst model curve thechanges are sorted in the ascending order according to theiractual defect densities In the prediction model curve thechanges are sorted in the descending order based on theirpredicted defect densities -en the area between the threecurves and the x-axis respectively Area(optimal)Area(m) and Area(worst) are calculated ACC indicatesthe recall of defect-inducing changes when 20 of the effortis invested based on the prediction model curve Accordingto [7] Popt can be defined by the following equation

Popt 1minusarea(optimal)minus area(m)

area(optimal)minus optimal(worst) (7)

44 Data Analysis Method -e experiment considers threescenarios to evaluate the performance of prediction modelswhich are 10 times 10-fold cross-validation cross-project-validation and timewise-cross-validation respectively -edetails are as follows

441 10 Times 10-Fold Cross-Validation Cross-validation isperformed within the same project and the basic process of10 times 10-fold cross-validation is as follows first the datasets of a project are divided into 10 parts with equal sizeSubsequently nine parts of them are used to train a pre-diction model and the remaining part is used as the test set-is process is repeated 10 times in order to reduce therandomness deviation -erefore 10 times 10-fold cross-validation can produce 10 times 10 100 results

Table 3 Confusion matrix

Actual valuePrediction result

Positive NegativePositive TP FNNegative FP TN

Def

ect-i

nduc

ing

chan

ges (

)

100

80

60

40

20

0100806040200

Code churn ()

Optimal modelPrediction modelWorst model

Figure 3 -e performance indicators of ACC and Popt

Scientific Programming 7

442 Cross-Project-Validation Cross-project defect pre-diction has received widespread attention in recent years[34] -e basic idea of cross-project defect prediction is totrain a defect prediction model with the defect data sets of aproject to predict defects of another project [8] Given datasets containing n projects n times (nminus 1) run results can begenerated for a prediction model -e data sets used in ourexperiment contain six open-source projects so6 times (6minus 1) 30 prediction results can be generated for aprediction model

443 Timewise-Cross-Validation Timewise-cross-validationis also performed within the same project but it con-siders the chronological order of changes in the data sets[35] Defect data sets are collected in the chronological orderduring the actual software development process -ereforechanges committed early in the data sets can be used topredict the defect proneness of changes committed late Inthe timewise-cross-validation scenario all changes in thedata sets are arranged in the chronological order and thechanges in the same month are grouped in the same groupSuppose the data sets are divided into n groups Group i andgroup i + 1 are used to train a prediction model and thegroup i + 4 and group i + 5 are used as a test set -ereforeassuming that a project contains nmonths of data sets nminus 5run results can be generated for a prediction model

-e experiment uses the Wilcoxon signed-rank test [36]and Cliffrsquos δ [37] to test the significance of differences inclassification performance and prediction performance be-tween local and global models -e Wilcoxon signed-ranktest is a popular nonparametric statistical hypothesis testWe use p values to examine if the difference of the per-formance of local and global models is statistically significantat a significance level of 005-e experiment also uses Cliffrsquosδ to determine the magnitude of the difference in theprediction performance between local models and globalmodels Usually the magnitude of the difference can bedivided into four levels ie trivial ((|δ|lt 0147)) small(0147le |δ|lt 033) moderate (033le |δ|lt 0474) and large(ge0474) [38] In summary when the p value is less than005 and Cliff rsquos δ is greater than or equal to 0147 localmodels and global models have significant differences inprediction performance

45 Data Preprocessing According to suggestion of Kameiet al [7] it is necessary to preprocess the data sets It includesthe following three steps

(1) Removing Highly Correlated Metrics -e ND andREXP metrics are excluded because NF and NDREXP and EXP are highly correlated LA and LD arenormalized by dividing by LT since LA and LD arehighly correlated LT and NUC are normalized bydividing by NF since LT and NUC are highly cor-related with NF

(2) Logarithmic Transformation Since most metrics arehighly skewed each metric executes logarithmictransformation (except for fix)

(3) Dealing with Class Imbalance -e data sets used inthe experiment have the problem of class imbalanceie the number of defect-inducing changes is far lessthan the number of clean changes -erefore weperform random undersampling on the training setBy randomly deleting clean changes the number ofbuggy changes remains the same as the number ofclean changes Note that the undersampling is notperformed in the test set

5 Experimental Results and Analysis

51 Analysis for RQ1 To investigate and compare the per-formance of local and global models in the 10 times 10-foldcross-validation scenario we conduct a large-scale empiricalstudy based on data sets from six open-source projects -eexperimental results are shown in Table 4 As is shown inTable 4 the first column represents the evaluation indicatorsfor prediction models including AUC F1 ACC and Poptthe second column denotes the project names of six data setsthe third column shows the performance values of the globalmodels in the four evaluation indicators from the fourthcolumn to the twelfth column the performance values of thelocal models at different k values are shown Consideringthat the number of clusters k may affect the performance oflocal models the value of parameter k is tuned from 2 to 10Since 100 prediction results are generated in the 10 times 10-fold cross-validation scenario the values in the table aremedians of the experimental results We do the followingprocessing on the data in Table 4

(i) We performed the significance test defined inSection 44 on the performance of the local andglobal models -e performance values of localmodels that are significantly different from theperformance values of global model are bolded

(ii) -e performance values of the local models whichare significantly better than those of the globalmodels are given in italics

(iii) -e WDL values (ie the number of projects forwhich local models can obtain a better equal andworse prediction performance than global models)are calculated under different k values and differentevaluation indicators

First we analyze and compare the classification per-formance of local and global models As is shown in Table 4firstly all of the performance of local models with differentparameters k and different projects are worse than that ofglobal models in the AUC indicator Secondly in most casesthe performance of local models are worse than that of globalmodels (except for four values in italics) in the F1 indicator-erefore it can be concluded that global models performbetter than local models in the classification performance

Second we investigate and compare effort-aware pre-diction performance for local and global models As can beseen from Table 4 local models are valid for effort-awareJIT-SDP However local models have different effects ondifferent projects and different parameters k

8 Scientific Programming

First local models have different effects on differentdata sets To describe the effectiveness of local models ondifferent data sets we calculate the number of times thatlocal models perform significantly better than globalmodels on 9 k values and two evaluation indicators(ie ACC and Popt) for each data set -e statistical resultsare shown in Table 5 where the second row is the numberof times that local models perform better than globalmodels for each project It can be seen that local modelshave better performance on the projects MOZ and PLA andpoor performance on the projects BUG and COL Based onthe number of times that local models outperform globalmodels in six projects we rank the project as shown in thethird row In addition we list the sizes of data sets and theirrank as shown in the fourth row It can be seen that theperformance of local models on data sets has a strongcorrelation with the size of data sets Generally the largerthe size of the data set the more significant the perfor-mance improvement of local models

In view of this situation we consider that if the size of thedata set is too small when using local models since the datasets are divided into multiple clusters the training samplesof each cluster will be further reduced which will lead to theperformance degradation of models While the size of datasets is large the performance of the model is not easilylimited by the size of the data sets so that the advantages oflocal models are better reflected

In addition the number of clusters k also has an impacton the performance of local models As can be seen from thetable according to the WDL analysis local models canobtain the best ACC indicator when k is set to 2 and obtainthe best Popt indicator when k is set to 2 and 6 -erefore werecommend that in the cross-validation scenario whenusing local models to solve the effort-aware JIT-SDPproblem the parameter k can be set to 2

-is section analyzes and compares the classificationperformance and effort-aware prediction performance oflocal and global models in 10 times 10-fold cross-validationscenario -e empirical results show that compared withglobal models local models perform poorly in the classifi-cation performance but perform better in the effort-awareprediction performance However the performance of localmodels is affected by the parameter k and the size of the datasets -e experimental results show that local models canobtain better effort-aware prediction performance on theproject with larger size of data sets and can obtain bettereffort-aware prediction performance when k is set to 2

52Analysis forRQ2 -is section discusses the performancedifferences between local and global models in the cross-project-validation scenario for JIT-SDP -e results of theexperiment are shown in Table 6 in which the first columnindicates four evaluation indicators including AUC F1

Table 4 Local vs global models in the 10 times 10-fold cross-validation scenario

Indicator Project GlobalLocal

k 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC

BUG 0730 0704 0687 0684 0698 0688 0686 0686 0693 0685COL 0742 0719 0708 0701 0703 0657 0666 0670 0668 0660JDT 0725 0719 0677 0670 0675 0697 0671 0674 0671 0654MOZ 0769 0540 0741 0715 0703 0730 0730 0725 0731 0722PLA 0742 0733 0702 0693 0654 0677 0640 0638 0588 0575POS 0777 0764 0758 0760 0758 0753 0755 0705 0757 0757

WDL mdash 006 006 006 006 006 006 006 006 006

F1

BUG 0663 0644 0628 0616 0603 0598 0608 0610 0601 0592COL 0696 0672 0649 0669 0678 0634 0537 0596 0604 0564JDT 0723 0706 0764 0757 0734 0695 0615 0643 0732 0722MOZ 0825 0374 0853 0758 0785 0741 0718 0737 0731 0721PLA 0704 0678 0675 0527 0477 0557 0743 0771 0627 0604POS 0762 0699 0747 0759 0746 0698 0697 0339 0699 0700

WDL mdash 006 204 114 015 006 015 105 015 015

ACC

BUG 0421 0452 0471 0458 0424 0451 0450 0463 0446 0462COL 0553 0458 0375 0253 0443 0068 0195 0459 0403 0421JDT 0354 0532 0374 0431 0291 0481 0439 0314 0287 0334MOZ 0189 0386 0328 0355 0364 0301 0270 0312 0312 0312PLA 0368 0472 0509 0542 0497 0509 0456 0442 0456 0495POS 0336 0490 0397 0379 0457 0442 0406 0341 0400 0383

WDL mdash 501 411 411 312 411 411 222 312 411

Popt

BUG 0742 055 0747 0744 0726 0738 0749 0750 0760 0750 0766COL 0726 0661 0624 0401 0731 0295 0421 0644 0628 0681JDT 0660 0779 0625 0698 0568 0725 0708 0603 0599 0634MOZ 0555 0679 0626 0645 0659 0588 0589 0643 0634 0630PLA 0664 0723 0727 0784 0746 0758 0718 0673 0716 0747POS 0673 0729 0680 0651 0709 0739 0683 0596 0669 0644

WDL mdash 411 222 213 321 411 231 123 132 213

Scientific Programming 9

ACC and Popt the second column represents the perfor-mance values of local models from the third column to theeleventh column the prediction performance of localmodels under different parameters k is represented In ourexperiment six data sets are used in the cross-project-validation scenario -erefore each prediction model pro-duces 30 prediction results -e values in Table 6 representthe median of the prediction results

For the values in Table 6 we do the following processing

(i) Using the significance test method the performancevalues of local models that are significantly differentfrom those of global models are bolded

(ii) -e performance values of local models that aresignificantly better than those of global models aregiven in italics

(iii) -e WDL column analyzes the number of timesthat local models perform significantly better thanglobal models at different parameters k

First we compare the difference in classification per-formance between local models and global models As can beseen from Table 6 in the AUC and F1 indicators theclassification performance of local models under differentparameter k is worse than or equal to that of global models-erefore we can conclude that in the cross-project-validation scenario local models perform worse thanglobal models in the classification performance

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 6 that the performance of local models is better than orequal to that of global models in the ACC and Popt indicatorswhen the parameter k is from 2 to 10-erefore local modelsare valid for effort-aware JIT-SDP in the cross-validationscenario In particular when the number k of clusters is setto 2 local models can obtain better ACC values and Poptvalues than global models when k is set to 5 local modelscan obtain better ACC values -erefore it is appropriate toset k to 2 which can find 493 defect-inducing changeswhen using 20 effort and increase the ACC indicator by570 and increase the Popt indicator by 164

In the cross-project-validation scenario local modelsperform worse in the classification performance but performbetter in the effort-aware prediction performance thanglobal models However the setting of k in local models hasan impact on the effort-aware prediction performance forlocal models-e empirical results show that when k is set to2 local models can obtain the optimal effort-aware pre-diction performance

53 Analysis for RQ3 -is section compares the predictionperformance of local and global models in the timewise-

cross-validation scenario Since in the timewise-cross-validation scenario only two months of data in each datasets are used to train prediction models the size of thetraining sample in each cluster is small -rough experi-ment we find that if the parameter k of local models is settoo large the number of training samples in each clustermay be too small or even equal to 0 -erefore the ex-periment sets the parameter k of local models to 2 -eexperimental results are shown in Table 7 Suppose aproject contains n months of data and prediction modelscan produce nminus 5 prediction results in the timewise-cross-validation scenario -erefore the third and fourth col-umns in Table 7 represent the median and standard de-viation of prediction results for global and local modelsBased on the significance test method the performancevalues of local models with significant differences fromthose of global models are bold -e row WDL sum-marizes the number of projects for which local models canobtain a better equal and worse performance than globalmodels in each indicator

First we compare the classification performance of localand global models It can be seen from Table 7 that localmodels perform worse on the AUC and F1 indicators thanglobal models on the 6 data sets -erefore it can be con-cluded that local models perform poorly in classificationperformance compared to global models in the timewise-cross-validation scenario

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 7 that in the projects BUG and COL local models haveworse ACC and Popt indicators than global models while inthe other four projects (JDT MOZ PLA and POS) localmodels and global models have no significant difference inthe ACC and Popt indicators -erefore local models per-form worse in effort-aware prediction performance thanglobal models in the timewise-cross-validation scenario-is conclusion is inconsistent with the conclusions inSections 51 and 52 For this problem we consider that themain reason lies in the size of the training sample In thetimewise-cross-validation scenario only two months of datain the data sets are used to train prediction models (there areonly 452 changes per month on average)When local modelsare used since the training set is divided into several clustersthe number of training samples in each cluster is furtherdecreased which results in a decrease in the effort-awareprediction performance of local models

In this section we compare the classification perfor-mance and effort-aware prediction performance of local andglobal models in the timewise-cross-validation scenarioEmpirical results show that local models perform worse thanglobal models in the classification performance and effort-aware prediction performance -erefore we recommend

Table 5 -e number of wins of local models at different k values

Project BUG COL JDT MOZ PLA POSCount 4 0 7 18 14 11Rank 5 6 4 1 2 3Number of changes (rank) 4620 (5) 4455 (6) 35386 (3) 98275 (1) 64250 (2) 20431 (3)

10 Scientific Programming

using global models to solve the JIT-SDP problem in thetimewise-cross-validation scenario

6 Threats to Validity

61 External Validity Although our experiment uses publicdata sets that have extensively been used in previous studieswe cannot yet guarantee that the discovery of the experimentcan be applied to all other change-level defect data sets-erefore the defect data sets of more projects should bemined in order to verify the generalization of the experi-mental results

62 Construct Validity We use F1 AUC ACC and Popt toevaluate the prediction performance of local and globalmodels Because defect data sets are often imbalanced AUCis more appropriate as a threshold-independent evaluationindicator to evaluate the classification performance of

models [4] In addition we use F1 a threshold-basedevaluation indicator as a supplement to AUC Besidessimilar to prior study [7 9] the experiment uses ACC andPopt to evaluate the effort-aware prediction performance ofmodels However we cannot guarantee that the experi-mental results are valid in all other evaluation indicatorssuch as precision recall Fβ and PofB20

63 Internal Validity Internal validity involves errors in theexperimental code In order to reduce this risk we carefullyexamined the code of the experiment and referred to thecode published in previous studies [7 8 29] so as to improvethe reliability of the experiment code

7 Conclusions and Future Work

In this article we firstly apply local models to JIT-SDP Ourstudy aims to compare the prediction performance of localand global models under cross-validation cross-project-validation and timewise-cross-validation for JIT-SDP Inorder to compare the classification performance and effort-aware prediction performance of local and global models wefirst build local models based on the k-medoids methodAfterwards logistic regression and EALR are used to buildclassification models and effort-aware prediction models forlocal and global models -e empirical results show that localmodels perform worse in the classification performance thanglobal models in three evaluation scenarios Moreover localmodels have worse effort-aware prediction performance in thetimewise-cross-validation scenario However local models aresignificantly better than global models in the effort-awareprediction performance in the cross-validation-scenario andtimewise-cross-validation scenario Particularly the optimaleffort-aware prediction performance can be obtained whenthe parameter k of local models is set to 2 -erefore localmodels are still promising in the context of effort-aware JIT-SDP

In the future first we plan to consider more commercialprojects to further verify the validity of the experimentalconclusions Second logistic regression and EALR are usedin local and global models Considering that the choice ofmodeling methods will affect the prediction performance oflocal and global models we hope to add more baselines tolocal and global models to verify the generalization of ex-perimental conclusions

Data Availability

-e experiment uses public data sets shared by Kamei et al[7] and they have already published the download address of

Table 6 Local vs global models in the cross-project-validation scenario

Indicator GlobalLocal

WDLk 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC 0706 0683 0681 0674 0673 0673 0668 0659 0680 0674 009F1 0678 0629 0703 0696 0677 0613 0620 0554 0629 0637 045ACC 0314 0493 0396 0348 0408 0393 0326 034 0342 0345 270Popt 0656 0764 0691 0672 0719 0687 0619 0632 0667 0624 180

Table 7 Local vs global models in the timewise-cross-validationscenario

Indicators Project Global Local (k 2)

AUC

BUG 0672plusmn 0125 0545plusmn 0124COL 0717plusmn 0075 0636plusmn 0135JDT 0708plusmn 0043 0645plusmn 0073MOZ 0749plusmn 0034 0648plusmn 0128PLA 0709plusmn 0054 0642plusmn 0076POS 0743plusmn 0072 0702plusmn 0095

WDL mdash 006

F1

BUG 0587plusmn 0108 0515plusmn 0128COL 0651plusmn 0066 0599plusmn 0125JDT 0722plusmn 0048 0623plusmn 0161MOZ 0804plusmn 0050 0696plusmn 0190PLA 0702plusmn 0053 0618plusmn 0162POS 0714plusmn 0065 0657plusmn 0128

WDL mdash 006

ACC

BUG 0357plusmn 0212 0242plusmn 0210COL 0426plusmn 0168 0321plusmn 0186JDT 0356plusmn 0148 0329plusmn 0189MOZ 0218plusmn 0132 0242plusmn 0125PLA 0358plusmn 0181 0371plusmn 0196POS 0345plusmn 0159 0306plusmn 0205

WDL mdash 042

Popt

BUG 0622plusmn 0194 0458plusmn 0189COL 0669plusmn 0150 0553plusmn 0204JDT 0654plusmn 0101 0624plusmn 0142MOZ 0537plusmn 0110 0537plusmn 0122PLA 0631plusmn 0115 0645plusmn 0151POS 0615plusmn 0131 0583plusmn 0167

WDL mdash 042

Scientific Programming 11

the data sets in their paper In addition in order to verify therepeatability of our experiment and promote future researchall the experimental code in this article can be downloaded athttpsgithubcomyangxingguangLocalJIT

Conflicts of Interest

-e authors declare that they have no conflicts of interest

Acknowledgments

-is work was partially supported by the NSF of Chinaunder Grant nos 61772200 and 61702334 Shanghai PujiangTalent Program under Grant no 17PJ1401900 ShanghaiMunicipal Natural Science Foundation under Grant nos17ZR1406900 and 17ZR1429700 Educational ResearchFund of ECUST under Grant no ZH1726108 and Collab-orative Innovation Foundation of Shanghai Institute ofTechnology under Grant no XTCX2016-20

References

[1] M Newman Software Errors Cost US Economy $595 BillionAnnually NIST Assesses Technical Needs of Industry to ImproveSoftware-Testing 2002 httpwwwabeachacomNIST_press_release_bugs_costhtm

[2] Z Li X-Y Jing and X Zhu ldquoProgress on approaches tosoftware defect predictionrdquo IET Software vol 12 no 3pp 161ndash175 2018

[3] X Chen Q Gu W Liu S Liu and C Ni ldquoSurvey of staticsoftware defect predictionrdquo Ruan Jian Xue BaoJournal ofSoftware vol 27 no 1 2016 in Chinese

[4] C Tantithamthavorn S McIntosh A E Hassan andKMatsumoto ldquoAn empirical comparison of model validationtechniques for defect prediction modelsrdquo IEEE Transactionson Software Engineering vol 43 no 1 pp 1ndash18 2017

[5] A G Koru D Dongsong Zhang K El Emam andH Hongfang Liu ldquoAn investigation into the functional formof the size-defect relationship for software modulesrdquo IEEETransactions on Software Engineering vol 35 no 2pp 293ndash304 2009

[6] S Kim E J Whitehead and Y Zhang ldquoClassifying softwarechanges clean or buggyrdquo IEEE Transactions on SoftwareEngineering vol 34 no 2 pp 181ndash196 2008

[7] Y Kamei E Shihab B Adams et al ldquoA large-scale empiricalstudy of just-in-time quality assurancerdquo IEEE Transactions onSoftware Engineering vol 39 no 6 pp 757ndash773 2013

[8] Y Yang Y Zhou J Liu et al ldquoEffort-aware just-in-time defectprediction simple unsupervised models could be better thansupervised modelsrdquo in Proceedings of the 24th ACM SIGSOFTInternational Symposium on Foundations of Software Engi-neering pp 157ndash168 Hong Kong China November 2016

[9] X Chen Y Zhao Q Wang and Z Yuan ldquoMULTI multi-objective effort-aware just-in-time software defect predictionrdquoInformation and Software Technology vol 93 pp 1ndash13 2018

[10] Y Kamei T Fukushima S McIntosh K YamashitaN Ubayashi and A E Hassan ldquoStudying just-in-time defectprediction using cross-project modelsrdquo Empirical SoftwareEngineering vol 21 no 5 pp 2072ndash2106 2016

[11] T Menzies A Butcher A Marcus T Zimmermann andD R Cok ldquoLocal vs global models for effort estimation anddefect predictionrdquo in Proceedings of the 26th IEEEACM

International Conference on Automated Software Engineer-ing (ASE) pp 343ndash351 Lawrence KS USA November 2011

[12] T Menzies A Butcher D Cok et al ldquoLocal versus globallessons for defect prediction and effort estimationrdquo IEEETransactions on Software Engineering vol 39 no 6pp 822ndash834 2013

[13] G Scanniello C Gravino A Marcus and T Menzies ldquoClasslevel fault prediction using software clusteringrdquo in Pro-ceedings of the 2013 28th IEEEACM International Conferenceon Automated Software Engineering (ASE) pp 640ndash645Silicon Valley CA USA November 2013

[14] N Bettenburg M Nagappan and A E Hassan ldquoTowardsimproving statistical modeling of software engineering datathink locally act globallyrdquo Empirical Software Engineeringvol 20 no 2 pp 294ndash335 2015

[15] Q Wang S Wu and M Li ldquoSoftware defect predictionrdquoJournal of Software vol 19 no 7 pp 1565ndash1580 2008 inChinese

[16] F Akiyama ldquoAn example of software system debuggingrdquoInformation Processing vol 71 pp 353ndash359 1971

[17] T J McCabe ldquoA complexity measurerdquo IEEE Transactions onSoftware Engineering vol SE-2 no 4 pp 308ndash320 1976

[18] S R Chidamber and C F Kemerer ldquoAmetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[19] N Nagappan and T Ball ldquoUse of relative code churn mea-sures to predict system defect densityrdquo in Proceedings of the27th International Conference on Software Engineering (ICSE)pp 284ndash292 Saint Louis MO USA May 2005

[20] A Mockus and D M Weiss ldquoPredicting risk of softwarechangesrdquo Bell Labs Technical Journal vol 5 no 2 pp 169ndash180 2000

[21] X Yang D Lo X Xia Y Zhang and J Sun ldquoDeep learningfor just-in-time defect predictionrdquo in Proceedings of the IEEEInternational Conference on Software Quality Reliability andSecurity (QRS) pp 17ndash26 Vancouver BC Canada August2015

[22] X Yang D Lo X Xia and J Sun ldquoTLEL a two-layer ensemblelearning approach for just-in-time defect predictionrdquo In-formation and Software Technology vol 87 pp 206ndash220 2017

[23] L Pascarella F Palomba and A Bacchelli ldquoFine-grained just-in-time defect predictionrdquo Journal of Systems and Softwarevol 150 pp 22ndash36 2019

[24] E Arisholm L C Briand and E B Johannessen ldquoA sys-tematic and comprehensive investigation of methods to buildand evaluate fault prediction modelsrdquo Journal of Systems andSoftware vol 83 no 1 pp 2ndash17 2010

[25] Y Kamei S Matsumoto A Monden K MatsumotoB Adams and A E Hassan ldquoRevisiting common bug pre-diction findings using effort-aware modelsrdquo in Proceedings ofthe 26th IEEE International Conference on Software Mainte-nance (ICSM) pp 1ndash10 Shanghai China September 2010

[26] Y Zhou B Xu H Leung and L Chen ldquoAn in-depth study ofthe potentially confounding effect of class size in fault pre-dictionrdquo ACM Transactions on Software Engineering andMethodology vol 23 no 1 pp 1ndash51 2014

[27] J Liu Y Zhou Y Yang H Lu and B Xu ldquoCode churn aneglectedmetric in effort-aware just-in-time defect predictionrdquoin Proceedings of the ACMIEEE International Symposium onEmpirical Software Engineering and Measurement (ESEM)pp 11ndash19 Toronto ON Canada November 2017

[28] W Fu and T Menzies ldquoRevisiting unsupervised learning fordefect predictionrdquo in Proceedings of the 2017 11th Joint

12 Scientific Programming

Meeting on Foundations of Software Engineering (ESECFSE)pp 72ndash83 Paderborn Germany September 2017

[29] Q Huang X Xia and D Lo ldquoSupervised vs unsupervisedmodels a holistic look at effort-aware just-in-time defectpredictionrdquo in Proceedings of the IEEE International Con-ference on Software Maintenance and Evolution (ICSME)pp 159ndash170 Shanghai China September 2017

[30] Q Huang X Xia and D Lo ldquoRevisiting supervised andunsupervised models for effort-aware just-in-time defectpredictionrdquo Empirical Software Engineering pp 1ndash40 2018

[31] S Herbold A Trautsch and J Grabowski ldquoGlobal vs localmodels for cross-project defect predictionrdquo Empirical Soft-ware Engineering vol 22 no 4 pp 1866ndash1902 2017

[32] M E Mezouar F Zhang and Y Zou ldquoLocal versus globalmodels for effort-aware defect predictionrdquo in Proceedings ofthe 26th Annual International Conference on Computer Sci-ence and Software Engineering (CASCON) pp 178ndash187Toronto ON Canada October 2016

[33] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis Wiley Hoboken NJ USA1990

[34] S Hosseini B Turhan and D Gunarathna ldquoA systematicliterature review and meta-analysis on cross project defectpredictionrdquo IEEE Transactions on Software Engineeringvol 45 no 2 pp 111ndash147 2019

[35] J Sliwerski T Zimmermann and A Zeller ldquoWhen dochanges induce fixesrdquo Mining Software Repositories vol 30no 4 pp 1ndash5 2005

[36] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin vol 1 no 6 pp 80ndash83 1945

[37] N Cliff Ordinal Methods for Behavioral Data AnalysisPsychology Press London UK 2014

[38] J Romano J D Kromrey J Coraggio and J SkowronekldquoAppropriate statistics for ordinal level data should we reallybe using t-test and cohensd for evaluating group differenceson the nsse and other surveysrdquo in Annual Meeting of theFlorida Association of Institutional Research pp 1ndash33 2006

Scientific Programming 13

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 5: LocalversusGlobalModelsforJust-In-TimeSoftware ...downloads.hindawi.com/journals/sp/2019/2384706.pdf · Just-in-time software defect prediction (JIT-SDP) is an active topic in software

41 Data Sets -e data sets used in experiment are widelyused for JIT-SDP which are shared by Kamei et al [7] -edata sets contain six open-source projects from differentapplication domains For example Bugzilla (BUG) is anopen-source defect tracking system Eclipse JDT (JDT) is afull-featured Java integrated development environmentplugin for the Eclipse platform Mozilla (MOZ) is a webbrowser and PostgreSQL (POS) is a database system [9]-ese data sets are extracted from CVS repositories ofprojects and corresponding bug reports [7] -e basic in-formation of the data sets is shown in Table 1

To predict whether a change is defective or not Kameiet al [7] design 14 metrics associated with software defectswhich can be grouped into five groups (ie diffusion sizepurpose history and experience)-e specific description ofthese metrics is shown in Table 2 A more detailed in-troduction to the metrics can be found in the literature [7]

42 Modeling Methods -ere are three modeling methodsthat are used in our experiment First in the process of using

local models the training set is divided into multipleclusters -erefore this process is completed by the k-medoids model Second the classification performance andeffort-aware prediction performance of local models andglobal models are investigated in our experiment -ereforewe use logistic regression and EALR to build classificationmodels and effort-aware prediction models respectivelywhich are used as baseline models in previous JIT-SDPstudies [7ndash9] Details of the three models are shown below

421 K-Medoids -e first step in local models is to use aclustering algorithm to divide all training data into severalhomogeneous regions In this process the experiment usesk-medoids as a clustering algorithm which is widely used indifferent clustering tasks Different from k-means the co-ordinate center of each cluster of k-medoids is a repre-sentative object rather than a centroid Hence k-medoids isless sensitive to the outliers -e objective function of k-medoids can be defined by using Equation (1) where p is asample in the cluster Ci and Oi is a medoid in the cluster Ci

Training data

Test dataPrediction result

Classifier

(a)

Training data Clustering algorithm

Cluster 1

Cluster i

Cluster n

Cluster model

Cluster number i

Classifier i

Prediction result

Test data

(b)

Figure 2 -e process of (a) global and (b) local models

Input training set D (x1 y1) (x2 y2) (xn yn)1113864 1113865 test set T (x1 y1) (x2 y2) (xm ym)1113864 1113865 the number of clusters kOutput prediction results R

(1) begin(2) R⟵[

(3) Divide the training set into k clusters(4) clusters cluster_centers k-medoids (D k)(5) Train a classifier for each cluster(6) classifiersmodeling_algorithm (clusters)(7) for Instance t in T do(8) Assign a cluster number i for each instance t(9) i calculate_min_distance (t cluster_centers)(10) r classifiers [i]predict (t) perform prediction(11) Rappend (r)(12) end(13) end

ALGORITHM 1 -e application process of local models for JIT-SDP

Scientific Programming 5

E 1113944k

i11113944

pϵCi

pminusOi

111386811138681113868111386811138681113868111386811138682 (1)

Among different k-medoids algorithms the partitioningaroundmedoids (PAM) algorithm is used in the experimentwhich was proposed by Kaufman el al [33] and is one of themost popularly used algorithms for k-medoids

422 Logistic Regression Similar to prior studies logisticregression is used to build classification models [7] Logisticregression is a widely used regression model that can be usedfor a variety of classification and regression tasks Assumingthat there are nmetrics for a change c logistic regression canbe expressed by using Equation (2) where w0 w1 wn

denote the coefficients of logistic regression and the in-dependent variables m1 mn represent the metrics of achange which are introduced in Table 2 -e output of themodel y(c) indicates the probability that a change is buggy

y(c) 1

1 + eminus w0+w1m1++wnmn( ) (2)

In our experiment logistic regression is used to solveclassification problems However the output of the model isa continuous value with a range of 0 to 1 -erefore for achange c if the value of y(c) is greater than 05 then c isclassified as buggy otherwise it is classified as clean -eprocess can be defined as follows

Y(c) 1 if y(c)gt 05

0 if y(c)le 051113896 (3)

423 Effort-Aware Linear Regression Arisholm et al [24]pointed out that the use of defect prediction models shouldconsider not only the classification performance of themodel but also the cost-effectiveness of code inspectionAfterwards a large number of studies focus on the effort-aware defect prediction [8 9 11 12] Effort-aware linearregression (EALR) was firstly proposed by Kamei et al [7]for solving effort-aware JIT-SDP which is based on linearregression models Different from the above logistic re-gression the dependent variable of EALR is Y(c)Effort(c)where Y(c) denotes the defect proneness of a change andEffort(c) denotes the effort required to code checking for thechange c According to the suggestion of Kamei et al [7] theeffort for a change can obtained by calculating the number ofmodified lines of code

43 Performance Indicators -ere are four performanceindicators used to evaluate and compare the performance oflocal and global models for JIT-SDP (ie F1 AUC ACCand Popt) Specifically we use F1 and AUC to evaluate theclassification performance of prediction models and useACC and Popt to evaluate the effort-aware prediction per-formance of prediction models -e details are as follows

Table 1 -e basic information of data sets

Project Period Number of defective changes Number of changes defect rateBUG 19980826sim20061216 1696 4620 3671COL 20021125sim20060727 1361 4455 3055JDT 20010502sim20071231 5089 35386 1438MOZ 20000101sim20061217 5149 98275 524PLA 20010502sim20071231 9452 64250 1471POS 19960709sim20100515 5119 20431 2506

Table 2 -e description of metrics

Dimension Metric Description

Diffusion

NS Number of modified subsystemsND Number of modified directoriesNF Number of modified files

Entropy Distribution of modified code across each file

SizeLA Lines of code addedLD Lines of code deletedLT Lines of code in a file before the change

Purpose FIX Whether or not the change is a defect fix

History

NDEV Number of developers that changed the files

AGE Average time interval between the last and thecurrent change

NUC Number of unique last changes to the files

ExperienceEXP Developer experienceREXP Recent developer experienceSEXP Developer experience on a subsystem

6 Scientific Programming

-e experiment first uses F1 and AUC to evaluate theclassification performance of models JIT-SDP technologyaims to predict the defect proneness of a change which is aclassification task-e predicted results can be expressed as aconfusion matrix shown in Table 3 including true positive(TP) false negative (FN) false positive (FP) and truenegative (TN)

For classification problems precision (P) and recall (R)are often used to evaluate the prediction performance of themodel

P TP

TP + FP

R TP

TP + FN

(4)

However the two indicators are usually contradictory intheir results In order to evaluate a prediction modelcomprehensively we use F1 to evaluate the classificationperformance of the model which is a harmonic mean ofprecision and recall and is shown in the following equation

F1 2 times P times RP + R

(5)

-e second indicator is AUC Data class imbalanceproblems are often encountered in software defect pre-diction Our data sets are also imbalanced (the number ofbuggy changes is much less than clean changes) In the class-imbalanced problem threshold-based evaluation indicators(such as accuracy recall rate and F1) are sensitive to thethreshold [4] -erefore a threshold-independent evalua-tion indicator AUC is used to evaluate the classificationperformance of a model AUC (area under curve) is the areaunder the ROC (receiver operating characteristic) curve-edrawing process of the ROC curve is as follows First sort theinstances in a descending order of probability that is pre-dicted to be positive and then treat them as positive ex-amples in turn In each iteration the true positive rate (TPR)and the false positive rate (FPR) are calculated as the or-dinate and the abscissa respectively to obtain an ROCcurve

TPR TP

TP + FN

FPR FP

TN + FP

(6)

-e experiment uses ACC and Popt to evaluate the effort-aware prediction performance of local and global modelsACC and Popt are commonly used performance indicatorswhich have widely been used in previous research [7ndash9 25]

-e calculation method of ACC and Popt is shown inFigure 3 In Figure 3 x-axis and y-axis represent the cu-mulative percentage of code churn (ie the number of linesof code added and deleted) and defect-inducing changes Tocompute the values of ACC and Popt three curves are in-cluded in Figure 3 optimal model curve prediction modelcurve and worst model curve [8] In the optimal modelcurve the changes are sorted in a descending order based on

their actual defect densities In the worst model curve thechanges are sorted in the ascending order according to theiractual defect densities In the prediction model curve thechanges are sorted in the descending order based on theirpredicted defect densities -en the area between the threecurves and the x-axis respectively Area(optimal)Area(m) and Area(worst) are calculated ACC indicatesthe recall of defect-inducing changes when 20 of the effortis invested based on the prediction model curve Accordingto [7] Popt can be defined by the following equation

Popt 1minusarea(optimal)minus area(m)

area(optimal)minus optimal(worst) (7)

44 Data Analysis Method -e experiment considers threescenarios to evaluate the performance of prediction modelswhich are 10 times 10-fold cross-validation cross-project-validation and timewise-cross-validation respectively -edetails are as follows

441 10 Times 10-Fold Cross-Validation Cross-validation isperformed within the same project and the basic process of10 times 10-fold cross-validation is as follows first the datasets of a project are divided into 10 parts with equal sizeSubsequently nine parts of them are used to train a pre-diction model and the remaining part is used as the test set-is process is repeated 10 times in order to reduce therandomness deviation -erefore 10 times 10-fold cross-validation can produce 10 times 10 100 results

Table 3 Confusion matrix

Actual valuePrediction result

Positive NegativePositive TP FNNegative FP TN

Def

ect-i

nduc

ing

chan

ges (

)

100

80

60

40

20

0100806040200

Code churn ()

Optimal modelPrediction modelWorst model

Figure 3 -e performance indicators of ACC and Popt

Scientific Programming 7

442 Cross-Project-Validation Cross-project defect pre-diction has received widespread attention in recent years[34] -e basic idea of cross-project defect prediction is totrain a defect prediction model with the defect data sets of aproject to predict defects of another project [8] Given datasets containing n projects n times (nminus 1) run results can begenerated for a prediction model -e data sets used in ourexperiment contain six open-source projects so6 times (6minus 1) 30 prediction results can be generated for aprediction model

443 Timewise-Cross-Validation Timewise-cross-validationis also performed within the same project but it con-siders the chronological order of changes in the data sets[35] Defect data sets are collected in the chronological orderduring the actual software development process -ereforechanges committed early in the data sets can be used topredict the defect proneness of changes committed late Inthe timewise-cross-validation scenario all changes in thedata sets are arranged in the chronological order and thechanges in the same month are grouped in the same groupSuppose the data sets are divided into n groups Group i andgroup i + 1 are used to train a prediction model and thegroup i + 4 and group i + 5 are used as a test set -ereforeassuming that a project contains nmonths of data sets nminus 5run results can be generated for a prediction model

-e experiment uses the Wilcoxon signed-rank test [36]and Cliffrsquos δ [37] to test the significance of differences inclassification performance and prediction performance be-tween local and global models -e Wilcoxon signed-ranktest is a popular nonparametric statistical hypothesis testWe use p values to examine if the difference of the per-formance of local and global models is statistically significantat a significance level of 005-e experiment also uses Cliffrsquosδ to determine the magnitude of the difference in theprediction performance between local models and globalmodels Usually the magnitude of the difference can bedivided into four levels ie trivial ((|δ|lt 0147)) small(0147le |δ|lt 033) moderate (033le |δ|lt 0474) and large(ge0474) [38] In summary when the p value is less than005 and Cliff rsquos δ is greater than or equal to 0147 localmodels and global models have significant differences inprediction performance

45 Data Preprocessing According to suggestion of Kameiet al [7] it is necessary to preprocess the data sets It includesthe following three steps

(1) Removing Highly Correlated Metrics -e ND andREXP metrics are excluded because NF and NDREXP and EXP are highly correlated LA and LD arenormalized by dividing by LT since LA and LD arehighly correlated LT and NUC are normalized bydividing by NF since LT and NUC are highly cor-related with NF

(2) Logarithmic Transformation Since most metrics arehighly skewed each metric executes logarithmictransformation (except for fix)

(3) Dealing with Class Imbalance -e data sets used inthe experiment have the problem of class imbalanceie the number of defect-inducing changes is far lessthan the number of clean changes -erefore weperform random undersampling on the training setBy randomly deleting clean changes the number ofbuggy changes remains the same as the number ofclean changes Note that the undersampling is notperformed in the test set

5 Experimental Results and Analysis

51 Analysis for RQ1 To investigate and compare the per-formance of local and global models in the 10 times 10-foldcross-validation scenario we conduct a large-scale empiricalstudy based on data sets from six open-source projects -eexperimental results are shown in Table 4 As is shown inTable 4 the first column represents the evaluation indicatorsfor prediction models including AUC F1 ACC and Poptthe second column denotes the project names of six data setsthe third column shows the performance values of the globalmodels in the four evaluation indicators from the fourthcolumn to the twelfth column the performance values of thelocal models at different k values are shown Consideringthat the number of clusters k may affect the performance oflocal models the value of parameter k is tuned from 2 to 10Since 100 prediction results are generated in the 10 times 10-fold cross-validation scenario the values in the table aremedians of the experimental results We do the followingprocessing on the data in Table 4

(i) We performed the significance test defined inSection 44 on the performance of the local andglobal models -e performance values of localmodels that are significantly different from theperformance values of global model are bolded

(ii) -e performance values of the local models whichare significantly better than those of the globalmodels are given in italics

(iii) -e WDL values (ie the number of projects forwhich local models can obtain a better equal andworse prediction performance than global models)are calculated under different k values and differentevaluation indicators

First we analyze and compare the classification per-formance of local and global models As is shown in Table 4firstly all of the performance of local models with differentparameters k and different projects are worse than that ofglobal models in the AUC indicator Secondly in most casesthe performance of local models are worse than that of globalmodels (except for four values in italics) in the F1 indicator-erefore it can be concluded that global models performbetter than local models in the classification performance

Second we investigate and compare effort-aware pre-diction performance for local and global models As can beseen from Table 4 local models are valid for effort-awareJIT-SDP However local models have different effects ondifferent projects and different parameters k

8 Scientific Programming

First local models have different effects on differentdata sets To describe the effectiveness of local models ondifferent data sets we calculate the number of times thatlocal models perform significantly better than globalmodels on 9 k values and two evaluation indicators(ie ACC and Popt) for each data set -e statistical resultsare shown in Table 5 where the second row is the numberof times that local models perform better than globalmodels for each project It can be seen that local modelshave better performance on the projects MOZ and PLA andpoor performance on the projects BUG and COL Based onthe number of times that local models outperform globalmodels in six projects we rank the project as shown in thethird row In addition we list the sizes of data sets and theirrank as shown in the fourth row It can be seen that theperformance of local models on data sets has a strongcorrelation with the size of data sets Generally the largerthe size of the data set the more significant the perfor-mance improvement of local models

In view of this situation we consider that if the size of thedata set is too small when using local models since the datasets are divided into multiple clusters the training samplesof each cluster will be further reduced which will lead to theperformance degradation of models While the size of datasets is large the performance of the model is not easilylimited by the size of the data sets so that the advantages oflocal models are better reflected

In addition the number of clusters k also has an impacton the performance of local models As can be seen from thetable according to the WDL analysis local models canobtain the best ACC indicator when k is set to 2 and obtainthe best Popt indicator when k is set to 2 and 6 -erefore werecommend that in the cross-validation scenario whenusing local models to solve the effort-aware JIT-SDPproblem the parameter k can be set to 2

-is section analyzes and compares the classificationperformance and effort-aware prediction performance oflocal and global models in 10 times 10-fold cross-validationscenario -e empirical results show that compared withglobal models local models perform poorly in the classifi-cation performance but perform better in the effort-awareprediction performance However the performance of localmodels is affected by the parameter k and the size of the datasets -e experimental results show that local models canobtain better effort-aware prediction performance on theproject with larger size of data sets and can obtain bettereffort-aware prediction performance when k is set to 2

52Analysis forRQ2 -is section discusses the performancedifferences between local and global models in the cross-project-validation scenario for JIT-SDP -e results of theexperiment are shown in Table 6 in which the first columnindicates four evaluation indicators including AUC F1

Table 4 Local vs global models in the 10 times 10-fold cross-validation scenario

Indicator Project GlobalLocal

k 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC

BUG 0730 0704 0687 0684 0698 0688 0686 0686 0693 0685COL 0742 0719 0708 0701 0703 0657 0666 0670 0668 0660JDT 0725 0719 0677 0670 0675 0697 0671 0674 0671 0654MOZ 0769 0540 0741 0715 0703 0730 0730 0725 0731 0722PLA 0742 0733 0702 0693 0654 0677 0640 0638 0588 0575POS 0777 0764 0758 0760 0758 0753 0755 0705 0757 0757

WDL mdash 006 006 006 006 006 006 006 006 006

F1

BUG 0663 0644 0628 0616 0603 0598 0608 0610 0601 0592COL 0696 0672 0649 0669 0678 0634 0537 0596 0604 0564JDT 0723 0706 0764 0757 0734 0695 0615 0643 0732 0722MOZ 0825 0374 0853 0758 0785 0741 0718 0737 0731 0721PLA 0704 0678 0675 0527 0477 0557 0743 0771 0627 0604POS 0762 0699 0747 0759 0746 0698 0697 0339 0699 0700

WDL mdash 006 204 114 015 006 015 105 015 015

ACC

BUG 0421 0452 0471 0458 0424 0451 0450 0463 0446 0462COL 0553 0458 0375 0253 0443 0068 0195 0459 0403 0421JDT 0354 0532 0374 0431 0291 0481 0439 0314 0287 0334MOZ 0189 0386 0328 0355 0364 0301 0270 0312 0312 0312PLA 0368 0472 0509 0542 0497 0509 0456 0442 0456 0495POS 0336 0490 0397 0379 0457 0442 0406 0341 0400 0383

WDL mdash 501 411 411 312 411 411 222 312 411

Popt

BUG 0742 055 0747 0744 0726 0738 0749 0750 0760 0750 0766COL 0726 0661 0624 0401 0731 0295 0421 0644 0628 0681JDT 0660 0779 0625 0698 0568 0725 0708 0603 0599 0634MOZ 0555 0679 0626 0645 0659 0588 0589 0643 0634 0630PLA 0664 0723 0727 0784 0746 0758 0718 0673 0716 0747POS 0673 0729 0680 0651 0709 0739 0683 0596 0669 0644

WDL mdash 411 222 213 321 411 231 123 132 213

Scientific Programming 9

ACC and Popt the second column represents the perfor-mance values of local models from the third column to theeleventh column the prediction performance of localmodels under different parameters k is represented In ourexperiment six data sets are used in the cross-project-validation scenario -erefore each prediction model pro-duces 30 prediction results -e values in Table 6 representthe median of the prediction results

For the values in Table 6 we do the following processing

(i) Using the significance test method the performancevalues of local models that are significantly differentfrom those of global models are bolded

(ii) -e performance values of local models that aresignificantly better than those of global models aregiven in italics

(iii) -e WDL column analyzes the number of timesthat local models perform significantly better thanglobal models at different parameters k

First we compare the difference in classification per-formance between local models and global models As can beseen from Table 6 in the AUC and F1 indicators theclassification performance of local models under differentparameter k is worse than or equal to that of global models-erefore we can conclude that in the cross-project-validation scenario local models perform worse thanglobal models in the classification performance

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 6 that the performance of local models is better than orequal to that of global models in the ACC and Popt indicatorswhen the parameter k is from 2 to 10-erefore local modelsare valid for effort-aware JIT-SDP in the cross-validationscenario In particular when the number k of clusters is setto 2 local models can obtain better ACC values and Poptvalues than global models when k is set to 5 local modelscan obtain better ACC values -erefore it is appropriate toset k to 2 which can find 493 defect-inducing changeswhen using 20 effort and increase the ACC indicator by570 and increase the Popt indicator by 164

In the cross-project-validation scenario local modelsperform worse in the classification performance but performbetter in the effort-aware prediction performance thanglobal models However the setting of k in local models hasan impact on the effort-aware prediction performance forlocal models-e empirical results show that when k is set to2 local models can obtain the optimal effort-aware pre-diction performance

53 Analysis for RQ3 -is section compares the predictionperformance of local and global models in the timewise-

cross-validation scenario Since in the timewise-cross-validation scenario only two months of data in each datasets are used to train prediction models the size of thetraining sample in each cluster is small -rough experi-ment we find that if the parameter k of local models is settoo large the number of training samples in each clustermay be too small or even equal to 0 -erefore the ex-periment sets the parameter k of local models to 2 -eexperimental results are shown in Table 7 Suppose aproject contains n months of data and prediction modelscan produce nminus 5 prediction results in the timewise-cross-validation scenario -erefore the third and fourth col-umns in Table 7 represent the median and standard de-viation of prediction results for global and local modelsBased on the significance test method the performancevalues of local models with significant differences fromthose of global models are bold -e row WDL sum-marizes the number of projects for which local models canobtain a better equal and worse performance than globalmodels in each indicator

First we compare the classification performance of localand global models It can be seen from Table 7 that localmodels perform worse on the AUC and F1 indicators thanglobal models on the 6 data sets -erefore it can be con-cluded that local models perform poorly in classificationperformance compared to global models in the timewise-cross-validation scenario

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 7 that in the projects BUG and COL local models haveworse ACC and Popt indicators than global models while inthe other four projects (JDT MOZ PLA and POS) localmodels and global models have no significant difference inthe ACC and Popt indicators -erefore local models per-form worse in effort-aware prediction performance thanglobal models in the timewise-cross-validation scenario-is conclusion is inconsistent with the conclusions inSections 51 and 52 For this problem we consider that themain reason lies in the size of the training sample In thetimewise-cross-validation scenario only two months of datain the data sets are used to train prediction models (there areonly 452 changes per month on average)When local modelsare used since the training set is divided into several clustersthe number of training samples in each cluster is furtherdecreased which results in a decrease in the effort-awareprediction performance of local models

In this section we compare the classification perfor-mance and effort-aware prediction performance of local andglobal models in the timewise-cross-validation scenarioEmpirical results show that local models perform worse thanglobal models in the classification performance and effort-aware prediction performance -erefore we recommend

Table 5 -e number of wins of local models at different k values

Project BUG COL JDT MOZ PLA POSCount 4 0 7 18 14 11Rank 5 6 4 1 2 3Number of changes (rank) 4620 (5) 4455 (6) 35386 (3) 98275 (1) 64250 (2) 20431 (3)

10 Scientific Programming

using global models to solve the JIT-SDP problem in thetimewise-cross-validation scenario

6 Threats to Validity

61 External Validity Although our experiment uses publicdata sets that have extensively been used in previous studieswe cannot yet guarantee that the discovery of the experimentcan be applied to all other change-level defect data sets-erefore the defect data sets of more projects should bemined in order to verify the generalization of the experi-mental results

62 Construct Validity We use F1 AUC ACC and Popt toevaluate the prediction performance of local and globalmodels Because defect data sets are often imbalanced AUCis more appropriate as a threshold-independent evaluationindicator to evaluate the classification performance of

models [4] In addition we use F1 a threshold-basedevaluation indicator as a supplement to AUC Besidessimilar to prior study [7 9] the experiment uses ACC andPopt to evaluate the effort-aware prediction performance ofmodels However we cannot guarantee that the experi-mental results are valid in all other evaluation indicatorssuch as precision recall Fβ and PofB20

63 Internal Validity Internal validity involves errors in theexperimental code In order to reduce this risk we carefullyexamined the code of the experiment and referred to thecode published in previous studies [7 8 29] so as to improvethe reliability of the experiment code

7 Conclusions and Future Work

In this article we firstly apply local models to JIT-SDP Ourstudy aims to compare the prediction performance of localand global models under cross-validation cross-project-validation and timewise-cross-validation for JIT-SDP Inorder to compare the classification performance and effort-aware prediction performance of local and global models wefirst build local models based on the k-medoids methodAfterwards logistic regression and EALR are used to buildclassification models and effort-aware prediction models forlocal and global models -e empirical results show that localmodels perform worse in the classification performance thanglobal models in three evaluation scenarios Moreover localmodels have worse effort-aware prediction performance in thetimewise-cross-validation scenario However local models aresignificantly better than global models in the effort-awareprediction performance in the cross-validation-scenario andtimewise-cross-validation scenario Particularly the optimaleffort-aware prediction performance can be obtained whenthe parameter k of local models is set to 2 -erefore localmodels are still promising in the context of effort-aware JIT-SDP

In the future first we plan to consider more commercialprojects to further verify the validity of the experimentalconclusions Second logistic regression and EALR are usedin local and global models Considering that the choice ofmodeling methods will affect the prediction performance oflocal and global models we hope to add more baselines tolocal and global models to verify the generalization of ex-perimental conclusions

Data Availability

-e experiment uses public data sets shared by Kamei et al[7] and they have already published the download address of

Table 6 Local vs global models in the cross-project-validation scenario

Indicator GlobalLocal

WDLk 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC 0706 0683 0681 0674 0673 0673 0668 0659 0680 0674 009F1 0678 0629 0703 0696 0677 0613 0620 0554 0629 0637 045ACC 0314 0493 0396 0348 0408 0393 0326 034 0342 0345 270Popt 0656 0764 0691 0672 0719 0687 0619 0632 0667 0624 180

Table 7 Local vs global models in the timewise-cross-validationscenario

Indicators Project Global Local (k 2)

AUC

BUG 0672plusmn 0125 0545plusmn 0124COL 0717plusmn 0075 0636plusmn 0135JDT 0708plusmn 0043 0645plusmn 0073MOZ 0749plusmn 0034 0648plusmn 0128PLA 0709plusmn 0054 0642plusmn 0076POS 0743plusmn 0072 0702plusmn 0095

WDL mdash 006

F1

BUG 0587plusmn 0108 0515plusmn 0128COL 0651plusmn 0066 0599plusmn 0125JDT 0722plusmn 0048 0623plusmn 0161MOZ 0804plusmn 0050 0696plusmn 0190PLA 0702plusmn 0053 0618plusmn 0162POS 0714plusmn 0065 0657plusmn 0128

WDL mdash 006

ACC

BUG 0357plusmn 0212 0242plusmn 0210COL 0426plusmn 0168 0321plusmn 0186JDT 0356plusmn 0148 0329plusmn 0189MOZ 0218plusmn 0132 0242plusmn 0125PLA 0358plusmn 0181 0371plusmn 0196POS 0345plusmn 0159 0306plusmn 0205

WDL mdash 042

Popt

BUG 0622plusmn 0194 0458plusmn 0189COL 0669plusmn 0150 0553plusmn 0204JDT 0654plusmn 0101 0624plusmn 0142MOZ 0537plusmn 0110 0537plusmn 0122PLA 0631plusmn 0115 0645plusmn 0151POS 0615plusmn 0131 0583plusmn 0167

WDL mdash 042

Scientific Programming 11

the data sets in their paper In addition in order to verify therepeatability of our experiment and promote future researchall the experimental code in this article can be downloaded athttpsgithubcomyangxingguangLocalJIT

Conflicts of Interest

-e authors declare that they have no conflicts of interest

Acknowledgments

-is work was partially supported by the NSF of Chinaunder Grant nos 61772200 and 61702334 Shanghai PujiangTalent Program under Grant no 17PJ1401900 ShanghaiMunicipal Natural Science Foundation under Grant nos17ZR1406900 and 17ZR1429700 Educational ResearchFund of ECUST under Grant no ZH1726108 and Collab-orative Innovation Foundation of Shanghai Institute ofTechnology under Grant no XTCX2016-20

References

[1] M Newman Software Errors Cost US Economy $595 BillionAnnually NIST Assesses Technical Needs of Industry to ImproveSoftware-Testing 2002 httpwwwabeachacomNIST_press_release_bugs_costhtm

[2] Z Li X-Y Jing and X Zhu ldquoProgress on approaches tosoftware defect predictionrdquo IET Software vol 12 no 3pp 161ndash175 2018

[3] X Chen Q Gu W Liu S Liu and C Ni ldquoSurvey of staticsoftware defect predictionrdquo Ruan Jian Xue BaoJournal ofSoftware vol 27 no 1 2016 in Chinese

[4] C Tantithamthavorn S McIntosh A E Hassan andKMatsumoto ldquoAn empirical comparison of model validationtechniques for defect prediction modelsrdquo IEEE Transactionson Software Engineering vol 43 no 1 pp 1ndash18 2017

[5] A G Koru D Dongsong Zhang K El Emam andH Hongfang Liu ldquoAn investigation into the functional formof the size-defect relationship for software modulesrdquo IEEETransactions on Software Engineering vol 35 no 2pp 293ndash304 2009

[6] S Kim E J Whitehead and Y Zhang ldquoClassifying softwarechanges clean or buggyrdquo IEEE Transactions on SoftwareEngineering vol 34 no 2 pp 181ndash196 2008

[7] Y Kamei E Shihab B Adams et al ldquoA large-scale empiricalstudy of just-in-time quality assurancerdquo IEEE Transactions onSoftware Engineering vol 39 no 6 pp 757ndash773 2013

[8] Y Yang Y Zhou J Liu et al ldquoEffort-aware just-in-time defectprediction simple unsupervised models could be better thansupervised modelsrdquo in Proceedings of the 24th ACM SIGSOFTInternational Symposium on Foundations of Software Engi-neering pp 157ndash168 Hong Kong China November 2016

[9] X Chen Y Zhao Q Wang and Z Yuan ldquoMULTI multi-objective effort-aware just-in-time software defect predictionrdquoInformation and Software Technology vol 93 pp 1ndash13 2018

[10] Y Kamei T Fukushima S McIntosh K YamashitaN Ubayashi and A E Hassan ldquoStudying just-in-time defectprediction using cross-project modelsrdquo Empirical SoftwareEngineering vol 21 no 5 pp 2072ndash2106 2016

[11] T Menzies A Butcher A Marcus T Zimmermann andD R Cok ldquoLocal vs global models for effort estimation anddefect predictionrdquo in Proceedings of the 26th IEEEACM

International Conference on Automated Software Engineer-ing (ASE) pp 343ndash351 Lawrence KS USA November 2011

[12] T Menzies A Butcher D Cok et al ldquoLocal versus globallessons for defect prediction and effort estimationrdquo IEEETransactions on Software Engineering vol 39 no 6pp 822ndash834 2013

[13] G Scanniello C Gravino A Marcus and T Menzies ldquoClasslevel fault prediction using software clusteringrdquo in Pro-ceedings of the 2013 28th IEEEACM International Conferenceon Automated Software Engineering (ASE) pp 640ndash645Silicon Valley CA USA November 2013

[14] N Bettenburg M Nagappan and A E Hassan ldquoTowardsimproving statistical modeling of software engineering datathink locally act globallyrdquo Empirical Software Engineeringvol 20 no 2 pp 294ndash335 2015

[15] Q Wang S Wu and M Li ldquoSoftware defect predictionrdquoJournal of Software vol 19 no 7 pp 1565ndash1580 2008 inChinese

[16] F Akiyama ldquoAn example of software system debuggingrdquoInformation Processing vol 71 pp 353ndash359 1971

[17] T J McCabe ldquoA complexity measurerdquo IEEE Transactions onSoftware Engineering vol SE-2 no 4 pp 308ndash320 1976

[18] S R Chidamber and C F Kemerer ldquoAmetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[19] N Nagappan and T Ball ldquoUse of relative code churn mea-sures to predict system defect densityrdquo in Proceedings of the27th International Conference on Software Engineering (ICSE)pp 284ndash292 Saint Louis MO USA May 2005

[20] A Mockus and D M Weiss ldquoPredicting risk of softwarechangesrdquo Bell Labs Technical Journal vol 5 no 2 pp 169ndash180 2000

[21] X Yang D Lo X Xia Y Zhang and J Sun ldquoDeep learningfor just-in-time defect predictionrdquo in Proceedings of the IEEEInternational Conference on Software Quality Reliability andSecurity (QRS) pp 17ndash26 Vancouver BC Canada August2015

[22] X Yang D Lo X Xia and J Sun ldquoTLEL a two-layer ensemblelearning approach for just-in-time defect predictionrdquo In-formation and Software Technology vol 87 pp 206ndash220 2017

[23] L Pascarella F Palomba and A Bacchelli ldquoFine-grained just-in-time defect predictionrdquo Journal of Systems and Softwarevol 150 pp 22ndash36 2019

[24] E Arisholm L C Briand and E B Johannessen ldquoA sys-tematic and comprehensive investigation of methods to buildand evaluate fault prediction modelsrdquo Journal of Systems andSoftware vol 83 no 1 pp 2ndash17 2010

[25] Y Kamei S Matsumoto A Monden K MatsumotoB Adams and A E Hassan ldquoRevisiting common bug pre-diction findings using effort-aware modelsrdquo in Proceedings ofthe 26th IEEE International Conference on Software Mainte-nance (ICSM) pp 1ndash10 Shanghai China September 2010

[26] Y Zhou B Xu H Leung and L Chen ldquoAn in-depth study ofthe potentially confounding effect of class size in fault pre-dictionrdquo ACM Transactions on Software Engineering andMethodology vol 23 no 1 pp 1ndash51 2014

[27] J Liu Y Zhou Y Yang H Lu and B Xu ldquoCode churn aneglectedmetric in effort-aware just-in-time defect predictionrdquoin Proceedings of the ACMIEEE International Symposium onEmpirical Software Engineering and Measurement (ESEM)pp 11ndash19 Toronto ON Canada November 2017

[28] W Fu and T Menzies ldquoRevisiting unsupervised learning fordefect predictionrdquo in Proceedings of the 2017 11th Joint

12 Scientific Programming

Meeting on Foundations of Software Engineering (ESECFSE)pp 72ndash83 Paderborn Germany September 2017

[29] Q Huang X Xia and D Lo ldquoSupervised vs unsupervisedmodels a holistic look at effort-aware just-in-time defectpredictionrdquo in Proceedings of the IEEE International Con-ference on Software Maintenance and Evolution (ICSME)pp 159ndash170 Shanghai China September 2017

[30] Q Huang X Xia and D Lo ldquoRevisiting supervised andunsupervised models for effort-aware just-in-time defectpredictionrdquo Empirical Software Engineering pp 1ndash40 2018

[31] S Herbold A Trautsch and J Grabowski ldquoGlobal vs localmodels for cross-project defect predictionrdquo Empirical Soft-ware Engineering vol 22 no 4 pp 1866ndash1902 2017

[32] M E Mezouar F Zhang and Y Zou ldquoLocal versus globalmodels for effort-aware defect predictionrdquo in Proceedings ofthe 26th Annual International Conference on Computer Sci-ence and Software Engineering (CASCON) pp 178ndash187Toronto ON Canada October 2016

[33] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis Wiley Hoboken NJ USA1990

[34] S Hosseini B Turhan and D Gunarathna ldquoA systematicliterature review and meta-analysis on cross project defectpredictionrdquo IEEE Transactions on Software Engineeringvol 45 no 2 pp 111ndash147 2019

[35] J Sliwerski T Zimmermann and A Zeller ldquoWhen dochanges induce fixesrdquo Mining Software Repositories vol 30no 4 pp 1ndash5 2005

[36] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin vol 1 no 6 pp 80ndash83 1945

[37] N Cliff Ordinal Methods for Behavioral Data AnalysisPsychology Press London UK 2014

[38] J Romano J D Kromrey J Coraggio and J SkowronekldquoAppropriate statistics for ordinal level data should we reallybe using t-test and cohensd for evaluating group differenceson the nsse and other surveysrdquo in Annual Meeting of theFlorida Association of Institutional Research pp 1ndash33 2006

Scientific Programming 13

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 6: LocalversusGlobalModelsforJust-In-TimeSoftware ...downloads.hindawi.com/journals/sp/2019/2384706.pdf · Just-in-time software defect prediction (JIT-SDP) is an active topic in software

E 1113944k

i11113944

pϵCi

pminusOi

111386811138681113868111386811138681113868111386811138682 (1)

Among different k-medoids algorithms the partitioningaroundmedoids (PAM) algorithm is used in the experimentwhich was proposed by Kaufman el al [33] and is one of themost popularly used algorithms for k-medoids

422 Logistic Regression Similar to prior studies logisticregression is used to build classification models [7] Logisticregression is a widely used regression model that can be usedfor a variety of classification and regression tasks Assumingthat there are nmetrics for a change c logistic regression canbe expressed by using Equation (2) where w0 w1 wn

denote the coefficients of logistic regression and the in-dependent variables m1 mn represent the metrics of achange which are introduced in Table 2 -e output of themodel y(c) indicates the probability that a change is buggy

y(c) 1

1 + eminus w0+w1m1++wnmn( ) (2)

In our experiment logistic regression is used to solveclassification problems However the output of the model isa continuous value with a range of 0 to 1 -erefore for achange c if the value of y(c) is greater than 05 then c isclassified as buggy otherwise it is classified as clean -eprocess can be defined as follows

Y(c) 1 if y(c)gt 05

0 if y(c)le 051113896 (3)

423 Effort-Aware Linear Regression Arisholm et al [24]pointed out that the use of defect prediction models shouldconsider not only the classification performance of themodel but also the cost-effectiveness of code inspectionAfterwards a large number of studies focus on the effort-aware defect prediction [8 9 11 12] Effort-aware linearregression (EALR) was firstly proposed by Kamei et al [7]for solving effort-aware JIT-SDP which is based on linearregression models Different from the above logistic re-gression the dependent variable of EALR is Y(c)Effort(c)where Y(c) denotes the defect proneness of a change andEffort(c) denotes the effort required to code checking for thechange c According to the suggestion of Kamei et al [7] theeffort for a change can obtained by calculating the number ofmodified lines of code

43 Performance Indicators -ere are four performanceindicators used to evaluate and compare the performance oflocal and global models for JIT-SDP (ie F1 AUC ACCand Popt) Specifically we use F1 and AUC to evaluate theclassification performance of prediction models and useACC and Popt to evaluate the effort-aware prediction per-formance of prediction models -e details are as follows

Table 1 -e basic information of data sets

Project Period Number of defective changes Number of changes defect rateBUG 19980826sim20061216 1696 4620 3671COL 20021125sim20060727 1361 4455 3055JDT 20010502sim20071231 5089 35386 1438MOZ 20000101sim20061217 5149 98275 524PLA 20010502sim20071231 9452 64250 1471POS 19960709sim20100515 5119 20431 2506

Table 2 -e description of metrics

Dimension Metric Description

Diffusion

NS Number of modified subsystemsND Number of modified directoriesNF Number of modified files

Entropy Distribution of modified code across each file

SizeLA Lines of code addedLD Lines of code deletedLT Lines of code in a file before the change

Purpose FIX Whether or not the change is a defect fix

History

NDEV Number of developers that changed the files

AGE Average time interval between the last and thecurrent change

NUC Number of unique last changes to the files

ExperienceEXP Developer experienceREXP Recent developer experienceSEXP Developer experience on a subsystem

6 Scientific Programming

-e experiment first uses F1 and AUC to evaluate theclassification performance of models JIT-SDP technologyaims to predict the defect proneness of a change which is aclassification task-e predicted results can be expressed as aconfusion matrix shown in Table 3 including true positive(TP) false negative (FN) false positive (FP) and truenegative (TN)

For classification problems precision (P) and recall (R)are often used to evaluate the prediction performance of themodel

P TP

TP + FP

R TP

TP + FN

(4)

However the two indicators are usually contradictory intheir results In order to evaluate a prediction modelcomprehensively we use F1 to evaluate the classificationperformance of the model which is a harmonic mean ofprecision and recall and is shown in the following equation

F1 2 times P times RP + R

(5)

-e second indicator is AUC Data class imbalanceproblems are often encountered in software defect pre-diction Our data sets are also imbalanced (the number ofbuggy changes is much less than clean changes) In the class-imbalanced problem threshold-based evaluation indicators(such as accuracy recall rate and F1) are sensitive to thethreshold [4] -erefore a threshold-independent evalua-tion indicator AUC is used to evaluate the classificationperformance of a model AUC (area under curve) is the areaunder the ROC (receiver operating characteristic) curve-edrawing process of the ROC curve is as follows First sort theinstances in a descending order of probability that is pre-dicted to be positive and then treat them as positive ex-amples in turn In each iteration the true positive rate (TPR)and the false positive rate (FPR) are calculated as the or-dinate and the abscissa respectively to obtain an ROCcurve

TPR TP

TP + FN

FPR FP

TN + FP

(6)

-e experiment uses ACC and Popt to evaluate the effort-aware prediction performance of local and global modelsACC and Popt are commonly used performance indicatorswhich have widely been used in previous research [7ndash9 25]

-e calculation method of ACC and Popt is shown inFigure 3 In Figure 3 x-axis and y-axis represent the cu-mulative percentage of code churn (ie the number of linesof code added and deleted) and defect-inducing changes Tocompute the values of ACC and Popt three curves are in-cluded in Figure 3 optimal model curve prediction modelcurve and worst model curve [8] In the optimal modelcurve the changes are sorted in a descending order based on

their actual defect densities In the worst model curve thechanges are sorted in the ascending order according to theiractual defect densities In the prediction model curve thechanges are sorted in the descending order based on theirpredicted defect densities -en the area between the threecurves and the x-axis respectively Area(optimal)Area(m) and Area(worst) are calculated ACC indicatesthe recall of defect-inducing changes when 20 of the effortis invested based on the prediction model curve Accordingto [7] Popt can be defined by the following equation

Popt 1minusarea(optimal)minus area(m)

area(optimal)minus optimal(worst) (7)

44 Data Analysis Method -e experiment considers threescenarios to evaluate the performance of prediction modelswhich are 10 times 10-fold cross-validation cross-project-validation and timewise-cross-validation respectively -edetails are as follows

441 10 Times 10-Fold Cross-Validation Cross-validation isperformed within the same project and the basic process of10 times 10-fold cross-validation is as follows first the datasets of a project are divided into 10 parts with equal sizeSubsequently nine parts of them are used to train a pre-diction model and the remaining part is used as the test set-is process is repeated 10 times in order to reduce therandomness deviation -erefore 10 times 10-fold cross-validation can produce 10 times 10 100 results

Table 3 Confusion matrix

Actual valuePrediction result

Positive NegativePositive TP FNNegative FP TN

Def

ect-i

nduc

ing

chan

ges (

)

100

80

60

40

20

0100806040200

Code churn ()

Optimal modelPrediction modelWorst model

Figure 3 -e performance indicators of ACC and Popt

Scientific Programming 7

442 Cross-Project-Validation Cross-project defect pre-diction has received widespread attention in recent years[34] -e basic idea of cross-project defect prediction is totrain a defect prediction model with the defect data sets of aproject to predict defects of another project [8] Given datasets containing n projects n times (nminus 1) run results can begenerated for a prediction model -e data sets used in ourexperiment contain six open-source projects so6 times (6minus 1) 30 prediction results can be generated for aprediction model

443 Timewise-Cross-Validation Timewise-cross-validationis also performed within the same project but it con-siders the chronological order of changes in the data sets[35] Defect data sets are collected in the chronological orderduring the actual software development process -ereforechanges committed early in the data sets can be used topredict the defect proneness of changes committed late Inthe timewise-cross-validation scenario all changes in thedata sets are arranged in the chronological order and thechanges in the same month are grouped in the same groupSuppose the data sets are divided into n groups Group i andgroup i + 1 are used to train a prediction model and thegroup i + 4 and group i + 5 are used as a test set -ereforeassuming that a project contains nmonths of data sets nminus 5run results can be generated for a prediction model

-e experiment uses the Wilcoxon signed-rank test [36]and Cliffrsquos δ [37] to test the significance of differences inclassification performance and prediction performance be-tween local and global models -e Wilcoxon signed-ranktest is a popular nonparametric statistical hypothesis testWe use p values to examine if the difference of the per-formance of local and global models is statistically significantat a significance level of 005-e experiment also uses Cliffrsquosδ to determine the magnitude of the difference in theprediction performance between local models and globalmodels Usually the magnitude of the difference can bedivided into four levels ie trivial ((|δ|lt 0147)) small(0147le |δ|lt 033) moderate (033le |δ|lt 0474) and large(ge0474) [38] In summary when the p value is less than005 and Cliff rsquos δ is greater than or equal to 0147 localmodels and global models have significant differences inprediction performance

45 Data Preprocessing According to suggestion of Kameiet al [7] it is necessary to preprocess the data sets It includesthe following three steps

(1) Removing Highly Correlated Metrics -e ND andREXP metrics are excluded because NF and NDREXP and EXP are highly correlated LA and LD arenormalized by dividing by LT since LA and LD arehighly correlated LT and NUC are normalized bydividing by NF since LT and NUC are highly cor-related with NF

(2) Logarithmic Transformation Since most metrics arehighly skewed each metric executes logarithmictransformation (except for fix)

(3) Dealing with Class Imbalance -e data sets used inthe experiment have the problem of class imbalanceie the number of defect-inducing changes is far lessthan the number of clean changes -erefore weperform random undersampling on the training setBy randomly deleting clean changes the number ofbuggy changes remains the same as the number ofclean changes Note that the undersampling is notperformed in the test set

5 Experimental Results and Analysis

51 Analysis for RQ1 To investigate and compare the per-formance of local and global models in the 10 times 10-foldcross-validation scenario we conduct a large-scale empiricalstudy based on data sets from six open-source projects -eexperimental results are shown in Table 4 As is shown inTable 4 the first column represents the evaluation indicatorsfor prediction models including AUC F1 ACC and Poptthe second column denotes the project names of six data setsthe third column shows the performance values of the globalmodels in the four evaluation indicators from the fourthcolumn to the twelfth column the performance values of thelocal models at different k values are shown Consideringthat the number of clusters k may affect the performance oflocal models the value of parameter k is tuned from 2 to 10Since 100 prediction results are generated in the 10 times 10-fold cross-validation scenario the values in the table aremedians of the experimental results We do the followingprocessing on the data in Table 4

(i) We performed the significance test defined inSection 44 on the performance of the local andglobal models -e performance values of localmodels that are significantly different from theperformance values of global model are bolded

(ii) -e performance values of the local models whichare significantly better than those of the globalmodels are given in italics

(iii) -e WDL values (ie the number of projects forwhich local models can obtain a better equal andworse prediction performance than global models)are calculated under different k values and differentevaluation indicators

First we analyze and compare the classification per-formance of local and global models As is shown in Table 4firstly all of the performance of local models with differentparameters k and different projects are worse than that ofglobal models in the AUC indicator Secondly in most casesthe performance of local models are worse than that of globalmodels (except for four values in italics) in the F1 indicator-erefore it can be concluded that global models performbetter than local models in the classification performance

Second we investigate and compare effort-aware pre-diction performance for local and global models As can beseen from Table 4 local models are valid for effort-awareJIT-SDP However local models have different effects ondifferent projects and different parameters k

8 Scientific Programming

First local models have different effects on differentdata sets To describe the effectiveness of local models ondifferent data sets we calculate the number of times thatlocal models perform significantly better than globalmodels on 9 k values and two evaluation indicators(ie ACC and Popt) for each data set -e statistical resultsare shown in Table 5 where the second row is the numberof times that local models perform better than globalmodels for each project It can be seen that local modelshave better performance on the projects MOZ and PLA andpoor performance on the projects BUG and COL Based onthe number of times that local models outperform globalmodels in six projects we rank the project as shown in thethird row In addition we list the sizes of data sets and theirrank as shown in the fourth row It can be seen that theperformance of local models on data sets has a strongcorrelation with the size of data sets Generally the largerthe size of the data set the more significant the perfor-mance improvement of local models

In view of this situation we consider that if the size of thedata set is too small when using local models since the datasets are divided into multiple clusters the training samplesof each cluster will be further reduced which will lead to theperformance degradation of models While the size of datasets is large the performance of the model is not easilylimited by the size of the data sets so that the advantages oflocal models are better reflected

In addition the number of clusters k also has an impacton the performance of local models As can be seen from thetable according to the WDL analysis local models canobtain the best ACC indicator when k is set to 2 and obtainthe best Popt indicator when k is set to 2 and 6 -erefore werecommend that in the cross-validation scenario whenusing local models to solve the effort-aware JIT-SDPproblem the parameter k can be set to 2

-is section analyzes and compares the classificationperformance and effort-aware prediction performance oflocal and global models in 10 times 10-fold cross-validationscenario -e empirical results show that compared withglobal models local models perform poorly in the classifi-cation performance but perform better in the effort-awareprediction performance However the performance of localmodels is affected by the parameter k and the size of the datasets -e experimental results show that local models canobtain better effort-aware prediction performance on theproject with larger size of data sets and can obtain bettereffort-aware prediction performance when k is set to 2

52Analysis forRQ2 -is section discusses the performancedifferences between local and global models in the cross-project-validation scenario for JIT-SDP -e results of theexperiment are shown in Table 6 in which the first columnindicates four evaluation indicators including AUC F1

Table 4 Local vs global models in the 10 times 10-fold cross-validation scenario

Indicator Project GlobalLocal

k 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC

BUG 0730 0704 0687 0684 0698 0688 0686 0686 0693 0685COL 0742 0719 0708 0701 0703 0657 0666 0670 0668 0660JDT 0725 0719 0677 0670 0675 0697 0671 0674 0671 0654MOZ 0769 0540 0741 0715 0703 0730 0730 0725 0731 0722PLA 0742 0733 0702 0693 0654 0677 0640 0638 0588 0575POS 0777 0764 0758 0760 0758 0753 0755 0705 0757 0757

WDL mdash 006 006 006 006 006 006 006 006 006

F1

BUG 0663 0644 0628 0616 0603 0598 0608 0610 0601 0592COL 0696 0672 0649 0669 0678 0634 0537 0596 0604 0564JDT 0723 0706 0764 0757 0734 0695 0615 0643 0732 0722MOZ 0825 0374 0853 0758 0785 0741 0718 0737 0731 0721PLA 0704 0678 0675 0527 0477 0557 0743 0771 0627 0604POS 0762 0699 0747 0759 0746 0698 0697 0339 0699 0700

WDL mdash 006 204 114 015 006 015 105 015 015

ACC

BUG 0421 0452 0471 0458 0424 0451 0450 0463 0446 0462COL 0553 0458 0375 0253 0443 0068 0195 0459 0403 0421JDT 0354 0532 0374 0431 0291 0481 0439 0314 0287 0334MOZ 0189 0386 0328 0355 0364 0301 0270 0312 0312 0312PLA 0368 0472 0509 0542 0497 0509 0456 0442 0456 0495POS 0336 0490 0397 0379 0457 0442 0406 0341 0400 0383

WDL mdash 501 411 411 312 411 411 222 312 411

Popt

BUG 0742 055 0747 0744 0726 0738 0749 0750 0760 0750 0766COL 0726 0661 0624 0401 0731 0295 0421 0644 0628 0681JDT 0660 0779 0625 0698 0568 0725 0708 0603 0599 0634MOZ 0555 0679 0626 0645 0659 0588 0589 0643 0634 0630PLA 0664 0723 0727 0784 0746 0758 0718 0673 0716 0747POS 0673 0729 0680 0651 0709 0739 0683 0596 0669 0644

WDL mdash 411 222 213 321 411 231 123 132 213

Scientific Programming 9

ACC and Popt the second column represents the perfor-mance values of local models from the third column to theeleventh column the prediction performance of localmodels under different parameters k is represented In ourexperiment six data sets are used in the cross-project-validation scenario -erefore each prediction model pro-duces 30 prediction results -e values in Table 6 representthe median of the prediction results

For the values in Table 6 we do the following processing

(i) Using the significance test method the performancevalues of local models that are significantly differentfrom those of global models are bolded

(ii) -e performance values of local models that aresignificantly better than those of global models aregiven in italics

(iii) -e WDL column analyzes the number of timesthat local models perform significantly better thanglobal models at different parameters k

First we compare the difference in classification per-formance between local models and global models As can beseen from Table 6 in the AUC and F1 indicators theclassification performance of local models under differentparameter k is worse than or equal to that of global models-erefore we can conclude that in the cross-project-validation scenario local models perform worse thanglobal models in the classification performance

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 6 that the performance of local models is better than orequal to that of global models in the ACC and Popt indicatorswhen the parameter k is from 2 to 10-erefore local modelsare valid for effort-aware JIT-SDP in the cross-validationscenario In particular when the number k of clusters is setto 2 local models can obtain better ACC values and Poptvalues than global models when k is set to 5 local modelscan obtain better ACC values -erefore it is appropriate toset k to 2 which can find 493 defect-inducing changeswhen using 20 effort and increase the ACC indicator by570 and increase the Popt indicator by 164

In the cross-project-validation scenario local modelsperform worse in the classification performance but performbetter in the effort-aware prediction performance thanglobal models However the setting of k in local models hasan impact on the effort-aware prediction performance forlocal models-e empirical results show that when k is set to2 local models can obtain the optimal effort-aware pre-diction performance

53 Analysis for RQ3 -is section compares the predictionperformance of local and global models in the timewise-

cross-validation scenario Since in the timewise-cross-validation scenario only two months of data in each datasets are used to train prediction models the size of thetraining sample in each cluster is small -rough experi-ment we find that if the parameter k of local models is settoo large the number of training samples in each clustermay be too small or even equal to 0 -erefore the ex-periment sets the parameter k of local models to 2 -eexperimental results are shown in Table 7 Suppose aproject contains n months of data and prediction modelscan produce nminus 5 prediction results in the timewise-cross-validation scenario -erefore the third and fourth col-umns in Table 7 represent the median and standard de-viation of prediction results for global and local modelsBased on the significance test method the performancevalues of local models with significant differences fromthose of global models are bold -e row WDL sum-marizes the number of projects for which local models canobtain a better equal and worse performance than globalmodels in each indicator

First we compare the classification performance of localand global models It can be seen from Table 7 that localmodels perform worse on the AUC and F1 indicators thanglobal models on the 6 data sets -erefore it can be con-cluded that local models perform poorly in classificationperformance compared to global models in the timewise-cross-validation scenario

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 7 that in the projects BUG and COL local models haveworse ACC and Popt indicators than global models while inthe other four projects (JDT MOZ PLA and POS) localmodels and global models have no significant difference inthe ACC and Popt indicators -erefore local models per-form worse in effort-aware prediction performance thanglobal models in the timewise-cross-validation scenario-is conclusion is inconsistent with the conclusions inSections 51 and 52 For this problem we consider that themain reason lies in the size of the training sample In thetimewise-cross-validation scenario only two months of datain the data sets are used to train prediction models (there areonly 452 changes per month on average)When local modelsare used since the training set is divided into several clustersthe number of training samples in each cluster is furtherdecreased which results in a decrease in the effort-awareprediction performance of local models

In this section we compare the classification perfor-mance and effort-aware prediction performance of local andglobal models in the timewise-cross-validation scenarioEmpirical results show that local models perform worse thanglobal models in the classification performance and effort-aware prediction performance -erefore we recommend

Table 5 -e number of wins of local models at different k values

Project BUG COL JDT MOZ PLA POSCount 4 0 7 18 14 11Rank 5 6 4 1 2 3Number of changes (rank) 4620 (5) 4455 (6) 35386 (3) 98275 (1) 64250 (2) 20431 (3)

10 Scientific Programming

using global models to solve the JIT-SDP problem in thetimewise-cross-validation scenario

6 Threats to Validity

61 External Validity Although our experiment uses publicdata sets that have extensively been used in previous studieswe cannot yet guarantee that the discovery of the experimentcan be applied to all other change-level defect data sets-erefore the defect data sets of more projects should bemined in order to verify the generalization of the experi-mental results

62 Construct Validity We use F1 AUC ACC and Popt toevaluate the prediction performance of local and globalmodels Because defect data sets are often imbalanced AUCis more appropriate as a threshold-independent evaluationindicator to evaluate the classification performance of

models [4] In addition we use F1 a threshold-basedevaluation indicator as a supplement to AUC Besidessimilar to prior study [7 9] the experiment uses ACC andPopt to evaluate the effort-aware prediction performance ofmodels However we cannot guarantee that the experi-mental results are valid in all other evaluation indicatorssuch as precision recall Fβ and PofB20

63 Internal Validity Internal validity involves errors in theexperimental code In order to reduce this risk we carefullyexamined the code of the experiment and referred to thecode published in previous studies [7 8 29] so as to improvethe reliability of the experiment code

7 Conclusions and Future Work

In this article we firstly apply local models to JIT-SDP Ourstudy aims to compare the prediction performance of localand global models under cross-validation cross-project-validation and timewise-cross-validation for JIT-SDP Inorder to compare the classification performance and effort-aware prediction performance of local and global models wefirst build local models based on the k-medoids methodAfterwards logistic regression and EALR are used to buildclassification models and effort-aware prediction models forlocal and global models -e empirical results show that localmodels perform worse in the classification performance thanglobal models in three evaluation scenarios Moreover localmodels have worse effort-aware prediction performance in thetimewise-cross-validation scenario However local models aresignificantly better than global models in the effort-awareprediction performance in the cross-validation-scenario andtimewise-cross-validation scenario Particularly the optimaleffort-aware prediction performance can be obtained whenthe parameter k of local models is set to 2 -erefore localmodels are still promising in the context of effort-aware JIT-SDP

In the future first we plan to consider more commercialprojects to further verify the validity of the experimentalconclusions Second logistic regression and EALR are usedin local and global models Considering that the choice ofmodeling methods will affect the prediction performance oflocal and global models we hope to add more baselines tolocal and global models to verify the generalization of ex-perimental conclusions

Data Availability

-e experiment uses public data sets shared by Kamei et al[7] and they have already published the download address of

Table 6 Local vs global models in the cross-project-validation scenario

Indicator GlobalLocal

WDLk 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC 0706 0683 0681 0674 0673 0673 0668 0659 0680 0674 009F1 0678 0629 0703 0696 0677 0613 0620 0554 0629 0637 045ACC 0314 0493 0396 0348 0408 0393 0326 034 0342 0345 270Popt 0656 0764 0691 0672 0719 0687 0619 0632 0667 0624 180

Table 7 Local vs global models in the timewise-cross-validationscenario

Indicators Project Global Local (k 2)

AUC

BUG 0672plusmn 0125 0545plusmn 0124COL 0717plusmn 0075 0636plusmn 0135JDT 0708plusmn 0043 0645plusmn 0073MOZ 0749plusmn 0034 0648plusmn 0128PLA 0709plusmn 0054 0642plusmn 0076POS 0743plusmn 0072 0702plusmn 0095

WDL mdash 006

F1

BUG 0587plusmn 0108 0515plusmn 0128COL 0651plusmn 0066 0599plusmn 0125JDT 0722plusmn 0048 0623plusmn 0161MOZ 0804plusmn 0050 0696plusmn 0190PLA 0702plusmn 0053 0618plusmn 0162POS 0714plusmn 0065 0657plusmn 0128

WDL mdash 006

ACC

BUG 0357plusmn 0212 0242plusmn 0210COL 0426plusmn 0168 0321plusmn 0186JDT 0356plusmn 0148 0329plusmn 0189MOZ 0218plusmn 0132 0242plusmn 0125PLA 0358plusmn 0181 0371plusmn 0196POS 0345plusmn 0159 0306plusmn 0205

WDL mdash 042

Popt

BUG 0622plusmn 0194 0458plusmn 0189COL 0669plusmn 0150 0553plusmn 0204JDT 0654plusmn 0101 0624plusmn 0142MOZ 0537plusmn 0110 0537plusmn 0122PLA 0631plusmn 0115 0645plusmn 0151POS 0615plusmn 0131 0583plusmn 0167

WDL mdash 042

Scientific Programming 11

the data sets in their paper In addition in order to verify therepeatability of our experiment and promote future researchall the experimental code in this article can be downloaded athttpsgithubcomyangxingguangLocalJIT

Conflicts of Interest

-e authors declare that they have no conflicts of interest

Acknowledgments

-is work was partially supported by the NSF of Chinaunder Grant nos 61772200 and 61702334 Shanghai PujiangTalent Program under Grant no 17PJ1401900 ShanghaiMunicipal Natural Science Foundation under Grant nos17ZR1406900 and 17ZR1429700 Educational ResearchFund of ECUST under Grant no ZH1726108 and Collab-orative Innovation Foundation of Shanghai Institute ofTechnology under Grant no XTCX2016-20

References

[1] M Newman Software Errors Cost US Economy $595 BillionAnnually NIST Assesses Technical Needs of Industry to ImproveSoftware-Testing 2002 httpwwwabeachacomNIST_press_release_bugs_costhtm

[2] Z Li X-Y Jing and X Zhu ldquoProgress on approaches tosoftware defect predictionrdquo IET Software vol 12 no 3pp 161ndash175 2018

[3] X Chen Q Gu W Liu S Liu and C Ni ldquoSurvey of staticsoftware defect predictionrdquo Ruan Jian Xue BaoJournal ofSoftware vol 27 no 1 2016 in Chinese

[4] C Tantithamthavorn S McIntosh A E Hassan andKMatsumoto ldquoAn empirical comparison of model validationtechniques for defect prediction modelsrdquo IEEE Transactionson Software Engineering vol 43 no 1 pp 1ndash18 2017

[5] A G Koru D Dongsong Zhang K El Emam andH Hongfang Liu ldquoAn investigation into the functional formof the size-defect relationship for software modulesrdquo IEEETransactions on Software Engineering vol 35 no 2pp 293ndash304 2009

[6] S Kim E J Whitehead and Y Zhang ldquoClassifying softwarechanges clean or buggyrdquo IEEE Transactions on SoftwareEngineering vol 34 no 2 pp 181ndash196 2008

[7] Y Kamei E Shihab B Adams et al ldquoA large-scale empiricalstudy of just-in-time quality assurancerdquo IEEE Transactions onSoftware Engineering vol 39 no 6 pp 757ndash773 2013

[8] Y Yang Y Zhou J Liu et al ldquoEffort-aware just-in-time defectprediction simple unsupervised models could be better thansupervised modelsrdquo in Proceedings of the 24th ACM SIGSOFTInternational Symposium on Foundations of Software Engi-neering pp 157ndash168 Hong Kong China November 2016

[9] X Chen Y Zhao Q Wang and Z Yuan ldquoMULTI multi-objective effort-aware just-in-time software defect predictionrdquoInformation and Software Technology vol 93 pp 1ndash13 2018

[10] Y Kamei T Fukushima S McIntosh K YamashitaN Ubayashi and A E Hassan ldquoStudying just-in-time defectprediction using cross-project modelsrdquo Empirical SoftwareEngineering vol 21 no 5 pp 2072ndash2106 2016

[11] T Menzies A Butcher A Marcus T Zimmermann andD R Cok ldquoLocal vs global models for effort estimation anddefect predictionrdquo in Proceedings of the 26th IEEEACM

International Conference on Automated Software Engineer-ing (ASE) pp 343ndash351 Lawrence KS USA November 2011

[12] T Menzies A Butcher D Cok et al ldquoLocal versus globallessons for defect prediction and effort estimationrdquo IEEETransactions on Software Engineering vol 39 no 6pp 822ndash834 2013

[13] G Scanniello C Gravino A Marcus and T Menzies ldquoClasslevel fault prediction using software clusteringrdquo in Pro-ceedings of the 2013 28th IEEEACM International Conferenceon Automated Software Engineering (ASE) pp 640ndash645Silicon Valley CA USA November 2013

[14] N Bettenburg M Nagappan and A E Hassan ldquoTowardsimproving statistical modeling of software engineering datathink locally act globallyrdquo Empirical Software Engineeringvol 20 no 2 pp 294ndash335 2015

[15] Q Wang S Wu and M Li ldquoSoftware defect predictionrdquoJournal of Software vol 19 no 7 pp 1565ndash1580 2008 inChinese

[16] F Akiyama ldquoAn example of software system debuggingrdquoInformation Processing vol 71 pp 353ndash359 1971

[17] T J McCabe ldquoA complexity measurerdquo IEEE Transactions onSoftware Engineering vol SE-2 no 4 pp 308ndash320 1976

[18] S R Chidamber and C F Kemerer ldquoAmetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[19] N Nagappan and T Ball ldquoUse of relative code churn mea-sures to predict system defect densityrdquo in Proceedings of the27th International Conference on Software Engineering (ICSE)pp 284ndash292 Saint Louis MO USA May 2005

[20] A Mockus and D M Weiss ldquoPredicting risk of softwarechangesrdquo Bell Labs Technical Journal vol 5 no 2 pp 169ndash180 2000

[21] X Yang D Lo X Xia Y Zhang and J Sun ldquoDeep learningfor just-in-time defect predictionrdquo in Proceedings of the IEEEInternational Conference on Software Quality Reliability andSecurity (QRS) pp 17ndash26 Vancouver BC Canada August2015

[22] X Yang D Lo X Xia and J Sun ldquoTLEL a two-layer ensemblelearning approach for just-in-time defect predictionrdquo In-formation and Software Technology vol 87 pp 206ndash220 2017

[23] L Pascarella F Palomba and A Bacchelli ldquoFine-grained just-in-time defect predictionrdquo Journal of Systems and Softwarevol 150 pp 22ndash36 2019

[24] E Arisholm L C Briand and E B Johannessen ldquoA sys-tematic and comprehensive investigation of methods to buildand evaluate fault prediction modelsrdquo Journal of Systems andSoftware vol 83 no 1 pp 2ndash17 2010

[25] Y Kamei S Matsumoto A Monden K MatsumotoB Adams and A E Hassan ldquoRevisiting common bug pre-diction findings using effort-aware modelsrdquo in Proceedings ofthe 26th IEEE International Conference on Software Mainte-nance (ICSM) pp 1ndash10 Shanghai China September 2010

[26] Y Zhou B Xu H Leung and L Chen ldquoAn in-depth study ofthe potentially confounding effect of class size in fault pre-dictionrdquo ACM Transactions on Software Engineering andMethodology vol 23 no 1 pp 1ndash51 2014

[27] J Liu Y Zhou Y Yang H Lu and B Xu ldquoCode churn aneglectedmetric in effort-aware just-in-time defect predictionrdquoin Proceedings of the ACMIEEE International Symposium onEmpirical Software Engineering and Measurement (ESEM)pp 11ndash19 Toronto ON Canada November 2017

[28] W Fu and T Menzies ldquoRevisiting unsupervised learning fordefect predictionrdquo in Proceedings of the 2017 11th Joint

12 Scientific Programming

Meeting on Foundations of Software Engineering (ESECFSE)pp 72ndash83 Paderborn Germany September 2017

[29] Q Huang X Xia and D Lo ldquoSupervised vs unsupervisedmodels a holistic look at effort-aware just-in-time defectpredictionrdquo in Proceedings of the IEEE International Con-ference on Software Maintenance and Evolution (ICSME)pp 159ndash170 Shanghai China September 2017

[30] Q Huang X Xia and D Lo ldquoRevisiting supervised andunsupervised models for effort-aware just-in-time defectpredictionrdquo Empirical Software Engineering pp 1ndash40 2018

[31] S Herbold A Trautsch and J Grabowski ldquoGlobal vs localmodels for cross-project defect predictionrdquo Empirical Soft-ware Engineering vol 22 no 4 pp 1866ndash1902 2017

[32] M E Mezouar F Zhang and Y Zou ldquoLocal versus globalmodels for effort-aware defect predictionrdquo in Proceedings ofthe 26th Annual International Conference on Computer Sci-ence and Software Engineering (CASCON) pp 178ndash187Toronto ON Canada October 2016

[33] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis Wiley Hoboken NJ USA1990

[34] S Hosseini B Turhan and D Gunarathna ldquoA systematicliterature review and meta-analysis on cross project defectpredictionrdquo IEEE Transactions on Software Engineeringvol 45 no 2 pp 111ndash147 2019

[35] J Sliwerski T Zimmermann and A Zeller ldquoWhen dochanges induce fixesrdquo Mining Software Repositories vol 30no 4 pp 1ndash5 2005

[36] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin vol 1 no 6 pp 80ndash83 1945

[37] N Cliff Ordinal Methods for Behavioral Data AnalysisPsychology Press London UK 2014

[38] J Romano J D Kromrey J Coraggio and J SkowronekldquoAppropriate statistics for ordinal level data should we reallybe using t-test and cohensd for evaluating group differenceson the nsse and other surveysrdquo in Annual Meeting of theFlorida Association of Institutional Research pp 1ndash33 2006

Scientific Programming 13

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 7: LocalversusGlobalModelsforJust-In-TimeSoftware ...downloads.hindawi.com/journals/sp/2019/2384706.pdf · Just-in-time software defect prediction (JIT-SDP) is an active topic in software

-e experiment first uses F1 and AUC to evaluate theclassification performance of models JIT-SDP technologyaims to predict the defect proneness of a change which is aclassification task-e predicted results can be expressed as aconfusion matrix shown in Table 3 including true positive(TP) false negative (FN) false positive (FP) and truenegative (TN)

For classification problems precision (P) and recall (R)are often used to evaluate the prediction performance of themodel

P TP

TP + FP

R TP

TP + FN

(4)

However the two indicators are usually contradictory intheir results In order to evaluate a prediction modelcomprehensively we use F1 to evaluate the classificationperformance of the model which is a harmonic mean ofprecision and recall and is shown in the following equation

F1 2 times P times RP + R

(5)

-e second indicator is AUC Data class imbalanceproblems are often encountered in software defect pre-diction Our data sets are also imbalanced (the number ofbuggy changes is much less than clean changes) In the class-imbalanced problem threshold-based evaluation indicators(such as accuracy recall rate and F1) are sensitive to thethreshold [4] -erefore a threshold-independent evalua-tion indicator AUC is used to evaluate the classificationperformance of a model AUC (area under curve) is the areaunder the ROC (receiver operating characteristic) curve-edrawing process of the ROC curve is as follows First sort theinstances in a descending order of probability that is pre-dicted to be positive and then treat them as positive ex-amples in turn In each iteration the true positive rate (TPR)and the false positive rate (FPR) are calculated as the or-dinate and the abscissa respectively to obtain an ROCcurve

TPR TP

TP + FN

FPR FP

TN + FP

(6)

-e experiment uses ACC and Popt to evaluate the effort-aware prediction performance of local and global modelsACC and Popt are commonly used performance indicatorswhich have widely been used in previous research [7ndash9 25]

-e calculation method of ACC and Popt is shown inFigure 3 In Figure 3 x-axis and y-axis represent the cu-mulative percentage of code churn (ie the number of linesof code added and deleted) and defect-inducing changes Tocompute the values of ACC and Popt three curves are in-cluded in Figure 3 optimal model curve prediction modelcurve and worst model curve [8] In the optimal modelcurve the changes are sorted in a descending order based on

their actual defect densities In the worst model curve thechanges are sorted in the ascending order according to theiractual defect densities In the prediction model curve thechanges are sorted in the descending order based on theirpredicted defect densities -en the area between the threecurves and the x-axis respectively Area(optimal)Area(m) and Area(worst) are calculated ACC indicatesthe recall of defect-inducing changes when 20 of the effortis invested based on the prediction model curve Accordingto [7] Popt can be defined by the following equation

Popt 1minusarea(optimal)minus area(m)

area(optimal)minus optimal(worst) (7)

44 Data Analysis Method -e experiment considers threescenarios to evaluate the performance of prediction modelswhich are 10 times 10-fold cross-validation cross-project-validation and timewise-cross-validation respectively -edetails are as follows

441 10 Times 10-Fold Cross-Validation Cross-validation isperformed within the same project and the basic process of10 times 10-fold cross-validation is as follows first the datasets of a project are divided into 10 parts with equal sizeSubsequently nine parts of them are used to train a pre-diction model and the remaining part is used as the test set-is process is repeated 10 times in order to reduce therandomness deviation -erefore 10 times 10-fold cross-validation can produce 10 times 10 100 results

Table 3 Confusion matrix

Actual valuePrediction result

Positive NegativePositive TP FNNegative FP TN

Def

ect-i

nduc

ing

chan

ges (

)

100

80

60

40

20

0100806040200

Code churn ()

Optimal modelPrediction modelWorst model

Figure 3 -e performance indicators of ACC and Popt

Scientific Programming 7

442 Cross-Project-Validation Cross-project defect pre-diction has received widespread attention in recent years[34] -e basic idea of cross-project defect prediction is totrain a defect prediction model with the defect data sets of aproject to predict defects of another project [8] Given datasets containing n projects n times (nminus 1) run results can begenerated for a prediction model -e data sets used in ourexperiment contain six open-source projects so6 times (6minus 1) 30 prediction results can be generated for aprediction model

443 Timewise-Cross-Validation Timewise-cross-validationis also performed within the same project but it con-siders the chronological order of changes in the data sets[35] Defect data sets are collected in the chronological orderduring the actual software development process -ereforechanges committed early in the data sets can be used topredict the defect proneness of changes committed late Inthe timewise-cross-validation scenario all changes in thedata sets are arranged in the chronological order and thechanges in the same month are grouped in the same groupSuppose the data sets are divided into n groups Group i andgroup i + 1 are used to train a prediction model and thegroup i + 4 and group i + 5 are used as a test set -ereforeassuming that a project contains nmonths of data sets nminus 5run results can be generated for a prediction model

-e experiment uses the Wilcoxon signed-rank test [36]and Cliffrsquos δ [37] to test the significance of differences inclassification performance and prediction performance be-tween local and global models -e Wilcoxon signed-ranktest is a popular nonparametric statistical hypothesis testWe use p values to examine if the difference of the per-formance of local and global models is statistically significantat a significance level of 005-e experiment also uses Cliffrsquosδ to determine the magnitude of the difference in theprediction performance between local models and globalmodels Usually the magnitude of the difference can bedivided into four levels ie trivial ((|δ|lt 0147)) small(0147le |δ|lt 033) moderate (033le |δ|lt 0474) and large(ge0474) [38] In summary when the p value is less than005 and Cliff rsquos δ is greater than or equal to 0147 localmodels and global models have significant differences inprediction performance

45 Data Preprocessing According to suggestion of Kameiet al [7] it is necessary to preprocess the data sets It includesthe following three steps

(1) Removing Highly Correlated Metrics -e ND andREXP metrics are excluded because NF and NDREXP and EXP are highly correlated LA and LD arenormalized by dividing by LT since LA and LD arehighly correlated LT and NUC are normalized bydividing by NF since LT and NUC are highly cor-related with NF

(2) Logarithmic Transformation Since most metrics arehighly skewed each metric executes logarithmictransformation (except for fix)

(3) Dealing with Class Imbalance -e data sets used inthe experiment have the problem of class imbalanceie the number of defect-inducing changes is far lessthan the number of clean changes -erefore weperform random undersampling on the training setBy randomly deleting clean changes the number ofbuggy changes remains the same as the number ofclean changes Note that the undersampling is notperformed in the test set

5 Experimental Results and Analysis

51 Analysis for RQ1 To investigate and compare the per-formance of local and global models in the 10 times 10-foldcross-validation scenario we conduct a large-scale empiricalstudy based on data sets from six open-source projects -eexperimental results are shown in Table 4 As is shown inTable 4 the first column represents the evaluation indicatorsfor prediction models including AUC F1 ACC and Poptthe second column denotes the project names of six data setsthe third column shows the performance values of the globalmodels in the four evaluation indicators from the fourthcolumn to the twelfth column the performance values of thelocal models at different k values are shown Consideringthat the number of clusters k may affect the performance oflocal models the value of parameter k is tuned from 2 to 10Since 100 prediction results are generated in the 10 times 10-fold cross-validation scenario the values in the table aremedians of the experimental results We do the followingprocessing on the data in Table 4

(i) We performed the significance test defined inSection 44 on the performance of the local andglobal models -e performance values of localmodels that are significantly different from theperformance values of global model are bolded

(ii) -e performance values of the local models whichare significantly better than those of the globalmodels are given in italics

(iii) -e WDL values (ie the number of projects forwhich local models can obtain a better equal andworse prediction performance than global models)are calculated under different k values and differentevaluation indicators

First we analyze and compare the classification per-formance of local and global models As is shown in Table 4firstly all of the performance of local models with differentparameters k and different projects are worse than that ofglobal models in the AUC indicator Secondly in most casesthe performance of local models are worse than that of globalmodels (except for four values in italics) in the F1 indicator-erefore it can be concluded that global models performbetter than local models in the classification performance

Second we investigate and compare effort-aware pre-diction performance for local and global models As can beseen from Table 4 local models are valid for effort-awareJIT-SDP However local models have different effects ondifferent projects and different parameters k

8 Scientific Programming

First local models have different effects on differentdata sets To describe the effectiveness of local models ondifferent data sets we calculate the number of times thatlocal models perform significantly better than globalmodels on 9 k values and two evaluation indicators(ie ACC and Popt) for each data set -e statistical resultsare shown in Table 5 where the second row is the numberof times that local models perform better than globalmodels for each project It can be seen that local modelshave better performance on the projects MOZ and PLA andpoor performance on the projects BUG and COL Based onthe number of times that local models outperform globalmodels in six projects we rank the project as shown in thethird row In addition we list the sizes of data sets and theirrank as shown in the fourth row It can be seen that theperformance of local models on data sets has a strongcorrelation with the size of data sets Generally the largerthe size of the data set the more significant the perfor-mance improvement of local models

In view of this situation we consider that if the size of thedata set is too small when using local models since the datasets are divided into multiple clusters the training samplesof each cluster will be further reduced which will lead to theperformance degradation of models While the size of datasets is large the performance of the model is not easilylimited by the size of the data sets so that the advantages oflocal models are better reflected

In addition the number of clusters k also has an impacton the performance of local models As can be seen from thetable according to the WDL analysis local models canobtain the best ACC indicator when k is set to 2 and obtainthe best Popt indicator when k is set to 2 and 6 -erefore werecommend that in the cross-validation scenario whenusing local models to solve the effort-aware JIT-SDPproblem the parameter k can be set to 2

-is section analyzes and compares the classificationperformance and effort-aware prediction performance oflocal and global models in 10 times 10-fold cross-validationscenario -e empirical results show that compared withglobal models local models perform poorly in the classifi-cation performance but perform better in the effort-awareprediction performance However the performance of localmodels is affected by the parameter k and the size of the datasets -e experimental results show that local models canobtain better effort-aware prediction performance on theproject with larger size of data sets and can obtain bettereffort-aware prediction performance when k is set to 2

52Analysis forRQ2 -is section discusses the performancedifferences between local and global models in the cross-project-validation scenario for JIT-SDP -e results of theexperiment are shown in Table 6 in which the first columnindicates four evaluation indicators including AUC F1

Table 4 Local vs global models in the 10 times 10-fold cross-validation scenario

Indicator Project GlobalLocal

k 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC

BUG 0730 0704 0687 0684 0698 0688 0686 0686 0693 0685COL 0742 0719 0708 0701 0703 0657 0666 0670 0668 0660JDT 0725 0719 0677 0670 0675 0697 0671 0674 0671 0654MOZ 0769 0540 0741 0715 0703 0730 0730 0725 0731 0722PLA 0742 0733 0702 0693 0654 0677 0640 0638 0588 0575POS 0777 0764 0758 0760 0758 0753 0755 0705 0757 0757

WDL mdash 006 006 006 006 006 006 006 006 006

F1

BUG 0663 0644 0628 0616 0603 0598 0608 0610 0601 0592COL 0696 0672 0649 0669 0678 0634 0537 0596 0604 0564JDT 0723 0706 0764 0757 0734 0695 0615 0643 0732 0722MOZ 0825 0374 0853 0758 0785 0741 0718 0737 0731 0721PLA 0704 0678 0675 0527 0477 0557 0743 0771 0627 0604POS 0762 0699 0747 0759 0746 0698 0697 0339 0699 0700

WDL mdash 006 204 114 015 006 015 105 015 015

ACC

BUG 0421 0452 0471 0458 0424 0451 0450 0463 0446 0462COL 0553 0458 0375 0253 0443 0068 0195 0459 0403 0421JDT 0354 0532 0374 0431 0291 0481 0439 0314 0287 0334MOZ 0189 0386 0328 0355 0364 0301 0270 0312 0312 0312PLA 0368 0472 0509 0542 0497 0509 0456 0442 0456 0495POS 0336 0490 0397 0379 0457 0442 0406 0341 0400 0383

WDL mdash 501 411 411 312 411 411 222 312 411

Popt

BUG 0742 055 0747 0744 0726 0738 0749 0750 0760 0750 0766COL 0726 0661 0624 0401 0731 0295 0421 0644 0628 0681JDT 0660 0779 0625 0698 0568 0725 0708 0603 0599 0634MOZ 0555 0679 0626 0645 0659 0588 0589 0643 0634 0630PLA 0664 0723 0727 0784 0746 0758 0718 0673 0716 0747POS 0673 0729 0680 0651 0709 0739 0683 0596 0669 0644

WDL mdash 411 222 213 321 411 231 123 132 213

Scientific Programming 9

ACC and Popt the second column represents the perfor-mance values of local models from the third column to theeleventh column the prediction performance of localmodels under different parameters k is represented In ourexperiment six data sets are used in the cross-project-validation scenario -erefore each prediction model pro-duces 30 prediction results -e values in Table 6 representthe median of the prediction results

For the values in Table 6 we do the following processing

(i) Using the significance test method the performancevalues of local models that are significantly differentfrom those of global models are bolded

(ii) -e performance values of local models that aresignificantly better than those of global models aregiven in italics

(iii) -e WDL column analyzes the number of timesthat local models perform significantly better thanglobal models at different parameters k

First we compare the difference in classification per-formance between local models and global models As can beseen from Table 6 in the AUC and F1 indicators theclassification performance of local models under differentparameter k is worse than or equal to that of global models-erefore we can conclude that in the cross-project-validation scenario local models perform worse thanglobal models in the classification performance

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 6 that the performance of local models is better than orequal to that of global models in the ACC and Popt indicatorswhen the parameter k is from 2 to 10-erefore local modelsare valid for effort-aware JIT-SDP in the cross-validationscenario In particular when the number k of clusters is setto 2 local models can obtain better ACC values and Poptvalues than global models when k is set to 5 local modelscan obtain better ACC values -erefore it is appropriate toset k to 2 which can find 493 defect-inducing changeswhen using 20 effort and increase the ACC indicator by570 and increase the Popt indicator by 164

In the cross-project-validation scenario local modelsperform worse in the classification performance but performbetter in the effort-aware prediction performance thanglobal models However the setting of k in local models hasan impact on the effort-aware prediction performance forlocal models-e empirical results show that when k is set to2 local models can obtain the optimal effort-aware pre-diction performance

53 Analysis for RQ3 -is section compares the predictionperformance of local and global models in the timewise-

cross-validation scenario Since in the timewise-cross-validation scenario only two months of data in each datasets are used to train prediction models the size of thetraining sample in each cluster is small -rough experi-ment we find that if the parameter k of local models is settoo large the number of training samples in each clustermay be too small or even equal to 0 -erefore the ex-periment sets the parameter k of local models to 2 -eexperimental results are shown in Table 7 Suppose aproject contains n months of data and prediction modelscan produce nminus 5 prediction results in the timewise-cross-validation scenario -erefore the third and fourth col-umns in Table 7 represent the median and standard de-viation of prediction results for global and local modelsBased on the significance test method the performancevalues of local models with significant differences fromthose of global models are bold -e row WDL sum-marizes the number of projects for which local models canobtain a better equal and worse performance than globalmodels in each indicator

First we compare the classification performance of localand global models It can be seen from Table 7 that localmodels perform worse on the AUC and F1 indicators thanglobal models on the 6 data sets -erefore it can be con-cluded that local models perform poorly in classificationperformance compared to global models in the timewise-cross-validation scenario

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 7 that in the projects BUG and COL local models haveworse ACC and Popt indicators than global models while inthe other four projects (JDT MOZ PLA and POS) localmodels and global models have no significant difference inthe ACC and Popt indicators -erefore local models per-form worse in effort-aware prediction performance thanglobal models in the timewise-cross-validation scenario-is conclusion is inconsistent with the conclusions inSections 51 and 52 For this problem we consider that themain reason lies in the size of the training sample In thetimewise-cross-validation scenario only two months of datain the data sets are used to train prediction models (there areonly 452 changes per month on average)When local modelsare used since the training set is divided into several clustersthe number of training samples in each cluster is furtherdecreased which results in a decrease in the effort-awareprediction performance of local models

In this section we compare the classification perfor-mance and effort-aware prediction performance of local andglobal models in the timewise-cross-validation scenarioEmpirical results show that local models perform worse thanglobal models in the classification performance and effort-aware prediction performance -erefore we recommend

Table 5 -e number of wins of local models at different k values

Project BUG COL JDT MOZ PLA POSCount 4 0 7 18 14 11Rank 5 6 4 1 2 3Number of changes (rank) 4620 (5) 4455 (6) 35386 (3) 98275 (1) 64250 (2) 20431 (3)

10 Scientific Programming

using global models to solve the JIT-SDP problem in thetimewise-cross-validation scenario

6 Threats to Validity

61 External Validity Although our experiment uses publicdata sets that have extensively been used in previous studieswe cannot yet guarantee that the discovery of the experimentcan be applied to all other change-level defect data sets-erefore the defect data sets of more projects should bemined in order to verify the generalization of the experi-mental results

62 Construct Validity We use F1 AUC ACC and Popt toevaluate the prediction performance of local and globalmodels Because defect data sets are often imbalanced AUCis more appropriate as a threshold-independent evaluationindicator to evaluate the classification performance of

models [4] In addition we use F1 a threshold-basedevaluation indicator as a supplement to AUC Besidessimilar to prior study [7 9] the experiment uses ACC andPopt to evaluate the effort-aware prediction performance ofmodels However we cannot guarantee that the experi-mental results are valid in all other evaluation indicatorssuch as precision recall Fβ and PofB20

63 Internal Validity Internal validity involves errors in theexperimental code In order to reduce this risk we carefullyexamined the code of the experiment and referred to thecode published in previous studies [7 8 29] so as to improvethe reliability of the experiment code

7 Conclusions and Future Work

In this article we firstly apply local models to JIT-SDP Ourstudy aims to compare the prediction performance of localand global models under cross-validation cross-project-validation and timewise-cross-validation for JIT-SDP Inorder to compare the classification performance and effort-aware prediction performance of local and global models wefirst build local models based on the k-medoids methodAfterwards logistic regression and EALR are used to buildclassification models and effort-aware prediction models forlocal and global models -e empirical results show that localmodels perform worse in the classification performance thanglobal models in three evaluation scenarios Moreover localmodels have worse effort-aware prediction performance in thetimewise-cross-validation scenario However local models aresignificantly better than global models in the effort-awareprediction performance in the cross-validation-scenario andtimewise-cross-validation scenario Particularly the optimaleffort-aware prediction performance can be obtained whenthe parameter k of local models is set to 2 -erefore localmodels are still promising in the context of effort-aware JIT-SDP

In the future first we plan to consider more commercialprojects to further verify the validity of the experimentalconclusions Second logistic regression and EALR are usedin local and global models Considering that the choice ofmodeling methods will affect the prediction performance oflocal and global models we hope to add more baselines tolocal and global models to verify the generalization of ex-perimental conclusions

Data Availability

-e experiment uses public data sets shared by Kamei et al[7] and they have already published the download address of

Table 6 Local vs global models in the cross-project-validation scenario

Indicator GlobalLocal

WDLk 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC 0706 0683 0681 0674 0673 0673 0668 0659 0680 0674 009F1 0678 0629 0703 0696 0677 0613 0620 0554 0629 0637 045ACC 0314 0493 0396 0348 0408 0393 0326 034 0342 0345 270Popt 0656 0764 0691 0672 0719 0687 0619 0632 0667 0624 180

Table 7 Local vs global models in the timewise-cross-validationscenario

Indicators Project Global Local (k 2)

AUC

BUG 0672plusmn 0125 0545plusmn 0124COL 0717plusmn 0075 0636plusmn 0135JDT 0708plusmn 0043 0645plusmn 0073MOZ 0749plusmn 0034 0648plusmn 0128PLA 0709plusmn 0054 0642plusmn 0076POS 0743plusmn 0072 0702plusmn 0095

WDL mdash 006

F1

BUG 0587plusmn 0108 0515plusmn 0128COL 0651plusmn 0066 0599plusmn 0125JDT 0722plusmn 0048 0623plusmn 0161MOZ 0804plusmn 0050 0696plusmn 0190PLA 0702plusmn 0053 0618plusmn 0162POS 0714plusmn 0065 0657plusmn 0128

WDL mdash 006

ACC

BUG 0357plusmn 0212 0242plusmn 0210COL 0426plusmn 0168 0321plusmn 0186JDT 0356plusmn 0148 0329plusmn 0189MOZ 0218plusmn 0132 0242plusmn 0125PLA 0358plusmn 0181 0371plusmn 0196POS 0345plusmn 0159 0306plusmn 0205

WDL mdash 042

Popt

BUG 0622plusmn 0194 0458plusmn 0189COL 0669plusmn 0150 0553plusmn 0204JDT 0654plusmn 0101 0624plusmn 0142MOZ 0537plusmn 0110 0537plusmn 0122PLA 0631plusmn 0115 0645plusmn 0151POS 0615plusmn 0131 0583plusmn 0167

WDL mdash 042

Scientific Programming 11

the data sets in their paper In addition in order to verify therepeatability of our experiment and promote future researchall the experimental code in this article can be downloaded athttpsgithubcomyangxingguangLocalJIT

Conflicts of Interest

-e authors declare that they have no conflicts of interest

Acknowledgments

-is work was partially supported by the NSF of Chinaunder Grant nos 61772200 and 61702334 Shanghai PujiangTalent Program under Grant no 17PJ1401900 ShanghaiMunicipal Natural Science Foundation under Grant nos17ZR1406900 and 17ZR1429700 Educational ResearchFund of ECUST under Grant no ZH1726108 and Collab-orative Innovation Foundation of Shanghai Institute ofTechnology under Grant no XTCX2016-20

References

[1] M Newman Software Errors Cost US Economy $595 BillionAnnually NIST Assesses Technical Needs of Industry to ImproveSoftware-Testing 2002 httpwwwabeachacomNIST_press_release_bugs_costhtm

[2] Z Li X-Y Jing and X Zhu ldquoProgress on approaches tosoftware defect predictionrdquo IET Software vol 12 no 3pp 161ndash175 2018

[3] X Chen Q Gu W Liu S Liu and C Ni ldquoSurvey of staticsoftware defect predictionrdquo Ruan Jian Xue BaoJournal ofSoftware vol 27 no 1 2016 in Chinese

[4] C Tantithamthavorn S McIntosh A E Hassan andKMatsumoto ldquoAn empirical comparison of model validationtechniques for defect prediction modelsrdquo IEEE Transactionson Software Engineering vol 43 no 1 pp 1ndash18 2017

[5] A G Koru D Dongsong Zhang K El Emam andH Hongfang Liu ldquoAn investigation into the functional formof the size-defect relationship for software modulesrdquo IEEETransactions on Software Engineering vol 35 no 2pp 293ndash304 2009

[6] S Kim E J Whitehead and Y Zhang ldquoClassifying softwarechanges clean or buggyrdquo IEEE Transactions on SoftwareEngineering vol 34 no 2 pp 181ndash196 2008

[7] Y Kamei E Shihab B Adams et al ldquoA large-scale empiricalstudy of just-in-time quality assurancerdquo IEEE Transactions onSoftware Engineering vol 39 no 6 pp 757ndash773 2013

[8] Y Yang Y Zhou J Liu et al ldquoEffort-aware just-in-time defectprediction simple unsupervised models could be better thansupervised modelsrdquo in Proceedings of the 24th ACM SIGSOFTInternational Symposium on Foundations of Software Engi-neering pp 157ndash168 Hong Kong China November 2016

[9] X Chen Y Zhao Q Wang and Z Yuan ldquoMULTI multi-objective effort-aware just-in-time software defect predictionrdquoInformation and Software Technology vol 93 pp 1ndash13 2018

[10] Y Kamei T Fukushima S McIntosh K YamashitaN Ubayashi and A E Hassan ldquoStudying just-in-time defectprediction using cross-project modelsrdquo Empirical SoftwareEngineering vol 21 no 5 pp 2072ndash2106 2016

[11] T Menzies A Butcher A Marcus T Zimmermann andD R Cok ldquoLocal vs global models for effort estimation anddefect predictionrdquo in Proceedings of the 26th IEEEACM

International Conference on Automated Software Engineer-ing (ASE) pp 343ndash351 Lawrence KS USA November 2011

[12] T Menzies A Butcher D Cok et al ldquoLocal versus globallessons for defect prediction and effort estimationrdquo IEEETransactions on Software Engineering vol 39 no 6pp 822ndash834 2013

[13] G Scanniello C Gravino A Marcus and T Menzies ldquoClasslevel fault prediction using software clusteringrdquo in Pro-ceedings of the 2013 28th IEEEACM International Conferenceon Automated Software Engineering (ASE) pp 640ndash645Silicon Valley CA USA November 2013

[14] N Bettenburg M Nagappan and A E Hassan ldquoTowardsimproving statistical modeling of software engineering datathink locally act globallyrdquo Empirical Software Engineeringvol 20 no 2 pp 294ndash335 2015

[15] Q Wang S Wu and M Li ldquoSoftware defect predictionrdquoJournal of Software vol 19 no 7 pp 1565ndash1580 2008 inChinese

[16] F Akiyama ldquoAn example of software system debuggingrdquoInformation Processing vol 71 pp 353ndash359 1971

[17] T J McCabe ldquoA complexity measurerdquo IEEE Transactions onSoftware Engineering vol SE-2 no 4 pp 308ndash320 1976

[18] S R Chidamber and C F Kemerer ldquoAmetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[19] N Nagappan and T Ball ldquoUse of relative code churn mea-sures to predict system defect densityrdquo in Proceedings of the27th International Conference on Software Engineering (ICSE)pp 284ndash292 Saint Louis MO USA May 2005

[20] A Mockus and D M Weiss ldquoPredicting risk of softwarechangesrdquo Bell Labs Technical Journal vol 5 no 2 pp 169ndash180 2000

[21] X Yang D Lo X Xia Y Zhang and J Sun ldquoDeep learningfor just-in-time defect predictionrdquo in Proceedings of the IEEEInternational Conference on Software Quality Reliability andSecurity (QRS) pp 17ndash26 Vancouver BC Canada August2015

[22] X Yang D Lo X Xia and J Sun ldquoTLEL a two-layer ensemblelearning approach for just-in-time defect predictionrdquo In-formation and Software Technology vol 87 pp 206ndash220 2017

[23] L Pascarella F Palomba and A Bacchelli ldquoFine-grained just-in-time defect predictionrdquo Journal of Systems and Softwarevol 150 pp 22ndash36 2019

[24] E Arisholm L C Briand and E B Johannessen ldquoA sys-tematic and comprehensive investigation of methods to buildand evaluate fault prediction modelsrdquo Journal of Systems andSoftware vol 83 no 1 pp 2ndash17 2010

[25] Y Kamei S Matsumoto A Monden K MatsumotoB Adams and A E Hassan ldquoRevisiting common bug pre-diction findings using effort-aware modelsrdquo in Proceedings ofthe 26th IEEE International Conference on Software Mainte-nance (ICSM) pp 1ndash10 Shanghai China September 2010

[26] Y Zhou B Xu H Leung and L Chen ldquoAn in-depth study ofthe potentially confounding effect of class size in fault pre-dictionrdquo ACM Transactions on Software Engineering andMethodology vol 23 no 1 pp 1ndash51 2014

[27] J Liu Y Zhou Y Yang H Lu and B Xu ldquoCode churn aneglectedmetric in effort-aware just-in-time defect predictionrdquoin Proceedings of the ACMIEEE International Symposium onEmpirical Software Engineering and Measurement (ESEM)pp 11ndash19 Toronto ON Canada November 2017

[28] W Fu and T Menzies ldquoRevisiting unsupervised learning fordefect predictionrdquo in Proceedings of the 2017 11th Joint

12 Scientific Programming

Meeting on Foundations of Software Engineering (ESECFSE)pp 72ndash83 Paderborn Germany September 2017

[29] Q Huang X Xia and D Lo ldquoSupervised vs unsupervisedmodels a holistic look at effort-aware just-in-time defectpredictionrdquo in Proceedings of the IEEE International Con-ference on Software Maintenance and Evolution (ICSME)pp 159ndash170 Shanghai China September 2017

[30] Q Huang X Xia and D Lo ldquoRevisiting supervised andunsupervised models for effort-aware just-in-time defectpredictionrdquo Empirical Software Engineering pp 1ndash40 2018

[31] S Herbold A Trautsch and J Grabowski ldquoGlobal vs localmodels for cross-project defect predictionrdquo Empirical Soft-ware Engineering vol 22 no 4 pp 1866ndash1902 2017

[32] M E Mezouar F Zhang and Y Zou ldquoLocal versus globalmodels for effort-aware defect predictionrdquo in Proceedings ofthe 26th Annual International Conference on Computer Sci-ence and Software Engineering (CASCON) pp 178ndash187Toronto ON Canada October 2016

[33] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis Wiley Hoboken NJ USA1990

[34] S Hosseini B Turhan and D Gunarathna ldquoA systematicliterature review and meta-analysis on cross project defectpredictionrdquo IEEE Transactions on Software Engineeringvol 45 no 2 pp 111ndash147 2019

[35] J Sliwerski T Zimmermann and A Zeller ldquoWhen dochanges induce fixesrdquo Mining Software Repositories vol 30no 4 pp 1ndash5 2005

[36] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin vol 1 no 6 pp 80ndash83 1945

[37] N Cliff Ordinal Methods for Behavioral Data AnalysisPsychology Press London UK 2014

[38] J Romano J D Kromrey J Coraggio and J SkowronekldquoAppropriate statistics for ordinal level data should we reallybe using t-test and cohensd for evaluating group differenceson the nsse and other surveysrdquo in Annual Meeting of theFlorida Association of Institutional Research pp 1ndash33 2006

Scientific Programming 13

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 8: LocalversusGlobalModelsforJust-In-TimeSoftware ...downloads.hindawi.com/journals/sp/2019/2384706.pdf · Just-in-time software defect prediction (JIT-SDP) is an active topic in software

442 Cross-Project-Validation Cross-project defect pre-diction has received widespread attention in recent years[34] -e basic idea of cross-project defect prediction is totrain a defect prediction model with the defect data sets of aproject to predict defects of another project [8] Given datasets containing n projects n times (nminus 1) run results can begenerated for a prediction model -e data sets used in ourexperiment contain six open-source projects so6 times (6minus 1) 30 prediction results can be generated for aprediction model

443 Timewise-Cross-Validation Timewise-cross-validationis also performed within the same project but it con-siders the chronological order of changes in the data sets[35] Defect data sets are collected in the chronological orderduring the actual software development process -ereforechanges committed early in the data sets can be used topredict the defect proneness of changes committed late Inthe timewise-cross-validation scenario all changes in thedata sets are arranged in the chronological order and thechanges in the same month are grouped in the same groupSuppose the data sets are divided into n groups Group i andgroup i + 1 are used to train a prediction model and thegroup i + 4 and group i + 5 are used as a test set -ereforeassuming that a project contains nmonths of data sets nminus 5run results can be generated for a prediction model

-e experiment uses the Wilcoxon signed-rank test [36]and Cliffrsquos δ [37] to test the significance of differences inclassification performance and prediction performance be-tween local and global models -e Wilcoxon signed-ranktest is a popular nonparametric statistical hypothesis testWe use p values to examine if the difference of the per-formance of local and global models is statistically significantat a significance level of 005-e experiment also uses Cliffrsquosδ to determine the magnitude of the difference in theprediction performance between local models and globalmodels Usually the magnitude of the difference can bedivided into four levels ie trivial ((|δ|lt 0147)) small(0147le |δ|lt 033) moderate (033le |δ|lt 0474) and large(ge0474) [38] In summary when the p value is less than005 and Cliff rsquos δ is greater than or equal to 0147 localmodels and global models have significant differences inprediction performance

45 Data Preprocessing According to suggestion of Kameiet al [7] it is necessary to preprocess the data sets It includesthe following three steps

(1) Removing Highly Correlated Metrics -e ND andREXP metrics are excluded because NF and NDREXP and EXP are highly correlated LA and LD arenormalized by dividing by LT since LA and LD arehighly correlated LT and NUC are normalized bydividing by NF since LT and NUC are highly cor-related with NF

(2) Logarithmic Transformation Since most metrics arehighly skewed each metric executes logarithmictransformation (except for fix)

(3) Dealing with Class Imbalance -e data sets used inthe experiment have the problem of class imbalanceie the number of defect-inducing changes is far lessthan the number of clean changes -erefore weperform random undersampling on the training setBy randomly deleting clean changes the number ofbuggy changes remains the same as the number ofclean changes Note that the undersampling is notperformed in the test set

5 Experimental Results and Analysis

51 Analysis for RQ1 To investigate and compare the per-formance of local and global models in the 10 times 10-foldcross-validation scenario we conduct a large-scale empiricalstudy based on data sets from six open-source projects -eexperimental results are shown in Table 4 As is shown inTable 4 the first column represents the evaluation indicatorsfor prediction models including AUC F1 ACC and Poptthe second column denotes the project names of six data setsthe third column shows the performance values of the globalmodels in the four evaluation indicators from the fourthcolumn to the twelfth column the performance values of thelocal models at different k values are shown Consideringthat the number of clusters k may affect the performance oflocal models the value of parameter k is tuned from 2 to 10Since 100 prediction results are generated in the 10 times 10-fold cross-validation scenario the values in the table aremedians of the experimental results We do the followingprocessing on the data in Table 4

(i) We performed the significance test defined inSection 44 on the performance of the local andglobal models -e performance values of localmodels that are significantly different from theperformance values of global model are bolded

(ii) -e performance values of the local models whichare significantly better than those of the globalmodels are given in italics

(iii) -e WDL values (ie the number of projects forwhich local models can obtain a better equal andworse prediction performance than global models)are calculated under different k values and differentevaluation indicators

First we analyze and compare the classification per-formance of local and global models As is shown in Table 4firstly all of the performance of local models with differentparameters k and different projects are worse than that ofglobal models in the AUC indicator Secondly in most casesthe performance of local models are worse than that of globalmodels (except for four values in italics) in the F1 indicator-erefore it can be concluded that global models performbetter than local models in the classification performance

Second we investigate and compare effort-aware pre-diction performance for local and global models As can beseen from Table 4 local models are valid for effort-awareJIT-SDP However local models have different effects ondifferent projects and different parameters k

8 Scientific Programming

First local models have different effects on differentdata sets To describe the effectiveness of local models ondifferent data sets we calculate the number of times thatlocal models perform significantly better than globalmodels on 9 k values and two evaluation indicators(ie ACC and Popt) for each data set -e statistical resultsare shown in Table 5 where the second row is the numberof times that local models perform better than globalmodels for each project It can be seen that local modelshave better performance on the projects MOZ and PLA andpoor performance on the projects BUG and COL Based onthe number of times that local models outperform globalmodels in six projects we rank the project as shown in thethird row In addition we list the sizes of data sets and theirrank as shown in the fourth row It can be seen that theperformance of local models on data sets has a strongcorrelation with the size of data sets Generally the largerthe size of the data set the more significant the perfor-mance improvement of local models

In view of this situation we consider that if the size of thedata set is too small when using local models since the datasets are divided into multiple clusters the training samplesof each cluster will be further reduced which will lead to theperformance degradation of models While the size of datasets is large the performance of the model is not easilylimited by the size of the data sets so that the advantages oflocal models are better reflected

In addition the number of clusters k also has an impacton the performance of local models As can be seen from thetable according to the WDL analysis local models canobtain the best ACC indicator when k is set to 2 and obtainthe best Popt indicator when k is set to 2 and 6 -erefore werecommend that in the cross-validation scenario whenusing local models to solve the effort-aware JIT-SDPproblem the parameter k can be set to 2

-is section analyzes and compares the classificationperformance and effort-aware prediction performance oflocal and global models in 10 times 10-fold cross-validationscenario -e empirical results show that compared withglobal models local models perform poorly in the classifi-cation performance but perform better in the effort-awareprediction performance However the performance of localmodels is affected by the parameter k and the size of the datasets -e experimental results show that local models canobtain better effort-aware prediction performance on theproject with larger size of data sets and can obtain bettereffort-aware prediction performance when k is set to 2

52Analysis forRQ2 -is section discusses the performancedifferences between local and global models in the cross-project-validation scenario for JIT-SDP -e results of theexperiment are shown in Table 6 in which the first columnindicates four evaluation indicators including AUC F1

Table 4 Local vs global models in the 10 times 10-fold cross-validation scenario

Indicator Project GlobalLocal

k 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC

BUG 0730 0704 0687 0684 0698 0688 0686 0686 0693 0685COL 0742 0719 0708 0701 0703 0657 0666 0670 0668 0660JDT 0725 0719 0677 0670 0675 0697 0671 0674 0671 0654MOZ 0769 0540 0741 0715 0703 0730 0730 0725 0731 0722PLA 0742 0733 0702 0693 0654 0677 0640 0638 0588 0575POS 0777 0764 0758 0760 0758 0753 0755 0705 0757 0757

WDL mdash 006 006 006 006 006 006 006 006 006

F1

BUG 0663 0644 0628 0616 0603 0598 0608 0610 0601 0592COL 0696 0672 0649 0669 0678 0634 0537 0596 0604 0564JDT 0723 0706 0764 0757 0734 0695 0615 0643 0732 0722MOZ 0825 0374 0853 0758 0785 0741 0718 0737 0731 0721PLA 0704 0678 0675 0527 0477 0557 0743 0771 0627 0604POS 0762 0699 0747 0759 0746 0698 0697 0339 0699 0700

WDL mdash 006 204 114 015 006 015 105 015 015

ACC

BUG 0421 0452 0471 0458 0424 0451 0450 0463 0446 0462COL 0553 0458 0375 0253 0443 0068 0195 0459 0403 0421JDT 0354 0532 0374 0431 0291 0481 0439 0314 0287 0334MOZ 0189 0386 0328 0355 0364 0301 0270 0312 0312 0312PLA 0368 0472 0509 0542 0497 0509 0456 0442 0456 0495POS 0336 0490 0397 0379 0457 0442 0406 0341 0400 0383

WDL mdash 501 411 411 312 411 411 222 312 411

Popt

BUG 0742 055 0747 0744 0726 0738 0749 0750 0760 0750 0766COL 0726 0661 0624 0401 0731 0295 0421 0644 0628 0681JDT 0660 0779 0625 0698 0568 0725 0708 0603 0599 0634MOZ 0555 0679 0626 0645 0659 0588 0589 0643 0634 0630PLA 0664 0723 0727 0784 0746 0758 0718 0673 0716 0747POS 0673 0729 0680 0651 0709 0739 0683 0596 0669 0644

WDL mdash 411 222 213 321 411 231 123 132 213

Scientific Programming 9

ACC and Popt the second column represents the perfor-mance values of local models from the third column to theeleventh column the prediction performance of localmodels under different parameters k is represented In ourexperiment six data sets are used in the cross-project-validation scenario -erefore each prediction model pro-duces 30 prediction results -e values in Table 6 representthe median of the prediction results

For the values in Table 6 we do the following processing

(i) Using the significance test method the performancevalues of local models that are significantly differentfrom those of global models are bolded

(ii) -e performance values of local models that aresignificantly better than those of global models aregiven in italics

(iii) -e WDL column analyzes the number of timesthat local models perform significantly better thanglobal models at different parameters k

First we compare the difference in classification per-formance between local models and global models As can beseen from Table 6 in the AUC and F1 indicators theclassification performance of local models under differentparameter k is worse than or equal to that of global models-erefore we can conclude that in the cross-project-validation scenario local models perform worse thanglobal models in the classification performance

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 6 that the performance of local models is better than orequal to that of global models in the ACC and Popt indicatorswhen the parameter k is from 2 to 10-erefore local modelsare valid for effort-aware JIT-SDP in the cross-validationscenario In particular when the number k of clusters is setto 2 local models can obtain better ACC values and Poptvalues than global models when k is set to 5 local modelscan obtain better ACC values -erefore it is appropriate toset k to 2 which can find 493 defect-inducing changeswhen using 20 effort and increase the ACC indicator by570 and increase the Popt indicator by 164

In the cross-project-validation scenario local modelsperform worse in the classification performance but performbetter in the effort-aware prediction performance thanglobal models However the setting of k in local models hasan impact on the effort-aware prediction performance forlocal models-e empirical results show that when k is set to2 local models can obtain the optimal effort-aware pre-diction performance

53 Analysis for RQ3 -is section compares the predictionperformance of local and global models in the timewise-

cross-validation scenario Since in the timewise-cross-validation scenario only two months of data in each datasets are used to train prediction models the size of thetraining sample in each cluster is small -rough experi-ment we find that if the parameter k of local models is settoo large the number of training samples in each clustermay be too small or even equal to 0 -erefore the ex-periment sets the parameter k of local models to 2 -eexperimental results are shown in Table 7 Suppose aproject contains n months of data and prediction modelscan produce nminus 5 prediction results in the timewise-cross-validation scenario -erefore the third and fourth col-umns in Table 7 represent the median and standard de-viation of prediction results for global and local modelsBased on the significance test method the performancevalues of local models with significant differences fromthose of global models are bold -e row WDL sum-marizes the number of projects for which local models canobtain a better equal and worse performance than globalmodels in each indicator

First we compare the classification performance of localand global models It can be seen from Table 7 that localmodels perform worse on the AUC and F1 indicators thanglobal models on the 6 data sets -erefore it can be con-cluded that local models perform poorly in classificationperformance compared to global models in the timewise-cross-validation scenario

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 7 that in the projects BUG and COL local models haveworse ACC and Popt indicators than global models while inthe other four projects (JDT MOZ PLA and POS) localmodels and global models have no significant difference inthe ACC and Popt indicators -erefore local models per-form worse in effort-aware prediction performance thanglobal models in the timewise-cross-validation scenario-is conclusion is inconsistent with the conclusions inSections 51 and 52 For this problem we consider that themain reason lies in the size of the training sample In thetimewise-cross-validation scenario only two months of datain the data sets are used to train prediction models (there areonly 452 changes per month on average)When local modelsare used since the training set is divided into several clustersthe number of training samples in each cluster is furtherdecreased which results in a decrease in the effort-awareprediction performance of local models

In this section we compare the classification perfor-mance and effort-aware prediction performance of local andglobal models in the timewise-cross-validation scenarioEmpirical results show that local models perform worse thanglobal models in the classification performance and effort-aware prediction performance -erefore we recommend

Table 5 -e number of wins of local models at different k values

Project BUG COL JDT MOZ PLA POSCount 4 0 7 18 14 11Rank 5 6 4 1 2 3Number of changes (rank) 4620 (5) 4455 (6) 35386 (3) 98275 (1) 64250 (2) 20431 (3)

10 Scientific Programming

using global models to solve the JIT-SDP problem in thetimewise-cross-validation scenario

6 Threats to Validity

61 External Validity Although our experiment uses publicdata sets that have extensively been used in previous studieswe cannot yet guarantee that the discovery of the experimentcan be applied to all other change-level defect data sets-erefore the defect data sets of more projects should bemined in order to verify the generalization of the experi-mental results

62 Construct Validity We use F1 AUC ACC and Popt toevaluate the prediction performance of local and globalmodels Because defect data sets are often imbalanced AUCis more appropriate as a threshold-independent evaluationindicator to evaluate the classification performance of

models [4] In addition we use F1 a threshold-basedevaluation indicator as a supplement to AUC Besidessimilar to prior study [7 9] the experiment uses ACC andPopt to evaluate the effort-aware prediction performance ofmodels However we cannot guarantee that the experi-mental results are valid in all other evaluation indicatorssuch as precision recall Fβ and PofB20

63 Internal Validity Internal validity involves errors in theexperimental code In order to reduce this risk we carefullyexamined the code of the experiment and referred to thecode published in previous studies [7 8 29] so as to improvethe reliability of the experiment code

7 Conclusions and Future Work

In this article we firstly apply local models to JIT-SDP Ourstudy aims to compare the prediction performance of localand global models under cross-validation cross-project-validation and timewise-cross-validation for JIT-SDP Inorder to compare the classification performance and effort-aware prediction performance of local and global models wefirst build local models based on the k-medoids methodAfterwards logistic regression and EALR are used to buildclassification models and effort-aware prediction models forlocal and global models -e empirical results show that localmodels perform worse in the classification performance thanglobal models in three evaluation scenarios Moreover localmodels have worse effort-aware prediction performance in thetimewise-cross-validation scenario However local models aresignificantly better than global models in the effort-awareprediction performance in the cross-validation-scenario andtimewise-cross-validation scenario Particularly the optimaleffort-aware prediction performance can be obtained whenthe parameter k of local models is set to 2 -erefore localmodels are still promising in the context of effort-aware JIT-SDP

In the future first we plan to consider more commercialprojects to further verify the validity of the experimentalconclusions Second logistic regression and EALR are usedin local and global models Considering that the choice ofmodeling methods will affect the prediction performance oflocal and global models we hope to add more baselines tolocal and global models to verify the generalization of ex-perimental conclusions

Data Availability

-e experiment uses public data sets shared by Kamei et al[7] and they have already published the download address of

Table 6 Local vs global models in the cross-project-validation scenario

Indicator GlobalLocal

WDLk 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC 0706 0683 0681 0674 0673 0673 0668 0659 0680 0674 009F1 0678 0629 0703 0696 0677 0613 0620 0554 0629 0637 045ACC 0314 0493 0396 0348 0408 0393 0326 034 0342 0345 270Popt 0656 0764 0691 0672 0719 0687 0619 0632 0667 0624 180

Table 7 Local vs global models in the timewise-cross-validationscenario

Indicators Project Global Local (k 2)

AUC

BUG 0672plusmn 0125 0545plusmn 0124COL 0717plusmn 0075 0636plusmn 0135JDT 0708plusmn 0043 0645plusmn 0073MOZ 0749plusmn 0034 0648plusmn 0128PLA 0709plusmn 0054 0642plusmn 0076POS 0743plusmn 0072 0702plusmn 0095

WDL mdash 006

F1

BUG 0587plusmn 0108 0515plusmn 0128COL 0651plusmn 0066 0599plusmn 0125JDT 0722plusmn 0048 0623plusmn 0161MOZ 0804plusmn 0050 0696plusmn 0190PLA 0702plusmn 0053 0618plusmn 0162POS 0714plusmn 0065 0657plusmn 0128

WDL mdash 006

ACC

BUG 0357plusmn 0212 0242plusmn 0210COL 0426plusmn 0168 0321plusmn 0186JDT 0356plusmn 0148 0329plusmn 0189MOZ 0218plusmn 0132 0242plusmn 0125PLA 0358plusmn 0181 0371plusmn 0196POS 0345plusmn 0159 0306plusmn 0205

WDL mdash 042

Popt

BUG 0622plusmn 0194 0458plusmn 0189COL 0669plusmn 0150 0553plusmn 0204JDT 0654plusmn 0101 0624plusmn 0142MOZ 0537plusmn 0110 0537plusmn 0122PLA 0631plusmn 0115 0645plusmn 0151POS 0615plusmn 0131 0583plusmn 0167

WDL mdash 042

Scientific Programming 11

the data sets in their paper In addition in order to verify therepeatability of our experiment and promote future researchall the experimental code in this article can be downloaded athttpsgithubcomyangxingguangLocalJIT

Conflicts of Interest

-e authors declare that they have no conflicts of interest

Acknowledgments

-is work was partially supported by the NSF of Chinaunder Grant nos 61772200 and 61702334 Shanghai PujiangTalent Program under Grant no 17PJ1401900 ShanghaiMunicipal Natural Science Foundation under Grant nos17ZR1406900 and 17ZR1429700 Educational ResearchFund of ECUST under Grant no ZH1726108 and Collab-orative Innovation Foundation of Shanghai Institute ofTechnology under Grant no XTCX2016-20

References

[1] M Newman Software Errors Cost US Economy $595 BillionAnnually NIST Assesses Technical Needs of Industry to ImproveSoftware-Testing 2002 httpwwwabeachacomNIST_press_release_bugs_costhtm

[2] Z Li X-Y Jing and X Zhu ldquoProgress on approaches tosoftware defect predictionrdquo IET Software vol 12 no 3pp 161ndash175 2018

[3] X Chen Q Gu W Liu S Liu and C Ni ldquoSurvey of staticsoftware defect predictionrdquo Ruan Jian Xue BaoJournal ofSoftware vol 27 no 1 2016 in Chinese

[4] C Tantithamthavorn S McIntosh A E Hassan andKMatsumoto ldquoAn empirical comparison of model validationtechniques for defect prediction modelsrdquo IEEE Transactionson Software Engineering vol 43 no 1 pp 1ndash18 2017

[5] A G Koru D Dongsong Zhang K El Emam andH Hongfang Liu ldquoAn investigation into the functional formof the size-defect relationship for software modulesrdquo IEEETransactions on Software Engineering vol 35 no 2pp 293ndash304 2009

[6] S Kim E J Whitehead and Y Zhang ldquoClassifying softwarechanges clean or buggyrdquo IEEE Transactions on SoftwareEngineering vol 34 no 2 pp 181ndash196 2008

[7] Y Kamei E Shihab B Adams et al ldquoA large-scale empiricalstudy of just-in-time quality assurancerdquo IEEE Transactions onSoftware Engineering vol 39 no 6 pp 757ndash773 2013

[8] Y Yang Y Zhou J Liu et al ldquoEffort-aware just-in-time defectprediction simple unsupervised models could be better thansupervised modelsrdquo in Proceedings of the 24th ACM SIGSOFTInternational Symposium on Foundations of Software Engi-neering pp 157ndash168 Hong Kong China November 2016

[9] X Chen Y Zhao Q Wang and Z Yuan ldquoMULTI multi-objective effort-aware just-in-time software defect predictionrdquoInformation and Software Technology vol 93 pp 1ndash13 2018

[10] Y Kamei T Fukushima S McIntosh K YamashitaN Ubayashi and A E Hassan ldquoStudying just-in-time defectprediction using cross-project modelsrdquo Empirical SoftwareEngineering vol 21 no 5 pp 2072ndash2106 2016

[11] T Menzies A Butcher A Marcus T Zimmermann andD R Cok ldquoLocal vs global models for effort estimation anddefect predictionrdquo in Proceedings of the 26th IEEEACM

International Conference on Automated Software Engineer-ing (ASE) pp 343ndash351 Lawrence KS USA November 2011

[12] T Menzies A Butcher D Cok et al ldquoLocal versus globallessons for defect prediction and effort estimationrdquo IEEETransactions on Software Engineering vol 39 no 6pp 822ndash834 2013

[13] G Scanniello C Gravino A Marcus and T Menzies ldquoClasslevel fault prediction using software clusteringrdquo in Pro-ceedings of the 2013 28th IEEEACM International Conferenceon Automated Software Engineering (ASE) pp 640ndash645Silicon Valley CA USA November 2013

[14] N Bettenburg M Nagappan and A E Hassan ldquoTowardsimproving statistical modeling of software engineering datathink locally act globallyrdquo Empirical Software Engineeringvol 20 no 2 pp 294ndash335 2015

[15] Q Wang S Wu and M Li ldquoSoftware defect predictionrdquoJournal of Software vol 19 no 7 pp 1565ndash1580 2008 inChinese

[16] F Akiyama ldquoAn example of software system debuggingrdquoInformation Processing vol 71 pp 353ndash359 1971

[17] T J McCabe ldquoA complexity measurerdquo IEEE Transactions onSoftware Engineering vol SE-2 no 4 pp 308ndash320 1976

[18] S R Chidamber and C F Kemerer ldquoAmetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[19] N Nagappan and T Ball ldquoUse of relative code churn mea-sures to predict system defect densityrdquo in Proceedings of the27th International Conference on Software Engineering (ICSE)pp 284ndash292 Saint Louis MO USA May 2005

[20] A Mockus and D M Weiss ldquoPredicting risk of softwarechangesrdquo Bell Labs Technical Journal vol 5 no 2 pp 169ndash180 2000

[21] X Yang D Lo X Xia Y Zhang and J Sun ldquoDeep learningfor just-in-time defect predictionrdquo in Proceedings of the IEEEInternational Conference on Software Quality Reliability andSecurity (QRS) pp 17ndash26 Vancouver BC Canada August2015

[22] X Yang D Lo X Xia and J Sun ldquoTLEL a two-layer ensemblelearning approach for just-in-time defect predictionrdquo In-formation and Software Technology vol 87 pp 206ndash220 2017

[23] L Pascarella F Palomba and A Bacchelli ldquoFine-grained just-in-time defect predictionrdquo Journal of Systems and Softwarevol 150 pp 22ndash36 2019

[24] E Arisholm L C Briand and E B Johannessen ldquoA sys-tematic and comprehensive investigation of methods to buildand evaluate fault prediction modelsrdquo Journal of Systems andSoftware vol 83 no 1 pp 2ndash17 2010

[25] Y Kamei S Matsumoto A Monden K MatsumotoB Adams and A E Hassan ldquoRevisiting common bug pre-diction findings using effort-aware modelsrdquo in Proceedings ofthe 26th IEEE International Conference on Software Mainte-nance (ICSM) pp 1ndash10 Shanghai China September 2010

[26] Y Zhou B Xu H Leung and L Chen ldquoAn in-depth study ofthe potentially confounding effect of class size in fault pre-dictionrdquo ACM Transactions on Software Engineering andMethodology vol 23 no 1 pp 1ndash51 2014

[27] J Liu Y Zhou Y Yang H Lu and B Xu ldquoCode churn aneglectedmetric in effort-aware just-in-time defect predictionrdquoin Proceedings of the ACMIEEE International Symposium onEmpirical Software Engineering and Measurement (ESEM)pp 11ndash19 Toronto ON Canada November 2017

[28] W Fu and T Menzies ldquoRevisiting unsupervised learning fordefect predictionrdquo in Proceedings of the 2017 11th Joint

12 Scientific Programming

Meeting on Foundations of Software Engineering (ESECFSE)pp 72ndash83 Paderborn Germany September 2017

[29] Q Huang X Xia and D Lo ldquoSupervised vs unsupervisedmodels a holistic look at effort-aware just-in-time defectpredictionrdquo in Proceedings of the IEEE International Con-ference on Software Maintenance and Evolution (ICSME)pp 159ndash170 Shanghai China September 2017

[30] Q Huang X Xia and D Lo ldquoRevisiting supervised andunsupervised models for effort-aware just-in-time defectpredictionrdquo Empirical Software Engineering pp 1ndash40 2018

[31] S Herbold A Trautsch and J Grabowski ldquoGlobal vs localmodels for cross-project defect predictionrdquo Empirical Soft-ware Engineering vol 22 no 4 pp 1866ndash1902 2017

[32] M E Mezouar F Zhang and Y Zou ldquoLocal versus globalmodels for effort-aware defect predictionrdquo in Proceedings ofthe 26th Annual International Conference on Computer Sci-ence and Software Engineering (CASCON) pp 178ndash187Toronto ON Canada October 2016

[33] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis Wiley Hoboken NJ USA1990

[34] S Hosseini B Turhan and D Gunarathna ldquoA systematicliterature review and meta-analysis on cross project defectpredictionrdquo IEEE Transactions on Software Engineeringvol 45 no 2 pp 111ndash147 2019

[35] J Sliwerski T Zimmermann and A Zeller ldquoWhen dochanges induce fixesrdquo Mining Software Repositories vol 30no 4 pp 1ndash5 2005

[36] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin vol 1 no 6 pp 80ndash83 1945

[37] N Cliff Ordinal Methods for Behavioral Data AnalysisPsychology Press London UK 2014

[38] J Romano J D Kromrey J Coraggio and J SkowronekldquoAppropriate statistics for ordinal level data should we reallybe using t-test and cohensd for evaluating group differenceson the nsse and other surveysrdquo in Annual Meeting of theFlorida Association of Institutional Research pp 1ndash33 2006

Scientific Programming 13

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 9: LocalversusGlobalModelsforJust-In-TimeSoftware ...downloads.hindawi.com/journals/sp/2019/2384706.pdf · Just-in-time software defect prediction (JIT-SDP) is an active topic in software

First local models have different effects on differentdata sets To describe the effectiveness of local models ondifferent data sets we calculate the number of times thatlocal models perform significantly better than globalmodels on 9 k values and two evaluation indicators(ie ACC and Popt) for each data set -e statistical resultsare shown in Table 5 where the second row is the numberof times that local models perform better than globalmodels for each project It can be seen that local modelshave better performance on the projects MOZ and PLA andpoor performance on the projects BUG and COL Based onthe number of times that local models outperform globalmodels in six projects we rank the project as shown in thethird row In addition we list the sizes of data sets and theirrank as shown in the fourth row It can be seen that theperformance of local models on data sets has a strongcorrelation with the size of data sets Generally the largerthe size of the data set the more significant the perfor-mance improvement of local models

In view of this situation we consider that if the size of thedata set is too small when using local models since the datasets are divided into multiple clusters the training samplesof each cluster will be further reduced which will lead to theperformance degradation of models While the size of datasets is large the performance of the model is not easilylimited by the size of the data sets so that the advantages oflocal models are better reflected

In addition the number of clusters k also has an impacton the performance of local models As can be seen from thetable according to the WDL analysis local models canobtain the best ACC indicator when k is set to 2 and obtainthe best Popt indicator when k is set to 2 and 6 -erefore werecommend that in the cross-validation scenario whenusing local models to solve the effort-aware JIT-SDPproblem the parameter k can be set to 2

-is section analyzes and compares the classificationperformance and effort-aware prediction performance oflocal and global models in 10 times 10-fold cross-validationscenario -e empirical results show that compared withglobal models local models perform poorly in the classifi-cation performance but perform better in the effort-awareprediction performance However the performance of localmodels is affected by the parameter k and the size of the datasets -e experimental results show that local models canobtain better effort-aware prediction performance on theproject with larger size of data sets and can obtain bettereffort-aware prediction performance when k is set to 2

52Analysis forRQ2 -is section discusses the performancedifferences between local and global models in the cross-project-validation scenario for JIT-SDP -e results of theexperiment are shown in Table 6 in which the first columnindicates four evaluation indicators including AUC F1

Table 4 Local vs global models in the 10 times 10-fold cross-validation scenario

Indicator Project GlobalLocal

k 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC

BUG 0730 0704 0687 0684 0698 0688 0686 0686 0693 0685COL 0742 0719 0708 0701 0703 0657 0666 0670 0668 0660JDT 0725 0719 0677 0670 0675 0697 0671 0674 0671 0654MOZ 0769 0540 0741 0715 0703 0730 0730 0725 0731 0722PLA 0742 0733 0702 0693 0654 0677 0640 0638 0588 0575POS 0777 0764 0758 0760 0758 0753 0755 0705 0757 0757

WDL mdash 006 006 006 006 006 006 006 006 006

F1

BUG 0663 0644 0628 0616 0603 0598 0608 0610 0601 0592COL 0696 0672 0649 0669 0678 0634 0537 0596 0604 0564JDT 0723 0706 0764 0757 0734 0695 0615 0643 0732 0722MOZ 0825 0374 0853 0758 0785 0741 0718 0737 0731 0721PLA 0704 0678 0675 0527 0477 0557 0743 0771 0627 0604POS 0762 0699 0747 0759 0746 0698 0697 0339 0699 0700

WDL mdash 006 204 114 015 006 015 105 015 015

ACC

BUG 0421 0452 0471 0458 0424 0451 0450 0463 0446 0462COL 0553 0458 0375 0253 0443 0068 0195 0459 0403 0421JDT 0354 0532 0374 0431 0291 0481 0439 0314 0287 0334MOZ 0189 0386 0328 0355 0364 0301 0270 0312 0312 0312PLA 0368 0472 0509 0542 0497 0509 0456 0442 0456 0495POS 0336 0490 0397 0379 0457 0442 0406 0341 0400 0383

WDL mdash 501 411 411 312 411 411 222 312 411

Popt

BUG 0742 055 0747 0744 0726 0738 0749 0750 0760 0750 0766COL 0726 0661 0624 0401 0731 0295 0421 0644 0628 0681JDT 0660 0779 0625 0698 0568 0725 0708 0603 0599 0634MOZ 0555 0679 0626 0645 0659 0588 0589 0643 0634 0630PLA 0664 0723 0727 0784 0746 0758 0718 0673 0716 0747POS 0673 0729 0680 0651 0709 0739 0683 0596 0669 0644

WDL mdash 411 222 213 321 411 231 123 132 213

Scientific Programming 9

ACC and Popt the second column represents the perfor-mance values of local models from the third column to theeleventh column the prediction performance of localmodels under different parameters k is represented In ourexperiment six data sets are used in the cross-project-validation scenario -erefore each prediction model pro-duces 30 prediction results -e values in Table 6 representthe median of the prediction results

For the values in Table 6 we do the following processing

(i) Using the significance test method the performancevalues of local models that are significantly differentfrom those of global models are bolded

(ii) -e performance values of local models that aresignificantly better than those of global models aregiven in italics

(iii) -e WDL column analyzes the number of timesthat local models perform significantly better thanglobal models at different parameters k

First we compare the difference in classification per-formance between local models and global models As can beseen from Table 6 in the AUC and F1 indicators theclassification performance of local models under differentparameter k is worse than or equal to that of global models-erefore we can conclude that in the cross-project-validation scenario local models perform worse thanglobal models in the classification performance

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 6 that the performance of local models is better than orequal to that of global models in the ACC and Popt indicatorswhen the parameter k is from 2 to 10-erefore local modelsare valid for effort-aware JIT-SDP in the cross-validationscenario In particular when the number k of clusters is setto 2 local models can obtain better ACC values and Poptvalues than global models when k is set to 5 local modelscan obtain better ACC values -erefore it is appropriate toset k to 2 which can find 493 defect-inducing changeswhen using 20 effort and increase the ACC indicator by570 and increase the Popt indicator by 164

In the cross-project-validation scenario local modelsperform worse in the classification performance but performbetter in the effort-aware prediction performance thanglobal models However the setting of k in local models hasan impact on the effort-aware prediction performance forlocal models-e empirical results show that when k is set to2 local models can obtain the optimal effort-aware pre-diction performance

53 Analysis for RQ3 -is section compares the predictionperformance of local and global models in the timewise-

cross-validation scenario Since in the timewise-cross-validation scenario only two months of data in each datasets are used to train prediction models the size of thetraining sample in each cluster is small -rough experi-ment we find that if the parameter k of local models is settoo large the number of training samples in each clustermay be too small or even equal to 0 -erefore the ex-periment sets the parameter k of local models to 2 -eexperimental results are shown in Table 7 Suppose aproject contains n months of data and prediction modelscan produce nminus 5 prediction results in the timewise-cross-validation scenario -erefore the third and fourth col-umns in Table 7 represent the median and standard de-viation of prediction results for global and local modelsBased on the significance test method the performancevalues of local models with significant differences fromthose of global models are bold -e row WDL sum-marizes the number of projects for which local models canobtain a better equal and worse performance than globalmodels in each indicator

First we compare the classification performance of localand global models It can be seen from Table 7 that localmodels perform worse on the AUC and F1 indicators thanglobal models on the 6 data sets -erefore it can be con-cluded that local models perform poorly in classificationperformance compared to global models in the timewise-cross-validation scenario

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 7 that in the projects BUG and COL local models haveworse ACC and Popt indicators than global models while inthe other four projects (JDT MOZ PLA and POS) localmodels and global models have no significant difference inthe ACC and Popt indicators -erefore local models per-form worse in effort-aware prediction performance thanglobal models in the timewise-cross-validation scenario-is conclusion is inconsistent with the conclusions inSections 51 and 52 For this problem we consider that themain reason lies in the size of the training sample In thetimewise-cross-validation scenario only two months of datain the data sets are used to train prediction models (there areonly 452 changes per month on average)When local modelsare used since the training set is divided into several clustersthe number of training samples in each cluster is furtherdecreased which results in a decrease in the effort-awareprediction performance of local models

In this section we compare the classification perfor-mance and effort-aware prediction performance of local andglobal models in the timewise-cross-validation scenarioEmpirical results show that local models perform worse thanglobal models in the classification performance and effort-aware prediction performance -erefore we recommend

Table 5 -e number of wins of local models at different k values

Project BUG COL JDT MOZ PLA POSCount 4 0 7 18 14 11Rank 5 6 4 1 2 3Number of changes (rank) 4620 (5) 4455 (6) 35386 (3) 98275 (1) 64250 (2) 20431 (3)

10 Scientific Programming

using global models to solve the JIT-SDP problem in thetimewise-cross-validation scenario

6 Threats to Validity

61 External Validity Although our experiment uses publicdata sets that have extensively been used in previous studieswe cannot yet guarantee that the discovery of the experimentcan be applied to all other change-level defect data sets-erefore the defect data sets of more projects should bemined in order to verify the generalization of the experi-mental results

62 Construct Validity We use F1 AUC ACC and Popt toevaluate the prediction performance of local and globalmodels Because defect data sets are often imbalanced AUCis more appropriate as a threshold-independent evaluationindicator to evaluate the classification performance of

models [4] In addition we use F1 a threshold-basedevaluation indicator as a supplement to AUC Besidessimilar to prior study [7 9] the experiment uses ACC andPopt to evaluate the effort-aware prediction performance ofmodels However we cannot guarantee that the experi-mental results are valid in all other evaluation indicatorssuch as precision recall Fβ and PofB20

63 Internal Validity Internal validity involves errors in theexperimental code In order to reduce this risk we carefullyexamined the code of the experiment and referred to thecode published in previous studies [7 8 29] so as to improvethe reliability of the experiment code

7 Conclusions and Future Work

In this article we firstly apply local models to JIT-SDP Ourstudy aims to compare the prediction performance of localand global models under cross-validation cross-project-validation and timewise-cross-validation for JIT-SDP Inorder to compare the classification performance and effort-aware prediction performance of local and global models wefirst build local models based on the k-medoids methodAfterwards logistic regression and EALR are used to buildclassification models and effort-aware prediction models forlocal and global models -e empirical results show that localmodels perform worse in the classification performance thanglobal models in three evaluation scenarios Moreover localmodels have worse effort-aware prediction performance in thetimewise-cross-validation scenario However local models aresignificantly better than global models in the effort-awareprediction performance in the cross-validation-scenario andtimewise-cross-validation scenario Particularly the optimaleffort-aware prediction performance can be obtained whenthe parameter k of local models is set to 2 -erefore localmodels are still promising in the context of effort-aware JIT-SDP

In the future first we plan to consider more commercialprojects to further verify the validity of the experimentalconclusions Second logistic regression and EALR are usedin local and global models Considering that the choice ofmodeling methods will affect the prediction performance oflocal and global models we hope to add more baselines tolocal and global models to verify the generalization of ex-perimental conclusions

Data Availability

-e experiment uses public data sets shared by Kamei et al[7] and they have already published the download address of

Table 6 Local vs global models in the cross-project-validation scenario

Indicator GlobalLocal

WDLk 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC 0706 0683 0681 0674 0673 0673 0668 0659 0680 0674 009F1 0678 0629 0703 0696 0677 0613 0620 0554 0629 0637 045ACC 0314 0493 0396 0348 0408 0393 0326 034 0342 0345 270Popt 0656 0764 0691 0672 0719 0687 0619 0632 0667 0624 180

Table 7 Local vs global models in the timewise-cross-validationscenario

Indicators Project Global Local (k 2)

AUC

BUG 0672plusmn 0125 0545plusmn 0124COL 0717plusmn 0075 0636plusmn 0135JDT 0708plusmn 0043 0645plusmn 0073MOZ 0749plusmn 0034 0648plusmn 0128PLA 0709plusmn 0054 0642plusmn 0076POS 0743plusmn 0072 0702plusmn 0095

WDL mdash 006

F1

BUG 0587plusmn 0108 0515plusmn 0128COL 0651plusmn 0066 0599plusmn 0125JDT 0722plusmn 0048 0623plusmn 0161MOZ 0804plusmn 0050 0696plusmn 0190PLA 0702plusmn 0053 0618plusmn 0162POS 0714plusmn 0065 0657plusmn 0128

WDL mdash 006

ACC

BUG 0357plusmn 0212 0242plusmn 0210COL 0426plusmn 0168 0321plusmn 0186JDT 0356plusmn 0148 0329plusmn 0189MOZ 0218plusmn 0132 0242plusmn 0125PLA 0358plusmn 0181 0371plusmn 0196POS 0345plusmn 0159 0306plusmn 0205

WDL mdash 042

Popt

BUG 0622plusmn 0194 0458plusmn 0189COL 0669plusmn 0150 0553plusmn 0204JDT 0654plusmn 0101 0624plusmn 0142MOZ 0537plusmn 0110 0537plusmn 0122PLA 0631plusmn 0115 0645plusmn 0151POS 0615plusmn 0131 0583plusmn 0167

WDL mdash 042

Scientific Programming 11

the data sets in their paper In addition in order to verify therepeatability of our experiment and promote future researchall the experimental code in this article can be downloaded athttpsgithubcomyangxingguangLocalJIT

Conflicts of Interest

-e authors declare that they have no conflicts of interest

Acknowledgments

-is work was partially supported by the NSF of Chinaunder Grant nos 61772200 and 61702334 Shanghai PujiangTalent Program under Grant no 17PJ1401900 ShanghaiMunicipal Natural Science Foundation under Grant nos17ZR1406900 and 17ZR1429700 Educational ResearchFund of ECUST under Grant no ZH1726108 and Collab-orative Innovation Foundation of Shanghai Institute ofTechnology under Grant no XTCX2016-20

References

[1] M Newman Software Errors Cost US Economy $595 BillionAnnually NIST Assesses Technical Needs of Industry to ImproveSoftware-Testing 2002 httpwwwabeachacomNIST_press_release_bugs_costhtm

[2] Z Li X-Y Jing and X Zhu ldquoProgress on approaches tosoftware defect predictionrdquo IET Software vol 12 no 3pp 161ndash175 2018

[3] X Chen Q Gu W Liu S Liu and C Ni ldquoSurvey of staticsoftware defect predictionrdquo Ruan Jian Xue BaoJournal ofSoftware vol 27 no 1 2016 in Chinese

[4] C Tantithamthavorn S McIntosh A E Hassan andKMatsumoto ldquoAn empirical comparison of model validationtechniques for defect prediction modelsrdquo IEEE Transactionson Software Engineering vol 43 no 1 pp 1ndash18 2017

[5] A G Koru D Dongsong Zhang K El Emam andH Hongfang Liu ldquoAn investigation into the functional formof the size-defect relationship for software modulesrdquo IEEETransactions on Software Engineering vol 35 no 2pp 293ndash304 2009

[6] S Kim E J Whitehead and Y Zhang ldquoClassifying softwarechanges clean or buggyrdquo IEEE Transactions on SoftwareEngineering vol 34 no 2 pp 181ndash196 2008

[7] Y Kamei E Shihab B Adams et al ldquoA large-scale empiricalstudy of just-in-time quality assurancerdquo IEEE Transactions onSoftware Engineering vol 39 no 6 pp 757ndash773 2013

[8] Y Yang Y Zhou J Liu et al ldquoEffort-aware just-in-time defectprediction simple unsupervised models could be better thansupervised modelsrdquo in Proceedings of the 24th ACM SIGSOFTInternational Symposium on Foundations of Software Engi-neering pp 157ndash168 Hong Kong China November 2016

[9] X Chen Y Zhao Q Wang and Z Yuan ldquoMULTI multi-objective effort-aware just-in-time software defect predictionrdquoInformation and Software Technology vol 93 pp 1ndash13 2018

[10] Y Kamei T Fukushima S McIntosh K YamashitaN Ubayashi and A E Hassan ldquoStudying just-in-time defectprediction using cross-project modelsrdquo Empirical SoftwareEngineering vol 21 no 5 pp 2072ndash2106 2016

[11] T Menzies A Butcher A Marcus T Zimmermann andD R Cok ldquoLocal vs global models for effort estimation anddefect predictionrdquo in Proceedings of the 26th IEEEACM

International Conference on Automated Software Engineer-ing (ASE) pp 343ndash351 Lawrence KS USA November 2011

[12] T Menzies A Butcher D Cok et al ldquoLocal versus globallessons for defect prediction and effort estimationrdquo IEEETransactions on Software Engineering vol 39 no 6pp 822ndash834 2013

[13] G Scanniello C Gravino A Marcus and T Menzies ldquoClasslevel fault prediction using software clusteringrdquo in Pro-ceedings of the 2013 28th IEEEACM International Conferenceon Automated Software Engineering (ASE) pp 640ndash645Silicon Valley CA USA November 2013

[14] N Bettenburg M Nagappan and A E Hassan ldquoTowardsimproving statistical modeling of software engineering datathink locally act globallyrdquo Empirical Software Engineeringvol 20 no 2 pp 294ndash335 2015

[15] Q Wang S Wu and M Li ldquoSoftware defect predictionrdquoJournal of Software vol 19 no 7 pp 1565ndash1580 2008 inChinese

[16] F Akiyama ldquoAn example of software system debuggingrdquoInformation Processing vol 71 pp 353ndash359 1971

[17] T J McCabe ldquoA complexity measurerdquo IEEE Transactions onSoftware Engineering vol SE-2 no 4 pp 308ndash320 1976

[18] S R Chidamber and C F Kemerer ldquoAmetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[19] N Nagappan and T Ball ldquoUse of relative code churn mea-sures to predict system defect densityrdquo in Proceedings of the27th International Conference on Software Engineering (ICSE)pp 284ndash292 Saint Louis MO USA May 2005

[20] A Mockus and D M Weiss ldquoPredicting risk of softwarechangesrdquo Bell Labs Technical Journal vol 5 no 2 pp 169ndash180 2000

[21] X Yang D Lo X Xia Y Zhang and J Sun ldquoDeep learningfor just-in-time defect predictionrdquo in Proceedings of the IEEEInternational Conference on Software Quality Reliability andSecurity (QRS) pp 17ndash26 Vancouver BC Canada August2015

[22] X Yang D Lo X Xia and J Sun ldquoTLEL a two-layer ensemblelearning approach for just-in-time defect predictionrdquo In-formation and Software Technology vol 87 pp 206ndash220 2017

[23] L Pascarella F Palomba and A Bacchelli ldquoFine-grained just-in-time defect predictionrdquo Journal of Systems and Softwarevol 150 pp 22ndash36 2019

[24] E Arisholm L C Briand and E B Johannessen ldquoA sys-tematic and comprehensive investigation of methods to buildand evaluate fault prediction modelsrdquo Journal of Systems andSoftware vol 83 no 1 pp 2ndash17 2010

[25] Y Kamei S Matsumoto A Monden K MatsumotoB Adams and A E Hassan ldquoRevisiting common bug pre-diction findings using effort-aware modelsrdquo in Proceedings ofthe 26th IEEE International Conference on Software Mainte-nance (ICSM) pp 1ndash10 Shanghai China September 2010

[26] Y Zhou B Xu H Leung and L Chen ldquoAn in-depth study ofthe potentially confounding effect of class size in fault pre-dictionrdquo ACM Transactions on Software Engineering andMethodology vol 23 no 1 pp 1ndash51 2014

[27] J Liu Y Zhou Y Yang H Lu and B Xu ldquoCode churn aneglectedmetric in effort-aware just-in-time defect predictionrdquoin Proceedings of the ACMIEEE International Symposium onEmpirical Software Engineering and Measurement (ESEM)pp 11ndash19 Toronto ON Canada November 2017

[28] W Fu and T Menzies ldquoRevisiting unsupervised learning fordefect predictionrdquo in Proceedings of the 2017 11th Joint

12 Scientific Programming

Meeting on Foundations of Software Engineering (ESECFSE)pp 72ndash83 Paderborn Germany September 2017

[29] Q Huang X Xia and D Lo ldquoSupervised vs unsupervisedmodels a holistic look at effort-aware just-in-time defectpredictionrdquo in Proceedings of the IEEE International Con-ference on Software Maintenance and Evolution (ICSME)pp 159ndash170 Shanghai China September 2017

[30] Q Huang X Xia and D Lo ldquoRevisiting supervised andunsupervised models for effort-aware just-in-time defectpredictionrdquo Empirical Software Engineering pp 1ndash40 2018

[31] S Herbold A Trautsch and J Grabowski ldquoGlobal vs localmodels for cross-project defect predictionrdquo Empirical Soft-ware Engineering vol 22 no 4 pp 1866ndash1902 2017

[32] M E Mezouar F Zhang and Y Zou ldquoLocal versus globalmodels for effort-aware defect predictionrdquo in Proceedings ofthe 26th Annual International Conference on Computer Sci-ence and Software Engineering (CASCON) pp 178ndash187Toronto ON Canada October 2016

[33] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis Wiley Hoboken NJ USA1990

[34] S Hosseini B Turhan and D Gunarathna ldquoA systematicliterature review and meta-analysis on cross project defectpredictionrdquo IEEE Transactions on Software Engineeringvol 45 no 2 pp 111ndash147 2019

[35] J Sliwerski T Zimmermann and A Zeller ldquoWhen dochanges induce fixesrdquo Mining Software Repositories vol 30no 4 pp 1ndash5 2005

[36] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin vol 1 no 6 pp 80ndash83 1945

[37] N Cliff Ordinal Methods for Behavioral Data AnalysisPsychology Press London UK 2014

[38] J Romano J D Kromrey J Coraggio and J SkowronekldquoAppropriate statistics for ordinal level data should we reallybe using t-test and cohensd for evaluating group differenceson the nsse and other surveysrdquo in Annual Meeting of theFlorida Association of Institutional Research pp 1ndash33 2006

Scientific Programming 13

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 10: LocalversusGlobalModelsforJust-In-TimeSoftware ...downloads.hindawi.com/journals/sp/2019/2384706.pdf · Just-in-time software defect prediction (JIT-SDP) is an active topic in software

ACC and Popt the second column represents the perfor-mance values of local models from the third column to theeleventh column the prediction performance of localmodels under different parameters k is represented In ourexperiment six data sets are used in the cross-project-validation scenario -erefore each prediction model pro-duces 30 prediction results -e values in Table 6 representthe median of the prediction results

For the values in Table 6 we do the following processing

(i) Using the significance test method the performancevalues of local models that are significantly differentfrom those of global models are bolded

(ii) -e performance values of local models that aresignificantly better than those of global models aregiven in italics

(iii) -e WDL column analyzes the number of timesthat local models perform significantly better thanglobal models at different parameters k

First we compare the difference in classification per-formance between local models and global models As can beseen from Table 6 in the AUC and F1 indicators theclassification performance of local models under differentparameter k is worse than or equal to that of global models-erefore we can conclude that in the cross-project-validation scenario local models perform worse thanglobal models in the classification performance

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 6 that the performance of local models is better than orequal to that of global models in the ACC and Popt indicatorswhen the parameter k is from 2 to 10-erefore local modelsare valid for effort-aware JIT-SDP in the cross-validationscenario In particular when the number k of clusters is setto 2 local models can obtain better ACC values and Poptvalues than global models when k is set to 5 local modelscan obtain better ACC values -erefore it is appropriate toset k to 2 which can find 493 defect-inducing changeswhen using 20 effort and increase the ACC indicator by570 and increase the Popt indicator by 164

In the cross-project-validation scenario local modelsperform worse in the classification performance but performbetter in the effort-aware prediction performance thanglobal models However the setting of k in local models hasan impact on the effort-aware prediction performance forlocal models-e empirical results show that when k is set to2 local models can obtain the optimal effort-aware pre-diction performance

53 Analysis for RQ3 -is section compares the predictionperformance of local and global models in the timewise-

cross-validation scenario Since in the timewise-cross-validation scenario only two months of data in each datasets are used to train prediction models the size of thetraining sample in each cluster is small -rough experi-ment we find that if the parameter k of local models is settoo large the number of training samples in each clustermay be too small or even equal to 0 -erefore the ex-periment sets the parameter k of local models to 2 -eexperimental results are shown in Table 7 Suppose aproject contains n months of data and prediction modelscan produce nminus 5 prediction results in the timewise-cross-validation scenario -erefore the third and fourth col-umns in Table 7 represent the median and standard de-viation of prediction results for global and local modelsBased on the significance test method the performancevalues of local models with significant differences fromthose of global models are bold -e row WDL sum-marizes the number of projects for which local models canobtain a better equal and worse performance than globalmodels in each indicator

First we compare the classification performance of localand global models It can be seen from Table 7 that localmodels perform worse on the AUC and F1 indicators thanglobal models on the 6 data sets -erefore it can be con-cluded that local models perform poorly in classificationperformance compared to global models in the timewise-cross-validation scenario

Second we compare the effort-aware prediction per-formance of local and global models It can be seen fromTable 7 that in the projects BUG and COL local models haveworse ACC and Popt indicators than global models while inthe other four projects (JDT MOZ PLA and POS) localmodels and global models have no significant difference inthe ACC and Popt indicators -erefore local models per-form worse in effort-aware prediction performance thanglobal models in the timewise-cross-validation scenario-is conclusion is inconsistent with the conclusions inSections 51 and 52 For this problem we consider that themain reason lies in the size of the training sample In thetimewise-cross-validation scenario only two months of datain the data sets are used to train prediction models (there areonly 452 changes per month on average)When local modelsare used since the training set is divided into several clustersthe number of training samples in each cluster is furtherdecreased which results in a decrease in the effort-awareprediction performance of local models

In this section we compare the classification perfor-mance and effort-aware prediction performance of local andglobal models in the timewise-cross-validation scenarioEmpirical results show that local models perform worse thanglobal models in the classification performance and effort-aware prediction performance -erefore we recommend

Table 5 -e number of wins of local models at different k values

Project BUG COL JDT MOZ PLA POSCount 4 0 7 18 14 11Rank 5 6 4 1 2 3Number of changes (rank) 4620 (5) 4455 (6) 35386 (3) 98275 (1) 64250 (2) 20431 (3)

10 Scientific Programming

using global models to solve the JIT-SDP problem in thetimewise-cross-validation scenario

6 Threats to Validity

61 External Validity Although our experiment uses publicdata sets that have extensively been used in previous studieswe cannot yet guarantee that the discovery of the experimentcan be applied to all other change-level defect data sets-erefore the defect data sets of more projects should bemined in order to verify the generalization of the experi-mental results

62 Construct Validity We use F1 AUC ACC and Popt toevaluate the prediction performance of local and globalmodels Because defect data sets are often imbalanced AUCis more appropriate as a threshold-independent evaluationindicator to evaluate the classification performance of

models [4] In addition we use F1 a threshold-basedevaluation indicator as a supplement to AUC Besidessimilar to prior study [7 9] the experiment uses ACC andPopt to evaluate the effort-aware prediction performance ofmodels However we cannot guarantee that the experi-mental results are valid in all other evaluation indicatorssuch as precision recall Fβ and PofB20

63 Internal Validity Internal validity involves errors in theexperimental code In order to reduce this risk we carefullyexamined the code of the experiment and referred to thecode published in previous studies [7 8 29] so as to improvethe reliability of the experiment code

7 Conclusions and Future Work

In this article we firstly apply local models to JIT-SDP Ourstudy aims to compare the prediction performance of localand global models under cross-validation cross-project-validation and timewise-cross-validation for JIT-SDP Inorder to compare the classification performance and effort-aware prediction performance of local and global models wefirst build local models based on the k-medoids methodAfterwards logistic regression and EALR are used to buildclassification models and effort-aware prediction models forlocal and global models -e empirical results show that localmodels perform worse in the classification performance thanglobal models in three evaluation scenarios Moreover localmodels have worse effort-aware prediction performance in thetimewise-cross-validation scenario However local models aresignificantly better than global models in the effort-awareprediction performance in the cross-validation-scenario andtimewise-cross-validation scenario Particularly the optimaleffort-aware prediction performance can be obtained whenthe parameter k of local models is set to 2 -erefore localmodels are still promising in the context of effort-aware JIT-SDP

In the future first we plan to consider more commercialprojects to further verify the validity of the experimentalconclusions Second logistic regression and EALR are usedin local and global models Considering that the choice ofmodeling methods will affect the prediction performance oflocal and global models we hope to add more baselines tolocal and global models to verify the generalization of ex-perimental conclusions

Data Availability

-e experiment uses public data sets shared by Kamei et al[7] and they have already published the download address of

Table 6 Local vs global models in the cross-project-validation scenario

Indicator GlobalLocal

WDLk 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC 0706 0683 0681 0674 0673 0673 0668 0659 0680 0674 009F1 0678 0629 0703 0696 0677 0613 0620 0554 0629 0637 045ACC 0314 0493 0396 0348 0408 0393 0326 034 0342 0345 270Popt 0656 0764 0691 0672 0719 0687 0619 0632 0667 0624 180

Table 7 Local vs global models in the timewise-cross-validationscenario

Indicators Project Global Local (k 2)

AUC

BUG 0672plusmn 0125 0545plusmn 0124COL 0717plusmn 0075 0636plusmn 0135JDT 0708plusmn 0043 0645plusmn 0073MOZ 0749plusmn 0034 0648plusmn 0128PLA 0709plusmn 0054 0642plusmn 0076POS 0743plusmn 0072 0702plusmn 0095

WDL mdash 006

F1

BUG 0587plusmn 0108 0515plusmn 0128COL 0651plusmn 0066 0599plusmn 0125JDT 0722plusmn 0048 0623plusmn 0161MOZ 0804plusmn 0050 0696plusmn 0190PLA 0702plusmn 0053 0618plusmn 0162POS 0714plusmn 0065 0657plusmn 0128

WDL mdash 006

ACC

BUG 0357plusmn 0212 0242plusmn 0210COL 0426plusmn 0168 0321plusmn 0186JDT 0356plusmn 0148 0329plusmn 0189MOZ 0218plusmn 0132 0242plusmn 0125PLA 0358plusmn 0181 0371plusmn 0196POS 0345plusmn 0159 0306plusmn 0205

WDL mdash 042

Popt

BUG 0622plusmn 0194 0458plusmn 0189COL 0669plusmn 0150 0553plusmn 0204JDT 0654plusmn 0101 0624plusmn 0142MOZ 0537plusmn 0110 0537plusmn 0122PLA 0631plusmn 0115 0645plusmn 0151POS 0615plusmn 0131 0583plusmn 0167

WDL mdash 042

Scientific Programming 11

the data sets in their paper In addition in order to verify therepeatability of our experiment and promote future researchall the experimental code in this article can be downloaded athttpsgithubcomyangxingguangLocalJIT

Conflicts of Interest

-e authors declare that they have no conflicts of interest

Acknowledgments

-is work was partially supported by the NSF of Chinaunder Grant nos 61772200 and 61702334 Shanghai PujiangTalent Program under Grant no 17PJ1401900 ShanghaiMunicipal Natural Science Foundation under Grant nos17ZR1406900 and 17ZR1429700 Educational ResearchFund of ECUST under Grant no ZH1726108 and Collab-orative Innovation Foundation of Shanghai Institute ofTechnology under Grant no XTCX2016-20

References

[1] M Newman Software Errors Cost US Economy $595 BillionAnnually NIST Assesses Technical Needs of Industry to ImproveSoftware-Testing 2002 httpwwwabeachacomNIST_press_release_bugs_costhtm

[2] Z Li X-Y Jing and X Zhu ldquoProgress on approaches tosoftware defect predictionrdquo IET Software vol 12 no 3pp 161ndash175 2018

[3] X Chen Q Gu W Liu S Liu and C Ni ldquoSurvey of staticsoftware defect predictionrdquo Ruan Jian Xue BaoJournal ofSoftware vol 27 no 1 2016 in Chinese

[4] C Tantithamthavorn S McIntosh A E Hassan andKMatsumoto ldquoAn empirical comparison of model validationtechniques for defect prediction modelsrdquo IEEE Transactionson Software Engineering vol 43 no 1 pp 1ndash18 2017

[5] A G Koru D Dongsong Zhang K El Emam andH Hongfang Liu ldquoAn investigation into the functional formof the size-defect relationship for software modulesrdquo IEEETransactions on Software Engineering vol 35 no 2pp 293ndash304 2009

[6] S Kim E J Whitehead and Y Zhang ldquoClassifying softwarechanges clean or buggyrdquo IEEE Transactions on SoftwareEngineering vol 34 no 2 pp 181ndash196 2008

[7] Y Kamei E Shihab B Adams et al ldquoA large-scale empiricalstudy of just-in-time quality assurancerdquo IEEE Transactions onSoftware Engineering vol 39 no 6 pp 757ndash773 2013

[8] Y Yang Y Zhou J Liu et al ldquoEffort-aware just-in-time defectprediction simple unsupervised models could be better thansupervised modelsrdquo in Proceedings of the 24th ACM SIGSOFTInternational Symposium on Foundations of Software Engi-neering pp 157ndash168 Hong Kong China November 2016

[9] X Chen Y Zhao Q Wang and Z Yuan ldquoMULTI multi-objective effort-aware just-in-time software defect predictionrdquoInformation and Software Technology vol 93 pp 1ndash13 2018

[10] Y Kamei T Fukushima S McIntosh K YamashitaN Ubayashi and A E Hassan ldquoStudying just-in-time defectprediction using cross-project modelsrdquo Empirical SoftwareEngineering vol 21 no 5 pp 2072ndash2106 2016

[11] T Menzies A Butcher A Marcus T Zimmermann andD R Cok ldquoLocal vs global models for effort estimation anddefect predictionrdquo in Proceedings of the 26th IEEEACM

International Conference on Automated Software Engineer-ing (ASE) pp 343ndash351 Lawrence KS USA November 2011

[12] T Menzies A Butcher D Cok et al ldquoLocal versus globallessons for defect prediction and effort estimationrdquo IEEETransactions on Software Engineering vol 39 no 6pp 822ndash834 2013

[13] G Scanniello C Gravino A Marcus and T Menzies ldquoClasslevel fault prediction using software clusteringrdquo in Pro-ceedings of the 2013 28th IEEEACM International Conferenceon Automated Software Engineering (ASE) pp 640ndash645Silicon Valley CA USA November 2013

[14] N Bettenburg M Nagappan and A E Hassan ldquoTowardsimproving statistical modeling of software engineering datathink locally act globallyrdquo Empirical Software Engineeringvol 20 no 2 pp 294ndash335 2015

[15] Q Wang S Wu and M Li ldquoSoftware defect predictionrdquoJournal of Software vol 19 no 7 pp 1565ndash1580 2008 inChinese

[16] F Akiyama ldquoAn example of software system debuggingrdquoInformation Processing vol 71 pp 353ndash359 1971

[17] T J McCabe ldquoA complexity measurerdquo IEEE Transactions onSoftware Engineering vol SE-2 no 4 pp 308ndash320 1976

[18] S R Chidamber and C F Kemerer ldquoAmetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[19] N Nagappan and T Ball ldquoUse of relative code churn mea-sures to predict system defect densityrdquo in Proceedings of the27th International Conference on Software Engineering (ICSE)pp 284ndash292 Saint Louis MO USA May 2005

[20] A Mockus and D M Weiss ldquoPredicting risk of softwarechangesrdquo Bell Labs Technical Journal vol 5 no 2 pp 169ndash180 2000

[21] X Yang D Lo X Xia Y Zhang and J Sun ldquoDeep learningfor just-in-time defect predictionrdquo in Proceedings of the IEEEInternational Conference on Software Quality Reliability andSecurity (QRS) pp 17ndash26 Vancouver BC Canada August2015

[22] X Yang D Lo X Xia and J Sun ldquoTLEL a two-layer ensemblelearning approach for just-in-time defect predictionrdquo In-formation and Software Technology vol 87 pp 206ndash220 2017

[23] L Pascarella F Palomba and A Bacchelli ldquoFine-grained just-in-time defect predictionrdquo Journal of Systems and Softwarevol 150 pp 22ndash36 2019

[24] E Arisholm L C Briand and E B Johannessen ldquoA sys-tematic and comprehensive investigation of methods to buildand evaluate fault prediction modelsrdquo Journal of Systems andSoftware vol 83 no 1 pp 2ndash17 2010

[25] Y Kamei S Matsumoto A Monden K MatsumotoB Adams and A E Hassan ldquoRevisiting common bug pre-diction findings using effort-aware modelsrdquo in Proceedings ofthe 26th IEEE International Conference on Software Mainte-nance (ICSM) pp 1ndash10 Shanghai China September 2010

[26] Y Zhou B Xu H Leung and L Chen ldquoAn in-depth study ofthe potentially confounding effect of class size in fault pre-dictionrdquo ACM Transactions on Software Engineering andMethodology vol 23 no 1 pp 1ndash51 2014

[27] J Liu Y Zhou Y Yang H Lu and B Xu ldquoCode churn aneglectedmetric in effort-aware just-in-time defect predictionrdquoin Proceedings of the ACMIEEE International Symposium onEmpirical Software Engineering and Measurement (ESEM)pp 11ndash19 Toronto ON Canada November 2017

[28] W Fu and T Menzies ldquoRevisiting unsupervised learning fordefect predictionrdquo in Proceedings of the 2017 11th Joint

12 Scientific Programming

Meeting on Foundations of Software Engineering (ESECFSE)pp 72ndash83 Paderborn Germany September 2017

[29] Q Huang X Xia and D Lo ldquoSupervised vs unsupervisedmodels a holistic look at effort-aware just-in-time defectpredictionrdquo in Proceedings of the IEEE International Con-ference on Software Maintenance and Evolution (ICSME)pp 159ndash170 Shanghai China September 2017

[30] Q Huang X Xia and D Lo ldquoRevisiting supervised andunsupervised models for effort-aware just-in-time defectpredictionrdquo Empirical Software Engineering pp 1ndash40 2018

[31] S Herbold A Trautsch and J Grabowski ldquoGlobal vs localmodels for cross-project defect predictionrdquo Empirical Soft-ware Engineering vol 22 no 4 pp 1866ndash1902 2017

[32] M E Mezouar F Zhang and Y Zou ldquoLocal versus globalmodels for effort-aware defect predictionrdquo in Proceedings ofthe 26th Annual International Conference on Computer Sci-ence and Software Engineering (CASCON) pp 178ndash187Toronto ON Canada October 2016

[33] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis Wiley Hoboken NJ USA1990

[34] S Hosseini B Turhan and D Gunarathna ldquoA systematicliterature review and meta-analysis on cross project defectpredictionrdquo IEEE Transactions on Software Engineeringvol 45 no 2 pp 111ndash147 2019

[35] J Sliwerski T Zimmermann and A Zeller ldquoWhen dochanges induce fixesrdquo Mining Software Repositories vol 30no 4 pp 1ndash5 2005

[36] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin vol 1 no 6 pp 80ndash83 1945

[37] N Cliff Ordinal Methods for Behavioral Data AnalysisPsychology Press London UK 2014

[38] J Romano J D Kromrey J Coraggio and J SkowronekldquoAppropriate statistics for ordinal level data should we reallybe using t-test and cohensd for evaluating group differenceson the nsse and other surveysrdquo in Annual Meeting of theFlorida Association of Institutional Research pp 1ndash33 2006

Scientific Programming 13

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 11: LocalversusGlobalModelsforJust-In-TimeSoftware ...downloads.hindawi.com/journals/sp/2019/2384706.pdf · Just-in-time software defect prediction (JIT-SDP) is an active topic in software

using global models to solve the JIT-SDP problem in thetimewise-cross-validation scenario

6 Threats to Validity

61 External Validity Although our experiment uses publicdata sets that have extensively been used in previous studieswe cannot yet guarantee that the discovery of the experimentcan be applied to all other change-level defect data sets-erefore the defect data sets of more projects should bemined in order to verify the generalization of the experi-mental results

62 Construct Validity We use F1 AUC ACC and Popt toevaluate the prediction performance of local and globalmodels Because defect data sets are often imbalanced AUCis more appropriate as a threshold-independent evaluationindicator to evaluate the classification performance of

models [4] In addition we use F1 a threshold-basedevaluation indicator as a supplement to AUC Besidessimilar to prior study [7 9] the experiment uses ACC andPopt to evaluate the effort-aware prediction performance ofmodels However we cannot guarantee that the experi-mental results are valid in all other evaluation indicatorssuch as precision recall Fβ and PofB20

63 Internal Validity Internal validity involves errors in theexperimental code In order to reduce this risk we carefullyexamined the code of the experiment and referred to thecode published in previous studies [7 8 29] so as to improvethe reliability of the experiment code

7 Conclusions and Future Work

In this article we firstly apply local models to JIT-SDP Ourstudy aims to compare the prediction performance of localand global models under cross-validation cross-project-validation and timewise-cross-validation for JIT-SDP Inorder to compare the classification performance and effort-aware prediction performance of local and global models wefirst build local models based on the k-medoids methodAfterwards logistic regression and EALR are used to buildclassification models and effort-aware prediction models forlocal and global models -e empirical results show that localmodels perform worse in the classification performance thanglobal models in three evaluation scenarios Moreover localmodels have worse effort-aware prediction performance in thetimewise-cross-validation scenario However local models aresignificantly better than global models in the effort-awareprediction performance in the cross-validation-scenario andtimewise-cross-validation scenario Particularly the optimaleffort-aware prediction performance can be obtained whenthe parameter k of local models is set to 2 -erefore localmodels are still promising in the context of effort-aware JIT-SDP

In the future first we plan to consider more commercialprojects to further verify the validity of the experimentalconclusions Second logistic regression and EALR are usedin local and global models Considering that the choice ofmodeling methods will affect the prediction performance oflocal and global models we hope to add more baselines tolocal and global models to verify the generalization of ex-perimental conclusions

Data Availability

-e experiment uses public data sets shared by Kamei et al[7] and they have already published the download address of

Table 6 Local vs global models in the cross-project-validation scenario

Indicator GlobalLocal

WDLk 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 10

AUC 0706 0683 0681 0674 0673 0673 0668 0659 0680 0674 009F1 0678 0629 0703 0696 0677 0613 0620 0554 0629 0637 045ACC 0314 0493 0396 0348 0408 0393 0326 034 0342 0345 270Popt 0656 0764 0691 0672 0719 0687 0619 0632 0667 0624 180

Table 7 Local vs global models in the timewise-cross-validationscenario

Indicators Project Global Local (k 2)

AUC

BUG 0672plusmn 0125 0545plusmn 0124COL 0717plusmn 0075 0636plusmn 0135JDT 0708plusmn 0043 0645plusmn 0073MOZ 0749plusmn 0034 0648plusmn 0128PLA 0709plusmn 0054 0642plusmn 0076POS 0743plusmn 0072 0702plusmn 0095

WDL mdash 006

F1

BUG 0587plusmn 0108 0515plusmn 0128COL 0651plusmn 0066 0599plusmn 0125JDT 0722plusmn 0048 0623plusmn 0161MOZ 0804plusmn 0050 0696plusmn 0190PLA 0702plusmn 0053 0618plusmn 0162POS 0714plusmn 0065 0657plusmn 0128

WDL mdash 006

ACC

BUG 0357plusmn 0212 0242plusmn 0210COL 0426plusmn 0168 0321plusmn 0186JDT 0356plusmn 0148 0329plusmn 0189MOZ 0218plusmn 0132 0242plusmn 0125PLA 0358plusmn 0181 0371plusmn 0196POS 0345plusmn 0159 0306plusmn 0205

WDL mdash 042

Popt

BUG 0622plusmn 0194 0458plusmn 0189COL 0669plusmn 0150 0553plusmn 0204JDT 0654plusmn 0101 0624plusmn 0142MOZ 0537plusmn 0110 0537plusmn 0122PLA 0631plusmn 0115 0645plusmn 0151POS 0615plusmn 0131 0583plusmn 0167

WDL mdash 042

Scientific Programming 11

the data sets in their paper In addition in order to verify therepeatability of our experiment and promote future researchall the experimental code in this article can be downloaded athttpsgithubcomyangxingguangLocalJIT

Conflicts of Interest

-e authors declare that they have no conflicts of interest

Acknowledgments

-is work was partially supported by the NSF of Chinaunder Grant nos 61772200 and 61702334 Shanghai PujiangTalent Program under Grant no 17PJ1401900 ShanghaiMunicipal Natural Science Foundation under Grant nos17ZR1406900 and 17ZR1429700 Educational ResearchFund of ECUST under Grant no ZH1726108 and Collab-orative Innovation Foundation of Shanghai Institute ofTechnology under Grant no XTCX2016-20

References

[1] M Newman Software Errors Cost US Economy $595 BillionAnnually NIST Assesses Technical Needs of Industry to ImproveSoftware-Testing 2002 httpwwwabeachacomNIST_press_release_bugs_costhtm

[2] Z Li X-Y Jing and X Zhu ldquoProgress on approaches tosoftware defect predictionrdquo IET Software vol 12 no 3pp 161ndash175 2018

[3] X Chen Q Gu W Liu S Liu and C Ni ldquoSurvey of staticsoftware defect predictionrdquo Ruan Jian Xue BaoJournal ofSoftware vol 27 no 1 2016 in Chinese

[4] C Tantithamthavorn S McIntosh A E Hassan andKMatsumoto ldquoAn empirical comparison of model validationtechniques for defect prediction modelsrdquo IEEE Transactionson Software Engineering vol 43 no 1 pp 1ndash18 2017

[5] A G Koru D Dongsong Zhang K El Emam andH Hongfang Liu ldquoAn investigation into the functional formof the size-defect relationship for software modulesrdquo IEEETransactions on Software Engineering vol 35 no 2pp 293ndash304 2009

[6] S Kim E J Whitehead and Y Zhang ldquoClassifying softwarechanges clean or buggyrdquo IEEE Transactions on SoftwareEngineering vol 34 no 2 pp 181ndash196 2008

[7] Y Kamei E Shihab B Adams et al ldquoA large-scale empiricalstudy of just-in-time quality assurancerdquo IEEE Transactions onSoftware Engineering vol 39 no 6 pp 757ndash773 2013

[8] Y Yang Y Zhou J Liu et al ldquoEffort-aware just-in-time defectprediction simple unsupervised models could be better thansupervised modelsrdquo in Proceedings of the 24th ACM SIGSOFTInternational Symposium on Foundations of Software Engi-neering pp 157ndash168 Hong Kong China November 2016

[9] X Chen Y Zhao Q Wang and Z Yuan ldquoMULTI multi-objective effort-aware just-in-time software defect predictionrdquoInformation and Software Technology vol 93 pp 1ndash13 2018

[10] Y Kamei T Fukushima S McIntosh K YamashitaN Ubayashi and A E Hassan ldquoStudying just-in-time defectprediction using cross-project modelsrdquo Empirical SoftwareEngineering vol 21 no 5 pp 2072ndash2106 2016

[11] T Menzies A Butcher A Marcus T Zimmermann andD R Cok ldquoLocal vs global models for effort estimation anddefect predictionrdquo in Proceedings of the 26th IEEEACM

International Conference on Automated Software Engineer-ing (ASE) pp 343ndash351 Lawrence KS USA November 2011

[12] T Menzies A Butcher D Cok et al ldquoLocal versus globallessons for defect prediction and effort estimationrdquo IEEETransactions on Software Engineering vol 39 no 6pp 822ndash834 2013

[13] G Scanniello C Gravino A Marcus and T Menzies ldquoClasslevel fault prediction using software clusteringrdquo in Pro-ceedings of the 2013 28th IEEEACM International Conferenceon Automated Software Engineering (ASE) pp 640ndash645Silicon Valley CA USA November 2013

[14] N Bettenburg M Nagappan and A E Hassan ldquoTowardsimproving statistical modeling of software engineering datathink locally act globallyrdquo Empirical Software Engineeringvol 20 no 2 pp 294ndash335 2015

[15] Q Wang S Wu and M Li ldquoSoftware defect predictionrdquoJournal of Software vol 19 no 7 pp 1565ndash1580 2008 inChinese

[16] F Akiyama ldquoAn example of software system debuggingrdquoInformation Processing vol 71 pp 353ndash359 1971

[17] T J McCabe ldquoA complexity measurerdquo IEEE Transactions onSoftware Engineering vol SE-2 no 4 pp 308ndash320 1976

[18] S R Chidamber and C F Kemerer ldquoAmetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[19] N Nagappan and T Ball ldquoUse of relative code churn mea-sures to predict system defect densityrdquo in Proceedings of the27th International Conference on Software Engineering (ICSE)pp 284ndash292 Saint Louis MO USA May 2005

[20] A Mockus and D M Weiss ldquoPredicting risk of softwarechangesrdquo Bell Labs Technical Journal vol 5 no 2 pp 169ndash180 2000

[21] X Yang D Lo X Xia Y Zhang and J Sun ldquoDeep learningfor just-in-time defect predictionrdquo in Proceedings of the IEEEInternational Conference on Software Quality Reliability andSecurity (QRS) pp 17ndash26 Vancouver BC Canada August2015

[22] X Yang D Lo X Xia and J Sun ldquoTLEL a two-layer ensemblelearning approach for just-in-time defect predictionrdquo In-formation and Software Technology vol 87 pp 206ndash220 2017

[23] L Pascarella F Palomba and A Bacchelli ldquoFine-grained just-in-time defect predictionrdquo Journal of Systems and Softwarevol 150 pp 22ndash36 2019

[24] E Arisholm L C Briand and E B Johannessen ldquoA sys-tematic and comprehensive investigation of methods to buildand evaluate fault prediction modelsrdquo Journal of Systems andSoftware vol 83 no 1 pp 2ndash17 2010

[25] Y Kamei S Matsumoto A Monden K MatsumotoB Adams and A E Hassan ldquoRevisiting common bug pre-diction findings using effort-aware modelsrdquo in Proceedings ofthe 26th IEEE International Conference on Software Mainte-nance (ICSM) pp 1ndash10 Shanghai China September 2010

[26] Y Zhou B Xu H Leung and L Chen ldquoAn in-depth study ofthe potentially confounding effect of class size in fault pre-dictionrdquo ACM Transactions on Software Engineering andMethodology vol 23 no 1 pp 1ndash51 2014

[27] J Liu Y Zhou Y Yang H Lu and B Xu ldquoCode churn aneglectedmetric in effort-aware just-in-time defect predictionrdquoin Proceedings of the ACMIEEE International Symposium onEmpirical Software Engineering and Measurement (ESEM)pp 11ndash19 Toronto ON Canada November 2017

[28] W Fu and T Menzies ldquoRevisiting unsupervised learning fordefect predictionrdquo in Proceedings of the 2017 11th Joint

12 Scientific Programming

Meeting on Foundations of Software Engineering (ESECFSE)pp 72ndash83 Paderborn Germany September 2017

[29] Q Huang X Xia and D Lo ldquoSupervised vs unsupervisedmodels a holistic look at effort-aware just-in-time defectpredictionrdquo in Proceedings of the IEEE International Con-ference on Software Maintenance and Evolution (ICSME)pp 159ndash170 Shanghai China September 2017

[30] Q Huang X Xia and D Lo ldquoRevisiting supervised andunsupervised models for effort-aware just-in-time defectpredictionrdquo Empirical Software Engineering pp 1ndash40 2018

[31] S Herbold A Trautsch and J Grabowski ldquoGlobal vs localmodels for cross-project defect predictionrdquo Empirical Soft-ware Engineering vol 22 no 4 pp 1866ndash1902 2017

[32] M E Mezouar F Zhang and Y Zou ldquoLocal versus globalmodels for effort-aware defect predictionrdquo in Proceedings ofthe 26th Annual International Conference on Computer Sci-ence and Software Engineering (CASCON) pp 178ndash187Toronto ON Canada October 2016

[33] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis Wiley Hoboken NJ USA1990

[34] S Hosseini B Turhan and D Gunarathna ldquoA systematicliterature review and meta-analysis on cross project defectpredictionrdquo IEEE Transactions on Software Engineeringvol 45 no 2 pp 111ndash147 2019

[35] J Sliwerski T Zimmermann and A Zeller ldquoWhen dochanges induce fixesrdquo Mining Software Repositories vol 30no 4 pp 1ndash5 2005

[36] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin vol 1 no 6 pp 80ndash83 1945

[37] N Cliff Ordinal Methods for Behavioral Data AnalysisPsychology Press London UK 2014

[38] J Romano J D Kromrey J Coraggio and J SkowronekldquoAppropriate statistics for ordinal level data should we reallybe using t-test and cohensd for evaluating group differenceson the nsse and other surveysrdquo in Annual Meeting of theFlorida Association of Institutional Research pp 1ndash33 2006

Scientific Programming 13

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 12: LocalversusGlobalModelsforJust-In-TimeSoftware ...downloads.hindawi.com/journals/sp/2019/2384706.pdf · Just-in-time software defect prediction (JIT-SDP) is an active topic in software

the data sets in their paper In addition in order to verify therepeatability of our experiment and promote future researchall the experimental code in this article can be downloaded athttpsgithubcomyangxingguangLocalJIT

Conflicts of Interest

-e authors declare that they have no conflicts of interest

Acknowledgments

-is work was partially supported by the NSF of Chinaunder Grant nos 61772200 and 61702334 Shanghai PujiangTalent Program under Grant no 17PJ1401900 ShanghaiMunicipal Natural Science Foundation under Grant nos17ZR1406900 and 17ZR1429700 Educational ResearchFund of ECUST under Grant no ZH1726108 and Collab-orative Innovation Foundation of Shanghai Institute ofTechnology under Grant no XTCX2016-20

References

[1] M Newman Software Errors Cost US Economy $595 BillionAnnually NIST Assesses Technical Needs of Industry to ImproveSoftware-Testing 2002 httpwwwabeachacomNIST_press_release_bugs_costhtm

[2] Z Li X-Y Jing and X Zhu ldquoProgress on approaches tosoftware defect predictionrdquo IET Software vol 12 no 3pp 161ndash175 2018

[3] X Chen Q Gu W Liu S Liu and C Ni ldquoSurvey of staticsoftware defect predictionrdquo Ruan Jian Xue BaoJournal ofSoftware vol 27 no 1 2016 in Chinese

[4] C Tantithamthavorn S McIntosh A E Hassan andKMatsumoto ldquoAn empirical comparison of model validationtechniques for defect prediction modelsrdquo IEEE Transactionson Software Engineering vol 43 no 1 pp 1ndash18 2017

[5] A G Koru D Dongsong Zhang K El Emam andH Hongfang Liu ldquoAn investigation into the functional formof the size-defect relationship for software modulesrdquo IEEETransactions on Software Engineering vol 35 no 2pp 293ndash304 2009

[6] S Kim E J Whitehead and Y Zhang ldquoClassifying softwarechanges clean or buggyrdquo IEEE Transactions on SoftwareEngineering vol 34 no 2 pp 181ndash196 2008

[7] Y Kamei E Shihab B Adams et al ldquoA large-scale empiricalstudy of just-in-time quality assurancerdquo IEEE Transactions onSoftware Engineering vol 39 no 6 pp 757ndash773 2013

[8] Y Yang Y Zhou J Liu et al ldquoEffort-aware just-in-time defectprediction simple unsupervised models could be better thansupervised modelsrdquo in Proceedings of the 24th ACM SIGSOFTInternational Symposium on Foundations of Software Engi-neering pp 157ndash168 Hong Kong China November 2016

[9] X Chen Y Zhao Q Wang and Z Yuan ldquoMULTI multi-objective effort-aware just-in-time software defect predictionrdquoInformation and Software Technology vol 93 pp 1ndash13 2018

[10] Y Kamei T Fukushima S McIntosh K YamashitaN Ubayashi and A E Hassan ldquoStudying just-in-time defectprediction using cross-project modelsrdquo Empirical SoftwareEngineering vol 21 no 5 pp 2072ndash2106 2016

[11] T Menzies A Butcher A Marcus T Zimmermann andD R Cok ldquoLocal vs global models for effort estimation anddefect predictionrdquo in Proceedings of the 26th IEEEACM

International Conference on Automated Software Engineer-ing (ASE) pp 343ndash351 Lawrence KS USA November 2011

[12] T Menzies A Butcher D Cok et al ldquoLocal versus globallessons for defect prediction and effort estimationrdquo IEEETransactions on Software Engineering vol 39 no 6pp 822ndash834 2013

[13] G Scanniello C Gravino A Marcus and T Menzies ldquoClasslevel fault prediction using software clusteringrdquo in Pro-ceedings of the 2013 28th IEEEACM International Conferenceon Automated Software Engineering (ASE) pp 640ndash645Silicon Valley CA USA November 2013

[14] N Bettenburg M Nagappan and A E Hassan ldquoTowardsimproving statistical modeling of software engineering datathink locally act globallyrdquo Empirical Software Engineeringvol 20 no 2 pp 294ndash335 2015

[15] Q Wang S Wu and M Li ldquoSoftware defect predictionrdquoJournal of Software vol 19 no 7 pp 1565ndash1580 2008 inChinese

[16] F Akiyama ldquoAn example of software system debuggingrdquoInformation Processing vol 71 pp 353ndash359 1971

[17] T J McCabe ldquoA complexity measurerdquo IEEE Transactions onSoftware Engineering vol SE-2 no 4 pp 308ndash320 1976

[18] S R Chidamber and C F Kemerer ldquoAmetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[19] N Nagappan and T Ball ldquoUse of relative code churn mea-sures to predict system defect densityrdquo in Proceedings of the27th International Conference on Software Engineering (ICSE)pp 284ndash292 Saint Louis MO USA May 2005

[20] A Mockus and D M Weiss ldquoPredicting risk of softwarechangesrdquo Bell Labs Technical Journal vol 5 no 2 pp 169ndash180 2000

[21] X Yang D Lo X Xia Y Zhang and J Sun ldquoDeep learningfor just-in-time defect predictionrdquo in Proceedings of the IEEEInternational Conference on Software Quality Reliability andSecurity (QRS) pp 17ndash26 Vancouver BC Canada August2015

[22] X Yang D Lo X Xia and J Sun ldquoTLEL a two-layer ensemblelearning approach for just-in-time defect predictionrdquo In-formation and Software Technology vol 87 pp 206ndash220 2017

[23] L Pascarella F Palomba and A Bacchelli ldquoFine-grained just-in-time defect predictionrdquo Journal of Systems and Softwarevol 150 pp 22ndash36 2019

[24] E Arisholm L C Briand and E B Johannessen ldquoA sys-tematic and comprehensive investigation of methods to buildand evaluate fault prediction modelsrdquo Journal of Systems andSoftware vol 83 no 1 pp 2ndash17 2010

[25] Y Kamei S Matsumoto A Monden K MatsumotoB Adams and A E Hassan ldquoRevisiting common bug pre-diction findings using effort-aware modelsrdquo in Proceedings ofthe 26th IEEE International Conference on Software Mainte-nance (ICSM) pp 1ndash10 Shanghai China September 2010

[26] Y Zhou B Xu H Leung and L Chen ldquoAn in-depth study ofthe potentially confounding effect of class size in fault pre-dictionrdquo ACM Transactions on Software Engineering andMethodology vol 23 no 1 pp 1ndash51 2014

[27] J Liu Y Zhou Y Yang H Lu and B Xu ldquoCode churn aneglectedmetric in effort-aware just-in-time defect predictionrdquoin Proceedings of the ACMIEEE International Symposium onEmpirical Software Engineering and Measurement (ESEM)pp 11ndash19 Toronto ON Canada November 2017

[28] W Fu and T Menzies ldquoRevisiting unsupervised learning fordefect predictionrdquo in Proceedings of the 2017 11th Joint

12 Scientific Programming

Meeting on Foundations of Software Engineering (ESECFSE)pp 72ndash83 Paderborn Germany September 2017

[29] Q Huang X Xia and D Lo ldquoSupervised vs unsupervisedmodels a holistic look at effort-aware just-in-time defectpredictionrdquo in Proceedings of the IEEE International Con-ference on Software Maintenance and Evolution (ICSME)pp 159ndash170 Shanghai China September 2017

[30] Q Huang X Xia and D Lo ldquoRevisiting supervised andunsupervised models for effort-aware just-in-time defectpredictionrdquo Empirical Software Engineering pp 1ndash40 2018

[31] S Herbold A Trautsch and J Grabowski ldquoGlobal vs localmodels for cross-project defect predictionrdquo Empirical Soft-ware Engineering vol 22 no 4 pp 1866ndash1902 2017

[32] M E Mezouar F Zhang and Y Zou ldquoLocal versus globalmodels for effort-aware defect predictionrdquo in Proceedings ofthe 26th Annual International Conference on Computer Sci-ence and Software Engineering (CASCON) pp 178ndash187Toronto ON Canada October 2016

[33] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis Wiley Hoboken NJ USA1990

[34] S Hosseini B Turhan and D Gunarathna ldquoA systematicliterature review and meta-analysis on cross project defectpredictionrdquo IEEE Transactions on Software Engineeringvol 45 no 2 pp 111ndash147 2019

[35] J Sliwerski T Zimmermann and A Zeller ldquoWhen dochanges induce fixesrdquo Mining Software Repositories vol 30no 4 pp 1ndash5 2005

[36] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin vol 1 no 6 pp 80ndash83 1945

[37] N Cliff Ordinal Methods for Behavioral Data AnalysisPsychology Press London UK 2014

[38] J Romano J D Kromrey J Coraggio and J SkowronekldquoAppropriate statistics for ordinal level data should we reallybe using t-test and cohensd for evaluating group differenceson the nsse and other surveysrdquo in Annual Meeting of theFlorida Association of Institutional Research pp 1ndash33 2006

Scientific Programming 13

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 13: LocalversusGlobalModelsforJust-In-TimeSoftware ...downloads.hindawi.com/journals/sp/2019/2384706.pdf · Just-in-time software defect prediction (JIT-SDP) is an active topic in software

Meeting on Foundations of Software Engineering (ESECFSE)pp 72ndash83 Paderborn Germany September 2017

[29] Q Huang X Xia and D Lo ldquoSupervised vs unsupervisedmodels a holistic look at effort-aware just-in-time defectpredictionrdquo in Proceedings of the IEEE International Con-ference on Software Maintenance and Evolution (ICSME)pp 159ndash170 Shanghai China September 2017

[30] Q Huang X Xia and D Lo ldquoRevisiting supervised andunsupervised models for effort-aware just-in-time defectpredictionrdquo Empirical Software Engineering pp 1ndash40 2018

[31] S Herbold A Trautsch and J Grabowski ldquoGlobal vs localmodels for cross-project defect predictionrdquo Empirical Soft-ware Engineering vol 22 no 4 pp 1866ndash1902 2017

[32] M E Mezouar F Zhang and Y Zou ldquoLocal versus globalmodels for effort-aware defect predictionrdquo in Proceedings ofthe 26th Annual International Conference on Computer Sci-ence and Software Engineering (CASCON) pp 178ndash187Toronto ON Canada October 2016

[33] L Kaufman and P J Rousseeuw Finding Groups in Data AnIntroduction to Cluster Analysis Wiley Hoboken NJ USA1990

[34] S Hosseini B Turhan and D Gunarathna ldquoA systematicliterature review and meta-analysis on cross project defectpredictionrdquo IEEE Transactions on Software Engineeringvol 45 no 2 pp 111ndash147 2019

[35] J Sliwerski T Zimmermann and A Zeller ldquoWhen dochanges induce fixesrdquo Mining Software Repositories vol 30no 4 pp 1ndash5 2005

[36] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin vol 1 no 6 pp 80ndash83 1945

[37] N Cliff Ordinal Methods for Behavioral Data AnalysisPsychology Press London UK 2014

[38] J Romano J D Kromrey J Coraggio and J SkowronekldquoAppropriate statistics for ordinal level data should we reallybe using t-test and cohensd for evaluating group differenceson the nsse and other surveysrdquo in Annual Meeting of theFlorida Association of Institutional Research pp 1ndash33 2006

Scientific Programming 13

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 14: LocalversusGlobalModelsforJust-In-TimeSoftware ...downloads.hindawi.com/journals/sp/2019/2384706.pdf · Just-in-time software defect prediction (JIT-SDP) is an active topic in software

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom