an improved method for cross-project defect prediction by...

19
Research Article An Improved Method for Cross-Project Defect Prediction by Simplifying Training Data Peng He , 1,2 Yao He, 1 Lvjun Yu, 1 and Bing Li 3 1 School of Computer Science and Information Engineering, Hubei University, Wuhan 430062, China 2 Hubei Province Engineering Technology Research Center for Educational Informationization, Wuhan 430062, China 3 School of Computer, Wuhan University, Wuhan 430072, China Correspondence should be addressed to Peng He; [email protected] Received 10 December 2017; Revised 23 February 2018; Accepted 15 April 2018; Published 7 June 2018 Academic Editor: Dingli Yu Copyright © 2018 Peng He et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Cross-project defect prediction (CPDP) on projects with limited historical data has attracted much attention. To the best of our knowledge, however, the performance of existing approaches is usually poor, because of low quality cross-project training data. e objective of this study is to propose an improved method for CPDP by simplifying training data, labeled as TDSelector, which considers both the similarity and the number of defects that each training instance has (denoted by defects), and to demonstrate the effectiveness of the proposed method. Our work consists of three main steps. First, we constructed TDSelector in terms of a linear weighted function of instances’ similarity and defects. Second, the basic defect predictor used in our experiments was built by using the Logistic Regression classification algorithm. ird, we analyzed the impacts of different combinations of similarity and the normalization of defects on prediction performance and then compared with two existing methods. We evaluated our method on 14 projects collected from two public repositories. e results suggest that the proposed TDSelector method performs, on average, better than both baseline methods, and the AUC values are increased by up to 10.6% and 4.3%, respectively. at is, the inclusion of defects is indeed helpful to select high quality training instances for CPDP. On the other hand, the combination of Euclidean distance and linear normalization is the preferred way for TDSelector. An additional experiment also shows that selecting those instances with more bugs directly as training data can further improve the performance of the bug predictor trained by our method. 1. Introduction Soſtware defect prediction is one of the most active research topics in Soſtware Engineering. Most early studies usually trained predictors (also known as prediction models) from the historical data on soſtware defects/bugs in the same soſtware project and predicted defects in its upcoming release versions [1]. is approach is referred to as Within-Project Defect Prediction (WPDP). However, WPDP has an obvious drawback when a project has limited historical defect data. To address the above issue, researchers in this field have attempted to apply defect predictors built for one project to other projects [2–7]. is method is termed Cross-Project Defect Prediction (CPDP). e main purpose of CPDP is to predict defect-prone instances (such as classes) in a project based on the defect data collected from other projects on those public soſtware repositories like PROMISE (http:// openscience.us/repo/). e feasibility and potential useful- ness of cross-project predictors built with a number of soſtware metrics have been validated [1, 3, 5, 6], but how to improve the performance of CPDP models is still an open issue. Peters et al. [5] argued that selecting appropriate training data from a soſtware repository became a major issue for CPDP. Moreover, some researchers also suggested that the success rate of CPDP models could be drastically improved when using a suitable training dataset [1, 7]. at is to say, the selection of training data of quality could be a key break- through on the above issue. us, the construction of an ap- propriate training dataset gathered from a large number of projects on public soſtware repositories is indeed a challenge for CPDP [7]. Hindawi Mathematical Problems in Engineering Volume 2018, Article ID 2650415, 18 pages https://doi.org/10.1155/2018/2650415

Upload: others

Post on 23-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Improved Method for Cross-Project Defect Prediction by …downloads.hindawi.com/journals/mpe/2018/2650415.pdf · Project A-1 Project A-2 Projec B-1 Projec C-1 Projec C-2 Tst et

Research ArticleAn Improved Method for Cross-Project Defect Prediction bySimplifying Training Data

Peng He 12 Yao He1 Lvjun Yu1 and Bing Li 3

1School of Computer Science and Information Engineering Hubei University Wuhan 430062 China2Hubei Province Engineering Technology Research Center for Educational Informationization Wuhan 430062 China3School of Computer Wuhan University Wuhan 430072 China

Correspondence should be addressed to Peng He penghewhueducn

Received 10 December 2017 Revised 23 February 2018 Accepted 15 April 2018 Published 7 June 2018

Academic Editor Dingli Yu

Copyright copy 2018 Peng He et alThis is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Cross-project defect prediction (CPDP) on projects with limited historical data has attracted much attention To the best of ourknowledge however the performance of existing approaches is usually poor because of low quality cross-project training dataThe objective of this study is to propose an improved method for CPDP by simplifying training data labeled as TDSelector whichconsiders both the similarity and the number of defects that each training instance has (denoted by defects) and to demonstratethe effectiveness of the proposed method Our work consists of three main steps First we constructed TDSelector in terms of alinear weighted function of instancesrsquo similarity and defects Second the basic defect predictor used in our experiments was builtby using the Logistic Regression classification algorithm Third we analyzed the impacts of different combinations of similarityand the normalization of defects on prediction performance and then compared with two existing methods We evaluated ourmethod on 14 projects collected from two public repositories The results suggest that the proposed TDSelector method performson average better than both baseline methods and the AUC values are increased by up to 106 and 43 respectively That isthe inclusion of defects is indeed helpful to select high quality training instances for CPDP On the other hand the combination ofEuclidean distance and linear normalization is the preferred way for TDSelector An additional experiment also shows that selectingthose instances with more bugs directly as training data can further improve the performance of the bug predictor trained by ourmethod

1 Introduction

Software defect prediction is one of the most active researchtopics in Software Engineering Most early studies usuallytrained predictors (also known as prediction models) fromthe historical data on software defectsbugs in the samesoftware project and predicted defects in its upcoming releaseversions [1] This approach is referred to as Within-ProjectDefect Prediction (WPDP) However WPDP has an obviousdrawback when a project has limited historical defect data

To address the above issue researchers in this field haveattempted to apply defect predictors built for one project toother projects [2ndash7] This method is termed Cross-ProjectDefect Prediction (CPDP) The main purpose of CPDP is topredict defect-prone instances (such as classes) in a projectbased on the defect data collected from other projects on

those public software repositories like PROMISE (httpopenscienceusrepo) The feasibility and potential useful-ness of cross-project predictors built with a number ofsoftware metrics have been validated [1 3 5 6] but how toimprove the performance of CPDP models is still an openissue

Peters et al [5] argued that selecting appropriate trainingdata from a software repository became a major issue forCPDP Moreover some researchers also suggested that thesuccess rate of CPDP models could be drastically improvedwhen using a suitable training dataset [1 7] That is to saythe selection of training data of quality could be a key break-through on the above issue Thus the construction of an ap-propriate training dataset gathered from a large number ofprojects on public software repositories is indeed a challengefor CPDP [7]

HindawiMathematical Problems in EngineeringVolume 2018 Article ID 2650415 18 pageshttpsdoiorg10115520182650415

2 Mathematical Problems in Engineering

As far as we know although previous studies on CPDPhave taken different types of software metrics into accountduring the process of selecting relevant training samplesnone of them considered the number of defects contained ineach sample (denoted by defects) But in fact we argue thatit is also an important factor to consider Fortunately somestudies have empirically demonstrated the relevance of defectsto prediction For example ldquomodules with faults in the pastare likely to have faults in the futurerdquo [8] ldquo17 to 54 of thehigh-fault files of release 119894 are still high-fault in release 119894 + 1rdquo[9] ldquocover 73ndash95 of faults by selecting 10 of the mostfault prone source code filerdquo [10] and ldquothe number of defectsfound in the previous release of file correlates with its currentdefect count on a high levelrdquo [11]

Does the selection of training data considering defectsimprove the performance of CPDP models If the answer isldquoYesrdquo on the one hand it is helpful to validate the feasibility ofCPDP on the other hand it will contribute to better softwaredefect predictors by making full use of those defect datasetsavailable on the Internet

The objective of our work is to propose an improvedmethod of training data selection for CPDP by introducingthe information of defects Unlike the prior studies similar toour work such as [5 12] which focus mainly on the simi-larity between instances from training set and test set thispaper gives a comprehensive account of two factors namelysimilarity and defects Moreover the proposedmethod calledTDSelector can automatically optimize their weights toachieve the best result In brief our main contributions to thecurrent state of research onCPDP are summarized as follows

(1) Considering both similarity and defects we proposeda simple and easy-to-use training data selection method forCPDP (ie TDSelector) which is based on an improved scor-ing scheme that ranks all possible training instances In parti-cular we designed an algorithm to calculate their weightsautomatically so as to obtain the best prediction result

(2) To validate the effectiveness of our method we con-ducted an elaborate empirical study based on 15 datasets col-lected from PROMISE and AEEEM (httpbuginfusich)and the experimental results show that in a specific CPDPscenario (iemany-to-one [13]) the TDSelector-based defectpredictor outperforms its rivals that were built with twocompeting methods in terms of prediction precision

With these technical contributions our study couldcomplement previous work on CPDPwith respect to trainingdata selection In particular we provide a reasonable scoringscheme as well as a more comprehensive guideline for devel-opers to choose appropriate training data to train a defect pre-dictor in practice

The rest of this paper is organized as follows In Section 2we reviewed the related work of this topic Section 3 presentsthe preliminaries to our work Section 4 describes the pro-posed method TDSelector Section 5 introduces our experi-mental setup and Section 6 shows the primary experimen-tal results a detailed discussion of some issues includingpotential threats to the validity of our study is presented inSection 7 in the end Section 8 summarizes this paper andpresents our future work

2 Related Work

21 Cross-Project Defect Prediction Many studies were car-ried out to validate the feasibility of CPDP in the last fiveyears For example Turhan et al [12] proposed a cross-company defect prediction approach using defect data fromother companies to build predictors for target projects Theyfound that the proposed method increased the probabilityof defect detection at the cost of increasing false positiverate Ni et al [14] proposed a novel method called FeSCHand designed three ranking strategies to choose appropriatefeatures The experimental results show that FeSCH canoutperform WPDP ALL and TCA+ in most cases and itsperformance is independent of the used classifiers He et al[15] compared the performance between CPDP and WPDPusing feature selection techniques The results indicated thatfor reduced training data WPDP obtained higher precisionbut CPDP in turn achieved a better recall or119865-measure Someresearchers have also studied the performance of CPDPbasedon ensemble classifiers and then validated their effects on thisissue [16 17]

Ryu et al [18] proposed a transfer cost-sensitive boostingmethod by considering both distributional characteristicsand the class imbalance for CPDPThe results show that theirmethod significantly improves CPDP performanceThey also[19] proposed a multiobjective naive Bayes learning tech-nique under CPDP environments by taking into account theclass-imbalance contexts The results indicated that their ap-proaches performed better than the single-objective ones andWPDP models Li et al [20] compared some famous datafilters and proposed a method called HSBF (hierarchicalselect-based filter) to improve the performance of CPDPTheresults demonstrate that the data filter strategy can indeedimprove the performance of CPDP significantly Moreoverwhen using appropriate data filter strategy the defect predic-tor built fromcross-project data can outperform the predictorlearned by using within-project data

Zhang et al [21] proposed a universal CPDPmodel whichwasbuiltusingalargenumberofprojects collected fromSource-Forge (httpssourceforgenet) and Google Code (httpscodegooglecom) Their experimental results showed thatit was indeed comparable to WPDP Furthermore CPDP isfeasible for different projects that have heterogeneous metricsets He et al [22] first proposed a CPDP-IFS approach basedon the distribution characteristics of both source and targetprojects to overcome this problem Nam and Kim [23] thenproposed an improved method called HDP where metricselection and metric matching were introduced to build adefect predictorTheir empirical study on 28 projects showedthat about 68 of predictions using the proposed approachoutperformed or were comparable to WPDP with statisticalsignificance Jing et al [24] proposed a unified metric repre-sentation (UMR) for heterogeneous defect data More re-searches can be found in [25ndash27]The experiments on 14 pub-lic heterogeneous datasets from four different companies in-dicated that the proposed approach was more effective inaddressing the problem

Mathematical Problems in Engineering 3

22 Training Data Selection for CPDP As mentioned in [528] a fundamental issue for CPDP is to select the mostappropriate training data for building quality defect predic-tors He et al [29] discussed this problem in detail from theperspective of data granularity ie release level and instancelevel They presented a two-step method for training dataselection The results indicated that the predictor built basedon naive Bayes could achieve fairly good performance whenusing the method together with Peter filter [5] Porto andSimao [30] proposed an Instance Filtering method by select-ing the most similar instances from the training dataset andthe experimental results of 36 versions of 11 open-sourceprojects show that the defect predictor built from cross-project data selected by Feature Selection and Instance Filter-ing can have generally better performances both in classifica-tion and in ranking

With regard to the data imbalance problem of defectdatasets Jing et al [31] introduced an effective feature learn-ingmethod called SDA to provide effective solutions for class-imbalance problems of both within-project and cross-projecttypes by employing the semisupervised transfer componentanalysis (SSTCA)method tomake the distributions of sourceand target data consistent The results indicated that theirmethod greatly improved WPDP and CPDP performanceRyu et al [32] proposed amethod of hybrid instance selectionusing nearest neighbor (HISNN)Their results suggested thatthose instances which had strong local knowledge could beidentified via nearest neighbors with the same class labelPoon et al [33] proposed a credibility theory based naiveBayes (CNB) classifier to establish a novel reweighting mech-anism between the source projects and target projects sothat the source data could simultaneously adapt to the targetdata distribution and retain its ownpatternThe experimentalresults demonstrate the significant improvement in terms ofthe performance metrics considered achieved by CNB overother CPDP approaches

The above-mentioned existing studies aimed at reducingthe gap in prediction performance between WPDP andCPDP Although they are making progress towards the goalthere is clearly a lot of room for improvement For this reasonin this paper we proposed a selection approach to trainingdata based on an improved strategy for instance rankinginstead of a single strategy for similarity calculation whichwas used in many prior studies [1 5 7 12]

3 Preliminaries

In our context a defect dataset 119878 contains119898 instances whichis represented as 119878 = 1198681 1198682 119868119898 Instance 119868119894 is an objectclass represented as 119868119894 = 1198911198941 1198911198942 119891119894119899 where 119891119894119895 is the 119895thmetric value of instance 119868119894 and 119899 is the number ofmetrics (alsoknown as features) Given a source dataset 119878119904 and a targetdataset 119878119905 CPDP aims to perform a prediction in 119878119905 using theknowledge extracted from 119878119904 where 119878119904 = 119878119905 (see Figure 1(a))In this paper source and target datasets have the same set ofmetrics and they may differ in distributional characteristicsof metric values

To improve the performance of CPDP several strategiesused to select appropriate training data have been put forward

(see Figure 1(b)) eg Turhan et al [12] filtered out thoseirrelevant training instances by returning 119896-nearest neighborsfor each test instance

31 An Example of Training Data Selection First we intro-duce a typical method for training data selection at theinstance level and a simple example is used to illustrate thismethod For the strategy for other levels of training dataselection such as at the release level please refer to [7]

Figure 2 shows a training set 119878119904 (including five instances)and a test set 119878119905 (including an instance) Here each instancecontains five metrics and a classification label (ie 0 or 1)An instance is defect-free (label = 0) only if its defects areequal to 0 otherwise it is defective (label = 1) According tothe 119896-nearest neighbor method based on Euclidean distancewe can rank all the five training instances in terms of theirdistances from the test instance Due to the same nearestdistance from test instance 119868test it is clear that three instances1198681 1198682 and 1198685 are suitable for use as training instances when119896 is set to 1 For the three instances 1198682 and 1198685 have the samemetric values but 1198682 is labeled as a defective instance becauseit contains a bug In this case 1198681 will be selected with the sameprobability as that of 1198682 regardless of the number of defectsthey include

In this way those instances most relevant to the test onecan be quickly determined Clearly the goal of training dataselection is to preserve the representative training instancesin 119878119904 as much as possible

32 General Process of Training Data Selection Before pre-senting our approach we describe a general selection processof training data which consists of three main steps TDS(training dataset) setup ranking and duplicate removal

TDS Setup For each target project with little historical datawe need to set up an initial TDS where training data arecollected from other projects To simulate this scenario ofCPDP in this paper any defect data from the target projectmust be excluded from the initial TDS Note that differentrelease versions of a project actually belong to the sameproject A simple example is visualized in Figure 3

Ranking Once the initial TDS is determined an instancewill be treated as a metric vector 119868 as mentioned above Foreach test instance one can calculate its relevance to eachtraining instance and then rank these training instances interms of their similarity based on software metrics Notethat a wide variety of software metrics such as source codemetrics process metrics previous defects and code churnhave been used as features for CPDP approaches to improvetheir prediction performance

Duplicate Removal Let 119897 be the size of test set For eachtest instance if we select its 119896-nearest neighbors from theinitial TDS there are a total of 119896 times 119897 candidate traininginstances Considering that these selected instances may notbe unique (ie a training instance can be the nearest neighborof multiple test instances) after removing the duplicate onesthey form the final training set which is a subset of the initialTDS

4 Mathematical Problems in Engineering

trainingPredictor

test

Training data set Ss Target data set St

(a) General CPDP

MetricFeature value f

Buggy instance I

Non-buggy instance I

Unlabeled instance I

Strategies for instance selection

Predictor

testtraining

Training data set Ss

Ss Target data set StReduced training data set rsquo

(b) Improved CPDP using training data selection

Figure 1 Two CPDP scenarios used in this paper

01 0 05 0 1(3)

01 0 0 05 1(1)

04 03 0 01 0

0 0 04 0 0

01 0 0 05 0

01 0 05 05

Label (defects)

rank instance

1

2

3

distance(Ii Itest)I1

I2

I3

I4

I4

I3

I5

Itest

f1 f2 f3 f4

I1 I2 I5

St

Ss

Figure 2 An example of the selection of training instances

Mathematical Problems in Engineering 5

Data sets

Project C-v1

Ranked instances(TDSelector)

Defects of each instance

similarity of software metrics

Defect predictor

Project A-v1Project A-v2Project B-v1Project C-v1Project C-v2 Test set

Project A-v1Project A-v2Project B-v1

Training set

ranking

TDS setup

rankingTraining

data

Test data

Figure 3 The overall structure of TDSelector for CPDP

4 Our Approach TDSelector

To improve the prediction performance of CPDP we leveragethe following observations

Similar Instances Given a test instance we can examine itssimilar training instances that were labeled beforeThe defectproneness shared by similar training instances could helpus identify the probability that a test instance is defectiveIntuitively two instances are more likely to have the samestate if their metric values are very similar

Number of Defects (defects) During the selection processwhen several training instances have the same distance froma test instance we need to determine which one shouldbe ranked higher According to our experiences in softwaredefect prediction and other researchersrsquo studies on the quan-titative analysis of previous defect prediction approaches [3435] we believe that more attention should be paid to thosetraining instances with more defects in practice

The selection of training data based on instance similarityhas been used in someprior studies [5 12 35]However to thebest of our knowledge the information about defects has notbeen fully utilized So in this paper we attempt to propose atraining data selection approach combining such informationand instance similarity

41 Overall Structure of TDSelector Figure 3 shows the over-all structure of the proposed approach to training dataselection named TDSelector Before selecting appropriatetraining data for CPDP we have to set up a test set andits corresponding initial TDS For a given project treated asthe test set all the other projects (except the target project)available at hand are used as the initial TDS This is the so-calledmany-to-one (M2O) scenario for CPDP [13] It is quitedifferent from the typical O2O (one-to-one) scenario whereonly one randomly selected project is treated as the trainingset for a given target project (namely test set)

When both of the sets are given the ranks of traininginstances are calculated based on the similarity of softwaremetrics and then returned for each test instance For the

initial TDS we also collect each training instancersquos defectsand thus rank these instances by their defects Then we rateeach training instance by combining the two types of ranks insome way and identify the top-k training instances for eachtest instance according to their final scores Finally we usethe predictor trained with the final TDS to predict defectproneness in the test set We describe the core componentof TDSelector namely scoring scheme in the followingsubsection

42 Scoring Scheme For each instance in training set and testset which is treated as a vector of features (namely softwaremetrics) we calculate the similarity between them in terms ofsimilarity index (such as cosine similarity Euclidean distanceand Manhattan distance as shown in Table 1) Traininginstances are then ranked by the similarity between each ofthem and a given test instance

For instance the cosine similarity between a traininginstance 119868119901 and the target instance 119868119902 is computed via theirvector representations described as follows

119878119894119898 (119868119901 119868119902) =997888rarr119868119901 sdot 997888rarr1198681199021003817100381710038171003817100381711986811990110038171003817100381710038171003817 times 1003817100381710038171003817100381711986811990210038171003817100381710038171003817 =

sum119899119894=1 (119891119901119894 times 119891119902119894)radicsum119899119894=1 1198912119901119894 times radicsum119899119894=1 1198912119902119894 (1)

where 997888rarr119868119901 and 997888rarr119868119902 are the metric vectors for 119868119901 and 119868119902 respec-tively and 119891lowast119894 represents the 119894th metric value of instance 119868lowast

Additionally for each training instance we also considerthe factor defects in order to further enrich the ranking of itsrelevant instances The assumption here is that the more theprevious defects the richer the information of an instance Sowe propose a scoring scheme to rank those candidate traininginstances defined as below

Score (119868119901 119868119902) = 120572 119878119894119898 (119868119901 119868119902) + (1 minus 120572)119873 (119889119890119891119890119888119905119901) (2)

where 119889119890119891119890119888119905119901 represents the defects of 119868119901 120572 is a weightingfactor (0 le 120572 le 1) which is learned from training data usingAlgorithm 1 (see Algorithm 1) and 119873(119889119890119891119890119888119905119901) is a functionused to normalize defects with values ranging from 0 to 1

6 Mathematical Problems in Engineering

Optimizing the parameter 120572Input(1) Candidate TDS 119878119904 = 1198681199041 1198681199042 119868119904119898 test set 119878119905 = 1198681199051 1198681199052 119868119905119897 (119898 gt 119897)(2) 119889119890119891119890119888119905119904 = 119889119890119891119890119888119905(1198681199041) 119889119890119891119890119888119905(1198681199042) 119889119890119891119890119888119905(119868119904119898) and 119896 = 10Output(3) 120572 (120572 isin [0 1])Method(4) Initialize 120572 = 0 119878119904(120572) = 0(5) While (120572 le 1) do(6) For 119894 = 1 119894 le 119897 119894 + +(7) For 119895 = 1 119895 le 119898 119895 + +(8) Score(119868119905119894 119868119904119895) = 120572 119878119894119898(119868119905119894 119868119904119895) + (1 minus 120572)119873 (119889119890119891119890119888119905(119868119904119895))(9) End For(10) descSort (119878119888119900119903119890(119868119905119894 119868119904119895) | 119895 = 1 sdot sdot sdot 119898) sort119898 training instances in descending order(11) 119878119904(120572) = 119878119904(120572) cup Top-119896 training instances select the top 119896 instances(12) End For(13) 119860119880119862 lArr997904 119878119904(120572) CPDP997888997888997888997888rarr 119878119905 prediction result(14) 120572 = 120572 + 01(15) EndWhile(16) Return (120572 | max120572119860119880119862)

Algorithm 1 Algorithm of parameter optimization

Table 1 Similarity indexes and normalization methods used in thispaper

Similarity

Cosine cos (119883 119884) = sum119899119896=1 119909119896119910119896radicsum119899119896=1 1199092119896radicsum119899119896=1 1199102119896

Euclidean distance 119889 (119883 119884) = radic 119899sum119896=1

(119909119896 minus 119910119896)2Manhattan distance 119889 (119883 119884) = 119899sum

119896=1

1003816100381610038161003816119909119896 minus 1199101198961003816100381610038161003816Normalization

Linear 119873(119909) = 119909 minus 119909min119909max minus 119909min

Logistic 119873(119909) = 11 + 119890minus119909 minus 05Square root 119873(119909) = 1 minus 1radic1 + 119909Logarithmic 119873(119909) = log10(119909 + 1)Inverse cotangent 119873(119909) = arctan(119909) lowast 2120587

Normalization is a commonly used data preprocessingtechnique for mathematics and computer science [36] Grafand Borer [37] have confirmed that normalization canimprove prediction performance of classificationmodels Forthis reason we normalize the defects of training instanceswhen using TDSelector As you know there are many nor-malization methods In this study we introduce five typicalnormalization methods used in machine learning [36 38]The description and formulas of the five normalizationmethods are listed in Table 1

For each test instance the top-119896 training instances rankedin terms of their scores will be returned Hence the final

TDS is composed by merging the sets of the top-119896 train-ing instances for each test instance when those duplicateinstances are removed

5 Experimental Setup

51 Research Questions Our experiments were conductedto find empirical evidence that answers the following threeresearch questions

RQ1 Does the Consideration of Defects Improve the Per-formance of CPDP Unlike the previous methods [1 5 712 29] TDSelector ranks candidate training instances interms of both defects and metric-based similarity To eval-uate the effectiveness of the proposed method consideringthe additional information of defects we tested TDSelectoraccording to the experimental data described in Section 52According to (2) we also empirically analyzed the impact ofthe parameter 120572 on prediction results

RQ2 Which Combination of Similarity and NormalizationIs More Suitable for TDSelector Equation (2) is comprisedof two parts namely similarity and the normalization ofdefects For each part several commonly usedmethods can beadopted in our context To fully take advantage of TDSelectorone would wonder which combination of similarity andnormalization should be chosen Therefore it is necessary tocompare the effects of different combinations of similarityand normalization methods on prediction results and todetermine the best one for TDSelector

RQ3 Can TDSelector-Based CPDP Outperform the Base-line Methods Cross-project prediction has attracted muchresearch interest in recent years and a few CPDP approachesusing training data selection have also been proposed eg

Mathematical Problems in Engineering 7

Peter filter based CPDP [5] (labeled as baseline1) and TCA+(Transfer Component Analysis) based CPDP [39] (labeledas baseline2) To answer the third question we comparedTDSelector-based CPDP proposed in this paper with theabove two state-of-the-art methods

52 Data Collection To evaluate the effectiveness of TDSe-lector in this paper we used 14 open-source projects writtenin Java on two online public software repositories namelyPROMISE [40] and AEEEM [41] The data statistics of the 14projects in question are presented in Table 2 where Instanceand Defect are the numbers of instances and defectiveinstances respectively and Defect is the proportion ofdefective instances to the total number of instances Eachinstance in these projects represents a file of object class andconsists of two parts namely software metrics and defects

The first repository PROMISE was collected by Jureczkoand Spinellis [40] The information of defects and 20 sourcecode metrics for the projects on PROMISE have been vali-dated and used in several previous studies [1 7 12 29] Thesecond repository AEEEM was collected by DrsquoAmbros etal [41] and each project on it has 76 metrics including 17source code metrics 15 change metrics 5 previous defectmetrics 5 entropy-of-change metrics 17 entropy-of-source-code metrics and 17 churn-of-source-code metrics AEEEMhas been successfully used in [23 39]

Before performing a cross-project prediction we need todetermine a target dataset (test set) and its candidate TDSFor PROMISE (10 projects) each one in the 10 projects wasselected to be the target dataset once and then we set up acandidate TDS for CPDP which excluded any data from thetarget project For instance if Ivy is selected as test projectdata from the other nine projects was used to construct itsinitial TDS

53 Experiment Design To answer the three research ques-tions our experimental procedure which is designed underthe context of M2O in the CPDP scenario is described asfollows

First as with many prior studies [1 5 15 35] all softwaremetric values in training and test sets were normalized byusing the119885-scoremethod because thesemetrics are differentin the scales of numerical values For the 14 projects onAEEEM and PROMISE their numbers of software metricsare different So the training set for a given test set wasselected from the same repository

Second to examine whether the consideration of defectsimproves the performance of CPDP we compared ourapproach TDSelector with NoD which is a baseline methodconsidering only the similarity between instances ie 120572 = 1in (2) Since there are three similarity computation methodsused in this paper we designed three different TDSelectorsand their corresponding baseline methods based on similar-ity indexesThe prediction results of eachmethod in questionfor the 15 test sets were analyzed in terms of mean valueand standard deviation More specifically we also used Cliff rsquosdelta (120575) [42] which is a nonparametric effect size measureof how often the values in one distribution are larger than the

values in a second distribution to compare the results gen-erated through our approach and its corresponding baselinemethod

Because Cliff did not suggest corresponding 120575 values torepresent small medium and large effects we convertedCohenrsquos 119889 effect size to Cliff rsquos 120575 using cohd2delta R package(httpsrdrriocranorddommancohd2deltahtml) Notethat Table 3 contains descriptors for magnitude of 119889 = 001to 20

Third according to the results of the second step of thisprocedure 15 combinations based on three typical similaritymethods for software metrics and five commonly usednormalization functions for defects were examined by thepairwise comparison method We then determined whichcombination is more suitable for our approach according tomean standard deviation and Cliff rsquos delta effect size

Fourth to further validate the effectiveness of the TDS-elector-based CPDP predictor we conducted cross-projectpredictions for all the 15 test sets using TDSelector and twocompeting methods (ie baseline1 and baseline2 introducedin Section 51) Note that the TDSelector used in this exper-iment was built with the best combination of similarity andnormalization

After this process is completed we will discuss the an-swers to the three research questions of our study

54 Classifier and Evaluation Measure As an underlyingmachine learning classifier for CPDP Logistic Regression(LR) which was widely used in many defect prediction liter-atures [4 23 39 43ndash46] is also used in this study All LR clas-sifiers were implemented withWeka (httpswwwcswaikatoacnzmlweka) For our experiments we used the defaultparameter setting for LR specified in Weka unless otherwisespecified

To evaluate the prediction performance of differentmeth-ods in this paper we utilized the area under a ReceiverOperating Characteristic curve (AUC) AUC is equal to theprobability that a classifier will identify a randomly chosendefective class higher than a randomly chosen defect-freeone [47] known as a useful measure for comparing differentmodels Compared with traditional accuracy measures AUCis commonly used because it is unaffected by class imbalanceand independent of the prediction threshold that is used todecide whether an instance should be classified as a negativeinstance [6 48 49] The AUC value of 05 indicates theperformance of a random predictor and higher AUC valuesindicate better prediction performance

6 Experimental Results

61 Answer to RQ1 We compared our approach consideringdefects with the baseline method NoD that selects trainingdata in terms of cosine similarity Table 5 shows that onaverage TDSelector does achieve an improvement in AUCvalue across the 15 test sets Obviously the average growthrates of AUC value vary from 59 to 90 when differentnormalization methods for defects were utilized In additionall the 120575 values in this table are greater than 02 which

8 Mathematical Problems in Engineering

Table 2 Data statistics of the projects used in our experiments

Repository Project Version Instance Defect Defect

PROMISE

Ant 17 745 166 223Camel 16 965 188 195Ivy 20 352 40 114Jedit 32 272 90 331

Lucene 24 340 203 597Poi 30 442 281 636

Synapse 12 256 86 336Velocity 14 196 147 750Xalan 26 885 411 464Xerces 14 588 437 743

AEEEM

Equinox 112005ndash6252008 324 129 398Eclipse JDT core (Eclipse) 112005ndash6172008 997 206 207Apache Lucene (Lucene2) 112005ndash1082008 692 20 29

Mylyn 1172005ndash3172009 1862 245 132Eclipse PDE UI (Pde) 112005ndash9112008 1497 209 140

Table 3 The mappings between different values and their effective-ness levels

Effect size d 120575Very small 001 0008Small 020 0147Medium 050 033Large 080 0474Very large 120 0622Huge 20 0811

indicates that each group of 15 prediction results obtained byour approach has a greater effect than that of NoD In otherwords our approach outperforms NoD In particular forJedit Velocity Eclipse andEquinox the improvements of ourapproach over NoD are substantial For example when usingthe linear normalizationmethod the AUC values for the fourprojects are increased by 306 430 226 and 394respectively moreover the logistic normalization methodfor Velocity achieves the biggest improvement in AUC value(namely 617)

We then compared TDSelector with the baseline methodsusing other widely used similarity calculation methods andthe results obtained by using Euclidean distance and Man-hattan distance to calculate the similarity between instancesare presented in Tables 6 and 7 TDSelector compared withthe corresponding NoD achieves the average growth ratesof AUC value that vary from 59 to 77 in Table 6 andfrom 27 to 69 in Table 7 respectively More specificallythe highest growth rate of AUC value in Table 6 is 436for Equinox and in Table 7 is 397 for Lucene2 Besides allCliff rsquos delta (120575) effect sizes in these two tables are also greaterthan 01 Hence the results indicate that our approach can onaverage improve the performance of those baseline methodswithout regard to defects

Table 4 Analyzing the factors similarity and normalization

Factor Method Mean Std 120575Similarity

Cosine similarity 0704 0082 minus0133Euclidean distance 0719 0080 -Manhattan distance 0682 0098 minus0193

Normalization

Linear 0706 0087 minus0012Logistic 0710 0078 -

Square root 0699 0091 minus0044Logarithmic 0700 0086 minus0064

Inverse cotangent 0696 0097 minus0056

In short during the process of training data selection theconsideration of defects for CPDP can help us to select higherquality training data thus leading to better classificationresults

62 Answer to RQ2 Although the inclusion of defects inthe selection of training data of quality is helpful for betterperformance of CPDP it is worthy to note that our methodcompletely failed in Mylyn and Pde when computing thesimilarity between instances in terms of Manhattan distance(see the corresponding maximum AUC values in Table 7)This implies that the success ofTDSelector depends largely onthe reasonable combination of similarity and normalizationmethods Therefore which combination of similarity andnormalization is more suitable for TDSelector

First we analyzed the two factors (ie similarity andnormalization) separately For example we evaluated thedifference among cosine similarity Euclidean distance andManhattan distance regardless of any normalization methodused in the experiment The results expressed in terms ofmean and standard deviation are shown in Table 4 wherethey are grouped by factors

Mathematical Problems in Engineering 9

Table5Th

ebestpredictio

nresults

obtained

bytheCP

DPapproach

basedon

TDSelec

torw

ithCosinesim

ilarityNoD

representsthebaselin

emetho

d+deno

testhe

grow

thrateof

AUC

valuethem

axim

umAU

Cvalueo

fdifferentn

ormalizationmetho

dsisun

derlinedeach

numbersho

wnin

bold

indicatesthatthe

correspo

ndingAU

Cvaluer

isesb

ymorethan10

Cosines

imilarity

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

07

09

09

1009

1009

1006

09

08

06

07

07

05

0338

AUC

0813

0676

0603

0793

0700

0611

0758

0741

0512

0742

0783

0760

0739

0705

0729

0711plusmn0

081

+(

)63

37

19

-30

6

-30

-43

0

03

226

394

41

59

40

90

Logistic

12057207

05

07

107

06

06

06

05

05

004

07

05

05

0351

AUC

0802

0674

0595

0793

0665

0621

0759

0765

0579

0745

0773

0738

0712

0707

0740

0711plusmn0

070

+(

)48

34

05

-24

116

31

32

617

07

210

355

03

62

56

90

Square

root

12057207

07

06

06

07

06

07

09

05

104

06

06

06

06

0249

AUC

0799

0654

0596

0807

0735

0626

0746

0762

0500

0740

0774

0560

0722

0700

0738

0697plusmn0

091

+(

)44

03

07

18

371

25

14

28

397

-210

28

17

53

53

69

Logarithm

ic120572

06

06

09

1007

1007

07

05

09

05

05

06

06

06

0351

AUC

0798

0662

0594

0793

0731

0611

0748

0744

0500

0758

0774

0700

0755

0702

0741

0707plusmn0

083

+(

)43

15

03

-36

4

-16

04

397

24

212

285

63

55

58

85

Inversec

otangent

12057207

1010

1007

1007

1006

07

007

07

07

07

0213

AUC

0798

0652

0592

0793

0659

0611

0749

0741

0500

0764

0773

0556

0739

0695

0734

0690plusmn0

092

+(

)43

--

-229

-18

-

397

32

210

21

41

44

48

59

NoD

(120572=1)

0765

0652

0592

0793

0536

0611

0736

0741

0358

0740

0639

0543

0709

0665

0701

0652plusmn0

113

10 Mathematical Problems in Engineering

Table6Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithEu

clidean

distance

Euclidean

distance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

09

09

1009

09

08

1010

08

08

006

1008

08

0369

AUC

0795

0727

0598

0826

0793

0603

0714

0757

0545

0775

0773

0719

0722

0697

0744

0719plusmn0

080

+(

)13

68

-09

322

19

--

117

52

176

430

-11

96

77

Logistic

12057207

08

04

07

07

05

06

09

09

09

007

1010

09

0360

AUC

0787

0750

0603

0832

0766

0613

0716

0767

0556

0745

0773

0698

0722

0690

0730

0717plusmn0

075

+(

)03

101

08

16

277

35

03

13

139

11

176

388

--

75

72

Square

root

12057207

08

1007

08

06

07

07

07

1007

08

1010

09

0342

AUC

0796

0743

0598

0820

0720

0618

0735

0786

0564

0737

0774

0696

0722

0690

0750

0715plusmn0

076

+(

)14

91

-

01

200

44

29

38

156

-178

384

--

105

70

Logarithm

ic120572

07

08

1010

08

06

1010

09

09

09

08

1010

09

0324

AUC

0794

0746

0598

0819

0722

0607

0714

0757

0573

0739

0778

0722

0722

0690

0748

0715plusmn0

072

+(

)11

95

--

203

25

--

174

03

185

436

--

103

70

Inversec

otangent

12057208

09

06

08

08

07

1008

06

07

009

09

1009

0280

AUC

0796

0749

0603

0820

0701

0623

0714

0787

0538

0750

0773

0589

0763

0690

0722

0708plusmn0

084

+(

)14

100

08

01

168

52

-40

102

18

176

170

56

-64

59

NoD

(120572=1)

0785

0681

0598

0819

060

00592

0714

0757

0488

0737

0657

0503

0722

0690

0678

066

8plusmn0

096

Mathematical Problems in Engineering 11

Table7Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithManhatta

ndista

nce

Manhatta

ndistance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

08

09

09

1009

09

1010

08

100

08

09

1010

0187

AUC

0804

0753

0599

0816

0689

0626

0695

0748

0500

0749

0773

0633

0692

0695

066

80696plusmn0

084

+(

)13

70

03

-73

63

--

78

-116

190

397

--

56

Logistic

12057207

07

08

08

08

07

07

09

06

07

009

09

1010

0249

AUC

0799

0760

0607

0830

0674

0621

0735

0794

0520

0756

0773

0680

0559

0695

066

80705plusmn0

084

+(

)06

80

17

17

50

54

58

61

121

09

116

279

127

--

69

Square

root

12057209

09

09

1008

08

09

08

09

100

100

1010

0164

AUC

0795

0755

060

40816

0693

0627

0704

0750

0510

0749

0773

0532

0523

0695

066

80680plusmn0

1+(

)01

72

12

-79

65

13

03

99

-116

-46

--

31

Logarithm

ic120572

1009

09

1009

1010

08

09

09

010

010

100116

AUC

0794

0755

0603

0816

066

40589

0695

0763

0524

0756

0773

0532

0523

0695

066

80677plusmn0

102

+(

)-

72

10

-34

--

20

129

09

116

-46

--

27

Inversec

otangent

12057210

09

09

09

09

08

09

1007

08

010

010

100133

AUC

0794

0749

0608

0821

0667

060

90710

0748

0500

0758

0773

0532

0523

0695

066

80677plusmn0

103

+(

)-

64

18

06

39

34

22

-78

12

116

-46

--

27

NoD

(120572=1)

0794

0704

0597

0816

064

20589

0695

0748

046

40749

0693

0532

0500

0695

066

80659plusmn0

105

12 Mathematical Problems in Engineering

Euclidean Logistic

Cosine

Linear

Manhattan+Logistic Euclidean

Linear

Cosine

Logistic

Manhattan+Logistic

(1) (2)

Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow

If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method

Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4

Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions

63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In

particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable

Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication

In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models

7 Discussion

71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10

Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance

Mathematical Problems in Engineering 13

Table8Pairw

isecomparis

onsb

etweenag

iven

combinatio

nandeach

ofthe15combinatio

nsin

term

sofC

liffrsquosdelta

(120575)effectsize

Cosines

imilarity

Euclidean

distance

Manhatta

ndistance

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Cosine+

Linear

-0018

0084

000

00116

minus0049

minus0036

minus0004

minus0013

minus0009

0138

0049

0164

0178

0169

Euclidean

+Linear

0049

0102

0111

0062

0164

-0036

004

00058

0089

0209

0102

0249

0276

0244

Manhatta

n+Lo

gistic

minus0049

minus0022

0022

minus0013

0111

minus0102

minus0076

minus0080

minus0049

minus0031

0053

-0124

0151

0147

14 Mathematical Problems in Engineering

Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question

Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10

Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647

Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43

Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487

k10987654321

AUC

09

08

07

06

05

Manhattan+logisticEuclidean+linearCosine+linear

Combination

Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE

when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations

Manhattan+logisticEuclidean+linearCosine+linear

k10987654321

AUC

09

08

07

06

05

Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM

72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice

According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as

Mathematical Problems in Engineering 15

0

02

04

06

08

1

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Perc

enta

ge(in

stanc

e)

release_id

defectslt2defectslt3

(a)

0

0002

0004

0006

0008

001

0012

0014

0016

4 14 24 34 44 54 64 74

Perc

enta

ge(in

stanc

e)

defects

(b)

Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments

TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones

Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work

73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain

Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software

metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors

Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033

Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five

16 Mathematical Problems in Engineering

Euclidean+Linear

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

Cosine+Linear

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

Manhattan+Logistic

Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value

normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear

8 Conclusion and Future Work

This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of

defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods

Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are

Mathematical Problems in Engineering 17

required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances

Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this article

Acknowledgments

The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)

References

[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012

[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002

[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012

[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009

[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013

[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012

[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013

[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998

[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002

[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007

[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005

[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009

[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015

[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017

[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015

[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015

[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017

[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015

[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016

[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017

[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014

[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014

[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015

[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015

18 Mathematical Problems in Engineering

[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017

[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017

[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017

[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013

[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014

[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016

[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017

[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015

[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017

[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012

[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013

[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012

[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7

[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006

[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013

[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010

[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE

Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010

[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011

[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011

[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008

[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008

[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011

[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012

[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011

[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011

[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952

[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009

[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 2: An Improved Method for Cross-Project Defect Prediction by …downloads.hindawi.com/journals/mpe/2018/2650415.pdf · Project A-1 Project A-2 Projec B-1 Projec C-1 Projec C-2 Tst et

2 Mathematical Problems in Engineering

As far as we know although previous studies on CPDPhave taken different types of software metrics into accountduring the process of selecting relevant training samplesnone of them considered the number of defects contained ineach sample (denoted by defects) But in fact we argue thatit is also an important factor to consider Fortunately somestudies have empirically demonstrated the relevance of defectsto prediction For example ldquomodules with faults in the pastare likely to have faults in the futurerdquo [8] ldquo17 to 54 of thehigh-fault files of release 119894 are still high-fault in release 119894 + 1rdquo[9] ldquocover 73ndash95 of faults by selecting 10 of the mostfault prone source code filerdquo [10] and ldquothe number of defectsfound in the previous release of file correlates with its currentdefect count on a high levelrdquo [11]

Does the selection of training data considering defectsimprove the performance of CPDP models If the answer isldquoYesrdquo on the one hand it is helpful to validate the feasibility ofCPDP on the other hand it will contribute to better softwaredefect predictors by making full use of those defect datasetsavailable on the Internet

The objective of our work is to propose an improvedmethod of training data selection for CPDP by introducingthe information of defects Unlike the prior studies similar toour work such as [5 12] which focus mainly on the simi-larity between instances from training set and test set thispaper gives a comprehensive account of two factors namelysimilarity and defects Moreover the proposedmethod calledTDSelector can automatically optimize their weights toachieve the best result In brief our main contributions to thecurrent state of research onCPDP are summarized as follows

(1) Considering both similarity and defects we proposeda simple and easy-to-use training data selection method forCPDP (ie TDSelector) which is based on an improved scor-ing scheme that ranks all possible training instances In parti-cular we designed an algorithm to calculate their weightsautomatically so as to obtain the best prediction result

(2) To validate the effectiveness of our method we con-ducted an elaborate empirical study based on 15 datasets col-lected from PROMISE and AEEEM (httpbuginfusich)and the experimental results show that in a specific CPDPscenario (iemany-to-one [13]) the TDSelector-based defectpredictor outperforms its rivals that were built with twocompeting methods in terms of prediction precision

With these technical contributions our study couldcomplement previous work on CPDPwith respect to trainingdata selection In particular we provide a reasonable scoringscheme as well as a more comprehensive guideline for devel-opers to choose appropriate training data to train a defect pre-dictor in practice

The rest of this paper is organized as follows In Section 2we reviewed the related work of this topic Section 3 presentsthe preliminaries to our work Section 4 describes the pro-posed method TDSelector Section 5 introduces our experi-mental setup and Section 6 shows the primary experimen-tal results a detailed discussion of some issues includingpotential threats to the validity of our study is presented inSection 7 in the end Section 8 summarizes this paper andpresents our future work

2 Related Work

21 Cross-Project Defect Prediction Many studies were car-ried out to validate the feasibility of CPDP in the last fiveyears For example Turhan et al [12] proposed a cross-company defect prediction approach using defect data fromother companies to build predictors for target projects Theyfound that the proposed method increased the probabilityof defect detection at the cost of increasing false positiverate Ni et al [14] proposed a novel method called FeSCHand designed three ranking strategies to choose appropriatefeatures The experimental results show that FeSCH canoutperform WPDP ALL and TCA+ in most cases and itsperformance is independent of the used classifiers He et al[15] compared the performance between CPDP and WPDPusing feature selection techniques The results indicated thatfor reduced training data WPDP obtained higher precisionbut CPDP in turn achieved a better recall or119865-measure Someresearchers have also studied the performance of CPDPbasedon ensemble classifiers and then validated their effects on thisissue [16 17]

Ryu et al [18] proposed a transfer cost-sensitive boostingmethod by considering both distributional characteristicsand the class imbalance for CPDPThe results show that theirmethod significantly improves CPDP performanceThey also[19] proposed a multiobjective naive Bayes learning tech-nique under CPDP environments by taking into account theclass-imbalance contexts The results indicated that their ap-proaches performed better than the single-objective ones andWPDP models Li et al [20] compared some famous datafilters and proposed a method called HSBF (hierarchicalselect-based filter) to improve the performance of CPDPTheresults demonstrate that the data filter strategy can indeedimprove the performance of CPDP significantly Moreoverwhen using appropriate data filter strategy the defect predic-tor built fromcross-project data can outperform the predictorlearned by using within-project data

Zhang et al [21] proposed a universal CPDPmodel whichwasbuiltusingalargenumberofprojects collected fromSource-Forge (httpssourceforgenet) and Google Code (httpscodegooglecom) Their experimental results showed thatit was indeed comparable to WPDP Furthermore CPDP isfeasible for different projects that have heterogeneous metricsets He et al [22] first proposed a CPDP-IFS approach basedon the distribution characteristics of both source and targetprojects to overcome this problem Nam and Kim [23] thenproposed an improved method called HDP where metricselection and metric matching were introduced to build adefect predictorTheir empirical study on 28 projects showedthat about 68 of predictions using the proposed approachoutperformed or were comparable to WPDP with statisticalsignificance Jing et al [24] proposed a unified metric repre-sentation (UMR) for heterogeneous defect data More re-searches can be found in [25ndash27]The experiments on 14 pub-lic heterogeneous datasets from four different companies in-dicated that the proposed approach was more effective inaddressing the problem

Mathematical Problems in Engineering 3

22 Training Data Selection for CPDP As mentioned in [528] a fundamental issue for CPDP is to select the mostappropriate training data for building quality defect predic-tors He et al [29] discussed this problem in detail from theperspective of data granularity ie release level and instancelevel They presented a two-step method for training dataselection The results indicated that the predictor built basedon naive Bayes could achieve fairly good performance whenusing the method together with Peter filter [5] Porto andSimao [30] proposed an Instance Filtering method by select-ing the most similar instances from the training dataset andthe experimental results of 36 versions of 11 open-sourceprojects show that the defect predictor built from cross-project data selected by Feature Selection and Instance Filter-ing can have generally better performances both in classifica-tion and in ranking

With regard to the data imbalance problem of defectdatasets Jing et al [31] introduced an effective feature learn-ingmethod called SDA to provide effective solutions for class-imbalance problems of both within-project and cross-projecttypes by employing the semisupervised transfer componentanalysis (SSTCA)method tomake the distributions of sourceand target data consistent The results indicated that theirmethod greatly improved WPDP and CPDP performanceRyu et al [32] proposed amethod of hybrid instance selectionusing nearest neighbor (HISNN)Their results suggested thatthose instances which had strong local knowledge could beidentified via nearest neighbors with the same class labelPoon et al [33] proposed a credibility theory based naiveBayes (CNB) classifier to establish a novel reweighting mech-anism between the source projects and target projects sothat the source data could simultaneously adapt to the targetdata distribution and retain its ownpatternThe experimentalresults demonstrate the significant improvement in terms ofthe performance metrics considered achieved by CNB overother CPDP approaches

The above-mentioned existing studies aimed at reducingthe gap in prediction performance between WPDP andCPDP Although they are making progress towards the goalthere is clearly a lot of room for improvement For this reasonin this paper we proposed a selection approach to trainingdata based on an improved strategy for instance rankinginstead of a single strategy for similarity calculation whichwas used in many prior studies [1 5 7 12]

3 Preliminaries

In our context a defect dataset 119878 contains119898 instances whichis represented as 119878 = 1198681 1198682 119868119898 Instance 119868119894 is an objectclass represented as 119868119894 = 1198911198941 1198911198942 119891119894119899 where 119891119894119895 is the 119895thmetric value of instance 119868119894 and 119899 is the number ofmetrics (alsoknown as features) Given a source dataset 119878119904 and a targetdataset 119878119905 CPDP aims to perform a prediction in 119878119905 using theknowledge extracted from 119878119904 where 119878119904 = 119878119905 (see Figure 1(a))In this paper source and target datasets have the same set ofmetrics and they may differ in distributional characteristicsof metric values

To improve the performance of CPDP several strategiesused to select appropriate training data have been put forward

(see Figure 1(b)) eg Turhan et al [12] filtered out thoseirrelevant training instances by returning 119896-nearest neighborsfor each test instance

31 An Example of Training Data Selection First we intro-duce a typical method for training data selection at theinstance level and a simple example is used to illustrate thismethod For the strategy for other levels of training dataselection such as at the release level please refer to [7]

Figure 2 shows a training set 119878119904 (including five instances)and a test set 119878119905 (including an instance) Here each instancecontains five metrics and a classification label (ie 0 or 1)An instance is defect-free (label = 0) only if its defects areequal to 0 otherwise it is defective (label = 1) According tothe 119896-nearest neighbor method based on Euclidean distancewe can rank all the five training instances in terms of theirdistances from the test instance Due to the same nearestdistance from test instance 119868test it is clear that three instances1198681 1198682 and 1198685 are suitable for use as training instances when119896 is set to 1 For the three instances 1198682 and 1198685 have the samemetric values but 1198682 is labeled as a defective instance becauseit contains a bug In this case 1198681 will be selected with the sameprobability as that of 1198682 regardless of the number of defectsthey include

In this way those instances most relevant to the test onecan be quickly determined Clearly the goal of training dataselection is to preserve the representative training instancesin 119878119904 as much as possible

32 General Process of Training Data Selection Before pre-senting our approach we describe a general selection processof training data which consists of three main steps TDS(training dataset) setup ranking and duplicate removal

TDS Setup For each target project with little historical datawe need to set up an initial TDS where training data arecollected from other projects To simulate this scenario ofCPDP in this paper any defect data from the target projectmust be excluded from the initial TDS Note that differentrelease versions of a project actually belong to the sameproject A simple example is visualized in Figure 3

Ranking Once the initial TDS is determined an instancewill be treated as a metric vector 119868 as mentioned above Foreach test instance one can calculate its relevance to eachtraining instance and then rank these training instances interms of their similarity based on software metrics Notethat a wide variety of software metrics such as source codemetrics process metrics previous defects and code churnhave been used as features for CPDP approaches to improvetheir prediction performance

Duplicate Removal Let 119897 be the size of test set For eachtest instance if we select its 119896-nearest neighbors from theinitial TDS there are a total of 119896 times 119897 candidate traininginstances Considering that these selected instances may notbe unique (ie a training instance can be the nearest neighborof multiple test instances) after removing the duplicate onesthey form the final training set which is a subset of the initialTDS

4 Mathematical Problems in Engineering

trainingPredictor

test

Training data set Ss Target data set St

(a) General CPDP

MetricFeature value f

Buggy instance I

Non-buggy instance I

Unlabeled instance I

Strategies for instance selection

Predictor

testtraining

Training data set Ss

Ss Target data set StReduced training data set rsquo

(b) Improved CPDP using training data selection

Figure 1 Two CPDP scenarios used in this paper

01 0 05 0 1(3)

01 0 0 05 1(1)

04 03 0 01 0

0 0 04 0 0

01 0 0 05 0

01 0 05 05

Label (defects)

rank instance

1

2

3

distance(Ii Itest)I1

I2

I3

I4

I4

I3

I5

Itest

f1 f2 f3 f4

I1 I2 I5

St

Ss

Figure 2 An example of the selection of training instances

Mathematical Problems in Engineering 5

Data sets

Project C-v1

Ranked instances(TDSelector)

Defects of each instance

similarity of software metrics

Defect predictor

Project A-v1Project A-v2Project B-v1Project C-v1Project C-v2 Test set

Project A-v1Project A-v2Project B-v1

Training set

ranking

TDS setup

rankingTraining

data

Test data

Figure 3 The overall structure of TDSelector for CPDP

4 Our Approach TDSelector

To improve the prediction performance of CPDP we leveragethe following observations

Similar Instances Given a test instance we can examine itssimilar training instances that were labeled beforeThe defectproneness shared by similar training instances could helpus identify the probability that a test instance is defectiveIntuitively two instances are more likely to have the samestate if their metric values are very similar

Number of Defects (defects) During the selection processwhen several training instances have the same distance froma test instance we need to determine which one shouldbe ranked higher According to our experiences in softwaredefect prediction and other researchersrsquo studies on the quan-titative analysis of previous defect prediction approaches [3435] we believe that more attention should be paid to thosetraining instances with more defects in practice

The selection of training data based on instance similarityhas been used in someprior studies [5 12 35]However to thebest of our knowledge the information about defects has notbeen fully utilized So in this paper we attempt to propose atraining data selection approach combining such informationand instance similarity

41 Overall Structure of TDSelector Figure 3 shows the over-all structure of the proposed approach to training dataselection named TDSelector Before selecting appropriatetraining data for CPDP we have to set up a test set andits corresponding initial TDS For a given project treated asthe test set all the other projects (except the target project)available at hand are used as the initial TDS This is the so-calledmany-to-one (M2O) scenario for CPDP [13] It is quitedifferent from the typical O2O (one-to-one) scenario whereonly one randomly selected project is treated as the trainingset for a given target project (namely test set)

When both of the sets are given the ranks of traininginstances are calculated based on the similarity of softwaremetrics and then returned for each test instance For the

initial TDS we also collect each training instancersquos defectsand thus rank these instances by their defects Then we rateeach training instance by combining the two types of ranks insome way and identify the top-k training instances for eachtest instance according to their final scores Finally we usethe predictor trained with the final TDS to predict defectproneness in the test set We describe the core componentof TDSelector namely scoring scheme in the followingsubsection

42 Scoring Scheme For each instance in training set and testset which is treated as a vector of features (namely softwaremetrics) we calculate the similarity between them in terms ofsimilarity index (such as cosine similarity Euclidean distanceand Manhattan distance as shown in Table 1) Traininginstances are then ranked by the similarity between each ofthem and a given test instance

For instance the cosine similarity between a traininginstance 119868119901 and the target instance 119868119902 is computed via theirvector representations described as follows

119878119894119898 (119868119901 119868119902) =997888rarr119868119901 sdot 997888rarr1198681199021003817100381710038171003817100381711986811990110038171003817100381710038171003817 times 1003817100381710038171003817100381711986811990210038171003817100381710038171003817 =

sum119899119894=1 (119891119901119894 times 119891119902119894)radicsum119899119894=1 1198912119901119894 times radicsum119899119894=1 1198912119902119894 (1)

where 997888rarr119868119901 and 997888rarr119868119902 are the metric vectors for 119868119901 and 119868119902 respec-tively and 119891lowast119894 represents the 119894th metric value of instance 119868lowast

Additionally for each training instance we also considerthe factor defects in order to further enrich the ranking of itsrelevant instances The assumption here is that the more theprevious defects the richer the information of an instance Sowe propose a scoring scheme to rank those candidate traininginstances defined as below

Score (119868119901 119868119902) = 120572 119878119894119898 (119868119901 119868119902) + (1 minus 120572)119873 (119889119890119891119890119888119905119901) (2)

where 119889119890119891119890119888119905119901 represents the defects of 119868119901 120572 is a weightingfactor (0 le 120572 le 1) which is learned from training data usingAlgorithm 1 (see Algorithm 1) and 119873(119889119890119891119890119888119905119901) is a functionused to normalize defects with values ranging from 0 to 1

6 Mathematical Problems in Engineering

Optimizing the parameter 120572Input(1) Candidate TDS 119878119904 = 1198681199041 1198681199042 119868119904119898 test set 119878119905 = 1198681199051 1198681199052 119868119905119897 (119898 gt 119897)(2) 119889119890119891119890119888119905119904 = 119889119890119891119890119888119905(1198681199041) 119889119890119891119890119888119905(1198681199042) 119889119890119891119890119888119905(119868119904119898) and 119896 = 10Output(3) 120572 (120572 isin [0 1])Method(4) Initialize 120572 = 0 119878119904(120572) = 0(5) While (120572 le 1) do(6) For 119894 = 1 119894 le 119897 119894 + +(7) For 119895 = 1 119895 le 119898 119895 + +(8) Score(119868119905119894 119868119904119895) = 120572 119878119894119898(119868119905119894 119868119904119895) + (1 minus 120572)119873 (119889119890119891119890119888119905(119868119904119895))(9) End For(10) descSort (119878119888119900119903119890(119868119905119894 119868119904119895) | 119895 = 1 sdot sdot sdot 119898) sort119898 training instances in descending order(11) 119878119904(120572) = 119878119904(120572) cup Top-119896 training instances select the top 119896 instances(12) End For(13) 119860119880119862 lArr997904 119878119904(120572) CPDP997888997888997888997888rarr 119878119905 prediction result(14) 120572 = 120572 + 01(15) EndWhile(16) Return (120572 | max120572119860119880119862)

Algorithm 1 Algorithm of parameter optimization

Table 1 Similarity indexes and normalization methods used in thispaper

Similarity

Cosine cos (119883 119884) = sum119899119896=1 119909119896119910119896radicsum119899119896=1 1199092119896radicsum119899119896=1 1199102119896

Euclidean distance 119889 (119883 119884) = radic 119899sum119896=1

(119909119896 minus 119910119896)2Manhattan distance 119889 (119883 119884) = 119899sum

119896=1

1003816100381610038161003816119909119896 minus 1199101198961003816100381610038161003816Normalization

Linear 119873(119909) = 119909 minus 119909min119909max minus 119909min

Logistic 119873(119909) = 11 + 119890minus119909 minus 05Square root 119873(119909) = 1 minus 1radic1 + 119909Logarithmic 119873(119909) = log10(119909 + 1)Inverse cotangent 119873(119909) = arctan(119909) lowast 2120587

Normalization is a commonly used data preprocessingtechnique for mathematics and computer science [36] Grafand Borer [37] have confirmed that normalization canimprove prediction performance of classificationmodels Forthis reason we normalize the defects of training instanceswhen using TDSelector As you know there are many nor-malization methods In this study we introduce five typicalnormalization methods used in machine learning [36 38]The description and formulas of the five normalizationmethods are listed in Table 1

For each test instance the top-119896 training instances rankedin terms of their scores will be returned Hence the final

TDS is composed by merging the sets of the top-119896 train-ing instances for each test instance when those duplicateinstances are removed

5 Experimental Setup

51 Research Questions Our experiments were conductedto find empirical evidence that answers the following threeresearch questions

RQ1 Does the Consideration of Defects Improve the Per-formance of CPDP Unlike the previous methods [1 5 712 29] TDSelector ranks candidate training instances interms of both defects and metric-based similarity To eval-uate the effectiveness of the proposed method consideringthe additional information of defects we tested TDSelectoraccording to the experimental data described in Section 52According to (2) we also empirically analyzed the impact ofthe parameter 120572 on prediction results

RQ2 Which Combination of Similarity and NormalizationIs More Suitable for TDSelector Equation (2) is comprisedof two parts namely similarity and the normalization ofdefects For each part several commonly usedmethods can beadopted in our context To fully take advantage of TDSelectorone would wonder which combination of similarity andnormalization should be chosen Therefore it is necessary tocompare the effects of different combinations of similarityand normalization methods on prediction results and todetermine the best one for TDSelector

RQ3 Can TDSelector-Based CPDP Outperform the Base-line Methods Cross-project prediction has attracted muchresearch interest in recent years and a few CPDP approachesusing training data selection have also been proposed eg

Mathematical Problems in Engineering 7

Peter filter based CPDP [5] (labeled as baseline1) and TCA+(Transfer Component Analysis) based CPDP [39] (labeledas baseline2) To answer the third question we comparedTDSelector-based CPDP proposed in this paper with theabove two state-of-the-art methods

52 Data Collection To evaluate the effectiveness of TDSe-lector in this paper we used 14 open-source projects writtenin Java on two online public software repositories namelyPROMISE [40] and AEEEM [41] The data statistics of the 14projects in question are presented in Table 2 where Instanceand Defect are the numbers of instances and defectiveinstances respectively and Defect is the proportion ofdefective instances to the total number of instances Eachinstance in these projects represents a file of object class andconsists of two parts namely software metrics and defects

The first repository PROMISE was collected by Jureczkoand Spinellis [40] The information of defects and 20 sourcecode metrics for the projects on PROMISE have been vali-dated and used in several previous studies [1 7 12 29] Thesecond repository AEEEM was collected by DrsquoAmbros etal [41] and each project on it has 76 metrics including 17source code metrics 15 change metrics 5 previous defectmetrics 5 entropy-of-change metrics 17 entropy-of-source-code metrics and 17 churn-of-source-code metrics AEEEMhas been successfully used in [23 39]

Before performing a cross-project prediction we need todetermine a target dataset (test set) and its candidate TDSFor PROMISE (10 projects) each one in the 10 projects wasselected to be the target dataset once and then we set up acandidate TDS for CPDP which excluded any data from thetarget project For instance if Ivy is selected as test projectdata from the other nine projects was used to construct itsinitial TDS

53 Experiment Design To answer the three research ques-tions our experimental procedure which is designed underthe context of M2O in the CPDP scenario is described asfollows

First as with many prior studies [1 5 15 35] all softwaremetric values in training and test sets were normalized byusing the119885-scoremethod because thesemetrics are differentin the scales of numerical values For the 14 projects onAEEEM and PROMISE their numbers of software metricsare different So the training set for a given test set wasselected from the same repository

Second to examine whether the consideration of defectsimproves the performance of CPDP we compared ourapproach TDSelector with NoD which is a baseline methodconsidering only the similarity between instances ie 120572 = 1in (2) Since there are three similarity computation methodsused in this paper we designed three different TDSelectorsand their corresponding baseline methods based on similar-ity indexesThe prediction results of eachmethod in questionfor the 15 test sets were analyzed in terms of mean valueand standard deviation More specifically we also used Cliff rsquosdelta (120575) [42] which is a nonparametric effect size measureof how often the values in one distribution are larger than the

values in a second distribution to compare the results gen-erated through our approach and its corresponding baselinemethod

Because Cliff did not suggest corresponding 120575 values torepresent small medium and large effects we convertedCohenrsquos 119889 effect size to Cliff rsquos 120575 using cohd2delta R package(httpsrdrriocranorddommancohd2deltahtml) Notethat Table 3 contains descriptors for magnitude of 119889 = 001to 20

Third according to the results of the second step of thisprocedure 15 combinations based on three typical similaritymethods for software metrics and five commonly usednormalization functions for defects were examined by thepairwise comparison method We then determined whichcombination is more suitable for our approach according tomean standard deviation and Cliff rsquos delta effect size

Fourth to further validate the effectiveness of the TDS-elector-based CPDP predictor we conducted cross-projectpredictions for all the 15 test sets using TDSelector and twocompeting methods (ie baseline1 and baseline2 introducedin Section 51) Note that the TDSelector used in this exper-iment was built with the best combination of similarity andnormalization

After this process is completed we will discuss the an-swers to the three research questions of our study

54 Classifier and Evaluation Measure As an underlyingmachine learning classifier for CPDP Logistic Regression(LR) which was widely used in many defect prediction liter-atures [4 23 39 43ndash46] is also used in this study All LR clas-sifiers were implemented withWeka (httpswwwcswaikatoacnzmlweka) For our experiments we used the defaultparameter setting for LR specified in Weka unless otherwisespecified

To evaluate the prediction performance of differentmeth-ods in this paper we utilized the area under a ReceiverOperating Characteristic curve (AUC) AUC is equal to theprobability that a classifier will identify a randomly chosendefective class higher than a randomly chosen defect-freeone [47] known as a useful measure for comparing differentmodels Compared with traditional accuracy measures AUCis commonly used because it is unaffected by class imbalanceand independent of the prediction threshold that is used todecide whether an instance should be classified as a negativeinstance [6 48 49] The AUC value of 05 indicates theperformance of a random predictor and higher AUC valuesindicate better prediction performance

6 Experimental Results

61 Answer to RQ1 We compared our approach consideringdefects with the baseline method NoD that selects trainingdata in terms of cosine similarity Table 5 shows that onaverage TDSelector does achieve an improvement in AUCvalue across the 15 test sets Obviously the average growthrates of AUC value vary from 59 to 90 when differentnormalization methods for defects were utilized In additionall the 120575 values in this table are greater than 02 which

8 Mathematical Problems in Engineering

Table 2 Data statistics of the projects used in our experiments

Repository Project Version Instance Defect Defect

PROMISE

Ant 17 745 166 223Camel 16 965 188 195Ivy 20 352 40 114Jedit 32 272 90 331

Lucene 24 340 203 597Poi 30 442 281 636

Synapse 12 256 86 336Velocity 14 196 147 750Xalan 26 885 411 464Xerces 14 588 437 743

AEEEM

Equinox 112005ndash6252008 324 129 398Eclipse JDT core (Eclipse) 112005ndash6172008 997 206 207Apache Lucene (Lucene2) 112005ndash1082008 692 20 29

Mylyn 1172005ndash3172009 1862 245 132Eclipse PDE UI (Pde) 112005ndash9112008 1497 209 140

Table 3 The mappings between different values and their effective-ness levels

Effect size d 120575Very small 001 0008Small 020 0147Medium 050 033Large 080 0474Very large 120 0622Huge 20 0811

indicates that each group of 15 prediction results obtained byour approach has a greater effect than that of NoD In otherwords our approach outperforms NoD In particular forJedit Velocity Eclipse andEquinox the improvements of ourapproach over NoD are substantial For example when usingthe linear normalizationmethod the AUC values for the fourprojects are increased by 306 430 226 and 394respectively moreover the logistic normalization methodfor Velocity achieves the biggest improvement in AUC value(namely 617)

We then compared TDSelector with the baseline methodsusing other widely used similarity calculation methods andthe results obtained by using Euclidean distance and Man-hattan distance to calculate the similarity between instancesare presented in Tables 6 and 7 TDSelector compared withthe corresponding NoD achieves the average growth ratesof AUC value that vary from 59 to 77 in Table 6 andfrom 27 to 69 in Table 7 respectively More specificallythe highest growth rate of AUC value in Table 6 is 436for Equinox and in Table 7 is 397 for Lucene2 Besides allCliff rsquos delta (120575) effect sizes in these two tables are also greaterthan 01 Hence the results indicate that our approach can onaverage improve the performance of those baseline methodswithout regard to defects

Table 4 Analyzing the factors similarity and normalization

Factor Method Mean Std 120575Similarity

Cosine similarity 0704 0082 minus0133Euclidean distance 0719 0080 -Manhattan distance 0682 0098 minus0193

Normalization

Linear 0706 0087 minus0012Logistic 0710 0078 -

Square root 0699 0091 minus0044Logarithmic 0700 0086 minus0064

Inverse cotangent 0696 0097 minus0056

In short during the process of training data selection theconsideration of defects for CPDP can help us to select higherquality training data thus leading to better classificationresults

62 Answer to RQ2 Although the inclusion of defects inthe selection of training data of quality is helpful for betterperformance of CPDP it is worthy to note that our methodcompletely failed in Mylyn and Pde when computing thesimilarity between instances in terms of Manhattan distance(see the corresponding maximum AUC values in Table 7)This implies that the success ofTDSelector depends largely onthe reasonable combination of similarity and normalizationmethods Therefore which combination of similarity andnormalization is more suitable for TDSelector

First we analyzed the two factors (ie similarity andnormalization) separately For example we evaluated thedifference among cosine similarity Euclidean distance andManhattan distance regardless of any normalization methodused in the experiment The results expressed in terms ofmean and standard deviation are shown in Table 4 wherethey are grouped by factors

Mathematical Problems in Engineering 9

Table5Th

ebestpredictio

nresults

obtained

bytheCP

DPapproach

basedon

TDSelec

torw

ithCosinesim

ilarityNoD

representsthebaselin

emetho

d+deno

testhe

grow

thrateof

AUC

valuethem

axim

umAU

Cvalueo

fdifferentn

ormalizationmetho

dsisun

derlinedeach

numbersho

wnin

bold

indicatesthatthe

correspo

ndingAU

Cvaluer

isesb

ymorethan10

Cosines

imilarity

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

07

09

09

1009

1009

1006

09

08

06

07

07

05

0338

AUC

0813

0676

0603

0793

0700

0611

0758

0741

0512

0742

0783

0760

0739

0705

0729

0711plusmn0

081

+(

)63

37

19

-30

6

-30

-43

0

03

226

394

41

59

40

90

Logistic

12057207

05

07

107

06

06

06

05

05

004

07

05

05

0351

AUC

0802

0674

0595

0793

0665

0621

0759

0765

0579

0745

0773

0738

0712

0707

0740

0711plusmn0

070

+(

)48

34

05

-24

116

31

32

617

07

210

355

03

62

56

90

Square

root

12057207

07

06

06

07

06

07

09

05

104

06

06

06

06

0249

AUC

0799

0654

0596

0807

0735

0626

0746

0762

0500

0740

0774

0560

0722

0700

0738

0697plusmn0

091

+(

)44

03

07

18

371

25

14

28

397

-210

28

17

53

53

69

Logarithm

ic120572

06

06

09

1007

1007

07

05

09

05

05

06

06

06

0351

AUC

0798

0662

0594

0793

0731

0611

0748

0744

0500

0758

0774

0700

0755

0702

0741

0707plusmn0

083

+(

)43

15

03

-36

4

-16

04

397

24

212

285

63

55

58

85

Inversec

otangent

12057207

1010

1007

1007

1006

07

007

07

07

07

0213

AUC

0798

0652

0592

0793

0659

0611

0749

0741

0500

0764

0773

0556

0739

0695

0734

0690plusmn0

092

+(

)43

--

-229

-18

-

397

32

210

21

41

44

48

59

NoD

(120572=1)

0765

0652

0592

0793

0536

0611

0736

0741

0358

0740

0639

0543

0709

0665

0701

0652plusmn0

113

10 Mathematical Problems in Engineering

Table6Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithEu

clidean

distance

Euclidean

distance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

09

09

1009

09

08

1010

08

08

006

1008

08

0369

AUC

0795

0727

0598

0826

0793

0603

0714

0757

0545

0775

0773

0719

0722

0697

0744

0719plusmn0

080

+(

)13

68

-09

322

19

--

117

52

176

430

-11

96

77

Logistic

12057207

08

04

07

07

05

06

09

09

09

007

1010

09

0360

AUC

0787

0750

0603

0832

0766

0613

0716

0767

0556

0745

0773

0698

0722

0690

0730

0717plusmn0

075

+(

)03

101

08

16

277

35

03

13

139

11

176

388

--

75

72

Square

root

12057207

08

1007

08

06

07

07

07

1007

08

1010

09

0342

AUC

0796

0743

0598

0820

0720

0618

0735

0786

0564

0737

0774

0696

0722

0690

0750

0715plusmn0

076

+(

)14

91

-

01

200

44

29

38

156

-178

384

--

105

70

Logarithm

ic120572

07

08

1010

08

06

1010

09

09

09

08

1010

09

0324

AUC

0794

0746

0598

0819

0722

0607

0714

0757

0573

0739

0778

0722

0722

0690

0748

0715plusmn0

072

+(

)11

95

--

203

25

--

174

03

185

436

--

103

70

Inversec

otangent

12057208

09

06

08

08

07

1008

06

07

009

09

1009

0280

AUC

0796

0749

0603

0820

0701

0623

0714

0787

0538

0750

0773

0589

0763

0690

0722

0708plusmn0

084

+(

)14

100

08

01

168

52

-40

102

18

176

170

56

-64

59

NoD

(120572=1)

0785

0681

0598

0819

060

00592

0714

0757

0488

0737

0657

0503

0722

0690

0678

066

8plusmn0

096

Mathematical Problems in Engineering 11

Table7Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithManhatta

ndista

nce

Manhatta

ndistance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

08

09

09

1009

09

1010

08

100

08

09

1010

0187

AUC

0804

0753

0599

0816

0689

0626

0695

0748

0500

0749

0773

0633

0692

0695

066

80696plusmn0

084

+(

)13

70

03

-73

63

--

78

-116

190

397

--

56

Logistic

12057207

07

08

08

08

07

07

09

06

07

009

09

1010

0249

AUC

0799

0760

0607

0830

0674

0621

0735

0794

0520

0756

0773

0680

0559

0695

066

80705plusmn0

084

+(

)06

80

17

17

50

54

58

61

121

09

116

279

127

--

69

Square

root

12057209

09

09

1008

08

09

08

09

100

100

1010

0164

AUC

0795

0755

060

40816

0693

0627

0704

0750

0510

0749

0773

0532

0523

0695

066

80680plusmn0

1+(

)01

72

12

-79

65

13

03

99

-116

-46

--

31

Logarithm

ic120572

1009

09

1009

1010

08

09

09

010

010

100116

AUC

0794

0755

0603

0816

066

40589

0695

0763

0524

0756

0773

0532

0523

0695

066

80677plusmn0

102

+(

)-

72

10

-34

--

20

129

09

116

-46

--

27

Inversec

otangent

12057210

09

09

09

09

08

09

1007

08

010

010

100133

AUC

0794

0749

0608

0821

0667

060

90710

0748

0500

0758

0773

0532

0523

0695

066

80677plusmn0

103

+(

)-

64

18

06

39

34

22

-78

12

116

-46

--

27

NoD

(120572=1)

0794

0704

0597

0816

064

20589

0695

0748

046

40749

0693

0532

0500

0695

066

80659plusmn0

105

12 Mathematical Problems in Engineering

Euclidean Logistic

Cosine

Linear

Manhattan+Logistic Euclidean

Linear

Cosine

Logistic

Manhattan+Logistic

(1) (2)

Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow

If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method

Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4

Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions

63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In

particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable

Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication

In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models

7 Discussion

71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10

Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance

Mathematical Problems in Engineering 13

Table8Pairw

isecomparis

onsb

etweenag

iven

combinatio

nandeach

ofthe15combinatio

nsin

term

sofC

liffrsquosdelta

(120575)effectsize

Cosines

imilarity

Euclidean

distance

Manhatta

ndistance

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Cosine+

Linear

-0018

0084

000

00116

minus0049

minus0036

minus0004

minus0013

minus0009

0138

0049

0164

0178

0169

Euclidean

+Linear

0049

0102

0111

0062

0164

-0036

004

00058

0089

0209

0102

0249

0276

0244

Manhatta

n+Lo

gistic

minus0049

minus0022

0022

minus0013

0111

minus0102

minus0076

minus0080

minus0049

minus0031

0053

-0124

0151

0147

14 Mathematical Problems in Engineering

Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question

Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10

Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647

Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43

Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487

k10987654321

AUC

09

08

07

06

05

Manhattan+logisticEuclidean+linearCosine+linear

Combination

Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE

when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations

Manhattan+logisticEuclidean+linearCosine+linear

k10987654321

AUC

09

08

07

06

05

Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM

72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice

According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as

Mathematical Problems in Engineering 15

0

02

04

06

08

1

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Perc

enta

ge(in

stanc

e)

release_id

defectslt2defectslt3

(a)

0

0002

0004

0006

0008

001

0012

0014

0016

4 14 24 34 44 54 64 74

Perc

enta

ge(in

stanc

e)

defects

(b)

Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments

TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones

Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work

73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain

Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software

metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors

Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033

Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five

16 Mathematical Problems in Engineering

Euclidean+Linear

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

Cosine+Linear

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

Manhattan+Logistic

Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value

normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear

8 Conclusion and Future Work

This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of

defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods

Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are

Mathematical Problems in Engineering 17

required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances

Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this article

Acknowledgments

The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)

References

[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012

[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002

[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012

[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009

[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013

[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012

[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013

[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998

[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002

[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007

[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005

[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009

[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015

[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017

[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015

[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015

[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017

[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015

[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016

[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017

[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014

[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014

[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015

[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015

18 Mathematical Problems in Engineering

[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017

[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017

[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017

[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013

[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014

[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016

[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017

[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015

[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017

[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012

[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013

[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012

[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7

[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006

[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013

[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010

[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE

Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010

[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011

[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011

[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008

[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008

[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011

[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012

[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011

[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011

[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952

[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009

[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 3: An Improved Method for Cross-Project Defect Prediction by …downloads.hindawi.com/journals/mpe/2018/2650415.pdf · Project A-1 Project A-2 Projec B-1 Projec C-1 Projec C-2 Tst et

Mathematical Problems in Engineering 3

22 Training Data Selection for CPDP As mentioned in [528] a fundamental issue for CPDP is to select the mostappropriate training data for building quality defect predic-tors He et al [29] discussed this problem in detail from theperspective of data granularity ie release level and instancelevel They presented a two-step method for training dataselection The results indicated that the predictor built basedon naive Bayes could achieve fairly good performance whenusing the method together with Peter filter [5] Porto andSimao [30] proposed an Instance Filtering method by select-ing the most similar instances from the training dataset andthe experimental results of 36 versions of 11 open-sourceprojects show that the defect predictor built from cross-project data selected by Feature Selection and Instance Filter-ing can have generally better performances both in classifica-tion and in ranking

With regard to the data imbalance problem of defectdatasets Jing et al [31] introduced an effective feature learn-ingmethod called SDA to provide effective solutions for class-imbalance problems of both within-project and cross-projecttypes by employing the semisupervised transfer componentanalysis (SSTCA)method tomake the distributions of sourceand target data consistent The results indicated that theirmethod greatly improved WPDP and CPDP performanceRyu et al [32] proposed amethod of hybrid instance selectionusing nearest neighbor (HISNN)Their results suggested thatthose instances which had strong local knowledge could beidentified via nearest neighbors with the same class labelPoon et al [33] proposed a credibility theory based naiveBayes (CNB) classifier to establish a novel reweighting mech-anism between the source projects and target projects sothat the source data could simultaneously adapt to the targetdata distribution and retain its ownpatternThe experimentalresults demonstrate the significant improvement in terms ofthe performance metrics considered achieved by CNB overother CPDP approaches

The above-mentioned existing studies aimed at reducingthe gap in prediction performance between WPDP andCPDP Although they are making progress towards the goalthere is clearly a lot of room for improvement For this reasonin this paper we proposed a selection approach to trainingdata based on an improved strategy for instance rankinginstead of a single strategy for similarity calculation whichwas used in many prior studies [1 5 7 12]

3 Preliminaries

In our context a defect dataset 119878 contains119898 instances whichis represented as 119878 = 1198681 1198682 119868119898 Instance 119868119894 is an objectclass represented as 119868119894 = 1198911198941 1198911198942 119891119894119899 where 119891119894119895 is the 119895thmetric value of instance 119868119894 and 119899 is the number ofmetrics (alsoknown as features) Given a source dataset 119878119904 and a targetdataset 119878119905 CPDP aims to perform a prediction in 119878119905 using theknowledge extracted from 119878119904 where 119878119904 = 119878119905 (see Figure 1(a))In this paper source and target datasets have the same set ofmetrics and they may differ in distributional characteristicsof metric values

To improve the performance of CPDP several strategiesused to select appropriate training data have been put forward

(see Figure 1(b)) eg Turhan et al [12] filtered out thoseirrelevant training instances by returning 119896-nearest neighborsfor each test instance

31 An Example of Training Data Selection First we intro-duce a typical method for training data selection at theinstance level and a simple example is used to illustrate thismethod For the strategy for other levels of training dataselection such as at the release level please refer to [7]

Figure 2 shows a training set 119878119904 (including five instances)and a test set 119878119905 (including an instance) Here each instancecontains five metrics and a classification label (ie 0 or 1)An instance is defect-free (label = 0) only if its defects areequal to 0 otherwise it is defective (label = 1) According tothe 119896-nearest neighbor method based on Euclidean distancewe can rank all the five training instances in terms of theirdistances from the test instance Due to the same nearestdistance from test instance 119868test it is clear that three instances1198681 1198682 and 1198685 are suitable for use as training instances when119896 is set to 1 For the three instances 1198682 and 1198685 have the samemetric values but 1198682 is labeled as a defective instance becauseit contains a bug In this case 1198681 will be selected with the sameprobability as that of 1198682 regardless of the number of defectsthey include

In this way those instances most relevant to the test onecan be quickly determined Clearly the goal of training dataselection is to preserve the representative training instancesin 119878119904 as much as possible

32 General Process of Training Data Selection Before pre-senting our approach we describe a general selection processof training data which consists of three main steps TDS(training dataset) setup ranking and duplicate removal

TDS Setup For each target project with little historical datawe need to set up an initial TDS where training data arecollected from other projects To simulate this scenario ofCPDP in this paper any defect data from the target projectmust be excluded from the initial TDS Note that differentrelease versions of a project actually belong to the sameproject A simple example is visualized in Figure 3

Ranking Once the initial TDS is determined an instancewill be treated as a metric vector 119868 as mentioned above Foreach test instance one can calculate its relevance to eachtraining instance and then rank these training instances interms of their similarity based on software metrics Notethat a wide variety of software metrics such as source codemetrics process metrics previous defects and code churnhave been used as features for CPDP approaches to improvetheir prediction performance

Duplicate Removal Let 119897 be the size of test set For eachtest instance if we select its 119896-nearest neighbors from theinitial TDS there are a total of 119896 times 119897 candidate traininginstances Considering that these selected instances may notbe unique (ie a training instance can be the nearest neighborof multiple test instances) after removing the duplicate onesthey form the final training set which is a subset of the initialTDS

4 Mathematical Problems in Engineering

trainingPredictor

test

Training data set Ss Target data set St

(a) General CPDP

MetricFeature value f

Buggy instance I

Non-buggy instance I

Unlabeled instance I

Strategies for instance selection

Predictor

testtraining

Training data set Ss

Ss Target data set StReduced training data set rsquo

(b) Improved CPDP using training data selection

Figure 1 Two CPDP scenarios used in this paper

01 0 05 0 1(3)

01 0 0 05 1(1)

04 03 0 01 0

0 0 04 0 0

01 0 0 05 0

01 0 05 05

Label (defects)

rank instance

1

2

3

distance(Ii Itest)I1

I2

I3

I4

I4

I3

I5

Itest

f1 f2 f3 f4

I1 I2 I5

St

Ss

Figure 2 An example of the selection of training instances

Mathematical Problems in Engineering 5

Data sets

Project C-v1

Ranked instances(TDSelector)

Defects of each instance

similarity of software metrics

Defect predictor

Project A-v1Project A-v2Project B-v1Project C-v1Project C-v2 Test set

Project A-v1Project A-v2Project B-v1

Training set

ranking

TDS setup

rankingTraining

data

Test data

Figure 3 The overall structure of TDSelector for CPDP

4 Our Approach TDSelector

To improve the prediction performance of CPDP we leveragethe following observations

Similar Instances Given a test instance we can examine itssimilar training instances that were labeled beforeThe defectproneness shared by similar training instances could helpus identify the probability that a test instance is defectiveIntuitively two instances are more likely to have the samestate if their metric values are very similar

Number of Defects (defects) During the selection processwhen several training instances have the same distance froma test instance we need to determine which one shouldbe ranked higher According to our experiences in softwaredefect prediction and other researchersrsquo studies on the quan-titative analysis of previous defect prediction approaches [3435] we believe that more attention should be paid to thosetraining instances with more defects in practice

The selection of training data based on instance similarityhas been used in someprior studies [5 12 35]However to thebest of our knowledge the information about defects has notbeen fully utilized So in this paper we attempt to propose atraining data selection approach combining such informationand instance similarity

41 Overall Structure of TDSelector Figure 3 shows the over-all structure of the proposed approach to training dataselection named TDSelector Before selecting appropriatetraining data for CPDP we have to set up a test set andits corresponding initial TDS For a given project treated asthe test set all the other projects (except the target project)available at hand are used as the initial TDS This is the so-calledmany-to-one (M2O) scenario for CPDP [13] It is quitedifferent from the typical O2O (one-to-one) scenario whereonly one randomly selected project is treated as the trainingset for a given target project (namely test set)

When both of the sets are given the ranks of traininginstances are calculated based on the similarity of softwaremetrics and then returned for each test instance For the

initial TDS we also collect each training instancersquos defectsand thus rank these instances by their defects Then we rateeach training instance by combining the two types of ranks insome way and identify the top-k training instances for eachtest instance according to their final scores Finally we usethe predictor trained with the final TDS to predict defectproneness in the test set We describe the core componentof TDSelector namely scoring scheme in the followingsubsection

42 Scoring Scheme For each instance in training set and testset which is treated as a vector of features (namely softwaremetrics) we calculate the similarity between them in terms ofsimilarity index (such as cosine similarity Euclidean distanceand Manhattan distance as shown in Table 1) Traininginstances are then ranked by the similarity between each ofthem and a given test instance

For instance the cosine similarity between a traininginstance 119868119901 and the target instance 119868119902 is computed via theirvector representations described as follows

119878119894119898 (119868119901 119868119902) =997888rarr119868119901 sdot 997888rarr1198681199021003817100381710038171003817100381711986811990110038171003817100381710038171003817 times 1003817100381710038171003817100381711986811990210038171003817100381710038171003817 =

sum119899119894=1 (119891119901119894 times 119891119902119894)radicsum119899119894=1 1198912119901119894 times radicsum119899119894=1 1198912119902119894 (1)

where 997888rarr119868119901 and 997888rarr119868119902 are the metric vectors for 119868119901 and 119868119902 respec-tively and 119891lowast119894 represents the 119894th metric value of instance 119868lowast

Additionally for each training instance we also considerthe factor defects in order to further enrich the ranking of itsrelevant instances The assumption here is that the more theprevious defects the richer the information of an instance Sowe propose a scoring scheme to rank those candidate traininginstances defined as below

Score (119868119901 119868119902) = 120572 119878119894119898 (119868119901 119868119902) + (1 minus 120572)119873 (119889119890119891119890119888119905119901) (2)

where 119889119890119891119890119888119905119901 represents the defects of 119868119901 120572 is a weightingfactor (0 le 120572 le 1) which is learned from training data usingAlgorithm 1 (see Algorithm 1) and 119873(119889119890119891119890119888119905119901) is a functionused to normalize defects with values ranging from 0 to 1

6 Mathematical Problems in Engineering

Optimizing the parameter 120572Input(1) Candidate TDS 119878119904 = 1198681199041 1198681199042 119868119904119898 test set 119878119905 = 1198681199051 1198681199052 119868119905119897 (119898 gt 119897)(2) 119889119890119891119890119888119905119904 = 119889119890119891119890119888119905(1198681199041) 119889119890119891119890119888119905(1198681199042) 119889119890119891119890119888119905(119868119904119898) and 119896 = 10Output(3) 120572 (120572 isin [0 1])Method(4) Initialize 120572 = 0 119878119904(120572) = 0(5) While (120572 le 1) do(6) For 119894 = 1 119894 le 119897 119894 + +(7) For 119895 = 1 119895 le 119898 119895 + +(8) Score(119868119905119894 119868119904119895) = 120572 119878119894119898(119868119905119894 119868119904119895) + (1 minus 120572)119873 (119889119890119891119890119888119905(119868119904119895))(9) End For(10) descSort (119878119888119900119903119890(119868119905119894 119868119904119895) | 119895 = 1 sdot sdot sdot 119898) sort119898 training instances in descending order(11) 119878119904(120572) = 119878119904(120572) cup Top-119896 training instances select the top 119896 instances(12) End For(13) 119860119880119862 lArr997904 119878119904(120572) CPDP997888997888997888997888rarr 119878119905 prediction result(14) 120572 = 120572 + 01(15) EndWhile(16) Return (120572 | max120572119860119880119862)

Algorithm 1 Algorithm of parameter optimization

Table 1 Similarity indexes and normalization methods used in thispaper

Similarity

Cosine cos (119883 119884) = sum119899119896=1 119909119896119910119896radicsum119899119896=1 1199092119896radicsum119899119896=1 1199102119896

Euclidean distance 119889 (119883 119884) = radic 119899sum119896=1

(119909119896 minus 119910119896)2Manhattan distance 119889 (119883 119884) = 119899sum

119896=1

1003816100381610038161003816119909119896 minus 1199101198961003816100381610038161003816Normalization

Linear 119873(119909) = 119909 minus 119909min119909max minus 119909min

Logistic 119873(119909) = 11 + 119890minus119909 minus 05Square root 119873(119909) = 1 minus 1radic1 + 119909Logarithmic 119873(119909) = log10(119909 + 1)Inverse cotangent 119873(119909) = arctan(119909) lowast 2120587

Normalization is a commonly used data preprocessingtechnique for mathematics and computer science [36] Grafand Borer [37] have confirmed that normalization canimprove prediction performance of classificationmodels Forthis reason we normalize the defects of training instanceswhen using TDSelector As you know there are many nor-malization methods In this study we introduce five typicalnormalization methods used in machine learning [36 38]The description and formulas of the five normalizationmethods are listed in Table 1

For each test instance the top-119896 training instances rankedin terms of their scores will be returned Hence the final

TDS is composed by merging the sets of the top-119896 train-ing instances for each test instance when those duplicateinstances are removed

5 Experimental Setup

51 Research Questions Our experiments were conductedto find empirical evidence that answers the following threeresearch questions

RQ1 Does the Consideration of Defects Improve the Per-formance of CPDP Unlike the previous methods [1 5 712 29] TDSelector ranks candidate training instances interms of both defects and metric-based similarity To eval-uate the effectiveness of the proposed method consideringthe additional information of defects we tested TDSelectoraccording to the experimental data described in Section 52According to (2) we also empirically analyzed the impact ofthe parameter 120572 on prediction results

RQ2 Which Combination of Similarity and NormalizationIs More Suitable for TDSelector Equation (2) is comprisedof two parts namely similarity and the normalization ofdefects For each part several commonly usedmethods can beadopted in our context To fully take advantage of TDSelectorone would wonder which combination of similarity andnormalization should be chosen Therefore it is necessary tocompare the effects of different combinations of similarityand normalization methods on prediction results and todetermine the best one for TDSelector

RQ3 Can TDSelector-Based CPDP Outperform the Base-line Methods Cross-project prediction has attracted muchresearch interest in recent years and a few CPDP approachesusing training data selection have also been proposed eg

Mathematical Problems in Engineering 7

Peter filter based CPDP [5] (labeled as baseline1) and TCA+(Transfer Component Analysis) based CPDP [39] (labeledas baseline2) To answer the third question we comparedTDSelector-based CPDP proposed in this paper with theabove two state-of-the-art methods

52 Data Collection To evaluate the effectiveness of TDSe-lector in this paper we used 14 open-source projects writtenin Java on two online public software repositories namelyPROMISE [40] and AEEEM [41] The data statistics of the 14projects in question are presented in Table 2 where Instanceand Defect are the numbers of instances and defectiveinstances respectively and Defect is the proportion ofdefective instances to the total number of instances Eachinstance in these projects represents a file of object class andconsists of two parts namely software metrics and defects

The first repository PROMISE was collected by Jureczkoand Spinellis [40] The information of defects and 20 sourcecode metrics for the projects on PROMISE have been vali-dated and used in several previous studies [1 7 12 29] Thesecond repository AEEEM was collected by DrsquoAmbros etal [41] and each project on it has 76 metrics including 17source code metrics 15 change metrics 5 previous defectmetrics 5 entropy-of-change metrics 17 entropy-of-source-code metrics and 17 churn-of-source-code metrics AEEEMhas been successfully used in [23 39]

Before performing a cross-project prediction we need todetermine a target dataset (test set) and its candidate TDSFor PROMISE (10 projects) each one in the 10 projects wasselected to be the target dataset once and then we set up acandidate TDS for CPDP which excluded any data from thetarget project For instance if Ivy is selected as test projectdata from the other nine projects was used to construct itsinitial TDS

53 Experiment Design To answer the three research ques-tions our experimental procedure which is designed underthe context of M2O in the CPDP scenario is described asfollows

First as with many prior studies [1 5 15 35] all softwaremetric values in training and test sets were normalized byusing the119885-scoremethod because thesemetrics are differentin the scales of numerical values For the 14 projects onAEEEM and PROMISE their numbers of software metricsare different So the training set for a given test set wasselected from the same repository

Second to examine whether the consideration of defectsimproves the performance of CPDP we compared ourapproach TDSelector with NoD which is a baseline methodconsidering only the similarity between instances ie 120572 = 1in (2) Since there are three similarity computation methodsused in this paper we designed three different TDSelectorsand their corresponding baseline methods based on similar-ity indexesThe prediction results of eachmethod in questionfor the 15 test sets were analyzed in terms of mean valueand standard deviation More specifically we also used Cliff rsquosdelta (120575) [42] which is a nonparametric effect size measureof how often the values in one distribution are larger than the

values in a second distribution to compare the results gen-erated through our approach and its corresponding baselinemethod

Because Cliff did not suggest corresponding 120575 values torepresent small medium and large effects we convertedCohenrsquos 119889 effect size to Cliff rsquos 120575 using cohd2delta R package(httpsrdrriocranorddommancohd2deltahtml) Notethat Table 3 contains descriptors for magnitude of 119889 = 001to 20

Third according to the results of the second step of thisprocedure 15 combinations based on three typical similaritymethods for software metrics and five commonly usednormalization functions for defects were examined by thepairwise comparison method We then determined whichcombination is more suitable for our approach according tomean standard deviation and Cliff rsquos delta effect size

Fourth to further validate the effectiveness of the TDS-elector-based CPDP predictor we conducted cross-projectpredictions for all the 15 test sets using TDSelector and twocompeting methods (ie baseline1 and baseline2 introducedin Section 51) Note that the TDSelector used in this exper-iment was built with the best combination of similarity andnormalization

After this process is completed we will discuss the an-swers to the three research questions of our study

54 Classifier and Evaluation Measure As an underlyingmachine learning classifier for CPDP Logistic Regression(LR) which was widely used in many defect prediction liter-atures [4 23 39 43ndash46] is also used in this study All LR clas-sifiers were implemented withWeka (httpswwwcswaikatoacnzmlweka) For our experiments we used the defaultparameter setting for LR specified in Weka unless otherwisespecified

To evaluate the prediction performance of differentmeth-ods in this paper we utilized the area under a ReceiverOperating Characteristic curve (AUC) AUC is equal to theprobability that a classifier will identify a randomly chosendefective class higher than a randomly chosen defect-freeone [47] known as a useful measure for comparing differentmodels Compared with traditional accuracy measures AUCis commonly used because it is unaffected by class imbalanceand independent of the prediction threshold that is used todecide whether an instance should be classified as a negativeinstance [6 48 49] The AUC value of 05 indicates theperformance of a random predictor and higher AUC valuesindicate better prediction performance

6 Experimental Results

61 Answer to RQ1 We compared our approach consideringdefects with the baseline method NoD that selects trainingdata in terms of cosine similarity Table 5 shows that onaverage TDSelector does achieve an improvement in AUCvalue across the 15 test sets Obviously the average growthrates of AUC value vary from 59 to 90 when differentnormalization methods for defects were utilized In additionall the 120575 values in this table are greater than 02 which

8 Mathematical Problems in Engineering

Table 2 Data statistics of the projects used in our experiments

Repository Project Version Instance Defect Defect

PROMISE

Ant 17 745 166 223Camel 16 965 188 195Ivy 20 352 40 114Jedit 32 272 90 331

Lucene 24 340 203 597Poi 30 442 281 636

Synapse 12 256 86 336Velocity 14 196 147 750Xalan 26 885 411 464Xerces 14 588 437 743

AEEEM

Equinox 112005ndash6252008 324 129 398Eclipse JDT core (Eclipse) 112005ndash6172008 997 206 207Apache Lucene (Lucene2) 112005ndash1082008 692 20 29

Mylyn 1172005ndash3172009 1862 245 132Eclipse PDE UI (Pde) 112005ndash9112008 1497 209 140

Table 3 The mappings between different values and their effective-ness levels

Effect size d 120575Very small 001 0008Small 020 0147Medium 050 033Large 080 0474Very large 120 0622Huge 20 0811

indicates that each group of 15 prediction results obtained byour approach has a greater effect than that of NoD In otherwords our approach outperforms NoD In particular forJedit Velocity Eclipse andEquinox the improvements of ourapproach over NoD are substantial For example when usingthe linear normalizationmethod the AUC values for the fourprojects are increased by 306 430 226 and 394respectively moreover the logistic normalization methodfor Velocity achieves the biggest improvement in AUC value(namely 617)

We then compared TDSelector with the baseline methodsusing other widely used similarity calculation methods andthe results obtained by using Euclidean distance and Man-hattan distance to calculate the similarity between instancesare presented in Tables 6 and 7 TDSelector compared withthe corresponding NoD achieves the average growth ratesof AUC value that vary from 59 to 77 in Table 6 andfrom 27 to 69 in Table 7 respectively More specificallythe highest growth rate of AUC value in Table 6 is 436for Equinox and in Table 7 is 397 for Lucene2 Besides allCliff rsquos delta (120575) effect sizes in these two tables are also greaterthan 01 Hence the results indicate that our approach can onaverage improve the performance of those baseline methodswithout regard to defects

Table 4 Analyzing the factors similarity and normalization

Factor Method Mean Std 120575Similarity

Cosine similarity 0704 0082 minus0133Euclidean distance 0719 0080 -Manhattan distance 0682 0098 minus0193

Normalization

Linear 0706 0087 minus0012Logistic 0710 0078 -

Square root 0699 0091 minus0044Logarithmic 0700 0086 minus0064

Inverse cotangent 0696 0097 minus0056

In short during the process of training data selection theconsideration of defects for CPDP can help us to select higherquality training data thus leading to better classificationresults

62 Answer to RQ2 Although the inclusion of defects inthe selection of training data of quality is helpful for betterperformance of CPDP it is worthy to note that our methodcompletely failed in Mylyn and Pde when computing thesimilarity between instances in terms of Manhattan distance(see the corresponding maximum AUC values in Table 7)This implies that the success ofTDSelector depends largely onthe reasonable combination of similarity and normalizationmethods Therefore which combination of similarity andnormalization is more suitable for TDSelector

First we analyzed the two factors (ie similarity andnormalization) separately For example we evaluated thedifference among cosine similarity Euclidean distance andManhattan distance regardless of any normalization methodused in the experiment The results expressed in terms ofmean and standard deviation are shown in Table 4 wherethey are grouped by factors

Mathematical Problems in Engineering 9

Table5Th

ebestpredictio

nresults

obtained

bytheCP

DPapproach

basedon

TDSelec

torw

ithCosinesim

ilarityNoD

representsthebaselin

emetho

d+deno

testhe

grow

thrateof

AUC

valuethem

axim

umAU

Cvalueo

fdifferentn

ormalizationmetho

dsisun

derlinedeach

numbersho

wnin

bold

indicatesthatthe

correspo

ndingAU

Cvaluer

isesb

ymorethan10

Cosines

imilarity

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

07

09

09

1009

1009

1006

09

08

06

07

07

05

0338

AUC

0813

0676

0603

0793

0700

0611

0758

0741

0512

0742

0783

0760

0739

0705

0729

0711plusmn0

081

+(

)63

37

19

-30

6

-30

-43

0

03

226

394

41

59

40

90

Logistic

12057207

05

07

107

06

06

06

05

05

004

07

05

05

0351

AUC

0802

0674

0595

0793

0665

0621

0759

0765

0579

0745

0773

0738

0712

0707

0740

0711plusmn0

070

+(

)48

34

05

-24

116

31

32

617

07

210

355

03

62

56

90

Square

root

12057207

07

06

06

07

06

07

09

05

104

06

06

06

06

0249

AUC

0799

0654

0596

0807

0735

0626

0746

0762

0500

0740

0774

0560

0722

0700

0738

0697plusmn0

091

+(

)44

03

07

18

371

25

14

28

397

-210

28

17

53

53

69

Logarithm

ic120572

06

06

09

1007

1007

07

05

09

05

05

06

06

06

0351

AUC

0798

0662

0594

0793

0731

0611

0748

0744

0500

0758

0774

0700

0755

0702

0741

0707plusmn0

083

+(

)43

15

03

-36

4

-16

04

397

24

212

285

63

55

58

85

Inversec

otangent

12057207

1010

1007

1007

1006

07

007

07

07

07

0213

AUC

0798

0652

0592

0793

0659

0611

0749

0741

0500

0764

0773

0556

0739

0695

0734

0690plusmn0

092

+(

)43

--

-229

-18

-

397

32

210

21

41

44

48

59

NoD

(120572=1)

0765

0652

0592

0793

0536

0611

0736

0741

0358

0740

0639

0543

0709

0665

0701

0652plusmn0

113

10 Mathematical Problems in Engineering

Table6Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithEu

clidean

distance

Euclidean

distance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

09

09

1009

09

08

1010

08

08

006

1008

08

0369

AUC

0795

0727

0598

0826

0793

0603

0714

0757

0545

0775

0773

0719

0722

0697

0744

0719plusmn0

080

+(

)13

68

-09

322

19

--

117

52

176

430

-11

96

77

Logistic

12057207

08

04

07

07

05

06

09

09

09

007

1010

09

0360

AUC

0787

0750

0603

0832

0766

0613

0716

0767

0556

0745

0773

0698

0722

0690

0730

0717plusmn0

075

+(

)03

101

08

16

277

35

03

13

139

11

176

388

--

75

72

Square

root

12057207

08

1007

08

06

07

07

07

1007

08

1010

09

0342

AUC

0796

0743

0598

0820

0720

0618

0735

0786

0564

0737

0774

0696

0722

0690

0750

0715plusmn0

076

+(

)14

91

-

01

200

44

29

38

156

-178

384

--

105

70

Logarithm

ic120572

07

08

1010

08

06

1010

09

09

09

08

1010

09

0324

AUC

0794

0746

0598

0819

0722

0607

0714

0757

0573

0739

0778

0722

0722

0690

0748

0715plusmn0

072

+(

)11

95

--

203

25

--

174

03

185

436

--

103

70

Inversec

otangent

12057208

09

06

08

08

07

1008

06

07

009

09

1009

0280

AUC

0796

0749

0603

0820

0701

0623

0714

0787

0538

0750

0773

0589

0763

0690

0722

0708plusmn0

084

+(

)14

100

08

01

168

52

-40

102

18

176

170

56

-64

59

NoD

(120572=1)

0785

0681

0598

0819

060

00592

0714

0757

0488

0737

0657

0503

0722

0690

0678

066

8plusmn0

096

Mathematical Problems in Engineering 11

Table7Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithManhatta

ndista

nce

Manhatta

ndistance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

08

09

09

1009

09

1010

08

100

08

09

1010

0187

AUC

0804

0753

0599

0816

0689

0626

0695

0748

0500

0749

0773

0633

0692

0695

066

80696plusmn0

084

+(

)13

70

03

-73

63

--

78

-116

190

397

--

56

Logistic

12057207

07

08

08

08

07

07

09

06

07

009

09

1010

0249

AUC

0799

0760

0607

0830

0674

0621

0735

0794

0520

0756

0773

0680

0559

0695

066

80705plusmn0

084

+(

)06

80

17

17

50

54

58

61

121

09

116

279

127

--

69

Square

root

12057209

09

09

1008

08

09

08

09

100

100

1010

0164

AUC

0795

0755

060

40816

0693

0627

0704

0750

0510

0749

0773

0532

0523

0695

066

80680plusmn0

1+(

)01

72

12

-79

65

13

03

99

-116

-46

--

31

Logarithm

ic120572

1009

09

1009

1010

08

09

09

010

010

100116

AUC

0794

0755

0603

0816

066

40589

0695

0763

0524

0756

0773

0532

0523

0695

066

80677plusmn0

102

+(

)-

72

10

-34

--

20

129

09

116

-46

--

27

Inversec

otangent

12057210

09

09

09

09

08

09

1007

08

010

010

100133

AUC

0794

0749

0608

0821

0667

060

90710

0748

0500

0758

0773

0532

0523

0695

066

80677plusmn0

103

+(

)-

64

18

06

39

34

22

-78

12

116

-46

--

27

NoD

(120572=1)

0794

0704

0597

0816

064

20589

0695

0748

046

40749

0693

0532

0500

0695

066

80659plusmn0

105

12 Mathematical Problems in Engineering

Euclidean Logistic

Cosine

Linear

Manhattan+Logistic Euclidean

Linear

Cosine

Logistic

Manhattan+Logistic

(1) (2)

Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow

If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method

Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4

Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions

63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In

particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable

Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication

In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models

7 Discussion

71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10

Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance

Mathematical Problems in Engineering 13

Table8Pairw

isecomparis

onsb

etweenag

iven

combinatio

nandeach

ofthe15combinatio

nsin

term

sofC

liffrsquosdelta

(120575)effectsize

Cosines

imilarity

Euclidean

distance

Manhatta

ndistance

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Cosine+

Linear

-0018

0084

000

00116

minus0049

minus0036

minus0004

minus0013

minus0009

0138

0049

0164

0178

0169

Euclidean

+Linear

0049

0102

0111

0062

0164

-0036

004

00058

0089

0209

0102

0249

0276

0244

Manhatta

n+Lo

gistic

minus0049

minus0022

0022

minus0013

0111

minus0102

minus0076

minus0080

minus0049

minus0031

0053

-0124

0151

0147

14 Mathematical Problems in Engineering

Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question

Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10

Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647

Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43

Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487

k10987654321

AUC

09

08

07

06

05

Manhattan+logisticEuclidean+linearCosine+linear

Combination

Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE

when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations

Manhattan+logisticEuclidean+linearCosine+linear

k10987654321

AUC

09

08

07

06

05

Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM

72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice

According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as

Mathematical Problems in Engineering 15

0

02

04

06

08

1

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Perc

enta

ge(in

stanc

e)

release_id

defectslt2defectslt3

(a)

0

0002

0004

0006

0008

001

0012

0014

0016

4 14 24 34 44 54 64 74

Perc

enta

ge(in

stanc

e)

defects

(b)

Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments

TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones

Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work

73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain

Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software

metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors

Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033

Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five

16 Mathematical Problems in Engineering

Euclidean+Linear

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

Cosine+Linear

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

Manhattan+Logistic

Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value

normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear

8 Conclusion and Future Work

This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of

defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods

Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are

Mathematical Problems in Engineering 17

required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances

Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this article

Acknowledgments

The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)

References

[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012

[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002

[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012

[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009

[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013

[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012

[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013

[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998

[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002

[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007

[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005

[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009

[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015

[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017

[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015

[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015

[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017

[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015

[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016

[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017

[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014

[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014

[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015

[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015

18 Mathematical Problems in Engineering

[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017

[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017

[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017

[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013

[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014

[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016

[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017

[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015

[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017

[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012

[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013

[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012

[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7

[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006

[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013

[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010

[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE

Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010

[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011

[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011

[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008

[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008

[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011

[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012

[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011

[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011

[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952

[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009

[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 4: An Improved Method for Cross-Project Defect Prediction by …downloads.hindawi.com/journals/mpe/2018/2650415.pdf · Project A-1 Project A-2 Projec B-1 Projec C-1 Projec C-2 Tst et

4 Mathematical Problems in Engineering

trainingPredictor

test

Training data set Ss Target data set St

(a) General CPDP

MetricFeature value f

Buggy instance I

Non-buggy instance I

Unlabeled instance I

Strategies for instance selection

Predictor

testtraining

Training data set Ss

Ss Target data set StReduced training data set rsquo

(b) Improved CPDP using training data selection

Figure 1 Two CPDP scenarios used in this paper

01 0 05 0 1(3)

01 0 0 05 1(1)

04 03 0 01 0

0 0 04 0 0

01 0 0 05 0

01 0 05 05

Label (defects)

rank instance

1

2

3

distance(Ii Itest)I1

I2

I3

I4

I4

I3

I5

Itest

f1 f2 f3 f4

I1 I2 I5

St

Ss

Figure 2 An example of the selection of training instances

Mathematical Problems in Engineering 5

Data sets

Project C-v1

Ranked instances(TDSelector)

Defects of each instance

similarity of software metrics

Defect predictor

Project A-v1Project A-v2Project B-v1Project C-v1Project C-v2 Test set

Project A-v1Project A-v2Project B-v1

Training set

ranking

TDS setup

rankingTraining

data

Test data

Figure 3 The overall structure of TDSelector for CPDP

4 Our Approach TDSelector

To improve the prediction performance of CPDP we leveragethe following observations

Similar Instances Given a test instance we can examine itssimilar training instances that were labeled beforeThe defectproneness shared by similar training instances could helpus identify the probability that a test instance is defectiveIntuitively two instances are more likely to have the samestate if their metric values are very similar

Number of Defects (defects) During the selection processwhen several training instances have the same distance froma test instance we need to determine which one shouldbe ranked higher According to our experiences in softwaredefect prediction and other researchersrsquo studies on the quan-titative analysis of previous defect prediction approaches [3435] we believe that more attention should be paid to thosetraining instances with more defects in practice

The selection of training data based on instance similarityhas been used in someprior studies [5 12 35]However to thebest of our knowledge the information about defects has notbeen fully utilized So in this paper we attempt to propose atraining data selection approach combining such informationand instance similarity

41 Overall Structure of TDSelector Figure 3 shows the over-all structure of the proposed approach to training dataselection named TDSelector Before selecting appropriatetraining data for CPDP we have to set up a test set andits corresponding initial TDS For a given project treated asthe test set all the other projects (except the target project)available at hand are used as the initial TDS This is the so-calledmany-to-one (M2O) scenario for CPDP [13] It is quitedifferent from the typical O2O (one-to-one) scenario whereonly one randomly selected project is treated as the trainingset for a given target project (namely test set)

When both of the sets are given the ranks of traininginstances are calculated based on the similarity of softwaremetrics and then returned for each test instance For the

initial TDS we also collect each training instancersquos defectsand thus rank these instances by their defects Then we rateeach training instance by combining the two types of ranks insome way and identify the top-k training instances for eachtest instance according to their final scores Finally we usethe predictor trained with the final TDS to predict defectproneness in the test set We describe the core componentof TDSelector namely scoring scheme in the followingsubsection

42 Scoring Scheme For each instance in training set and testset which is treated as a vector of features (namely softwaremetrics) we calculate the similarity between them in terms ofsimilarity index (such as cosine similarity Euclidean distanceand Manhattan distance as shown in Table 1) Traininginstances are then ranked by the similarity between each ofthem and a given test instance

For instance the cosine similarity between a traininginstance 119868119901 and the target instance 119868119902 is computed via theirvector representations described as follows

119878119894119898 (119868119901 119868119902) =997888rarr119868119901 sdot 997888rarr1198681199021003817100381710038171003817100381711986811990110038171003817100381710038171003817 times 1003817100381710038171003817100381711986811990210038171003817100381710038171003817 =

sum119899119894=1 (119891119901119894 times 119891119902119894)radicsum119899119894=1 1198912119901119894 times radicsum119899119894=1 1198912119902119894 (1)

where 997888rarr119868119901 and 997888rarr119868119902 are the metric vectors for 119868119901 and 119868119902 respec-tively and 119891lowast119894 represents the 119894th metric value of instance 119868lowast

Additionally for each training instance we also considerthe factor defects in order to further enrich the ranking of itsrelevant instances The assumption here is that the more theprevious defects the richer the information of an instance Sowe propose a scoring scheme to rank those candidate traininginstances defined as below

Score (119868119901 119868119902) = 120572 119878119894119898 (119868119901 119868119902) + (1 minus 120572)119873 (119889119890119891119890119888119905119901) (2)

where 119889119890119891119890119888119905119901 represents the defects of 119868119901 120572 is a weightingfactor (0 le 120572 le 1) which is learned from training data usingAlgorithm 1 (see Algorithm 1) and 119873(119889119890119891119890119888119905119901) is a functionused to normalize defects with values ranging from 0 to 1

6 Mathematical Problems in Engineering

Optimizing the parameter 120572Input(1) Candidate TDS 119878119904 = 1198681199041 1198681199042 119868119904119898 test set 119878119905 = 1198681199051 1198681199052 119868119905119897 (119898 gt 119897)(2) 119889119890119891119890119888119905119904 = 119889119890119891119890119888119905(1198681199041) 119889119890119891119890119888119905(1198681199042) 119889119890119891119890119888119905(119868119904119898) and 119896 = 10Output(3) 120572 (120572 isin [0 1])Method(4) Initialize 120572 = 0 119878119904(120572) = 0(5) While (120572 le 1) do(6) For 119894 = 1 119894 le 119897 119894 + +(7) For 119895 = 1 119895 le 119898 119895 + +(8) Score(119868119905119894 119868119904119895) = 120572 119878119894119898(119868119905119894 119868119904119895) + (1 minus 120572)119873 (119889119890119891119890119888119905(119868119904119895))(9) End For(10) descSort (119878119888119900119903119890(119868119905119894 119868119904119895) | 119895 = 1 sdot sdot sdot 119898) sort119898 training instances in descending order(11) 119878119904(120572) = 119878119904(120572) cup Top-119896 training instances select the top 119896 instances(12) End For(13) 119860119880119862 lArr997904 119878119904(120572) CPDP997888997888997888997888rarr 119878119905 prediction result(14) 120572 = 120572 + 01(15) EndWhile(16) Return (120572 | max120572119860119880119862)

Algorithm 1 Algorithm of parameter optimization

Table 1 Similarity indexes and normalization methods used in thispaper

Similarity

Cosine cos (119883 119884) = sum119899119896=1 119909119896119910119896radicsum119899119896=1 1199092119896radicsum119899119896=1 1199102119896

Euclidean distance 119889 (119883 119884) = radic 119899sum119896=1

(119909119896 minus 119910119896)2Manhattan distance 119889 (119883 119884) = 119899sum

119896=1

1003816100381610038161003816119909119896 minus 1199101198961003816100381610038161003816Normalization

Linear 119873(119909) = 119909 minus 119909min119909max minus 119909min

Logistic 119873(119909) = 11 + 119890minus119909 minus 05Square root 119873(119909) = 1 minus 1radic1 + 119909Logarithmic 119873(119909) = log10(119909 + 1)Inverse cotangent 119873(119909) = arctan(119909) lowast 2120587

Normalization is a commonly used data preprocessingtechnique for mathematics and computer science [36] Grafand Borer [37] have confirmed that normalization canimprove prediction performance of classificationmodels Forthis reason we normalize the defects of training instanceswhen using TDSelector As you know there are many nor-malization methods In this study we introduce five typicalnormalization methods used in machine learning [36 38]The description and formulas of the five normalizationmethods are listed in Table 1

For each test instance the top-119896 training instances rankedin terms of their scores will be returned Hence the final

TDS is composed by merging the sets of the top-119896 train-ing instances for each test instance when those duplicateinstances are removed

5 Experimental Setup

51 Research Questions Our experiments were conductedto find empirical evidence that answers the following threeresearch questions

RQ1 Does the Consideration of Defects Improve the Per-formance of CPDP Unlike the previous methods [1 5 712 29] TDSelector ranks candidate training instances interms of both defects and metric-based similarity To eval-uate the effectiveness of the proposed method consideringthe additional information of defects we tested TDSelectoraccording to the experimental data described in Section 52According to (2) we also empirically analyzed the impact ofthe parameter 120572 on prediction results

RQ2 Which Combination of Similarity and NormalizationIs More Suitable for TDSelector Equation (2) is comprisedof two parts namely similarity and the normalization ofdefects For each part several commonly usedmethods can beadopted in our context To fully take advantage of TDSelectorone would wonder which combination of similarity andnormalization should be chosen Therefore it is necessary tocompare the effects of different combinations of similarityand normalization methods on prediction results and todetermine the best one for TDSelector

RQ3 Can TDSelector-Based CPDP Outperform the Base-line Methods Cross-project prediction has attracted muchresearch interest in recent years and a few CPDP approachesusing training data selection have also been proposed eg

Mathematical Problems in Engineering 7

Peter filter based CPDP [5] (labeled as baseline1) and TCA+(Transfer Component Analysis) based CPDP [39] (labeledas baseline2) To answer the third question we comparedTDSelector-based CPDP proposed in this paper with theabove two state-of-the-art methods

52 Data Collection To evaluate the effectiveness of TDSe-lector in this paper we used 14 open-source projects writtenin Java on two online public software repositories namelyPROMISE [40] and AEEEM [41] The data statistics of the 14projects in question are presented in Table 2 where Instanceand Defect are the numbers of instances and defectiveinstances respectively and Defect is the proportion ofdefective instances to the total number of instances Eachinstance in these projects represents a file of object class andconsists of two parts namely software metrics and defects

The first repository PROMISE was collected by Jureczkoand Spinellis [40] The information of defects and 20 sourcecode metrics for the projects on PROMISE have been vali-dated and used in several previous studies [1 7 12 29] Thesecond repository AEEEM was collected by DrsquoAmbros etal [41] and each project on it has 76 metrics including 17source code metrics 15 change metrics 5 previous defectmetrics 5 entropy-of-change metrics 17 entropy-of-source-code metrics and 17 churn-of-source-code metrics AEEEMhas been successfully used in [23 39]

Before performing a cross-project prediction we need todetermine a target dataset (test set) and its candidate TDSFor PROMISE (10 projects) each one in the 10 projects wasselected to be the target dataset once and then we set up acandidate TDS for CPDP which excluded any data from thetarget project For instance if Ivy is selected as test projectdata from the other nine projects was used to construct itsinitial TDS

53 Experiment Design To answer the three research ques-tions our experimental procedure which is designed underthe context of M2O in the CPDP scenario is described asfollows

First as with many prior studies [1 5 15 35] all softwaremetric values in training and test sets were normalized byusing the119885-scoremethod because thesemetrics are differentin the scales of numerical values For the 14 projects onAEEEM and PROMISE their numbers of software metricsare different So the training set for a given test set wasselected from the same repository

Second to examine whether the consideration of defectsimproves the performance of CPDP we compared ourapproach TDSelector with NoD which is a baseline methodconsidering only the similarity between instances ie 120572 = 1in (2) Since there are three similarity computation methodsused in this paper we designed three different TDSelectorsand their corresponding baseline methods based on similar-ity indexesThe prediction results of eachmethod in questionfor the 15 test sets were analyzed in terms of mean valueand standard deviation More specifically we also used Cliff rsquosdelta (120575) [42] which is a nonparametric effect size measureof how often the values in one distribution are larger than the

values in a second distribution to compare the results gen-erated through our approach and its corresponding baselinemethod

Because Cliff did not suggest corresponding 120575 values torepresent small medium and large effects we convertedCohenrsquos 119889 effect size to Cliff rsquos 120575 using cohd2delta R package(httpsrdrriocranorddommancohd2deltahtml) Notethat Table 3 contains descriptors for magnitude of 119889 = 001to 20

Third according to the results of the second step of thisprocedure 15 combinations based on three typical similaritymethods for software metrics and five commonly usednormalization functions for defects were examined by thepairwise comparison method We then determined whichcombination is more suitable for our approach according tomean standard deviation and Cliff rsquos delta effect size

Fourth to further validate the effectiveness of the TDS-elector-based CPDP predictor we conducted cross-projectpredictions for all the 15 test sets using TDSelector and twocompeting methods (ie baseline1 and baseline2 introducedin Section 51) Note that the TDSelector used in this exper-iment was built with the best combination of similarity andnormalization

After this process is completed we will discuss the an-swers to the three research questions of our study

54 Classifier and Evaluation Measure As an underlyingmachine learning classifier for CPDP Logistic Regression(LR) which was widely used in many defect prediction liter-atures [4 23 39 43ndash46] is also used in this study All LR clas-sifiers were implemented withWeka (httpswwwcswaikatoacnzmlweka) For our experiments we used the defaultparameter setting for LR specified in Weka unless otherwisespecified

To evaluate the prediction performance of differentmeth-ods in this paper we utilized the area under a ReceiverOperating Characteristic curve (AUC) AUC is equal to theprobability that a classifier will identify a randomly chosendefective class higher than a randomly chosen defect-freeone [47] known as a useful measure for comparing differentmodels Compared with traditional accuracy measures AUCis commonly used because it is unaffected by class imbalanceand independent of the prediction threshold that is used todecide whether an instance should be classified as a negativeinstance [6 48 49] The AUC value of 05 indicates theperformance of a random predictor and higher AUC valuesindicate better prediction performance

6 Experimental Results

61 Answer to RQ1 We compared our approach consideringdefects with the baseline method NoD that selects trainingdata in terms of cosine similarity Table 5 shows that onaverage TDSelector does achieve an improvement in AUCvalue across the 15 test sets Obviously the average growthrates of AUC value vary from 59 to 90 when differentnormalization methods for defects were utilized In additionall the 120575 values in this table are greater than 02 which

8 Mathematical Problems in Engineering

Table 2 Data statistics of the projects used in our experiments

Repository Project Version Instance Defect Defect

PROMISE

Ant 17 745 166 223Camel 16 965 188 195Ivy 20 352 40 114Jedit 32 272 90 331

Lucene 24 340 203 597Poi 30 442 281 636

Synapse 12 256 86 336Velocity 14 196 147 750Xalan 26 885 411 464Xerces 14 588 437 743

AEEEM

Equinox 112005ndash6252008 324 129 398Eclipse JDT core (Eclipse) 112005ndash6172008 997 206 207Apache Lucene (Lucene2) 112005ndash1082008 692 20 29

Mylyn 1172005ndash3172009 1862 245 132Eclipse PDE UI (Pde) 112005ndash9112008 1497 209 140

Table 3 The mappings between different values and their effective-ness levels

Effect size d 120575Very small 001 0008Small 020 0147Medium 050 033Large 080 0474Very large 120 0622Huge 20 0811

indicates that each group of 15 prediction results obtained byour approach has a greater effect than that of NoD In otherwords our approach outperforms NoD In particular forJedit Velocity Eclipse andEquinox the improvements of ourapproach over NoD are substantial For example when usingthe linear normalizationmethod the AUC values for the fourprojects are increased by 306 430 226 and 394respectively moreover the logistic normalization methodfor Velocity achieves the biggest improvement in AUC value(namely 617)

We then compared TDSelector with the baseline methodsusing other widely used similarity calculation methods andthe results obtained by using Euclidean distance and Man-hattan distance to calculate the similarity between instancesare presented in Tables 6 and 7 TDSelector compared withthe corresponding NoD achieves the average growth ratesof AUC value that vary from 59 to 77 in Table 6 andfrom 27 to 69 in Table 7 respectively More specificallythe highest growth rate of AUC value in Table 6 is 436for Equinox and in Table 7 is 397 for Lucene2 Besides allCliff rsquos delta (120575) effect sizes in these two tables are also greaterthan 01 Hence the results indicate that our approach can onaverage improve the performance of those baseline methodswithout regard to defects

Table 4 Analyzing the factors similarity and normalization

Factor Method Mean Std 120575Similarity

Cosine similarity 0704 0082 minus0133Euclidean distance 0719 0080 -Manhattan distance 0682 0098 minus0193

Normalization

Linear 0706 0087 minus0012Logistic 0710 0078 -

Square root 0699 0091 minus0044Logarithmic 0700 0086 minus0064

Inverse cotangent 0696 0097 minus0056

In short during the process of training data selection theconsideration of defects for CPDP can help us to select higherquality training data thus leading to better classificationresults

62 Answer to RQ2 Although the inclusion of defects inthe selection of training data of quality is helpful for betterperformance of CPDP it is worthy to note that our methodcompletely failed in Mylyn and Pde when computing thesimilarity between instances in terms of Manhattan distance(see the corresponding maximum AUC values in Table 7)This implies that the success ofTDSelector depends largely onthe reasonable combination of similarity and normalizationmethods Therefore which combination of similarity andnormalization is more suitable for TDSelector

First we analyzed the two factors (ie similarity andnormalization) separately For example we evaluated thedifference among cosine similarity Euclidean distance andManhattan distance regardless of any normalization methodused in the experiment The results expressed in terms ofmean and standard deviation are shown in Table 4 wherethey are grouped by factors

Mathematical Problems in Engineering 9

Table5Th

ebestpredictio

nresults

obtained

bytheCP

DPapproach

basedon

TDSelec

torw

ithCosinesim

ilarityNoD

representsthebaselin

emetho

d+deno

testhe

grow

thrateof

AUC

valuethem

axim

umAU

Cvalueo

fdifferentn

ormalizationmetho

dsisun

derlinedeach

numbersho

wnin

bold

indicatesthatthe

correspo

ndingAU

Cvaluer

isesb

ymorethan10

Cosines

imilarity

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

07

09

09

1009

1009

1006

09

08

06

07

07

05

0338

AUC

0813

0676

0603

0793

0700

0611

0758

0741

0512

0742

0783

0760

0739

0705

0729

0711plusmn0

081

+(

)63

37

19

-30

6

-30

-43

0

03

226

394

41

59

40

90

Logistic

12057207

05

07

107

06

06

06

05

05

004

07

05

05

0351

AUC

0802

0674

0595

0793

0665

0621

0759

0765

0579

0745

0773

0738

0712

0707

0740

0711plusmn0

070

+(

)48

34

05

-24

116

31

32

617

07

210

355

03

62

56

90

Square

root

12057207

07

06

06

07

06

07

09

05

104

06

06

06

06

0249

AUC

0799

0654

0596

0807

0735

0626

0746

0762

0500

0740

0774

0560

0722

0700

0738

0697plusmn0

091

+(

)44

03

07

18

371

25

14

28

397

-210

28

17

53

53

69

Logarithm

ic120572

06

06

09

1007

1007

07

05

09

05

05

06

06

06

0351

AUC

0798

0662

0594

0793

0731

0611

0748

0744

0500

0758

0774

0700

0755

0702

0741

0707plusmn0

083

+(

)43

15

03

-36

4

-16

04

397

24

212

285

63

55

58

85

Inversec

otangent

12057207

1010

1007

1007

1006

07

007

07

07

07

0213

AUC

0798

0652

0592

0793

0659

0611

0749

0741

0500

0764

0773

0556

0739

0695

0734

0690plusmn0

092

+(

)43

--

-229

-18

-

397

32

210

21

41

44

48

59

NoD

(120572=1)

0765

0652

0592

0793

0536

0611

0736

0741

0358

0740

0639

0543

0709

0665

0701

0652plusmn0

113

10 Mathematical Problems in Engineering

Table6Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithEu

clidean

distance

Euclidean

distance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

09

09

1009

09

08

1010

08

08

006

1008

08

0369

AUC

0795

0727

0598

0826

0793

0603

0714

0757

0545

0775

0773

0719

0722

0697

0744

0719plusmn0

080

+(

)13

68

-09

322

19

--

117

52

176

430

-11

96

77

Logistic

12057207

08

04

07

07

05

06

09

09

09

007

1010

09

0360

AUC

0787

0750

0603

0832

0766

0613

0716

0767

0556

0745

0773

0698

0722

0690

0730

0717plusmn0

075

+(

)03

101

08

16

277

35

03

13

139

11

176

388

--

75

72

Square

root

12057207

08

1007

08

06

07

07

07

1007

08

1010

09

0342

AUC

0796

0743

0598

0820

0720

0618

0735

0786

0564

0737

0774

0696

0722

0690

0750

0715plusmn0

076

+(

)14

91

-

01

200

44

29

38

156

-178

384

--

105

70

Logarithm

ic120572

07

08

1010

08

06

1010

09

09

09

08

1010

09

0324

AUC

0794

0746

0598

0819

0722

0607

0714

0757

0573

0739

0778

0722

0722

0690

0748

0715plusmn0

072

+(

)11

95

--

203

25

--

174

03

185

436

--

103

70

Inversec

otangent

12057208

09

06

08

08

07

1008

06

07

009

09

1009

0280

AUC

0796

0749

0603

0820

0701

0623

0714

0787

0538

0750

0773

0589

0763

0690

0722

0708plusmn0

084

+(

)14

100

08

01

168

52

-40

102

18

176

170

56

-64

59

NoD

(120572=1)

0785

0681

0598

0819

060

00592

0714

0757

0488

0737

0657

0503

0722

0690

0678

066

8plusmn0

096

Mathematical Problems in Engineering 11

Table7Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithManhatta

ndista

nce

Manhatta

ndistance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

08

09

09

1009

09

1010

08

100

08

09

1010

0187

AUC

0804

0753

0599

0816

0689

0626

0695

0748

0500

0749

0773

0633

0692

0695

066

80696plusmn0

084

+(

)13

70

03

-73

63

--

78

-116

190

397

--

56

Logistic

12057207

07

08

08

08

07

07

09

06

07

009

09

1010

0249

AUC

0799

0760

0607

0830

0674

0621

0735

0794

0520

0756

0773

0680

0559

0695

066

80705plusmn0

084

+(

)06

80

17

17

50

54

58

61

121

09

116

279

127

--

69

Square

root

12057209

09

09

1008

08

09

08

09

100

100

1010

0164

AUC

0795

0755

060

40816

0693

0627

0704

0750

0510

0749

0773

0532

0523

0695

066

80680plusmn0

1+(

)01

72

12

-79

65

13

03

99

-116

-46

--

31

Logarithm

ic120572

1009

09

1009

1010

08

09

09

010

010

100116

AUC

0794

0755

0603

0816

066

40589

0695

0763

0524

0756

0773

0532

0523

0695

066

80677plusmn0

102

+(

)-

72

10

-34

--

20

129

09

116

-46

--

27

Inversec

otangent

12057210

09

09

09

09

08

09

1007

08

010

010

100133

AUC

0794

0749

0608

0821

0667

060

90710

0748

0500

0758

0773

0532

0523

0695

066

80677plusmn0

103

+(

)-

64

18

06

39

34

22

-78

12

116

-46

--

27

NoD

(120572=1)

0794

0704

0597

0816

064

20589

0695

0748

046

40749

0693

0532

0500

0695

066

80659plusmn0

105

12 Mathematical Problems in Engineering

Euclidean Logistic

Cosine

Linear

Manhattan+Logistic Euclidean

Linear

Cosine

Logistic

Manhattan+Logistic

(1) (2)

Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow

If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method

Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4

Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions

63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In

particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable

Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication

In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models

7 Discussion

71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10

Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance

Mathematical Problems in Engineering 13

Table8Pairw

isecomparis

onsb

etweenag

iven

combinatio

nandeach

ofthe15combinatio

nsin

term

sofC

liffrsquosdelta

(120575)effectsize

Cosines

imilarity

Euclidean

distance

Manhatta

ndistance

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Cosine+

Linear

-0018

0084

000

00116

minus0049

minus0036

minus0004

minus0013

minus0009

0138

0049

0164

0178

0169

Euclidean

+Linear

0049

0102

0111

0062

0164

-0036

004

00058

0089

0209

0102

0249

0276

0244

Manhatta

n+Lo

gistic

minus0049

minus0022

0022

minus0013

0111

minus0102

minus0076

minus0080

minus0049

minus0031

0053

-0124

0151

0147

14 Mathematical Problems in Engineering

Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question

Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10

Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647

Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43

Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487

k10987654321

AUC

09

08

07

06

05

Manhattan+logisticEuclidean+linearCosine+linear

Combination

Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE

when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations

Manhattan+logisticEuclidean+linearCosine+linear

k10987654321

AUC

09

08

07

06

05

Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM

72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice

According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as

Mathematical Problems in Engineering 15

0

02

04

06

08

1

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Perc

enta

ge(in

stanc

e)

release_id

defectslt2defectslt3

(a)

0

0002

0004

0006

0008

001

0012

0014

0016

4 14 24 34 44 54 64 74

Perc

enta

ge(in

stanc

e)

defects

(b)

Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments

TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones

Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work

73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain

Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software

metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors

Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033

Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five

16 Mathematical Problems in Engineering

Euclidean+Linear

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

Cosine+Linear

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

Manhattan+Logistic

Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value

normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear

8 Conclusion and Future Work

This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of

defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods

Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are

Mathematical Problems in Engineering 17

required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances

Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this article

Acknowledgments

The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)

References

[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012

[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002

[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012

[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009

[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013

[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012

[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013

[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998

[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002

[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007

[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005

[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009

[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015

[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017

[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015

[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015

[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017

[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015

[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016

[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017

[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014

[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014

[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015

[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015

18 Mathematical Problems in Engineering

[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017

[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017

[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017

[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013

[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014

[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016

[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017

[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015

[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017

[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012

[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013

[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012

[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7

[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006

[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013

[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010

[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE

Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010

[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011

[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011

[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008

[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008

[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011

[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012

[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011

[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011

[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952

[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009

[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 5: An Improved Method for Cross-Project Defect Prediction by …downloads.hindawi.com/journals/mpe/2018/2650415.pdf · Project A-1 Project A-2 Projec B-1 Projec C-1 Projec C-2 Tst et

Mathematical Problems in Engineering 5

Data sets

Project C-v1

Ranked instances(TDSelector)

Defects of each instance

similarity of software metrics

Defect predictor

Project A-v1Project A-v2Project B-v1Project C-v1Project C-v2 Test set

Project A-v1Project A-v2Project B-v1

Training set

ranking

TDS setup

rankingTraining

data

Test data

Figure 3 The overall structure of TDSelector for CPDP

4 Our Approach TDSelector

To improve the prediction performance of CPDP we leveragethe following observations

Similar Instances Given a test instance we can examine itssimilar training instances that were labeled beforeThe defectproneness shared by similar training instances could helpus identify the probability that a test instance is defectiveIntuitively two instances are more likely to have the samestate if their metric values are very similar

Number of Defects (defects) During the selection processwhen several training instances have the same distance froma test instance we need to determine which one shouldbe ranked higher According to our experiences in softwaredefect prediction and other researchersrsquo studies on the quan-titative analysis of previous defect prediction approaches [3435] we believe that more attention should be paid to thosetraining instances with more defects in practice

The selection of training data based on instance similarityhas been used in someprior studies [5 12 35]However to thebest of our knowledge the information about defects has notbeen fully utilized So in this paper we attempt to propose atraining data selection approach combining such informationand instance similarity

41 Overall Structure of TDSelector Figure 3 shows the over-all structure of the proposed approach to training dataselection named TDSelector Before selecting appropriatetraining data for CPDP we have to set up a test set andits corresponding initial TDS For a given project treated asthe test set all the other projects (except the target project)available at hand are used as the initial TDS This is the so-calledmany-to-one (M2O) scenario for CPDP [13] It is quitedifferent from the typical O2O (one-to-one) scenario whereonly one randomly selected project is treated as the trainingset for a given target project (namely test set)

When both of the sets are given the ranks of traininginstances are calculated based on the similarity of softwaremetrics and then returned for each test instance For the

initial TDS we also collect each training instancersquos defectsand thus rank these instances by their defects Then we rateeach training instance by combining the two types of ranks insome way and identify the top-k training instances for eachtest instance according to their final scores Finally we usethe predictor trained with the final TDS to predict defectproneness in the test set We describe the core componentof TDSelector namely scoring scheme in the followingsubsection

42 Scoring Scheme For each instance in training set and testset which is treated as a vector of features (namely softwaremetrics) we calculate the similarity between them in terms ofsimilarity index (such as cosine similarity Euclidean distanceand Manhattan distance as shown in Table 1) Traininginstances are then ranked by the similarity between each ofthem and a given test instance

For instance the cosine similarity between a traininginstance 119868119901 and the target instance 119868119902 is computed via theirvector representations described as follows

119878119894119898 (119868119901 119868119902) =997888rarr119868119901 sdot 997888rarr1198681199021003817100381710038171003817100381711986811990110038171003817100381710038171003817 times 1003817100381710038171003817100381711986811990210038171003817100381710038171003817 =

sum119899119894=1 (119891119901119894 times 119891119902119894)radicsum119899119894=1 1198912119901119894 times radicsum119899119894=1 1198912119902119894 (1)

where 997888rarr119868119901 and 997888rarr119868119902 are the metric vectors for 119868119901 and 119868119902 respec-tively and 119891lowast119894 represents the 119894th metric value of instance 119868lowast

Additionally for each training instance we also considerthe factor defects in order to further enrich the ranking of itsrelevant instances The assumption here is that the more theprevious defects the richer the information of an instance Sowe propose a scoring scheme to rank those candidate traininginstances defined as below

Score (119868119901 119868119902) = 120572 119878119894119898 (119868119901 119868119902) + (1 minus 120572)119873 (119889119890119891119890119888119905119901) (2)

where 119889119890119891119890119888119905119901 represents the defects of 119868119901 120572 is a weightingfactor (0 le 120572 le 1) which is learned from training data usingAlgorithm 1 (see Algorithm 1) and 119873(119889119890119891119890119888119905119901) is a functionused to normalize defects with values ranging from 0 to 1

6 Mathematical Problems in Engineering

Optimizing the parameter 120572Input(1) Candidate TDS 119878119904 = 1198681199041 1198681199042 119868119904119898 test set 119878119905 = 1198681199051 1198681199052 119868119905119897 (119898 gt 119897)(2) 119889119890119891119890119888119905119904 = 119889119890119891119890119888119905(1198681199041) 119889119890119891119890119888119905(1198681199042) 119889119890119891119890119888119905(119868119904119898) and 119896 = 10Output(3) 120572 (120572 isin [0 1])Method(4) Initialize 120572 = 0 119878119904(120572) = 0(5) While (120572 le 1) do(6) For 119894 = 1 119894 le 119897 119894 + +(7) For 119895 = 1 119895 le 119898 119895 + +(8) Score(119868119905119894 119868119904119895) = 120572 119878119894119898(119868119905119894 119868119904119895) + (1 minus 120572)119873 (119889119890119891119890119888119905(119868119904119895))(9) End For(10) descSort (119878119888119900119903119890(119868119905119894 119868119904119895) | 119895 = 1 sdot sdot sdot 119898) sort119898 training instances in descending order(11) 119878119904(120572) = 119878119904(120572) cup Top-119896 training instances select the top 119896 instances(12) End For(13) 119860119880119862 lArr997904 119878119904(120572) CPDP997888997888997888997888rarr 119878119905 prediction result(14) 120572 = 120572 + 01(15) EndWhile(16) Return (120572 | max120572119860119880119862)

Algorithm 1 Algorithm of parameter optimization

Table 1 Similarity indexes and normalization methods used in thispaper

Similarity

Cosine cos (119883 119884) = sum119899119896=1 119909119896119910119896radicsum119899119896=1 1199092119896radicsum119899119896=1 1199102119896

Euclidean distance 119889 (119883 119884) = radic 119899sum119896=1

(119909119896 minus 119910119896)2Manhattan distance 119889 (119883 119884) = 119899sum

119896=1

1003816100381610038161003816119909119896 minus 1199101198961003816100381610038161003816Normalization

Linear 119873(119909) = 119909 minus 119909min119909max minus 119909min

Logistic 119873(119909) = 11 + 119890minus119909 minus 05Square root 119873(119909) = 1 minus 1radic1 + 119909Logarithmic 119873(119909) = log10(119909 + 1)Inverse cotangent 119873(119909) = arctan(119909) lowast 2120587

Normalization is a commonly used data preprocessingtechnique for mathematics and computer science [36] Grafand Borer [37] have confirmed that normalization canimprove prediction performance of classificationmodels Forthis reason we normalize the defects of training instanceswhen using TDSelector As you know there are many nor-malization methods In this study we introduce five typicalnormalization methods used in machine learning [36 38]The description and formulas of the five normalizationmethods are listed in Table 1

For each test instance the top-119896 training instances rankedin terms of their scores will be returned Hence the final

TDS is composed by merging the sets of the top-119896 train-ing instances for each test instance when those duplicateinstances are removed

5 Experimental Setup

51 Research Questions Our experiments were conductedto find empirical evidence that answers the following threeresearch questions

RQ1 Does the Consideration of Defects Improve the Per-formance of CPDP Unlike the previous methods [1 5 712 29] TDSelector ranks candidate training instances interms of both defects and metric-based similarity To eval-uate the effectiveness of the proposed method consideringthe additional information of defects we tested TDSelectoraccording to the experimental data described in Section 52According to (2) we also empirically analyzed the impact ofthe parameter 120572 on prediction results

RQ2 Which Combination of Similarity and NormalizationIs More Suitable for TDSelector Equation (2) is comprisedof two parts namely similarity and the normalization ofdefects For each part several commonly usedmethods can beadopted in our context To fully take advantage of TDSelectorone would wonder which combination of similarity andnormalization should be chosen Therefore it is necessary tocompare the effects of different combinations of similarityand normalization methods on prediction results and todetermine the best one for TDSelector

RQ3 Can TDSelector-Based CPDP Outperform the Base-line Methods Cross-project prediction has attracted muchresearch interest in recent years and a few CPDP approachesusing training data selection have also been proposed eg

Mathematical Problems in Engineering 7

Peter filter based CPDP [5] (labeled as baseline1) and TCA+(Transfer Component Analysis) based CPDP [39] (labeledas baseline2) To answer the third question we comparedTDSelector-based CPDP proposed in this paper with theabove two state-of-the-art methods

52 Data Collection To evaluate the effectiveness of TDSe-lector in this paper we used 14 open-source projects writtenin Java on two online public software repositories namelyPROMISE [40] and AEEEM [41] The data statistics of the 14projects in question are presented in Table 2 where Instanceand Defect are the numbers of instances and defectiveinstances respectively and Defect is the proportion ofdefective instances to the total number of instances Eachinstance in these projects represents a file of object class andconsists of two parts namely software metrics and defects

The first repository PROMISE was collected by Jureczkoand Spinellis [40] The information of defects and 20 sourcecode metrics for the projects on PROMISE have been vali-dated and used in several previous studies [1 7 12 29] Thesecond repository AEEEM was collected by DrsquoAmbros etal [41] and each project on it has 76 metrics including 17source code metrics 15 change metrics 5 previous defectmetrics 5 entropy-of-change metrics 17 entropy-of-source-code metrics and 17 churn-of-source-code metrics AEEEMhas been successfully used in [23 39]

Before performing a cross-project prediction we need todetermine a target dataset (test set) and its candidate TDSFor PROMISE (10 projects) each one in the 10 projects wasselected to be the target dataset once and then we set up acandidate TDS for CPDP which excluded any data from thetarget project For instance if Ivy is selected as test projectdata from the other nine projects was used to construct itsinitial TDS

53 Experiment Design To answer the three research ques-tions our experimental procedure which is designed underthe context of M2O in the CPDP scenario is described asfollows

First as with many prior studies [1 5 15 35] all softwaremetric values in training and test sets were normalized byusing the119885-scoremethod because thesemetrics are differentin the scales of numerical values For the 14 projects onAEEEM and PROMISE their numbers of software metricsare different So the training set for a given test set wasselected from the same repository

Second to examine whether the consideration of defectsimproves the performance of CPDP we compared ourapproach TDSelector with NoD which is a baseline methodconsidering only the similarity between instances ie 120572 = 1in (2) Since there are three similarity computation methodsused in this paper we designed three different TDSelectorsand their corresponding baseline methods based on similar-ity indexesThe prediction results of eachmethod in questionfor the 15 test sets were analyzed in terms of mean valueand standard deviation More specifically we also used Cliff rsquosdelta (120575) [42] which is a nonparametric effect size measureof how often the values in one distribution are larger than the

values in a second distribution to compare the results gen-erated through our approach and its corresponding baselinemethod

Because Cliff did not suggest corresponding 120575 values torepresent small medium and large effects we convertedCohenrsquos 119889 effect size to Cliff rsquos 120575 using cohd2delta R package(httpsrdrriocranorddommancohd2deltahtml) Notethat Table 3 contains descriptors for magnitude of 119889 = 001to 20

Third according to the results of the second step of thisprocedure 15 combinations based on three typical similaritymethods for software metrics and five commonly usednormalization functions for defects were examined by thepairwise comparison method We then determined whichcombination is more suitable for our approach according tomean standard deviation and Cliff rsquos delta effect size

Fourth to further validate the effectiveness of the TDS-elector-based CPDP predictor we conducted cross-projectpredictions for all the 15 test sets using TDSelector and twocompeting methods (ie baseline1 and baseline2 introducedin Section 51) Note that the TDSelector used in this exper-iment was built with the best combination of similarity andnormalization

After this process is completed we will discuss the an-swers to the three research questions of our study

54 Classifier and Evaluation Measure As an underlyingmachine learning classifier for CPDP Logistic Regression(LR) which was widely used in many defect prediction liter-atures [4 23 39 43ndash46] is also used in this study All LR clas-sifiers were implemented withWeka (httpswwwcswaikatoacnzmlweka) For our experiments we used the defaultparameter setting for LR specified in Weka unless otherwisespecified

To evaluate the prediction performance of differentmeth-ods in this paper we utilized the area under a ReceiverOperating Characteristic curve (AUC) AUC is equal to theprobability that a classifier will identify a randomly chosendefective class higher than a randomly chosen defect-freeone [47] known as a useful measure for comparing differentmodels Compared with traditional accuracy measures AUCis commonly used because it is unaffected by class imbalanceand independent of the prediction threshold that is used todecide whether an instance should be classified as a negativeinstance [6 48 49] The AUC value of 05 indicates theperformance of a random predictor and higher AUC valuesindicate better prediction performance

6 Experimental Results

61 Answer to RQ1 We compared our approach consideringdefects with the baseline method NoD that selects trainingdata in terms of cosine similarity Table 5 shows that onaverage TDSelector does achieve an improvement in AUCvalue across the 15 test sets Obviously the average growthrates of AUC value vary from 59 to 90 when differentnormalization methods for defects were utilized In additionall the 120575 values in this table are greater than 02 which

8 Mathematical Problems in Engineering

Table 2 Data statistics of the projects used in our experiments

Repository Project Version Instance Defect Defect

PROMISE

Ant 17 745 166 223Camel 16 965 188 195Ivy 20 352 40 114Jedit 32 272 90 331

Lucene 24 340 203 597Poi 30 442 281 636

Synapse 12 256 86 336Velocity 14 196 147 750Xalan 26 885 411 464Xerces 14 588 437 743

AEEEM

Equinox 112005ndash6252008 324 129 398Eclipse JDT core (Eclipse) 112005ndash6172008 997 206 207Apache Lucene (Lucene2) 112005ndash1082008 692 20 29

Mylyn 1172005ndash3172009 1862 245 132Eclipse PDE UI (Pde) 112005ndash9112008 1497 209 140

Table 3 The mappings between different values and their effective-ness levels

Effect size d 120575Very small 001 0008Small 020 0147Medium 050 033Large 080 0474Very large 120 0622Huge 20 0811

indicates that each group of 15 prediction results obtained byour approach has a greater effect than that of NoD In otherwords our approach outperforms NoD In particular forJedit Velocity Eclipse andEquinox the improvements of ourapproach over NoD are substantial For example when usingthe linear normalizationmethod the AUC values for the fourprojects are increased by 306 430 226 and 394respectively moreover the logistic normalization methodfor Velocity achieves the biggest improvement in AUC value(namely 617)

We then compared TDSelector with the baseline methodsusing other widely used similarity calculation methods andthe results obtained by using Euclidean distance and Man-hattan distance to calculate the similarity between instancesare presented in Tables 6 and 7 TDSelector compared withthe corresponding NoD achieves the average growth ratesof AUC value that vary from 59 to 77 in Table 6 andfrom 27 to 69 in Table 7 respectively More specificallythe highest growth rate of AUC value in Table 6 is 436for Equinox and in Table 7 is 397 for Lucene2 Besides allCliff rsquos delta (120575) effect sizes in these two tables are also greaterthan 01 Hence the results indicate that our approach can onaverage improve the performance of those baseline methodswithout regard to defects

Table 4 Analyzing the factors similarity and normalization

Factor Method Mean Std 120575Similarity

Cosine similarity 0704 0082 minus0133Euclidean distance 0719 0080 -Manhattan distance 0682 0098 minus0193

Normalization

Linear 0706 0087 minus0012Logistic 0710 0078 -

Square root 0699 0091 minus0044Logarithmic 0700 0086 minus0064

Inverse cotangent 0696 0097 minus0056

In short during the process of training data selection theconsideration of defects for CPDP can help us to select higherquality training data thus leading to better classificationresults

62 Answer to RQ2 Although the inclusion of defects inthe selection of training data of quality is helpful for betterperformance of CPDP it is worthy to note that our methodcompletely failed in Mylyn and Pde when computing thesimilarity between instances in terms of Manhattan distance(see the corresponding maximum AUC values in Table 7)This implies that the success ofTDSelector depends largely onthe reasonable combination of similarity and normalizationmethods Therefore which combination of similarity andnormalization is more suitable for TDSelector

First we analyzed the two factors (ie similarity andnormalization) separately For example we evaluated thedifference among cosine similarity Euclidean distance andManhattan distance regardless of any normalization methodused in the experiment The results expressed in terms ofmean and standard deviation are shown in Table 4 wherethey are grouped by factors

Mathematical Problems in Engineering 9

Table5Th

ebestpredictio

nresults

obtained

bytheCP

DPapproach

basedon

TDSelec

torw

ithCosinesim

ilarityNoD

representsthebaselin

emetho

d+deno

testhe

grow

thrateof

AUC

valuethem

axim

umAU

Cvalueo

fdifferentn

ormalizationmetho

dsisun

derlinedeach

numbersho

wnin

bold

indicatesthatthe

correspo

ndingAU

Cvaluer

isesb

ymorethan10

Cosines

imilarity

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

07

09

09

1009

1009

1006

09

08

06

07

07

05

0338

AUC

0813

0676

0603

0793

0700

0611

0758

0741

0512

0742

0783

0760

0739

0705

0729

0711plusmn0

081

+(

)63

37

19

-30

6

-30

-43

0

03

226

394

41

59

40

90

Logistic

12057207

05

07

107

06

06

06

05

05

004

07

05

05

0351

AUC

0802

0674

0595

0793

0665

0621

0759

0765

0579

0745

0773

0738

0712

0707

0740

0711plusmn0

070

+(

)48

34

05

-24

116

31

32

617

07

210

355

03

62

56

90

Square

root

12057207

07

06

06

07

06

07

09

05

104

06

06

06

06

0249

AUC

0799

0654

0596

0807

0735

0626

0746

0762

0500

0740

0774

0560

0722

0700

0738

0697plusmn0

091

+(

)44

03

07

18

371

25

14

28

397

-210

28

17

53

53

69

Logarithm

ic120572

06

06

09

1007

1007

07

05

09

05

05

06

06

06

0351

AUC

0798

0662

0594

0793

0731

0611

0748

0744

0500

0758

0774

0700

0755

0702

0741

0707plusmn0

083

+(

)43

15

03

-36

4

-16

04

397

24

212

285

63

55

58

85

Inversec

otangent

12057207

1010

1007

1007

1006

07

007

07

07

07

0213

AUC

0798

0652

0592

0793

0659

0611

0749

0741

0500

0764

0773

0556

0739

0695

0734

0690plusmn0

092

+(

)43

--

-229

-18

-

397

32

210

21

41

44

48

59

NoD

(120572=1)

0765

0652

0592

0793

0536

0611

0736

0741

0358

0740

0639

0543

0709

0665

0701

0652plusmn0

113

10 Mathematical Problems in Engineering

Table6Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithEu

clidean

distance

Euclidean

distance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

09

09

1009

09

08

1010

08

08

006

1008

08

0369

AUC

0795

0727

0598

0826

0793

0603

0714

0757

0545

0775

0773

0719

0722

0697

0744

0719plusmn0

080

+(

)13

68

-09

322

19

--

117

52

176

430

-11

96

77

Logistic

12057207

08

04

07

07

05

06

09

09

09

007

1010

09

0360

AUC

0787

0750

0603

0832

0766

0613

0716

0767

0556

0745

0773

0698

0722

0690

0730

0717plusmn0

075

+(

)03

101

08

16

277

35

03

13

139

11

176

388

--

75

72

Square

root

12057207

08

1007

08

06

07

07

07

1007

08

1010

09

0342

AUC

0796

0743

0598

0820

0720

0618

0735

0786

0564

0737

0774

0696

0722

0690

0750

0715plusmn0

076

+(

)14

91

-

01

200

44

29

38

156

-178

384

--

105

70

Logarithm

ic120572

07

08

1010

08

06

1010

09

09

09

08

1010

09

0324

AUC

0794

0746

0598

0819

0722

0607

0714

0757

0573

0739

0778

0722

0722

0690

0748

0715plusmn0

072

+(

)11

95

--

203

25

--

174

03

185

436

--

103

70

Inversec

otangent

12057208

09

06

08

08

07

1008

06

07

009

09

1009

0280

AUC

0796

0749

0603

0820

0701

0623

0714

0787

0538

0750

0773

0589

0763

0690

0722

0708plusmn0

084

+(

)14

100

08

01

168

52

-40

102

18

176

170

56

-64

59

NoD

(120572=1)

0785

0681

0598

0819

060

00592

0714

0757

0488

0737

0657

0503

0722

0690

0678

066

8plusmn0

096

Mathematical Problems in Engineering 11

Table7Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithManhatta

ndista

nce

Manhatta

ndistance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

08

09

09

1009

09

1010

08

100

08

09

1010

0187

AUC

0804

0753

0599

0816

0689

0626

0695

0748

0500

0749

0773

0633

0692

0695

066

80696plusmn0

084

+(

)13

70

03

-73

63

--

78

-116

190

397

--

56

Logistic

12057207

07

08

08

08

07

07

09

06

07

009

09

1010

0249

AUC

0799

0760

0607

0830

0674

0621

0735

0794

0520

0756

0773

0680

0559

0695

066

80705plusmn0

084

+(

)06

80

17

17

50

54

58

61

121

09

116

279

127

--

69

Square

root

12057209

09

09

1008

08

09

08

09

100

100

1010

0164

AUC

0795

0755

060

40816

0693

0627

0704

0750

0510

0749

0773

0532

0523

0695

066

80680plusmn0

1+(

)01

72

12

-79

65

13

03

99

-116

-46

--

31

Logarithm

ic120572

1009

09

1009

1010

08

09

09

010

010

100116

AUC

0794

0755

0603

0816

066

40589

0695

0763

0524

0756

0773

0532

0523

0695

066

80677plusmn0

102

+(

)-

72

10

-34

--

20

129

09

116

-46

--

27

Inversec

otangent

12057210

09

09

09

09

08

09

1007

08

010

010

100133

AUC

0794

0749

0608

0821

0667

060

90710

0748

0500

0758

0773

0532

0523

0695

066

80677plusmn0

103

+(

)-

64

18

06

39

34

22

-78

12

116

-46

--

27

NoD

(120572=1)

0794

0704

0597

0816

064

20589

0695

0748

046

40749

0693

0532

0500

0695

066

80659plusmn0

105

12 Mathematical Problems in Engineering

Euclidean Logistic

Cosine

Linear

Manhattan+Logistic Euclidean

Linear

Cosine

Logistic

Manhattan+Logistic

(1) (2)

Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow

If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method

Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4

Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions

63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In

particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable

Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication

In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models

7 Discussion

71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10

Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance

Mathematical Problems in Engineering 13

Table8Pairw

isecomparis

onsb

etweenag

iven

combinatio

nandeach

ofthe15combinatio

nsin

term

sofC

liffrsquosdelta

(120575)effectsize

Cosines

imilarity

Euclidean

distance

Manhatta

ndistance

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Cosine+

Linear

-0018

0084

000

00116

minus0049

minus0036

minus0004

minus0013

minus0009

0138

0049

0164

0178

0169

Euclidean

+Linear

0049

0102

0111

0062

0164

-0036

004

00058

0089

0209

0102

0249

0276

0244

Manhatta

n+Lo

gistic

minus0049

minus0022

0022

minus0013

0111

minus0102

minus0076

minus0080

minus0049

minus0031

0053

-0124

0151

0147

14 Mathematical Problems in Engineering

Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question

Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10

Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647

Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43

Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487

k10987654321

AUC

09

08

07

06

05

Manhattan+logisticEuclidean+linearCosine+linear

Combination

Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE

when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations

Manhattan+logisticEuclidean+linearCosine+linear

k10987654321

AUC

09

08

07

06

05

Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM

72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice

According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as

Mathematical Problems in Engineering 15

0

02

04

06

08

1

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Perc

enta

ge(in

stanc

e)

release_id

defectslt2defectslt3

(a)

0

0002

0004

0006

0008

001

0012

0014

0016

4 14 24 34 44 54 64 74

Perc

enta

ge(in

stanc

e)

defects

(b)

Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments

TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones

Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work

73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain

Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software

metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors

Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033

Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five

16 Mathematical Problems in Engineering

Euclidean+Linear

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

Cosine+Linear

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

Manhattan+Logistic

Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value

normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear

8 Conclusion and Future Work

This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of

defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods

Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are

Mathematical Problems in Engineering 17

required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances

Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this article

Acknowledgments

The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)

References

[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012

[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002

[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012

[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009

[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013

[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012

[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013

[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998

[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002

[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007

[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005

[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009

[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015

[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017

[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015

[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015

[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017

[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015

[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016

[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017

[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014

[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014

[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015

[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015

18 Mathematical Problems in Engineering

[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017

[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017

[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017

[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013

[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014

[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016

[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017

[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015

[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017

[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012

[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013

[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012

[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7

[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006

[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013

[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010

[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE

Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010

[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011

[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011

[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008

[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008

[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011

[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012

[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011

[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011

[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952

[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009

[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 6: An Improved Method for Cross-Project Defect Prediction by …downloads.hindawi.com/journals/mpe/2018/2650415.pdf · Project A-1 Project A-2 Projec B-1 Projec C-1 Projec C-2 Tst et

6 Mathematical Problems in Engineering

Optimizing the parameter 120572Input(1) Candidate TDS 119878119904 = 1198681199041 1198681199042 119868119904119898 test set 119878119905 = 1198681199051 1198681199052 119868119905119897 (119898 gt 119897)(2) 119889119890119891119890119888119905119904 = 119889119890119891119890119888119905(1198681199041) 119889119890119891119890119888119905(1198681199042) 119889119890119891119890119888119905(119868119904119898) and 119896 = 10Output(3) 120572 (120572 isin [0 1])Method(4) Initialize 120572 = 0 119878119904(120572) = 0(5) While (120572 le 1) do(6) For 119894 = 1 119894 le 119897 119894 + +(7) For 119895 = 1 119895 le 119898 119895 + +(8) Score(119868119905119894 119868119904119895) = 120572 119878119894119898(119868119905119894 119868119904119895) + (1 minus 120572)119873 (119889119890119891119890119888119905(119868119904119895))(9) End For(10) descSort (119878119888119900119903119890(119868119905119894 119868119904119895) | 119895 = 1 sdot sdot sdot 119898) sort119898 training instances in descending order(11) 119878119904(120572) = 119878119904(120572) cup Top-119896 training instances select the top 119896 instances(12) End For(13) 119860119880119862 lArr997904 119878119904(120572) CPDP997888997888997888997888rarr 119878119905 prediction result(14) 120572 = 120572 + 01(15) EndWhile(16) Return (120572 | max120572119860119880119862)

Algorithm 1 Algorithm of parameter optimization

Table 1 Similarity indexes and normalization methods used in thispaper

Similarity

Cosine cos (119883 119884) = sum119899119896=1 119909119896119910119896radicsum119899119896=1 1199092119896radicsum119899119896=1 1199102119896

Euclidean distance 119889 (119883 119884) = radic 119899sum119896=1

(119909119896 minus 119910119896)2Manhattan distance 119889 (119883 119884) = 119899sum

119896=1

1003816100381610038161003816119909119896 minus 1199101198961003816100381610038161003816Normalization

Linear 119873(119909) = 119909 minus 119909min119909max minus 119909min

Logistic 119873(119909) = 11 + 119890minus119909 minus 05Square root 119873(119909) = 1 minus 1radic1 + 119909Logarithmic 119873(119909) = log10(119909 + 1)Inverse cotangent 119873(119909) = arctan(119909) lowast 2120587

Normalization is a commonly used data preprocessingtechnique for mathematics and computer science [36] Grafand Borer [37] have confirmed that normalization canimprove prediction performance of classificationmodels Forthis reason we normalize the defects of training instanceswhen using TDSelector As you know there are many nor-malization methods In this study we introduce five typicalnormalization methods used in machine learning [36 38]The description and formulas of the five normalizationmethods are listed in Table 1

For each test instance the top-119896 training instances rankedin terms of their scores will be returned Hence the final

TDS is composed by merging the sets of the top-119896 train-ing instances for each test instance when those duplicateinstances are removed

5 Experimental Setup

51 Research Questions Our experiments were conductedto find empirical evidence that answers the following threeresearch questions

RQ1 Does the Consideration of Defects Improve the Per-formance of CPDP Unlike the previous methods [1 5 712 29] TDSelector ranks candidate training instances interms of both defects and metric-based similarity To eval-uate the effectiveness of the proposed method consideringthe additional information of defects we tested TDSelectoraccording to the experimental data described in Section 52According to (2) we also empirically analyzed the impact ofthe parameter 120572 on prediction results

RQ2 Which Combination of Similarity and NormalizationIs More Suitable for TDSelector Equation (2) is comprisedof two parts namely similarity and the normalization ofdefects For each part several commonly usedmethods can beadopted in our context To fully take advantage of TDSelectorone would wonder which combination of similarity andnormalization should be chosen Therefore it is necessary tocompare the effects of different combinations of similarityand normalization methods on prediction results and todetermine the best one for TDSelector

RQ3 Can TDSelector-Based CPDP Outperform the Base-line Methods Cross-project prediction has attracted muchresearch interest in recent years and a few CPDP approachesusing training data selection have also been proposed eg

Mathematical Problems in Engineering 7

Peter filter based CPDP [5] (labeled as baseline1) and TCA+(Transfer Component Analysis) based CPDP [39] (labeledas baseline2) To answer the third question we comparedTDSelector-based CPDP proposed in this paper with theabove two state-of-the-art methods

52 Data Collection To evaluate the effectiveness of TDSe-lector in this paper we used 14 open-source projects writtenin Java on two online public software repositories namelyPROMISE [40] and AEEEM [41] The data statistics of the 14projects in question are presented in Table 2 where Instanceand Defect are the numbers of instances and defectiveinstances respectively and Defect is the proportion ofdefective instances to the total number of instances Eachinstance in these projects represents a file of object class andconsists of two parts namely software metrics and defects

The first repository PROMISE was collected by Jureczkoand Spinellis [40] The information of defects and 20 sourcecode metrics for the projects on PROMISE have been vali-dated and used in several previous studies [1 7 12 29] Thesecond repository AEEEM was collected by DrsquoAmbros etal [41] and each project on it has 76 metrics including 17source code metrics 15 change metrics 5 previous defectmetrics 5 entropy-of-change metrics 17 entropy-of-source-code metrics and 17 churn-of-source-code metrics AEEEMhas been successfully used in [23 39]

Before performing a cross-project prediction we need todetermine a target dataset (test set) and its candidate TDSFor PROMISE (10 projects) each one in the 10 projects wasselected to be the target dataset once and then we set up acandidate TDS for CPDP which excluded any data from thetarget project For instance if Ivy is selected as test projectdata from the other nine projects was used to construct itsinitial TDS

53 Experiment Design To answer the three research ques-tions our experimental procedure which is designed underthe context of M2O in the CPDP scenario is described asfollows

First as with many prior studies [1 5 15 35] all softwaremetric values in training and test sets were normalized byusing the119885-scoremethod because thesemetrics are differentin the scales of numerical values For the 14 projects onAEEEM and PROMISE their numbers of software metricsare different So the training set for a given test set wasselected from the same repository

Second to examine whether the consideration of defectsimproves the performance of CPDP we compared ourapproach TDSelector with NoD which is a baseline methodconsidering only the similarity between instances ie 120572 = 1in (2) Since there are three similarity computation methodsused in this paper we designed three different TDSelectorsand their corresponding baseline methods based on similar-ity indexesThe prediction results of eachmethod in questionfor the 15 test sets were analyzed in terms of mean valueand standard deviation More specifically we also used Cliff rsquosdelta (120575) [42] which is a nonparametric effect size measureof how often the values in one distribution are larger than the

values in a second distribution to compare the results gen-erated through our approach and its corresponding baselinemethod

Because Cliff did not suggest corresponding 120575 values torepresent small medium and large effects we convertedCohenrsquos 119889 effect size to Cliff rsquos 120575 using cohd2delta R package(httpsrdrriocranorddommancohd2deltahtml) Notethat Table 3 contains descriptors for magnitude of 119889 = 001to 20

Third according to the results of the second step of thisprocedure 15 combinations based on three typical similaritymethods for software metrics and five commonly usednormalization functions for defects were examined by thepairwise comparison method We then determined whichcombination is more suitable for our approach according tomean standard deviation and Cliff rsquos delta effect size

Fourth to further validate the effectiveness of the TDS-elector-based CPDP predictor we conducted cross-projectpredictions for all the 15 test sets using TDSelector and twocompeting methods (ie baseline1 and baseline2 introducedin Section 51) Note that the TDSelector used in this exper-iment was built with the best combination of similarity andnormalization

After this process is completed we will discuss the an-swers to the three research questions of our study

54 Classifier and Evaluation Measure As an underlyingmachine learning classifier for CPDP Logistic Regression(LR) which was widely used in many defect prediction liter-atures [4 23 39 43ndash46] is also used in this study All LR clas-sifiers were implemented withWeka (httpswwwcswaikatoacnzmlweka) For our experiments we used the defaultparameter setting for LR specified in Weka unless otherwisespecified

To evaluate the prediction performance of differentmeth-ods in this paper we utilized the area under a ReceiverOperating Characteristic curve (AUC) AUC is equal to theprobability that a classifier will identify a randomly chosendefective class higher than a randomly chosen defect-freeone [47] known as a useful measure for comparing differentmodels Compared with traditional accuracy measures AUCis commonly used because it is unaffected by class imbalanceand independent of the prediction threshold that is used todecide whether an instance should be classified as a negativeinstance [6 48 49] The AUC value of 05 indicates theperformance of a random predictor and higher AUC valuesindicate better prediction performance

6 Experimental Results

61 Answer to RQ1 We compared our approach consideringdefects with the baseline method NoD that selects trainingdata in terms of cosine similarity Table 5 shows that onaverage TDSelector does achieve an improvement in AUCvalue across the 15 test sets Obviously the average growthrates of AUC value vary from 59 to 90 when differentnormalization methods for defects were utilized In additionall the 120575 values in this table are greater than 02 which

8 Mathematical Problems in Engineering

Table 2 Data statistics of the projects used in our experiments

Repository Project Version Instance Defect Defect

PROMISE

Ant 17 745 166 223Camel 16 965 188 195Ivy 20 352 40 114Jedit 32 272 90 331

Lucene 24 340 203 597Poi 30 442 281 636

Synapse 12 256 86 336Velocity 14 196 147 750Xalan 26 885 411 464Xerces 14 588 437 743

AEEEM

Equinox 112005ndash6252008 324 129 398Eclipse JDT core (Eclipse) 112005ndash6172008 997 206 207Apache Lucene (Lucene2) 112005ndash1082008 692 20 29

Mylyn 1172005ndash3172009 1862 245 132Eclipse PDE UI (Pde) 112005ndash9112008 1497 209 140

Table 3 The mappings between different values and their effective-ness levels

Effect size d 120575Very small 001 0008Small 020 0147Medium 050 033Large 080 0474Very large 120 0622Huge 20 0811

indicates that each group of 15 prediction results obtained byour approach has a greater effect than that of NoD In otherwords our approach outperforms NoD In particular forJedit Velocity Eclipse andEquinox the improvements of ourapproach over NoD are substantial For example when usingthe linear normalizationmethod the AUC values for the fourprojects are increased by 306 430 226 and 394respectively moreover the logistic normalization methodfor Velocity achieves the biggest improvement in AUC value(namely 617)

We then compared TDSelector with the baseline methodsusing other widely used similarity calculation methods andthe results obtained by using Euclidean distance and Man-hattan distance to calculate the similarity between instancesare presented in Tables 6 and 7 TDSelector compared withthe corresponding NoD achieves the average growth ratesof AUC value that vary from 59 to 77 in Table 6 andfrom 27 to 69 in Table 7 respectively More specificallythe highest growth rate of AUC value in Table 6 is 436for Equinox and in Table 7 is 397 for Lucene2 Besides allCliff rsquos delta (120575) effect sizes in these two tables are also greaterthan 01 Hence the results indicate that our approach can onaverage improve the performance of those baseline methodswithout regard to defects

Table 4 Analyzing the factors similarity and normalization

Factor Method Mean Std 120575Similarity

Cosine similarity 0704 0082 minus0133Euclidean distance 0719 0080 -Manhattan distance 0682 0098 minus0193

Normalization

Linear 0706 0087 minus0012Logistic 0710 0078 -

Square root 0699 0091 minus0044Logarithmic 0700 0086 minus0064

Inverse cotangent 0696 0097 minus0056

In short during the process of training data selection theconsideration of defects for CPDP can help us to select higherquality training data thus leading to better classificationresults

62 Answer to RQ2 Although the inclusion of defects inthe selection of training data of quality is helpful for betterperformance of CPDP it is worthy to note that our methodcompletely failed in Mylyn and Pde when computing thesimilarity between instances in terms of Manhattan distance(see the corresponding maximum AUC values in Table 7)This implies that the success ofTDSelector depends largely onthe reasonable combination of similarity and normalizationmethods Therefore which combination of similarity andnormalization is more suitable for TDSelector

First we analyzed the two factors (ie similarity andnormalization) separately For example we evaluated thedifference among cosine similarity Euclidean distance andManhattan distance regardless of any normalization methodused in the experiment The results expressed in terms ofmean and standard deviation are shown in Table 4 wherethey are grouped by factors

Mathematical Problems in Engineering 9

Table5Th

ebestpredictio

nresults

obtained

bytheCP

DPapproach

basedon

TDSelec

torw

ithCosinesim

ilarityNoD

representsthebaselin

emetho

d+deno

testhe

grow

thrateof

AUC

valuethem

axim

umAU

Cvalueo

fdifferentn

ormalizationmetho

dsisun

derlinedeach

numbersho

wnin

bold

indicatesthatthe

correspo

ndingAU

Cvaluer

isesb

ymorethan10

Cosines

imilarity

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

07

09

09

1009

1009

1006

09

08

06

07

07

05

0338

AUC

0813

0676

0603

0793

0700

0611

0758

0741

0512

0742

0783

0760

0739

0705

0729

0711plusmn0

081

+(

)63

37

19

-30

6

-30

-43

0

03

226

394

41

59

40

90

Logistic

12057207

05

07

107

06

06

06

05

05

004

07

05

05

0351

AUC

0802

0674

0595

0793

0665

0621

0759

0765

0579

0745

0773

0738

0712

0707

0740

0711plusmn0

070

+(

)48

34

05

-24

116

31

32

617

07

210

355

03

62

56

90

Square

root

12057207

07

06

06

07

06

07

09

05

104

06

06

06

06

0249

AUC

0799

0654

0596

0807

0735

0626

0746

0762

0500

0740

0774

0560

0722

0700

0738

0697plusmn0

091

+(

)44

03

07

18

371

25

14

28

397

-210

28

17

53

53

69

Logarithm

ic120572

06

06

09

1007

1007

07

05

09

05

05

06

06

06

0351

AUC

0798

0662

0594

0793

0731

0611

0748

0744

0500

0758

0774

0700

0755

0702

0741

0707plusmn0

083

+(

)43

15

03

-36

4

-16

04

397

24

212

285

63

55

58

85

Inversec

otangent

12057207

1010

1007

1007

1006

07

007

07

07

07

0213

AUC

0798

0652

0592

0793

0659

0611

0749

0741

0500

0764

0773

0556

0739

0695

0734

0690plusmn0

092

+(

)43

--

-229

-18

-

397

32

210

21

41

44

48

59

NoD

(120572=1)

0765

0652

0592

0793

0536

0611

0736

0741

0358

0740

0639

0543

0709

0665

0701

0652plusmn0

113

10 Mathematical Problems in Engineering

Table6Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithEu

clidean

distance

Euclidean

distance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

09

09

1009

09

08

1010

08

08

006

1008

08

0369

AUC

0795

0727

0598

0826

0793

0603

0714

0757

0545

0775

0773

0719

0722

0697

0744

0719plusmn0

080

+(

)13

68

-09

322

19

--

117

52

176

430

-11

96

77

Logistic

12057207

08

04

07

07

05

06

09

09

09

007

1010

09

0360

AUC

0787

0750

0603

0832

0766

0613

0716

0767

0556

0745

0773

0698

0722

0690

0730

0717plusmn0

075

+(

)03

101

08

16

277

35

03

13

139

11

176

388

--

75

72

Square

root

12057207

08

1007

08

06

07

07

07

1007

08

1010

09

0342

AUC

0796

0743

0598

0820

0720

0618

0735

0786

0564

0737

0774

0696

0722

0690

0750

0715plusmn0

076

+(

)14

91

-

01

200

44

29

38

156

-178

384

--

105

70

Logarithm

ic120572

07

08

1010

08

06

1010

09

09

09

08

1010

09

0324

AUC

0794

0746

0598

0819

0722

0607

0714

0757

0573

0739

0778

0722

0722

0690

0748

0715plusmn0

072

+(

)11

95

--

203

25

--

174

03

185

436

--

103

70

Inversec

otangent

12057208

09

06

08

08

07

1008

06

07

009

09

1009

0280

AUC

0796

0749

0603

0820

0701

0623

0714

0787

0538

0750

0773

0589

0763

0690

0722

0708plusmn0

084

+(

)14

100

08

01

168

52

-40

102

18

176

170

56

-64

59

NoD

(120572=1)

0785

0681

0598

0819

060

00592

0714

0757

0488

0737

0657

0503

0722

0690

0678

066

8plusmn0

096

Mathematical Problems in Engineering 11

Table7Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithManhatta

ndista

nce

Manhatta

ndistance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

08

09

09

1009

09

1010

08

100

08

09

1010

0187

AUC

0804

0753

0599

0816

0689

0626

0695

0748

0500

0749

0773

0633

0692

0695

066

80696plusmn0

084

+(

)13

70

03

-73

63

--

78

-116

190

397

--

56

Logistic

12057207

07

08

08

08

07

07

09

06

07

009

09

1010

0249

AUC

0799

0760

0607

0830

0674

0621

0735

0794

0520

0756

0773

0680

0559

0695

066

80705plusmn0

084

+(

)06

80

17

17

50

54

58

61

121

09

116

279

127

--

69

Square

root

12057209

09

09

1008

08

09

08

09

100

100

1010

0164

AUC

0795

0755

060

40816

0693

0627

0704

0750

0510

0749

0773

0532

0523

0695

066

80680plusmn0

1+(

)01

72

12

-79

65

13

03

99

-116

-46

--

31

Logarithm

ic120572

1009

09

1009

1010

08

09

09

010

010

100116

AUC

0794

0755

0603

0816

066

40589

0695

0763

0524

0756

0773

0532

0523

0695

066

80677plusmn0

102

+(

)-

72

10

-34

--

20

129

09

116

-46

--

27

Inversec

otangent

12057210

09

09

09

09

08

09

1007

08

010

010

100133

AUC

0794

0749

0608

0821

0667

060

90710

0748

0500

0758

0773

0532

0523

0695

066

80677plusmn0

103

+(

)-

64

18

06

39

34

22

-78

12

116

-46

--

27

NoD

(120572=1)

0794

0704

0597

0816

064

20589

0695

0748

046

40749

0693

0532

0500

0695

066

80659plusmn0

105

12 Mathematical Problems in Engineering

Euclidean Logistic

Cosine

Linear

Manhattan+Logistic Euclidean

Linear

Cosine

Logistic

Manhattan+Logistic

(1) (2)

Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow

If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method

Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4

Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions

63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In

particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable

Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication

In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models

7 Discussion

71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10

Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance

Mathematical Problems in Engineering 13

Table8Pairw

isecomparis

onsb

etweenag

iven

combinatio

nandeach

ofthe15combinatio

nsin

term

sofC

liffrsquosdelta

(120575)effectsize

Cosines

imilarity

Euclidean

distance

Manhatta

ndistance

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Cosine+

Linear

-0018

0084

000

00116

minus0049

minus0036

minus0004

minus0013

minus0009

0138

0049

0164

0178

0169

Euclidean

+Linear

0049

0102

0111

0062

0164

-0036

004

00058

0089

0209

0102

0249

0276

0244

Manhatta

n+Lo

gistic

minus0049

minus0022

0022

minus0013

0111

minus0102

minus0076

minus0080

minus0049

minus0031

0053

-0124

0151

0147

14 Mathematical Problems in Engineering

Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question

Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10

Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647

Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43

Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487

k10987654321

AUC

09

08

07

06

05

Manhattan+logisticEuclidean+linearCosine+linear

Combination

Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE

when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations

Manhattan+logisticEuclidean+linearCosine+linear

k10987654321

AUC

09

08

07

06

05

Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM

72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice

According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as

Mathematical Problems in Engineering 15

0

02

04

06

08

1

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Perc

enta

ge(in

stanc

e)

release_id

defectslt2defectslt3

(a)

0

0002

0004

0006

0008

001

0012

0014

0016

4 14 24 34 44 54 64 74

Perc

enta

ge(in

stanc

e)

defects

(b)

Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments

TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones

Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work

73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain

Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software

metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors

Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033

Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five

16 Mathematical Problems in Engineering

Euclidean+Linear

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

Cosine+Linear

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

Manhattan+Logistic

Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value

normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear

8 Conclusion and Future Work

This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of

defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods

Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are

Mathematical Problems in Engineering 17

required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances

Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this article

Acknowledgments

The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)

References

[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012

[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002

[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012

[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009

[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013

[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012

[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013

[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998

[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002

[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007

[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005

[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009

[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015

[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017

[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015

[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015

[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017

[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015

[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016

[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017

[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014

[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014

[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015

[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015

18 Mathematical Problems in Engineering

[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017

[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017

[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017

[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013

[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014

[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016

[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017

[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015

[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017

[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012

[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013

[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012

[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7

[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006

[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013

[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010

[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE

Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010

[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011

[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011

[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008

[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008

[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011

[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012

[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011

[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011

[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952

[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009

[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 7: An Improved Method for Cross-Project Defect Prediction by …downloads.hindawi.com/journals/mpe/2018/2650415.pdf · Project A-1 Project A-2 Projec B-1 Projec C-1 Projec C-2 Tst et

Mathematical Problems in Engineering 7

Peter filter based CPDP [5] (labeled as baseline1) and TCA+(Transfer Component Analysis) based CPDP [39] (labeledas baseline2) To answer the third question we comparedTDSelector-based CPDP proposed in this paper with theabove two state-of-the-art methods

52 Data Collection To evaluate the effectiveness of TDSe-lector in this paper we used 14 open-source projects writtenin Java on two online public software repositories namelyPROMISE [40] and AEEEM [41] The data statistics of the 14projects in question are presented in Table 2 where Instanceand Defect are the numbers of instances and defectiveinstances respectively and Defect is the proportion ofdefective instances to the total number of instances Eachinstance in these projects represents a file of object class andconsists of two parts namely software metrics and defects

The first repository PROMISE was collected by Jureczkoand Spinellis [40] The information of defects and 20 sourcecode metrics for the projects on PROMISE have been vali-dated and used in several previous studies [1 7 12 29] Thesecond repository AEEEM was collected by DrsquoAmbros etal [41] and each project on it has 76 metrics including 17source code metrics 15 change metrics 5 previous defectmetrics 5 entropy-of-change metrics 17 entropy-of-source-code metrics and 17 churn-of-source-code metrics AEEEMhas been successfully used in [23 39]

Before performing a cross-project prediction we need todetermine a target dataset (test set) and its candidate TDSFor PROMISE (10 projects) each one in the 10 projects wasselected to be the target dataset once and then we set up acandidate TDS for CPDP which excluded any data from thetarget project For instance if Ivy is selected as test projectdata from the other nine projects was used to construct itsinitial TDS

53 Experiment Design To answer the three research ques-tions our experimental procedure which is designed underthe context of M2O in the CPDP scenario is described asfollows

First as with many prior studies [1 5 15 35] all softwaremetric values in training and test sets were normalized byusing the119885-scoremethod because thesemetrics are differentin the scales of numerical values For the 14 projects onAEEEM and PROMISE their numbers of software metricsare different So the training set for a given test set wasselected from the same repository

Second to examine whether the consideration of defectsimproves the performance of CPDP we compared ourapproach TDSelector with NoD which is a baseline methodconsidering only the similarity between instances ie 120572 = 1in (2) Since there are three similarity computation methodsused in this paper we designed three different TDSelectorsand their corresponding baseline methods based on similar-ity indexesThe prediction results of eachmethod in questionfor the 15 test sets were analyzed in terms of mean valueand standard deviation More specifically we also used Cliff rsquosdelta (120575) [42] which is a nonparametric effect size measureof how often the values in one distribution are larger than the

values in a second distribution to compare the results gen-erated through our approach and its corresponding baselinemethod

Because Cliff did not suggest corresponding 120575 values torepresent small medium and large effects we convertedCohenrsquos 119889 effect size to Cliff rsquos 120575 using cohd2delta R package(httpsrdrriocranorddommancohd2deltahtml) Notethat Table 3 contains descriptors for magnitude of 119889 = 001to 20

Third according to the results of the second step of thisprocedure 15 combinations based on three typical similaritymethods for software metrics and five commonly usednormalization functions for defects were examined by thepairwise comparison method We then determined whichcombination is more suitable for our approach according tomean standard deviation and Cliff rsquos delta effect size

Fourth to further validate the effectiveness of the TDS-elector-based CPDP predictor we conducted cross-projectpredictions for all the 15 test sets using TDSelector and twocompeting methods (ie baseline1 and baseline2 introducedin Section 51) Note that the TDSelector used in this exper-iment was built with the best combination of similarity andnormalization

After this process is completed we will discuss the an-swers to the three research questions of our study

54 Classifier and Evaluation Measure As an underlyingmachine learning classifier for CPDP Logistic Regression(LR) which was widely used in many defect prediction liter-atures [4 23 39 43ndash46] is also used in this study All LR clas-sifiers were implemented withWeka (httpswwwcswaikatoacnzmlweka) For our experiments we used the defaultparameter setting for LR specified in Weka unless otherwisespecified

To evaluate the prediction performance of differentmeth-ods in this paper we utilized the area under a ReceiverOperating Characteristic curve (AUC) AUC is equal to theprobability that a classifier will identify a randomly chosendefective class higher than a randomly chosen defect-freeone [47] known as a useful measure for comparing differentmodels Compared with traditional accuracy measures AUCis commonly used because it is unaffected by class imbalanceand independent of the prediction threshold that is used todecide whether an instance should be classified as a negativeinstance [6 48 49] The AUC value of 05 indicates theperformance of a random predictor and higher AUC valuesindicate better prediction performance

6 Experimental Results

61 Answer to RQ1 We compared our approach consideringdefects with the baseline method NoD that selects trainingdata in terms of cosine similarity Table 5 shows that onaverage TDSelector does achieve an improvement in AUCvalue across the 15 test sets Obviously the average growthrates of AUC value vary from 59 to 90 when differentnormalization methods for defects were utilized In additionall the 120575 values in this table are greater than 02 which

8 Mathematical Problems in Engineering

Table 2 Data statistics of the projects used in our experiments

Repository Project Version Instance Defect Defect

PROMISE

Ant 17 745 166 223Camel 16 965 188 195Ivy 20 352 40 114Jedit 32 272 90 331

Lucene 24 340 203 597Poi 30 442 281 636

Synapse 12 256 86 336Velocity 14 196 147 750Xalan 26 885 411 464Xerces 14 588 437 743

AEEEM

Equinox 112005ndash6252008 324 129 398Eclipse JDT core (Eclipse) 112005ndash6172008 997 206 207Apache Lucene (Lucene2) 112005ndash1082008 692 20 29

Mylyn 1172005ndash3172009 1862 245 132Eclipse PDE UI (Pde) 112005ndash9112008 1497 209 140

Table 3 The mappings between different values and their effective-ness levels

Effect size d 120575Very small 001 0008Small 020 0147Medium 050 033Large 080 0474Very large 120 0622Huge 20 0811

indicates that each group of 15 prediction results obtained byour approach has a greater effect than that of NoD In otherwords our approach outperforms NoD In particular forJedit Velocity Eclipse andEquinox the improvements of ourapproach over NoD are substantial For example when usingthe linear normalizationmethod the AUC values for the fourprojects are increased by 306 430 226 and 394respectively moreover the logistic normalization methodfor Velocity achieves the biggest improvement in AUC value(namely 617)

We then compared TDSelector with the baseline methodsusing other widely used similarity calculation methods andthe results obtained by using Euclidean distance and Man-hattan distance to calculate the similarity between instancesare presented in Tables 6 and 7 TDSelector compared withthe corresponding NoD achieves the average growth ratesof AUC value that vary from 59 to 77 in Table 6 andfrom 27 to 69 in Table 7 respectively More specificallythe highest growth rate of AUC value in Table 6 is 436for Equinox and in Table 7 is 397 for Lucene2 Besides allCliff rsquos delta (120575) effect sizes in these two tables are also greaterthan 01 Hence the results indicate that our approach can onaverage improve the performance of those baseline methodswithout regard to defects

Table 4 Analyzing the factors similarity and normalization

Factor Method Mean Std 120575Similarity

Cosine similarity 0704 0082 minus0133Euclidean distance 0719 0080 -Manhattan distance 0682 0098 minus0193

Normalization

Linear 0706 0087 minus0012Logistic 0710 0078 -

Square root 0699 0091 minus0044Logarithmic 0700 0086 minus0064

Inverse cotangent 0696 0097 minus0056

In short during the process of training data selection theconsideration of defects for CPDP can help us to select higherquality training data thus leading to better classificationresults

62 Answer to RQ2 Although the inclusion of defects inthe selection of training data of quality is helpful for betterperformance of CPDP it is worthy to note that our methodcompletely failed in Mylyn and Pde when computing thesimilarity between instances in terms of Manhattan distance(see the corresponding maximum AUC values in Table 7)This implies that the success ofTDSelector depends largely onthe reasonable combination of similarity and normalizationmethods Therefore which combination of similarity andnormalization is more suitable for TDSelector

First we analyzed the two factors (ie similarity andnormalization) separately For example we evaluated thedifference among cosine similarity Euclidean distance andManhattan distance regardless of any normalization methodused in the experiment The results expressed in terms ofmean and standard deviation are shown in Table 4 wherethey are grouped by factors

Mathematical Problems in Engineering 9

Table5Th

ebestpredictio

nresults

obtained

bytheCP

DPapproach

basedon

TDSelec

torw

ithCosinesim

ilarityNoD

representsthebaselin

emetho

d+deno

testhe

grow

thrateof

AUC

valuethem

axim

umAU

Cvalueo

fdifferentn

ormalizationmetho

dsisun

derlinedeach

numbersho

wnin

bold

indicatesthatthe

correspo

ndingAU

Cvaluer

isesb

ymorethan10

Cosines

imilarity

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

07

09

09

1009

1009

1006

09

08

06

07

07

05

0338

AUC

0813

0676

0603

0793

0700

0611

0758

0741

0512

0742

0783

0760

0739

0705

0729

0711plusmn0

081

+(

)63

37

19

-30

6

-30

-43

0

03

226

394

41

59

40

90

Logistic

12057207

05

07

107

06

06

06

05

05

004

07

05

05

0351

AUC

0802

0674

0595

0793

0665

0621

0759

0765

0579

0745

0773

0738

0712

0707

0740

0711plusmn0

070

+(

)48

34

05

-24

116

31

32

617

07

210

355

03

62

56

90

Square

root

12057207

07

06

06

07

06

07

09

05

104

06

06

06

06

0249

AUC

0799

0654

0596

0807

0735

0626

0746

0762

0500

0740

0774

0560

0722

0700

0738

0697plusmn0

091

+(

)44

03

07

18

371

25

14

28

397

-210

28

17

53

53

69

Logarithm

ic120572

06

06

09

1007

1007

07

05

09

05

05

06

06

06

0351

AUC

0798

0662

0594

0793

0731

0611

0748

0744

0500

0758

0774

0700

0755

0702

0741

0707plusmn0

083

+(

)43

15

03

-36

4

-16

04

397

24

212

285

63

55

58

85

Inversec

otangent

12057207

1010

1007

1007

1006

07

007

07

07

07

0213

AUC

0798

0652

0592

0793

0659

0611

0749

0741

0500

0764

0773

0556

0739

0695

0734

0690plusmn0

092

+(

)43

--

-229

-18

-

397

32

210

21

41

44

48

59

NoD

(120572=1)

0765

0652

0592

0793

0536

0611

0736

0741

0358

0740

0639

0543

0709

0665

0701

0652plusmn0

113

10 Mathematical Problems in Engineering

Table6Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithEu

clidean

distance

Euclidean

distance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

09

09

1009

09

08

1010

08

08

006

1008

08

0369

AUC

0795

0727

0598

0826

0793

0603

0714

0757

0545

0775

0773

0719

0722

0697

0744

0719plusmn0

080

+(

)13

68

-09

322

19

--

117

52

176

430

-11

96

77

Logistic

12057207

08

04

07

07

05

06

09

09

09

007

1010

09

0360

AUC

0787

0750

0603

0832

0766

0613

0716

0767

0556

0745

0773

0698

0722

0690

0730

0717plusmn0

075

+(

)03

101

08

16

277

35

03

13

139

11

176

388

--

75

72

Square

root

12057207

08

1007

08

06

07

07

07

1007

08

1010

09

0342

AUC

0796

0743

0598

0820

0720

0618

0735

0786

0564

0737

0774

0696

0722

0690

0750

0715plusmn0

076

+(

)14

91

-

01

200

44

29

38

156

-178

384

--

105

70

Logarithm

ic120572

07

08

1010

08

06

1010

09

09

09

08

1010

09

0324

AUC

0794

0746

0598

0819

0722

0607

0714

0757

0573

0739

0778

0722

0722

0690

0748

0715plusmn0

072

+(

)11

95

--

203

25

--

174

03

185

436

--

103

70

Inversec

otangent

12057208

09

06

08

08

07

1008

06

07

009

09

1009

0280

AUC

0796

0749

0603

0820

0701

0623

0714

0787

0538

0750

0773

0589

0763

0690

0722

0708plusmn0

084

+(

)14

100

08

01

168

52

-40

102

18

176

170

56

-64

59

NoD

(120572=1)

0785

0681

0598

0819

060

00592

0714

0757

0488

0737

0657

0503

0722

0690

0678

066

8plusmn0

096

Mathematical Problems in Engineering 11

Table7Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithManhatta

ndista

nce

Manhatta

ndistance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

08

09

09

1009

09

1010

08

100

08

09

1010

0187

AUC

0804

0753

0599

0816

0689

0626

0695

0748

0500

0749

0773

0633

0692

0695

066

80696plusmn0

084

+(

)13

70

03

-73

63

--

78

-116

190

397

--

56

Logistic

12057207

07

08

08

08

07

07

09

06

07

009

09

1010

0249

AUC

0799

0760

0607

0830

0674

0621

0735

0794

0520

0756

0773

0680

0559

0695

066

80705plusmn0

084

+(

)06

80

17

17

50

54

58

61

121

09

116

279

127

--

69

Square

root

12057209

09

09

1008

08

09

08

09

100

100

1010

0164

AUC

0795

0755

060

40816

0693

0627

0704

0750

0510

0749

0773

0532

0523

0695

066

80680plusmn0

1+(

)01

72

12

-79

65

13

03

99

-116

-46

--

31

Logarithm

ic120572

1009

09

1009

1010

08

09

09

010

010

100116

AUC

0794

0755

0603

0816

066

40589

0695

0763

0524

0756

0773

0532

0523

0695

066

80677plusmn0

102

+(

)-

72

10

-34

--

20

129

09

116

-46

--

27

Inversec

otangent

12057210

09

09

09

09

08

09

1007

08

010

010

100133

AUC

0794

0749

0608

0821

0667

060

90710

0748

0500

0758

0773

0532

0523

0695

066

80677plusmn0

103

+(

)-

64

18

06

39

34

22

-78

12

116

-46

--

27

NoD

(120572=1)

0794

0704

0597

0816

064

20589

0695

0748

046

40749

0693

0532

0500

0695

066

80659plusmn0

105

12 Mathematical Problems in Engineering

Euclidean Logistic

Cosine

Linear

Manhattan+Logistic Euclidean

Linear

Cosine

Logistic

Manhattan+Logistic

(1) (2)

Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow

If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method

Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4

Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions

63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In

particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable

Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication

In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models

7 Discussion

71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10

Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance

Mathematical Problems in Engineering 13

Table8Pairw

isecomparis

onsb

etweenag

iven

combinatio

nandeach

ofthe15combinatio

nsin

term

sofC

liffrsquosdelta

(120575)effectsize

Cosines

imilarity

Euclidean

distance

Manhatta

ndistance

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Cosine+

Linear

-0018

0084

000

00116

minus0049

minus0036

minus0004

minus0013

minus0009

0138

0049

0164

0178

0169

Euclidean

+Linear

0049

0102

0111

0062

0164

-0036

004

00058

0089

0209

0102

0249

0276

0244

Manhatta

n+Lo

gistic

minus0049

minus0022

0022

minus0013

0111

minus0102

minus0076

minus0080

minus0049

minus0031

0053

-0124

0151

0147

14 Mathematical Problems in Engineering

Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question

Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10

Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647

Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43

Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487

k10987654321

AUC

09

08

07

06

05

Manhattan+logisticEuclidean+linearCosine+linear

Combination

Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE

when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations

Manhattan+logisticEuclidean+linearCosine+linear

k10987654321

AUC

09

08

07

06

05

Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM

72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice

According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as

Mathematical Problems in Engineering 15

0

02

04

06

08

1

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Perc

enta

ge(in

stanc

e)

release_id

defectslt2defectslt3

(a)

0

0002

0004

0006

0008

001

0012

0014

0016

4 14 24 34 44 54 64 74

Perc

enta

ge(in

stanc

e)

defects

(b)

Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments

TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones

Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work

73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain

Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software

metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors

Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033

Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five

16 Mathematical Problems in Engineering

Euclidean+Linear

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

Cosine+Linear

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

Manhattan+Logistic

Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value

normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear

8 Conclusion and Future Work

This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of

defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods

Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are

Mathematical Problems in Engineering 17

required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances

Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this article

Acknowledgments

The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)

References

[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012

[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002

[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012

[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009

[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013

[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012

[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013

[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998

[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002

[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007

[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005

[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009

[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015

[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017

[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015

[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015

[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017

[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015

[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016

[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017

[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014

[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014

[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015

[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015

18 Mathematical Problems in Engineering

[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017

[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017

[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017

[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013

[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014

[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016

[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017

[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015

[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017

[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012

[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013

[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012

[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7

[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006

[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013

[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010

[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE

Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010

[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011

[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011

[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008

[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008

[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011

[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012

[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011

[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011

[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952

[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009

[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 8: An Improved Method for Cross-Project Defect Prediction by …downloads.hindawi.com/journals/mpe/2018/2650415.pdf · Project A-1 Project A-2 Projec B-1 Projec C-1 Projec C-2 Tst et

8 Mathematical Problems in Engineering

Table 2 Data statistics of the projects used in our experiments

Repository Project Version Instance Defect Defect

PROMISE

Ant 17 745 166 223Camel 16 965 188 195Ivy 20 352 40 114Jedit 32 272 90 331

Lucene 24 340 203 597Poi 30 442 281 636

Synapse 12 256 86 336Velocity 14 196 147 750Xalan 26 885 411 464Xerces 14 588 437 743

AEEEM

Equinox 112005ndash6252008 324 129 398Eclipse JDT core (Eclipse) 112005ndash6172008 997 206 207Apache Lucene (Lucene2) 112005ndash1082008 692 20 29

Mylyn 1172005ndash3172009 1862 245 132Eclipse PDE UI (Pde) 112005ndash9112008 1497 209 140

Table 3 The mappings between different values and their effective-ness levels

Effect size d 120575Very small 001 0008Small 020 0147Medium 050 033Large 080 0474Very large 120 0622Huge 20 0811

indicates that each group of 15 prediction results obtained byour approach has a greater effect than that of NoD In otherwords our approach outperforms NoD In particular forJedit Velocity Eclipse andEquinox the improvements of ourapproach over NoD are substantial For example when usingthe linear normalizationmethod the AUC values for the fourprojects are increased by 306 430 226 and 394respectively moreover the logistic normalization methodfor Velocity achieves the biggest improvement in AUC value(namely 617)

We then compared TDSelector with the baseline methodsusing other widely used similarity calculation methods andthe results obtained by using Euclidean distance and Man-hattan distance to calculate the similarity between instancesare presented in Tables 6 and 7 TDSelector compared withthe corresponding NoD achieves the average growth ratesof AUC value that vary from 59 to 77 in Table 6 andfrom 27 to 69 in Table 7 respectively More specificallythe highest growth rate of AUC value in Table 6 is 436for Equinox and in Table 7 is 397 for Lucene2 Besides allCliff rsquos delta (120575) effect sizes in these two tables are also greaterthan 01 Hence the results indicate that our approach can onaverage improve the performance of those baseline methodswithout regard to defects

Table 4 Analyzing the factors similarity and normalization

Factor Method Mean Std 120575Similarity

Cosine similarity 0704 0082 minus0133Euclidean distance 0719 0080 -Manhattan distance 0682 0098 minus0193

Normalization

Linear 0706 0087 minus0012Logistic 0710 0078 -

Square root 0699 0091 minus0044Logarithmic 0700 0086 minus0064

Inverse cotangent 0696 0097 minus0056

In short during the process of training data selection theconsideration of defects for CPDP can help us to select higherquality training data thus leading to better classificationresults

62 Answer to RQ2 Although the inclusion of defects inthe selection of training data of quality is helpful for betterperformance of CPDP it is worthy to note that our methodcompletely failed in Mylyn and Pde when computing thesimilarity between instances in terms of Manhattan distance(see the corresponding maximum AUC values in Table 7)This implies that the success ofTDSelector depends largely onthe reasonable combination of similarity and normalizationmethods Therefore which combination of similarity andnormalization is more suitable for TDSelector

First we analyzed the two factors (ie similarity andnormalization) separately For example we evaluated thedifference among cosine similarity Euclidean distance andManhattan distance regardless of any normalization methodused in the experiment The results expressed in terms ofmean and standard deviation are shown in Table 4 wherethey are grouped by factors

Mathematical Problems in Engineering 9

Table5Th

ebestpredictio

nresults

obtained

bytheCP

DPapproach

basedon

TDSelec

torw

ithCosinesim

ilarityNoD

representsthebaselin

emetho

d+deno

testhe

grow

thrateof

AUC

valuethem

axim

umAU

Cvalueo

fdifferentn

ormalizationmetho

dsisun

derlinedeach

numbersho

wnin

bold

indicatesthatthe

correspo

ndingAU

Cvaluer

isesb

ymorethan10

Cosines

imilarity

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

07

09

09

1009

1009

1006

09

08

06

07

07

05

0338

AUC

0813

0676

0603

0793

0700

0611

0758

0741

0512

0742

0783

0760

0739

0705

0729

0711plusmn0

081

+(

)63

37

19

-30

6

-30

-43

0

03

226

394

41

59

40

90

Logistic

12057207

05

07

107

06

06

06

05

05

004

07

05

05

0351

AUC

0802

0674

0595

0793

0665

0621

0759

0765

0579

0745

0773

0738

0712

0707

0740

0711plusmn0

070

+(

)48

34

05

-24

116

31

32

617

07

210

355

03

62

56

90

Square

root

12057207

07

06

06

07

06

07

09

05

104

06

06

06

06

0249

AUC

0799

0654

0596

0807

0735

0626

0746

0762

0500

0740

0774

0560

0722

0700

0738

0697plusmn0

091

+(

)44

03

07

18

371

25

14

28

397

-210

28

17

53

53

69

Logarithm

ic120572

06

06

09

1007

1007

07

05

09

05

05

06

06

06

0351

AUC

0798

0662

0594

0793

0731

0611

0748

0744

0500

0758

0774

0700

0755

0702

0741

0707plusmn0

083

+(

)43

15

03

-36

4

-16

04

397

24

212

285

63

55

58

85

Inversec

otangent

12057207

1010

1007

1007

1006

07

007

07

07

07

0213

AUC

0798

0652

0592

0793

0659

0611

0749

0741

0500

0764

0773

0556

0739

0695

0734

0690plusmn0

092

+(

)43

--

-229

-18

-

397

32

210

21

41

44

48

59

NoD

(120572=1)

0765

0652

0592

0793

0536

0611

0736

0741

0358

0740

0639

0543

0709

0665

0701

0652plusmn0

113

10 Mathematical Problems in Engineering

Table6Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithEu

clidean

distance

Euclidean

distance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

09

09

1009

09

08

1010

08

08

006

1008

08

0369

AUC

0795

0727

0598

0826

0793

0603

0714

0757

0545

0775

0773

0719

0722

0697

0744

0719plusmn0

080

+(

)13

68

-09

322

19

--

117

52

176

430

-11

96

77

Logistic

12057207

08

04

07

07

05

06

09

09

09

007

1010

09

0360

AUC

0787

0750

0603

0832

0766

0613

0716

0767

0556

0745

0773

0698

0722

0690

0730

0717plusmn0

075

+(

)03

101

08

16

277

35

03

13

139

11

176

388

--

75

72

Square

root

12057207

08

1007

08

06

07

07

07

1007

08

1010

09

0342

AUC

0796

0743

0598

0820

0720

0618

0735

0786

0564

0737

0774

0696

0722

0690

0750

0715plusmn0

076

+(

)14

91

-

01

200

44

29

38

156

-178

384

--

105

70

Logarithm

ic120572

07

08

1010

08

06

1010

09

09

09

08

1010

09

0324

AUC

0794

0746

0598

0819

0722

0607

0714

0757

0573

0739

0778

0722

0722

0690

0748

0715plusmn0

072

+(

)11

95

--

203

25

--

174

03

185

436

--

103

70

Inversec

otangent

12057208

09

06

08

08

07

1008

06

07

009

09

1009

0280

AUC

0796

0749

0603

0820

0701

0623

0714

0787

0538

0750

0773

0589

0763

0690

0722

0708plusmn0

084

+(

)14

100

08

01

168

52

-40

102

18

176

170

56

-64

59

NoD

(120572=1)

0785

0681

0598

0819

060

00592

0714

0757

0488

0737

0657

0503

0722

0690

0678

066

8plusmn0

096

Mathematical Problems in Engineering 11

Table7Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithManhatta

ndista

nce

Manhatta

ndistance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

08

09

09

1009

09

1010

08

100

08

09

1010

0187

AUC

0804

0753

0599

0816

0689

0626

0695

0748

0500

0749

0773

0633

0692

0695

066

80696plusmn0

084

+(

)13

70

03

-73

63

--

78

-116

190

397

--

56

Logistic

12057207

07

08

08

08

07

07

09

06

07

009

09

1010

0249

AUC

0799

0760

0607

0830

0674

0621

0735

0794

0520

0756

0773

0680

0559

0695

066

80705plusmn0

084

+(

)06

80

17

17

50

54

58

61

121

09

116

279

127

--

69

Square

root

12057209

09

09

1008

08

09

08

09

100

100

1010

0164

AUC

0795

0755

060

40816

0693

0627

0704

0750

0510

0749

0773

0532

0523

0695

066

80680plusmn0

1+(

)01

72

12

-79

65

13

03

99

-116

-46

--

31

Logarithm

ic120572

1009

09

1009

1010

08

09

09

010

010

100116

AUC

0794

0755

0603

0816

066

40589

0695

0763

0524

0756

0773

0532

0523

0695

066

80677plusmn0

102

+(

)-

72

10

-34

--

20

129

09

116

-46

--

27

Inversec

otangent

12057210

09

09

09

09

08

09

1007

08

010

010

100133

AUC

0794

0749

0608

0821

0667

060

90710

0748

0500

0758

0773

0532

0523

0695

066

80677plusmn0

103

+(

)-

64

18

06

39

34

22

-78

12

116

-46

--

27

NoD

(120572=1)

0794

0704

0597

0816

064

20589

0695

0748

046

40749

0693

0532

0500

0695

066

80659plusmn0

105

12 Mathematical Problems in Engineering

Euclidean Logistic

Cosine

Linear

Manhattan+Logistic Euclidean

Linear

Cosine

Logistic

Manhattan+Logistic

(1) (2)

Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow

If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method

Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4

Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions

63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In

particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable

Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication

In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models

7 Discussion

71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10

Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance

Mathematical Problems in Engineering 13

Table8Pairw

isecomparis

onsb

etweenag

iven

combinatio

nandeach

ofthe15combinatio

nsin

term

sofC

liffrsquosdelta

(120575)effectsize

Cosines

imilarity

Euclidean

distance

Manhatta

ndistance

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Cosine+

Linear

-0018

0084

000

00116

minus0049

minus0036

minus0004

minus0013

minus0009

0138

0049

0164

0178

0169

Euclidean

+Linear

0049

0102

0111

0062

0164

-0036

004

00058

0089

0209

0102

0249

0276

0244

Manhatta

n+Lo

gistic

minus0049

minus0022

0022

minus0013

0111

minus0102

minus0076

minus0080

minus0049

minus0031

0053

-0124

0151

0147

14 Mathematical Problems in Engineering

Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question

Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10

Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647

Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43

Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487

k10987654321

AUC

09

08

07

06

05

Manhattan+logisticEuclidean+linearCosine+linear

Combination

Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE

when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations

Manhattan+logisticEuclidean+linearCosine+linear

k10987654321

AUC

09

08

07

06

05

Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM

72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice

According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as

Mathematical Problems in Engineering 15

0

02

04

06

08

1

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Perc

enta

ge(in

stanc

e)

release_id

defectslt2defectslt3

(a)

0

0002

0004

0006

0008

001

0012

0014

0016

4 14 24 34 44 54 64 74

Perc

enta

ge(in

stanc

e)

defects

(b)

Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments

TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones

Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work

73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain

Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software

metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors

Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033

Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five

16 Mathematical Problems in Engineering

Euclidean+Linear

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

Cosine+Linear

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

Manhattan+Logistic

Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value

normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear

8 Conclusion and Future Work

This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of

defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods

Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are

Mathematical Problems in Engineering 17

required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances

Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this article

Acknowledgments

The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)

References

[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012

[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002

[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012

[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009

[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013

[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012

[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013

[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998

[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002

[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007

[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005

[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009

[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015

[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017

[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015

[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015

[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017

[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015

[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016

[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017

[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014

[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014

[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015

[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015

18 Mathematical Problems in Engineering

[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017

[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017

[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017

[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013

[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014

[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016

[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017

[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015

[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017

[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012

[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013

[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012

[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7

[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006

[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013

[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010

[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE

Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010

[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011

[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011

[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008

[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008

[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011

[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012

[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011

[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011

[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952

[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009

[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 9: An Improved Method for Cross-Project Defect Prediction by …downloads.hindawi.com/journals/mpe/2018/2650415.pdf · Project A-1 Project A-2 Projec B-1 Projec C-1 Projec C-2 Tst et

Mathematical Problems in Engineering 9

Table5Th

ebestpredictio

nresults

obtained

bytheCP

DPapproach

basedon

TDSelec

torw

ithCosinesim

ilarityNoD

representsthebaselin

emetho

d+deno

testhe

grow

thrateof

AUC

valuethem

axim

umAU

Cvalueo

fdifferentn

ormalizationmetho

dsisun

derlinedeach

numbersho

wnin

bold

indicatesthatthe

correspo

ndingAU

Cvaluer

isesb

ymorethan10

Cosines

imilarity

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

07

09

09

1009

1009

1006

09

08

06

07

07

05

0338

AUC

0813

0676

0603

0793

0700

0611

0758

0741

0512

0742

0783

0760

0739

0705

0729

0711plusmn0

081

+(

)63

37

19

-30

6

-30

-43

0

03

226

394

41

59

40

90

Logistic

12057207

05

07

107

06

06

06

05

05

004

07

05

05

0351

AUC

0802

0674

0595

0793

0665

0621

0759

0765

0579

0745

0773

0738

0712

0707

0740

0711plusmn0

070

+(

)48

34

05

-24

116

31

32

617

07

210

355

03

62

56

90

Square

root

12057207

07

06

06

07

06

07

09

05

104

06

06

06

06

0249

AUC

0799

0654

0596

0807

0735

0626

0746

0762

0500

0740

0774

0560

0722

0700

0738

0697plusmn0

091

+(

)44

03

07

18

371

25

14

28

397

-210

28

17

53

53

69

Logarithm

ic120572

06

06

09

1007

1007

07

05

09

05

05

06

06

06

0351

AUC

0798

0662

0594

0793

0731

0611

0748

0744

0500

0758

0774

0700

0755

0702

0741

0707plusmn0

083

+(

)43

15

03

-36

4

-16

04

397

24

212

285

63

55

58

85

Inversec

otangent

12057207

1010

1007

1007

1006

07

007

07

07

07

0213

AUC

0798

0652

0592

0793

0659

0611

0749

0741

0500

0764

0773

0556

0739

0695

0734

0690plusmn0

092

+(

)43

--

-229

-18

-

397

32

210

21

41

44

48

59

NoD

(120572=1)

0765

0652

0592

0793

0536

0611

0736

0741

0358

0740

0639

0543

0709

0665

0701

0652plusmn0

113

10 Mathematical Problems in Engineering

Table6Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithEu

clidean

distance

Euclidean

distance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

09

09

1009

09

08

1010

08

08

006

1008

08

0369

AUC

0795

0727

0598

0826

0793

0603

0714

0757

0545

0775

0773

0719

0722

0697

0744

0719plusmn0

080

+(

)13

68

-09

322

19

--

117

52

176

430

-11

96

77

Logistic

12057207

08

04

07

07

05

06

09

09

09

007

1010

09

0360

AUC

0787

0750

0603

0832

0766

0613

0716

0767

0556

0745

0773

0698

0722

0690

0730

0717plusmn0

075

+(

)03

101

08

16

277

35

03

13

139

11

176

388

--

75

72

Square

root

12057207

08

1007

08

06

07

07

07

1007

08

1010

09

0342

AUC

0796

0743

0598

0820

0720

0618

0735

0786

0564

0737

0774

0696

0722

0690

0750

0715plusmn0

076

+(

)14

91

-

01

200

44

29

38

156

-178

384

--

105

70

Logarithm

ic120572

07

08

1010

08

06

1010

09

09

09

08

1010

09

0324

AUC

0794

0746

0598

0819

0722

0607

0714

0757

0573

0739

0778

0722

0722

0690

0748

0715plusmn0

072

+(

)11

95

--

203

25

--

174

03

185

436

--

103

70

Inversec

otangent

12057208

09

06

08

08

07

1008

06

07

009

09

1009

0280

AUC

0796

0749

0603

0820

0701

0623

0714

0787

0538

0750

0773

0589

0763

0690

0722

0708plusmn0

084

+(

)14

100

08

01

168

52

-40

102

18

176

170

56

-64

59

NoD

(120572=1)

0785

0681

0598

0819

060

00592

0714

0757

0488

0737

0657

0503

0722

0690

0678

066

8plusmn0

096

Mathematical Problems in Engineering 11

Table7Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithManhatta

ndista

nce

Manhatta

ndistance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

08

09

09

1009

09

1010

08

100

08

09

1010

0187

AUC

0804

0753

0599

0816

0689

0626

0695

0748

0500

0749

0773

0633

0692

0695

066

80696plusmn0

084

+(

)13

70

03

-73

63

--

78

-116

190

397

--

56

Logistic

12057207

07

08

08

08

07

07

09

06

07

009

09

1010

0249

AUC

0799

0760

0607

0830

0674

0621

0735

0794

0520

0756

0773

0680

0559

0695

066

80705plusmn0

084

+(

)06

80

17

17

50

54

58

61

121

09

116

279

127

--

69

Square

root

12057209

09

09

1008

08

09

08

09

100

100

1010

0164

AUC

0795

0755

060

40816

0693

0627

0704

0750

0510

0749

0773

0532

0523

0695

066

80680plusmn0

1+(

)01

72

12

-79

65

13

03

99

-116

-46

--

31

Logarithm

ic120572

1009

09

1009

1010

08

09

09

010

010

100116

AUC

0794

0755

0603

0816

066

40589

0695

0763

0524

0756

0773

0532

0523

0695

066

80677plusmn0

102

+(

)-

72

10

-34

--

20

129

09

116

-46

--

27

Inversec

otangent

12057210

09

09

09

09

08

09

1007

08

010

010

100133

AUC

0794

0749

0608

0821

0667

060

90710

0748

0500

0758

0773

0532

0523

0695

066

80677plusmn0

103

+(

)-

64

18

06

39

34

22

-78

12

116

-46

--

27

NoD

(120572=1)

0794

0704

0597

0816

064

20589

0695

0748

046

40749

0693

0532

0500

0695

066

80659plusmn0

105

12 Mathematical Problems in Engineering

Euclidean Logistic

Cosine

Linear

Manhattan+Logistic Euclidean

Linear

Cosine

Logistic

Manhattan+Logistic

(1) (2)

Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow

If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method

Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4

Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions

63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In

particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable

Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication

In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models

7 Discussion

71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10

Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance

Mathematical Problems in Engineering 13

Table8Pairw

isecomparis

onsb

etweenag

iven

combinatio

nandeach

ofthe15combinatio

nsin

term

sofC

liffrsquosdelta

(120575)effectsize

Cosines

imilarity

Euclidean

distance

Manhatta

ndistance

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Cosine+

Linear

-0018

0084

000

00116

minus0049

minus0036

minus0004

minus0013

minus0009

0138

0049

0164

0178

0169

Euclidean

+Linear

0049

0102

0111

0062

0164

-0036

004

00058

0089

0209

0102

0249

0276

0244

Manhatta

n+Lo

gistic

minus0049

minus0022

0022

minus0013

0111

minus0102

minus0076

minus0080

minus0049

minus0031

0053

-0124

0151

0147

14 Mathematical Problems in Engineering

Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question

Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10

Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647

Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43

Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487

k10987654321

AUC

09

08

07

06

05

Manhattan+logisticEuclidean+linearCosine+linear

Combination

Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE

when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations

Manhattan+logisticEuclidean+linearCosine+linear

k10987654321

AUC

09

08

07

06

05

Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM

72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice

According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as

Mathematical Problems in Engineering 15

0

02

04

06

08

1

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Perc

enta

ge(in

stanc

e)

release_id

defectslt2defectslt3

(a)

0

0002

0004

0006

0008

001

0012

0014

0016

4 14 24 34 44 54 64 74

Perc

enta

ge(in

stanc

e)

defects

(b)

Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments

TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones

Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work

73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain

Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software

metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors

Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033

Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five

16 Mathematical Problems in Engineering

Euclidean+Linear

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

Cosine+Linear

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

Manhattan+Logistic

Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value

normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear

8 Conclusion and Future Work

This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of

defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods

Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are

Mathematical Problems in Engineering 17

required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances

Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this article

Acknowledgments

The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)

References

[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012

[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002

[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012

[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009

[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013

[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012

[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013

[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998

[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002

[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007

[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005

[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009

[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015

[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017

[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015

[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015

[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017

[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015

[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016

[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017

[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014

[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014

[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015

[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015

18 Mathematical Problems in Engineering

[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017

[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017

[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017

[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013

[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014

[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016

[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017

[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015

[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017

[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012

[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013

[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012

[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7

[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006

[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013

[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010

[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE

Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010

[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011

[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011

[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008

[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008

[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011

[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012

[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011

[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011

[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952

[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009

[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 10: An Improved Method for Cross-Project Defect Prediction by …downloads.hindawi.com/journals/mpe/2018/2650415.pdf · Project A-1 Project A-2 Projec B-1 Projec C-1 Projec C-2 Tst et

10 Mathematical Problems in Engineering

Table6Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithEu

clidean

distance

Euclidean

distance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

09

09

1009

09

08

1010

08

08

006

1008

08

0369

AUC

0795

0727

0598

0826

0793

0603

0714

0757

0545

0775

0773

0719

0722

0697

0744

0719plusmn0

080

+(

)13

68

-09

322

19

--

117

52

176

430

-11

96

77

Logistic

12057207

08

04

07

07

05

06

09

09

09

007

1010

09

0360

AUC

0787

0750

0603

0832

0766

0613

0716

0767

0556

0745

0773

0698

0722

0690

0730

0717plusmn0

075

+(

)03

101

08

16

277

35

03

13

139

11

176

388

--

75

72

Square

root

12057207

08

1007

08

06

07

07

07

1007

08

1010

09

0342

AUC

0796

0743

0598

0820

0720

0618

0735

0786

0564

0737

0774

0696

0722

0690

0750

0715plusmn0

076

+(

)14

91

-

01

200

44

29

38

156

-178

384

--

105

70

Logarithm

ic120572

07

08

1010

08

06

1010

09

09

09

08

1010

09

0324

AUC

0794

0746

0598

0819

0722

0607

0714

0757

0573

0739

0778

0722

0722

0690

0748

0715plusmn0

072

+(

)11

95

--

203

25

--

174

03

185

436

--

103

70

Inversec

otangent

12057208

09

06

08

08

07

1008

06

07

009

09

1009

0280

AUC

0796

0749

0603

0820

0701

0623

0714

0787

0538

0750

0773

0589

0763

0690

0722

0708plusmn0

084

+(

)14

100

08

01

168

52

-40

102

18

176

170

56

-64

59

NoD

(120572=1)

0785

0681

0598

0819

060

00592

0714

0757

0488

0737

0657

0503

0722

0690

0678

066

8plusmn0

096

Mathematical Problems in Engineering 11

Table7Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithManhatta

ndista

nce

Manhatta

ndistance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

08

09

09

1009

09

1010

08

100

08

09

1010

0187

AUC

0804

0753

0599

0816

0689

0626

0695

0748

0500

0749

0773

0633

0692

0695

066

80696plusmn0

084

+(

)13

70

03

-73

63

--

78

-116

190

397

--

56

Logistic

12057207

07

08

08

08

07

07

09

06

07

009

09

1010

0249

AUC

0799

0760

0607

0830

0674

0621

0735

0794

0520

0756

0773

0680

0559

0695

066

80705plusmn0

084

+(

)06

80

17

17

50

54

58

61

121

09

116

279

127

--

69

Square

root

12057209

09

09

1008

08

09

08

09

100

100

1010

0164

AUC

0795

0755

060

40816

0693

0627

0704

0750

0510

0749

0773

0532

0523

0695

066

80680plusmn0

1+(

)01

72

12

-79

65

13

03

99

-116

-46

--

31

Logarithm

ic120572

1009

09

1009

1010

08

09

09

010

010

100116

AUC

0794

0755

0603

0816

066

40589

0695

0763

0524

0756

0773

0532

0523

0695

066

80677plusmn0

102

+(

)-

72

10

-34

--

20

129

09

116

-46

--

27

Inversec

otangent

12057210

09

09

09

09

08

09

1007

08

010

010

100133

AUC

0794

0749

0608

0821

0667

060

90710

0748

0500

0758

0773

0532

0523

0695

066

80677plusmn0

103

+(

)-

64

18

06

39

34

22

-78

12

116

-46

--

27

NoD

(120572=1)

0794

0704

0597

0816

064

20589

0695

0748

046

40749

0693

0532

0500

0695

066

80659plusmn0

105

12 Mathematical Problems in Engineering

Euclidean Logistic

Cosine

Linear

Manhattan+Logistic Euclidean

Linear

Cosine

Logistic

Manhattan+Logistic

(1) (2)

Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow

If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method

Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4

Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions

63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In

particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable

Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication

In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models

7 Discussion

71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10

Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance

Mathematical Problems in Engineering 13

Table8Pairw

isecomparis

onsb

etweenag

iven

combinatio

nandeach

ofthe15combinatio

nsin

term

sofC

liffrsquosdelta

(120575)effectsize

Cosines

imilarity

Euclidean

distance

Manhatta

ndistance

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Cosine+

Linear

-0018

0084

000

00116

minus0049

minus0036

minus0004

minus0013

minus0009

0138

0049

0164

0178

0169

Euclidean

+Linear

0049

0102

0111

0062

0164

-0036

004

00058

0089

0209

0102

0249

0276

0244

Manhatta

n+Lo

gistic

minus0049

minus0022

0022

minus0013

0111

minus0102

minus0076

minus0080

minus0049

minus0031

0053

-0124

0151

0147

14 Mathematical Problems in Engineering

Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question

Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10

Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647

Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43

Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487

k10987654321

AUC

09

08

07

06

05

Manhattan+logisticEuclidean+linearCosine+linear

Combination

Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE

when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations

Manhattan+logisticEuclidean+linearCosine+linear

k10987654321

AUC

09

08

07

06

05

Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM

72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice

According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as

Mathematical Problems in Engineering 15

0

02

04

06

08

1

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Perc

enta

ge(in

stanc

e)

release_id

defectslt2defectslt3

(a)

0

0002

0004

0006

0008

001

0012

0014

0016

4 14 24 34 44 54 64 74

Perc

enta

ge(in

stanc

e)

defects

(b)

Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments

TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones

Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work

73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain

Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software

metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors

Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033

Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five

16 Mathematical Problems in Engineering

Euclidean+Linear

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

Cosine+Linear

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

Manhattan+Logistic

Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value

normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear

8 Conclusion and Future Work

This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of

defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods

Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are

Mathematical Problems in Engineering 17

required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances

Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this article

Acknowledgments

The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)

References

[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012

[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002

[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012

[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009

[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013

[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012

[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013

[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998

[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002

[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007

[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005

[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009

[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015

[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017

[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015

[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015

[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017

[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015

[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016

[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017

[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014

[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014

[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015

[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015

18 Mathematical Problems in Engineering

[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017

[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017

[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017

[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013

[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014

[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016

[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017

[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015

[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017

[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012

[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013

[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012

[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7

[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006

[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013

[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010

[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE

Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010

[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011

[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011

[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008

[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008

[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011

[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012

[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011

[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011

[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952

[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009

[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 11: An Improved Method for Cross-Project Defect Prediction by …downloads.hindawi.com/journals/mpe/2018/2650415.pdf · Project A-1 Project A-2 Projec B-1 Projec C-1 Projec C-2 Tst et

Mathematical Problems in Engineering 11

Table7Th

ebestp

redictionresults

obtained

bytheC

PDPapproach

basedon

TDSelec

torw

ithManhatta

ndista

nce

Manhatta

ndistance

Ant

Xalan

Camel

Ivy

Jedit

Lucene

Poi

Synapse

Velocity

Xerces

Eclip

seEq

uino

xLu

cene2

Mylyn

Pde

MeanplusmnSt

d120575

Linear 120572

08

09

09

1009

09

1010

08

100

08

09

1010

0187

AUC

0804

0753

0599

0816

0689

0626

0695

0748

0500

0749

0773

0633

0692

0695

066

80696plusmn0

084

+(

)13

70

03

-73

63

--

78

-116

190

397

--

56

Logistic

12057207

07

08

08

08

07

07

09

06

07

009

09

1010

0249

AUC

0799

0760

0607

0830

0674

0621

0735

0794

0520

0756

0773

0680

0559

0695

066

80705plusmn0

084

+(

)06

80

17

17

50

54

58

61

121

09

116

279

127

--

69

Square

root

12057209

09

09

1008

08

09

08

09

100

100

1010

0164

AUC

0795

0755

060

40816

0693

0627

0704

0750

0510

0749

0773

0532

0523

0695

066

80680plusmn0

1+(

)01

72

12

-79

65

13

03

99

-116

-46

--

31

Logarithm

ic120572

1009

09

1009

1010

08

09

09

010

010

100116

AUC

0794

0755

0603

0816

066

40589

0695

0763

0524

0756

0773

0532

0523

0695

066

80677plusmn0

102

+(

)-

72

10

-34

--

20

129

09

116

-46

--

27

Inversec

otangent

12057210

09

09

09

09

08

09

1007

08

010

010

100133

AUC

0794

0749

0608

0821

0667

060

90710

0748

0500

0758

0773

0532

0523

0695

066

80677plusmn0

103

+(

)-

64

18

06

39

34

22

-78

12

116

-46

--

27

NoD

(120572=1)

0794

0704

0597

0816

064

20589

0695

0748

046

40749

0693

0532

0500

0695

066

80659plusmn0

105

12 Mathematical Problems in Engineering

Euclidean Logistic

Cosine

Linear

Manhattan+Logistic Euclidean

Linear

Cosine

Logistic

Manhattan+Logistic

(1) (2)

Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow

If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method

Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4

Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions

63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In

particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable

Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication

In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models

7 Discussion

71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10

Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance

Mathematical Problems in Engineering 13

Table8Pairw

isecomparis

onsb

etweenag

iven

combinatio

nandeach

ofthe15combinatio

nsin

term

sofC

liffrsquosdelta

(120575)effectsize

Cosines

imilarity

Euclidean

distance

Manhatta

ndistance

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Cosine+

Linear

-0018

0084

000

00116

minus0049

minus0036

minus0004

minus0013

minus0009

0138

0049

0164

0178

0169

Euclidean

+Linear

0049

0102

0111

0062

0164

-0036

004

00058

0089

0209

0102

0249

0276

0244

Manhatta

n+Lo

gistic

minus0049

minus0022

0022

minus0013

0111

minus0102

minus0076

minus0080

minus0049

minus0031

0053

-0124

0151

0147

14 Mathematical Problems in Engineering

Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question

Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10

Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647

Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43

Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487

k10987654321

AUC

09

08

07

06

05

Manhattan+logisticEuclidean+linearCosine+linear

Combination

Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE

when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations

Manhattan+logisticEuclidean+linearCosine+linear

k10987654321

AUC

09

08

07

06

05

Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM

72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice

According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as

Mathematical Problems in Engineering 15

0

02

04

06

08

1

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Perc

enta

ge(in

stanc

e)

release_id

defectslt2defectslt3

(a)

0

0002

0004

0006

0008

001

0012

0014

0016

4 14 24 34 44 54 64 74

Perc

enta

ge(in

stanc

e)

defects

(b)

Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments

TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones

Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work

73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain

Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software

metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors

Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033

Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five

16 Mathematical Problems in Engineering

Euclidean+Linear

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

Cosine+Linear

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

Manhattan+Logistic

Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value

normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear

8 Conclusion and Future Work

This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of

defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods

Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are

Mathematical Problems in Engineering 17

required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances

Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this article

Acknowledgments

The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)

References

[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012

[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002

[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012

[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009

[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013

[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012

[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013

[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998

[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002

[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007

[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005

[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009

[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015

[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017

[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015

[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015

[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017

[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015

[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016

[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017

[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014

[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014

[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015

[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015

18 Mathematical Problems in Engineering

[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017

[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017

[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017

[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013

[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014

[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016

[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017

[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015

[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017

[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012

[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013

[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012

[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7

[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006

[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013

[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010

[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE

Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010

[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011

[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011

[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008

[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008

[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011

[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012

[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011

[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011

[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952

[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009

[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 12: An Improved Method for Cross-Project Defect Prediction by …downloads.hindawi.com/journals/mpe/2018/2650415.pdf · Project A-1 Project A-2 Projec B-1 Projec C-1 Projec C-2 Tst et

12 Mathematical Problems in Engineering

Euclidean Logistic

Cosine

Linear

Manhattan+Logistic Euclidean

Linear

Cosine

Logistic

Manhattan+Logistic

(1) (2)

Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow

If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method

Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4

Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions

63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In

particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable

Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication

In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models

7 Discussion

71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10

Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance

Mathematical Problems in Engineering 13

Table8Pairw

isecomparis

onsb

etweenag

iven

combinatio

nandeach

ofthe15combinatio

nsin

term

sofC

liffrsquosdelta

(120575)effectsize

Cosines

imilarity

Euclidean

distance

Manhatta

ndistance

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Cosine+

Linear

-0018

0084

000

00116

minus0049

minus0036

minus0004

minus0013

minus0009

0138

0049

0164

0178

0169

Euclidean

+Linear

0049

0102

0111

0062

0164

-0036

004

00058

0089

0209

0102

0249

0276

0244

Manhatta

n+Lo

gistic

minus0049

minus0022

0022

minus0013

0111

minus0102

minus0076

minus0080

minus0049

minus0031

0053

-0124

0151

0147

14 Mathematical Problems in Engineering

Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question

Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10

Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647

Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43

Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487

k10987654321

AUC

09

08

07

06

05

Manhattan+logisticEuclidean+linearCosine+linear

Combination

Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE

when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations

Manhattan+logisticEuclidean+linearCosine+linear

k10987654321

AUC

09

08

07

06

05

Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM

72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice

According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as

Mathematical Problems in Engineering 15

0

02

04

06

08

1

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Perc

enta

ge(in

stanc

e)

release_id

defectslt2defectslt3

(a)

0

0002

0004

0006

0008

001

0012

0014

0016

4 14 24 34 44 54 64 74

Perc

enta

ge(in

stanc

e)

defects

(b)

Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments

TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones

Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work

73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain

Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software

metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors

Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033

Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five

16 Mathematical Problems in Engineering

Euclidean+Linear

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

Cosine+Linear

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

Manhattan+Logistic

Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value

normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear

8 Conclusion and Future Work

This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of

defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods

Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are

Mathematical Problems in Engineering 17

required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances

Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this article

Acknowledgments

The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)

References

[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012

[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002

[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012

[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009

[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013

[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012

[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013

[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998

[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002

[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007

[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005

[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009

[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015

[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017

[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015

[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015

[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017

[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015

[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016

[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017

[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014

[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014

[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015

[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015

18 Mathematical Problems in Engineering

[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017

[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017

[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017

[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013

[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014

[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016

[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017

[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015

[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017

[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012

[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013

[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012

[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7

[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006

[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013

[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010

[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE

Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010

[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011

[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011

[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008

[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008

[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011

[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012

[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011

[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011

[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952

[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009

[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 13: An Improved Method for Cross-Project Defect Prediction by …downloads.hindawi.com/journals/mpe/2018/2650415.pdf · Project A-1 Project A-2 Projec B-1 Projec C-1 Projec C-2 Tst et

Mathematical Problems in Engineering 13

Table8Pairw

isecomparis

onsb

etweenag

iven

combinatio

nandeach

ofthe15combinatio

nsin

term

sofC

liffrsquosdelta

(120575)effectsize

Cosines

imilarity

Euclidean

distance

Manhatta

ndistance

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Linear

Logistic

Square

root

Logarithm

icInverse

cotangent

Cosine+

Linear

-0018

0084

000

00116

minus0049

minus0036

minus0004

minus0013

minus0009

0138

0049

0164

0178

0169

Euclidean

+Linear

0049

0102

0111

0062

0164

-0036

004

00058

0089

0209

0102

0249

0276

0244

Manhatta

n+Lo

gistic

minus0049

minus0022

0022

minus0013

0111

minus0102

minus0076

minus0080

minus0049

minus0031

0053

-0124

0151

0147

14 Mathematical Problems in Engineering

Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question

Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10

Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647

Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43

Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487

k10987654321

AUC

09

08

07

06

05

Manhattan+logisticEuclidean+linearCosine+linear

Combination

Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE

when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations

Manhattan+logisticEuclidean+linearCosine+linear

k10987654321

AUC

09

08

07

06

05

Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM

72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice

According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as

Mathematical Problems in Engineering 15

0

02

04

06

08

1

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Perc

enta

ge(in

stanc

e)

release_id

defectslt2defectslt3

(a)

0

0002

0004

0006

0008

001

0012

0014

0016

4 14 24 34 44 54 64 74

Perc

enta

ge(in

stanc

e)

defects

(b)

Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments

TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones

Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work

73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain

Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software

metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors

Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033

Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five

16 Mathematical Problems in Engineering

Euclidean+Linear

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

Cosine+Linear

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

Manhattan+Logistic

Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value

normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear

8 Conclusion and Future Work

This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of

defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods

Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are

Mathematical Problems in Engineering 17

required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances

Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this article

Acknowledgments

The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)

References

[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012

[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002

[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012

[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009

[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013

[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012

[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013

[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998

[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002

[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007

[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005

[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009

[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015

[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017

[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015

[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015

[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017

[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015

[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016

[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017

[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014

[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014

[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015

[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015

18 Mathematical Problems in Engineering

[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017

[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017

[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017

[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013

[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014

[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016

[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017

[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015

[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017

[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012

[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013

[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012

[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7

[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006

[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013

[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010

[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE

Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010

[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011

[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011

[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008

[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008

[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011

[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012

[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011

[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011

[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952

[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009

[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 14: An Improved Method for Cross-Project Defect Prediction by …downloads.hindawi.com/journals/mpe/2018/2650415.pdf · Project A-1 Project A-2 Projec B-1 Projec C-1 Projec C-2 Tst et

14 Mathematical Problems in Engineering

Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question

Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10

Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647

Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43

Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)

119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487

k10987654321

AUC

09

08

07

06

05

Manhattan+logisticEuclidean+linearCosine+linear

Combination

Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE

when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations

Manhattan+logisticEuclidean+linearCosine+linear

k10987654321

AUC

09

08

07

06

05

Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM

72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice

According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as

Mathematical Problems in Engineering 15

0

02

04

06

08

1

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Perc

enta

ge(in

stanc

e)

release_id

defectslt2defectslt3

(a)

0

0002

0004

0006

0008

001

0012

0014

0016

4 14 24 34 44 54 64 74

Perc

enta

ge(in

stanc

e)

defects

(b)

Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments

TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones

Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work

73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain

Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software

metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors

Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033

Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five

16 Mathematical Problems in Engineering

Euclidean+Linear

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

Cosine+Linear

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

Manhattan+Logistic

Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value

normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear

8 Conclusion and Future Work

This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of

defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods

Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are

Mathematical Problems in Engineering 17

required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances

Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this article

Acknowledgments

The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)

References

[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012

[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002

[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012

[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009

[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013

[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012

[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013

[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998

[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002

[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007

[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005

[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009

[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015

[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017

[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015

[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015

[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017

[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015

[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016

[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017

[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014

[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014

[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015

[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015

18 Mathematical Problems in Engineering

[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017

[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017

[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017

[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013

[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014

[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016

[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017

[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015

[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017

[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012

[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013

[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012

[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7

[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006

[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013

[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010

[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE

Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010

[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011

[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011

[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008

[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008

[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011

[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012

[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011

[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011

[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952

[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009

[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 15: An Improved Method for Cross-Project Defect Prediction by …downloads.hindawi.com/journals/mpe/2018/2650415.pdf · Project A-1 Project A-2 Projec B-1 Projec C-1 Projec C-2 Tst et

Mathematical Problems in Engineering 15

0

02

04

06

08

1

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Perc

enta

ge(in

stanc

e)

release_id

defectslt2defectslt3

(a)

0

0002

0004

0006

0008

001

0012

0014

0016

4 14 24 34 44 54 64 74

Perc

enta

ge(in

stanc

e)

defects

(b)

Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments

TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones

Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work

73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain

Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software

metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors

Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033

Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five

16 Mathematical Problems in Engineering

Euclidean+Linear

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

Cosine+Linear

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

Manhattan+Logistic

Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value

normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear

8 Conclusion and Future Work

This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of

defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods

Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are

Mathematical Problems in Engineering 17

required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances

Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this article

Acknowledgments

The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)

References

[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012

[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002

[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012

[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009

[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013

[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012

[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013

[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998

[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002

[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007

[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005

[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009

[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015

[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017

[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015

[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015

[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017

[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015

[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016

[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017

[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014

[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014

[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015

[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015

18 Mathematical Problems in Engineering

[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017

[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017

[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017

[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013

[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014

[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016

[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017

[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015

[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017

[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012

[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013

[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012

[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7

[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006

[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013

[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010

[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE

Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010

[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011

[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011

[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008

[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008

[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011

[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012

[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011

[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011

[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952

[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009

[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 16: An Improved Method for Cross-Project Defect Prediction by …downloads.hindawi.com/journals/mpe/2018/2650415.pdf · Project A-1 Project A-2 Projec B-1 Projec C-1 Projec C-2 Tst et

16 Mathematical Problems in Engineering

Euclidean+Linear

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

Cosine+Linear

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

050055060065070075080085

Ant

Xala

n

Cam

el Ivy

Jedi

t

Luce

ne Poi

Syna

pse

Velo

city

Xerc

es

Eclip

se

Equi

nox

Luce

ne

Myl

yn Pde

Avg

AUC

TDSelectorTDSelector-3

Manhattan+Logistic

Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value

normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear

8 Conclusion and Future Work

This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of

defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods

Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are

Mathematical Problems in Engineering 17

required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances

Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this article

Acknowledgments

The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)

References

[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012

[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002

[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012

[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009

[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013

[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012

[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013

[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998

[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002

[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007

[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005

[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009

[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015

[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017

[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015

[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015

[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017

[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015

[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016

[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017

[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014

[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014

[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015

[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015

18 Mathematical Problems in Engineering

[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017

[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017

[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017

[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013

[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014

[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016

[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017

[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015

[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017

[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012

[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013

[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012

[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7

[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006

[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013

[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010

[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE

Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010

[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011

[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011

[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008

[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008

[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011

[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012

[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011

[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011

[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952

[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009

[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 17: An Improved Method for Cross-Project Defect Prediction by …downloads.hindawi.com/journals/mpe/2018/2650415.pdf · Project A-1 Project A-2 Projec B-1 Projec C-1 Projec C-2 Tst et

Mathematical Problems in Engineering 17

required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances

Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)

Conflicts of Interest

The authors declare that there are no conflicts of interest re-garding the publication of this article

Acknowledgments

The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)

References

[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012

[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002

[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012

[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009

[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013

[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012

[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013

[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998

[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002

[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007

[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005

[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009

[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015

[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017

[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015

[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015

[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017

[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015

[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016

[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017

[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014

[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014

[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015

[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015

18 Mathematical Problems in Engineering

[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017

[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017

[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017

[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013

[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014

[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016

[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017

[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015

[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017

[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012

[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013

[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012

[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7

[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006

[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013

[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010

[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE

Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010

[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011

[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011

[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008

[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008

[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011

[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012

[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011

[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011

[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952

[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009

[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 18: An Improved Method for Cross-Project Defect Prediction by …downloads.hindawi.com/journals/mpe/2018/2650415.pdf · Project A-1 Project A-2 Projec B-1 Projec C-1 Projec C-2 Tst et

18 Mathematical Problems in Engineering

[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017

[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017

[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017

[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013

[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014

[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016

[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017

[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015

[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017

[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012

[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013

[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012

[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7

[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006

[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013

[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010

[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE

Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010

[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011

[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011

[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008

[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008

[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011

[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006

[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012

[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011

[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011

[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952

[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009

[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 19: An Improved Method for Cross-Project Defect Prediction by …downloads.hindawi.com/journals/mpe/2018/2650415.pdf · Project A-1 Project A-2 Projec B-1 Projec C-1 Projec C-2 Tst et

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom