an improved method for cross-project defect prediction by...
TRANSCRIPT
Research ArticleAn Improved Method for Cross-Project Defect Prediction bySimplifying Training Data
Peng He 12 Yao He1 Lvjun Yu1 and Bing Li 3
1School of Computer Science and Information Engineering Hubei University Wuhan 430062 China2Hubei Province Engineering Technology Research Center for Educational Informationization Wuhan 430062 China3School of Computer Wuhan University Wuhan 430072 China
Correspondence should be addressed to Peng He penghewhueducn
Received 10 December 2017 Revised 23 February 2018 Accepted 15 April 2018 Published 7 June 2018
Academic Editor Dingli Yu
Copyright copy 2018 Peng He et alThis is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
Cross-project defect prediction (CPDP) on projects with limited historical data has attracted much attention To the best of ourknowledge however the performance of existing approaches is usually poor because of low quality cross-project training dataThe objective of this study is to propose an improved method for CPDP by simplifying training data labeled as TDSelector whichconsiders both the similarity and the number of defects that each training instance has (denoted by defects) and to demonstratethe effectiveness of the proposed method Our work consists of three main steps First we constructed TDSelector in terms of alinear weighted function of instancesrsquo similarity and defects Second the basic defect predictor used in our experiments was builtby using the Logistic Regression classification algorithm Third we analyzed the impacts of different combinations of similarityand the normalization of defects on prediction performance and then compared with two existing methods We evaluated ourmethod on 14 projects collected from two public repositories The results suggest that the proposed TDSelector method performson average better than both baseline methods and the AUC values are increased by up to 106 and 43 respectively That isthe inclusion of defects is indeed helpful to select high quality training instances for CPDP On the other hand the combination ofEuclidean distance and linear normalization is the preferred way for TDSelector An additional experiment also shows that selectingthose instances with more bugs directly as training data can further improve the performance of the bug predictor trained by ourmethod
1 Introduction
Software defect prediction is one of the most active researchtopics in Software Engineering Most early studies usuallytrained predictors (also known as prediction models) fromthe historical data on software defectsbugs in the samesoftware project and predicted defects in its upcoming releaseversions [1] This approach is referred to as Within-ProjectDefect Prediction (WPDP) However WPDP has an obviousdrawback when a project has limited historical defect data
To address the above issue researchers in this field haveattempted to apply defect predictors built for one project toother projects [2ndash7] This method is termed Cross-ProjectDefect Prediction (CPDP) The main purpose of CPDP is topredict defect-prone instances (such as classes) in a projectbased on the defect data collected from other projects on
those public software repositories like PROMISE (httpopenscienceusrepo) The feasibility and potential useful-ness of cross-project predictors built with a number ofsoftware metrics have been validated [1 3 5 6] but how toimprove the performance of CPDP models is still an openissue
Peters et al [5] argued that selecting appropriate trainingdata from a software repository became a major issue forCPDP Moreover some researchers also suggested that thesuccess rate of CPDP models could be drastically improvedwhen using a suitable training dataset [1 7] That is to saythe selection of training data of quality could be a key break-through on the above issue Thus the construction of an ap-propriate training dataset gathered from a large number ofprojects on public software repositories is indeed a challengefor CPDP [7]
HindawiMathematical Problems in EngineeringVolume 2018 Article ID 2650415 18 pageshttpsdoiorg10115520182650415
2 Mathematical Problems in Engineering
As far as we know although previous studies on CPDPhave taken different types of software metrics into accountduring the process of selecting relevant training samplesnone of them considered the number of defects contained ineach sample (denoted by defects) But in fact we argue thatit is also an important factor to consider Fortunately somestudies have empirically demonstrated the relevance of defectsto prediction For example ldquomodules with faults in the pastare likely to have faults in the futurerdquo [8] ldquo17 to 54 of thehigh-fault files of release 119894 are still high-fault in release 119894 + 1rdquo[9] ldquocover 73ndash95 of faults by selecting 10 of the mostfault prone source code filerdquo [10] and ldquothe number of defectsfound in the previous release of file correlates with its currentdefect count on a high levelrdquo [11]
Does the selection of training data considering defectsimprove the performance of CPDP models If the answer isldquoYesrdquo on the one hand it is helpful to validate the feasibility ofCPDP on the other hand it will contribute to better softwaredefect predictors by making full use of those defect datasetsavailable on the Internet
The objective of our work is to propose an improvedmethod of training data selection for CPDP by introducingthe information of defects Unlike the prior studies similar toour work such as [5 12] which focus mainly on the simi-larity between instances from training set and test set thispaper gives a comprehensive account of two factors namelysimilarity and defects Moreover the proposedmethod calledTDSelector can automatically optimize their weights toachieve the best result In brief our main contributions to thecurrent state of research onCPDP are summarized as follows
(1) Considering both similarity and defects we proposeda simple and easy-to-use training data selection method forCPDP (ie TDSelector) which is based on an improved scor-ing scheme that ranks all possible training instances In parti-cular we designed an algorithm to calculate their weightsautomatically so as to obtain the best prediction result
(2) To validate the effectiveness of our method we con-ducted an elaborate empirical study based on 15 datasets col-lected from PROMISE and AEEEM (httpbuginfusich)and the experimental results show that in a specific CPDPscenario (iemany-to-one [13]) the TDSelector-based defectpredictor outperforms its rivals that were built with twocompeting methods in terms of prediction precision
With these technical contributions our study couldcomplement previous work on CPDPwith respect to trainingdata selection In particular we provide a reasonable scoringscheme as well as a more comprehensive guideline for devel-opers to choose appropriate training data to train a defect pre-dictor in practice
The rest of this paper is organized as follows In Section 2we reviewed the related work of this topic Section 3 presentsthe preliminaries to our work Section 4 describes the pro-posed method TDSelector Section 5 introduces our experi-mental setup and Section 6 shows the primary experimen-tal results a detailed discussion of some issues includingpotential threats to the validity of our study is presented inSection 7 in the end Section 8 summarizes this paper andpresents our future work
2 Related Work
21 Cross-Project Defect Prediction Many studies were car-ried out to validate the feasibility of CPDP in the last fiveyears For example Turhan et al [12] proposed a cross-company defect prediction approach using defect data fromother companies to build predictors for target projects Theyfound that the proposed method increased the probabilityof defect detection at the cost of increasing false positiverate Ni et al [14] proposed a novel method called FeSCHand designed three ranking strategies to choose appropriatefeatures The experimental results show that FeSCH canoutperform WPDP ALL and TCA+ in most cases and itsperformance is independent of the used classifiers He et al[15] compared the performance between CPDP and WPDPusing feature selection techniques The results indicated thatfor reduced training data WPDP obtained higher precisionbut CPDP in turn achieved a better recall or119865-measure Someresearchers have also studied the performance of CPDPbasedon ensemble classifiers and then validated their effects on thisissue [16 17]
Ryu et al [18] proposed a transfer cost-sensitive boostingmethod by considering both distributional characteristicsand the class imbalance for CPDPThe results show that theirmethod significantly improves CPDP performanceThey also[19] proposed a multiobjective naive Bayes learning tech-nique under CPDP environments by taking into account theclass-imbalance contexts The results indicated that their ap-proaches performed better than the single-objective ones andWPDP models Li et al [20] compared some famous datafilters and proposed a method called HSBF (hierarchicalselect-based filter) to improve the performance of CPDPTheresults demonstrate that the data filter strategy can indeedimprove the performance of CPDP significantly Moreoverwhen using appropriate data filter strategy the defect predic-tor built fromcross-project data can outperform the predictorlearned by using within-project data
Zhang et al [21] proposed a universal CPDPmodel whichwasbuiltusingalargenumberofprojects collected fromSource-Forge (httpssourceforgenet) and Google Code (httpscodegooglecom) Their experimental results showed thatit was indeed comparable to WPDP Furthermore CPDP isfeasible for different projects that have heterogeneous metricsets He et al [22] first proposed a CPDP-IFS approach basedon the distribution characteristics of both source and targetprojects to overcome this problem Nam and Kim [23] thenproposed an improved method called HDP where metricselection and metric matching were introduced to build adefect predictorTheir empirical study on 28 projects showedthat about 68 of predictions using the proposed approachoutperformed or were comparable to WPDP with statisticalsignificance Jing et al [24] proposed a unified metric repre-sentation (UMR) for heterogeneous defect data More re-searches can be found in [25ndash27]The experiments on 14 pub-lic heterogeneous datasets from four different companies in-dicated that the proposed approach was more effective inaddressing the problem
Mathematical Problems in Engineering 3
22 Training Data Selection for CPDP As mentioned in [528] a fundamental issue for CPDP is to select the mostappropriate training data for building quality defect predic-tors He et al [29] discussed this problem in detail from theperspective of data granularity ie release level and instancelevel They presented a two-step method for training dataselection The results indicated that the predictor built basedon naive Bayes could achieve fairly good performance whenusing the method together with Peter filter [5] Porto andSimao [30] proposed an Instance Filtering method by select-ing the most similar instances from the training dataset andthe experimental results of 36 versions of 11 open-sourceprojects show that the defect predictor built from cross-project data selected by Feature Selection and Instance Filter-ing can have generally better performances both in classifica-tion and in ranking
With regard to the data imbalance problem of defectdatasets Jing et al [31] introduced an effective feature learn-ingmethod called SDA to provide effective solutions for class-imbalance problems of both within-project and cross-projecttypes by employing the semisupervised transfer componentanalysis (SSTCA)method tomake the distributions of sourceand target data consistent The results indicated that theirmethod greatly improved WPDP and CPDP performanceRyu et al [32] proposed amethod of hybrid instance selectionusing nearest neighbor (HISNN)Their results suggested thatthose instances which had strong local knowledge could beidentified via nearest neighbors with the same class labelPoon et al [33] proposed a credibility theory based naiveBayes (CNB) classifier to establish a novel reweighting mech-anism between the source projects and target projects sothat the source data could simultaneously adapt to the targetdata distribution and retain its ownpatternThe experimentalresults demonstrate the significant improvement in terms ofthe performance metrics considered achieved by CNB overother CPDP approaches
The above-mentioned existing studies aimed at reducingthe gap in prediction performance between WPDP andCPDP Although they are making progress towards the goalthere is clearly a lot of room for improvement For this reasonin this paper we proposed a selection approach to trainingdata based on an improved strategy for instance rankinginstead of a single strategy for similarity calculation whichwas used in many prior studies [1 5 7 12]
3 Preliminaries
In our context a defect dataset 119878 contains119898 instances whichis represented as 119878 = 1198681 1198682 119868119898 Instance 119868119894 is an objectclass represented as 119868119894 = 1198911198941 1198911198942 119891119894119899 where 119891119894119895 is the 119895thmetric value of instance 119868119894 and 119899 is the number ofmetrics (alsoknown as features) Given a source dataset 119878119904 and a targetdataset 119878119905 CPDP aims to perform a prediction in 119878119905 using theknowledge extracted from 119878119904 where 119878119904 = 119878119905 (see Figure 1(a))In this paper source and target datasets have the same set ofmetrics and they may differ in distributional characteristicsof metric values
To improve the performance of CPDP several strategiesused to select appropriate training data have been put forward
(see Figure 1(b)) eg Turhan et al [12] filtered out thoseirrelevant training instances by returning 119896-nearest neighborsfor each test instance
31 An Example of Training Data Selection First we intro-duce a typical method for training data selection at theinstance level and a simple example is used to illustrate thismethod For the strategy for other levels of training dataselection such as at the release level please refer to [7]
Figure 2 shows a training set 119878119904 (including five instances)and a test set 119878119905 (including an instance) Here each instancecontains five metrics and a classification label (ie 0 or 1)An instance is defect-free (label = 0) only if its defects areequal to 0 otherwise it is defective (label = 1) According tothe 119896-nearest neighbor method based on Euclidean distancewe can rank all the five training instances in terms of theirdistances from the test instance Due to the same nearestdistance from test instance 119868test it is clear that three instances1198681 1198682 and 1198685 are suitable for use as training instances when119896 is set to 1 For the three instances 1198682 and 1198685 have the samemetric values but 1198682 is labeled as a defective instance becauseit contains a bug In this case 1198681 will be selected with the sameprobability as that of 1198682 regardless of the number of defectsthey include
In this way those instances most relevant to the test onecan be quickly determined Clearly the goal of training dataselection is to preserve the representative training instancesin 119878119904 as much as possible
32 General Process of Training Data Selection Before pre-senting our approach we describe a general selection processof training data which consists of three main steps TDS(training dataset) setup ranking and duplicate removal
TDS Setup For each target project with little historical datawe need to set up an initial TDS where training data arecollected from other projects To simulate this scenario ofCPDP in this paper any defect data from the target projectmust be excluded from the initial TDS Note that differentrelease versions of a project actually belong to the sameproject A simple example is visualized in Figure 3
Ranking Once the initial TDS is determined an instancewill be treated as a metric vector 119868 as mentioned above Foreach test instance one can calculate its relevance to eachtraining instance and then rank these training instances interms of their similarity based on software metrics Notethat a wide variety of software metrics such as source codemetrics process metrics previous defects and code churnhave been used as features for CPDP approaches to improvetheir prediction performance
Duplicate Removal Let 119897 be the size of test set For eachtest instance if we select its 119896-nearest neighbors from theinitial TDS there are a total of 119896 times 119897 candidate traininginstances Considering that these selected instances may notbe unique (ie a training instance can be the nearest neighborof multiple test instances) after removing the duplicate onesthey form the final training set which is a subset of the initialTDS
4 Mathematical Problems in Engineering
trainingPredictor
test
Training data set Ss Target data set St
(a) General CPDP
MetricFeature value f
Buggy instance I
Non-buggy instance I
Unlabeled instance I
Strategies for instance selection
Predictor
testtraining
Training data set Ss
Ss Target data set StReduced training data set rsquo
(b) Improved CPDP using training data selection
Figure 1 Two CPDP scenarios used in this paper
01 0 05 0 1(3)
01 0 0 05 1(1)
04 03 0 01 0
0 0 04 0 0
01 0 0 05 0
01 0 05 05
Label (defects)
rank instance
1
2
3
distance(Ii Itest)I1
I2
I3
I4
I4
I3
I5
Itest
f1 f2 f3 f4
I1 I2 I5
St
Ss
Figure 2 An example of the selection of training instances
Mathematical Problems in Engineering 5
Data sets
Project C-v1
Ranked instances(TDSelector)
Defects of each instance
similarity of software metrics
Defect predictor
Project A-v1Project A-v2Project B-v1Project C-v1Project C-v2 Test set
Project A-v1Project A-v2Project B-v1
Training set
ranking
TDS setup
rankingTraining
data
Test data
Figure 3 The overall structure of TDSelector for CPDP
4 Our Approach TDSelector
To improve the prediction performance of CPDP we leveragethe following observations
Similar Instances Given a test instance we can examine itssimilar training instances that were labeled beforeThe defectproneness shared by similar training instances could helpus identify the probability that a test instance is defectiveIntuitively two instances are more likely to have the samestate if their metric values are very similar
Number of Defects (defects) During the selection processwhen several training instances have the same distance froma test instance we need to determine which one shouldbe ranked higher According to our experiences in softwaredefect prediction and other researchersrsquo studies on the quan-titative analysis of previous defect prediction approaches [3435] we believe that more attention should be paid to thosetraining instances with more defects in practice
The selection of training data based on instance similarityhas been used in someprior studies [5 12 35]However to thebest of our knowledge the information about defects has notbeen fully utilized So in this paper we attempt to propose atraining data selection approach combining such informationand instance similarity
41 Overall Structure of TDSelector Figure 3 shows the over-all structure of the proposed approach to training dataselection named TDSelector Before selecting appropriatetraining data for CPDP we have to set up a test set andits corresponding initial TDS For a given project treated asthe test set all the other projects (except the target project)available at hand are used as the initial TDS This is the so-calledmany-to-one (M2O) scenario for CPDP [13] It is quitedifferent from the typical O2O (one-to-one) scenario whereonly one randomly selected project is treated as the trainingset for a given target project (namely test set)
When both of the sets are given the ranks of traininginstances are calculated based on the similarity of softwaremetrics and then returned for each test instance For the
initial TDS we also collect each training instancersquos defectsand thus rank these instances by their defects Then we rateeach training instance by combining the two types of ranks insome way and identify the top-k training instances for eachtest instance according to their final scores Finally we usethe predictor trained with the final TDS to predict defectproneness in the test set We describe the core componentof TDSelector namely scoring scheme in the followingsubsection
42 Scoring Scheme For each instance in training set and testset which is treated as a vector of features (namely softwaremetrics) we calculate the similarity between them in terms ofsimilarity index (such as cosine similarity Euclidean distanceand Manhattan distance as shown in Table 1) Traininginstances are then ranked by the similarity between each ofthem and a given test instance
For instance the cosine similarity between a traininginstance 119868119901 and the target instance 119868119902 is computed via theirvector representations described as follows
119878119894119898 (119868119901 119868119902) =997888rarr119868119901 sdot 997888rarr1198681199021003817100381710038171003817100381711986811990110038171003817100381710038171003817 times 1003817100381710038171003817100381711986811990210038171003817100381710038171003817 =
sum119899119894=1 (119891119901119894 times 119891119902119894)radicsum119899119894=1 1198912119901119894 times radicsum119899119894=1 1198912119902119894 (1)
where 997888rarr119868119901 and 997888rarr119868119902 are the metric vectors for 119868119901 and 119868119902 respec-tively and 119891lowast119894 represents the 119894th metric value of instance 119868lowast
Additionally for each training instance we also considerthe factor defects in order to further enrich the ranking of itsrelevant instances The assumption here is that the more theprevious defects the richer the information of an instance Sowe propose a scoring scheme to rank those candidate traininginstances defined as below
Score (119868119901 119868119902) = 120572 119878119894119898 (119868119901 119868119902) + (1 minus 120572)119873 (119889119890119891119890119888119905119901) (2)
where 119889119890119891119890119888119905119901 represents the defects of 119868119901 120572 is a weightingfactor (0 le 120572 le 1) which is learned from training data usingAlgorithm 1 (see Algorithm 1) and 119873(119889119890119891119890119888119905119901) is a functionused to normalize defects with values ranging from 0 to 1
6 Mathematical Problems in Engineering
Optimizing the parameter 120572Input(1) Candidate TDS 119878119904 = 1198681199041 1198681199042 119868119904119898 test set 119878119905 = 1198681199051 1198681199052 119868119905119897 (119898 gt 119897)(2) 119889119890119891119890119888119905119904 = 119889119890119891119890119888119905(1198681199041) 119889119890119891119890119888119905(1198681199042) 119889119890119891119890119888119905(119868119904119898) and 119896 = 10Output(3) 120572 (120572 isin [0 1])Method(4) Initialize 120572 = 0 119878119904(120572) = 0(5) While (120572 le 1) do(6) For 119894 = 1 119894 le 119897 119894 + +(7) For 119895 = 1 119895 le 119898 119895 + +(8) Score(119868119905119894 119868119904119895) = 120572 119878119894119898(119868119905119894 119868119904119895) + (1 minus 120572)119873 (119889119890119891119890119888119905(119868119904119895))(9) End For(10) descSort (119878119888119900119903119890(119868119905119894 119868119904119895) | 119895 = 1 sdot sdot sdot 119898) sort119898 training instances in descending order(11) 119878119904(120572) = 119878119904(120572) cup Top-119896 training instances select the top 119896 instances(12) End For(13) 119860119880119862 lArr997904 119878119904(120572) CPDP997888997888997888997888rarr 119878119905 prediction result(14) 120572 = 120572 + 01(15) EndWhile(16) Return (120572 | max120572119860119880119862)
Algorithm 1 Algorithm of parameter optimization
Table 1 Similarity indexes and normalization methods used in thispaper
Similarity
Cosine cos (119883 119884) = sum119899119896=1 119909119896119910119896radicsum119899119896=1 1199092119896radicsum119899119896=1 1199102119896
Euclidean distance 119889 (119883 119884) = radic 119899sum119896=1
(119909119896 minus 119910119896)2Manhattan distance 119889 (119883 119884) = 119899sum
119896=1
1003816100381610038161003816119909119896 minus 1199101198961003816100381610038161003816Normalization
Linear 119873(119909) = 119909 minus 119909min119909max minus 119909min
Logistic 119873(119909) = 11 + 119890minus119909 minus 05Square root 119873(119909) = 1 minus 1radic1 + 119909Logarithmic 119873(119909) = log10(119909 + 1)Inverse cotangent 119873(119909) = arctan(119909) lowast 2120587
Normalization is a commonly used data preprocessingtechnique for mathematics and computer science [36] Grafand Borer [37] have confirmed that normalization canimprove prediction performance of classificationmodels Forthis reason we normalize the defects of training instanceswhen using TDSelector As you know there are many nor-malization methods In this study we introduce five typicalnormalization methods used in machine learning [36 38]The description and formulas of the five normalizationmethods are listed in Table 1
For each test instance the top-119896 training instances rankedin terms of their scores will be returned Hence the final
TDS is composed by merging the sets of the top-119896 train-ing instances for each test instance when those duplicateinstances are removed
5 Experimental Setup
51 Research Questions Our experiments were conductedto find empirical evidence that answers the following threeresearch questions
RQ1 Does the Consideration of Defects Improve the Per-formance of CPDP Unlike the previous methods [1 5 712 29] TDSelector ranks candidate training instances interms of both defects and metric-based similarity To eval-uate the effectiveness of the proposed method consideringthe additional information of defects we tested TDSelectoraccording to the experimental data described in Section 52According to (2) we also empirically analyzed the impact ofthe parameter 120572 on prediction results
RQ2 Which Combination of Similarity and NormalizationIs More Suitable for TDSelector Equation (2) is comprisedof two parts namely similarity and the normalization ofdefects For each part several commonly usedmethods can beadopted in our context To fully take advantage of TDSelectorone would wonder which combination of similarity andnormalization should be chosen Therefore it is necessary tocompare the effects of different combinations of similarityand normalization methods on prediction results and todetermine the best one for TDSelector
RQ3 Can TDSelector-Based CPDP Outperform the Base-line Methods Cross-project prediction has attracted muchresearch interest in recent years and a few CPDP approachesusing training data selection have also been proposed eg
Mathematical Problems in Engineering 7
Peter filter based CPDP [5] (labeled as baseline1) and TCA+(Transfer Component Analysis) based CPDP [39] (labeledas baseline2) To answer the third question we comparedTDSelector-based CPDP proposed in this paper with theabove two state-of-the-art methods
52 Data Collection To evaluate the effectiveness of TDSe-lector in this paper we used 14 open-source projects writtenin Java on two online public software repositories namelyPROMISE [40] and AEEEM [41] The data statistics of the 14projects in question are presented in Table 2 where Instanceand Defect are the numbers of instances and defectiveinstances respectively and Defect is the proportion ofdefective instances to the total number of instances Eachinstance in these projects represents a file of object class andconsists of two parts namely software metrics and defects
The first repository PROMISE was collected by Jureczkoand Spinellis [40] The information of defects and 20 sourcecode metrics for the projects on PROMISE have been vali-dated and used in several previous studies [1 7 12 29] Thesecond repository AEEEM was collected by DrsquoAmbros etal [41] and each project on it has 76 metrics including 17source code metrics 15 change metrics 5 previous defectmetrics 5 entropy-of-change metrics 17 entropy-of-source-code metrics and 17 churn-of-source-code metrics AEEEMhas been successfully used in [23 39]
Before performing a cross-project prediction we need todetermine a target dataset (test set) and its candidate TDSFor PROMISE (10 projects) each one in the 10 projects wasselected to be the target dataset once and then we set up acandidate TDS for CPDP which excluded any data from thetarget project For instance if Ivy is selected as test projectdata from the other nine projects was used to construct itsinitial TDS
53 Experiment Design To answer the three research ques-tions our experimental procedure which is designed underthe context of M2O in the CPDP scenario is described asfollows
First as with many prior studies [1 5 15 35] all softwaremetric values in training and test sets were normalized byusing the119885-scoremethod because thesemetrics are differentin the scales of numerical values For the 14 projects onAEEEM and PROMISE their numbers of software metricsare different So the training set for a given test set wasselected from the same repository
Second to examine whether the consideration of defectsimproves the performance of CPDP we compared ourapproach TDSelector with NoD which is a baseline methodconsidering only the similarity between instances ie 120572 = 1in (2) Since there are three similarity computation methodsused in this paper we designed three different TDSelectorsand their corresponding baseline methods based on similar-ity indexesThe prediction results of eachmethod in questionfor the 15 test sets were analyzed in terms of mean valueand standard deviation More specifically we also used Cliff rsquosdelta (120575) [42] which is a nonparametric effect size measureof how often the values in one distribution are larger than the
values in a second distribution to compare the results gen-erated through our approach and its corresponding baselinemethod
Because Cliff did not suggest corresponding 120575 values torepresent small medium and large effects we convertedCohenrsquos 119889 effect size to Cliff rsquos 120575 using cohd2delta R package(httpsrdrriocranorddommancohd2deltahtml) Notethat Table 3 contains descriptors for magnitude of 119889 = 001to 20
Third according to the results of the second step of thisprocedure 15 combinations based on three typical similaritymethods for software metrics and five commonly usednormalization functions for defects were examined by thepairwise comparison method We then determined whichcombination is more suitable for our approach according tomean standard deviation and Cliff rsquos delta effect size
Fourth to further validate the effectiveness of the TDS-elector-based CPDP predictor we conducted cross-projectpredictions for all the 15 test sets using TDSelector and twocompeting methods (ie baseline1 and baseline2 introducedin Section 51) Note that the TDSelector used in this exper-iment was built with the best combination of similarity andnormalization
After this process is completed we will discuss the an-swers to the three research questions of our study
54 Classifier and Evaluation Measure As an underlyingmachine learning classifier for CPDP Logistic Regression(LR) which was widely used in many defect prediction liter-atures [4 23 39 43ndash46] is also used in this study All LR clas-sifiers were implemented withWeka (httpswwwcswaikatoacnzmlweka) For our experiments we used the defaultparameter setting for LR specified in Weka unless otherwisespecified
To evaluate the prediction performance of differentmeth-ods in this paper we utilized the area under a ReceiverOperating Characteristic curve (AUC) AUC is equal to theprobability that a classifier will identify a randomly chosendefective class higher than a randomly chosen defect-freeone [47] known as a useful measure for comparing differentmodels Compared with traditional accuracy measures AUCis commonly used because it is unaffected by class imbalanceand independent of the prediction threshold that is used todecide whether an instance should be classified as a negativeinstance [6 48 49] The AUC value of 05 indicates theperformance of a random predictor and higher AUC valuesindicate better prediction performance
6 Experimental Results
61 Answer to RQ1 We compared our approach consideringdefects with the baseline method NoD that selects trainingdata in terms of cosine similarity Table 5 shows that onaverage TDSelector does achieve an improvement in AUCvalue across the 15 test sets Obviously the average growthrates of AUC value vary from 59 to 90 when differentnormalization methods for defects were utilized In additionall the 120575 values in this table are greater than 02 which
8 Mathematical Problems in Engineering
Table 2 Data statistics of the projects used in our experiments
Repository Project Version Instance Defect Defect
PROMISE
Ant 17 745 166 223Camel 16 965 188 195Ivy 20 352 40 114Jedit 32 272 90 331
Lucene 24 340 203 597Poi 30 442 281 636
Synapse 12 256 86 336Velocity 14 196 147 750Xalan 26 885 411 464Xerces 14 588 437 743
AEEEM
Equinox 112005ndash6252008 324 129 398Eclipse JDT core (Eclipse) 112005ndash6172008 997 206 207Apache Lucene (Lucene2) 112005ndash1082008 692 20 29
Mylyn 1172005ndash3172009 1862 245 132Eclipse PDE UI (Pde) 112005ndash9112008 1497 209 140
Table 3 The mappings between different values and their effective-ness levels
Effect size d 120575Very small 001 0008Small 020 0147Medium 050 033Large 080 0474Very large 120 0622Huge 20 0811
indicates that each group of 15 prediction results obtained byour approach has a greater effect than that of NoD In otherwords our approach outperforms NoD In particular forJedit Velocity Eclipse andEquinox the improvements of ourapproach over NoD are substantial For example when usingthe linear normalizationmethod the AUC values for the fourprojects are increased by 306 430 226 and 394respectively moreover the logistic normalization methodfor Velocity achieves the biggest improvement in AUC value(namely 617)
We then compared TDSelector with the baseline methodsusing other widely used similarity calculation methods andthe results obtained by using Euclidean distance and Man-hattan distance to calculate the similarity between instancesare presented in Tables 6 and 7 TDSelector compared withthe corresponding NoD achieves the average growth ratesof AUC value that vary from 59 to 77 in Table 6 andfrom 27 to 69 in Table 7 respectively More specificallythe highest growth rate of AUC value in Table 6 is 436for Equinox and in Table 7 is 397 for Lucene2 Besides allCliff rsquos delta (120575) effect sizes in these two tables are also greaterthan 01 Hence the results indicate that our approach can onaverage improve the performance of those baseline methodswithout regard to defects
Table 4 Analyzing the factors similarity and normalization
Factor Method Mean Std 120575Similarity
Cosine similarity 0704 0082 minus0133Euclidean distance 0719 0080 -Manhattan distance 0682 0098 minus0193
Normalization
Linear 0706 0087 minus0012Logistic 0710 0078 -
Square root 0699 0091 minus0044Logarithmic 0700 0086 minus0064
Inverse cotangent 0696 0097 minus0056
In short during the process of training data selection theconsideration of defects for CPDP can help us to select higherquality training data thus leading to better classificationresults
62 Answer to RQ2 Although the inclusion of defects inthe selection of training data of quality is helpful for betterperformance of CPDP it is worthy to note that our methodcompletely failed in Mylyn and Pde when computing thesimilarity between instances in terms of Manhattan distance(see the corresponding maximum AUC values in Table 7)This implies that the success ofTDSelector depends largely onthe reasonable combination of similarity and normalizationmethods Therefore which combination of similarity andnormalization is more suitable for TDSelector
First we analyzed the two factors (ie similarity andnormalization) separately For example we evaluated thedifference among cosine similarity Euclidean distance andManhattan distance regardless of any normalization methodused in the experiment The results expressed in terms ofmean and standard deviation are shown in Table 4 wherethey are grouped by factors
Mathematical Problems in Engineering 9
Table5Th
ebestpredictio
nresults
obtained
bytheCP
DPapproach
basedon
TDSelec
torw
ithCosinesim
ilarityNoD
representsthebaselin
emetho
d+deno
testhe
grow
thrateof
AUC
valuethem
axim
umAU
Cvalueo
fdifferentn
ormalizationmetho
dsisun
derlinedeach
numbersho
wnin
bold
indicatesthatthe
correspo
ndingAU
Cvaluer
isesb
ymorethan10
Cosines
imilarity
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
07
09
09
1009
1009
1006
09
08
06
07
07
05
0338
AUC
0813
0676
0603
0793
0700
0611
0758
0741
0512
0742
0783
0760
0739
0705
0729
0711plusmn0
081
+(
)63
37
19
-30
6
-30
-43
0
03
226
394
41
59
40
90
Logistic
12057207
05
07
107
06
06
06
05
05
004
07
05
05
0351
AUC
0802
0674
0595
0793
0665
0621
0759
0765
0579
0745
0773
0738
0712
0707
0740
0711plusmn0
070
+(
)48
34
05
-24
116
31
32
617
07
210
355
03
62
56
90
Square
root
12057207
07
06
06
07
06
07
09
05
104
06
06
06
06
0249
AUC
0799
0654
0596
0807
0735
0626
0746
0762
0500
0740
0774
0560
0722
0700
0738
0697plusmn0
091
+(
)44
03
07
18
371
25
14
28
397
-210
28
17
53
53
69
Logarithm
ic120572
06
06
09
1007
1007
07
05
09
05
05
06
06
06
0351
AUC
0798
0662
0594
0793
0731
0611
0748
0744
0500
0758
0774
0700
0755
0702
0741
0707plusmn0
083
+(
)43
15
03
-36
4
-16
04
397
24
212
285
63
55
58
85
Inversec
otangent
12057207
1010
1007
1007
1006
07
007
07
07
07
0213
AUC
0798
0652
0592
0793
0659
0611
0749
0741
0500
0764
0773
0556
0739
0695
0734
0690plusmn0
092
+(
)43
--
-229
-18
-
397
32
210
21
41
44
48
59
NoD
(120572=1)
0765
0652
0592
0793
0536
0611
0736
0741
0358
0740
0639
0543
0709
0665
0701
0652plusmn0
113
10 Mathematical Problems in Engineering
Table6Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithEu
clidean
distance
Euclidean
distance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
09
09
1009
09
08
1010
08
08
006
1008
08
0369
AUC
0795
0727
0598
0826
0793
0603
0714
0757
0545
0775
0773
0719
0722
0697
0744
0719plusmn0
080
+(
)13
68
-09
322
19
--
117
52
176
430
-11
96
77
Logistic
12057207
08
04
07
07
05
06
09
09
09
007
1010
09
0360
AUC
0787
0750
0603
0832
0766
0613
0716
0767
0556
0745
0773
0698
0722
0690
0730
0717plusmn0
075
+(
)03
101
08
16
277
35
03
13
139
11
176
388
--
75
72
Square
root
12057207
08
1007
08
06
07
07
07
1007
08
1010
09
0342
AUC
0796
0743
0598
0820
0720
0618
0735
0786
0564
0737
0774
0696
0722
0690
0750
0715plusmn0
076
+(
)14
91
-
01
200
44
29
38
156
-178
384
--
105
70
Logarithm
ic120572
07
08
1010
08
06
1010
09
09
09
08
1010
09
0324
AUC
0794
0746
0598
0819
0722
0607
0714
0757
0573
0739
0778
0722
0722
0690
0748
0715plusmn0
072
+(
)11
95
--
203
25
--
174
03
185
436
--
103
70
Inversec
otangent
12057208
09
06
08
08
07
1008
06
07
009
09
1009
0280
AUC
0796
0749
0603
0820
0701
0623
0714
0787
0538
0750
0773
0589
0763
0690
0722
0708plusmn0
084
+(
)14
100
08
01
168
52
-40
102
18
176
170
56
-64
59
NoD
(120572=1)
0785
0681
0598
0819
060
00592
0714
0757
0488
0737
0657
0503
0722
0690
0678
066
8plusmn0
096
Mathematical Problems in Engineering 11
Table7Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithManhatta
ndista
nce
Manhatta
ndistance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
08
09
09
1009
09
1010
08
100
08
09
1010
0187
AUC
0804
0753
0599
0816
0689
0626
0695
0748
0500
0749
0773
0633
0692
0695
066
80696plusmn0
084
+(
)13
70
03
-73
63
--
78
-116
190
397
--
56
Logistic
12057207
07
08
08
08
07
07
09
06
07
009
09
1010
0249
AUC
0799
0760
0607
0830
0674
0621
0735
0794
0520
0756
0773
0680
0559
0695
066
80705plusmn0
084
+(
)06
80
17
17
50
54
58
61
121
09
116
279
127
--
69
Square
root
12057209
09
09
1008
08
09
08
09
100
100
1010
0164
AUC
0795
0755
060
40816
0693
0627
0704
0750
0510
0749
0773
0532
0523
0695
066
80680plusmn0
1+(
)01
72
12
-79
65
13
03
99
-116
-46
--
31
Logarithm
ic120572
1009
09
1009
1010
08
09
09
010
010
100116
AUC
0794
0755
0603
0816
066
40589
0695
0763
0524
0756
0773
0532
0523
0695
066
80677plusmn0
102
+(
)-
72
10
-34
--
20
129
09
116
-46
--
27
Inversec
otangent
12057210
09
09
09
09
08
09
1007
08
010
010
100133
AUC
0794
0749
0608
0821
0667
060
90710
0748
0500
0758
0773
0532
0523
0695
066
80677plusmn0
103
+(
)-
64
18
06
39
34
22
-78
12
116
-46
--
27
NoD
(120572=1)
0794
0704
0597
0816
064
20589
0695
0748
046
40749
0693
0532
0500
0695
066
80659plusmn0
105
12 Mathematical Problems in Engineering
Euclidean Logistic
Cosine
Linear
Manhattan+Logistic Euclidean
Linear
Cosine
Logistic
Manhattan+Logistic
(1) (2)
Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow
If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method
Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4
Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions
63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In
particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable
Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication
In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models
7 Discussion
71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10
Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance
Mathematical Problems in Engineering 13
Table8Pairw
isecomparis
onsb
etweenag
iven
combinatio
nandeach
ofthe15combinatio
nsin
term
sofC
liffrsquosdelta
(120575)effectsize
Cosines
imilarity
Euclidean
distance
Manhatta
ndistance
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Cosine+
Linear
-0018
0084
000
00116
minus0049
minus0036
minus0004
minus0013
minus0009
0138
0049
0164
0178
0169
Euclidean
+Linear
0049
0102
0111
0062
0164
-0036
004
00058
0089
0209
0102
0249
0276
0244
Manhatta
n+Lo
gistic
minus0049
minus0022
0022
minus0013
0111
minus0102
minus0076
minus0080
minus0049
minus0031
0053
-0124
0151
0147
14 Mathematical Problems in Engineering
Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question
Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10
Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647
Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43
Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487
k10987654321
AUC
09
08
07
06
05
Manhattan+logisticEuclidean+linearCosine+linear
Combination
Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE
when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations
Manhattan+logisticEuclidean+linearCosine+linear
k10987654321
AUC
09
08
07
06
05
Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM
72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice
According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as
Mathematical Problems in Engineering 15
0
02
04
06
08
1
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Perc
enta
ge(in
stanc
e)
release_id
defectslt2defectslt3
(a)
0
0002
0004
0006
0008
001
0012
0014
0016
4 14 24 34 44 54 64 74
Perc
enta
ge(in
stanc
e)
defects
(b)
Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments
TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones
Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work
73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain
Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software
metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors
Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033
Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five
16 Mathematical Problems in Engineering
Euclidean+Linear
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
Cosine+Linear
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
Manhattan+Logistic
Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value
normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear
8 Conclusion and Future Work
This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of
defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods
Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are
Mathematical Problems in Engineering 17
required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances
Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)
Conflicts of Interest
The authors declare that there are no conflicts of interest re-garding the publication of this article
Acknowledgments
The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)
References
[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012
[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002
[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012
[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009
[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013
[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012
[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013
[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998
[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002
[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007
[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005
[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009
[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015
[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017
[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015
[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015
[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017
[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015
[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016
[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017
[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014
[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014
[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015
[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015
18 Mathematical Problems in Engineering
[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017
[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017
[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017
[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013
[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014
[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016
[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017
[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015
[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017
[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012
[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013
[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012
[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7
[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006
[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013
[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010
[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE
Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010
[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011
[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011
[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008
[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008
[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011
[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006
[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012
[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011
[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011
[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952
[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009
[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
2 Mathematical Problems in Engineering
As far as we know although previous studies on CPDPhave taken different types of software metrics into accountduring the process of selecting relevant training samplesnone of them considered the number of defects contained ineach sample (denoted by defects) But in fact we argue thatit is also an important factor to consider Fortunately somestudies have empirically demonstrated the relevance of defectsto prediction For example ldquomodules with faults in the pastare likely to have faults in the futurerdquo [8] ldquo17 to 54 of thehigh-fault files of release 119894 are still high-fault in release 119894 + 1rdquo[9] ldquocover 73ndash95 of faults by selecting 10 of the mostfault prone source code filerdquo [10] and ldquothe number of defectsfound in the previous release of file correlates with its currentdefect count on a high levelrdquo [11]
Does the selection of training data considering defectsimprove the performance of CPDP models If the answer isldquoYesrdquo on the one hand it is helpful to validate the feasibility ofCPDP on the other hand it will contribute to better softwaredefect predictors by making full use of those defect datasetsavailable on the Internet
The objective of our work is to propose an improvedmethod of training data selection for CPDP by introducingthe information of defects Unlike the prior studies similar toour work such as [5 12] which focus mainly on the simi-larity between instances from training set and test set thispaper gives a comprehensive account of two factors namelysimilarity and defects Moreover the proposedmethod calledTDSelector can automatically optimize their weights toachieve the best result In brief our main contributions to thecurrent state of research onCPDP are summarized as follows
(1) Considering both similarity and defects we proposeda simple and easy-to-use training data selection method forCPDP (ie TDSelector) which is based on an improved scor-ing scheme that ranks all possible training instances In parti-cular we designed an algorithm to calculate their weightsautomatically so as to obtain the best prediction result
(2) To validate the effectiveness of our method we con-ducted an elaborate empirical study based on 15 datasets col-lected from PROMISE and AEEEM (httpbuginfusich)and the experimental results show that in a specific CPDPscenario (iemany-to-one [13]) the TDSelector-based defectpredictor outperforms its rivals that were built with twocompeting methods in terms of prediction precision
With these technical contributions our study couldcomplement previous work on CPDPwith respect to trainingdata selection In particular we provide a reasonable scoringscheme as well as a more comprehensive guideline for devel-opers to choose appropriate training data to train a defect pre-dictor in practice
The rest of this paper is organized as follows In Section 2we reviewed the related work of this topic Section 3 presentsthe preliminaries to our work Section 4 describes the pro-posed method TDSelector Section 5 introduces our experi-mental setup and Section 6 shows the primary experimen-tal results a detailed discussion of some issues includingpotential threats to the validity of our study is presented inSection 7 in the end Section 8 summarizes this paper andpresents our future work
2 Related Work
21 Cross-Project Defect Prediction Many studies were car-ried out to validate the feasibility of CPDP in the last fiveyears For example Turhan et al [12] proposed a cross-company defect prediction approach using defect data fromother companies to build predictors for target projects Theyfound that the proposed method increased the probabilityof defect detection at the cost of increasing false positiverate Ni et al [14] proposed a novel method called FeSCHand designed three ranking strategies to choose appropriatefeatures The experimental results show that FeSCH canoutperform WPDP ALL and TCA+ in most cases and itsperformance is independent of the used classifiers He et al[15] compared the performance between CPDP and WPDPusing feature selection techniques The results indicated thatfor reduced training data WPDP obtained higher precisionbut CPDP in turn achieved a better recall or119865-measure Someresearchers have also studied the performance of CPDPbasedon ensemble classifiers and then validated their effects on thisissue [16 17]
Ryu et al [18] proposed a transfer cost-sensitive boostingmethod by considering both distributional characteristicsand the class imbalance for CPDPThe results show that theirmethod significantly improves CPDP performanceThey also[19] proposed a multiobjective naive Bayes learning tech-nique under CPDP environments by taking into account theclass-imbalance contexts The results indicated that their ap-proaches performed better than the single-objective ones andWPDP models Li et al [20] compared some famous datafilters and proposed a method called HSBF (hierarchicalselect-based filter) to improve the performance of CPDPTheresults demonstrate that the data filter strategy can indeedimprove the performance of CPDP significantly Moreoverwhen using appropriate data filter strategy the defect predic-tor built fromcross-project data can outperform the predictorlearned by using within-project data
Zhang et al [21] proposed a universal CPDPmodel whichwasbuiltusingalargenumberofprojects collected fromSource-Forge (httpssourceforgenet) and Google Code (httpscodegooglecom) Their experimental results showed thatit was indeed comparable to WPDP Furthermore CPDP isfeasible for different projects that have heterogeneous metricsets He et al [22] first proposed a CPDP-IFS approach basedon the distribution characteristics of both source and targetprojects to overcome this problem Nam and Kim [23] thenproposed an improved method called HDP where metricselection and metric matching were introduced to build adefect predictorTheir empirical study on 28 projects showedthat about 68 of predictions using the proposed approachoutperformed or were comparable to WPDP with statisticalsignificance Jing et al [24] proposed a unified metric repre-sentation (UMR) for heterogeneous defect data More re-searches can be found in [25ndash27]The experiments on 14 pub-lic heterogeneous datasets from four different companies in-dicated that the proposed approach was more effective inaddressing the problem
Mathematical Problems in Engineering 3
22 Training Data Selection for CPDP As mentioned in [528] a fundamental issue for CPDP is to select the mostappropriate training data for building quality defect predic-tors He et al [29] discussed this problem in detail from theperspective of data granularity ie release level and instancelevel They presented a two-step method for training dataselection The results indicated that the predictor built basedon naive Bayes could achieve fairly good performance whenusing the method together with Peter filter [5] Porto andSimao [30] proposed an Instance Filtering method by select-ing the most similar instances from the training dataset andthe experimental results of 36 versions of 11 open-sourceprojects show that the defect predictor built from cross-project data selected by Feature Selection and Instance Filter-ing can have generally better performances both in classifica-tion and in ranking
With regard to the data imbalance problem of defectdatasets Jing et al [31] introduced an effective feature learn-ingmethod called SDA to provide effective solutions for class-imbalance problems of both within-project and cross-projecttypes by employing the semisupervised transfer componentanalysis (SSTCA)method tomake the distributions of sourceand target data consistent The results indicated that theirmethod greatly improved WPDP and CPDP performanceRyu et al [32] proposed amethod of hybrid instance selectionusing nearest neighbor (HISNN)Their results suggested thatthose instances which had strong local knowledge could beidentified via nearest neighbors with the same class labelPoon et al [33] proposed a credibility theory based naiveBayes (CNB) classifier to establish a novel reweighting mech-anism between the source projects and target projects sothat the source data could simultaneously adapt to the targetdata distribution and retain its ownpatternThe experimentalresults demonstrate the significant improvement in terms ofthe performance metrics considered achieved by CNB overother CPDP approaches
The above-mentioned existing studies aimed at reducingthe gap in prediction performance between WPDP andCPDP Although they are making progress towards the goalthere is clearly a lot of room for improvement For this reasonin this paper we proposed a selection approach to trainingdata based on an improved strategy for instance rankinginstead of a single strategy for similarity calculation whichwas used in many prior studies [1 5 7 12]
3 Preliminaries
In our context a defect dataset 119878 contains119898 instances whichis represented as 119878 = 1198681 1198682 119868119898 Instance 119868119894 is an objectclass represented as 119868119894 = 1198911198941 1198911198942 119891119894119899 where 119891119894119895 is the 119895thmetric value of instance 119868119894 and 119899 is the number ofmetrics (alsoknown as features) Given a source dataset 119878119904 and a targetdataset 119878119905 CPDP aims to perform a prediction in 119878119905 using theknowledge extracted from 119878119904 where 119878119904 = 119878119905 (see Figure 1(a))In this paper source and target datasets have the same set ofmetrics and they may differ in distributional characteristicsof metric values
To improve the performance of CPDP several strategiesused to select appropriate training data have been put forward
(see Figure 1(b)) eg Turhan et al [12] filtered out thoseirrelevant training instances by returning 119896-nearest neighborsfor each test instance
31 An Example of Training Data Selection First we intro-duce a typical method for training data selection at theinstance level and a simple example is used to illustrate thismethod For the strategy for other levels of training dataselection such as at the release level please refer to [7]
Figure 2 shows a training set 119878119904 (including five instances)and a test set 119878119905 (including an instance) Here each instancecontains five metrics and a classification label (ie 0 or 1)An instance is defect-free (label = 0) only if its defects areequal to 0 otherwise it is defective (label = 1) According tothe 119896-nearest neighbor method based on Euclidean distancewe can rank all the five training instances in terms of theirdistances from the test instance Due to the same nearestdistance from test instance 119868test it is clear that three instances1198681 1198682 and 1198685 are suitable for use as training instances when119896 is set to 1 For the three instances 1198682 and 1198685 have the samemetric values but 1198682 is labeled as a defective instance becauseit contains a bug In this case 1198681 will be selected with the sameprobability as that of 1198682 regardless of the number of defectsthey include
In this way those instances most relevant to the test onecan be quickly determined Clearly the goal of training dataselection is to preserve the representative training instancesin 119878119904 as much as possible
32 General Process of Training Data Selection Before pre-senting our approach we describe a general selection processof training data which consists of three main steps TDS(training dataset) setup ranking and duplicate removal
TDS Setup For each target project with little historical datawe need to set up an initial TDS where training data arecollected from other projects To simulate this scenario ofCPDP in this paper any defect data from the target projectmust be excluded from the initial TDS Note that differentrelease versions of a project actually belong to the sameproject A simple example is visualized in Figure 3
Ranking Once the initial TDS is determined an instancewill be treated as a metric vector 119868 as mentioned above Foreach test instance one can calculate its relevance to eachtraining instance and then rank these training instances interms of their similarity based on software metrics Notethat a wide variety of software metrics such as source codemetrics process metrics previous defects and code churnhave been used as features for CPDP approaches to improvetheir prediction performance
Duplicate Removal Let 119897 be the size of test set For eachtest instance if we select its 119896-nearest neighbors from theinitial TDS there are a total of 119896 times 119897 candidate traininginstances Considering that these selected instances may notbe unique (ie a training instance can be the nearest neighborof multiple test instances) after removing the duplicate onesthey form the final training set which is a subset of the initialTDS
4 Mathematical Problems in Engineering
trainingPredictor
test
Training data set Ss Target data set St
(a) General CPDP
MetricFeature value f
Buggy instance I
Non-buggy instance I
Unlabeled instance I
Strategies for instance selection
Predictor
testtraining
Training data set Ss
Ss Target data set StReduced training data set rsquo
(b) Improved CPDP using training data selection
Figure 1 Two CPDP scenarios used in this paper
01 0 05 0 1(3)
01 0 0 05 1(1)
04 03 0 01 0
0 0 04 0 0
01 0 0 05 0
01 0 05 05
Label (defects)
rank instance
1
2
3
distance(Ii Itest)I1
I2
I3
I4
I4
I3
I5
Itest
f1 f2 f3 f4
I1 I2 I5
St
Ss
Figure 2 An example of the selection of training instances
Mathematical Problems in Engineering 5
Data sets
Project C-v1
Ranked instances(TDSelector)
Defects of each instance
similarity of software metrics
Defect predictor
Project A-v1Project A-v2Project B-v1Project C-v1Project C-v2 Test set
Project A-v1Project A-v2Project B-v1
Training set
ranking
TDS setup
rankingTraining
data
Test data
Figure 3 The overall structure of TDSelector for CPDP
4 Our Approach TDSelector
To improve the prediction performance of CPDP we leveragethe following observations
Similar Instances Given a test instance we can examine itssimilar training instances that were labeled beforeThe defectproneness shared by similar training instances could helpus identify the probability that a test instance is defectiveIntuitively two instances are more likely to have the samestate if their metric values are very similar
Number of Defects (defects) During the selection processwhen several training instances have the same distance froma test instance we need to determine which one shouldbe ranked higher According to our experiences in softwaredefect prediction and other researchersrsquo studies on the quan-titative analysis of previous defect prediction approaches [3435] we believe that more attention should be paid to thosetraining instances with more defects in practice
The selection of training data based on instance similarityhas been used in someprior studies [5 12 35]However to thebest of our knowledge the information about defects has notbeen fully utilized So in this paper we attempt to propose atraining data selection approach combining such informationand instance similarity
41 Overall Structure of TDSelector Figure 3 shows the over-all structure of the proposed approach to training dataselection named TDSelector Before selecting appropriatetraining data for CPDP we have to set up a test set andits corresponding initial TDS For a given project treated asthe test set all the other projects (except the target project)available at hand are used as the initial TDS This is the so-calledmany-to-one (M2O) scenario for CPDP [13] It is quitedifferent from the typical O2O (one-to-one) scenario whereonly one randomly selected project is treated as the trainingset for a given target project (namely test set)
When both of the sets are given the ranks of traininginstances are calculated based on the similarity of softwaremetrics and then returned for each test instance For the
initial TDS we also collect each training instancersquos defectsand thus rank these instances by their defects Then we rateeach training instance by combining the two types of ranks insome way and identify the top-k training instances for eachtest instance according to their final scores Finally we usethe predictor trained with the final TDS to predict defectproneness in the test set We describe the core componentof TDSelector namely scoring scheme in the followingsubsection
42 Scoring Scheme For each instance in training set and testset which is treated as a vector of features (namely softwaremetrics) we calculate the similarity between them in terms ofsimilarity index (such as cosine similarity Euclidean distanceand Manhattan distance as shown in Table 1) Traininginstances are then ranked by the similarity between each ofthem and a given test instance
For instance the cosine similarity between a traininginstance 119868119901 and the target instance 119868119902 is computed via theirvector representations described as follows
119878119894119898 (119868119901 119868119902) =997888rarr119868119901 sdot 997888rarr1198681199021003817100381710038171003817100381711986811990110038171003817100381710038171003817 times 1003817100381710038171003817100381711986811990210038171003817100381710038171003817 =
sum119899119894=1 (119891119901119894 times 119891119902119894)radicsum119899119894=1 1198912119901119894 times radicsum119899119894=1 1198912119902119894 (1)
where 997888rarr119868119901 and 997888rarr119868119902 are the metric vectors for 119868119901 and 119868119902 respec-tively and 119891lowast119894 represents the 119894th metric value of instance 119868lowast
Additionally for each training instance we also considerthe factor defects in order to further enrich the ranking of itsrelevant instances The assumption here is that the more theprevious defects the richer the information of an instance Sowe propose a scoring scheme to rank those candidate traininginstances defined as below
Score (119868119901 119868119902) = 120572 119878119894119898 (119868119901 119868119902) + (1 minus 120572)119873 (119889119890119891119890119888119905119901) (2)
where 119889119890119891119890119888119905119901 represents the defects of 119868119901 120572 is a weightingfactor (0 le 120572 le 1) which is learned from training data usingAlgorithm 1 (see Algorithm 1) and 119873(119889119890119891119890119888119905119901) is a functionused to normalize defects with values ranging from 0 to 1
6 Mathematical Problems in Engineering
Optimizing the parameter 120572Input(1) Candidate TDS 119878119904 = 1198681199041 1198681199042 119868119904119898 test set 119878119905 = 1198681199051 1198681199052 119868119905119897 (119898 gt 119897)(2) 119889119890119891119890119888119905119904 = 119889119890119891119890119888119905(1198681199041) 119889119890119891119890119888119905(1198681199042) 119889119890119891119890119888119905(119868119904119898) and 119896 = 10Output(3) 120572 (120572 isin [0 1])Method(4) Initialize 120572 = 0 119878119904(120572) = 0(5) While (120572 le 1) do(6) For 119894 = 1 119894 le 119897 119894 + +(7) For 119895 = 1 119895 le 119898 119895 + +(8) Score(119868119905119894 119868119904119895) = 120572 119878119894119898(119868119905119894 119868119904119895) + (1 minus 120572)119873 (119889119890119891119890119888119905(119868119904119895))(9) End For(10) descSort (119878119888119900119903119890(119868119905119894 119868119904119895) | 119895 = 1 sdot sdot sdot 119898) sort119898 training instances in descending order(11) 119878119904(120572) = 119878119904(120572) cup Top-119896 training instances select the top 119896 instances(12) End For(13) 119860119880119862 lArr997904 119878119904(120572) CPDP997888997888997888997888rarr 119878119905 prediction result(14) 120572 = 120572 + 01(15) EndWhile(16) Return (120572 | max120572119860119880119862)
Algorithm 1 Algorithm of parameter optimization
Table 1 Similarity indexes and normalization methods used in thispaper
Similarity
Cosine cos (119883 119884) = sum119899119896=1 119909119896119910119896radicsum119899119896=1 1199092119896radicsum119899119896=1 1199102119896
Euclidean distance 119889 (119883 119884) = radic 119899sum119896=1
(119909119896 minus 119910119896)2Manhattan distance 119889 (119883 119884) = 119899sum
119896=1
1003816100381610038161003816119909119896 minus 1199101198961003816100381610038161003816Normalization
Linear 119873(119909) = 119909 minus 119909min119909max minus 119909min
Logistic 119873(119909) = 11 + 119890minus119909 minus 05Square root 119873(119909) = 1 minus 1radic1 + 119909Logarithmic 119873(119909) = log10(119909 + 1)Inverse cotangent 119873(119909) = arctan(119909) lowast 2120587
Normalization is a commonly used data preprocessingtechnique for mathematics and computer science [36] Grafand Borer [37] have confirmed that normalization canimprove prediction performance of classificationmodels Forthis reason we normalize the defects of training instanceswhen using TDSelector As you know there are many nor-malization methods In this study we introduce five typicalnormalization methods used in machine learning [36 38]The description and formulas of the five normalizationmethods are listed in Table 1
For each test instance the top-119896 training instances rankedin terms of their scores will be returned Hence the final
TDS is composed by merging the sets of the top-119896 train-ing instances for each test instance when those duplicateinstances are removed
5 Experimental Setup
51 Research Questions Our experiments were conductedto find empirical evidence that answers the following threeresearch questions
RQ1 Does the Consideration of Defects Improve the Per-formance of CPDP Unlike the previous methods [1 5 712 29] TDSelector ranks candidate training instances interms of both defects and metric-based similarity To eval-uate the effectiveness of the proposed method consideringthe additional information of defects we tested TDSelectoraccording to the experimental data described in Section 52According to (2) we also empirically analyzed the impact ofthe parameter 120572 on prediction results
RQ2 Which Combination of Similarity and NormalizationIs More Suitable for TDSelector Equation (2) is comprisedof two parts namely similarity and the normalization ofdefects For each part several commonly usedmethods can beadopted in our context To fully take advantage of TDSelectorone would wonder which combination of similarity andnormalization should be chosen Therefore it is necessary tocompare the effects of different combinations of similarityand normalization methods on prediction results and todetermine the best one for TDSelector
RQ3 Can TDSelector-Based CPDP Outperform the Base-line Methods Cross-project prediction has attracted muchresearch interest in recent years and a few CPDP approachesusing training data selection have also been proposed eg
Mathematical Problems in Engineering 7
Peter filter based CPDP [5] (labeled as baseline1) and TCA+(Transfer Component Analysis) based CPDP [39] (labeledas baseline2) To answer the third question we comparedTDSelector-based CPDP proposed in this paper with theabove two state-of-the-art methods
52 Data Collection To evaluate the effectiveness of TDSe-lector in this paper we used 14 open-source projects writtenin Java on two online public software repositories namelyPROMISE [40] and AEEEM [41] The data statistics of the 14projects in question are presented in Table 2 where Instanceand Defect are the numbers of instances and defectiveinstances respectively and Defect is the proportion ofdefective instances to the total number of instances Eachinstance in these projects represents a file of object class andconsists of two parts namely software metrics and defects
The first repository PROMISE was collected by Jureczkoand Spinellis [40] The information of defects and 20 sourcecode metrics for the projects on PROMISE have been vali-dated and used in several previous studies [1 7 12 29] Thesecond repository AEEEM was collected by DrsquoAmbros etal [41] and each project on it has 76 metrics including 17source code metrics 15 change metrics 5 previous defectmetrics 5 entropy-of-change metrics 17 entropy-of-source-code metrics and 17 churn-of-source-code metrics AEEEMhas been successfully used in [23 39]
Before performing a cross-project prediction we need todetermine a target dataset (test set) and its candidate TDSFor PROMISE (10 projects) each one in the 10 projects wasselected to be the target dataset once and then we set up acandidate TDS for CPDP which excluded any data from thetarget project For instance if Ivy is selected as test projectdata from the other nine projects was used to construct itsinitial TDS
53 Experiment Design To answer the three research ques-tions our experimental procedure which is designed underthe context of M2O in the CPDP scenario is described asfollows
First as with many prior studies [1 5 15 35] all softwaremetric values in training and test sets were normalized byusing the119885-scoremethod because thesemetrics are differentin the scales of numerical values For the 14 projects onAEEEM and PROMISE their numbers of software metricsare different So the training set for a given test set wasselected from the same repository
Second to examine whether the consideration of defectsimproves the performance of CPDP we compared ourapproach TDSelector with NoD which is a baseline methodconsidering only the similarity between instances ie 120572 = 1in (2) Since there are three similarity computation methodsused in this paper we designed three different TDSelectorsand their corresponding baseline methods based on similar-ity indexesThe prediction results of eachmethod in questionfor the 15 test sets were analyzed in terms of mean valueand standard deviation More specifically we also used Cliff rsquosdelta (120575) [42] which is a nonparametric effect size measureof how often the values in one distribution are larger than the
values in a second distribution to compare the results gen-erated through our approach and its corresponding baselinemethod
Because Cliff did not suggest corresponding 120575 values torepresent small medium and large effects we convertedCohenrsquos 119889 effect size to Cliff rsquos 120575 using cohd2delta R package(httpsrdrriocranorddommancohd2deltahtml) Notethat Table 3 contains descriptors for magnitude of 119889 = 001to 20
Third according to the results of the second step of thisprocedure 15 combinations based on three typical similaritymethods for software metrics and five commonly usednormalization functions for defects were examined by thepairwise comparison method We then determined whichcombination is more suitable for our approach according tomean standard deviation and Cliff rsquos delta effect size
Fourth to further validate the effectiveness of the TDS-elector-based CPDP predictor we conducted cross-projectpredictions for all the 15 test sets using TDSelector and twocompeting methods (ie baseline1 and baseline2 introducedin Section 51) Note that the TDSelector used in this exper-iment was built with the best combination of similarity andnormalization
After this process is completed we will discuss the an-swers to the three research questions of our study
54 Classifier and Evaluation Measure As an underlyingmachine learning classifier for CPDP Logistic Regression(LR) which was widely used in many defect prediction liter-atures [4 23 39 43ndash46] is also used in this study All LR clas-sifiers were implemented withWeka (httpswwwcswaikatoacnzmlweka) For our experiments we used the defaultparameter setting for LR specified in Weka unless otherwisespecified
To evaluate the prediction performance of differentmeth-ods in this paper we utilized the area under a ReceiverOperating Characteristic curve (AUC) AUC is equal to theprobability that a classifier will identify a randomly chosendefective class higher than a randomly chosen defect-freeone [47] known as a useful measure for comparing differentmodels Compared with traditional accuracy measures AUCis commonly used because it is unaffected by class imbalanceand independent of the prediction threshold that is used todecide whether an instance should be classified as a negativeinstance [6 48 49] The AUC value of 05 indicates theperformance of a random predictor and higher AUC valuesindicate better prediction performance
6 Experimental Results
61 Answer to RQ1 We compared our approach consideringdefects with the baseline method NoD that selects trainingdata in terms of cosine similarity Table 5 shows that onaverage TDSelector does achieve an improvement in AUCvalue across the 15 test sets Obviously the average growthrates of AUC value vary from 59 to 90 when differentnormalization methods for defects were utilized In additionall the 120575 values in this table are greater than 02 which
8 Mathematical Problems in Engineering
Table 2 Data statistics of the projects used in our experiments
Repository Project Version Instance Defect Defect
PROMISE
Ant 17 745 166 223Camel 16 965 188 195Ivy 20 352 40 114Jedit 32 272 90 331
Lucene 24 340 203 597Poi 30 442 281 636
Synapse 12 256 86 336Velocity 14 196 147 750Xalan 26 885 411 464Xerces 14 588 437 743
AEEEM
Equinox 112005ndash6252008 324 129 398Eclipse JDT core (Eclipse) 112005ndash6172008 997 206 207Apache Lucene (Lucene2) 112005ndash1082008 692 20 29
Mylyn 1172005ndash3172009 1862 245 132Eclipse PDE UI (Pde) 112005ndash9112008 1497 209 140
Table 3 The mappings between different values and their effective-ness levels
Effect size d 120575Very small 001 0008Small 020 0147Medium 050 033Large 080 0474Very large 120 0622Huge 20 0811
indicates that each group of 15 prediction results obtained byour approach has a greater effect than that of NoD In otherwords our approach outperforms NoD In particular forJedit Velocity Eclipse andEquinox the improvements of ourapproach over NoD are substantial For example when usingthe linear normalizationmethod the AUC values for the fourprojects are increased by 306 430 226 and 394respectively moreover the logistic normalization methodfor Velocity achieves the biggest improvement in AUC value(namely 617)
We then compared TDSelector with the baseline methodsusing other widely used similarity calculation methods andthe results obtained by using Euclidean distance and Man-hattan distance to calculate the similarity between instancesare presented in Tables 6 and 7 TDSelector compared withthe corresponding NoD achieves the average growth ratesof AUC value that vary from 59 to 77 in Table 6 andfrom 27 to 69 in Table 7 respectively More specificallythe highest growth rate of AUC value in Table 6 is 436for Equinox and in Table 7 is 397 for Lucene2 Besides allCliff rsquos delta (120575) effect sizes in these two tables are also greaterthan 01 Hence the results indicate that our approach can onaverage improve the performance of those baseline methodswithout regard to defects
Table 4 Analyzing the factors similarity and normalization
Factor Method Mean Std 120575Similarity
Cosine similarity 0704 0082 minus0133Euclidean distance 0719 0080 -Manhattan distance 0682 0098 minus0193
Normalization
Linear 0706 0087 minus0012Logistic 0710 0078 -
Square root 0699 0091 minus0044Logarithmic 0700 0086 minus0064
Inverse cotangent 0696 0097 minus0056
In short during the process of training data selection theconsideration of defects for CPDP can help us to select higherquality training data thus leading to better classificationresults
62 Answer to RQ2 Although the inclusion of defects inthe selection of training data of quality is helpful for betterperformance of CPDP it is worthy to note that our methodcompletely failed in Mylyn and Pde when computing thesimilarity between instances in terms of Manhattan distance(see the corresponding maximum AUC values in Table 7)This implies that the success ofTDSelector depends largely onthe reasonable combination of similarity and normalizationmethods Therefore which combination of similarity andnormalization is more suitable for TDSelector
First we analyzed the two factors (ie similarity andnormalization) separately For example we evaluated thedifference among cosine similarity Euclidean distance andManhattan distance regardless of any normalization methodused in the experiment The results expressed in terms ofmean and standard deviation are shown in Table 4 wherethey are grouped by factors
Mathematical Problems in Engineering 9
Table5Th
ebestpredictio
nresults
obtained
bytheCP
DPapproach
basedon
TDSelec
torw
ithCosinesim
ilarityNoD
representsthebaselin
emetho
d+deno
testhe
grow
thrateof
AUC
valuethem
axim
umAU
Cvalueo
fdifferentn
ormalizationmetho
dsisun
derlinedeach
numbersho
wnin
bold
indicatesthatthe
correspo
ndingAU
Cvaluer
isesb
ymorethan10
Cosines
imilarity
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
07
09
09
1009
1009
1006
09
08
06
07
07
05
0338
AUC
0813
0676
0603
0793
0700
0611
0758
0741
0512
0742
0783
0760
0739
0705
0729
0711plusmn0
081
+(
)63
37
19
-30
6
-30
-43
0
03
226
394
41
59
40
90
Logistic
12057207
05
07
107
06
06
06
05
05
004
07
05
05
0351
AUC
0802
0674
0595
0793
0665
0621
0759
0765
0579
0745
0773
0738
0712
0707
0740
0711plusmn0
070
+(
)48
34
05
-24
116
31
32
617
07
210
355
03
62
56
90
Square
root
12057207
07
06
06
07
06
07
09
05
104
06
06
06
06
0249
AUC
0799
0654
0596
0807
0735
0626
0746
0762
0500
0740
0774
0560
0722
0700
0738
0697plusmn0
091
+(
)44
03
07
18
371
25
14
28
397
-210
28
17
53
53
69
Logarithm
ic120572
06
06
09
1007
1007
07
05
09
05
05
06
06
06
0351
AUC
0798
0662
0594
0793
0731
0611
0748
0744
0500
0758
0774
0700
0755
0702
0741
0707plusmn0
083
+(
)43
15
03
-36
4
-16
04
397
24
212
285
63
55
58
85
Inversec
otangent
12057207
1010
1007
1007
1006
07
007
07
07
07
0213
AUC
0798
0652
0592
0793
0659
0611
0749
0741
0500
0764
0773
0556
0739
0695
0734
0690plusmn0
092
+(
)43
--
-229
-18
-
397
32
210
21
41
44
48
59
NoD
(120572=1)
0765
0652
0592
0793
0536
0611
0736
0741
0358
0740
0639
0543
0709
0665
0701
0652plusmn0
113
10 Mathematical Problems in Engineering
Table6Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithEu
clidean
distance
Euclidean
distance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
09
09
1009
09
08
1010
08
08
006
1008
08
0369
AUC
0795
0727
0598
0826
0793
0603
0714
0757
0545
0775
0773
0719
0722
0697
0744
0719plusmn0
080
+(
)13
68
-09
322
19
--
117
52
176
430
-11
96
77
Logistic
12057207
08
04
07
07
05
06
09
09
09
007
1010
09
0360
AUC
0787
0750
0603
0832
0766
0613
0716
0767
0556
0745
0773
0698
0722
0690
0730
0717plusmn0
075
+(
)03
101
08
16
277
35
03
13
139
11
176
388
--
75
72
Square
root
12057207
08
1007
08
06
07
07
07
1007
08
1010
09
0342
AUC
0796
0743
0598
0820
0720
0618
0735
0786
0564
0737
0774
0696
0722
0690
0750
0715plusmn0
076
+(
)14
91
-
01
200
44
29
38
156
-178
384
--
105
70
Logarithm
ic120572
07
08
1010
08
06
1010
09
09
09
08
1010
09
0324
AUC
0794
0746
0598
0819
0722
0607
0714
0757
0573
0739
0778
0722
0722
0690
0748
0715plusmn0
072
+(
)11
95
--
203
25
--
174
03
185
436
--
103
70
Inversec
otangent
12057208
09
06
08
08
07
1008
06
07
009
09
1009
0280
AUC
0796
0749
0603
0820
0701
0623
0714
0787
0538
0750
0773
0589
0763
0690
0722
0708plusmn0
084
+(
)14
100
08
01
168
52
-40
102
18
176
170
56
-64
59
NoD
(120572=1)
0785
0681
0598
0819
060
00592
0714
0757
0488
0737
0657
0503
0722
0690
0678
066
8plusmn0
096
Mathematical Problems in Engineering 11
Table7Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithManhatta
ndista
nce
Manhatta
ndistance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
08
09
09
1009
09
1010
08
100
08
09
1010
0187
AUC
0804
0753
0599
0816
0689
0626
0695
0748
0500
0749
0773
0633
0692
0695
066
80696plusmn0
084
+(
)13
70
03
-73
63
--
78
-116
190
397
--
56
Logistic
12057207
07
08
08
08
07
07
09
06
07
009
09
1010
0249
AUC
0799
0760
0607
0830
0674
0621
0735
0794
0520
0756
0773
0680
0559
0695
066
80705plusmn0
084
+(
)06
80
17
17
50
54
58
61
121
09
116
279
127
--
69
Square
root
12057209
09
09
1008
08
09
08
09
100
100
1010
0164
AUC
0795
0755
060
40816
0693
0627
0704
0750
0510
0749
0773
0532
0523
0695
066
80680plusmn0
1+(
)01
72
12
-79
65
13
03
99
-116
-46
--
31
Logarithm
ic120572
1009
09
1009
1010
08
09
09
010
010
100116
AUC
0794
0755
0603
0816
066
40589
0695
0763
0524
0756
0773
0532
0523
0695
066
80677plusmn0
102
+(
)-
72
10
-34
--
20
129
09
116
-46
--
27
Inversec
otangent
12057210
09
09
09
09
08
09
1007
08
010
010
100133
AUC
0794
0749
0608
0821
0667
060
90710
0748
0500
0758
0773
0532
0523
0695
066
80677plusmn0
103
+(
)-
64
18
06
39
34
22
-78
12
116
-46
--
27
NoD
(120572=1)
0794
0704
0597
0816
064
20589
0695
0748
046
40749
0693
0532
0500
0695
066
80659plusmn0
105
12 Mathematical Problems in Engineering
Euclidean Logistic
Cosine
Linear
Manhattan+Logistic Euclidean
Linear
Cosine
Logistic
Manhattan+Logistic
(1) (2)
Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow
If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method
Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4
Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions
63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In
particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable
Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication
In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models
7 Discussion
71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10
Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance
Mathematical Problems in Engineering 13
Table8Pairw
isecomparis
onsb
etweenag
iven
combinatio
nandeach
ofthe15combinatio
nsin
term
sofC
liffrsquosdelta
(120575)effectsize
Cosines
imilarity
Euclidean
distance
Manhatta
ndistance
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Cosine+
Linear
-0018
0084
000
00116
minus0049
minus0036
minus0004
minus0013
minus0009
0138
0049
0164
0178
0169
Euclidean
+Linear
0049
0102
0111
0062
0164
-0036
004
00058
0089
0209
0102
0249
0276
0244
Manhatta
n+Lo
gistic
minus0049
minus0022
0022
minus0013
0111
minus0102
minus0076
minus0080
minus0049
minus0031
0053
-0124
0151
0147
14 Mathematical Problems in Engineering
Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question
Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10
Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647
Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43
Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487
k10987654321
AUC
09
08
07
06
05
Manhattan+logisticEuclidean+linearCosine+linear
Combination
Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE
when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations
Manhattan+logisticEuclidean+linearCosine+linear
k10987654321
AUC
09
08
07
06
05
Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM
72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice
According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as
Mathematical Problems in Engineering 15
0
02
04
06
08
1
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Perc
enta
ge(in
stanc
e)
release_id
defectslt2defectslt3
(a)
0
0002
0004
0006
0008
001
0012
0014
0016
4 14 24 34 44 54 64 74
Perc
enta
ge(in
stanc
e)
defects
(b)
Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments
TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones
Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work
73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain
Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software
metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors
Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033
Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five
16 Mathematical Problems in Engineering
Euclidean+Linear
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
Cosine+Linear
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
Manhattan+Logistic
Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value
normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear
8 Conclusion and Future Work
This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of
defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods
Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are
Mathematical Problems in Engineering 17
required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances
Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)
Conflicts of Interest
The authors declare that there are no conflicts of interest re-garding the publication of this article
Acknowledgments
The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)
References
[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012
[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002
[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012
[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009
[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013
[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012
[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013
[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998
[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002
[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007
[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005
[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009
[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015
[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017
[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015
[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015
[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017
[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015
[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016
[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017
[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014
[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014
[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015
[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015
18 Mathematical Problems in Engineering
[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017
[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017
[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017
[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013
[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014
[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016
[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017
[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015
[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017
[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012
[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013
[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012
[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7
[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006
[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013
[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010
[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE
Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010
[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011
[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011
[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008
[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008
[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011
[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006
[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012
[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011
[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011
[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952
[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009
[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
Mathematical Problems in Engineering 3
22 Training Data Selection for CPDP As mentioned in [528] a fundamental issue for CPDP is to select the mostappropriate training data for building quality defect predic-tors He et al [29] discussed this problem in detail from theperspective of data granularity ie release level and instancelevel They presented a two-step method for training dataselection The results indicated that the predictor built basedon naive Bayes could achieve fairly good performance whenusing the method together with Peter filter [5] Porto andSimao [30] proposed an Instance Filtering method by select-ing the most similar instances from the training dataset andthe experimental results of 36 versions of 11 open-sourceprojects show that the defect predictor built from cross-project data selected by Feature Selection and Instance Filter-ing can have generally better performances both in classifica-tion and in ranking
With regard to the data imbalance problem of defectdatasets Jing et al [31] introduced an effective feature learn-ingmethod called SDA to provide effective solutions for class-imbalance problems of both within-project and cross-projecttypes by employing the semisupervised transfer componentanalysis (SSTCA)method tomake the distributions of sourceand target data consistent The results indicated that theirmethod greatly improved WPDP and CPDP performanceRyu et al [32] proposed amethod of hybrid instance selectionusing nearest neighbor (HISNN)Their results suggested thatthose instances which had strong local knowledge could beidentified via nearest neighbors with the same class labelPoon et al [33] proposed a credibility theory based naiveBayes (CNB) classifier to establish a novel reweighting mech-anism between the source projects and target projects sothat the source data could simultaneously adapt to the targetdata distribution and retain its ownpatternThe experimentalresults demonstrate the significant improvement in terms ofthe performance metrics considered achieved by CNB overother CPDP approaches
The above-mentioned existing studies aimed at reducingthe gap in prediction performance between WPDP andCPDP Although they are making progress towards the goalthere is clearly a lot of room for improvement For this reasonin this paper we proposed a selection approach to trainingdata based on an improved strategy for instance rankinginstead of a single strategy for similarity calculation whichwas used in many prior studies [1 5 7 12]
3 Preliminaries
In our context a defect dataset 119878 contains119898 instances whichis represented as 119878 = 1198681 1198682 119868119898 Instance 119868119894 is an objectclass represented as 119868119894 = 1198911198941 1198911198942 119891119894119899 where 119891119894119895 is the 119895thmetric value of instance 119868119894 and 119899 is the number ofmetrics (alsoknown as features) Given a source dataset 119878119904 and a targetdataset 119878119905 CPDP aims to perform a prediction in 119878119905 using theknowledge extracted from 119878119904 where 119878119904 = 119878119905 (see Figure 1(a))In this paper source and target datasets have the same set ofmetrics and they may differ in distributional characteristicsof metric values
To improve the performance of CPDP several strategiesused to select appropriate training data have been put forward
(see Figure 1(b)) eg Turhan et al [12] filtered out thoseirrelevant training instances by returning 119896-nearest neighborsfor each test instance
31 An Example of Training Data Selection First we intro-duce a typical method for training data selection at theinstance level and a simple example is used to illustrate thismethod For the strategy for other levels of training dataselection such as at the release level please refer to [7]
Figure 2 shows a training set 119878119904 (including five instances)and a test set 119878119905 (including an instance) Here each instancecontains five metrics and a classification label (ie 0 or 1)An instance is defect-free (label = 0) only if its defects areequal to 0 otherwise it is defective (label = 1) According tothe 119896-nearest neighbor method based on Euclidean distancewe can rank all the five training instances in terms of theirdistances from the test instance Due to the same nearestdistance from test instance 119868test it is clear that three instances1198681 1198682 and 1198685 are suitable for use as training instances when119896 is set to 1 For the three instances 1198682 and 1198685 have the samemetric values but 1198682 is labeled as a defective instance becauseit contains a bug In this case 1198681 will be selected with the sameprobability as that of 1198682 regardless of the number of defectsthey include
In this way those instances most relevant to the test onecan be quickly determined Clearly the goal of training dataselection is to preserve the representative training instancesin 119878119904 as much as possible
32 General Process of Training Data Selection Before pre-senting our approach we describe a general selection processof training data which consists of three main steps TDS(training dataset) setup ranking and duplicate removal
TDS Setup For each target project with little historical datawe need to set up an initial TDS where training data arecollected from other projects To simulate this scenario ofCPDP in this paper any defect data from the target projectmust be excluded from the initial TDS Note that differentrelease versions of a project actually belong to the sameproject A simple example is visualized in Figure 3
Ranking Once the initial TDS is determined an instancewill be treated as a metric vector 119868 as mentioned above Foreach test instance one can calculate its relevance to eachtraining instance and then rank these training instances interms of their similarity based on software metrics Notethat a wide variety of software metrics such as source codemetrics process metrics previous defects and code churnhave been used as features for CPDP approaches to improvetheir prediction performance
Duplicate Removal Let 119897 be the size of test set For eachtest instance if we select its 119896-nearest neighbors from theinitial TDS there are a total of 119896 times 119897 candidate traininginstances Considering that these selected instances may notbe unique (ie a training instance can be the nearest neighborof multiple test instances) after removing the duplicate onesthey form the final training set which is a subset of the initialTDS
4 Mathematical Problems in Engineering
trainingPredictor
test
Training data set Ss Target data set St
(a) General CPDP
MetricFeature value f
Buggy instance I
Non-buggy instance I
Unlabeled instance I
Strategies for instance selection
Predictor
testtraining
Training data set Ss
Ss Target data set StReduced training data set rsquo
(b) Improved CPDP using training data selection
Figure 1 Two CPDP scenarios used in this paper
01 0 05 0 1(3)
01 0 0 05 1(1)
04 03 0 01 0
0 0 04 0 0
01 0 0 05 0
01 0 05 05
Label (defects)
rank instance
1
2
3
distance(Ii Itest)I1
I2
I3
I4
I4
I3
I5
Itest
f1 f2 f3 f4
I1 I2 I5
St
Ss
Figure 2 An example of the selection of training instances
Mathematical Problems in Engineering 5
Data sets
Project C-v1
Ranked instances(TDSelector)
Defects of each instance
similarity of software metrics
Defect predictor
Project A-v1Project A-v2Project B-v1Project C-v1Project C-v2 Test set
Project A-v1Project A-v2Project B-v1
Training set
ranking
TDS setup
rankingTraining
data
Test data
Figure 3 The overall structure of TDSelector for CPDP
4 Our Approach TDSelector
To improve the prediction performance of CPDP we leveragethe following observations
Similar Instances Given a test instance we can examine itssimilar training instances that were labeled beforeThe defectproneness shared by similar training instances could helpus identify the probability that a test instance is defectiveIntuitively two instances are more likely to have the samestate if their metric values are very similar
Number of Defects (defects) During the selection processwhen several training instances have the same distance froma test instance we need to determine which one shouldbe ranked higher According to our experiences in softwaredefect prediction and other researchersrsquo studies on the quan-titative analysis of previous defect prediction approaches [3435] we believe that more attention should be paid to thosetraining instances with more defects in practice
The selection of training data based on instance similarityhas been used in someprior studies [5 12 35]However to thebest of our knowledge the information about defects has notbeen fully utilized So in this paper we attempt to propose atraining data selection approach combining such informationand instance similarity
41 Overall Structure of TDSelector Figure 3 shows the over-all structure of the proposed approach to training dataselection named TDSelector Before selecting appropriatetraining data for CPDP we have to set up a test set andits corresponding initial TDS For a given project treated asthe test set all the other projects (except the target project)available at hand are used as the initial TDS This is the so-calledmany-to-one (M2O) scenario for CPDP [13] It is quitedifferent from the typical O2O (one-to-one) scenario whereonly one randomly selected project is treated as the trainingset for a given target project (namely test set)
When both of the sets are given the ranks of traininginstances are calculated based on the similarity of softwaremetrics and then returned for each test instance For the
initial TDS we also collect each training instancersquos defectsand thus rank these instances by their defects Then we rateeach training instance by combining the two types of ranks insome way and identify the top-k training instances for eachtest instance according to their final scores Finally we usethe predictor trained with the final TDS to predict defectproneness in the test set We describe the core componentof TDSelector namely scoring scheme in the followingsubsection
42 Scoring Scheme For each instance in training set and testset which is treated as a vector of features (namely softwaremetrics) we calculate the similarity between them in terms ofsimilarity index (such as cosine similarity Euclidean distanceand Manhattan distance as shown in Table 1) Traininginstances are then ranked by the similarity between each ofthem and a given test instance
For instance the cosine similarity between a traininginstance 119868119901 and the target instance 119868119902 is computed via theirvector representations described as follows
119878119894119898 (119868119901 119868119902) =997888rarr119868119901 sdot 997888rarr1198681199021003817100381710038171003817100381711986811990110038171003817100381710038171003817 times 1003817100381710038171003817100381711986811990210038171003817100381710038171003817 =
sum119899119894=1 (119891119901119894 times 119891119902119894)radicsum119899119894=1 1198912119901119894 times radicsum119899119894=1 1198912119902119894 (1)
where 997888rarr119868119901 and 997888rarr119868119902 are the metric vectors for 119868119901 and 119868119902 respec-tively and 119891lowast119894 represents the 119894th metric value of instance 119868lowast
Additionally for each training instance we also considerthe factor defects in order to further enrich the ranking of itsrelevant instances The assumption here is that the more theprevious defects the richer the information of an instance Sowe propose a scoring scheme to rank those candidate traininginstances defined as below
Score (119868119901 119868119902) = 120572 119878119894119898 (119868119901 119868119902) + (1 minus 120572)119873 (119889119890119891119890119888119905119901) (2)
where 119889119890119891119890119888119905119901 represents the defects of 119868119901 120572 is a weightingfactor (0 le 120572 le 1) which is learned from training data usingAlgorithm 1 (see Algorithm 1) and 119873(119889119890119891119890119888119905119901) is a functionused to normalize defects with values ranging from 0 to 1
6 Mathematical Problems in Engineering
Optimizing the parameter 120572Input(1) Candidate TDS 119878119904 = 1198681199041 1198681199042 119868119904119898 test set 119878119905 = 1198681199051 1198681199052 119868119905119897 (119898 gt 119897)(2) 119889119890119891119890119888119905119904 = 119889119890119891119890119888119905(1198681199041) 119889119890119891119890119888119905(1198681199042) 119889119890119891119890119888119905(119868119904119898) and 119896 = 10Output(3) 120572 (120572 isin [0 1])Method(4) Initialize 120572 = 0 119878119904(120572) = 0(5) While (120572 le 1) do(6) For 119894 = 1 119894 le 119897 119894 + +(7) For 119895 = 1 119895 le 119898 119895 + +(8) Score(119868119905119894 119868119904119895) = 120572 119878119894119898(119868119905119894 119868119904119895) + (1 minus 120572)119873 (119889119890119891119890119888119905(119868119904119895))(9) End For(10) descSort (119878119888119900119903119890(119868119905119894 119868119904119895) | 119895 = 1 sdot sdot sdot 119898) sort119898 training instances in descending order(11) 119878119904(120572) = 119878119904(120572) cup Top-119896 training instances select the top 119896 instances(12) End For(13) 119860119880119862 lArr997904 119878119904(120572) CPDP997888997888997888997888rarr 119878119905 prediction result(14) 120572 = 120572 + 01(15) EndWhile(16) Return (120572 | max120572119860119880119862)
Algorithm 1 Algorithm of parameter optimization
Table 1 Similarity indexes and normalization methods used in thispaper
Similarity
Cosine cos (119883 119884) = sum119899119896=1 119909119896119910119896radicsum119899119896=1 1199092119896radicsum119899119896=1 1199102119896
Euclidean distance 119889 (119883 119884) = radic 119899sum119896=1
(119909119896 minus 119910119896)2Manhattan distance 119889 (119883 119884) = 119899sum
119896=1
1003816100381610038161003816119909119896 minus 1199101198961003816100381610038161003816Normalization
Linear 119873(119909) = 119909 minus 119909min119909max minus 119909min
Logistic 119873(119909) = 11 + 119890minus119909 minus 05Square root 119873(119909) = 1 minus 1radic1 + 119909Logarithmic 119873(119909) = log10(119909 + 1)Inverse cotangent 119873(119909) = arctan(119909) lowast 2120587
Normalization is a commonly used data preprocessingtechnique for mathematics and computer science [36] Grafand Borer [37] have confirmed that normalization canimprove prediction performance of classificationmodels Forthis reason we normalize the defects of training instanceswhen using TDSelector As you know there are many nor-malization methods In this study we introduce five typicalnormalization methods used in machine learning [36 38]The description and formulas of the five normalizationmethods are listed in Table 1
For each test instance the top-119896 training instances rankedin terms of their scores will be returned Hence the final
TDS is composed by merging the sets of the top-119896 train-ing instances for each test instance when those duplicateinstances are removed
5 Experimental Setup
51 Research Questions Our experiments were conductedto find empirical evidence that answers the following threeresearch questions
RQ1 Does the Consideration of Defects Improve the Per-formance of CPDP Unlike the previous methods [1 5 712 29] TDSelector ranks candidate training instances interms of both defects and metric-based similarity To eval-uate the effectiveness of the proposed method consideringthe additional information of defects we tested TDSelectoraccording to the experimental data described in Section 52According to (2) we also empirically analyzed the impact ofthe parameter 120572 on prediction results
RQ2 Which Combination of Similarity and NormalizationIs More Suitable for TDSelector Equation (2) is comprisedof two parts namely similarity and the normalization ofdefects For each part several commonly usedmethods can beadopted in our context To fully take advantage of TDSelectorone would wonder which combination of similarity andnormalization should be chosen Therefore it is necessary tocompare the effects of different combinations of similarityand normalization methods on prediction results and todetermine the best one for TDSelector
RQ3 Can TDSelector-Based CPDP Outperform the Base-line Methods Cross-project prediction has attracted muchresearch interest in recent years and a few CPDP approachesusing training data selection have also been proposed eg
Mathematical Problems in Engineering 7
Peter filter based CPDP [5] (labeled as baseline1) and TCA+(Transfer Component Analysis) based CPDP [39] (labeledas baseline2) To answer the third question we comparedTDSelector-based CPDP proposed in this paper with theabove two state-of-the-art methods
52 Data Collection To evaluate the effectiveness of TDSe-lector in this paper we used 14 open-source projects writtenin Java on two online public software repositories namelyPROMISE [40] and AEEEM [41] The data statistics of the 14projects in question are presented in Table 2 where Instanceand Defect are the numbers of instances and defectiveinstances respectively and Defect is the proportion ofdefective instances to the total number of instances Eachinstance in these projects represents a file of object class andconsists of two parts namely software metrics and defects
The first repository PROMISE was collected by Jureczkoand Spinellis [40] The information of defects and 20 sourcecode metrics for the projects on PROMISE have been vali-dated and used in several previous studies [1 7 12 29] Thesecond repository AEEEM was collected by DrsquoAmbros etal [41] and each project on it has 76 metrics including 17source code metrics 15 change metrics 5 previous defectmetrics 5 entropy-of-change metrics 17 entropy-of-source-code metrics and 17 churn-of-source-code metrics AEEEMhas been successfully used in [23 39]
Before performing a cross-project prediction we need todetermine a target dataset (test set) and its candidate TDSFor PROMISE (10 projects) each one in the 10 projects wasselected to be the target dataset once and then we set up acandidate TDS for CPDP which excluded any data from thetarget project For instance if Ivy is selected as test projectdata from the other nine projects was used to construct itsinitial TDS
53 Experiment Design To answer the three research ques-tions our experimental procedure which is designed underthe context of M2O in the CPDP scenario is described asfollows
First as with many prior studies [1 5 15 35] all softwaremetric values in training and test sets were normalized byusing the119885-scoremethod because thesemetrics are differentin the scales of numerical values For the 14 projects onAEEEM and PROMISE their numbers of software metricsare different So the training set for a given test set wasselected from the same repository
Second to examine whether the consideration of defectsimproves the performance of CPDP we compared ourapproach TDSelector with NoD which is a baseline methodconsidering only the similarity between instances ie 120572 = 1in (2) Since there are three similarity computation methodsused in this paper we designed three different TDSelectorsand their corresponding baseline methods based on similar-ity indexesThe prediction results of eachmethod in questionfor the 15 test sets were analyzed in terms of mean valueand standard deviation More specifically we also used Cliff rsquosdelta (120575) [42] which is a nonparametric effect size measureof how often the values in one distribution are larger than the
values in a second distribution to compare the results gen-erated through our approach and its corresponding baselinemethod
Because Cliff did not suggest corresponding 120575 values torepresent small medium and large effects we convertedCohenrsquos 119889 effect size to Cliff rsquos 120575 using cohd2delta R package(httpsrdrriocranorddommancohd2deltahtml) Notethat Table 3 contains descriptors for magnitude of 119889 = 001to 20
Third according to the results of the second step of thisprocedure 15 combinations based on three typical similaritymethods for software metrics and five commonly usednormalization functions for defects were examined by thepairwise comparison method We then determined whichcombination is more suitable for our approach according tomean standard deviation and Cliff rsquos delta effect size
Fourth to further validate the effectiveness of the TDS-elector-based CPDP predictor we conducted cross-projectpredictions for all the 15 test sets using TDSelector and twocompeting methods (ie baseline1 and baseline2 introducedin Section 51) Note that the TDSelector used in this exper-iment was built with the best combination of similarity andnormalization
After this process is completed we will discuss the an-swers to the three research questions of our study
54 Classifier and Evaluation Measure As an underlyingmachine learning classifier for CPDP Logistic Regression(LR) which was widely used in many defect prediction liter-atures [4 23 39 43ndash46] is also used in this study All LR clas-sifiers were implemented withWeka (httpswwwcswaikatoacnzmlweka) For our experiments we used the defaultparameter setting for LR specified in Weka unless otherwisespecified
To evaluate the prediction performance of differentmeth-ods in this paper we utilized the area under a ReceiverOperating Characteristic curve (AUC) AUC is equal to theprobability that a classifier will identify a randomly chosendefective class higher than a randomly chosen defect-freeone [47] known as a useful measure for comparing differentmodels Compared with traditional accuracy measures AUCis commonly used because it is unaffected by class imbalanceand independent of the prediction threshold that is used todecide whether an instance should be classified as a negativeinstance [6 48 49] The AUC value of 05 indicates theperformance of a random predictor and higher AUC valuesindicate better prediction performance
6 Experimental Results
61 Answer to RQ1 We compared our approach consideringdefects with the baseline method NoD that selects trainingdata in terms of cosine similarity Table 5 shows that onaverage TDSelector does achieve an improvement in AUCvalue across the 15 test sets Obviously the average growthrates of AUC value vary from 59 to 90 when differentnormalization methods for defects were utilized In additionall the 120575 values in this table are greater than 02 which
8 Mathematical Problems in Engineering
Table 2 Data statistics of the projects used in our experiments
Repository Project Version Instance Defect Defect
PROMISE
Ant 17 745 166 223Camel 16 965 188 195Ivy 20 352 40 114Jedit 32 272 90 331
Lucene 24 340 203 597Poi 30 442 281 636
Synapse 12 256 86 336Velocity 14 196 147 750Xalan 26 885 411 464Xerces 14 588 437 743
AEEEM
Equinox 112005ndash6252008 324 129 398Eclipse JDT core (Eclipse) 112005ndash6172008 997 206 207Apache Lucene (Lucene2) 112005ndash1082008 692 20 29
Mylyn 1172005ndash3172009 1862 245 132Eclipse PDE UI (Pde) 112005ndash9112008 1497 209 140
Table 3 The mappings between different values and their effective-ness levels
Effect size d 120575Very small 001 0008Small 020 0147Medium 050 033Large 080 0474Very large 120 0622Huge 20 0811
indicates that each group of 15 prediction results obtained byour approach has a greater effect than that of NoD In otherwords our approach outperforms NoD In particular forJedit Velocity Eclipse andEquinox the improvements of ourapproach over NoD are substantial For example when usingthe linear normalizationmethod the AUC values for the fourprojects are increased by 306 430 226 and 394respectively moreover the logistic normalization methodfor Velocity achieves the biggest improvement in AUC value(namely 617)
We then compared TDSelector with the baseline methodsusing other widely used similarity calculation methods andthe results obtained by using Euclidean distance and Man-hattan distance to calculate the similarity between instancesare presented in Tables 6 and 7 TDSelector compared withthe corresponding NoD achieves the average growth ratesof AUC value that vary from 59 to 77 in Table 6 andfrom 27 to 69 in Table 7 respectively More specificallythe highest growth rate of AUC value in Table 6 is 436for Equinox and in Table 7 is 397 for Lucene2 Besides allCliff rsquos delta (120575) effect sizes in these two tables are also greaterthan 01 Hence the results indicate that our approach can onaverage improve the performance of those baseline methodswithout regard to defects
Table 4 Analyzing the factors similarity and normalization
Factor Method Mean Std 120575Similarity
Cosine similarity 0704 0082 minus0133Euclidean distance 0719 0080 -Manhattan distance 0682 0098 minus0193
Normalization
Linear 0706 0087 minus0012Logistic 0710 0078 -
Square root 0699 0091 minus0044Logarithmic 0700 0086 minus0064
Inverse cotangent 0696 0097 minus0056
In short during the process of training data selection theconsideration of defects for CPDP can help us to select higherquality training data thus leading to better classificationresults
62 Answer to RQ2 Although the inclusion of defects inthe selection of training data of quality is helpful for betterperformance of CPDP it is worthy to note that our methodcompletely failed in Mylyn and Pde when computing thesimilarity between instances in terms of Manhattan distance(see the corresponding maximum AUC values in Table 7)This implies that the success ofTDSelector depends largely onthe reasonable combination of similarity and normalizationmethods Therefore which combination of similarity andnormalization is more suitable for TDSelector
First we analyzed the two factors (ie similarity andnormalization) separately For example we evaluated thedifference among cosine similarity Euclidean distance andManhattan distance regardless of any normalization methodused in the experiment The results expressed in terms ofmean and standard deviation are shown in Table 4 wherethey are grouped by factors
Mathematical Problems in Engineering 9
Table5Th
ebestpredictio
nresults
obtained
bytheCP
DPapproach
basedon
TDSelec
torw
ithCosinesim
ilarityNoD
representsthebaselin
emetho
d+deno
testhe
grow
thrateof
AUC
valuethem
axim
umAU
Cvalueo
fdifferentn
ormalizationmetho
dsisun
derlinedeach
numbersho
wnin
bold
indicatesthatthe
correspo
ndingAU
Cvaluer
isesb
ymorethan10
Cosines
imilarity
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
07
09
09
1009
1009
1006
09
08
06
07
07
05
0338
AUC
0813
0676
0603
0793
0700
0611
0758
0741
0512
0742
0783
0760
0739
0705
0729
0711plusmn0
081
+(
)63
37
19
-30
6
-30
-43
0
03
226
394
41
59
40
90
Logistic
12057207
05
07
107
06
06
06
05
05
004
07
05
05
0351
AUC
0802
0674
0595
0793
0665
0621
0759
0765
0579
0745
0773
0738
0712
0707
0740
0711plusmn0
070
+(
)48
34
05
-24
116
31
32
617
07
210
355
03
62
56
90
Square
root
12057207
07
06
06
07
06
07
09
05
104
06
06
06
06
0249
AUC
0799
0654
0596
0807
0735
0626
0746
0762
0500
0740
0774
0560
0722
0700
0738
0697plusmn0
091
+(
)44
03
07
18
371
25
14
28
397
-210
28
17
53
53
69
Logarithm
ic120572
06
06
09
1007
1007
07
05
09
05
05
06
06
06
0351
AUC
0798
0662
0594
0793
0731
0611
0748
0744
0500
0758
0774
0700
0755
0702
0741
0707plusmn0
083
+(
)43
15
03
-36
4
-16
04
397
24
212
285
63
55
58
85
Inversec
otangent
12057207
1010
1007
1007
1006
07
007
07
07
07
0213
AUC
0798
0652
0592
0793
0659
0611
0749
0741
0500
0764
0773
0556
0739
0695
0734
0690plusmn0
092
+(
)43
--
-229
-18
-
397
32
210
21
41
44
48
59
NoD
(120572=1)
0765
0652
0592
0793
0536
0611
0736
0741
0358
0740
0639
0543
0709
0665
0701
0652plusmn0
113
10 Mathematical Problems in Engineering
Table6Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithEu
clidean
distance
Euclidean
distance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
09
09
1009
09
08
1010
08
08
006
1008
08
0369
AUC
0795
0727
0598
0826
0793
0603
0714
0757
0545
0775
0773
0719
0722
0697
0744
0719plusmn0
080
+(
)13
68
-09
322
19
--
117
52
176
430
-11
96
77
Logistic
12057207
08
04
07
07
05
06
09
09
09
007
1010
09
0360
AUC
0787
0750
0603
0832
0766
0613
0716
0767
0556
0745
0773
0698
0722
0690
0730
0717plusmn0
075
+(
)03
101
08
16
277
35
03
13
139
11
176
388
--
75
72
Square
root
12057207
08
1007
08
06
07
07
07
1007
08
1010
09
0342
AUC
0796
0743
0598
0820
0720
0618
0735
0786
0564
0737
0774
0696
0722
0690
0750
0715plusmn0
076
+(
)14
91
-
01
200
44
29
38
156
-178
384
--
105
70
Logarithm
ic120572
07
08
1010
08
06
1010
09
09
09
08
1010
09
0324
AUC
0794
0746
0598
0819
0722
0607
0714
0757
0573
0739
0778
0722
0722
0690
0748
0715plusmn0
072
+(
)11
95
--
203
25
--
174
03
185
436
--
103
70
Inversec
otangent
12057208
09
06
08
08
07
1008
06
07
009
09
1009
0280
AUC
0796
0749
0603
0820
0701
0623
0714
0787
0538
0750
0773
0589
0763
0690
0722
0708plusmn0
084
+(
)14
100
08
01
168
52
-40
102
18
176
170
56
-64
59
NoD
(120572=1)
0785
0681
0598
0819
060
00592
0714
0757
0488
0737
0657
0503
0722
0690
0678
066
8plusmn0
096
Mathematical Problems in Engineering 11
Table7Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithManhatta
ndista
nce
Manhatta
ndistance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
08
09
09
1009
09
1010
08
100
08
09
1010
0187
AUC
0804
0753
0599
0816
0689
0626
0695
0748
0500
0749
0773
0633
0692
0695
066
80696plusmn0
084
+(
)13
70
03
-73
63
--
78
-116
190
397
--
56
Logistic
12057207
07
08
08
08
07
07
09
06
07
009
09
1010
0249
AUC
0799
0760
0607
0830
0674
0621
0735
0794
0520
0756
0773
0680
0559
0695
066
80705plusmn0
084
+(
)06
80
17
17
50
54
58
61
121
09
116
279
127
--
69
Square
root
12057209
09
09
1008
08
09
08
09
100
100
1010
0164
AUC
0795
0755
060
40816
0693
0627
0704
0750
0510
0749
0773
0532
0523
0695
066
80680plusmn0
1+(
)01
72
12
-79
65
13
03
99
-116
-46
--
31
Logarithm
ic120572
1009
09
1009
1010
08
09
09
010
010
100116
AUC
0794
0755
0603
0816
066
40589
0695
0763
0524
0756
0773
0532
0523
0695
066
80677plusmn0
102
+(
)-
72
10
-34
--
20
129
09
116
-46
--
27
Inversec
otangent
12057210
09
09
09
09
08
09
1007
08
010
010
100133
AUC
0794
0749
0608
0821
0667
060
90710
0748
0500
0758
0773
0532
0523
0695
066
80677plusmn0
103
+(
)-
64
18
06
39
34
22
-78
12
116
-46
--
27
NoD
(120572=1)
0794
0704
0597
0816
064
20589
0695
0748
046
40749
0693
0532
0500
0695
066
80659plusmn0
105
12 Mathematical Problems in Engineering
Euclidean Logistic
Cosine
Linear
Manhattan+Logistic Euclidean
Linear
Cosine
Logistic
Manhattan+Logistic
(1) (2)
Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow
If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method
Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4
Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions
63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In
particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable
Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication
In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models
7 Discussion
71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10
Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance
Mathematical Problems in Engineering 13
Table8Pairw
isecomparis
onsb
etweenag
iven
combinatio
nandeach
ofthe15combinatio
nsin
term
sofC
liffrsquosdelta
(120575)effectsize
Cosines
imilarity
Euclidean
distance
Manhatta
ndistance
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Cosine+
Linear
-0018
0084
000
00116
minus0049
minus0036
minus0004
minus0013
minus0009
0138
0049
0164
0178
0169
Euclidean
+Linear
0049
0102
0111
0062
0164
-0036
004
00058
0089
0209
0102
0249
0276
0244
Manhatta
n+Lo
gistic
minus0049
minus0022
0022
minus0013
0111
minus0102
minus0076
minus0080
minus0049
minus0031
0053
-0124
0151
0147
14 Mathematical Problems in Engineering
Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question
Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10
Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647
Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43
Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487
k10987654321
AUC
09
08
07
06
05
Manhattan+logisticEuclidean+linearCosine+linear
Combination
Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE
when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations
Manhattan+logisticEuclidean+linearCosine+linear
k10987654321
AUC
09
08
07
06
05
Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM
72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice
According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as
Mathematical Problems in Engineering 15
0
02
04
06
08
1
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Perc
enta
ge(in
stanc
e)
release_id
defectslt2defectslt3
(a)
0
0002
0004
0006
0008
001
0012
0014
0016
4 14 24 34 44 54 64 74
Perc
enta
ge(in
stanc
e)
defects
(b)
Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments
TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones
Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work
73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain
Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software
metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors
Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033
Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five
16 Mathematical Problems in Engineering
Euclidean+Linear
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
Cosine+Linear
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
Manhattan+Logistic
Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value
normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear
8 Conclusion and Future Work
This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of
defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods
Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are
Mathematical Problems in Engineering 17
required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances
Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)
Conflicts of Interest
The authors declare that there are no conflicts of interest re-garding the publication of this article
Acknowledgments
The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)
References
[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012
[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002
[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012
[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009
[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013
[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012
[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013
[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998
[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002
[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007
[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005
[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009
[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015
[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017
[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015
[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015
[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017
[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015
[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016
[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017
[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014
[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014
[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015
[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015
18 Mathematical Problems in Engineering
[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017
[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017
[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017
[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013
[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014
[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016
[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017
[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015
[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017
[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012
[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013
[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012
[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7
[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006
[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013
[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010
[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE
Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010
[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011
[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011
[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008
[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008
[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011
[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006
[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012
[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011
[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011
[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952
[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009
[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
4 Mathematical Problems in Engineering
trainingPredictor
test
Training data set Ss Target data set St
(a) General CPDP
MetricFeature value f
Buggy instance I
Non-buggy instance I
Unlabeled instance I
Strategies for instance selection
Predictor
testtraining
Training data set Ss
Ss Target data set StReduced training data set rsquo
(b) Improved CPDP using training data selection
Figure 1 Two CPDP scenarios used in this paper
01 0 05 0 1(3)
01 0 0 05 1(1)
04 03 0 01 0
0 0 04 0 0
01 0 0 05 0
01 0 05 05
Label (defects)
rank instance
1
2
3
distance(Ii Itest)I1
I2
I3
I4
I4
I3
I5
Itest
f1 f2 f3 f4
I1 I2 I5
St
Ss
Figure 2 An example of the selection of training instances
Mathematical Problems in Engineering 5
Data sets
Project C-v1
Ranked instances(TDSelector)
Defects of each instance
similarity of software metrics
Defect predictor
Project A-v1Project A-v2Project B-v1Project C-v1Project C-v2 Test set
Project A-v1Project A-v2Project B-v1
Training set
ranking
TDS setup
rankingTraining
data
Test data
Figure 3 The overall structure of TDSelector for CPDP
4 Our Approach TDSelector
To improve the prediction performance of CPDP we leveragethe following observations
Similar Instances Given a test instance we can examine itssimilar training instances that were labeled beforeThe defectproneness shared by similar training instances could helpus identify the probability that a test instance is defectiveIntuitively two instances are more likely to have the samestate if their metric values are very similar
Number of Defects (defects) During the selection processwhen several training instances have the same distance froma test instance we need to determine which one shouldbe ranked higher According to our experiences in softwaredefect prediction and other researchersrsquo studies on the quan-titative analysis of previous defect prediction approaches [3435] we believe that more attention should be paid to thosetraining instances with more defects in practice
The selection of training data based on instance similarityhas been used in someprior studies [5 12 35]However to thebest of our knowledge the information about defects has notbeen fully utilized So in this paper we attempt to propose atraining data selection approach combining such informationand instance similarity
41 Overall Structure of TDSelector Figure 3 shows the over-all structure of the proposed approach to training dataselection named TDSelector Before selecting appropriatetraining data for CPDP we have to set up a test set andits corresponding initial TDS For a given project treated asthe test set all the other projects (except the target project)available at hand are used as the initial TDS This is the so-calledmany-to-one (M2O) scenario for CPDP [13] It is quitedifferent from the typical O2O (one-to-one) scenario whereonly one randomly selected project is treated as the trainingset for a given target project (namely test set)
When both of the sets are given the ranks of traininginstances are calculated based on the similarity of softwaremetrics and then returned for each test instance For the
initial TDS we also collect each training instancersquos defectsand thus rank these instances by their defects Then we rateeach training instance by combining the two types of ranks insome way and identify the top-k training instances for eachtest instance according to their final scores Finally we usethe predictor trained with the final TDS to predict defectproneness in the test set We describe the core componentof TDSelector namely scoring scheme in the followingsubsection
42 Scoring Scheme For each instance in training set and testset which is treated as a vector of features (namely softwaremetrics) we calculate the similarity between them in terms ofsimilarity index (such as cosine similarity Euclidean distanceand Manhattan distance as shown in Table 1) Traininginstances are then ranked by the similarity between each ofthem and a given test instance
For instance the cosine similarity between a traininginstance 119868119901 and the target instance 119868119902 is computed via theirvector representations described as follows
119878119894119898 (119868119901 119868119902) =997888rarr119868119901 sdot 997888rarr1198681199021003817100381710038171003817100381711986811990110038171003817100381710038171003817 times 1003817100381710038171003817100381711986811990210038171003817100381710038171003817 =
sum119899119894=1 (119891119901119894 times 119891119902119894)radicsum119899119894=1 1198912119901119894 times radicsum119899119894=1 1198912119902119894 (1)
where 997888rarr119868119901 and 997888rarr119868119902 are the metric vectors for 119868119901 and 119868119902 respec-tively and 119891lowast119894 represents the 119894th metric value of instance 119868lowast
Additionally for each training instance we also considerthe factor defects in order to further enrich the ranking of itsrelevant instances The assumption here is that the more theprevious defects the richer the information of an instance Sowe propose a scoring scheme to rank those candidate traininginstances defined as below
Score (119868119901 119868119902) = 120572 119878119894119898 (119868119901 119868119902) + (1 minus 120572)119873 (119889119890119891119890119888119905119901) (2)
where 119889119890119891119890119888119905119901 represents the defects of 119868119901 120572 is a weightingfactor (0 le 120572 le 1) which is learned from training data usingAlgorithm 1 (see Algorithm 1) and 119873(119889119890119891119890119888119905119901) is a functionused to normalize defects with values ranging from 0 to 1
6 Mathematical Problems in Engineering
Optimizing the parameter 120572Input(1) Candidate TDS 119878119904 = 1198681199041 1198681199042 119868119904119898 test set 119878119905 = 1198681199051 1198681199052 119868119905119897 (119898 gt 119897)(2) 119889119890119891119890119888119905119904 = 119889119890119891119890119888119905(1198681199041) 119889119890119891119890119888119905(1198681199042) 119889119890119891119890119888119905(119868119904119898) and 119896 = 10Output(3) 120572 (120572 isin [0 1])Method(4) Initialize 120572 = 0 119878119904(120572) = 0(5) While (120572 le 1) do(6) For 119894 = 1 119894 le 119897 119894 + +(7) For 119895 = 1 119895 le 119898 119895 + +(8) Score(119868119905119894 119868119904119895) = 120572 119878119894119898(119868119905119894 119868119904119895) + (1 minus 120572)119873 (119889119890119891119890119888119905(119868119904119895))(9) End For(10) descSort (119878119888119900119903119890(119868119905119894 119868119904119895) | 119895 = 1 sdot sdot sdot 119898) sort119898 training instances in descending order(11) 119878119904(120572) = 119878119904(120572) cup Top-119896 training instances select the top 119896 instances(12) End For(13) 119860119880119862 lArr997904 119878119904(120572) CPDP997888997888997888997888rarr 119878119905 prediction result(14) 120572 = 120572 + 01(15) EndWhile(16) Return (120572 | max120572119860119880119862)
Algorithm 1 Algorithm of parameter optimization
Table 1 Similarity indexes and normalization methods used in thispaper
Similarity
Cosine cos (119883 119884) = sum119899119896=1 119909119896119910119896radicsum119899119896=1 1199092119896radicsum119899119896=1 1199102119896
Euclidean distance 119889 (119883 119884) = radic 119899sum119896=1
(119909119896 minus 119910119896)2Manhattan distance 119889 (119883 119884) = 119899sum
119896=1
1003816100381610038161003816119909119896 minus 1199101198961003816100381610038161003816Normalization
Linear 119873(119909) = 119909 minus 119909min119909max minus 119909min
Logistic 119873(119909) = 11 + 119890minus119909 minus 05Square root 119873(119909) = 1 minus 1radic1 + 119909Logarithmic 119873(119909) = log10(119909 + 1)Inverse cotangent 119873(119909) = arctan(119909) lowast 2120587
Normalization is a commonly used data preprocessingtechnique for mathematics and computer science [36] Grafand Borer [37] have confirmed that normalization canimprove prediction performance of classificationmodels Forthis reason we normalize the defects of training instanceswhen using TDSelector As you know there are many nor-malization methods In this study we introduce five typicalnormalization methods used in machine learning [36 38]The description and formulas of the five normalizationmethods are listed in Table 1
For each test instance the top-119896 training instances rankedin terms of their scores will be returned Hence the final
TDS is composed by merging the sets of the top-119896 train-ing instances for each test instance when those duplicateinstances are removed
5 Experimental Setup
51 Research Questions Our experiments were conductedto find empirical evidence that answers the following threeresearch questions
RQ1 Does the Consideration of Defects Improve the Per-formance of CPDP Unlike the previous methods [1 5 712 29] TDSelector ranks candidate training instances interms of both defects and metric-based similarity To eval-uate the effectiveness of the proposed method consideringthe additional information of defects we tested TDSelectoraccording to the experimental data described in Section 52According to (2) we also empirically analyzed the impact ofthe parameter 120572 on prediction results
RQ2 Which Combination of Similarity and NormalizationIs More Suitable for TDSelector Equation (2) is comprisedof two parts namely similarity and the normalization ofdefects For each part several commonly usedmethods can beadopted in our context To fully take advantage of TDSelectorone would wonder which combination of similarity andnormalization should be chosen Therefore it is necessary tocompare the effects of different combinations of similarityand normalization methods on prediction results and todetermine the best one for TDSelector
RQ3 Can TDSelector-Based CPDP Outperform the Base-line Methods Cross-project prediction has attracted muchresearch interest in recent years and a few CPDP approachesusing training data selection have also been proposed eg
Mathematical Problems in Engineering 7
Peter filter based CPDP [5] (labeled as baseline1) and TCA+(Transfer Component Analysis) based CPDP [39] (labeledas baseline2) To answer the third question we comparedTDSelector-based CPDP proposed in this paper with theabove two state-of-the-art methods
52 Data Collection To evaluate the effectiveness of TDSe-lector in this paper we used 14 open-source projects writtenin Java on two online public software repositories namelyPROMISE [40] and AEEEM [41] The data statistics of the 14projects in question are presented in Table 2 where Instanceand Defect are the numbers of instances and defectiveinstances respectively and Defect is the proportion ofdefective instances to the total number of instances Eachinstance in these projects represents a file of object class andconsists of two parts namely software metrics and defects
The first repository PROMISE was collected by Jureczkoand Spinellis [40] The information of defects and 20 sourcecode metrics for the projects on PROMISE have been vali-dated and used in several previous studies [1 7 12 29] Thesecond repository AEEEM was collected by DrsquoAmbros etal [41] and each project on it has 76 metrics including 17source code metrics 15 change metrics 5 previous defectmetrics 5 entropy-of-change metrics 17 entropy-of-source-code metrics and 17 churn-of-source-code metrics AEEEMhas been successfully used in [23 39]
Before performing a cross-project prediction we need todetermine a target dataset (test set) and its candidate TDSFor PROMISE (10 projects) each one in the 10 projects wasselected to be the target dataset once and then we set up acandidate TDS for CPDP which excluded any data from thetarget project For instance if Ivy is selected as test projectdata from the other nine projects was used to construct itsinitial TDS
53 Experiment Design To answer the three research ques-tions our experimental procedure which is designed underthe context of M2O in the CPDP scenario is described asfollows
First as with many prior studies [1 5 15 35] all softwaremetric values in training and test sets were normalized byusing the119885-scoremethod because thesemetrics are differentin the scales of numerical values For the 14 projects onAEEEM and PROMISE their numbers of software metricsare different So the training set for a given test set wasselected from the same repository
Second to examine whether the consideration of defectsimproves the performance of CPDP we compared ourapproach TDSelector with NoD which is a baseline methodconsidering only the similarity between instances ie 120572 = 1in (2) Since there are three similarity computation methodsused in this paper we designed three different TDSelectorsand their corresponding baseline methods based on similar-ity indexesThe prediction results of eachmethod in questionfor the 15 test sets were analyzed in terms of mean valueand standard deviation More specifically we also used Cliff rsquosdelta (120575) [42] which is a nonparametric effect size measureof how often the values in one distribution are larger than the
values in a second distribution to compare the results gen-erated through our approach and its corresponding baselinemethod
Because Cliff did not suggest corresponding 120575 values torepresent small medium and large effects we convertedCohenrsquos 119889 effect size to Cliff rsquos 120575 using cohd2delta R package(httpsrdrriocranorddommancohd2deltahtml) Notethat Table 3 contains descriptors for magnitude of 119889 = 001to 20
Third according to the results of the second step of thisprocedure 15 combinations based on three typical similaritymethods for software metrics and five commonly usednormalization functions for defects were examined by thepairwise comparison method We then determined whichcombination is more suitable for our approach according tomean standard deviation and Cliff rsquos delta effect size
Fourth to further validate the effectiveness of the TDS-elector-based CPDP predictor we conducted cross-projectpredictions for all the 15 test sets using TDSelector and twocompeting methods (ie baseline1 and baseline2 introducedin Section 51) Note that the TDSelector used in this exper-iment was built with the best combination of similarity andnormalization
After this process is completed we will discuss the an-swers to the three research questions of our study
54 Classifier and Evaluation Measure As an underlyingmachine learning classifier for CPDP Logistic Regression(LR) which was widely used in many defect prediction liter-atures [4 23 39 43ndash46] is also used in this study All LR clas-sifiers were implemented withWeka (httpswwwcswaikatoacnzmlweka) For our experiments we used the defaultparameter setting for LR specified in Weka unless otherwisespecified
To evaluate the prediction performance of differentmeth-ods in this paper we utilized the area under a ReceiverOperating Characteristic curve (AUC) AUC is equal to theprobability that a classifier will identify a randomly chosendefective class higher than a randomly chosen defect-freeone [47] known as a useful measure for comparing differentmodels Compared with traditional accuracy measures AUCis commonly used because it is unaffected by class imbalanceand independent of the prediction threshold that is used todecide whether an instance should be classified as a negativeinstance [6 48 49] The AUC value of 05 indicates theperformance of a random predictor and higher AUC valuesindicate better prediction performance
6 Experimental Results
61 Answer to RQ1 We compared our approach consideringdefects with the baseline method NoD that selects trainingdata in terms of cosine similarity Table 5 shows that onaverage TDSelector does achieve an improvement in AUCvalue across the 15 test sets Obviously the average growthrates of AUC value vary from 59 to 90 when differentnormalization methods for defects were utilized In additionall the 120575 values in this table are greater than 02 which
8 Mathematical Problems in Engineering
Table 2 Data statistics of the projects used in our experiments
Repository Project Version Instance Defect Defect
PROMISE
Ant 17 745 166 223Camel 16 965 188 195Ivy 20 352 40 114Jedit 32 272 90 331
Lucene 24 340 203 597Poi 30 442 281 636
Synapse 12 256 86 336Velocity 14 196 147 750Xalan 26 885 411 464Xerces 14 588 437 743
AEEEM
Equinox 112005ndash6252008 324 129 398Eclipse JDT core (Eclipse) 112005ndash6172008 997 206 207Apache Lucene (Lucene2) 112005ndash1082008 692 20 29
Mylyn 1172005ndash3172009 1862 245 132Eclipse PDE UI (Pde) 112005ndash9112008 1497 209 140
Table 3 The mappings between different values and their effective-ness levels
Effect size d 120575Very small 001 0008Small 020 0147Medium 050 033Large 080 0474Very large 120 0622Huge 20 0811
indicates that each group of 15 prediction results obtained byour approach has a greater effect than that of NoD In otherwords our approach outperforms NoD In particular forJedit Velocity Eclipse andEquinox the improvements of ourapproach over NoD are substantial For example when usingthe linear normalizationmethod the AUC values for the fourprojects are increased by 306 430 226 and 394respectively moreover the logistic normalization methodfor Velocity achieves the biggest improvement in AUC value(namely 617)
We then compared TDSelector with the baseline methodsusing other widely used similarity calculation methods andthe results obtained by using Euclidean distance and Man-hattan distance to calculate the similarity between instancesare presented in Tables 6 and 7 TDSelector compared withthe corresponding NoD achieves the average growth ratesof AUC value that vary from 59 to 77 in Table 6 andfrom 27 to 69 in Table 7 respectively More specificallythe highest growth rate of AUC value in Table 6 is 436for Equinox and in Table 7 is 397 for Lucene2 Besides allCliff rsquos delta (120575) effect sizes in these two tables are also greaterthan 01 Hence the results indicate that our approach can onaverage improve the performance of those baseline methodswithout regard to defects
Table 4 Analyzing the factors similarity and normalization
Factor Method Mean Std 120575Similarity
Cosine similarity 0704 0082 minus0133Euclidean distance 0719 0080 -Manhattan distance 0682 0098 minus0193
Normalization
Linear 0706 0087 minus0012Logistic 0710 0078 -
Square root 0699 0091 minus0044Logarithmic 0700 0086 minus0064
Inverse cotangent 0696 0097 minus0056
In short during the process of training data selection theconsideration of defects for CPDP can help us to select higherquality training data thus leading to better classificationresults
62 Answer to RQ2 Although the inclusion of defects inthe selection of training data of quality is helpful for betterperformance of CPDP it is worthy to note that our methodcompletely failed in Mylyn and Pde when computing thesimilarity between instances in terms of Manhattan distance(see the corresponding maximum AUC values in Table 7)This implies that the success ofTDSelector depends largely onthe reasonable combination of similarity and normalizationmethods Therefore which combination of similarity andnormalization is more suitable for TDSelector
First we analyzed the two factors (ie similarity andnormalization) separately For example we evaluated thedifference among cosine similarity Euclidean distance andManhattan distance regardless of any normalization methodused in the experiment The results expressed in terms ofmean and standard deviation are shown in Table 4 wherethey are grouped by factors
Mathematical Problems in Engineering 9
Table5Th
ebestpredictio
nresults
obtained
bytheCP
DPapproach
basedon
TDSelec
torw
ithCosinesim
ilarityNoD
representsthebaselin
emetho
d+deno
testhe
grow
thrateof
AUC
valuethem
axim
umAU
Cvalueo
fdifferentn
ormalizationmetho
dsisun
derlinedeach
numbersho
wnin
bold
indicatesthatthe
correspo
ndingAU
Cvaluer
isesb
ymorethan10
Cosines
imilarity
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
07
09
09
1009
1009
1006
09
08
06
07
07
05
0338
AUC
0813
0676
0603
0793
0700
0611
0758
0741
0512
0742
0783
0760
0739
0705
0729
0711plusmn0
081
+(
)63
37
19
-30
6
-30
-43
0
03
226
394
41
59
40
90
Logistic
12057207
05
07
107
06
06
06
05
05
004
07
05
05
0351
AUC
0802
0674
0595
0793
0665
0621
0759
0765
0579
0745
0773
0738
0712
0707
0740
0711plusmn0
070
+(
)48
34
05
-24
116
31
32
617
07
210
355
03
62
56
90
Square
root
12057207
07
06
06
07
06
07
09
05
104
06
06
06
06
0249
AUC
0799
0654
0596
0807
0735
0626
0746
0762
0500
0740
0774
0560
0722
0700
0738
0697plusmn0
091
+(
)44
03
07
18
371
25
14
28
397
-210
28
17
53
53
69
Logarithm
ic120572
06
06
09
1007
1007
07
05
09
05
05
06
06
06
0351
AUC
0798
0662
0594
0793
0731
0611
0748
0744
0500
0758
0774
0700
0755
0702
0741
0707plusmn0
083
+(
)43
15
03
-36
4
-16
04
397
24
212
285
63
55
58
85
Inversec
otangent
12057207
1010
1007
1007
1006
07
007
07
07
07
0213
AUC
0798
0652
0592
0793
0659
0611
0749
0741
0500
0764
0773
0556
0739
0695
0734
0690plusmn0
092
+(
)43
--
-229
-18
-
397
32
210
21
41
44
48
59
NoD
(120572=1)
0765
0652
0592
0793
0536
0611
0736
0741
0358
0740
0639
0543
0709
0665
0701
0652plusmn0
113
10 Mathematical Problems in Engineering
Table6Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithEu
clidean
distance
Euclidean
distance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
09
09
1009
09
08
1010
08
08
006
1008
08
0369
AUC
0795
0727
0598
0826
0793
0603
0714
0757
0545
0775
0773
0719
0722
0697
0744
0719plusmn0
080
+(
)13
68
-09
322
19
--
117
52
176
430
-11
96
77
Logistic
12057207
08
04
07
07
05
06
09
09
09
007
1010
09
0360
AUC
0787
0750
0603
0832
0766
0613
0716
0767
0556
0745
0773
0698
0722
0690
0730
0717plusmn0
075
+(
)03
101
08
16
277
35
03
13
139
11
176
388
--
75
72
Square
root
12057207
08
1007
08
06
07
07
07
1007
08
1010
09
0342
AUC
0796
0743
0598
0820
0720
0618
0735
0786
0564
0737
0774
0696
0722
0690
0750
0715plusmn0
076
+(
)14
91
-
01
200
44
29
38
156
-178
384
--
105
70
Logarithm
ic120572
07
08
1010
08
06
1010
09
09
09
08
1010
09
0324
AUC
0794
0746
0598
0819
0722
0607
0714
0757
0573
0739
0778
0722
0722
0690
0748
0715plusmn0
072
+(
)11
95
--
203
25
--
174
03
185
436
--
103
70
Inversec
otangent
12057208
09
06
08
08
07
1008
06
07
009
09
1009
0280
AUC
0796
0749
0603
0820
0701
0623
0714
0787
0538
0750
0773
0589
0763
0690
0722
0708plusmn0
084
+(
)14
100
08
01
168
52
-40
102
18
176
170
56
-64
59
NoD
(120572=1)
0785
0681
0598
0819
060
00592
0714
0757
0488
0737
0657
0503
0722
0690
0678
066
8plusmn0
096
Mathematical Problems in Engineering 11
Table7Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithManhatta
ndista
nce
Manhatta
ndistance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
08
09
09
1009
09
1010
08
100
08
09
1010
0187
AUC
0804
0753
0599
0816
0689
0626
0695
0748
0500
0749
0773
0633
0692
0695
066
80696plusmn0
084
+(
)13
70
03
-73
63
--
78
-116
190
397
--
56
Logistic
12057207
07
08
08
08
07
07
09
06
07
009
09
1010
0249
AUC
0799
0760
0607
0830
0674
0621
0735
0794
0520
0756
0773
0680
0559
0695
066
80705plusmn0
084
+(
)06
80
17
17
50
54
58
61
121
09
116
279
127
--
69
Square
root
12057209
09
09
1008
08
09
08
09
100
100
1010
0164
AUC
0795
0755
060
40816
0693
0627
0704
0750
0510
0749
0773
0532
0523
0695
066
80680plusmn0
1+(
)01
72
12
-79
65
13
03
99
-116
-46
--
31
Logarithm
ic120572
1009
09
1009
1010
08
09
09
010
010
100116
AUC
0794
0755
0603
0816
066
40589
0695
0763
0524
0756
0773
0532
0523
0695
066
80677plusmn0
102
+(
)-
72
10
-34
--
20
129
09
116
-46
--
27
Inversec
otangent
12057210
09
09
09
09
08
09
1007
08
010
010
100133
AUC
0794
0749
0608
0821
0667
060
90710
0748
0500
0758
0773
0532
0523
0695
066
80677plusmn0
103
+(
)-
64
18
06
39
34
22
-78
12
116
-46
--
27
NoD
(120572=1)
0794
0704
0597
0816
064
20589
0695
0748
046
40749
0693
0532
0500
0695
066
80659plusmn0
105
12 Mathematical Problems in Engineering
Euclidean Logistic
Cosine
Linear
Manhattan+Logistic Euclidean
Linear
Cosine
Logistic
Manhattan+Logistic
(1) (2)
Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow
If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method
Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4
Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions
63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In
particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable
Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication
In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models
7 Discussion
71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10
Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance
Mathematical Problems in Engineering 13
Table8Pairw
isecomparis
onsb
etweenag
iven
combinatio
nandeach
ofthe15combinatio
nsin
term
sofC
liffrsquosdelta
(120575)effectsize
Cosines
imilarity
Euclidean
distance
Manhatta
ndistance
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Cosine+
Linear
-0018
0084
000
00116
minus0049
minus0036
minus0004
minus0013
minus0009
0138
0049
0164
0178
0169
Euclidean
+Linear
0049
0102
0111
0062
0164
-0036
004
00058
0089
0209
0102
0249
0276
0244
Manhatta
n+Lo
gistic
minus0049
minus0022
0022
minus0013
0111
minus0102
minus0076
minus0080
minus0049
minus0031
0053
-0124
0151
0147
14 Mathematical Problems in Engineering
Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question
Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10
Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647
Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43
Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487
k10987654321
AUC
09
08
07
06
05
Manhattan+logisticEuclidean+linearCosine+linear
Combination
Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE
when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations
Manhattan+logisticEuclidean+linearCosine+linear
k10987654321
AUC
09
08
07
06
05
Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM
72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice
According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as
Mathematical Problems in Engineering 15
0
02
04
06
08
1
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Perc
enta
ge(in
stanc
e)
release_id
defectslt2defectslt3
(a)
0
0002
0004
0006
0008
001
0012
0014
0016
4 14 24 34 44 54 64 74
Perc
enta
ge(in
stanc
e)
defects
(b)
Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments
TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones
Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work
73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain
Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software
metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors
Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033
Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five
16 Mathematical Problems in Engineering
Euclidean+Linear
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
Cosine+Linear
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
Manhattan+Logistic
Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value
normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear
8 Conclusion and Future Work
This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of
defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods
Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are
Mathematical Problems in Engineering 17
required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances
Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)
Conflicts of Interest
The authors declare that there are no conflicts of interest re-garding the publication of this article
Acknowledgments
The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)
References
[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012
[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002
[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012
[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009
[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013
[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012
[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013
[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998
[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002
[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007
[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005
[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009
[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015
[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017
[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015
[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015
[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017
[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015
[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016
[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017
[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014
[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014
[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015
[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015
18 Mathematical Problems in Engineering
[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017
[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017
[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017
[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013
[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014
[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016
[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017
[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015
[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017
[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012
[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013
[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012
[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7
[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006
[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013
[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010
[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE
Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010
[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011
[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011
[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008
[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008
[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011
[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006
[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012
[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011
[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011
[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952
[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009
[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
Mathematical Problems in Engineering 5
Data sets
Project C-v1
Ranked instances(TDSelector)
Defects of each instance
similarity of software metrics
Defect predictor
Project A-v1Project A-v2Project B-v1Project C-v1Project C-v2 Test set
Project A-v1Project A-v2Project B-v1
Training set
ranking
TDS setup
rankingTraining
data
Test data
Figure 3 The overall structure of TDSelector for CPDP
4 Our Approach TDSelector
To improve the prediction performance of CPDP we leveragethe following observations
Similar Instances Given a test instance we can examine itssimilar training instances that were labeled beforeThe defectproneness shared by similar training instances could helpus identify the probability that a test instance is defectiveIntuitively two instances are more likely to have the samestate if their metric values are very similar
Number of Defects (defects) During the selection processwhen several training instances have the same distance froma test instance we need to determine which one shouldbe ranked higher According to our experiences in softwaredefect prediction and other researchersrsquo studies on the quan-titative analysis of previous defect prediction approaches [3435] we believe that more attention should be paid to thosetraining instances with more defects in practice
The selection of training data based on instance similarityhas been used in someprior studies [5 12 35]However to thebest of our knowledge the information about defects has notbeen fully utilized So in this paper we attempt to propose atraining data selection approach combining such informationand instance similarity
41 Overall Structure of TDSelector Figure 3 shows the over-all structure of the proposed approach to training dataselection named TDSelector Before selecting appropriatetraining data for CPDP we have to set up a test set andits corresponding initial TDS For a given project treated asthe test set all the other projects (except the target project)available at hand are used as the initial TDS This is the so-calledmany-to-one (M2O) scenario for CPDP [13] It is quitedifferent from the typical O2O (one-to-one) scenario whereonly one randomly selected project is treated as the trainingset for a given target project (namely test set)
When both of the sets are given the ranks of traininginstances are calculated based on the similarity of softwaremetrics and then returned for each test instance For the
initial TDS we also collect each training instancersquos defectsand thus rank these instances by their defects Then we rateeach training instance by combining the two types of ranks insome way and identify the top-k training instances for eachtest instance according to their final scores Finally we usethe predictor trained with the final TDS to predict defectproneness in the test set We describe the core componentof TDSelector namely scoring scheme in the followingsubsection
42 Scoring Scheme For each instance in training set and testset which is treated as a vector of features (namely softwaremetrics) we calculate the similarity between them in terms ofsimilarity index (such as cosine similarity Euclidean distanceand Manhattan distance as shown in Table 1) Traininginstances are then ranked by the similarity between each ofthem and a given test instance
For instance the cosine similarity between a traininginstance 119868119901 and the target instance 119868119902 is computed via theirvector representations described as follows
119878119894119898 (119868119901 119868119902) =997888rarr119868119901 sdot 997888rarr1198681199021003817100381710038171003817100381711986811990110038171003817100381710038171003817 times 1003817100381710038171003817100381711986811990210038171003817100381710038171003817 =
sum119899119894=1 (119891119901119894 times 119891119902119894)radicsum119899119894=1 1198912119901119894 times radicsum119899119894=1 1198912119902119894 (1)
where 997888rarr119868119901 and 997888rarr119868119902 are the metric vectors for 119868119901 and 119868119902 respec-tively and 119891lowast119894 represents the 119894th metric value of instance 119868lowast
Additionally for each training instance we also considerthe factor defects in order to further enrich the ranking of itsrelevant instances The assumption here is that the more theprevious defects the richer the information of an instance Sowe propose a scoring scheme to rank those candidate traininginstances defined as below
Score (119868119901 119868119902) = 120572 119878119894119898 (119868119901 119868119902) + (1 minus 120572)119873 (119889119890119891119890119888119905119901) (2)
where 119889119890119891119890119888119905119901 represents the defects of 119868119901 120572 is a weightingfactor (0 le 120572 le 1) which is learned from training data usingAlgorithm 1 (see Algorithm 1) and 119873(119889119890119891119890119888119905119901) is a functionused to normalize defects with values ranging from 0 to 1
6 Mathematical Problems in Engineering
Optimizing the parameter 120572Input(1) Candidate TDS 119878119904 = 1198681199041 1198681199042 119868119904119898 test set 119878119905 = 1198681199051 1198681199052 119868119905119897 (119898 gt 119897)(2) 119889119890119891119890119888119905119904 = 119889119890119891119890119888119905(1198681199041) 119889119890119891119890119888119905(1198681199042) 119889119890119891119890119888119905(119868119904119898) and 119896 = 10Output(3) 120572 (120572 isin [0 1])Method(4) Initialize 120572 = 0 119878119904(120572) = 0(5) While (120572 le 1) do(6) For 119894 = 1 119894 le 119897 119894 + +(7) For 119895 = 1 119895 le 119898 119895 + +(8) Score(119868119905119894 119868119904119895) = 120572 119878119894119898(119868119905119894 119868119904119895) + (1 minus 120572)119873 (119889119890119891119890119888119905(119868119904119895))(9) End For(10) descSort (119878119888119900119903119890(119868119905119894 119868119904119895) | 119895 = 1 sdot sdot sdot 119898) sort119898 training instances in descending order(11) 119878119904(120572) = 119878119904(120572) cup Top-119896 training instances select the top 119896 instances(12) End For(13) 119860119880119862 lArr997904 119878119904(120572) CPDP997888997888997888997888rarr 119878119905 prediction result(14) 120572 = 120572 + 01(15) EndWhile(16) Return (120572 | max120572119860119880119862)
Algorithm 1 Algorithm of parameter optimization
Table 1 Similarity indexes and normalization methods used in thispaper
Similarity
Cosine cos (119883 119884) = sum119899119896=1 119909119896119910119896radicsum119899119896=1 1199092119896radicsum119899119896=1 1199102119896
Euclidean distance 119889 (119883 119884) = radic 119899sum119896=1
(119909119896 minus 119910119896)2Manhattan distance 119889 (119883 119884) = 119899sum
119896=1
1003816100381610038161003816119909119896 minus 1199101198961003816100381610038161003816Normalization
Linear 119873(119909) = 119909 minus 119909min119909max minus 119909min
Logistic 119873(119909) = 11 + 119890minus119909 minus 05Square root 119873(119909) = 1 minus 1radic1 + 119909Logarithmic 119873(119909) = log10(119909 + 1)Inverse cotangent 119873(119909) = arctan(119909) lowast 2120587
Normalization is a commonly used data preprocessingtechnique for mathematics and computer science [36] Grafand Borer [37] have confirmed that normalization canimprove prediction performance of classificationmodels Forthis reason we normalize the defects of training instanceswhen using TDSelector As you know there are many nor-malization methods In this study we introduce five typicalnormalization methods used in machine learning [36 38]The description and formulas of the five normalizationmethods are listed in Table 1
For each test instance the top-119896 training instances rankedin terms of their scores will be returned Hence the final
TDS is composed by merging the sets of the top-119896 train-ing instances for each test instance when those duplicateinstances are removed
5 Experimental Setup
51 Research Questions Our experiments were conductedto find empirical evidence that answers the following threeresearch questions
RQ1 Does the Consideration of Defects Improve the Per-formance of CPDP Unlike the previous methods [1 5 712 29] TDSelector ranks candidate training instances interms of both defects and metric-based similarity To eval-uate the effectiveness of the proposed method consideringthe additional information of defects we tested TDSelectoraccording to the experimental data described in Section 52According to (2) we also empirically analyzed the impact ofthe parameter 120572 on prediction results
RQ2 Which Combination of Similarity and NormalizationIs More Suitable for TDSelector Equation (2) is comprisedof two parts namely similarity and the normalization ofdefects For each part several commonly usedmethods can beadopted in our context To fully take advantage of TDSelectorone would wonder which combination of similarity andnormalization should be chosen Therefore it is necessary tocompare the effects of different combinations of similarityand normalization methods on prediction results and todetermine the best one for TDSelector
RQ3 Can TDSelector-Based CPDP Outperform the Base-line Methods Cross-project prediction has attracted muchresearch interest in recent years and a few CPDP approachesusing training data selection have also been proposed eg
Mathematical Problems in Engineering 7
Peter filter based CPDP [5] (labeled as baseline1) and TCA+(Transfer Component Analysis) based CPDP [39] (labeledas baseline2) To answer the third question we comparedTDSelector-based CPDP proposed in this paper with theabove two state-of-the-art methods
52 Data Collection To evaluate the effectiveness of TDSe-lector in this paper we used 14 open-source projects writtenin Java on two online public software repositories namelyPROMISE [40] and AEEEM [41] The data statistics of the 14projects in question are presented in Table 2 where Instanceand Defect are the numbers of instances and defectiveinstances respectively and Defect is the proportion ofdefective instances to the total number of instances Eachinstance in these projects represents a file of object class andconsists of two parts namely software metrics and defects
The first repository PROMISE was collected by Jureczkoand Spinellis [40] The information of defects and 20 sourcecode metrics for the projects on PROMISE have been vali-dated and used in several previous studies [1 7 12 29] Thesecond repository AEEEM was collected by DrsquoAmbros etal [41] and each project on it has 76 metrics including 17source code metrics 15 change metrics 5 previous defectmetrics 5 entropy-of-change metrics 17 entropy-of-source-code metrics and 17 churn-of-source-code metrics AEEEMhas been successfully used in [23 39]
Before performing a cross-project prediction we need todetermine a target dataset (test set) and its candidate TDSFor PROMISE (10 projects) each one in the 10 projects wasselected to be the target dataset once and then we set up acandidate TDS for CPDP which excluded any data from thetarget project For instance if Ivy is selected as test projectdata from the other nine projects was used to construct itsinitial TDS
53 Experiment Design To answer the three research ques-tions our experimental procedure which is designed underthe context of M2O in the CPDP scenario is described asfollows
First as with many prior studies [1 5 15 35] all softwaremetric values in training and test sets were normalized byusing the119885-scoremethod because thesemetrics are differentin the scales of numerical values For the 14 projects onAEEEM and PROMISE their numbers of software metricsare different So the training set for a given test set wasselected from the same repository
Second to examine whether the consideration of defectsimproves the performance of CPDP we compared ourapproach TDSelector with NoD which is a baseline methodconsidering only the similarity between instances ie 120572 = 1in (2) Since there are three similarity computation methodsused in this paper we designed three different TDSelectorsand their corresponding baseline methods based on similar-ity indexesThe prediction results of eachmethod in questionfor the 15 test sets were analyzed in terms of mean valueand standard deviation More specifically we also used Cliff rsquosdelta (120575) [42] which is a nonparametric effect size measureof how often the values in one distribution are larger than the
values in a second distribution to compare the results gen-erated through our approach and its corresponding baselinemethod
Because Cliff did not suggest corresponding 120575 values torepresent small medium and large effects we convertedCohenrsquos 119889 effect size to Cliff rsquos 120575 using cohd2delta R package(httpsrdrriocranorddommancohd2deltahtml) Notethat Table 3 contains descriptors for magnitude of 119889 = 001to 20
Third according to the results of the second step of thisprocedure 15 combinations based on three typical similaritymethods for software metrics and five commonly usednormalization functions for defects were examined by thepairwise comparison method We then determined whichcombination is more suitable for our approach according tomean standard deviation and Cliff rsquos delta effect size
Fourth to further validate the effectiveness of the TDS-elector-based CPDP predictor we conducted cross-projectpredictions for all the 15 test sets using TDSelector and twocompeting methods (ie baseline1 and baseline2 introducedin Section 51) Note that the TDSelector used in this exper-iment was built with the best combination of similarity andnormalization
After this process is completed we will discuss the an-swers to the three research questions of our study
54 Classifier and Evaluation Measure As an underlyingmachine learning classifier for CPDP Logistic Regression(LR) which was widely used in many defect prediction liter-atures [4 23 39 43ndash46] is also used in this study All LR clas-sifiers were implemented withWeka (httpswwwcswaikatoacnzmlweka) For our experiments we used the defaultparameter setting for LR specified in Weka unless otherwisespecified
To evaluate the prediction performance of differentmeth-ods in this paper we utilized the area under a ReceiverOperating Characteristic curve (AUC) AUC is equal to theprobability that a classifier will identify a randomly chosendefective class higher than a randomly chosen defect-freeone [47] known as a useful measure for comparing differentmodels Compared with traditional accuracy measures AUCis commonly used because it is unaffected by class imbalanceand independent of the prediction threshold that is used todecide whether an instance should be classified as a negativeinstance [6 48 49] The AUC value of 05 indicates theperformance of a random predictor and higher AUC valuesindicate better prediction performance
6 Experimental Results
61 Answer to RQ1 We compared our approach consideringdefects with the baseline method NoD that selects trainingdata in terms of cosine similarity Table 5 shows that onaverage TDSelector does achieve an improvement in AUCvalue across the 15 test sets Obviously the average growthrates of AUC value vary from 59 to 90 when differentnormalization methods for defects were utilized In additionall the 120575 values in this table are greater than 02 which
8 Mathematical Problems in Engineering
Table 2 Data statistics of the projects used in our experiments
Repository Project Version Instance Defect Defect
PROMISE
Ant 17 745 166 223Camel 16 965 188 195Ivy 20 352 40 114Jedit 32 272 90 331
Lucene 24 340 203 597Poi 30 442 281 636
Synapse 12 256 86 336Velocity 14 196 147 750Xalan 26 885 411 464Xerces 14 588 437 743
AEEEM
Equinox 112005ndash6252008 324 129 398Eclipse JDT core (Eclipse) 112005ndash6172008 997 206 207Apache Lucene (Lucene2) 112005ndash1082008 692 20 29
Mylyn 1172005ndash3172009 1862 245 132Eclipse PDE UI (Pde) 112005ndash9112008 1497 209 140
Table 3 The mappings between different values and their effective-ness levels
Effect size d 120575Very small 001 0008Small 020 0147Medium 050 033Large 080 0474Very large 120 0622Huge 20 0811
indicates that each group of 15 prediction results obtained byour approach has a greater effect than that of NoD In otherwords our approach outperforms NoD In particular forJedit Velocity Eclipse andEquinox the improvements of ourapproach over NoD are substantial For example when usingthe linear normalizationmethod the AUC values for the fourprojects are increased by 306 430 226 and 394respectively moreover the logistic normalization methodfor Velocity achieves the biggest improvement in AUC value(namely 617)
We then compared TDSelector with the baseline methodsusing other widely used similarity calculation methods andthe results obtained by using Euclidean distance and Man-hattan distance to calculate the similarity between instancesare presented in Tables 6 and 7 TDSelector compared withthe corresponding NoD achieves the average growth ratesof AUC value that vary from 59 to 77 in Table 6 andfrom 27 to 69 in Table 7 respectively More specificallythe highest growth rate of AUC value in Table 6 is 436for Equinox and in Table 7 is 397 for Lucene2 Besides allCliff rsquos delta (120575) effect sizes in these two tables are also greaterthan 01 Hence the results indicate that our approach can onaverage improve the performance of those baseline methodswithout regard to defects
Table 4 Analyzing the factors similarity and normalization
Factor Method Mean Std 120575Similarity
Cosine similarity 0704 0082 minus0133Euclidean distance 0719 0080 -Manhattan distance 0682 0098 minus0193
Normalization
Linear 0706 0087 minus0012Logistic 0710 0078 -
Square root 0699 0091 minus0044Logarithmic 0700 0086 minus0064
Inverse cotangent 0696 0097 minus0056
In short during the process of training data selection theconsideration of defects for CPDP can help us to select higherquality training data thus leading to better classificationresults
62 Answer to RQ2 Although the inclusion of defects inthe selection of training data of quality is helpful for betterperformance of CPDP it is worthy to note that our methodcompletely failed in Mylyn and Pde when computing thesimilarity between instances in terms of Manhattan distance(see the corresponding maximum AUC values in Table 7)This implies that the success ofTDSelector depends largely onthe reasonable combination of similarity and normalizationmethods Therefore which combination of similarity andnormalization is more suitable for TDSelector
First we analyzed the two factors (ie similarity andnormalization) separately For example we evaluated thedifference among cosine similarity Euclidean distance andManhattan distance regardless of any normalization methodused in the experiment The results expressed in terms ofmean and standard deviation are shown in Table 4 wherethey are grouped by factors
Mathematical Problems in Engineering 9
Table5Th
ebestpredictio
nresults
obtained
bytheCP
DPapproach
basedon
TDSelec
torw
ithCosinesim
ilarityNoD
representsthebaselin
emetho
d+deno
testhe
grow
thrateof
AUC
valuethem
axim
umAU
Cvalueo
fdifferentn
ormalizationmetho
dsisun
derlinedeach
numbersho
wnin
bold
indicatesthatthe
correspo
ndingAU
Cvaluer
isesb
ymorethan10
Cosines
imilarity
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
07
09
09
1009
1009
1006
09
08
06
07
07
05
0338
AUC
0813
0676
0603
0793
0700
0611
0758
0741
0512
0742
0783
0760
0739
0705
0729
0711plusmn0
081
+(
)63
37
19
-30
6
-30
-43
0
03
226
394
41
59
40
90
Logistic
12057207
05
07
107
06
06
06
05
05
004
07
05
05
0351
AUC
0802
0674
0595
0793
0665
0621
0759
0765
0579
0745
0773
0738
0712
0707
0740
0711plusmn0
070
+(
)48
34
05
-24
116
31
32
617
07
210
355
03
62
56
90
Square
root
12057207
07
06
06
07
06
07
09
05
104
06
06
06
06
0249
AUC
0799
0654
0596
0807
0735
0626
0746
0762
0500
0740
0774
0560
0722
0700
0738
0697plusmn0
091
+(
)44
03
07
18
371
25
14
28
397
-210
28
17
53
53
69
Logarithm
ic120572
06
06
09
1007
1007
07
05
09
05
05
06
06
06
0351
AUC
0798
0662
0594
0793
0731
0611
0748
0744
0500
0758
0774
0700
0755
0702
0741
0707plusmn0
083
+(
)43
15
03
-36
4
-16
04
397
24
212
285
63
55
58
85
Inversec
otangent
12057207
1010
1007
1007
1006
07
007
07
07
07
0213
AUC
0798
0652
0592
0793
0659
0611
0749
0741
0500
0764
0773
0556
0739
0695
0734
0690plusmn0
092
+(
)43
--
-229
-18
-
397
32
210
21
41
44
48
59
NoD
(120572=1)
0765
0652
0592
0793
0536
0611
0736
0741
0358
0740
0639
0543
0709
0665
0701
0652plusmn0
113
10 Mathematical Problems in Engineering
Table6Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithEu
clidean
distance
Euclidean
distance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
09
09
1009
09
08
1010
08
08
006
1008
08
0369
AUC
0795
0727
0598
0826
0793
0603
0714
0757
0545
0775
0773
0719
0722
0697
0744
0719plusmn0
080
+(
)13
68
-09
322
19
--
117
52
176
430
-11
96
77
Logistic
12057207
08
04
07
07
05
06
09
09
09
007
1010
09
0360
AUC
0787
0750
0603
0832
0766
0613
0716
0767
0556
0745
0773
0698
0722
0690
0730
0717plusmn0
075
+(
)03
101
08
16
277
35
03
13
139
11
176
388
--
75
72
Square
root
12057207
08
1007
08
06
07
07
07
1007
08
1010
09
0342
AUC
0796
0743
0598
0820
0720
0618
0735
0786
0564
0737
0774
0696
0722
0690
0750
0715plusmn0
076
+(
)14
91
-
01
200
44
29
38
156
-178
384
--
105
70
Logarithm
ic120572
07
08
1010
08
06
1010
09
09
09
08
1010
09
0324
AUC
0794
0746
0598
0819
0722
0607
0714
0757
0573
0739
0778
0722
0722
0690
0748
0715plusmn0
072
+(
)11
95
--
203
25
--
174
03
185
436
--
103
70
Inversec
otangent
12057208
09
06
08
08
07
1008
06
07
009
09
1009
0280
AUC
0796
0749
0603
0820
0701
0623
0714
0787
0538
0750
0773
0589
0763
0690
0722
0708plusmn0
084
+(
)14
100
08
01
168
52
-40
102
18
176
170
56
-64
59
NoD
(120572=1)
0785
0681
0598
0819
060
00592
0714
0757
0488
0737
0657
0503
0722
0690
0678
066
8plusmn0
096
Mathematical Problems in Engineering 11
Table7Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithManhatta
ndista
nce
Manhatta
ndistance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
08
09
09
1009
09
1010
08
100
08
09
1010
0187
AUC
0804
0753
0599
0816
0689
0626
0695
0748
0500
0749
0773
0633
0692
0695
066
80696plusmn0
084
+(
)13
70
03
-73
63
--
78
-116
190
397
--
56
Logistic
12057207
07
08
08
08
07
07
09
06
07
009
09
1010
0249
AUC
0799
0760
0607
0830
0674
0621
0735
0794
0520
0756
0773
0680
0559
0695
066
80705plusmn0
084
+(
)06
80
17
17
50
54
58
61
121
09
116
279
127
--
69
Square
root
12057209
09
09
1008
08
09
08
09
100
100
1010
0164
AUC
0795
0755
060
40816
0693
0627
0704
0750
0510
0749
0773
0532
0523
0695
066
80680plusmn0
1+(
)01
72
12
-79
65
13
03
99
-116
-46
--
31
Logarithm
ic120572
1009
09
1009
1010
08
09
09
010
010
100116
AUC
0794
0755
0603
0816
066
40589
0695
0763
0524
0756
0773
0532
0523
0695
066
80677plusmn0
102
+(
)-
72
10
-34
--
20
129
09
116
-46
--
27
Inversec
otangent
12057210
09
09
09
09
08
09
1007
08
010
010
100133
AUC
0794
0749
0608
0821
0667
060
90710
0748
0500
0758
0773
0532
0523
0695
066
80677plusmn0
103
+(
)-
64
18
06
39
34
22
-78
12
116
-46
--
27
NoD
(120572=1)
0794
0704
0597
0816
064
20589
0695
0748
046
40749
0693
0532
0500
0695
066
80659plusmn0
105
12 Mathematical Problems in Engineering
Euclidean Logistic
Cosine
Linear
Manhattan+Logistic Euclidean
Linear
Cosine
Logistic
Manhattan+Logistic
(1) (2)
Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow
If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method
Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4
Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions
63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In
particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable
Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication
In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models
7 Discussion
71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10
Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance
Mathematical Problems in Engineering 13
Table8Pairw
isecomparis
onsb
etweenag
iven
combinatio
nandeach
ofthe15combinatio
nsin
term
sofC
liffrsquosdelta
(120575)effectsize
Cosines
imilarity
Euclidean
distance
Manhatta
ndistance
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Cosine+
Linear
-0018
0084
000
00116
minus0049
minus0036
minus0004
minus0013
minus0009
0138
0049
0164
0178
0169
Euclidean
+Linear
0049
0102
0111
0062
0164
-0036
004
00058
0089
0209
0102
0249
0276
0244
Manhatta
n+Lo
gistic
minus0049
minus0022
0022
minus0013
0111
minus0102
minus0076
minus0080
minus0049
minus0031
0053
-0124
0151
0147
14 Mathematical Problems in Engineering
Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question
Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10
Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647
Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43
Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487
k10987654321
AUC
09
08
07
06
05
Manhattan+logisticEuclidean+linearCosine+linear
Combination
Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE
when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations
Manhattan+logisticEuclidean+linearCosine+linear
k10987654321
AUC
09
08
07
06
05
Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM
72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice
According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as
Mathematical Problems in Engineering 15
0
02
04
06
08
1
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Perc
enta
ge(in
stanc
e)
release_id
defectslt2defectslt3
(a)
0
0002
0004
0006
0008
001
0012
0014
0016
4 14 24 34 44 54 64 74
Perc
enta
ge(in
stanc
e)
defects
(b)
Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments
TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones
Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work
73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain
Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software
metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors
Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033
Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five
16 Mathematical Problems in Engineering
Euclidean+Linear
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
Cosine+Linear
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
Manhattan+Logistic
Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value
normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear
8 Conclusion and Future Work
This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of
defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods
Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are
Mathematical Problems in Engineering 17
required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances
Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)
Conflicts of Interest
The authors declare that there are no conflicts of interest re-garding the publication of this article
Acknowledgments
The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)
References
[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012
[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002
[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012
[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009
[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013
[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012
[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013
[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998
[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002
[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007
[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005
[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009
[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015
[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017
[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015
[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015
[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017
[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015
[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016
[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017
[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014
[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014
[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015
[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015
18 Mathematical Problems in Engineering
[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017
[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017
[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017
[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013
[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014
[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016
[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017
[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015
[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017
[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012
[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013
[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012
[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7
[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006
[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013
[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010
[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE
Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010
[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011
[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011
[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008
[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008
[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011
[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006
[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012
[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011
[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011
[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952
[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009
[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
6 Mathematical Problems in Engineering
Optimizing the parameter 120572Input(1) Candidate TDS 119878119904 = 1198681199041 1198681199042 119868119904119898 test set 119878119905 = 1198681199051 1198681199052 119868119905119897 (119898 gt 119897)(2) 119889119890119891119890119888119905119904 = 119889119890119891119890119888119905(1198681199041) 119889119890119891119890119888119905(1198681199042) 119889119890119891119890119888119905(119868119904119898) and 119896 = 10Output(3) 120572 (120572 isin [0 1])Method(4) Initialize 120572 = 0 119878119904(120572) = 0(5) While (120572 le 1) do(6) For 119894 = 1 119894 le 119897 119894 + +(7) For 119895 = 1 119895 le 119898 119895 + +(8) Score(119868119905119894 119868119904119895) = 120572 119878119894119898(119868119905119894 119868119904119895) + (1 minus 120572)119873 (119889119890119891119890119888119905(119868119904119895))(9) End For(10) descSort (119878119888119900119903119890(119868119905119894 119868119904119895) | 119895 = 1 sdot sdot sdot 119898) sort119898 training instances in descending order(11) 119878119904(120572) = 119878119904(120572) cup Top-119896 training instances select the top 119896 instances(12) End For(13) 119860119880119862 lArr997904 119878119904(120572) CPDP997888997888997888997888rarr 119878119905 prediction result(14) 120572 = 120572 + 01(15) EndWhile(16) Return (120572 | max120572119860119880119862)
Algorithm 1 Algorithm of parameter optimization
Table 1 Similarity indexes and normalization methods used in thispaper
Similarity
Cosine cos (119883 119884) = sum119899119896=1 119909119896119910119896radicsum119899119896=1 1199092119896radicsum119899119896=1 1199102119896
Euclidean distance 119889 (119883 119884) = radic 119899sum119896=1
(119909119896 minus 119910119896)2Manhattan distance 119889 (119883 119884) = 119899sum
119896=1
1003816100381610038161003816119909119896 minus 1199101198961003816100381610038161003816Normalization
Linear 119873(119909) = 119909 minus 119909min119909max minus 119909min
Logistic 119873(119909) = 11 + 119890minus119909 minus 05Square root 119873(119909) = 1 minus 1radic1 + 119909Logarithmic 119873(119909) = log10(119909 + 1)Inverse cotangent 119873(119909) = arctan(119909) lowast 2120587
Normalization is a commonly used data preprocessingtechnique for mathematics and computer science [36] Grafand Borer [37] have confirmed that normalization canimprove prediction performance of classificationmodels Forthis reason we normalize the defects of training instanceswhen using TDSelector As you know there are many nor-malization methods In this study we introduce five typicalnormalization methods used in machine learning [36 38]The description and formulas of the five normalizationmethods are listed in Table 1
For each test instance the top-119896 training instances rankedin terms of their scores will be returned Hence the final
TDS is composed by merging the sets of the top-119896 train-ing instances for each test instance when those duplicateinstances are removed
5 Experimental Setup
51 Research Questions Our experiments were conductedto find empirical evidence that answers the following threeresearch questions
RQ1 Does the Consideration of Defects Improve the Per-formance of CPDP Unlike the previous methods [1 5 712 29] TDSelector ranks candidate training instances interms of both defects and metric-based similarity To eval-uate the effectiveness of the proposed method consideringthe additional information of defects we tested TDSelectoraccording to the experimental data described in Section 52According to (2) we also empirically analyzed the impact ofthe parameter 120572 on prediction results
RQ2 Which Combination of Similarity and NormalizationIs More Suitable for TDSelector Equation (2) is comprisedof two parts namely similarity and the normalization ofdefects For each part several commonly usedmethods can beadopted in our context To fully take advantage of TDSelectorone would wonder which combination of similarity andnormalization should be chosen Therefore it is necessary tocompare the effects of different combinations of similarityand normalization methods on prediction results and todetermine the best one for TDSelector
RQ3 Can TDSelector-Based CPDP Outperform the Base-line Methods Cross-project prediction has attracted muchresearch interest in recent years and a few CPDP approachesusing training data selection have also been proposed eg
Mathematical Problems in Engineering 7
Peter filter based CPDP [5] (labeled as baseline1) and TCA+(Transfer Component Analysis) based CPDP [39] (labeledas baseline2) To answer the third question we comparedTDSelector-based CPDP proposed in this paper with theabove two state-of-the-art methods
52 Data Collection To evaluate the effectiveness of TDSe-lector in this paper we used 14 open-source projects writtenin Java on two online public software repositories namelyPROMISE [40] and AEEEM [41] The data statistics of the 14projects in question are presented in Table 2 where Instanceand Defect are the numbers of instances and defectiveinstances respectively and Defect is the proportion ofdefective instances to the total number of instances Eachinstance in these projects represents a file of object class andconsists of two parts namely software metrics and defects
The first repository PROMISE was collected by Jureczkoand Spinellis [40] The information of defects and 20 sourcecode metrics for the projects on PROMISE have been vali-dated and used in several previous studies [1 7 12 29] Thesecond repository AEEEM was collected by DrsquoAmbros etal [41] and each project on it has 76 metrics including 17source code metrics 15 change metrics 5 previous defectmetrics 5 entropy-of-change metrics 17 entropy-of-source-code metrics and 17 churn-of-source-code metrics AEEEMhas been successfully used in [23 39]
Before performing a cross-project prediction we need todetermine a target dataset (test set) and its candidate TDSFor PROMISE (10 projects) each one in the 10 projects wasselected to be the target dataset once and then we set up acandidate TDS for CPDP which excluded any data from thetarget project For instance if Ivy is selected as test projectdata from the other nine projects was used to construct itsinitial TDS
53 Experiment Design To answer the three research ques-tions our experimental procedure which is designed underthe context of M2O in the CPDP scenario is described asfollows
First as with many prior studies [1 5 15 35] all softwaremetric values in training and test sets were normalized byusing the119885-scoremethod because thesemetrics are differentin the scales of numerical values For the 14 projects onAEEEM and PROMISE their numbers of software metricsare different So the training set for a given test set wasselected from the same repository
Second to examine whether the consideration of defectsimproves the performance of CPDP we compared ourapproach TDSelector with NoD which is a baseline methodconsidering only the similarity between instances ie 120572 = 1in (2) Since there are three similarity computation methodsused in this paper we designed three different TDSelectorsand their corresponding baseline methods based on similar-ity indexesThe prediction results of eachmethod in questionfor the 15 test sets were analyzed in terms of mean valueand standard deviation More specifically we also used Cliff rsquosdelta (120575) [42] which is a nonparametric effect size measureof how often the values in one distribution are larger than the
values in a second distribution to compare the results gen-erated through our approach and its corresponding baselinemethod
Because Cliff did not suggest corresponding 120575 values torepresent small medium and large effects we convertedCohenrsquos 119889 effect size to Cliff rsquos 120575 using cohd2delta R package(httpsrdrriocranorddommancohd2deltahtml) Notethat Table 3 contains descriptors for magnitude of 119889 = 001to 20
Third according to the results of the second step of thisprocedure 15 combinations based on three typical similaritymethods for software metrics and five commonly usednormalization functions for defects were examined by thepairwise comparison method We then determined whichcombination is more suitable for our approach according tomean standard deviation and Cliff rsquos delta effect size
Fourth to further validate the effectiveness of the TDS-elector-based CPDP predictor we conducted cross-projectpredictions for all the 15 test sets using TDSelector and twocompeting methods (ie baseline1 and baseline2 introducedin Section 51) Note that the TDSelector used in this exper-iment was built with the best combination of similarity andnormalization
After this process is completed we will discuss the an-swers to the three research questions of our study
54 Classifier and Evaluation Measure As an underlyingmachine learning classifier for CPDP Logistic Regression(LR) which was widely used in many defect prediction liter-atures [4 23 39 43ndash46] is also used in this study All LR clas-sifiers were implemented withWeka (httpswwwcswaikatoacnzmlweka) For our experiments we used the defaultparameter setting for LR specified in Weka unless otherwisespecified
To evaluate the prediction performance of differentmeth-ods in this paper we utilized the area under a ReceiverOperating Characteristic curve (AUC) AUC is equal to theprobability that a classifier will identify a randomly chosendefective class higher than a randomly chosen defect-freeone [47] known as a useful measure for comparing differentmodels Compared with traditional accuracy measures AUCis commonly used because it is unaffected by class imbalanceand independent of the prediction threshold that is used todecide whether an instance should be classified as a negativeinstance [6 48 49] The AUC value of 05 indicates theperformance of a random predictor and higher AUC valuesindicate better prediction performance
6 Experimental Results
61 Answer to RQ1 We compared our approach consideringdefects with the baseline method NoD that selects trainingdata in terms of cosine similarity Table 5 shows that onaverage TDSelector does achieve an improvement in AUCvalue across the 15 test sets Obviously the average growthrates of AUC value vary from 59 to 90 when differentnormalization methods for defects were utilized In additionall the 120575 values in this table are greater than 02 which
8 Mathematical Problems in Engineering
Table 2 Data statistics of the projects used in our experiments
Repository Project Version Instance Defect Defect
PROMISE
Ant 17 745 166 223Camel 16 965 188 195Ivy 20 352 40 114Jedit 32 272 90 331
Lucene 24 340 203 597Poi 30 442 281 636
Synapse 12 256 86 336Velocity 14 196 147 750Xalan 26 885 411 464Xerces 14 588 437 743
AEEEM
Equinox 112005ndash6252008 324 129 398Eclipse JDT core (Eclipse) 112005ndash6172008 997 206 207Apache Lucene (Lucene2) 112005ndash1082008 692 20 29
Mylyn 1172005ndash3172009 1862 245 132Eclipse PDE UI (Pde) 112005ndash9112008 1497 209 140
Table 3 The mappings between different values and their effective-ness levels
Effect size d 120575Very small 001 0008Small 020 0147Medium 050 033Large 080 0474Very large 120 0622Huge 20 0811
indicates that each group of 15 prediction results obtained byour approach has a greater effect than that of NoD In otherwords our approach outperforms NoD In particular forJedit Velocity Eclipse andEquinox the improvements of ourapproach over NoD are substantial For example when usingthe linear normalizationmethod the AUC values for the fourprojects are increased by 306 430 226 and 394respectively moreover the logistic normalization methodfor Velocity achieves the biggest improvement in AUC value(namely 617)
We then compared TDSelector with the baseline methodsusing other widely used similarity calculation methods andthe results obtained by using Euclidean distance and Man-hattan distance to calculate the similarity between instancesare presented in Tables 6 and 7 TDSelector compared withthe corresponding NoD achieves the average growth ratesof AUC value that vary from 59 to 77 in Table 6 andfrom 27 to 69 in Table 7 respectively More specificallythe highest growth rate of AUC value in Table 6 is 436for Equinox and in Table 7 is 397 for Lucene2 Besides allCliff rsquos delta (120575) effect sizes in these two tables are also greaterthan 01 Hence the results indicate that our approach can onaverage improve the performance of those baseline methodswithout regard to defects
Table 4 Analyzing the factors similarity and normalization
Factor Method Mean Std 120575Similarity
Cosine similarity 0704 0082 minus0133Euclidean distance 0719 0080 -Manhattan distance 0682 0098 minus0193
Normalization
Linear 0706 0087 minus0012Logistic 0710 0078 -
Square root 0699 0091 minus0044Logarithmic 0700 0086 minus0064
Inverse cotangent 0696 0097 minus0056
In short during the process of training data selection theconsideration of defects for CPDP can help us to select higherquality training data thus leading to better classificationresults
62 Answer to RQ2 Although the inclusion of defects inthe selection of training data of quality is helpful for betterperformance of CPDP it is worthy to note that our methodcompletely failed in Mylyn and Pde when computing thesimilarity between instances in terms of Manhattan distance(see the corresponding maximum AUC values in Table 7)This implies that the success ofTDSelector depends largely onthe reasonable combination of similarity and normalizationmethods Therefore which combination of similarity andnormalization is more suitable for TDSelector
First we analyzed the two factors (ie similarity andnormalization) separately For example we evaluated thedifference among cosine similarity Euclidean distance andManhattan distance regardless of any normalization methodused in the experiment The results expressed in terms ofmean and standard deviation are shown in Table 4 wherethey are grouped by factors
Mathematical Problems in Engineering 9
Table5Th
ebestpredictio
nresults
obtained
bytheCP
DPapproach
basedon
TDSelec
torw
ithCosinesim
ilarityNoD
representsthebaselin
emetho
d+deno
testhe
grow
thrateof
AUC
valuethem
axim
umAU
Cvalueo
fdifferentn
ormalizationmetho
dsisun
derlinedeach
numbersho
wnin
bold
indicatesthatthe
correspo
ndingAU
Cvaluer
isesb
ymorethan10
Cosines
imilarity
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
07
09
09
1009
1009
1006
09
08
06
07
07
05
0338
AUC
0813
0676
0603
0793
0700
0611
0758
0741
0512
0742
0783
0760
0739
0705
0729
0711plusmn0
081
+(
)63
37
19
-30
6
-30
-43
0
03
226
394
41
59
40
90
Logistic
12057207
05
07
107
06
06
06
05
05
004
07
05
05
0351
AUC
0802
0674
0595
0793
0665
0621
0759
0765
0579
0745
0773
0738
0712
0707
0740
0711plusmn0
070
+(
)48
34
05
-24
116
31
32
617
07
210
355
03
62
56
90
Square
root
12057207
07
06
06
07
06
07
09
05
104
06
06
06
06
0249
AUC
0799
0654
0596
0807
0735
0626
0746
0762
0500
0740
0774
0560
0722
0700
0738
0697plusmn0
091
+(
)44
03
07
18
371
25
14
28
397
-210
28
17
53
53
69
Logarithm
ic120572
06
06
09
1007
1007
07
05
09
05
05
06
06
06
0351
AUC
0798
0662
0594
0793
0731
0611
0748
0744
0500
0758
0774
0700
0755
0702
0741
0707plusmn0
083
+(
)43
15
03
-36
4
-16
04
397
24
212
285
63
55
58
85
Inversec
otangent
12057207
1010
1007
1007
1006
07
007
07
07
07
0213
AUC
0798
0652
0592
0793
0659
0611
0749
0741
0500
0764
0773
0556
0739
0695
0734
0690plusmn0
092
+(
)43
--
-229
-18
-
397
32
210
21
41
44
48
59
NoD
(120572=1)
0765
0652
0592
0793
0536
0611
0736
0741
0358
0740
0639
0543
0709
0665
0701
0652plusmn0
113
10 Mathematical Problems in Engineering
Table6Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithEu
clidean
distance
Euclidean
distance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
09
09
1009
09
08
1010
08
08
006
1008
08
0369
AUC
0795
0727
0598
0826
0793
0603
0714
0757
0545
0775
0773
0719
0722
0697
0744
0719plusmn0
080
+(
)13
68
-09
322
19
--
117
52
176
430
-11
96
77
Logistic
12057207
08
04
07
07
05
06
09
09
09
007
1010
09
0360
AUC
0787
0750
0603
0832
0766
0613
0716
0767
0556
0745
0773
0698
0722
0690
0730
0717plusmn0
075
+(
)03
101
08
16
277
35
03
13
139
11
176
388
--
75
72
Square
root
12057207
08
1007
08
06
07
07
07
1007
08
1010
09
0342
AUC
0796
0743
0598
0820
0720
0618
0735
0786
0564
0737
0774
0696
0722
0690
0750
0715plusmn0
076
+(
)14
91
-
01
200
44
29
38
156
-178
384
--
105
70
Logarithm
ic120572
07
08
1010
08
06
1010
09
09
09
08
1010
09
0324
AUC
0794
0746
0598
0819
0722
0607
0714
0757
0573
0739
0778
0722
0722
0690
0748
0715plusmn0
072
+(
)11
95
--
203
25
--
174
03
185
436
--
103
70
Inversec
otangent
12057208
09
06
08
08
07
1008
06
07
009
09
1009
0280
AUC
0796
0749
0603
0820
0701
0623
0714
0787
0538
0750
0773
0589
0763
0690
0722
0708plusmn0
084
+(
)14
100
08
01
168
52
-40
102
18
176
170
56
-64
59
NoD
(120572=1)
0785
0681
0598
0819
060
00592
0714
0757
0488
0737
0657
0503
0722
0690
0678
066
8plusmn0
096
Mathematical Problems in Engineering 11
Table7Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithManhatta
ndista
nce
Manhatta
ndistance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
08
09
09
1009
09
1010
08
100
08
09
1010
0187
AUC
0804
0753
0599
0816
0689
0626
0695
0748
0500
0749
0773
0633
0692
0695
066
80696plusmn0
084
+(
)13
70
03
-73
63
--
78
-116
190
397
--
56
Logistic
12057207
07
08
08
08
07
07
09
06
07
009
09
1010
0249
AUC
0799
0760
0607
0830
0674
0621
0735
0794
0520
0756
0773
0680
0559
0695
066
80705plusmn0
084
+(
)06
80
17
17
50
54
58
61
121
09
116
279
127
--
69
Square
root
12057209
09
09
1008
08
09
08
09
100
100
1010
0164
AUC
0795
0755
060
40816
0693
0627
0704
0750
0510
0749
0773
0532
0523
0695
066
80680plusmn0
1+(
)01
72
12
-79
65
13
03
99
-116
-46
--
31
Logarithm
ic120572
1009
09
1009
1010
08
09
09
010
010
100116
AUC
0794
0755
0603
0816
066
40589
0695
0763
0524
0756
0773
0532
0523
0695
066
80677plusmn0
102
+(
)-
72
10
-34
--
20
129
09
116
-46
--
27
Inversec
otangent
12057210
09
09
09
09
08
09
1007
08
010
010
100133
AUC
0794
0749
0608
0821
0667
060
90710
0748
0500
0758
0773
0532
0523
0695
066
80677plusmn0
103
+(
)-
64
18
06
39
34
22
-78
12
116
-46
--
27
NoD
(120572=1)
0794
0704
0597
0816
064
20589
0695
0748
046
40749
0693
0532
0500
0695
066
80659plusmn0
105
12 Mathematical Problems in Engineering
Euclidean Logistic
Cosine
Linear
Manhattan+Logistic Euclidean
Linear
Cosine
Logistic
Manhattan+Logistic
(1) (2)
Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow
If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method
Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4
Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions
63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In
particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable
Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication
In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models
7 Discussion
71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10
Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance
Mathematical Problems in Engineering 13
Table8Pairw
isecomparis
onsb
etweenag
iven
combinatio
nandeach
ofthe15combinatio
nsin
term
sofC
liffrsquosdelta
(120575)effectsize
Cosines
imilarity
Euclidean
distance
Manhatta
ndistance
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Cosine+
Linear
-0018
0084
000
00116
minus0049
minus0036
minus0004
minus0013
minus0009
0138
0049
0164
0178
0169
Euclidean
+Linear
0049
0102
0111
0062
0164
-0036
004
00058
0089
0209
0102
0249
0276
0244
Manhatta
n+Lo
gistic
minus0049
minus0022
0022
minus0013
0111
minus0102
minus0076
minus0080
minus0049
minus0031
0053
-0124
0151
0147
14 Mathematical Problems in Engineering
Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question
Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10
Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647
Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43
Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487
k10987654321
AUC
09
08
07
06
05
Manhattan+logisticEuclidean+linearCosine+linear
Combination
Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE
when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations
Manhattan+logisticEuclidean+linearCosine+linear
k10987654321
AUC
09
08
07
06
05
Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM
72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice
According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as
Mathematical Problems in Engineering 15
0
02
04
06
08
1
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Perc
enta
ge(in
stanc
e)
release_id
defectslt2defectslt3
(a)
0
0002
0004
0006
0008
001
0012
0014
0016
4 14 24 34 44 54 64 74
Perc
enta
ge(in
stanc
e)
defects
(b)
Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments
TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones
Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work
73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain
Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software
metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors
Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033
Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five
16 Mathematical Problems in Engineering
Euclidean+Linear
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
Cosine+Linear
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
Manhattan+Logistic
Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value
normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear
8 Conclusion and Future Work
This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of
defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods
Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are
Mathematical Problems in Engineering 17
required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances
Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)
Conflicts of Interest
The authors declare that there are no conflicts of interest re-garding the publication of this article
Acknowledgments
The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)
References
[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012
[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002
[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012
[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009
[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013
[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012
[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013
[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998
[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002
[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007
[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005
[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009
[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015
[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017
[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015
[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015
[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017
[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015
[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016
[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017
[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014
[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014
[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015
[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015
18 Mathematical Problems in Engineering
[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017
[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017
[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017
[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013
[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014
[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016
[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017
[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015
[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017
[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012
[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013
[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012
[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7
[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006
[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013
[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010
[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE
Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010
[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011
[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011
[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008
[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008
[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011
[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006
[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012
[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011
[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011
[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952
[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009
[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
Mathematical Problems in Engineering 7
Peter filter based CPDP [5] (labeled as baseline1) and TCA+(Transfer Component Analysis) based CPDP [39] (labeledas baseline2) To answer the third question we comparedTDSelector-based CPDP proposed in this paper with theabove two state-of-the-art methods
52 Data Collection To evaluate the effectiveness of TDSe-lector in this paper we used 14 open-source projects writtenin Java on two online public software repositories namelyPROMISE [40] and AEEEM [41] The data statistics of the 14projects in question are presented in Table 2 where Instanceand Defect are the numbers of instances and defectiveinstances respectively and Defect is the proportion ofdefective instances to the total number of instances Eachinstance in these projects represents a file of object class andconsists of two parts namely software metrics and defects
The first repository PROMISE was collected by Jureczkoand Spinellis [40] The information of defects and 20 sourcecode metrics for the projects on PROMISE have been vali-dated and used in several previous studies [1 7 12 29] Thesecond repository AEEEM was collected by DrsquoAmbros etal [41] and each project on it has 76 metrics including 17source code metrics 15 change metrics 5 previous defectmetrics 5 entropy-of-change metrics 17 entropy-of-source-code metrics and 17 churn-of-source-code metrics AEEEMhas been successfully used in [23 39]
Before performing a cross-project prediction we need todetermine a target dataset (test set) and its candidate TDSFor PROMISE (10 projects) each one in the 10 projects wasselected to be the target dataset once and then we set up acandidate TDS for CPDP which excluded any data from thetarget project For instance if Ivy is selected as test projectdata from the other nine projects was used to construct itsinitial TDS
53 Experiment Design To answer the three research ques-tions our experimental procedure which is designed underthe context of M2O in the CPDP scenario is described asfollows
First as with many prior studies [1 5 15 35] all softwaremetric values in training and test sets were normalized byusing the119885-scoremethod because thesemetrics are differentin the scales of numerical values For the 14 projects onAEEEM and PROMISE their numbers of software metricsare different So the training set for a given test set wasselected from the same repository
Second to examine whether the consideration of defectsimproves the performance of CPDP we compared ourapproach TDSelector with NoD which is a baseline methodconsidering only the similarity between instances ie 120572 = 1in (2) Since there are three similarity computation methodsused in this paper we designed three different TDSelectorsand their corresponding baseline methods based on similar-ity indexesThe prediction results of eachmethod in questionfor the 15 test sets were analyzed in terms of mean valueand standard deviation More specifically we also used Cliff rsquosdelta (120575) [42] which is a nonparametric effect size measureof how often the values in one distribution are larger than the
values in a second distribution to compare the results gen-erated through our approach and its corresponding baselinemethod
Because Cliff did not suggest corresponding 120575 values torepresent small medium and large effects we convertedCohenrsquos 119889 effect size to Cliff rsquos 120575 using cohd2delta R package(httpsrdrriocranorddommancohd2deltahtml) Notethat Table 3 contains descriptors for magnitude of 119889 = 001to 20
Third according to the results of the second step of thisprocedure 15 combinations based on three typical similaritymethods for software metrics and five commonly usednormalization functions for defects were examined by thepairwise comparison method We then determined whichcombination is more suitable for our approach according tomean standard deviation and Cliff rsquos delta effect size
Fourth to further validate the effectiveness of the TDS-elector-based CPDP predictor we conducted cross-projectpredictions for all the 15 test sets using TDSelector and twocompeting methods (ie baseline1 and baseline2 introducedin Section 51) Note that the TDSelector used in this exper-iment was built with the best combination of similarity andnormalization
After this process is completed we will discuss the an-swers to the three research questions of our study
54 Classifier and Evaluation Measure As an underlyingmachine learning classifier for CPDP Logistic Regression(LR) which was widely used in many defect prediction liter-atures [4 23 39 43ndash46] is also used in this study All LR clas-sifiers were implemented withWeka (httpswwwcswaikatoacnzmlweka) For our experiments we used the defaultparameter setting for LR specified in Weka unless otherwisespecified
To evaluate the prediction performance of differentmeth-ods in this paper we utilized the area under a ReceiverOperating Characteristic curve (AUC) AUC is equal to theprobability that a classifier will identify a randomly chosendefective class higher than a randomly chosen defect-freeone [47] known as a useful measure for comparing differentmodels Compared with traditional accuracy measures AUCis commonly used because it is unaffected by class imbalanceand independent of the prediction threshold that is used todecide whether an instance should be classified as a negativeinstance [6 48 49] The AUC value of 05 indicates theperformance of a random predictor and higher AUC valuesindicate better prediction performance
6 Experimental Results
61 Answer to RQ1 We compared our approach consideringdefects with the baseline method NoD that selects trainingdata in terms of cosine similarity Table 5 shows that onaverage TDSelector does achieve an improvement in AUCvalue across the 15 test sets Obviously the average growthrates of AUC value vary from 59 to 90 when differentnormalization methods for defects were utilized In additionall the 120575 values in this table are greater than 02 which
8 Mathematical Problems in Engineering
Table 2 Data statistics of the projects used in our experiments
Repository Project Version Instance Defect Defect
PROMISE
Ant 17 745 166 223Camel 16 965 188 195Ivy 20 352 40 114Jedit 32 272 90 331
Lucene 24 340 203 597Poi 30 442 281 636
Synapse 12 256 86 336Velocity 14 196 147 750Xalan 26 885 411 464Xerces 14 588 437 743
AEEEM
Equinox 112005ndash6252008 324 129 398Eclipse JDT core (Eclipse) 112005ndash6172008 997 206 207Apache Lucene (Lucene2) 112005ndash1082008 692 20 29
Mylyn 1172005ndash3172009 1862 245 132Eclipse PDE UI (Pde) 112005ndash9112008 1497 209 140
Table 3 The mappings between different values and their effective-ness levels
Effect size d 120575Very small 001 0008Small 020 0147Medium 050 033Large 080 0474Very large 120 0622Huge 20 0811
indicates that each group of 15 prediction results obtained byour approach has a greater effect than that of NoD In otherwords our approach outperforms NoD In particular forJedit Velocity Eclipse andEquinox the improvements of ourapproach over NoD are substantial For example when usingthe linear normalizationmethod the AUC values for the fourprojects are increased by 306 430 226 and 394respectively moreover the logistic normalization methodfor Velocity achieves the biggest improvement in AUC value(namely 617)
We then compared TDSelector with the baseline methodsusing other widely used similarity calculation methods andthe results obtained by using Euclidean distance and Man-hattan distance to calculate the similarity between instancesare presented in Tables 6 and 7 TDSelector compared withthe corresponding NoD achieves the average growth ratesof AUC value that vary from 59 to 77 in Table 6 andfrom 27 to 69 in Table 7 respectively More specificallythe highest growth rate of AUC value in Table 6 is 436for Equinox and in Table 7 is 397 for Lucene2 Besides allCliff rsquos delta (120575) effect sizes in these two tables are also greaterthan 01 Hence the results indicate that our approach can onaverage improve the performance of those baseline methodswithout regard to defects
Table 4 Analyzing the factors similarity and normalization
Factor Method Mean Std 120575Similarity
Cosine similarity 0704 0082 minus0133Euclidean distance 0719 0080 -Manhattan distance 0682 0098 minus0193
Normalization
Linear 0706 0087 minus0012Logistic 0710 0078 -
Square root 0699 0091 minus0044Logarithmic 0700 0086 minus0064
Inverse cotangent 0696 0097 minus0056
In short during the process of training data selection theconsideration of defects for CPDP can help us to select higherquality training data thus leading to better classificationresults
62 Answer to RQ2 Although the inclusion of defects inthe selection of training data of quality is helpful for betterperformance of CPDP it is worthy to note that our methodcompletely failed in Mylyn and Pde when computing thesimilarity between instances in terms of Manhattan distance(see the corresponding maximum AUC values in Table 7)This implies that the success ofTDSelector depends largely onthe reasonable combination of similarity and normalizationmethods Therefore which combination of similarity andnormalization is more suitable for TDSelector
First we analyzed the two factors (ie similarity andnormalization) separately For example we evaluated thedifference among cosine similarity Euclidean distance andManhattan distance regardless of any normalization methodused in the experiment The results expressed in terms ofmean and standard deviation are shown in Table 4 wherethey are grouped by factors
Mathematical Problems in Engineering 9
Table5Th
ebestpredictio
nresults
obtained
bytheCP
DPapproach
basedon
TDSelec
torw
ithCosinesim
ilarityNoD
representsthebaselin
emetho
d+deno
testhe
grow
thrateof
AUC
valuethem
axim
umAU
Cvalueo
fdifferentn
ormalizationmetho
dsisun
derlinedeach
numbersho
wnin
bold
indicatesthatthe
correspo
ndingAU
Cvaluer
isesb
ymorethan10
Cosines
imilarity
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
07
09
09
1009
1009
1006
09
08
06
07
07
05
0338
AUC
0813
0676
0603
0793
0700
0611
0758
0741
0512
0742
0783
0760
0739
0705
0729
0711plusmn0
081
+(
)63
37
19
-30
6
-30
-43
0
03
226
394
41
59
40
90
Logistic
12057207
05
07
107
06
06
06
05
05
004
07
05
05
0351
AUC
0802
0674
0595
0793
0665
0621
0759
0765
0579
0745
0773
0738
0712
0707
0740
0711plusmn0
070
+(
)48
34
05
-24
116
31
32
617
07
210
355
03
62
56
90
Square
root
12057207
07
06
06
07
06
07
09
05
104
06
06
06
06
0249
AUC
0799
0654
0596
0807
0735
0626
0746
0762
0500
0740
0774
0560
0722
0700
0738
0697plusmn0
091
+(
)44
03
07
18
371
25
14
28
397
-210
28
17
53
53
69
Logarithm
ic120572
06
06
09
1007
1007
07
05
09
05
05
06
06
06
0351
AUC
0798
0662
0594
0793
0731
0611
0748
0744
0500
0758
0774
0700
0755
0702
0741
0707plusmn0
083
+(
)43
15
03
-36
4
-16
04
397
24
212
285
63
55
58
85
Inversec
otangent
12057207
1010
1007
1007
1006
07
007
07
07
07
0213
AUC
0798
0652
0592
0793
0659
0611
0749
0741
0500
0764
0773
0556
0739
0695
0734
0690plusmn0
092
+(
)43
--
-229
-18
-
397
32
210
21
41
44
48
59
NoD
(120572=1)
0765
0652
0592
0793
0536
0611
0736
0741
0358
0740
0639
0543
0709
0665
0701
0652plusmn0
113
10 Mathematical Problems in Engineering
Table6Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithEu
clidean
distance
Euclidean
distance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
09
09
1009
09
08
1010
08
08
006
1008
08
0369
AUC
0795
0727
0598
0826
0793
0603
0714
0757
0545
0775
0773
0719
0722
0697
0744
0719plusmn0
080
+(
)13
68
-09
322
19
--
117
52
176
430
-11
96
77
Logistic
12057207
08
04
07
07
05
06
09
09
09
007
1010
09
0360
AUC
0787
0750
0603
0832
0766
0613
0716
0767
0556
0745
0773
0698
0722
0690
0730
0717plusmn0
075
+(
)03
101
08
16
277
35
03
13
139
11
176
388
--
75
72
Square
root
12057207
08
1007
08
06
07
07
07
1007
08
1010
09
0342
AUC
0796
0743
0598
0820
0720
0618
0735
0786
0564
0737
0774
0696
0722
0690
0750
0715plusmn0
076
+(
)14
91
-
01
200
44
29
38
156
-178
384
--
105
70
Logarithm
ic120572
07
08
1010
08
06
1010
09
09
09
08
1010
09
0324
AUC
0794
0746
0598
0819
0722
0607
0714
0757
0573
0739
0778
0722
0722
0690
0748
0715plusmn0
072
+(
)11
95
--
203
25
--
174
03
185
436
--
103
70
Inversec
otangent
12057208
09
06
08
08
07
1008
06
07
009
09
1009
0280
AUC
0796
0749
0603
0820
0701
0623
0714
0787
0538
0750
0773
0589
0763
0690
0722
0708plusmn0
084
+(
)14
100
08
01
168
52
-40
102
18
176
170
56
-64
59
NoD
(120572=1)
0785
0681
0598
0819
060
00592
0714
0757
0488
0737
0657
0503
0722
0690
0678
066
8plusmn0
096
Mathematical Problems in Engineering 11
Table7Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithManhatta
ndista
nce
Manhatta
ndistance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
08
09
09
1009
09
1010
08
100
08
09
1010
0187
AUC
0804
0753
0599
0816
0689
0626
0695
0748
0500
0749
0773
0633
0692
0695
066
80696plusmn0
084
+(
)13
70
03
-73
63
--
78
-116
190
397
--
56
Logistic
12057207
07
08
08
08
07
07
09
06
07
009
09
1010
0249
AUC
0799
0760
0607
0830
0674
0621
0735
0794
0520
0756
0773
0680
0559
0695
066
80705plusmn0
084
+(
)06
80
17
17
50
54
58
61
121
09
116
279
127
--
69
Square
root
12057209
09
09
1008
08
09
08
09
100
100
1010
0164
AUC
0795
0755
060
40816
0693
0627
0704
0750
0510
0749
0773
0532
0523
0695
066
80680plusmn0
1+(
)01
72
12
-79
65
13
03
99
-116
-46
--
31
Logarithm
ic120572
1009
09
1009
1010
08
09
09
010
010
100116
AUC
0794
0755
0603
0816
066
40589
0695
0763
0524
0756
0773
0532
0523
0695
066
80677plusmn0
102
+(
)-
72
10
-34
--
20
129
09
116
-46
--
27
Inversec
otangent
12057210
09
09
09
09
08
09
1007
08
010
010
100133
AUC
0794
0749
0608
0821
0667
060
90710
0748
0500
0758
0773
0532
0523
0695
066
80677plusmn0
103
+(
)-
64
18
06
39
34
22
-78
12
116
-46
--
27
NoD
(120572=1)
0794
0704
0597
0816
064
20589
0695
0748
046
40749
0693
0532
0500
0695
066
80659plusmn0
105
12 Mathematical Problems in Engineering
Euclidean Logistic
Cosine
Linear
Manhattan+Logistic Euclidean
Linear
Cosine
Logistic
Manhattan+Logistic
(1) (2)
Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow
If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method
Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4
Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions
63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In
particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable
Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication
In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models
7 Discussion
71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10
Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance
Mathematical Problems in Engineering 13
Table8Pairw
isecomparis
onsb
etweenag
iven
combinatio
nandeach
ofthe15combinatio
nsin
term
sofC
liffrsquosdelta
(120575)effectsize
Cosines
imilarity
Euclidean
distance
Manhatta
ndistance
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Cosine+
Linear
-0018
0084
000
00116
minus0049
minus0036
minus0004
minus0013
minus0009
0138
0049
0164
0178
0169
Euclidean
+Linear
0049
0102
0111
0062
0164
-0036
004
00058
0089
0209
0102
0249
0276
0244
Manhatta
n+Lo
gistic
minus0049
minus0022
0022
minus0013
0111
minus0102
minus0076
minus0080
minus0049
minus0031
0053
-0124
0151
0147
14 Mathematical Problems in Engineering
Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question
Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10
Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647
Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43
Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487
k10987654321
AUC
09
08
07
06
05
Manhattan+logisticEuclidean+linearCosine+linear
Combination
Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE
when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations
Manhattan+logisticEuclidean+linearCosine+linear
k10987654321
AUC
09
08
07
06
05
Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM
72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice
According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as
Mathematical Problems in Engineering 15
0
02
04
06
08
1
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Perc
enta
ge(in
stanc
e)
release_id
defectslt2defectslt3
(a)
0
0002
0004
0006
0008
001
0012
0014
0016
4 14 24 34 44 54 64 74
Perc
enta
ge(in
stanc
e)
defects
(b)
Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments
TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones
Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work
73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain
Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software
metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors
Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033
Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five
16 Mathematical Problems in Engineering
Euclidean+Linear
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
Cosine+Linear
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
Manhattan+Logistic
Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value
normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear
8 Conclusion and Future Work
This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of
defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods
Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are
Mathematical Problems in Engineering 17
required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances
Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)
Conflicts of Interest
The authors declare that there are no conflicts of interest re-garding the publication of this article
Acknowledgments
The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)
References
[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012
[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002
[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012
[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009
[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013
[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012
[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013
[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998
[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002
[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007
[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005
[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009
[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015
[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017
[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015
[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015
[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017
[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015
[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016
[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017
[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014
[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014
[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015
[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015
18 Mathematical Problems in Engineering
[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017
[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017
[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017
[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013
[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014
[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016
[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017
[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015
[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017
[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012
[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013
[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012
[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7
[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006
[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013
[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010
[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE
Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010
[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011
[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011
[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008
[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008
[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011
[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006
[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012
[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011
[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011
[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952
[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009
[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
8 Mathematical Problems in Engineering
Table 2 Data statistics of the projects used in our experiments
Repository Project Version Instance Defect Defect
PROMISE
Ant 17 745 166 223Camel 16 965 188 195Ivy 20 352 40 114Jedit 32 272 90 331
Lucene 24 340 203 597Poi 30 442 281 636
Synapse 12 256 86 336Velocity 14 196 147 750Xalan 26 885 411 464Xerces 14 588 437 743
AEEEM
Equinox 112005ndash6252008 324 129 398Eclipse JDT core (Eclipse) 112005ndash6172008 997 206 207Apache Lucene (Lucene2) 112005ndash1082008 692 20 29
Mylyn 1172005ndash3172009 1862 245 132Eclipse PDE UI (Pde) 112005ndash9112008 1497 209 140
Table 3 The mappings between different values and their effective-ness levels
Effect size d 120575Very small 001 0008Small 020 0147Medium 050 033Large 080 0474Very large 120 0622Huge 20 0811
indicates that each group of 15 prediction results obtained byour approach has a greater effect than that of NoD In otherwords our approach outperforms NoD In particular forJedit Velocity Eclipse andEquinox the improvements of ourapproach over NoD are substantial For example when usingthe linear normalizationmethod the AUC values for the fourprojects are increased by 306 430 226 and 394respectively moreover the logistic normalization methodfor Velocity achieves the biggest improvement in AUC value(namely 617)
We then compared TDSelector with the baseline methodsusing other widely used similarity calculation methods andthe results obtained by using Euclidean distance and Man-hattan distance to calculate the similarity between instancesare presented in Tables 6 and 7 TDSelector compared withthe corresponding NoD achieves the average growth ratesof AUC value that vary from 59 to 77 in Table 6 andfrom 27 to 69 in Table 7 respectively More specificallythe highest growth rate of AUC value in Table 6 is 436for Equinox and in Table 7 is 397 for Lucene2 Besides allCliff rsquos delta (120575) effect sizes in these two tables are also greaterthan 01 Hence the results indicate that our approach can onaverage improve the performance of those baseline methodswithout regard to defects
Table 4 Analyzing the factors similarity and normalization
Factor Method Mean Std 120575Similarity
Cosine similarity 0704 0082 minus0133Euclidean distance 0719 0080 -Manhattan distance 0682 0098 minus0193
Normalization
Linear 0706 0087 minus0012Logistic 0710 0078 -
Square root 0699 0091 minus0044Logarithmic 0700 0086 minus0064
Inverse cotangent 0696 0097 minus0056
In short during the process of training data selection theconsideration of defects for CPDP can help us to select higherquality training data thus leading to better classificationresults
62 Answer to RQ2 Although the inclusion of defects inthe selection of training data of quality is helpful for betterperformance of CPDP it is worthy to note that our methodcompletely failed in Mylyn and Pde when computing thesimilarity between instances in terms of Manhattan distance(see the corresponding maximum AUC values in Table 7)This implies that the success ofTDSelector depends largely onthe reasonable combination of similarity and normalizationmethods Therefore which combination of similarity andnormalization is more suitable for TDSelector
First we analyzed the two factors (ie similarity andnormalization) separately For example we evaluated thedifference among cosine similarity Euclidean distance andManhattan distance regardless of any normalization methodused in the experiment The results expressed in terms ofmean and standard deviation are shown in Table 4 wherethey are grouped by factors
Mathematical Problems in Engineering 9
Table5Th
ebestpredictio
nresults
obtained
bytheCP
DPapproach
basedon
TDSelec
torw
ithCosinesim
ilarityNoD
representsthebaselin
emetho
d+deno
testhe
grow
thrateof
AUC
valuethem
axim
umAU
Cvalueo
fdifferentn
ormalizationmetho
dsisun
derlinedeach
numbersho
wnin
bold
indicatesthatthe
correspo
ndingAU
Cvaluer
isesb
ymorethan10
Cosines
imilarity
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
07
09
09
1009
1009
1006
09
08
06
07
07
05
0338
AUC
0813
0676
0603
0793
0700
0611
0758
0741
0512
0742
0783
0760
0739
0705
0729
0711plusmn0
081
+(
)63
37
19
-30
6
-30
-43
0
03
226
394
41
59
40
90
Logistic
12057207
05
07
107
06
06
06
05
05
004
07
05
05
0351
AUC
0802
0674
0595
0793
0665
0621
0759
0765
0579
0745
0773
0738
0712
0707
0740
0711plusmn0
070
+(
)48
34
05
-24
116
31
32
617
07
210
355
03
62
56
90
Square
root
12057207
07
06
06
07
06
07
09
05
104
06
06
06
06
0249
AUC
0799
0654
0596
0807
0735
0626
0746
0762
0500
0740
0774
0560
0722
0700
0738
0697plusmn0
091
+(
)44
03
07
18
371
25
14
28
397
-210
28
17
53
53
69
Logarithm
ic120572
06
06
09
1007
1007
07
05
09
05
05
06
06
06
0351
AUC
0798
0662
0594
0793
0731
0611
0748
0744
0500
0758
0774
0700
0755
0702
0741
0707plusmn0
083
+(
)43
15
03
-36
4
-16
04
397
24
212
285
63
55
58
85
Inversec
otangent
12057207
1010
1007
1007
1006
07
007
07
07
07
0213
AUC
0798
0652
0592
0793
0659
0611
0749
0741
0500
0764
0773
0556
0739
0695
0734
0690plusmn0
092
+(
)43
--
-229
-18
-
397
32
210
21
41
44
48
59
NoD
(120572=1)
0765
0652
0592
0793
0536
0611
0736
0741
0358
0740
0639
0543
0709
0665
0701
0652plusmn0
113
10 Mathematical Problems in Engineering
Table6Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithEu
clidean
distance
Euclidean
distance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
09
09
1009
09
08
1010
08
08
006
1008
08
0369
AUC
0795
0727
0598
0826
0793
0603
0714
0757
0545
0775
0773
0719
0722
0697
0744
0719plusmn0
080
+(
)13
68
-09
322
19
--
117
52
176
430
-11
96
77
Logistic
12057207
08
04
07
07
05
06
09
09
09
007
1010
09
0360
AUC
0787
0750
0603
0832
0766
0613
0716
0767
0556
0745
0773
0698
0722
0690
0730
0717plusmn0
075
+(
)03
101
08
16
277
35
03
13
139
11
176
388
--
75
72
Square
root
12057207
08
1007
08
06
07
07
07
1007
08
1010
09
0342
AUC
0796
0743
0598
0820
0720
0618
0735
0786
0564
0737
0774
0696
0722
0690
0750
0715plusmn0
076
+(
)14
91
-
01
200
44
29
38
156
-178
384
--
105
70
Logarithm
ic120572
07
08
1010
08
06
1010
09
09
09
08
1010
09
0324
AUC
0794
0746
0598
0819
0722
0607
0714
0757
0573
0739
0778
0722
0722
0690
0748
0715plusmn0
072
+(
)11
95
--
203
25
--
174
03
185
436
--
103
70
Inversec
otangent
12057208
09
06
08
08
07
1008
06
07
009
09
1009
0280
AUC
0796
0749
0603
0820
0701
0623
0714
0787
0538
0750
0773
0589
0763
0690
0722
0708plusmn0
084
+(
)14
100
08
01
168
52
-40
102
18
176
170
56
-64
59
NoD
(120572=1)
0785
0681
0598
0819
060
00592
0714
0757
0488
0737
0657
0503
0722
0690
0678
066
8plusmn0
096
Mathematical Problems in Engineering 11
Table7Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithManhatta
ndista
nce
Manhatta
ndistance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
08
09
09
1009
09
1010
08
100
08
09
1010
0187
AUC
0804
0753
0599
0816
0689
0626
0695
0748
0500
0749
0773
0633
0692
0695
066
80696plusmn0
084
+(
)13
70
03
-73
63
--
78
-116
190
397
--
56
Logistic
12057207
07
08
08
08
07
07
09
06
07
009
09
1010
0249
AUC
0799
0760
0607
0830
0674
0621
0735
0794
0520
0756
0773
0680
0559
0695
066
80705plusmn0
084
+(
)06
80
17
17
50
54
58
61
121
09
116
279
127
--
69
Square
root
12057209
09
09
1008
08
09
08
09
100
100
1010
0164
AUC
0795
0755
060
40816
0693
0627
0704
0750
0510
0749
0773
0532
0523
0695
066
80680plusmn0
1+(
)01
72
12
-79
65
13
03
99
-116
-46
--
31
Logarithm
ic120572
1009
09
1009
1010
08
09
09
010
010
100116
AUC
0794
0755
0603
0816
066
40589
0695
0763
0524
0756
0773
0532
0523
0695
066
80677plusmn0
102
+(
)-
72
10
-34
--
20
129
09
116
-46
--
27
Inversec
otangent
12057210
09
09
09
09
08
09
1007
08
010
010
100133
AUC
0794
0749
0608
0821
0667
060
90710
0748
0500
0758
0773
0532
0523
0695
066
80677plusmn0
103
+(
)-
64
18
06
39
34
22
-78
12
116
-46
--
27
NoD
(120572=1)
0794
0704
0597
0816
064
20589
0695
0748
046
40749
0693
0532
0500
0695
066
80659plusmn0
105
12 Mathematical Problems in Engineering
Euclidean Logistic
Cosine
Linear
Manhattan+Logistic Euclidean
Linear
Cosine
Logistic
Manhattan+Logistic
(1) (2)
Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow
If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method
Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4
Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions
63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In
particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable
Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication
In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models
7 Discussion
71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10
Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance
Mathematical Problems in Engineering 13
Table8Pairw
isecomparis
onsb
etweenag
iven
combinatio
nandeach
ofthe15combinatio
nsin
term
sofC
liffrsquosdelta
(120575)effectsize
Cosines
imilarity
Euclidean
distance
Manhatta
ndistance
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Cosine+
Linear
-0018
0084
000
00116
minus0049
minus0036
minus0004
minus0013
minus0009
0138
0049
0164
0178
0169
Euclidean
+Linear
0049
0102
0111
0062
0164
-0036
004
00058
0089
0209
0102
0249
0276
0244
Manhatta
n+Lo
gistic
minus0049
minus0022
0022
minus0013
0111
minus0102
minus0076
minus0080
minus0049
minus0031
0053
-0124
0151
0147
14 Mathematical Problems in Engineering
Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question
Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10
Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647
Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43
Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487
k10987654321
AUC
09
08
07
06
05
Manhattan+logisticEuclidean+linearCosine+linear
Combination
Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE
when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations
Manhattan+logisticEuclidean+linearCosine+linear
k10987654321
AUC
09
08
07
06
05
Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM
72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice
According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as
Mathematical Problems in Engineering 15
0
02
04
06
08
1
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Perc
enta
ge(in
stanc
e)
release_id
defectslt2defectslt3
(a)
0
0002
0004
0006
0008
001
0012
0014
0016
4 14 24 34 44 54 64 74
Perc
enta
ge(in
stanc
e)
defects
(b)
Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments
TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones
Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work
73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain
Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software
metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors
Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033
Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five
16 Mathematical Problems in Engineering
Euclidean+Linear
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
Cosine+Linear
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
Manhattan+Logistic
Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value
normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear
8 Conclusion and Future Work
This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of
defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods
Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are
Mathematical Problems in Engineering 17
required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances
Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)
Conflicts of Interest
The authors declare that there are no conflicts of interest re-garding the publication of this article
Acknowledgments
The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)
References
[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012
[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002
[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012
[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009
[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013
[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012
[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013
[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998
[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002
[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007
[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005
[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009
[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015
[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017
[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015
[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015
[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017
[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015
[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016
[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017
[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014
[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014
[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015
[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015
18 Mathematical Problems in Engineering
[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017
[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017
[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017
[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013
[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014
[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016
[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017
[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015
[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017
[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012
[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013
[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012
[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7
[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006
[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013
[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010
[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE
Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010
[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011
[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011
[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008
[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008
[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011
[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006
[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012
[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011
[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011
[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952
[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009
[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
Mathematical Problems in Engineering 9
Table5Th
ebestpredictio
nresults
obtained
bytheCP
DPapproach
basedon
TDSelec
torw
ithCosinesim
ilarityNoD
representsthebaselin
emetho
d+deno
testhe
grow
thrateof
AUC
valuethem
axim
umAU
Cvalueo
fdifferentn
ormalizationmetho
dsisun
derlinedeach
numbersho
wnin
bold
indicatesthatthe
correspo
ndingAU
Cvaluer
isesb
ymorethan10
Cosines
imilarity
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
07
09
09
1009
1009
1006
09
08
06
07
07
05
0338
AUC
0813
0676
0603
0793
0700
0611
0758
0741
0512
0742
0783
0760
0739
0705
0729
0711plusmn0
081
+(
)63
37
19
-30
6
-30
-43
0
03
226
394
41
59
40
90
Logistic
12057207
05
07
107
06
06
06
05
05
004
07
05
05
0351
AUC
0802
0674
0595
0793
0665
0621
0759
0765
0579
0745
0773
0738
0712
0707
0740
0711plusmn0
070
+(
)48
34
05
-24
116
31
32
617
07
210
355
03
62
56
90
Square
root
12057207
07
06
06
07
06
07
09
05
104
06
06
06
06
0249
AUC
0799
0654
0596
0807
0735
0626
0746
0762
0500
0740
0774
0560
0722
0700
0738
0697plusmn0
091
+(
)44
03
07
18
371
25
14
28
397
-210
28
17
53
53
69
Logarithm
ic120572
06
06
09
1007
1007
07
05
09
05
05
06
06
06
0351
AUC
0798
0662
0594
0793
0731
0611
0748
0744
0500
0758
0774
0700
0755
0702
0741
0707plusmn0
083
+(
)43
15
03
-36
4
-16
04
397
24
212
285
63
55
58
85
Inversec
otangent
12057207
1010
1007
1007
1006
07
007
07
07
07
0213
AUC
0798
0652
0592
0793
0659
0611
0749
0741
0500
0764
0773
0556
0739
0695
0734
0690plusmn0
092
+(
)43
--
-229
-18
-
397
32
210
21
41
44
48
59
NoD
(120572=1)
0765
0652
0592
0793
0536
0611
0736
0741
0358
0740
0639
0543
0709
0665
0701
0652plusmn0
113
10 Mathematical Problems in Engineering
Table6Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithEu
clidean
distance
Euclidean
distance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
09
09
1009
09
08
1010
08
08
006
1008
08
0369
AUC
0795
0727
0598
0826
0793
0603
0714
0757
0545
0775
0773
0719
0722
0697
0744
0719plusmn0
080
+(
)13
68
-09
322
19
--
117
52
176
430
-11
96
77
Logistic
12057207
08
04
07
07
05
06
09
09
09
007
1010
09
0360
AUC
0787
0750
0603
0832
0766
0613
0716
0767
0556
0745
0773
0698
0722
0690
0730
0717plusmn0
075
+(
)03
101
08
16
277
35
03
13
139
11
176
388
--
75
72
Square
root
12057207
08
1007
08
06
07
07
07
1007
08
1010
09
0342
AUC
0796
0743
0598
0820
0720
0618
0735
0786
0564
0737
0774
0696
0722
0690
0750
0715plusmn0
076
+(
)14
91
-
01
200
44
29
38
156
-178
384
--
105
70
Logarithm
ic120572
07
08
1010
08
06
1010
09
09
09
08
1010
09
0324
AUC
0794
0746
0598
0819
0722
0607
0714
0757
0573
0739
0778
0722
0722
0690
0748
0715plusmn0
072
+(
)11
95
--
203
25
--
174
03
185
436
--
103
70
Inversec
otangent
12057208
09
06
08
08
07
1008
06
07
009
09
1009
0280
AUC
0796
0749
0603
0820
0701
0623
0714
0787
0538
0750
0773
0589
0763
0690
0722
0708plusmn0
084
+(
)14
100
08
01
168
52
-40
102
18
176
170
56
-64
59
NoD
(120572=1)
0785
0681
0598
0819
060
00592
0714
0757
0488
0737
0657
0503
0722
0690
0678
066
8plusmn0
096
Mathematical Problems in Engineering 11
Table7Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithManhatta
ndista
nce
Manhatta
ndistance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
08
09
09
1009
09
1010
08
100
08
09
1010
0187
AUC
0804
0753
0599
0816
0689
0626
0695
0748
0500
0749
0773
0633
0692
0695
066
80696plusmn0
084
+(
)13
70
03
-73
63
--
78
-116
190
397
--
56
Logistic
12057207
07
08
08
08
07
07
09
06
07
009
09
1010
0249
AUC
0799
0760
0607
0830
0674
0621
0735
0794
0520
0756
0773
0680
0559
0695
066
80705plusmn0
084
+(
)06
80
17
17
50
54
58
61
121
09
116
279
127
--
69
Square
root
12057209
09
09
1008
08
09
08
09
100
100
1010
0164
AUC
0795
0755
060
40816
0693
0627
0704
0750
0510
0749
0773
0532
0523
0695
066
80680plusmn0
1+(
)01
72
12
-79
65
13
03
99
-116
-46
--
31
Logarithm
ic120572
1009
09
1009
1010
08
09
09
010
010
100116
AUC
0794
0755
0603
0816
066
40589
0695
0763
0524
0756
0773
0532
0523
0695
066
80677plusmn0
102
+(
)-
72
10
-34
--
20
129
09
116
-46
--
27
Inversec
otangent
12057210
09
09
09
09
08
09
1007
08
010
010
100133
AUC
0794
0749
0608
0821
0667
060
90710
0748
0500
0758
0773
0532
0523
0695
066
80677plusmn0
103
+(
)-
64
18
06
39
34
22
-78
12
116
-46
--
27
NoD
(120572=1)
0794
0704
0597
0816
064
20589
0695
0748
046
40749
0693
0532
0500
0695
066
80659plusmn0
105
12 Mathematical Problems in Engineering
Euclidean Logistic
Cosine
Linear
Manhattan+Logistic Euclidean
Linear
Cosine
Logistic
Manhattan+Logistic
(1) (2)
Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow
If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method
Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4
Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions
63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In
particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable
Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication
In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models
7 Discussion
71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10
Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance
Mathematical Problems in Engineering 13
Table8Pairw
isecomparis
onsb
etweenag
iven
combinatio
nandeach
ofthe15combinatio
nsin
term
sofC
liffrsquosdelta
(120575)effectsize
Cosines
imilarity
Euclidean
distance
Manhatta
ndistance
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Cosine+
Linear
-0018
0084
000
00116
minus0049
minus0036
minus0004
minus0013
minus0009
0138
0049
0164
0178
0169
Euclidean
+Linear
0049
0102
0111
0062
0164
-0036
004
00058
0089
0209
0102
0249
0276
0244
Manhatta
n+Lo
gistic
minus0049
minus0022
0022
minus0013
0111
minus0102
minus0076
minus0080
minus0049
minus0031
0053
-0124
0151
0147
14 Mathematical Problems in Engineering
Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question
Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10
Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647
Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43
Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487
k10987654321
AUC
09
08
07
06
05
Manhattan+logisticEuclidean+linearCosine+linear
Combination
Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE
when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations
Manhattan+logisticEuclidean+linearCosine+linear
k10987654321
AUC
09
08
07
06
05
Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM
72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice
According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as
Mathematical Problems in Engineering 15
0
02
04
06
08
1
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Perc
enta
ge(in
stanc
e)
release_id
defectslt2defectslt3
(a)
0
0002
0004
0006
0008
001
0012
0014
0016
4 14 24 34 44 54 64 74
Perc
enta
ge(in
stanc
e)
defects
(b)
Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments
TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones
Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work
73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain
Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software
metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors
Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033
Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five
16 Mathematical Problems in Engineering
Euclidean+Linear
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
Cosine+Linear
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
Manhattan+Logistic
Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value
normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear
8 Conclusion and Future Work
This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of
defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods
Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are
Mathematical Problems in Engineering 17
required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances
Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)
Conflicts of Interest
The authors declare that there are no conflicts of interest re-garding the publication of this article
Acknowledgments
The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)
References
[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012
[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002
[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012
[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009
[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013
[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012
[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013
[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998
[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002
[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007
[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005
[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009
[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015
[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017
[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015
[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015
[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017
[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015
[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016
[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017
[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014
[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014
[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015
[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015
18 Mathematical Problems in Engineering
[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017
[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017
[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017
[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013
[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014
[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016
[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017
[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015
[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017
[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012
[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013
[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012
[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7
[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006
[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013
[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010
[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE
Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010
[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011
[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011
[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008
[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008
[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011
[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006
[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012
[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011
[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011
[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952
[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009
[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
10 Mathematical Problems in Engineering
Table6Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithEu
clidean
distance
Euclidean
distance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
09
09
1009
09
08
1010
08
08
006
1008
08
0369
AUC
0795
0727
0598
0826
0793
0603
0714
0757
0545
0775
0773
0719
0722
0697
0744
0719plusmn0
080
+(
)13
68
-09
322
19
--
117
52
176
430
-11
96
77
Logistic
12057207
08
04
07
07
05
06
09
09
09
007
1010
09
0360
AUC
0787
0750
0603
0832
0766
0613
0716
0767
0556
0745
0773
0698
0722
0690
0730
0717plusmn0
075
+(
)03
101
08
16
277
35
03
13
139
11
176
388
--
75
72
Square
root
12057207
08
1007
08
06
07
07
07
1007
08
1010
09
0342
AUC
0796
0743
0598
0820
0720
0618
0735
0786
0564
0737
0774
0696
0722
0690
0750
0715plusmn0
076
+(
)14
91
-
01
200
44
29
38
156
-178
384
--
105
70
Logarithm
ic120572
07
08
1010
08
06
1010
09
09
09
08
1010
09
0324
AUC
0794
0746
0598
0819
0722
0607
0714
0757
0573
0739
0778
0722
0722
0690
0748
0715plusmn0
072
+(
)11
95
--
203
25
--
174
03
185
436
--
103
70
Inversec
otangent
12057208
09
06
08
08
07
1008
06
07
009
09
1009
0280
AUC
0796
0749
0603
0820
0701
0623
0714
0787
0538
0750
0773
0589
0763
0690
0722
0708plusmn0
084
+(
)14
100
08
01
168
52
-40
102
18
176
170
56
-64
59
NoD
(120572=1)
0785
0681
0598
0819
060
00592
0714
0757
0488
0737
0657
0503
0722
0690
0678
066
8plusmn0
096
Mathematical Problems in Engineering 11
Table7Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithManhatta
ndista
nce
Manhatta
ndistance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
08
09
09
1009
09
1010
08
100
08
09
1010
0187
AUC
0804
0753
0599
0816
0689
0626
0695
0748
0500
0749
0773
0633
0692
0695
066
80696plusmn0
084
+(
)13
70
03
-73
63
--
78
-116
190
397
--
56
Logistic
12057207
07
08
08
08
07
07
09
06
07
009
09
1010
0249
AUC
0799
0760
0607
0830
0674
0621
0735
0794
0520
0756
0773
0680
0559
0695
066
80705plusmn0
084
+(
)06
80
17
17
50
54
58
61
121
09
116
279
127
--
69
Square
root
12057209
09
09
1008
08
09
08
09
100
100
1010
0164
AUC
0795
0755
060
40816
0693
0627
0704
0750
0510
0749
0773
0532
0523
0695
066
80680plusmn0
1+(
)01
72
12
-79
65
13
03
99
-116
-46
--
31
Logarithm
ic120572
1009
09
1009
1010
08
09
09
010
010
100116
AUC
0794
0755
0603
0816
066
40589
0695
0763
0524
0756
0773
0532
0523
0695
066
80677plusmn0
102
+(
)-
72
10
-34
--
20
129
09
116
-46
--
27
Inversec
otangent
12057210
09
09
09
09
08
09
1007
08
010
010
100133
AUC
0794
0749
0608
0821
0667
060
90710
0748
0500
0758
0773
0532
0523
0695
066
80677plusmn0
103
+(
)-
64
18
06
39
34
22
-78
12
116
-46
--
27
NoD
(120572=1)
0794
0704
0597
0816
064
20589
0695
0748
046
40749
0693
0532
0500
0695
066
80659plusmn0
105
12 Mathematical Problems in Engineering
Euclidean Logistic
Cosine
Linear
Manhattan+Logistic Euclidean
Linear
Cosine
Logistic
Manhattan+Logistic
(1) (2)
Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow
If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method
Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4
Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions
63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In
particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable
Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication
In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models
7 Discussion
71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10
Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance
Mathematical Problems in Engineering 13
Table8Pairw
isecomparis
onsb
etweenag
iven
combinatio
nandeach
ofthe15combinatio
nsin
term
sofC
liffrsquosdelta
(120575)effectsize
Cosines
imilarity
Euclidean
distance
Manhatta
ndistance
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Cosine+
Linear
-0018
0084
000
00116
minus0049
minus0036
minus0004
minus0013
minus0009
0138
0049
0164
0178
0169
Euclidean
+Linear
0049
0102
0111
0062
0164
-0036
004
00058
0089
0209
0102
0249
0276
0244
Manhatta
n+Lo
gistic
minus0049
minus0022
0022
minus0013
0111
minus0102
minus0076
minus0080
minus0049
minus0031
0053
-0124
0151
0147
14 Mathematical Problems in Engineering
Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question
Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10
Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647
Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43
Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487
k10987654321
AUC
09
08
07
06
05
Manhattan+logisticEuclidean+linearCosine+linear
Combination
Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE
when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations
Manhattan+logisticEuclidean+linearCosine+linear
k10987654321
AUC
09
08
07
06
05
Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM
72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice
According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as
Mathematical Problems in Engineering 15
0
02
04
06
08
1
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Perc
enta
ge(in
stanc
e)
release_id
defectslt2defectslt3
(a)
0
0002
0004
0006
0008
001
0012
0014
0016
4 14 24 34 44 54 64 74
Perc
enta
ge(in
stanc
e)
defects
(b)
Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments
TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones
Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work
73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain
Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software
metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors
Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033
Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five
16 Mathematical Problems in Engineering
Euclidean+Linear
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
Cosine+Linear
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
Manhattan+Logistic
Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value
normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear
8 Conclusion and Future Work
This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of
defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods
Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are
Mathematical Problems in Engineering 17
required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances
Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)
Conflicts of Interest
The authors declare that there are no conflicts of interest re-garding the publication of this article
Acknowledgments
The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)
References
[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012
[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002
[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012
[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009
[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013
[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012
[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013
[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998
[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002
[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007
[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005
[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009
[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015
[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017
[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015
[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015
[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017
[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015
[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016
[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017
[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014
[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014
[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015
[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015
18 Mathematical Problems in Engineering
[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017
[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017
[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017
[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013
[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014
[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016
[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017
[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015
[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017
[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012
[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013
[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012
[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7
[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006
[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013
[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010
[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE
Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010
[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011
[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011
[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008
[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008
[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011
[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006
[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012
[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011
[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011
[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952
[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009
[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
Mathematical Problems in Engineering 11
Table7Th
ebestp
redictionresults
obtained
bytheC
PDPapproach
basedon
TDSelec
torw
ithManhatta
ndista
nce
Manhatta
ndistance
Ant
Xalan
Camel
Ivy
Jedit
Lucene
Poi
Synapse
Velocity
Xerces
Eclip
seEq
uino
xLu
cene2
Mylyn
Pde
MeanplusmnSt
d120575
Linear 120572
08
09
09
1009
09
1010
08
100
08
09
1010
0187
AUC
0804
0753
0599
0816
0689
0626
0695
0748
0500
0749
0773
0633
0692
0695
066
80696plusmn0
084
+(
)13
70
03
-73
63
--
78
-116
190
397
--
56
Logistic
12057207
07
08
08
08
07
07
09
06
07
009
09
1010
0249
AUC
0799
0760
0607
0830
0674
0621
0735
0794
0520
0756
0773
0680
0559
0695
066
80705plusmn0
084
+(
)06
80
17
17
50
54
58
61
121
09
116
279
127
--
69
Square
root
12057209
09
09
1008
08
09
08
09
100
100
1010
0164
AUC
0795
0755
060
40816
0693
0627
0704
0750
0510
0749
0773
0532
0523
0695
066
80680plusmn0
1+(
)01
72
12
-79
65
13
03
99
-116
-46
--
31
Logarithm
ic120572
1009
09
1009
1010
08
09
09
010
010
100116
AUC
0794
0755
0603
0816
066
40589
0695
0763
0524
0756
0773
0532
0523
0695
066
80677plusmn0
102
+(
)-
72
10
-34
--
20
129
09
116
-46
--
27
Inversec
otangent
12057210
09
09
09
09
08
09
1007
08
010
010
100133
AUC
0794
0749
0608
0821
0667
060
90710
0748
0500
0758
0773
0532
0523
0695
066
80677plusmn0
103
+(
)-
64
18
06
39
34
22
-78
12
116
-46
--
27
NoD
(120572=1)
0794
0704
0597
0816
064
20589
0695
0748
046
40749
0693
0532
0500
0695
066
80659plusmn0
105
12 Mathematical Problems in Engineering
Euclidean Logistic
Cosine
Linear
Manhattan+Logistic Euclidean
Linear
Cosine
Logistic
Manhattan+Logistic
(1) (2)
Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow
If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method
Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4
Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions
63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In
particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable
Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication
In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models
7 Discussion
71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10
Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance
Mathematical Problems in Engineering 13
Table8Pairw
isecomparis
onsb
etweenag
iven
combinatio
nandeach
ofthe15combinatio
nsin
term
sofC
liffrsquosdelta
(120575)effectsize
Cosines
imilarity
Euclidean
distance
Manhatta
ndistance
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Cosine+
Linear
-0018
0084
000
00116
minus0049
minus0036
minus0004
minus0013
minus0009
0138
0049
0164
0178
0169
Euclidean
+Linear
0049
0102
0111
0062
0164
-0036
004
00058
0089
0209
0102
0249
0276
0244
Manhatta
n+Lo
gistic
minus0049
minus0022
0022
minus0013
0111
minus0102
minus0076
minus0080
minus0049
minus0031
0053
-0124
0151
0147
14 Mathematical Problems in Engineering
Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question
Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10
Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647
Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43
Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487
k10987654321
AUC
09
08
07
06
05
Manhattan+logisticEuclidean+linearCosine+linear
Combination
Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE
when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations
Manhattan+logisticEuclidean+linearCosine+linear
k10987654321
AUC
09
08
07
06
05
Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM
72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice
According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as
Mathematical Problems in Engineering 15
0
02
04
06
08
1
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Perc
enta
ge(in
stanc
e)
release_id
defectslt2defectslt3
(a)
0
0002
0004
0006
0008
001
0012
0014
0016
4 14 24 34 44 54 64 74
Perc
enta
ge(in
stanc
e)
defects
(b)
Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments
TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones
Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work
73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain
Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software
metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors
Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033
Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five
16 Mathematical Problems in Engineering
Euclidean+Linear
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
Cosine+Linear
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
Manhattan+Logistic
Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value
normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear
8 Conclusion and Future Work
This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of
defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods
Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are
Mathematical Problems in Engineering 17
required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances
Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)
Conflicts of Interest
The authors declare that there are no conflicts of interest re-garding the publication of this article
Acknowledgments
The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)
References
[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012
[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002
[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012
[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009
[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013
[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012
[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013
[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998
[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002
[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007
[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005
[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009
[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015
[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017
[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015
[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015
[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017
[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015
[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016
[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017
[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014
[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014
[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015
[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015
18 Mathematical Problems in Engineering
[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017
[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017
[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017
[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013
[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014
[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016
[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017
[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015
[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017
[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012
[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013
[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012
[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7
[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006
[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013
[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010
[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE
Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010
[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011
[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011
[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008
[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008
[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011
[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006
[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012
[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011
[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011
[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952
[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009
[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
12 Mathematical Problems in Engineering
Euclidean Logistic
Cosine
Linear
Manhattan+Logistic Euclidean
Linear
Cosine
Logistic
Manhattan+Logistic
(1) (2)
Figure 4 A guideline for choosing suitable similarity indexes and normalization methods from two aspects of similarity (see (1)) andnormalization (see (2)) The selection priority is lowered along the direction of the arrow
If we do not take into account normalization Euclideandistance achieves the maximum mean value 0719 and theminimum standard deviation value 0080 among the threesimilarity indexes followed by cosine similarity ThereforeEuclidean distance and cosine similarity are the first andsecond choices of our approach respectively On the otherhand if we do not take into account similarity index thelogistic normalization method seems to be the most suitablemethod for TDSelector indicated by the maximum meanvalue 0710 and theminimum standard deviation value 0078and it is followed by the linear normalization method
Therefore the logistic normalization method is the pre-ferred way for TDSelector to normalize defects while thelinear normalizationmethod is a possible alternativemethodIt is worth noting that the evidence that all Cliff rsquos delta (120575)effect sizes in Table 4 are negative also supported the resultThen a simple guideline for choosing similarity indexes andnormalization methods for TDSelector from two differentaspects is presented in Figure 4
Then we considered both factors According to theresults in Tables 5 6 and 7 grouped by different similarityindexes TDSelector can obtain the best result 0719 plusmn 00800711 plusmn 0070 and 0705 plusmn 0084 when using ldquoEuclidean +Linearrdquo (short for Euclidean distance + linear normalization)ldquoCosine + Logisticrdquo (short for cosine similarity + logistic nor-malization) and ldquoManhattan + Logisticrdquo (short for Manhat-tan distance + logistic normalization) respectively We alsocalculated the value of Cliff rsquos delta (120575) effect size for every twocombinations under discussion As shown in Table 8 accord-ing to the largest number of positive 120575 values in this tablethe combination of Euclidean distance and the linear normal-ization method can still outperform the other 14 combina-tions
63 Answer to RQ3 A comparison between our approachand two baseline methods (ie baseline1 and baseline2)across the 15 test sets is presented in Table 9 It is obviousthat our approach is on average better than the two baselinemethods indicated by the average growth rates of AUC value(ie 106 and 43) across the 15 test sets The TDSelectorperforms better than baseline1 in 14 out of 15 datasets and ithas an advantage over baseline2 in 10 out of 15 datasets In
particular compared with baseline1 and baseline2 the high-est growth rates of AUC value of our approach reach up to652 and 647 respectively for Velocity We also analyzedthe possible reason in terms of the defective instances ofsimplified training dataset obtained from different methodsTable 10 shows that the proportion of defective instancesin each simplified training dataset is very close How-ever according to instances withmore than one defect amongthese defective instances our method can return more andthe ratio approximates to twice as large as that of the baselinesTherefore a possible explanation for the improvement is thatthe information about defects was more fully utilized due tothe instances with more defects The result further validatedthat the selection of training data considering defects isvaluable
Besides the negative 120575 values in this table also indicatethat our approach outperforms the baseline methods fromthe perspective of distribution though we have to admit thatthe effect size 0009 is too small to be of interest in a particularapplication
In summary since the TDSelector-based defect predictoroutperforms those based on the two state-of-the-art CPDPmethods our approach is beneficial for training data selectionand can further improve the performance of CPDP models
7 Discussion
71 Impact of Top-119896 on Prediction Results The parameter 119896determines the number of the nearest training instances ofeach test instance Since 119896 was set to 10 in our experimentshere we discuss the impact of 119896 on prediction results of ourapproach as its value is changed from 1 to 10 with a stepvalue of 1 As shown in Figure 5 for the three combinationsin question selecting the 119896-nearest training instances (eg119896 le 5) for each test instance in the 10 test sets fromPROMISEhowever does not lead to better prediction results becausetheir best results are obtained when 119896 is equal to 10
Interestingly for the combinations of ldquoEuclidean + Lin-earrdquo and ldquoCosine + Linearrdquo a similar trend of AUC valuechanges is visible in Figure 6 For the five test sets fromAEEEM they achieve stable prediction results when 119896 rangesfrom four to eight and then they reach peak performance
Mathematical Problems in Engineering 13
Table8Pairw
isecomparis
onsb
etweenag
iven
combinatio
nandeach
ofthe15combinatio
nsin
term
sofC
liffrsquosdelta
(120575)effectsize
Cosines
imilarity
Euclidean
distance
Manhatta
ndistance
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Cosine+
Linear
-0018
0084
000
00116
minus0049
minus0036
minus0004
minus0013
minus0009
0138
0049
0164
0178
0169
Euclidean
+Linear
0049
0102
0111
0062
0164
-0036
004
00058
0089
0209
0102
0249
0276
0244
Manhatta
n+Lo
gistic
minus0049
minus0022
0022
minus0013
0111
minus0102
minus0076
minus0080
minus0049
minus0031
0053
-0124
0151
0147
14 Mathematical Problems in Engineering
Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question
Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10
Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647
Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43
Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487
k10987654321
AUC
09
08
07
06
05
Manhattan+logisticEuclidean+linearCosine+linear
Combination
Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE
when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations
Manhattan+logisticEuclidean+linearCosine+linear
k10987654321
AUC
09
08
07
06
05
Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM
72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice
According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as
Mathematical Problems in Engineering 15
0
02
04
06
08
1
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Perc
enta
ge(in
stanc
e)
release_id
defectslt2defectslt3
(a)
0
0002
0004
0006
0008
001
0012
0014
0016
4 14 24 34 44 54 64 74
Perc
enta
ge(in
stanc
e)
defects
(b)
Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments
TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones
Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work
73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain
Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software
metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors
Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033
Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five
16 Mathematical Problems in Engineering
Euclidean+Linear
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
Cosine+Linear
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
Manhattan+Logistic
Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value
normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear
8 Conclusion and Future Work
This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of
defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods
Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are
Mathematical Problems in Engineering 17
required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances
Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)
Conflicts of Interest
The authors declare that there are no conflicts of interest re-garding the publication of this article
Acknowledgments
The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)
References
[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012
[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002
[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012
[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009
[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013
[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012
[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013
[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998
[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002
[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007
[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005
[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009
[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015
[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017
[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015
[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015
[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017
[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015
[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016
[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017
[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014
[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014
[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015
[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015
18 Mathematical Problems in Engineering
[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017
[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017
[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017
[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013
[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014
[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016
[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017
[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015
[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017
[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012
[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013
[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012
[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7
[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006
[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013
[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010
[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE
Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010
[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011
[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011
[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008
[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008
[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011
[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006
[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012
[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011
[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011
[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952
[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009
[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
Mathematical Problems in Engineering 13
Table8Pairw
isecomparis
onsb
etweenag
iven
combinatio
nandeach
ofthe15combinatio
nsin
term
sofC
liffrsquosdelta
(120575)effectsize
Cosines
imilarity
Euclidean
distance
Manhatta
ndistance
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Linear
Logistic
Square
root
Logarithm
icInverse
cotangent
Cosine+
Linear
-0018
0084
000
00116
minus0049
minus0036
minus0004
minus0013
minus0009
0138
0049
0164
0178
0169
Euclidean
+Linear
0049
0102
0111
0062
0164
-0036
004
00058
0089
0209
0102
0249
0276
0244
Manhatta
n+Lo
gistic
minus0049
minus0022
0022
minus0013
0111
minus0102
minus0076
minus0080
minus0049
minus0031
0053
-0124
0151
0147
14 Mathematical Problems in Engineering
Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question
Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10
Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647
Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43
Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487
k10987654321
AUC
09
08
07
06
05
Manhattan+logisticEuclidean+linearCosine+linear
Combination
Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE
when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations
Manhattan+logisticEuclidean+linearCosine+linear
k10987654321
AUC
09
08
07
06
05
Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM
72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice
According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as
Mathematical Problems in Engineering 15
0
02
04
06
08
1
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Perc
enta
ge(in
stanc
e)
release_id
defectslt2defectslt3
(a)
0
0002
0004
0006
0008
001
0012
0014
0016
4 14 24 34 44 54 64 74
Perc
enta
ge(in
stanc
e)
defects
(b)
Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments
TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones
Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work
73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain
Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software
metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors
Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033
Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five
16 Mathematical Problems in Engineering
Euclidean+Linear
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
Cosine+Linear
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
Manhattan+Logistic
Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value
normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear
8 Conclusion and Future Work
This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of
defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods
Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are
Mathematical Problems in Engineering 17
required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances
Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)
Conflicts of Interest
The authors declare that there are no conflicts of interest re-garding the publication of this article
Acknowledgments
The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)
References
[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012
[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002
[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012
[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009
[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013
[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012
[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013
[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998
[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002
[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007
[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005
[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009
[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015
[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017
[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015
[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015
[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017
[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015
[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016
[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017
[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014
[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014
[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015
[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015
18 Mathematical Problems in Engineering
[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017
[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017
[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017
[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013
[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014
[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016
[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017
[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015
[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017
[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012
[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013
[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012
[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7
[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006
[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013
[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010
[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE
Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010
[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011
[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011
[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008
[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008
[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011
[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006
[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012
[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011
[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011
[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952
[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009
[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
14 Mathematical Problems in Engineering
Table 9 A comparison between our approach and two baseline methods for the data sets from PROMISE and AEEEM The comparison isconducted based on the best prediction results of all the three methods in question
Test set Baseline1 Baseline2 Euclidean + Linear 120575Ant 0785 0803 13 minus10
Baseline1 vs TDSelector minus0409Xalan 0657 0675 107 77Camel 0595 0624 05 minus42Ivy 0789 0802 47 30Jedit 0694 0782 143 14Lucene 0608 0701 minus08 minus140Poi 0691 0789 33 minus95Synapse 0740 0748 23 12Velocity 0330 0331 652 647
Baseline2 vs TDSelector minus0009Xerces 0714 0753 85 29Eclipse 0706 0744 102 46Equinox 0587 0720 231 03Lucene2 0705 0724 25 minus02Mylyn 0631 0646 93 68Pde 0678 0737 104 15Avg 0663 0705 106 43
Table 10 Comparison with the defect instances of simplified train-ing dataset obtained from different methods on Velocity project
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904119894119899119904119905119886119899119888119890119904 119894119899119904119905119886119899119888119890119904(119889119890119891119890119888119905119904 gt 1)
119889119890119891119890119888119905 119894119899119904119905119886119899119888119890119904Baseline1 0375 0247Baseline2 0393 0291TDSelector 0376 0487
k10987654321
AUC
09
08
07
06
05
Manhattan+logisticEuclidean+linearCosine+linear
Combination
Figure 5 The impact of 119896 on prediction results for the 10 test setsfrom PROMISE
when 119896 is equal to 10 The combination of ldquoManhattan +Logisticrdquo by contrast achieves the best result as 119896 is set to7 Even so the best result is still worse than those of the othertwo combinations
Manhattan+logisticEuclidean+linearCosine+linear
k10987654321
AUC
09
08
07
06
05
Figure 6 The impact of 119896 on prediction results for the 5 test setsfrom AEEEM
72 Selecting Instances with More Bugs Directly as TrainingData Our experimental results have validated the impactof defects on the selection of training data of quality interms of AUC and we also want to know whether thedirect selection of defective instances with more bugs astraining instances which simplifies the selection process andreduces computation cost would achieve better predictionperformance The result of this question is of particularconcern for developers in practice
According to Figure 7(a) for the 15 releases most ofthem contain instances with no more than two bugs On theother hand the ratio of the instances that have more thanthree defects to the total instances is less than 140 (seeFigure 7(b)) Therefore we built a new TDSelector based onthe number of bugs in each instance which is referred to as
Mathematical Problems in Engineering 15
0
02
04
06
08
1
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Perc
enta
ge(in
stanc
e)
release_id
defectslt2defectslt3
(a)
0
0002
0004
0006
0008
001
0012
0014
0016
4 14 24 34 44 54 64 74
Perc
enta
ge(in
stanc
e)
defects
(b)
Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments
TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones
Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work
73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain
Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software
metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors
Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033
Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five
16 Mathematical Problems in Engineering
Euclidean+Linear
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
Cosine+Linear
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
Manhattan+Logistic
Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value
normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear
8 Conclusion and Future Work
This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of
defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods
Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are
Mathematical Problems in Engineering 17
required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances
Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)
Conflicts of Interest
The authors declare that there are no conflicts of interest re-garding the publication of this article
Acknowledgments
The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)
References
[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012
[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002
[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012
[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009
[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013
[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012
[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013
[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998
[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002
[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007
[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005
[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009
[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015
[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017
[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015
[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015
[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017
[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015
[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016
[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017
[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014
[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014
[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015
[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015
18 Mathematical Problems in Engineering
[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017
[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017
[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017
[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013
[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014
[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016
[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017
[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015
[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017
[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012
[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013
[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012
[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7
[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006
[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013
[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010
[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE
Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010
[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011
[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011
[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008
[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008
[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011
[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006
[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012
[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011
[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011
[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952
[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009
[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
Mathematical Problems in Engineering 15
0
02
04
06
08
1
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Perc
enta
ge(in
stanc
e)
release_id
defectslt2defectslt3
(a)
0
0002
0004
0006
0008
001
0012
0014
0016
4 14 24 34 44 54 64 74
Perc
enta
ge(in
stanc
e)
defects
(b)
Figure 7 Percentage of defective instances with different numbers of bugs (a) is shown from the viewpoint of a single dataset (release) while(b) is shown from the viewpoint of the whole dataset used in our experiments
TDSelector-3 That is to say those defective instances thathave at least three bugs were chosen directly from an initialTDS as training data while the remaining instances in theTDS were selected in light of (2) All instances from the twoparts then form the final TDS after removing redundant ones
Figure 8 shows that the results of the two methods differfrom dataset to dataset For Ivy and Xerces collected fromPROMISE TDSelector outperforms TDSelector-3 in all thethree scenarios but only slightly On the contrary for Luceneand Velocity from PROMISE the incremental AUC valuesobtained by using TDSelector-3 with ldquoCosine + Linearrdquoreach up to 0109 and 0235 respectively As shown inFigure 8 on average TDSelector-3 performs better than thecorresponding TDSelector and the average AUC values forldquoCosine + Linearrdquo ldquoEuclidean + Linearrdquo and ldquoManhattan +Logisticrdquo are improved by up to 326 257 and 142respectively Therefore the direct selection of defectiveinstances that contain quite a few bugs can overall furtherimprove the performance of the predictor trained by ourapproach In other words those valuable defective instancescan be screened out quickly according to a threshold for thenumber of bugs in each training instance (namely three inthis paper) at the first stage Our approach is then able to beapplied to the remaining TDS Note that the automaticoptimization method for such a threshold for TDSelector willbe investigated in our future work
73 Threats to Validity In this study we obtained severalinteresting results but potential threats to the validity of ourwork remain
Threats to internal validity concern any confounding fac-tor that may affect our results First the raw data used in thispaper were normalized by using the 119885-score method whilethe baseline method TCA+ provides four normalizationmethods [39] Second unlike TCA+ TDSelector does notintroduce any feature selection method to process software
metricsThird the weighting factor 120572 changes with a step size01 whenAlgorithm 1 calculates themaximum value of AUCThere is no doubt that a smaller step size will result in greatercalculation time Fourth we trained only one type of defectpredictor based on the default parameter settings configuredby the tool Weka because LR has been widely used in pre-vious studies Hence we are indeed aware that the resultsof our study would change if we use different settings of theabove three factors
Threats to statistical conclusion validity focus on whetherconclusions about the relationship among variables basedon the experimental data are correct or reasonable [50] Inaddition to mean value and standard deviation in this paperwe also utilized Cliff rsquos delta effect size instead of hypotheticaltest methods such as the KruskalndashWallis 119867 test [51] to com-pare the results of different methods because there are only15 datasets collected fromPROMISE andAEEEM Accordingto the criteria that were initially suggested by Cohen andexpanded by Sawilowsky [52] nearly all of the effect sizevalues in this paper belong to small (0147 le 120575 lt 033) andvery small (0008 le 120575 lt 0147) This indicates that there is nosignificant difference in AUC value between different com-binations in question though some perform better in termsof mean value and standard deviation However it is clearthat our method obviously performs better than baseline1indicated by |120575| = 0409 gt 033
Threats to external validity emphasize the generalizationof the obtained results First the selection of experimentaldatasetsmdashin addition to AEEEM and PROMISEmdashis themain threat to validate the results of our study All the 14projects used in this paper are written in Java and from theApache Software Foundation and the Eclipse FoundationAlthough our experiments can be repeated with more open-source projects written in other programming languagesand developed with different software metrics the empir-ical results may be different from our main conclusionsSecond we utilized only three similarity indexes and five
16 Mathematical Problems in Engineering
Euclidean+Linear
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
Cosine+Linear
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
Manhattan+Logistic
Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value
normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear
8 Conclusion and Future Work
This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of
defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods
Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are
Mathematical Problems in Engineering 17
required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances
Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)
Conflicts of Interest
The authors declare that there are no conflicts of interest re-garding the publication of this article
Acknowledgments
The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)
References
[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012
[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002
[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012
[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009
[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013
[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012
[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013
[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998
[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002
[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007
[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005
[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009
[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015
[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017
[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015
[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015
[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017
[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015
[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016
[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017
[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014
[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014
[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015
[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015
18 Mathematical Problems in Engineering
[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017
[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017
[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017
[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013
[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014
[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016
[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017
[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015
[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017
[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012
[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013
[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012
[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7
[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006
[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013
[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010
[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE
Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010
[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011
[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011
[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008
[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008
[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011
[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006
[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012
[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011
[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011
[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952
[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009
[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
16 Mathematical Problems in Engineering
Euclidean+Linear
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
Cosine+Linear
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
050055060065070075080085
Ant
Xala
n
Cam
el Ivy
Jedi
t
Luce
ne Poi
Syna
pse
Velo
city
Xerc
es
Eclip
se
Equi
nox
Luce
ne
Myl
yn Pde
Avg
AUC
TDSelectorTDSelector-3
Manhattan+Logistic
Figure 8 A comparison of prediction performance between TDSelector-3 and the corresponding TDSelectorThe last column in each of thethree plots represents the average AUC value
normalization methods when calculating the score of eachcandidate training instance Therefore the generalizability ofour method for other similarity indexes (such as PearsonCorrelation Coefficient and Mahalanobis distance [53]) andnormalizationmethods has yet to be testedThird to compareour method with TCA+ defect predictors used in this paperwere built using LR implying that the generalizability of ourmethod for other classification algorithms remains unclear
8 Conclusion and Future Work
This study aims to train better defect predictors by selectingthe most appropriate training data from those defect datasetsavailable on the Internet to improve the performance ofcross-project defect predictions In summary the study hasbeen conducted on 14 open-source projects and consists of(1) an empirical validation on the usability of the number of
defects that an instance includes for training data selection(2) an in-depth analysis of our method TDSelector withregard to similarity and normalization and (3) a comparisonbetween our proposed method and the benchmark methods
Compared with those similar previous studies the resultsof this study indicate that the inclusion of defects doesimprove the performance of CPDP predictors With a ratio-nal balance between the similarity of test instances withtraining instances and defects TDSelector can effectivelyselect appropriate training instances so that TDSelector-based defect predictors built by using LR achieve better pre-diction performance in terms of AUC More specifically thecombination of Euclidean distance and linear normalizationis the preferred way for TDSelector In addition our resultsalso demonstrate the effectiveness of the proposed methodaccording to a comparison with the baseline methods in thecontext of M2O in CPDP scenarios Hence we believe thatour approach can be helpful for developers when they are
Mathematical Problems in Engineering 17
required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances
Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)
Conflicts of Interest
The authors declare that there are no conflicts of interest re-garding the publication of this article
Acknowledgments
The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)
References
[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012
[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002
[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012
[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009
[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013
[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012
[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013
[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998
[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002
[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007
[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005
[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009
[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015
[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017
[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015
[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015
[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017
[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015
[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016
[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017
[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014
[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014
[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015
[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015
18 Mathematical Problems in Engineering
[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017
[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017
[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017
[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013
[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014
[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016
[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017
[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015
[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017
[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012
[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013
[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012
[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7
[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006
[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013
[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010
[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE
Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010
[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011
[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011
[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008
[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008
[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011
[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006
[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012
[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011
[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011
[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952
[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009
[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
Mathematical Problems in Engineering 17
required to build suitable predictors quickly for their newprojects because one of our interesting findings is that thosecandidate instances with more bugs can be chosen directly astraining instances
Our future work mainly includes two aspects On the onehandwe plan to validate the generalizability of our studywithmore defect data from projects written in different languagesOn the other hand we will focus on more effective hybridmethods based ondifferent selection strategies such as featureselection techniques [32] Last but not least we also plan todiscuss the possibility of considering not only the numberof defects but also time variables for training data selection(such as bug-fixing time)
Conflicts of Interest
The authors declare that there are no conflicts of interest re-garding the publication of this article
Acknowledgments
The authors greatly appreciate Dr Nam and Dr Pan theauthors of [39] for providing them with the TCA sourceprogram and teaching them how to use itThis work was sup-ported by the Natural Science Foundation of Hubei province(no 2016CFB309) and the National Natural Science Founda-tion of China (nos 61272111 61273216 and 61572371)
References
[1] Z He F Shu Y Yang M Li and Q Wang ldquoAn investigationon the feasibility of cross-project defect predictionrdquo AutomatedSoftware Engineering vol 19 no 2 pp 167ndash199 2012
[2] L C Briand W L Melo and J Wust ldquoAssessing the applica-bility of fault-proneness models across object-oriented softwareprojectsrdquo IEEETransactions on Software Engineering vol 28 no7 pp 706ndash720 2002
[3] Y Ma G Luo X Zeng and A Chen ldquoTransfer learning forcross-company software defect predictionrdquo Information andSoftware Technology vol 54 no 3 pp 248ndash256 2012
[4] T Zimmermann N Nagappan H Gall E Giger and BMurphy ldquoCross-project defect prediction A large scale exper-iment on data vs domain vs processrdquo in Proceedings of theJoint 12th European Software Engineering Conference and 17thACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESEC-FSErsquo09 pp 91ndash100 nld August 2009
[5] F Peters T Menzies and A Marcus ldquoBetter cross companydefect predictionrdquo in Proceedings of the 10th InternationalWorking Conference onMining Software Repositories MSR 2013pp 409ndash418 usa May 2013
[6] F RahmanD Posnett andPDevanbu ldquoRecalling the ldquoimpreci-sionrdquo of cross-project defect predictionrdquo inProceedings of the theACM SIGSOFT 20th International Symposium p 1 Cary NorthCarolina November 2012
[7] SHerbold ldquoTraining data selection for cross-project defect pre-dictionrdquo in Proceedings of the the 9th International Conferencepp 1ndash10 Baltimore Maryland October 2013
[8] T M Khoshgoftaar E B Allen R Halstead G P Trio and RM Flass ldquoUsing process history to predict software qualityrdquoTheComputer Journal vol 31 no 4 pp 66ndash72 1998
[9] T J Ostrand and E J Weyuker ldquoThe distribution of faults ina large industrial software systemrdquo ACM SIGSOFT SoftwareEngineering Notes vol 27 no 4 p 55 2002
[10] S Kim T Zimmermann E J Whitehead Jr and A ZellerldquoPredicting faults from cached historyrdquo in Proceedings of the29th International Conference on Software Engineering (ICSErsquo07) pp 489ndash498 IEEE Computer Society Washington DCUSA May 2007
[11] T Gyimothy R Ferenc and I Siket ldquoEmpirical validationof object-oriented metrics on open source software for faultpredictionrdquo IEEE Transactions on Software Engineering vol 31no 10 pp 897ndash910 2005
[12] B Turhan T Menzies A B Bener and J Di Stefano ldquoOn therelative value of cross-company and within-company data fordefect predictionrdquo Empirical Software Engineering vol 14 no5 pp 540ndash578 2009
[13] M Chen and Y Ma ldquoAn empirical study on predicting defectnumbersrdquo in Proceedings of the 27th International Conferenceon Software Engineering andKnowledge Engineering SEKE 2015pp 397ndash402 usa July 2015
[14] C Ni W Liu Q Gu X Chen and D Chen ldquoFeSCH A Fea-ture Selection Method using Clusters of Hybrid-data forCross-Project Defect Predictionrdquo in Proceedings of the 41stIEEE Annual Computer Software and Applications ConferenceCOMPSAC 2017 pp 51ndash56 ita July 2017
[15] P He B Li X Liu J Chen and Y Ma ldquoAn empirical studyon software defect prediction with a simplified metric setrdquoInformation and Software Technology vol 59 pp 170ndash190 2015
[16] T Wang Z Zhang X Jing and L Zhang ldquoMultiple kernelensemble learning for software defect predictionrdquo AutomatedSoftware Engineering vol 23 no 4 pp 1ndash22 2015
[17] J Y He Z P Meng X Chen Z Wang and X Y Fan ldquoSemi-supervised ensemble learning approach for cross-project defectpredictionrdquo Journal of Software Ruanjian Xuebao vol 28 no 6pp 1455ndash1473 2017
[18] D Ryu J-I Jang and J Baik ldquoA transfer cost-sensitive boostingapproach for cross-project defect predictionrdquo Software QualityJournal vol 25 no 1 pp 1ndash38 2015
[19] D Ryu and J Baik ldquoEffective multi-objective naıve Bayes learn-ing for cross-project defect predictionrdquoApplied Soft Computingvol 49 pp 1062ndash1077 2016
[20] Y Li Z Huang Y Wang and B Fang ldquoEvaluating Data Filteron Cross-Project Defect Prediction Comparison and Improve-mentsrdquo IEEE Access vol 5 pp 25646ndash25656 2017
[21] F Zhang A Mockus I Keivanloo and Y Zou ldquoTowards build-ing a universal defect prediction modelrdquo in Proceedings of the11th International Working Conference on Mining Software Re-positories MSR 2014 pp 182ndash191 ind June 2014
[22] P He B Li and Y Ma Towards Cross-Project Defect Predictionwith Imbalanced Feature Sets 2014
[23] J Nam and S Kim ldquoHeterogeneous defect predictionrdquo in Pro-ceedings of the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering ESECFSE 2015 pp508ndash519 ita September 2015
[24] X Jing FWu X Dong F Qi and B Xu ldquoHeterogeneous cross-company defect prediction by unifiedmetric representation andCCA-based transfer learningrdquo in Proceedings of the 10th JointMeeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering ESECFSE 2015 pp 496ndash507 ita September 2015
18 Mathematical Problems in Engineering
[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017
[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017
[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017
[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013
[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014
[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016
[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017
[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015
[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017
[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012
[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013
[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012
[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7
[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006
[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013
[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010
[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE
Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010
[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011
[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011
[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008
[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008
[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011
[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006
[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012
[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011
[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011
[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952
[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009
[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
18 Mathematical Problems in Engineering
[25] Z Li X-Y Jing X Zhu H Zhang B Xu and S Ying ldquoOnthe Multiple Sources and Privacy Preservation Issues for Het-erogeneous Defect Predictionrdquo IEEE Transactions on SoftwareEngineering 1 page 2017
[26] Z Li X-Y Jing F Wu X Zhu B Xu and S Ying ldquoCost-sensitive transfer kernel canonical correlation analysis for het-erogeneous defect predictionrdquoAutomated Software Engineeringpp 1ndash45 2017
[27] Z Li X Jing X Zhu and H Zhang ldquoHeterogeneous DefectPrediction Through Multiple Kernel Learning and EnsembleLearningrdquo in Proceedings of the 2017 IEEE International Confer-ence on Software Maintenance and Evolution (ICSME) pp 91ndash102 Shanghai September 2017
[28] Z He F Peters T Menzies and Y Yang ldquoLearning from open-source projects An empirical study on defect predictionrdquo inProceedings of the 2013 ACM IEEE International Symposium onEmpirical Software Engineering and Measurement ESEM 2013pp 45ndash54 usa October 2013
[29] P He B Li D Zhang and YMa Simplification of Training Datafor Cross-Project Defect Prediction arXiv preprint 2014
[30] F Porto and A Simao ldquoFeature Subset Selection and InstanceFiltering for Cross-project Defect Prediction-Classification andRankingrdquo CLEI Electronic Journal vol 19 no 4 p 17 2016
[31] X-Y Jing FWu X Dong and B Xu ldquoAn Improved SDA BasedDefect Prediction Framework for Both Within-Project andCross-Project Class-Imbalance Problemsrdquo IEEE Transactionson Software Engineering vol 43 no 4 pp 321ndash339 2017
[32] D Ryu J-I Jang and J Baik ldquoA hybrid instance selection usingnearest-neighbor for cross-project defect predictionrdquo Journal ofComputer Science and Technology vol 30 no 5 pp 969ndash9802015
[33] W N Poon K E Bennin J Huang P Phannachitta and J WKeung ldquoCross-project defect prediction using a credibilitytheory based naive bayes classifierrdquo in Proceedings of the 17thIEEE International Conference on Software Quality Reliabilityand Security QRS 2017 pp 434ndash441 cze July 2017
[34] M DrsquoAmbros M Lanza and R Robbes ldquoEvaluating defect pre-diction approaches a benchmark and an extensive comparisonrdquoEmpirical Software Engineering vol 17 no 4-5 pp 531ndash577 2012
[35] B Turhan A Tosun Misirli and A Bener ldquoEmpirical evalua-tion of the effects of mixed project data on learning defect pre-dictorsrdquo Information and Software Technology vol 55 no 6 pp1101ndash1118 2013
[36] J Han and M Kamber Data Mining Concepts and TechniquesElsevierMorgan KaufmannWalthamMass USA 3rd edition2012
[37] A Graf and S Borer ldquoNormalization in Support Vector Ma-chinesrdquo in Prooceedings of Dagm-Symposium on Pattern Recog-nition vol 2191 pp 277ndash282 Springer-Verlag 2001 no 7
[38] S B Kotsiantis D Kanellopoulos and P E Pintelas ldquoDataPreprocessing for Supervised Learningrdquo Enformatika vol 1 no1 pp 111ndash117 2006
[39] J Nam S J Pan and S Kim ldquoTransfer defect learningrdquo inProceedings of the 2013 35th International Conference on Soft-ware Engineering ICSE 2013 pp 382ndash391 usa May 2013
[40] M Jureczko andDD Spinellis ldquoUsingObject-OrientedDesignMetrics to Predict Software Defectsrdquo in Proceedings of theInternational Conference on Dependability of Computer SystemMonographs of System Dependability pp 69ndash81 2010
[41] MDrsquoAmbrosM Lanza and R Robbes ldquoAn extensive compari-son of bug prediction approachesrdquo inProceedings of the 7th IEEE
Working Conference on Mining Software Repositories (MSR rsquo10)pp 31ndash41 IEEE May 2010
[42] G E Macbeth E Razumiejczyk and R D Ledesma ldquoCliff rsquosdelta calculator A non-parametric effect size program for twogroups of observationsrdquo Universitas Psychologica vol 10 no 2pp 545ndash555 2011
[43] T Lee D G Han S Kim and H P In ldquoMicro interaction met-rics for defect predictionrdquo in Proceedings of the 19th ACMSIGSOFT Symposium on Foundations of Software EngineeringSIGSOFTFSErsquo11 pp 311ndash321 hun September 2011
[44] A Meneely L Williams W Snipes and J Osborne ldquoPredictingfailures with developer networks and social network analysisrdquoin Proceedings of the 16th ACM SIGSOFT International Sympo-sium on Foundations of Software Engineering (SIGSOFT rsquo08) pp13ndash23 ACM November 2008
[45] R Moser W Pedrycz and G Succi ldquoA Comparative analysisof the efficiency of change metrics and static code attributes fordefect predictionrdquo in Proceedings of the 30th International Con-ference on Software Engineering 2008 ICSErsquo08 pp 181ndash190 deuMay 2008
[46] E Shihab A Mockus Y Kamei B Adams and A E HassanldquoHigh-impact defects A study of breakage and surprise defectsrdquoin Proceedings of the 19th ACM SIGSOFT Symposium on Foun-dations of Software Engineering SIGSOFTFSErsquo11 pp 300ndash310hun September 2011
[47] T Fawcett ldquoAn introduction to ROC analysisrdquo Pattern Recogni-tion Letters vol 27 no 8 pp 861ndash874 2006
[48] E Giger M DrsquoAmbros M Pinzger and H C Gall ldquoMethod-level bug predictionrdquo in Proceedings of the 6th ACM-IEEEInternational Symposium on Empirical Software Engineering andMeasurement ESEM 2012 pp 171ndash180 swe September 2012
[49] Q Song Z Jia M Shepperd S Ying and J Liu ldquoA generalsoftware defect-proneness prediction frameworkrdquo IEEE Trans-actions on Software Engineering vol 37 no 3 pp 356ndash370 2011
[50] P C Cozby ldquoMethods in behavioral researchrdquo in McGraw-HillHigher Education 2011
[51] W H Kruskal and W A Wallis ldquoUse of ranks in one-criterionvariance analysisrdquo Journal of the American Statistical Associa-tion vol 47 no 260 pp 583ndash621 1952
[52] S S Sawilowsky ldquoNew Effect Size Rules of Thumbrdquo Journal ofModern Applied Statistical Methods vol 8 no 2 pp 597ndash5992009
[53] S S Choi S H Cha and C C Tappert ldquoA Survey of BinarySimilarity and Distance Measuresrdquo Journal of Systemics Cyber-netics Informatics vol 8 no 1 pp 43ndash48 2010
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom