how does regression test prioritization perform in real ...lxz144130/publications/icse2016.pdfin...

12
How Does Regression Test Prioritization Perform in Real-World Software Evolution? Yafeng Lu 1 , Yiling Lou 2 , Shiyang Cheng 1 , Lingming Zhang 1 , Dan Hao 2* , Yangfan Zhou 3 , Lu Zhang 2 1 Department of Computer Science, University of Texas at Dallas, 75080, USA {yxl131230,sxc145630,lingming.zhang}@utdallas.edu 2 Key Laboratory of High Confidence Software Technologies (Peking University), MoE, China Institute of Software, EECS, Peking University, Beijing, 100871, China {louyiling,haodan,zhanglucs}@pku.edu.cn 3 School of Computer Science, Fudan University, 201203, China [email protected] ABSTRACT In recent years, researchers have intensively investigated var- ious topics in test prioritization, which aims to re-order tests to increase the rate of fault detection during regression test- ing. While the main research focus in test prioritization is on proposing novel prioritization techniques and evalu- ating on more and larger subject systems, little effort has been put on investigating the threats to validity in exist- ing work on test prioritization. One main threat to validity is that existing work mainly evaluates prioritization tech- niques based on simple artificial changes on the source code and tests. For example, the changes in the source code usu- ally include only seeded program faults, whereas the test suite is usually not augmented at all. On the contrary, in real-world software development, software systems usually undergo various changes on the source code and test suite augmentation. Therefore, it is not clear whether the conclu- sions drawn by existing work in test prioritization from the artificial changes are still valid for real-world software evo- lution. In this paper, we present the first empirical study to investigate this important threat to validity in test prioriti- zation. We reimplemented 24 variant techniques of both the traditional and time-aware test prioritization, and investi- gated the impacts of software evolution on those techniques based on the version history of 8 real-world Java programs from GitHub. The results show that for both traditional and time-aware test prioritization, test suite augmentation sig- nificantly hampers their effectiveness, whereas source code changes alone do not influence their effectiveness much. 1. INTRODUCTION During software development and maintenance, a soft- ware system continuously evolves due to various reasons, e.g., adding new features, fixing bugs, or improving main- * Corresponding author. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ICSE ’16, May 14-22, 2016, Austin, TX, USA c 2016 ACM. ISBN 978-1-4503-3900-1/16/05. . . $15.00 DOI: http://dx.doi.org/10.1145/2884781.2884874 tainability and efficiency. As the change on the software may incur bugs, it is necessary to apply regression testing to revalidate the modified system. However, it is costly to perform regression testing. As reported, regression testing consumes 80% of the overall testing budgets [1,2], and some test suite consumes more than seven weeks [3]. Therefore, it is important to set a balance between the tests that ide- ally should be run in regression testing and the tests that are affordable to be used in regression testing, which is the intuition of test prioritization [4]. Following this intuition, test prioritization [3–5] is pro- posed to improve the efficiency of regression testing, espe- cially when the testing resources (e.g., time, developers’ ef- forts) are limited. In particular, test prioritization aims to schedule the execution order of existing tests so as to maxi- mize some testing objective (e.g., the rate of fault-detection). In the literature, a huge body of research work has been dedicated to test prioritization to investigate various prior- itization algorithms [6–10], coverage criteria [11, 12], and so on [13–15]. However, little work in the literature studied the threats to validity in test prioritization, especially the threat result- ing from software changes. In software evolution, a software is usually modified by changing its source code and adding tests. However, the existing work on test prioritization is usually evaluated without considering test change. In par- ticular, for a modified software, its test suite consists of: (1) existing tests, which are designed to test the software system before modification (denoted as S) and can be used to test the modified software system (denoted as S 0 ), and (2) new tests, which are added to test the modification. Although all these tests are used to test the modified software, none of the existing work on test prioritization is applied or evaluated by considering the influence of the added tests. Furthermore, the existing work on test prioritization is usually evaluated based on the software whose changes on the source code in- clude only seeded program faults. That is, the difference between the original software system S and the modified software system S 0 is only in the seeded faults. However, the changes from S to S 0 consists of many edits, which are not necessarily related to program faults. In summary, exist- ing work on test prioritization is usually evaluated based on such simplified artificial changes (without real source code changes and test additions), and thus it is not clear whether

Upload: others

Post on 05-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How Does Regression Test Prioritization Perform in Real ...lxz144130/publications/icse2016.pdfIn software evolution, a software is usually modi ed by changing its source code and adding

How Does Regression Test Prioritization Perform inReal-World Software Evolution?

Yafeng Lu1, Yiling Lou2, Shiyang Cheng1, Lingming Zhang1, Dan Hao2∗, Yangfan Zhou3, Lu Zhang2

1Department of Computer Science, University of Texas at Dallas, 75080, USA{yxl131230,sxc145630,lingming.zhang}@utdallas.edu

2Key Laboratory of High Confidence Software Technologies (Peking University), MoE, ChinaInstitute of Software, EECS, Peking University, Beijing, 100871, China

{louyiling,haodan,zhanglucs}@pku.edu.cn3School of Computer Science, Fudan University, 201203, China

[email protected]

ABSTRACTIn recent years, researchers have intensively investigated var-ious topics in test prioritization, which aims to re-order teststo increase the rate of fault detection during regression test-ing. While the main research focus in test prioritizationis on proposing novel prioritization techniques and evalu-ating on more and larger subject systems, little effort hasbeen put on investigating the threats to validity in exist-ing work on test prioritization. One main threat to validityis that existing work mainly evaluates prioritization tech-niques based on simple artificial changes on the source codeand tests. For example, the changes in the source code usu-ally include only seeded program faults, whereas the testsuite is usually not augmented at all. On the contrary, inreal-world software development, software systems usuallyundergo various changes on the source code and test suiteaugmentation. Therefore, it is not clear whether the conclu-sions drawn by existing work in test prioritization from theartificial changes are still valid for real-world software evo-lution. In this paper, we present the first empirical study toinvestigate this important threat to validity in test prioriti-zation. We reimplemented 24 variant techniques of both thetraditional and time-aware test prioritization, and investi-gated the impacts of software evolution on those techniquesbased on the version history of 8 real-world Java programsfrom GitHub. The results show that for both traditional andtime-aware test prioritization, test suite augmentation sig-nificantly hampers their effectiveness, whereas source codechanges alone do not influence their effectiveness much.

1. INTRODUCTIONDuring software development and maintenance, a soft-

ware system continuously evolves due to various reasons,e.g., adding new features, fixing bugs, or improving main-

∗Corresponding author.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

ICSE ’16, May 14-22, 2016, Austin, TX, USAc© 2016 ACM. ISBN 978-1-4503-3900-1/16/05. . . $15.00

DOI: http://dx.doi.org/10.1145/2884781.2884874

tainability and efficiency. As the change on the softwaremay incur bugs, it is necessary to apply regression testingto revalidate the modified system. However, it is costly toperform regression testing. As reported, regression testingconsumes 80% of the overall testing budgets [1,2], and sometest suite consumes more than seven weeks [3]. Therefore,it is important to set a balance between the tests that ide-ally should be run in regression testing and the tests thatare affordable to be used in regression testing, which is theintuition of test prioritization [4].

Following this intuition, test prioritization [3–5] is pro-posed to improve the efficiency of regression testing, espe-cially when the testing resources (e.g., time, developers’ ef-forts) are limited. In particular, test prioritization aims toschedule the execution order of existing tests so as to maxi-mize some testing objective (e.g., the rate of fault-detection).In the literature, a huge body of research work has beendedicated to test prioritization to investigate various prior-itization algorithms [6–10], coverage criteria [11, 12], and soon [13–15].

However, little work in the literature studied the threatsto validity in test prioritization, especially the threat result-ing from software changes. In software evolution, a softwareis usually modified by changing its source code and addingtests. However, the existing work on test prioritization isusually evaluated without considering test change. In par-ticular, for a modified software, its test suite consists of: (1)existing tests, which are designed to test the software systembefore modification (denoted as S) and can be used to testthe modified software system (denoted as S′), and (2) newtests, which are added to test the modification. Although allthese tests are used to test the modified software, none of theexisting work on test prioritization is applied or evaluated byconsidering the influence of the added tests. Furthermore,the existing work on test prioritization is usually evaluatedbased on the software whose changes on the source code in-clude only seeded program faults. That is, the differencebetween the original software system S and the modifiedsoftware system S′ is only in the seeded faults. However,the changes from S to S′ consists of many edits, which arenot necessarily related to program faults. In summary, exist-ing work on test prioritization is usually evaluated based onsuch simplified artificial changes (without real source codechanges and test additions), and thus it is not clear whether

Page 2: How Does Regression Test Prioritization Perform in Real ...lxz144130/publications/icse2016.pdfIn software evolution, a software is usually modi ed by changing its source code and adding

the conclusions drawn by the existing work are still valid forreal-world software evolution.

In this paper, we point out this important threat to valid-ity in existing test prioritization, and investigate the effec-tiveness of existing test prioritization in real-world softwareevolution by conducting an empirical study on 8 open-sourceprojects. To investigate the influence of source code change,we constructed two types of source code changes, seedingfaults into the original software like existing test prioritiza-tion work does and seeding faults into the modified software(i.e., latter versions of a project). To investigate the influ-ence of test additions, we prioritized tests without consid-ering newly added tests as well as all tests including newlyadded tests. Furthermore, as time budgets are limited insome cases, we also investigated test prioritization consider-ing the influence of time budgets. That is, in the empiricalstudy, we implemented traditional techniques [3, 7, 8] andtime-aware techniques [16], each of which is implementedbased on coverage information of various granularities (i.e.,statement, method, and branch coverage).

The experimental results show that the test additions havesignificant influence on the effectiveness of test prioritiza-tion. When new tests are added, all the traditional andtime-aware test prioritization techniques become much lesseffective. Another interesting finding is that source codechanges alone (without test additions) do not have much in-fluence on the effectiveness of test prioritization. Both tradi-tional and time-aware test prioritization techniques performquite stable in case of source code changes without test addi-tions, indicating that out-of-date coverage information (dueto code changes) can still be effective for test prioritization.

The paper makes the following contributions:

• Pointing out an important threat to validity in theevaluation of existing test prioritization – evolution ofsource code and tests.

• The first empirical study on investigating the influenceof the threat resulting from real-world software evolu-tion, by considering various time-unaware and time-aware prioritization techniques based on various cov-erage information.

• Interesting findings showing that both traditional andtime-aware test prioritization are significantly nega-tively influenced by test additions, but not influencedmuch by source code changes alone. Practical guide-lines are also learnt from the study to help with testprioritization in practice.

2. STUDIED TECHNIQUESIn this section, we explain the technical details about the

techniques investigated in the empirical study.

2.1 Traditional Test PrioritizationGiven any test suite T and its set of permutations on tests

PT , traditional test prioritization aims to find a permuta-tion T ′ ∈ PT that for any permutation T ′′ ∈ PT (T ′′ 6= T ′),f(T ′′) ≤ f(T ′), where f is an objective function from a per-mutation to a real number.

Most existing test prioritization techniques guide theirprioritization process based on coverage information, whichrefers to whether any structural unit is covered by a test. Inthis section, we explain the prioritization techniques based

on statement coverage, but they are applicable to othertypes of structural coverage (e.g., method coverage or branchcoverage), which will be further investigated in Section 3.8.4.

2.1.1 Total&Additional Test PrioritizationThe total test prioritization technique schedules the ex-

ecution order of tests based on the descendent number ofstatements covered by these tests, whereas the additionaltest prioritization technique schedules the execution orderof tests based on the number of statements that are uncov-ered by already selected tests but covered by these tests.

The total&additional test prioritization techniques are sim-ple greedy algorithms, but they are recognized as represen-tative test prioritization techniques due to their effectivenessand are taken as the control techniques in the evaluation ofexisting work [8, 9].

2.1.2 Search-Based Test PrioritizationSearch-based test prioritization is proposed by Li et al. [8],

and is an application of search-based software engineering [17,18]. Typically, search-based test prioritization takes all thepermutations as candidate solutions and utilizes some heuris-tics to guide the process of searching for a better execu-tion order of tests. In this empirical study, we use genetic-algorithm-based test prioritization as a representative of search-based test prioritization due to its effectiveness [8].

The genetic-algorithm-based test prioritization technique [8]encodes a permutation of a test suite by an N-size binary ar-ray, which represents the position of each test in the priori-tized test suite. Initially, a set of permutations are randomlygenerated, which is taken as the initial population. Eachpair of permutations in the population is taken as parent per-mutations to generate two offspring permutations throughcrossover on a random position. For each offspring permu-tation, the mutation operator1 randomly selects two testsand exchanges their positions. To find an optimal solution,the genetic-algorithm-based test prioritization technique de-fines a fitness function based on the coverage of tests (e.g.,average percentage of statement coverage).

2.1.3 Adaptive Random Test PrioritizationBased on random test prioritization, Jiang et al. [7] pre-

sented a set of adaptive random test prioritization tech-niques by varying the types of coverage information andfunction f2 that determines which test to select in priori-tization process. In particular, adaptive random test prior-itization has three types of f2: (1) f2 is defined to select atest that has the largest average distance with the existingselected tests, (2) f2 is defined to select a test that has thelargest maximum distance with the existing selected tests,and (3) f2 is defined to select a test that has the largestminimum distance with the existing selected tests. Further-more, the random test prioritization technique of the lasttype of f2 is evaluated to be more effective and efficient [7],which is taken as the representative of adaptive random testprioritization in this paper.

2.2 Time-Aware Test PrioritizationConsidering the time budget, sometimes it is impossible

to run all the tests and thus it is necessary to investigate

1The concept “mutation operator” used here refers to anoperation in genetic programming, and is different from theconcept used in mutation testing.

Page 3: How Does Regression Test Prioritization Perform in Real ...lxz144130/publications/icse2016.pdfIn software evolution, a software is usually modi ed by changing its source code and adding

test prioritization with time constraints. Considering thetime limit and various execution time of tests, time-awaretest prioritization is proposed in the literature [16, 19, 20],which is usually formalized as follows. Given a test suiteT , its set of permutations on the tests PT , and time bud-get timemax, time-aware test prioritization aims to find apermutation T ′ ∈ PT satisfying that for any element T ′′ ∈PT (T ′ 6= T ′′), f(T ′) ≥ f(T ′′), time(T ′) ≤ timemax, andtime(T ′′) ≤ timemax, where f and time are the objectiveand cost functions from permutations to real numbers [16],respectively. Compared with traditional test prioritization,time-aware test prioritization additionally requires the pri-oritized test suite to satisfy the time constraint.

2.2.1 Total&Additional Test PrioritizationTime-aware total/additional test prioritization is adopted

from traditional total/additional test prioritization by con-sidering the time budget [16]. In particular, time-aware to-tal/additional test prioritization first schedules the execu-tion order of tests like traditional total/additional test pri-oritization, and then keeps the preceding tests whose totalexecution time does not exceed the time budget by removingthe remaining tests.

2.2.2 Total&Additional Test Prioritization via Inte-ger Linear Programming

Based on traditional total&additional test prioritization,Zhang et al. [16] proposed integer linear programming (ILP)based test prioritization, which formalizes test selection inthe process of test prioritization using an ILP model. Inparticular, Zhang et al. [16] presented a total test prioritiza-tion technique via ILP, which first selects a set of tests thathas the maximum sum of the number of statements cov-ered by each test and whose total execution time does notexceed the given time budget and then schedules the execu-tion order of these selected tests using the traditional totalstrategy. Also, Zhang et al. [16] presented an additional testprioritization technique via ILP, which selects a set of teststhat maximizes the number of statements covered by thesetests and whose total execution time does not exceed thegiven time budget.

For each technique, we implement its variants based onvarious coverage criteria (i.e., method, statement, and branchcoverage), and thus we have 8 ∗ 3 = 24 variant techniques.

3. EMPIRICAL STUDY

3.1 Research QuestionsThis study investigates the following research questions:

• RQ1: How does software evolution influence tradi-tional test prioritization techniques in real-world evolv-ing software systems?

• RQ2: How does software evolution influence time-aware test prioritization techniques in real-world evolv-ing software systems?

• RQ3: How do source code differences alone (i.e., with-out test additions) influence traditional and time-awaretest prioritization techniques?

• RQ4: How do different test prioritization techniquescompare with random prioritization in software evolu-tion?

In RQ1 and RQ2, we investigate the impact of real-worldsoftware evolution (including both code changes and testadditions) for both traditional and time-aware test prioriti-zation. Since added tests vary for different projects, we fur-ther distinguish the impacts of added tests from the impactsof source code changes (which cause test coverage changes).That is, in RQ3, we exclude the tests that are newly added,and only investigate the effectiveness of various prioritiza-tion techniques for tests that exist in earlier versions (andthus only have source code changes). Furthermore, in RQ4,we investigate when the simple random prioritization can becompetitive to the existing techniques.

3.2 Implementation and Supporting ToolsTo collect various coverage information, we used on-the-fly

bytecode instrumentation which dynamically instruments classesloaded into the JVM through a Java agent without anymodification of the target program. We implemented codeinstrumentation based on the ASM byte-code manipulationand analysis framework2 under our FaultTracer tool [21,22].To implement the ILP based techniques (i.e., total&additionaltest prioritization via ILP), we used a mathematical pro-gramming solver, GUROBI Optimization3, which is used torepresent and solve the equations formulated by the ILP-based techniques. All the test prioritization techniques havebeen implemented as Java and Python programs. Note thatall the tools developed by ourselves are available from ourproject website. Finally, we used the PIT mutation testingtool4 to seed faults into our subject programs.

3.3 Subject Systems, Tests, and FaultsIn the empirical study, we selected 8 GitHub5 Java projects

that have been widely used in software testing research [23–25]. Each project has accumulated various commits dur-ing real-world software evolution. Following prior work [23],for each project, we first chose the latest commit that canbe successfully applied with the used tools, then selectedup to 10 versions by counting backwards 30 commits eachtime. Note that each project version has a correspondingJUnit test suite accumulated during software development.In software testing research, it has been shown by previousstudies [26–28] that mutation faults are close to real faultsand are suitable for software testing experimentation, sincereal faults are usually hard to find and small in number,making it hard to perform statistical analysis. Furthermore,mutation faults have also been widely used in test prioriti-zation research [9, 12, 29]. Therefore, in this work, we usemutation faults generated by the PIT mutation testing toolto investigate the effectiveness of various test prioritizationtechniques.

Table 1 presents the basic information of each subject,including its number of versions (abbreviated as ”Ver”). Aseach project has more than one versions, the columns“MinS”and“MaxS”present the minimum and maximum numbers oflines of code for each project, whereas the columns “MinT”and “MaxT” present the minimum and maximum numbersof tests for each project. As this empirical study is designedto investigate the influences of test addition, the columns“Ft1” and “Ft2” presents the numbers of faults used for the

2http://asm.ow2.org/3http://www.gurobi.com/4http://pitest.org/5https://github.com/

Page 4: How Does Regression Test Prioritization Perform in Real ...lxz144130/publications/icse2016.pdfIn software evolution, a software is usually modi ed by changing its source code and adding

Sub Ver MinS MaxS MinT MaxT Ft1 Ft2

jasmine-maven 5 1,640 4,348 7 118 500 10java-apns 8 1,362 3,839 15 87 410 80jopt-simple 4 6,636 8,569 394 657 500 500la4j 9 8,094 12,555 172 625 500 500scribe-java 8 2,497 5,957 38 99 500 385vraptor-core 5 31,176 32,997 985 1,124 500 500assertj-core 8 55,443 67,282 4,055 5,269 500 500metrics-core 6 11,477 12,536 270 318 500 500

Table 1: Subject statistics

study with test additions and the number of faults used forthe study without test addition, respectively6.

3.4 Independent VariablesWe consider the following independent variables (IVs) in

the empirical study.IV1: Prioritization Scenarios. We consider both (1)traditional test prioritization which prioritizes and executesall tests, and (2) time-aware test prioritization which onlyprioritizes and executes tests within certain time constraints.IV2: Prioritization Techniques. For each test prior-itization scenario, we also consider various representativeprioritization techniques. For traditional prioritization, weconsider (1) total technique, (2) additional technique, (3)search-based technique, and (4) adaptive random technique.For time-aware prioritization, we consider (1) total tech-nique, (2) additional technique7,(3) total ILP-based tech-nique, and (4) additional ILP-based technique. Besides thesetechniques, we also implement random prioritization as thecontrolled technique for traditional and time-aware prior-itization in comparison. In random prioritization, we ran-domly generate 1000 execution orders and take their averageresults as suggested by the literature [30].IV3: Coverage Criteria. Since all the studied techniquesrely on code coverage information, we also investigate theinfluence of coverage criteria. More specifically, we studiedthree widely used coverage criteria: (1) method coverage,(2) statement coverage, and (3) branch coverage.IV4: Revision Granularities. Existing work mainly ap-plies test prioritization based the coverage data of softwareversion vi on the same version with seeded faults. How-ever, in practice, when prioritizing tests for vi, we only havecoverage of an older version, e.g., vi−1. Therefore, besides,applying test prioritization based on coverage of vi to vi (de-noted as vi → vi), we also study vi−1 → vi, vi−2 → vi, ...,v1 → vi, to study the influence of larger revision for testprioritization.IV5: Time Budgets. In time-aware test prioritization,only tests within certain time budgets are allowed to exe-cute. Following existing work [16], we consider the followingtime budgets b ∈ {5%, 25%, 50%, 75%}, where b refers to thepercentage of execution time of the entire test suite.

3.5 Dependent VariableTo evaluate the effectiveness of the various test prioriti-

zation techniques in software evolution, we used the widelyused APFD (Average Percentage of Faults Detected) met-ric [3, 6, 9, 31]. For any given test suite and faulty program,higher APFD values imply higher fault-detection rate. Tobetter evaluate the effectiveness of time-aware prioritiza-

6Note that the number of faults used can be different for thetwo settings because different sets of faults are detected bydifferent sets of tests.7We consider the total and additional techniques here be-cause they have been studied as baseline techniques in time-aware prioritization [16].

tion, we followed existing work on time-aware prioritizationand adopted a slight variant of the traditional APFD met-ric [16, 19]. The details of the variant APFD metric can befound in [19].

3.6 Experimental SetupFor each subject system S, we obtain a set of subsequent

revisions, e.g., Sv1 , Sv2 , ..., Svn . Then, for the latest versionSvn , we apply the following process.

First, we construct various faulty versions of Svn . Ac-cording to existing work [9,12,29], a specific program versionusually does not contain a large number of faults. Therefore,similar to the existing work [9], we construct faulty versionsby grouping 5 faults together. That is, for each programversion, we randomly produce up to 100 fault groups eachof which contains 5 randomly selected faults. Note that weguarantee that the selected faults can be detected by at leastone test, and the different fault groups should not have com-mon faults. That said, we have constructed up to 500 faultyversions for each program version.

Second, we collect three types of coverage (i.e., method,statement, and branch coverage) for Svn and all its earlierversions, i.e., Sv1 to Svn−1 . That said, we have 3n coveragefiles for each project.

Third, we apply all the studied techniques based on thecoverage files of Svn to faulty versions of Svn , denoted asSvn → Svn . In addition, we also apply the techniques basedon the coverage files of earlier versions to faulty versions ofSvn , i.e., Svi → Svn (i ∈ [1, n − 1]). Note that for newlyadded tests without coverage, we simply put them at theend of the prioritization sequence in an alphabetical order.

Finally, we collect all the prioritization APFD values ofSvn based on the combination of all our independent vari-ables, and apply data analysis.

3.7 Threats to ValidityThe threat to internal validity lies in the implementation

of the test prioritization techniques used in the empiricalstudy. To reduce this threat, we used existing tools (e.g.,GUROBI) to aid the implementation and reviewed all thecode of the test prioritization techniques before conductingthe empirical study.

The threats to external validity mainly lie in the subjectsand faults. Although the subjects may not be sufficientlyrepresentative, especially for programs in other program-ming languages, all the subjects used in the empirical studyare real-world Java projects and have been used in previ-ous studies on software testing [23, 24]. This threat can befurther reduced by using more programs in various program-ming languages (e.g., C, C++, Java and C#).

In software development, whenever the developers detectany regression fault, they usually will fix the fault beforecommiting rather than commiting a version with failed tests.Therefore, it is hard to find real regression faults in open-source code repositories. In addition, according to the exper-imental study conducted by Do and Rothermel [29], mutation-generated faults are suitable to be used in test prioritizationstudies. Therefore, we used mutation faults in our evalua-tion. To further reduce this threat, we performed a prelimi-nary study by using all the 27 real-world faults of joda-timefrom Defects4J [32] to simulate real regression faults. Thestudy further confirms that test additions impact test pri-

Page 5: How Does Regression Test Prioritization Perform in Real ...lxz144130/publications/icse2016.pdfIn software evolution, a software is usually modi ed by changing its source code and adding

oritization results significantly while source code changes donot.

The threat to construct validity mainly lies in the metricused to assess the effectiveness of prioritization techniques.The APFD metric we used has been extensively used in priorwork [3, 16, 19], but it does not consider all the costs andbenefits related to test prioritization [29]. Further reductionof this threat requires more study using more metrics.

3.8 Result AnalysisIn this section, we present our main experimental findings.

Note that all our detailed experimental data and results areavailable online8.

3.8.1 RQ1: Impacts of Software Evolution on Tradi-tional Test Prioritization

Figure 1 presents the boxplots of the APFD values for allthe four traditional test prioritization techniques using thestatement coverage criterion on all projects. The results forthe other two coverage criteria are quite close to those ofstatement coverage and can be found on our webpage. Eachsub-figure presents the detailed APFD results when usingeach version’s coverage information to prioritize the tests forthe latest version for each project using statement coverage.In each sub-figure, the x-axis shows the versions from whichthe coverage is collected, the y-axis shows the APFD val-ues for the techniques when prioritizing tests for the latestversion, and each boxplot presents the APFD distribution(median, upper/lower quartile, and 95th/5th percentile val-ues) based on different fault groups for each technique usingeach version’s coverage information. We use white, blue,green, orange, and red boxes to represent the random, total,additional, ART, and search-based techniques respectively.From the figure, we have the following observations:

First, for all projects, all the techniques except the totaltechnique have a clear growth trend when using more up-to-date coverage for prioritization. For example, for java-apns,the median APFD value of the additional strategy is 0.59when using v1’s statement coverage to prioritize tests forv8, and it grows to 0.87 when using v8’s statement coverageto prioritize tests for v8. The total strategy does not havea clear growth trend for the majority of the subjects. Wethink the reason to be that the total technique does not usethe coverage information effectively, and thus is not sensitiveto the coverage changes.

Second, when using the most up-to-date coverage infor-mation, for the majority of the subjects, the additional andthe search-based techniques are among the best prioritiza-tion technique, while the ART technique is also competitive.This finding confirms the previous work on search-based testprioritization [8] and adaptive random test prioritization [7],respectively. In addition, our results further demonstratethat the search-based test prioritization technique is supe-rior to the ART technique (to our knowledge, those twotechniques have never been compared before).

Third, when the version used for coverage collection is farfrom the version for prioritization (i.e., massive code changesoccur), all the four techniques become close in performance,demonstrating the importance of using up-to-date coverageinformation for test prioritization. For example, for la4j, thetotal, additional, ART, and search-based techniques have

8http://utdallas.edu/˜yxl131230/icse16support.html

median APFD values of 0.53, 0.85, 0.80, and 0.83, respec-tively, when using coverage of v8 to prioritize tests for v9.In contrast, the median APFD value of the total techniqueincreases to 0.67, while the median APFD values of all theother techniques drop to around 0.71, when using coverageof v1 to prioritize tests for v9.

3.8.2 RQ2: Impacts of Software Evolution on Time-Aware Test Prioritization

Figure 2 presents the APFD values for all the four time-aware test prioritization techniques (together with the ran-dom prioritization) using the 50% time constraint based onthe statement coverage9. Each sub-figure presents the de-tailed APFD results when using each version’s coverage in-formation to prioritize the tests for the latest version for eachproject using the 50% time constraint. The settings for eachsub-figure are the same with those in Figure 1. The only dif-ference is that now we use white, blue, green, orange, andred boxes to represent the random, total, additional, ILP-total, and ILP-additional techniques, respectively. From theresults, we have the following observations:

First, the additional, ILP-total, and ILP-additional tech-niques have a clear growth trend when using more up-to-datecoverage information for test prioritization. For example, forjopt-simple (with the 5% time constraint) the median APFDvalue of the additional strategy is 0.38 when using v1’s state-ment coverage to prioritize tests for v4, and it grows to 0.58when using v4’s statement coverage to prioritize tests for v4.

Second, similar as traditional test prioritization withouttime constraints, the total technique also does not have aclear growth trend when using more up-to-date coverageinformation. For example, for java-apns when using 50%time constraint for test prioritization, surprisingly, whenwe use code coverage of v8 to prioritize tests for v8, themedian APFD value even falls below 0, while it becomeslarger than 0.6 when using less up-to-date coverage infor-mation (i.e., from v7). We looked into the code, and foundthe reason to be that version v8 introduced a newly-addedhuge test (i.e., handleTransmissionErrorInQueuedConnection

of test class ApnsConnectionCacheTest), which has an averageexecution time of 248ms (far more than the execution timeof other tests) and covers a huge amount of code elements(statements/methods/branches). Therefore, when the to-tal technique uses v8’s coverage to prioritize tests, it tendsto put that huge test at the very beginning. Then, underthe 50% time constraint, only 4 tests can be run when us-ing v8’s coverage to prioritize tests for v8. On the contrary,when using less up-to-date coverage information, the newlyadded huge test does not have coverage at all, and will beput into the end of the test execution sequence, enablingmore tests to be executed. For example, under the 50%time constraint, when using the coverage information fromv1 to v7 to prioritize tests for v8, 41, 57, 67, 67, 73, 73, 75tests can be executed, respectively. Therefore, the APFDvalues for those coverage versions are much higher.

Third, when using the most up-to-date coverage informa-tion, the ILP-additional technique always performs the best,which confirms previous study on time-aware test prioriti-zation [16]. Furthermore, our results also demonstrate thatthe ILP-additional technique is the most stable and effective

9Note that due to the space limitation, we present the resultsfor other coverage criteria and time constraints (similar tothe results shown in this paper) in our project homepage.

Page 6: How Does Regression Test Prioritization Perform in Real ...lxz144130/publications/icse2016.pdfIn software evolution, a software is usually modi ed by changing its source code and adding

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� �� ��

jasmine-maven-plugin (stmt-cov) 0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� �� �� �� �� ��

java-apns (stmt-cov)-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

�� �� �� ��

jopt-simple (stmt-cov) 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� �� �� �� �� �� ��

la4j (stmt-cov)

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

�� �� �� �� �� �� �� ��

scribe-java (stmt-cov) 0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� �� ��

vraptor-core (stmt-cov) 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� �� �� ��

metrics-core (stmt-cov) 0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� �� �� �� �� ��

assertj-core (stmt-cov)

Figure 1: Results for traditional test prioritization techniques (i.e., random , total , additional , ART , and search-based) based on statement coverage (with new tests)

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� �� ��

jasmine-maven-plugin (50% time)-0.2

0

0.2

0.4

0.6

0.8

1

�� �� �� �� �� �� �� ��

java-apns (50% time)-0.2

0

0.2

0.4

0.6

0.8

1

�� �� �� ��

jopt-simple (50% time)-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� �� �� �� �� �� ��

la4j (50% time)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� �� �� �� �� ��

scribe-java (50% time)-0.2

0

0.2

0.4

0.6

0.8

1

�� �� �� �� ��

vraptor-core (50% time)-0.2

0

0.2

0.4

0.6

0.8

1

�� �� �� �� �� ��

metrics-core (50% time)-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� �� �� �� �� ��

assertj-core (50% time)

Figure 2: Results for time-aware test prioritization techniques (i.e., random , total , additional , ILP-total , andILP-additional ) with statement coverage and 50% time constraint (with new tests)

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

�� �� �� �� ��

jasmine-maven-plugin (stmt-cov) 0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� �� �� �� �� ��

java-apns (stmt-cov) 0.7

0.75

0.8

0.85

0.9

0.95

1

�� �� �� ��

jopt-simple (stmt-cov) 0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� �� �� �� �� �� ��

la4j (stmt-cov)

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� �� �� �� �� ��

scribe-java (stmt-cov) 0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� �� ��

vraptor-core (stmt-cov) 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� �� �� ��

metrics-core (stmt-cov) 0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� �� �� �� �� ��

assertj-core (stmt-cov)

Figure 3: Results for traditional test prioritization techniques (i.e., random , total , additional , ART , and search-based) based on statement coverage (excluding new tests)

technique even when using less up-to-date coverage informa-tion.

3.8.3 RQ3: Impacts of Only Source Code Changeson Test Prioritization

Similar with Figure 1, Figure 3 presents the APFD valuesfor all the four traditional test prioritization techniques us-ing statement coverage criteria on all projects. Similar withFigure 2, Figure 4 presents the APFD values for all the fourtime-aware test prioritization techniques (together with therandom prioritization) using the 50% time constraints basedon the statement coverage. The only change is that in Fig-ure 3 and Figure 4, we consider only the tests that exist forall versions in order to study the influence of coverage change

alone (i.e., excluding the influence of test additions)10. Notethat the corresponding figures for other coverage and timeconstraint settings are all available from our webpage. Fromthe two figures, we observe that for traditional test prioriti-zation, the additional and search-based techniques performthe best across all versions. For example, for la4j, when us-ing the coverage from the earliest version to prioritize testsof the latest version, both the additional and search-basedtechniques have median APFD results of 0.93, whereas thetotal and ART techniques have median APFD results of0.78 and 0.91, respectively. For time-aware test prioritiza-tion, the ILP-additional technique performs the best nearly

10Note that there are only 3 tests in common for all versionsof jasmine-maven, thus the boxplots cannot be shown.

Page 7: How Does Regression Test Prioritization Perform in Real ...lxz144130/publications/icse2016.pdfIn software evolution, a software is usually modi ed by changing its source code and adding

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

�� �� �� �� ��

jasmine-maven-plugin (50% time)-0.2

0

0.2

0.4

0.6

0.8

1

�� �� �� �� �� �� �� ��

java-apns (50% time) 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� ��

jopt-simple (50% time)-0.2

0

0.2

0.4

0.6

0.8

1

�� �� �� �� �� �� �� �� ��

la4j (50% time)

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� �� �� �� �� ��

scribe-java (50% time)-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� �� ��

vraptor-core (50% time)-0.2

0

0.2

0.4

0.6

0.8

1

�� �� �� �� �� ��

metrics-core (50% time)-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

�� �� �� �� �� �� �� ��

assertj-core (50% time)

Figure 4: Results for time-aware test prioritization techniques (i.e., random , total , additional , ILP-total , andILP-additional ) with statement coverage and 50% time constraint (excluding new tests)

Tech Sub With new tests Without new testsp-value v1 v2 v3 v4 v5 v6 v7 v8 v9 p-value v1 v2 v3 v4 v5 v6 v7 v8 v9

Total jopt-simple p<0.05 b b a a – – – – – 0.987 a a a a – – – – –vraptor-core 0.147 a a a a a – – – – 0.999 a a a a a – – – –metrics-core 0.339 a a a a a a – – – 1.000 a a a a a a – – –java-apns p<0.05 b b a a a a a a – p<0.05 a a b b b b b b –assertj-core p<0.05 b b b b b a a a – 1.000 a a a a a a a a –jasmine-maven p<0.05 c bc b a a – – – – p<0.05 b a a a a – – – –scribe-java 0.854 a a a a a a a a – 0.960 a a a a a a a a –la4j p<0.05 a a b bc bcd d bcd bcd cd 0.956 a a a a a a a a a

Addit. jopt-simple p<0.05 b b a a – – – – – 0.998 a a a a – – – – –vraptor-core p<0.05 b b ab a a – – – – 0.987 a a a a a – – – –metrics-core 0.991 a a a a a a – – – 1.000 a a a a a a – – –java-apns p<0.05 c c b b b b b a – 0.842 a a a a a a a a –assertj-core p<0.05 d c c bc bc ab a a – 1.000 a a a a a a a a –jasmine-maven p<0.05 c c b a a – – – – p<0.05 b a a a a – – – –scribe-java p<0.05 c c c bc bc ab a a – 0.998 a a a a a a a a –la4j p<0.05 b b c c bc b a a a 0.997 a a a a a a a a a

ART jopt-simple p<0.05 b b a a – – – – – 0.119 a a a a – – – – –vraptor-core p<0.05 b ab ab a ab – – – – 0.931 a a a a a – – – –metrics-core p<0.05 bc c b a bc bc – – – p<0.05 c a c ab a bc – – –java-apns p<0.05 d cd bc b b b b a – p<0.05 a a a a a a a a –assertj-core p<0.05 c b b b b ab a a – p<0.05 a ab ab ab b ab b ab –jasmine-maven p<0.05 b b a a a – – – – p<0.05 a b b b b – – – –scribe-java 0.499 a a a a a a a a – p<0.05 bcd bc ab d a bcd cd bc –la4j p<0.05 c c de e cd c ab b a p<0.05 e cde cde cd de de ab a bc

Search jopt-simple p<0.05 b b a a – – – – – p<0.05 b a ab ab – – – – –vraptor-core p<0.05 b b b a a – – – – 0.933 a a a a a – – – –metrics-core p<0.05 b ab b a b ab – – – p<0.05 b b b b a b – – –java-apns p<0.05 c c b b b b b a – 0.973 a a a a a a a a –assertj-core p<0.05 d bc c bc bc abc a ab – p<0.05 b ab ab ab b b ab a –jasmine-maven p<0.05 d c b a a – – – – p<0.05 b a a a a – – – –scribe-java p<0.05 c c c bc bc ab a a – 0.951 a a a a a a a a –la4j p<0.05 b b cd d bc b a a a p<0.05 c b a ab a bc ab b b

Table 2: ANOVA analysis and Tukey’s HSD test among using statement coverage of different versions for traditional priori-tization

across all versions for all time constraints, and the additionaltechnique can be competitive when the time constraints arenot tight (e.g., >=75%).

We also observe that for all the studied techniques, theireffectiveness is not influenced much by the use of coveragedata from different versions. To illustrate, for the vast ma-jority of the sub-figures in Figures 1 and 2, the effectivenessof all techniques tends to increase when using more up-to-date coverage information. However, for all sub-figures inFigures 3 and 4, the effectiveness of all techniques is stablewhen using coverage from different versions. To further con-firm our observation, we perform one-way ANOVA analy-sis [33] at the significance level of 0.05 to investigate whetherusing coverage information from different versions incur anysignificant effectiveness differences inside each techique. Inaddition, we further perform Tukey HSD post-hoc test torank the effectiveness when using coverage information ofdifferent versions as different groups for each technique. The

detailed results for traditional/time-aware test prioritizationare shown in Table 2/3. In each table, Columns 1 and 2 listall the techniques and subjects. Column 3 presents the one-way ANOVA analysis results, and Column 4 presents thedetailed Tukey HSD grouping results (“a” means the groupwith best effectiveness) on using coverage data of differentversions when considering new tests. Similarly, Columns 5and 6 present the corresponding one-way ANOVA analysisand Tukey HSD test results when excluding new tests.

From the tables, we observe that when considering newtests, the effectiveness of each technique has significant dif-ferences (rejecting the NULL hypothesis of one-way ANOVAanalysis by p < 0.05) when using coverage data from differ-ent versions for the vast majority of the studied subjects.To illustrate, the search-based technique for traditional testprioritization has significant differences on all the studiedsubjects, and all the studied time-aware test prioritizationtechniques have significant differences on all subjects. In

Page 8: How Does Regression Test Prioritization Perform in Real ...lxz144130/publications/icse2016.pdfIn software evolution, a software is usually modi ed by changing its source code and adding

Tech Sub With new tests Without new testsp-value v1 v2 v3 v4 v5 v6 v7 v8 v9 p-value v1 v2 v3 v4 v5 v6 v7 v8 v9

Total jopt-simple p<0.05 b b a a – – – – – 1.000 a a a a – – – – –vraptor-core p<0.05 a d c b b – – – – 0.942 a a a a a – – – –metrics-core p<0.05 b b b b b a – – – 1.000 a a a a a a – – –java-apns p<0.05 c b a a a a a d – p<0.05 a a b b b b b b –assertj-core p<0.05 e d d d d c b a – 0.785 a a a a a a a a –jasmine-maven p<0.05 c b a a a – – – – p<0.05 b a a a a – – – –scribe-java p<0.05 a a a a a ab ab b – 0.458 a a a a a a a a –la4j p<0.05 a b c c cd de d d e p<0.05 b c cd cd a cd ab cd d

Addit. jopt-simple p<0.05 c c b a – – – – – 0.925 a a a a – – – – –vraptor-core p<0.05 a d b c c – – – – 0.271 a a a a a – – – –metrics-core p<0.05 b b b b b a – – – 0.983 a a a a a a – – –java-apns p<0.05 c b a a a a a c – 0.732 a a a a a a a a –assertj-core p<0.05 e d d d d c b a – 1.000 a a a a a a a a –jasmine-maven p<0.05 d c b a a – – – – p<0.05 b a a a a – – – –scribe-java p<0.05 b b b ab ab a a a – p<0.05 b ab ab ab ab a ab ab –la4j p<0.05 d f g f e c b b a p<0.05 ab ab de cde bc bcd e ab a

ILP-T jopt-simple p<0.05 b c a a – – – – – 0.933 a a a a – – – – –vraptor-core p<0.05 c b a a a – – – – p<0.05 b b ab a a – – – –metrics-core p<0.05 b b b b b a – – – 0.809 a a a a a a – – –java-apns p<0.05 d c b b ab ab ab a – p<0.05 a a b b b b b b –assertj-core p<0.05 e d d d d c b a – 0.763 a a a a a a a a –jasmine-maven p<0.05 c b ab a a – – – – p<0.05 b a b a a – – – –scribe-java p<0.05 b ab a a ab ab a a – p<0.05 b b b b b b b a –la4j p<0.05 a b c cd cd e de b b p<0.05 c d c c b c bc a a

ILP-A jopt-simple p<0.05 c d b a – – – – – 0.970 a a a a – – – – –vraptor-core p<0.05 d c b a a – – – – p<0.05 b ab a a a – – – –metrics-core p<0.05 c c c ab b a – – – p<0.05 b b b a a a – – –java-apns p<0.05 e d c c bc bc b a – p<0.05 ab b a ab a a ab ab –assertj-core p<0.05 e d d d d c b a – 1.000 a a a a a a a a –jasmine-maven p<0.05 d c b a a – – – – p<0.05 b a b a a – – – –scribe-java p<0.05 d d d cd cd bc ab a – 0.209 a a a a a a a a –la4j p<0.05 c d f e c b a a a p<0.05 bcd b e cde de bcd a bc bcd

Table 3: ANOVA analysis and Tukey’s HSD test among using statement coverage of different versions for time-aware priori-tization

addition, for most techniques, Tukey HSD test shows thatusing more up-to-date coverage data tends to be more ef-fective. For example, the effectiveness of the search-basedtechnique is grouped into“a”or“ab”when using the most up-to-date coverage, while being grouped into “b” to “d” whenusing the most obsolete coverage data. On the contrary,when excluding new tests, the studied techniques tend notto have significant differences when using coverage data fromdifferent versions. In addition, even for the cases that thereare statistical differences, the techniques using the most up-to-date coverage data are usually not categorized as the bestperformance group. This observation further confirms thatsource code changes alone (i.e., excluding impacts of newtests) do not impact the effectiveness of test prioritizationtechniques much.

3.8.4 RQ4: Comparison with Random PrioritizationAs simple random techniques can be powerful in software

testing research [34,35], we also investigate the strengths ofrandom technique in test prioritization. More specifically,we compare all the studied techniques against random pri-oritization both considering new tests and excluding newtests. We chose to use the Wilcoxon Signed-Rank Test [36]for the comparison, because it is suitable even for the casethat the sample differences may not be normally distributed.Table 4/5 shows the detailed Wilcoxon test results at the0.05 significance level for traditional/time-aware test prior-itization techniques. In each table, Columns 1 and 2 listall the subjects and techniques. Columns 3 lists the statis-tical test results for comparing each technique against therandom prioritization when using each version’s coverage toprioritize all tests for the latest version. Column 4 presentssimilar results with Column 3 but excluding new tests. Weuse m to denote that there is no statistical difference, 4 todenote that the studied technique is significantly better, and

6 to denote that the random prioritization is significantlybetter.

From the two tables, we observe that random prioriti-zation can be competitive when programs evolve with newtests. For example, when prioritizing tests for v8 using cov-erage data from v7 for java-apns, all traditional test priori-tization techniques cannot outperform the random prioriti-zation. The results on time-aware test prioritization furtherconfirms this finding. For example, when prioritizing testsfor v9 of la4j under the 50% time constraint, random priori-tization can outperform all techniques when using coveragedata earlier than v7. In contrast, when considering onlysource code changes (i.e., excluding new tests), a number oftechniques can always significantly outperform the randomprioritization in both traditional and time-aware test prior-itization. For traditional test prioritization, all the studiedtechniques except the total technique are at least as effectiveas the random prioritization using coverage data from anyversion for all subjects. For time-aware test prioritization,the additional and ILP-additional techniques are at least aseffective as the random prioritization using coverage datafrom any version for all subjects.

In summary, random prioritization can be competitivewhen new tests are involved in software evolution. However,state-of-the-art techniques can outperform random prioriti-zation stably when excluding new tests.

3.9 Practical GuidelinesOur study reveals a number of interesting findings that

can serve as the practical guidelines for test prioritization.Source code changes vs. test additions. Our experi-mental results demonstrate that the effectiveness of test pri-oritization techniques can be greatly influenced by softwareevolution (including source code changes and test additions).However, when only considering source code changes (i.e.,

Page 9: How Does Regression Test Prioritization Perform in Real ...lxz144130/publications/icse2016.pdfIn software evolution, a software is usually modi ed by changing its source code and adding

Sub Tech With new tests Without new testsv1 v2 v3 v4 v5 v6 v7 v8 v9 v1 v2 v3 v4 v5 v6 v7 v8 v9

jasmine-mavenTotal 6(0.00)6(0.00)6(0.00)m(0.94)m(0.92)– – – – m(0.35)m(0.35)m(0.35)m(0.35)m(0.35)– – – –Addit. 6(0.00)6(0.00)m(0.60)4(0.00)4(0.00)– – – – m(0.35)m(0.35)m(0.35)m(0.35)m(0.35)– – – –ART 6(0.00)6(0.00)m(0.31)4(0.03)4(0.04)– – – – m(0.35)m(0.35)m(0.35)m(0.35)m(0.35)– – – –Search 6(0.00)6(0.00)m(0.42)4(0.00)4(0.00)– – – – m(0.35)m(0.35)m(0.35)m(0.35)m(0.35)– – – –

java-apns Total 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.01)– 4(0.00)4(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)–Addit. 6(0.00)6(0.00)m(0.08)m(0.13)m(0.84)m(0.82)m(0.38)4(0.00)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–ART 6(0.00)6(0.00)6(0.00)6(0.00)6(0.02)m(0.08)m(0.06)4(0.00)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–Search 6(0.00)6(0.00)6(0.04)m(0.15)m(0.82)m(0.75)m(0.23)4(0.00)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–

jopt-simple Total 6(0.00)6(0.00)6(0.00)6(0.00)– – – – – 4(0.00)4(0.00)4(0.00)4(0.00)– – – – –Addit. 6(0.00)6(0.00)m(0.14)4(0.00)– – – – – 4(0.00)4(0.00)4(0.00)4(0.00)– – – – –ART 6(0.00)6(0.00)m(0.38)4(0.00)– – – – – 4(0.00)4(0.00)4(0.00)4(0.00)– – – – –Search 6(0.00)6(0.00)6(0.03)4(0.00)– – – – – 4(0.00)4(0.00)4(0.00)4(0.00)– – – – –

la4j Total 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00) 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)Addit. 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)4(0.00)4(0.00)4(0.00) 4(0.03)m(0.07)m(0.07)m(0.07)4(0.02)m(0.07)4(0.03)m(0.07)m(0.07)ART 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)4(0.00)m(0.15)4(0.00) m(0.63)4(0.00)4(0.00)4(0.00)m(0.08)4(0.00)4(0.00)4(0.00)4(0.00)Search 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)4(0.00)4(0.00)4(0.00) m(0.47)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)

scribe-java Total m(0.73)m(0.53)m(0.69)m(0.88)m(0.13)m(0.09)m(0.18)6(0.03)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–Addit. m(0.53)m(0.60)m(0.30)m(0.10)m(0.19)4(0.00)4(0.00)4(0.00)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–ART m(0.06)m(0.38)m(0.23)6(0.01)6(0.03)m(0.49)6(0.02)6(0.02)– 4(0.00)4(0.00)4(0.00)4(0.01)4(0.00)4(0.00)4(0.02)4(0.00)–Search m(0.44)m(0.53)m(0.31)m(0.10)m(0.24)4(0.00)4(0.00)4(0.00)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–

vraptor-core Total 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)– – – – 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)– – – –Addit. m(0.10)4(0.02)4(0.00)4(0.00)4(0.00)– – – – 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)– – – –ART 6(0.02)m(0.09)m(0.52)m(0.51)m(0.99)– – – – m(0.88)m(0.19)m(0.81)m(0.56)m(0.39)– – – –Search m(0.43)m(0.29)m(0.08)4(0.00)4(0.00)– – – – 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)– – – –

metrics-core Total 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)– – – 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)– – –Addit. 4(0.00)4(0.00)4(0.00)4(0.00)4(0.05)4(0.00)– – – 4(0.03)4(0.03)4(0.03)4(0.03)4(0.02)4(0.03)– – –ART m(0.43)m(0.91)4(0.00)4(0.00)m(0.12)4(0.00)– – – 4(0.00)4(0.00)4(0.02)4(0.00)4(0.00)4(0.00)– – –Search m(0.70)4(0.01)m(0.54)4(0.00)m(0.88)4(0.00)– – – m(0.92)m(0.81)m(0.20)m(0.33)4(0.00)m(0.16)– – –

assertj-core Total m(0.26)m(0.89)m(0.64)m(0.09)m(0.10)4(0.00)4(0.00)4(0.00)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–Addit. m(0.10)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–ART m(0.58)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–Search 6(0.01)4(0.00)4(0.04)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–

Table 4: Wilcoxon tests between each studied technique and random technique based on statement coverage for traditionalprioritization (p values included in brackets)

Sub Tech With new tests Without new testsv1 v2 v3 v4 v5 v6 v7 v8 v9 v1 v2 v3 v4 v5 v6 v7 v8 v9

jasmine-mavenTotal 6(0.00)6(0.00)m(0.13)m(0.40)m(0.51)– – – – m(0.50)m(0.50)m(0.50)m(0.50)m(0.50)– – – –Addit. 6(0.00)6(0.00)4(0.00)4(0.00)4(0.00)– – – – m(0.50)m(0.50)m(0.50)m(0.50)m(0.50)– – – –ILP-T 6(0.00)6(0.00)m(0.56)m(0.71)m(0.74)– – – – m(0.50)m(0.50)m(0.50)m(0.50)m(0.50)– – – –ILP-A 6(0.00)m(0.68)4(0.00)4(0.00)4(0.00)– – – – m(0.50)m(0.50)m(0.50)m(0.50)m(0.50)– – – –

java-apns Total 6(0.00)6(0.00)4(0.05)4(0.03)4(0.00)4(0.00)4(0.00)6(0.00)– 4(0.00)4(0.00)6(0.02)6(0.02)6(0.02)6(0.02)6(0.02)6(0.02)–Addit. 6(0.00)6(0.01)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)6(0.00)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–ILP-T 6(0.00)m(0.17)4(0.01)4(0.01)4(0.00)4(0.00)4(0.00)4(0.00)– 4(0.00)4(0.00)m(0.35)m(0.32)m(0.32)m(0.32)m(0.25)m(0.27)–ILP-A 6(0.00)m(0.08)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–

jopt-simple Total 6(0.00)6(0.00)6(0.00)6(0.00)– – – – – m(0.73)m(0.78)m(0.85)m(0.76)– – – – –Addit. 6(0.00)6(0.00)4(0.02)4(0.00)– – – – – 4(0.00)4(0.00)4(0.00)4(0.00)– – – – –ILP-T 6(0.00)6(0.00)6(0.00)6(0.01)– – – – – m(0.09)m(0.10)m(0.35)m(0.20)– – – – –ILP-A 6(0.00)6(0.00)4(0.01)4(0.00)– – – – – 4(0.00)4(0.00)4(0.00)4(0.00)– – – – –

la4j Total 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00) 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)Addit. 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)4(0.00)4(0.00)4(0.00) 4(0.00)4(0.00)m(0.37)m(0.23)4(0.00)4(0.00)m(0.22)4(0.00)4(0.00)ILP-T 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00) 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)ILP-A 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)4(0.00)4(0.00)4(0.00) 4(0.00)4(0.00)m(0.66)4(0.01)m(0.05)4(0.00)4(0.00)4(0.00)4(0.00)

scribe-java Total 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.01)m(0.97)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–Addit. 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–ILP-T m(0.53)4(0.02)4(0.00)4(0.00)4(0.01)4(0.00)4(0.00)4(0.00)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–ILP-A 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–

vraptor-core Total 4(0.00)m(0.91)4(0.00)4(0.00)4(0.00)– – – – 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)– – – –Addit. 4(0.00)4(0.03)4(0.00)4(0.00)4(0.00)– – – – 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)– – – –ILP-T 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)– – – – 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)– – – –ILP-A 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)– – – – 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)– – – –

metrics-core Total m(0.20)m(0.20)m(0.20)m(0.16)m(0.46)4(0.00)– – – 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)– – –Addit. 6(0.00)6(0.00)6(0.00)6(0.00)6(0.00)m(0.44)– – – m(0.35)m(0.26)m(0.24)m(0.35)m(0.15)m(0.08)– – –ILP-T 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)– – – 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)– – –ILP-A 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)– – – 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)– – –

assertj-core Total 6(0.00)m(0.49)m(0.43)m(0.13)4(0.03)4(0.00)4(0.00)4(0.00)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–Addit. m(0.52)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–ILP-T 6(0.00)m(0.83)m(0.14)4(0.02)4(0.00)4(0.00)4(0.00)4(0.00)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–ILP-A 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)– 4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)4(0.00)–

Table 5: Wilcoxon tests between each studied technique and random technique based on statement coverage for time-awareprioritization under the 50% time constraints (p values included in brackets)

excluding test additions), test prioritization techniques per-form stably even there are massive changes (i.e., using cov-erage data of a very early version to prioritize tests for thecurrent version). This finding provides implications for bothresearchers and developers, i.e., they need to consider theinfluence of test addition when proposing new test prioriti-zation techniques or applying test prioritization techniquesin practice.Traditional prioritization vs. time-aware prioritiza-tion. Our experimental results demonstrate that the im-pacts of software evolution (both source code evolution,andtest additions) are similar for both traditional and time-aware test prioritization.Prioritization using various coverage criteria. Ourexperimental results on different coverage criteria demon-strate that the impacts of software evolution are similar forvarious coverage criteria (detailed data available online.)Prioritization using various revision granularities.Different revision granularities (i.e., the distance between

the version for coverage collection and the version for testprioritization) can greatly impact test prioritization results.For instance, using the finer revision granularity usually pro-duces more effective test prioritization results. The reason isthat smaller number of tests may be added for finer revisiongranularity.Prioritization techniques and random prioritization.Our experimental results show that random prioritizationcan be competitive when software evolution involves testadditions. However, state-of-the-art test prioritization tech-niques can always outperform random prioritization whenonly considering source code changes (i.e., excluding newtests). Among various test prioritization techniques, ourstudy demonstrates that the additional and search-basedtest prioritization can be the most effective for traditionaltest prioritization, while the ILP-additional technique is themost effective technique for time-aware test prioritization.

Page 10: How Does Regression Test Prioritization Perform in Real ...lxz144130/publications/icse2016.pdfIn software evolution, a software is usually modi ed by changing its source code and adding

4. RELATED WORKIn the literature, a considerable amount of research focuses

on test prioritization, and recently Yoo and Harman [37] con-ducted a survey on regression test minimization [23,38–40],selection [6,24,41], and prioritization [3,7,42–46], which pro-vides a broad view of the state-of-art on test prioritization.Prioritization Strategies. The research in this categorymainly focuses on various strategies or algorithms used intest prioritization. Among various prioritization strategies,the traditional total and additional strategies [3,47] are themost widely used. Based on the basic mechanisms of thetotal and additional techniques, Zhang et al. [9] presentedunified test prioritization to bridge their gap based on theprobability that a test case can detect faults and generated aseries of prioritization strategies between them. Li et al. [8]transferred the test prioritization problem into a searchingproblem and presented two searching algorithms (i.e., hill-climbing algorithm and genetic programming algorithm) fortest prioritization. Motivated by random test prioritization,Jiang et al. [7] presented adaptive random test prioritization,which tends to select a test case that is the farthest from thealready selected tests. Tonella et al. [48] presented a ma-chine learning based technique, which prioritizes tests basedon user knowledge. Yoo et al. [45] proposed a cluster-basedtest prioritization technique, which clusters tests based ontheir dynamic behavior. Recently, Nguyen et al. [49] andSaha et al. [50] presented to prioritize tests based on infor-mation retrieval. In this paper, we do not present a noveltest prioritization strategy, but aim to investigate the effec-tiveness of typical test prioritization techniques through anempirical study.Coverage Criteria. The research in this category mainlyfocuses on various coverage criteria used in test prioritiza-tion. Besides the most widely used coverage criteria (e.g.,statement and method coverage [3], block coverage [51], mod-ified condition and decision coverage [52], and statically es-timated method coverage [53, 54]), Elbaum et al. [46] pro-posed fault-exposing-potential and fault index coverage toguide prioritization. Mei et al. [55] proposed a family ofdataflow testing criteria, and three levels of coverage crite-ria, i.e., workflow, XPath, and WSDL for service-orientedbusiness application [56]. Xu and Ding [57] proposed tran-sition coverage and roundtrip coverage to prioritize aspecttests. Thomas et al. [58] proposed to build topics using thelinguistic data in tests so as to guide test prioritization. Inthis paper, we do not present any new coverage criteria,but investigate the effectiveness of test prioritization basedon widely-used coverage criteria (i.e., statement coverage,method coverage, and branch coverage).Constraints. The research in this category mainly focuseson test prioritization with various constraints, e.g., test costand fault severity [59, 60]. That is, researchers investigatedhow to prioritize tests under these constraints. In particu-lar, Hou et al. [61] proposed a test prioritization techniquewith quota constraints for service-centric systems. Kim andPorter [62] proposed a history-based test prioritization tech-nique with time and resource constraints. To address thetime constraint, Walcott et al. [19] proposed a genetic al-gorithm, Zhang et al. [16] proposed an integer linear pro-gramming based technique, and Do et al. [63] presented anempirical study investigating the effect of time constraints.In this paper, we consider test prioritization without any

constraint as well as test prioritization with only time con-straints in the empirical study.Measurements. The research in this category mainly fo-cuses on how to measure the effectiveness of test prioriti-zation techniques. Until now, the average percentage faultsdetected (APFD) [3] is the mostly used metric for evaluatingtest prioritization. Based on APFD, Elbaum et al. [59, 64]proposed the weighted APFD by assigning higher weightsto tests with lower cost and/or with capability on detect-ing faults with higher severity. Walcott et al. [19] extendedAPFD by using a time penalty, and Do and Rothermel [65,66] presented comprehensive economic models to accuratelyassess tradeoffs between techniques in industrial testing en-vironments. In this paper, we use APFD as a metric to mea-sure the test prioritization results, and will consider othermeasurements in the future.Empirical Studies. The research in this category mainlyfocuses on investigating test prioritization by considering theinfluence of its internal or external factors through empiri-cal studies. In particular, Rothermel et al. [3] investigatedthe influence of coverage criterion (e.g.,statement coverage,branch coverage, and probability of exposing faults). Doet al. [63] investigated the influence of time constraints andfound that time constraints affected test prioritization tech-niques differently. Rothermel et al. [67] investigated the in-fluence of fault types (i.e., real faults, mutation faults, andseeded faults). Elbaum et al. [46] investigated the influenceof specific versions, coverage criteria (i.e., statement cover-age and function coverage), and predictor of fault-proneness.Similar to these work, this paper also presents an empiricalstudy on test prioritization, but it focuses on the influence ofthe important threat resulting from real software evolution,which has never been investigated before.

5. CONCLUSIONAlthough test prioritization has been intensively investi-

gated, little effort has focused on the investigation of itsthreats to validity. In this paper, we pointed out an im-portant threat resulting from real-world software evolution,including both source code changes and test additions, whichare usually ignored by the existing work on test prioritiza-tion. Without considering this threat, it is unclear whetherthe existing conclusions drawn by the existing work are stillvalid for real-world software evolution. Therefore, we con-ducted the first empirical study to investigate the influenceof this threat in test prioritization and got the followingmain findings: (1) both the traditional and time-aware testprioritization techniques studied in this paper become muchless effective in software evolution involving test additions;(2) both traditional and time-aware test prioritization tech-niques are stable and effective in case of only source codechanges (i.e., without test additions).

6. ACKNOWLEDGEMENTSWe thank August Shi from the University of Illinois for

sharing the experimental subjects with us. This researchwas sponsored by faculty startup funds from the Universityof Texas at Dallas, the National Basic Research Programof China (973) under Grant No. 2014CB347701, the Na-tional Natural Science Foundation of China under GrantNo.61421091, 91318301, 61225007, and 61522201.

Page 11: How Does Regression Test Prioritization Perform in Real ...lxz144130/publications/icse2016.pdfIn software evolution, a software is usually modi ed by changing its source code and adding

7. REFERENCES[1] C. Kaner, “Improving the maintainability of

automated test suites,” in Proc. QW, 1997.

[2] P. K. Chittimalli and M. J. Harrold, “Recomputingcoverage information to assist regression testing,”IEEE TSE, vol. 35, no. 4, pp. 452–469, 2009.

[3] G. Rothermel, R. H. Untch, C. Chu, and M. J.Harrold, “Test case prioritization: an empirical study,”in ICSM, pp. 179–188, 1999.

[4] W. E. Wong, J. R. Horgan, S. London, andH. Agrawal, “A study of effective regression testing inpractice,” in Proc. ISSRE, pp. 264–274, 1997.

[5] M. J. Harrold, “Testing evolving software,” JSS,vol. 47, no. 2–3, pp. 173–181, 1999.

[6] G. Rothermel and M. Harrold, “A safe, efficientregression test selection technique,” TOSEM, vol. 6,no. 2, pp. 173–210, 1997.

[7] B. Jiang, Z. Zhang, W. K. Chan, and T. H. Tse,“Adaptive random test case prioritization,” in ASE,pp. 257–266, 2009.

[8] Z. Li, M. Harman, and R. Hierons, “Search algorithmsfor regression test case prioritisation,” TSE, vol. 33,no. 4, pp. 225–237, 2007.

[9] L. Zhang, D. Hao, L. Zhang, G. Rothermel, andH. Mei, “Bridging the gap between the total andadditional test-case prioritization strategies,” in ICSE,pp. 192–201, 2013.

[10] D. Hao, L. Zhang, L. Zhang, G. Rothermel, andH. Mei, “A unified test case prioritization approach,”TOSEM, vol. 10, pp. 1–31, 2014.

[11] S. e Zehra Haidry and T. Miller, “Using dependencystructures for prioritization of functional test suites,”TSE, vol. 39, no. 2, pp. 258–275, 2013.

[12] H. Mei, D. Hao, L. Zhang, L. Zhang, J. Zhou, andG. Rothermel, “A static approach to prioritizing junittest cases,” TSE, vol. 38, no. 6, pp. 1258–1275, 2012.

[13] C. Zhang, A. Groce, and M. A. Alipour, “Using testcase reduction and prioritization to improve symbolicexecution,” in Proc. ISSTA, pp. 160–170, 2014.

[14] L. Zhang, D. Marinov, and S. Khurshid, “Fastermutation testing inspired by test prioritization andreduction,” in Proc. ISSTA, pp. 235–245, 2013.

[15] H. Do, S. Mirarab, L. Tahvildari, and G. Rothermel,“The effects of time constraints on test caseprioritization: A series of controlled experiments,”TSE, vol. 36, no. 5, pp. 593–617, 2010.

[16] L. Zhang, S.-S. Hou, C. Guo, T. Xie, and H. Mei,“Time-aware test-case prioritization using integerlinear programming,” in ISSTA, pp. 213–224, 2009.

[17] M. Harman, S. A. Mansouri, and Y. Zhang,“Search-based software engineering: Trends,techniques and applications,” CSUR, vol. 45, no. 1,p. 11, 2012.

[18] M. Harman and B. F. Jones, “Search based softwareengineering,” IST, vol. 43, no. 14, pp. 833–839, 2001.

[19] K. R. Walcott, M. L. Soffa, G. M. Kapfhammer, andR. S. Roos, “Time aware test suite prioritization,” inISSTA, pp. 1–11, 2006.

[20] S. Alspaugh, K. R. Walcott, M. Belanich, G. M.Kapfhammer, and M. L. Soffa, “Efficient time-awareprioritization with knapsack solvers,” in Workshop on

Empirical Assessment of Software EngineeringLanguages and Technologies, pp. 17–31, 2007.

[21] L. Zhang, M. Kim, and S. Khurshid, “Localizingfailure-inducing program edits based on spectruminformation,” in ICSM, pp. 23–32, 2011.

[22] L. Zhang, M. Kim, and S. Khurshid, “Faulttracer: achange impact and regression fault analysis tool forevolving java programs,” in FSE, p. 40, 2012.

[23] A. Shi, A. Gyori, M. Gligoric, A. Zaytsev, andD. Marinov, “Balancing trade-offs in test-suitereduction,” in Proc. FSE, pp. 246–256, 2014.

[24] A. Shi, T. Yung, A. Gyori, and D. Marinov,“Comparing and combining test-suite reduction andregression test selection,” in FSE, pp. 237–247, 2015.

[25] A. Gyori, A. Shi, F. Hariri, and D. Marinov, “Reliabletesting: Detecting state-polluting tests to prevent testdependency,” in ISSTA, pp. 223–233, 2015.

[26] J. H. Andrews, L. C. Briand, and Y. Labiche, “Ismutation an appropriate tool for testingexperiments?,” in ICSE, pp. 402–411, 2005.

[27] H. Do and G. Rothermel, “A controlled experimentassessing test case prioritization techniques viamutation faults,” in ICSM, pp. 411–420, 2005.

[28] R. Just, D. Jalali, L. Inozemtseva, M. D. Ernst,R. Holmes, and G. Fraser, “Are mutants a validsubstitute for real faults in software testing,” in Proc.FSE, pp. 654–665, 2014.

[29] H. Do and G. Rothermel, “On the use of mutationfaults in empirical assessments of test caseprioritization techniques,” TSE, vol. 32, no. 9,pp. 733–752, 2006.

[30] A. Arcuri and L. Briand, “A practical guide for usingstatistical tests to assess randomized algorithms insoftware engineering,” in ICSE, pp. 1–10, 2011.

[31] R. K. Saha, L. Zhang, S. Khurshid, and D. E. Perry,“An information retrieval approach for regression testprioritization based on program changes,” 2015. Toappear.

[32] R. Just, D. Jalali, and M. D. Ernst, “Defects4j: Adatabase of existing faults to enable controlled testingstudies for java programs,” in ISSTA, pp. 437–440,2014.

[33] T. H. Wonnacott and R. J. Wonnacott, Introductorystatistics, vol. 19690. Wiley New York, 1972.

[34] L. Zhang, S.-S. Hou, J.-J. Hu, T. Xie, and H. Mei, “Isoperator-based mutant selection superior to randommutant selection?,” in ICSE, pp. 435–444, 2010.

[35] Y. Qi, X. Mao, Y. Lei, Z. Dai, and C. Wang, “Thestrength of random search on automated programrepair,” in ICSE, pp. 254–265, 2014.

[36] F. Wilcoxon, “Individual comparisons by rankingmethods,” in Breakthroughs in Statistics, pp. 196–202,1992.

[37] S. Yoo and M. Harman, “Regression testingminimization, selection and prioritization: a survey,”STVR, vol. 22, no. 2, pp. 67–120, 2012.

[38] J. Black, E. Melachrinoudis, and D. Kaeli, “Bi-criteriamodels for all-uses test suite reduction,” in ICSE,pp. 106–115, 2004.

[39] G. Rothermel, M. J. Harrold, J. Von Ronne, andC. Hong, “Empirical studies of test-suite reduction,”STVR, vol. 12, no. 4, pp. 219–249, 2002.

Page 12: How Does Regression Test Prioritization Perform in Real ...lxz144130/publications/icse2016.pdfIn software evolution, a software is usually modi ed by changing its source code and adding

[40] L. Zhang, D. Marinov, L. Zhang, and S. Khurshid,“An empirical study of junit test-suite reduction,” inISSRE, pp. 170–179, 2011.

[41] T. Ball, “On the limit of control flow analysis forregression test selection,” in ISSTA, pp. 134–142, 1998.

[42] J. Chen, Y. Bai, D. Hao, Y. Xiong, H. Zhang,L. Zhang, and B. Xie, “A text-vector based approachto test case prioritization,” in ICST, 2016. To appear.

[43] Y. Lou, D. Hao, and L. Zhang, “Mutation-basedtest-case prioritization in software evolution,” inISSRE, pp. 46–57, 2015.

[44] D. Hao, L. Zhang, L. Zang, Y. Wang, X. Wu, andT. Xie, “To be optimal or not in test-caseprioritization,” TSE, 2015. To appear.

[45] S. Yoo, M. Harman, P. Tonella, and A. Susi,“Clustering test cases to achieve effective and scalableprioritisation incorporating expert knowledge,” inProc. ISSTA, pp. 201–212, 2009.

[46] S. Elbaum, A. Malishevsky, and G. Rothermel, “Testcase prioritization: A family of empirical studies,”TSE, vol. 28, no. 2, pp. 159–182, 2002.

[47] S. Elbaum, A. Malishevsky, and G. Rothermel,“Prioritizing test cases for regression testing,” inISSTA, pp. 102–112, 2000.

[48] P. Tonella, P. Avesani, and A. Susi, “Using thecase-based ranking methodology for test caseprioritization,” in Proc. ICSM, pp. 123–133, 2006.

[49] C. D. Nguyen, A. Marchetto, and P. Tonella, “Testcase prioritization for audit testing of evolving webservices using information retrieval techniques,” inProc. ICWS, pp. 636–643, 2011.

[50] R. K. Saha, L. Zhang, S. Khurshid, and D. E. Perry,“An information retrieval approach for regression testprioritization based on program changes,” in Proc.ICSE, p. to appear, 2015.

[51] H. Do, G. Rothermel, and A. Kinneer, “Empiricalstudies of test case prioritization in a JUnit testingenvironment,” in ISSRE, pp. 113–124, 2004.

[52] J. A. Jones and M. J. Harrold, “Test-suite reductionand prioritization for modified condition/decisioncoverage,” in ICSM, pp. 92–101, 2001.

[53] L. Zhang, J. Zhou, D. Hao, L. Zhang, and H. Mei,“Prioritizing JUnit test cases in absence of coverageinformation,” in ICSM, pp. 19–28, 2009.

[54] H. Mei, D. Hao, L. Zhang, L. Zhang, J. Zhou, andG. Rothermel, “A static approach to prioritizing junittest cases,” TSE, vol. 38, no. 6, pp. 1258–1275, 2012.

[55] L. Mei, W. K. Chan, and T. H. Tse, “Data flowtesting of service-oriented workflow applications,” inICSE, pp. 371–380, 2008.

[56] L. Mei, Z. Zhang, W. K. Chan, and T. H. Tse, “Testcase prioritization for regression testing ofservice-oriented business applications,” in WWW,pp. 901–910, 2009.

[57] D. Xu and J. Ding, “Prioritizing state-based aspecttests,” in Proc. ICST, pp. 265–274, 2010.

[58] S. W. Thomas, H. Hemmati, A. E. Hassan, andD. Blostein, “Static test case prioritization using topicmodels,” ESE, vol. 19, no. 1, pp. 182–212, 2014.

[59] S. Elbaum, A. Malishevsky, and G. Rothermel,“Incorporating varying test costs and fault severitiesinto test case prioritization,” in ICSE, pp. 329–338,2001.

[60] H. Park, H. Ryu, and J. Baik, “Historical value-basedapproach for cost-cognizant test case prioritization toimprove the effectiveness of regression testing,” inSSIRI, pp. 39–46, 2008.

[61] S.-S. Hou, L. Zhang, T. Xie, and J. Sun,“Quota-constrained test-case prioritization forregression testing of service-centric systems,” in ICSM,pp. 257–266, 2008.

[62] J. M. Kim and A. Porter, “A history-based testprioritization technique for regression testing inresource constrained environments,” in ICSE,pp. 119–129, 2002.

[63] H. Do, S. Mirarab, L. Tahvildari, and G. Rothermel,“An empirical study of the effect of time constraintson the cost-benefits of regression testing,” in FSE,pp. 71–82, 2008.

[64] A. Malishevsky, J. R. Ruthru, G. Rothermel, andS. Elbaum, “Cost-cognizant test case prioritization,”tech. rep., Department Computer Science andEngineering of University of Nebraska, 2006.

[65] H. Do and G. Rothermel, “Using sensitivity analysis tocreate simplified economic models for regressiontesting,” in Proc. ISSTA, pp. 51–62, 2008.

[66] H. Do and G. Rothermel, “An empirical study ofregression testing techniques incorporating contextand lifecycle factors and improved cost-benefitmodels,” in Proc. FSE, pp. 141–151, 2006.

[67] G. Rothermel, R. J. Untch, C. Chu, and M. J.Harrold, “Prioritizing test cases for regression testing,”TSE, vol. 27, no. 10, pp. 929–948, 2001.