test case filtering and prioritization based on coverage of combinations of program elements

Test Case Filtering and Prioritization Based on Coverage of Combinations of

Program Elements

Wes Masri and Marwa El-GhaliAmerican Univ. of Beirut

ECE DepartmentBeirut, Lebanon

[email protected]

Test Case Filtering

• Test case filtering is concerned with selecting from a test suite T a subset T’ that is capable of revealing most of the defects revealed by T

• Approach: T’ to cover all elements covered by T

Test Case Filtering: What to Cover?

• Existing techniques cover singular program elements of varying granularity: methods, statements, branches, def-use pairs,

slice pairs and information flow pairs

• Previous studies have shown that increasing the granularity leads to revealing more defects at the expense of larger subsets

Test Case Filtering• This work explores covering suspicious combinations

of simple program elements

• The number of possible combinations is exponential w.r.t. the number of singular elements use an approximation algorithm

• We use a genetic algorithm

Test Case Filtering: Conjectures

I. Combinations of program elements are more likely to characterize complex failures

II. The percentage of failing tests is typically much smaller than that of the passing tests Each defect causes a small number of tests to fail Given groups of (structurally) similar tests, smaller

ones are more likely to be failure-inducing than larger ones

Test Case Filtering: Steps

1) Given a test suite T, generate execution profiles of simple program elements (statements, branches, and def-use pairs)

2) Choose a threshold Mfail for the maximum number of tests that could fail due to a single defect

3) Use the genetic algorithm to generate C’, a set of combinations of simple program elements that were covered by less than Mfail tests suspicious combinations

4) Use a greedy algorithm to extract T’, the smallest subset of T that covers all the combinations in C’

Genetic Algorithm

• A genetic algorithm solves a problem by Operating on an initial population of candidate

solutions or chromosomes Evaluating their quality using a fitness function Uses transformation to create new generations

with improved quality Ultimately evolving to a single solution

Fitness Function

• We use the following equation:fitness(combination) = 1 - %tests

where %tests is the percentage of test cases that exercised the combinationThe smaller the percentage the higher the fitness

• The aim is to end up with a manageable set of combinations in which each combination occurred in at most Mfail tests

Initial Population Generation• Generated from union of all execution profiles • Size: 50 in our implementation• 00 always, 11 with small probability P

9

… 0 0 0 0 0 1 0 0 1 … 0 0 1 0 1 0 0 1 0 0 0 0 ...

… 1 0 0 0 0 0 1 0 0 … 1 0 0 1 0 0 0 1 0 0 1 0 ...

… 0 1 0 0 0 0 1 0 0 … 0 0 1 0 0 0 0 0 0 1 1 0 ...

… 1 1 0 1 0 1 1 0 1 … 1 0 1 1 1 0 1 1 0 1 1 0 ...

Transformation Operator

10

Combines two parent chromosomes to produce a child Passes down properties from each, favoring the parent with the higher fitness. Goal: child to have a better fitness than its parents Replace the parent with the worse fitness with the child

Solution Set

• The obtained solution set contains all the encountered combinations with high-enough fitness values suspicious combinations

Experimental Work

Our subject programs included:

• The JTidy HTML syntax checker and pretty printer; 1000 tests; 8 defects; 47 failures

• The NanoXML XML parser; 140 tests; 4 defects; 20 failures

Experimental Work• We profiled the following program elements:

– basic-blocks or statements (BB)– basic-block edges or branches (BBE)– def-use pairs (DUP)

• Next we applied the genetic algorithm to generate the following: – a pool of BBcomb

– a pool of BBEcomb

– a pool of DUPcomb

– a pool of ALLcomb (combinations of BBs, BBEs and DUPs)

• The values of Mfail we chose for JTidy, and NanoXML were 100, and 20, respectively

Profile Type

% Tests Selected

% Defects Revealed

BB 5.3 55.0BBcomb 9.6 65.6 BBE 6.5 78.7 BBEcomb 10.2 87.5 DUP 11.7 81.2 DUPcomb 14.1 87.5ALL 12.4 94.8ALLcomb 14.1 100.0 SliceP 26.7 100.0

JTidy results:• In the case of ALLcomb, 14.1% of the original test suite was needed to exercise

all of the combinations exercised by the original test suite, and these tests revealed all the defects revealed by the original test suite

• In previous work we showed that coverage of slice pairs (SliceP) performed better than coverage of BB, BBE and DUP; this is why we are including the results of SliceP here for comparison.

Above Figure compares the various techniques to random sampling :1. All variations performed better than random sampling2. BBcomb revealed 10.6% more defects than BB but selected 4.2% more tests

3. BBEcomb revealed 8.8% more defects than BBE but selected 3.7% more tests

4. DUPcomb revealed 6.3% more defects than DUP but selected 2.4% more tests

5. ALLcomb performed better than SliceP, since it revealed all defects, as SliceP did, but selected 12.6% less tests

Experimental Work

• Concerning BBcomb , BBEcomb , DUPcomb, the additional cost due to the selection of more tests might not be well justified, since the rate of improvement is no better than it is for random sampling

• Concerning ALLcomb, not only did it perform better than SliceP, but it is considerably less costly– It took 90 seconds on average per test to generate its

profiles (i.e., BB’s, BBE’s and DUP’s), whereas it took 1200 seconds per test to generate the SliceP profiles (1 day vs. 2 weeks)

NanoXML observations:• BB, BBE, DUP, and ALL did not perform any better than random sampling, whereas

BBcomb, BBEcomb, DUPcomb, and ALLcomb performed noticeably better

• BBcomb, BBEcomb, DUPcomb, and ALLcomb revealed all the defects, but at relatively high cost, since over 50% tests were needed to be executed

• The cost of running the genetic algorithm and the greedy selection algorithm has to be factored in when comparing our techniques to others

Test Case Prioritization

• Test case prioritization aims at scheduling the tests in T so that the defects are revealed as early as possible

Summary of our technique• Prioritize combinations in terms of their suspiciousness • Then assign the priority of a given combination to the

tests that cover it

Test Case Prioritization: Steps1) Identify combinations that were exercised by 1 test; assign

that test priority 1, and add it to T’

2) Identify combinations that were exercised by 2 tests; assign those tests priority 2, and add them to T’

3) … and so on … until all tests are prioritized, or Mfail is exceeded, or all combinations were explored

4) Use the greedy algorithm to reduce T’

5) Any remaining tests that were not prioritized will be scheduled to run randomly following the prioritized tests

Element%tests %defects

BBcomb 6.75 56.25BBEcomb 7.55 81.25DUPcomb 12.6 87.5ALLcomb 13.05 100.0

JTidy prioritization results when step 3 is satisfied, i.e., when all tests are prioritized, or Mfail is exceeded, or all combinations were explored

Observation:

Using BBcomb, BBEcomb, and DUPcomb not all defects were revealed.

Combinations of BB, BBE, and DUP (ALLcomb) are needed to reveal all defects.

NanoXML prioritization results

Observation:

All defects were revealed using BBcomb, BBEcomb, DUPcomb , or ALLcomb, but at a high cost of selected tests.

Element%tests %defects

BBcomb 50.2 100.0BBEcomb 50.8 100.0DUPcomb 52.8 100.0ALLcomb 53.5 100.0

Conclusion

• Our techniques performed better than similar coverage-based techniques that consider program elements of the same type and that do not take into account their combinations

• Will conduct a more thorough empirical study

• Will use APFD (Average Percentage of Faults Detected) approach to evaluate prioritization

test case filtering and prioritization based on coverage of combinations of program elements

Documents

following program elements

small number of tests

coverage of combinations

encountered combinations

similar tests

maximum number of tests

number of singular elements

ttest case filtering