implications of ceiling effects in defect predictors

OutlineApproachUse More DataUse Less DataUse Even Less DataDiscussionsExamplesConclusions

ApproachOther Research: Try changing data miners

Various data miners: no ground-breaking improvements

This Research: Try changing training dataSub-sampling: Over/ Under/ Micro samplingHypothesis: Static Code Attributes have limited

information contentPredictions:

Simple learners can extract limited information content No need for more complex learners Further progress needs increasing the information content

in data

State-of-the-art Defect PredictorNaive Bayes with simple log-

filteringProbability of detection (pd): 75%Probability of false alarms (pf): 21%Other data miners failed to achieve

such performance:Logistic regression J48OneRComplex variants of BayesVarious others available in WEKA...

How Much Data: Use more...Experimental Rig:

Stratify |Test|=100 samplesN={100, 200,

300,...} |Training|=N*90%

samplesRandomize and

repeat 20 times

Plots of N vs. balance

Over/ Under Sampling: Use Less...Software Datasets are not balanced

~10% DefectiveTarget Class: Defective (modules)Under Sampling:

Use all target class instances, say NPick N from other classLearn theories on 2N instances

Over Sampling:Use all from other class, say M (M>N)Using N target class instances, populate M

instancesLearn theories on 2M instances

Over/ Under Sampling: Use Less...NB/none is still among the

bestSampling J48 does not out-

perform NB NB/none is equivalent with

NB/ underUnder sampling does not

harm classifier performance.Theories can be learned from

a very small sample of available data

Micro Sampling: Use Even Less... Given N defective

modules:M = {25, 50, 75, ...} <= NSelect M defective and M

defect-free modules.Learn theories on 2M

instancesUndersampling: M=N8/12 datasets -> M = 25 1/12 datasets -> M = 75 3/12 datasets -> M =

{200, 575, 1025}

DiscussionsIncremental Case Based ReasoningAutomatic Data MinersWhen is CBR preferable to ADM?

Impractical in large number of casesOur results suggest 50 samples are

adequate.CBR can perform as well as ADM.One step further: CBR can perform better

than ADM.

Example 1: Requirement MetricsDoes not mean “Use

Requirement Docs” all the time!

Combine features from whatever sources available.

Explore whatever is not a black-box approach.

Consistent with prior research

SE should make use of domain specific knowledge!

From: Text MiningTo: NLPSubject: Semantics

Example 2: Simple WeightingCombine features

wisely!Black-box Feature

Selection -> NP-hard.Information provided

by black-box approach is not necessarily meaningful to humans.

Information provided by humans is meaningful for black-boxes.

Check the validity of NB assumptions!

Example 3: WHICH Rule Learner

Current practice: Learn predictors with

criteria P Assess predictors with

criteria Q In general: P≠Q

WHICH supports defining P≈Q Learn what you will assess

later. micro20 means only 20+20

samples.

13

• WHICH initially creates a sorted stack of all attribute ranges in isolation.

• It then, based on score, randomly selects two rules from the stack, combines them, and places the new rule in the stack in sorted order.

• It continues to do this until a stopping criterion is met.

• WHICH supports both conjunction and disjunctions.

• If a the two rules selected both contain different ranges from the same attribute, they are OR'd together instead of AND'doutlook=sunny

AND rain=true

outlook=overcast

outlook = [ sunny OR overcast ]AND rain = true

Example 3: WHICH Rule Learner

Example 4: NN-SamplingWithin vs. Cross Company Data

Substantial increase in pd... ...with the cost of substantial increase in pf. CC Data should only be used for mission critical

projects Companies should starve for local (WC) data

Why? CC data contains a larger space of samples... ...it also includes irrelavancies.

Howto decrese pf? Remove irrelavancies by sampling from CC data.

Example 4: NN-SamplingSame patterns in:

NASA MDP and Turkish washing

machines

ConclusionsDefect predictors are practical toolsLimited information content hypothesis

Simple learners can extract limited information content No need for more complex learners Further progress needs increasing the information content in data

Current research paradigm has reached its limits

Black-box methods lack the business knowledgeHuman-in-the-loop CBR tools should take place

Practical: Small samples to examine Instantaneous: ADM will run fastDirection: Increase information content

Promise data: OK.What about Promise tools?

Increase in information content?

Building predictors aligned with business goals.

Future WorkBenchmark Human-in-the-loop CBR against

ADM.Instead of which learner, ask which data.Better sampling strategies?

Thanks...

Questions ?

implications of ceiling effects in defect predictors

Business

data use

cc data

askwhich data

training data subsampling

samples n

local wc data

n instances

small sample of available