implications of ceiling effects in defect predictors
DESCRIPTION
Implications of Ceiling Effects in Defect Predictors - PROMISE 2008TRANSCRIPT
OutlineApproachUse More DataUse Less DataUse Even Less DataDiscussionsExamplesConclusions
ApproachOther Research: Try changing data miners
Various data miners: no ground-breaking improvements
This Research: Try changing training dataSub-sampling: Over/ Under/ Micro samplingHypothesis: Static Code Attributes have limited
information contentPredictions:
Simple learners can extract limited information content No need for more complex learners Further progress needs increasing the information content
in data
State-of-the-art Defect PredictorNaive Bayes with simple log-
filteringProbability of detection (pd): 75%Probability of false alarms (pf): 21%Other data miners failed to achieve
such performance:Logistic regression J48OneRComplex variants of BayesVarious others available in WEKA...
How Much Data: Use more...Experimental Rig:
Stratify |Test|=100 samplesN={100, 200,
300,...} |Training|=N*90%
samplesRandomize and
repeat 20 times
Plots of N vs. balance
Over/ Under Sampling: Use Less...Software Datasets are not balanced
~10% DefectiveTarget Class: Defective (modules)Under Sampling:
Use all target class instances, say NPick N from other classLearn theories on 2N instances
Over Sampling:Use all from other class, say M (M>N)Using N target class instances, populate M
instancesLearn theories on 2M instances
Over/ Under Sampling: Use Less...NB/none is still among the
bestSampling J48 does not out-
perform NB NB/none is equivalent with
NB/ underUnder sampling does not
harm classifier performance.Theories can be learned from
a very small sample of available data
Micro Sampling: Use Even Less... Given N defective
modules:M = {25, 50, 75, ...} <= NSelect M defective and M
defect-free modules.Learn theories on 2M
instancesUndersampling: M=N8/12 datasets -> M = 25 1/12 datasets -> M = 75 3/12 datasets -> M =
{200, 575, 1025}
DiscussionsIncremental Case Based ReasoningAutomatic Data MinersWhen is CBR preferable to ADM?
Impractical in large number of casesOur results suggest 50 samples are
adequate.CBR can perform as well as ADM.One step further: CBR can perform better
than ADM.
Example 1: Requirement MetricsDoes not mean “Use
Requirement Docs” all the time!
Combine features from whatever sources available.
Explore whatever is not a black-box approach.
Consistent with prior research
SE should make use of domain specific knowledge!
From: Text MiningTo: NLPSubject: Semantics
Example 2: Simple WeightingCombine features
wisely!Black-box Feature
Selection -> NP-hard.Information provided
by black-box approach is not necessarily meaningful to humans.
Information provided by humans is meaningful for black-boxes.
Check the validity of NB assumptions!
Example 3: WHICH Rule Learner
Current practice: Learn predictors with
criteria P Assess predictors with
criteria Q In general: P≠Q
WHICH supports defining P≈Q Learn what you will assess
later. micro20 means only 20+20
samples.
13
• WHICH initially creates a sorted stack of all attribute ranges in isolation.
• It then, based on score, randomly selects two rules from the stack, combines them, and places the new rule in the stack in sorted order.
• It continues to do this until a stopping criterion is met.
• WHICH supports both conjunction and disjunctions.
• If a the two rules selected both contain different ranges from the same attribute, they are OR'd together instead of AND'doutlook=sunny
AND rain=true
outlook=overcast
outlook = [ sunny OR overcast ]AND rain = true
Example 3: WHICH Rule Learner
Example 4: NN-SamplingWithin vs. Cross Company Data
Substantial increase in pd... ...with the cost of substantial increase in pf. CC Data should only be used for mission critical
projects Companies should starve for local (WC) data
Why? CC data contains a larger space of samples... ...it also includes irrelavancies.
Howto decrese pf? Remove irrelavancies by sampling from CC data.
Example 4: NN-SamplingSame patterns in:
NASA MDP and Turkish washing
machines
ConclusionsDefect predictors are practical toolsLimited information content hypothesis
Simple learners can extract limited information content No need for more complex learners Further progress needs increasing the information content in data
Current research paradigm has reached its limits
Black-box methods lack the business knowledgeHuman-in-the-loop CBR tools should take place
Practical: Small samples to examine Instantaneous: ADM will run fastDirection: Increase information content
Promise data: OK.What about Promise tools?
Increase in information content?
Building predictors aligned with business goals.
Future WorkBenchmark Human-in-the-loop CBR against
ADM.Instead of which learner, ask which data.Better sampling strategies?
Thanks...
Questions ?