1 the deviance problem in effort estimation [email protected] promise-2 and defect prediction and...

1

The deviance problem in effort estimation

[email protected]

And defectprediction

And softwareengineering

2

Variance: confusing prior results?

Software effort estimation– Jorgensen: most effort estimation is

“expert-based”; so “model-based” estimation

is a waste of time model- vs expert-base studies:

5 better, 5 even, 5 worse

Software defect prediction– Shepperd&Ince:

static code measures un-informative for software quality

dumb LOC vs Mccabe studies: 6 better, 6 even, 6 worse

I smell a rat

Selecting Best Practices for Effort Estimation - Menzies, Chen, Hihn, Lum. TSE 200X

Data Mining Static Code to Learn Defect Predictor - Menzies, Greenwald,Frank, TSE 200X

3

What you never want to hear…

“This isn't right. This isn't even wrong.”– Wolfgang Pauli

4

Standard disclaimer

An excessive focus on empiricism …– … stunts the development of novel ,

pre-experimental, speculations

But currently: – there is no danger of an excess of

empiricism in SE– SE= a field flooded by pre-

experimental speculations.

5

Sampleexperiments Public domain data Don’t test using your

training data– N-way cross val

M * randomize order Straw man Feature subset

selection Thall shalt script

– you will run it again Study mean and

variance over M * N

Defect predictions

defect prediction

effort estimation

6

Data summation:K.I.S.S. Combine PD/PF

Compute & sort combined performance deltas , method A vs all others

Summarize as quartiles

400,000 runs– Nb= naïve bayes– J48= entrophy-based

decision tree learner– oneR=straw man– logNums= log the numerics

Massive FSSMassive FSS

Singletons, including LOC, not enoughSingletons, including LOC, not enough

7

Variance: confusing prior results?

Software effort estimation– Jorgensen: most effort estimation is

“expert-based”; so “model-based” estimation

is a waste of time model- vs expert-base studies:

5 better, 5 even, 5 worse

Software defect prediction– Shepperd&Ince:

static code measures un-informative for software quality

dumb LOC vs Mccabe studies: 6 better, 6 even, 6 worse

I smell a rat

8

Sources of varianceSoftware effort estimation

30 * { shuffle, test = data[1..10]

train = data - test, <a,b> = LC(train)

MRE = Estimate(a,b,test) }

Software defect prediction10 * { randomly select 90% of data,

score each attribute via “INFOGAIN” }

Numerous candidatesfor “most informative”attributes

Large deviations confuse comparisons of competing methods

Can be reduced by FSS

Target class: continuous

Target class: discrete

9

What is Feature Subset Selection?

PCA worse (empirically)

INFOGAIN fastest – Useful for defect detection

e.g. 10,000 modules in defect logs

WRAPPER slowest– Performs best– Practical for effort estimation

e.g. dozens of past projects in company databases

Turned blue to green

a = 10.1 + 0.3x + 0.9y - 1.2z1) “wiggle” in x,y,z causes “wiggle” in “a”2) Removing x,y,z,reduces “wiggle”in “a”3) But can damage mean performance

10

Warning: no single “best” theorydefect predictioneffort estimation

11

Committee-based learning

Ensemble-based learning – bagging, boosting,

stacking, etc

Conclusions by voting across a committee– 10 identical experts are a

waste of money– 10, slightly different, experts

can offer different insights onto a problem

12

Using committees Classification ensembles:

– “Majority vote does as good as anything else”- Tom Dietrich

Numeric prediction ensembles– Can use other measures: “heuristic rejection rules”– Theorists: “gasp horror”– Seasoned cost-estimation practitioners: “of course”

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Standard statistics failing . T-tests report that none of these are “worse

Standard statistics failing . T-tests report that none of these are “worse

13

• For any pair of treatments, • If one is “worse”• Vote it off• Repeat till none “worse”

survivors

14

So… So, those M*N-way cross-vals

– Time to use them.

New research area– Automatic model selection methods are now required

– Data fusion in biometrcs

The technical problem is not the challenge– Issues with explanation and expectation

15

Why so many unverified ideas in software engineering?

Humans use language to mark territory– Repeated effect: linguistic drift

Villages, separated by just a few miles, evolve different dialects

– Language choice = who you talk to, what tools you buy

US vs THEM: SUN built JAVA as a weapon against Microsoft

– Result: never-ending stream of new language systems

Vendors want to sell new tools, not assess them.

16

But, the tide is turning Text mining of NFRs, traceability:

– IEEE RE’06 (Minniapolis, 2006) The Detection and Classification

of Non-Functional Requirements Cleland-Huang, Settimi, Zou, Solc

– IEEE TSE Jan, 2006, p 4-19: Advancing Candidate Link Generation

for Requirements Tracing– Hayes, Dekhtyar, Sundaram

Software effort estimation– IEEE TSE 200?

Selecting Best Practices for Effort Estimation– Menzies, Chen, Hihn, Lum

Software defect prediction– IEEE TSE 200?

Data Mining Static Code Attributes to Learn Defect Predictors

– Menzies, Greenwald, Frank Yes Timmy, senior forums endorseempirical rigor

Yes Timmy, senior forums endorseempirical rigor

Bestpaper

1 the deviance problem in effort estimation [email protected] promise-2 and defect prediction and...

Documents

effort estimation menzies

software defect prediction10

vs expertbase studies

defect detectione

worse vote

defect predictor menzies

data test

pca worse