how to find relevant data for effort estimation ?

20
1 How to Find Relevant Data for Effort Estimation ? 2012-03-28

Upload: ryann

Post on 05-Jan-2016

57 views

Category:

Documents


1 download

DESCRIPTION

How to Find Relevant Data for Effort Estimation ?. 毛 可 2012-03-28. 1. Author. Ekrem Kocaguneli ( [email protected] ) Tim Menzies Specialties : Data Mining, Effort Estimation 1 1’ TSE: Exploiting the Essential Assumptions of Analogy-Based Effort Estimation (TEAK) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: How to Find Relevant Data for Effort Estimation ?

1

How to Find Relevant Data for Effort Estimation ?

毛 可2012-03-28

Page 2: How to Find Relevant Data for Effort Estimation ?

2

Author

• Ekrem Kocaguneli ( [email protected] )• Tim Menzies

• Specialties: Data Mining, Effort Estimation

• 11’ TSE: Exploiting the Essential Assumptions of Analogy-Based Effort Estimation (TEAK)

• 11’ TSE: On the Value of Ensemble Effort Estimation• 11’ ESEM: –• 10’ ASE: When to Use Data from Other Projects for Effort Estimation(short)• Pre: Relevancy Filtering for Defect Estimation

Page 3: How to Find Relevant Data for Effort Estimation ?

3

Motivation (Why)The Locality(1) Assumption• Data divides best on one attribute

– 1. project type;e.g. embedded, etc; – 2. development centers of developers; – 3. development language– 4. application type(MIS; GNC; etc); – 5. targeted hardware platform; – 6. in-house vs out sourced projects;

• If Locality(1)– Hard to use data across these boundaries– confined model, need to collect local data

Page 4: How to Find Relevant Data for Effort Estimation ?

4

Motivation (Why)

The Locality(N) Assumption• Data divides best on combination of

attributes• If Locality(N)

– Easier to use data across these boundaries

Page 5: How to Find Relevant Data for Effort Estimation ?

5

Work

• Cross-vs-Within + “relevancy filtering” for

effort estimation– Cross as good as within

– Companies can use other’s data for their estimates

– If they first apply “relevancy filtering”• "cross" same as "local"

Page 6: How to Find Relevant Data for Effort Estimation ?

6

Technology (How)

• How to find relevant training data?

Page 7: How to Find Relevant Data for Effort Estimation ?

7

Technology (How)

• Variance Pruning

Page 8: How to Find Relevant Data for Effort Estimation ?

8

Technology (How)• TEAK = ABE0 + Instance selection

– 11’ TSE: Exploiting the Essential Assumptions of Analogy-Based Effort Estimation

• ABE0 = ABE version 0– most commonly used– Normalized numerics, 0 to 1– Euclidean distance– equal weight to all attributes– return median effort of k-nearest neighbors

• Instance selection– smart way to adjust training data

Page 9: How to Find Relevant Data for Effort Estimation ?

9

Technology (How)• TEAK is a variance-based instance

selector• It is built via GAC trees (binary for even)

• TEAK is a two-pass system– First pass selects low variance

relevant projects ( instance selection )– Second pass retrieves projects to

estimate from ( instance retrieval )

• Variance Pruning– > 10% * max ( σ2 )

– > (100%+10%) * max ( σ2 ) ?

Page 10: How to Find Relevant Data for Effort Estimation ?

11

Technology (How)• TEAK finds local regions important to the estimation of

particular cases

• TEAK finds those regions via locality(N) not locality(1)

Page 11: How to Find Relevant Data for Effort Estimation ?

12

Experiments - Datasets

• Public availability: for reproducibility• cross-within divisibility• 6 out of 20+ datasets from PROMISE

Page 12: How to Find Relevant Data for Effort Estimation ?

13

Experiments - Datasets

For dataset X: subset X1 , X2 , X3• Within

– TEAK for X1, X2, X3 separately. LOOCV

• Cross– X1 test, X2+X3 train. … N-Fold CV

• Repeat 20 times! As TEAK is greedy, vary according to input data order

Page 13: How to Find Relevant Data for Effort Estimation ?

14

Experiments

• Win-Loss-Tie:

• Mann Whitney Test (95%)– 检验两个总体的分布是否有显著的差别

Page 14: How to Find Relevant Data for Effort Estimation ?

15

Experiment1 - Performace Comparison

MAR: Mean Absolute ResidualMdMRE: Median MRE

Page 15: How to Find Relevant Data for Effort Estimation ?

16

Experiment1 - Performace ComparisonAnalogy by 1-neighbor: (PRED(25) > 0.3 on C81 Subsets )for i = 1:numTestCases estimates(i) = effortTrain(nearestCase(i)) * sizeTest(i) / sizeTrain(nearestCase(i)); for k = 1 : numTestFactors estimates(i) = estimates(i) * cdTestReady(i,k) / cdTrainReady(nearestCase(i),k); endend

Analogy by K-neighbor:

Page 16: How to Find Relevant Data for Effort Estimation ?

17

Experiment2 – Retrieval Tendency

Page 17: How to Find Relevant Data for Effort Estimation ?

18

Experiment2 – Retrieval Tendency

Diagonal( WC ) vs. Off-Diagonal( CC ) selectionPercentages sorted

Percentiles of diagonals andoff-diagonals

Page 18: How to Find Relevant Data for Effort Estimation ?

19

Conclusion

1. Cross performance is no worse than within performance

2. Probability that estimator retrieves a training instance form cross/within data is the same

Implication:• Companies can learn from each other’s data• Locality(N). Maybe, there are general effects in SE

– Effects that transcend boundaries of one company– Local vs. Global Model…

Page 19: How to Find Relevant Data for Effort Estimation ?

20

Future work• Check external validity

– After instance selection, Does cross == within ?

• Build more repositories– More useful than previously thought for effort estimation

• Synonym discovery– Can only use cross-data if it has the same ontology– Auto-generate lexicons to map terms between data sets. ( “LOC” – “size”, “product complexity” )

Page 20: How to Find Relevant Data for Effort Estimation ?

Thanks! Q & A ?

21