assessing software cost estimation models: criteria … › f0ed › 3ee63e9767b7... · to estimate...

30

ASSESSING SOFTWARE COST ESTIMATION MODELS:

CRITERIA FOR ACCURACY, CONSISTENCY AND REGRESSION

Bruce W.N. Lo and Xiangzhu GaoSchool of Multimedia and Information Technology

Southern Cross UniversityLismore, NSW Australia 2480

Email: [email protected]

ABSTRACT

One of the problems in software cost estimation is how to evaluate estimation models. Estimation models are usuallyevaluated against two attributes: estimation accuracy and estimation consistency. A number of measures are reportedin the literature, but they have shortcomings. There is no generally accepted standard to evaluate estimation modelsand the existing measures sometimes are not consistent among themselves. This paper examines existing measures ofestimation accuracy and consistency and proposes two new ones: the weighted mean of quartiles of relative errors(WMQ) as a measure of accuracy and the standard deviation of the ratios of the estimate to actual observation (SDR)as a measure of consistency. Besides, a new regression criterion is proposed to determine model parameters. Theproposed measures and criterion were tested with a data set from real world software projects. Results obtained showthat these new measures and criterion overcome many of the difficulties of the existing ones.

INTRODUCTION

Since the early 1950s, software development practitioners and researchers have been trying to develop methodsto estimate software costs and schedules (Abdel-Hamid, 1990). Software cost estimation models have appearedin the literature over the past two decades (Wrigley et al, 1991). However, the field of software cost estimation isstill in its infancy (Kitchenham et al, 1990). The existing cost estimation methods are far from standardised andreliable (Rowlands, 1989). There is a need to evaluate estimation models and improve modeling processes. Thispaper will focus on how to quantitatively evaluate software cost estimation models. A new approach todetermine model parameters is also proposed.In the field of software cost estimation, estimation models are usually evaluated against two attributes: estimationaccuracy and estimation consistency. The rules or measures needed to describe these two attributes will bediscussed in this paper.Measurement may be used for two purposes: assessment and prediction (Fenton, 1994). For assessment, theresults of comparing different models may be used to judge which theory is more successful at explaining thebehaviour of cost factors (Pfleeger, 1991). Successful models would provide further insight into softwaredevelopment processes. To assess models, we need to have common standards. Although some measures forestimation accuracy and consistency have been introduced in the literature, they are not generally accepted formodel assessment (Verner et al, 1992). Sometimes these measures are not consistent among themselves. Forexample, a model may be better than another one with respect to one measure but poorer with respect to adifferent measure.For prediction, the measurement may provide feedback to improve the modeling process so that the model cansatisfy the rules of measurement as far as possible. Modeling is associated with measurement (Kan et al, 1994).In this context, it is necessary to define the procedures for determining model parameters and interpreting theresults (Fenton, 1994). This leads to the problem of how to determine the model parameters. Unfortunately thereis also no generally accepted criterion for researchers to follow in the modelling process. This paper, therefore,will also discuss the criteria for determining model parameters.The paper is organised into three parts. The first part examines the measures used to evaluate estimation models.To overcome the shortcomings of existing practice, two new measures are proposed: the weighted mean ofquartiles of relative errors (WMQ) which provides a measurement of accuracy and the standard deviation of theratios of the estimate to actual value (SDR) which provides a measurement of consistency. The second partexamines traditional mathematical procedures for formulating costing models and proposes a new regressioncriterion called least sum of logarithmic ratios of estimate to actual value. This is an unbiased method of findingparameters in a cost estimation model when the functional form of the model is known. The third part assessesthe proposed measures and criterion.A data set from real world software projects was used to examine the proposed measures and the regressioncriterion so as to demonstrate their advantages over the other measures and criteria currently reported in theliterature.

AJIS vol S no 1 September 1997

31

MEASURES OF ACCURACY

Accuracy is defined as the measure of how close a result is to its correct value (Deeson, 1991). There are twoways to compare a result and its correct value: their difference and their ratio.

Let n be the number of projects in a data set, acti be the /'* (i= 1,2,3,...,") actual observed value and estj be the

corresponding estimated value. The difference measure of estimation accuracy is based on the differencebetween estimated value and actual value

esti - act{ (i=l,2,3,...,n)

The ratio measure of accuracy is based on the ratio of estimated value to actual value

est{'- (j=l,2,3,...,n)

actf

In evaluating the accuracy of software cost estimation models, both difference and ratio measures have beenused. These are discussed below.

Difference Measures of Accuracy

(1) Mean of absolute errors (MAE)I n

MAE=— \\act. -esti" w

(2) Root mean of squares of error (RMSE)

n 1=1

(3) Coefficient of determinant (R2 )

_ 1 » /• _where act = — Y acti is the mean of n actual observed values. For a given set of data \ (act, -act)2 is a

n 1=1 1=1constant. Therefore R is a distance measure.

(4) Mean of residues (MR)1 "

MR = - ̂ (estf ~ acti )« 1=1

Ratio Measures Of Accuracy

(1) Mean (or average) of relative errors (ARE)

n ~ act,

AJIS vol 5 no 1 September 1997

32

(2) Mean of magnitude of relative errors (MRE)

act: - est:

act1-

est.

act.

(3) Root mean of squared relative errors (RMSRE)

RMSRE = \-

Many measures are based on magnitude of relative errors, mrei

mre, =act. - est:

act:

MRE is the most widely used measure in the literature. However, it is influenced by extreme values or outliers.A smaller MRE is not always better than a larger one (J0rgensen, 1994). To address this problem, some singlevalues of mrei (i=\,2,3,...,ri) have been used to measure the estimation accuracy. They are described below.

(4) Prediction at level / (PRED(/))

Conte el al (1986) put forward a single value measure, prediction at level /

PRED(l) = —n

where k is the number of projects in a set of n projects whose mre < I. They suggested that an acceptableaccuracy for a model is PRED(0.25) > 0.75, which is seldom reached in reality.

(5) Third quartile of mre (Q3)

Conte et al's standard can be expressed in another way: 75% of the mre's are less than or equal to 0.25. In termsof quartiles, the third quartile, Q3, of the mre's is less than or equal to 0.25, i.e. Q3 < 0.25. As mentioned above,

this is too high a standard at present, but it should be a goal to pursue. The smaller the <23, the more accurate the

estimation.Both Qj anc' PRED(0.25) avoid considering some extremely poor predicted values. So they eliminate the

influence of extreme values.(6) Other measuresThe median of mre's (Q2 , second quartile) has also been used as one of the measures of accuracy (J0rgensen,

1994). It is only a middle number among the ordered mre's, so it hardly presents an average of all the mre's.Miyazaki era/ (1991) believed that mre based measures favour underestimation. (This problem will be discussedin detail later.) To address this problem, they defined the relative error in another way:

est — actr. =•

r =•esti - act(

est:

{ - acti > 0)

(est. - acti < 0)


33

They argued that by using /",, a balanced evaluation is obtained for both underestimated and overestimated cases.

With the above definition of relative error, MRE, RMSRE or PRED(0.25) can be used for accuracy evaluation(Miyazaki et al, 1991).Fenton (1994) argued that good measurement should be meaningful. In practice, it is generally accepted toexpress relative error by comparing error with actual observed value. With this commonly accepted idea ofrelative error, if the relative errors are 0.40, 0.50, 0.60 and 0.70 in the case of underestimation, thecorresponding r values are 0.67, 1.00, 1.50 and 2.33. In an extreme case, if the relative error is 0.95, which is notunusual, the r value can be as high as 19.00. The variance of r( must be large and the outliers take an important

role in MRE. The large variance will also make RMSRE and PRED(0.25) values quite different from thosecommonly accepted. Moreover this measure penalises underestimation too much.Jenson and Hartley (1991) proposed the weighted MRE (WMRE). They believe that one weakness of MRE isthat the relative size of each individual observation is not recognised. WMRE is calculated such that eachobservation's relative error is weighted by the ratio of the observed value to the average of the observed values.They argued that by weighting larger projects more heavily than smaller ones, the WMRE recognises the greatercost associated with estimation errors on larger projects. However, on closer examination, we can prove thattheir argument is untenable since

act

w*rr actact,-est:

act;

1nact^

act, - est:

In this expression, it does not appear that larger projects are weighted more heavily than smaller ones. On thecontrary, the weight of project size included in relative error disappears from the formula. So it falls into thecategory of difference measure.Occasionally, a combination of difference measure and ratio measure is used for accuracy. J0rgensen (1994)introduced PRED2(0.25,0.5), the percentage of projects with mre < 0.25 or estimation error < 0.5 days.

PRED2(0.25,0.5) is actually PRED(0.25), because "estimation error < 0.5 days" only contributes to the

percentage when the estimation is performed for a project of less than two days.

Discussion On Measures Of Accuracy

Kitchenham et al (1990) indicated that we should distinguish between ̂ z/ accuracy and prediction accuracy, i.e.how well a model fits the data from which it was generated and how good a prediction from the model will be.

To assess fit accuracy, R is a good measure. It can be seen from the definition of R that it is related to the

variance of actual observed values. The larger this variance is, the easier it is to obtain a large R . R measuresthe proportion of the variation in the dependent variable that is explained by the regression equation (Kenkel,

1989). The higher the value of R , the greater the explanatory power of the regression equation. When R isclose to 0, either the functional form of the model does not fit the data set from which it was generated or moreindependent variables are needed to further explain the variation in the dependent variable.To assess prediction accuracy, difference measures are not suitable. With respect to software cost estimation, theprediction error increases with the magnitude of the observed value. The larger the project is, the harder it is toestimate the effort. Difference measures are not adequate if they are used for both large projects and small ones,because they do not take into consideration the size of projects. Therefore difference measures should not beused to assess the prediction accuracy for they penalise the prediction for large projects. On the other hand,relative error is an average of prediction error in every unit of effort, which reflects "error rate". It takes intoaccount the project size and allows larger absolute error for larger projects. Therefore, ratio measure is moresuitable for the assessment of the accuracy in software cost estimation.Among the ratio measures, all mean measures are influenced by very large extreme mre values, which usuallyoccur in software estimation. In the context of accuracy evaluation for software cost estimation, they cannot playthe role of the measure for central tendency. Some of them favour underestimation. For the single valuemeasures, although they eliminate the influence of outliers, they can hardly provide an overall accuracy. Forexample, PRED(0.25) takes into account less than 75% of all the wire's, because Conte's acceptable level canseldom be achieved. Although Q3 indicates that 75% of the projects are estimated with mre less than or equal to

<23, it does not provide any detailed information on these mre values. For the purpose of comparison among

different models, the single value measures are rough and stochastic. Moreover, these single value measures andmean measures are not always consistent when they are used to compare different models.


34

The above discussion may be summarized in three points:

• The difference measure is not suitable for the evaluation of software cost models. The ratio measureis preferable.

• The mean measures take into account every single relative error, but they are significantlyinfluenced by outliers, which frequently occur in software cost estimation. Some of them favourunderestimation.

• Single value measures are not influenced by outliers, but they are stochastic and lose too muchinformation. Therefore the evaluation by single value measures is not always reliable when they areused to compare different models.

Because of the above drawbacks of the existing accuracy measures, there is a need for a better measure foraccuracy evaluation.

A New Measure Of Accuracy - Weighted Mean of Quartiles of mre's (WMQ)

To resolve the above problems, it is proposed to use the weighted mean of quartiles (WMQ) of mre to measurethe prediction accuracy.The third quartile (Q3) is the most important as 75% of mre's values are less than it. So it is weighted with 75.

The second quartile (Q2) is weighted with 50, and the first quartile (Q,) is weighted with 25. The WMQ isdefined as:

WMQ150 6

There are two assumptions underlying the use of WMQ:

1) The number of outliers is less than 25% of all the mre values. This assumption is generally true.2) If the estimation of 75% of the projects is acceptable, the model is desirable. Because of the present

low level of poor estimation accuracy, this assumption is reasonable.

Unlike the mean measures, the WMQ is not influenced by extreme mre values, and at the same time it providesmore general information about the distribution of mre than the single value measures. It will be shown later thatthe WMQ presents a good average of mre in evaluating accuracy of software cost estimation.The WMQ is consistent with the MRE provided that the estimation is not obviously biased (tending tounder/overestimate) and there are no outliers (extreme large values).In a sample of n projects an estimation model results in n relative errors, the magnitude of which is expressed asa variable mre. These mre's ait arranged in ascending order. The mre can be expressed as a function of the ordernumber N, i.e.

mre =f(N)

With this equation a smoothed curve ( OQ^Q2Q^Q4 ) can be drawn. This curve is illustrated in Figure 1, where

(2i is the first quartile of mre and also expresses the intersection of the curve and the vertical line N = nQ ,

where nQ (= 0.25/0 >s me ordinal number of Q\ , and so on. Through Point O and Point Q3 a straight line is

drawn to intersect with vertical line N = nal Point Q4 . The vertical line N = nQi meets with straight line O Q4

at Point Q} . Point Q2 is obtained in the same way.


35

mre

O

Figure 1 Relationship between WMQ and MRE

n. n N

If there are no outliers with respect to mre we assume that the area OQtQ2Q3Q2Qt is approximately equal to

the area Q^Q4Q4. Therefore the area Al under the smoothed curve is approximately equal to the area A2 of

triangle OQ4n. Let k be the slope of straight line O Q4 .

4MRE = —n n n

If we assume Ql ~ Q} and Q2 ~ Q2 , then

WMQ =2Q2 + 3Q3

*2Q2

6025kn + 2(0.50kn) + 3(0.75*n)

= - 6 -

In this particular case, as Ql < Q} and Q2 < Q2 in Figure 1, WMQ < 0.58/72. In the situation where there is a

tendency to underestimate, the front part of the curve goes up and the rear part comes down. Hence, Ql is close

to £?i and Q2 is close to Q2. WMQ may be greater than O.Sfcn. That is WMQ > MRE (the MRE favours

underestimation). For an unbiased estimation, Ql and Q2 are relatively further from Ql and Q2 respectively.

WMQ may be closer to O.Skn. That is WMQ = MRE. This explains why WMQ and MRE are consistent whenthere are no outliers and the estimation is unbiased.

MEASURES OF CONSISTENCY

A model that is sensitive to the influence of various productivity factors may nonetheless consistentlyoverestimate or underestimate development, if the standard productivity rate assumed by the model issignificantly different from that of the environment in which the software was developed (Mukhopadhyay et al,1992). Models developed in different environments do not work very well without calibration. A consistentlyoverestimating or underestimating model is easier to calibrate than an inconsistent one. Therefore, besidesaccuracy, consistency is another important feature for an estimation model.

Correlation Coefficient Of Estimates And Actual Values (SDR)

To measure the level of consistency, some researchers have used the correlation coefficient, SDR, betweenobserved and estimated values (Mukhopadhyay et al, 1992). This measure tests the linear association betweenthe actual values and estimates. For a highly consistent model, R should be close to 1 (-!</?< 1), otherwise it is


36

close to 0. If R is negative, it indicates that larger actual values are associated with smaller estimates. R is the

square root of R , the coefficient of determination introduced earlier. So SDR is not consistent with the ratio

measure which has been illustrated to be more suitable for software cost estimation. Moreover, as R2 , SDR isinfluenced by the variance of data. The greater the variance of actual values, the larger the denominator in theexpression. R varies according to not only the estimation accuracy but also the variance of data. We need aconsistency measure, which assesses the estimation on the basis of ratio measure and is independent to thedistribution of the actual observations.

A New Measure Of Consistency

Suppose that a model was developed in environment A and a set of data, which was collected from softwareprojects developed in environment B, is used to test the estimation consistency of the model. For each estimationthere is a ratio

est.(i = 1.2,3 ..... n)

acti

In the case of consistent estimation, the values of r. (i = l,2,3,...,/i) are close to one another. On the other hand,

if the values of r; spread over a wide range, the estimation is not consistent. The closer to one another the values

of rf are, the more consistent the estimation is. Statistically, standard deviation is a measure of variation or

spread of the ri 's. So it is proposed to use the standard deviation of r; (SDR) as a measure of estimation

consistency,

SDR =n-\

where r is the mean of r. 's (;=l,2,3,...,/i) The smaller the SDR, the more consistent the estimation.

It can be shown that standard deviation of relative error is equal to SDR. Because the assessment of estimationaccuracy is based on relative error, SDR is related to estimation accuracy. Therefore SDR can be used tocalibrate a model in order to improve the estimation accuracy in different environments.

REGRESSION CRITERIA

In a previous section, it is argued that ratio measures are more suitable for accuracy assessment. We nowconsider how to determine model parameters so as to satisfy the ratio measures of accuracy.In modeling costing formulas, regression is the basic technique to determine model parameters. The traditionalmethod employed by most statistical software is least squares (LS) (Khoshgoftaar et al, 1992), which can beexpressed as

min[

This is a criterion of difference measure. This criterion cannot lead to an optimised functional expression of themodel as ratio measure should be used to assess the prediction accuracy.In modelling formulas for software cost estimation, the criterion of minimising the sum of squared relative errors

has been used (Conte et al, 1986). This criterion is not suitable either. In the expression of mre


37

mrei =act, - est.

act:

1-est:

act.

actf is positive for all i, and est( should be positive for all i. In the case of overestimation, that is est > act- ,

an mre can be greater than 1, but in the case of underestimation, that is estf < act( , an mre can never be greater

than 1. This explains the problem discussed in a previous section that MRE favours underestimation if it is usedas a measure of accuracy. To achieve the least sum of squared relative errors, the regression technique makes asmany negative relative errors as possible. Therefore, if least sum of squared relative errors is used as theregression criterion, the obtained formula will be one that systematically underestimates the effort. It has beenfound in practice that in most cases of software cost estimation, the errors came from underestimation (Ledereret al, 1993). Using least sum of relative errors as the regression criterion would make the situation even moreserious.

A New Regression Criterion

From the above discussion we can argue that there are two special requirements for regression criterion informulating software costing models.

• It should be in the form of the ratio of est{ to act.t .

• It should be an unbiased one, which results in neither systematic underestimation nor systematicoverestimation.

To meet these two special demands, a new regression criterion in formulating software costing models isproposed:

est

It is obvious that this criterion takes the ratio form of estf to actf. The closer (In '-) is to 0, the closeract(

estfis to 1, and the more accurate the estimation. As shown in the above expression, est; and act-t are

act,symmetric with respect to their positions in the expression. Therefore, this criterion does not produce anestimation formula of either systematic underestimation or systematic overestimation.

Computational Procedure With The New Regression Criterion

We can make use of existing statistics software in performing non-linear regression with the proposed criterion.In statistics software, the criterion for finding the optimal parameters is LS

min[

Before the non-linear regression is undertaken, the observed values and the functional form of the model can be

transferred into their logarithmic counterparts, actf and est{ .The LS can be expressed as

n n n artV-i • 1 7 V"1 n \ « \-i2 X^ 11 eAl: 2

min [ > (actf - estf) ] = min { > [m(act.) - m(esti.)] } = min [ > (In -) ]

For example, we have a set of n pairs of data (size t, effort.) (i = 1,2,3,...,n). If the functional form of a model

is

Effort = a(Size) + b


38

we can take the logarithmic form for the both sides of the equation to obtain

In(Effort) = \n[a(Size) + b] .

The regression is to be performed according to this equation. In this case, the /'* estimate isesti = a(sizej ) + b and the actual value is acti = effort , . Before the regression, we first calculate

ln(effort: ) (i= l,2,3,...,n), which will be used as the values for the dependent variable for the regression. In the

regression, the functional form is ln[fl(S/ze) 4- b] . Therefore we have the criterion

mm i2.] }effort,

ASSESSMENT OF THE PROPOSED MEASURES AND CRITERION

A set of data quoted by Desharnais (1988) from 81 software projects was used to assess these measures ofestimation accuracy and consistency, and regression criteria. The main features of the data set are summarisedbelow.

Effort range: 546-19,894 person-hoursFunction point range: 62- 1 , 1 1 6 FPTeam average experience: 0-4 yearsManager experience: 0-7 yearsLanguage levels: Level 1: 46 projects were developed in 3GL (COBOL with IMS or

IDMS type databases).Level 2: 25 projects were developed using a combination of thetraditional 3GL approach together with screen and reportgenerators.Level 3: 10 projects were implemented in 4GL.

Integrated Software Cost Model

An integrated software cost model was proposed by Gao and Lo (1995) on the basis of COCOMO (Boehm,1981) and FPA. The integrated model attempts to combine the advantages of these two approaches into a singlemodel. To address the problem of language dependency, language-weighted function point is introduced. Inaddition, continuous, rather than discrete, "cost drivers" are used. The general form of the model is

Effort = a(WF/))*fjc,u'"'/i)

where a and b ant constants, xi is the magnitude of the i' cost driver, c, and di are constants corresponding

to the i cost driver, n is the number of cost drivers and WFP is language-weighted FP.The equation appropriate to Desharnairs' data set is

Effort = a(lkFP)b.

where /, = 1.00, l^ = 0.95 and L± = 0.20 are the weights for language level 1, 2, and 3 respectively (Gao et al,

1995). The exponent b is the responsiveness of Effort to FP. It indicates that Effort increases by b% when FPincreases by 1%.


39

Assessment On Regression Criteria

Three criteria for finding optimal parameters are used in regression:

• Criterion 1: min[^ (act( — esti )2 ]

1=1

^ .act.-est,^• Cnterion 2: min[ > ( ) ] and

ti acti

v^,, est: .• Criterion 3: min[ 2^ (In ) ]

In Figure 2 and Figure 3, the relative errors

re, = •est{ - acti

act.(i= 1.2,3 81)

are shown in an ascending order. Figure 2 shows the relative errors with respect to Criterion 2. It can be seen thatmore than 70% of the relative errors are negative. This indicates that the formula obtained from Criterion 2 tendstowards underestimation. Figure 3 shows the relative errors with respect to Criterion 3. The number of negativerelative errors is approximately the same as that of positive relative errors. The formula obtained from Criterion3 does not result in systematic bias. (Criterion 1 is similar to Criterion 3 with respect to estimation bias, becausethe actt and estt can interchange in the expression of Criterion 1.) This shows that the regression with

Criterion 3 overcomes the shortcoming of the tendency to underestimate which occurred with Criterion 2.

Figure 2 Relative errors (Criterion 2) Figure 3 Relative errors (Criterion 3)

Assessment On The Measures Of Accuracy And Consistency

For comparison, Q3, PRED(0.25), MRE as well as WMQ are computed as measures of prediction accuracy,

while SDR and SDR are computed as measures of estimation consistency. We also use


40

Effort = a(FP)b

for regression in order to fully assess the measures of accuracy.

Accuracy Assessment

Table 1 uses FP as size measure while Table 2 uses WFP. Table 2 also includes consistency results, which are tobe compared with Table 3 in the next section.

Table 1 Accuracy of FP as size measure

Criterion

1

2

3

ft

0.66

0.70

0.56

PRED(0.25)

0.42

0.11

0.37

WMQ

0.45

0.62

0.42

MRE

0.67

0.58

0.58

Table 2 Accuracy and consistency of WFP as size measure

Criterion

1

2

3

ft

0.43

0.45

0.44

PRED(0.25)

0.49

0.48

0.56

WMQ

0.32

0.34

0.31

MRE

0.33

0.30

0.32

R

0.8428

0.8389

0 8406

SDR

0.439

0.344

0.423

The first observation is that the compared measures of accuracy, Q3, PRED(0.25), WMQ and MRE, are not

consistent in both tables. This observation is not unexpected because the Q^ and PRED(0.25) are stochastic. The

difference between MRE and WMQ will be discussed later.The second observation is that Criterion 2 leads to the best MRE and Criterion 1 leads to the worst MRE. Asdiscussed in the previous section, Criterion 1 is based on difference measure while MRE is a ratio measure.Therefore the MRE of Criterion 1 is larger than that of Criterion 2 and Criterion 3, which are based on relativeerrors. Criterion 2 leads to underestimation tendency and MRE favours underestimation. Therefore Criterion 2has the best MRE. Criterion 3 does not have over- or underestimation tendency. Therefore the MRE of Criterion3 is located between those of Criterion 2 and Criterion 1.The third observation is that if there are no outliers and no underestimation tendency (Criterions 1 and 3 in Table2), WMQ = MRE. If there are outliers (Criterion 1 and Criterion 3 in Table 1), WMQ < MRE, and ifunderestimation tendency exists (Criterion 2 in Table 1), WMQ > MRE. As mentioned before, MRE as ameasure of accuracy has two drawbacks. It is influenced by outliers and it favours underestimation. WMQ asproposed targets the solution of these two difficulties, and this third observation indicates that WMQ hassucceeded in its objective.The fourth observation is that the Criterion 3 leads to the best WMQ for all data conditions. As assessed above,WMQ is a better measure of accuracy, so Criterion 3 is superior to the other two criteria.

Consistency Assessment

In this section, it will be first shown that SDR is better than SDR, because R is influenced by not only theestimation result but also the variance of the actual effort, whereas SDR is not influenced by the later. For thispurpose, the data of one project, which involves the largest development effort, is deleted from the data set. With

AJ1S vol5 no 1 September 1997

41

this change the variance of the actual effort changes. Table 3 shows the accuracy and consistency results fromthe reduced data set.

Table 3 Accuracy and consistency results from reduced data set

Criterion Q3 PRED(0.25) WMQ

1

2

3

0.44

0.45

0.44

0.49

0.49

0.56

0.33

0.33

0.31

MRE

0.34

0.30

0.32

R

0.7900

0.7885

0.7891

SDR

0.441

0.345

0.425

Comparing Table 2 and Table 3, we find that the Q values, WMQ and MRE of each Criterion are virtually thesame in the two corresponding situations. The consistency should not be different because it measures the qualityof a model in a specific environment and it should be independent of the variance of the actual effort. However,the change of variance makes the R values decrease by more than 0.0500. On the other hand, the SDR valuesonly change by less than 0.002, which is comparatively small. Therefore, the SDR is superior over R in theaspect that the former is not influenced by the variance in effort.Next, it will be demonstrated that a model producing consistent estimation errors is more easily calibrated toimprove accuracy than is a model producing less consistent errors, if the consistency is measured by SDR. Thecalibration method is described as follows.With Desharnais' data set, three formulas are obtained by regression with the three criteria described earlier.These formulas are to be calibrated to the environment of Albrecht's data set (1983). In Albrecht's data set, 15projects used level 1 language and their FPs fell in the size range of Desharnais' data set. For the other projects,either the languages cannot be categorised into the 3 levels or the FPs are beyond the size range of Desharnais'data set. Therefore only these 15 projects were used for the calibration.It is assumed that the responsiveness of effort to size (exponents in the formulas, refer to section of "IntegratedSoftware Cost Model") does not change, but the environments of the two sets of data are different. InDesharnais' environment, the formulas obtained with the three regression criteria are

Effort = a j(WFP)bi a =1,2,3)

The environment is the overall level of cost drivers (Gao et al, 1995), which is expressed by the coefficient a}

(j = 1,2,3). For Albrecht's data set, an environment adjustment parameter, kj (j = 1,2,3), is used in the above

formulas to account for the environment change

Effort = kjaj (j= 1,2,3)

where the k (/ = 1 »2,3) are determined by the following procedure.

Firstly, the estimated efforts est- (i = 1,2,3,...,15, j = 1,2,3) are obtained with the formulas developed in

Desharnais' environment. Secondly, it is assumed that the k • (j = 1,2,3) satisfy the following equations

est..act: = k, est;s orl = k.1 act,

=1,2,3,...,15,7= 1,2,3)


42

where act( (i = 1,2,3,...,15) are the actual values of effort in Albrecht's data set. Thirdly, these 15 equations are

summed to give

is

act(7=1,2,3)

and the above expression is rearranged so that the kj (j = 1,2,3) are obtained

15 est0'= U2,3)

Table 4 and Table 5 show the results of accuracy and consistency before and after the calibration.

Table 4 Results before calibration

Criterion

1

2

3

G,

0.10

0.21

0.11

G20.18

0.35

0.26

a0.49

0.57

0.47

PRED(0.25)

0.60

0.27

0.47

WMQ

0.32

0.44

0.34

MRE

0.36

0.42

0.36

R

0.5328

0.5361

0.5352

SDR

0

0

0

57

42

53

Table 5 Results after calibration

Criterion Q} Q2

1

2

3

0.10

0.09

0.10

0.17

0.23

0.20

G,

0.50

0.46

0.47

PRED(0.25)

0.60

0.60

0.60

0.32

0.32

0 32

WMQ

0

0

36

36

0.36

MRE

0.5328

0.5361

0.5352

R SDR

0.58

0.57

0.57

For Criterion 1 in Table 4, the SDR is 0.57 (the largest SDR in this table), so the estimation is the leastconsistent. Criterion 1 in Table 5 shows that the calibration does not improve the accuracy. For Criterion 3 inTable 4, the SDR is 0.53. This consistency is better than that of Criterion 1. The calibration improves the WMQfrom 0.34 in Table 4 to 0.32 in Table 5, while MRE remains unchanged. For Criterion 2 in Table 4, the SDR is0.42. This consistency is much better than that of Criterion 1 or Criterion 3. The calibration improves the WMQfrom 0.44 in Table 4 to 0.32 in Table 5, and the MRE from 0.42 to 0.36. The results indicate that the smaller theSDR is, the easier it is to improve accuracy by calibration.The SDR values remain the same before and after the calibration. The linear relationship between estimatedeffort and actual effort does not change after calibration, because the calibration multiplies the uncalibratedestimated effort by a constant,. On the other hand, all the SDR values after calibration differ from those beforecalibration. This indicates that SDR and SDR are not the same. The SDR measures the linear relationshipbetween estimated value and actual value while SDR measures the variation of relative errors.The accuracy results after calibration are comparable with those obtained when the original formulas are usedwith Desharnais' data set. This observation further emphasises that a model needs to be calibrated if it is to beused in a different environment.


43

From Table 4 and Table 5 it can also be observed that MREs are greater than WMQs in all the cases except inCriterion 2 of Table 4. As expected, there are extreme values in cases where MRE > WMQ, and the formulatends to underestimate the effort in Criterion 2 of Table 4, where MRE < WMQ.

SUMMARY

In order to evaluate estimation models and improve the modelling process, we need appropriate methods tomeasure these models and determine model parameters. This paper reviewed existing methods and proposed newmethods to measure estimation accuracy and consistency, and to determine model parameters.The difference measures of accuracy favour the estimation for small projects. Therefore it is argued in this paperthat measures of accuracy should be based on relative error of estimation. As the MRE (the most widely usedmeasure of accuracy) is influenced by outliers and favours underestimation, and single value measures arestochastic, the WMQ is proposed for accuracy evaluation. The WMQ includes more information on theestimation than single value measures, so it is less stochastic. It is also consistent with MRE when there are nooutliers and estimation is unbiased.Consistency examines the model's degree of ease of calibration. A consistently overestimating orunderestimating model is more easily calibrated than an inconsistent one. The correlation coefficient R betweenobserved and actual values has been used to evaluate consistency. The R favours a data set with large variance. Itis determined not only by estimation but also by the distribution of the actual values. In this paper, the standarddeviation of the ratios of the estimate to actual effort (SDR) is proposed as a measure of consistency. The SDR isa measure of the variation or spread of the relative error.The method of least squares is the conventional regression criterion, but it is based on a difference measure,which is not suitable for software cost estimation. The criterion of minimising the MRE will produce a formulawhich tends to underestimate the effort. To overcome the shortcomings of these criteria, this paper proposes touse the criterion of least squares of the logarithmic ratio of estimate to actual value for regression.The proposed measures and criterion share the characteristic that they are based on relative errors of theestimation.Applied to real-world data, the proposed measures and criterion appear superior to the measures and criteria thatare currently used in the field and reported in the literature. It is therefore recommended that the proposedmeasures and criterion be used for further assessment

REFERENCES

Abdel-Hamid, T.K. (1990) "On the utility of historical project statistics for cost and schedule estimation: resultsfrom a simulation-based case study", Journal of Systems & Software, Sep., Vol.13, No.l, pp.71-82.

Albrecht, AJ. & Gaffney, J. (1983) "Software function, source lines of code, and development effort prediction:a software science validation", IEEE Transactions on Software Engineering, Nov., Vol.9, No.6,pp.639-648.

Boehm, B.W.(1981) Software Engineering Economics, Prentice-Hall, Inc., Englewood Cliffs, New Jersey,.Conte, S.D., Dunsmore, H.E. & Shen, V.Y. (1986) Software Engineering Metrics and Models,

Benjamin/Cummings Publishing Company, Inc., Menlo Park, Calif..Deeson, E. (1991) Collins Dictionary of Information Technology, Harper Collins Publishers, Glasgow.Desharnais, J-M. (1988) "Programme de maitrise en informatique de gestion", University of Montreal, Canada.Fenton, N. (1994) "Software measurement: a necessary scientific basis", IEEE Transactions on Software

Engineering, Mar., Vol.20, No.3, pp. 199-206.Gao, X. & Lo, B. (1995) "An integrated software cost model based on COCOMO and function point analysis",

Proceedings of the Software Education Conference (SRIG-ET94), pp.86-93, IEEE ComputerSociety, Los Alamitos, CA.

Jenson, R.L. & Bartley, J.W. (1991) "Parametric estimation of programming effort: an object-oriented model",Journal of Systems & Software, May, Vol.15, No.2, pp.107-114.

J0rgensen, M. (1994) "A comparative study of software maintenance effort prediction models", Proceedings of5th Australian Conference on Information Systems, 27-29 September 1994, Melbourne, Australia,pp.699-709.

Kan, S.H., Basili, V.R. & Shapiro, L.N. (1994) "Software quality: an overview from the perspective of totalquality management", IBM Systems Journal, Mar, Vol.33, No.l, pp.4-19.

Kenkel, J.L. (1989) Introductory Statistics for Management and Economics (Third Edition), DuxburyPress, USA.Khoshgoftaar, T.M., J.C. Munson, et al (1992) "Predictive modeling techniques of software quality from

software measures", IEEE Transactions on Software Engineering, Nov., Vol.18, No.l 1, pp.979-987.


44

Kitchenham, B.A., Kok, P.A.M. & Kirakowski, J. (1990) "The MERMAID approach to software costestimation", Proceedings of the Annual ESPRIT Conference, Brussels, November 12-15, 1990,pp.296-314.

Lederer, A.L. & Prasad, J. (1993) "Information systems software cost estimating: a current assessment", Journalof Information Technology, No.8, pp.22-33.

Miyazaki, Y. Takanou, A. & Nozaki, H. (1991) "Method to estimate parameter values in software predictionmodels", Information & Software Technology, Apr., Vol.33, No.3, pp.239-243.

Mukhopadhyay, T., Vicinanza, S.S. & Prietula, M.J. (1992) "Examining the feasibility of a case-based reasoningmodel for software effort estimation", MIS Quarterly, Jun., pp. 155-170.

Pfleeger, S.L. (1991) "Model of software effort and productivity", Information & Software Technology, Apr.,Vol.33, No.3, pp.224-231.

Rowlands, B.H. (1989) "Estimating instructional systems development effort and costs", Proceedings of the 7thAnnual Conf. of the Australian Society for Computers in Learning in Tertiary Education, GoldCoast, 1989, pp.340-348.

Wrigley, C.D. & Dexter, A.S. (1991) "A model for measuring information system size", MIS Quarterly, Jun.,pp.245-257.

Verner, J. & Tate, G. (1992) "A software size model", IEEE Transactions on Software Engineering, Apr.,Vol.l8,No.4,pp.265-278.


assessing software cost estimation models: criteria … › f0ed › 3ee63e9767b7... · to estimate...

Documents