identification of fuzzy models of software cost estimation

23
Fuzzy Sets and Systems 145 (2004) 141 – 163 www.elsevier.com/locate/fss Identication of fuzzy models of software cost estimation Zhiwei Xu a , Taghi M. Khoshgoftaar b; a Motorola Labs, Schaumburg, IL, USA b Department of Computer Science and Engineering, Empirical Software Engineering Laboratory, Florida Atlantic University, Boca Raton, FL 33431, USA Abstract Software cost estimation is one of the most critical tasks in managing software projects. Development costs tend to increase with project complexity, and hence accurate cost estimates are highly desired during the early stages of development. An important objective of the software engineering community has been to develop useful models that constructively explain the software development life-cycle and accurately estimate the cost of software development. Currently used software development eort estimation models such as, COCOMO and Function Point Analysis, do not consistently provide accurate project cost and eort estimates. This is often because important project data, available at the time of modeling, are often vague, imprecise, and incomplete. Traditionally used cost estimation models cannot utilize such vague yet important information in their models. Fuzzy logic-based cost estimation models are more appropriate when vague and imprecise information is to be accounted for. Such models usually rely on expert knowledge, which is however, often too general to t a particular data set because dierent data sets have dierent characteristics. We present an innovative fuzzy identication cost estimation modeling technique to deal with linguistic data, and automatically generate fuzzy membership functions and rules. A case study based on the COCOMO81 database compared the proposed model with all three COCOMO models, i.e., Basic, Intermediate, and Detailed. It was observed that the fuzzy identication model provided signicantly better cost estimations than the three COCOMO models. c 2003 Published by Elsevier B.V. Keywords: Fuzzy identication; Rule generation; Software cost estimation; Fuzzy clustering; COCOMO models 1. Introduction Estimating the cost and the schedule required to develop a software system is one of the most critical and dicult tasks in managing software projects. While technical and marketing issues have a strong impact on a project’s success, poorly managed projects—no matter how advanced the Corresponding author. Tel.: +1-561-297-3994; fax: +1-561-297-2800. E-mail addresses: [email protected] (Z. Xu), [email protected] (T.M. Khoshgoftaar). 0165-0114/$ - see front matter c 2003 Published by Elsevier B.V. doi:10.1016/j.fss.2003.10.008

Upload: zhiwei-xu

Post on 21-Jun-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Identification of fuzzy models of software cost estimation

Fuzzy Sets and Systems 145 (2004) 141–163www.elsevier.com/locate/fss

Identi"cation of fuzzy models of software cost estimation

Zhiwei Xua, Taghi M. Khoshgoftaarb;∗

aMotorola Labs, Schaumburg, IL, USAbDepartment of Computer Science and Engineering, Empirical Software Engineering Laboratory, Florida Atlantic

University, Boca Raton, FL 33431, USA

Abstract

Software cost estimation is one of the most critical tasks in managing software projects. Development coststend to increase with project complexity, and hence accurate cost estimates are highly desired during the earlystages of development. An important objective of the software engineering community has been to developuseful models that constructively explain the software development life-cycle and accurately estimate the costof software development. Currently used software development e4ort estimation models such as, COCOMOand Function Point Analysis, do not consistently provide accurate project cost and e4ort estimates. This is oftenbecause important project data, available at the time of modeling, are often vague, imprecise, and incomplete.Traditionally used cost estimation models cannot utilize such vague yet important information in their models.

Fuzzy logic-based cost estimation models are more appropriate when vague and imprecise information is tobe accounted for. Such models usually rely on expert knowledge, which is however, often too general to "ta particular data set because di4erent data sets have di4erent characteristics. We present an innovative fuzzyidenti"cation cost estimation modeling technique to deal with linguistic data, and automatically generate fuzzymembership functions and rules. A case study based on the COCOMO81 database compared the proposedmodel with all three COCOMO models, i.e., Basic, Intermediate, and Detailed. It was observed that the fuzzyidenti"cation model provided signi"cantly better cost estimations than the three COCOMO models.c© 2003 Published by Elsevier B.V.

Keywords: Fuzzy identi"cation; Rule generation; Software cost estimation; Fuzzy clustering; COCOMO models

1. Introduction

Estimating the cost and the schedule required to develop a software system is one of the mostcritical and di?cult tasks in managing software projects. While technical and marketing issues havea strong impact on a project’s success, poorly managed projects—no matter how advanced the

∗ Corresponding author. Tel.: +1-561-297-3994; fax: +1-561-297-2800.E-mail addresses: [email protected] (Z. Xu), [email protected] (T.M. Khoshgoftaar).

0165-0114/$ - see front matter c© 2003 Published by Elsevier B.V.doi:10.1016/j.fss.2003.10.008

Page 2: Identification of fuzzy models of software cost estimation

142 Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163

technology—are more likely to fail than succeed. Despite increasing attempts to treat software de-velopment as a form of engineering, many projects are still not completed on schedule, with underor over estimates of e4ort each causing their own particular problems. Therefore, in order to reducebudget and schedule overruns and to improve contractual bids for development projects, various soft-ware cost estimation models have been developed [1,5,19,22]. In the "eld of software engineering,both terms, cost and e4ort imply the same concept, and hence in this paper we use these two termsinterchangeably.

The rapidly changing nature of software development has made it extremely di?cult to developcost models that continue to yield high prediction accuracies. Software development costs continueto increase and practitioners continue to express their concerns over their inability to accuratelypredict the costs involved. Thus, one of the most important objectives of the software engineeringcommunity has been to develop useful models that constructively explain the software developmentlife-cycle and accurately predict the cost of developing a software product.

A few of the current software development e4ort estimation models include: COCOMO [5], SLIM[19] Estimacs, and Function Point Analysis (FPA) [1,15]. However, no model has proven to be con-sistent in successfully providing accurate software development e4ort estimations. This is largelydue to the fact that information about software e4ort is often uncertain, imprecise, and incomplete.In the early stages of software development, it is di?cult to build an explicit software e4ort esti-mation model at the time of modeling. In such cases, using a fuzzy identi"cation approach is moreappropriate.

The application of fuzzy logic concepts to software cost estimation modeling has been recentlyexplored by various researchers. Generally speaking, such e4orts can be classi"ed into two categories:(1) using fuzzy numbers for interval prediction [8] and (2) rule-based fuzzy logic [9]. Both of thesecategories are based on expert knowledge. However, expert knowledge is often too general to "t aparticular data set. This stimulates us to seek a way to generate rules from data directly that canshow the particular relationship of independent and dependent variables in a particular data set.

Recently, various methods have been proposed for automatically generating fuzzy if–then rulesfrom numerical data. Most of these methods applied iterative learning procedures or complicatedrule generation mechanisms. These algorithms include gradient descent learning methods [10,11,17]genetic algorithm-based methods [12,13] least squares methods [20,21] fuzzy c-means method [25]and fuzzy-neuro method [6,16,18]. We also built a fuzzy rule extraction model to classify fault-prone and not fault-prone modules [24]. However, all of the above mentioned modeling methodscan only extract rules from numerical data. Furthermore, most software development e4ort estimatesare required and performed at the earlier stages of development, where important software attributescan only be expressed as vague and imprecise non-numerical values. Consequently, many usefulsoftware attributes used for e4ort estimation are linguistic values, which makes the above-mentionedfuzzy logic methods unsuitable for software e4ort estimation.

In this paper, we present an innovative fuzzy identi"cation method to deal with linguistic data, andgenerate fuzzy membership functions and rules automatically. This paper summarizes the conceptsof fuzzy identi"cation and its usage in software e4ort estimation. To the best of our knowledge,this is the "rst time that linguistic software attributes, recorded in the earlier life cycle stages,have been modeled in this way. A case study based on the COCOMO81 database compared amodel built using the proposed fuzzy identi"cation model with the three COCOMO81 models, i.e.,Basic, Intermediate, and Detailed. It was observed that the proposed fuzzy identi"cation model

Page 3: Identification of fuzzy models of software cost estimation

Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163 143

provided signi"cantly more accurate cost estimations than all of the three COCOMO81 estimationmodels.

The layout of the rest of the paper is as follows. In Section 2, we brieMy introduce the principleof software e4ort estimation models with a focus on the Intermediate COCOMO81 model. Section 3presents the fuzzy identi"cation and fuzzy identi"cation techniques that could be applied to soft-ware e4ort estimation. In Sections 4 and 5, we describe the validation and analysis of the resultsobtained from our experiment. A conclusion and an overview of future work conclude this paper inSection 6.

2. Software development e�ort estimation models

A large portion of the work in the cost estimation "eld has focused on algorithmic cost modeling.In these methods, mathematical formulas are used to estimate the software project cost and e4ortbased on software metrics and other input parameters. An alternative method is based on a formalmodel. In a formal model, the formulae used arise from the analysis of historical data. In both cases,the accuracy of the model can be improved by calibrating the model to a project-speci"c developmentenvironment. This involves adjusting the weights of the metrics according to their importance.

2.1. COCOMO81 (COnstructive COst MOdel)

The COCOMO cost estimation model is used by numerous software project managers, and isbased on a study of hundreds of software projects. Unlike other cost estimation models, COCOMOis an open model, hence all of its details are published, including: COCOMO81, derived from theanalysis of 63 software projects in 1981. Boehm proposed three levels of the COCOMO model:Basic, Intermediate, and Detailed.

• The Basic COCOMO81 model is a single-valued static model, that computes software developmente4ort (and cost) as a function of the program size expressed in estimated lines of code (LOC).

• The Intermediate COCOMO81 model computes software development e4ort as a function of theprogram size and a set of “cost drivers”, that include subjective assessments of product, hardware,personnel, and project attributes.

• The Detailed COCOMO81 model incorporates all characteristics of the Intermediate model withan assessment of the cost drivers impact on each step (i.e., analysis, design, etc.) of the softwareengineering process.

COCOMO81 models primarily are dependent on two equations. The "rst equation is developmente8ort, based on MM—man-month, person-month, or sta4-month, where each is a measure of 1month of e4ort by one person. The "rst equation is given by

MM = a(KDSI)b; (1)

The second equation is e8ort and development time (TDEV), and is given by

TDEV = c(MM)d: (2)

Page 4: Identification of fuzzy models of software cost estimation

144 Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163

Table 1MM for the Basic COCOMO

Development mode Basic e4ort equation

Organic MM = 2:4 ∗ (KDSI)1:05

Semi-detached MM = 3:0 ∗ (KDSI)1:12

Embedded MM = 3:6 ∗ (KDSI)1:20

Table 2TDEV for the Basic COCOMO

Development mode Basic e4ort equation

Organic TDEV = 2:5 ∗ (MM)0:38

Semi-detached TDEV = 2:5 ∗ (MM)0:35

Embedded TDEV = 2:5 ∗ (MM)0:32

In the above two equations, KDSI represents the number in thousands of delivered source instruc-tions and is a measure of the program size. The coe?cients a, b, c, and d depend on the projectdevelopment mode. There are three modes of development and they are listed below:

• Organic mode: The project is being developed in a familiar, stable environment and is similar topreviously developed projects.

• Embedded mode: The project will require much innovation and is subject to tight, inMexibleinterface requirements and constraints.

• Semi-detached mode: The project is characterized somewhere between the Organic and Embeddeddevelopment modes.

2.2. Basic COCOMO model

The Basic COCOMO is the top-level model, and is e4ective when a rough software e4ort estimateis needed. The E4ort equations for each mode of project development are shown in Table 1. Thetime to develop the project (TDEV) is based on E4ort and is characterized by the equations listedin Table 2.

2.3. Intermediate COCOMO

Boehm [5] suggests that the accuracy of Basic COCOMO is limited because it does not accountfor di4erences in hardware, quality and experience of personnel, use of modern tools, and otherattributes that are known to have a signi"cant inMuence on project cost. Intermediate COCOMOadds accuracy to the Basic COCOMO by multiplying “Cost Drivers” into the equation with a newvariable: “E4ort Adjustment Factor” (EAF).

The EAF term shown in Table 3 is the product of 15 “E4ort Multipliers”, that are listed inTable 4. If the category values of all the "fteen cost drivers are “Nominal”, then the EAF term

Page 5: Identification of fuzzy models of software cost estimation

Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163 145

Table 3MM for the Intermediate COCOMO

Development mode Intermediate e4ort equation

Organic MM = EAF ∗ 3:2 ∗ (KDSI)1:05

Semi-detached MM = EAF ∗ 3:0 ∗ (KDSI)1:12

Embedded MM = EAF ∗ 2:8 ∗ (KDSI)1:20

Table 4Intermediate COCOMO multipliers

Cost driver Very low Low Nominal High Very high Extra high

ACAP 1.46 1.19 1.00 0.86 0.71 —AEXP 1.29 1.13 1.00 0.91 0.82 —CPLX 0.70 0.85 1.00 1.15 1.30 1.65DATA — 0.94 1.00 1.08 1.16 —LEXP 1.14 1.07 1.00 0.95 — —MODP 1.24 1.10 1.00 0.91 0.82 —PCAP 1.42 1.17 1.00 0.86 0.70 —RELY 0.75 0.88 1.00 1.15 1.40 —SCED 1.23 1.08 1.00 1.04 1.10 —STOR — — 1.00 1.06 1.21 1.56TIME — — 1.00 1.11 1.30 1.66TOOL 1.24 1.10 1.00 0.91 0.83 —TURN — 0.87 1.00 1.07 1.15 —VEXP 1.21 1.10 1.00 0.90 — —VIRT — 0.87 1.00 1.15 1.30 —

is equal to 1.0, implying that for a Semi-detached model the Intermediate and Basic COCOMOmodels would yield the same results. Due to the new cost drivers introduced for the Organic andEmbedded development modes, the project cost estimates would, respectively, increase or decrease.Depending on the cost drivers, EAF would be increased or decreased by choosing lower or highervalues, respectively.

The details of the cost driver acronyms used in Table 3 are as follows [5]: ACAP—analyst capa-bility; AEXP—applications experience; CPLX—product complexity; DATA—database size; LEXP—language experience; MODP—modern programming practices; PCAP—programmer capability;RELY—required software reliability; SCED—required development schedule; STOR—main stor-age constraint; TIME—execution time constraint; TOOL—use of software tools; TURN—computerturnaround time; VEXP—virtual machine experience; and VIRT—virtual machine volatility.

2.4. Detailed COCOMO

The Detailed model di4ers from the Intermediate model with respect to only one major aspect, inthat the Detailed model uses di4erent E4ort Multipliers for each phase of a project. Phase-dependentE4ort Multipliers yield better estimates than the Intermediate model. The Detailed model de"nes six

Page 6: Identification of fuzzy models of software cost estimation

146 Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163

life cycle phases: requirements, product design, detailed design, coding and unit testing, integrationand testing, and maintenance.

The Detailed COCOMO model illustrates the importance of recognizing the di4erent levels ofpredictability at each phase of the development cycle. Boehm et al. had the right idea here, however,COCOMO81 by itself is not robust enough to predict project costs accurately at all phases ofthe development life cycle. By considering an extreme scenario of applying appropriate weightsto the requirements analysis phase, a serious Maw in the Detailed COCOMO model is observed.Furthermore, the model evaluates the cost estimate based on inputs that are not very accurate untilthe later phases of the software design. Hence, the Intermediate model is the more accurate COCOMOmodel, as compared to both the Basic and Detailed models.

The steps in obtaining an estimate using the Intermediate COCOMO81 model are:

(1) Identify the mode of development for the new product, i.e., Organic, Semi-detached, orEmbedded.

(2) Estimate the size of the project in KDSI to derive a nominal e4ort prediction. Adjust the "fteencost drivers (Table 4) to reMect the project.

(3) Calculate the e4ort adjustment factor (EAF).(4) Calculate the predicted project e4ort using the equations in Table 3.

2.5. Characteristics of COCOMO81

COCOMO is transparent, in that one can observe how the model works, whereas other modelssuch as SLIM do not provide such a transparency. Drivers are particularly helpful to the estimator inorder to understand the impact of di4erent factors that a4ect project costs. However, the COCOMO81model demonstrates a few drawbacks, as listed below.

• It is di?cult to accurately estimate KDSI during the early stages of the project when accuratee4ort estimates are required the most.

• KDSI is not a true size measure but rather it is a length measure. This makes the model extremelyvulnerable to mis-classi"cation of the project development mode.

• Success depends largely on tuning the model, with the use of historical data, to the needs of theorganization. However, historical data is usually not available when needed.

3. Fuzzy identi�cation

Systems can be represented by mathematical models of many di4erent forms, such as algebraicequations, di4erential equations, and "nite state machines. A fuzzy model is based on a set of if–thenrules that describe the relationships between variables. Basically, a fuzzy model provides an e4ectiveway to explore and present the approximate and imprecise nature of the real world. In particular, afuzzy model appears useful when the systems are not suitable for analysis by conventional quanti-tative techniques or when the available information on the systems is uncertain or inaccurate. Forexample, the rules establish logical relations between the system’s variables by relating qualitativevalues of one variable to qualitative values of another variable. The qualitative values typically havea clear linguistic interpretation and are referred to as linguistic terms.

Page 7: Identification of fuzzy models of software cost estimation

Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163 147

The term fuzzy identi"cation [2] usually refers to the techniques and algorithms for constructingfuzzy models from data. There are two main approaches for obtaining a fuzzy model from data:

• The expert knowledge in a verbal form that is translated into a set of if–then rules. A certainmodel structure can be created, and parameters of this structure, such as membership functionsand weights of rules, can be tuned using input and output data.

• No prior knowledge about the system under study is initially used to formulate the rules, and afuzzy model is constructed from data based on a certain algorithm. It is expected that extractedrules and membership functions can explain the system behavior. An expert can modify the rulesor supply new ones based upon his or her own experience. The expert tuning is optional in thisapproach.

Our study will focus on the second approach, i.e., fuzzy models are directly derived from dataautomatically. The fuzzy model adopted in our study is the Takagi–Sugeno fuzzy model [20,21],in which the data can be numerical or linguistic. The two major components of a fuzzy model aremembership functions and rules. The following sections will focus on the algorithms for membershipfunctions, Takagi-Sugeno fuzzy model, and rules extraction.

3.1. Fuzzy clustering

Prior to determining the membership functions, we need to obtain for each input variable, thepartitioning of the data into a number of clusters based on experiences. These clusters have “fuzzy”boundaries, in the sense that each data value belongs to a given cluster to some degree, implyingthat the membership of each data observation is neither crisp nor certain. Having decided upon thenumber of such clusters to be used, we need some procedure to locate their mid-points (or moregenerally, their centroids) and to determine the associated membership functions and degrees ofmembership for the data-points.

A variety of fuzzy clustering methods have been proposed. In this section, we have selected todescribe only the fuzzy c-means (FCM) method and its most obvious generalizations. We note thatreaders familiar with the FCM algorithm and its process can skip or just skim through this section.The FCM algorithm is really a generalization of the “hard” c-means algorithm. The FCM algorithmis closely associated with such early contributors as Bezdek [4] and Dunn [7], and is widely usedin "elds such as pattern recognition. Suppose we are given a set of n elements or data samples thatwe wish to classify:

X = {x1; x2; : : : ; xk ; : : : ; xn}: (3)

Each element xk is an m-dimensional data vector:

xk = [xk1; xk2; : : : ; xkm]: (4)

The FCM algorithm is generally more suited for a data set that has data points that are evenly (ap-proximately) distributed around distinct cluster “centers”. A cluster center is a vector vi (i= 1; 2; : : : ; c)of m components representing a “prototype” for the elements in cluster i. vi does not necessarilyneed to be an element of the set. A membership value which describes the degree of belonging of

Page 8: Identification of fuzzy models of software cost estimation

148 Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163

the kth element in the set, xk , to the ith cluster can be denoted by,

�ik =∈ [0; 1] (1 6 i 6 c; 1 6 k 6 n): (5)

In the FCM algorithm, a partition matrix U , that reMects a measure of similarity among theelements (i.e., proximity of the data points to each of the cluster centers), is de"ned as follows:

U =

�11 �12 · · · �1n

�21 �22 · · · �2n...

.... . .

...�c1 �c2 · · · �cn

: (6)

The classi"cation criterion is realized by minimizing the following objective function (the perfor-mance index) with respect to both the membership values and the cluster centers:

J (c) =n∑

k=1

c∑i=1

(�ik)w(dik)2; (7)

where w ∈ (1;∞) denotes the fuzziness index (a weight on membership values) and dik is a measureof proximity between the kth data sample xk and the ith cluster center vi, de"ned by the Euclideandistance norm,

dik = ‖xk − vi‖ =

m∑j=1

((xkj − vij)2)1=2

: (8)

Since the elements in the set have m features (co-ordinates) to describe their location in featurespace, each cluster center also requires m features to determine its location in the same space.Therefore, the ith cluster center vi is an m-dimensional vector, similar to a data point:

vi = [vi1 vi2 · · · vim]: (9)

The value of the jth feature of vi is calculated by the expression:

vij =∑n

k=1 (�ik)wxkj∑nk=1 (�ik)w

; (10)

where the computation is done for all features, j= 1; 2; : : : ; m.The fuzzy c-means method is an iterative algorithm, and is described as follows:

(1) Given the desired number of clusters c for the classi"cation of the n elements of the data set,and a real number w¿1, assume an initial partition matrix U (0). The iteration number in thisalgorithm is labeled with superscript t, with t being 0 for the initial guess.

(2) Compute the cluster centers (prototypes) v(t)i = [v(t)

i1 v(t)i2 · · · v(t)

im ] for i= 1; 2; : : : ; c, using the ex-pression:

v(t)ij =

∑nk=1 (�(t)

ik )w xkj∑nk=1 (�(t)

ik )w: (11)

Page 9: Identification of fuzzy models of software cost estimation

Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163 149

(3) Compute the distances from each element in the set to each cluster center, using

d(t)ik = ‖xk − v(t)

i ‖ =

m∑j=1

((xkj − v(t)ij )2)1=2

(12)

for all clusters i= 1; 2; : : : ; c and elements k = 1; 2; : : : ; n:(4) Update the membership value of each data point. The updated values �ik of element k in cluster

i are computed by the formula:

�(t+1)ik =

1[∑cj=1(d

(t)ik =d

(t)jk )2=(w−1)

] : (13)

The special form of this formula ensures that the sum of membership values of an element overall clusters equals unity. The partition matrix U (t+1) is then re-computed with these updatedmembership values as

U (t+1) =

�(t+1)11 �(t+1)

12 · · · �(t+1)1n

�(t+1)21 �(t+1)

22 · · · �(t+1)2n

......

. . ....

�(t+1)c1 �(t+1)

c2 · · · �(t+1)cn

: (14)

(5) The iterative process stops when it has converged under some selected norm. Otherwise, a newiteration is performed, i.e., set t = t+1 and return to step (2). The norm employed for checkingconvergence might be:

max |�(t+1)ik − �(t)

ik | 6 �; (15)

where � is a prede"ned error limit.

3.2. Takagi–Sugeno fuzzy model

The Takagi–Sugeno Fuzzy Model, also known as the TS fuzzy model, was proposed by Takagi,Sugeno and Kang [20,21] in an e4ort to develop a systematic approach to generating fuzzy rulesfrom a given input–output data set. A typical fuzzy rule in a TS fuzzy model has the form,

Ri : If x1 is Ai1 and · · · xp is Aip then yi = f(x); i = 1; 2; : : : ; n (16)

where Ai is the fuzzy set in the antecedent, while yi =f(x) is a crisp function in the consequent.Usually yi =f(x) is a polynomial function with respect to the input variable x. When yi =f(x)is a "rst-order polynomial function, the resulting fuzzy inference system is called a ;rst-order TSfuzzy model, which was originally proposed in [20,21]. When f is a constant, we have what isknown as a zero-order TS fuzzy model, which can be viewed as a special case of the Mamdanifuzzy inference system [14]. The following is a single-input TS fuzzy model:

If X is small then Y = 0:1X + 6:4;If X is medium then Y = −0:5X + 4;If X is large then Y = X − 2:

(17)

Page 10: Identification of fuzzy models of software cost estimation

150 Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163

-10 -5 0 5 100

0.2

0.4

0.6

0.8

1

1.2

X

Mem

ber

ship

Gra

des

small medium large

-10 -5 0 5 100

1

2

3

4

5

6

7

8

X

Y

(a) (b)

Fig. 1. TS fuzzy model with non-fuzzy input membership function. (a) Antecedent MFs for crisp rules, (b) Overall I/Ocurve for crisp rules.

If “small”, “medium”, and “large” are non-fuzzy sets with membership functions in Fig. 1, then theoverall input–output curve is piece-wise linear. On the other hand, if we have fuzzy input membershipfunction, the overall input–output curve becomes smooth like the one shown in Fig. 2.

3.3. Rule extraction

We can generate rules for the fuzzy model based on the membership functions obtained by usingFCM. There are various algorithms to generate rules, however, we will focus our e4orts on theTakagi–Sugeno (TS) models.

The idea of constructing TS fuzzy models by fuzzy clustering is not new, Yoshinari et al. appliedthe fuzzy c-elliptotypes algorithm to derive a TS fuzzy model [26] Babuska and Verbruggen et al.used the GK algorithm [2]. We will present a way of deriving TS model using FCM. The fuzzymodel can be represented as a set of TS rules:

Ri : If x is Ai then yi = aTi x + bi; i = 1; 2; : : : ; n: (18)

The antecedent fuzzy set Ai can be extracted from the fuzzy partition matrix by projections. Theconsequent parameters, ai and bi, are estimated from the data using least-squares methods.

Page 11: Identification of fuzzy models of software cost estimation

Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163 151

-10 -5 0 5 100

0.2

0.4

0.6

0.8

1

1.2

X

Mem

ber

ship

Gra

des

small medium large

-10 -5 0 5 100

1

2

3

4

5

6

7

8

X

Y

(c) (d)

Fig. 2. TS fuzzy model with fuzzy input membership function. (c) Antecedent MFs for fuzzy rules, (d) Overall I/O curvefor fuzzy rules.

3.4. Generating antecedent membership function by projection

The principle of this method is to obtain the individual antecedent variable membership functionby projecting the multi-dimensional fuzzy sets de"ned point-wise in the rows of the partition matrixU onto axes associated with the individual antecedent variable. Currently, there are many projectionmethods available, however, in our study we use the axis-orthogonal projection method.

This method projects the fuzzy partition matrix U onto the axes of the antecedent variables xj,16j6p. The TS rules are then expressed in the conjunctive form:

Ri : If x1 is Ai1 and · · · xp is Aip then yi = aTi x + bi; i = 1; 2; : : : ; n (19)

In order to obtain membership functions for the antecedent fuzzy sets Aij, the multi-dimensionalfuzzy set de"ned point-wise in the ith row of the partition matrix U is projected onto the regressorsxj by

�Aij(xjk) = projj(�ik): (20)

The projection operator is based on the following two de"nitions.

De�nition 1 (Point-wise projection). Let U = (X (i))i∈Nn be a universe of dimension n and let C, Sand T be index subsets of Nn which satisfy the conditions T = S∪C, S∩C = ∅ and S �= ∅. Point-wiseprojection of X T onto X S is the mapping redT

S : X T → X S de"ned by

redTS = xS with (∀i ∈ S : (xS)(i) = (xT)(i)): (21)

Page 12: Identification of fuzzy models of software cost estimation

152 Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163

µ

1A

11A

12A

1x

2x

Fig. 3. Example of projection from 2 dimension to 1 dimension.

De�nition 2 (Projection of a fuzzy set). Let U = (X (i))i∈Nn be a universe of dimension n, M anindex set with ∅ �= M ⊆ Nn. The projection of A onto xM is the mapping projM : F(X ) → F(XM )de"ned by

projM (�(x)) = sup{�(x′) | x′ ∈ X ∧ x = redNn

M (x′)}: (22)

An example of projection from a two-dimensional space to a one-dimensional space is presentedin Fig. 3.

3.5. Estimating consequent parameters

There are several approaches for obtaining the consequent parameters, however, in our studywe have adopted the weighted least-squares technique. The identi"cation data and the membershipdegrees of the fuzzy partition are arranged in the following matrices:

X =

xT1

xT2

...

xTN

; y =

y1

y2

...

yN

(23)

Wi =

�i1 0 · · · 0

0 �i2 · · · 0...

.... . .

...

0 0 · · · �iN

: (24)

The consequent parameters, ai and bi, of the rule belonging to the ith cluster, are concatenatedinto a single parameter vector, %i, which is given by,

%i = [aTi ; bi]

T (25)

Page 13: Identification of fuzzy models of software cost estimation

Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163 153

Appending a unitary column to X gives the extended regressor matrix Xe

Xe = [X; 1]: (26)

The membership degree, �ik , of the fuzzy partition serves as the weights expressing the relevanceof the data pair (xk ; yk) to that local model. If the columns Xe are linearly independent and �ik¿0for 16k6n, then

%i = [XTeWiXe]−1XT

eWiy (27)

is the least-squares solution of y = Xe% + � where the kth data pair (xk ; yk) is the weighted by �ik .The parameters ai and bi are given by

ai = [%1; %2; : : : ; %p]; bi = %p+1: (28)

However, if the columns of Xe are linearly dependent, we should use the orthogonal factorizationof Xe. To simplify the computation, it is more e?cient to "rst multiply each row of Xe and y by√�ik :

X̃i =

√�i1xT

e1√�i2xT

e2

...√�iNxT

en

; y =

√�i1y1

√�i2y2

...√�iNyn

(29)

and then compute %i by

%i = [X̃Ti X̃i]−1X̃

Ti y: (30)

4. Preprocessing data

In the COCOMO81 Intermediate model, the cost driver attributes are grouped into four categories:software product attributes, computer attributes, personnel attributes, and project attributes. They arelisted below:

• Product Attributes:RELY : Required Software Reliability;DATA : Database Size;CPLX : Product Complexity.

• Computer Attributes:TIME : Execution Time Constraint;STOR : Main Storage Constraint;VIRT : Virtual Machine Volatility;TURN : Computer Turnaround Time.

• Personnel Attributes:ACAP : Analyst Capability;

Page 14: Identification of fuzzy models of software cost estimation

154 Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163

AEXP : Applications Experience;PCAP : Programmer Capability;VEXP : Virtual Machine Experience;LEXP : Programming Language Experience.

• Project AttributesMODP : Modern Programming Practices;TOOL : Use of Software Tool;SCED : Required Development Schedule.

The fuzzy model will be extremely complex if we use all of the above-mentioned 15 factors asindividual inputs to the fuzzy model. Hence, it would be more practical to use the four groupedcategories as inputs. The values of the four categories are obtained from the comprehensive contri-bution of the factors in their respective categories. We "rst de"ne membership functions for thesefour categories. The category is divided into 10 membership functions as follows:

�very-low(x) = 11+| x

12:5 |5:00

�low(x) = 1

1+| x−2012:5 |5:00

�low-nominal(x) = 1

1+| x−3012:5 |5:00

�nominal(x) = 1

1+| x−4012:5 |5:00

�nominal-high(x) = 1

1+| x−5012:5 |5:00

�high(x) = 1

1+| x−6012:5 |5:00

�high-very high(x) = 1

1+| x−7012:5 |5:00

�very high(x) = 1

1+| x−8012:5 |5:00

�very-extra high(x) = 1

1+| x−9012:5 |5:00

�extra high(x) = 1

1+| x−10012:5 |5:00

(31)

The following rules are then used to calculate the values of the categories. Let us consider ‘RELY’from the Product Attributes in the following example.

(1) Rule:(2) If RELY is very low then Product Attributes associated with RELY is very low.

Page 15: Identification of fuzzy models of software cost estimation

Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163 155

(3) If RELY is low then Product Attributes associated with RELY is low.(4) If RELY is between low and nominal then Product Attributes associated with RELY is low-

nominal.(5) If RELY is nominal then Product Attributes associated with RELY is nominal.(6) If RELY is between nominal and high then Product Attributes associated with RELY is

nominal-high.(7) If RELY is high then Product Attributes associated with RELY is high.(8) If RELY is between high and very-high then Product Attributes associated with RELY is

high-very high.(9) If RELY is very-high Product Attributes associated with RELY is very-high.

(10) If RELY is between very-high and extra-high then Product Attributes associated with RELYis very-extra high.

(11) If RELY is extra-high then Product Attributes associated with RELY is extra-high.

The other factors in Product Attributes use similar rules. Hence we can then obtain the overallfuzzy set Product Attributes associated with RELY, DATA, and CPLX. We use the Larsen Product-Addition inference methods [23] to obtain the overall Product Attributes. Consequently, we extracta crisp value from the fuzzy set Product Attributes associated with RELY, DATA, and CPLX as arepresentative value. This process is similar to the defuzzi"cation [23] in a fuzzy system, and weapply the “Centroid of Areas” method to calculate the defuzzi"cation values.

5. Case study

5.1. Preprocessing data

The COCOMO81 database that consists of 63 projects [5], was investigated in our study. Wepreprocess the database according to Section 4. The extracted representative values for four categoriesare listed in Table 5, and our fuzzy model is based on these extracted representative values. ThePN value in the table indicates the sequential project number, currsize.

5.2. COCOMO 81 model

We evaluated the Intermediate COCOMO81 model using the Organic model described in Table3. The cost driver attributes determine a multiplying factor that estimates the e4ect of the attributeon the software development e4ort. These multipliers are applied to a COCOMO development e4ortto obtain a re"ned estimate of the software development e4ort. Each cost driver in the IntermediateCOCOMO81 model is measured using a rating scale of six linguistic values: very low, low, nominal,high, very high, and extra high. Table 4 lists the e4ort multipliers used in this model.

Sometimes these six linguistic values are not su?cient to distinguish particular project attribute,therefore, we added four more values: low-nominal, nominal-high, high-very high, and very-extrahigh. The scale rate for these four extra attributes can take the average value of the two scales intowhich they fall. For example nominal-high for RELY can be 1.07. Table 7 shows the "tting resultof the three COCOMO81 models.

Page 16: Identification of fuzzy models of software cost estimation

156 Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163

Table 5COCOMO extracted category data

PN Product Computer Personnel Project

1 41.1240 54.9800 24.3957 33.51522 40.1425 50.0000 50.0000 33.56553 46.6479 32.7032 63.8372 53.32004 42.5577 35.1900 38.6238 41.22535 27.0283 35.1900 51.9920 33.31866 25.5631 49.8162 41.5188 25.56317 33.3186 30.3048 48.0080 53.32008 53.3521 70.9635 43.7821 20.42759 53.3521 64.6792 40.2771 46.6800

10 59.8575 59.1114 59.8417 40.040111 59.8575 59.1114 59.8417 40.040112 53.3521 50.0000 59.8417 40.237413 53.3521 54.9800 51.8616 53.090614 51.1653 64.6232 44.1198 13.760415 59.8575 55.0160 36.7206 41.225316 66.2675 53.4798 48.0080 40.040117 66.2675 53.4798 55.8463 40.040118 66.4345 64.6792 44.0241 17.718119 53.3200 47.5121 49.8531 53.320020 72.9717 64.8100 45.9852 25.563121 59.7626 40.3112 44.0241 53.320022 46.6800 54.8142 59.8417 41.225323 46.6800 47.5103 59.8417 33.318624 27.0283 42.6876 36.4120 66.267525 72.9717 45.1858 44.1537 53.352126 30.2440 57.4692 44.1537 33.732527 46.9094 54.8491 40.0401 40.142528 66.4345 64.8100 40.2771 27.028329 33.5655 40.0401 36.4120 49.132130 33.5655 40.0401 40.7008 49.132131 59.7626 72.0865 48.1384 40.040132 33.7325 40.0401 55.7159 27.028333 72.9717 63.3083 59.8417 53.3200

PN TIME STOR VIRT TURN

34 53.3200 37.6902 40.0401 33.515235 41.1240 57.3124 38.1615 23.427736 33.7325 30.3048 40.2771 53.320037 17.7181 45.0200 55.8463 40.040138 46.6800 30.3048 57.8434 59.762639 46.6800 35.1900 67.6406 40.237440 33.5655 35.1900 47.9965 33.565541 27.0283 30.3048 63.7299 53.320042 40.2400 45.1858 50.0170 53.319143 46.6809 50.0000 51.9920 46.680044 40.2400 52.5078 48.0080 40.237445 40.2400 45.1858 48.0080 50.0000

Page 17: Identification of fuzzy models of software cost estimation

Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163 157

Table 5 (Contd.)

PN TIME STOR VIRT TURN

46 40.2400 45.1858 51.9920 46.680047 41.1240 30.3048 55.9145 40.237448 20.4275 35.1900 42.2813 40.237449 33.5655 35.1900 51.8616 59.762650 46.6800 54.8491 44.1537 40.040151 33.5655 49.8162 36.4120 33.515252 20.4275 50.0000 32.2631 25.563153 33.7325 64.6792 47.9848 27.028354 33.5655 45.1858 51.8508 40.237455 17.7181 30.3048 56.0105 33.565556 53.3521 59.6632 38.2746 20.427557 40.2374 64.6792 32.2631 17.718158 59.8575 58.9399 71.6592 46.680059 40.2374 42.6876 40.0401 46.680060 53.3521 50.0000 36.3781 20.427561 40.2374 30.3048 51.9920 53.090662 40.1425 59.8121 60.8193 41.225363 40.2374 35.1900 59.7229 53.0906

5.3. Fuzzy modeling

We applied fuzzy modeling to the COCOMO81 database. We selected the overall representa-tive value of Product Attributes, Computer Attributes, Personnel Attributes, Project Attributes, andKDSI as independent variables, and the dependent variable was Man Month (MM) as described inSection 2. We evaluated the model using the quality of "t and cross validation evaluation techniques.

The initial number of clusters is empirically selected as 5, implying that there are 5 TS rules forthe model. However, other values for the number of clusters can also be used. The optimizationproblem of determining the best value for number of clusters is beyond the scope of this paper. Inthe cross validation evaluation technique, 63 iterations of model building and evaluation were per-formed. At each iteration one entry from the database is used as the test data, while the remaining62 entries are used as the "t data. The model is built using the "t data, whereas the test data isused to evaluate the model. The average of the 63 test data evaluations represents the evaluation ofthe "tted model (Fig. 4).

The sampling period is 1 s, whereas the termination tolerance of the clustering algorithm was0.01. Fig. 5 shows the projection membership functions for these "ve independent variables. Thecluster centers are shown in Table 6. The quality of "t and cross validation cost estimate resultsare shown in Table 7, whereas the statistical results are shown in Table 8. The quality of "t valuesrepresent the cost estimates of the model when the complete "t data set is used as a test data set,i.e., resubstitution. The cross validation values indicate the cost estimates of the model based on theleave-one-out strategy.

Page 18: Identification of fuzzy models of software cost estimation

158 Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163

0 10 20 30 40 50 60 70 80 90 1000

0.5

1

Very Low

Low

Low-Nominal

Nominal

Nominal-High

High

High-Very High

Very High

Very-Extra High

Extra High

Fig. 4. Membership function for the COCOMO81 categories.

0 20 40 60 800

0.5

1

u1

30 40 50 60 70 800

0.5

1

u2

20 40 60 800

0.5

1

u3

0 20 40 60 800

0.5

1

u4

0 200 400 600 800 10000

0.5

1

u5

Fig. 5. Projection membership functions for "ve inputs.

In the following, the output-speci"c information is shown for each output, i.e., the Rules for ManMonth (MM) are shown.

1. If u1 is A11 and u2 is A12 and u3 is A13 and u4 is A14 and u5 is A15 thenMM = 32:00u1 + 34:50u2 − 28:60u3 + 51:60u4 + 3:44u5 − 2540:00.

2. If u1 is A21 and u2 is A22 and u3 is A23 and u4 is A24 and u5 is A25 thenMM = − 5:94u1 − 0:73u2 + 2:20u3 + 5:80u4 + 9:43u5 − 85:40.

Page 19: Identification of fuzzy models of software cost estimation

Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163 159

Table 6Cluster centers

Rule u1 u2 u3 u4 u5

1 39:9 51:2 42:4 29:5 169:02 40:6 43:1 48:7 40:3 8:53 42:3 50:1 45:2 38:4 24:44 48:6 50:7 54:9 47:0 66:65 65:1 52:5 45:6 40:7 373:0

3. If u1 is A31 and u2 is A32 and u3 is A33 and u4 is A34 and u5 is A35 thenMM = 7:39u1 − 1:53u2 + 0:48u3 − 7:01u4 + 5:82u5 − 43:90.

4. If u1 is A41 and u2 is A42 and u3 is A43 and u4 is A44 and u5 is A45 thenMM = 18:90u1 + 19:90 − 33:60u3 − 13:60u4 + 8:60u5 + 447:00.

5. If u1 is A51 and u2 is A52 and u3 is A53 and u4 is A54 and u5 is A55 then

MM = − 3:01× 108u1 + 6:48× 108u2 − 6:27× 108u3 + 3:79× 108u4 − 4:13× 106u5 + 2:86× 108,where u1, u2, u3, u4, and u5 represent Product Attributes, Computer Attributes, Personnel Attributes,Project Attributes, and KDSI, respectively, currsize.

5.4. Discussion

As shown in Tables 7 and 8, the estimation accuracy for the proposed fuzzy modeling approachis better than that of the COCOMO model. The average absolute error (AAE) value for crossvalidation of the fuzzy model (Fuzzy Cross) is only 45.5745, i.e., 32.5715 lower than the IntermediateCOCOMO "t model. The AAE value for quality of "t of the fuzzy model (Fuzzy Fit) is even lower,i.e., it is better. Furthermore, since the Intermediate COCOMO model has better AAE than that ofthe Basic and Detailed COCOMO models, both Fuzzy Cross and Fuzzy Fit models perform betterthan the Basic and Detailed COCOMO models. Thus, we can clearly observe that AAEs for crossvalidation and quality of "t values of the fuzzy model are signi"cantly better than those of theCOCOMO models. Also as shown in Table 8, the average relative error (ARE) values for the fuzzymodels are also signi"cantly lower than all of the three COCOMO models.

In order to statistically verify our observations, we conducted a paired t-test [3] with ARE (aswell as AAE) as the response variable for the statistical test. The null and alternate hypothesis tests(for ARE) are formulated as:

H0 : �(MMAREFuzzy − MMARE

COCOMO) ¿ 0;

HA : �(MMAREFuzzy − MMARE

COCOMO) ¡ 0;

where MMAREFuzzy and MMARE

COCOMO represent the absolute relative error (ARE) for the fuzzy identi"cationmodel and COCOMO model, respectively. We compared the Fuzzy Cross and Fuzzy Fit models witheach of the COCOMO models. However, only the results of ARE comparisons with the Intermediateand Detailed models are presented.

When comparing the Fuzzy Cross and Fuzzy Fit models with the Intermediate COCOMO model,the t values obtained were 1.7561 and 1.8254, respectively. Both of the t values obtained, were

Page 20: Identification of fuzzy models of software cost estimation

160 Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163

Table 7MM Results of the COCOMO and fuzzy model

PN Fuzzy Fuzzy Intermediate Detailed Basic Realcross "t COCOMO COCOMO COCOMO

1 1665.76 2029.11 2218 2286 1047 20402 1738.9 1642.63 1770 1760 2702 16003 386.781 290.548 245 248 711 2434 227.697 257.462 212 207 134 2405 35.1072 30.0092 39 38 44 336 35.2049 42.23228 30 30 10.3 437 9.6609 9.2164 9.8 10.2 18 88 511.155 1013.03 869 994 147 10759 428.083 470.224 397 395 213 423

10 239.023 252.081 214 218 115 32111 205.23 277.115 243 248 131 21812 195.67 219.131 238 237 274 20113 110.603 102.336 108 106 163 7914 89.466 77.126 60 64 10.3 7315 66.7695 59.02503 52 51 18 6116 34.9789 36.8081 38 39 17 4017 11.1449 12.4844 10.7 10.9 7.8 918 11400 11400 11056 12380 3652 1140019 6105.09 6600 7764 7699 13749 660020 6338.81 6400 6536 7571 1698 640021 2414.12 2455 1836 1864 2741 245522 705.797 778.306 733 728 1003 72423 609.902 463.14 443 445 640 53924 454.276 443.364 326 337 463 45325 451.143 523 430 433 283 52326 365.347 370.318 339 341 375 38727 82.227 88.7186 89 82 53 8828 85.4045 132.093 133 143 35 9829 8.12141 7.873 7 7.0 7.0 7.330 6.05914 6.03522 5.8 5.9 6.4 5.931 1013.37 1002.02 962 1057 394 106332 815.256 656.699 869 868 1527 70233 642.263 540.293 529 543 301 60534 203.147 188.059 201 201 147 23035 72.2977 76.099 161 162 78 8236 67.9568 68.6683 33 34 49 5537 46.7448 43.981 44 44 97 4738 18.3142 20.4074 20 20 41 1239 9.744 8.328 8.4 8.3 16 840 9.0743 8.3674 8.1 8.0 6.3 841 8.02493 6.062 4.7 4.9 14 642 39.448 46.794 46 46 54 4543 74.9033 131.163 102 102 79 8344 75.0984 125.501 130 128 85 8745 123.041 95.5027 100 100 91 10646 150.338 219.995 166 164 167 126

Page 21: Identification of fuzzy models of software cost estimation

Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163 161

Table 7 (Contd.)

PN Fuzzy Fuzzy Intermediate Detailed Basic Realcross "t COCOMO COCOMO COCOMO

47 32.6361 32.8342 33 32 65 3648 1155.35 1282.97 1542 1519 1858 127249 169.238 151.5083 168 170 469 15650 191.581 128.538 193 194 163 17651 130.866 147.1222 114 115 27 12252 37.8064 51.1202 55 56 22 4153 16.30243 17.2839 22 22 19.4 1454 24.881 20.84719 14 14 11.4 2055 20.7061 17.524364 7.5 7.4 16.6 1856 826.252 929.524 537 570 188 95857 244.518 182.428 239 247 93 23758 131.502 132.405 145 143 171 13059 78.6103 73.273 68 68 59 7060 61.6232 67.75 60 61 18 5761 51.2874 59.1798 47 48 79 5062 37.4984 35.23229 42 41 36 3863 9.40078 17.6209 17 17 57 15

Table 8Statistical results of the COCOMO and fuzzy model

Fuzzy Fuzzy Intermediate Detailed Basiccross "t COCOMO COCOMO COCOMO

AAE 45.575 20.073 78.146 99.498 464.659ARE 0.137 0.134 0.188 0.188 0.602

greater than the critical value t1−( ; n−1. In our case study, (= 0:05 and n= 63 (the number of projectsin the COCOMO database), and consequently the critical value is t = 1:67. Therefore, we reject thenull hypothesis, H0, and conclude that both fuzzy identi"cation models signi"cantly reduced theaverage relative error (ARE) in this case study. A similar observation was made when performingt tests with AAE as the response variable.

When comparing the fuzzy models with the Detailed COCOMO model, the t values obtained(for ARE) were 1.7471 (Fuzzy Cross) and 1.8153 (Fuzzy Fit). Once again these values are greaterthan the critical value of 1.67. Furthermore, a similar conclusion was once again observed whenperforming t tests with AAE as the response variable. Therefore in summary, we conclude that thefuzzy identi"cation models for this case study yielded statistically better cost estimation results thanall of the three COCOMO estimation models.

Page 22: Identification of fuzzy models of software cost estimation

162 Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163

6. Conclusions

It is a well-known fact that software project management teams can greatly bene"t from knowingthe estimated cost of their software projects. The bene"ts can have greater impact if accurate costestimations are deduced during the early stages of the project life cycle. The process and performancebehavior of software projects have been used as measures for software cost and e4ort estimationmodels. With a knowledge of the expected cost of a project, software management teams can controlthe software development process using an e4ective approach.

Estimating the process and performance behavior of a software project, early in the software lifecycle, is very challenging and often very di?cult to model. This issue is further elevated by thefact that important and useful recorded information pertaining to the software cost are often vague,imprecise, and in some cases even linguistic. Traditionally used software cost estimation techniquesare not capable of incorporating such vague and imprecise information in their cost estimationmodels. This incapability prevents the extraction and use of important information that could verywell improve a model’s project cost estimation.

The commonly used COCOMO cost estimation model(s), has been used in obtaining softwarecost and e4ort estimations. The model(s) incorporates data collected from several software projects,and uses this gathered information for its cost and e4ort estimations. However, it may not provideaccurate project cost estimations as the software size and complexity increases.

A Fuzzy Identi"cation software cost estimation technique, presented in this paper, incorporatesthe important project information that are often too vague and imprecise. The proposed estimationtechnique is an advanced fuzzy logic technique that integrates fuzzy clustering, space projection,fuzzy inMuence, and defuzzi"cation. It is observed that the structure of the fuzzy model is verysimple and the number of inference rules is the same as the number of fuzzy clusters. The rulebased preprocessing of data reduces the database size signi"cantly.

In this case study, we applied fuzzy identi"cation to extract rules and membership functions fromfuzzy input data. The cost estimation results were then compared with those obtained from the threetypes of COCOMO cost estimation models, i.e., Basic, Intermediate, and Detailed. For the casestudy investigated, it is clearly indicated that the cost estimation accuracy of the fuzzy models wassigni"cantly better than that of the COCOMO models.

Future research may investigate using the fuzzy modeling approach in other estimation problemssuch as project size estimation. Furthermore, the proposed fuzzy identi"cation modeling techniquemay be investigated using other case studies.

Acknowledgements

We thank the anonymous reviewers for their useful comments, and Dr. Witold Pedrycz for hisuseful suggestions. We thank Naeem Seliya for his assistance with modi"cations, editorial reviewsand useful suggestions. We also thank Erik Geleyn and Robert M. Szabo for their suggestions. Thiswork was supported in part by Cooperative Agreement NCC 2-1141 from NASA Ames ResearchCenter, Software Technology Division (Independent Veri"cation and Validation Facility). The "nd-ings and opinions in this paper belong solely to the authors and are not necessarily those of thesponsors, or collaborators.

Page 23: Identification of fuzzy models of software cost estimation

Z. Xu, T.M. Khoshgoftaar / Fuzzy Sets and Systems 145 (2004) 141–163 163

References

[1] A.J. Albrecht, J.E. Ga4ney, Software function, source lines of code, and development e4ort prediction: a softwarescience validation, IEEE Trans. Software Engg. SE-9 (6) (1983) 639–647.

[2] R. Babuska, Fuzzy Modeling For Control, Kluwer Academic Publishers, Dordrecht, 1999.[3] M.L. Berenson, D.M. Levine, M. Goldstein, Intermediate Statistical Methods and Applications: A Computer Package

Approach, Prentice-Hall, Englewood Cli4s, NJ, 1983.[4] J.C. Bezdek, Patern Recognition with Fuzzy Objective Function Algorithm, Plenum Press, New York, 1981.[5] B.W. Boehm, Software Engineering Economics, Prentice-Hall, Englewood Cli4s, NJ, 1981.[6] K.H. Chen, H.L. Chen, H.M. Lee, A multiclass neural network classi"er with fuzzy teaching inputs, Fuzzy Sets and

Systems 91 (1997) 15–35.[7] J.C. Dunn, A fuzzy relative of the ISODATA process and its use in detecting compact well separated clusters,

J. Cybernet. 3 (1974).[8] Z. Fei, X. Liu, f-COCOMO: fuzzy constructive cost model in software engineering, IEEE Internat. Conf. Fuzzy

Systems, 1992, pp. 331–337.[9] P.W. Garratt, A.C. Hodgkinson, A neurofuzzy cost estimator, Proc. Software Engineering and Applications, 1999,

pp. 401–406.[10] H. Ichihashi, T. Watanabe, Learning control system by a simpli"ed fuzzy reasoning model, Proc. Information

Processing and Management of Uncertainty, 1990, pp. 417–419.[11] H. Ishibuchi, K. Nozaki, H. Tanaka, Y. Hosaka, M. Matsuda, Empirical study on learning in fuzzy systems by rice

taste analysis, Fuzzy Sets and Systems 64 (1994) 129–144.[12] C.L. Karr, E.J. Gentry, Fuzzy control of pH using genetic algorithms, IEEE Trans. Fuzzy Systems 1 (1993) 46–53.[13] J.R. Koza, Genetic Programming: on the Programming of Computers by Means of Natural Selection, MIT Press,

Cambridge, MA, 1996.[14] E.H. Mamdani, S. Assilian, An experiment in linguistic synthesis with a fuzzy logic controller, Internat. J. Man

Mach. Stud. 7 (1) (1975) 1–13.[15] J.E. Matson, B.E. Barrett, J.M. Mellichamp, Software development cost estimation using function points, IEEE Trans.

Software Eng. 20 (4) (1994) 275–287.[16] D. Nauck, R. Kruse, A neuro-fuzzy method to learn fuzzy classi"cation rules from data, Fuzzy Sets and Systems

89 (1997) 277–288.[17] H. Nomura, I. Hayashi, N. Wakami, A learning method of fuzzy inference rules by descent method, Proc.:

FUZZ-IEEE’92, 1992, pp. 203–210.[18] T. Pfeufer, M. Ayoubi, Application of a hybrid neuro-fuzzy system to the fault diagnosis of an automotive

electromechanical actuator, Fuzzy Sets and Systems 89 (1997) 351–360.[19] L.H. Putnam, A general empirical solution to the macro software sizing and estimation problem, IEEE Trans. on

Software Engineering, July 1978, pp. 345–361.[20] M. Sugeno, G.T. Kang, Structure identi"cation of fuzzy model, Fuzzy Sets and Systems 28 (1988) 15–33.[21] H. Takagi, M. Sugeno, Fuzzy identi"cation of systems and its applications to modeling and control, IEEE Trans.

Systems Man Cybernet. 15 (1985) 116–132.[22] C.E. Walston, A.P. Felix, A method of programming measurement and estimation, IBM Systems J. 16 (1) (1977)

54–73.[23] Z. Xu, Fuzzy logic control system CAD and study on the improvement of FLC performance, Master Thesis, Guangxi

University, Nanning, Guangxi P.R. China, May 1997.[24] Z. Xu, Fuzzy logic techniques for software reliability engineering, Ph.D. Dissertation, Atlantic University, Boca

Raton, FL, May 2001.[25] J. Yen, R. Langari, Fuzzy Logic: Intelligence, Control, and Information, Prentice Hall, Inc., Upper Saddle River,

NJ, 1999.[26] Y. Yoshinari, W. Pedrycz, K. Hirota, Construction of fuzzy models through clustering techniques, Fuzzy Sets and

Systems 54 (1993) 157–165.