mae: a syntactic metric analysis environment

MAE: A Syntactic Metric Analysis Environment

Warren Harrison University of Portland, Portland, Oregon

A metric analysis environment for the COBOL programming language is presented that, through the use of a commercial data base management system (DBMS), accesses a set of files produced by a static analysis tool. The query language of the host DBMS can then be used to easily implement various metric mappings. Such an approach provides an amazingly flexible source of data for software metric analysis.

I. INTRODUCTION

The importance of software metrics has been acknowl-

edged by many [l-3]. It is thought that the development of improved software metrics and application techniques

may result in useful tools to aid in the management of

large software projects. A specific class of metrics, often referred to as

syntactic metrics, involves the analysis of various syntactic characteristics of a piece of software. The

degree to which each of these selected syntactic characteristics exists in the code is then mapped into (usually) a

single number that is purported to measure the “complexity” of the software. Thus, a complexity metric can

be viewed as a mapping m such that

m(C1, c2, . . . . Cn) + measure of complexity

where Ci is the degree to which a pa~icular characteristic exists in a program.

This topic has received a great deal of attention in the past. This attention has resulted in the proliferation of

numerous metrics, most based on the syntactic analysis of the source code. Thus, multiple mappings exist, differing in the number and choice of characteristics and

in the algorithm used to map the characteristics to the measure of complexity. Some representative examples of metric mappings are [ 1, 4, 51

LOC(lines of code)

SoftwareScience-E (unique operators, unique operands.

__.-__. -

Address correspondence to Warren Harrison, University of Portland, Portland, OR 97203.

The Journal of Systems and Software 8, 57-62 (1988)

0 1988 Elsevier Science Publishing Co., Inc., 1988

total operators, total operands)

SoftwareScience-V (unique operators, unique operands,

total operators, total operands)

VG(decisions)

VARS(variables)

LV(lv1, lv2, ’ * - ) lvk)

where lines of code is a simple count of the number of lines of code in the software; unique operators and unique operands refer to the number of different

operators and operands used; total operators and total

operands refer to the total number of these objects used; decisions refers to the number of decision-making

statements in the code; variables refers to the number of variables actually used in the software; and Ivi refers to the number of live variables at line i in a procedure.

2. EVALUATING METRICS

Currently, the plethora of metrics makes it unlikely that

any one metric will come to be regarded as the perfect metric in the near future. Therefore, before using a

metric, an analyst should assess its performance on historical project data. A metric’s performance can be assessed by comparing other artifacts of complexity

(e.g., numerous errors or excessive maintenance costs) associated with the program with the measure that

results from applying the metric to the code. An environment to support such an analysis would be

a great contribution to the deveropment and use of software metrics. Such an environment must satisfy several requirements:

It must make it easy to evaluate a number of metrics for a particular historical data base.

It must be an environment in which it is easy to add metrics or to make minor changes to counting rules and the like used by a particular metric. These new or refined metrics should then be easily applied to

previously analyzed projects. Finally, to further the overall field of software

57

01641212/88/$3.50

58

metrics, the environment should encourage the sharing of the historical data base among researchers, both within and outside the originating organization.

A common approach is simply to maintain a collection of metric data for various projects that have been completed in the past, and then to continually add metrics for new projects as they are completed. How- ever, the constant development of new metrics and revised interpretations of existing metrics suggests that such a collection will quickly become obsolete.

Another approach is to maintain a collection of the actual source code of past projects. However, this solution presents two major problems: (1) The proprietary nature of industrial software development would no doubt make those who contribute to this data base a bit uncomfo~ble unless the data base were strictly limited to users within the organi~tion. Unfo~unately, this restriction would prevent access by others to the data base. (2) Before new metrics could be applied to the existing data base, the source code would have to be reparsed, further complicating the development of metric tools and cost of analysis.

An alternative approach is to maintain just the syntactic chracteristics and to generate the metrics as needed. The syntactic characteristics could be stored in a data base, and the metric tools could be const~cted out of the DBMS’s query language. This option would allow new and/or refined metric tools to be developed quickly and easily, without requiring a great deal of programming effort. Further, views of the schema could be provided that would “alias” various characteristics of the software, allowing the characteristics to be statistically analyzed, but prohibiting access to information needed to regenerate the original source code. This would ensure the con~denti~ity of proprietary software.

3. IDENTIFYING THE NEEDED DATA

Because of the advantages inherent in the syntactic characteristic data base, such an environment, MAE (Metric Analysis Environment), has been developed. The data needed to fully support this environment are of two types:

1. Product data are essentially data on the syntactic characteristics of the software.

2, Process data encompass those “artifacts of complexity ’ ’ mentioned in the second section, i.e., information on the performance of programmers applying some programming activity to the code, such as maintenance, testing, and debugging.

The current discussion will focus on the generation and maintenance of product data by MAE. The use and

W. Harrison

maintenance of process data will be discussed in a future paper.

4. A DATA BASE SCHEMA FOR THE ENVIRONMENT

Clearly, the power available to the user is dependent on the data available from the environment. A reasonable method of describing what is available is to describe the data base schema the environment is built upon.

The current environment that has been developed provides data for product metrics of COBOL programs. The environment as described is an experimental prototype built upon a modification of the COBOL extensible static analysis tool, or CESAT, [6]. The environment is built by using the Datatrieve data management system inning under the VMS operating system. ’

Datatrieve is a relational data base management system (DBMS) with a powerful interactive query language that includes such features as a report writer and graphics support. However, the environment could easily be ported to another relational data base system.

The prototype is intended primarily for small student- oriented systems and projects. Therefore it does not currentIy accommodate the use of external subprograms. However, the current tool is meant only as a prototype; a more powerful version of this environment is currently under development. Figure 1 provides an entity-relationship (E-R) diagram [7] of the data base. A brief description of the four entities in this diagram follows.

1.

2.

3.

4.

-

Paragraph represents each paragraph in the PROCE- DURE DIVISION of a particular COBOL program. This entity’s only attribute is the paragraph name. Line represents each line of code in the PROCE- DURE DIVISION of a particular COBOL program. The only attribute associated with this entity is the line number. Identifier represents every identifier (identifiers in COBOL are similar to variable names in other programming languages) that is either defined in the DATA DIVISION or used in the PROCEDURE DIVISION of a particular COBOL program. Identifi- ers also include file names and constants. The two attributes associated with this entity are the identifier name and the identifier type (i.e., data type). Operator represents each operator in the PROCE- DURE DIVISION of a particular COBOL program. In fact, it would probably be more properly referred to as “ReservedWord,” in that not every entry in this entity may really be an operator in the classical

_

I Datatrieve and VMS are trademarks of Digital Equipment Corpo- ration.

MAE: A Syntactic Metric Analysis Environment 59

Identifier

Line

Paragraph

sense of the word (e.g., THEN is considered an operator). This could be adjusted by simply changing

the table of reserved words the tool uses to identify operators. The only attribute associated with this entity is the operator name.

Elements within each of the entity sets are “connected” via “relationships,” as shown in Figure 1, whose roles

should be self-explanatory. The relationships are actu-

ally implemented by using “internal keys” that “link”

tuples in the connected entity relations. In the actual MAE implementation, the internal keys can be viewed

as an important part of the relations, since the relation-

ship links use the internal keys to “connect” tuples from different relations. As an example, the internal key

Identifier ID (IID) connects the entity Identifier with the

relationship Assigned-In, which is connected to the entity Line via the internal key Line ID (LID). Thus, the

tuple representing a particular identifier is connected with the tuple representing a particular line in which the identi~er is assigned a value via the Assigned-In

relationship. Figure 2 shows the relational schema that implements this E-R diagram.

The contents of the “product data base” are generated

Figure 1. Entity-relationship diagram for CESAT data base.

by a static analysis tool written in COBOL. The tool extracts information on syntactic characteristics of the

software and builds a set of indexed files that are then accessed as relations from the Datatrieve query language. Examples of the data that are available are shown

in Figures 3a-3d.

Through the use of Datatrieve’s built-in query language, a multitude of metric mappings can be formu-

lated using small scripts of no more than a dozen lines.

Because of this, MAE does not support any paticular

product metrics. Rather, users are encouraged to de-

velop their own query scripts to implement specific

metrics. For demonstration purposes, however, we will include a few metric implementations that have been developed within MAE:

1. The cyclomatic measure (VG) can be easily com- puted by simply totaling the number of times each “decision-making” operator occurs. Thus, this mea-

sure can be calculated by using the following simple

query :

60

DTR> zavailable

The following ENTITIES have been defined: -_I_-------__------------~-------~-~------------

Identifier(IdViaw1 IdentifierNamc+ Identifierfype IID

Line(LineView1 LineContents* LID

Operator(OpView) OperatorName* OID

Paragraph<ParaViewl ParagraphName+ PIE)

The following RELATIONSHIPS have been definedr ___-_-_-----____c----~~---~~~~-~~-~~~~~~~~~~~~

&ssigned_In Executed-In Rakes-Up Referenced-In Used-In

(LID,IID,CNT) (LID,OID,CNT) (LID,PIDl (PIDl,PIDZ,CNT) (LID,IID,CNT)

W. Harrison

Figure 2. The Datatrieve implementation of the E-R design makes these relations and attributes available. Views to achieve “aliasing” are listed after the relation names in parentheses. Attributes are not available within the/that views are marked with an asterisk.

PRINT “VG is: “, (TOTAL Cnt OF Operator CROSS Executed-In OVER Oid WITH

OperatorName = “if”, “until”, “times”) + 1

Eta1 = COUNT OF Operators Eta2 = COUNT OF Identi~ers Nl = TOTAL Cnt OF Operators CROSS ExecutedIn

OVER Oid N2 = TOTAL Cnt OF Identifier CROSS Used_In OVER

Iid + TOTAL Cnt of Identifier CROSS Assigned-In OVER Iid

Eta = Eta1 + Eta2 N = Nl + N2

2. Almost all software science measures can be obtained

from nl, n2 (unique operators and operands), Nl,

and N2 (total operators and operands). Clearly, this

informatjon is available from the operator and identifier entities. As an example, consider the

following script to compute Halstead’s volume:

V = N * (FN$LOGlO(Eta) / FN$LOG10(2)) PRINT “Volume is: “, V

3. LOC (lines of code) can be obtained by making a count of records in the LINE entity using the

following query (note that this is a count of noncom- mentary lines of PROCEDURE DIVISION code):

PRINT “NCSL is: “, COUNT OF Line

Figure 3a. Identifier information maintained by MAE for a 4. VAR (number of variables) can be obtained by

sample program. making a count of the identifiers that are not

DTR> PRINT Identifier

IID IDENTIFIERNCIME

00001 infile 00002 recin 00003 outfile 00004 recout 00005 sortfile 00006 sortrec 00007 sortkey 00008 filler

IDENTIFIERTYPE

<file> label records are standard pit x f 80 > .

<file> label records are standard pit x ( 80 ) .

<file> . <group> . pit xxx . pit x ( 77 1 .

MAE: A Syntactic Metric Analysis Environment

DTR> PRINT Operator

OID OPERATORNfWiE

00001 . 00002 sort 00003 ascending

00004 key 00005 using 00006 giving 00007 stoprun

Figure 3b. Operator information maintained by MAE for a sample program.

constants, files, or group items (i.e., a count of elementary items) as follows:

PRINT “VAR is: “, COUNT OF Identifier WITH IdentifierType NOT CONTAINING

“(file)", “(constant)“, “(group)”

Note that other methods of counting variables (e.g., including group items) could easily be implemented.

It should be clear that other mappings, or variants on

these mappings. could be easily implemented using the query language.

5. ALIASING OPERANDS AND PARAGRAPH NAMES FOR SECURITY

As pointed out in [8], many industrial organizations are

reluctant to share their data with others, since proprie-

tary algorithms and the like may be exposed to unauthor- ized use or access. However, these fears can be

minimized if the actual operand and/or paragraph names are aliased, and the contents of each line are not available to the researcher. This would make the

reconstruction of the original code next to impossible. At the same time, however, the same set of metrics can

be calculated using views as can be calculated using the

actual relations. This process is accomplished quite easily in a power-

ful data base environment through the use of views. Thus, views may be defined for each relation to allow

61

access to only the internal keys and other nonsensitive

information. As a consequence, the contents of the lines

and the names of the identifiers would not be available to

the user. The use of internal keys supports the use of such

views, since no other data are required from the tuple to represent the relationship among objects. For example,

the relationship among paragraphs via the Referenced- In relation is maintained even if the paragraph names are not known, thus ensuring that a casual user could not

ascertain the purpose of the paragraphs, yet could still

recognize the relationship. Each relation has a corresponding view that hides

potentially sensitive data. Figure 2 illustrates the views

available in MAE in addition to the relations. The views are listed in parentheses after the entity name. Entity

attributes that are marked with an asterisk do not

participate in the view. For example, a view of Identifier is available called IdView, which contains only the attributes IdentifierType and IID. However, since the

various items within the entity sets are linked via internal keys, the identical relationships are still maintained.

It should be clear that the same set of queries can be answered using the views as can be answered using the

actual relations, with the exception of queries that

actually involve the use of one of the attributes that have

been eliminated. In fact, the use of views provides data for COBOL

programs similar to those produced for C programs via

the Reduced Form of Harrison and Cook [8]. The set of data available via the Reduced Form has been used to

calculate numerous metrics for industrial code systems and has been shown to produce a format from which it is almost impossible to reproduce the original source code.

6. USING THE ENVIRONMENT

The environment interfaces with its user by means of a

combination of menus and free-format input. Upon entering the environment, the user is presented with the following menu of choices:

Figure 3c. Identifier that are “used” within the PROCE- DURE DIVISION of the sample program.

DTR> PRINT IdentifierName OF Identifier CROSS Used-In OVER Iid

IDENTIFIERNAME

infile sortfile sortkey

62 W. Harrison

DTR> FRiNT LineContents OF Line

LINECONTENTS

procedure division, OlO-main-driver.

sort SortFile, ascending key SortKey, using InFile giving OutFile.

stop run.

Figure 3d. Contents of the PROCEDURE DIVISION of the sample program.

vIAE 1.1

(A) Create a Data Base from a COBOL Program

(B) List COBOL Programs Currently Avail- able

(C) Analyze an Existing Data Base (D) List Existing Data Bases (E) Enter the SPSSx Statistical Subsystem (F) List the Contents of a File at the Screen (G) Send the Contents of a File to the Printer (H) Use EDT to Create a File (I) Delete a Data Base (J) Change Directories

(K) Exit

4, B, C, D, E, F, G. H, I. J, K or {CR)?_

Thus, from within the environment, the user may build a data base of syntactic characteristic information for a given COBOL program, analyze a previously created data base of syntactic characteristics using Datatrieve, statistically analyze intermediate files generated from within Datatrieve, or perform a number of utility operations, such as changing directories or destroying an existing data base.

7. PROPOSED EXTENSIONS TO THE CURRENT ENVIRONMENT

MAE’s status as an experimental prototype suggests that it still requires refinement. The most useful extension to the environment will be the addition of a facility to accommodate multiple external subprograms. Although no need for this feature was felt while analyzing student projects, when this tool is released for industrial use such a capability must be available.

The development of a method of representing “mi- cro” control flow would be useful to obtain some metrics. Currently, “macro” control flow can be captured through the Referenced_ln relationship (i.e., which paragraphs PERFORM or branch to which other

paragraphs2). However, the role of IF statements in the control ff ow is not captured, except in a purely syntactic manner. Naturally, in the event that we want to alias the data, we may not want to have this detail of information available, since it could be used to reconstruct the original source code.

One of the recognized benefits of the entity-relationship approach to data base design is that the data base schema is quite easy to extend as additional data are needed. Thus, as additional information is found to be needed in the data base, it can be added with minimal dif~culty.

8. SUMMARY A metric analysis environment for the COBOL programming language has been presented. This environment is based on the use of a commercial DBMS, accessing a set of files produced by a static analysis tool. Such an approach provides an amazingly flexible source of data for software metric analysis.

REFERENCES

1.

2.

1 _ *

4.

5.

6.

7.

8.

S. Conte, H. Dunsmore, and V. Shen, ~u~~ure Engi- neering ,~efrics and Models. Menlo Park, Ca: Ben- jamin/Cummings, 1986. C. Jones, Programming Productivity. New York: Mc-

Graw-Hill, 1986. M. Shooman, Software Engineering. New York: Mc- Graw-Hill, 1983. M. Halstead, &einenfs of Software Science. New York:

Elsevier, 1977.

T. McCabe, A complexity measures, IEEE Trans. Software Engineering SE-2:S308-320 (1976). W. Harrison, An Extensible Static Analysis Tool for

COBOL Programs, Proc. ACM Ann. Computer Sci- ence Conf. pp. 28.5-291, 1987. P. Chen, The entity-relationship model-Towards a unified view of data, ACM Trans. Data Base Systems 1(1):9-36 (1976). W. Harrison and C. Cook, A method of sharing industrial software complexity data, SIGPLAN Nofices 20(2):42- 51 (1985).

z A paragraph is the smallest unit of code that can be referenced from within the COBOL PROCEDURE DIVISION.

mae: a syntactic metric analysis environment

Documents