an empirical study of run-time coupling and … · an empirical study of run-time coupling and...

AN EMPIRICAL STUDY OFRUN-TIME COUPLING AND

COHESION SOFTWAREMETRICS

Aine Mitchell

Supervisor: Dr. James Power

A Thesis presented for the degree of

Doctor of Philosophy in Computer Science

Department of Computer Science

National University of Ireland, Maynooth

Co. Kildare, Ireland

October 2005

Dedicated toMy parents Patrick and Ann Mitchell

An empirical study of run-time coupling and

cohesion software metrics

Aine Mitchell

Submitted for the degree of Doctor of Philosophy

Oct 2005

Abstract

The extent of coupling and cohesion in an object-oriented system has implications

for its external quality. Various static coupling and cohesion metrics have been

proposed and used in past empirical investigations, however none of these have

taken the run-time properties of a program into account. As program behaviour is

a function of its operational environment as well as the complexity of the source

code, static metrics may fail to quantify all the underlying dimensions of coupling

and cohesion. By considering both of these influences, one will acquire a more

comprehensive understanding of the quality of critical components of a software

system. We believe that any measurement of these attributes should include changes

that take place at run-time. For this reason, in this work we address the utility of

run-time coupling and cohesion complexity through the empirical evaluation of a

selection of run-time measures for these properties. This study is carried out using

a comprehensive selection of Java benchmark and real world programs.

Our first case study investigates the influence of instruction coverage on the re-

lationship between static and run-time coupling metrics. Our second case study de-

fines a new run-time coupling metric that can be used to study object behaviour and

investigates the ability of measures of run-time cohesion to predict such behaviour.

Finally, we investigate whether run-time coupling metrics are good predictors of

software fault-proneness in comparison to standard coverage measures. To the best

of our knowledge this is the largest empirical study that has been performed to date

on the run-time analysis of Java programs.

Declaration

The work in this thesis is based on research carried out at the Department of Com-

puter Science, in the National University of Ireland Maynooth, Co. Kildare, Ireland.

No part of this thesis has been submitted elsewhere for any other degree or qualifi-

cation and it is all my own work unless referenced to the contrary in the text.

Signature:. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Date:. . . . . . . . . . . . . . . . . . . . . . . .

Copyright c© 2005 Aine Mitchell.

“The copyright of this thesis rests with the author. No quotations from it should be

published without the author’s prior written consent and information derived from

it should be acknowledged”.

iv

Acknowledgements

I would like to thank my PhD adviser, Dr. James Power, for his advice, guidance,

support, and encouragement throughout my PhD effort.

A special thanks to my parents without whose continual support this work would

not have been possible.

I would also like to thank all my friends who were there for me throughout it all.

This work has been funded by the Embark initiative, operated by the Irish

Research Council for Science, Engineering and Technology (IRCSET).

v

Contents

Abstract iii

Declaration iv

Acknowledgements v

1 Introduction 1

1.1 Software Metrics and Complexity . . . . . . . . . . . . . . . . . . . . 1

1.2 Traditional Measures of Complexity . . . . . . . . . . . . . . . . . . . 3

1.3 Object-Oriented Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Definitions of Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Definitions of Cohesion . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.6 Static and Run-time Metrics . . . . . . . . . . . . . . . . . . . . . . . 7

1.7 Factors Influencing Software Metrics . . . . . . . . . . . . . . . . . . 8

1.7.1 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.7.2 Metrics and Object Behaviour . . . . . . . . . . . . . . . . . . 9

1.7.3 Metrics and Software Testing . . . . . . . . . . . . . . . . . . 9

1.8 Aims of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.9 Structure of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Literature Review 12

2.1 Static Coupling Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 Chidamber and Kemerer . . . . . . . . . . . . . . . . . . . . . 13

2.1.2 Other Coupling Metrics . . . . . . . . . . . . . . . . . . . . . 14

2.2 Frameworks for Static Coupling Measurement . . . . . . . . . . . . . 15

vi

Contents vii

2.2.1 Eder et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.2 Hitz and Montazeri . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.3 Briand et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.4 Revised Framework by Briand et al. . . . . . . . . . . . . . . . 18

2.3 Static Cohesion Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.1 Chidamber and Kemerer . . . . . . . . . . . . . . . . . . . . . 19

2.3.2 Other Cohesion Metrics . . . . . . . . . . . . . . . . . . . . . 20

2.4 Frameworks for Static Cohesion Measurement . . . . . . . . . . . . . 21

2.4.1 Eder et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.2 Briand et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5 Run-time/Dynamic Coupling Metrics . . . . . . . . . . . . . . . . . . 23

2.5.1 Yacoub et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5.2 Arisholm et al. . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.6 Run-time/Dynamic Cohesion Metrics . . . . . . . . . . . . . . . . . . 25

2.6.1 Gupta and Rao . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.7 Other Studies of Dynamic Behaviour . . . . . . . . . . . . . . . . . . 25

2.7.1 Dynamic Behaviour Studies . . . . . . . . . . . . . . . . . . . 25

2.8 Coverage Metrics and Software Testing . . . . . . . . . . . . . . . . . 26

2.8.1 Instruction Coverage . . . . . . . . . . . . . . . . . . . . . . . 27

2.8.2 Alexander and Offutt . . . . . . . . . . . . . . . . . . . . . . . 27

2.9 Previous Work by the Author . . . . . . . . . . . . . . . . . . . . . . 28

2.10 Definition of Run-time Metrics . . . . . . . . . . . . . . . . . . . . . . 29

2.10.1 Coupling Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.10.2 Cohesion Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3 Experimental Design 34

3.1 Methods for Collecting Run-time Information . . . . . . . . . . . . . 34

3.1.1 Instrumenting a Virtual Machine . . . . . . . . . . . . . . . . 34

3.1.2 Sun’s Java Platform Debug Architecture (JPDA) . . . . . . . 35

3.1.3 Bytecode Instrumentation . . . . . . . . . . . . . . . . . . . . 35

3.2 Metrics Data Collection Tools (Design Objectives) . . . . . . . . . . . 35

Contents viii

3.2.1 Class-Level Metrics Collection Tool (ClMet) . . . . . . . . . . 36

3.2.2 Object-Level Metrics Collection Tool (ObMet) . . . . . . . . . 37

3.2.3 Static Data Collection Tool (StatMet) . . . . . . . . . . . . . 38

3.2.4 Coverage Data Collection Tool (InCov) . . . . . . . . . . . . . 39

3.2.5 Fault Detection Study . . . . . . . . . . . . . . . . . . . . . . 40

3.3 Test Case Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.1 Benchmark Programs . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.2 Real-World Programs . . . . . . . . . . . . . . . . . . . . . . . 43

3.3.3 Execution of Programs . . . . . . . . . . . . . . . . . . . . . . 45

3.4 Statistical Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.4.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . 45

3.4.2 Normality Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4.3 Normalising Transformations . . . . . . . . . . . . . . . . . . 48

3.4.4 Pearson Correlation Test . . . . . . . . . . . . . . . . . . . . . 49

3.4.5 T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4.6 Principal Component Analysis . . . . . . . . . . . . . . . . . . 50

3.4.7 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.4.8 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . 53

3.4.9 Analysis of Variance (ANOVA) . . . . . . . . . . . . . . . . . 55

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Case Study 1: The Influence of Instruction Coverage on the Rela-

tionship Between Static and Run-time Coupling Metrics 57

4.1 Goals and Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.1 Experiment 1: To investigate the relationship between static

and run-time coupling metrics . . . . . . . . . . . . . . . . . . 60

4.3.2 Experiment 2: The influence of instruction coverage . . . . . . 62

4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Contents ix

5 Case Study 2: The Impact of Run-time Cohesion on Object Be-

haviour 69



5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.3.1 Experiment 1: To determine if objects from the same class

behave differently at run-time from the point of view of coupling 74

5.3.2 Experiment 2: The influence of cohesion on the NOC . . . . . 77

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6 Case Study 3: A Study of Run-time Coupling Metrics and Fault

Detection 82



6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.3.1 Experiment 1: To examine the relationship between instruc-

tion coverage and fault detection. . . . . . . . . . . . . . . . . 85

6.3.2 Experiment 2: To examine the relationship between run-time

coupling metrics and fault detection. . . . . . . . . . . . . . . 87

6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

7 Conclusions 90

7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7.2 Applications of this Work . . . . . . . . . . . . . . . . . . . . . . . . 96

7.3 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

7.3.1 Internal Threats . . . . . . . . . . . . . . . . . . . . . . . . . . 96

7.3.2 External Threats . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Appendix 100

A Case Study 1: To Investigate the Influence of Instruction Coverage

on the Relationship Between Static and Run-time Coupling Metric100

Contents x

A.1 PCA Test Results for all programs. . . . . . . . . . . . . . . . . . . . 101

A.1.1 SPECjvm98 Benchmark Suite . . . . . . . . . . . . . . . . . . 101

A.1.2 JOlden Benchmark Suite . . . . . . . . . . . . . . . . . . . . . 101

A.1.3 Real-World Programs, Velocity, Xalan and Ant . . . . . . . . 102

A.2 Multiple linear regression results for all programs . . . . . . . . . . . 103

A.2.1 SPECjvm98 Benchmark Suite . . . . . . . . . . . . . . . . . . 103

A.2.2 JOlden Benchmark Suite . . . . . . . . . . . . . . . . . . . . . 104

A.2.3 Real-World Programs, Velocity, Xalan and Ant . . . . . . . . 105

B Case Study 2: The Impact of Run-time Cohesion on Object Be-

haviour 106

B.1 PCA Test Results for all programs. . . . . . . . . . . . . . . . . . . . 106

B.1.1 JOlden Benchmark Suite . . . . . . . . . . . . . . . . . . . . . 106

B.1.2 Real-World Programs, Velocity, Xalan and Ant . . . . . . . . 107

B.2 Multiple linear regression results for all programs. . . . . . . . . . . . 107

B.2.1 JOlden Benchmark Suite . . . . . . . . . . . . . . . . . . . . . 107

B.2.2 Real-World Programs, Velocity, Xalan and Ant . . . . . . . . 108

C Case Study 3: A Study of Run-time Coupling Metrics and Fault

Detection 109

C.1 Regression analysis results for real-world programs, Velocity, Xalan

and Ant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

C.1.1 For Class Mutants . . . . . . . . . . . . . . . . . . . . . . . . 110

C.1.2 For Traditional Mutants . . . . . . . . . . . . . . . . . . . . . 111

C.2 Regression analysis results for real-world programs, Velocity, Xalan

and Ant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

C.2.1 For Class Mutants . . . . . . . . . . . . . . . . . . . . . . . . 111

C.2.2 For Traditional Mutants . . . . . . . . . . . . . . . . . . . . . 112

D Mutation operators in µJava 113

List of Figures

1.1 The software quality model shows how different measures of internal

quality can characterise the overall quality of a software product . . . 3

3.1 Components of run-time class-level metrics collection tool, ClMet . . 37

3.2 Components of run-time object-level metrics collection tool, ObMet . 38

3.3 Components of static metrics collection tool, StatMet . . . . . . . . . 39

3.4 Dendrogram: At the cutting line there are two clusters . . . . . . . . 54

4.1 PCA test results for all programs for metrics in PC1, PC2 and PC3.

In all graphs the bars represents the PCA value obtained for the

corresponding metric. PC1 contains import level run-time metrics.

PC2 contains the export level run-time metrics and PC3 contain the

static CBO metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Multiple linear regression results for class-level metrics (IC CC and

EC CC). In both graphs the lighter bars represents the R2 value for

CBO, and the darker bars represents the R2 value for CBO and Ic

combined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3 Multiple linear regression results for method-level metrics (IC CM

and EC CM). In both graphs the lighter bars represents the R2

value for CBO, and the darker bars represents the R2 value for CBO

and Ic combined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.1 CV of IC OC for classes from the programs studied. The bars rep-

resent the number of classes in each program that have CV in the

corresponding range. . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

xi

List of Figures xii

5.2 NOC results of cluster analysis. The bars represent the number of

classes in each program that have the corresponding NOC value. . . . 76

5.3 PCA Test Results for all programs for metrics in PC1 and PC2. In

both graphs the bars represents the PCA value obtained for the corre-

sponding metric. PC1 contains RLCOM and RWLCOM . PC2 contains

SLCOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.4 Results from multiple linear regression where Y=NOC . The lighter

bars represent the R2 for SLCOM , and the darker bars represent the

R2 value for SLCOM and RLCOM combined. . . . . . . . . . . . . . . . 80

6.1 Mutation test results for real-world programs Velocity, Xalan and

Ant. In all graphs the bars represents the number of classes that

exhibit a percentage mutant kill rate in the corresponding range. . . . 86

6.2 Regression analysis results for the effectiveness of Ic in predicting

class and traditional-level mutations in real-world programs Velocity,

Xalan and Ant. The bars represents the R2 value for the run-time

metric under consideration. . . . . . . . . . . . . . . . . . . . . . . . 87

6.3 Regression analysis results for the effectiveness of run-time coupling

metrics in predicting class-level mutations in real-world programs Ve-

locity, Xalan and Ant. The bars represents the R2 value for the

run-time metric under consideration. . . . . . . . . . . . . . . . . . . 89

7.1 Findings from case study one that show our run-time coupling metrics

are not simply surrogate measures for static CBO and coverage plus

static metrics are better predictors of run-time measures than static

measure alone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7.2 Findings from case study two that show run-time object-level cou-

pling measures can be used to identify objects that are exhibiting

different behaviours at run-time and run-time cohesion measures are

good predictors of this type of behaviour. . . . . . . . . . . . . . . . . 93

List of Figures xiii

7.3 Findings from case study three that show run-time coupling metrics

are good predictors of class-type faults and instruction coverage is a

good predictor of traditional faults in programs. . . . . . . . . . . . . 94

List of Tables

2.1 Abbreviations for the dynamic coupling metrics of Arisholm et al. . . 24

3.1 Description of the SPECjvm98 benchmarks . . . . . . . . . . . . . . . 42

3.2 Description of the JOlden benchmarks . . . . . . . . . . . . . . . . . 43

3.3 Programs used for each case study . . . . . . . . . . . . . . . . . . . . 45

4.1 Descriptive statistic results for all programs . . . . . . . . . . . . . . 61

5.1 Matrix of unique accesses per object, for objectsBlackNode1, . . . , BlackNode4

to classes GreyNode, QuadTreeNode and WhiteNode . . . . . . . . . 72

5.2 Descriptive statistic results for all programs . . . . . . . . . . . . . . 73

D.1 Traditional-level mutation operators in µJava . . . . . . . . . . . . . 113

D.2 Class-level mutation operators in µJava . . . . . . . . . . . . . . . . . 114

xiv

Chapter 1

Introduction

Software metrics have become essential in some disciplines of software engineering.

In forward engineering they are used to measure software quality and to estimate

the cost and effort of software projects [40]. In the field of software evolution,

metrics can be used for identifying stable or unstable parts of software systems,

as well as identifying where refactorings can be applied or have been applied [32],

and detecting increases or decreases of quality in the structure of evolving software

systems. In the field of software re-engineering and reverse engineering, metrics are

used for assessing the quality and complexity of software systems, and also to get a

basic understanding and provide clues about sensitive parts of software systems [27].

1.1 Software Metrics and Complexity

Software metrics evaluate different aspects of the complexity of a software product.

Software complexity was originally defined as “a measurement of the resources that

must be expended in developing, testing, debugging, maintenance, user training,

operation, and correction of software products” [94]. Complexity has been char-

acterised in terms of seven different levels, the correlation and interdependence of

which will determine the overall level of complexity in a software product [44]. The

levels are as follows:

• Control Structure

1

1.1. Software Metrics and Complexity 2

• Module Coupling

• Algorithm

• Code

• Nesting

• Module Cohesion

• Data Structure

However, most metrics measure only one software complexity factor. These

foundations of complexity will determine the internal quality of a product.

Internal quality measures are those which are performed in terms of the software

product itself and are measurable both during and after the creation of the software

product. They have however, no inherent, practical meaning within themselves. To

give them meaning they must be characterised in terms of the product’s external

quality.

External quality measures are evaluated with respect to how a product relates to

its environment and are deemed to be inherently meaningful, such examples would

be the maintainability or testability of a product.

It should be noted that good internal quality is a requirement for good external

quality. Figure 1.1 illustrates the software quality model which depicts the relation-

ship between these measures. Much research has contributed models and measures

of both internal software quality attributes and external attributes of a design. Al-

though the relationships between these attributes is for the most part intuitive, e.g.,

more complex code will require greater effort to maintain, the precise functional form

of those relationships can be less clear and is the subject of intense practical and

research concern [31]. Empirical validation aims at demonstrating the usefulness

of a measure in practice and is, therefore, a crucial activity to establish the overall

validity of a measure [6]. Therefore it is the belief of the author that a well-designed

empirical study serves to clarify and strengthen the observed relationships.

1.2. Traditional Measures of Complexity 3

Complexity External Quality

Metrics

Maintainability

Reusability

Testability

Internal

External

Quality

In Use

Coupling

Cohesion

Internal Quality

Internal Quality

Figure 1.1: The software quality model shows how different measures of internal

quality can characterise the overall quality of a software product

1.2 Traditional Measures of Complexity

The earliest software measure, which was proposed in the late 1960s, is the Source

Lines of Code (SLOC) metric, which is still used today. It is used to measure the

amount of code in a software program. It is typically used to estimate the amount of

effort that will be required to develop a program, as well as to estimate productivity

or effort once the software is produced. Two major types of SLOC measures exist:

physical SLOC and logical SLOC. Exact definitions of these measures vary. The

most common definition of physical SLOC is a count of “non-blank, non-comment

lines” in the text of the program’s source code. Logical SLOC measures attempt

to measure the number of “statements”, however their specific definitions are tied

to specific computer languages. Therefore, it is much easier to create tools that

measure physical SLOC, and physical SLOC definitions are easier to explain. How-

ever, physical SLOC measures are sensitive to logically irrelevant formatting and

style conventions, while logical SLOC is less sensitive to formatting and style con-

1.3. Object-Oriented Metrics 4

ventions.

The are a number of drawbacks of using a crude measure such as LOC as a sur-

rogate measure for different notions of program size such as effort, functionality and

complexity. The need for more discriminating measures became especially urgent

with the increasing diversity of programming languages, as LOC in an assembly

language is not comparable in effort, functionality, or complexity to an LOC in a

high-level language [39].

Thus from the mid-1970s there was an increase in the number of different com-

plexity metrics defined. Some of the more prevalent ones were Halstead’s software

science metrics [47], which made an attempt to capture notions of size and com-

plexity beyond simply counting lines of code. Although the work has had a lasting

impact they are principally regarded as an example of confused and inadequate

measurements [40].

McCabe defined a measure known as Cyclomatic Complexity [71]. It may be

considered as a broad measure of soundness and confidence for a program. It mea-

sures the number of linearly-independent paths through a program module and it is

intended to be independent of language and language format.

Function points, which were pioneered by Albrecht [2] in 1977, are a measure

of the size of computer applications and the projects that build them. The size is

measured from a functional, or user, point of view. It is independent of the computer

language, development methodology, technology or capability of the project team

used to develop the application. The original metric has been augmented and refined

to cover more than the original emphasis on business-related data processing.

However as object-oriented techniques became more prevalent there was an in-

creasing need for metrics that could correctly evaluate their properties.

1.3 Object-Oriented Metrics

Object-oriented design and development is becoming very popular in today’s soft-

ware development environment. Object-oriented development requires not only a

different approach to design and implementation but it requires a different approach

1.4. Definitions of Coupling 5

to software metrics. Since object oriented technology uses objects and not algorithms

as its fundamental building blocks, the approach to software metrics for object ori-

ented programs must be different from the standard metrics set. Metrics, such as

lines of code and cyclomatic complexity, have become accepted as standard for tra-

ditional functional/ procedural programs and were used to evaluate object-oriented

environments at the beginning of the object-oriented design revolution. However,

traditional metrics for procedural approaches are not adequate for evaluating object-

oriented software, primarily because they are not designed to measure basic elements

like classes, objects, polymorphism, and message-passing. Even when adjusted to

syntactically analyse object-oriented software they can only capture a small part of

such software and thus provide a weak quality indication [50, 65]. Since this time

there have been many proposed object-oriented metrics in the literature. The ques-

tion now is, which object-oriented metrics should a project use? As the quality

of object-oriented software, like other software, is a complex concept there can be

no single, simple measure of software quality acceptable to everyone. To assess or

improve software quality in you must define the aspects of quality in which you are

interested, then decide how you are going to measure them. By defining quality

in a measurable way, you make it easier for other people to understand your view-

point and relate your notions to their own [60]. As illustrated by Chapter 2 some of

the seminal methods of evaluating an object-oriented design are through the use of

measures for coupling and cohesion.

1.4 Definitions of Coupling

Stevens et al. [95] first introduced coupling in the context of structured development

techniques. They defined coupling as “the measure of the strength of association

established by a connection from one module to another”. They stated that the

stronger the coupling between modules, that is, the more inter-related they are, the

more difficult these modules are to understand, change and correct and thus the

more complex the resulting software system.

Myers [82] refined the concept of coupling by defining six distinct levels of cou-

1.5. Definitions of Cohesion 6

pling. However coupling could only be determined by hand as the definitions were

neither precise nor prescriptive, leaving room for subjective interpretations of the

levels.

Constantine and Yourdon [29] also stated that the modularity of software design

can be measured by coupling and cohesion. They stated that coupling between two

units reflect the interconnections between units and that faults in one unit may

affect the coupled unit.

Page and Jones [89] ordered coupling into eight different levels according to their

effects on the understandability, maintainability, modifiability and reusability of the

coupled modules.

Troy and Zweben [98] showed that coupling between units is a good indicator

of the number of faults in software. However their study was based on subjective

interpretation of design documents instead of real code.

Offutt et al. [85] extended the eight levels of coupling to twelve thus providing a

finer grained measure of coupling. They also described algorithms to automatically

measure the coupling level between each pair of units in a program. The coupling

levels are defined between pairs of units A and B. For each coupling level the param-

eters are classified by the way they are used. Uses are classified into computation

uses (C-uses) [42], predicate uses (P-uses) and indirect uses (I-uses) [85]. A C-use

occurs when a variable is used on the right side of an assignment statement, in an

output statement, or a procedure call. A P-use occurs when a variable is used in a

predicate statement. An I-use occurs when a variable is used in an assignment to

another variable and the defined variable is later used in a predicate. The I-use is

considered to be in the predicate rather than in the assignment.

1.5 Definitions of Cohesion

The cohesion of a module is the extent to which its individual components are

needed to perform the same task [40]. Cohesion was first introduced within the

context of module design by Stevens et al. [95]. In their definition, the cohesion of a

module is measured by inspecting the association between all pairs of its processing

1.6. Static and Run-time Metrics 7

elements. The term processing element was defined as an action performed by

a module such as a statement, procedure call, or something which must be done

in a module but which has not yet been reduced to code [29]. Their definition

was informal thereby leaving it open for interpretation. They developed a scale of

cohesion that provide an ordinal scale of measurement that describes the degree to

which the actions performed by a module contribute to a unified function. There

are seven categories of cohesion which range from the most desirable (functional) to

least desirable (coincidental). They stated that it is possible for a module to exhibit

more than one type of cohesion, in this case the module is categorized by its least

desirable type of cohesion. In the principle of good software design it is desirable to

have highly cohesive modules, preferably functional.

Emerson [36,37] based his cohesion measure on a control flow graph representa-

tion of a module. The range of this complexity measure varies from 0 to 1. Emerson

indicates that his method for computing cohesion is related to program slicing. He

reclassifies the seven levels of cohesion into three.

Ott and Thuss [88] used program slicing to evaluate their cohesion measurements.

They reclassified the original seven levels of cohesion into four categories.

Lakhotia [61] codified the natural language definitions of the seven levels of

cohesion. He developed a method for computing cohesion based on an analysis of

the variable dependence graphs of a module. Pairs of outputs were examined to

identify any data or control dependences that exist between the two outputs. Rules

were provided for determining the cohesion of the pairs.

1.6 Static and Run-time Metrics

A large number of metrics have been proposed to measure object-oriented de-

sign quality. Design metrics can be classified into two categories; static and run-

time/dynamic. Static metrics measure what may happen when a program is ex-

ecuted and are said to quantify different aspects of the complexity of the source

code. Run-time metrics measure what actually happens when a program is exe-

cuted. They evaluate the source code’s run-time characteristics and behaviour as

1.7. Factors Influencing Software Metrics 8

well as its complexity.

Despite the rich body of research and practice in developing design quality met-

rics, there has been less emphasis on run-time metrics for object-oriented designs

mainly due to the fact that a run-time code analysis is more expensive and complex

to perform. [99]. However, due to polymorphism, dynamic binding, and the common

presence of unused (dead) code in software, static coupling and cohesion measures

do not perfectly reflect the actual situation taking place amongst classes at run-time.

The complex dynamic behaviour of many real-time applications motivates a shift

in interest from traditional static metrics to run-time metrics. In this work, we in-

vestigate whether useful information on design quality can be provided by run-time

measures of coupling and cohesion over and above that which is given by simple

static measures. This will determine if it is worthwhile to continue the investigation

into run-time coupling and cohesion metrics and their relationship with the external

quality.

1.7 Factors Influencing Software Metrics

This section discusses factors which affect software metrics, including coverage and

object-level behaviour. The relationship with software testing is also discussed.

1.7.1 Coverage

When relating static and run-time measures, it is important to have a thorough un-

derstanding of the degree to which the analysed source code corresponds to the code

that is actually executed. In this thesis, this relationship is studied using instruc-

tion coverage measures with regard to the influence of coverage on the relationship

between static and dynamic metrics. It is proposed that coverage results have a sig-

nificant influence on the relationship and thus should always be a measured, recorded

factor in any such comparison.

1.8. Aims of Thesis 9

1.7.2 Metrics and Object Behaviour

To date little work has been done on the analysis of code at the object-level, that is

the use of metrics to identify specific object behaviours. We identify this behaviour

through the use of run-time object-level coupling metrics. Run-time object-level

coupling quantifies the level of dependencies between objects in a system whereas

run-time class-level coupling quantifies the level of dependencies between the classes

that implement the methods or variables of the caller object and the receiver object

[5]. The class of the object sending or receiving a message may be different from the

class implementing the corresponding method due to the impact of inheritance. We

also investigate the ability of run-time cohesion measures to predict such behaviour.

1.7.3 Metrics and Software Testing

Testing is one of the most effort-intensive activities during software development [7].

Much research is directed toward developing new and improved fault detection mech-

anisms. A number of papers have investigated the relationships between static design

metrics and the detection of faults in object-oriented software [6, 15]. However, to

date no work has been conducted on the correlation of run-time coupling metrics and

fault detection. In this thesis, we investigate whether measures for run-time coupling

are good predictors of fault-proneness, an important software quality attribute.

1.8 Aims of Thesis

In summary, the central aims of this thesis are to outline operational definitions for

run-time class and object-level coupling and cohesion metrics suitable for evaluating

the quality of an object-oriented application. The motivation for these measures

is to complement existing measures that are based on static analysis by actually

measuring coupling and cohesion at runtime.

It is necessary to provide tools for accurately collecting such measures for Java

systems effectively. Java was chosen as the target language for this analysis be-

cause Java is executed on a virtual machine which makes it relatively simple to

collect run-time trace information in comparison to languages like C or C++. Java

1.9. Structure of Thesis 10

also combines a wide range of language features found in different programming

languages, for example, an object-oriented model, exception handling and garbage

collection. Its features of portability, robustness, simplicity and security have made

it increasingly popular within the software engineering community, underpinning its

importance and providing a good selection of sample applications for study.

Finally, a thorough empirical investigation using both Java benchmark and real-

world programs needs to be performed. The objectives of this are:

1. To assess the fundamental properties of the run-time measures and to inves-

tigate whether they are redundant with respect to the most commonly used

coupling and cohesion measures, as defined by Chidamber and Kemerer [26].

2. To examine the influence of test case coverage on the relationship between

static and run-time coupling metrics. Intuitively, one would expect the better

the coverage of the test cases used, the better the static and run-time metrics

should correlate.

3. To investigate run-time object behaviour, that is, to determine if objects from

the same class behave differently at run-time, through the use of object-level

coupling metrics.

4. To investigate run-time object behaviour using run-time measures for cohesion.

5. To conduct a study investigating the correlation between run-time coupling

measures and fault detection in object-oriented software.

1.9 Structure of Thesis

This thesis describes how coupling and cohesion can be defined and precisely mea-

sured based on the run-time analysis of systems. An empirical evaluation of the pro-

posed run-time measures is reported using a selection of benchmarks and real-world

Java applications. An investigation is conducted to determine if these measures are

redundant with respect to their static counterparts. We also determine if coverage

has a significant impact on the correlation between static and run-time metrics. We

1.9. Structure of Thesis 11

examine object behaviour using a run-time object-level coupling metric and we in-

vestigate the relationship of run-time cohesion metrics on this. Finally, we study

the fault detection capabilities of run-time coupling measures.

Chapter 2 presents a literature survey of coupling and cohesion metrics and

associated studies. Chapter 3 defines the run-time metrics used in this study and

outlines the experimental tools and techniques. Chapter 4 presents a case study on

the correlation between static and run-time coupling measures and the influence of

coverage on this correlation. Chapter 5 discusses a case study on object behaviour

and the impact of cohesion on this. Chapter 6 presents a case study on run-time

coupling metrics and fault detection. Chapter 7 presents the final conclusions and

discusses future work.

Chapter 2

Literature Review

In this chapter, a comprehensive survey and literature review of existing static and

run-time/dynamic measures and frameworks for coupling and cohesion in object-

oriented systems is presented. Previous work which describes a coupling based

testing approach for object-oriented software is presented. Finally, the role coverage

measures play in software testing is discussed. In Section 2.1 and 2.3, we present

existing coupling and cohesion measures and discuss them. Sections 2.2 and 2.4,

present alternative frameworks for coupling and cohesion. Measures for the run-time

evaluation of coupling and cohesion are presented in Sections 2.5 and 2.6 respectively.

Other work in studies of dynamic behaviour is described in Section 2.7. A discussion

of coverage metrics and the role they play in software testing is presented in Section

2.8. Previous work by the author is discussed in 2.9. Finally, a description of the

run-time measures used in the subsequent case studies are provided in Section 2.10.

2.1 Static Coupling Metrics

There exists a large variety of measurements for coupling. A comprehensive review

of existing measures performed by Briand et al. [13] found that more than thirty

different measures of object-oriented coupling exist. The most prevalent ones are

explained in the following subsections:

12

2.1. Static Coupling Metrics 13

2.1.1 Chidamber and Kemerer

In their papers [25, 26] Chidamber and Kemerer propose and validate a set of six

software metrics for object-oriented systems, including two measures for coupling.

As these are the most accepted and widely used coupling metrics, we use these as

the basis for our run-time coupling measures.

Coupling Between Objects (CBO)

They first define a measure CBO for a class as, “a count of the number of nonin-

heritance related couples with other classes” [25]. An object of a class is coupled

to another if the methods of one class use the methods or attributes of the other.

They later revise this definition to state, “CBO for a class is a count of the number

of other classes to which it is coupled ” [26]. A footnote added that “this includes

coupling due to inheritance.”

They state that coupling has an adverse effect on the maintenance, reuse and

testing of a design and that excessive coupling between object classes is detrimental

to modular design and prevents reuse. As the more independent a class is, the easier

it is to reuse in another application. They state that inter-object class couples should

be kept to a minimum in order to improve modularity and promote encapsulation.

The larger the number of couples, the higher the sensitivity to changes in other parts

of the design, making maintenance more difficult. A measure of coupling is useful

to determine how complex the testing of various parts of a design are likely to be.

The higher the inter-object class coupling the more rigorous the testing needs to be.

Response for class (RFC):

The response set (RS) of a class is a set of methods that can potentially be executed

in response to a message received by an object of that class. RFC is simply the

number of methods in the set, that is, RFC = #{RS}. A given method is counted

only once. Since RFC specifically includes methods called from outside the class,

it is also a measure of the potential communication between the class and other

classes.

2.1. Static Coupling Metrics 14

RS = M ∪alli Ri = [∪i∈MRi] (2.1)

Equation 2.1 gives the response set for a class where Ri is the set of methods

called by the method i and M is the set of all methods in the class.

If a large number of methods can be invoked in response to a message, the testing

and debugging of the class becomes more complicated since it requires a greater level

of understanding on the part of the tester. The complexity of a class increases with

the number of methods that can be invoked from it.

2.1.2 Other Coupling Metrics

In their paper [63] Li and Henry identify a number of metrics that can predict the

maintainability of a design. They define two measures, message passing coupling

(MPC) and data abstraction coupling (DAC). MPC is defined as the number of

send statements defined in a class. The number of send statements sent out from a

class may indicate how dependent the implementation of the local methods is on the

methods in other classes. MPC only counts invocations of methods of other classes,

not its own. DAC is defined as “the number of abstract data types (ADT) defined

in a class”. An ADT is defined in a class c if it is the type of an attribute of class c.

It is also specified that “the number of variables having an ADT type may indicate

the number of data structures dependent on the definitions of other classes”.

Martin describes two coupling metrics that can be used to measure the quality of

an object-oriented design in terms of the interdependence between the subsystems

of that design [70]. Afferent Coupling (Ca) is the number of classes outside this

category that depend upon classes within this category. Efferent Coupling (Ce) is

the number of classes inside this category that depend upon classes outside this

category. A category is a set of classes that belong together in the sense that

they achieve some common goal. Martin does not specify exactly what constitutes

dependencies between classes.

Abreu et al. present a coupling metric known as Coupling Factor (COF) for the

design quality evaluation of object-oriented software systems [1]. COF is the actual

number of client-server relationships between classes that are not related via inher-

2.2. Frameworks for Static Coupling Measurement 15

itance divided by the maximum possible number of such client-server relationships.

It is normalised to range between 0 and 1 to allow for comparisons for systems of

different sizes. It was not specified how to account for such factors as polymorphism

and method overriding.

Lee et al. measure coupling and cohesion of an object-oriented program based

on information flow through programs [62]. They define a measure, Information-

flow-based coupling (ICP), that counts for a method m of a class c, the number

of methods that are invoked polymorphically from other classes, weighted by the

number of parameters of the invoked method. This count can be scaled up to

classes and subsystems. They go on to derive two more sets of measures which

measure inheritance-based coupling (coupling to ancestor classes (IH-ICP)) and

noninheritance-based coupling (coupling to unrelated classes (NIH-ICP)) and de-

duce that ICP is simply the sum of IH-ICP and NIH-ICP

Briand et al. perform a comprehensive empirical validation of product measures,

such as coupling and cohesion, in object-oriented systems and explore the probability

of fault detection in system classes during testing [11]. They define a number of

measures which count the number of class-attribute (CA), class-method (CM) and

method-method (MM) interactions for each class. They take into account which

class the interactions originate from or are directed at and the number of ancestor

or other classes. A CA-interaction occurs from class c to class d if an attribute of

class c is of type class d. A CM-interaction occurs from class c to class d if a newly

defined method of class c has a parameter of type class d. An MM-interaction occurs

from class c to class d if a method implemented at class c statically invokes a newly

defined or overriding method of class d, or receives a pointer to such a method. This

set has sixteen metrics in total.

2.2 Frameworks for Static Coupling Measurement

Several different authors describe frameworks to characterise different approaches to

coupling and to assign relative strengths to different types of coupling. A framework

defines what constitutes coupling. This is done in an attempt to determine the


potential use of coupling metrics and how different metrics complement each other.

There are three existing frameworks:

2.2.1 Eder et al.

Eder et al. identify three different types of relationships [34]. These are, interaction

relationships between methods, component relationships between classes, and inher-

itance between classes. These relationships are then used to derive three different

dimensions of coupling which are classified according to different strengths:

1. Interaction coupling: Two methods are said to be interaction coupled if i) one

method invokes the other, or ii) they communicate via the sharing of data.

There are seven types of interaction coupling.

2. Component coupling: Two classes c and d are component coupled, if d is the

type of either i) an attribute of c, or ii) an input or output parameter of a

method of c, or iii) a local variable of a method of c, or iv) an input or output

parameter of a method invoked within a method of c. There are four different

degrees of component coupling.

3. Inheritance coupling: two classes c and d are inheritance coupled, if one class

is an ancestor of the other. There are four degrees of inheritance coupling.

2.2.2 Hitz and Montazeri

Hitz and Montazeri derive two different types of coupling, object and class-level cou-

pling [52]. These are determined by the state of an object (value of its attributes at

a given moment at runtime) and state of an object’s implementation (class interface

and body at a given time in the development cycle).

Class level coupling (CLC) results from state dependencies between two classes

in a system during the development cycle. This can only be determined from a static

analysis of the design documents or source code. This is important when considering

maintenance and change dependencies as changes in one class may lead to changes

in other classes which use it.


Object level coupling (OLC) results from state dependencies between two objects

during the run-time of a system. This depends on concrete object structure at run-

time, which in turn is determined by actual input data. Therefore, it is a function

of design or source code and input data at run-time. This is relevant for run-time

oriented activities such as testing and debugging.

2.2.3 Briand et al.

In the framework by Briand et al. coupling is constituted as interactions between

classes [14]. The strength is determined by the type of the interaction (Class-

Attribute, Class-Method, Method-Method), the relationship between the classes (In-

heritance, Other) and the interaction’s locus of impact (Import/Client, Export/Server).

They assign no strengths to the different kinds of interactions. There are three basic

criteria in the framework which are as follows:

1. Type of interaction: This determines the mechanism by which two classes are

coupled. A class-attribute interaction is present if aggregation occurs, that is,

if a class c is the type of an attribute of class d. A class-method interaction

occurs if a class c is the type of a parameter of method md of a class d, or if a

class c is the return type of method md. A method-method interaction occurs

if a md of a class d directly invokes a method mc, or a method md receives via

parameter a pointer to mc thereby invoking mc indirectly.

2. Relationship: An inheritance relationship occurs if a class c is an ancestor of

class d or vice versa. Friendship is present if a class c declares class d as its

friend, which grants class d access to the non-public elements of c. There is

another relationship when no inheritance or friendship relationship is present

between classes c and d.

3. Locus: If a class c is involved in an interaction with another class, a distinction

is made between export and import coupling. Export is when a class c is the

used class or server in the interaction. Import is when a class c is the using

class or client in the interaction.

2.3. Static Cohesion Metrics 18

2.2.4 Revised Framework by Briand et al.

Briand et al. outline a new unified framework for coupling in object-oriented systems

[13]. It is characterised based on the issues identified by comparing existing coupling

frameworks. There are six different criteria in the framework and each criterion

determines one basic aspect of the resulting measure. The criteria are as follows:

1. The type of connection: This determines what constitutes coupling. It is the

type of link between a client and a server item which could be an attribute,

method, or class.

2. The locus of impact: This is import or export coupling. Import coupling anal-

yses attributes, methods, or classes in their role as clients of other attributes,

methods, or classes. Export coupling analyses the attributes, methods, and

classes in their role as servers to other attributes, methods or classes.

3. The granularity of the measure: This is the domain of the measure, that is,

what components are to be measured and how to count coupling connections.

4. The Stability of server: Should both stable and unstable classes be included?

Classes can be, a) stable which are those that are not subject to change in the

project at hand, for example classes imported from libraries, or b) unstable

which are those which are subject to development or modification in the project

at hand.

5. Direct or indirect coupling: Should only direct connections be counted or

should indirect connections also be taken into account?

6. Inheritance: Inheritance-based versus noninheritance-based coupling. Also

how to account for polymorphism and how to assign attributes and methods

to classes.

2.3 Static Cohesion Metrics

A large number of alternative measures are proposed for measuring cohesion. Briand

et al. [12] carry out a broad survey on the current state of cohesion measurement


in object-oriented systems and find fifteen separate measurements of cohesion. A

review of these measures is presented in the following subsections.

2.3.1 Chidamber and Kemerer

The Lack of Cohesion in Methods (LCOM1) measure was first suggested by Chi-

damber and Kemerer [25]. It is the most prevalently used cohesion measure today

and therefore is used as the basis for the definition of our run-time cohesion mea-

sures. It is defined as “the degree of similarity of methods” and is theoretically

based on the ontology of objects by Bunge [21]. Within this ontology, the similarity

of things is defined as the set of properties that the things have in common.

For a given class C with a number of methods, M1, M2, ..., Mn, let {Ii} be the

set of instance variables accessed by the method Mi. As there are n methods there

will be n such sets, one set per method. The LCOM metric is then determined by

counting the number of disjoint sets formed by the intersection of the n sets.

However, this was found to be quite ambiguous and the pair later redefined their

metric (LCOM2) [26]. For a class C1 with n methods, M1, . . . ,Mn, let {Ii} be the

set of instance variables referenced by method Mi. There are n such sets I1, ...In.

We can define two disjoint sets:

P = {(Ii, Ij) | Ii ∩ Ij = ∅}Q = {(Ii, Ij) | Ii ∩ Ij 6= ∅}

(2.2)

The lack of cohesion in methods is then defined from the cardinality of these sets

by:

LCOM = |P | − |Q|,if |P | > |Q| or 0 otherwise

(2.3)

LCOM is an inverse cohesion measure. An LCOM value of zero indicates a

cohesive class. Cohesiveness of methods within a class is desirable as it promotes

encapsulation. Any measure of disparateness of methods helps identify flaws in the

design of classes. If the value is greater than zero this indicates that the class can

be split into two or more classes, since its variables belong in disjoint sets. Low


cohesion is said to increase complexity, thereby increasing the likelihood of errors

during the development process.

2.3.2 Other Cohesion Metrics

Briand et al. define a set of cohesion measures for object-based systems [16,17] which

are adapted in [12] to object-oriented systems. For this adaption a class is viewed as

a collection of data declarations and methods. A data declaration is a local, public

type declaration, the class itself or public attributes. There can be data declara-

tion interactions between classes, attributes, types of different classes and methods.

They define the following measures; Ratio of Cohesive Interactions (RCI), Neutral

Ratio of Cohesive Interactions (NRCI), Pessimistic Ratio of Cohesive Interactions

(PRCI) and Optimistic Ratio of Cohesive Interactions (ORCI).

Hitz and Montazeri base their cohesion measurements LCOM3, LCOM4 and

C (Connectivity) on the work of Chidamber and Kemerer [51].

The cohesion measurements by Bieman and Kang are also based on the work of

Chidamber and Kemerer [9]. They define measurements known as Tight Class Co-

hesion (TCC) and Loose Class Cohesion (LCC). These metrics also consider pairs

of methods which use common attributes, however a distinction is made between

methods which access attributes directly or indirectly. They also take inheritance

into account, making suggestions on how to deal with inherited methods and inher-

ited attributes.

Lee et al. propose a set of cohesion measures based on the information flow

through method invocations within a class [62]. For a method m implemented in a

given class c, the cohesion of m is the number of invocations to other methods im-

plemented in class c, weighted by the number of parameters of the invoked methods.

The greater the number of parameters an invoked method has, the more informa-

tion is passed, the stronger the link between the invoking and invoked method. The

cohesion of a class is the sum of the cohesion of its methods. The cohesion of a set

of classes is given by the sum of the cohesion of the classes in the set.

Henderson-Sellers propose a cohesion measure (LCOM5) [49]. They state that

a value of zero is obtained if each method of the class references every attribute

2.4. Frameworks for Static Cohesion Measurement 21

of the class and they called this “perfect cohesion”. They also state that if each

method of the class references only a single attribute, the measure yields one and

that values between zero and one are to be interpreted as percentages of the perfect

value. They do not state how to deal with inherited methods and attributes.

2.4 Frameworks for Static Cohesion Measurement

Two frameworks are defined in an attempt to outline what constitutes cohesion.

Eder et al. define a framework which aims at providing qualitative criteria for

cohesion and also assigns relative strengths to the different levels of cohesion they

identify within this framework.

A comprehensive framework based on a standard terminology and formalism is

outlined by Briand et al. which can be used (i) to facilitate comparison of existing

cohesion measures, (ii) to facilitate the evaluation and empirical validation of exist-

ing cohesion measures, and (iii) to support the definition of new cohesion measures

and the selection of existing ones based on a particular goal of measurement.

2.4.1 Eder et al.

Eder et al. propose a framework aimed at providing comprehensive, qualitative

criteria for cohesion in object-oriented systems [34]. They modify existing frame-

works for cohesion in the procedural and object-based paradigm to the specifics of

the object-oriented paradigm. They distinguish between three types of cohesion in

an object-oriented system: method, class and inheritance cohesion and state that

various degrees of cohesion exist for each type.

Myers [83] classical definition of cohesion was applied to methods for their def-

inition of method cohesion. Elements of a method are statements, local variables

and attributes of the method’s class. They defined seven degrees of cohesion, based

on the definition by Myers. From weakest to strongest, the degrees of method co-

hesion are coincidental, logical, temporal, communicational, sequential, procedural

and functional.

Class cohesion addresses the relationships between the elements of a class. The

2.4. Frameworks for Static Cohesion Measurement 22

elements of a class are its non-inherited methods and non-inherited attributes. Eder

et al. use a categorisation of cohesion for abstract data types by Embley and Wood-

field [35] and adapt it to object-oriented systems. They define five degrees of class co-

hesion which are, from weakest to strongest, separable, multifaceted, non-delegated,

concealed and model.

Inheritance cohesion is similar to class cohesion in that it addresses the rela-

tionships between elements of a class. However, inheritance cohesion takes all the

methods and attributes of a class into account, that is, both the inherited and non-

inherited. Inheritance cohesion is strong if inheritance has been used for the purpose

of defining specialized children classes. Inheritance cohesion is weak if it has been

used for the purpose of reusing code. The degrees of inheritance cohesion are the

same as those for class cohesion.

2.4.2 Briand et al.

Briand et al. outline a new framework for cohesion in object-oriented systems [12]

based on the issues identified by comparing the various approaches to measuring

cohesion and the discussion of existing measures outlined in Section 2.3. The frame-

work consists of five criteria, each criterion determining one basic aspect of the

resulting measure.

The five criteria of the framework are:

1. The type of connection, that is, what makes a class cohesive. A connection

within a class is a link between elements of the class which can be attributes,

methods, or data declarations.

2. Domain of the measure, this specifies the objects to be measured which can

be methods, classes etc.

3. They ask whether direct or also indirect connections should be counted.

4. How to deal with inheritance, that is, how to assign attributes and methods

to classes and how to account for polymorphism.

5. How to account for access methods and constructors.

2.5. Run-time/Dynamic Coupling Metrics 23

2.5 Run-time/Dynamic Coupling Metrics

While there has been considerable work on static metrics there has been little re-

search to date on run-time/dynamic coupling metrics. This section presents the two

most relevant works.

2.5.1 Yacoub et al.

Yacoub et al. propose a set of dynamic coupling metrics designed to evaluate the

change-proneness of a design [99]. These metrics are applied at the early de-

velopment phase to determine design quality. The measures are calculated from

executable object-oriented design models, which are used to model the application

to be tested. They are based on execution scenarios, that is “the measurements are

calculated for parts of the design model that are activated during the execution of a

specific scenario triggered by an input stimulus.” A scenario is the context in which

the metric is applicable. The scenarios are then extended to have an application

scope.

They define two metrics designed to measure the quality of designs at an early

development phase. Export Object Coupling (EOCx(oi, oj)) for an object oi with

respect to an object oj, is defined as the percentage of the number of messages sent

from oi to oj with respect to the total number of messages exchanged during the

execution of a scenario x. Import Object Coupling (IOCx(oi, oj)) for an object oi

with respect to an object oj, is the percentage of the number of messages received by

object oi that were sent by object oj with respect to the total number of messages

exchanged during the execution of a scenario x.

2.5.2 Arisholm et al.

Arisholm et al. define and validate a number of dynamic coupling metrics that

are listed in Table 2.1 [5]. Each dynamic coupling metric name starts with either

I or E to distinguish between import coupling and export coupling, based on the

direction of the method calls. The third letter C or O distinguishes whether entity

of measurement is the object or the class. The remaining letter distinguish three

2.5. Run-time/Dynamic Coupling Metrics 24

Variable Description

IC CC Import, Class Level, Number of Distinct Classes

IC CM Import, Class Level, Number of Distinct Methods

IC CD Import, Class Level, Number of Dynamic Messages

EC CC Export, Class Level, Number of Distinct Classes

EC CM Export, Class Level, Number of Distinct Methods

EC CD Export, Class Level, Number of Dynamic Messages

IC OC Import, Object Level, Number of Distinct Classes

IC OM Import, Object Level, Number of Distinct Methods

IC OD Import, Object Level, Number of Dynamic Messages

EC OC Export, Object Level, Number of Distinct Classes

EC OM Export, Object Level, Number of Distinct Methods

EC OD Export, Object Level, Number of Dynamic Messages

Table 2.1: Abbreviations for the dynamic coupling metrics of Arisholm et al.

types of coupling. The first metric, C, counts the number of distinct classes that

a method in a given class/object uses or is used by. The second metric, M , counts

the number of distinct methods invoked by each method in each class/object while

the third metric, D, counts the total number of dynamic messages sent or received

from one class/object to or from other classes/objects.

Arisholm et al. study the relationship of these measures with the change-

proneness of classes. They find that the dynamic coupling metrics capture addi-

tional properties compared to the static coupling metrics and are good predictors

of the change-proneness of a class. Their study uses a single software system called

Velocity executed with its associated test suite, to evaluate the dynamic coupling

metrics. These test cases are found to originally have 70% method coverage, which

is increased to 90% for the methods that “might contribute to coupling” through

the removal of dead code. However, they did not study the impact of code coverage

on their results nor were results given for programs other than versions of Velocity.

2.6. Run-time/Dynamic Cohesion Metrics 25

2.6 Run-time/Dynamic Cohesion Metrics

As is the case with the run-time coupling metrics, there has not been much research

into run-time measures for cohesion. This section presents the only available study

to date.

2.6.1 Gupta and Rao

Gupta and Rao conduct a study which measures module cohesion in legacy soft-

ware [46]. Gupta and Rao compare statically calculated metrics against a program

execution based approach of measuring the levels of module cohesion. The results

from this study show that the static approach significantly overestimates the levels

of cohesion present in the software tested. However, Gupta and Rao are considering

programs written in C, where many features of object-oriented programs are not

directly applicable.

2.7 Other Studies of Dynamic Behaviour

In this section we present a review of other work on studies into the dynamic be-

haviour of Java programs. While such research is not directly related to coupling

and cohesion metrics, many of the issues and approaches to measurement are similar.

Indeed, any research that performs both static and dynamic analyses of programs

benefits from being viewed in the context of some overall perspective of the rela-

tionship between the static and dynamic data.

2.7.1 Dynamic Behaviour Studies

A number of studies of the dynamic behaviour of Java programs have been carried

out, mostly for optimisation purposes. Issues such as bytecode usage [45] and mem-

ory utilisation [28] have been studied, along with a comprehensive set of dynamic

measures relating to polymorphism, object creation and hot-spots [33]. However,

none of this work directly addresses the calculation of standard software metrics at

run-time.

2.8. Coverage Metrics and Software Testing 26

The Sable group [33] seek to quantify the behaviour of programs with a concise

and precisely defined set of metrics. They define a set of unambiguous, dynamic, ro-

bust and architecture-independent measures that can be used to categorise programs

according to their dynamic behaviour in five areas which are size, data structure,

memory use, concurrency, and polymorphism. Many of the measurements they

record are of interest to the Java performance community as understanding the dy-

namic behaviour of programs is one important aspect in developing effective new

strategies for optimising compilers and runtime systems. It is important to note

that these are not typical software engineering metrics.

2.8 Coverage Metrics and Software Testing

Dynamic coverage measures are typically used in the field of software testing as

an estimate of the effectiveness of a test suite [10, 72]. Measurements of structural

coverage of code is a means of assessing the thoroughness of testing. The basis

of software testing is that software functionality is characterised by its execution

behaviour. In general, improved test coverage leads to improved fault coverage

and improved software reliability [69]. There are a number of metrics available for

measuring coverage, with increasing support from software tools. Such metrics do

not constitute testing techniques, but can be used as a measure of the effectiveness

of testing techniques. There are many different strategies for testing software, and

there is no consensus among software engineers about which approach is preferable

in a given situation. Test strategies fall into two categories [40]:

• Black-box (closed-box) testing: The test cases are derived from the specifica-

tion or requirements without reference to the code itself or its structure.

• White-box (open-box) testing: The test cases are selected based on knowledge

of the internal program structure.

A number of coverage metrics are based on the traversal of paths through the

control dataflow graph (CDFG) representing the system behaviour. Applying these

metrics to the CDFG representing a single process is a well understood task. The

2.8. Coverage Metrics and Software Testing 27

following coverage metrics are examples of white-box testing techniques and are

based on the CDFG.

2.8.1 Instruction Coverage

Instruction coverage is the simplest structural coverage metric. It is achieved if

every source language statement in the program is executed at least once. With

this technique test cases are selected so that every program statement is executed

at least once. It is also known as statement coverage, segment coverage [84], C1 [7]

and basic block coverage.

The main advantage of this measure is that it can be applied directly to object

code and does not require processing source code. Performance profilers commonly

implement this measure. The main disadvantage of statement coverage is that it is

insensitive to some control structures. In summary, this measure is affected more

by computational statements than by decisions. Due to its ubiquity this was chosen

as the coverage measure that is used in the case studies in this thesis. There are

however, a number of other methods for evaluating the coverage of a program, for

example branch coverage, condition coverage, condition/decision coverage, modified

condition/decision coverage and path coverage.

2.8.2 Alexander and Offutt

In their paper [3], Alexander and Offutt describe a coupling-based testing approach

for analysing and testing the polymorphic relationships that occur in object-oriented

software. The traditional notion of software coupling has been updated to apply

to object-oriented software, handling the relationships of aggregation, inheritance

and polymorphism. This allows the introduction of a new integration analysis and

testing technique for data flow interactions within object-oriented software. The

foundation of this technique is the coupling sequence, which is a new abstraction

for representing state space interactions between pairs of method invocations. The

coupling sequence provides the analytical focal point for methods under test and is

the foundation for identifying and representing polymorphic relationships for both

2.9. Previous Work by the Author 28

static and dynamic analysis. With this abstraction both testers and developers of

object-oriented programs can analyse and better understand the interactions within

their software. The application of these techniques can result in an increased ability

to find faults and overall higher quality software.

2.9 Previous Work by the Author

A preliminary study was previously conducted on the issues involved in perform-

ing a run-time analysis of Java programs [74]. This study outlined the general

principles involved in performing such an analysis. However, the results did not

offer a justifiable basis for generalisation as the programs analysed were a set of

Java microbenchmarks from the Java Grande Forum Benchmark Suite (JGFBS)

and therefore not representative of real applications. The metrics used were also

of a more primitive nature than the ones used in this study. Also, there was no

investigation made into the perspective of the measures, that is, the influence of

coverage, or the ability to predict external design quality. It did however provide

an indication that the evaluation of software metrics at run-time can provide an

interesting quantitative analysis of a program and that further research in this area

is needed.

The following papers have also been published:

• In [77, 78] studies on the quantification of a variety of run-time class-level

coupling metrics for object-oriented programs are described.

• In [77,79] an empirical investigation into run-time metrics for cohesion is pre-

sented.

• A study into a coverage analysis of Java benchmark suites is described in [20].

• An investigation into how object-level run-time metrics can be used to study

coupling between objects is presented in [81].

• A study of the influence of coverage on the relationship between static and

dynamic coupling metrics is described in [80].

2.10. Definition of Run-time Metrics 29

2.10 Definition of Run-time Metrics

This section outlines the run-time metrics used in the remainder of this thesis.

Originally, it was decided to develop a number of run-time metrics for coupling

and cohesion that parallel the standard static object-oriented measures defined by

Chidamber and Kemerer [26]. Later Arishlom et al. defined a set of dynamic

coupling metrics in their paper [5] which closely parallel ours, so for the ease of

comparison it was decided to use their terminology and definitions for the coupling

measures.

The cohesion measures are all novel and are based on our own definitions.

2.10.1 Coupling Metrics

Three decision criteria are used to define and classify the run-time coupling mea-

sures. Firstly, a distinction is made as to whether the entity of measurement is

the object or the class. Run-time object-level coupling quantifies the level of depen-

dencies between objects in a system. Run-time class-level coupling quantifies the

level of dependencies between the classes that implement the methods or variables of

the caller object and the receiver object. The class of the object sending or receiving

a message may be different from the class implementing the corresponding method

due to the impact of inheritance.

Second, the direction of coupling for a class or object is taken into account,

as is outlined in previous static coupling frameworks [13]. This allows for the fact

that in a coupling relationship a class may act as a client or a server, that is, it may

access methods or instance variables from another class (import coupling) or it may

have its own methods or instance variables used (export coupling).

Finally the strength of the coupling relationship is assessed, that is the amount

of association between the classes. To do this it is possible to count either:

1. The number of distinct classes that a method in a given class uses or is used

by.

2. The number of distinct methods invoked by each method in each class.


3. The total number of dynamic messages sent or received from one class to or

from other classes.

Class-Level Metrics

The following are metrics for evaluating class-level coupling:

• IC CC: This determines the number of distinct classes accessed by a class at

run-time.

• IC CM: This determines the number of distinct methods accessed by a class

at run-time.

• IC CD: This determines the number of dynamic messages accessed by a class

at run-time.

• EC CC: This determines the number of distinct classes that are accessed by

other classes at run-time.

• EC CM: This determines the number of distinct methods that are accessed by

other classes at run-time.

• EC CD: This determines the number of dynamic messages that are accessed

by other classes at run-time.

Object-Level Metric

To evaluate object-level coupling it was deemed necessary to define just one metric.

Since we want to examine the behaviour of objects at run-time we require a measure

that is based on a class rather than a method-level. Further, it was deemed necessary

to evaluate only coupling at the import level, as we are interested in examining how

classes use other classes at the object-level rather than how they are used by other

classes, therefore export coupling for this measure was not evaluated.

The following is a measure for evaluating object-level coupling:

• IC OC: Import, Object-Level, Number of Distinct Classes: This measure will

be some function of the static CBO measure, as this measure determines the


classes that can be theoretically accessed at run-time. This is a coarse-grained

measure which will assess class-class coupling at the object-level.

2.10.2 Cohesion Metrics

The following run-time measures are based on the Chidamber and Kemerer static

LCOM measure for cohesion as described in Section 2.3.1. However, a problem with

the original definition for LCOM is its lack of discriminating power. Much of this

arises from the criteria which states if |P | < |Q|, LCOM is automatically set to zero.

The result of this is a large number of classes with an LCOM of zero so the metric

has little discriminating power between these classes. In an attempt to correct this,

for the purpose of this analysis, we modify the original definition to be:

SLCOM = |P ||P |+|Q| (2.4)

SLCOM can range in value from zero to one. This new definition allows for com-

parison across classes therefore we use this new version as a basis for the definition

of the run-time metrics. As these are cohesion measures they are evaluated at the

class-level only.

Run-time Simple LCOM (RLCOM)

RLCOM is a direct extension of the static case, except that now we only count

instance variables that are actually accessed at run-time. Thus, for a set of methods

m1, . . . ,mn, as before, let {IRi } represent the set of instance variables referenced by

method mi at run-time. Two disjoint sets are defined from this:

PR = {(IRi , IRj ) | IRi ∩ IRj = ∅}QR = {(IRi , IRj ) | IRi ∩ IRj 6= ∅}

(2.5)

We can then define RLCOM as:

RLCOM = |PR||PR|+|QR| (2.6)


We note that for any method mi, (Ii − IRi ) ≥ 0, and represents the number

of instance variables mentioned in a method’s code, but not actually accessed at

run-time.

Run-time Call-Weighted LCOM (RWLCOM)

It is reasonable to suggest that a heavily accessed variable should make a greater con-

tribution to class cohesion than one which is rarely accessed. However, the RLCOM

metric does not distinguish between the degree of access to instance variables. Thus

a second run-time measure RWLCOM is defined by weighting each instance variable

by the number of times it is accessed at run-time. This metric assesses the strength

of cohesion by taking the number of accesses into account.

As before, consider a class with n methods, m1, . . . ,mn, and let {Ii} be the set

of instance variables referenced by method mi. Define Ni as the number of times

method mi dynamically accesses instance variables from the set {Ii}.Now define a call-weighted version of equation 2.2 by summing over the number

of accesses:

PW =∑

1≤i,j≤n{(Ni +Nj) | Ii ∩ Ij = ∅}

QW =∑

1≤i,j≤n{(Ni +Nj) | Ii ∩ Ij 6= ∅}

where PW = 0, if {I1}, ..., {In} = ∅

(2.7)

Following equation 2.6 we define:

RWLCOM = |PW ||PW |+|QW | (2.8)

RWLCOM can range in value from zero to one. There is no direct relationship

with SLCOM or RLCOM , as it is based on the “hotness” of a particular program.

2.11. Conclusion 33

2.11 Conclusion

This chapter outlined the most prevalent metrics for coupling and cohesion and dis-

cussed other work on studies into the dynamic behaviour of Java programs. Mea-

sures for dynamic coverage that are commonly used in the field of software testing

were described. Work and publications by the author were outlined. Finally, a

description of the run-time metrics used in this thesis were provided.

Chapter 3

Experimental Design

This chapter presents an overview of the tools and techniques used to carry out the

run-time empirical evaluation of a set of Java programs together with a detailed

description of the set of programs analysed. A review of the statistical techniques

used to interpret the data is also given.

3.1 Methods for Collecting Run-time Information

There are a number of alternative techniques available for extracting run-time in-

formation from Java programs, each with their own advantages and disadvantages.

3.1.1 Instrumenting a Virtual Machine

There are several open-source implementations of the JVM available, for example

Kaffe [58], Jikes [57] or the Sable VM [59]. As their source code is freely available

this means that all aspects of a running Java program can be observed. However,

due to the logging of bytecode instructions, instrumenting a JVM can result in a

huge amount of data being generated for the simplest of programs. The source code

organisation must be understood and the instrumentation has to be redone for each

new version of the VM. There can also be compatibility issues when compared with

the Java class libraries released by Sun. It has also been found that these VMs are

not very robust. This was the method used for a preliminary study [74], however it

was later discarded due to its many disadvantages.

34

3.2. Metrics Data Collection Tools (Design Objectives) 35

3.1.2 Sun’s Java Platform Debug Architecture (JPDA)

Version 1.4 and later of the Java SDK supports a debugging architecture, the JPDA

[96], that provides event notification for low level JVM operations. A trace program

that handles these events can thus record information about the execution of a Java

program. This method is faster than instrumenting a VM and is more robust. The

same agent works with all VM’s supporting the JPDA and this is currently supported

by both Sun and IBM (although there are some differences). This technique has

proved useful in class-level metrics analysis. However, it is still very time consuming

to generate a profile for a large application and it is difficult to conduct an object-

level analysis using this approach.

3.1.3 Bytecode Instrumentation

This involves statically manipulating the bytecode to insert probes, or other track-

ing mechanisms, that record information at runtime. This provides the simplest

approach to dynamic analysis since it does not require implementation specific

knowledge of JVM internals, and imposes little overhead on the running program.

Bytecode instrumentation can be performed using the publicly available Apache

Bytecode Engineering Library (BCEL) [30]. This technique provides object-level

accuracy and therefore was used in the object-level metrics analysis.

3.2 Metrics Data Collection Tools (Design Objec-

tives)

The dynamic analysis of any program involves a huge amount of data processing.

However, the level of performance of the collection mechanism was not considered

to be a critical issue. It was only desirable that the analysis could be carried out in

reasonable and practical time. The flexibility of the collection mechanism was a key

issue, as it was necessary to be able to collect a wide variety of dynamic information.


3.2.1 Class-Level Metrics Collection Tool (ClMet)

We have developed a tool for the collection of class-level metrics called ClMet, as

illustrated by Figure 3.1, which utilises the JPDA. This is a multi-tiered debugging

architecture contained within Sun Microsystem’s Java 2 SDK version 1.4. It consists

of two interfaces, the Java Virtual Machine Debug Interface (JVMDI), and the Java

Debug Interface (JDI), and a protocol, the Java Debug Wire Protocol (JDWP).

The first layer of the JPDA, the JVMDI, is a programming interface implemented

by the virtual machine. It provides a way to both inspect the state and control the

execution of applications running in the JVM. The second layer, the JDWP, de-

fines the format of information and requests transferred between the process being

debugged and the debugger front-end which implements the JDI. The JDI, which

comprises the third layer, defines information and requests at the user code level. It

provides introspective access to a running virtual machine’s state, the class, array,

interface, and primitive types, and instances of those types. While a tracer imple-

mentor could directly use the Java Debug Wire Protocol (JDWP) or Java Virtual

Machine Debug Interface (JVMDI), this interface greatly facilitates the integration

of tracing capabilities into development environments. This method was selected

because of the ease with which it is possible to obtain specific information about

the run-time behaviour of a program.

In order to match objects against method calls it is necessary to model the

execution stack of the JVM, as this information is not provided directly by the

JPDA. We have implemented an EventTrace analyser class in Java, which carries

out a stack based simulation of the entire execution in order to obtain information

about the state of the execution stack. This class also implements a filter which

allows the user to specify which events and which of their corresponding fields are

to be captured for processing. This allows a high degree of flexibility in the collection

of the dynamic trace data.

The final component of our collection system is a Metrics class, which is re-

sponsible for calculating the desired metrics on the fly. It is also responsible for

outputting the results in text format. The metrics to be calculated can be specified

from the command line. The addition of the metrics class allows new metrics to be


Figure 3.1: Components of run-time class-level metrics collection tool, ClMet

easily defined as the user need only interact with this class.

3.2.2 Object-Level Metrics Collection Tool (ObMet)

We have developed an object-level metrics collection tool called ObMet, which uses

the BCEL and is based on the Gretel [53] coverage monitoring tool.

The BCEL is an API which can be used to analyse, create, and manipulate

(binary) Java class files. Classes are represented by BCEL objects which contain all

the symbolic information of the given class, such as methods, fields and byte code

instructions. Such objects can be read from an existing file, be transformed by a

program and dumped to a file again.

Figure 3.2 illustrates the components of ObMet. In the first stage the Instru-


Instrumenter

A.class B.class C.class


JVM

Probe hit reports

(Binary file)

Metrics

Calculate

MetricsResults

<index> <file> <method> <class><index> <file> <method> <class><index> <file> <method> <class>

Probe Table

Figure 3.2: Components of run-time object-level metrics collection tool, ObMet

menter program takes a list of class files and instruments them. During this phase

the BCEL inserts probes into these files to flag events like method calls or instance

variable accesses. During instrumentation, the class files are changed in-place, and

a file containing information on method and field accesses is created. Each method

and field are given a unique index in this file. When the application is run, each

probe records a “hit” in another file. The Metrics program then calculates the

run-time measures utilising the information in these files.

3.2.3 Static Data Collection Tool (StatMet)

In order to calculate the static metrics it is necessary to convert the binary class

files into a human readable format. The StatMet tool is based on the Gnoloo

disassembler [38], which converts the class files into an Oolong source file. The

Oolong language is an assembly language for the Java Virtual Machine and the

resulting file will be nearly equivalent to the class file format but it will be suitable for

human interpretation. The StatMet tool extends the disassembler with an additional



Gnoloo

Metrics

Oolong Code

Static Metrics

Results

[Human Readable Code]

[Disassembler]

[Binary Format]

Figure 3.3: Components of static metrics collection tool, StatMet

metrics component which calculates the static metrics from the Oolong code. Figure

3.3 illustrates the components of the StatMet tool.

3.2.4 Coverage Data Collection Tool (InCov)

In order to calculate the instruction coverage, it is necessary to record, for each

instruction, whether or not it was executed. In fact, well-known techniques exist for

identifying sequences of consecutive instructions, known as basic blocks, that some-

what reduce the instrumentation overhead. Nonetheless, since static code analysis

is required to determine basic block entry points, it seemed most efficient to also

instrument the bytecode during this analysis.

The instrumentation framework uses the Apache Byte Code Engineering Library

(BCEL) [30] along with the Gretel Residual Test Coverage Tool [53]. The Gretel

tool statically works out the basic blocks in a Java class file and inserts a probe

consisting of small sequence of bytecode instructions at each basic block. Whenever

the basic block is executed, the probe code records a “hit” as a simple boolean value.

The number of bytecode instructions in the basic block can then be used to calculate


instruction coverage.

3.2.5 Fault Detection Study

Mutation testing [48, 64] is a fault-based testing technique that measures the effec-

tiveness of test cases. It was first introduced as a way of measuring the accuracy

of test suites. It is based on the assumption that a program will be well tested if

a majority of simple faults are detected and removed. Mutation testing measures

how good a test is by inserting faults into the program under test. Each fault gen-

erates a new program, a mutant, that is slightly different from the original. These

mutant versions of the program are created from the original program by applying

mutation operators, which describe syntactic changes to the programming language.

Test cases are used to execute these mutants with the goal of causing each mutant

to produce incorrect output. The idea is that the tests are adequate if they dis-

tinguish the program from one or more mutants. The cost of mutation testing has

always been a serious issue and many techniques proposed for implementing it have

proved to be too slow for practical adoption. µJava is a tool created for performing

mutation testing on Java programs.

µJava

µJava [66, 67] is a mutation system for Java programs. It automatically generates

mutants for both traditional mutation testing and class-level mutation testing. It

can test individual classes and packages of multiple classes.

The method-level or traditional mutants are based on the selective operator set

by Offutt et al. [87]. These (non-OO) mutants are all behavioural in nature. There

are five traditional mutants in total. A description of these mutants can be found

in Appendix D.1.

The class-level mutation operators were designed for Java classes by Ma, Kwon

and Offutt [68], and were in turn designed from a categorisation of object-oriented

faults by Offutt, Alexander et al. [86]. The object-oriented mutants are created

according to 23 operators that are specialised to object-oriented faults. Each of

these can be catergorised based one of five language feature groups they are related

3.3. Test Case Programs 41

to. The class-level mutants can also be divided into one of two types, behavioural

mutants are those that change the behavior of the program while structural mutants

are those that change the structure of the program. A detailed description of these

mutants can be found in Appendix D.2.

After creating mutants, µJava allows the tester to enter and run tests, and

evaluates the mutation coverage of the tests. Test cases are then added in an attempt

to “kill” the mutants by differentiating the output of the original program from the

mutant programs. Tests are supplied by the users as sequences of method calls to

the classes under test encapsulated in methods in separate classes.

3.3 Test Case Programs

An important technique used in the evaluation of object systems is benchmarking. A

benchmark is a black-box test, even if the source code is available [73]. A benchmark

should consists of two elements:

• The structure of the persistent data.

• The behaviour of an application accessing and manipulating the data.

The process of using a benchmark to assess a particular object system involves exe-

cuting or simulating the behaviour of the application while collecting data reflecting

its performance [54]. A number of different Java benchmarks are available and those

used in the course of this study are discussed in the following subsection.

3.3.1 Benchmark Programs

Benchmark suites are commonly used to measure performance and fulfill many of

the required properties of a test suite. The following were used in this analysis.

SPECjvm98 Benchmark Suite

The SPECjvm98 benchmark suite [8] is typically used to study the architectural

implications of a Java runtime environment. The benchmark suite consists of eight


Application Description

201 compress A popular modified Lempel–Ziv method (LZW) compres-

sion program.

202 jess JESS is the Java Expert Shell System and is based on

NASAs popular CLIPS rule-based expert shell system.

205 raytrace This is a raytracer that works on a scene depicting a di-

nosaur.

209 db Data management software written by IBM.

213 javac This is the Sun Microsystem Java compiler from the JDK

1.0.2.

222 mpegaudio This is an application that decompresses audio files that

conform to the ISO MPEG Layer–3 audio specification.

227 mtrt This is a variant of 205 raytrace. This is a dual–threaded

program that ray traces an image.

228 jack A Java parser generator from Sun Microsystems that is

based on the Purdue Compiler Construction Tool Set (PC-

CTS). This is an early version of what is now called

JavaCC.

Table 3.1: Description of the SPECjvm98 benchmarks

Java programs which represent different classes of Java applications as illustrated

by Table 3.1.

These programs were run at the command line prompt and do not include graph-

ics, AWT (graphical interfaces), or networking. The programs were run with a 100%

size execution by specifying a problem size s100 at the command line.

JOlden Benchmark Suite

The original Olden benchmarks are a suite of pointer intensive C programs which

have been translated into Java. They are small, synthetic programs but they were

used as part of this study as each program exhibits a large volume of object creation.


Application Description

bh Solves the N-body problem using hierarchical methods.

bisort Sorts by creating two disjoint bitonic sequences and then

merging them.

em3d Simulates the propagation electro-magnetic waves in a 3D

object.

health Simulates the Columbian health care system.

mst Computes the minimum spanning tree of a graph.

perimeter Computes the perimeter of a set of quad-tree encoded

raster images.

power Solves the Power System Optimization problem.

treeadd Adds the values in a tree.

tsp Computes an estimate of the best hamiltonian circuit for

the travelling salesman problem.

voronoi Computes the Voronoi Diagram of a set of points.

Table 3.2: Description of the JOlden benchmarks

Table 3.2 gives a description of the programs [23].

There are a number of other benchmark suite available that could be used in this

type of study which were excluded for various reasons. The DaCapo benchmark suite

was excluded as it is still in its beta stage of development. The Java Grande Forum

Benchmark Suite (JGFBS), which was used in a previous study [74], was excluded

as the programs did not exhibit very high levels of coupling and cohesion at run-

time. Other suite such as CaffineMark were excluded as these are microbenchmark

programs therefore are not typical of real Java applications.

3.3.2 Real-World Programs

It was deemed desirable to include a number of real-world programs in the analysis

to see if the results are scalable to actual programs. The following were chosen as

they are all publicly available and so is their source code. They all come with a set


of pre-defined test cases that are also publicly available, thus defining both the static

and dynamic context of our work. This contrasts with some other approaches which,

at worst, can use arbitrary software packages, often proprietary, with an ad-hoc set

of test inputs.

Velocity

Velocity (version 1.4.1) is an open-source software system that is part of the Apache

Jakarta Project [55]. It is a Java-based template engine and it permits anyone to

use a simple yet powerful template language to reference objects defined in Java

code. It can be used to generate web pages, SQL, PostScript, and other outputs

from template documents. It can be used either as a standalone utility or as an

integrated component of other systems. The set of JUnit test cases supplied with

the program were used to execute the program.

Xalan-Java

Xalan-Java (version 2.6.0) is an open-source software system that is part of the

Apache XML Project [92]. It is an XSLT processor for transforming XML docu-

ments into HTML, text, or other XML document types. It implements XSL Trans-

formations (XSLT) Version 1.0 and XML Path Language (XPath) Version 1.0. It

can be used from the command line, in an applet or a servlet, or as a module in

other program. A set of JUnit test cases supplied for the program were used for its

execution.

Ant

Ant (version 1.6.1) is a Java-based build tool that is part of the Apache Ant Project

[4]. It is similar to GNU Make but has the full portability of pure Java code. Instead

of writing shell commands, as with Make, the configuration files are XML-based,

calling out a target tree where various tasks are executed.

3.4. Statistical Techniques 45

SPECjvm98 JOlden Velocity Xalan Ant

Case Study 1: X X X X X

Case Study 2: X X X X

Case Study 3: X X X

Table 3.3: Programs used for each case study

3.3.3 Execution of Programs

All the programs except those in the SPEC benchmark suite were compiled using

the javac compiler from Sun’s SDK version 1.5.0 01, and all benchmarks were run

using the client virtual machine from this SDK. The programs in the SPEC suite are

distributed in class file format, and were not recompiled or otherwise modified. We

note (in accordance with the license) that the SPEC programs were run individually,

and thus none of these results are comparable with the standard SPECjvm98 metric.

All benchmark suites include not just the programs themselves, but a test harness

to ensure that results from different executions are comparable. Table 3.3 outlines

the programs used for each case study. Not all programs were suitable for use in

every case study and we defer the explanation of this to the relevant chapters.

3.4 Statistical Techniques

The following section presents a detailed review of the statistical techniques used in

this study.

3.4.1 Descriptive Statistics

Descriptive statistics describe patterns and general trends in a data set. They also

aid in explaining the results of more complex statistical techniques. For each case

study a number of descriptive statistics were evaluated from the following:

The Distribution or Mean (X)

X =

∑X

N(3.1)


The mean is the sum of all values (X) divided by the total number of values (N).

The Standard Deviation (s)

s =√var =

√∑(X −X)2

N − 1(3.2)

The standard deviation is a measure of the range of values in a set of numbers.

It is used used as a measure of the dispersion or variation in a distribution. Simply

put, it tells us how far a typical member of a sample or population is from the

mean value of that sample or population. A large standard deviation suggests that

a typical member is far away from the mean. A small standard deviation suggests

that members are clustered closely around the mean. It is computed as the square

root of the variance.

Many statistical techniques assume that data is normally distributed. If that

assumption can be justified, then 68% of the values are at most one standard devi-

ation away from the mean, 95% of the values are at most two standard deviations

away from the mean, and 99.7% of the values lie within three standard deviations

of the mean.

The Coefficient of Variation (CV )

CV = σ/µ ∗ 100 (3.3)

CV measures the relative scatter in data with respect to the mean and is calcu-

lated by dividing the standard deviation by the mean. It has no units and can be

expressed as a simple decimal value or reported as a percentage value. When the

CV is small the data scatter relative to the mean is small. When the CV is large

compared to the mean the amount of variation is large. Equation 3.3 defines the

coefficient of variation as a percentage, where µ is the mean and σ is the standard

deviation.

Skewness

skewness =

∑ni=1(Xi −X)3

(N − 1)s3(3.4)


Skewness is the tilt (or lack of it) in a distribution. It characterises the degree of

asymmetry of a distribution around its mean. A distribution is symmetric if it looks

the same to the left and right of the centre point. Equation 3.4 gives the formula

for skewness for X1, X2, ..., XN , where X is the mean, s is the standard deviation

and N is the number of data points

Kurtosis

kurotsis =

∑ni=1(Xi −X)4

(N − 1)s4(3.5)

Kurtosis is the peakedness of a distribution. Equation 3.5 gives the formula for

kurtosis for X1, X2, ..., XN ,.

3.4.2 Normality Tests

Many statistical procedures require that the data being analysed follow a normal

data distribution. If this is not the case, then the computed statistics may be

extremely misleading. Normal distributions take the form of a symmetric bell-

shaped curve. Normality can be visually assessed by looking at a histogram of

frequencies, or by looking at a normal probability plot.

A common rule-of-thumb test for normality is to get skewness and kurtosis, then

divide these by the standard errors. Skew and kurtosis should be within the +2

to -2 range when the data are normally distributed. Negative skew is left-leaning,

positive skew right-leaning. Negative kurtosis indicates too many cases in the tails

of the distribution. Positive kurtosis indicates too few cases in the tails.

Shapiro-Wilk’s W Test

W =(∑n

i−1 aix(i))2

∑ni−1(xi − x)2

(3.6)

Formal tests such as the Shapiro-Wilk’s test may also be applied to assess

whether the data is normally distributed. It calculates a W statistic that tests

whether a random sample, x1, x2, ..., xn comes from a normal distribution. W may

be thought of as the correlation between given data and their corresponding nor-

mal scores, with W = 1 when the given data are perfectly normal in distribution.


When W is significantly smaller than 1, the assumption of normality is not met.

Shapiro-Wilks W is recommended for small and medium samples up to n = 2000.

Equation 3.6 calculates the W statistic where xi are the ordered sample values and

ai are the constants generated from the means, variances and covariances of the

order statistics of a sample of size n from a normal distribution [90,93].

Kolmogorov-Smirnov D Test or K-S Lilliefors test

D = maxl≤i≤N |F (yi)− i

N| (3.7)

For larger samples, the Kolmogorov-Smirnov test is recommended. For a single

sample of data, this test is used to test whether or not the sample of data is consistent

with a specified distribution function. When there are two samples of data, it is

used to test whether or not these two samples may reasonably be assumed to come

from the same distribution. Equation 3.7 defines the test statistic, where F is the

theoretical cumulative distribution of the distribution being tested which must be a

continuous distribution. The hypothesis regarding the distributional form is rejected

if the test statistic, D, is greater than the critical value obtained from a table. There

are several variations of these tables in the literature [24].

3.4.3 Normalising Transformations

There are a number of transformations that can be applied to approximate data

to become normally distributed. To normalise right or positive skew, square roots,

logarithmic, and inverse (1/x) transforms “pull in” outliers. Inverse transforms are

stronger than logarithmic transforms which are stronger than roots. To correct left

or negative skew, first subtract all values from the highest value plus 1, then apply

square root, inverse, or logarithmic transforms. Power transforms can be used to

correct both types of skew and finer adjustments can be made by adding a con-

stant, C, in the transform of X: (X + C)P . Values of P less than one (roots) correct

right skew, which is the common situation (using a power of 2/3 is common when

attempting to normalise). Values of P greater than 1 (powers) correct left skew.

For right skew, decreasing P decreases right skew. Too great a reduction in P will


overcorrect and cause left skew. When the best P is found, further refinements

can be made by adjusting C. For right skew, for instance, subtracting C will de-

crease skew. Logarithmic transformations are appropriate to achieve symmetry in

the central distribution when symmetry of the tails is not important. Square root

transformations are used when symmetry in the tails is important. When both are

important, a fourth root transform may work.

3.4.4 Pearson Correlation Test

R =n∑xy −∑ x

∑y√

([n∑x2 − (

∑x)2][n

∑y2 − (

∑y)2])

(3.8)

The Pearson or product moment correlation test is used to assess if there is a

relationship between two or more variables, in other words it is a measure of the

strength of the relationship between the variables. Having n pairs of data (xi, yi),

equation 3.8 computes the correlation coefficient (R). R is a number that summarises

the direction and degree (closeness) of linear relations between two variables and is

also known as the Pearson Product-Moment Correlation Coefficient. R can take

values between -1 through 0 to +1. The sign (+ or -) of the correlation affects its

interpretation. When the correlation is positive (R > 0), as the value of one variable

increases, so does the other. The closer R is to zero the weaker the relationship. If

a correlation is negative, when one variable increases, the other variable decreases.

The following general categories indicate a quick way of interpreting a calculated R

value [97]:

• 0.0 to 0.2 Very weak to negligible correlation

• 0.2 to 0.4 Weak, low correlation (not very significant)

• 0.4 to 0.6 Moderate correlation

• 0.7 to 0.9 Strong, high correlation

• 0.9 to 1.0 Very strong correlation

The results of such an analysis are displayed in a correlation matrix table.

3.4.5 T-Test

t =r√

[(1− r2)/(N − 2)](3.9)


Any relationship between two variables should be assessed for its significance

as well as its strength. A standard two tailed t-test is used to test for statistical

significance as illustrated by equation 3.9. Coefficients are considered significant if

the t-test p-value is below 0.05. This tells how unlikely a given correlation coefficient,

r, will occur given no relationship in the population. Therefore the smaller the p-

level, the more significant the relationship taking account of type I and type II

errors.

3.4.6 Principal Component Analysis

Principal Component Analysis (PCA) is used to analyse the covariate structure

of the metrics and to determine the underlying structural dimensions they capture.

In other words PCA can tell if all the metrics are likely to be measuring the same

class property. PCA usually generates a large number of principal components. The

number will be decided based on the amount of variance explained by each compo-

nent. A typical threshold would be retaining principal components with eigenvalues

(variances) larger than 1.0. This is the Kaiser criterion. There are a number of

stages involved in performing a PCA on a set of data:

1. Select a data set, for example one with two dimensions x and y.

2. Subtract the mean from each of the data dimensions. The mean subtracted

is the average across each dimension, so all the x values have the mean x

subtracted and all the y values will have y subtracted. This produces a data

set whose mean is zero.

3. Calculate the covariance matrix. Formula 3.10 gives the definition for a co-

variance matrix for a set of data with n dimensions, where Cn×n is a matrix

with n rows and n columns, and Dimx is the xth dimension.

Cn×n = (ci,j,, ci,j = cov(Dimi, Dimj)) (3.10)

An n-dimensional data set will have n!(n−2!)∗2 different covariance values. As

the data we propose to use is two dimensional, the covariance matrix will be


2 × 2:

C =

cov(x, x) cov(x, y)

cov(y, x) cov(y, y)

4. Calculate the eigenvectors and eigenvalues of the covariance matrix. They are

both unit vectors, that is their lengths are both 1 and they are both closely

related. These are important as they provide information about patterns in

the data.

5. Choosing components and forming a feature vector. In general, once eigen-

vectors are found from the covariance matrix, the next step is to order them

by eigenvalue, highest to lowest. This gives you the components in order of

significance. Some of the components of lesser significance can be ignored. If

some components are left out, the final data set will have less dimensions than

the original. To be precise, if there are originally n dimensions in the data,

and n eigenvectors and eigenvalues are calculated, and the first p eigenvectors

are chosen, then the final data set has only p dimensions.

Forming a feature vector, which is just another name for a matrix of vectors,

is constructed by taking the eigenvectors that you want to keep from the list

of eigenvectors, and forming a matrix with these eigenvectors in the columns.

FeatureV ector = (eig1 eig2 eig3...eign) (3.11)

6. Derive a new data set, for this we simply take the transpose of the vector and

multiply it on the left of the original data set, transposed.

FinalData = RowFeatureV ector ∗RowDataAdjust (3.12)

where RowFeatureVector is the matrix with the eigenvectors in the columns

transposed so that the eigenvectors are now in the rows, with the most signif-

icant eigenvector at the top, and RowDataAdjust is the mean-adjusted data


transposed, that is, the data items are in each column, with each row holding

a separate dimension.

See [56] for further details on PCA.

3.4.7 Cluster Analysis

Cluster Analysis is a data exploratory statistical procedure that helps reveal asso-

ciations and structures of data in a domain set [91]. A measure of proximity or

similarity/dissimilarity is needed in order to determine groups from a complex data

set. A wide variety of such measures exist but no consensus prevails over which is

superior. For this study, two widely used dissimilarity measures, Pearson dissimi-

larity and Euclidean distance were chosen. The analysis was repeated using these

two different measures in order to verify the results.

Equation 3.13 defines the Pearson Dissimilarity, where µx and µy are the means

of the first and second sets of data, and σx and σy are the standard deviations of

the first and second sets of data.

d(x, y) =1n

∑i xiyi − µxµyσxσy

(3.13)

Equation 3.14 defines the Euclidean Distance between two sets of data.

d(x, y) =

√√√√n∑i

(xi − yi)2 (3.14)

The next step is to select the most suitable type of clustering algorithm for the

analysis. The agglomerative hierarchical clustering (AHC) algorithm was chosen as

being the most suitable for the specifications of the analysis. Also, it does not require

the number of clusters the data should be grouped into be specified in advance. AHC

algorithms start with singleton clusters, one for each entity. The most similar pair

of clusters are merged, one pair at a time, until a single cluster remains.

Throughout the cluster analysis, there is a symmetric matrix of dissimilarities

maintained between the clusters. Once two clusters have been merged, it is neces-

sary to generate the dissimilarity between the new cluster and every other cluster.


The unweighted pair group average linkage algorithm was employed here as it is

theoretically the best method to use. This algorithm clusters objects based on the

average distance between all pairs.

Suppose we have three clusters A, B and C, with i being the distance between

A and B, and j being the distance between B and C. If A and B are the most

similar pair of entities and are joined together into a new cluster D, the method of

calculating the new distance k between C and D is given by Equation 3.15.

k = (i ∗ size(A) + j ∗ size(B))/(size(A) + size(B)) (3.15)

The analysis was repeated using Ward’s method to verify the results. With

this method cluster membership is assessed by calculating the total sum of squared

deviations from the mean of a cluster. The criterion for fusion is that it should

produce the smallest possible increase in the error sum of squares.

The output of AHC is usually represented in a special type of tree structure called

a dendrogram, as illustrated by Figure 3.4. Each branch of the tree represents a

cluster and is drawn vertically to height where the cluster merges with neighbouring

clusters. The cutting line is a line drawn horizontally across the dendrogram at a

given dissimilarity level to determine the number of clusters. The cutting line is

determined by constructing a histogram of node levels to find where the increase

in dissimilarity is strongest, as then we have reached a level where we are grouping

groups that are already homogeneous. The cutting line is selected before this level

is reached.

3.4.8 Regression Analysis

The general computational problem that needs to be solved in linear regression

analysis is to fit a straight line to a set of points [43]. When there is more than

one independent variable, the regression procedures will estimate a linear equation

of the form shown in Equation 3.16, where Y is the dependent variable, Xi stands

for a set of independent variables, a is a constant and each bi is the slope of the

regression line. The constant a is also known as the intercept, and the slope as the

regression coefficient.


Figure 3.4: Dendrogram: At the cutting line there are two clusters


Y = a+ b1X1 + b2X2 + . . .+ bpXp (3.16)

The regression line expresses the best prediction of the dependent variable Y

given the independent variables Xi. However, usually there is substantial variation

of the observed points around the fitted regression line. The deviation of a particular

point from the line is known as the residual value. The smaller the variability of

the residual values around the regression line relative to the overall variability, the

better the prediction. In most cases the ratio will fall somewhere between 0.0 and

1.0. If there is no relationship between the X and Y variables the ratio will be

1.0, while if X and Y are perfectly related the ratio will be 0.0. The least squares

method is employed to perform the regression.

The R2 or the coefficient of determination is 1.0 minus this ratio. The R2 value

is an indicator of how well the model fits the data. If we have an R2 close to 1.0 this

indicates that we have accounted for almost all of the variability with the variables

specified in the model.

The correlation coefficient R expresses the degree to which two or more indepen-

dent variables are related to the dependent variable, and it is the square root of R2.

R can assume values between -1 and +1. The sign (plus or minus) of the correla-

tion coefficient interprets the direction of the relationship between the variables. If

it is positive, then the relationship of this variable with the dependent variable is

positive. If it is negative then the relationship is negative. If it is zero then there is

no relationship between the variables.

3.4.9 Analysis of Variance (ANOVA)

ANOVA is used to test the significance of the variation in the dependent variable that

can be attributed to the regression of one or more independent variables. The results

enable us to determine whether or not the explanatory variables bring significant

information to the model. ANOVA gives a statistical test of the null hypothesis H0,

which is, there is no linear relationship between the variables versus the alternative

hypothesis H1, which is, there is a relationship between the variables.

3.5. Conclusion 56

There are four parts to ANOVA results, the sum of squares, degrees of freedom,

mean squares and the F test. Fisher’s F test, as given by Equation 3.17, is used

to test whether the R2 values are statistically significant. Values are deemed to be

significant at p ≤ 0.05.

F =R2 ∗ (N −K − 1)

(1−R2) ∗K (3.17)

Here, K is the number of independent variables (two in our case) and N is the

number of observed values.

3.5 Conclusion

A detailed account of the tools and techniques needed to conduct the test case

studies described in the next sections were outlined in this chapter. The programs

evaluated in this work were discussed and an outline of the statistical techniques

used to analyse the results were provided.

Chapter 4

Case Study 1: The Influence of

Instruction Coverage on the

Relationship Between Static and

Run-time Coupling Metrics

When comparing static and run-time measures it is important to have a thorough

understanding of the degree to which the analysed source code corresponds to the

code that is actually executed. In this chapter this relationship is studied using

instruction coverage measures with regard to the influence of coverage on the rela-

tionship between static and run-time metrics. It is proposed that coverage results

have a significant influence on the relationship and thus should always be a mea-

sured, recorded factor in any such comparison.

An empirical investigation is conducted using a set of six run-time metrics on

seventeen Java benchmark and real-world programs. First, the differences in the

underlying dimensions of coupling captured by the static versus the run-time metrics

are assessed using principal component analysis. Subsequently, multiple regression

analysis is used to study the predictive ability of the static CBO and instruction

coverage data to extrapolate the run-time measures.

57

4.1. Goals and Hypotheses 58

4.1 Goals and Hypotheses

The Goal Question Metric/MEtric DEfinition Approach (GQM/MEDEA) frame-

work proposed by Briand et al. [18] was used to set up the experiments for this

study.

Experiment 1:

Goal: To investigate the relationship between static and run-time coupling met-

rics.

Perspective: We would expect some degree of correlation between the run-time

measures for coupling and the static CBO metric. We use a number of statistical

techniques, including principle component analysis to analyse the covariate struc-

ture of the metrics to determine if they are measuring the same class properties.

Environment: We chose to evaluate a number of Java programs from well

defined publicly-available benchmark suites as well as a number of open source real-

world programs.

Hypothesis:

H0 : Run-time measures for coupling are simply surrogate measures for the static

CBO metric.

H1 : Run-time measures for coupling are not simply surrogate measures for the

static CBO metric.

Experiment 2:

Goal: To examine the relationship between static CBO and run-time coverage

metrics, particularly in the context of the influence of instruction coverage.

Perspective: Intuitively, one would expect the better the coverage of the test

cases used the greater the correlation between the static and run-time metrics. We

use multiple regression analysis to determine if there is a significant correlation.

4.2. Experimental Design 59

Environment: We chose to evaluate a number of Java programs from well

defined publicly-available benchmark suites as well as a number of open source real-

world programs.

Hypothesis:

H0 : The coverage of the test cases used to evaluate a program has no influence

on the relationship between static and run-time coupling metrics.

H1 : The coverage of the test cases used to evaluate a program has an influence

on the relationship between static and run-time coupling metrics.

4.2 Experimental Design

In order to conduct the practical experiments underlying this study, it was necessary

to select a suite of Java programs and measure:

• the static CBO metric

• the instruction coverage percentages: IC

• the run-time coupling metrics: IC CC, EC CC, IC CM, EC EM, IC CD, EC CD

The static metrics data collection tool StatMet, described in Section 3.2.3, was

used to calculate CBO, while the InCov tool, outlined in Section 3.2.4, was used to

determine the instruction coverage. The run-time metrics were evaluated using the

ClMet tool, which is described in Chapter 3.2.1.

The set of programs used in this study consist of the benchmark programs JOlden

and SPECjvm98, as well as the real-world programs Velocity, Xalan and Ant. The

SPECjvm98 suite was chosen as it is directly comparable to other studies that use

Java software. The program mtrt was excluded from the investigation as it is multi-

threaded and therefore is not suitable for this type of analysis. The more synthetic

JOlden programs were included to ensure that it considers programs that create

significantly large populations of objects. Three of the programs from the JOlden

suite BiSort, TreeAdd and TSP were omitted from the analysis as they contained

4.3. Results 60

only two classes, therefore the results could not be further analysed. A selection of

real programs were selected to ensure that the results were scalable to all types of

programs.

4.3 Results

4.3.1 Experiment 1: To investigate the relationship between

static and run-time coupling metrics

For each program the distribution (mean) and variance (standard deviation)

of each measure across the class is calculated. These statistics are used to select

metrics that exhibit enough variance to merit further analysis, as a low variance

metric would not differentiate classes very well and therefore would not be a useful

predictor of external quality. Descriptive statistics also aid in explaining the results

of the subsequent analysis.

The descriptive statistic results for each program are summarised in Table 4.1.

The metric values exhibit large variances which makes them suitable candidates for

further analysis.

Principal Component Analysis

Principal Component Analysis (PCA) is used to investigate whether the run-

time coupling metrics are not simply surrogate measures for static CBO.

A similar study was carried out by Arisholm et al. using only the Velocity

program [5]. The work in this chapter extends their work to include fourteen bench-

mark programs as well as three real-world programs in order to demonstrate the

robustness of these results over a larger range and variety of programs.

Appendix A.1 shows the results of the principal component analysis used to in-

vestigate the covariate structure of the static and run-time metrics. Using the Kaiser

criterion to select the number of factors to retain shows that the metrics mostly cap-

ture three orthogonal dimensions in the sample space formed by all measures. In

other words, the coupling is divided along three dimensions for each of the programs

4.3. Results 61

SPECjvm98 Benchmark Suite

201 compress

Mean SD

CBO 6.24 6.2

IC CC 1.72 2.11

IC CM 4.34 3.54

IC CD 7.56 5.46

EC CC 1.80 1.16

EC CM 4.35 4.76

EC CD 6.56 4.56

202 jess

Mean SD

CBO 6.99 4.78

IC CC 2.97 7.21

IC CM 4.34 3.43

IC CD 5.45 4.54

EC CC 2.97 9.01

EC CM 4.34 4.35

EC CD 7.56 6.56

205 raytrace

Mean SD

CBO 7.25 7.51

IC CC 2.14 4.25

IC CM 4.45 3.54

IC CD 7.56 6.56

EC CC 2.06 1.89

EC CM 4.54 4.53

EC CD 6.56 4.56

209 db

Mean SD

CBO 9.12 6.60

IC CC 1.81 1.98

IC CM 6.56 4.46

IC CD 9.67 8.68

EC CC 1.88 1.54

EC CM 6.45 5.67

EC CD 9.57 7.65

213 javac

Mean SD

CBO 8.54 7.15

IC CC 3.21 3.01

IC CM 5.45 4.56

IC CD 7.56 7.56

EC CC 3.01 2.87

EC CM 3.45 4.56

EC CD 5.45 5.65

222 mpegaudio

Mean SD

CBO 5.75 4.90

IC CC 2.60 2.36

IC CM 4.54 3.56

IC CD 7.56 6.56

EC CC 2.60 2.70

EC CM 5.45 4.56

EC CD 5.87 5.46

228 jack

Mean SD

CBO 6.05 7.51

IC CC 2.68 5.37

IC CM 3.45 3.43

IC CD 5.45 4.45

EC CC 2.68 2.39

EC CM 5.45 4.56

EC CD 7.56 6.56


BH

Mean SD

CBO 5.22 3.40

IC CC 2.62 2.50

IC CM 7.44 8.86

IC CD 8.67 10.84

EC CC 2.33 1.33

EC CM 5.77 4.44

EC CD 6.25 4.74

Em3d

Mean SD

CBO 4.20 2.86

IC CC 3.22 0.71

IC CM 3.87 1.01

IC CD 4.76 3.96

EC CC 3.75 1.33

EC CM 3.35 3.49

EC CD 4.65 3.46

Health

Mean SD

CBO 3.43 3.46

IC CC 2.43 2.46

IC CM 3.35 4.24

IC CD 4.25 5.46

EC CC 3.35 3.46

EC CM 3.55 2.43

EC CD 4.46 4.43

MST

Mean SD

CBO 4.34 3.45

IC CC 3.54 2.45

IC CM 4.23 3.45

IC CD 7.54 4.54

EC CC 3.45 3.34

EC CM 3.45 2.45

EC CD 4.56 4.32

Perimeter

Mean SD

CBO 5.34 4.34

IC CC 3.34 3.45

IC CM 4.34 2.45

IC CD 8.56 6.45

EC CC 3.54 3.45

EC CM 4.54 3.43

EC CD 6.54 3.54

Power

Mean SD

CBO 4.50 2.54

IC CC 1.32 0.45

IC CM 5.23 2.23

IC CD 5.64 2.56

EC CC 1.54 1.45

EC CM 4.12 4.56

EC CD 4.67 5.35

Voronoi

Mean SD

CBO 5.43 3.46

IC CC 2.43 1.45

IC CM 4.54 0.45

IC CD 7.45 3.46

EC CC 3.45 3.46

EC CM 4.45 2.45

EC CD 5.36 2.46

Real-World Programs

Velocity

Mean SD

CBO 7.59 7.57

IC CC 4.27 7.11

IC CM 8.45 10.87

IC CD 20.45 32.14

EC CC 3.85 4.30

EC CM 7.54 9.45

EC CD 25.45 28.45

Xalan

Mean SD

CBO 8.98 9.92

IC CC 4.03 4.61

IC CM 8.54 8.99

IC CD 35.45 38.14

EC CC 2.85 3.60

EC CM 6.54 7.56

EC CD 42.15 45.12

Ant

Mean SD

CBO 8.49 7.74

IC CC 3.92 7.91

IC CM 7.46 8.78

IC CD 16.75 17.25

EC CC 2.43 3.51

EC CM 7.04 7.54

EC CD 21.23 20.56

Table 4.1: Descriptive statistic results for all programs

4.3. Results 62

analysed.

Analysing the definitions of the measures that exhibit high loadings in PC1, PC2

and PC3 yields the following interpretation of the coupling dimensions:

• PC1 = {IC CC, IC CD, IC CM}, the run-time import coupling metrics as

illustrated by Figure 4.1(a).

• PC2 = {EC CC,EC CD,EC CM}, the run-time export coupling metrics

as illustrated by Figure 4.1(b).

• PC3 = {CBO}, the static coupling metric as illustrated by Figure 4.1(c).

Figure 4.1 summarises these results graphically. Overall the PCA results demon-

strate that the run-time coupling metrics are not redundant with the static CBO

metric and that they capture additional dimensions of coupling. This leads us to

reject our null hypothesis H0, to say that run-time measures for coupling are not

simply surrogate measures for the static CBO metric, suggesting that additional

information over and above the information obtainable from the static CBO metrics

can be extracted using run-time metrics. This confirms the findings of Arisholm et

al. for the single Velocity program are applicable across a variety of programs.

The results also indicate that the direction of coupling is a greater determin-

ing factor than the type of coupling, with PC1 containing the three import-based

metrics and PC2 containing the three export-based metrics.

4.3.2 Experiment 2: The influence of instruction coverage

Multiple Regression Analysis

Multiple regression analysis is used to test the hypothesis that instruction coverage

of test cases used to evaluate a program has no influence on the relationship between

static and run-time metrics. The two independent variables are thus the static CBO

metric and the instruction coverage measure Ic; each of the six run-time coupling

metrics in turn is then used as the dependent variable. A full list of these results

can be found in Appendix A.2.

4.3. Results 63

(a) Results from PCA for IC CC, IC CM and IC CD

(b) Results from PCA for EC CC, EC CM and EC CD

(c) Results from PCA for CBO

Figure 4.1: PCA test results for all programs for metrics in PC1, PC2 and PC3. In

all graphs the bars represents the PCA value obtained for the corresponding metric.

PC1 contains import level run-time metrics. PC2 contains the export level run-time

metrics and PC3 contain the static CBO metric.

4.3. Results 64

First, all R values turned out to be positive for each of the programs used in

this study. This means that there is a positive correlation between the dependent

(run-time metric) and independent variables CBO and Ic. Therefore as the values

for CBO and Ic increase or decrease so will the observed value for the run-time

metric under consideration.

Figures 4.2(a) and 4.2(b) give a pictorial view of the results from the multiple

regression analysis for all programs for class-level run-time coupling, and Figures

4.3(a) and 4.3(b) for method-level run-time coupling. The lighter bars represent the

influence of CBO, while the the darker bars represent the influence of both CBO

and Ic. Therefore the difference between these gives the additional amount of the

variation of the run-time metric that can be allocated to the influence of instruction

coverage.

Distinct Classes: IC CC and EC CC

It is immediately apparent from Figures 4.2(a) and 4.2(b) that the instruction cov-

erage is a significant influencing factor. For example, from Figure 4.2(a) it can be

seen that in ten of the programs, Ic accounts for an extra additional 20% variation.

Two of the programs in Figure 4.2(a), MST and Voroni, that show little increase,

already exhibit a high correlation with CBO alone that would have been difficult

to improve on. While the increase is not uniform throughout the programs in Fig-

ure 4.2(a), the overall data demonstrates that instruction coverage is an important

contributory factor.

Figure 4.2(b), representing the contribution of CBO and Ic to export coupling

measured at the class level, presents a sharper contrast. Here, the influence of Ic

is clearly a vital contributing factor, accounting for at least an extra 20% of the

variation in eleven of the seventeen programs. The important factor here is that

the overall contribution of CBO to export coupling is much lower than to import

coupling, as can be seen from contrasting the lighter-shaded bars in Figure 4.2(a)

with those in Figure 4.2(b). Thus classes with a high level of static coupling exhibit

a higher level of import coupling at run-time. This indicates that the coupling being

exercised at run-time is from classes behaving as clients, making use of other class

4.3. Results 65

(a) Results from the multiple linear regression where Y = IC CC.

(b) Results from the multiple linear regression where Y = EC CC.

Figure 4.2: Multiple linear regression results for class-level metrics (IC CC and

EC CC). In both graphs the lighter bars represents the R2 value for CBO, and the

darker bars represents the R2 value for CBO and Ic combined.

4.3. Results 66

(a) Results from the multiple linear regression where Y = IC CM

(b) Results from the multiple linear regression where Y = EC CM

Figure 4.3: Multiple linear regression results for method-level metrics (IC CM and

EC CM). In both graphs the lighter bars represents the R2 value for CBO, and

the darker bars represents the R2 value for CBO and Ic combined.

4.3. Results 67

methods, rather than those behaving as servers, offering their methods for use by

others. The greater influence of Ic in export coupling results from there being less

of a drop in its influence between IC CC and EC CC, suggesting that instruction

coverage, as a predictor of coupling, is not as sensitive to the direction of that

coupling.

Distinct Methods: IC CM and EC CM

The results for the IC CM and EC CM , illustrated by Figures 4.3(a) and 4.3(b),

present a similar picture. Both of these run-time metrics are scaled by the number

of methods involved in the coupling relationship. Given that CBO is defined on a

class level, it does surprisingly well in influencing the IC CM metric. Instruction

coverage is also defined at a class level, but nonetheless accounts for roughly an

extra 20% of the variance for five programs, and roughly an extra 10% for five other

programs. The drop between import and export coupling is accentuated here, but

while Figure 4.3(b) shows CBO proving a bad predictor for EC CM , instruction

coverage dramatically improves this for over half the programs studied.

Overall, these results show that coverage has a significant impact on the correla-

tion between static CBO and the four run-time coupling metrics defined for distinct

classes and distinct methods.

Run-time Messages: IC CD and EC CD

The run-time metrics IC CD and EC CD did not exhibit a significant relationship

for any of the programs under consideration and thus are not depicted graphically

here. As these metrics are defined in terms of a count of the number of distinct times

a method was executed, this result was not surprising. It is reasonable to postulate

that such metrics might be more influenced by the “hotness” of a particular method,

and the distribution of execution focus through the program, rather than instruction

coverage data. This was the result we expected for the measures based on the number

of dynamic method calls.

4.4. Conclusion 68

4.4 Conclusion

From our experimental data, using principal component analysis, we showed that

run-time coupling metrics captured different properties than static CBO and there-

fore are not simply surrogate measures for CBO. This indicated that useful infor-

mation beyond that which is provided by CBO may be obtained through the use of

these run-time measures.

Second, we found that the coverage of test cases used to evaluate a program had

a significant impact on the correlation between CBO and run-time coupling metrics

and thus should be a measured, recorded factor in any comparison made. We found

that instruction coverage and CBO were a better predictor of the run-time metics

based on distinct class (IC CC, EC CC) and distinct method counts (IC CM ,

EC CM) than CBO alone. The results in Appendix A.2 show the results from the

Fishers F test which illustrate that all results were statistically significant at the 5%

level of significance..

Chapter 5

Case Study 2: The Impact of

Run-time Cohesion on Object

Behaviour

In this study we present an investigation into the run-time behaviour of objects

in Java programs and whether cohesion metrics are a good predictor of object be-

haviour. Based on the definition of static CBO it would be expected that objects

derived from the same class would exhibit similar coupling behaviour, that is, that

they would be coupled to the same classes and make the same accesses. It is un-

known whether static CBO provides a true measure of coupling between objects, or

whether it is restricted to being a measure of the level of coupling between classes.

To this end, a measure, the Number of Object-Class Clusters (NOC), is proposed

in an attempt to analyse run-time object behaviour. This measure is derived from

a statistical analysis of run-time object-level coupling metrics. Cluster analysis is

used to group objects together based on the similarity of the accesses they make to

other classes. Therefore one would expect objects from the same class to occupy the

same cluster. If more than one cluster is found for a class then it is reasonable to

postulate that the class has objects that are behaving differently at run-time from

the point of view of coupling. A selection of programs are anaylsed to determine if

this is the case.

The second part of this study involves determining the predictive ability of cohe-

69


sion metrics (both static and run-time) to forecast object behaviour, in other words,

how well they indicate the NOC for a class. First, the differences in the under-

lying dimensions of cohesion captured by the static versus the run-time measures

are assessed using principal component analysis. Subsequently, multiple regression

analysis is used to study the predictive ability of cohesion metrics to extrapolate

NOC for a class. We also wish to determine if a run-time definition of cohesion is a

better predictor of NOC than the static SLCOM version alone.


The GQM/MEDEA framework was used to set up the experiments for this study.

Experiment 1:

Goal: To determine if objects from the same class behave differently at runtime

from the point of view of coupling.

Perspective: We investigate the behaviour of objects at run-time with respect

to coupling using a number of metrics which measure the level of coupling at dif-

ferent layers of granularity. We use a number of statistical techniques capable of

separating objects from a class into groups based on their similarity.

Environment: Since we are studying object behaviour, a set of Java programs

which create a large number of objects at run-time are used. These are supple-

mented with a number of real-world programs to ensure the results are scalable to

genuine programs.

Hypothesis:

H0 : Objects from a class behave similarly at run-time from the point of view of

coupling

H1 : Objects from a class behave differently at run-time from the point of view

of coupling


Experiment 2:

Goal: To determine if a run-time definition for cohesion gives any additional

information about class behaviour over and above the standard static definition.

Perspective: Within a highly cohesive class the components of the class are

functionally related, with a class that exhibits low cohesion, they are not. Intu-

itively, one would expect the more cohesive the class the lower the NOC for a class.

We use a number of statistical techniques, including PCA and regression analysis to

determine if there is a significant correlation between static and run-time cohesion

and NOC . We also wish to determine if run-time cohesion is a better predictor of

NOC than the static version alone.

Environment: Since we are studying object behaviour a set of Java programs

which create a large number of objects at run-time are used. These are supple-

mented with a number of real-world programs to ensure the results are scalable to

genuine programs.

Hypothesis:

H0 : Run-time cohesion metrics do not provide additional information about

class behaviour over and above that provided by static SLCOM

H1 : Run-time cohesion metrics provide additional information about class be-

haviour over and above that provided by static SLCOM


For this study it was necessary to calculate:

• the run-time object-level coupling metric: IC OC

• the Number of Object-Class Clusters: NOC

• the static SLCOM


GreyNode QuadTreeNode WhiteNode

BlackNode1 0 2 0

BlackNode2 0 2 0

BlackNode3 0 2 0

BlackNode4 0 2 0

Table 5.1: Matrix of unique accesses per object, for objects

BlackNode1, . . . , BlackNode4 to classes GreyNode, QuadTreeNode and WhiteNode

• the run-time cohesion metrics: RLCOM , RWLCOM

IC OC was calculated using the object-level run-time metric analysis tool Ob-

Met which is described in Chapter 3.2.2. In order to test the first hypothesis the

coefficient of variation, CV , was calculated for the IC OC results to determine how

the IC OC values varied across the objects of a class. If the CV for all classes under

consideration is zero then this would lead us to accept the null hypothesis, H0, as all

objects of this classes would be accessing the same variables. However, if there was

variation in the IC OC values, CV > 0, this would lead us to reject H0 and accept

H1, as the objects would be behaving differently at run-time from the point of view

of coupling.

To determine the NOC for a class, one class is fixed and the distribution of

unique accesses per object is determined. A matrix of such values for each class

in the program under consideration is constructed. Table 5.1 gives an example of

such a matrix, where we record the run-time coupling values for individual objects

of class BlackNode, BlackNode1, . . . , BlackNode4, against the classes GreyNode,

QuadTreeNode and WhiteNode. This data is statistically analysed using cluster

analysis to evaluate the behaviour of the objects. This technique groups objects

together based on their similarity. The number of clusters are determined and this

becomes the NOC for that class. In order to accept H0 we would expect objects from

the same class to group together and occupy the same cluster, therefore expecting

values of NOC to be 1. The formation of a number of different clusters, where NOC

> 1, would lead us to reject H0 and accept H1.

5.3. Results 73


BH

Mean SD

IC OC 1.83 2.74

NOC 2 0.52

SLCOM 0.317 0.30

RLCOM 0.144 0.287

RWLCOM 0.248 0.226

Em3d

Mean SD

IC OC 1 0.5

NOC 6 −SLCOM 0.317 0.223

RLCOM 0.190 0.381

RWLCOM 0.472 0.572

Health

Mean SD

IC OC 2.5 1.84

NOC 2.5 1.29

SLCOM 0.318 0.223

RLCOM 0.171 0.189

RWLCOM 0.335 0.356

MST

Mean SD

IC OC 2 1.54

NOC 2.5 2.12

SLCOM 0.163 0.283

RLCOM 0.111 0.172

RWLCOM 0.252 0.154

Perimeter

Mean SD

IC OC 2.25 2.6

NOC 2.5 1.73

SLCOM 0.136 0.275

RLCOM 0.104 0.285

RWLCOM 0.132 0.254

Power

Mean SD

IC OC 1.66 1.88

NOC 2 1.73

SLCOM 0.151 0.199

RLCOM 0.083 0.204

RWLCOM 0.155 0.134

Voronoi

Mean SD

IC OC 2 2.12

NOC 4.5 0.71

SLCOM 0.373 0.238

RLCOM 0.265 0.363

RWLCOM 0.448 0.438

Real-World Programs

Velocity

Mean SD

IC OC 6.14 7.21

NOC 5.1 2.45

SLCOM 0.314 0.385

RLCOM 0.154 0.254

RWLCOM 0.398 0.454

Xalan

Mean SD

IC OC 7.45 8.21

NOC 6.7 3.45

SLCOM 0.251 0.305

RLCOM 0.198 0.241

RWLCOM 0.354 0.484

Ant

Mean SD

IC OC 8.11 8.65

NOC 7.2 2.56

SLCOM 0.333 0.31

RLCOM 0.247 0.208

RWLCOM 0.387 0.355

Table 5.2: Descriptive statistic results for all programs

The static metrics data collection tool StatMet, described in Section 3.2.3, was

used to calculate SLCOM . The ClMet tool, described in Chapter 3.2.1, was used to

calculate the run-time cohesion metrics.

The analysis was conducted on the programs from the JOlden benchmark suite

as well as the real-world programs Velocity, Xalan and Ant. Three of the classes

BiSort, TSP and TreeAdd contain too few class to perform PCA and regression

analysis, therefore are excluded from further analysis. The SPECjvm98 benchmark

programs that were used in the previous study were excluded from this analysis as

they did not exhibit significant volumes of object creation.

5.3 Results

Table 5.2 summarises the descriptive statistic results for each program. The mea-

sures all exhibited large variances which makes them suitable candidates for further

analysis.

5.3. Results 74

5.3.1 Experiment 1: To determine if objects from the same

class behave differently at run-time from the point of

view of coupling

IC OC Results

The IC OC metric is used to investigate whether objects of the same class type are

coupled to the same classes at run-time. The first thing to look at is the CV results

for the IC OC metric, as depicted by Figure 5.1. If all objects from the same class

are behaving in a similar fashion we would expect them to make accesses to the

same classes at run-time. Consequently, there should be little or no variability in

the IC OC values for objects from the same class, for example, two classes from

BH had CV of 0. However, for the classes from the set of programs studied, the

CV varied from 0% to 54.2%. In the cases where the CV > 0, we have classes with

objects that are coupled to different classes at run-time. A class might create one

group of objects that access one set of classes and another that access a different set.

So we have a number of objects from the same class that are behaving differently at

run-time at the class-class level. Due to these results, at the class-class level, we can

reject H0 and accept H1. One cannot observe such behaviour simply by calculating

the static CBO value for that class.

NOC Results

Figure 5.2 illustrates the NOC values for the programs under consideration. The NOC

values range from one to seven and the bars represent the number of classes from

each program that exhibit that value. Since cluster analysis groups objects together

based on the similarity of the accesses they make to other classes one would expect

objects from the same class to occupy the same cluster (NOC = 1). This was the

case for a large proportion of the classes under consideration, for example 50% of the

classes from the program BH from the JOlden suite exhibited an NOC of 1. Similar

results were obtained with the real-world programs with NOC = 1 for 51% of classes

from Velocity, 49% from Xalan and 48% from Ant. However, there were instances

where more than one cluster was found for a class, for example 50% of the classes

5.3. Results 75

IC_OC CV Results

0 10 20 30 40 50 60

Cv

No. Classes

Ant

Xalan

Velocity

Voronoi

Power

Perimeter

MST

Health

Em3d

BH

0%

0% -10%

10% - 20%

20% - 30%

30% - 40%

40% - 50%

50% - 60%

Figure 5.1: CV of IC OC for classes from the programs studied. The bars represent

the number of classes in each program that have CV in the corresponding range.

5.3. Results 76

Figure 5.2: NOC results of cluster analysis. The bars represent the number of classes

in each program that have the corresponding NOC value.

from Perimeter from the JOlden suite had NOC = 4. When more than one cluster

is found we have the situtation where a single class is creating groups of objects

that are exhibiting different behaviours at run-time. This leads us to reject H0 and

accept H1 to state that objects from a class can behave differently at run-time from

the point of view of coupling.

Looking at Figures 5.1 and 5.2 there seems to be a relationship between the CV

and the number of clusters with both graphs being markedly similar. In many cases

a high CV leads to >1 clusters. Intuitively this would make sense as it is easy to

see how variation in the number of classes used by an object would lead to variation

in the variables they use and consequently leading to a number of groups of objects

behaving differently.

From these findings, it is suggested that the static CBO metric would be better

defined as coupling between classes as it does not necessarily give a true measure of

run-time coupling between objects.

5.3. Results 77

5.3.2 Experiment 2: The influence of cohesion on the NOC

The following statistical analysis is applied to determine first, if run-time cohesion

metrics are redundant with respect to SLCOM and second, if cohesion metrics are

good predictors of NOC .

Principal Component Analysis

Initially, we investigate the relationship between static and run-time cohesion met-

rics. We use PCA to determine if the static and run-time cohesion metrics are

likely to be measuring the same class property, in other words it is used to examine

whether the run-time cohesion metrics are not simply surrogate measures for static

SLCOM .

Appendix B.1 shows the results of the principal component analysis when all

of the cohesion metrics are taken into consideration. Using the Kaiser criterion to

select the number of factors to retain it is found that the metrics mostly capture

two orthogonal dimensions in the sample space formed by all measures. In other

words, cohesion is divided along two dimensions for each of the programs analysed.

Analysing the definitions of the measures that exhibit high loadings in PC1 and

PC2 yields the following interpretation of the cohesion dimensions:

• PC1 = {SLCOM}, the static cohesion metric.

• PC2 = {RLCOM , RWLCOM}, the run-time cohesion metrics.

Figure 5.3 summarises these results graphically. The PCA findings from this

study indicate that no significant information about the cohesiveness of a class

can be gained by evaluating the RWLCOM instead of the simpler RLCOM , as both

metrics belonged to the same principal component. This means not enough variance

is captured by the RWLCOM that is not accounted for by RLCOM .

However, the PCA results indicate that RLCOM is not redundant with respect to

SLCOM and that it captures additional information about cohesion. The values show

that RLCOM is not simply an alternative static measure. Clearly, the simple static

calculation of SLCOM masks a considerable amount of detail available at run-time.

5.3. Results 78

PCA Test Results for all Programs for Metrics in PC1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

BH Em3d Health MST Perimeter Power Voronoi Velocity Xalan Ant

Program

PCA Value

RLCOM

RWLCOM

PCA Test Results for all Programs for Metrics in PC2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

BH Em3d Health MST Perimeter Power Voronoi Velocity Xalan Ant

Program

PCA Value

SLCOM

Figure 5.3: PCA Test Results for all programs for metrics in PC1 and PC2. In both

graphs the bars represents the PCA value obtained for the corresponding metric.

PC1 contains RLCOM and RWLCOM . PC2 contains SLCOM .

5.3. Results 79

Multiple Regression Analysis

Next we wish to discover if cohesion metrics are good predictors of object behaviour,

that is, can they be used to deduct the NOC for a class. Multiple regression analysis

is used for this purpose. In this case the dependent variable is the NOC , while the

independent variables are the static SLCOM and the run-time RLCOM and RWLCOM

cohesion metrics. Appendix B.2 gives the results from this analysis.

First, the results show that there is a positive correlation between the NOC

(dependent variable) and the static and run-time cohesion measures (independent

variables), as all R values were positive. This means that as the value for SLCOM ,

RLCOM and RWLCOM increases/decreases so will the observed value for NOC . Intu-

itively this would make sense, as one would expect the more cohesive the class, that

is the lower the LCOM value is, the more the class it is geared toward performing a

single function. Therefore one would expect the number of clusters to be low also.

Figure 5.4 summarises the results of the regression analysis for each of the pro-

grams analysed. The lighter bars represent the influence of SLCOM , while the darker

bars depict the influence of both SLCOM and RLCOM . The difference between the two

indicates the additional amount of variation that can be allocated to the run-time

cohesion metric.

It is apparent from this graph that the RLCOM is a significant factor influencing

NOC , for example for the three real-world programs RLCOM accounts for approxi-

mately an additional 30% variation, while five of the benchmarks exhibit a similar

result. For eight out of the ten programs studied RLCOM was a better predictor of

NOC than SLCOM .

Overall, these results show that cohesion metrics are a good predictor of NOC ,

with run-time cohesion being the superior metric. This leads us to reject our null

hypothesis and state that run-time cohesion metrics provide additional information

about class behaviour over and above that provided by static SLCOM .

Only one program exhibited a significant result when using the RWLCOM mea-

sure, therefore the results have not been summarised graphically. This could be due

to the fact that the metric is defined on a call-weighted basis, which may skew the

results.

5.3. Results 80

Figure 5.4: Results from multiple linear regression where Y=NOC . The lighter bars

represent the R2 for SLCOM , and the darker bars represent the R2 value for SLCOM

and RLCOM combined.

5.4. Conclusion 81

5.4 Conclusion

From this case study, we found that run-time object-level coupling metrics could be

used to investigate object behaviour. Using the IC OC run-time coupling measure

we discovered that objects from the same class exhibited different behaviours at

run-time from the point of view of coupling. Object behaviour was identified by

defining a new metric NOC which groups objects together based on their run-time

coupling properties.

We defined a number of metrics for evaluating run-time cohesion. First, we

proved that these measures were not redundant with respect to the static LCOM

measure and that they captured additional dimensions of cohesion. Next, we inves-

tigated the impact of run-time cohesion metrics on object behaviour using regression

analysis and proved that these run-time cohesion metrics were good predictors of

object behaviour, as identified by the NOC measure. Appendix B.2 gives the results

from this analysis and shows the Fishers F test results which state that all results

were statistically significant the 5% level of significance..

Chapter 6

Case Study 3: A Study of

Run-time Coupling Metrics and

Fault Detection

Fault-proneness detection is an interesting concept in many areas of software engi-

neering research. Quality and maintenance effort control depend on the understand-

ing of this concept. In previous years, a large volume of work has been performed

in order to define suitable metrics and models for fault detection [6, 13,19,41].

Code coverage has been proposed to be an estimator for fault-proneness, but it

remains a controversial topic which lacks support from empirical data [22]. In this

case study we investigate whether instruction coverage is a significant predictor of

fault-proneness, an important software quality indicator. This is done by taking

a set of real-world programs, namely Velocity, Xalan and Ant, and introducing

faults into them using the mutation system µJava. Two kinds of mutations are

introduced separately into the programs, traditional and class-type mutations. We

then determine the percentage mutants killed (MK) by the set of test cases provided

with the programs. Equation 6.1 gives the formula for MK . Regression analysis is

applied to determine if instruction coverage is a good predictor of fault-proneness,

which is defined as the MK for the class for each type of mutation. From previous

work we expect instruction coverage to be a good predictor of non object-oriented

or traditional-type mutants [69].

82


MK = Number of mutants killedTotal number of mutants created

∗ 1001

(6.1)

Next, we empirically validate a set of six run-time object-oriented metrics in

terms of their usefulness in predicting fault-proneness. We use regression analysis

again to investigate the ability of these run-time measures in predicting MK for both

types of mutations. From these two experiments we wish to discover if the run-time

measures for coupling are better predictors of fault-proneness than the traditional

coverage measure.


The GQM/MEDEA framework was used to set up the experiments for this study.

Experiment 1:

Goal: To examine the relationship between coverage and fault detection, in the

context of instruction coverage.

Perspective: Code coverage has been proposed to be an estimator for testing

effectiveness. Regression analysis is used to assess if coverage is a better indicator

of fault-proneness in comparison to the run-time coupling metrics. In particular we

investigate whether it is a better detector of traditional or class-type mutations in

programs.

Environment: We chose to evaluate a selection of open source real-world pro-

grams. Each program comes with its own set of JUnit test cases, thus defining both

the static and dynamic context of our work.

Hypothesis:

H0 : Coverage measures are poor detectors of faults in a program.

H1 : Coverage measures are good detectors of faults in a program.


Experiment 2:

Goal: To examine the relationship between run-time coupling metrics and fault

detection.

Perspective: Previous work has shown the static coupling measure CBO is a

good detector of faults in programs [13]. Intuitively, one would expect run-time cou-

pling measures to give a better indication as they are based on an actual execution

of the program. Regression analysis is used to determine if there is a significant

correlation.

Environment: We chose to evaluate a selection of open source real-world pro-

grams. Each program comes with its own set of JUnit test cases, thus defining both

the static and dynamic context of our work.

Hypothesis:

H0 : Run-time coupling metrics are poor detectors of faults in a program.

H1 : Run-time coupling metrics are good detectors of faults in a program.


In order to conduct the practical experiments underlying this study, it was necessary

to select a suite of Java programs and measure:

• the instruction coverage percentages: IC

• the mutation coverage of test cases (mutants killed (MK))

• the run-time coupling metrics: IC CC, EC CC, IC CM, EC EM, IC CD, EC CD

The InCov tool, described in Chapter 3.2.4, was used to determine IC . The

run-time measures were evaluated using the ClMet tool. The mutation system

µJava, described in Chapter 3.2.5, was used to insert both traditional and class-

6.3. Results 85

level mutants into the test case programs and to determine the MK rates of the test

cases supplied with the programs.

Three open source real-world programs Velocity, Xalan and Ant were evaluated in

this study. The SPECjvm98 and JOlden benchmark programs used in the previous

studies exhibited very poor mutant kill percentages when analysed (most classes

exhibited 0% mutant kill rate) and therefore were excluded from futher analysis.

6.3 Results

Percentage Mutant Kill Rate Results

Figure 6.1 gives the percentages of mutants killed upon the execution of the JUnit

test cases supplied with the programs analysed. Looking at Figure 6.1(a) for the

Velocity program, twenty-three classes exhibit a percentage kill rate of zero for the

class-level mutants, while thirteen classes exhibit the same rate for the traditional

mutants. At the other end of the spectrum for the class-level mutants, six classes

exhibited a percentage kill rate of between 90% and 100%, while seven classes ex-

hibited the same kill rate for the traditional mutants.

In their paper [66] Offutt et al. created test cases for the set of programs they

studied by hand so that 100% MK was achieved. To date no one has endeavoured

to apply this mutation system to a set of real programs so there is no consensus on

what a desirable MK rate would be.

6.3.1 Experiment 1: To examine the relationship between

instruction coverage and fault detection.

Regression Analysis

We investigate the statistical relationship between instruction code coverage and

fault-proneness using regression analysis. The dependent variable is the percentage

mutant kill rate MK , while the independent variable is the instruction coverage mea-

sure Ic for each class. Both class and traditional mutants are evaluated separately.

Appendix C.2 gives the results from this analysis.

6.3. Results 86

(a) Results from Mutation Testing for Velocity.

(b) Results from Mutation Testing for Xalan.

(c) Results from Mutation Testing for Ant.

Figure 6.1: Mutation test results for real-world programs Velocity, Xalan and Ant.

In all graphs the bars represents the number of classes that exhibit a percentage

mutant kill rate in the corresponding range.

6.3. Results 87

Figure 6.2: Regression analysis results for the effectiveness of Ic in predicting class

and traditional-level mutations in real-world programs Velocity, Xalan and Ant. The

bars represents the R2 value for the run-time metric under consideration.

Figure 6.2 depicts the results on the effects of instruction coverage on fault-

proneness for both types of mutations tested. For all of the programs tested Ic

proved to be a poor predictor of class-type mutants with the highest value being

16.7% for Xalan. In contrast Ic showed to be an effective indicator of traditional

mutations with the values ranging from 64.5% to 78.9%. This was as we expected as

coverage is not really effective in evaluating object-oriented type programs therefore

we would not expect it to be good predictors of object-oriented type faults.

6.3.2 Experiment 2: To examine the relationship between

run-time coupling metrics and fault detection.

Regression Analysis

Regression analysis is used to determine the effectiveness of run-time coupling met-

rics in detecting faults in programs. The dependent variable is the percentage mu-

6.3. Results 88

tant kill rate of the test cases used to execute the programs, while the independent

variables are the six run-time coupling metrics. Both class and traditional mutants

are evaluated separately. Appendix C.1 gives the results from this analysis.

The traditional mutants did not show any relation with the run-time coupling

measures, with only the IC CC metric for the Velocity program exhibiting a signif-

icant correlation. This is in contrast to the results from the previous experiment

where Ic proved to be a poor predictor of class-type mutants but a good predictor

of traditional-type mutants.

Figure 6.3 illustrates the results for the effectiveness of the run-time coupling

metrics IC CC, IC CM, EC CC and EC CM in predicting the MK for class-level

mutations for each of the programs analysed. For two of the programs Velocity and

Xalan the IC CC measure provided the greatest predictor of MK at 69% and 59%

respectively. For the Ant program, the EC CC metric had the highest value at 69%,

however the IC CC value for this was also high at 60%. For all of the programs the

EC CM measure was the poorest predictor. There were five categories of mutations

introduced into the programs by µJava, as illustrated by Table D.2. We would

expect the coupling measures to be a good predictor of those mutations based on

inheritance, polymorphism and overloading. However, we would not expect such

a relationship for those based on Java-specific features and common programming

mistakes. The inclusion of these types of mutations may have negitavely skewed the

results.

None of the run-time metrics based on distinct method counts IC CD and EC CD

exhibited a significant result and therefore have not been summarised graphically.

As with the case in Section 4.3, this was expected and emphasises the significance

of the predictive capabilities of the other metrics.

Overall, one would expect this kind of result as the class-type mutants are object-

oriented, while the traditional mutations are based on factors like operator replace-

ment and therefore would not be expected to correlate strongly with coupling. This

leads us to reject our null hypothesis for both experiments and state that run-time

coupling metrics are good detectors of class-level faults, while coverage measures

are good detectors of traditional-type faults in a program. We therefore postulate

6.4. Conclusion 89

Figure 6.3: Regression analysis results for the effectiveness of run-time coupling met-

rics in predicting class-level mutations in real-world programs Velocity, Xalan and

Ant. The bars represents the R2 value for the run-time metric under consideration.

a possible utility for run-time coupling metrics for use in fault-proneness detection

with regard to identifying faults in object-oriented programs.

6.4 Conclusion

The results from this case study used regression analysis to show that run-time cou-

pling metrics were good detectors of class-type faults in programs, while instruction

coverage was a good detector of traditional-type mutants. Appendix C.1 illustrates

these results and shows that all results were deemed to be statistically significant at

the 5% level of significance. We therefore proposed the run-time coupling metrics as

alternative measures for fault-detection useful for identifying object-oriented type

faults in programs.

Chapter 7

Conclusions

In this thesis we presented an empirical investigation into run-time coupling and

cohesion metrics.

The first case study investigated the influence of instruction coverage on the re-

lationship between static and run-time coupling metrics. An empirical investigation

was conducted using the set of run-time metrics proposed by Arisholm et al. on

a large set of Java programs. This set contained programs from the SPECjvm98

and JOlden benchmark suites and also included three real-world programs Velocity,

Xalan and Ant.

The differences in the underlying dimensions of coupling captured by the static

versus the run-time metrics were assessed using principal component analysis. Three

components were identified which contained static CBO, the import-based run-time

metrics, and the export-based run-time metrics. This established that the run-

time metrics were not simply surrogate static measures, which made them suitable

candidates for further analysis.

A study into the predictive ability of the static CBO and instruction coverage

data was then conducted using multiple regression analysis. The purpose of this was

to show how well the static CBO metric and instruction coverage measure Ic could

predict the six run-time metrics under consideration. The PCA analysis placed

import and export based coupling in different components, and this difference was

also seen in the regression analysis. Both CBO and instruction coverage had less

influence overall on the export-based metrics, EC CC and EC CM than on the

90

Chapter 7. Conclusions 91

Static

Coupling

Run-time

Coupling

Not surrogate

Static

Coupling

Instruction

Coverage

+ Run-time

Coupling

Good predictor

Figure 7.1: Findings from case study one that show our run-time coupling metrics

are not simply surrogate measures for static CBO and coverage plus static metrics

are better predictors of run-time measures than static measure alone.

import-based run-time metrics, IC CC and IC CM .

It was shown from the regression analysis that the combination of the static

measure with instruction coverage gave a significantly better prediction of the run-

time behavior of programs than the use of static metrics alone, for the class-based

and method-based metrics. This suggested that the correlation between static and

run-time was as much a factor of coverage as an intrinsic property of the metrics

themselves.

The results for the two run-time metrics based on distinct message counts,

EC CD and EC CD were not within the chosen significance level, and thus no de-

termination was made on the relationship for these metrics. Figure 7.1 summarises

the finding from this study.


The second case study looked at run-time object behaviour and whether run-time

cohesion metrics could be used to identify such behaviour.

First, we looked at object behaviour in the context of coupling. We used the

IC OC object-level metric, as defined by Arisholm et al. and defined a new measure

NOC in an attempt to identify objects that were behaving differently at run-time

from the point of view of coupling. We concluded that objects from the same class

could behave differently at run-time from the point of view of coupling due to the

fact that there were classes that exhibited variable CV values for IC OC and NOC

values greater than one

Subsequently, we looked at whether run-time cohesion metrics could be used to

predict object behaviour, as defined by the NOC measure. First, we had to prove

that the run-time cohesion metrics were not redundant with respect to static SLCOM .

The relationship between static and run-time cohesion metrics was investigated using

PCA. Two components were identified containing the static SLCOM and the run-time

cohesion measures RLCOM , RWLCOM . This established that the run-time metrics

were not simply surrogate static measures, making them suitable candidates for

further analysis.

Multiple regression analysis was used to discover if the cohesion metrics were

good predictors of object behaviour. The purpose of this was to show how well the

SLCOM metric and the run-time cohesion measures RLCOM , RWLCOM could predict

NOC . Overall, the results showed that cohesion metrics were a good predictor of

NOC , with run-time cohesion being the superior metric. This led us to conclude that

run-time cohesion metrics provided additional information about class behaviour

over and above that provided by SLCOM . Figure 7.2 depicts the results of this

study.

The third case study investigated whether instruction coverage was a good pre-

dictor of faults in a program. We used regression analysis to determine if this

measure was related to MK , the mutation kill rate of the test cases used. It was

found that IC was a good predictor of traditional-type faults but a poor predictor of

class-type faults which verifies results from previous studies on coverage measures.

Next, we analysed the extent to which the run-time coupling metrics were good


Class A

a8

a7

a6

a5

a4

a3

a2

a1

Task 3

Task 2

Task 1

a12

a11

a10

a9

IC_OC used to determine NOC

NOC =3

Run-time Cohesion

good predictor of NOC

Objects

Figure 7.2: Findings from case study two that show run-time object-level coupling

measures can be used to identify objects that are exhibiting different behaviours

at run-time and run-time cohesion measures are good predictors of this type of

behaviour.

7.1. Contributions 94

Fault Proneness

Class-type Mutations Traditional Mutations

Run-time

Coupling

Instruction

Coverage

Good predictor

Figure 7.3: Findings from case study three that show run-time coupling metrics are

good predictors of class-type faults and instruction coverage is a good predictor of

traditional faults in programs.

detectors of traditional and class-type faults in a program. Our results showed that

the measures IC CC, IC CM, EC CC and EC CM were significantly related to MK

when considering class-type mutations. The results for EC CD and EC CD, the

two run-time metrics based on distinct message counts, were not within the chosen

significance level, and thus no determination was made on the relationship for these

metrics.

The purpose of this study was to determine whether instruction coverage is a bet-

ter predictor of fault-proneness than the run-time coupling measures. As we found

these measures were superior in detecting object-oriented type faults in programs

than simple measures of coverage, we proposed the run-time coupling metrics as an

alternative measure for fault-proneness, useful for detecting faults in object-oriented

software. Figure 7.3 illustrates the finding from this study.

7.1 Contributions

We have implemented the tools ClMet and ObMet that can be used to perform a

class and object-level analysis of Java programs.

7.1. Contributions 95

We use the definitions of Arishlom et al. for the set of run-time coupling metrics

in this analysis. We had however defined our own set of run-time coupling metrics

previous to the publication of this paper [75, 76]. However, due to the similarity in

nature of the metrics we then switched to using their definitions for the sake of ease

of comparison.

We define a number of object-oriented run-time metrics for cohesion and we

investigate their possible utility. To date no one has attempted to do this.

We define a new measure NOC that can be used to study run-time object-

behaviour.

To the best of our knowledge this is the largest empirical study that has been

performed to date on the run-time analysis of Java programs. Previously, a study

was carried out by Arisholm et al. on “Dynamic Coupling Measurement for Object-

Oriented Software”, however this included a single program, Velocity, in the analysis.

Our study looks at not only Velocity but also the real-world programs Xalan and

Ant as well as seven benchmark programs from the SPECjvm98 suite and seven

programs from the JOlden suite thus making it much wider in scope.

The main findings from our study are as follows:

• We showed run-time coupling metrics capture additional dimensions of cou-

pling and are not simply surrogate measures for static CBO. Therefore, useful

information above that provided by a simple static analysis may be acquired

through the use of run-time metrics.

• Coverage has a significant impact on the correlation between static CBO and

run-time coupling metrics and should be a measured, recorded factor in any

comparison.

• Run-time object-level coupling metrics can be used to investigate object be-

haviour. Using such a measure we discovered that objects from the same class

can behave differently at run-time from the point of view of coupling.

• Run-time cohesion metrics are not redundant with respect to the static SLCOM

measure and capture additional dimensions of cohesion.

7.2. Applications of this Work 96

• Run-time cohesion metrics are good predictors of run-time object behaviour.

• Run-time coupling metrics based on distinct class and distinct method counts

are good predictors of class-type or object-oriented faults in programs but poor

predictors of traditional-type mutations.

• Coverage is a good predictor of traditional-type faults but a poor predictor of

class-type faults in programs.

7.2 Applications of this Work

Much of the work on the dynamic analysis of Java programs has come from the

language design and compiler community. The work in this thesis forms part of an

increasing link between this community and the software engineering community,

with an emphasis on collecting, analysing and comparing quantitative static and

dynamic data. Other possible examples of this synthesis include relating studies

of polymorphicity with testing inheritance relationships, or relating measures of

program “hot-spots” with metrics based on distinct messages such as IC CD and

EC CD. Run-time metrics may also have a role to play in areas of research such

as reverse engineering and program comprehension, as they contribute to a better

understanding of the behavior of code in its operational environment.

7.3 Threats to Validity

7.3.1 Internal Threats

There are a number of factors which may potentially affect the validity of these run-

time metrics. In this thesis we have chosen only to look at run-time definitions for

coupling and cohesion which are based on the standard static definitions proposed by

Chidamber and Kemerer. Their metric suite for analysing object-oriented software

consists of three additional measurements for evaluating the depth of inheritance tree

(DIT), the number of children (NOC) and the weighted methods per class (WMC).

7.3. Threats to Validity 97

Our set of run-time measures should be expanded to include run-time definitions for

these also to ensure that the set is fully comprehensive.

The run-time metrics used in this study are rated based on how they perform

in relation to static measurements for coupling and cohesion. However, no study

has definitively shown that any measurement for coupling or cohesion provides any

extra information on design quality over and above that which can be gained simply

by evaluating the much simpler lines of code measure.

7.3.2 External Threats

A general problem with any type of run-time analysis is that the results are based on

dynamic measurement and are thus tied to the inputs or test cases used. Therefore

the use of different test cases may produce different results. Static measurements

however will remain the same regardless of the set of test cases used to execute the

program.

The set of programs used in this study may not be representative of all classes

of Java programs, as for example, no GUI based programs were included in this

analysis.

While the run-time analysis tools ClMet and ObMet made it easy to collect a

wide variety of run-time information from a program and were easy to use, it was

still quite time comsuming to perform a full analysis of a program. Allthough it was

stated that performance was not really an issue in the design of these tools. if such

a method of evaluating a program was to be marketed to industry the performance

of the tools would have to be given more serious consdieration.

Only one external quality attribute, fault detection, was investigated in this

thesis. Further research need to be conducted to see how measures for coupling and

cohesion do in predicting other important external quality attributes of a design

such as maintainability, reusability or comprehensibility.

The relationship between internal and external quality attribute is quite intitu-

itive, for example, more complex code will require greater effort to maintain. How-

ever the precise functional form of this relationship is less clear and is the subject

of intense and practical research concern. Using theories of cognition and problem-

7.4. Future Work 98

solving to help us understand the effects of complexity on software is the subject of

much current research [31].

7.4 Future Work

Future work may involve extending the existing set of coupling and cohesion met-

rics to develop a comprehensive set of run-time object-oriented metrics that can

intuitively quantify such aspects of object-oriented applications such as inheritance,

dynamic binding and polymorphism.

Currently there exists no set of benchmarks specifically designed for evaluating

properties of object-oriented programming such as coupling and cohesion, it would

be useful to design such a set of benchmarks for use in similar empirical studies.

Futher research could involve designing a run-time profiling tool written in C++

rather than Java. Such a tool could utilise the JVMDI component of the JPDA

directly and therefore would be dynamically linked with thw JVM at run-time.

This would probably result in less performance overhead which would result in a

reduction in the time taken to preform such an analysis.

Another important aspect would be to further investigate the correlation between

run-time metrics and external quality aspects of a design, including investigating

the possibility of using hybrid models that use a combination of static and runtime

metrics to evaluate a design.

It would be interesting to conduct an industrial case study using real commercial

software and data to further verify the results in this thesis.

Other applications of run-time metrics should be investigated, for example they

could be useful in determining where refactoring have been or could be applied or

they could be used to aid in program comprehension.

This study has focused solely on the evaluation of Java software, it would be

important to investigate if the run-time metrics gave similar results if they were

used to evaluate other types of object-oriented software, for example C sharp.

Though the approach and results are of significance to the field, they can also be

used as stepping stones to open up new ways to consider a wider set of internal qual-

7.4. Future Work 99

ity attributes and their interrelationships, and their independent and interdependent

effect relationship on external quality aspect of a design.

Appendix A

Case Study 1: To Investigate the

Influence of Instruction Coverage

on the Relationship Between

Static and Run-time Coupling

Metric

Appendix A.1 contains the PCA test results for the SPECjvm98 and JOlden suites

and for the real-world programs Velocity, Xalan and Ant. Values deemed to be

significant at the level p ≤ 0.05 are highlighted.

Appendix A.2 contains the results from the multiple linear regression used to test

the hypothesis H0, that coverage has no effect on the relationship between static

and run-time metrics for the programs from the SPECjvm98 and JOlden suites and

for the real-world programs Velocity, Xalan and Ant. All significant results are

highlighted.

100

A.1. PCA Test Results for all programs. 101

A.1 PCA Test Results for all programs.

A.1.1 SPECjvm98 Benchmark Suite

201 compress

PC1 PC2 PC3

CBO 0.113 0.014 0.712

IC CC 0.865 0.065 0.186

IC CM 0.766 0.154 0.097

IC CD 0.866 0.073 0.100

EC CC 0.023 0.873 0.176

EC CM 0.143 0.799 0.035

EC CD 0.098 0.834 0.096

202 jess

PC1 PC2 PC3

CBO 0.198 0.187 0.672

IC CC 0.963 0.007 0.005

IC CM 0.912 0.003 0.016

IC CD 0.874 0.032 0.004

EC CC 0.154 0.812 0.002

EC CM 0.298 0.734 0.054

EC CD 0.098 0.923 0.002

205 raytrace

PC1 PC2 PC3

CBO 0.123 0.087 0.723

IC CC 0.834 0.021 0.019

IC CM 0.912 0.017 0.008

IC CD 0.896 0.103 0.001

EC CC 0.198 0.763 0.003

EC CM 0.125 0.709 0.017

EC CD 0.097 0.821 0.002

209 db

PC1 PC2 PC3

CBO 0.012 0.163 0.843

IC CC 0.893 0.088 0.002

IC CM 0.923 0.004 0.000

IC CD 0.976 0.003 0.013

EC CC 0.178 0.763 0.002

EC CM 0.110 0.793 0.027

EC CD 0.087 0.823 0.017

213 javac

PC1 PC2 PC3

CBO 0.187 0.000 0.973

IC CC 0.633 0.083 0.184

IC CM 0.834 0.033 0.023

IC CD 0.723 0.143 0.002

EC CC 0.138 0.834 0.004

EC CM 0.078 0.734 0.012

EC CD 0.067 0.759 0.034

222 mpegaudio

PC1 PC2 PC3

CBO 0.244 0.137 0.583

IC CC 0.943 0.004 0.087

IC CM 0.898 0.034 0.041

IC CD 0.943 0.023 0.001

EC CC 0.034 0.943 0.043

EC CM 0.134 0.754 0.085

EC CD 0.098 0.845 0.005

228 jack

PC1 PC2 PC3

CBO 0.004 0.243 0.634

IC CC 0.605 0.234 0.154

IC CM 0.723 0.194 0.076

IC CD 0.604 0.195 0.098

EC CC 0.194 0.749 0.098

EC CM 0.103 0.694 0.049

EC CD 0.094 0.749 0.104

A.1.2 JOlden Benchmark SuiteBH

PC1 PC2 PC3

CBO 0.403 0.002 0.520

IC CC 0.728 0.224 0.012

IC CM 0.536 0.391 0.001

IC CD 0.555 0.376 0.000

EC CC 0.358 0.522 0.109

EC CM 0.203 0.763 0.025

EC CD 0.203 0.763 0.025

Em3d

PC1 PC2 PC3

CBO 0.134 0.034 0.712

IC CC 0.933 0.013 0.016

IC CM 0.772 0.168 0.039

IC CD 0.772 0.168 0.039

EC CC 0.139 0.702 0.082

EC CM 0.223 0.716 0.039

EC CD 0.223 0.716 0.039

Health

PC1 PC2 PC3

CBO 0.238 0.187 0.521

IC CC 0.956 0.005 0.017

IC CM 0.936 0.024 0.010

IC CD 0.940 0.028 0.009

EC CC 0.076 0.831 0.086

EC CM 0.070 0.919 0.002

EC CD 0.065 0.794 0.003

MST

PC1 PC2 PC3

CBO 0.000 0.013 0.972

IC CC 0.900 0.063 0.032

IC CM 0.956 0.010 0.026

IC CD 0.941 0.012 0.027

EC CC 0.356 0.609 0.033

EC CM 0.121 0.877 0.001

EC CD 0.118 0.881 0.000

Perimeter

PC1 PC2 PC3

CBO 0.231 0.123 0.612

IC CC 0.541 0.169 0.281

IC CM 0.876 0.080 0.002

IC CD 0.905 0.056 0.038

EC CC 0.236 0.752 0.000

EC CM 0.147 0.830 0.023

EC CD 0.142 0.828 0.026

Power

PC1 PC2 PC3

CBO 0.329 0.014 0.626

IC CC 0.617 0.073 0.161

IC CM 0.624 0.338 0.036

IC CD 0.712 0.228 0.041

EC CC 0.022 0.915 0.015

EC CM 0.007 0.880 0.112

EC CD 0.008 0.824 0.164

A.1. PCA Test Results for all programs. 102

Voronoi

PC1 PC2 PC3

CBO 0.198 0.213 0.526

IC CC 0.718 0.123 0.069

IC CM 0.812 0.088 0.134

IC CD 0.773 0.176 0.141

EC CC 0.043 0.911 0.005

EC CM 0.067 0.934 0.004

EC CD 0.148 0.834 0.054

A.1.3 Real-World Programs, Velocity, Xalan and Ant

Velocity

PC1 PC2 PC3

CBO 0.384 0.184 0.734

IC CC 0.623 0.034 0.174

IC CM 0.725 0.087 0.231

IC CD 0.684 0.196 0.192

EC CC 0.284 0.684 0.097

EC CM 0.023 0.793 0.005

EC CD 0.174 0.590 0.015

Xalan

PC1 PC2 PC3

CBO 0.316 0.174 0.586

IC CC 0.824 0.184 0.183

IC CM 0.890 0.284 0.284

IC CD 0.795 0.003 0.194

EC CC 0.013 0.834 0.164

EC CM 0.284 0.793 0.023

EC CD 0.384 0.823 0.154

Ant

PC1 PC2 PC3

CBO 0.125 0.254 0.687

IC CC 0.874 0.125 0.125

IC CM 0.789 0.231 0.012

IC CD 0.801 0.324 0.214

EC CC 0.214 0.789 0.124

EC CM 0.141 0.785 0.054

EC CD 0.123 0.754 0.014

A.2. Multiple linear regression results for all programs 103

A.2 Multiple linear regression results for all pro-

grams

A.2.1 SPECjvm98 Benchmark Suite201 compress

Hypothesis Y R R2 P > F

HCBO IC CC 0.775 0.593 0.003

HCBO,Ic IC CC 0.798 0.602 0.0001

HCBO EC CC 0.634 0.402 0.01

HCBO,Ic EC CC 0.870 0.759 0.007

HCBO IC CD 0.512 0.262 0.421

HCBO,Ic IC CD 0.599 0.359 0.201

HCBO EC CD 0.239 0.057 0.054

HCBO,Ic EC CD 0.422 0.178 0.134

HCBO IC CM 0.762 0.58 0.003

HCBO,Ic IC CM 0.885 0.784 0.006

HCBO EC CM 0.235 0.056 0.04

HCBO,Ic EC CM 0.58 0.336 0.035

202 jess


HCBO IC CC 0.553 0.306 0.002

HCBO,Ic IC CC 0.703 0.494 0.001

HCBO EC CC 0.428 0.184 0.031

HCBO,Ic EC CC 0.567 0.322 0.023

HCBO IC CD 0.765 0.586 0.145

HCBO,Ic IC CD 0.868 0.754 0.321

HCBO EC CD 0.691 0.748 0.246

HCBO,Ic EC CD 0.723 0.523 0.135

HCBO IC CM 0.762 0.581 0.023

HCBO,Ic IC CM 0.922 0.852 0.012

HCBO EC CM 0.618 0.382 0.001

HCBO,Ic EC CM 0.645 0.416 0.002

205 raytrace


HCBO IC CC 0.444 0.197 0.021

HCBO,Ic IC CC 0.659 0.434 0.002

HCBO EC CC 0.59 0.349 0.043

HCBO,Ic EC CC 0.669 0.447 0.032

HCBO IC CD 0.256 0.065 0.342

HCBO,Ic IC CD 0.36 0.13 0.365

HCBO EC CD 0.239 0.057 0.123

HCBO,Ic EC CD 0.363 0.132 0.432

HCBO IC CM 0.443 0.196 0.034

HCBO,Ic IC CM 0.599 0.359 0.032

HCBO EC CM 0.422 0.178 0.012

HCBO,Ic EC CM 0.632 0.399 0.032

209 db


HCBO IC CC 0.419 0.178 0.0001

HCBO,Ic IC CC 0.868 0.754 0.001

HCBO EC CC 0.567 0.322 0.002

HCBO,Ic EC CC 0.881 0.777 0.001

HCBO IC CD 0.691 0.478 0.522

HCBO,Ic IC CD 0.768 0.589 0.263

HCBO EC CD 0.312 0.097 0.609

HCBO,Ic EC CD 0.429 0.184 0.816

HCBO IC CM 0.582 0.338 0.003

HCBO,Ic IC CM 0.703 0.494 0.006

HCBO EC CM 0.313 0.098 0.019

HCBO,Ic EC CM 0.428 0.184 0.016

213 javac


HCBO IC CC 0.535 0.286 0.005

HCBO,Ic IC CC 0.748 0.559 0.002

HCBO EC CC 0.443 0.196 0.004

HCBO,Ic EC CC 0.531 0.282 0.007

HCBO IC CD 0.512 0.262 0.234

HCBO,Ic IC CD 0.606 0.367 0.176

HCBO EC CD 0.872 0.76 0.765

HCBO,Ic EC CD 0.922 0.85 0.567

HCBO IC CM 0.553 0.306 0.034

HCBO,Ic IC CM 0.76 0.577 0.024

HCBO EC CM 0.321 0.107 0.042

HCBO,Ic EC CM 0.567 0.322 0.034

222 mpegaudio


HCBO IC CC 0.174 0.032 0.003

HCBO,Ic IC CC 0.452 0.204 0.001

HCBO EC CC 0.296 0.088 0.013

HCBO,Ic EC CC 0.635 0.403 0.006

HCBO IC CD 0.734 0.538 0.165

HCBO,Ic IC CD 0.885 0.784 0.214

HCBO EC CD 0.948 0.899 0.234

HCBO,Ic EC CD 0.978 0.956 0.654

HCBO IC CM 0.753 0.567 0.001

HCBO,Ic IC CM 0.769 0.592 0.002

HCBO EC CM 0.533 0.284 0.021

HCBO,Ic EC CM 0.635 0.403 0.03

228 jack


HCBO IC CC 0.606 0.367 0.003

HCBO,Ic IC CC 0.966 0.933 0.012

HCBO EC CC 0.512 0.262 0.002

HCBO,Ic EC CC 0.872 0.76 0.003

HCBO IC CD 0.239 0.057 0.465

HCBO,Ic IC CD 0.618 0.382 0.450

HCBO EC CD 0.363 0.132 0.123

HCBO,Ic EC CD 0.419 0.178 0.576

HCBO IC CM 0.585 0.343 0.013

HCBO,Ic IC CM 0.599 0.359 0.002

HCBO EC CM 0.363 0.132 0.045

HCBO,Ic EC CM 0.417 0.174 0.032


A.2.2 JOlden Benchmark Suite

BH


HCBO IC CC 0.531 0.282 0.038

HCBO,Ic IC CC 0.767 0.588 0.044

HCBO EC CC 0.092 0.008 0.0001

HCBO,Ic EC CC 0.533 0.284 0.0001

HCBO IC CD 0.431 0.185 0.247

HCBO,Ic IC CD 0.617 0.381 0.237

HCBO EC CD 0.443 0.196 0.232

HCBO,Ic EC CD 0.514 0.264 0.398

HCBO IC CM 0.45 0.203 0.024

HCBO,Ic IC CM 0.635 0.403 0.013

HCBO EC CM 0.443 0.196 0.032

HCBO,Ic EC CM 0.514 0.264 0.024

Em3d


HCBO IC CC 0.617 0.381 0.046

HCBO,Ic IC CC 0.748 0.659 0.001

HCBO EC CC 0.262 0.069 0.03

HCBO,Ic EC CC 0.937 0.878 0.024

HCBO IC CD 0.59 0.349 0.294

HCBO,Ic IC CD 0.591 0.349 0.651

HCBO EC CD 0.02 0.00 0.975

HCBO,Ic EC CD 0.626 0.392 0.608

HCBO IC CM 0.59 0.349 0.194

HCBO,Ic IC CM 0.591 0.349 0.151

HCBO EC CM 0.02 0.000 0.075

HCBO,Ic EC CM 0.626 0.392 0.008

Health


HCBO IC CC 0.601 0.372 0.04

HCBO,Ic IC CC 0.643 0.414 0.003

HCBO EC CC 0.22 0.048 0.06

HCBO,Ic EC CC 0.254 0.064 0.13

HCBO IC CD 0.659 0.434 0.075

HCBO,Ic IC CD 0.753 0.566 0.124

HCBO EC CD 0.444 0.197 0.27

HCBO,Ic EC CD 0.535 0.286 0.431

HCBO IC CM 0.669 0.447 0.07

HCBO,Ic IC CM 0.76 0.578 0.116

HCBO EC CM 0.444 0.197 0.207

HCBO,Ic EC CM 0.535 0.286 0.431

MST


HCBO IC CC 0.97 0.941 0.001

HCBO,Ic IC CC 0.972 0.945 0.0001

HCBO EC CC 0.606 0.367 0.002

HCBO,Ic EC CC 0.76 0.577 0.001

HCBO IC CD 0.966 0.933 0.200

HCBO,Ic IC CD 0.987 0.974 0.401

HCBO EC CD 0.239 0.057 0.649

HCBO,Ic EC CD 0.618 0.382 0.486

HCBO IC CM 0.966 0.933 0.002

HCBO,Ic IC CM 0.987 0.974 0.004

HCBO EC CM 0.239 0.057 0.049

HCBO,Ic EC CM 0.618 0.382 0.086

Perimeter


HCBO IC CC 0.36 0.13 0.306

HCBO,Ic IC CC 0.422 0.178 0.503

HCBO EC CC 0.095 0.009 0.194

HCBO,Ic EC CC 0.599 0.359 0.211

HCBO IC CD 0.512 0.262 0.131

HCBO,Ic IC CD 0.585 0.343 0.230

HCBO EC CD 0.256 0.065 0.476

HCBO,Ic EC CD 0.58 0.336 0.238

HCBO IC CM 0.645 0.416 0.044

HCBO,Ic IC CM 0.66 0.435 0.135

HCBO EC CM 0.256 0.065 0.076

HCBO,Ic EC CM 0.58 0.336 0.038

Power


HCBO IC CC 0.709 0.502 0.042

HCBO,Ic IC CC 0.713 0.508 0.001

HCBO EC CC 0.635 0.404 0.011

HCBO,Ic EC CC 0.872 0.76 0.001

HCBO IC CD 0.104 0.011 0.844

HCBO,Ic IC CD 0.723 0.523 0.329

HCBO EC CD 0.363 0.132 0.479

HCBO,Ic EC CD 0.632 0.399 0.465

HCBO IC CM 0.067 0.004 0.9

HCBO,Ic IC CM 0.638 0.407 0.456

HCBO EC CM 0.417 0.174 0.010

HCBO,Ic EC CM 0.673 0.453 0.005

Voronoi


HCBO IC CC 0.922 0.85 0.009

HCBO,Ic IC CC 0.941 0.885 0.0001

HCBO EC CC 0.553 0.306 0.255

HCBO,Ic EC CC 0.561 0.314 0.568

HCBO IC CD 0.762 0.58 0.078

HCBO,Ic IC CD 0.768 0.589 0.263

HCBO EC CD 0.627 0.393 0.183

HCBO,Ic EC CD 0.636 0.405 0.459

HCBO IC CM 0.765 0.586 0.076

HCBO,Ic IC CM 0.77 0.594 0.059

HCBO EC CM 0.627 0.393 0.083

HCBO,Ic EC CM 0.636 0.405 0.029


A.2.3 Real-World Programs, Velocity, Xalan and Ant

Velocity


HCBO IC CC 0.515 0.265 0.0001

HCBO,Ic IC CC 0.722 0.521 0.001

HCBO EC CC 0.381 0.145 0.014

HCBO,Ic EC CC 0.617 0.381 0.025

HCBO IC CD 0.595 0.354 0.254

HCBO,Ic IC CD 0.741 0.547 0.354

HCBO EC CD 0.677 0.458 0.144

HCBO,Ic EC CD 0.861 0.741 0.214

HCBO IC CM 0.675 0.455 0.005

HCBO,Ic IC CM 0.752 0.565 0.004

HCBO EC CM 0.409 0.167 0.007

HCBO,Ic EC CM 0.506 0.256 0.01

Xalan


HCBO IC CC 0.453 0.205 0.002

HCBO,Ic IC CC 0.637 0.406 0.001

HCBO EC CC 0.430 0.185 0.002

HCBO,Ic EC CC 0.570 0.325 0.004

HCBO IC CD 0.709 0.502 0.547

HCBO,Ic IC CD 0.892 0.796 0.214

HCBO EC CD 0.830 0.689 0.114

HCBO,Ic EC CD 0.857 0.735 0.147

HCBO IC CM 0.652 0.425 0.006

HCBO,Ic IC CM 0.762 0.581 0.005

HCBO EC CM 0.504 0.254 0.011

HCBO,Ic EC CM 0.624 0.389 0.007

Ant


HCBO IC CC 0.604 0.365 0.005

HCBO,Ic IC CC 0.765 0.585 0.006

HCBO EC CC 0.453 0.205 0.014

HCBO,Ic EC CC 0.636 0.405 0.018

HCBO IC CD 0.597 0.356 0.154

HCBO,Ic IC CD 0.698 0.487 0.198

HCBO EC CD 0.518 0.268 0.287

HCBO,Ic EC CD 0.667 0.445 0.098

HCBO IC CM 0.725 0.525 0.017

HCBO,Ic IC CM 0.784 0.615 0.025

HCBO EC CM 0.451 0.204 0.042

HCBO,Ic EC CM 0.560 0.314 0.034

Appendix B

Case Study 2: The Impact of

Run-time Cohesion on Object

Behaviour

Appendix B.1 contains the PCA test results for the JOlden benchmark suite and for

the real-world programs Velocity, Xalan and Ant. Values deemed to be significant

at the level p ≤ 0.05 are highlighted.

Appendix B.2 contains the results from the multiple linear regression used to test

the hypothesis H0, that measures of run-time cohesion provide a better indication

of NOC than a static measure alone for the JOlden benchmark programs and the

real-world programs Velocity, Xalan and Ant. All significant results are highlighted.

B.1 PCA Test Results for all programs.

B.1.1 JOlden Benchmark SuiteBH

PC1 PC2

SLCOM 0.214 0.754

RLCOM 0.714 0.214

RWLCOM 0.721 0.101

Em3d

PC1 PC2

SLCOM 0.135 0.812

RLCOM 0.841 0.014

RWLCOM 0.814 0.014

Health

PC1 PC2

SLCOM 0.122 0.789

RLCOM 0.674 0.145

RWLCOM 0.714 0.212

MST

PC1 PC2

SLCOM 0.251 0.712

RLCOM 0.714 0.211

RWLCOM 0.751 0.165

Perimeter

PC1 PC2

SLCOM 0.025 0.912

RLCOM 0.874 0.145

RWLCOM 0.768 0.121

Power

PC1 PC2

SLCOM 0.142 0.775

RLCOM 0.654 0.154

RWLCOM 0.698 0.177

106

B.2. Multiple linear regression results for all programs. 107

Voronoi

PC1 PC2

SLCOM 0.045 0.901

RLCOM 0.854 0.104

RWLCOM 0.868 0.021

B.1.2 Real-World Programs, Velocity, Xalan and Ant

Velocity

PC1 PC2

SLCOM 0.215 0.614

RLCOM 0.814 0.124

RWLCOM 0.751 0.165

Xalan

PC1 PC2

SLCOM 0.315 0.554

RLCOM 0.714 0.116

RWLCOM 0.641 0.225

Ant

PC1 PC2

SLCOM 0.114 0.712

RLCOM 0.814 0.124

RWLCOM 0.801 0.101

B.2 Multiple linear regression results for all pro-

grams.

B.2.1 JOlden Benchmark Suite

BH


HSLCOMNOC 0.444 0.197 0.016

HSLCOM,R LCOM NOC 0.711 0.507 0.01

HSLCOMNOC 0.105 0.012 0.452

HSLCOM,RW LCOM NOC 0.631 0.398 0.487

Health


HSLCOMNOC 0.518 0.268 0.012


HSLCOMNOC 0.445 0.198 0.124


Perimeter


HSLCOMNOC 0.514 0.263 0.002


HSLCOMNOC 0.366 0.135 0.048


Em3d


HSLCOMNOC 0.365 0.134 0.006


HSLCOMNOC 0.415 0.173 0.254


MST


HSLCOMNOC 0.235 0.056 0.025


HSLCOMNOC 0.555 0.308 0.121


Power


HSLCOMNOC 0.177 0.035 0.028


HSLCOMNOC 0.445 0.198 0.214


Voronoi


HSLCOMNOC 0.523 0.273 0.004


HSLCOMNOC 0.255 0.064 0.381


B.2. Multiple linear regression results for all programs. 108

B.2.2 Real-World Programs, Velocity, Xalan and Ant

Velocity


HSLCOMNOC 0.445 0.198 0.002


HSLCOMNOC 0.363 0.132 0.456


Xalan


HSLCOMNOC 0.242 0.06 0.044


HSLCOMNOC 0.722 0.523 0.287


Ant


HSLCOMNOC 0.455 0.207 0.0001


HSLCOMNOC 0.633 0.401 0.214


Appendix C

Case Study 3: A Study of

Run-time Coupling Metrics and

Fault Detection

Appendix C.1 contains the results from the regression analysis used to test the

hypothesis H0, that run-time coupling metrics are poor detectors of faults in a

program for the set of real-world programs Velocity, Xalan and Ant.

Appendix C.2 presents the results to test the hypothesis H0, that coverage mea-

sures are poor detectors of faults in a program for the real-world programs Velocity,

Xalan and Ant. All significant results are highlighted.

109

C.1. Regression analysis results for real-world programs, Velocity, Xalanand Ant. 110

C.1 Regression analysis results for real-world pro-

grams, Velocity, Xalan and Ant.

C.1.1 For Class Mutants

Velocity


HIC CC MK 0.830 0.689 0.002

HIC CM MK 0.766 0.587 0.001

HIC CD MK 0.684 0.468 0.006

HEC CC MK 0.790 0.621 0.007

HEC CM MK 0.754 0.569 0.411

HEC CD MK 0.491 0.241 0.456

Xalan


HIC CC MK 0.767 0.589 0.003

HIC CM MK 0.705 0.498 0.002

HIC CC MK 0.710 0.504 0.001

HEC CD MK 0.706 0.499 0.046

HEC CM MK 0.706 0.499 0.254

HEC CD MK 0.649 0.421 0.680

Ant


HIC CC MK 0.773 0.598 0.003

HIC CM MK 0.708 0.501 0.005

HIC CC MK 0.829 0.687 0.001

HEC CD MK 0.749 0.561 0.075

HEC CM MK 0.749 0.561 0.342

HEC CD MK 0.463 0.214 0.127


C.1.2 For Traditional Mutants

Velocity


HIC CC MK 0.570 0.325 0.048

HIC CM MK 0.644 0.415 0.054

HIC CD MK 0.375 0.141 0.065

HEC CC MK 0.642 0.412 0.115

HEC CM MK 0.463 0.214 0.256

HEC CD MK 0.392 0.154 0.658

Xalan


HIC CC MK 0.598 0.358 0.091

HIC CM MK 0.567 0.321 0.078

HIC CC MK 0.463 0.214 0.154

HEC CD MK 0.676 0.457 0.254

HEC CM MK 0.459 0.211 0.254

HEC CD MK 0.381 0.145 0.351

Ant


HIC CC MK 0.740 0.547 0.065

HIC CM MK 0.649 0.421 0.085

HIC CC MK 0.463 0.214 0.159

HEC CD MK 0.577 0.333 0.241

HEC CD MK 0.606 0.367 0.154

HEC CM MK 0.536 0.287 0.054

HEC CD MK 0.459 0.211 0.216

C.2 Regression analysis results for real-world pro-

grams, Velocity, Xalan and Ant.

C.2.1 For Class Mutants

Velocity


HIc MK 0.326 0.106 0.032

Xalan


HIc MK 0.409 0.167 0.004

Ant


HIc MK 0.344 0.118 0.005


C.2.2 For Traditional Mutants

Velocity


HIc MK 0.888 0.789 0.003

Xalan


HIc MK 0.803 0.645 0.024

Ant


HIc MK 0.836 0.699 0.019

Appendix D

Mutation operators in µJava

Table D.1 presents a description of the traditional-level mutation operators in µJava.

Table D.2 presents a description of the class-level mutation operators in µJava.

Operator Description

ABS Absolute value insertion

AOR Arithmetic operator replacement

LCR Logical connector replacement

ROR Relational operator replacement

UOI Unary operator insertion

Table D.1: Traditional-level mutation operators in µJava

113

Appendix D. Mutation operators in µJava 114

Language Feature Operator Description

Inheritance: IHD Hiding variable deletion

IHI Hiding variable delection

IOD Overriding method deletion

IOP Overriding method calling position change

IOR Overriding method rename

ISK super key word deletion

IPC Explicit call of a parents constructor deletion

Polymorphism: PNC new method call with child class type

PMD Instance variable declaration with parent class type

PPD Parameter variable declaration with child class type

PRV Reference assignment with other comparable type

Overloading: OMR Overloading method contents change

OMD Overloading method deletion

OAO Argument order change

OAN Argument number change

Java-specific features: JTD this keyword deletion

JSC static modifier change

JID Member variable initialization deletion

JDC Java-supported default constructor creation

Common programming EOA Reference assignment and content

mistakes: assignment replacement

EOC Reference comparison and content

comparison replacement

EAM Accessor method change

EMM Modifier method change

Table D.2: Class-level mutation operators in µJava

Bibliography

[1] F. Abreu, M. Goulo, and R. Esteves. Toward the design quality evaluation of

object-oriented software systems. In Fifth International Conference on Software

Quality, pages 44–57, Austin, Texas, USA, Oct 1995.

[2] A.J. Albrecht. Measuring application development. In IBM Applications Devel-

opment joint SHARE/GUIDE symposium, pages 83–92, Monterey California,

USA, 1979.

[3] R.T. Alexander and J. Offutt. Coupling-based testing of O-O programs. The

Journal of Universal Computer Science, 10(4):391–427, 2004.

[4] The Apache Ant Project. Ant. http://ant.apache.org/.

[5] E. Arisholm, L.C. Briand, and A. Foyen. Dynamic coupling measures for object-

oriented software. IEEE Transactions on Software Engineering, 30(8):491–506,

2004.

[6] V.R. Basili, L.C. Briand, and W.L. Melo. A validation of object-oriented design

metrics as quality indicators. IEEE Transactions on Software Engineering,

22(10):751–761, October 1996.

[7] B. Beizer. Software Testing Techniques. 2nd edition, Van Nostrand Reinhold,

New York, USA, 1990.

[8] Standard Performance Evaluation Corporation SPECjvm98 Benchmarks.

http://www.spec.org/jvm98.

115

Bibliography 116

[9] J.M. Bieman and B.K. Kang. Cohesion and reuse in an object-oriented system.

In ACM Symposium on Software Reusability, pages 295–262, Seattle, Washing-

ton, USA, 1995.

[10] R. Binder. Testing Object Oriented Systems: Models, Patterns and Tools. Ad-

dison Wesley, Boston, Massachusetts, USA, October 1999.

[11] L.C. Briand, J. Daly, V. Porter, and J. Wust. A comprehensive empirical valida-

tion of product measures in object-oriented systems. Technical Report ISERN-

98-07, Fraunhofer Institute for Experimental Software Engineering, Germany,

1998.

[12] L.C. Briand, J.W. Daly, and J.K. Wust. A unified framework for cohesion

measurement in object-oriented systems. Empirical Software Engineering: An

International Journal, 3(1):65–117, 1998.

[13] L.C. Briand, J.W. Daly, and J.K. Wust. A unified framework for coupling

measurement in object-oriented systems. IEEE Transactions on Software En-

gineering, 25(1):91–121, Jan/Feb 1999.

[14] L.C. Briand, P. Devanbu, and W. Melo. An investigation into coupling measures

for C++. In 19th International Conference on Software Engineering, pages

412–421, Boston, USA, May 1997.

[15] L.C. Briand, W.L. Melo, and J. Wust. Assessing the applicability of fault-

proneness models across object-oriented software projects. IEEE Transactions

on Software Engineering, 28(7):706–720, 2002.

[16] L.C. Briand, S. Morasca, and V. Basili. Measuring and assessing maintainabil-

ity at the end of high-level design. In International Conference on Software

Maintenance, pages 88–97, Montreal, Canada, 1993.

[17] L.C. Briand, S. Morasca, and V. Basili. Defining and validating high-level design

metrics. Technical Report CS-TR 3301, Department of Computer Science,

University of Maryland, College Park, MD 20742, USA, 1994.

Bibliography 117

[18] L.C. Briand, S. Morasca, and V.R. Basili. An operational process for goal-

driven definition of measures. IEEE Transactions on Software Engineering,

28(12):1106–1125, December 2002.

[19] L.C. Briand, J.K. Wust, J.W. Daly, and V. Porter. Exploring the relationship

between design measures and software quality in object-oriented systems. The

Journal of Systems and Software, 51:245–273, 2000.

[20] S. Brown, A. Mitchell, and J.F. Power. A coverage analysis of Java benchmark

suites. In IASTED International Conference on Software Engineering, pages

144–150, Innsbruck, Austria, Feburary 15-17 2005.

[21] M. Bunge. Treatise on Basic Philosophy: Ontology II: The World of Systems.

Riedel, Boston, USA, 1972.

[22] X. Cai and M.R. Lyu. The effect of code coverage on fault detection un-

der different testing profiles. In First International Workshop on Advances in

Model-based Testing, pages 1–7, St. Louis, Missouri, USA, 2005.

[23] M.C. Carlisle and A. Rogers. Software caching and computation migration in

olden. In ACM Symposium on Principles and Practice of Parallel Programming,

pages 29–38, Santa Barbara, California, USA, July 1995.

[24] I.M. Chakravarti, R.G. Laha, and J. Roy. Handbook of Methods of Applied

Statistics, volume 1. John Wiley and Sons, New York, USA, 1967.

[25] S.R. Chidamber and C.F. Kemerer. Towards a metrics suite for object-oriented

design. In Object Oriented Programming Systems Languages and Applications,

pages 197–211, Phoenix, Arizona, USA, November 1991.

[26] S.R. Chidamber and C.F. Kemerer. A metrics suite for object-oriented design.

IEEE Transactions on Software Engineering, 20(6):467–493, June 1994.

[27] E.J. Chikofsky and J.H. Cross II. Reverse engineering and design recovery: A

taxonomy. IEEE Software, 7(1):13–17, 1990.

Bibliography 118

[28] J. Choi, M. Gupta, M.J. Serrano, V.C. Sreedhar, and S.P. Midkiff. Stack

allocation and synchronization optimizations for Java using escape analysis.

ACM Transactions on Programming Languages and Systems, 25(6):876 – 910,

November 2003.

[29] L.L Constantine and E. Yourdon. Structured Design. Prentice-Hall, Englewood

Cliffs, New Jersey, USA, 1979.

[30] M. Dahm. Byte Code Engineering Library (BCEL), version 5.1, April 25 2004.

http://jakarta.apache.org/bcel/.

[31] D.P. Darcy, C.F. Kemerer, S.A. Slaughter, and T.A. Tomayko. The structural

complexity of software: An experimental test. TOSE, 31(11):982–995, 2005.

[32] S. Demeyer, S. Ducasse, and O. Nierstrasz. Finding refactorings via change

metrics. In 15th ACM SIGPLAN conference on Object-oriented programming,

systems, languages, and applications, pages 166–178, Minneapolis, Minnesota,

USA, 2000.

[33] B. Dufour, K. Driesen, L. J. Hendren, and C. Verbrugge. Dynamic metrics for

Java. In Conference on Object-Oriented Programming Systems, Languages and

Applications, pages 149–168, Anaheim, California, USA, October 26-30 2003.

[34] J. Eder, G. Kappel, and M. Schrefl. Coupling and cohesion in object–oriented

systems. Technical Report 2/93, Department of Information Systems, Univer-

sity of Linz, Linz, Austria, 1993.

[35] D.W. Embley and S.N. Woodfield. Cohesion and coupling for abstract data

types. In 6th International Phoenix Conference on Computers and Communi-

cations, pages 144–153, Phoenix, Arizona, USA, 1987.

[36] T. J. Emerson. A discriminant metric for module cohesion. In 7th 1nternational

Conference on Software Engineering, pages 294–303, Orlando, Florida, USA,

1984.

Bibliography 119

[37] T. J. Emerson. Program testing, path coverage, and the cohesion metric. In

Computer Software Application Conference, pages 421–431, Chicago, Illinois,

USA, 1984.

[38] J. Engel. Programming for the Java Virtual Machine. Addison-Wesley, Cali-

fornia, USA, 1999.

[39] N.E. Fenton and M. Neil. Software metrics: Successes, failures and new direc-

tions. The Journal of Systems and Software, 47:149–157, 1999.

[40] N.E. Fenton and S.L. Pfleeger. Software Metrics: A Rigorous and Practical

Approach. PWS Publishing Company, Boston, Massachusetts, USA, 1997.

[41] F. Fioravanti and P. Nesi. A study on fault-proneness detection of object-

oriented systems. In Fifth European Conference on Software Maintenance and

Reengineering, pages 121–130, Lisbon, Portugal, 14-16 March 2001.

[42] P.G. Frankl and E.J. Weyuker. An applicable family of data flow testing criteria.

IEEE Transactions on Software Engineering, 14(10):1483–1498, 1988.

[43] R.J. Freund and W.J. Wilson. Regression Analysis: Statistical Modeling of a

Response Variable. Academic Press, 1998.

[44] R.R. Gonzalez. A unified metric of software complexity: Measuring produc-

tivity, quality and value. The Journal of Systems and Software, 29(1):17–37,

1995.

[45] D. Gregg, J. Power, and J. Waldron. Platform independent dynamic Java vir-

tual machine analysis: the Java Grande Forum benchmark suite. Concurrency

and Computation: Practice and Experience, 15(3-5):459–484, March 2003.

[46] N. Gupta and P. Rao. Program execution based module cohesion measurement.

In 16th International Conference on Automated on Software Engineering, pages

144–153, San Diego, USA, Nov 2001.

[47] M. Halstead. Elements of Software Science. North-Holland, Amsterdam, 1977.

Bibliography 120

[48] R. G. Hamlet. Testing programs with the aid of a compiler. IEEE Transactions

on Software Engineering, 3(4):279–290, 1977.

[49] B. Henderson-Sellers. Software Metrics. Prentice Hall, Hemel Hempstaed, U.K.,

1996.

[50] B. Henderson-Sellers and J. Edwards. Object-Oriented Knowledge: The Work-

ing Object (Book Two). Prentice Hall, Sydney, Australia, 1994.

[51] M. Hitz and B. Montazeri. Measuring coupling and cohesion in object-oriented

systems. In International Symposium on Applied Corporate Computing, pages

25–27, Monterrey, Mexico, October 1995.

[52] M. Hitz and B. Montazeri. Measuring product attributes of object-oriented

systems. In Fifth European Software Engineering Conference, pages 124 – 136,

Barcelona, Spain, September 1995.

[53] C. Howells. Gretel: An open-source residual test coverage tool, June 2002.

http://www.cs.uoregon.edu/research/perpetual/Software/Gretel/.

[54] T.O. Humphries, A. Klauser, A.L. Wolf, and B.G. Zorn. An infrastructure for

generating and sharing experimental workloads for persistent object systems.

Software–Practice and Experience, 30:387–417, 2000.

[55] Jakarta. The Apache Jakarta Project. http://jakarta.apache.org/.

[56] I.T. Jolliffe. Principal Component Analysis. Springer Verlag, 2nd edition, 2002.

[57] Jikes JVM. http://www-124.ibm.com/developerworks/oss/jikes/.

[58] Kaffe JVM. http://www.kaffe.org/.

[59] Sable JVM. http://sablevm.org/.

[60] B. Kitchemham and S.L. Pfleeger. Software quality: The elusive target. IEEE

Software, pages 12–21, 1996.

Bibliography 121

[61] A. Lakhotia. Rule-based approach to computing module cohesion. In 15th In-

ternational Conference on Software Engineering, pages 35–44, Baltimore, Mary-

land, USA, 1993.

[62] Y.S. Lee, B.S. Liang, S.F. Wu, and F.J. Wang. Measuring the coupling and

cohesion of an object-oriented program based on information flow. In Interna-

tional Conference on Software Quality, pages 81–90, Maribor, Slovenia, 1995.

[63] W. Li and S. Henry. Object-oriented metrics that predict maintainability. The

Journal of Systems and Software, 23(2):111–122, 1993.

[64] R.J. Lipton, R.A. DeMillo, and F.G. Sayward. Hints on test data selection:

Help for the practicing programmer. IEEE Computer, 11(4):34–41, 1978.

[65] M. Lorenz and J. Kidds. Object-Oriented Software Metrics. Prentice Hall

Object-Oriented Series, Englewood Cliffs, USA, 1994.

[66] Y. Ma, J. Offutt, and Y. Kwon. Mujava: An automated class mutation sys-

tem. The Journal of Software Testing, Verification and Reliability, 15(2):97–

133, June 2005.

[67] Y.S. Ma, Y.R. Kwon, and J. Offutt. mujava. http://www.isse.gmu.edu/-

faculty/ofut/mujava/.

[68] Y.S. Ma, Y.R. Kwon, and J. Offutt. Inter-class mutation operators for java.

In 13th International Symposium on Software Reliability Engineering, pages

352–363, Annapolis, Maryland, USA, November 2002.

[69] Y.K. Malaiya, M.N. Li, J.M. Bieman, and R. Karcich. Software reliability

growth with test coverage. IEEE Transactions on Reliability, 51(4):420426,

December 2002.

[70] R. Martin. OO design quality metrics: An analysis of dependencies. In Pro-

ceedings Workshop on Pragmatic and Theoretical Directions in Object-Oriented

Software Metrics, 1994.

Bibliography 122

[71] T. McCabe. A software complexity measure. IEEE Transactions on Software

Engineering, 2(4):308–320, 1976.

[72] J.D. McGregor and D.A. Sykes. A Practical Guide to Testing Object-oriented

Software. Addison Wesley, March 2001.

[73] P.C. Mehlitz. Performance analysis of Java implementations. In

http://www.transvirtual.com/presentations/speed/index.html.

[74] A. Mitchell and J.F. Power. Masters thesis: Dynamic coupling and cohesion

metrics for Java programs. Department of Computer Science, N.U.I. Maynooth,

Co. Kildare, Ireland, Aug 2002.

[75] A. Mitchell and J.F. Power. Run-time cohesion metrics for the analysis of

Java programs - preliminary results from the SPEC and Grande suites. Tech-

nical Report NUIM-CS-TR2003-08, Department of Computer Science, N.U.I.

Maynooth, Co. Kildare, Ireland, April 2003.

[76] A. Mitchell and J.F. Power. Run-time coupling metrics for the analysis of

Java programs - preliminary results from the SPEC and Grande suites. Tech-

nical Report NUIM-CS-TR2003-07, Department of Computer Science, N.U.I.

Maynooth, Co. Kildare, Ireland, April 2003.

[77] A. Mitchell and J.F. Power. Toward a definition of run-time object-oriented

metrics. In 7th ECOOP Workshop on Quantitative Approaches in Object-

Oriented Software Engineering, Darmstadt, Germany, July 2003.

[78] A. Mitchell and J.F. Power. An empirical investigation into the dimensions of

run-time coupling in java programs. In 3rd Conference on the Principles and

Practice of Programming in Java, pages 9–14, Las Vegas, Nevada, USA, June

16-18 2004.

[79] A. Mitchell and J.F. Power. Run-time cohesion metrics: An empirical inves-

tigation. In International Conference on Software Engineering Research and

Practice, pages 532–537, Las Vegas, Nevada, USA, June 21-24 2004.

Bibliography 123

[80] A. Mitchell and J.F. Power. A study of the influence of coverage on the re-

lationship between static and dynamic coupling metrics. Science of Computer

Programming Elsevier, Accepted for Publication, March 2005.

[81] A. Mitchell and J.F. Power. Using object-level run-time metrics to study cou-

pling between objects. In ACM Symposium on Applied Computing, pages 1456–

1463, Santa Fe, New Mexico, USA, March 13-17 2005.

[82] G. Myers. Reliable Software Through Composite Design. Mason and Lipscomb

Publishers, New York, USA, 1974.

[83] G. Myers. Composite Structured Design. Van Nostrand Reinhold, New York,

USA, 1978.

[84] S. Ntafos. A comparison of some structural testing strategies. IEEE Transac-

tions on Software Engineering, 14(6):868–874, June 1988.

[85] A.J. Offutt, M.J. Harrold, and P. Kolte. A software metrics system for module

coupling. The Journal of Systems and Software, 20(3):295–308, 1993.

[86] J. Offutt, R. Alexander, Y. Wu, Q. Xiao, and C. Hutchinson. A fault model for

subtype inheritance and polymorphism. In 12th International Symposium on

Software Reliability Engineering, pages 84–93, Hong Kong, China, November

2001.

[87] J. Offutt, A. Lee, G. Rothermel, R. Untch, and C. Zapf. An experimental

determination of sufficient mutation operators. ACM Transactions on Software

Engineering Methodology, 5(2):99–118, April 1996.

[88] L. M. Ott and J. J. Thuss. The relationship between slices and module cohesion.

In 11th International Conference on Software Engineering, pages 198 – 204,

Pittsburgh, Pennsylvania, USA, 1989.

[89] M. Page-Jones. The Practical Guide to Structured Systems Design. Yourdon

Press, New York, NY, 1980.

Bibliography 124

[90] A.V. Pearson and H.O. Hartley. Biometrica Tables for Statisticians, volume 2.

Cambridge University Press, Cambridge, England, 1972.

[91] S. Phattarsukol and P. Muenchaisri. Identifying candidate objects using hierar-

chical clustering analysis. In 8th Asia-Pacific Software Engineering Conference,

pages 381–389, Macao, Japan, December 4-7 2001.

[92] The Apache XML Project. Xalan. http://xml.apache.org/xalan-j//.

[93] S.S. Shapiro and M.B. Wilk. An analysis of variance test for normality (com-

plete samples). Biometrika, 52(3/4):591–611, 1965.

[94] M.L Shooman. Software Engineering: Design, Reliability and Management.

McGraw Hill, New York, USA, 1983.

[95] W.P. Stevens, G.J. Myers, and L. L. Constantine. Structured design. IBM

Systems Journal, 13(2):115–139, 1974.

[96] Sun Microsystems, Inc. Java Platform Debug Architecture (JPDA). http://-

java.sun.com/products/jpda.

[97] TimeWeb. Correlation explained. http://www.bized.ac.uk/timeweb/crunching/-

crunch relate expl.htm, 2002.

[98] D.A. Troy and S.H. Zweben. Measuring the quality of structured designs. The

Journal of Systems and Software, 2:112–120, 1981.

[99] S.M. Yacoub, H.H. Ammar, and T. Robinson. Dynamic metrics for object-

oriented designs. In Software Metrics Symposium, pages 50–61, Boca Raton,

Florida, USA, Nov 4-6 1999.

Note: All URL’s correct as of 30th September 2005

an empirical study of run-time coupling and … · an empirical study of run-time coupling and...

Documents