an empirical study of run-time coupling and … · an empirical study of run-time coupling and...
Post on 18-Apr-2018
226 Views
Preview:
TRANSCRIPT
AN EMPIRICAL STUDY OFRUN-TIME COUPLING AND
COHESION SOFTWAREMETRICS
Aine Mitchell
Supervisor: Dr. James Power
A Thesis presented for the degree of
Doctor of Philosophy in Computer Science
Department of Computer Science
National University of Ireland, Maynooth
Co. Kildare, Ireland
October 2005
Dedicated toMy parents Patrick and Ann Mitchell
An empirical study of run-time coupling and
cohesion software metrics
Aine Mitchell
Submitted for the degree of Doctor of Philosophy
Oct 2005
Abstract
The extent of coupling and cohesion in an object-oriented system has implications
for its external quality. Various static coupling and cohesion metrics have been
proposed and used in past empirical investigations, however none of these have
taken the run-time properties of a program into account. As program behaviour is
a function of its operational environment as well as the complexity of the source
code, static metrics may fail to quantify all the underlying dimensions of coupling
and cohesion. By considering both of these influences, one will acquire a more
comprehensive understanding of the quality of critical components of a software
system. We believe that any measurement of these attributes should include changes
that take place at run-time. For this reason, in this work we address the utility of
run-time coupling and cohesion complexity through the empirical evaluation of a
selection of run-time measures for these properties. This study is carried out using
a comprehensive selection of Java benchmark and real world programs.
Our first case study investigates the influence of instruction coverage on the re-
lationship between static and run-time coupling metrics. Our second case study de-
fines a new run-time coupling metric that can be used to study object behaviour and
investigates the ability of measures of run-time cohesion to predict such behaviour.
Finally, we investigate whether run-time coupling metrics are good predictors of
software fault-proneness in comparison to standard coverage measures. To the best
of our knowledge this is the largest empirical study that has been performed to date
on the run-time analysis of Java programs.
Declaration
The work in this thesis is based on research carried out at the Department of Com-
puter Science, in the National University of Ireland Maynooth, Co. Kildare, Ireland.
No part of this thesis has been submitted elsewhere for any other degree or qualifi-
cation and it is all my own work unless referenced to the contrary in the text.
Signature:. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Date:. . . . . . . . . . . . . . . . . . . . . . . .
Copyright c© 2005 Aine Mitchell.
“The copyright of this thesis rests with the author. No quotations from it should be
published without the author’s prior written consent and information derived from
it should be acknowledged”.
iv
Acknowledgements
I would like to thank my PhD adviser, Dr. James Power, for his advice, guidance,
support, and encouragement throughout my PhD effort.
A special thanks to my parents without whose continual support this work would
not have been possible.
I would also like to thank all my friends who were there for me throughout it all.
This work has been funded by the Embark initiative, operated by the Irish
Research Council for Science, Engineering and Technology (IRCSET).
v
Contents
Abstract iii
Declaration iv
Acknowledgements v
1 Introduction 1
1.1 Software Metrics and Complexity . . . . . . . . . . . . . . . . . . . . 1
1.2 Traditional Measures of Complexity . . . . . . . . . . . . . . . . . . . 3
1.3 Object-Oriented Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Definitions of Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Definitions of Cohesion . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Static and Run-time Metrics . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Factors Influencing Software Metrics . . . . . . . . . . . . . . . . . . 8
1.7.1 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.7.2 Metrics and Object Behaviour . . . . . . . . . . . . . . . . . . 9
1.7.3 Metrics and Software Testing . . . . . . . . . . . . . . . . . . 9
1.8 Aims of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.9 Structure of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Literature Review 12
2.1 Static Coupling Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Chidamber and Kemerer . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 Other Coupling Metrics . . . . . . . . . . . . . . . . . . . . . 14
2.2 Frameworks for Static Coupling Measurement . . . . . . . . . . . . . 15
vi
Contents vii
2.2.1 Eder et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Hitz and Montazeri . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Briand et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.4 Revised Framework by Briand et al. . . . . . . . . . . . . . . . 18
2.3 Static Cohesion Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Chidamber and Kemerer . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Other Cohesion Metrics . . . . . . . . . . . . . . . . . . . . . 20
2.4 Frameworks for Static Cohesion Measurement . . . . . . . . . . . . . 21
2.4.1 Eder et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.2 Briand et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Run-time/Dynamic Coupling Metrics . . . . . . . . . . . . . . . . . . 23
2.5.1 Yacoub et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.2 Arisholm et al. . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Run-time/Dynamic Cohesion Metrics . . . . . . . . . . . . . . . . . . 25
2.6.1 Gupta and Rao . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7 Other Studies of Dynamic Behaviour . . . . . . . . . . . . . . . . . . 25
2.7.1 Dynamic Behaviour Studies . . . . . . . . . . . . . . . . . . . 25
2.8 Coverage Metrics and Software Testing . . . . . . . . . . . . . . . . . 26
2.8.1 Instruction Coverage . . . . . . . . . . . . . . . . . . . . . . . 27
2.8.2 Alexander and Offutt . . . . . . . . . . . . . . . . . . . . . . . 27
2.9 Previous Work by the Author . . . . . . . . . . . . . . . . . . . . . . 28
2.10 Definition of Run-time Metrics . . . . . . . . . . . . . . . . . . . . . . 29
2.10.1 Coupling Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.10.2 Cohesion Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 Experimental Design 34
3.1 Methods for Collecting Run-time Information . . . . . . . . . . . . . 34
3.1.1 Instrumenting a Virtual Machine . . . . . . . . . . . . . . . . 34
3.1.2 Sun’s Java Platform Debug Architecture (JPDA) . . . . . . . 35
3.1.3 Bytecode Instrumentation . . . . . . . . . . . . . . . . . . . . 35
3.2 Metrics Data Collection Tools (Design Objectives) . . . . . . . . . . . 35
Contents viii
3.2.1 Class-Level Metrics Collection Tool (ClMet) . . . . . . . . . . 36
3.2.2 Object-Level Metrics Collection Tool (ObMet) . . . . . . . . . 37
3.2.3 Static Data Collection Tool (StatMet) . . . . . . . . . . . . . 38
3.2.4 Coverage Data Collection Tool (InCov) . . . . . . . . . . . . . 39
3.2.5 Fault Detection Study . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Test Case Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Benchmark Programs . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.2 Real-World Programs . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.3 Execution of Programs . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Statistical Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.2 Normality Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.3 Normalising Transformations . . . . . . . . . . . . . . . . . . 48
3.4.4 Pearson Correlation Test . . . . . . . . . . . . . . . . . . . . . 49
3.4.5 T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.6 Principal Component Analysis . . . . . . . . . . . . . . . . . . 50
3.4.7 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.8 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.9 Analysis of Variance (ANOVA) . . . . . . . . . . . . . . . . . 55
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4 Case Study 1: The Influence of Instruction Coverage on the Rela-
tionship Between Static and Run-time Coupling Metrics 57
4.1 Goals and Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Experiment 1: To investigate the relationship between static
and run-time coupling metrics . . . . . . . . . . . . . . . . . . 60
4.3.2 Experiment 2: The influence of instruction coverage . . . . . . 62
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Contents ix
5 Case Study 2: The Impact of Run-time Cohesion on Object Be-
haviour 69
5.1 Goals and Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3.1 Experiment 1: To determine if objects from the same class
behave differently at run-time from the point of view of coupling 74
5.3.2 Experiment 2: The influence of cohesion on the NOC . . . . . 77
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6 Case Study 3: A Study of Run-time Coupling Metrics and Fault
Detection 82
6.1 Goals and Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3.1 Experiment 1: To examine the relationship between instruc-
tion coverage and fault detection. . . . . . . . . . . . . . . . . 85
6.3.2 Experiment 2: To examine the relationship between run-time
coupling metrics and fault detection. . . . . . . . . . . . . . . 87
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7 Conclusions 90
7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.2 Applications of this Work . . . . . . . . . . . . . . . . . . . . . . . . 96
7.3 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.3.1 Internal Threats . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.3.2 External Threats . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Appendix 100
A Case Study 1: To Investigate the Influence of Instruction Coverage
on the Relationship Between Static and Run-time Coupling Metric100
Contents x
A.1 PCA Test Results for all programs. . . . . . . . . . . . . . . . . . . . 101
A.1.1 SPECjvm98 Benchmark Suite . . . . . . . . . . . . . . . . . . 101
A.1.2 JOlden Benchmark Suite . . . . . . . . . . . . . . . . . . . . . 101
A.1.3 Real-World Programs, Velocity, Xalan and Ant . . . . . . . . 102
A.2 Multiple linear regression results for all programs . . . . . . . . . . . 103
A.2.1 SPECjvm98 Benchmark Suite . . . . . . . . . . . . . . . . . . 103
A.2.2 JOlden Benchmark Suite . . . . . . . . . . . . . . . . . . . . . 104
A.2.3 Real-World Programs, Velocity, Xalan and Ant . . . . . . . . 105
B Case Study 2: The Impact of Run-time Cohesion on Object Be-
haviour 106
B.1 PCA Test Results for all programs. . . . . . . . . . . . . . . . . . . . 106
B.1.1 JOlden Benchmark Suite . . . . . . . . . . . . . . . . . . . . . 106
B.1.2 Real-World Programs, Velocity, Xalan and Ant . . . . . . . . 107
B.2 Multiple linear regression results for all programs. . . . . . . . . . . . 107
B.2.1 JOlden Benchmark Suite . . . . . . . . . . . . . . . . . . . . . 107
B.2.2 Real-World Programs, Velocity, Xalan and Ant . . . . . . . . 108
C Case Study 3: A Study of Run-time Coupling Metrics and Fault
Detection 109
C.1 Regression analysis results for real-world programs, Velocity, Xalan
and Ant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
C.1.1 For Class Mutants . . . . . . . . . . . . . . . . . . . . . . . . 110
C.1.2 For Traditional Mutants . . . . . . . . . . . . . . . . . . . . . 111
C.2 Regression analysis results for real-world programs, Velocity, Xalan
and Ant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
C.2.1 For Class Mutants . . . . . . . . . . . . . . . . . . . . . . . . 111
C.2.2 For Traditional Mutants . . . . . . . . . . . . . . . . . . . . . 112
D Mutation operators in µJava 113
List of Figures
1.1 The software quality model shows how different measures of internal
quality can characterise the overall quality of a software product . . . 3
3.1 Components of run-time class-level metrics collection tool, ClMet . . 37
3.2 Components of run-time object-level metrics collection tool, ObMet . 38
3.3 Components of static metrics collection tool, StatMet . . . . . . . . . 39
3.4 Dendrogram: At the cutting line there are two clusters . . . . . . . . 54
4.1 PCA test results for all programs for metrics in PC1, PC2 and PC3.
In all graphs the bars represents the PCA value obtained for the
corresponding metric. PC1 contains import level run-time metrics.
PC2 contains the export level run-time metrics and PC3 contain the
static CBO metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Multiple linear regression results for class-level metrics (IC CC and
EC CC). In both graphs the lighter bars represents the R2 value for
CBO, and the darker bars represents the R2 value for CBO and Ic
combined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Multiple linear regression results for method-level metrics (IC CM
and EC CM). In both graphs the lighter bars represents the R2
value for CBO, and the darker bars represents the R2 value for CBO
and Ic combined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1 CV of IC OC for classes from the programs studied. The bars rep-
resent the number of classes in each program that have CV in the
corresponding range. . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
xi
List of Figures xii
5.2 NOC results of cluster analysis. The bars represent the number of
classes in each program that have the corresponding NOC value. . . . 76
5.3 PCA Test Results for all programs for metrics in PC1 and PC2. In
both graphs the bars represents the PCA value obtained for the corre-
sponding metric. PC1 contains RLCOM and RWLCOM . PC2 contains
SLCOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Results from multiple linear regression where Y=NOC . The lighter
bars represent the R2 for SLCOM , and the darker bars represent the
R2 value for SLCOM and RLCOM combined. . . . . . . . . . . . . . . . 80
6.1 Mutation test results for real-world programs Velocity, Xalan and
Ant. In all graphs the bars represents the number of classes that
exhibit a percentage mutant kill rate in the corresponding range. . . . 86
6.2 Regression analysis results for the effectiveness of Ic in predicting
class and traditional-level mutations in real-world programs Velocity,
Xalan and Ant. The bars represents the R2 value for the run-time
metric under consideration. . . . . . . . . . . . . . . . . . . . . . . . 87
6.3 Regression analysis results for the effectiveness of run-time coupling
metrics in predicting class-level mutations in real-world programs Ve-
locity, Xalan and Ant. The bars represents the R2 value for the
run-time metric under consideration. . . . . . . . . . . . . . . . . . . 89
7.1 Findings from case study one that show our run-time coupling metrics
are not simply surrogate measures for static CBO and coverage plus
static metrics are better predictors of run-time measures than static
measure alone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.2 Findings from case study two that show run-time object-level cou-
pling measures can be used to identify objects that are exhibiting
different behaviours at run-time and run-time cohesion measures are
good predictors of this type of behaviour. . . . . . . . . . . . . . . . . 93
List of Figures xiii
7.3 Findings from case study three that show run-time coupling metrics
are good predictors of class-type faults and instruction coverage is a
good predictor of traditional faults in programs. . . . . . . . . . . . . 94
List of Tables
2.1 Abbreviations for the dynamic coupling metrics of Arisholm et al. . . 24
3.1 Description of the SPECjvm98 benchmarks . . . . . . . . . . . . . . . 42
3.2 Description of the JOlden benchmarks . . . . . . . . . . . . . . . . . 43
3.3 Programs used for each case study . . . . . . . . . . . . . . . . . . . . 45
4.1 Descriptive statistic results for all programs . . . . . . . . . . . . . . 61
5.1 Matrix of unique accesses per object, for objectsBlackNode1, . . . , BlackNode4
to classes GreyNode, QuadTreeNode and WhiteNode . . . . . . . . . 72
5.2 Descriptive statistic results for all programs . . . . . . . . . . . . . . 73
D.1 Traditional-level mutation operators in µJava . . . . . . . . . . . . . 113
D.2 Class-level mutation operators in µJava . . . . . . . . . . . . . . . . . 114
xiv
Chapter 1
Introduction
Software metrics have become essential in some disciplines of software engineering.
In forward engineering they are used to measure software quality and to estimate
the cost and effort of software projects [40]. In the field of software evolution,
metrics can be used for identifying stable or unstable parts of software systems,
as well as identifying where refactorings can be applied or have been applied [32],
and detecting increases or decreases of quality in the structure of evolving software
systems. In the field of software re-engineering and reverse engineering, metrics are
used for assessing the quality and complexity of software systems, and also to get a
basic understanding and provide clues about sensitive parts of software systems [27].
1.1 Software Metrics and Complexity
Software metrics evaluate different aspects of the complexity of a software product.
Software complexity was originally defined as “a measurement of the resources that
must be expended in developing, testing, debugging, maintenance, user training,
operation, and correction of software products” [94]. Complexity has been char-
acterised in terms of seven different levels, the correlation and interdependence of
which will determine the overall level of complexity in a software product [44]. The
levels are as follows:
• Control Structure
1
1.1. Software Metrics and Complexity 2
• Module Coupling
• Algorithm
• Code
• Nesting
• Module Cohesion
• Data Structure
However, most metrics measure only one software complexity factor. These
foundations of complexity will determine the internal quality of a product.
Internal quality measures are those which are performed in terms of the software
product itself and are measurable both during and after the creation of the software
product. They have however, no inherent, practical meaning within themselves. To
give them meaning they must be characterised in terms of the product’s external
quality.
External quality measures are evaluated with respect to how a product relates to
its environment and are deemed to be inherently meaningful, such examples would
be the maintainability or testability of a product.
It should be noted that good internal quality is a requirement for good external
quality. Figure 1.1 illustrates the software quality model which depicts the relation-
ship between these measures. Much research has contributed models and measures
of both internal software quality attributes and external attributes of a design. Al-
though the relationships between these attributes is for the most part intuitive, e.g.,
more complex code will require greater effort to maintain, the precise functional form
of those relationships can be less clear and is the subject of intense practical and
research concern [31]. Empirical validation aims at demonstrating the usefulness
of a measure in practice and is, therefore, a crucial activity to establish the overall
validity of a measure [6]. Therefore it is the belief of the author that a well-designed
empirical study serves to clarify and strengthen the observed relationships.
1.2. Traditional Measures of Complexity 3
Complexity External Quality
Metrics
Maintainability
Reusability
Testability
Internal
External
Quality
In Use
Coupling
Cohesion
Internal Quality
Internal Quality
Figure 1.1: The software quality model shows how different measures of internal
quality can characterise the overall quality of a software product
1.2 Traditional Measures of Complexity
The earliest software measure, which was proposed in the late 1960s, is the Source
Lines of Code (SLOC) metric, which is still used today. It is used to measure the
amount of code in a software program. It is typically used to estimate the amount of
effort that will be required to develop a program, as well as to estimate productivity
or effort once the software is produced. Two major types of SLOC measures exist:
physical SLOC and logical SLOC. Exact definitions of these measures vary. The
most common definition of physical SLOC is a count of “non-blank, non-comment
lines” in the text of the program’s source code. Logical SLOC measures attempt
to measure the number of “statements”, however their specific definitions are tied
to specific computer languages. Therefore, it is much easier to create tools that
measure physical SLOC, and physical SLOC definitions are easier to explain. How-
ever, physical SLOC measures are sensitive to logically irrelevant formatting and
style conventions, while logical SLOC is less sensitive to formatting and style con-
1.3. Object-Oriented Metrics 4
ventions.
The are a number of drawbacks of using a crude measure such as LOC as a sur-
rogate measure for different notions of program size such as effort, functionality and
complexity. The need for more discriminating measures became especially urgent
with the increasing diversity of programming languages, as LOC in an assembly
language is not comparable in effort, functionality, or complexity to an LOC in a
high-level language [39].
Thus from the mid-1970s there was an increase in the number of different com-
plexity metrics defined. Some of the more prevalent ones were Halstead’s software
science metrics [47], which made an attempt to capture notions of size and com-
plexity beyond simply counting lines of code. Although the work has had a lasting
impact they are principally regarded as an example of confused and inadequate
measurements [40].
McCabe defined a measure known as Cyclomatic Complexity [71]. It may be
considered as a broad measure of soundness and confidence for a program. It mea-
sures the number of linearly-independent paths through a program module and it is
intended to be independent of language and language format.
Function points, which were pioneered by Albrecht [2] in 1977, are a measure
of the size of computer applications and the projects that build them. The size is
measured from a functional, or user, point of view. It is independent of the computer
language, development methodology, technology or capability of the project team
used to develop the application. The original metric has been augmented and refined
to cover more than the original emphasis on business-related data processing.
However as object-oriented techniques became more prevalent there was an in-
creasing need for metrics that could correctly evaluate their properties.
1.3 Object-Oriented Metrics
Object-oriented design and development is becoming very popular in today’s soft-
ware development environment. Object-oriented development requires not only a
different approach to design and implementation but it requires a different approach
1.4. Definitions of Coupling 5
to software metrics. Since object oriented technology uses objects and not algorithms
as its fundamental building blocks, the approach to software metrics for object ori-
ented programs must be different from the standard metrics set. Metrics, such as
lines of code and cyclomatic complexity, have become accepted as standard for tra-
ditional functional/ procedural programs and were used to evaluate object-oriented
environments at the beginning of the object-oriented design revolution. However,
traditional metrics for procedural approaches are not adequate for evaluating object-
oriented software, primarily because they are not designed to measure basic elements
like classes, objects, polymorphism, and message-passing. Even when adjusted to
syntactically analyse object-oriented software they can only capture a small part of
such software and thus provide a weak quality indication [50, 65]. Since this time
there have been many proposed object-oriented metrics in the literature. The ques-
tion now is, which object-oriented metrics should a project use? As the quality
of object-oriented software, like other software, is a complex concept there can be
no single, simple measure of software quality acceptable to everyone. To assess or
improve software quality in you must define the aspects of quality in which you are
interested, then decide how you are going to measure them. By defining quality
in a measurable way, you make it easier for other people to understand your view-
point and relate your notions to their own [60]. As illustrated by Chapter 2 some of
the seminal methods of evaluating an object-oriented design are through the use of
measures for coupling and cohesion.
1.4 Definitions of Coupling
Stevens et al. [95] first introduced coupling in the context of structured development
techniques. They defined coupling as “the measure of the strength of association
established by a connection from one module to another”. They stated that the
stronger the coupling between modules, that is, the more inter-related they are, the
more difficult these modules are to understand, change and correct and thus the
more complex the resulting software system.
Myers [82] refined the concept of coupling by defining six distinct levels of cou-
1.5. Definitions of Cohesion 6
pling. However coupling could only be determined by hand as the definitions were
neither precise nor prescriptive, leaving room for subjective interpretations of the
levels.
Constantine and Yourdon [29] also stated that the modularity of software design
can be measured by coupling and cohesion. They stated that coupling between two
units reflect the interconnections between units and that faults in one unit may
affect the coupled unit.
Page and Jones [89] ordered coupling into eight different levels according to their
effects on the understandability, maintainability, modifiability and reusability of the
coupled modules.
Troy and Zweben [98] showed that coupling between units is a good indicator
of the number of faults in software. However their study was based on subjective
interpretation of design documents instead of real code.
Offutt et al. [85] extended the eight levels of coupling to twelve thus providing a
finer grained measure of coupling. They also described algorithms to automatically
measure the coupling level between each pair of units in a program. The coupling
levels are defined between pairs of units A and B. For each coupling level the param-
eters are classified by the way they are used. Uses are classified into computation
uses (C-uses) [42], predicate uses (P-uses) and indirect uses (I-uses) [85]. A C-use
occurs when a variable is used on the right side of an assignment statement, in an
output statement, or a procedure call. A P-use occurs when a variable is used in a
predicate statement. An I-use occurs when a variable is used in an assignment to
another variable and the defined variable is later used in a predicate. The I-use is
considered to be in the predicate rather than in the assignment.
1.5 Definitions of Cohesion
The cohesion of a module is the extent to which its individual components are
needed to perform the same task [40]. Cohesion was first introduced within the
context of module design by Stevens et al. [95]. In their definition, the cohesion of a
module is measured by inspecting the association between all pairs of its processing
1.6. Static and Run-time Metrics 7
elements. The term processing element was defined as an action performed by
a module such as a statement, procedure call, or something which must be done
in a module but which has not yet been reduced to code [29]. Their definition
was informal thereby leaving it open for interpretation. They developed a scale of
cohesion that provide an ordinal scale of measurement that describes the degree to
which the actions performed by a module contribute to a unified function. There
are seven categories of cohesion which range from the most desirable (functional) to
least desirable (coincidental). They stated that it is possible for a module to exhibit
more than one type of cohesion, in this case the module is categorized by its least
desirable type of cohesion. In the principle of good software design it is desirable to
have highly cohesive modules, preferably functional.
Emerson [36,37] based his cohesion measure on a control flow graph representa-
tion of a module. The range of this complexity measure varies from 0 to 1. Emerson
indicates that his method for computing cohesion is related to program slicing. He
reclassifies the seven levels of cohesion into three.
Ott and Thuss [88] used program slicing to evaluate their cohesion measurements.
They reclassified the original seven levels of cohesion into four categories.
Lakhotia [61] codified the natural language definitions of the seven levels of
cohesion. He developed a method for computing cohesion based on an analysis of
the variable dependence graphs of a module. Pairs of outputs were examined to
identify any data or control dependences that exist between the two outputs. Rules
were provided for determining the cohesion of the pairs.
1.6 Static and Run-time Metrics
A large number of metrics have been proposed to measure object-oriented de-
sign quality. Design metrics can be classified into two categories; static and run-
time/dynamic. Static metrics measure what may happen when a program is ex-
ecuted and are said to quantify different aspects of the complexity of the source
code. Run-time metrics measure what actually happens when a program is exe-
cuted. They evaluate the source code’s run-time characteristics and behaviour as
1.7. Factors Influencing Software Metrics 8
well as its complexity.
Despite the rich body of research and practice in developing design quality met-
rics, there has been less emphasis on run-time metrics for object-oriented designs
mainly due to the fact that a run-time code analysis is more expensive and complex
to perform. [99]. However, due to polymorphism, dynamic binding, and the common
presence of unused (dead) code in software, static coupling and cohesion measures
do not perfectly reflect the actual situation taking place amongst classes at run-time.
The complex dynamic behaviour of many real-time applications motivates a shift
in interest from traditional static metrics to run-time metrics. In this work, we in-
vestigate whether useful information on design quality can be provided by run-time
measures of coupling and cohesion over and above that which is given by simple
static measures. This will determine if it is worthwhile to continue the investigation
into run-time coupling and cohesion metrics and their relationship with the external
quality.
1.7 Factors Influencing Software Metrics
This section discusses factors which affect software metrics, including coverage and
object-level behaviour. The relationship with software testing is also discussed.
1.7.1 Coverage
When relating static and run-time measures, it is important to have a thorough un-
derstanding of the degree to which the analysed source code corresponds to the code
that is actually executed. In this thesis, this relationship is studied using instruc-
tion coverage measures with regard to the influence of coverage on the relationship
between static and dynamic metrics. It is proposed that coverage results have a sig-
nificant influence on the relationship and thus should always be a measured, recorded
factor in any such comparison.
1.8. Aims of Thesis 9
1.7.2 Metrics and Object Behaviour
To date little work has been done on the analysis of code at the object-level, that is
the use of metrics to identify specific object behaviours. We identify this behaviour
through the use of run-time object-level coupling metrics. Run-time object-level
coupling quantifies the level of dependencies between objects in a system whereas
run-time class-level coupling quantifies the level of dependencies between the classes
that implement the methods or variables of the caller object and the receiver object
[5]. The class of the object sending or receiving a message may be different from the
class implementing the corresponding method due to the impact of inheritance. We
also investigate the ability of run-time cohesion measures to predict such behaviour.
1.7.3 Metrics and Software Testing
Testing is one of the most effort-intensive activities during software development [7].
Much research is directed toward developing new and improved fault detection mech-
anisms. A number of papers have investigated the relationships between static design
metrics and the detection of faults in object-oriented software [6, 15]. However, to
date no work has been conducted on the correlation of run-time coupling metrics and
fault detection. In this thesis, we investigate whether measures for run-time coupling
are good predictors of fault-proneness, an important software quality attribute.
1.8 Aims of Thesis
In summary, the central aims of this thesis are to outline operational definitions for
run-time class and object-level coupling and cohesion metrics suitable for evaluating
the quality of an object-oriented application. The motivation for these measures
is to complement existing measures that are based on static analysis by actually
measuring coupling and cohesion at runtime.
It is necessary to provide tools for accurately collecting such measures for Java
systems effectively. Java was chosen as the target language for this analysis be-
cause Java is executed on a virtual machine which makes it relatively simple to
collect run-time trace information in comparison to languages like C or C++. Java
1.9. Structure of Thesis 10
also combines a wide range of language features found in different programming
languages, for example, an object-oriented model, exception handling and garbage
collection. Its features of portability, robustness, simplicity and security have made
it increasingly popular within the software engineering community, underpinning its
importance and providing a good selection of sample applications for study.
Finally, a thorough empirical investigation using both Java benchmark and real-
world programs needs to be performed. The objectives of this are:
1. To assess the fundamental properties of the run-time measures and to inves-
tigate whether they are redundant with respect to the most commonly used
coupling and cohesion measures, as defined by Chidamber and Kemerer [26].
2. To examine the influence of test case coverage on the relationship between
static and run-time coupling metrics. Intuitively, one would expect the better
the coverage of the test cases used, the better the static and run-time metrics
should correlate.
3. To investigate run-time object behaviour, that is, to determine if objects from
the same class behave differently at run-time, through the use of object-level
coupling metrics.
4. To investigate run-time object behaviour using run-time measures for cohesion.
5. To conduct a study investigating the correlation between run-time coupling
measures and fault detection in object-oriented software.
1.9 Structure of Thesis
This thesis describes how coupling and cohesion can be defined and precisely mea-
sured based on the run-time analysis of systems. An empirical evaluation of the pro-
posed run-time measures is reported using a selection of benchmarks and real-world
Java applications. An investigation is conducted to determine if these measures are
redundant with respect to their static counterparts. We also determine if coverage
has a significant impact on the correlation between static and run-time metrics. We
1.9. Structure of Thesis 11
examine object behaviour using a run-time object-level coupling metric and we in-
vestigate the relationship of run-time cohesion metrics on this. Finally, we study
the fault detection capabilities of run-time coupling measures.
Chapter 2 presents a literature survey of coupling and cohesion metrics and
associated studies. Chapter 3 defines the run-time metrics used in this study and
outlines the experimental tools and techniques. Chapter 4 presents a case study on
the correlation between static and run-time coupling measures and the influence of
coverage on this correlation. Chapter 5 discusses a case study on object behaviour
and the impact of cohesion on this. Chapter 6 presents a case study on run-time
coupling metrics and fault detection. Chapter 7 presents the final conclusions and
discusses future work.
Chapter 2
Literature Review
In this chapter, a comprehensive survey and literature review of existing static and
run-time/dynamic measures and frameworks for coupling and cohesion in object-
oriented systems is presented. Previous work which describes a coupling based
testing approach for object-oriented software is presented. Finally, the role coverage
measures play in software testing is discussed. In Section 2.1 and 2.3, we present
existing coupling and cohesion measures and discuss them. Sections 2.2 and 2.4,
present alternative frameworks for coupling and cohesion. Measures for the run-time
evaluation of coupling and cohesion are presented in Sections 2.5 and 2.6 respectively.
Other work in studies of dynamic behaviour is described in Section 2.7. A discussion
of coverage metrics and the role they play in software testing is presented in Section
2.8. Previous work by the author is discussed in 2.9. Finally, a description of the
run-time measures used in the subsequent case studies are provided in Section 2.10.
2.1 Static Coupling Metrics
There exists a large variety of measurements for coupling. A comprehensive review
of existing measures performed by Briand et al. [13] found that more than thirty
different measures of object-oriented coupling exist. The most prevalent ones are
explained in the following subsections:
12
2.1. Static Coupling Metrics 13
2.1.1 Chidamber and Kemerer
In their papers [25, 26] Chidamber and Kemerer propose and validate a set of six
software metrics for object-oriented systems, including two measures for coupling.
As these are the most accepted and widely used coupling metrics, we use these as
the basis for our run-time coupling measures.
Coupling Between Objects (CBO)
They first define a measure CBO for a class as, “a count of the number of nonin-
heritance related couples with other classes” [25]. An object of a class is coupled
to another if the methods of one class use the methods or attributes of the other.
They later revise this definition to state, “CBO for a class is a count of the number
of other classes to which it is coupled ” [26]. A footnote added that “this includes
coupling due to inheritance.”
They state that coupling has an adverse effect on the maintenance, reuse and
testing of a design and that excessive coupling between object classes is detrimental
to modular design and prevents reuse. As the more independent a class is, the easier
it is to reuse in another application. They state that inter-object class couples should
be kept to a minimum in order to improve modularity and promote encapsulation.
The larger the number of couples, the higher the sensitivity to changes in other parts
of the design, making maintenance more difficult. A measure of coupling is useful
to determine how complex the testing of various parts of a design are likely to be.
The higher the inter-object class coupling the more rigorous the testing needs to be.
Response for class (RFC):
The response set (RS) of a class is a set of methods that can potentially be executed
in response to a message received by an object of that class. RFC is simply the
number of methods in the set, that is, RFC = #{RS}. A given method is counted
only once. Since RFC specifically includes methods called from outside the class,
it is also a measure of the potential communication between the class and other
classes.
2.1. Static Coupling Metrics 14
RS = M ∪alli Ri = [∪i∈MRi] (2.1)
Equation 2.1 gives the response set for a class where Ri is the set of methods
called by the method i and M is the set of all methods in the class.
If a large number of methods can be invoked in response to a message, the testing
and debugging of the class becomes more complicated since it requires a greater level
of understanding on the part of the tester. The complexity of a class increases with
the number of methods that can be invoked from it.
2.1.2 Other Coupling Metrics
In their paper [63] Li and Henry identify a number of metrics that can predict the
maintainability of a design. They define two measures, message passing coupling
(MPC) and data abstraction coupling (DAC). MPC is defined as the number of
send statements defined in a class. The number of send statements sent out from a
class may indicate how dependent the implementation of the local methods is on the
methods in other classes. MPC only counts invocations of methods of other classes,
not its own. DAC is defined as “the number of abstract data types (ADT) defined
in a class”. An ADT is defined in a class c if it is the type of an attribute of class c.
It is also specified that “the number of variables having an ADT type may indicate
the number of data structures dependent on the definitions of other classes”.
Martin describes two coupling metrics that can be used to measure the quality of
an object-oriented design in terms of the interdependence between the subsystems
of that design [70]. Afferent Coupling (Ca) is the number of classes outside this
category that depend upon classes within this category. Efferent Coupling (Ce) is
the number of classes inside this category that depend upon classes outside this
category. A category is a set of classes that belong together in the sense that
they achieve some common goal. Martin does not specify exactly what constitutes
dependencies between classes.
Abreu et al. present a coupling metric known as Coupling Factor (COF) for the
design quality evaluation of object-oriented software systems [1]. COF is the actual
number of client-server relationships between classes that are not related via inher-
2.2. Frameworks for Static Coupling Measurement 15
itance divided by the maximum possible number of such client-server relationships.
It is normalised to range between 0 and 1 to allow for comparisons for systems of
different sizes. It was not specified how to account for such factors as polymorphism
and method overriding.
Lee et al. measure coupling and cohesion of an object-oriented program based
on information flow through programs [62]. They define a measure, Information-
flow-based coupling (ICP), that counts for a method m of a class c, the number
of methods that are invoked polymorphically from other classes, weighted by the
number of parameters of the invoked method. This count can be scaled up to
classes and subsystems. They go on to derive two more sets of measures which
measure inheritance-based coupling (coupling to ancestor classes (IH-ICP)) and
noninheritance-based coupling (coupling to unrelated classes (NIH-ICP)) and de-
duce that ICP is simply the sum of IH-ICP and NIH-ICP
Briand et al. perform a comprehensive empirical validation of product measures,
such as coupling and cohesion, in object-oriented systems and explore the probability
of fault detection in system classes during testing [11]. They define a number of
measures which count the number of class-attribute (CA), class-method (CM) and
method-method (MM) interactions for each class. They take into account which
class the interactions originate from or are directed at and the number of ancestor
or other classes. A CA-interaction occurs from class c to class d if an attribute of
class c is of type class d. A CM-interaction occurs from class c to class d if a newly
defined method of class c has a parameter of type class d. An MM-interaction occurs
from class c to class d if a method implemented at class c statically invokes a newly
defined or overriding method of class d, or receives a pointer to such a method. This
set has sixteen metrics in total.
2.2 Frameworks for Static Coupling Measurement
Several different authors describe frameworks to characterise different approaches to
coupling and to assign relative strengths to different types of coupling. A framework
defines what constitutes coupling. This is done in an attempt to determine the
2.2. Frameworks for Static Coupling Measurement 16
potential use of coupling metrics and how different metrics complement each other.
There are three existing frameworks:
2.2.1 Eder et al.
Eder et al. identify three different types of relationships [34]. These are, interaction
relationships between methods, component relationships between classes, and inher-
itance between classes. These relationships are then used to derive three different
dimensions of coupling which are classified according to different strengths:
1. Interaction coupling: Two methods are said to be interaction coupled if i) one
method invokes the other, or ii) they communicate via the sharing of data.
There are seven types of interaction coupling.
2. Component coupling: Two classes c and d are component coupled, if d is the
type of either i) an attribute of c, or ii) an input or output parameter of a
method of c, or iii) a local variable of a method of c, or iv) an input or output
parameter of a method invoked within a method of c. There are four different
degrees of component coupling.
3. Inheritance coupling: two classes c and d are inheritance coupled, if one class
is an ancestor of the other. There are four degrees of inheritance coupling.
2.2.2 Hitz and Montazeri
Hitz and Montazeri derive two different types of coupling, object and class-level cou-
pling [52]. These are determined by the state of an object (value of its attributes at
a given moment at runtime) and state of an object’s implementation (class interface
and body at a given time in the development cycle).
Class level coupling (CLC) results from state dependencies between two classes
in a system during the development cycle. This can only be determined from a static
analysis of the design documents or source code. This is important when considering
maintenance and change dependencies as changes in one class may lead to changes
in other classes which use it.
2.2. Frameworks for Static Coupling Measurement 17
Object level coupling (OLC) results from state dependencies between two objects
during the run-time of a system. This depends on concrete object structure at run-
time, which in turn is determined by actual input data. Therefore, it is a function
of design or source code and input data at run-time. This is relevant for run-time
oriented activities such as testing and debugging.
2.2.3 Briand et al.
In the framework by Briand et al. coupling is constituted as interactions between
classes [14]. The strength is determined by the type of the interaction (Class-
Attribute, Class-Method, Method-Method), the relationship between the classes (In-
heritance, Other) and the interaction’s locus of impact (Import/Client, Export/Server).
They assign no strengths to the different kinds of interactions. There are three basic
criteria in the framework which are as follows:
1. Type of interaction: This determines the mechanism by which two classes are
coupled. A class-attribute interaction is present if aggregation occurs, that is,
if a class c is the type of an attribute of class d. A class-method interaction
occurs if a class c is the type of a parameter of method md of a class d, or if a
class c is the return type of method md. A method-method interaction occurs
if a md of a class d directly invokes a method mc, or a method md receives via
parameter a pointer to mc thereby invoking mc indirectly.
2. Relationship: An inheritance relationship occurs if a class c is an ancestor of
class d or vice versa. Friendship is present if a class c declares class d as its
friend, which grants class d access to the non-public elements of c. There is
another relationship when no inheritance or friendship relationship is present
between classes c and d.
3. Locus: If a class c is involved in an interaction with another class, a distinction
is made between export and import coupling. Export is when a class c is the
used class or server in the interaction. Import is when a class c is the using
class or client in the interaction.
2.3. Static Cohesion Metrics 18
2.2.4 Revised Framework by Briand et al.
Briand et al. outline a new unified framework for coupling in object-oriented systems
[13]. It is characterised based on the issues identified by comparing existing coupling
frameworks. There are six different criteria in the framework and each criterion
determines one basic aspect of the resulting measure. The criteria are as follows:
1. The type of connection: This determines what constitutes coupling. It is the
type of link between a client and a server item which could be an attribute,
method, or class.
2. The locus of impact: This is import or export coupling. Import coupling anal-
yses attributes, methods, or classes in their role as clients of other attributes,
methods, or classes. Export coupling analyses the attributes, methods, and
classes in their role as servers to other attributes, methods or classes.
3. The granularity of the measure: This is the domain of the measure, that is,
what components are to be measured and how to count coupling connections.
4. The Stability of server: Should both stable and unstable classes be included?
Classes can be, a) stable which are those that are not subject to change in the
project at hand, for example classes imported from libraries, or b) unstable
which are those which are subject to development or modification in the project
at hand.
5. Direct or indirect coupling: Should only direct connections be counted or
should indirect connections also be taken into account?
6. Inheritance: Inheritance-based versus noninheritance-based coupling. Also
how to account for polymorphism and how to assign attributes and methods
to classes.
2.3 Static Cohesion Metrics
A large number of alternative measures are proposed for measuring cohesion. Briand
et al. [12] carry out a broad survey on the current state of cohesion measurement
2.3. Static Cohesion Metrics 19
in object-oriented systems and find fifteen separate measurements of cohesion. A
review of these measures is presented in the following subsections.
2.3.1 Chidamber and Kemerer
The Lack of Cohesion in Methods (LCOM1) measure was first suggested by Chi-
damber and Kemerer [25]. It is the most prevalently used cohesion measure today
and therefore is used as the basis for the definition of our run-time cohesion mea-
sures. It is defined as “the degree of similarity of methods” and is theoretically
based on the ontology of objects by Bunge [21]. Within this ontology, the similarity
of things is defined as the set of properties that the things have in common.
For a given class C with a number of methods, M1, M2, ..., Mn, let {Ii} be the
set of instance variables accessed by the method Mi. As there are n methods there
will be n such sets, one set per method. The LCOM metric is then determined by
counting the number of disjoint sets formed by the intersection of the n sets.
However, this was found to be quite ambiguous and the pair later redefined their
metric (LCOM2) [26]. For a class C1 with n methods, M1, . . . ,Mn, let {Ii} be the
set of instance variables referenced by method Mi. There are n such sets I1, ...In.
We can define two disjoint sets:
P = {(Ii, Ij) | Ii ∩ Ij = ∅}Q = {(Ii, Ij) | Ii ∩ Ij 6= ∅}
(2.2)
The lack of cohesion in methods is then defined from the cardinality of these sets
by:
LCOM = |P | − |Q|,if |P | > |Q| or 0 otherwise
(2.3)
LCOM is an inverse cohesion measure. An LCOM value of zero indicates a
cohesive class. Cohesiveness of methods within a class is desirable as it promotes
encapsulation. Any measure of disparateness of methods helps identify flaws in the
design of classes. If the value is greater than zero this indicates that the class can
be split into two or more classes, since its variables belong in disjoint sets. Low
2.3. Static Cohesion Metrics 20
cohesion is said to increase complexity, thereby increasing the likelihood of errors
during the development process.
2.3.2 Other Cohesion Metrics
Briand et al. define a set of cohesion measures for object-based systems [16,17] which
are adapted in [12] to object-oriented systems. For this adaption a class is viewed as
a collection of data declarations and methods. A data declaration is a local, public
type declaration, the class itself or public attributes. There can be data declara-
tion interactions between classes, attributes, types of different classes and methods.
They define the following measures; Ratio of Cohesive Interactions (RCI), Neutral
Ratio of Cohesive Interactions (NRCI), Pessimistic Ratio of Cohesive Interactions
(PRCI) and Optimistic Ratio of Cohesive Interactions (ORCI).
Hitz and Montazeri base their cohesion measurements LCOM3, LCOM4 and
C (Connectivity) on the work of Chidamber and Kemerer [51].
The cohesion measurements by Bieman and Kang are also based on the work of
Chidamber and Kemerer [9]. They define measurements known as Tight Class Co-
hesion (TCC) and Loose Class Cohesion (LCC). These metrics also consider pairs
of methods which use common attributes, however a distinction is made between
methods which access attributes directly or indirectly. They also take inheritance
into account, making suggestions on how to deal with inherited methods and inher-
ited attributes.
Lee et al. propose a set of cohesion measures based on the information flow
through method invocations within a class [62]. For a method m implemented in a
given class c, the cohesion of m is the number of invocations to other methods im-
plemented in class c, weighted by the number of parameters of the invoked methods.
The greater the number of parameters an invoked method has, the more informa-
tion is passed, the stronger the link between the invoking and invoked method. The
cohesion of a class is the sum of the cohesion of its methods. The cohesion of a set
of classes is given by the sum of the cohesion of the classes in the set.
Henderson-Sellers propose a cohesion measure (LCOM5) [49]. They state that
a value of zero is obtained if each method of the class references every attribute
2.4. Frameworks for Static Cohesion Measurement 21
of the class and they called this “perfect cohesion”. They also state that if each
method of the class references only a single attribute, the measure yields one and
that values between zero and one are to be interpreted as percentages of the perfect
value. They do not state how to deal with inherited methods and attributes.
2.4 Frameworks for Static Cohesion Measurement
Two frameworks are defined in an attempt to outline what constitutes cohesion.
Eder et al. define a framework which aims at providing qualitative criteria for
cohesion and also assigns relative strengths to the different levels of cohesion they
identify within this framework.
A comprehensive framework based on a standard terminology and formalism is
outlined by Briand et al. which can be used (i) to facilitate comparison of existing
cohesion measures, (ii) to facilitate the evaluation and empirical validation of exist-
ing cohesion measures, and (iii) to support the definition of new cohesion measures
and the selection of existing ones based on a particular goal of measurement.
2.4.1 Eder et al.
Eder et al. propose a framework aimed at providing comprehensive, qualitative
criteria for cohesion in object-oriented systems [34]. They modify existing frame-
works for cohesion in the procedural and object-based paradigm to the specifics of
the object-oriented paradigm. They distinguish between three types of cohesion in
an object-oriented system: method, class and inheritance cohesion and state that
various degrees of cohesion exist for each type.
Myers [83] classical definition of cohesion was applied to methods for their def-
inition of method cohesion. Elements of a method are statements, local variables
and attributes of the method’s class. They defined seven degrees of cohesion, based
on the definition by Myers. From weakest to strongest, the degrees of method co-
hesion are coincidental, logical, temporal, communicational, sequential, procedural
and functional.
Class cohesion addresses the relationships between the elements of a class. The
2.4. Frameworks for Static Cohesion Measurement 22
elements of a class are its non-inherited methods and non-inherited attributes. Eder
et al. use a categorisation of cohesion for abstract data types by Embley and Wood-
field [35] and adapt it to object-oriented systems. They define five degrees of class co-
hesion which are, from weakest to strongest, separable, multifaceted, non-delegated,
concealed and model.
Inheritance cohesion is similar to class cohesion in that it addresses the rela-
tionships between elements of a class. However, inheritance cohesion takes all the
methods and attributes of a class into account, that is, both the inherited and non-
inherited. Inheritance cohesion is strong if inheritance has been used for the purpose
of defining specialized children classes. Inheritance cohesion is weak if it has been
used for the purpose of reusing code. The degrees of inheritance cohesion are the
same as those for class cohesion.
2.4.2 Briand et al.
Briand et al. outline a new framework for cohesion in object-oriented systems [12]
based on the issues identified by comparing the various approaches to measuring
cohesion and the discussion of existing measures outlined in Section 2.3. The frame-
work consists of five criteria, each criterion determining one basic aspect of the
resulting measure.
The five criteria of the framework are:
1. The type of connection, that is, what makes a class cohesive. A connection
within a class is a link between elements of the class which can be attributes,
methods, or data declarations.
2. Domain of the measure, this specifies the objects to be measured which can
be methods, classes etc.
3. They ask whether direct or also indirect connections should be counted.
4. How to deal with inheritance, that is, how to assign attributes and methods
to classes and how to account for polymorphism.
5. How to account for access methods and constructors.
2.5. Run-time/Dynamic Coupling Metrics 23
2.5 Run-time/Dynamic Coupling Metrics
While there has been considerable work on static metrics there has been little re-
search to date on run-time/dynamic coupling metrics. This section presents the two
most relevant works.
2.5.1 Yacoub et al.
Yacoub et al. propose a set of dynamic coupling metrics designed to evaluate the
change-proneness of a design [99]. These metrics are applied at the early de-
velopment phase to determine design quality. The measures are calculated from
executable object-oriented design models, which are used to model the application
to be tested. They are based on execution scenarios, that is “the measurements are
calculated for parts of the design model that are activated during the execution of a
specific scenario triggered by an input stimulus.” A scenario is the context in which
the metric is applicable. The scenarios are then extended to have an application
scope.
They define two metrics designed to measure the quality of designs at an early
development phase. Export Object Coupling (EOCx(oi, oj)) for an object oi with
respect to an object oj, is defined as the percentage of the number of messages sent
from oi to oj with respect to the total number of messages exchanged during the
execution of a scenario x. Import Object Coupling (IOCx(oi, oj)) for an object oi
with respect to an object oj, is the percentage of the number of messages received by
object oi that were sent by object oj with respect to the total number of messages
exchanged during the execution of a scenario x.
2.5.2 Arisholm et al.
Arisholm et al. define and validate a number of dynamic coupling metrics that
are listed in Table 2.1 [5]. Each dynamic coupling metric name starts with either
I or E to distinguish between import coupling and export coupling, based on the
direction of the method calls. The third letter C or O distinguishes whether entity
of measurement is the object or the class. The remaining letter distinguish three
2.5. Run-time/Dynamic Coupling Metrics 24
Variable Description
IC CC Import, Class Level, Number of Distinct Classes
IC CM Import, Class Level, Number of Distinct Methods
IC CD Import, Class Level, Number of Dynamic Messages
EC CC Export, Class Level, Number of Distinct Classes
EC CM Export, Class Level, Number of Distinct Methods
EC CD Export, Class Level, Number of Dynamic Messages
IC OC Import, Object Level, Number of Distinct Classes
IC OM Import, Object Level, Number of Distinct Methods
IC OD Import, Object Level, Number of Dynamic Messages
EC OC Export, Object Level, Number of Distinct Classes
EC OM Export, Object Level, Number of Distinct Methods
EC OD Export, Object Level, Number of Dynamic Messages
Table 2.1: Abbreviations for the dynamic coupling metrics of Arisholm et al.
types of coupling. The first metric, C, counts the number of distinct classes that
a method in a given class/object uses or is used by. The second metric, M , counts
the number of distinct methods invoked by each method in each class/object while
the third metric, D, counts the total number of dynamic messages sent or received
from one class/object to or from other classes/objects.
Arisholm et al. study the relationship of these measures with the change-
proneness of classes. They find that the dynamic coupling metrics capture addi-
tional properties compared to the static coupling metrics and are good predictors
of the change-proneness of a class. Their study uses a single software system called
Velocity executed with its associated test suite, to evaluate the dynamic coupling
metrics. These test cases are found to originally have 70% method coverage, which
is increased to 90% for the methods that “might contribute to coupling” through
the removal of dead code. However, they did not study the impact of code coverage
on their results nor were results given for programs other than versions of Velocity.
2.6. Run-time/Dynamic Cohesion Metrics 25
2.6 Run-time/Dynamic Cohesion Metrics
As is the case with the run-time coupling metrics, there has not been much research
into run-time measures for cohesion. This section presents the only available study
to date.
2.6.1 Gupta and Rao
Gupta and Rao conduct a study which measures module cohesion in legacy soft-
ware [46]. Gupta and Rao compare statically calculated metrics against a program
execution based approach of measuring the levels of module cohesion. The results
from this study show that the static approach significantly overestimates the levels
of cohesion present in the software tested. However, Gupta and Rao are considering
programs written in C, where many features of object-oriented programs are not
directly applicable.
2.7 Other Studies of Dynamic Behaviour
In this section we present a review of other work on studies into the dynamic be-
haviour of Java programs. While such research is not directly related to coupling
and cohesion metrics, many of the issues and approaches to measurement are similar.
Indeed, any research that performs both static and dynamic analyses of programs
benefits from being viewed in the context of some overall perspective of the rela-
tionship between the static and dynamic data.
2.7.1 Dynamic Behaviour Studies
A number of studies of the dynamic behaviour of Java programs have been carried
out, mostly for optimisation purposes. Issues such as bytecode usage [45] and mem-
ory utilisation [28] have been studied, along with a comprehensive set of dynamic
measures relating to polymorphism, object creation and hot-spots [33]. However,
none of this work directly addresses the calculation of standard software metrics at
run-time.
2.8. Coverage Metrics and Software Testing 26
The Sable group [33] seek to quantify the behaviour of programs with a concise
and precisely defined set of metrics. They define a set of unambiguous, dynamic, ro-
bust and architecture-independent measures that can be used to categorise programs
according to their dynamic behaviour in five areas which are size, data structure,
memory use, concurrency, and polymorphism. Many of the measurements they
record are of interest to the Java performance community as understanding the dy-
namic behaviour of programs is one important aspect in developing effective new
strategies for optimising compilers and runtime systems. It is important to note
that these are not typical software engineering metrics.
2.8 Coverage Metrics and Software Testing
Dynamic coverage measures are typically used in the field of software testing as
an estimate of the effectiveness of a test suite [10, 72]. Measurements of structural
coverage of code is a means of assessing the thoroughness of testing. The basis
of software testing is that software functionality is characterised by its execution
behaviour. In general, improved test coverage leads to improved fault coverage
and improved software reliability [69]. There are a number of metrics available for
measuring coverage, with increasing support from software tools. Such metrics do
not constitute testing techniques, but can be used as a measure of the effectiveness
of testing techniques. There are many different strategies for testing software, and
there is no consensus among software engineers about which approach is preferable
in a given situation. Test strategies fall into two categories [40]:
• Black-box (closed-box) testing: The test cases are derived from the specifica-
tion or requirements without reference to the code itself or its structure.
• White-box (open-box) testing: The test cases are selected based on knowledge
of the internal program structure.
A number of coverage metrics are based on the traversal of paths through the
control dataflow graph (CDFG) representing the system behaviour. Applying these
metrics to the CDFG representing a single process is a well understood task. The
2.8. Coverage Metrics and Software Testing 27
following coverage metrics are examples of white-box testing techniques and are
based on the CDFG.
2.8.1 Instruction Coverage
Instruction coverage is the simplest structural coverage metric. It is achieved if
every source language statement in the program is executed at least once. With
this technique test cases are selected so that every program statement is executed
at least once. It is also known as statement coverage, segment coverage [84], C1 [7]
and basic block coverage.
The main advantage of this measure is that it can be applied directly to object
code and does not require processing source code. Performance profilers commonly
implement this measure. The main disadvantage of statement coverage is that it is
insensitive to some control structures. In summary, this measure is affected more
by computational statements than by decisions. Due to its ubiquity this was chosen
as the coverage measure that is used in the case studies in this thesis. There are
however, a number of other methods for evaluating the coverage of a program, for
example branch coverage, condition coverage, condition/decision coverage, modified
condition/decision coverage and path coverage.
2.8.2 Alexander and Offutt
In their paper [3], Alexander and Offutt describe a coupling-based testing approach
for analysing and testing the polymorphic relationships that occur in object-oriented
software. The traditional notion of software coupling has been updated to apply
to object-oriented software, handling the relationships of aggregation, inheritance
and polymorphism. This allows the introduction of a new integration analysis and
testing technique for data flow interactions within object-oriented software. The
foundation of this technique is the coupling sequence, which is a new abstraction
for representing state space interactions between pairs of method invocations. The
coupling sequence provides the analytical focal point for methods under test and is
the foundation for identifying and representing polymorphic relationships for both
2.9. Previous Work by the Author 28
static and dynamic analysis. With this abstraction both testers and developers of
object-oriented programs can analyse and better understand the interactions within
their software. The application of these techniques can result in an increased ability
to find faults and overall higher quality software.
2.9 Previous Work by the Author
A preliminary study was previously conducted on the issues involved in perform-
ing a run-time analysis of Java programs [74]. This study outlined the general
principles involved in performing such an analysis. However, the results did not
offer a justifiable basis for generalisation as the programs analysed were a set of
Java microbenchmarks from the Java Grande Forum Benchmark Suite (JGFBS)
and therefore not representative of real applications. The metrics used were also
of a more primitive nature than the ones used in this study. Also, there was no
investigation made into the perspective of the measures, that is, the influence of
coverage, or the ability to predict external design quality. It did however provide
an indication that the evaluation of software metrics at run-time can provide an
interesting quantitative analysis of a program and that further research in this area
is needed.
The following papers have also been published:
• In [77, 78] studies on the quantification of a variety of run-time class-level
coupling metrics for object-oriented programs are described.
• In [77,79] an empirical investigation into run-time metrics for cohesion is pre-
sented.
• A study into a coverage analysis of Java benchmark suites is described in [20].
• An investigation into how object-level run-time metrics can be used to study
coupling between objects is presented in [81].
• A study of the influence of coverage on the relationship between static and
dynamic coupling metrics is described in [80].
2.10. Definition of Run-time Metrics 29
2.10 Definition of Run-time Metrics
This section outlines the run-time metrics used in the remainder of this thesis.
Originally, it was decided to develop a number of run-time metrics for coupling
and cohesion that parallel the standard static object-oriented measures defined by
Chidamber and Kemerer [26]. Later Arishlom et al. defined a set of dynamic
coupling metrics in their paper [5] which closely parallel ours, so for the ease of
comparison it was decided to use their terminology and definitions for the coupling
measures.
The cohesion measures are all novel and are based on our own definitions.
2.10.1 Coupling Metrics
Three decision criteria are used to define and classify the run-time coupling mea-
sures. Firstly, a distinction is made as to whether the entity of measurement is
the object or the class. Run-time object-level coupling quantifies the level of depen-
dencies between objects in a system. Run-time class-level coupling quantifies the
level of dependencies between the classes that implement the methods or variables of
the caller object and the receiver object. The class of the object sending or receiving
a message may be different from the class implementing the corresponding method
due to the impact of inheritance.
Second, the direction of coupling for a class or object is taken into account,
as is outlined in previous static coupling frameworks [13]. This allows for the fact
that in a coupling relationship a class may act as a client or a server, that is, it may
access methods or instance variables from another class (import coupling) or it may
have its own methods or instance variables used (export coupling).
Finally the strength of the coupling relationship is assessed, that is the amount
of association between the classes. To do this it is possible to count either:
1. The number of distinct classes that a method in a given class uses or is used
by.
2. The number of distinct methods invoked by each method in each class.
2.10. Definition of Run-time Metrics 30
3. The total number of dynamic messages sent or received from one class to or
from other classes.
Class-Level Metrics
The following are metrics for evaluating class-level coupling:
• IC CC: This determines the number of distinct classes accessed by a class at
run-time.
• IC CM: This determines the number of distinct methods accessed by a class
at run-time.
• IC CD: This determines the number of dynamic messages accessed by a class
at run-time.
• EC CC: This determines the number of distinct classes that are accessed by
other classes at run-time.
• EC CM: This determines the number of distinct methods that are accessed by
other classes at run-time.
• EC CD: This determines the number of dynamic messages that are accessed
by other classes at run-time.
Object-Level Metric
To evaluate object-level coupling it was deemed necessary to define just one metric.
Since we want to examine the behaviour of objects at run-time we require a measure
that is based on a class rather than a method-level. Further, it was deemed necessary
to evaluate only coupling at the import level, as we are interested in examining how
classes use other classes at the object-level rather than how they are used by other
classes, therefore export coupling for this measure was not evaluated.
The following is a measure for evaluating object-level coupling:
• IC OC: Import, Object-Level, Number of Distinct Classes: This measure will
be some function of the static CBO measure, as this measure determines the
2.10. Definition of Run-time Metrics 31
classes that can be theoretically accessed at run-time. This is a coarse-grained
measure which will assess class-class coupling at the object-level.
2.10.2 Cohesion Metrics
The following run-time measures are based on the Chidamber and Kemerer static
LCOM measure for cohesion as described in Section 2.3.1. However, a problem with
the original definition for LCOM is its lack of discriminating power. Much of this
arises from the criteria which states if |P | < |Q|, LCOM is automatically set to zero.
The result of this is a large number of classes with an LCOM of zero so the metric
has little discriminating power between these classes. In an attempt to correct this,
for the purpose of this analysis, we modify the original definition to be:
SLCOM = |P ||P |+|Q| (2.4)
SLCOM can range in value from zero to one. This new definition allows for com-
parison across classes therefore we use this new version as a basis for the definition
of the run-time metrics. As these are cohesion measures they are evaluated at the
class-level only.
Run-time Simple LCOM (RLCOM)
RLCOM is a direct extension of the static case, except that now we only count
instance variables that are actually accessed at run-time. Thus, for a set of methods
m1, . . . ,mn, as before, let {IRi } represent the set of instance variables referenced by
method mi at run-time. Two disjoint sets are defined from this:
PR = {(IRi , IRj ) | IRi ∩ IRj = ∅}QR = {(IRi , IRj ) | IRi ∩ IRj 6= ∅}
(2.5)
We can then define RLCOM as:
RLCOM = |PR||PR|+|QR| (2.6)
2.10. Definition of Run-time Metrics 32
We note that for any method mi, (Ii − IRi ) ≥ 0, and represents the number
of instance variables mentioned in a method’s code, but not actually accessed at
run-time.
Run-time Call-Weighted LCOM (RWLCOM)
It is reasonable to suggest that a heavily accessed variable should make a greater con-
tribution to class cohesion than one which is rarely accessed. However, the RLCOM
metric does not distinguish between the degree of access to instance variables. Thus
a second run-time measure RWLCOM is defined by weighting each instance variable
by the number of times it is accessed at run-time. This metric assesses the strength
of cohesion by taking the number of accesses into account.
As before, consider a class with n methods, m1, . . . ,mn, and let {Ii} be the set
of instance variables referenced by method mi. Define Ni as the number of times
method mi dynamically accesses instance variables from the set {Ii}.Now define a call-weighted version of equation 2.2 by summing over the number
of accesses:
PW =∑
1≤i,j≤n{(Ni +Nj) | Ii ∩ Ij = ∅}
QW =∑
1≤i,j≤n{(Ni +Nj) | Ii ∩ Ij 6= ∅}
where PW = 0, if {I1}, ..., {In} = ∅
(2.7)
Following equation 2.6 we define:
RWLCOM = |PW ||PW |+|QW | (2.8)
RWLCOM can range in value from zero to one. There is no direct relationship
with SLCOM or RLCOM , as it is based on the “hotness” of a particular program.
2.11. Conclusion 33
2.11 Conclusion
This chapter outlined the most prevalent metrics for coupling and cohesion and dis-
cussed other work on studies into the dynamic behaviour of Java programs. Mea-
sures for dynamic coverage that are commonly used in the field of software testing
were described. Work and publications by the author were outlined. Finally, a
description of the run-time metrics used in this thesis were provided.
Chapter 3
Experimental Design
This chapter presents an overview of the tools and techniques used to carry out the
run-time empirical evaluation of a set of Java programs together with a detailed
description of the set of programs analysed. A review of the statistical techniques
used to interpret the data is also given.
3.1 Methods for Collecting Run-time Information
There are a number of alternative techniques available for extracting run-time in-
formation from Java programs, each with their own advantages and disadvantages.
3.1.1 Instrumenting a Virtual Machine
There are several open-source implementations of the JVM available, for example
Kaffe [58], Jikes [57] or the Sable VM [59]. As their source code is freely available
this means that all aspects of a running Java program can be observed. However,
due to the logging of bytecode instructions, instrumenting a JVM can result in a
huge amount of data being generated for the simplest of programs. The source code
organisation must be understood and the instrumentation has to be redone for each
new version of the VM. There can also be compatibility issues when compared with
the Java class libraries released by Sun. It has also been found that these VMs are
not very robust. This was the method used for a preliminary study [74], however it
was later discarded due to its many disadvantages.
34
3.2. Metrics Data Collection Tools (Design Objectives) 35
3.1.2 Sun’s Java Platform Debug Architecture (JPDA)
Version 1.4 and later of the Java SDK supports a debugging architecture, the JPDA
[96], that provides event notification for low level JVM operations. A trace program
that handles these events can thus record information about the execution of a Java
program. This method is faster than instrumenting a VM and is more robust. The
same agent works with all VM’s supporting the JPDA and this is currently supported
by both Sun and IBM (although there are some differences). This technique has
proved useful in class-level metrics analysis. However, it is still very time consuming
to generate a profile for a large application and it is difficult to conduct an object-
level analysis using this approach.
3.1.3 Bytecode Instrumentation
This involves statically manipulating the bytecode to insert probes, or other track-
ing mechanisms, that record information at runtime. This provides the simplest
approach to dynamic analysis since it does not require implementation specific
knowledge of JVM internals, and imposes little overhead on the running program.
Bytecode instrumentation can be performed using the publicly available Apache
Bytecode Engineering Library (BCEL) [30]. This technique provides object-level
accuracy and therefore was used in the object-level metrics analysis.
3.2 Metrics Data Collection Tools (Design Objec-
tives)
The dynamic analysis of any program involves a huge amount of data processing.
However, the level of performance of the collection mechanism was not considered
to be a critical issue. It was only desirable that the analysis could be carried out in
reasonable and practical time. The flexibility of the collection mechanism was a key
issue, as it was necessary to be able to collect a wide variety of dynamic information.
3.2. Metrics Data Collection Tools (Design Objectives) 36
3.2.1 Class-Level Metrics Collection Tool (ClMet)
We have developed a tool for the collection of class-level metrics called ClMet, as
illustrated by Figure 3.1, which utilises the JPDA. This is a multi-tiered debugging
architecture contained within Sun Microsystem’s Java 2 SDK version 1.4. It consists
of two interfaces, the Java Virtual Machine Debug Interface (JVMDI), and the Java
Debug Interface (JDI), and a protocol, the Java Debug Wire Protocol (JDWP).
The first layer of the JPDA, the JVMDI, is a programming interface implemented
by the virtual machine. It provides a way to both inspect the state and control the
execution of applications running in the JVM. The second layer, the JDWP, de-
fines the format of information and requests transferred between the process being
debugged and the debugger front-end which implements the JDI. The JDI, which
comprises the third layer, defines information and requests at the user code level. It
provides introspective access to a running virtual machine’s state, the class, array,
interface, and primitive types, and instances of those types. While a tracer imple-
mentor could directly use the Java Debug Wire Protocol (JDWP) or Java Virtual
Machine Debug Interface (JVMDI), this interface greatly facilitates the integration
of tracing capabilities into development environments. This method was selected
because of the ease with which it is possible to obtain specific information about
the run-time behaviour of a program.
In order to match objects against method calls it is necessary to model the
execution stack of the JVM, as this information is not provided directly by the
JPDA. We have implemented an EventTrace analyser class in Java, which carries
out a stack based simulation of the entire execution in order to obtain information
about the state of the execution stack. This class also implements a filter which
allows the user to specify which events and which of their corresponding fields are
to be captured for processing. This allows a high degree of flexibility in the collection
of the dynamic trace data.
The final component of our collection system is a Metrics class, which is re-
sponsible for calculating the desired metrics on the fly. It is also responsible for
outputting the results in text format. The metrics to be calculated can be specified
from the command line. The addition of the metrics class allows new metrics to be
3.2. Metrics Data Collection Tools (Design Objectives) 37
Figure 3.1: Components of run-time class-level metrics collection tool, ClMet
easily defined as the user need only interact with this class.
3.2.2 Object-Level Metrics Collection Tool (ObMet)
We have developed an object-level metrics collection tool called ObMet, which uses
the BCEL and is based on the Gretel [53] coverage monitoring tool.
The BCEL is an API which can be used to analyse, create, and manipulate
(binary) Java class files. Classes are represented by BCEL objects which contain all
the symbolic information of the given class, such as methods, fields and byte code
instructions. Such objects can be read from an existing file, be transformed by a
program and dumped to a file again.
Figure 3.2 illustrates the components of ObMet. In the first stage the Instru-
3.2. Metrics Data Collection Tools (Design Objectives) 38
Instrumenter
A.class B.class C.class
A.class B.class C.class
JVM
Probe hit reports
(Binary file)
Metrics
Calculate
MetricsResults
<index> <file> <method> <class><index> <file> <method> <class><index> <file> <method> <class>
Probe Table
Figure 3.2: Components of run-time object-level metrics collection tool, ObMet
menter program takes a list of class files and instruments them. During this phase
the BCEL inserts probes into these files to flag events like method calls or instance
variable accesses. During instrumentation, the class files are changed in-place, and
a file containing information on method and field accesses is created. Each method
and field are given a unique index in this file. When the application is run, each
probe records a “hit” in another file. The Metrics program then calculates the
run-time measures utilising the information in these files.
3.2.3 Static Data Collection Tool (StatMet)
In order to calculate the static metrics it is necessary to convert the binary class
files into a human readable format. The StatMet tool is based on the Gnoloo
disassembler [38], which converts the class files into an Oolong source file. The
Oolong language is an assembly language for the Java Virtual Machine and the
resulting file will be nearly equivalent to the class file format but it will be suitable for
human interpretation. The StatMet tool extends the disassembler with an additional
3.2. Metrics Data Collection Tools (Design Objectives) 39
A.class B.class C.class
Gnoloo
Metrics
Oolong Code
Static Metrics
Results
[Human Readable Code]
[Disassembler]
[Binary Format]
Figure 3.3: Components of static metrics collection tool, StatMet
metrics component which calculates the static metrics from the Oolong code. Figure
3.3 illustrates the components of the StatMet tool.
3.2.4 Coverage Data Collection Tool (InCov)
In order to calculate the instruction coverage, it is necessary to record, for each
instruction, whether or not it was executed. In fact, well-known techniques exist for
identifying sequences of consecutive instructions, known as basic blocks, that some-
what reduce the instrumentation overhead. Nonetheless, since static code analysis
is required to determine basic block entry points, it seemed most efficient to also
instrument the bytecode during this analysis.
The instrumentation framework uses the Apache Byte Code Engineering Library
(BCEL) [30] along with the Gretel Residual Test Coverage Tool [53]. The Gretel
tool statically works out the basic blocks in a Java class file and inserts a probe
consisting of small sequence of bytecode instructions at each basic block. Whenever
the basic block is executed, the probe code records a “hit” as a simple boolean value.
The number of bytecode instructions in the basic block can then be used to calculate
3.2. Metrics Data Collection Tools (Design Objectives) 40
instruction coverage.
3.2.5 Fault Detection Study
Mutation testing [48, 64] is a fault-based testing technique that measures the effec-
tiveness of test cases. It was first introduced as a way of measuring the accuracy
of test suites. It is based on the assumption that a program will be well tested if
a majority of simple faults are detected and removed. Mutation testing measures
how good a test is by inserting faults into the program under test. Each fault gen-
erates a new program, a mutant, that is slightly different from the original. These
mutant versions of the program are created from the original program by applying
mutation operators, which describe syntactic changes to the programming language.
Test cases are used to execute these mutants with the goal of causing each mutant
to produce incorrect output. The idea is that the tests are adequate if they dis-
tinguish the program from one or more mutants. The cost of mutation testing has
always been a serious issue and many techniques proposed for implementing it have
proved to be too slow for practical adoption. µJava is a tool created for performing
mutation testing on Java programs.
µJava
µJava [66, 67] is a mutation system for Java programs. It automatically generates
mutants for both traditional mutation testing and class-level mutation testing. It
can test individual classes and packages of multiple classes.
The method-level or traditional mutants are based on the selective operator set
by Offutt et al. [87]. These (non-OO) mutants are all behavioural in nature. There
are five traditional mutants in total. A description of these mutants can be found
in Appendix D.1.
The class-level mutation operators were designed for Java classes by Ma, Kwon
and Offutt [68], and were in turn designed from a categorisation of object-oriented
faults by Offutt, Alexander et al. [86]. The object-oriented mutants are created
according to 23 operators that are specialised to object-oriented faults. Each of
these can be catergorised based one of five language feature groups they are related
3.3. Test Case Programs 41
to. The class-level mutants can also be divided into one of two types, behavioural
mutants are those that change the behavior of the program while structural mutants
are those that change the structure of the program. A detailed description of these
mutants can be found in Appendix D.2.
After creating mutants, µJava allows the tester to enter and run tests, and
evaluates the mutation coverage of the tests. Test cases are then added in an attempt
to “kill” the mutants by differentiating the output of the original program from the
mutant programs. Tests are supplied by the users as sequences of method calls to
the classes under test encapsulated in methods in separate classes.
3.3 Test Case Programs
An important technique used in the evaluation of object systems is benchmarking. A
benchmark is a black-box test, even if the source code is available [73]. A benchmark
should consists of two elements:
• The structure of the persistent data.
• The behaviour of an application accessing and manipulating the data.
The process of using a benchmark to assess a particular object system involves exe-
cuting or simulating the behaviour of the application while collecting data reflecting
its performance [54]. A number of different Java benchmarks are available and those
used in the course of this study are discussed in the following subsection.
3.3.1 Benchmark Programs
Benchmark suites are commonly used to measure performance and fulfill many of
the required properties of a test suite. The following were used in this analysis.
SPECjvm98 Benchmark Suite
The SPECjvm98 benchmark suite [8] is typically used to study the architectural
implications of a Java runtime environment. The benchmark suite consists of eight
3.3. Test Case Programs 42
Application Description
201 compress A popular modified Lempel–Ziv method (LZW) compres-
sion program.
202 jess JESS is the Java Expert Shell System and is based on
NASAs popular CLIPS rule-based expert shell system.
205 raytrace This is a raytracer that works on a scene depicting a di-
nosaur.
209 db Data management software written by IBM.
213 javac This is the Sun Microsystem Java compiler from the JDK
1.0.2.
222 mpegaudio This is an application that decompresses audio files that
conform to the ISO MPEG Layer–3 audio specification.
227 mtrt This is a variant of 205 raytrace. This is a dual–threaded
program that ray traces an image.
228 jack A Java parser generator from Sun Microsystems that is
based on the Purdue Compiler Construction Tool Set (PC-
CTS). This is an early version of what is now called
JavaCC.
Table 3.1: Description of the SPECjvm98 benchmarks
Java programs which represent different classes of Java applications as illustrated
by Table 3.1.
These programs were run at the command line prompt and do not include graph-
ics, AWT (graphical interfaces), or networking. The programs were run with a 100%
size execution by specifying a problem size s100 at the command line.
JOlden Benchmark Suite
The original Olden benchmarks are a suite of pointer intensive C programs which
have been translated into Java. They are small, synthetic programs but they were
used as part of this study as each program exhibits a large volume of object creation.
3.3. Test Case Programs 43
Application Description
bh Solves the N-body problem using hierarchical methods.
bisort Sorts by creating two disjoint bitonic sequences and then
merging them.
em3d Simulates the propagation electro-magnetic waves in a 3D
object.
health Simulates the Columbian health care system.
mst Computes the minimum spanning tree of a graph.
perimeter Computes the perimeter of a set of quad-tree encoded
raster images.
power Solves the Power System Optimization problem.
treeadd Adds the values in a tree.
tsp Computes an estimate of the best hamiltonian circuit for
the travelling salesman problem.
voronoi Computes the Voronoi Diagram of a set of points.
Table 3.2: Description of the JOlden benchmarks
Table 3.2 gives a description of the programs [23].
There are a number of other benchmark suite available that could be used in this
type of study which were excluded for various reasons. The DaCapo benchmark suite
was excluded as it is still in its beta stage of development. The Java Grande Forum
Benchmark Suite (JGFBS), which was used in a previous study [74], was excluded
as the programs did not exhibit very high levels of coupling and cohesion at run-
time. Other suite such as CaffineMark were excluded as these are microbenchmark
programs therefore are not typical of real Java applications.
3.3.2 Real-World Programs
It was deemed desirable to include a number of real-world programs in the analysis
to see if the results are scalable to actual programs. The following were chosen as
they are all publicly available and so is their source code. They all come with a set
3.3. Test Case Programs 44
of pre-defined test cases that are also publicly available, thus defining both the static
and dynamic context of our work. This contrasts with some other approaches which,
at worst, can use arbitrary software packages, often proprietary, with an ad-hoc set
of test inputs.
Velocity
Velocity (version 1.4.1) is an open-source software system that is part of the Apache
Jakarta Project [55]. It is a Java-based template engine and it permits anyone to
use a simple yet powerful template language to reference objects defined in Java
code. It can be used to generate web pages, SQL, PostScript, and other outputs
from template documents. It can be used either as a standalone utility or as an
integrated component of other systems. The set of JUnit test cases supplied with
the program were used to execute the program.
Xalan-Java
Xalan-Java (version 2.6.0) is an open-source software system that is part of the
Apache XML Project [92]. It is an XSLT processor for transforming XML docu-
ments into HTML, text, or other XML document types. It implements XSL Trans-
formations (XSLT) Version 1.0 and XML Path Language (XPath) Version 1.0. It
can be used from the command line, in an applet or a servlet, or as a module in
other program. A set of JUnit test cases supplied for the program were used for its
execution.
Ant
Ant (version 1.6.1) is a Java-based build tool that is part of the Apache Ant Project
[4]. It is similar to GNU Make but has the full portability of pure Java code. Instead
of writing shell commands, as with Make, the configuration files are XML-based,
calling out a target tree where various tasks are executed.
3.4. Statistical Techniques 45
SPECjvm98 JOlden Velocity Xalan Ant
Case Study 1: X X X X X
Case Study 2: X X X X
Case Study 3: X X X
Table 3.3: Programs used for each case study
3.3.3 Execution of Programs
All the programs except those in the SPEC benchmark suite were compiled using
the javac compiler from Sun’s SDK version 1.5.0 01, and all benchmarks were run
using the client virtual machine from this SDK. The programs in the SPEC suite are
distributed in class file format, and were not recompiled or otherwise modified. We
note (in accordance with the license) that the SPEC programs were run individually,
and thus none of these results are comparable with the standard SPECjvm98 metric.
All benchmark suites include not just the programs themselves, but a test harness
to ensure that results from different executions are comparable. Table 3.3 outlines
the programs used for each case study. Not all programs were suitable for use in
every case study and we defer the explanation of this to the relevant chapters.
3.4 Statistical Techniques
The following section presents a detailed review of the statistical techniques used in
this study.
3.4.1 Descriptive Statistics
Descriptive statistics describe patterns and general trends in a data set. They also
aid in explaining the results of more complex statistical techniques. For each case
study a number of descriptive statistics were evaluated from the following:
The Distribution or Mean (X)
X =
∑X
N(3.1)
3.4. Statistical Techniques 46
The mean is the sum of all values (X) divided by the total number of values (N).
The Standard Deviation (s)
s =√var =
√∑(X −X)2
N − 1(3.2)
The standard deviation is a measure of the range of values in a set of numbers.
It is used used as a measure of the dispersion or variation in a distribution. Simply
put, it tells us how far a typical member of a sample or population is from the
mean value of that sample or population. A large standard deviation suggests that
a typical member is far away from the mean. A small standard deviation suggests
that members are clustered closely around the mean. It is computed as the square
root of the variance.
Many statistical techniques assume that data is normally distributed. If that
assumption can be justified, then 68% of the values are at most one standard devi-
ation away from the mean, 95% of the values are at most two standard deviations
away from the mean, and 99.7% of the values lie within three standard deviations
of the mean.
The Coefficient of Variation (CV )
CV = σ/µ ∗ 100 (3.3)
CV measures the relative scatter in data with respect to the mean and is calcu-
lated by dividing the standard deviation by the mean. It has no units and can be
expressed as a simple decimal value or reported as a percentage value. When the
CV is small the data scatter relative to the mean is small. When the CV is large
compared to the mean the amount of variation is large. Equation 3.3 defines the
coefficient of variation as a percentage, where µ is the mean and σ is the standard
deviation.
Skewness
skewness =
∑ni=1(Xi −X)3
(N − 1)s3(3.4)
3.4. Statistical Techniques 47
Skewness is the tilt (or lack of it) in a distribution. It characterises the degree of
asymmetry of a distribution around its mean. A distribution is symmetric if it looks
the same to the left and right of the centre point. Equation 3.4 gives the formula
for skewness for X1, X2, ..., XN , where X is the mean, s is the standard deviation
and N is the number of data points
Kurtosis
kurotsis =
∑ni=1(Xi −X)4
(N − 1)s4(3.5)
Kurtosis is the peakedness of a distribution. Equation 3.5 gives the formula for
kurtosis for X1, X2, ..., XN ,.
3.4.2 Normality Tests
Many statistical procedures require that the data being analysed follow a normal
data distribution. If this is not the case, then the computed statistics may be
extremely misleading. Normal distributions take the form of a symmetric bell-
shaped curve. Normality can be visually assessed by looking at a histogram of
frequencies, or by looking at a normal probability plot.
A common rule-of-thumb test for normality is to get skewness and kurtosis, then
divide these by the standard errors. Skew and kurtosis should be within the +2
to -2 range when the data are normally distributed. Negative skew is left-leaning,
positive skew right-leaning. Negative kurtosis indicates too many cases in the tails
of the distribution. Positive kurtosis indicates too few cases in the tails.
Shapiro-Wilk’s W Test
W =(∑n
i−1 aix(i))2
∑ni−1(xi − x)2
(3.6)
Formal tests such as the Shapiro-Wilk’s test may also be applied to assess
whether the data is normally distributed. It calculates a W statistic that tests
whether a random sample, x1, x2, ..., xn comes from a normal distribution. W may
be thought of as the correlation between given data and their corresponding nor-
mal scores, with W = 1 when the given data are perfectly normal in distribution.
3.4. Statistical Techniques 48
When W is significantly smaller than 1, the assumption of normality is not met.
Shapiro-Wilks W is recommended for small and medium samples up to n = 2000.
Equation 3.6 calculates the W statistic where xi are the ordered sample values and
ai are the constants generated from the means, variances and covariances of the
order statistics of a sample of size n from a normal distribution [90,93].
Kolmogorov-Smirnov D Test or K-S Lilliefors test
D = maxl≤i≤N |F (yi)− i
N| (3.7)
For larger samples, the Kolmogorov-Smirnov test is recommended. For a single
sample of data, this test is used to test whether or not the sample of data is consistent
with a specified distribution function. When there are two samples of data, it is
used to test whether or not these two samples may reasonably be assumed to come
from the same distribution. Equation 3.7 defines the test statistic, where F is the
theoretical cumulative distribution of the distribution being tested which must be a
continuous distribution. The hypothesis regarding the distributional form is rejected
if the test statistic, D, is greater than the critical value obtained from a table. There
are several variations of these tables in the literature [24].
3.4.3 Normalising Transformations
There are a number of transformations that can be applied to approximate data
to become normally distributed. To normalise right or positive skew, square roots,
logarithmic, and inverse (1/x) transforms “pull in” outliers. Inverse transforms are
stronger than logarithmic transforms which are stronger than roots. To correct left
or negative skew, first subtract all values from the highest value plus 1, then apply
square root, inverse, or logarithmic transforms. Power transforms can be used to
correct both types of skew and finer adjustments can be made by adding a con-
stant, C, in the transform of X: (X + C)P . Values of P less than one (roots) correct
right skew, which is the common situation (using a power of 2/3 is common when
attempting to normalise). Values of P greater than 1 (powers) correct left skew.
For right skew, decreasing P decreases right skew. Too great a reduction in P will
3.4. Statistical Techniques 49
overcorrect and cause left skew. When the best P is found, further refinements
can be made by adjusting C. For right skew, for instance, subtracting C will de-
crease skew. Logarithmic transformations are appropriate to achieve symmetry in
the central distribution when symmetry of the tails is not important. Square root
transformations are used when symmetry in the tails is important. When both are
important, a fourth root transform may work.
3.4.4 Pearson Correlation Test
R =n∑xy −∑ x
∑y√
([n∑x2 − (
∑x)2][n
∑y2 − (
∑y)2])
(3.8)
The Pearson or product moment correlation test is used to assess if there is a
relationship between two or more variables, in other words it is a measure of the
strength of the relationship between the variables. Having n pairs of data (xi, yi),
equation 3.8 computes the correlation coefficient (R). R is a number that summarises
the direction and degree (closeness) of linear relations between two variables and is
also known as the Pearson Product-Moment Correlation Coefficient. R can take
values between -1 through 0 to +1. The sign (+ or -) of the correlation affects its
interpretation. When the correlation is positive (R > 0), as the value of one variable
increases, so does the other. The closer R is to zero the weaker the relationship. If
a correlation is negative, when one variable increases, the other variable decreases.
The following general categories indicate a quick way of interpreting a calculated R
value [97]:
• 0.0 to 0.2 Very weak to negligible correlation
• 0.2 to 0.4 Weak, low correlation (not very significant)
• 0.4 to 0.6 Moderate correlation
• 0.7 to 0.9 Strong, high correlation
• 0.9 to 1.0 Very strong correlation
The results of such an analysis are displayed in a correlation matrix table.
3.4.5 T-Test
t =r√
[(1− r2)/(N − 2)](3.9)
3.4. Statistical Techniques 50
Any relationship between two variables should be assessed for its significance
as well as its strength. A standard two tailed t-test is used to test for statistical
significance as illustrated by equation 3.9. Coefficients are considered significant if
the t-test p-value is below 0.05. This tells how unlikely a given correlation coefficient,
r, will occur given no relationship in the population. Therefore the smaller the p-
level, the more significant the relationship taking account of type I and type II
errors.
3.4.6 Principal Component Analysis
Principal Component Analysis (PCA) is used to analyse the covariate structure
of the metrics and to determine the underlying structural dimensions they capture.
In other words PCA can tell if all the metrics are likely to be measuring the same
class property. PCA usually generates a large number of principal components. The
number will be decided based on the amount of variance explained by each compo-
nent. A typical threshold would be retaining principal components with eigenvalues
(variances) larger than 1.0. This is the Kaiser criterion. There are a number of
stages involved in performing a PCA on a set of data:
1. Select a data set, for example one with two dimensions x and y.
2. Subtract the mean from each of the data dimensions. The mean subtracted
is the average across each dimension, so all the x values have the mean x
subtracted and all the y values will have y subtracted. This produces a data
set whose mean is zero.
3. Calculate the covariance matrix. Formula 3.10 gives the definition for a co-
variance matrix for a set of data with n dimensions, where Cn×n is a matrix
with n rows and n columns, and Dimx is the xth dimension.
Cn×n = (ci,j,, ci,j = cov(Dimi, Dimj)) (3.10)
An n-dimensional data set will have n!(n−2!)∗2 different covariance values. As
the data we propose to use is two dimensional, the covariance matrix will be
3.4. Statistical Techniques 51
2 × 2:
C =
cov(x, x) cov(x, y)
cov(y, x) cov(y, y)
4. Calculate the eigenvectors and eigenvalues of the covariance matrix. They are
both unit vectors, that is their lengths are both 1 and they are both closely
related. These are important as they provide information about patterns in
the data.
5. Choosing components and forming a feature vector. In general, once eigen-
vectors are found from the covariance matrix, the next step is to order them
by eigenvalue, highest to lowest. This gives you the components in order of
significance. Some of the components of lesser significance can be ignored. If
some components are left out, the final data set will have less dimensions than
the original. To be precise, if there are originally n dimensions in the data,
and n eigenvectors and eigenvalues are calculated, and the first p eigenvectors
are chosen, then the final data set has only p dimensions.
Forming a feature vector, which is just another name for a matrix of vectors,
is constructed by taking the eigenvectors that you want to keep from the list
of eigenvectors, and forming a matrix with these eigenvectors in the columns.
FeatureV ector = (eig1 eig2 eig3...eign) (3.11)
6. Derive a new data set, for this we simply take the transpose of the vector and
multiply it on the left of the original data set, transposed.
FinalData = RowFeatureV ector ∗RowDataAdjust (3.12)
where RowFeatureVector is the matrix with the eigenvectors in the columns
transposed so that the eigenvectors are now in the rows, with the most signif-
icant eigenvector at the top, and RowDataAdjust is the mean-adjusted data
3.4. Statistical Techniques 52
transposed, that is, the data items are in each column, with each row holding
a separate dimension.
See [56] for further details on PCA.
3.4.7 Cluster Analysis
Cluster Analysis is a data exploratory statistical procedure that helps reveal asso-
ciations and structures of data in a domain set [91]. A measure of proximity or
similarity/dissimilarity is needed in order to determine groups from a complex data
set. A wide variety of such measures exist but no consensus prevails over which is
superior. For this study, two widely used dissimilarity measures, Pearson dissimi-
larity and Euclidean distance were chosen. The analysis was repeated using these
two different measures in order to verify the results.
Equation 3.13 defines the Pearson Dissimilarity, where µx and µy are the means
of the first and second sets of data, and σx and σy are the standard deviations of
the first and second sets of data.
d(x, y) =1n
∑i xiyi − µxµyσxσy
(3.13)
Equation 3.14 defines the Euclidean Distance between two sets of data.
d(x, y) =
√√√√n∑i
(xi − yi)2 (3.14)
The next step is to select the most suitable type of clustering algorithm for the
analysis. The agglomerative hierarchical clustering (AHC) algorithm was chosen as
being the most suitable for the specifications of the analysis. Also, it does not require
the number of clusters the data should be grouped into be specified in advance. AHC
algorithms start with singleton clusters, one for each entity. The most similar pair
of clusters are merged, one pair at a time, until a single cluster remains.
Throughout the cluster analysis, there is a symmetric matrix of dissimilarities
maintained between the clusters. Once two clusters have been merged, it is neces-
sary to generate the dissimilarity between the new cluster and every other cluster.
3.4. Statistical Techniques 53
The unweighted pair group average linkage algorithm was employed here as it is
theoretically the best method to use. This algorithm clusters objects based on the
average distance between all pairs.
Suppose we have three clusters A, B and C, with i being the distance between
A and B, and j being the distance between B and C. If A and B are the most
similar pair of entities and are joined together into a new cluster D, the method of
calculating the new distance k between C and D is given by Equation 3.15.
k = (i ∗ size(A) + j ∗ size(B))/(size(A) + size(B)) (3.15)
The analysis was repeated using Ward’s method to verify the results. With
this method cluster membership is assessed by calculating the total sum of squared
deviations from the mean of a cluster. The criterion for fusion is that it should
produce the smallest possible increase in the error sum of squares.
The output of AHC is usually represented in a special type of tree structure called
a dendrogram, as illustrated by Figure 3.4. Each branch of the tree represents a
cluster and is drawn vertically to height where the cluster merges with neighbouring
clusters. The cutting line is a line drawn horizontally across the dendrogram at a
given dissimilarity level to determine the number of clusters. The cutting line is
determined by constructing a histogram of node levels to find where the increase
in dissimilarity is strongest, as then we have reached a level where we are grouping
groups that are already homogeneous. The cutting line is selected before this level
is reached.
3.4.8 Regression Analysis
The general computational problem that needs to be solved in linear regression
analysis is to fit a straight line to a set of points [43]. When there is more than
one independent variable, the regression procedures will estimate a linear equation
of the form shown in Equation 3.16, where Y is the dependent variable, Xi stands
for a set of independent variables, a is a constant and each bi is the slope of the
regression line. The constant a is also known as the intercept, and the slope as the
regression coefficient.
3.4. Statistical Techniques 54
Figure 3.4: Dendrogram: At the cutting line there are two clusters
3.4. Statistical Techniques 55
Y = a+ b1X1 + b2X2 + . . .+ bpXp (3.16)
The regression line expresses the best prediction of the dependent variable Y
given the independent variables Xi. However, usually there is substantial variation
of the observed points around the fitted regression line. The deviation of a particular
point from the line is known as the residual value. The smaller the variability of
the residual values around the regression line relative to the overall variability, the
better the prediction. In most cases the ratio will fall somewhere between 0.0 and
1.0. If there is no relationship between the X and Y variables the ratio will be
1.0, while if X and Y are perfectly related the ratio will be 0.0. The least squares
method is employed to perform the regression.
The R2 or the coefficient of determination is 1.0 minus this ratio. The R2 value
is an indicator of how well the model fits the data. If we have an R2 close to 1.0 this
indicates that we have accounted for almost all of the variability with the variables
specified in the model.
The correlation coefficient R expresses the degree to which two or more indepen-
dent variables are related to the dependent variable, and it is the square root of R2.
R can assume values between -1 and +1. The sign (plus or minus) of the correla-
tion coefficient interprets the direction of the relationship between the variables. If
it is positive, then the relationship of this variable with the dependent variable is
positive. If it is negative then the relationship is negative. If it is zero then there is
no relationship between the variables.
3.4.9 Analysis of Variance (ANOVA)
ANOVA is used to test the significance of the variation in the dependent variable that
can be attributed to the regression of one or more independent variables. The results
enable us to determine whether or not the explanatory variables bring significant
information to the model. ANOVA gives a statistical test of the null hypothesis H0,
which is, there is no linear relationship between the variables versus the alternative
hypothesis H1, which is, there is a relationship between the variables.
3.5. Conclusion 56
There are four parts to ANOVA results, the sum of squares, degrees of freedom,
mean squares and the F test. Fisher’s F test, as given by Equation 3.17, is used
to test whether the R2 values are statistically significant. Values are deemed to be
significant at p ≤ 0.05.
F =R2 ∗ (N −K − 1)
(1−R2) ∗K (3.17)
Here, K is the number of independent variables (two in our case) and N is the
number of observed values.
3.5 Conclusion
A detailed account of the tools and techniques needed to conduct the test case
studies described in the next sections were outlined in this chapter. The programs
evaluated in this work were discussed and an outline of the statistical techniques
used to analyse the results were provided.
Chapter 4
Case Study 1: The Influence of
Instruction Coverage on the
Relationship Between Static and
Run-time Coupling Metrics
When comparing static and run-time measures it is important to have a thorough
understanding of the degree to which the analysed source code corresponds to the
code that is actually executed. In this chapter this relationship is studied using
instruction coverage measures with regard to the influence of coverage on the rela-
tionship between static and run-time metrics. It is proposed that coverage results
have a significant influence on the relationship and thus should always be a mea-
sured, recorded factor in any such comparison.
An empirical investigation is conducted using a set of six run-time metrics on
seventeen Java benchmark and real-world programs. First, the differences in the
underlying dimensions of coupling captured by the static versus the run-time metrics
are assessed using principal component analysis. Subsequently, multiple regression
analysis is used to study the predictive ability of the static CBO and instruction
coverage data to extrapolate the run-time measures.
57
4.1. Goals and Hypotheses 58
4.1 Goals and Hypotheses
The Goal Question Metric/MEtric DEfinition Approach (GQM/MEDEA) frame-
work proposed by Briand et al. [18] was used to set up the experiments for this
study.
Experiment 1:
Goal: To investigate the relationship between static and run-time coupling met-
rics.
Perspective: We would expect some degree of correlation between the run-time
measures for coupling and the static CBO metric. We use a number of statistical
techniques, including principle component analysis to analyse the covariate struc-
ture of the metrics to determine if they are measuring the same class properties.
Environment: We chose to evaluate a number of Java programs from well
defined publicly-available benchmark suites as well as a number of open source real-
world programs.
Hypothesis:
H0 : Run-time measures for coupling are simply surrogate measures for the static
CBO metric.
H1 : Run-time measures for coupling are not simply surrogate measures for the
static CBO metric.
Experiment 2:
Goal: To examine the relationship between static CBO and run-time coverage
metrics, particularly in the context of the influence of instruction coverage.
Perspective: Intuitively, one would expect the better the coverage of the test
cases used the greater the correlation between the static and run-time metrics. We
use multiple regression analysis to determine if there is a significant correlation.
4.2. Experimental Design 59
Environment: We chose to evaluate a number of Java programs from well
defined publicly-available benchmark suites as well as a number of open source real-
world programs.
Hypothesis:
H0 : The coverage of the test cases used to evaluate a program has no influence
on the relationship between static and run-time coupling metrics.
H1 : The coverage of the test cases used to evaluate a program has an influence
on the relationship between static and run-time coupling metrics.
4.2 Experimental Design
In order to conduct the practical experiments underlying this study, it was necessary
to select a suite of Java programs and measure:
• the static CBO metric
• the instruction coverage percentages: IC
• the run-time coupling metrics: IC CC, EC CC, IC CM, EC EM, IC CD, EC CD
The static metrics data collection tool StatMet, described in Section 3.2.3, was
used to calculate CBO, while the InCov tool, outlined in Section 3.2.4, was used to
determine the instruction coverage. The run-time metrics were evaluated using the
ClMet tool, which is described in Chapter 3.2.1.
The set of programs used in this study consist of the benchmark programs JOlden
and SPECjvm98, as well as the real-world programs Velocity, Xalan and Ant. The
SPECjvm98 suite was chosen as it is directly comparable to other studies that use
Java software. The program mtrt was excluded from the investigation as it is multi-
threaded and therefore is not suitable for this type of analysis. The more synthetic
JOlden programs were included to ensure that it considers programs that create
significantly large populations of objects. Three of the programs from the JOlden
suite BiSort, TreeAdd and TSP were omitted from the analysis as they contained
4.3. Results 60
only two classes, therefore the results could not be further analysed. A selection of
real programs were selected to ensure that the results were scalable to all types of
programs.
4.3 Results
4.3.1 Experiment 1: To investigate the relationship between
static and run-time coupling metrics
For each program the distribution (mean) and variance (standard deviation)
of each measure across the class is calculated. These statistics are used to select
metrics that exhibit enough variance to merit further analysis, as a low variance
metric would not differentiate classes very well and therefore would not be a useful
predictor of external quality. Descriptive statistics also aid in explaining the results
of the subsequent analysis.
The descriptive statistic results for each program are summarised in Table 4.1.
The metric values exhibit large variances which makes them suitable candidates for
further analysis.
Principal Component Analysis
Principal Component Analysis (PCA) is used to investigate whether the run-
time coupling metrics are not simply surrogate measures for static CBO.
A similar study was carried out by Arisholm et al. using only the Velocity
program [5]. The work in this chapter extends their work to include fourteen bench-
mark programs as well as three real-world programs in order to demonstrate the
robustness of these results over a larger range and variety of programs.
Appendix A.1 shows the results of the principal component analysis used to in-
vestigate the covariate structure of the static and run-time metrics. Using the Kaiser
criterion to select the number of factors to retain shows that the metrics mostly cap-
ture three orthogonal dimensions in the sample space formed by all measures. In
other words, the coupling is divided along three dimensions for each of the programs
4.3. Results 61
SPECjvm98 Benchmark Suite
201 compress
Mean SD
CBO 6.24 6.2
IC CC 1.72 2.11
IC CM 4.34 3.54
IC CD 7.56 5.46
EC CC 1.80 1.16
EC CM 4.35 4.76
EC CD 6.56 4.56
202 jess
Mean SD
CBO 6.99 4.78
IC CC 2.97 7.21
IC CM 4.34 3.43
IC CD 5.45 4.54
EC CC 2.97 9.01
EC CM 4.34 4.35
EC CD 7.56 6.56
205 raytrace
Mean SD
CBO 7.25 7.51
IC CC 2.14 4.25
IC CM 4.45 3.54
IC CD 7.56 6.56
EC CC 2.06 1.89
EC CM 4.54 4.53
EC CD 6.56 4.56
209 db
Mean SD
CBO 9.12 6.60
IC CC 1.81 1.98
IC CM 6.56 4.46
IC CD 9.67 8.68
EC CC 1.88 1.54
EC CM 6.45 5.67
EC CD 9.57 7.65
213 javac
Mean SD
CBO 8.54 7.15
IC CC 3.21 3.01
IC CM 5.45 4.56
IC CD 7.56 7.56
EC CC 3.01 2.87
EC CM 3.45 4.56
EC CD 5.45 5.65
222 mpegaudio
Mean SD
CBO 5.75 4.90
IC CC 2.60 2.36
IC CM 4.54 3.56
IC CD 7.56 6.56
EC CC 2.60 2.70
EC CM 5.45 4.56
EC CD 5.87 5.46
228 jack
Mean SD
CBO 6.05 7.51
IC CC 2.68 5.37
IC CM 3.45 3.43
IC CD 5.45 4.45
EC CC 2.68 2.39
EC CM 5.45 4.56
EC CD 7.56 6.56
JOlden Benchmark Suite
BH
Mean SD
CBO 5.22 3.40
IC CC 2.62 2.50
IC CM 7.44 8.86
IC CD 8.67 10.84
EC CC 2.33 1.33
EC CM 5.77 4.44
EC CD 6.25 4.74
Em3d
Mean SD
CBO 4.20 2.86
IC CC 3.22 0.71
IC CM 3.87 1.01
IC CD 4.76 3.96
EC CC 3.75 1.33
EC CM 3.35 3.49
EC CD 4.65 3.46
Health
Mean SD
CBO 3.43 3.46
IC CC 2.43 2.46
IC CM 3.35 4.24
IC CD 4.25 5.46
EC CC 3.35 3.46
EC CM 3.55 2.43
EC CD 4.46 4.43
MST
Mean SD
CBO 4.34 3.45
IC CC 3.54 2.45
IC CM 4.23 3.45
IC CD 7.54 4.54
EC CC 3.45 3.34
EC CM 3.45 2.45
EC CD 4.56 4.32
Perimeter
Mean SD
CBO 5.34 4.34
IC CC 3.34 3.45
IC CM 4.34 2.45
IC CD 8.56 6.45
EC CC 3.54 3.45
EC CM 4.54 3.43
EC CD 6.54 3.54
Power
Mean SD
CBO 4.50 2.54
IC CC 1.32 0.45
IC CM 5.23 2.23
IC CD 5.64 2.56
EC CC 1.54 1.45
EC CM 4.12 4.56
EC CD 4.67 5.35
Voronoi
Mean SD
CBO 5.43 3.46
IC CC 2.43 1.45
IC CM 4.54 0.45
IC CD 7.45 3.46
EC CC 3.45 3.46
EC CM 4.45 2.45
EC CD 5.36 2.46
Real-World Programs
Velocity
Mean SD
CBO 7.59 7.57
IC CC 4.27 7.11
IC CM 8.45 10.87
IC CD 20.45 32.14
EC CC 3.85 4.30
EC CM 7.54 9.45
EC CD 25.45 28.45
Xalan
Mean SD
CBO 8.98 9.92
IC CC 4.03 4.61
IC CM 8.54 8.99
IC CD 35.45 38.14
EC CC 2.85 3.60
EC CM 6.54 7.56
EC CD 42.15 45.12
Ant
Mean SD
CBO 8.49 7.74
IC CC 3.92 7.91
IC CM 7.46 8.78
IC CD 16.75 17.25
EC CC 2.43 3.51
EC CM 7.04 7.54
EC CD 21.23 20.56
Table 4.1: Descriptive statistic results for all programs
4.3. Results 62
analysed.
Analysing the definitions of the measures that exhibit high loadings in PC1, PC2
and PC3 yields the following interpretation of the coupling dimensions:
• PC1 = {IC CC, IC CD, IC CM}, the run-time import coupling metrics as
illustrated by Figure 4.1(a).
• PC2 = {EC CC,EC CD,EC CM}, the run-time export coupling metrics
as illustrated by Figure 4.1(b).
• PC3 = {CBO}, the static coupling metric as illustrated by Figure 4.1(c).
Figure 4.1 summarises these results graphically. Overall the PCA results demon-
strate that the run-time coupling metrics are not redundant with the static CBO
metric and that they capture additional dimensions of coupling. This leads us to
reject our null hypothesis H0, to say that run-time measures for coupling are not
simply surrogate measures for the static CBO metric, suggesting that additional
information over and above the information obtainable from the static CBO metrics
can be extracted using run-time metrics. This confirms the findings of Arisholm et
al. for the single Velocity program are applicable across a variety of programs.
The results also indicate that the direction of coupling is a greater determin-
ing factor than the type of coupling, with PC1 containing the three import-based
metrics and PC2 containing the three export-based metrics.
4.3.2 Experiment 2: The influence of instruction coverage
Multiple Regression Analysis
Multiple regression analysis is used to test the hypothesis that instruction coverage
of test cases used to evaluate a program has no influence on the relationship between
static and run-time metrics. The two independent variables are thus the static CBO
metric and the instruction coverage measure Ic; each of the six run-time coupling
metrics in turn is then used as the dependent variable. A full list of these results
can be found in Appendix A.2.
4.3. Results 63
(a) Results from PCA for IC CC, IC CM and IC CD
(b) Results from PCA for EC CC, EC CM and EC CD
(c) Results from PCA for CBO
Figure 4.1: PCA test results for all programs for metrics in PC1, PC2 and PC3. In
all graphs the bars represents the PCA value obtained for the corresponding metric.
PC1 contains import level run-time metrics. PC2 contains the export level run-time
metrics and PC3 contain the static CBO metric.
4.3. Results 64
First, all R values turned out to be positive for each of the programs used in
this study. This means that there is a positive correlation between the dependent
(run-time metric) and independent variables CBO and Ic. Therefore as the values
for CBO and Ic increase or decrease so will the observed value for the run-time
metric under consideration.
Figures 4.2(a) and 4.2(b) give a pictorial view of the results from the multiple
regression analysis for all programs for class-level run-time coupling, and Figures
4.3(a) and 4.3(b) for method-level run-time coupling. The lighter bars represent the
influence of CBO, while the the darker bars represent the influence of both CBO
and Ic. Therefore the difference between these gives the additional amount of the
variation of the run-time metric that can be allocated to the influence of instruction
coverage.
Distinct Classes: IC CC and EC CC
It is immediately apparent from Figures 4.2(a) and 4.2(b) that the instruction cov-
erage is a significant influencing factor. For example, from Figure 4.2(a) it can be
seen that in ten of the programs, Ic accounts for an extra additional 20% variation.
Two of the programs in Figure 4.2(a), MST and Voroni, that show little increase,
already exhibit a high correlation with CBO alone that would have been difficult
to improve on. While the increase is not uniform throughout the programs in Fig-
ure 4.2(a), the overall data demonstrates that instruction coverage is an important
contributory factor.
Figure 4.2(b), representing the contribution of CBO and Ic to export coupling
measured at the class level, presents a sharper contrast. Here, the influence of Ic
is clearly a vital contributing factor, accounting for at least an extra 20% of the
variation in eleven of the seventeen programs. The important factor here is that
the overall contribution of CBO to export coupling is much lower than to import
coupling, as can be seen from contrasting the lighter-shaded bars in Figure 4.2(a)
with those in Figure 4.2(b). Thus classes with a high level of static coupling exhibit
a higher level of import coupling at run-time. This indicates that the coupling being
exercised at run-time is from classes behaving as clients, making use of other class
4.3. Results 65
(a) Results from the multiple linear regression where Y = IC CC.
(b) Results from the multiple linear regression where Y = EC CC.
Figure 4.2: Multiple linear regression results for class-level metrics (IC CC and
EC CC). In both graphs the lighter bars represents the R2 value for CBO, and the
darker bars represents the R2 value for CBO and Ic combined.
4.3. Results 66
(a) Results from the multiple linear regression where Y = IC CM
(b) Results from the multiple linear regression where Y = EC CM
Figure 4.3: Multiple linear regression results for method-level metrics (IC CM and
EC CM). In both graphs the lighter bars represents the R2 value for CBO, and
the darker bars represents the R2 value for CBO and Ic combined.
4.3. Results 67
methods, rather than those behaving as servers, offering their methods for use by
others. The greater influence of Ic in export coupling results from there being less
of a drop in its influence between IC CC and EC CC, suggesting that instruction
coverage, as a predictor of coupling, is not as sensitive to the direction of that
coupling.
Distinct Methods: IC CM and EC CM
The results for the IC CM and EC CM , illustrated by Figures 4.3(a) and 4.3(b),
present a similar picture. Both of these run-time metrics are scaled by the number
of methods involved in the coupling relationship. Given that CBO is defined on a
class level, it does surprisingly well in influencing the IC CM metric. Instruction
coverage is also defined at a class level, but nonetheless accounts for roughly an
extra 20% of the variance for five programs, and roughly an extra 10% for five other
programs. The drop between import and export coupling is accentuated here, but
while Figure 4.3(b) shows CBO proving a bad predictor for EC CM , instruction
coverage dramatically improves this for over half the programs studied.
Overall, these results show that coverage has a significant impact on the correla-
tion between static CBO and the four run-time coupling metrics defined for distinct
classes and distinct methods.
Run-time Messages: IC CD and EC CD
The run-time metrics IC CD and EC CD did not exhibit a significant relationship
for any of the programs under consideration and thus are not depicted graphically
here. As these metrics are defined in terms of a count of the number of distinct times
a method was executed, this result was not surprising. It is reasonable to postulate
that such metrics might be more influenced by the “hotness” of a particular method,
and the distribution of execution focus through the program, rather than instruction
coverage data. This was the result we expected for the measures based on the number
of dynamic method calls.
4.4. Conclusion 68
4.4 Conclusion
From our experimental data, using principal component analysis, we showed that
run-time coupling metrics captured different properties than static CBO and there-
fore are not simply surrogate measures for CBO. This indicated that useful infor-
mation beyond that which is provided by CBO may be obtained through the use of
these run-time measures.
Second, we found that the coverage of test cases used to evaluate a program had
a significant impact on the correlation between CBO and run-time coupling metrics
and thus should be a measured, recorded factor in any comparison made. We found
that instruction coverage and CBO were a better predictor of the run-time metics
based on distinct class (IC CC, EC CC) and distinct method counts (IC CM ,
EC CM) than CBO alone. The results in Appendix A.2 show the results from the
Fishers F test which illustrate that all results were statistically significant at the 5%
level of significance..
Chapter 5
Case Study 2: The Impact of
Run-time Cohesion on Object
Behaviour
In this study we present an investigation into the run-time behaviour of objects
in Java programs and whether cohesion metrics are a good predictor of object be-
haviour. Based on the definition of static CBO it would be expected that objects
derived from the same class would exhibit similar coupling behaviour, that is, that
they would be coupled to the same classes and make the same accesses. It is un-
known whether static CBO provides a true measure of coupling between objects, or
whether it is restricted to being a measure of the level of coupling between classes.
To this end, a measure, the Number of Object-Class Clusters (NOC), is proposed
in an attempt to analyse run-time object behaviour. This measure is derived from
a statistical analysis of run-time object-level coupling metrics. Cluster analysis is
used to group objects together based on the similarity of the accesses they make to
other classes. Therefore one would expect objects from the same class to occupy the
same cluster. If more than one cluster is found for a class then it is reasonable to
postulate that the class has objects that are behaving differently at run-time from
the point of view of coupling. A selection of programs are anaylsed to determine if
this is the case.
The second part of this study involves determining the predictive ability of cohe-
69
5.1. Goals and Hypotheses 70
sion metrics (both static and run-time) to forecast object behaviour, in other words,
how well they indicate the NOC for a class. First, the differences in the under-
lying dimensions of cohesion captured by the static versus the run-time measures
are assessed using principal component analysis. Subsequently, multiple regression
analysis is used to study the predictive ability of cohesion metrics to extrapolate
NOC for a class. We also wish to determine if a run-time definition of cohesion is a
better predictor of NOC than the static SLCOM version alone.
5.1 Goals and Hypotheses
The GQM/MEDEA framework was used to set up the experiments for this study.
Experiment 1:
Goal: To determine if objects from the same class behave differently at runtime
from the point of view of coupling.
Perspective: We investigate the behaviour of objects at run-time with respect
to coupling using a number of metrics which measure the level of coupling at dif-
ferent layers of granularity. We use a number of statistical techniques capable of
separating objects from a class into groups based on their similarity.
Environment: Since we are studying object behaviour, a set of Java programs
which create a large number of objects at run-time are used. These are supple-
mented with a number of real-world programs to ensure the results are scalable to
genuine programs.
Hypothesis:
H0 : Objects from a class behave similarly at run-time from the point of view of
coupling
H1 : Objects from a class behave differently at run-time from the point of view
of coupling
5.2. Experimental Design 71
Experiment 2:
Goal: To determine if a run-time definition for cohesion gives any additional
information about class behaviour over and above the standard static definition.
Perspective: Within a highly cohesive class the components of the class are
functionally related, with a class that exhibits low cohesion, they are not. Intu-
itively, one would expect the more cohesive the class the lower the NOC for a class.
We use a number of statistical techniques, including PCA and regression analysis to
determine if there is a significant correlation between static and run-time cohesion
and NOC . We also wish to determine if run-time cohesion is a better predictor of
NOC than the static version alone.
Environment: Since we are studying object behaviour a set of Java programs
which create a large number of objects at run-time are used. These are supple-
mented with a number of real-world programs to ensure the results are scalable to
genuine programs.
Hypothesis:
H0 : Run-time cohesion metrics do not provide additional information about
class behaviour over and above that provided by static SLCOM
H1 : Run-time cohesion metrics provide additional information about class be-
haviour over and above that provided by static SLCOM
5.2 Experimental Design
For this study it was necessary to calculate:
• the run-time object-level coupling metric: IC OC
• the Number of Object-Class Clusters: NOC
• the static SLCOM
5.2. Experimental Design 72
GreyNode QuadTreeNode WhiteNode
BlackNode1 0 2 0
BlackNode2 0 2 0
BlackNode3 0 2 0
BlackNode4 0 2 0
Table 5.1: Matrix of unique accesses per object, for objects
BlackNode1, . . . , BlackNode4 to classes GreyNode, QuadTreeNode and WhiteNode
• the run-time cohesion metrics: RLCOM , RWLCOM
IC OC was calculated using the object-level run-time metric analysis tool Ob-
Met which is described in Chapter 3.2.2. In order to test the first hypothesis the
coefficient of variation, CV , was calculated for the IC OC results to determine how
the IC OC values varied across the objects of a class. If the CV for all classes under
consideration is zero then this would lead us to accept the null hypothesis, H0, as all
objects of this classes would be accessing the same variables. However, if there was
variation in the IC OC values, CV > 0, this would lead us to reject H0 and accept
H1, as the objects would be behaving differently at run-time from the point of view
of coupling.
To determine the NOC for a class, one class is fixed and the distribution of
unique accesses per object is determined. A matrix of such values for each class
in the program under consideration is constructed. Table 5.1 gives an example of
such a matrix, where we record the run-time coupling values for individual objects
of class BlackNode, BlackNode1, . . . , BlackNode4, against the classes GreyNode,
QuadTreeNode and WhiteNode. This data is statistically analysed using cluster
analysis to evaluate the behaviour of the objects. This technique groups objects
together based on their similarity. The number of clusters are determined and this
becomes the NOC for that class. In order to accept H0 we would expect objects from
the same class to group together and occupy the same cluster, therefore expecting
values of NOC to be 1. The formation of a number of different clusters, where NOC
> 1, would lead us to reject H0 and accept H1.
5.3. Results 73
JOlden Benchmark Suite
BH
Mean SD
IC OC 1.83 2.74
NOC 2 0.52
SLCOM 0.317 0.30
RLCOM 0.144 0.287
RWLCOM 0.248 0.226
Em3d
Mean SD
IC OC 1 0.5
NOC 6 −SLCOM 0.317 0.223
RLCOM 0.190 0.381
RWLCOM 0.472 0.572
Health
Mean SD
IC OC 2.5 1.84
NOC 2.5 1.29
SLCOM 0.318 0.223
RLCOM 0.171 0.189
RWLCOM 0.335 0.356
MST
Mean SD
IC OC 2 1.54
NOC 2.5 2.12
SLCOM 0.163 0.283
RLCOM 0.111 0.172
RWLCOM 0.252 0.154
Perimeter
Mean SD
IC OC 2.25 2.6
NOC 2.5 1.73
SLCOM 0.136 0.275
RLCOM 0.104 0.285
RWLCOM 0.132 0.254
Power
Mean SD
IC OC 1.66 1.88
NOC 2 1.73
SLCOM 0.151 0.199
RLCOM 0.083 0.204
RWLCOM 0.155 0.134
Voronoi
Mean SD
IC OC 2 2.12
NOC 4.5 0.71
SLCOM 0.373 0.238
RLCOM 0.265 0.363
RWLCOM 0.448 0.438
Real-World Programs
Velocity
Mean SD
IC OC 6.14 7.21
NOC 5.1 2.45
SLCOM 0.314 0.385
RLCOM 0.154 0.254
RWLCOM 0.398 0.454
Xalan
Mean SD
IC OC 7.45 8.21
NOC 6.7 3.45
SLCOM 0.251 0.305
RLCOM 0.198 0.241
RWLCOM 0.354 0.484
Ant
Mean SD
IC OC 8.11 8.65
NOC 7.2 2.56
SLCOM 0.333 0.31
RLCOM 0.247 0.208
RWLCOM 0.387 0.355
Table 5.2: Descriptive statistic results for all programs
The static metrics data collection tool StatMet, described in Section 3.2.3, was
used to calculate SLCOM . The ClMet tool, described in Chapter 3.2.1, was used to
calculate the run-time cohesion metrics.
The analysis was conducted on the programs from the JOlden benchmark suite
as well as the real-world programs Velocity, Xalan and Ant. Three of the classes
BiSort, TSP and TreeAdd contain too few class to perform PCA and regression
analysis, therefore are excluded from further analysis. The SPECjvm98 benchmark
programs that were used in the previous study were excluded from this analysis as
they did not exhibit significant volumes of object creation.
5.3 Results
Table 5.2 summarises the descriptive statistic results for each program. The mea-
sures all exhibited large variances which makes them suitable candidates for further
analysis.
5.3. Results 74
5.3.1 Experiment 1: To determine if objects from the same
class behave differently at run-time from the point of
view of coupling
IC OC Results
The IC OC metric is used to investigate whether objects of the same class type are
coupled to the same classes at run-time. The first thing to look at is the CV results
for the IC OC metric, as depicted by Figure 5.1. If all objects from the same class
are behaving in a similar fashion we would expect them to make accesses to the
same classes at run-time. Consequently, there should be little or no variability in
the IC OC values for objects from the same class, for example, two classes from
BH had CV of 0. However, for the classes from the set of programs studied, the
CV varied from 0% to 54.2%. In the cases where the CV > 0, we have classes with
objects that are coupled to different classes at run-time. A class might create one
group of objects that access one set of classes and another that access a different set.
So we have a number of objects from the same class that are behaving differently at
run-time at the class-class level. Due to these results, at the class-class level, we can
reject H0 and accept H1. One cannot observe such behaviour simply by calculating
the static CBO value for that class.
NOC Results
Figure 5.2 illustrates the NOC values for the programs under consideration. The NOC
values range from one to seven and the bars represent the number of classes from
each program that exhibit that value. Since cluster analysis groups objects together
based on the similarity of the accesses they make to other classes one would expect
objects from the same class to occupy the same cluster (NOC = 1). This was the
case for a large proportion of the classes under consideration, for example 50% of the
classes from the program BH from the JOlden suite exhibited an NOC of 1. Similar
results were obtained with the real-world programs with NOC = 1 for 51% of classes
from Velocity, 49% from Xalan and 48% from Ant. However, there were instances
where more than one cluster was found for a class, for example 50% of the classes
5.3. Results 75
IC_OC CV Results
0 10 20 30 40 50 60
Cv
No. Classes
Ant
Xalan
Velocity
Voronoi
Power
Perimeter
MST
Health
Em3d
BH
0%
0% -10%
10% - 20%
20% - 30%
30% - 40%
40% - 50%
50% - 60%
Figure 5.1: CV of IC OC for classes from the programs studied. The bars represent
the number of classes in each program that have CV in the corresponding range.
5.3. Results 76
Figure 5.2: NOC results of cluster analysis. The bars represent the number of classes
in each program that have the corresponding NOC value.
from Perimeter from the JOlden suite had NOC = 4. When more than one cluster
is found we have the situtation where a single class is creating groups of objects
that are exhibiting different behaviours at run-time. This leads us to reject H0 and
accept H1 to state that objects from a class can behave differently at run-time from
the point of view of coupling.
Looking at Figures 5.1 and 5.2 there seems to be a relationship between the CV
and the number of clusters with both graphs being markedly similar. In many cases
a high CV leads to >1 clusters. Intuitively this would make sense as it is easy to
see how variation in the number of classes used by an object would lead to variation
in the variables they use and consequently leading to a number of groups of objects
behaving differently.
From these findings, it is suggested that the static CBO metric would be better
defined as coupling between classes as it does not necessarily give a true measure of
run-time coupling between objects.
5.3. Results 77
5.3.2 Experiment 2: The influence of cohesion on the NOC
The following statistical analysis is applied to determine first, if run-time cohesion
metrics are redundant with respect to SLCOM and second, if cohesion metrics are
good predictors of NOC .
Principal Component Analysis
Initially, we investigate the relationship between static and run-time cohesion met-
rics. We use PCA to determine if the static and run-time cohesion metrics are
likely to be measuring the same class property, in other words it is used to examine
whether the run-time cohesion metrics are not simply surrogate measures for static
SLCOM .
Appendix B.1 shows the results of the principal component analysis when all
of the cohesion metrics are taken into consideration. Using the Kaiser criterion to
select the number of factors to retain it is found that the metrics mostly capture
two orthogonal dimensions in the sample space formed by all measures. In other
words, cohesion is divided along two dimensions for each of the programs analysed.
Analysing the definitions of the measures that exhibit high loadings in PC1 and
PC2 yields the following interpretation of the cohesion dimensions:
• PC1 = {SLCOM}, the static cohesion metric.
• PC2 = {RLCOM , RWLCOM}, the run-time cohesion metrics.
Figure 5.3 summarises these results graphically. The PCA findings from this
study indicate that no significant information about the cohesiveness of a class
can be gained by evaluating the RWLCOM instead of the simpler RLCOM , as both
metrics belonged to the same principal component. This means not enough variance
is captured by the RWLCOM that is not accounted for by RLCOM .
However, the PCA results indicate that RLCOM is not redundant with respect to
SLCOM and that it captures additional information about cohesion. The values show
that RLCOM is not simply an alternative static measure. Clearly, the simple static
calculation of SLCOM masks a considerable amount of detail available at run-time.
5.3. Results 78
PCA Test Results for all Programs for Metrics in PC1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
BH Em3d Health MST Perimeter Power Voronoi Velocity Xalan Ant
Program
PCA Value
RLCOM
RWLCOM
PCA Test Results for all Programs for Metrics in PC2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
BH Em3d Health MST Perimeter Power Voronoi Velocity Xalan Ant
Program
PCA Value
SLCOM
Figure 5.3: PCA Test Results for all programs for metrics in PC1 and PC2. In both
graphs the bars represents the PCA value obtained for the corresponding metric.
PC1 contains RLCOM and RWLCOM . PC2 contains SLCOM .
5.3. Results 79
Multiple Regression Analysis
Next we wish to discover if cohesion metrics are good predictors of object behaviour,
that is, can they be used to deduct the NOC for a class. Multiple regression analysis
is used for this purpose. In this case the dependent variable is the NOC , while the
independent variables are the static SLCOM and the run-time RLCOM and RWLCOM
cohesion metrics. Appendix B.2 gives the results from this analysis.
First, the results show that there is a positive correlation between the NOC
(dependent variable) and the static and run-time cohesion measures (independent
variables), as all R values were positive. This means that as the value for SLCOM ,
RLCOM and RWLCOM increases/decreases so will the observed value for NOC . Intu-
itively this would make sense, as one would expect the more cohesive the class, that
is the lower the LCOM value is, the more the class it is geared toward performing a
single function. Therefore one would expect the number of clusters to be low also.
Figure 5.4 summarises the results of the regression analysis for each of the pro-
grams analysed. The lighter bars represent the influence of SLCOM , while the darker
bars depict the influence of both SLCOM and RLCOM . The difference between the two
indicates the additional amount of variation that can be allocated to the run-time
cohesion metric.
It is apparent from this graph that the RLCOM is a significant factor influencing
NOC , for example for the three real-world programs RLCOM accounts for approxi-
mately an additional 30% variation, while five of the benchmarks exhibit a similar
result. For eight out of the ten programs studied RLCOM was a better predictor of
NOC than SLCOM .
Overall, these results show that cohesion metrics are a good predictor of NOC ,
with run-time cohesion being the superior metric. This leads us to reject our null
hypothesis and state that run-time cohesion metrics provide additional information
about class behaviour over and above that provided by static SLCOM .
Only one program exhibited a significant result when using the RWLCOM mea-
sure, therefore the results have not been summarised graphically. This could be due
to the fact that the metric is defined on a call-weighted basis, which may skew the
results.
5.3. Results 80
Figure 5.4: Results from multiple linear regression where Y=NOC . The lighter bars
represent the R2 for SLCOM , and the darker bars represent the R2 value for SLCOM
and RLCOM combined.
5.4. Conclusion 81
5.4 Conclusion
From this case study, we found that run-time object-level coupling metrics could be
used to investigate object behaviour. Using the IC OC run-time coupling measure
we discovered that objects from the same class exhibited different behaviours at
run-time from the point of view of coupling. Object behaviour was identified by
defining a new metric NOC which groups objects together based on their run-time
coupling properties.
We defined a number of metrics for evaluating run-time cohesion. First, we
proved that these measures were not redundant with respect to the static LCOM
measure and that they captured additional dimensions of cohesion. Next, we inves-
tigated the impact of run-time cohesion metrics on object behaviour using regression
analysis and proved that these run-time cohesion metrics were good predictors of
object behaviour, as identified by the NOC measure. Appendix B.2 gives the results
from this analysis and shows the Fishers F test results which state that all results
were statistically significant the 5% level of significance..
Chapter 6
Case Study 3: A Study of
Run-time Coupling Metrics and
Fault Detection
Fault-proneness detection is an interesting concept in many areas of software engi-
neering research. Quality and maintenance effort control depend on the understand-
ing of this concept. In previous years, a large volume of work has been performed
in order to define suitable metrics and models for fault detection [6, 13,19,41].
Code coverage has been proposed to be an estimator for fault-proneness, but it
remains a controversial topic which lacks support from empirical data [22]. In this
case study we investigate whether instruction coverage is a significant predictor of
fault-proneness, an important software quality indicator. This is done by taking
a set of real-world programs, namely Velocity, Xalan and Ant, and introducing
faults into them using the mutation system µJava. Two kinds of mutations are
introduced separately into the programs, traditional and class-type mutations. We
then determine the percentage mutants killed (MK) by the set of test cases provided
with the programs. Equation 6.1 gives the formula for MK . Regression analysis is
applied to determine if instruction coverage is a good predictor of fault-proneness,
which is defined as the MK for the class for each type of mutation. From previous
work we expect instruction coverage to be a good predictor of non object-oriented
or traditional-type mutants [69].
82
6.1. Goals and Hypotheses 83
MK = Number of mutants killedTotal number of mutants created
∗ 1001
(6.1)
Next, we empirically validate a set of six run-time object-oriented metrics in
terms of their usefulness in predicting fault-proneness. We use regression analysis
again to investigate the ability of these run-time measures in predicting MK for both
types of mutations. From these two experiments we wish to discover if the run-time
measures for coupling are better predictors of fault-proneness than the traditional
coverage measure.
6.1 Goals and Hypotheses
The GQM/MEDEA framework was used to set up the experiments for this study.
Experiment 1:
Goal: To examine the relationship between coverage and fault detection, in the
context of instruction coverage.
Perspective: Code coverage has been proposed to be an estimator for testing
effectiveness. Regression analysis is used to assess if coverage is a better indicator
of fault-proneness in comparison to the run-time coupling metrics. In particular we
investigate whether it is a better detector of traditional or class-type mutations in
programs.
Environment: We chose to evaluate a selection of open source real-world pro-
grams. Each program comes with its own set of JUnit test cases, thus defining both
the static and dynamic context of our work.
Hypothesis:
H0 : Coverage measures are poor detectors of faults in a program.
H1 : Coverage measures are good detectors of faults in a program.
6.2. Experimental Design 84
Experiment 2:
Goal: To examine the relationship between run-time coupling metrics and fault
detection.
Perspective: Previous work has shown the static coupling measure CBO is a
good detector of faults in programs [13]. Intuitively, one would expect run-time cou-
pling measures to give a better indication as they are based on an actual execution
of the program. Regression analysis is used to determine if there is a significant
correlation.
Environment: We chose to evaluate a selection of open source real-world pro-
grams. Each program comes with its own set of JUnit test cases, thus defining both
the static and dynamic context of our work.
Hypothesis:
H0 : Run-time coupling metrics are poor detectors of faults in a program.
H1 : Run-time coupling metrics are good detectors of faults in a program.
6.2 Experimental Design
In order to conduct the practical experiments underlying this study, it was necessary
to select a suite of Java programs and measure:
• the instruction coverage percentages: IC
• the mutation coverage of test cases (mutants killed (MK))
• the run-time coupling metrics: IC CC, EC CC, IC CM, EC EM, IC CD, EC CD
The InCov tool, described in Chapter 3.2.4, was used to determine IC . The
run-time measures were evaluated using the ClMet tool. The mutation system
µJava, described in Chapter 3.2.5, was used to insert both traditional and class-
6.3. Results 85
level mutants into the test case programs and to determine the MK rates of the test
cases supplied with the programs.
Three open source real-world programs Velocity, Xalan and Ant were evaluated in
this study. The SPECjvm98 and JOlden benchmark programs used in the previous
studies exhibited very poor mutant kill percentages when analysed (most classes
exhibited 0% mutant kill rate) and therefore were excluded from futher analysis.
6.3 Results
Percentage Mutant Kill Rate Results
Figure 6.1 gives the percentages of mutants killed upon the execution of the JUnit
test cases supplied with the programs analysed. Looking at Figure 6.1(a) for the
Velocity program, twenty-three classes exhibit a percentage kill rate of zero for the
class-level mutants, while thirteen classes exhibit the same rate for the traditional
mutants. At the other end of the spectrum for the class-level mutants, six classes
exhibited a percentage kill rate of between 90% and 100%, while seven classes ex-
hibited the same kill rate for the traditional mutants.
In their paper [66] Offutt et al. created test cases for the set of programs they
studied by hand so that 100% MK was achieved. To date no one has endeavoured
to apply this mutation system to a set of real programs so there is no consensus on
what a desirable MK rate would be.
6.3.1 Experiment 1: To examine the relationship between
instruction coverage and fault detection.
Regression Analysis
We investigate the statistical relationship between instruction code coverage and
fault-proneness using regression analysis. The dependent variable is the percentage
mutant kill rate MK , while the independent variable is the instruction coverage mea-
sure Ic for each class. Both class and traditional mutants are evaluated separately.
Appendix C.2 gives the results from this analysis.
6.3. Results 86
(a) Results from Mutation Testing for Velocity.
(b) Results from Mutation Testing for Xalan.
(c) Results from Mutation Testing for Ant.
Figure 6.1: Mutation test results for real-world programs Velocity, Xalan and Ant.
In all graphs the bars represents the number of classes that exhibit a percentage
mutant kill rate in the corresponding range.
6.3. Results 87
Figure 6.2: Regression analysis results for the effectiveness of Ic in predicting class
and traditional-level mutations in real-world programs Velocity, Xalan and Ant. The
bars represents the R2 value for the run-time metric under consideration.
Figure 6.2 depicts the results on the effects of instruction coverage on fault-
proneness for both types of mutations tested. For all of the programs tested Ic
proved to be a poor predictor of class-type mutants with the highest value being
16.7% for Xalan. In contrast Ic showed to be an effective indicator of traditional
mutations with the values ranging from 64.5% to 78.9%. This was as we expected as
coverage is not really effective in evaluating object-oriented type programs therefore
we would not expect it to be good predictors of object-oriented type faults.
6.3.2 Experiment 2: To examine the relationship between
run-time coupling metrics and fault detection.
Regression Analysis
Regression analysis is used to determine the effectiveness of run-time coupling met-
rics in detecting faults in programs. The dependent variable is the percentage mu-
6.3. Results 88
tant kill rate of the test cases used to execute the programs, while the independent
variables are the six run-time coupling metrics. Both class and traditional mutants
are evaluated separately. Appendix C.1 gives the results from this analysis.
The traditional mutants did not show any relation with the run-time coupling
measures, with only the IC CC metric for the Velocity program exhibiting a signif-
icant correlation. This is in contrast to the results from the previous experiment
where Ic proved to be a poor predictor of class-type mutants but a good predictor
of traditional-type mutants.
Figure 6.3 illustrates the results for the effectiveness of the run-time coupling
metrics IC CC, IC CM, EC CC and EC CM in predicting the MK for class-level
mutations for each of the programs analysed. For two of the programs Velocity and
Xalan the IC CC measure provided the greatest predictor of MK at 69% and 59%
respectively. For the Ant program, the EC CC metric had the highest value at 69%,
however the IC CC value for this was also high at 60%. For all of the programs the
EC CM measure was the poorest predictor. There were five categories of mutations
introduced into the programs by µJava, as illustrated by Table D.2. We would
expect the coupling measures to be a good predictor of those mutations based on
inheritance, polymorphism and overloading. However, we would not expect such
a relationship for those based on Java-specific features and common programming
mistakes. The inclusion of these types of mutations may have negitavely skewed the
results.
None of the run-time metrics based on distinct method counts IC CD and EC CD
exhibited a significant result and therefore have not been summarised graphically.
As with the case in Section 4.3, this was expected and emphasises the significance
of the predictive capabilities of the other metrics.
Overall, one would expect this kind of result as the class-type mutants are object-
oriented, while the traditional mutations are based on factors like operator replace-
ment and therefore would not be expected to correlate strongly with coupling. This
leads us to reject our null hypothesis for both experiments and state that run-time
coupling metrics are good detectors of class-level faults, while coverage measures
are good detectors of traditional-type faults in a program. We therefore postulate
6.4. Conclusion 89
Figure 6.3: Regression analysis results for the effectiveness of run-time coupling met-
rics in predicting class-level mutations in real-world programs Velocity, Xalan and
Ant. The bars represents the R2 value for the run-time metric under consideration.
a possible utility for run-time coupling metrics for use in fault-proneness detection
with regard to identifying faults in object-oriented programs.
6.4 Conclusion
The results from this case study used regression analysis to show that run-time cou-
pling metrics were good detectors of class-type faults in programs, while instruction
coverage was a good detector of traditional-type mutants. Appendix C.1 illustrates
these results and shows that all results were deemed to be statistically significant at
the 5% level of significance. We therefore proposed the run-time coupling metrics as
alternative measures for fault-detection useful for identifying object-oriented type
faults in programs.
Chapter 7
Conclusions
In this thesis we presented an empirical investigation into run-time coupling and
cohesion metrics.
The first case study investigated the influence of instruction coverage on the re-
lationship between static and run-time coupling metrics. An empirical investigation
was conducted using the set of run-time metrics proposed by Arisholm et al. on
a large set of Java programs. This set contained programs from the SPECjvm98
and JOlden benchmark suites and also included three real-world programs Velocity,
Xalan and Ant.
The differences in the underlying dimensions of coupling captured by the static
versus the run-time metrics were assessed using principal component analysis. Three
components were identified which contained static CBO, the import-based run-time
metrics, and the export-based run-time metrics. This established that the run-
time metrics were not simply surrogate static measures, which made them suitable
candidates for further analysis.
A study into the predictive ability of the static CBO and instruction coverage
data was then conducted using multiple regression analysis. The purpose of this was
to show how well the static CBO metric and instruction coverage measure Ic could
predict the six run-time metrics under consideration. The PCA analysis placed
import and export based coupling in different components, and this difference was
also seen in the regression analysis. Both CBO and instruction coverage had less
influence overall on the export-based metrics, EC CC and EC CM than on the
90
Chapter 7. Conclusions 91
Static
Coupling
Run-time
Coupling
Not surrogate
Static
Coupling
Instruction
Coverage
+ Run-time
Coupling
Good predictor
Figure 7.1: Findings from case study one that show our run-time coupling metrics
are not simply surrogate measures for static CBO and coverage plus static metrics
are better predictors of run-time measures than static measure alone.
import-based run-time metrics, IC CC and IC CM .
It was shown from the regression analysis that the combination of the static
measure with instruction coverage gave a significantly better prediction of the run-
time behavior of programs than the use of static metrics alone, for the class-based
and method-based metrics. This suggested that the correlation between static and
run-time was as much a factor of coverage as an intrinsic property of the metrics
themselves.
The results for the two run-time metrics based on distinct message counts,
EC CD and EC CD were not within the chosen significance level, and thus no de-
termination was made on the relationship for these metrics. Figure 7.1 summarises
the finding from this study.
Chapter 7. Conclusions 92
The second case study looked at run-time object behaviour and whether run-time
cohesion metrics could be used to identify such behaviour.
First, we looked at object behaviour in the context of coupling. We used the
IC OC object-level metric, as defined by Arisholm et al. and defined a new measure
NOC in an attempt to identify objects that were behaving differently at run-time
from the point of view of coupling. We concluded that objects from the same class
could behave differently at run-time from the point of view of coupling due to the
fact that there were classes that exhibited variable CV values for IC OC and NOC
values greater than one
Subsequently, we looked at whether run-time cohesion metrics could be used to
predict object behaviour, as defined by the NOC measure. First, we had to prove
that the run-time cohesion metrics were not redundant with respect to static SLCOM .
The relationship between static and run-time cohesion metrics was investigated using
PCA. Two components were identified containing the static SLCOM and the run-time
cohesion measures RLCOM , RWLCOM . This established that the run-time metrics
were not simply surrogate static measures, making them suitable candidates for
further analysis.
Multiple regression analysis was used to discover if the cohesion metrics were
good predictors of object behaviour. The purpose of this was to show how well the
SLCOM metric and the run-time cohesion measures RLCOM , RWLCOM could predict
NOC . Overall, the results showed that cohesion metrics were a good predictor of
NOC , with run-time cohesion being the superior metric. This led us to conclude that
run-time cohesion metrics provided additional information about class behaviour
over and above that provided by SLCOM . Figure 7.2 depicts the results of this
study.
The third case study investigated whether instruction coverage was a good pre-
dictor of faults in a program. We used regression analysis to determine if this
measure was related to MK , the mutation kill rate of the test cases used. It was
found that IC was a good predictor of traditional-type faults but a poor predictor of
class-type faults which verifies results from previous studies on coverage measures.
Next, we analysed the extent to which the run-time coupling metrics were good
Chapter 7. Conclusions 93
Class A
a8
a7
a6
a5
a4
a3
a2
a1
Task 3
Task 2
Task 1
a12
a11
a10
a9
IC_OC used to determine NOC
NOC =3
Run-time Cohesion
good predictor of NOC
Objects
Figure 7.2: Findings from case study two that show run-time object-level coupling
measures can be used to identify objects that are exhibiting different behaviours
at run-time and run-time cohesion measures are good predictors of this type of
behaviour.
7.1. Contributions 94
Fault Proneness
Class-type Mutations Traditional Mutations
Run-time
Coupling
Instruction
Coverage
Good predictor
Figure 7.3: Findings from case study three that show run-time coupling metrics are
good predictors of class-type faults and instruction coverage is a good predictor of
traditional faults in programs.
detectors of traditional and class-type faults in a program. Our results showed that
the measures IC CC, IC CM, EC CC and EC CM were significantly related to MK
when considering class-type mutations. The results for EC CD and EC CD, the
two run-time metrics based on distinct message counts, were not within the chosen
significance level, and thus no determination was made on the relationship for these
metrics.
The purpose of this study was to determine whether instruction coverage is a bet-
ter predictor of fault-proneness than the run-time coupling measures. As we found
these measures were superior in detecting object-oriented type faults in programs
than simple measures of coverage, we proposed the run-time coupling metrics as an
alternative measure for fault-proneness, useful for detecting faults in object-oriented
software. Figure 7.3 illustrates the finding from this study.
7.1 Contributions
We have implemented the tools ClMet and ObMet that can be used to perform a
class and object-level analysis of Java programs.
7.1. Contributions 95
We use the definitions of Arishlom et al. for the set of run-time coupling metrics
in this analysis. We had however defined our own set of run-time coupling metrics
previous to the publication of this paper [75, 76]. However, due to the similarity in
nature of the metrics we then switched to using their definitions for the sake of ease
of comparison.
We define a number of object-oriented run-time metrics for cohesion and we
investigate their possible utility. To date no one has attempted to do this.
We define a new measure NOC that can be used to study run-time object-
behaviour.
To the best of our knowledge this is the largest empirical study that has been
performed to date on the run-time analysis of Java programs. Previously, a study
was carried out by Arisholm et al. on “Dynamic Coupling Measurement for Object-
Oriented Software”, however this included a single program, Velocity, in the analysis.
Our study looks at not only Velocity but also the real-world programs Xalan and
Ant as well as seven benchmark programs from the SPECjvm98 suite and seven
programs from the JOlden suite thus making it much wider in scope.
The main findings from our study are as follows:
• We showed run-time coupling metrics capture additional dimensions of cou-
pling and are not simply surrogate measures for static CBO. Therefore, useful
information above that provided by a simple static analysis may be acquired
through the use of run-time metrics.
• Coverage has a significant impact on the correlation between static CBO and
run-time coupling metrics and should be a measured, recorded factor in any
comparison.
• Run-time object-level coupling metrics can be used to investigate object be-
haviour. Using such a measure we discovered that objects from the same class
can behave differently at run-time from the point of view of coupling.
• Run-time cohesion metrics are not redundant with respect to the static SLCOM
measure and capture additional dimensions of cohesion.
7.2. Applications of this Work 96
• Run-time cohesion metrics are good predictors of run-time object behaviour.
• Run-time coupling metrics based on distinct class and distinct method counts
are good predictors of class-type or object-oriented faults in programs but poor
predictors of traditional-type mutations.
• Coverage is a good predictor of traditional-type faults but a poor predictor of
class-type faults in programs.
7.2 Applications of this Work
Much of the work on the dynamic analysis of Java programs has come from the
language design and compiler community. The work in this thesis forms part of an
increasing link between this community and the software engineering community,
with an emphasis on collecting, analysing and comparing quantitative static and
dynamic data. Other possible examples of this synthesis include relating studies
of polymorphicity with testing inheritance relationships, or relating measures of
program “hot-spots” with metrics based on distinct messages such as IC CD and
EC CD. Run-time metrics may also have a role to play in areas of research such
as reverse engineering and program comprehension, as they contribute to a better
understanding of the behavior of code in its operational environment.
7.3 Threats to Validity
7.3.1 Internal Threats
There are a number of factors which may potentially affect the validity of these run-
time metrics. In this thesis we have chosen only to look at run-time definitions for
coupling and cohesion which are based on the standard static definitions proposed by
Chidamber and Kemerer. Their metric suite for analysing object-oriented software
consists of three additional measurements for evaluating the depth of inheritance tree
(DIT), the number of children (NOC) and the weighted methods per class (WMC).
7.3. Threats to Validity 97
Our set of run-time measures should be expanded to include run-time definitions for
these also to ensure that the set is fully comprehensive.
The run-time metrics used in this study are rated based on how they perform
in relation to static measurements for coupling and cohesion. However, no study
has definitively shown that any measurement for coupling or cohesion provides any
extra information on design quality over and above that which can be gained simply
by evaluating the much simpler lines of code measure.
7.3.2 External Threats
A general problem with any type of run-time analysis is that the results are based on
dynamic measurement and are thus tied to the inputs or test cases used. Therefore
the use of different test cases may produce different results. Static measurements
however will remain the same regardless of the set of test cases used to execute the
program.
The set of programs used in this study may not be representative of all classes
of Java programs, as for example, no GUI based programs were included in this
analysis.
While the run-time analysis tools ClMet and ObMet made it easy to collect a
wide variety of run-time information from a program and were easy to use, it was
still quite time comsuming to perform a full analysis of a program. Allthough it was
stated that performance was not really an issue in the design of these tools. if such
a method of evaluating a program was to be marketed to industry the performance
of the tools would have to be given more serious consdieration.
Only one external quality attribute, fault detection, was investigated in this
thesis. Further research need to be conducted to see how measures for coupling and
cohesion do in predicting other important external quality attributes of a design
such as maintainability, reusability or comprehensibility.
The relationship between internal and external quality attribute is quite intitu-
itive, for example, more complex code will require greater effort to maintain. How-
ever the precise functional form of this relationship is less clear and is the subject
of intense and practical research concern. Using theories of cognition and problem-
7.4. Future Work 98
solving to help us understand the effects of complexity on software is the subject of
much current research [31].
7.4 Future Work
Future work may involve extending the existing set of coupling and cohesion met-
rics to develop a comprehensive set of run-time object-oriented metrics that can
intuitively quantify such aspects of object-oriented applications such as inheritance,
dynamic binding and polymorphism.
Currently there exists no set of benchmarks specifically designed for evaluating
properties of object-oriented programming such as coupling and cohesion, it would
be useful to design such a set of benchmarks for use in similar empirical studies.
Futher research could involve designing a run-time profiling tool written in C++
rather than Java. Such a tool could utilise the JVMDI component of the JPDA
directly and therefore would be dynamically linked with thw JVM at run-time.
This would probably result in less performance overhead which would result in a
reduction in the time taken to preform such an analysis.
Another important aspect would be to further investigate the correlation between
run-time metrics and external quality aspects of a design, including investigating
the possibility of using hybrid models that use a combination of static and runtime
metrics to evaluate a design.
It would be interesting to conduct an industrial case study using real commercial
software and data to further verify the results in this thesis.
Other applications of run-time metrics should be investigated, for example they
could be useful in determining where refactoring have been or could be applied or
they could be used to aid in program comprehension.
This study has focused solely on the evaluation of Java software, it would be
important to investigate if the run-time metrics gave similar results if they were
used to evaluate other types of object-oriented software, for example C sharp.
Though the approach and results are of significance to the field, they can also be
used as stepping stones to open up new ways to consider a wider set of internal qual-
7.4. Future Work 99
ity attributes and their interrelationships, and their independent and interdependent
effect relationship on external quality aspect of a design.
Appendix A
Case Study 1: To Investigate the
Influence of Instruction Coverage
on the Relationship Between
Static and Run-time Coupling
Metric
Appendix A.1 contains the PCA test results for the SPECjvm98 and JOlden suites
and for the real-world programs Velocity, Xalan and Ant. Values deemed to be
significant at the level p ≤ 0.05 are highlighted.
Appendix A.2 contains the results from the multiple linear regression used to test
the hypothesis H0, that coverage has no effect on the relationship between static
and run-time metrics for the programs from the SPECjvm98 and JOlden suites and
for the real-world programs Velocity, Xalan and Ant. All significant results are
highlighted.
100
A.1. PCA Test Results for all programs. 101
A.1 PCA Test Results for all programs.
A.1.1 SPECjvm98 Benchmark Suite
201 compress
PC1 PC2 PC3
CBO 0.113 0.014 0.712
IC CC 0.865 0.065 0.186
IC CM 0.766 0.154 0.097
IC CD 0.866 0.073 0.100
EC CC 0.023 0.873 0.176
EC CM 0.143 0.799 0.035
EC CD 0.098 0.834 0.096
202 jess
PC1 PC2 PC3
CBO 0.198 0.187 0.672
IC CC 0.963 0.007 0.005
IC CM 0.912 0.003 0.016
IC CD 0.874 0.032 0.004
EC CC 0.154 0.812 0.002
EC CM 0.298 0.734 0.054
EC CD 0.098 0.923 0.002
205 raytrace
PC1 PC2 PC3
CBO 0.123 0.087 0.723
IC CC 0.834 0.021 0.019
IC CM 0.912 0.017 0.008
IC CD 0.896 0.103 0.001
EC CC 0.198 0.763 0.003
EC CM 0.125 0.709 0.017
EC CD 0.097 0.821 0.002
209 db
PC1 PC2 PC3
CBO 0.012 0.163 0.843
IC CC 0.893 0.088 0.002
IC CM 0.923 0.004 0.000
IC CD 0.976 0.003 0.013
EC CC 0.178 0.763 0.002
EC CM 0.110 0.793 0.027
EC CD 0.087 0.823 0.017
213 javac
PC1 PC2 PC3
CBO 0.187 0.000 0.973
IC CC 0.633 0.083 0.184
IC CM 0.834 0.033 0.023
IC CD 0.723 0.143 0.002
EC CC 0.138 0.834 0.004
EC CM 0.078 0.734 0.012
EC CD 0.067 0.759 0.034
222 mpegaudio
PC1 PC2 PC3
CBO 0.244 0.137 0.583
IC CC 0.943 0.004 0.087
IC CM 0.898 0.034 0.041
IC CD 0.943 0.023 0.001
EC CC 0.034 0.943 0.043
EC CM 0.134 0.754 0.085
EC CD 0.098 0.845 0.005
228 jack
PC1 PC2 PC3
CBO 0.004 0.243 0.634
IC CC 0.605 0.234 0.154
IC CM 0.723 0.194 0.076
IC CD 0.604 0.195 0.098
EC CC 0.194 0.749 0.098
EC CM 0.103 0.694 0.049
EC CD 0.094 0.749 0.104
A.1.2 JOlden Benchmark SuiteBH
PC1 PC2 PC3
CBO 0.403 0.002 0.520
IC CC 0.728 0.224 0.012
IC CM 0.536 0.391 0.001
IC CD 0.555 0.376 0.000
EC CC 0.358 0.522 0.109
EC CM 0.203 0.763 0.025
EC CD 0.203 0.763 0.025
Em3d
PC1 PC2 PC3
CBO 0.134 0.034 0.712
IC CC 0.933 0.013 0.016
IC CM 0.772 0.168 0.039
IC CD 0.772 0.168 0.039
EC CC 0.139 0.702 0.082
EC CM 0.223 0.716 0.039
EC CD 0.223 0.716 0.039
Health
PC1 PC2 PC3
CBO 0.238 0.187 0.521
IC CC 0.956 0.005 0.017
IC CM 0.936 0.024 0.010
IC CD 0.940 0.028 0.009
EC CC 0.076 0.831 0.086
EC CM 0.070 0.919 0.002
EC CD 0.065 0.794 0.003
MST
PC1 PC2 PC3
CBO 0.000 0.013 0.972
IC CC 0.900 0.063 0.032
IC CM 0.956 0.010 0.026
IC CD 0.941 0.012 0.027
EC CC 0.356 0.609 0.033
EC CM 0.121 0.877 0.001
EC CD 0.118 0.881 0.000
Perimeter
PC1 PC2 PC3
CBO 0.231 0.123 0.612
IC CC 0.541 0.169 0.281
IC CM 0.876 0.080 0.002
IC CD 0.905 0.056 0.038
EC CC 0.236 0.752 0.000
EC CM 0.147 0.830 0.023
EC CD 0.142 0.828 0.026
Power
PC1 PC2 PC3
CBO 0.329 0.014 0.626
IC CC 0.617 0.073 0.161
IC CM 0.624 0.338 0.036
IC CD 0.712 0.228 0.041
EC CC 0.022 0.915 0.015
EC CM 0.007 0.880 0.112
EC CD 0.008 0.824 0.164
A.1. PCA Test Results for all programs. 102
Voronoi
PC1 PC2 PC3
CBO 0.198 0.213 0.526
IC CC 0.718 0.123 0.069
IC CM 0.812 0.088 0.134
IC CD 0.773 0.176 0.141
EC CC 0.043 0.911 0.005
EC CM 0.067 0.934 0.004
EC CD 0.148 0.834 0.054
A.1.3 Real-World Programs, Velocity, Xalan and Ant
Velocity
PC1 PC2 PC3
CBO 0.384 0.184 0.734
IC CC 0.623 0.034 0.174
IC CM 0.725 0.087 0.231
IC CD 0.684 0.196 0.192
EC CC 0.284 0.684 0.097
EC CM 0.023 0.793 0.005
EC CD 0.174 0.590 0.015
Xalan
PC1 PC2 PC3
CBO 0.316 0.174 0.586
IC CC 0.824 0.184 0.183
IC CM 0.890 0.284 0.284
IC CD 0.795 0.003 0.194
EC CC 0.013 0.834 0.164
EC CM 0.284 0.793 0.023
EC CD 0.384 0.823 0.154
Ant
PC1 PC2 PC3
CBO 0.125 0.254 0.687
IC CC 0.874 0.125 0.125
IC CM 0.789 0.231 0.012
IC CD 0.801 0.324 0.214
EC CC 0.214 0.789 0.124
EC CM 0.141 0.785 0.054
EC CD 0.123 0.754 0.014
A.2. Multiple linear regression results for all programs 103
A.2 Multiple linear regression results for all pro-
grams
A.2.1 SPECjvm98 Benchmark Suite201 compress
Hypothesis Y R R2 P > F
HCBO IC CC 0.775 0.593 0.003
HCBO,Ic IC CC 0.798 0.602 0.0001
HCBO EC CC 0.634 0.402 0.01
HCBO,Ic EC CC 0.870 0.759 0.007
HCBO IC CD 0.512 0.262 0.421
HCBO,Ic IC CD 0.599 0.359 0.201
HCBO EC CD 0.239 0.057 0.054
HCBO,Ic EC CD 0.422 0.178 0.134
HCBO IC CM 0.762 0.58 0.003
HCBO,Ic IC CM 0.885 0.784 0.006
HCBO EC CM 0.235 0.056 0.04
HCBO,Ic EC CM 0.58 0.336 0.035
202 jess
Hypothesis Y R R2 P > F
HCBO IC CC 0.553 0.306 0.002
HCBO,Ic IC CC 0.703 0.494 0.001
HCBO EC CC 0.428 0.184 0.031
HCBO,Ic EC CC 0.567 0.322 0.023
HCBO IC CD 0.765 0.586 0.145
HCBO,Ic IC CD 0.868 0.754 0.321
HCBO EC CD 0.691 0.748 0.246
HCBO,Ic EC CD 0.723 0.523 0.135
HCBO IC CM 0.762 0.581 0.023
HCBO,Ic IC CM 0.922 0.852 0.012
HCBO EC CM 0.618 0.382 0.001
HCBO,Ic EC CM 0.645 0.416 0.002
205 raytrace
Hypothesis Y R R2 P > F
HCBO IC CC 0.444 0.197 0.021
HCBO,Ic IC CC 0.659 0.434 0.002
HCBO EC CC 0.59 0.349 0.043
HCBO,Ic EC CC 0.669 0.447 0.032
HCBO IC CD 0.256 0.065 0.342
HCBO,Ic IC CD 0.36 0.13 0.365
HCBO EC CD 0.239 0.057 0.123
HCBO,Ic EC CD 0.363 0.132 0.432
HCBO IC CM 0.443 0.196 0.034
HCBO,Ic IC CM 0.599 0.359 0.032
HCBO EC CM 0.422 0.178 0.012
HCBO,Ic EC CM 0.632 0.399 0.032
209 db
Hypothesis Y R R2 P > F
HCBO IC CC 0.419 0.178 0.0001
HCBO,Ic IC CC 0.868 0.754 0.001
HCBO EC CC 0.567 0.322 0.002
HCBO,Ic EC CC 0.881 0.777 0.001
HCBO IC CD 0.691 0.478 0.522
HCBO,Ic IC CD 0.768 0.589 0.263
HCBO EC CD 0.312 0.097 0.609
HCBO,Ic EC CD 0.429 0.184 0.816
HCBO IC CM 0.582 0.338 0.003
HCBO,Ic IC CM 0.703 0.494 0.006
HCBO EC CM 0.313 0.098 0.019
HCBO,Ic EC CM 0.428 0.184 0.016
213 javac
Hypothesis Y R R2 P > F
HCBO IC CC 0.535 0.286 0.005
HCBO,Ic IC CC 0.748 0.559 0.002
HCBO EC CC 0.443 0.196 0.004
HCBO,Ic EC CC 0.531 0.282 0.007
HCBO IC CD 0.512 0.262 0.234
HCBO,Ic IC CD 0.606 0.367 0.176
HCBO EC CD 0.872 0.76 0.765
HCBO,Ic EC CD 0.922 0.85 0.567
HCBO IC CM 0.553 0.306 0.034
HCBO,Ic IC CM 0.76 0.577 0.024
HCBO EC CM 0.321 0.107 0.042
HCBO,Ic EC CM 0.567 0.322 0.034
222 mpegaudio
Hypothesis Y R R2 P > F
HCBO IC CC 0.174 0.032 0.003
HCBO,Ic IC CC 0.452 0.204 0.001
HCBO EC CC 0.296 0.088 0.013
HCBO,Ic EC CC 0.635 0.403 0.006
HCBO IC CD 0.734 0.538 0.165
HCBO,Ic IC CD 0.885 0.784 0.214
HCBO EC CD 0.948 0.899 0.234
HCBO,Ic EC CD 0.978 0.956 0.654
HCBO IC CM 0.753 0.567 0.001
HCBO,Ic IC CM 0.769 0.592 0.002
HCBO EC CM 0.533 0.284 0.021
HCBO,Ic EC CM 0.635 0.403 0.03
228 jack
Hypothesis Y R R2 P > F
HCBO IC CC 0.606 0.367 0.003
HCBO,Ic IC CC 0.966 0.933 0.012
HCBO EC CC 0.512 0.262 0.002
HCBO,Ic EC CC 0.872 0.76 0.003
HCBO IC CD 0.239 0.057 0.465
HCBO,Ic IC CD 0.618 0.382 0.450
HCBO EC CD 0.363 0.132 0.123
HCBO,Ic EC CD 0.419 0.178 0.576
HCBO IC CM 0.585 0.343 0.013
HCBO,Ic IC CM 0.599 0.359 0.002
HCBO EC CM 0.363 0.132 0.045
HCBO,Ic EC CM 0.417 0.174 0.032
A.2. Multiple linear regression results for all programs 104
A.2.2 JOlden Benchmark Suite
BH
Hypothesis Y R R2 P > F
HCBO IC CC 0.531 0.282 0.038
HCBO,Ic IC CC 0.767 0.588 0.044
HCBO EC CC 0.092 0.008 0.0001
HCBO,Ic EC CC 0.533 0.284 0.0001
HCBO IC CD 0.431 0.185 0.247
HCBO,Ic IC CD 0.617 0.381 0.237
HCBO EC CD 0.443 0.196 0.232
HCBO,Ic EC CD 0.514 0.264 0.398
HCBO IC CM 0.45 0.203 0.024
HCBO,Ic IC CM 0.635 0.403 0.013
HCBO EC CM 0.443 0.196 0.032
HCBO,Ic EC CM 0.514 0.264 0.024
Em3d
Hypothesis Y R R2 P > F
HCBO IC CC 0.617 0.381 0.046
HCBO,Ic IC CC 0.748 0.659 0.001
HCBO EC CC 0.262 0.069 0.03
HCBO,Ic EC CC 0.937 0.878 0.024
HCBO IC CD 0.59 0.349 0.294
HCBO,Ic IC CD 0.591 0.349 0.651
HCBO EC CD 0.02 0.00 0.975
HCBO,Ic EC CD 0.626 0.392 0.608
HCBO IC CM 0.59 0.349 0.194
HCBO,Ic IC CM 0.591 0.349 0.151
HCBO EC CM 0.02 0.000 0.075
HCBO,Ic EC CM 0.626 0.392 0.008
Health
Hypothesis Y R R2 P > F
HCBO IC CC 0.601 0.372 0.04
HCBO,Ic IC CC 0.643 0.414 0.003
HCBO EC CC 0.22 0.048 0.06
HCBO,Ic EC CC 0.254 0.064 0.13
HCBO IC CD 0.659 0.434 0.075
HCBO,Ic IC CD 0.753 0.566 0.124
HCBO EC CD 0.444 0.197 0.27
HCBO,Ic EC CD 0.535 0.286 0.431
HCBO IC CM 0.669 0.447 0.07
HCBO,Ic IC CM 0.76 0.578 0.116
HCBO EC CM 0.444 0.197 0.207
HCBO,Ic EC CM 0.535 0.286 0.431
MST
Hypothesis Y R R2 P > F
HCBO IC CC 0.97 0.941 0.001
HCBO,Ic IC CC 0.972 0.945 0.0001
HCBO EC CC 0.606 0.367 0.002
HCBO,Ic EC CC 0.76 0.577 0.001
HCBO IC CD 0.966 0.933 0.200
HCBO,Ic IC CD 0.987 0.974 0.401
HCBO EC CD 0.239 0.057 0.649
HCBO,Ic EC CD 0.618 0.382 0.486
HCBO IC CM 0.966 0.933 0.002
HCBO,Ic IC CM 0.987 0.974 0.004
HCBO EC CM 0.239 0.057 0.049
HCBO,Ic EC CM 0.618 0.382 0.086
Perimeter
Hypothesis Y R R2 P > F
HCBO IC CC 0.36 0.13 0.306
HCBO,Ic IC CC 0.422 0.178 0.503
HCBO EC CC 0.095 0.009 0.194
HCBO,Ic EC CC 0.599 0.359 0.211
HCBO IC CD 0.512 0.262 0.131
HCBO,Ic IC CD 0.585 0.343 0.230
HCBO EC CD 0.256 0.065 0.476
HCBO,Ic EC CD 0.58 0.336 0.238
HCBO IC CM 0.645 0.416 0.044
HCBO,Ic IC CM 0.66 0.435 0.135
HCBO EC CM 0.256 0.065 0.076
HCBO,Ic EC CM 0.58 0.336 0.038
Power
Hypothesis Y R R2 P > F
HCBO IC CC 0.709 0.502 0.042
HCBO,Ic IC CC 0.713 0.508 0.001
HCBO EC CC 0.635 0.404 0.011
HCBO,Ic EC CC 0.872 0.76 0.001
HCBO IC CD 0.104 0.011 0.844
HCBO,Ic IC CD 0.723 0.523 0.329
HCBO EC CD 0.363 0.132 0.479
HCBO,Ic EC CD 0.632 0.399 0.465
HCBO IC CM 0.067 0.004 0.9
HCBO,Ic IC CM 0.638 0.407 0.456
HCBO EC CM 0.417 0.174 0.010
HCBO,Ic EC CM 0.673 0.453 0.005
Voronoi
Hypothesis Y R R2 P > F
HCBO IC CC 0.922 0.85 0.009
HCBO,Ic IC CC 0.941 0.885 0.0001
HCBO EC CC 0.553 0.306 0.255
HCBO,Ic EC CC 0.561 0.314 0.568
HCBO IC CD 0.762 0.58 0.078
HCBO,Ic IC CD 0.768 0.589 0.263
HCBO EC CD 0.627 0.393 0.183
HCBO,Ic EC CD 0.636 0.405 0.459
HCBO IC CM 0.765 0.586 0.076
HCBO,Ic IC CM 0.77 0.594 0.059
HCBO EC CM 0.627 0.393 0.083
HCBO,Ic EC CM 0.636 0.405 0.029
A.2. Multiple linear regression results for all programs 105
A.2.3 Real-World Programs, Velocity, Xalan and Ant
Velocity
Hypothesis Y R R2 P > F
HCBO IC CC 0.515 0.265 0.0001
HCBO,Ic IC CC 0.722 0.521 0.001
HCBO EC CC 0.381 0.145 0.014
HCBO,Ic EC CC 0.617 0.381 0.025
HCBO IC CD 0.595 0.354 0.254
HCBO,Ic IC CD 0.741 0.547 0.354
HCBO EC CD 0.677 0.458 0.144
HCBO,Ic EC CD 0.861 0.741 0.214
HCBO IC CM 0.675 0.455 0.005
HCBO,Ic IC CM 0.752 0.565 0.004
HCBO EC CM 0.409 0.167 0.007
HCBO,Ic EC CM 0.506 0.256 0.01
Xalan
Hypothesis Y R R2 P > F
HCBO IC CC 0.453 0.205 0.002
HCBO,Ic IC CC 0.637 0.406 0.001
HCBO EC CC 0.430 0.185 0.002
HCBO,Ic EC CC 0.570 0.325 0.004
HCBO IC CD 0.709 0.502 0.547
HCBO,Ic IC CD 0.892 0.796 0.214
HCBO EC CD 0.830 0.689 0.114
HCBO,Ic EC CD 0.857 0.735 0.147
HCBO IC CM 0.652 0.425 0.006
HCBO,Ic IC CM 0.762 0.581 0.005
HCBO EC CM 0.504 0.254 0.011
HCBO,Ic EC CM 0.624 0.389 0.007
Ant
Hypothesis Y R R2 P > F
HCBO IC CC 0.604 0.365 0.005
HCBO,Ic IC CC 0.765 0.585 0.006
HCBO EC CC 0.453 0.205 0.014
HCBO,Ic EC CC 0.636 0.405 0.018
HCBO IC CD 0.597 0.356 0.154
HCBO,Ic IC CD 0.698 0.487 0.198
HCBO EC CD 0.518 0.268 0.287
HCBO,Ic EC CD 0.667 0.445 0.098
HCBO IC CM 0.725 0.525 0.017
HCBO,Ic IC CM 0.784 0.615 0.025
HCBO EC CM 0.451 0.204 0.042
HCBO,Ic EC CM 0.560 0.314 0.034
Appendix B
Case Study 2: The Impact of
Run-time Cohesion on Object
Behaviour
Appendix B.1 contains the PCA test results for the JOlden benchmark suite and for
the real-world programs Velocity, Xalan and Ant. Values deemed to be significant
at the level p ≤ 0.05 are highlighted.
Appendix B.2 contains the results from the multiple linear regression used to test
the hypothesis H0, that measures of run-time cohesion provide a better indication
of NOC than a static measure alone for the JOlden benchmark programs and the
real-world programs Velocity, Xalan and Ant. All significant results are highlighted.
B.1 PCA Test Results for all programs.
B.1.1 JOlden Benchmark SuiteBH
PC1 PC2
SLCOM 0.214 0.754
RLCOM 0.714 0.214
RWLCOM 0.721 0.101
Em3d
PC1 PC2
SLCOM 0.135 0.812
RLCOM 0.841 0.014
RWLCOM 0.814 0.014
Health
PC1 PC2
SLCOM 0.122 0.789
RLCOM 0.674 0.145
RWLCOM 0.714 0.212
MST
PC1 PC2
SLCOM 0.251 0.712
RLCOM 0.714 0.211
RWLCOM 0.751 0.165
Perimeter
PC1 PC2
SLCOM 0.025 0.912
RLCOM 0.874 0.145
RWLCOM 0.768 0.121
Power
PC1 PC2
SLCOM 0.142 0.775
RLCOM 0.654 0.154
RWLCOM 0.698 0.177
106
B.2. Multiple linear regression results for all programs. 107
Voronoi
PC1 PC2
SLCOM 0.045 0.901
RLCOM 0.854 0.104
RWLCOM 0.868 0.021
B.1.2 Real-World Programs, Velocity, Xalan and Ant
Velocity
PC1 PC2
SLCOM 0.215 0.614
RLCOM 0.814 0.124
RWLCOM 0.751 0.165
Xalan
PC1 PC2
SLCOM 0.315 0.554
RLCOM 0.714 0.116
RWLCOM 0.641 0.225
Ant
PC1 PC2
SLCOM 0.114 0.712
RLCOM 0.814 0.124
RWLCOM 0.801 0.101
B.2 Multiple linear regression results for all pro-
grams.
B.2.1 JOlden Benchmark Suite
BH
Hypothesis Y R R2 P > F
HSLCOMNOC 0.444 0.197 0.016
HSLCOM,R LCOM NOC 0.711 0.507 0.01
HSLCOMNOC 0.105 0.012 0.452
HSLCOM,RW LCOM NOC 0.631 0.398 0.487
Health
Hypothesis Y R R2 P > F
HSLCOMNOC 0.518 0.268 0.012
HSLCOM,R LCOM NOC 0.754 0.568 0.009
HSLCOMNOC 0.445 0.198 0.124
HSLCOM,RW LCOM NOC 0.534 0.285 0.211
Perimeter
Hypothesis Y R R2 P > F
HSLCOMNOC 0.514 0.263 0.002
HSLCOM,R LCOM NOC 0.631 0.398 0.001
HSLCOMNOC 0.366 0.135 0.048
HSLCOM,RW LCOM NOC 0.451 0.203 0.037
Em3d
Hypothesis Y R R2 P > F
HSLCOMNOC 0.365 0.134 0.006
HSLCOM,R LCOM NOC 0.744 0.655 0.005
HSLCOMNOC 0.415 0.173 0.254
HSLCOM,RW LCOM NOC 0.67 0.451 0.354
MST
Hypothesis Y R R2 P > F
HSLCOMNOC 0.235 0.056 0.025
HSLCOM,R LCOM NOC 0.704 0.495 0.012
HSLCOMNOC 0.555 0.308 0.121
HSLCOM,RW LCOM NOC 0.594 0.355 0.241
Power
Hypothesis Y R R2 P > F
HSLCOMNOC 0.177 0.035 0.028
HSLCOM,R LCOM NOC 0.598 0.358 0.035
HSLCOMNOC 0.445 0.198 0.214
HSLCOM,RW LCOM NOC 0.514 0.264 0.277
Voronoi
Hypothesis Y R R2 P > F
HSLCOMNOC 0.523 0.273 0.004
HSLCOM,R LCOM NOC 0.767 0.589 0.002
HSLCOMNOC 0.255 0.064 0.381
HSLCOM,RW LCOM NOC 0.333 0.129 0.358
B.2. Multiple linear regression results for all programs. 108
B.2.2 Real-World Programs, Velocity, Xalan and Ant
Velocity
Hypothesis Y R R2 P > F
HSLCOMNOC 0.445 0.198 0.002
HSLCOM,R LCOM NOC 0.756 0.572 0.001
HSLCOMNOC 0.363 0.132 0.456
HSLCOM,RW LCOM NOC 0.598 0.358 0.345
Xalan
Hypothesis Y R R2 P > F
HSLCOMNOC 0.242 0.06 0.044
HSLCOM,R LCOM NOC 0.621 0.385 0.098
HSLCOMNOC 0.722 0.523 0.287
HSLCOM,RW LCOM NOC 0.869 0.758 0.205
Ant
Hypothesis Y R R2 P > F
HSLCOMNOC 0.455 0.207 0.0001
HSLCOM,R LCOM NOC 0.747 0.558 0.001
HSLCOMNOC 0.633 0.401 0.214
HSLCOM,RW LCOM NOC 0.69 0.747 0.564
Appendix C
Case Study 3: A Study of
Run-time Coupling Metrics and
Fault Detection
Appendix C.1 contains the results from the regression analysis used to test the
hypothesis H0, that run-time coupling metrics are poor detectors of faults in a
program for the set of real-world programs Velocity, Xalan and Ant.
Appendix C.2 presents the results to test the hypothesis H0, that coverage mea-
sures are poor detectors of faults in a program for the real-world programs Velocity,
Xalan and Ant. All significant results are highlighted.
109
C.1. Regression analysis results for real-world programs, Velocity, Xalanand Ant. 110
C.1 Regression analysis results for real-world pro-
grams, Velocity, Xalan and Ant.
C.1.1 For Class Mutants
Velocity
Hypothesis Y R R2 P > F
HIC CC MK 0.830 0.689 0.002
HIC CM MK 0.766 0.587 0.001
HIC CD MK 0.684 0.468 0.006
HEC CC MK 0.790 0.621 0.007
HEC CM MK 0.754 0.569 0.411
HEC CD MK 0.491 0.241 0.456
Xalan
Hypothesis Y R R2 P > F
HIC CC MK 0.767 0.589 0.003
HIC CM MK 0.705 0.498 0.002
HIC CC MK 0.710 0.504 0.001
HEC CD MK 0.706 0.499 0.046
HEC CM MK 0.706 0.499 0.254
HEC CD MK 0.649 0.421 0.680
Ant
Hypothesis Y R R2 P > F
HIC CC MK 0.773 0.598 0.003
HIC CM MK 0.708 0.501 0.005
HIC CC MK 0.829 0.687 0.001
HEC CD MK 0.749 0.561 0.075
HEC CM MK 0.749 0.561 0.342
HEC CD MK 0.463 0.214 0.127
C.2. Regression analysis results for real-world programs, Velocity, Xalanand Ant. 111
C.1.2 For Traditional Mutants
Velocity
Hypothesis Y R R2 P > F
HIC CC MK 0.570 0.325 0.048
HIC CM MK 0.644 0.415 0.054
HIC CD MK 0.375 0.141 0.065
HEC CC MK 0.642 0.412 0.115
HEC CM MK 0.463 0.214 0.256
HEC CD MK 0.392 0.154 0.658
Xalan
Hypothesis Y R R2 P > F
HIC CC MK 0.598 0.358 0.091
HIC CM MK 0.567 0.321 0.078
HIC CC MK 0.463 0.214 0.154
HEC CD MK 0.676 0.457 0.254
HEC CM MK 0.459 0.211 0.254
HEC CD MK 0.381 0.145 0.351
Ant
Hypothesis Y R R2 P > F
HIC CC MK 0.740 0.547 0.065
HIC CM MK 0.649 0.421 0.085
HIC CC MK 0.463 0.214 0.159
HEC CD MK 0.577 0.333 0.241
HEC CD MK 0.606 0.367 0.154
HEC CM MK 0.536 0.287 0.054
HEC CD MK 0.459 0.211 0.216
C.2 Regression analysis results for real-world pro-
grams, Velocity, Xalan and Ant.
C.2.1 For Class Mutants
Velocity
Hypothesis Y R R2 P > F
HIc MK 0.326 0.106 0.032
Xalan
Hypothesis Y R R2 P > F
HIc MK 0.409 0.167 0.004
Ant
Hypothesis Y R R2 P > F
HIc MK 0.344 0.118 0.005
C.2. Regression analysis results for real-world programs, Velocity, Xalanand Ant. 112
C.2.2 For Traditional Mutants
Velocity
Hypothesis Y R R2 P > F
HIc MK 0.888 0.789 0.003
Xalan
Hypothesis Y R R2 P > F
HIc MK 0.803 0.645 0.024
Ant
Hypothesis Y R R2 P > F
HIc MK 0.836 0.699 0.019
Appendix D
Mutation operators in µJava
Table D.1 presents a description of the traditional-level mutation operators in µJava.
Table D.2 presents a description of the class-level mutation operators in µJava.
Operator Description
ABS Absolute value insertion
AOR Arithmetic operator replacement
LCR Logical connector replacement
ROR Relational operator replacement
UOI Unary operator insertion
Table D.1: Traditional-level mutation operators in µJava
113
Appendix D. Mutation operators in µJava 114
Language Feature Operator Description
Inheritance: IHD Hiding variable deletion
IHI Hiding variable delection
IOD Overriding method deletion
IOP Overriding method calling position change
IOR Overriding method rename
ISK super key word deletion
IPC Explicit call of a parents constructor deletion
Polymorphism: PNC new method call with child class type
PMD Instance variable declaration with parent class type
PPD Parameter variable declaration with child class type
PRV Reference assignment with other comparable type
Overloading: OMR Overloading method contents change
OMD Overloading method deletion
OAO Argument order change
OAN Argument number change
Java-specific features: JTD this keyword deletion
JSC static modifier change
JID Member variable initialization deletion
JDC Java-supported default constructor creation
Common programming EOA Reference assignment and content
mistakes: assignment replacement
EOC Reference comparison and content
comparison replacement
EAM Accessor method change
EMM Modifier method change
Table D.2: Class-level mutation operators in µJava
Bibliography
[1] F. Abreu, M. Goulo, and R. Esteves. Toward the design quality evaluation of
object-oriented software systems. In Fifth International Conference on Software
Quality, pages 44–57, Austin, Texas, USA, Oct 1995.
[2] A.J. Albrecht. Measuring application development. In IBM Applications Devel-
opment joint SHARE/GUIDE symposium, pages 83–92, Monterey California,
USA, 1979.
[3] R.T. Alexander and J. Offutt. Coupling-based testing of O-O programs. The
Journal of Universal Computer Science, 10(4):391–427, 2004.
[4] The Apache Ant Project. Ant. http://ant.apache.org/.
[5] E. Arisholm, L.C. Briand, and A. Foyen. Dynamic coupling measures for object-
oriented software. IEEE Transactions on Software Engineering, 30(8):491–506,
2004.
[6] V.R. Basili, L.C. Briand, and W.L. Melo. A validation of object-oriented design
metrics as quality indicators. IEEE Transactions on Software Engineering,
22(10):751–761, October 1996.
[7] B. Beizer. Software Testing Techniques. 2nd edition, Van Nostrand Reinhold,
New York, USA, 1990.
[8] Standard Performance Evaluation Corporation SPECjvm98 Benchmarks.
http://www.spec.org/jvm98.
115
Bibliography 116
[9] J.M. Bieman and B.K. Kang. Cohesion and reuse in an object-oriented system.
In ACM Symposium on Software Reusability, pages 295–262, Seattle, Washing-
ton, USA, 1995.
[10] R. Binder. Testing Object Oriented Systems: Models, Patterns and Tools. Ad-
dison Wesley, Boston, Massachusetts, USA, October 1999.
[11] L.C. Briand, J. Daly, V. Porter, and J. Wust. A comprehensive empirical valida-
tion of product measures in object-oriented systems. Technical Report ISERN-
98-07, Fraunhofer Institute for Experimental Software Engineering, Germany,
1998.
[12] L.C. Briand, J.W. Daly, and J.K. Wust. A unified framework for cohesion
measurement in object-oriented systems. Empirical Software Engineering: An
International Journal, 3(1):65–117, 1998.
[13] L.C. Briand, J.W. Daly, and J.K. Wust. A unified framework for coupling
measurement in object-oriented systems. IEEE Transactions on Software En-
gineering, 25(1):91–121, Jan/Feb 1999.
[14] L.C. Briand, P. Devanbu, and W. Melo. An investigation into coupling measures
for C++. In 19th International Conference on Software Engineering, pages
412–421, Boston, USA, May 1997.
[15] L.C. Briand, W.L. Melo, and J. Wust. Assessing the applicability of fault-
proneness models across object-oriented software projects. IEEE Transactions
on Software Engineering, 28(7):706–720, 2002.
[16] L.C. Briand, S. Morasca, and V. Basili. Measuring and assessing maintainabil-
ity at the end of high-level design. In International Conference on Software
Maintenance, pages 88–97, Montreal, Canada, 1993.
[17] L.C. Briand, S. Morasca, and V. Basili. Defining and validating high-level design
metrics. Technical Report CS-TR 3301, Department of Computer Science,
University of Maryland, College Park, MD 20742, USA, 1994.
Bibliography 117
[18] L.C. Briand, S. Morasca, and V.R. Basili. An operational process for goal-
driven definition of measures. IEEE Transactions on Software Engineering,
28(12):1106–1125, December 2002.
[19] L.C. Briand, J.K. Wust, J.W. Daly, and V. Porter. Exploring the relationship
between design measures and software quality in object-oriented systems. The
Journal of Systems and Software, 51:245–273, 2000.
[20] S. Brown, A. Mitchell, and J.F. Power. A coverage analysis of Java benchmark
suites. In IASTED International Conference on Software Engineering, pages
144–150, Innsbruck, Austria, Feburary 15-17 2005.
[21] M. Bunge. Treatise on Basic Philosophy: Ontology II: The World of Systems.
Riedel, Boston, USA, 1972.
[22] X. Cai and M.R. Lyu. The effect of code coverage on fault detection un-
der different testing profiles. In First International Workshop on Advances in
Model-based Testing, pages 1–7, St. Louis, Missouri, USA, 2005.
[23] M.C. Carlisle and A. Rogers. Software caching and computation migration in
olden. In ACM Symposium on Principles and Practice of Parallel Programming,
pages 29–38, Santa Barbara, California, USA, July 1995.
[24] I.M. Chakravarti, R.G. Laha, and J. Roy. Handbook of Methods of Applied
Statistics, volume 1. John Wiley and Sons, New York, USA, 1967.
[25] S.R. Chidamber and C.F. Kemerer. Towards a metrics suite for object-oriented
design. In Object Oriented Programming Systems Languages and Applications,
pages 197–211, Phoenix, Arizona, USA, November 1991.
[26] S.R. Chidamber and C.F. Kemerer. A metrics suite for object-oriented design.
IEEE Transactions on Software Engineering, 20(6):467–493, June 1994.
[27] E.J. Chikofsky and J.H. Cross II. Reverse engineering and design recovery: A
taxonomy. IEEE Software, 7(1):13–17, 1990.
Bibliography 118
[28] J. Choi, M. Gupta, M.J. Serrano, V.C. Sreedhar, and S.P. Midkiff. Stack
allocation and synchronization optimizations for Java using escape analysis.
ACM Transactions on Programming Languages and Systems, 25(6):876 – 910,
November 2003.
[29] L.L Constantine and E. Yourdon. Structured Design. Prentice-Hall, Englewood
Cliffs, New Jersey, USA, 1979.
[30] M. Dahm. Byte Code Engineering Library (BCEL), version 5.1, April 25 2004.
http://jakarta.apache.org/bcel/.
[31] D.P. Darcy, C.F. Kemerer, S.A. Slaughter, and T.A. Tomayko. The structural
complexity of software: An experimental test. TOSE, 31(11):982–995, 2005.
[32] S. Demeyer, S. Ducasse, and O. Nierstrasz. Finding refactorings via change
metrics. In 15th ACM SIGPLAN conference on Object-oriented programming,
systems, languages, and applications, pages 166–178, Minneapolis, Minnesota,
USA, 2000.
[33] B. Dufour, K. Driesen, L. J. Hendren, and C. Verbrugge. Dynamic metrics for
Java. In Conference on Object-Oriented Programming Systems, Languages and
Applications, pages 149–168, Anaheim, California, USA, October 26-30 2003.
[34] J. Eder, G. Kappel, and M. Schrefl. Coupling and cohesion in object–oriented
systems. Technical Report 2/93, Department of Information Systems, Univer-
sity of Linz, Linz, Austria, 1993.
[35] D.W. Embley and S.N. Woodfield. Cohesion and coupling for abstract data
types. In 6th International Phoenix Conference on Computers and Communi-
cations, pages 144–153, Phoenix, Arizona, USA, 1987.
[36] T. J. Emerson. A discriminant metric for module cohesion. In 7th 1nternational
Conference on Software Engineering, pages 294–303, Orlando, Florida, USA,
1984.
Bibliography 119
[37] T. J. Emerson. Program testing, path coverage, and the cohesion metric. In
Computer Software Application Conference, pages 421–431, Chicago, Illinois,
USA, 1984.
[38] J. Engel. Programming for the Java Virtual Machine. Addison-Wesley, Cali-
fornia, USA, 1999.
[39] N.E. Fenton and M. Neil. Software metrics: Successes, failures and new direc-
tions. The Journal of Systems and Software, 47:149–157, 1999.
[40] N.E. Fenton and S.L. Pfleeger. Software Metrics: A Rigorous and Practical
Approach. PWS Publishing Company, Boston, Massachusetts, USA, 1997.
[41] F. Fioravanti and P. Nesi. A study on fault-proneness detection of object-
oriented systems. In Fifth European Conference on Software Maintenance and
Reengineering, pages 121–130, Lisbon, Portugal, 14-16 March 2001.
[42] P.G. Frankl and E.J. Weyuker. An applicable family of data flow testing criteria.
IEEE Transactions on Software Engineering, 14(10):1483–1498, 1988.
[43] R.J. Freund and W.J. Wilson. Regression Analysis: Statistical Modeling of a
Response Variable. Academic Press, 1998.
[44] R.R. Gonzalez. A unified metric of software complexity: Measuring produc-
tivity, quality and value. The Journal of Systems and Software, 29(1):17–37,
1995.
[45] D. Gregg, J. Power, and J. Waldron. Platform independent dynamic Java vir-
tual machine analysis: the Java Grande Forum benchmark suite. Concurrency
and Computation: Practice and Experience, 15(3-5):459–484, March 2003.
[46] N. Gupta and P. Rao. Program execution based module cohesion measurement.
In 16th International Conference on Automated on Software Engineering, pages
144–153, San Diego, USA, Nov 2001.
[47] M. Halstead. Elements of Software Science. North-Holland, Amsterdam, 1977.
Bibliography 120
[48] R. G. Hamlet. Testing programs with the aid of a compiler. IEEE Transactions
on Software Engineering, 3(4):279–290, 1977.
[49] B. Henderson-Sellers. Software Metrics. Prentice Hall, Hemel Hempstaed, U.K.,
1996.
[50] B. Henderson-Sellers and J. Edwards. Object-Oriented Knowledge: The Work-
ing Object (Book Two). Prentice Hall, Sydney, Australia, 1994.
[51] M. Hitz and B. Montazeri. Measuring coupling and cohesion in object-oriented
systems. In International Symposium on Applied Corporate Computing, pages
25–27, Monterrey, Mexico, October 1995.
[52] M. Hitz and B. Montazeri. Measuring product attributes of object-oriented
systems. In Fifth European Software Engineering Conference, pages 124 – 136,
Barcelona, Spain, September 1995.
[53] C. Howells. Gretel: An open-source residual test coverage tool, June 2002.
http://www.cs.uoregon.edu/research/perpetual/Software/Gretel/.
[54] T.O. Humphries, A. Klauser, A.L. Wolf, and B.G. Zorn. An infrastructure for
generating and sharing experimental workloads for persistent object systems.
Software–Practice and Experience, 30:387–417, 2000.
[55] Jakarta. The Apache Jakarta Project. http://jakarta.apache.org/.
[56] I.T. Jolliffe. Principal Component Analysis. Springer Verlag, 2nd edition, 2002.
[57] Jikes JVM. http://www-124.ibm.com/developerworks/oss/jikes/.
[58] Kaffe JVM. http://www.kaffe.org/.
[59] Sable JVM. http://sablevm.org/.
[60] B. Kitchemham and S.L. Pfleeger. Software quality: The elusive target. IEEE
Software, pages 12–21, 1996.
Bibliography 121
[61] A. Lakhotia. Rule-based approach to computing module cohesion. In 15th In-
ternational Conference on Software Engineering, pages 35–44, Baltimore, Mary-
land, USA, 1993.
[62] Y.S. Lee, B.S. Liang, S.F. Wu, and F.J. Wang. Measuring the coupling and
cohesion of an object-oriented program based on information flow. In Interna-
tional Conference on Software Quality, pages 81–90, Maribor, Slovenia, 1995.
[63] W. Li and S. Henry. Object-oriented metrics that predict maintainability. The
Journal of Systems and Software, 23(2):111–122, 1993.
[64] R.J. Lipton, R.A. DeMillo, and F.G. Sayward. Hints on test data selection:
Help for the practicing programmer. IEEE Computer, 11(4):34–41, 1978.
[65] M. Lorenz and J. Kidds. Object-Oriented Software Metrics. Prentice Hall
Object-Oriented Series, Englewood Cliffs, USA, 1994.
[66] Y. Ma, J. Offutt, and Y. Kwon. Mujava: An automated class mutation sys-
tem. The Journal of Software Testing, Verification and Reliability, 15(2):97–
133, June 2005.
[67] Y.S. Ma, Y.R. Kwon, and J. Offutt. mujava. http://www.isse.gmu.edu/-
faculty/ofut/mujava/.
[68] Y.S. Ma, Y.R. Kwon, and J. Offutt. Inter-class mutation operators for java.
In 13th International Symposium on Software Reliability Engineering, pages
352–363, Annapolis, Maryland, USA, November 2002.
[69] Y.K. Malaiya, M.N. Li, J.M. Bieman, and R. Karcich. Software reliability
growth with test coverage. IEEE Transactions on Reliability, 51(4):420426,
December 2002.
[70] R. Martin. OO design quality metrics: An analysis of dependencies. In Pro-
ceedings Workshop on Pragmatic and Theoretical Directions in Object-Oriented
Software Metrics, 1994.
Bibliography 122
[71] T. McCabe. A software complexity measure. IEEE Transactions on Software
Engineering, 2(4):308–320, 1976.
[72] J.D. McGregor and D.A. Sykes. A Practical Guide to Testing Object-oriented
Software. Addison Wesley, March 2001.
[73] P.C. Mehlitz. Performance analysis of Java implementations. In
http://www.transvirtual.com/presentations/speed/index.html.
[74] A. Mitchell and J.F. Power. Masters thesis: Dynamic coupling and cohesion
metrics for Java programs. Department of Computer Science, N.U.I. Maynooth,
Co. Kildare, Ireland, Aug 2002.
[75] A. Mitchell and J.F. Power. Run-time cohesion metrics for the analysis of
Java programs - preliminary results from the SPEC and Grande suites. Tech-
nical Report NUIM-CS-TR2003-08, Department of Computer Science, N.U.I.
Maynooth, Co. Kildare, Ireland, April 2003.
[76] A. Mitchell and J.F. Power. Run-time coupling metrics for the analysis of
Java programs - preliminary results from the SPEC and Grande suites. Tech-
nical Report NUIM-CS-TR2003-07, Department of Computer Science, N.U.I.
Maynooth, Co. Kildare, Ireland, April 2003.
[77] A. Mitchell and J.F. Power. Toward a definition of run-time object-oriented
metrics. In 7th ECOOP Workshop on Quantitative Approaches in Object-
Oriented Software Engineering, Darmstadt, Germany, July 2003.
[78] A. Mitchell and J.F. Power. An empirical investigation into the dimensions of
run-time coupling in java programs. In 3rd Conference on the Principles and
Practice of Programming in Java, pages 9–14, Las Vegas, Nevada, USA, June
16-18 2004.
[79] A. Mitchell and J.F. Power. Run-time cohesion metrics: An empirical inves-
tigation. In International Conference on Software Engineering Research and
Practice, pages 532–537, Las Vegas, Nevada, USA, June 21-24 2004.
Bibliography 123
[80] A. Mitchell and J.F. Power. A study of the influence of coverage on the re-
lationship between static and dynamic coupling metrics. Science of Computer
Programming Elsevier, Accepted for Publication, March 2005.
[81] A. Mitchell and J.F. Power. Using object-level run-time metrics to study cou-
pling between objects. In ACM Symposium on Applied Computing, pages 1456–
1463, Santa Fe, New Mexico, USA, March 13-17 2005.
[82] G. Myers. Reliable Software Through Composite Design. Mason and Lipscomb
Publishers, New York, USA, 1974.
[83] G. Myers. Composite Structured Design. Van Nostrand Reinhold, New York,
USA, 1978.
[84] S. Ntafos. A comparison of some structural testing strategies. IEEE Transac-
tions on Software Engineering, 14(6):868–874, June 1988.
[85] A.J. Offutt, M.J. Harrold, and P. Kolte. A software metrics system for module
coupling. The Journal of Systems and Software, 20(3):295–308, 1993.
[86] J. Offutt, R. Alexander, Y. Wu, Q. Xiao, and C. Hutchinson. A fault model for
subtype inheritance and polymorphism. In 12th International Symposium on
Software Reliability Engineering, pages 84–93, Hong Kong, China, November
2001.
[87] J. Offutt, A. Lee, G. Rothermel, R. Untch, and C. Zapf. An experimental
determination of sufficient mutation operators. ACM Transactions on Software
Engineering Methodology, 5(2):99–118, April 1996.
[88] L. M. Ott and J. J. Thuss. The relationship between slices and module cohesion.
In 11th International Conference on Software Engineering, pages 198 – 204,
Pittsburgh, Pennsylvania, USA, 1989.
[89] M. Page-Jones. The Practical Guide to Structured Systems Design. Yourdon
Press, New York, NY, 1980.
Bibliography 124
[90] A.V. Pearson and H.O. Hartley. Biometrica Tables for Statisticians, volume 2.
Cambridge University Press, Cambridge, England, 1972.
[91] S. Phattarsukol and P. Muenchaisri. Identifying candidate objects using hierar-
chical clustering analysis. In 8th Asia-Pacific Software Engineering Conference,
pages 381–389, Macao, Japan, December 4-7 2001.
[92] The Apache XML Project. Xalan. http://xml.apache.org/xalan-j//.
[93] S.S. Shapiro and M.B. Wilk. An analysis of variance test for normality (com-
plete samples). Biometrika, 52(3/4):591–611, 1965.
[94] M.L Shooman. Software Engineering: Design, Reliability and Management.
McGraw Hill, New York, USA, 1983.
[95] W.P. Stevens, G.J. Myers, and L. L. Constantine. Structured design. IBM
Systems Journal, 13(2):115–139, 1974.
[96] Sun Microsystems, Inc. Java Platform Debug Architecture (JPDA). http://-
java.sun.com/products/jpda.
[97] TimeWeb. Correlation explained. http://www.bized.ac.uk/timeweb/crunching/-
crunch relate expl.htm, 2002.
[98] D.A. Troy and S.H. Zweben. Measuring the quality of structured designs. The
Journal of Systems and Software, 2:112–120, 1981.
[99] S.M. Yacoub, H.H. Ammar, and T. Robinson. Dynamic metrics for object-
oriented designs. In Software Metrics Symposium, pages 50–61, Boca Raton,
Florida, USA, Nov 4-6 1999.
Note: All URL’s correct as of 30th September 2005
top related