_blobs_research_scientific publication_benchmarking technical quality of software products
TRANSCRIPT
-
7/28/2019 _blobs_Research_Scientific Publication_Benchmarking Technical Quality of Software Products
1/4
Benchmarking Technical Quality of Software Products
Jose Pedro Correia Joost VisserSoftware Improvement Group
Amsterdam, The Netherlands
[email protected] [email protected]
Abstract
To enable systematic comparison of technical quality of
(groups of) software products, we have collected measure-
ment data of a wide range of systems into a benchmark
repository. The measurements were taken over the course
of several years of delivering software assessment servicesto corporations and public institutes. The granularity of the
collected data follows the layered structure of a model for
software product quality, based on the ISO/IEC 9126 inter-
national standard, which we developed previously.
In this paper, we describe the design of our benchmark
repository, and we explain how it can be used to perform
comparisons of systems. To provide a concrete illustration
of the concept without revealing confidential data, we use a
selection of open source systems as example.
1. Introduction
The Software Improvement Group (SIG) performs
source code analysis on scores of software systems on an
annual basis. In the context of several IT management con-
sultancy services [1, 8, 7, 13], these systems are analyzed on
a per-system basis. The analyses focus on technical qual-
ity of systems (Are they well built?), rather then functional
quality (Do they perform the required functions?). Find-
ings at the technical level (source code) are used as input
to formulate recommendations at the managerial level (sys-
tem, project). The services of SIG are not restricted to a
particular technology, but encompass all types of technol-
ogy mixes, including legacy mainframe systems, object-oriented web-based systems, embedded control software,
and customizations of enterprise resource planning (ERP)
packages.
In this paper, we address the question of how the mea-
surement data collected on a per-system basis can be used to
perform systematic comparions between systems or groups
of systems. Such comparisons amount to benchmarking of
software products with respect to their technical quality.
In order to make meaningful comparisons possible at
the system level, the low-level source code measurements
of each system must be aggregated. The manner of ag-
gregation should be such that (i) differences in technolo-
gies, architectures, size, functionality, etc are overcome, but
(ii) sufficient information-value is retained after aggrega-
tion. To meet these requirements, we employ a layeredmodel of software product quality that we developed pre-
viously [3]. The model summarizes the distribution of a
particular source code metric in a given system into a so-
called quality profile: a vector of the relative code volumes
that fall into risk categories of increasing severity. Subse-
quently, the quality profiles are combined and mapped, first
into ratings for system-level properties, such as complexity
and duplication, and finally into scores for various quality
aspects as defined by the ISO/IEC international standard for
software product quality [5].
The paper is structured as follows. In Section 2, we re-
call our quality model based on ISO/IEC 9126 and explain
the quality profiles used for aggregating source code met-
rics to the system level. In Section 3, we explain the design
of our benchmark repository and we characterize its cur-
rent contents. In Section 4, we explain how the benchmark
repository can be used to perform comparisons of systems.
2. Aggregation of product quality metrics
The software product quality model employed by SIG
was described in simplified form in [3]. The model distin-
guishes three levels: source code metrics, system proper-
ties, and quality sub-characterstics as defined by ISO/IEC
9126 [5]. Data collection occurs at the source code level,using metrics such as McCabes complexity, lines of code,
volume of duplicated code blocks, and test coverage per-
centages. The collected measurement values are then aggre-
gated to ratings at the level of properties of the entire sys-
tem, such as volume, code redundancy, unit complexity, and
test quality. Finally, the system property ratings are mapped
to ratings of the ISO/IEC 9126 characteristics, such as the
analysability and changeability sub-characteristics and the
-
7/28/2019 _blobs_Research_Scientific Publication_Benchmarking Technical Quality of Software Products
2/4
maintainability characteristic. The quality ratings are given
at a 5-tier scale, ranging from excellent (++) to poor (--).
At the source code level, many measurements are taken
per unit of code. We summarize the distribution of such
source code metrics throughout a given system into what
we call a quality profile. For example, a McCabe complex-
ity number is derived for each method in a Java system. Tocompute the quality profile, the units (methods) are first as-
signed to four risk categories:
MC C Risk evaluation
1-10 without much risk
11-20 moderate risk
21-50 high risk
> 50 very high risk
Secondly, the relative volume of each risk category is com-
puted by summing the lines of code of the units of that cate-
gory and dividing by the total lines of code in all units. The
vector of percentages we obtain is the quality profile for unit
complexity for the system (examples follow below).The complexity quality profile is used to arrive at a rating
of complexity as a system property following this schema:
max. relative LO C per MCC risk category
rank moderate high very high
++ 25% 0% 0%
+ 30% 5% 0%
o 40% 10% 0%
- 50% 15% 5%
-- - - -
To score a + for example, a system can have at most 5% of
its code lines in methods with high-risk complexity and at
most 30% in methods with moderate-risk complexity.
Other properties have similar evaluation schemes relying
on different risk categorizations and thresholds. This anal-
ysis is done separately for each different technology, after
which the ratings are then averaged with weights according
to the relative volume of each technology in the system.
3. The benchmarking repository
We have created a repository to collect measurement and
rating data for all systems analyzed in our consultancy prac-
tise. We briefly discuss the various kinds of information that
can be stored in the repository.
Overall structure For each system we record the subdi-
vision into its subsystems (high-level modules) and of these
subsystems into their technology parts (programming lan-
guages). The subsystem division is sometimes evident from
the source code or documentation. Sometimes the appropri-
ate division is arrived at after elicitation of design informa-
tion from architects and developers.
Java
44%
C
33%
C++
15%
other
8%
Figure 1. Content of the benchmark reposi-tory by programming language.
General information At the subsystem level, informa-
tion is stored about functionality, architecture, develop-
ment methodology, and development organization. This in-
formation is determined by our consultants through docu-
ment review, stakeholder elicitation, and expert judgment.We employ taxonomies similar to those of ISBSG [4] and
Jones [6].
Source code measurements For each sub-system, source
code measurements are stored per language. The measure-
ment values are collected using the software analysis toolkit
developed by SIG. For metrics with granularity below the
subsystem level (e.g. block, method, class) the values are
aggregated into quality profiles as explained above. Addi-
tionally, histograms are stored for selected metrics.
Ratings The ratings at the various levels of our qualitymodel (source code, system properties, ISO/IEC 9126 char-
acteristics) are stored per subsystem and per system. These
include ratings directly derived from source code measure-
ments, as well as ratings that involve the expert opinion of
the consultants that have assessed the systems.
The database currently contains about 70 systems, with
about 160 sub-systems. The distribution of the languages
present in the benchmark is shown in Figure 1 in terms of
percentage of lines of code.
4. Comparing systems
The purpose of collecting benchmark data into a reposi-
tory is to enable systematic comparison of (groups of) soft-
ware products. In this section, we give examples of various
kinds of comparisons. Since we cannot disclose measure-
ment data for the systems of our clients, we use a series of
open source systems as illustration. These systems are listed
in Table 1, ordered by their volume. For a more interesting
comparison, we selected systems with similar functionality.
-
7/28/2019 _blobs_Research_Scientific Publication_Benchmarking Technical Quality of Software Products
3/4
Name Functionality Main PLs LOC
Jaminid Web Server Java 1120
CSQL DBMS C/C++ 14830
Canopy Web Server C 15570
SmallSQL DBMS Java 18402
Axion DBMS Java 18921
JalistoPlus DBMS Java 21943
AOLServer Web Server C 45344
HSQLDB DBMS Java 65378
SQLite DBMS C 69302
H2 DBMS Java 72102
Tomcat Web Server Java 164199
Apache Web Server C 203829
602SQL DBMS C/C++ 274938
Derby DBMS Java 307367
Firebird DBMS C/C++ 357537
PostgreSQL DBMS C 497400
MySQL DBMS C/C++ 843175
Table 1. A selection of OSS web servers and
database systems, ordered by volume.
Figure 2. Comparison of OSS and proprietarysystems.
Comparison of groups Based on characteristics such as
functionality, architecture, main programming language, or
development model, the systems in the repository can be
divided into groups. These groups can then be compared
with respect to particular quality aspects.
For example, we can compare the group of open source
systems of Table 1 against closed source systems. In Fig-
ure 2, these two groups are compared with respect to the
risk categories of complexity. For each group, the average
and standard deviation of the percentage of code in each
risk category are plotted. As the graph show, the propri-etary systems perform very similar, but slightly better than
the OSS systems. In particular, the average percentage of
high and very high risk code is lower (7.60% and 3.49% vs.
10.47% and 6.64%).
Comparison of individual systems to group average In-
dividual systems can be compared to a group of systems.
This group can either be the entire database, or it can be a
Figure 3. Comparison of CSQL and Firebirdin relation to proprietary systems.
selection of systems with similar functionality, architecture,
or other traits.
For example, we can focus on the CSQL and Firebird
systems and compare them to the group of proprietary sys-
tems. In Figure 3 we plot the values of these systems for the
three highest complexity risk categories and compare them
with the average values from the proprietary systems. Asthe chart shows, the two systems deviate from the average
in different ways. CSQL has more moderate risk code than
the proprietary systems, even surpasses the standard devia-
tion, but has less high and no very high risk code. Firebird,
on the other hand, has moderate risk code inside the ex-
pected deviation from the average, but clearly oversteps the
average values regarding high and very high risk code.
Comparison of individual systems within a group The
individual systems that are contained within a given group
can all be simultaneously compared with each other. This
reveals the amount of variation within the group as well as
the rank position of each system within the group.For example, the group of OSS systems can be compared
with respect to complexity by ordering and plotting their
quality profiles. In Figure 4 the systems are ordered by (i)
Figure 4. Comparison of OSS systems byranking and plotting quality profiles.
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
CSQL (+)
Axion (+)
Tomcat (-)
Jaminid (-)
Jalisto (-)
Derby (-)
H2 (-)
SmallSQL (--)
PostgreSQL (--)
MySQL (--)
HSQLDB (--)
AOLServer (--)
Apache (--)
Canopy (--)
Firebird (--)
SQLite (--)
602SQL (--)
very high
high
moderate
low
-
7/28/2019 _blobs_Research_Scientific Publication_Benchmarking Technical Quality of Software Products
4/4
overall complexity score, and (ii) percentage of code in the
high risk category. As the chart reveals, CSQL is the least
complex of the selected OSS systems, while Firebird ranks
third.
Apart from comparing systems, the repository can be used
to study metrics and their aggregations. In particular, the
stored histograms have been helpful to determine appropri-
ate thresholds for quality profiles, and to select the most
relevant measurements for the quality model.
5. Related work
Productivity benchmarking Jones [6] provides a treat-
ment of benchmarking of software productivity. The focus
is not on the software product, though the functional size of
systems in terms of function points and the technical vol-
ume in terms of lines of code, are taken into account.
The International Software Benchmarking Standards
Group (ISBSG) [4] collects data about software productiv-
ity and disseminates the collected data for benchmarking
purposes. Apart from function points and lines of code, no
software product measures are taken into account.
Benchmarking open source products Spinellis [12]
compared the source code quality of 4 open source oper-
ating system kernels with respect to their measurement val-
ues for a wide range of metrics. For the greater part of these
metrics, he applies averaging to aggregate the values from
code to system level. He concludes that little quality differ-
ences exist among the studied kernels.
Samoladas et al [10] measured the maintainability in-dex [9] for successive versions of 6 projects that were OSS
at least at some moment during their development. They
conclude that like closed source, OSS quality tends to dete-
riorate over time.
Research benchmarks Demeyer et al [2] argue for the
use of a benchmark of software systems to validate soft-
ware evolution research. Sim et al [11] pose a challenge to
the software engineering research community to establish
a shared benchmark for validating research tools and tech-
niques. Thus, the intent of these benchmarks is to compare
research tools and techniques.
6. Concluding remarks
We have given a brief overview of our benchmark repos-
itory and we have illustrated some of its possible uses for
comparing (groups of) software systems. The type of infor-
mation stored in the repository makes it particularly useful
for comparing technical quality of software products.
Future work The benchmark repository continues to be
fed with new systems, both OSS systems and proprietary
ones analyzed in our consultancy practise. In future, we
hope to augment the product measurements with productiv-
ity data. Also, we intend to exploit the collected informa-
tion for scientific study of software product metrics and our
quality model.
References
[1] E. Bouwers and R. Vis. Multidimensional software monitor-
ing applied to erp. In C. Makris and J. Visser, editors, Proc.
2nd Int. Workshop on Software Quality and Maintainability,
ENTCS. Elsevier, 2008. To appear.
[2] S. Demeyer, T. Mens, and M. Wermelinger. Towards a soft-
ware evolution benchmark. In IWPSE 01: Proc. 4th In-
ternational Workshop on Principles of Software Evolution,
pages 174177, New York, NY, USA, 2001. ACM.
[3] I. Heitlager, T. Kuipers, and J. Visser. A practical model for
measuring maintainability. In 6th Int. Conf. on the Quality
of Information and Communications Technology (QUATIC
2007), pages 3039. IEEE Computer Society, 2007.
[4] International Software Benchmarking Standards Group.
www.isbsg.org.
[5] ISO. ISO/IEC 9126-1: Software engineering - product qual-
ity - part 1: Quality model, 2001.
[6] C. Jones. Software Assessments, Benchmarks, and Best
Practices. Addison-Wesley, 2000.
[7] T. Kuipers and J. Visser. A tool-based methodology for soft-
ware portfolio monitoring. In M. Piattini and M. Serrano,
editors, Proc. 1st Int. Workshop on Software Audit and Met-
rics, (SAM 2004), pages 118128. INSTICC Press, 2004.
[8] T. Kuipers, J. Visser, and G. de Vries. Monitoring the qual-
ity of outsourced software. In J. van Hillegersberg et al., ed-
itors, Proc. Int. Workshop on Tools for Managing GloballyDistributed Software Development (TOMAG 2007). Center
for Telematics and Information Technology (CTIT), The
Netherlands, 2007.
[9] P. W. Oman and J. R. Hagemeister. Construction and testing
of polynomials predicting software maintainability. Journal
of Systems and Software, 24(3):251266, 1994.
[10] I. Samoladas, I. Stamelos, L. Angelis, and A. Oikonomou.
Open source software development should strive for even
greater code maintainability. Commun. ACM, 47(10):8387,
2004.
[11] S. E. Sim, S. Easterbrook, and R. C. Holt. Using bench-
marking to advance research: a challenge to software engi-
neering. In ICSE 03: Proceedings of the 25th International
Conference on Software Engineering, pages 7483, Wash-ington, DC, USA, 2003. IEEE Computer Society.
[12] D. Spinellis. A tale of four kernels. In W. Schafer, M. B.
Dwyer, and V. Gruhn, editors, ICSE 08: Proceedings of
the 30th International Conference on Software Engineering,
pages 381390, New York, May 2008. ACM.
[13] A. van Deursen and T. Kuipers. Source-based software risk
assessment. In ICSM 03: Proc. Int. Conference on Software
Maintenance, page 385. IEEE Computer Society, 2003.