_blobs_research_scientific publication_benchmarking technical quality of software products

7/28/2019 _blobs_Research_Scientific Publication_Benchmarking Technical Quality of Software Products

1/4

Benchmarking Technical Quality of Software Products

Jose Pedro Correia Joost VisserSoftware Improvement Group

Amsterdam, The Netherlands

[email protected] [email protected]

Abstract

To enable systematic comparison of technical quality of

(groups of) software products, we have collected measure-

ment data of a wide range of systems into a benchmark

repository. The measurements were taken over the course

of several years of delivering software assessment servicesto corporations and public institutes. The granularity of the

collected data follows the layered structure of a model for

software product quality, based on the ISO/IEC 9126 inter-

national standard, which we developed previously.

In this paper, we describe the design of our benchmark

repository, and we explain how it can be used to perform

comparisons of systems. To provide a concrete illustration

of the concept without revealing confidential data, we use a

selection of open source systems as example.

1. Introduction

The Software Improvement Group (SIG) performs

source code analysis on scores of software systems on an

annual basis. In the context of several IT management con-

sultancy services [1, 8, 7, 13], these systems are analyzed on

a per-system basis. The analyses focus on technical qual-

ity of systems (Are they well built?), rather then functional

quality (Do they perform the required functions?). Find-

ings at the technical level (source code) are used as input

to formulate recommendations at the managerial level (sys-

tem, project). The services of SIG are not restricted to a

particular technology, but encompass all types of technol-

ogy mixes, including legacy mainframe systems, object-oriented web-based systems, embedded control software,

and customizations of enterprise resource planning (ERP)

packages.

In this paper, we address the question of how the mea-

surement data collected on a per-system basis can be used to

perform systematic comparions between systems or groups

of systems. Such comparisons amount to benchmarking of

software products with respect to their technical quality.

In order to make meaningful comparisons possible at

the system level, the low-level source code measurements

of each system must be aggregated. The manner of ag-

gregation should be such that (i) differences in technolo-

gies, architectures, size, functionality, etc are overcome, but

(ii) sufficient information-value is retained after aggrega-

tion. To meet these requirements, we employ a layeredmodel of software product quality that we developed pre-

viously [3]. The model summarizes the distribution of a

particular source code metric in a given system into a so-

called quality profile: a vector of the relative code volumes

that fall into risk categories of increasing severity. Subse-

quently, the quality profiles are combined and mapped, first

into ratings for system-level properties, such as complexity

and duplication, and finally into scores for various quality

aspects as defined by the ISO/IEC international standard for

software product quality [5].

The paper is structured as follows. In Section 2, we re-

call our quality model based on ISO/IEC 9126 and explain

the quality profiles used for aggregating source code met-

rics to the system level. In Section 3, we explain the design

of our benchmark repository and we characterize its cur-

rent contents. In Section 4, we explain how the benchmark

repository can be used to perform comparisons of systems.

2. Aggregation of product quality metrics

The software product quality model employed by SIG

was described in simplified form in [3]. The model distin-

guishes three levels: source code metrics, system proper-

ties, and quality sub-characterstics as defined by ISO/IEC

9126 [5]. Data collection occurs at the source code level,using metrics such as McCabes complexity, lines of code,

volume of duplicated code blocks, and test coverage per-

centages. The collected measurement values are then aggre-

gated to ratings at the level of properties of the entire sys-

tem, such as volume, code redundancy, unit complexity, and

test quality. Finally, the system property ratings are mapped

to ratings of the ISO/IEC 9126 characteristics, such as the

analysability and changeability sub-characteristics and the


2/4

maintainability characteristic. The quality ratings are given

at a 5-tier scale, ranging from excellent (++) to poor (--).

At the source code level, many measurements are taken

per unit of code. We summarize the distribution of such

source code metrics throughout a given system into what

we call a quality profile. For example, a McCabe complex-

ity number is derived for each method in a Java system. Tocompute the quality profile, the units (methods) are first as-

signed to four risk categories:

MC C Risk evaluation

1-10 without much risk

11-20 moderate risk

21-50 high risk

> 50 very high risk

Secondly, the relative volume of each risk category is com-

puted by summing the lines of code of the units of that cate-

gory and dividing by the total lines of code in all units. The

vector of percentages we obtain is the quality profile for unit

complexity for the system (examples follow below).The complexity quality profile is used to arrive at a rating

of complexity as a system property following this schema:

max. relative LO C per MCC risk category

rank moderate high very high

++ 25% 0% 0%

+ 30% 5% 0%

o 40% 10% 0%

- 50% 15% 5%

-- - - -

To score a + for example, a system can have at most 5% of

its code lines in methods with high-risk complexity and at

most 30% in methods with moderate-risk complexity.

Other properties have similar evaluation schemes relying

on different risk categorizations and thresholds. This anal-

ysis is done separately for each different technology, after

which the ratings are then averaged with weights according

to the relative volume of each technology in the system.

3. The benchmarking repository

We have created a repository to collect measurement and

rating data for all systems analyzed in our consultancy prac-

tise. We briefly discuss the various kinds of information that

can be stored in the repository.

Overall structure For each system we record the subdi-

vision into its subsystems (high-level modules) and of these

subsystems into their technology parts (programming lan-

guages). The subsystem division is sometimes evident from

the source code or documentation. Sometimes the appropri-

ate division is arrived at after elicitation of design informa-

tion from architects and developers.

Java

44%

C

33%

C++

15%

other

8%

Figure 1. Content of the benchmark reposi-tory by programming language.

General information At the subsystem level, informa-

tion is stored about functionality, architecture, develop-

ment methodology, and development organization. This in-

formation is determined by our consultants through docu-

ment review, stakeholder elicitation, and expert judgment.We employ taxonomies similar to those of ISBSG [4] and

Jones [6].

Source code measurements For each sub-system, source

code measurements are stored per language. The measure-

ment values are collected using the software analysis toolkit

developed by SIG. For metrics with granularity below the

subsystem level (e.g. block, method, class) the values are

aggregated into quality profiles as explained above. Addi-

tionally, histograms are stored for selected metrics.

Ratings The ratings at the various levels of our qualitymodel (source code, system properties, ISO/IEC 9126 char-

acteristics) are stored per subsystem and per system. These

include ratings directly derived from source code measure-

ments, as well as ratings that involve the expert opinion of

the consultants that have assessed the systems.

The database currently contains about 70 systems, with

about 160 sub-systems. The distribution of the languages

present in the benchmark is shown in Figure 1 in terms of

percentage of lines of code.

4. Comparing systems

The purpose of collecting benchmark data into a reposi-

tory is to enable systematic comparison of (groups of) soft-

ware products. In this section, we give examples of various

kinds of comparisons. Since we cannot disclose measure-

ment data for the systems of our clients, we use a series of

open source systems as illustration. These systems are listed

in Table 1, ordered by their volume. For a more interesting

comparison, we selected systems with similar functionality.


3/4

Name Functionality Main PLs LOC

Jaminid Web Server Java 1120

CSQL DBMS C/C++ 14830

Canopy Web Server C 15570

SmallSQL DBMS Java 18402

Axion DBMS Java 18921

JalistoPlus DBMS Java 21943

AOLServer Web Server C 45344

HSQLDB DBMS Java 65378

SQLite DBMS C 69302

H2 DBMS Java 72102

Tomcat Web Server Java 164199

Apache Web Server C 203829

602SQL DBMS C/C++ 274938

Derby DBMS Java 307367

Firebird DBMS C/C++ 357537

PostgreSQL DBMS C 497400

MySQL DBMS C/C++ 843175

Table 1. A selection of OSS web servers and

database systems, ordered by volume.

Figure 2. Comparison of OSS and proprietarysystems.

Comparison of groups Based on characteristics such as

functionality, architecture, main programming language, or

development model, the systems in the repository can be

divided into groups. These groups can then be compared

with respect to particular quality aspects.

For example, we can compare the group of open source

systems of Table 1 against closed source systems. In Fig-

ure 2, these two groups are compared with respect to the

risk categories of complexity. For each group, the average

and standard deviation of the percentage of code in each

risk category are plotted. As the graph show, the propri-etary systems perform very similar, but slightly better than

the OSS systems. In particular, the average percentage of

high and very high risk code is lower (7.60% and 3.49% vs.

10.47% and 6.64%).

Comparison of individual systems to group average In-

dividual systems can be compared to a group of systems.

This group can either be the entire database, or it can be a

Figure 3. Comparison of CSQL and Firebirdin relation to proprietary systems.

selection of systems with similar functionality, architecture,

or other traits.

For example, we can focus on the CSQL and Firebird

systems and compare them to the group of proprietary sys-

tems. In Figure 3 we plot the values of these systems for the

three highest complexity risk categories and compare them

with the average values from the proprietary systems. Asthe chart shows, the two systems deviate from the average

in different ways. CSQL has more moderate risk code than

the proprietary systems, even surpasses the standard devia-

tion, but has less high and no very high risk code. Firebird,

on the other hand, has moderate risk code inside the ex-

pected deviation from the average, but clearly oversteps the

average values regarding high and very high risk code.

Comparison of individual systems within a group The

individual systems that are contained within a given group

can all be simultaneously compared with each other. This

reveals the amount of variation within the group as well as

the rank position of each system within the group.For example, the group of OSS systems can be compared

with respect to complexity by ordering and plotting their

quality profiles. In Figure 4 the systems are ordered by (i)

Figure 4. Comparison of OSS systems byranking and plotting quality profiles.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

CSQL (+)

Axion (+)

Tomcat (-)

Jaminid (-)

Jalisto (-)

Derby (-)

H2 (-)

SmallSQL (--)

PostgreSQL (--)

MySQL (--)

HSQLDB (--)

AOLServer (--)

Apache (--)

Canopy (--)

Firebird (--)

SQLite (--)

602SQL (--)

very high

high

moderate

low


4/4

overall complexity score, and (ii) percentage of code in the

high risk category. As the chart reveals, CSQL is the least

complex of the selected OSS systems, while Firebird ranks

third.

Apart from comparing systems, the repository can be used

to study metrics and their aggregations. In particular, the

stored histograms have been helpful to determine appropri-

ate thresholds for quality profiles, and to select the most

relevant measurements for the quality model.

5. Related work

Productivity benchmarking Jones [6] provides a treat-

ment of benchmarking of software productivity. The focus

is not on the software product, though the functional size of

systems in terms of function points and the technical vol-

ume in terms of lines of code, are taken into account.

The International Software Benchmarking Standards

Group (ISBSG) [4] collects data about software productiv-

ity and disseminates the collected data for benchmarking

purposes. Apart from function points and lines of code, no

software product measures are taken into account.

Benchmarking open source products Spinellis [12]

compared the source code quality of 4 open source oper-

ating system kernels with respect to their measurement val-

ues for a wide range of metrics. For the greater part of these

metrics, he applies averaging to aggregate the values from

code to system level. He concludes that little quality differ-

ences exist among the studied kernels.

Samoladas et al [10] measured the maintainability in-dex [9] for successive versions of 6 projects that were OSS

at least at some moment during their development. They

conclude that like closed source, OSS quality tends to dete-

riorate over time.

Research benchmarks Demeyer et al [2] argue for the

use of a benchmark of software systems to validate soft-

ware evolution research. Sim et al [11] pose a challenge to

the software engineering research community to establish

a shared benchmark for validating research tools and tech-

niques. Thus, the intent of these benchmarks is to compare

research tools and techniques.

6. Concluding remarks

We have given a brief overview of our benchmark repos-

itory and we have illustrated some of its possible uses for

comparing (groups of) software systems. The type of infor-

mation stored in the repository makes it particularly useful

for comparing technical quality of software products.

Future work The benchmark repository continues to be

fed with new systems, both OSS systems and proprietary

ones analyzed in our consultancy practise. In future, we

hope to augment the product measurements with productiv-

ity data. Also, we intend to exploit the collected informa-

tion for scientific study of software product metrics and our

quality model.

References

[1] E. Bouwers and R. Vis. Multidimensional software monitor-

ing applied to erp. In C. Makris and J. Visser, editors, Proc.

2nd Int. Workshop on Software Quality and Maintainability,

ENTCS. Elsevier, 2008. To appear.

[2] S. Demeyer, T. Mens, and M. Wermelinger. Towards a soft-

ware evolution benchmark. In IWPSE 01: Proc. 4th In-

ternational Workshop on Principles of Software Evolution,

pages 174177, New York, NY, USA, 2001. ACM.

[3] I. Heitlager, T. Kuipers, and J. Visser. A practical model for

measuring maintainability. In 6th Int. Conf. on the Quality

of Information and Communications Technology (QUATIC

2007), pages 3039. IEEE Computer Society, 2007.

[4] International Software Benchmarking Standards Group.

www.isbsg.org.

[5] ISO. ISO/IEC 9126-1: Software engineering - product qual-

ity - part 1: Quality model, 2001.

[6] C. Jones. Software Assessments, Benchmarks, and Best

Practices. Addison-Wesley, 2000.

[7] T. Kuipers and J. Visser. A tool-based methodology for soft-

ware portfolio monitoring. In M. Piattini and M. Serrano,

editors, Proc. 1st Int. Workshop on Software Audit and Met-

rics, (SAM 2004), pages 118128. INSTICC Press, 2004.

[8] T. Kuipers, J. Visser, and G. de Vries. Monitoring the qual-

ity of outsourced software. In J. van Hillegersberg et al., ed-

itors, Proc. Int. Workshop on Tools for Managing GloballyDistributed Software Development (TOMAG 2007). Center

for Telematics and Information Technology (CTIT), The

Netherlands, 2007.

[9] P. W. Oman and J. R. Hagemeister. Construction and testing

of polynomials predicting software maintainability. Journal

of Systems and Software, 24(3):251266, 1994.

[10] I. Samoladas, I. Stamelos, L. Angelis, and A. Oikonomou.

Open source software development should strive for even

greater code maintainability. Commun. ACM, 47(10):8387,

2004.

[11] S. E. Sim, S. Easterbrook, and R. C. Holt. Using bench-

marking to advance research: a challenge to software engi-

neering. In ICSE 03: Proceedings of the 25th International

Conference on Software Engineering, pages 7483, Wash-ington, DC, USA, 2003. IEEE Computer Society.

[12] D. Spinellis. A tale of four kernels. In W. Schafer, M. B.

Dwyer, and V. Gruhn, editors, ICSE 08: Proceedings of

the 30th International Conference on Software Engineering,

pages 381390, New York, May 2008. ACM.

[13] A. van Deursen and T. Kuipers. Source-based software risk

assessment. In ICSM 03: Proc. Int. Conference on Software

Maintenance, page 385. IEEE Computer Society, 2003.

_blobs_research_scientific publication_benchmarking technical quality of software products

Documents