_blobs_research_scientific publication_benchmarking technical quality of software products

Upload: sridharanc23

Post on 03-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 _blobs_Research_Scientific Publication_Benchmarking Technical Quality of Software Products

    1/4

    Benchmarking Technical Quality of Software Products

    Jose Pedro Correia Joost VisserSoftware Improvement Group

    Amsterdam, The Netherlands

    [email protected] [email protected]

    Abstract

    To enable systematic comparison of technical quality of

    (groups of) software products, we have collected measure-

    ment data of a wide range of systems into a benchmark

    repository. The measurements were taken over the course

    of several years of delivering software assessment servicesto corporations and public institutes. The granularity of the

    collected data follows the layered structure of a model for

    software product quality, based on the ISO/IEC 9126 inter-

    national standard, which we developed previously.

    In this paper, we describe the design of our benchmark

    repository, and we explain how it can be used to perform

    comparisons of systems. To provide a concrete illustration

    of the concept without revealing confidential data, we use a

    selection of open source systems as example.

    1. Introduction

    The Software Improvement Group (SIG) performs

    source code analysis on scores of software systems on an

    annual basis. In the context of several IT management con-

    sultancy services [1, 8, 7, 13], these systems are analyzed on

    a per-system basis. The analyses focus on technical qual-

    ity of systems (Are they well built?), rather then functional

    quality (Do they perform the required functions?). Find-

    ings at the technical level (source code) are used as input

    to formulate recommendations at the managerial level (sys-

    tem, project). The services of SIG are not restricted to a

    particular technology, but encompass all types of technol-

    ogy mixes, including legacy mainframe systems, object-oriented web-based systems, embedded control software,

    and customizations of enterprise resource planning (ERP)

    packages.

    In this paper, we address the question of how the mea-

    surement data collected on a per-system basis can be used to

    perform systematic comparions between systems or groups

    of systems. Such comparisons amount to benchmarking of

    software products with respect to their technical quality.

    In order to make meaningful comparisons possible at

    the system level, the low-level source code measurements

    of each system must be aggregated. The manner of ag-

    gregation should be such that (i) differences in technolo-

    gies, architectures, size, functionality, etc are overcome, but

    (ii) sufficient information-value is retained after aggrega-

    tion. To meet these requirements, we employ a layeredmodel of software product quality that we developed pre-

    viously [3]. The model summarizes the distribution of a

    particular source code metric in a given system into a so-

    called quality profile: a vector of the relative code volumes

    that fall into risk categories of increasing severity. Subse-

    quently, the quality profiles are combined and mapped, first

    into ratings for system-level properties, such as complexity

    and duplication, and finally into scores for various quality

    aspects as defined by the ISO/IEC international standard for

    software product quality [5].

    The paper is structured as follows. In Section 2, we re-

    call our quality model based on ISO/IEC 9126 and explain

    the quality profiles used for aggregating source code met-

    rics to the system level. In Section 3, we explain the design

    of our benchmark repository and we characterize its cur-

    rent contents. In Section 4, we explain how the benchmark

    repository can be used to perform comparisons of systems.

    2. Aggregation of product quality metrics

    The software product quality model employed by SIG

    was described in simplified form in [3]. The model distin-

    guishes three levels: source code metrics, system proper-

    ties, and quality sub-characterstics as defined by ISO/IEC

    9126 [5]. Data collection occurs at the source code level,using metrics such as McCabes complexity, lines of code,

    volume of duplicated code blocks, and test coverage per-

    centages. The collected measurement values are then aggre-

    gated to ratings at the level of properties of the entire sys-

    tem, such as volume, code redundancy, unit complexity, and

    test quality. Finally, the system property ratings are mapped

    to ratings of the ISO/IEC 9126 characteristics, such as the

    analysability and changeability sub-characteristics and the

  • 7/28/2019 _blobs_Research_Scientific Publication_Benchmarking Technical Quality of Software Products

    2/4

    maintainability characteristic. The quality ratings are given

    at a 5-tier scale, ranging from excellent (++) to poor (--).

    At the source code level, many measurements are taken

    per unit of code. We summarize the distribution of such

    source code metrics throughout a given system into what

    we call a quality profile. For example, a McCabe complex-

    ity number is derived for each method in a Java system. Tocompute the quality profile, the units (methods) are first as-

    signed to four risk categories:

    MC C Risk evaluation

    1-10 without much risk

    11-20 moderate risk

    21-50 high risk

    > 50 very high risk

    Secondly, the relative volume of each risk category is com-

    puted by summing the lines of code of the units of that cate-

    gory and dividing by the total lines of code in all units. The

    vector of percentages we obtain is the quality profile for unit

    complexity for the system (examples follow below).The complexity quality profile is used to arrive at a rating

    of complexity as a system property following this schema:

    max. relative LO C per MCC risk category

    rank moderate high very high

    ++ 25% 0% 0%

    + 30% 5% 0%

    o 40% 10% 0%

    - 50% 15% 5%

    -- - - -

    To score a + for example, a system can have at most 5% of

    its code lines in methods with high-risk complexity and at

    most 30% in methods with moderate-risk complexity.

    Other properties have similar evaluation schemes relying

    on different risk categorizations and thresholds. This anal-

    ysis is done separately for each different technology, after

    which the ratings are then averaged with weights according

    to the relative volume of each technology in the system.

    3. The benchmarking repository

    We have created a repository to collect measurement and

    rating data for all systems analyzed in our consultancy prac-

    tise. We briefly discuss the various kinds of information that

    can be stored in the repository.

    Overall structure For each system we record the subdi-

    vision into its subsystems (high-level modules) and of these

    subsystems into their technology parts (programming lan-

    guages). The subsystem division is sometimes evident from

    the source code or documentation. Sometimes the appropri-

    ate division is arrived at after elicitation of design informa-

    tion from architects and developers.

    Java

    44%

    C

    33%

    C++

    15%

    other

    8%

    Figure 1. Content of the benchmark reposi-tory by programming language.

    General information At the subsystem level, informa-

    tion is stored about functionality, architecture, develop-

    ment methodology, and development organization. This in-

    formation is determined by our consultants through docu-

    ment review, stakeholder elicitation, and expert judgment.We employ taxonomies similar to those of ISBSG [4] and

    Jones [6].

    Source code measurements For each sub-system, source

    code measurements are stored per language. The measure-

    ment values are collected using the software analysis toolkit

    developed by SIG. For metrics with granularity below the

    subsystem level (e.g. block, method, class) the values are

    aggregated into quality profiles as explained above. Addi-

    tionally, histograms are stored for selected metrics.

    Ratings The ratings at the various levels of our qualitymodel (source code, system properties, ISO/IEC 9126 char-

    acteristics) are stored per subsystem and per system. These

    include ratings directly derived from source code measure-

    ments, as well as ratings that involve the expert opinion of

    the consultants that have assessed the systems.

    The database currently contains about 70 systems, with

    about 160 sub-systems. The distribution of the languages

    present in the benchmark is shown in Figure 1 in terms of

    percentage of lines of code.

    4. Comparing systems

    The purpose of collecting benchmark data into a reposi-

    tory is to enable systematic comparison of (groups of) soft-

    ware products. In this section, we give examples of various

    kinds of comparisons. Since we cannot disclose measure-

    ment data for the systems of our clients, we use a series of

    open source systems as illustration. These systems are listed

    in Table 1, ordered by their volume. For a more interesting

    comparison, we selected systems with similar functionality.

  • 7/28/2019 _blobs_Research_Scientific Publication_Benchmarking Technical Quality of Software Products

    3/4

    Name Functionality Main PLs LOC

    Jaminid Web Server Java 1120

    CSQL DBMS C/C++ 14830

    Canopy Web Server C 15570

    SmallSQL DBMS Java 18402

    Axion DBMS Java 18921

    JalistoPlus DBMS Java 21943

    AOLServer Web Server C 45344

    HSQLDB DBMS Java 65378

    SQLite DBMS C 69302

    H2 DBMS Java 72102

    Tomcat Web Server Java 164199

    Apache Web Server C 203829

    602SQL DBMS C/C++ 274938

    Derby DBMS Java 307367

    Firebird DBMS C/C++ 357537

    PostgreSQL DBMS C 497400

    MySQL DBMS C/C++ 843175

    Table 1. A selection of OSS web servers and

    database systems, ordered by volume.

    Figure 2. Comparison of OSS and proprietarysystems.

    Comparison of groups Based on characteristics such as

    functionality, architecture, main programming language, or

    development model, the systems in the repository can be

    divided into groups. These groups can then be compared

    with respect to particular quality aspects.

    For example, we can compare the group of open source

    systems of Table 1 against closed source systems. In Fig-

    ure 2, these two groups are compared with respect to the

    risk categories of complexity. For each group, the average

    and standard deviation of the percentage of code in each

    risk category are plotted. As the graph show, the propri-etary systems perform very similar, but slightly better than

    the OSS systems. In particular, the average percentage of

    high and very high risk code is lower (7.60% and 3.49% vs.

    10.47% and 6.64%).

    Comparison of individual systems to group average In-

    dividual systems can be compared to a group of systems.

    This group can either be the entire database, or it can be a

    Figure 3. Comparison of CSQL and Firebirdin relation to proprietary systems.

    selection of systems with similar functionality, architecture,

    or other traits.

    For example, we can focus on the CSQL and Firebird

    systems and compare them to the group of proprietary sys-

    tems. In Figure 3 we plot the values of these systems for the

    three highest complexity risk categories and compare them

    with the average values from the proprietary systems. Asthe chart shows, the two systems deviate from the average

    in different ways. CSQL has more moderate risk code than

    the proprietary systems, even surpasses the standard devia-

    tion, but has less high and no very high risk code. Firebird,

    on the other hand, has moderate risk code inside the ex-

    pected deviation from the average, but clearly oversteps the

    average values regarding high and very high risk code.

    Comparison of individual systems within a group The

    individual systems that are contained within a given group

    can all be simultaneously compared with each other. This

    reveals the amount of variation within the group as well as

    the rank position of each system within the group.For example, the group of OSS systems can be compared

    with respect to complexity by ordering and plotting their

    quality profiles. In Figure 4 the systems are ordered by (i)

    Figure 4. Comparison of OSS systems byranking and plotting quality profiles.

    0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

    CSQL (+)

    Axion (+)

    Tomcat (-)

    Jaminid (-)

    Jalisto (-)

    Derby (-)

    H2 (-)

    SmallSQL (--)

    PostgreSQL (--)

    MySQL (--)

    HSQLDB (--)

    AOLServer (--)

    Apache (--)

    Canopy (--)

    Firebird (--)

    SQLite (--)

    602SQL (--)

    very high

    high

    moderate

    low

  • 7/28/2019 _blobs_Research_Scientific Publication_Benchmarking Technical Quality of Software Products

    4/4

    overall complexity score, and (ii) percentage of code in the

    high risk category. As the chart reveals, CSQL is the least

    complex of the selected OSS systems, while Firebird ranks

    third.

    Apart from comparing systems, the repository can be used

    to study metrics and their aggregations. In particular, the

    stored histograms have been helpful to determine appropri-

    ate thresholds for quality profiles, and to select the most

    relevant measurements for the quality model.

    5. Related work

    Productivity benchmarking Jones [6] provides a treat-

    ment of benchmarking of software productivity. The focus

    is not on the software product, though the functional size of

    systems in terms of function points and the technical vol-

    ume in terms of lines of code, are taken into account.

    The International Software Benchmarking Standards

    Group (ISBSG) [4] collects data about software productiv-

    ity and disseminates the collected data for benchmarking

    purposes. Apart from function points and lines of code, no

    software product measures are taken into account.

    Benchmarking open source products Spinellis [12]

    compared the source code quality of 4 open source oper-

    ating system kernels with respect to their measurement val-

    ues for a wide range of metrics. For the greater part of these

    metrics, he applies averaging to aggregate the values from

    code to system level. He concludes that little quality differ-

    ences exist among the studied kernels.

    Samoladas et al [10] measured the maintainability in-dex [9] for successive versions of 6 projects that were OSS

    at least at some moment during their development. They

    conclude that like closed source, OSS quality tends to dete-

    riorate over time.

    Research benchmarks Demeyer et al [2] argue for the

    use of a benchmark of software systems to validate soft-

    ware evolution research. Sim et al [11] pose a challenge to

    the software engineering research community to establish

    a shared benchmark for validating research tools and tech-

    niques. Thus, the intent of these benchmarks is to compare

    research tools and techniques.

    6. Concluding remarks

    We have given a brief overview of our benchmark repos-

    itory and we have illustrated some of its possible uses for

    comparing (groups of) software systems. The type of infor-

    mation stored in the repository makes it particularly useful

    for comparing technical quality of software products.

    Future work The benchmark repository continues to be

    fed with new systems, both OSS systems and proprietary

    ones analyzed in our consultancy practise. In future, we

    hope to augment the product measurements with productiv-

    ity data. Also, we intend to exploit the collected informa-

    tion for scientific study of software product metrics and our

    quality model.

    References

    [1] E. Bouwers and R. Vis. Multidimensional software monitor-

    ing applied to erp. In C. Makris and J. Visser, editors, Proc.

    2nd Int. Workshop on Software Quality and Maintainability,

    ENTCS. Elsevier, 2008. To appear.

    [2] S. Demeyer, T. Mens, and M. Wermelinger. Towards a soft-

    ware evolution benchmark. In IWPSE 01: Proc. 4th In-

    ternational Workshop on Principles of Software Evolution,

    pages 174177, New York, NY, USA, 2001. ACM.

    [3] I. Heitlager, T. Kuipers, and J. Visser. A practical model for

    measuring maintainability. In 6th Int. Conf. on the Quality

    of Information and Communications Technology (QUATIC

    2007), pages 3039. IEEE Computer Society, 2007.

    [4] International Software Benchmarking Standards Group.

    www.isbsg.org.

    [5] ISO. ISO/IEC 9126-1: Software engineering - product qual-

    ity - part 1: Quality model, 2001.

    [6] C. Jones. Software Assessments, Benchmarks, and Best

    Practices. Addison-Wesley, 2000.

    [7] T. Kuipers and J. Visser. A tool-based methodology for soft-

    ware portfolio monitoring. In M. Piattini and M. Serrano,

    editors, Proc. 1st Int. Workshop on Software Audit and Met-

    rics, (SAM 2004), pages 118128. INSTICC Press, 2004.

    [8] T. Kuipers, J. Visser, and G. de Vries. Monitoring the qual-

    ity of outsourced software. In J. van Hillegersberg et al., ed-

    itors, Proc. Int. Workshop on Tools for Managing GloballyDistributed Software Development (TOMAG 2007). Center

    for Telematics and Information Technology (CTIT), The

    Netherlands, 2007.

    [9] P. W. Oman and J. R. Hagemeister. Construction and testing

    of polynomials predicting software maintainability. Journal

    of Systems and Software, 24(3):251266, 1994.

    [10] I. Samoladas, I. Stamelos, L. Angelis, and A. Oikonomou.

    Open source software development should strive for even

    greater code maintainability. Commun. ACM, 47(10):8387,

    2004.

    [11] S. E. Sim, S. Easterbrook, and R. C. Holt. Using bench-

    marking to advance research: a challenge to software engi-

    neering. In ICSE 03: Proceedings of the 25th International

    Conference on Software Engineering, pages 7483, Wash-ington, DC, USA, 2003. IEEE Computer Society.

    [12] D. Spinellis. A tale of four kernels. In W. Schafer, M. B.

    Dwyer, and V. Gruhn, editors, ICSE 08: Proceedings of

    the 30th International Conference on Software Engineering,

    pages 381390, New York, May 2008. ACM.

    [13] A. van Deursen and T. Kuipers. Source-based software risk

    assessment. In ICSM 03: Proc. Int. Conference on Software

    Maintenance, page 385. IEEE Computer Society, 2003.