reengineering of configurations based on mathematical ... · additional key words and phrases:...

Reengineering of Configurations Based onMathematical Concept Analysis

GREGOR SNELTING

Technische Universitat Braunschweig

We apply mathematical concept analysis to the problem of reengineering configurations.Concept analysis will reconstruct a taxonomy of concepts from a relation between objects andattributes. We use concept analysis to infer configuration structures from existing source code.Our tool NORA/RECS will accept source code, where configuration-specific code pieces arecontrolled by the preprocessor. The algorithm will compute a so-called concept lattice,which —when visually displayed — offers remarkable insight into the structure and propertiesof possible configurations. The lattice not only displays tine-grained dependencies betweenconfigurations, but also visualizes the overall quality of configuration structures according tosoftware engineering principles. In a second step, interferences between configurations can beanalyzed in order to restructure or simplify configurations. Interferences showing up in thelattice indicate high coupling and low cohesion between configuration concepts. Source filescan then be simplified according to the lattice structure. Finally, we show how governingexpressions can be simplified by utilizing an isomorphism theorem of mathematical conceptanalysis.

Categories and Subject Descriptors: D.2.6 [Software Engineering]: Interactive Program-ming Environments; D.2.7 [Software Engineering]: Distribution and Maintenance —restruc-turing; uersion control; D.2.9 [Software Engineering: Software Configuration Management

General Terms: Design, Management, Theory

Additional Key Words and Phrases: Concept analysis, concept lattices

1. INTRODUCTION

In his invited talk at the 16th International Conference on SoftwareEngineering, David Parnas said “When a large and important family ofproducts gets out of control, a major effort to restructure it is appropriate.The first step must be to reduce the size of the program family. One mustexamine the various versions to determine why and how they differ”

A preliminary version of parts of this article appeared in the Proceedings of the 16thInternational Conference on Software Engineering, May 1994.Author’s address: Abteilung Softwaretechnologie, Technische Universitat Braunschweig,Gau13strasse17, D-38106 Braunschweig, Germany; email: [email protected]. de.Permission to make digitallhard copy of part or all of this work for personal or classroom useis granted without fee provided that the copies are not made or distributed for profit orcommercial advantage, the copyright notice, the title of the publication, and its date appear,and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, torepublish, to post on servers, or to redistribute to lists, requires prior specific permissionandIor a fee.@ 1996 ACM 1049-331W96/0400-0146 $03.50

ACM Transactionson SoftwareEngineeringandMethodology,Vol.5, No.2, April 1996,Pages146-189

Reengineering of Configurations Based on Mathematical Concept Analysis . 147

IParnas 19941. However no method for reengineering program families wasas yet available, let alone tool support for restructuring.

At the same conference, we presented a first step toward a theory andtools for configuration restructuring [Krone and Snelting 1994]. Based onmathematical concept analysis [Wine 1982], we have shown how configura-tion structures can be inferred from existing source code and how interfer-ences between configurations can be detected.

In this article, we describe in detail how to extract configuration struc-tures from existing source code and how to interpret the obtained struc-tures. Source files must adhere to the paradigm “version selection by#ifdef,” The structure of the configuration space is given in the form of aconcept lattice, which is computed from the relation between code piecesand their governing expressions. When visually displayed, the lattice offersremarkable insight into properties of configurations and into relationshipsbetween configurations.

We then present an algorithm for detecting interferences between config-urations. An interference means that two configurations have common codewhere they should not. Based on the lattice structure and interferenceanalysis, source file simplification can be done by “amputating” parts of theconfiguration space. Finally we show how an isomorphism theorem fromconcept analysis makes it possible to simplify governing expressions withrespect to inferred properties of the configuration space.

Before we begin to explain our reengineering tool NORA/RECS in detail,we would like to give an overview; this might also serve as an extendedabstract for hurried readers. The overview will contain several informaldefinitions, for which the exact formal definitions are provided later in thetext.

1.1 Configuration Reengineering

Software configuration management is the discipline of organizing andcontrolling the evolution of software systems [Tichy 19881. A configurationof a software system is a collection of elements (software components, codepieces, modules, ..) which fulfill a particular purpose. Typically, a config-uration meets the needs of a particular client or platform. Therefore,configuration management must—among other tasks— be able to identifysoftware components or code pieces which have certain features and tobuild a complete system from selected components. Several sophisticatedconfiguration management systems have been developed recently—for ex-ample, Adele [Estublier and Casallas 1994], Shape [Mahler 19941, andClearcase [Leblang 1994].

This article, however, is concerned with reengineering of configurationstructures from existing source code. Therefore we do not make assump-tions about the underlying configuration management model, We justassume that we are given a set of software objects, as well as a set offeatures or attributes, where each configuration is characterized by a set ofattributes. We do not care whether objects are syntactically code pieces,

ACM Transactions on Software Engineering and Methodology. Vol 5, No 2, April 1996.

148 ● Gregor Snelting

components, or modules; whether versions of objects are revisions orvariants; and what architecture the system in question might have. We donot make any assumptions about the structure of the underlying set ofattributes either; on the contrary, discovering such a structure (if it exists)is part of the reengineering process.

The only basic assumption we make is that any configuration consists ofobjects and is selected by a set of attributes. For the time being, weconsider only simple attributes with a dual defined/undefined semantics;later we will see how more complicated attributes can be subsumed by thebasic model. Therefore, the process of version selection can abstractly bedescribed by a configuration function as follows:

Definition 1.1.1. Let O be a set of software objects. Let A be a set ofattributes. A configuration function is a mapping K : 2A -+ 2°. The set ofall possible configurations K(2A) c 220 is called the configuration space,and for any attribute set V c A, the specific object set P = M V) is called aconfiguration selected by V.

This definition does not say how K is determined, and indeed for theconfiguration management systems mentioned above, K will take a veryspecific form. Generally, the approach described in this article workswhenever version selection can abstractly be described by a configurationfunction.

Definition 1.1.2. Let X, Y G A. We say X implies Y if K(X) G K(Y). Xand Y are called interfering, if K(X) n K(Y) # 0.

The notions of configuration implication and interference will be more fullyexplained later. We can now specify the task of configuration reengineeringas tackled in this article: given a configuration function, determine allconfigurations (that is, all K(X) for X c A) and all implications orinterferences between them. Furthermore, visualize the overall structure ofthe configuration space. 1

1.2 Configuration Management by Preprocessing

In this article, we restrict ourselves to a simple and widely used versionselection system —namely, the C preprocessor (CPP). This restriction isintroduced for pragmatic reasons. A lot of code sticking to “configurationmanagement by preprocessing” is around. For example, both the X-Windowsystem and the GNU software (gee, g++, flex, bison, rcs, etc. ) use CPP forconfiguration building; thus there is enough raw material for reengineer-ing. Adapting our approach to more modern configuration managementsystems is possible whenever they can be modeled by configuration func-tions –this would require development of a new front end.

Configuration management by preprocessing is very simple. Con@ura-

lThe latter task is usually called reverse engineering —we do not distinguish betweenreengineering and reverse engineering of configurations.

ACM Transactions on Software Engineering and Methodology, Vol. 5, No. 2, April 1996.


tion-dependent source code pieces are enclosed in #if . . . #endif brackets:

#if E

#endif

E is a so-called go~erning expression, that is, a boolean expression whichmay in particular contain atomic formulas like defined(X) or defined(Y) forpreprocessor symbols X, Y, The often-used #ifdef X . . . #endif is a short cutfor #if defined(X) . . #endif.2 CPP also offers #else, #elif E, and #ifndef Xconstructs. Governing expressions may contain negations, conjunctions,and disjunctions, and #ifdef may be nested.

When starting the compiler, CPP symbols maybe defined (e.g., cc -Dultrix

prog.c ). CPP will evaluate all governing expressions and will include a codepiece only if its governing expression evaluates to true. Code pieces gov-erned by a nested #if are selected only if the surrounding code is selected aswell, Thus by defining CPP symbols a configuration is determined, and theappropriate code pieces are selected and compiled.

There are two basic methods for using preprocessor symbols. The firstmethod introduces a CPP symbol for every target configuration (e.g., AIX,SUN4, ULTRIX); this symbol must be defined if compiling for a specifictarget. Code common to several target configurations is enclosed in adisjunction of CPP symbols:

#if defined(SUN) II defined(ULTRIX) II defined(AIX)

#endif

The second method uses one CPP symbol for each feature of the target

configuration (e. g., HAS_NFS, BSD, HAS_ BCOPYI; code requiring certainfeatures is enclosed in a conjunction of CPP symbols:

#if defined(HAS_BCC)PY) && defined(HAS_NFS)

,..

#endif

If the #ifdef/#endif enclosing a code piece o contains a CPP symbol a, we sayeither o depends on a, or o is governed by a. In the first example, the codepiece is governed by SUN, ULTRIX, AIX, whereas in the second example thecode piece is governed by HAS_ BCOPY and HAS_NFS.

CPP “#ifdef” statements may contain complicated boolean expressions;thus existing programs do not stick to the two basic CPP schemes. As anexample, consider some code pieces from the X-Window tool “xload”; thistool displays various machine load factors (Figure 1). The 724-line programis quite platform dependent: 43 preprocessor symbols are used to control avariety of configurations (e. g., SYSV, macll, ultrix, sun, CRAY, sony). Codepieces not only depend on simple preprocessor symbols, but on arbitraryboolean combinations of such symbols. Furthermore, “#ifdef” and “#define”

‘The current (’PP standard treats X as equivalent to defined (X) & & X != O

ACM Transactions an Software Engineering and Methodology, Vol 5, NO 2. April 1996


#if ( !defined(SVR4) I I ldefined(_STDC_)) && !defined(sgi) && lde -fined (MOTOROLA)

extern void nlisto;#endif#i fdef A1XV3

knlist( namelist, 1, sizeof(struct nlist));#else

nlist( KERNEL_FILE, namelist) ;#endif#ifdef hcx

if (namelist[LOADAV] .n_type ‘- O &&#else

if (namelist[LOADAV] .n_type == O I I#endif /* hcx ●/

namelist[LOADAV] .n_value ‘= O) [xl Oad_error(Wcannot get name list from”, KERNEL_FILE);exit(-l);1

loadavg_seek - namelist[LDADAV] .n_value;#if defined(umips) && defined(SYSTYPE_SYSV)

loadavg_seek &= 0x7fffffff;#endif /* umips && SYSTYPE_SYSV +/#if (de fined (CRAY) && de fined (SYSINFO))

loadavg_seek +- ((char ●) (((struct sysinfo ●)NULL)->aVenKUn)) - ((char●) NULL);#endif /* CRAY && SYSINFO */

kmem = open(KMEf._FILE, O_RDONLY);if (kmem < O) xload_error (” cannot open” , KMEM_FILE);

#endif

Fig. 1, X-Window tool’’xload.c, ”

statements are nested, resultingin a rather incomprehensible source text.Even experienced pro~ammers-will have difflculti& to obtain some insightinto the configuration structure, and when a new configuration variant isto be covered, the introductionof errors is very likely.

Nevertheless, CPP adheres to the general configuratio nscheme describedabove: the objects are code pieces in a source file; the attributes are theCPPsymbols; and the configuration function isjust CPPitself. Given asetof defined CPP symbols, CPP will select the corresponding code piecesand—since there is atotal order, namely, the textual order defined for codepiecesin CPPfiles-automatically builds the corresponding configuration.As we will see later, even negated CPP symbols, disjunctions, etc. can beinterpreted as attributes; thus the notion of a configuration function ispowerful enough to describe the full CPP semantics.

Throughout the rest of this article, we stick to the equations object = codepiece (in a CPP source file) and attribute = (simple or derived) CPP symbol.Code pieces are usually identified by line number intervals, but in severalexamples we will use reman numerals as placeholders for code pieces.

1.3 Configuration Tables

The CPP configuration function K = KCPP can be represented by atwo-dimensional boolean array, the so-called configuration table. The table

ACMTransactions on Softwsre Engineering and Methodology, Vol. 5, No. 2, April 1996.


I.#i fdef DOS

E

DOS 0S2 x_wkl

II.#endlf

!tifdef 0s2II x

.111 Ill x

Uendlf

#lf de flned(DOS) $.L de fined (X_win)Iv x x

Iv v x#end> !

#~fdef X_wlr, VI

v

*end] !

VI

Fig. 2. A small code fragment and its configuration table

T = T~,,,,i, is indexed with objects and attributes, and T[o, a ] = trueindicates that the object o depends on attribute a: any V c A selecting omust contain a.

Definition 1.3.1. Let K be a configuration function. The configurationtable T~ : 0 XA ~Bisdefined by T~[o, a] = true ~b’V~A : 0 ~K(V) >a E V.

Figure 2 presents a very small source text and its configuration table.Section 3 will describe how configuration tables are constructed fromarbitrary CPP source files.

Conversely, any configuration table T determines a configuration func-tion: for o c O, let a(o) = {a E AI TIo, al = true}. Then K7(V) = {o EO IU(o ) c ~. This means that an object is selected by V if all governingattributes are in V. For K = Kct,P, it is easy to show that K~A( V) = K(V)(note that this identity captures the CPP semantics).

Thus for every configuration function K a configuration table T~ can bedefined, and every configuration table T induces a configuration functionK~. But K = K~~ does not always hold. Therefore, not every configurationfunction can correctly be represented by a configuration table. Indeed,there are only 2 ‘) 1A’ = 210 I4 configuration tables, but there are(210 i)( 21A’ ) configuration functions (many of them pathological). Configura-tion functions defined by configuration tables enjoy special properties, e.g.,monotony: X c Y > K(X) c K(Y).

For Adele, Shape, and Clearcase, it is possible to describe their versionselection mechanism by configuration tables. Adele, for example, is basedon software components which have several attributes of the form name =value; version selection is done with a boolean expression over suchattributes, and a component is selected if the selection expression evaluatesto true for the component’s attribute values. Many-valued attributes can bedescribed by so-called scaled configuration tables (which have one columnfor every value of an attribute), and complex selection expressions can betreated with the same techniques as complex governing expressions (seeSection 3).

For strange configuration functions K, the notion of a configuration tablecapturing dependencies of code pieces on attributes might not be adequate.

ACMTransactIonson SoftwareEngineeringand Methodology, V(I1 5, NO 2. April 1996.


Nevertheless, a relation between objects and attributes can always beestablished: TKIo, a] ~ 3V G A : a E V A o E K(V). This can be read as“a influences o” and explains why our approach can be used for quitedifferent configuration management models. Using scaled configurationtables, it is even possible to handle structured object or attribute spaces.But note that “influence” is weaker than “dependency”; thus for pathologi-cal configuration functions which cannot be coded as configuration tables,the information extracted from the “influences” relation is not as precise asfor CPP source files. For the rest of this article, we ignore pathologicalconfiguration tables.

1.4 Concept Lattices

We have seen that it is difllcult to understand source files like “xload.c.”Fortunately, there is a method, called formal concept analysis, which allowsfor the reconstruction of semantic structures from raw data as given in ourcase. Formal concept analysis can be used whenever a relation betweencertain objects and attributes is given. The basic idea goes back to G.Birkhoff, who observed in 1940 that a complete lattice can be associatedwith every binary relation, which offers remarkable insight into the struc-ture of the relation [Birkhoff 1983]. The lattice and its underlying relationcan be reconstructed from each other; hence the method is similar in spiritto Fourier analysis.

Later, the universal algebra group at the Technical University of Darm-stadt elaborated Birkhoff’s basic result. Today, there is not only a sophis-ticated mathematical theory, but also several extensions of concept analy-sis which can be applied to more complicated problem domains. Conceptanalysis has been applied to problems such as classification of finitelattices, analysis of Rembrandt’s paintings, or behavior of drug addicts. Itcan also be used as a knowledge acquisition mechanism, and the structuretheory of concept lattices provides for even more powerful analysis meth-ods.3

In this introduction we will only explain some basic notions. A concept isa pair, consisting of a set of objects and a set of attributes, such that allobjects have all attributes, and all attributes fit to all objects. Suchconcepts represent semantic properties of the underlying problem domain.The concepts form a complete lattice; hence the lattice structure imposes a

partial order on concepts (more specific versus more general), and for twoconcepts, supremum and infimum exist.

In our case, objects are code pieces; attributes are CPP symbols; and theobject-attribute relation is given by the configuration table. Concepts thencorrespond to (partial) configurations and are computed by our tool NOMRECS. In particular,

‘The characterization of (mental and mathematical) concepts has also been studied by otherauthors (e.g., see van Mechelen et al. [1993]), but without the lattice-theoretic underpinningswhich give formal concept analysis its power.


Reengineering of Configurations Based on Mathematical Concept Analysis ● 153

. ----- Elrma

Fi10 AxOn Opums wIatimg graphplace done

Imm --”

IFlle Editmla’mSuffenSam31CmlMcl: 1-1,14-14

l=C2: XJm / 12-12 C4: m / 3-3

Eximlt

“/C43-3

9-9sFig. 3. Concept lattice for Figure 2.

–for each configuration, its extent (the code pieces which make up theconfiguration) and intent (the attributes which govern the configuration )are computed;

—all implications between configurations are computed, where an implica-tion is of the form “Any code piece valid for the sun configuration is validfor the ultrix and sony configuration as well”;

–by computing a lattice of configuration concepts, a taxonomy of configu-

rations is determined;

—interferences between configurations are displayed, where an interferencebetween configurations means that they have common code where theyshould not;

–the overall qua~ity of the configuration structure can be judged accordingto software engineering principles.

The concept lattice for the code in Figure 2 is presented in Figure 3. This

simple lattice already demonstrates basic principles. The lattice elements

are named (Cl, C2, , C6) and are labeled with code pieces or CPPsymbols (or both ). Code pieces are given in form of line number intervals inthe source code: the source file is displayed in the emacs window. If a

concept labeled with code piece o is below or equal a concept labeled withCPP symbol a, then o depends on a. For example, line 9 is governed by

DOS, as C5 (which is labeled with source line interval 9-9) is below C4

(which is labeled DOS). Line 3 is also governed by DOS, as C4 is alsolabeled with interval 3–3. As C5 s C2, line 9 is also governed by X_win.

In fact, a source code piece occurring as label of a concept c is governedby all CPP symbols appearing in concepts abol’e c. These are called theintent of c. By clicking on c, its intent can be displayed. Indeed, the intentbox of C5 shows that line 9 is governed by DOS and X_win (and no other

ACMTransactionson SoftwareEngineeringand Methodology,V(,I 5. No 2, April 1996.


CPP symbol). Complementarily, a CPP symbol occurring as a label of cgoverns all code pieces appearing in concepts below c. These are called theextent of c. The extent box for C4 shows that DOS governs lines 3 and 9(and no other).

The code piece labels of the top element present the code which is notgoverned by any CPP symbol (in the example, lines 1 and 14), while thesymbol labels of the bottom element present those CPP symbols which donot govern anything (none in the example). The extent of the top elementconsists of all code pieces, while the intent of the bottom element consists ofall CPP symbols in the source code. Code governed by more than one CPPsymbol shows up as an infimum in the lattice, while CPP symbols govern-ing more than one code piece show up as a supremum.

CPP symbols which are concerned with different configuration aspects

(e.g., window system variants versus operating system variants) are calledorthogonal, and code which is governed by orthogonal symbols indicates aninterference. In the example, line 9 is governed by both DOS and X_win,indicating that window system and operating system aspects are notclearly separated. Adopting traditional terminology, we call this phenome-non strong coupling between orthogonal configuration aspects. Note thatonly a human can decide whether code shared by various configurationsreveals a harmful interference or a beneficial reuse. Section 4 will discussinterferences and coupling in depth.

1.5 Source Code Simplification

For very chaotic configuration spaces, restructuring is appropriate, andlater we will see how concept analysis can help in restructuring. The firststep is perhaps to reduce the size of the program family (called “amputa-tion” by Parnas [1994]). In a second step a restructuring of the source codein order to obtain more cohesive modules is appropriate. The concept latticeprovides very good insight into the possibility and effect of an amputation.We will see later that amputation can be implemented through partialevaluation of CPP files. Under the assumption that certain CPP symbolsare (or are not) defined, while for others this information is missing,governing expressions can be simplified, and perhaps some code piecesdisappear completely.

Amputation, or source file simplification, can be used if certain configu-rations are no longer needed. Furthermore, special “problematic” versionscan be generated from interferences; this allows for a more detailedinspection of configuration problems on the source code level.

Lattice theory also offers a clever way for simplification of governingexpressions. The lattice is generated by the so-called irreducible elementsalone. These irreducible elements can be used to generate new governingCPP symbols, which lead to simpler governing expressions. The method ismuch more powerful than just boolean simplification, as it takes lattice-specific information into account.

ACM Transactions on Software Engineering and Methodology, Vol. 5, No. 2, April 1996


2. BASIC NOTIONS OF CONCEPT ANALYSIS

2.1 The Concept Lattice

Formal concept analysis starts with a triple c = (0, A, P), called a( formal ) context, where O is a finite set (the so-called objects); A is a finite

set (the so-called attributes ); and P is a relation between O and A; hence

P c O x A, If (~~,cr) G P, we say object o has attribute a. In our case, the

objects are source code pieces; the attributes are governing preprocessor

symbols; and the relation is called a configuration table.For a set of objects X ~ O, we define the set of common attributes

(T(X) : – {a ~ A Vo E X : (o, a ) E P}. Similarly, for a set of attributes Yc A the common objects are defined by T(Y) := {o G O IVu E Y : (o, a )E P}. The mappings (r : 2°- 2A and ~ : 2A -2° form a Galois connection

and can be characterized by the following conditions: for X, Xl, X2 ~ O; Y,Yl, Yzg A

Xl Q Xz + (J(XY) Q CT(X1) and YI c Yz > 7(Y2) c dYl);

that is. both mappings are antimonotone;

X c T(w(X)) and u(X) = o(T(cJ(X)))

as well as

that is, both mappings are extensilje. In particular the common objects ofthe common attributes of an object set are a superset of this object set, andtheir common attributes are equal; this means that both u o T and r D u areclosure operators. For an index set 1 and Xi G O, Y, L A

A (formal) concept is a pair (X, Y), where X c O, Y c A, Y = I~(X), andX = T( Y). Hence, a concept is characterized by a set of objects (called itsextent I and a set of attributes (called its intent ) such that ~1 ) all objectshave all attributes and (2) all attributes fit to all objects. The set of allconcepts is denoted by B( O, A, P). Intuitively, a concept is a maximal filledrectangle in a table like Figure 4, where permutations of lines or columnsdo not matter.

A concept (Xl, Y,) is a subconcept of another concept (Xz, Y2) if Xl c Xz(or, equivalently, Y, ~ Yz I. It is easy to see that this definition imposes apartial order on B(O, A, P); thus we write (Xl, Y1 ) s (X2, Y2 ). Moreover,B(O, A, P} = (B(O, A, P), S) is a complete lattice.

ACN1Transactions on Software Engineering and Methodology, Vol. 5. No. 2, April 1996

156 . Gregor Snelting

SYSV SYSVW mad i3S4 unfix sun AIX CRAY apollo wmy sequent Sllient

1-1o x x x x x x x11-20 x x x x x x x x21-28 x x x x29-40 x x x x x x41-100 x x x101-106 x x x x x x107-115 x x x116-125 x x x126-200 x x x x x201-207] X x x x I x x

1 I

386

06

Fig. 4. A configuration table and its concept lattice.

BASIC THEOREM FOR CONCEPT LATTICES. Let C = (O, A, P) be a context.Then B(O, A, P) is a complete lattice, called the concept lattice of C, forwhich in fimum and supremum are given by

and

This theorem says that in order to compute the infimum (greatest commonsubconcept) of two concepts, their extents must be intersected and their

ACM Transactions on Software Engineering and Methodology, Vol. 5, No. 2, April 1996,


intents joined; the latter set of attributes must then be enlarged in order tofit to the object set of the infimum. Analogously, the supremum (smallestcommon superconcept) of two concepts is computed by intersecting theattributes and joining the objects.

The lattice structure allows for a labeling of the concepts: a concept islabeled with an object, if it is the smallest concept in the lattice subsumingthat object; a concept is labeled with an attribute, if it is the largest conceptsubsuming that attribute. The concept labeled with object o, respectively,the concept labeled with attribute a, is

Y(O) = (T(a({o})), u({o})) and v(a) = (T({a}), d7({a}) )).

The attribute labels of a concept c are written a(c), and the object labels ofc are written W(C). Utilizing this labeling, the extent of c can be obtained bycollecting all objects which appear as labels on concepts below c, and theintent of c is obtained by collecting all attributes which appear above c:

ext(c) = U co(c’) and int(c) = IJ cr(cf).c:< c ~c

For any two attribute sets A and B we say “A implies B“ (written A > B) if

T(A) c T(B) (or equivalently, if B c U(7(A))). This can be read as “anyobject having alI attributes in A also has all attributes in B.” IfA and 23 areintents of concepts C = (7(A), A) and D = (7(B), B), and C s D, thenA > B obviously holds. For the set of all implications, a minimal andcomplete basis can be constructed, which means that any implication canbe deduced from the basis, but that this property is lost if any basisimplication is removed [Duquenne 1987].

The concept lattice can be considered as a graph, that is, a relation. Whathappens if we again apply concept analysis to this derived relation? It turnsout that the concept lattice reproduces itself [Wine 1982]. Thus concepts do not“breed” new concepts; there is no proliferation of virtual information.

The basic theorem was already discovered by Birkhoff [1940]. Later,Wine and Ganter [1993] expanded Birkhoff’s result. Wine gave character-izations of all concept lattices, developed their structure theory, andinvented scaled contexts, a method for handling nonflat attribute spaces.Ganter, besides other contributions, developed efficient algorithms (seeSection 2.3). The interested reader should consult Daveys and Priestley[1990], which contains a chapter on elementary concept analysis. Thosewho are interested in more advanced theory might wish to consult Wine’slecture notes [Wine and Ganter 19931 or Ganter’s advanced introduction[Ganter 19951.

2.2 Interpretation of Concept Lattices

Figure 4 presents a (fictitious) example of a configuration table, wheresource code pieces are given in form of line number intervals. A cross in thetable for code piece o and preprocessor symbol a means that o will only be

ACM Transactions on Software Engineering and Methodology, Vol. 5, No. 2. April 1996


included in a configuration if a is defined. For the time being, we assumethat only simple governing expressions of the form “if defined(x) &&defined(Y) && . . .“ are used. This simplification will be dropped in Section3.

In the corresponding concept lattice, a configuration concept is a subcon-cept of another concept, if it has a smaller extent (i.e., the configuration hasfewer code pieces), or equivalently, a larger intent (i.e., more governingsymbols). Hence, going down in the lattice, we obtain more precise informa-tion about smaller object sets. The lattice elements are labeled with codepieces or CPP symbols (or both). If an element labeled with code piece o isbelow or equal to an element labeled with CPP symbol a (that is, Y(O) sp(a)), then o depends on a.

As an example, consider the concept labeled CRAY, which is in fact theconcept K(CRAY) = ({11-20, 21–28, 29–40, 201-207}, {CRAY, apollo, macll,SYSV}). Indeed, Figure 4 reveals that this concept is a rectangle in theconfiguration table. It reveals a simple fact about the configuration space—namely, that lines 11–20, 2 1–28, 29– 40, 20 1–207 are exactly those whichare governed by CRAY, apollo, macll, SYSV— and vice versa. The conceptlabeled apollo stands for ~(apollo) = ({11-20, 21-28, 29-40, 126-200,201 -207}, {apollo, macl 1, SYSV}), which again is a rectangle in the configu-ration table, higher but leaner than the first one: p(CRAY) s p(apollo).Thus, the CRAY configuration comprises lines 11–20, 21-28, 29-40, 201-207 (and no other), but these lines appear in the apollo configuration aswell.

This example already demonstrates one possible interpretation of aconcept lattice: it can be seen as a hierarchical conceptual clustering ofobjects. Objects are grouped into sets, and the lattice structure imposes ataxonomy on these object sets. The original table can always be recon-structed from the lattice (e.g., the column for i386 has entries for all objectsbelow concept p(i386)–namely, 1-10, 101-1 06– whereas the row labeled41-100 has entries for all attributes above–namely, sun, SYSV, and ultrix).Hence, a context table (i.e., relation) and its concept lattice are analogousto a function and its Fourier transform (which also can be reconstructedfrom each other): concept analysis is similar in spirit to spectral analysis ofcontinuous signals.

The infimum of two concepts (C, V) and (D, W) says which preprocessorsymbols govern the intersection of the extents: (C, V) A (D, W) = (C fl D,u(~(V U W))). Since X c u(7(X)), this symbol set can be larger than justthe ‘intuitive” V U W. In Figure 4, p(sony) A p(ultrix) = ({1-1 O, 11-20,29-40, 116-125, 101-106}, {sun, sony}) A ({1-10, 11-20, 41-100, 101-106,126–200, 201-207}, {sun, ultrix]) = ({l–1 O, 11-20, 101–1 06}, {sun, sony, ultrix,AlX}) = /.L(AIX).

The supremum says which code pieces are governed by the intersection ofthe intents: (C, V) v (D, W) = (?(a(C! U D)), V n W); this can be morecode than just C U D.

If we want to know what an apollo and an ultrix configuration have incommon, we look at the infimum in the lattice, which is labeled 126-200;


Reengmeering of Configurations Based on Mathematical Concept Analysis . 159

going down we see that lines 126–200, 201–207, and 11–20 appear in bothconfigurations. On the other hand, if we want to see which preprocessorsymbols govern both lines 126–200 and 101–106, we look at the supremumof the corresponding concepts, which is ultrix; going up, we see that the sunand the ultrix configurations (and no other) will include both code pieces.

Upward arcs in the lattice diagram can be interpreted as implications: ‘Ifa code piece appears in the sony or ultrix configuration, it will appear in thesun con figuration as well,” or equivalently “If sun is undefined, definingsony or ultrix has no effect. ” Such knowledge is not easily extracted by handfrom a source file like “xload.c”! This example demonstrates the secondmain possible interpretation of a concept lattice: it represents all implica-tions (or dependencies ) between sets of attributes.

How can we use the lattice to determine which code pieces will actuallybe included in a configuration, if a certain set S of preprocessor symbols isdefined? A code piece o is included if all governing symbols are defined. Thegoverning symbols of o are u({o}) = int(y(o)) = Uczy, c,, a(c); these arejust all attribute labels above y(o). Hence the code pieces included aregiven by the configuration function K : 2A * 2°, where K(S) = {o E O Iu({o} ) c S}. All code pieces in K(S) must be above A,,eL~ p(s), as any codepiece further down in the lattice must depend on preprocessor symbols notinS.

One can imagine K(S) as a set of downward paths or “strings” starting fromthe top element; the end of a string is labeled with a selected code piece, andall “nodes” along the string are labeled with CPP symbols in S. The con@-ura-tion function can more easily be computed from the original configurationtable; the lattice representation is intended for other purposes.

2.3 Construction of the Concept Lattice

In order to give the reader an idea of how a concept lattice is constructedfrom a formal context, we describe a simple construction algorithm. The

concept lattice can be constructed either top-down or bottom-up; we de-

scribe the bottom-up version. The algorithm utilizes the fact that, for aconcept (X, Y),

[]y= u(x) = o u {O} = f) d{O}).

OEX OEX

The smallest element is (T(u(LI)), u(fl)). Hence one can start by firstcomputing all the U({o}), which constitute the atoms of the lattice. For anyo = O, this can be done by a simple loop over A with time complexity0( IA I). The other elements are then obtained as suprema of alreadycomputed ones. Due to the basic theorem this can be done by intersectingthe attribute sets of any two elements already constructed, which can againbe done in time 0( IA I). The extent of a lattice element is obtained byapplying T, which needs two nested loops and has time complexity0(1 Ol”jAl).



A hash table is used to store the lattice elements and check whether a

newly constructed element is already in the lattice. Furthermore, one hasto keep track of all pairs of elements to be considered for supremumcomputation. This is clone with a FIFO queue: initially, the queue containsall pairs of atomic elements; the next supremum to be determined is givenby the first element of the queue, and pairs of concepts [c ~, c ~1, where atleast one is newly generated and not c ~ s C2 or C2 s c1, are appended to

the end of the queue.The overall complexity depends on the number of lattice elements. The

largest lattices for contexts of size n X n have 2“ elements; these lattices

are isomorphic to finite boolean algebras freely generated by n elements.Thus the worst-case running time of the construction algorithm is exponen-tial in n. In practice, however, the concept lattice has typically 0( n 2, oreven O(rz ) elements rather than 0( 2“ ), resulting in a typical running timeof 0( n 3,; this makes the method feasible for reasonably large contexts.

Ganter [19871 has found a more efficient algorithm which avoids trackingthe elements and suprema, but is more difficult to understand. Thisalgorithm is used in the Darmstadt implementation of concept analysis. It

has recently been reimplemented for NORA. Lindig [1995] presents some

empirical data on this implementation (Figure 5), which have been ob-tained from random contexts. The first graph shows the time for latticeconstruction as a function of the number of objects; for 1000 objects (codepieces) the analysis needs 100 seconds on a SUN ELC. The second pictureshows the number of concepts as a function of object and attribute cardinal-ity; it shows that this number can indeed grow exponentially. Fortunately,for UNIX source files, we found the number of configuration concepts to be

much lower.

3. FROM SOURCE CODE TO CONFIGURATION LAl_HCES

The tool NORA/RECS for restructuring of configurations accepts sourcecode as input and produces a graphical display of the concept lattice asintermediate representation. Reengineering is then done by analyzinginterferences and sublattices. Source code simplification corresponds to

selection of a specific sublattice; upon the restructurer’s request, the sourcefile is transformed accordingly. The source language is arbitrary, but theinput file must adhere to the conventions of the C preprocessor. NORA/RECS consists of the following phases:

(1) front end: the front end separates code pieces and preprocessor state-ments, syntactically analyzes the latter, and constructs a configurationtable according to the rules described below;

(2) kernel: the kernel reads a configuration table and computes the corre-sponding concept lattice;

(3) visualization: this accepts a description of the concept lattice andproduces a graphical display;

(4) interaction: the lattice can be analyzed, and sublattices can be selected;

ACM ‘1’ransactiona on Software Engineering and Methodology, Vol. 5, No. 2, April 1996.


I03MI-10atlrtxitae—4oanraxJlaa----

160aWtsila4 . .4eomaa-

1Ow920attrlbut~----.. --.:..

. .. .... ........-. ----::: .....-. --- .-..::.-..:

. .------ ......------ .. .... .

.. --”103

g

2.s10 -

1

0.10 200 400 600 Soo 1(K)O 1200 14&l 16G0 1Mo 20M

oblsctswith5 attributeseech

concepts

8000-7000 -

6000 -5000 -40003000 -2000 -1000 -

0 -

61500

,““”

objectswith 5 attributeseach 500500

attributes

Fig-. 5. Time complexity and lattice size for random contexts

(5 ) back end (optional): the source code is simplified according to certainlattice properties.

As usual, NORMRECS is invoked as a UNIX command with the source file

name as a parameter; additional options which control some displayparameters may be added. This section describes the first two phases.

ACM TransactIons on Sofiware Engineenng and Methodology, Vol. 5. Nm 2, April 1996.


3.1 Construction of the Configuration Table

A configuration table describes how code pieces depend on preprocessor

symbols. Configuration tables are used as input to formal concept analysis.According to the basic model of a configuration function, configurationsonly depend on positive attributes. Therefore, complex governing expressionsmust be transformed into sets of simple attributes. The reader should beaware that the often-used notion of a ‘feature” of a configuration is notidentical to the term “attribute” in our setting. A feature may be a complexproperty, while an attribute describes a simple, positive fact. For example, a

configuration can have the feature that it is governed by CPP symbol X,

whether by including or excluding code pieces of the original source file. As we

will see, this is described by two attributes: one has the meaning “it is truethat X is defined,” while the other means “it is true that X is not defined.”

We will now describe how to construct configuration tables from sourcefiles like “xload.c”; due to complex CPP expressions and nested “#ifdef”statements, this process is not trivial. After construction, every complexgoverning expression has been “compiled” to a set of simple, positive

attributes. In the following semiformal construction rules, A, B, C denotepreprocessor symbols, and p-p, n-n, q-q denote code pieces.

Basic Rule. As already mentioned, an entry in the configuration table

for a code piece o and a preprocessor symbol a means that o depends on (oris governed by) a; this means that o will only be included in a configurationif a is defined. Hence the basic rule for code pieces governed by singlepreprocessor symbols is:

. .. p-p... . A . . .#ifdef A

. . . n-n . . . + P-P . . . . ,.,

#endif

. .. q-q... n-n . . . x . . .

q-q . . . . .,.

Conjunctions of Preprocessor Symbols. If a code piece is governed

conjunction of preprocessor symbols, it depends on all of the symbols:

by a

. .. p-p...

#if defined(A) &&

defined(B) &&. . . && defined (C)

. . . n-n . . .

#endif. .. q-q...

,.. A B ,.. c .

p-p . . . .,. . . . . .

n-n .,. x x . . . x . . .

H . . . . . . . . . . . . .



Negated Preprocessor Symbols. If a symbol occurs in negated form, thisnegated symbol needs a column of its own, since a basic formal context canexpress only positive statements. The rule thus is:

#if defined(A)

. .. p-p...

#endif *

,..

#if !defined(A)

., n-n . . .

#endif

L---1--LA!-

l-&l-p-p x T1!A

n-n x

A similar rule applies to #ifdef . . . #else . . #endif. In the theory of conceptlattices, the resulting table is called the “dichotomized context.” Prolog

programmers have known the same trick (explicit rules for negated predi-

cates ) for a long time.

Disjunctions of Preprocessor Symbols. Disjunctions of symbols areslightly more complicated. The basic idea is as follows: in order to handle#if defined(A) II defined(B), we introduce a separate column for A v B. Asboth A and B imply A v B, we must therefore place a cross in the A v Bcolumn whenever we place a cross in the column for A or B. The basic rulefor disjunctions hence is:

#if defined(A)

,. p-p. . .#endif +’

#if defined(A) II defined(B)

n-n, . .

#endif

#if defined(B)

... q-q..,

A B ,.. AIIB+

p-p ,.. x x

n-n .,. x

q-q .,. x ,. x .

#endif

In order to see that this rule is correct, imagine we introduce a new CPPsymbol AorB which is always defined whenever A or B is defined. A IIB isreplaced by AorB, and any code piece dependent on A or B is in additionmade dependent on AorB. This transformation of the source file keeps all

configurations intact. The transformed source code would —according to theconjunction rule —produce exactly the configuration table which is given inthe disjunction rule.

ACM Transactions on Software Engineering and Methodology Vol. 5, No 2, April 1996

164 Q Gregor Snelting

Complex Governing Expressions. In case there are complex conditionsarbitrarily built up from conjunctions, disjunctions, and negations, theseare first transformed into conjunctive normal form by applying the distrib-utive and de Morgan laws.4 Afterward, all expressions are of the form(A IV A2V. VA Z) A(BIV B2V. VBj)A. ..(C1VC2 v“v Ch), where allA ~, B., CP are either simple symbols or negated symbols. Expressions inconjunctive normal form can then be treated by the above rules: foreach negated symbol, as well as for each simple disjunction of the formAI VAZ V... Ai, an additional column is introduced. Additional crossesare then placed according to the disjunction rule (whenever a row containsan entry for A., it must contain an entry for Al v AZ v . . . v Ai).

Arithmetic Expressions. In rare cases, one can find CPP expressions like“#if version >50” – that is, arithmetic CPP expressions which are used forconfiguration management. Our approach however assumes a binary “de-finedhndetined” semantics for CPP expressions. Therefore, arithmetic andrelational expressions are treated as follows: for every arithmetic or rela-tional expression, a new column in the configuration table is introduced.This column is labeled with the complete arithmetic expression, and anentry in the configuration table is made. Thus arithmetic expressions willshow up as concept labels.

NORMRECS does not provide a fine-grained analysis of arithmetic andrelational expressions. But at least it distinguishes between #if A = = 42and #if defined(A), for example. According to the CPP semantics, A must bedefined if its value is 42. Therefore, an implication is added in this (andsimilar) cases: whenever a cross is placed in the column labeled A = = 42,another one is placed in the column for A as well.

Nested #ifde~ #define, and #’undefine Statements. The treatment ofnested ‘#i fdef” is obvious: for any line preceding an “#ifdef” the governingsymbols have already been determined. These are extended by an entry forthe symbol(s) in the new “#ifdef.” Example:

#ifdef A

P- P..A B ,.,

#ifdef B

*P-P ,., x

n-n . . .. . .

. . .

#endifn-n x x

#endif.,.

q- q.. q-q . . . ,..

‘Note that normalizing boolean expressions can have exponential time complexity. But, inpractice, even for source files like “x.load.c” the governing expressions are small enough to betractable.



line 180 ff#lf h?. s_NFS I hac3_. nllnk

,nt

!._l ink, s)

char .(). s! .s

,, .

- Rem.,,, S , ever< lf ,t >s unwritable

:wmr. .nl ink< fN. ENT ?.$111re5. NFS q.”crate.boq”a O“, s

,1 bad_,]nl ink

,nt e

L! l’..lk[k ).) ‘- c1ret”rr, o

. - .,,.0,

, ) f has_NFS

If (e ‘- ENOENT)

return 0,

# end, f

if (c!mod(s. S–IWSR) ‘- 0) [

,,, ,,(> - e

,,? ,Jrn .1

end. f

if has_NFS

else

, .ndif

“end’f

, t r ha_rellame

,f has_NFS

define

$ .1,.

stat,,

,, +(,:,, ,s”1 ink[s )--0 ! errno--ENOENT

r.turr “I’,l, .k (s)

‘lull nk[s <t) llIIk(s, t)

.,, ,

0 1,

I I ba.unlmk I tMs_WS yj’-:”” I hns_NFS has_r.wame I has_renamehm_NFS

181 -1s3 x

190-193 x x

195-196 x x x

198-201 x x

204-204 x x

206-206 x

212-212 x x

214-214 x x x

Fig. 6. Source code and configuration table fragments of the RCS stream editor “rcsedit.c.”

Nested “#define” and “#undefine” statements can also be treated (fordetails, see Krone [1993]). Experience has taught us that a “#define” isseldom used for configuration management, but for definition of constantsor inline functions instead. Hence the current implementation of NORA/RECS ignores “#define” and “#undefine” statements.

As an example for the construction process, consider the stream editor ofthe RCS system [Tichy 1985]. This program is 1656 lines long and uses 21CPP symbols for configuration management. Figure 6 presents a piece ofsource code, starting at line 180. The resulting configuration table is alsopresented in Figure 6; the configuration structure will be analyzed inSection 5. Note the typical treatment of disjunctions, negations, and“#else.”

ACM Transactions on Software Engineering and Methodology. Vol. 5, No. 2, April 1996

166 ● Gregor Srrelting

#ifdef A

. ..1...

#endif

#i fdef B * *

o

DC. ..11... I II m#endif

#ifdef C

,, .111..

#endif

#i fdef A

I.,, . . . .A#ifdef B I

,. .11... a

1‘B

#ifdef C n,Iv,.. III . . .

#endifb;

...IV. . .

#endif

#endif

Fig, 7. An antichain and a chain.

3.2 Basic Patterns in Configuration Lattices

We will now explain some characteristic patterns in concept lattices andprovide some basic examples. The purpose of this section is to demonstratehow certain source code patterns show up in the lattice. The configurationtable is only an intermediate representation and invisible to the user ofNORA/RECS.

Chains and Antichains. An antichain in a lattice is a set of incompa-rable elements. In our case, an antichain in the concept lattice comes fromcode pieces which are governed by different, independent preprocessorsymbols. A lattice which consists of only one antichain plus a top andbottom element is called flat (left-hand side of Figure 7).

A chain in a lattice is a set of elements c ~ < Cz < C3 < , . . which aremutually comparable. In our case, chains result from nested “#ifdef”statements. Note that, in concept lattices, a chain can be interpreted as asequence of implications. For example, in the right-hand side of Figure 7any code piece which depends on C (i.e., code piece III) also depends on B,and any code which depends on B (i.e., code pieces II, IV) also depends onA. According to Section 2.1, this is written as C ~ B ~ A.

Suprema and Infima. A supremum which is not the top element indi-cates that two code pieces are governed by the same “superordinate” CPPsymbol (left-hand side of Figure 8). Simple disjunctions also show up assuprema in the concept lattice (right-hand side of Figure 8). Note that anysupremum consists of two chains and an antichain, and the above explana-tions for chains and antichains still hold.

An infimum which is not the bottom element indicates that a code pieceis governed by two different CPP symbols (left-hand side of Figure 9).Infima may indicate interference (see below), An infimum may also resultfrom disjunctions (right-hand side of Figure 9).

Cascades. Nested ‘#i fdef. . . #elif” structures produce so-called cas-cades in the concept lattice. A cascade resembles the flow diagram of anested if-then-else statement (Figure 10).

ACMTransactions on Software Engineering and Methodology, Vol. 5, No. 2, April 1996.

(+eengineering of Configurations Based on Mathematical Concept Analysis . 167

#i fdef A

.1 ...

#i fdef B .A.11 =$’ 1,’W

#endif B .C11 III

#i fdef C

111, . .

#endif

IV,. .

#endlf

#if defined(A)

. 1 . . .

#endif#if defined(A) I I defined(B) AIIB

11 .,. + / EA 5

#endif I III#if defined(B)

111,

#endif

Fig, 8. Twosuprema

#i fdef A AllB AIIC.1 . . .

#ifdef A11 III

#endif A1 . . .

#if defined(A) && defined(B) ““’* 1

.11... a A#endif #if defined(A) I I defined(B)

B

#endif I //’m “II’”#endif

#ifdef B

.111... “11 #if defined(A) I [ defined(C)

111...#end~f

#endif

Fig, 9. Twoinfima.

#i fdef A A B c 1A !B !C

I.. . =+ I x +

#elif defined(B)II x x A, !AIll

11..,x x x I >

!B, C#elif defined(C)

B /’”

.111,.. H ,C ’111

#endif

Fig. 10. A cascade.

4. QUALITY OF THE CONFIGURATION SPACE

The concept lattice not only displays fine-grained dependencies betweenconfigurations. It also provides good insight into the overall quality of theconfiguration space, as will be explained in this chapter.

Two important software engineering principles are separation ofconcernsand anticipation ofchange. For example, operating system issues shouldbeseparated from user interface issues, and it should be easy to incorporateanother window system into a future version. In traditional softwareengineering, lou’ coupling and high cohesion are considered importantcriteria for good design, which help to achieve separation of concerns andanticipation of change. Coupling measures the interdependence between

ACh~Tr.nsa.tlon. on SoftwareEngineerln~and Methodolo~,Vol 5,N0 2, April 1996.


modules, while cohesion means that the elements of a module are relatedstrongly. If there is low coupling between modules and high cohesionwithin modules, modification of (the implementation of) one module doesnot influence the behavior of others. We use the notions of coupling andcohesion also for configurations. Coupling between configurations meansthat they have common code and hence influence each other; the lattice willprovide an exact measure for the “strongness” of “configuration coupling.”Cohesion within a set of configurations means that all configurations in theset deal with the same configuration aspect.

4.1 Interferences

Coupling arises whenever configurations have common code. As alreadyexplained, such code is determined by the infimum operation. This obser-vation gives rise to the following:

Definition 4.1.1. Two preprocessor symbols a and b interfere if theyhave common code, i.e., the intersection of the extent of their configurationconcepts is not empty: ext(p(a ) A p(b)) # 0. Otherwise, a and b are calledmutually exclusive. Two sets of CPP symbols A and B interfere, if thereexist interfering a E A, b E B; otherwise A and B are called mutuallyexclusive.

Interference is a syntactic phenomenon. Interferences show up wheneverCPP symbols have an infimum in the lattice which is not the bottomelement, and they can easily be detected automatically.5 But it is a muchmore difficult question to decide whether an interference indicates coupling(and therefore should be considered harmful) or whether it just shows thatcertain combinations of features of the target configuration have specificcode.

Definition 4.1.2. Two CPP symbols a and b are called disjoint if theycannot be defined at the same time. They are called orthogonal if they dealwith different and independent aspects of the configuration space, e.g., userinterface variants versus operating system variants. Two symbol sets Aand B are called disjoint, respectively, orthogonal, if this holds for all pairsa~A,b~B.

Disjoint or orthogonal CPP symbols are a semantic phenomenon. There-fore, the definitions of disjoint, respectively, orthogonal, CPP symbols arenot completely formal, and such symbols cannot be detected automatically.From the source code alone, it is in general undecidable whether CPPsymbols are disjoint or orthogonal. Only the restructurer (if anybody)knows whether CPP symbols can be defined at the same time and whetherthey deal with independent configuration aspects.

‘In rare cases, the bottom element may have a nonempty extent and hence induce aninterference.



1#i fdef 13SD

.1 ...#endif

#i fdef SVR4

#i fdef X_ Win

.11. .

#endifi #l frfef BSD

1 ‘“

111 . . .#endif

Iv, #endlf

#i fdef BSD

,.. v.. .#endif

I

$ifdef UNIXI...

,, F’- F“”m’mllx

#endif .,, “MIX,,~*#ifdef DOS

.●,

.0S0 SVR4 .11.,.,,* * Iv #endif

~% ‘ J UNIX11 ,, I.lv

.“#if defined(DOS)l I

● X.wl”[11‘ defined(X_win)11

. 111...

#endif

#ifdef UNIXIv...

#endif#if defined(UNIX)ll

(defined(DOS) &&defined(X–win ))..v,.

#endif

I

I

Fig. 11. Two interferences

Definition 4.1.3. An interference is considered harmful, if it is aninterference between disjoint or orthogonal CPP symbols (or symbol sets).

In order to decide whether interferences are harmful, the restructurershould contribute some knowledge (or educated guesses) about the in-tended meaning of CPP symbols. Often, the names of the CPP symbolsindicate their meaning. Ifabsolutely nothingis known about the configura-tion space, and even the names of the CPP symbols do not indicateanything, CPP symbols must not be considered disjoint or orthogonal; thusevery interference must be considered potentially harmful,

Interfering disjoint CPP symbols indicate dead code, which can beeliminated. Interfering orthogonal CPP symbols are also very suspicious:from a software engineering viewpoint, such an interference indicatescoupling between orthogonal configuration aspects.

Let us consider two examples, which both contain an interference (Figure11). In the first example, code piece III is governed by both BSD and SVR4,but Berkeley UNIX and System 5 UNIX are to the restructurer’s bestknowledge incompatible and hence disjoint. Hence code piece III can bedeleted. Note that the syntactic knowledge provided by the lattice as wellas the semantic knowledge contributed by the restructurer are necessaryfor this decision. In this example, the names of CPP symbols indicate theirmeaning. Note that poorly chosen CPP symbol names can easily misguidethe restructurer.

The second example in Figure 11 uses “virtual” concept labels D(X IIX_win and UNIX ~1DOS which have been generated by normalizing disjunc-tions. Therefore, the dependencies of code piece II on these two “CPPsymbols” is in fact nonexistent: code piece II is governed by DOS, DOS IIX_win, and UNIX II DOS; but according to the absorption law, DOS A (DOS vX_u’in ) A ( tJNIX v DOS) = DOS. Dependencies on system-generated basic

ACM Transaction. on Software Engineering and Methodology. Vol. 5, No 2, April 1996.


disjunctions usually do not show up in the source code directly; they onlyshow that certain combinations of defined CPP symbols select code whichmight conceal a harmful interference. In the example, the interferencemerely states that operating system as well as user interface issues showup in both main configurations and that —worse —there is a cross depen-dency between them.

4.2 Configuration Coupling

We will now discuss how the “configuration-coupling factor” can be ex-tracted from the lattice. If the concept lattice is flat, there are severalconfigurations, but they do not have common code. This means there are nodependencies whatsoever between the possible configurations. Selectingany “feature” of a configuration does not interfere with the selection of anyother “feature.” Hence there is zero coupling between configurations (leftexample in Figure 7).

More generally, low coupling is achieved if the lattice is horizontallydecomposable. A horizontal decomposition is the inverse operation to ahorizontal sum, which will be defined first.

Definition 4.2.1. Let Ll, Lz, . . . , L. be lattices. The horizontal sum ofthese lattices is

~ L,={T, 1} U~ L1\{Tij 1,}

i=l j=l

Definition 4.2.2. A lattice L is called horizontally decomposable if it is

the horizontal sum of smaller lattices: L = X;. ~ Li,

A horizontal sum is obtained by removing the top and bottom of thesummands and adding new “global” top and bottom elements. Conversely, ahorizontal decomposition can be obtained by removing the top and bottomelements of the lattice, determining the strongly connected components ofthe lattice graph (these are called the summands), and adding a new topand bottom element to each summand.

In case a lattice is not horizontally decomposable, it might be that it isdecomposable after a small number of interferences have been removed.This gives rise to the following definition.

Definition 4.2.3. A lattice L is called k-decomposable if

(1) after removal of T and 1, the lattice graph is k-connected in thestandard graph-theoretic sense;6

6A graph has connectivity k if the deletion of any k – 1 vertices fails to disconnect the graph

[Aho et al. 19831.



(2)

(3)

Fig. 12. Horizontal lattice decomposition and a simple interference,

the k vertices x,, .xz, , x~ which disconnect the graph are in fima:x, = a, A bi;

after removal of x,, Xz, . . , xk, the remaining disconnected subgraphsL,, L2, ..., L,, are – after reattachment of T and L to each L, –sublattices of L.

If there is a simple interference between two horizontal summands, the

lattice is l-decomposable (Figure 12). If there are k simple interferencesbetween summands, the lattice is k-decomposable. But there can be evenworse interferences between two sublattices.

Definition 4.2.4, An interference of connectivity k in lattice L consists of

h simple interferences, where ~ is k-connected, and removal of the interfer-

ences splits the lattice into only two sublattices.

Such k-interferences are more difficult to detect and to treat than simple

interferences (see Section 6).Ideally, the lattice is horizontally decomposable (i.e., O-decomposable),

where the CPP symbols in each sublattice deal with the same configurationaspect, e.g., operating system variants, while the CPP symbols in differentsummands are orthogonal or disjoint. This guarantees high cohesion withinone sublattice and low coupling between orthogonal configuration concepts.Paths which are glued together in their top or bottom sections are accept-able, but cross arcs between orthogonal sublattices always indicate harmfulinterference between configurations.

The notion of k-connectivity can be used to give an exact definition ofconfiguration coupling. This definition requires the notion of orthogonalconfiguration aspects, which—as discussed above— may require humanjudgment. Therefore the coupling factor cannot be computed automatically,but this does not imply that the notion of ‘configuration coupling factor” isuseless or imprecise,

Definition 4.2.5. Two orthogonal configuration aspects have couplingfactor k, if

( 1 ) the configuration lattice L is k-connected with horizontal summandsL1, L2, . . ..L~.

(2 ) for each L,, the CPP symbols in L, deal with the same configurationaspect;

(3 ) the CPP symbols in two different summands L,, L, are orthogonal.

ACM Transactions on Software Engineering and Methodolo~, Vol. 5, No 2, April 1996


Flat lattices and horizontal sums are O-decomposable and hence interfer-ence free; they are the optimal structures for configuration managementwith respect to the configuration-coupling factor. Thus concept analysis notonly provides a detailed account of all dependencies, but can serve as aquality assurance tool in order to check for good design of the configurationstructure, or to limit entropy increase as a software system evolves. Indeed,we will see later that analysis of interferences in the RCS system reveals asubtle UNIX bug.

5. THREE CASE STUDIES

We applied NORMRECS to several UNIX programs. NORA/RECS uses thewell-known Sugiyama algorithm [Sugiyama 1981] for graph layout. Thereare special layout algorithms for concept lattices which are much moreclever (e.g., WiHe [1989a; 1989bl and Skorsky [19921); we plan to integratethese in a future version.

As the layout results are not always satisfactory, the user may finallychange the graph layout manually (but the system will maintain integrityof the concept lattice). Indeed, layouts in Figures 13, 14, and 16 aremanually improved.

5.1 Example 1: The RCS Stream Editor

Our first example is the stream editor from the RCS system “rcsedit.c.”This 1656-line program uses 21 preprocessor symbols for configurationmanagement. Part of the source code and the configuration table of“rcsedit.c” have been presented in Figure 6. Construction of the conceptlattice took 0.1 seconds on a Sparcstation 2.

The main window in Figure 13 displays the concept lattice, which has 33concepts.7 There are some additional pop-up windows displaying the extentof C27, the intent of C26, and the labels of C3. A piece of source code,namely, lines 142 1–1430, is displayed in a separate window. All conceptlabels are displayed below the picture.

First of all, the labels of top element C 1 consist of all code pieces notgoverned by anything; in “rcsedit” these are quite numerous. There are nounused CPP symbols, as the bottom element C33 has no labels (this, by theway, is the reason why the bottom element has no name). In the left part ofthe lattice, there are a lot of simple variants, which correspond to specificUNIX features like “has_set_uid” or “has_readlink.” For example, C17shows that “has_readlink” governs lines 1051–1094, 1123, and 1144–1148(and no other) and that these lines are not governed by any other CPPsymbol. Hence for most of the source code, there is no coupling between

7The corresponding figure in Krone and Snelting [1994] was produced with the old version ofNORA/RECS which did not ignore “#define” statements. As explained in Section 3.1, the newversion ignores “#define” and “#undefine” statements. Hence, the old lattice in Krone andSnelting [1994] had some additional “virtual” concepts and interferences. The central interfer-ence is however still present, and the new lattice is a suborder of the old one. The sameremark applies to Figure 13.

ACM Transactions on Softwsre Engineering snd Methodology, Vol. 5, No. 2, April 1996.


Ic.+.+.“

.,, m-rR

hlmtif IhU_ron#m

m do-l fnk(frm. to) 1- 0 ~ –1 : un-link(frcm)

D

# 01seC26 (baO_unllnkOR has_NFS) C3 has_renmO rma-m(frm, to) !- 0

has.NFS 1424-1424 # tf has_UFS

1428-1428 “ M errrm !- ENCENT

1432- 1432 end{f? -?

mm ~m- *

if bad_&wmm: mde t-mda.tif1n.rmmt ng P dmd(to. .

,,.-

C1.

C2

C3

cd

r5

1 164

169 179210 210

236 25241? 42?

48% 488.7) 59B

654 65Q

bfib b6,

6,4 684692 69,6’35 70:708 713725 723-31 731,~. 749753 154’58 104’21096 11211128 11421150 12281234 1316

1322 13281?4- 13?013”J 1385

1396 1396

1401 14021406 16081419 - 14201434 1>43

1549 1656has_setu:d

1547 1547has_rename1424 14241420 1428

1432 1432bad_. _rename

1;94 1394

has_pr OtOtypes12-5 1375

Fig. 13.

C6 ha s_mktempC19

1320 - 1320

1336 1345C20

C7 Open_ can_creat1232 1232

ro has_readlink

1125 1126C9 has_setu, d

1545 1545

C21,

C22

C23c24

Clo ( has_rename OR bad_ b_renaMe)

1J1O 1417C1l has_f chmod

c25c26

1398 - 13991404 1404

C12 bad_a_renameC27C28

1387 1392C13

C291430 - 1430

C14 has_prototypes1372 1373

C15 h.as_mktemp

1318 1318

1330 - 1334

C30C31

c16 open_ can_creat

1230 - 1230C17 has_readlink

1051 10941123 11231144 1148

C18 large_memOry

254 - 254277 400490 490608 652661 661693 693

c32

-15 715

729 72~C33

751 751

Configuration structure of the RCS stream

has.memmove258 275has.memmove

256 256

has_NFS

!has_rename1422 1422

213 213

(bad_ .”llnk OR has_NFSl1s1 18s208 208206 206bas_NFS204 2041426 1426215 235bad_unl, nk190 193198 201195 196

large_ memOry166 167

402 405410 411

429 404

492 530

600 606663 - 664669 672686 688704 706717 723

733 745

756 756bad_ fopen_wplus

40~ 40B

editor

configurations. The concepts below C24/C 10/C21 (concerning networking)form a grid-like cluster. For example, C26 and C29 display the lines solelygoverned by “has_NFS,” respectively, “bad_unlink,” and C30 says that lines

ACMTransactions on Software Engineering and Methodology, Vol. 5, No 2, April 1996.


195–196 are governed by both (the reader should compare these statementsto Figure 6).

But there is an interference manifest in C27, which is the infimum of C3and C26. C3 is labeled “has_rename” (as can be seen in the “labels” box);C26 is labeled “has_NFS”; and C27 is labeled 1426-1426. Thus, line 1426 isgoverned by both “has_NFS” and “has_rename’’–this agrees with thesource code in the text window. “has_rename” has to do with the filesystem, while ‘has_NFS” controls networking. As these should be orthogo-nal, the interference is considered harmful. C30 presents a similar interfer-ence. It seems that C13 is an interference as well, but as C 12 is labeled“bad_a_rename,” both C12 and C3 have to do with the file system.

Thus, although the overall structure is quite good, we suspect thatnetworking issues and file access variants are not clearly separated in“rcsedit.c.)’ And indeed, a comment in the source code explains that, due toan NFS bug, %enarne( )“ can in rare cases destroy the RCS file! Thisproblem has been rediscovered by concept analysis, just by analyzing theconfiguration structure. The example demonstrates that NORNRECS cantrack down bugs, even bugs which the programmers would like to keepcovered: the comment reads “An even rarer NFS bug can occur when clientsretry requests. . . . This not only wrongly deletes B’s lock; it removes the RCSfile! . . . Since this problem afflicts scads of UNIX programs, but is so rarethat nobody seems to be worried about it, we won’t worry either. ”

5.2 Example 2: The TC Shell

Our next example is a popular shell, the “tcsh” developed at Berkeley. Wehave analyzed one of its modules, namely, “sh.exec.c.” This program is 959lines long and uses 24 different preprocessor symbols for configurationmanagement. Construction of the concept lattice took 0.05 seconds on aSparcstation 2; the lattice has 21 elements. In the concept lattice (Figure14), singleton attribute or object labels are displayed in the diagram; theothers can be looked up by clicking at a concept.

The almost flat lattice shows that there are several features of possibleconfigurations, which however can be selected independently. It seems thatthere is an interference between concepts C14 and C 15. But a look at thesource code reveals that both VFORK and FASTHASH have to do with thehash function used; hence there are no dependencies between orthogonalconfiguration concepts. Therefore the configuration structure is perfectaccording to the criteria described in Section 4.

5.3 Example 3: “xload.c”

Let us now come back to our introductory example, “xload.c” (see Figure 1).This program is 724 lines long and uses 43 preprocessor symbols forconfiguration management. Construction of the concept lattice took 0.3seconds on a Sparcstation 2. The resulting lattice has 148 concepts and isshown in Figure 15. It looks pretty chaotic —the program obviously suffersfrom configuration hacking.



❑ IIoSA-RECS ❑aw

w hxlml Opuma *

L@filcOselected

~

I w-iv’Fig. 14. Configuration structure of tcshel[ module “’sh. exec. c,”

Fig. 15. Configuration structure of “xload.c “

NORA/’RECS offers a parameter which controls the maximal nestingdepth of #ifdef statements taken into account. For a small value of thisparameter, many concepts and dependencies will disappear, but the overall

taxonomy of the configuration space is usually still visible.” For “xload,” thelattice displaying only the top four “#ifdef” nesting levels is given in Figure16. Even on the top level there are interferences– namely, C28 andC32–and the central role of C30 does not inspire confidence (C28 is the

“In fact, the resulting lattice is a sublattice of the original one. The technique is reasonable, asmany programs use top-level “#ifde~ statements for important configuration alternatives.whereas lower-level “#ifdef’ statements are used to control minor configuration details,

ACM Transactions on Software Engineering and Methorlrrlogy, V(II 5, No 2, April 1996


! sVR4! uTEK! alliant1 hcxI sequent1 sgi! sun

c15 sequent72 - 7495 97

cl:

C2

C3:

C4 :

C5 ,

c6:

C7:

c8:C9 ,

0-3537 - 3945 47 c16 AIXV3 OR CRAY

63 70Cl? mips OR umivs

49 - 5157 5961 6370 7274 76B5 - 8789 9193 9597 - 99101 103109 111

570 - 709564 - 510! KERNELJ2AD_VAR1ARLE401 - 449455 - 466! KERNEL .FILE

c18 mips104 106

C19 .ltrix OR umips59 - 61

C32:C33,

C20 sun51 - 5356 - 57

C34:

C35

c36:

338 - 39i! XXSN_FILE334 - 336X_AVENRUNfxted468 477490 - 511558 - 560

c21: ,38653 - 56

C22 SYSV115 - 117119 - 125?23 - 724

C23 MOTOROLA47 49

C24 , ! aPol10184 - 185714 - 723

SVR4 OR UTEK OR alliantOR hcx OR sequent OR sgi OR sunapollo125 104

C37: ! macll39 - 4044 - 45C25 : ! SYSV

! SYSV386271 - 272

X_NOT_POS 1X117 119(OSMAJORVERS 10t4- 4 )sonv

c38,

C39,C40 :

40 - 4143 4441 - 43477 - 483

713 - 714c26 : ! KvN_ROUTINES

316 - 317712 - 713

c27 : KW4_R0UT1NES272 - 316

C28 , SYSV3861S5 - 271

c29: ! LOADSTOBC30: 332 - 334

103”- 10410s - 109

511 - 547560 - 562

C41,

C42,317 - 332att35 - 37

Sgi99 - 101SVR4 OR UTEK OR alliantSVR4

word! ward: n-typen_typeFSCALE

KERNEL_FILE624EM_F1LEVm_NANEPROC_NANEBUFJ4ANE

111 - 112114 115

C1O FSCALE

336 - 33S398 - 401449 455466 - 46S483 - 490547 558562 - 564709 - 712

112 - 114Cll: MOTOROLA OR UTEK OR alliantC12: 91 - 93c13: hcx

87 - B9C14 : macll

76 - 85DECh3’

Fig. 16. Top-level configuration structure of “xload,c.”

infimum of C22 and C24, whereas C32 is the infimum of (72 and C30. C22 isSYSV, and C24 is !apollo. C2 is SVR4 II UTEK II alliant II hex II sequent II sgi IIsun. C30 is a set of nine code pieces governed by the sundries SYSV386,!LOADST(JB, and !KVM_ROUTINES). overall, the lattice consists of several

“standalone” configurations (C3, C4, C42, C26, C5, C6, C17, C18, C19), a“SVR4 IIUTEK IIalliant IIhex IIsequent IIsgi IIsun” sublattice (concepts sC2), and a “not macII, not apollo” sublattice (concepts s C24/C37). The topelement C 1 shows that several code pieces are configuration independent

ACMTransactions on Software Engineering and Methodology, Vol. 5, No. 2, April 1996.


(not governed by any CPP symbol), while the bottom element C43 showsthat several CPP symbols are defined, but not used for “#ifdef”-namelythose which are used for definition of constants or inline functions.

6. IJ4TTICE ANALYSIS AND INTERFERENCE DETECTION

Once the lattice has been constructed and laid out, the restructurer mayinspect it in an interactive manner. NORA/RECS offers the followingfunctions:

–for every concept c, its labels a(c) and W(C) can be displayed by a simplemouse click;

–the source code pieces given in o(c) can be displayed;

–lattices can be horizontally decomposed (if possible);

—interferences of minimal connectivity can be computed and displayed (seebelow );

—sublattices can be selected by clicking on their top or bottom elements;

—the intersection of sublattices can be displayed, which contains interfer-ences (see below);

–lattice decompositions and corresponding source file simplifications canbe triggered (see below).

After analysis, the restructurer can execute the restructuring algorithmsdescribed below. He or she may also manually restructure the source codeand repeat the analysis. Hence, NORA/TtECS supports an interactive andincremental way of analyzing configuration structures.

6,1 Automatic Interference Detection

As explained above, interferences indicate coupling between configurations.It is therefore of practical importance to detect interferences automatically.In small lattices, interferences are easy to spot manually, but for biglattices (e.g., Figure 15), tool support for lattice analysis is essential.

The interference analysis algorithm incorporated in NORWRECS tries tohorizontally decompose the lattice in such a way that connectivity isminimal. Lattice decomposition is done in a top-down fashion: top-levelinterferences between big sublattices are detected first, whereas minorinterferences will be detected very late. This helps for restructuring, asinterferences between big sublattices are more likely to indicate errors orbad configuration structure. The sublattices can later be analyzed recur-sively.

The algorithm for detecting interferences of minimal connectivity imple-ments the definitions from Section 4.2. It proceeds as follows:

( 1 ) Try a horizontal decomposition of the lattice. This is done by removingthe top and bottom elements and their outgoing edges; and thendetermine the connected components of the (undirected) concept graphby a standard algorithm [Aho et al. 19831. If successful, there are no

top-level interferences (connectivity k = O). Reattach top and bottom

ACM Transactions on Software Engineering and Methodology. Vol 5. No 2, April 1996.


element to each sublattice, and apply the remaining steps recursively tothe sublattices.

(2) If a sublattice cannot be decomposed horizontally, it may containinterferences. First, simple interferences of connectivity k = 1 areinvestigated. These are detected by removing the top and bottomelements, and then computing the bicormected components [Aho et al.1983] of the remaining graph. A bridge between two biconnectedcomponents which leads to a A-reducible concept node (that is, of theform c = a A b) points to an interference. The node is highlighted.

(3) Often, there is more than one interference between sublattices. Thuswe compute the k-connected components of the lattice graph (withouttop and bottom), where k is minimal. A simple method to determinek-connected sublattices is to consider all sets of k A-reducible conceptnodes and test whether their removal will break the graph into uncon-nected subgraphs.

As discussed above, only the restructurer can decide whether the inter-ference must be resolved or whether it should be ignored. This decisionmust be based on the semantics of the involved CPP symbols. If thesemantics is not documented, the lattice structure provides insight into theextent and intent of configurations.

6.2 Determining Sublattice Intersections

NORA/RECS offers another approach to lattice analysis, namely, explicitselection of sublattices through their maximal elements.

The restructurer may click at a number of concept nodes {cl, . . . . c.}, andthereby select the downward suborder J {cl, . . . . c~} = {x I 3Z : x s Ci}(the dual operation of selecting upward suborders is also supported).g Thedownward suborder is then highlighted (or colored) on the screen. Itprovides interesting information, as the code piece labels in J {c ~, ., . , c.}are those which depend on the intent of one of the Ci. For example, selectingthe downward suborder ~ {apollo, AIX} in Figure 4 displays all code piecesin the “apollo” or “AIX configurations.

Several downward sublattices may be selected this way, with eachhighlighted in a different color.l” The intersection of such sublattices alsocontains interferences; in fact, the maximal elements in the intersection oftwo sublattices are interferences. These are, however, not necessarily ofminimal connectivity. But they are choosen by the restructurer, which maybe more appropriate,

6.3 Example: Analyzing “rcsedit.c”

The configuration lattice of the RCS stream editor was given in Figure 13.After initial horizontal decomposition of the lattice, NORA/RECS immedi-

91fn=l, J{cl, ..., Cn} is a sublattice, but otherwise not all subsets of elements have a.mpremum.10Unfortunately, the current NORA version only supports black and white.



Fig, 17. Automatic interference detection in a sublattice

ately detects two interferences with connectivity 1, namely, C27 and C 13(Figure 17). Interestingly, C30 (which was criticized as a harmful interfer-ence in Section 5.1) is not a top-level interference, as it is part of a grid-likesuborder (in fact, it is an interference of connectivity y 1 in the sublatticeC24/C26/C29/C30; according to the above algorithm, it will only be detectedafter the k = 2-interference C25/C28 has been removed). On the otherhand, in Section 5.1 it was argued that C13 is not a harmful interference.This example shows again that each proposed interference must be in-spected manually and that interferences can be hidden in sublattices andwill only be detected after recursive application of the algorithm.

Should we schedule “rcsedit.c” for restructuring? Lattice analysis showsthat restructuring does not make much sense here, because the originalprogram already had a reasonable configuration structure. As the NFS bugwhich causes the problem obviously cannot be fixed, there is little hope thatthe interferences can be removed.

7. CODE SIMPLIFICATION

According to Parnas [1994], the first step in restructuring must be toreduce the size of the program family. If the restructurer has someknowledge about the old configuration structure and the meaning of the oldpreprocessor symbols, he or she might conclude that certain configurations

ACM Transactions on Software Engineering and Methodology, k’ol 5, N{] 2, April 1996.


1 1sif bad-unlink

int.n_link(s)

char const ●B;

P. Remove S, even if ~t .s .nwritable.

● Ignore .nlznk( ) ENOENT failures, NM

‘/{# ,f bad_unl, nk

lnt e,

,f [..li.k(s) ‘- 0).eturn O;

e - errno,if (chmod(s, S_lWUSR)errno - e;return -1,

# endifreturn unlink(s) ;

1#endif

#if !has_rename# define do_llnk(s, t) lxnk(s, tl

generates bogus ones

-0)[

I

Fig. 18. Partial evaluation of “rcsedit.c” under context expression !defined(has_NFS).

(i.e., certain preprocessor settings) need no longer be supported. Hence thecorresponding code is irrelevant and should be discarded (this is called“amputation” by Parnas). In particular, if certain preprocessor symbols areno longer needed, code depending on them will never be included in anyrestructured configuration and can be deleted. Such a simplification of the

source code is appropriate before more complicated restructuring takesplace.

In this section, we describe how amputation can be implemented viapartial evaluation of CPP files. The process is driven by the lattice, whichprovides excellent insight into the possibilities and effects of an amputa-tion.

7.1 Partial Evaluation of CPP Files Using NORA/lCE

Simplification of configurations relies on partial evaluation of CPP files, atechnique which will be sketched in this section. It is implemented inNORA/ICE, a tool for incremental configuration management based onfeature logic [Zeller 1995; Zeller and Snelting 1995]. NORA/ICE offers –among other features —partial evaluation of preprocessor files. It allowsone to simplify preprocessor files with respect to the information thatcertain (combinations of) governing symbols will (or will not) be defined.NORA/ICE will simplify governing expressions and delete code pieces orpreprocessor statements with respect to a given “context” expression, whichis assumed to be true. The ordinary preprocessor behavior is included asthe “limit case,” namely, that all preprocessor symbols have a known value.NORNICE allows for arbitrary complex context expressions, includingthose which introduce new symbols. Simplification is not just constantfolding, but is based on feature unification [Smolka 19921.

Partial evaluation will considerably reduce the size of the source file, ifseveral preprocessor symbols are known to be always defined or undefined.Figure 18 gives an example of partial evaluation: the source code of

ACMTransactions on Software Engineering and Methodology, Vol. 5, No. Z, April 1996.


“rcsedit.c” (line 180~, see Figure 6) is simplified under the assumption that“has_NFS” is always undefined. The code piece shrinks from 35 lines to 24lines (the inner #if becomes redundant; thus it could shrink to 22 lines, butNORA/ICE does not detect this).

7.2 The Art of Amputation

Generation of Source Code from Sublattices. Once a sublattice has beendetermined – either by horizontal decomposition or by explicit selection – asimplified source file corresponding to this sublattice can be created. Thisworks with every sublattice, but the restructurer should try to achieve highcohesion and low coupling. Also, preprocessor symbols in the sublatticeshould be orthogonal or disjoint from the rest of the lattice.

Source file simplification is done by partial evaluation of CPP files. Letc={cl, c~, ..., Ck} be the concepts not in the sublattice and X = { x ~,. . . . x,, } = U ~= 1 u~c,~ all their attribute labels. Then the source text is fedto NORA/ICE, together with the context expression [ ~definedlx, ), . . ,~defined( x. ) ]. This removes all code pieces not in the configuration sub-space and simplifies the governing expressions for the remaining codepieces. 11 The resulting source code contains only preprocessor symbolswhich appear in the sublattice.

Generation of Problematic Variants from Interferences. Once an inter-

ference has been determined, a special “problematic” variant can be gener-

ated. Let C = {cl, . . . . c~} be the concept nodes which constitute an

interference of connectivity k. C need not necessarily be an antichain; Ccan be a suborder or even a sublattice. Let X = U:. ~ a(c’ ) be all attributesymbols of C, and let W be the attribute symbols in ~ C. Then the sourcetext is fed to NORA/ICE, together with the context expression[defined(a) I a ● W \ X] H I=defined(y) I y c A \ (W U X)]. Thiscreates a source text which contains exactly the configurations containingthe problematic code pieces. Such a specialized source file is useful for ananalysis of interferences on the source code level.

7.3 Example: Simplifying “xload.c”

Let us now apply amputation to “xload.c.” Figure 16 shows that there areseveral configurations for rather uncommon machines. We decide to discardthe following configurations: apollo, X_ NON_ POSIX, sony, CRAY, mips,

umips, att, LOADSTUB, alliant, sequent, UTEK, hex, and sgi.12 Feeding thesource code with context expression “[!defined(apollo) && !defined(X_NOT_Poslx) && . . && !defined(sgi)]” to NORA/ICE, a reduced source file with540 source lines results. The corresponding concept lattice has 75 concepts,which is still a lot, We therefore decide that we consider only configurationswhere KVM.ROUTINES are not available (whatever they may do). Thisresults in a further reduction of the source file; it is now 490 lines long, and

! IActual]y it would be enough to undefine only the maximal elements of CL~Thls is ‘n’ot to be understood as a recommendation to the authors Of “x.lmcl.cl”



❑ 140RA-REC9 !llm@l

brtmt Ia c“’-’%-’”””~ ‘2’’ss”0’’s’s’38”bc16 : ! KERNEL_FILE

185 - 187191 - 191195 198210 213

C41 : SYSV38693 177

C42 : MOTOROLAC43 :C44C45

c1 1-3539 3943 - 43.47 4857 - 5869 7074 7783 - 8387 - 91482 490

C2 ! SYSV OR ! SYSV386179 179103 183

cl?c18 SYSV

220 - 220169 - 189! KME3.F1 LE181 - 181X_NOT_POS 1X

c46 72 - 72C47c48 249 - 249c49 , 207 207

C19 :C20 :

C50 41 - 41C51C21 :

85 - 85SVR4

c52 :c53 : mB8kc54 : 246 - 246C55 m68kc56 243 - 243

C22224 225 C23

C24C25c26

254 - 258271 271280 293341 - 350356 - 356

265 265215 - 215! FSCALE80 BO

C57C58 : 204 204C59 : 201 - 201c60 : WCI1

37 - 37c61 : 202 - 286

c27

479 480 C28 :

C29

mac II

60 67273 - 280

C3 SVR4 OR sunC4 358 362C5 STDC_ OR ! SVR4 295 314

352 - 352365 368193 - 193A1XV350 55

319 - 319325 - 339354 354

c62 ! A1XV3323 323

C6 MOTOROLA OR S3’SVC7 : SVR4C8

C30C31C32c9 ; sun

C1O KERNEL_ LOAD_VARIABLE227 228234 236240 240

c63c6AC33

C34 :C35

238 - 238sun45 - 45SYSV

C65 :C66 :c67 : 474 - 474c6E , ! UOTfJROLA

317 - 317

252 252C1l : ! LEG C36

C37C38C39C40

C12 267 267c13 : USG

261 - 262 c69 : 321 321C70 : 371 - 400c71 , 403 - 472

C14 hDUX 218 - 218c15 hp90008800

231 231 C72

Fig. 19. Simplified xload.c with an interferene of connectivity 1.

the corresponding lattice has 72 concepts. The symmetric case that weassume “defined(KVM_ROUTINES)” leads to a small source file with 16configuration concepts and is not further investigated.

The 72-concept lattice is then subject to horizontal decomposition andinterference analysis (Figure 19). The lattice is horizontally decomposable(see the C21 chain), which shows that X_NOT_POSIX (label of C21) does



not interfere with other configurations. There are three top-level interfer-ences of connectivity k = 1, namely, C29, C33, and C61. Removal of each of

these would isolate C28, C32, respectively, C60 (macll, AIXV3, ! macll), butleave the rest of the lattice unchanged and is therefore not further

investigated. Similarly, k = 2, respectively, k = 3, interferences do notproduce a “clean” lattice decomposition either.

We therefore decide to base further amputation on sublattice selection.Obviously, many concepts are below C2, which is labeled !SYSV II!SYSV386. We therefore produce a simplified source file which handles allconfigurations except “System V“; this is based on the sublattice j {C2}.According to Section 7.3, NORA/RECS collects the concept labels of themaximal elements outside $ {C2}; these are C3, C28, C32, C60, and C21with labels SVR4 IIsun, macll, AIXV3, !macll, and X_ NOT_ POSIX. NORA/RECS then feeds the simplified source file to NORA/ICE, together withcontext expression I~defined( SV R4 ), ~defined{sun ~, Tdefined( rnacII),Tdefined( AIXV3 ), defined( macll ), Tdefined[ X_NOT_POSIX) 1. The result-ing source File consists of 197 lines and can be installed on all non-SystemV platforms {but not CRAY, etc., as these configurations have been ampu-tated before ).

8. SIMPLIFICATION BASED ON CONCEPT LAITICE ISOMORPHISMS

In this final section, we present an algorithm which simplifies governing

expressions by utilizing an isomorphism theorem from concept analysis. Inparticular, disjunctions in governing expressions are eliminated. This is ofpractical value, as disjunctions are very difficult to understand.

Definition 8.1. A lattice element c E L is called join-irreducible, if foralla, b~L, a~c, and b < c it implies a v b < c. c is calledmeet-irreducible, if for all a, b G L, a > c, and b > c itimplies a A b > c.

The join-irreducible elements of L are written J(L ), and the meet-irreduc-ible elements are written M(L).

The join-irreducible elements are those with only one outgoing downwardedge, and the meet-irreducible elements are those with only one outgoingupward edge. It is not surprising that the lattice is generated by theirreducible elements alone. The following isomorphism theorem resemblesa famous result by Birkhoff for distributive lattices.

THEOREM lWILLE 19821. Let L = (B(O, A, P), s ) be a concept lattice.Let X > J(L) and Y z M(L) be supersets of the join- (respectively,meet-irreducible) elements of L. Then L ~ B(X, Y, s ), In particular, L zB(J(L), M(L), s) and L s B(L, L, s). ““

The latter isomorphism (which was already mentioned in Section 2.1)means that a concept lattice reproduces itself. The first isomorphism meansthat a concept lattice is generated by its join- (respectively, meet-irreduc-ible ) elements alone, if these are taken as objects (respectively, attributes].( J( L ). M(L ), s ) is called the reduced context; usually it is considerablysmaller than the original one. In general, any supersets of the irreducible

ACM Transactions on Software Engineering and MethodoloL~. Vol. 5. No. 2. April 1996


elements generate the lattice, if used as objects (respectively, attributes);the corresponding contexts are called partially reduced.

The isomorphism theorem can be used to generate a restructuringproposal as follows. According to the theorem, the concept lattice corre-sponding to the formal context C = (-Y(O), M(L), ~ ) is isomorphic to theoriginal lattice L –the meet-irreducible concepts are enough to “span” theconfiguration space. Hence we can generate a new configuration table. Thisrestructured table has the same rows as the original one (the O in (Y(O),i14(L ), s). For each meet-irreducible concept, we choose one new preproces-sor symbol (if a meet-irreducible concept already has an attribute label,thie can be used instead of generating a new one); these are used as columnlabels. The configuration table is then created according to the rule: for o ~o, a E M(L)

{

true7’[0, a] =

y(o) = p(a)

false otherwise.

This table generates a concept lattice which is isomorphic to the old one.The new table is usually “leaner” than the original one, as it has fewercolumns.

From the new table, the new governing expressions are generated asfollows. Let a~(o) = {al, ..., a.} be the new CPP symbols governing o.Then the new governing expression for o is #if defined(al ) && . . . de-fined. Hence the new governing expression contains only conjunctions,which is a tremendous simplification.

Example 8.1. In order to visualize the above notions, consider the latticein Figure 4. The meet-irreducible elements are those labeled maxll, SYSV,sun, sony, ultrix, i386, sequent, apollo, or CRAY. AIX however is the infimumof sony and ultrix, hence not irreducible. This means that any code piecegoverned by AIX is also governed by sony and ultrix, indicating that AIX isredundant. Hence NORA/RECS proposes to remove all dependencies on AIXand replace them by defined(sony) && defined(ultrix). All configurations arekept intact: as sony and ultrix are probably disjoint,13 they could not bedefined simultaneously; hence defined(sony) && defined(ultrix) can safely beused to replace AIX.

Example 8.2. Consider the second example in Figure 11 (repeated inFigure 20). The meet-irreducible elements are those labeled DOS IIX_win,UNIX II DOS, UNIX II X_win, and UNIX; these labels have been generated bynormalizing complex governing expressions. This allows us to rename theA-irreducible elements and reconstruct a simpler configuration table fromthe isomorphic lattice. We introduce three new preprocessor symbols,named DX, UD, and UX, which stand for DOS II X_win, UNIX II DOS, UNIX IIX_win, respectively. According to the above theorem, the configuration tablein Figure 20 produces an isomorphic concept lattice. From the table, we

ls~member that Fi~re 4 is a fictitious example.


Reenglneering of Configurations Based on Mathematical Concept Analysis . 185

9]idef UNTX

illt(ff?fDOS -. *

1:Cas!x

*, F ;“’x”mDx dD~. ,*

Send, f * UNIXIIX_wl” > =4

!iL? de fined (DOS\ll I’*V b :x

de fined (X_win) @ . UNIX * ● UNIX

111. [1 ~ 1Iv n I1’,,

$$lfde: UNIX

Iv

*er,d>?

*,: .ieflned(UNIX)

(de flned(DOS )&&defined (X_wln))

nendlt

&

m UD Ux UNIXx x x

x xx

xx x 3

xx

#if defined && defined && de fined (UNIX)

.1 .,

#endlf

#~f defined .5&de flned(rJD)

,.t?endif

#~fdef DX

111

#endif

#if defined b& defined && de fined (3NIX)

Iv

#endif

Slf defined .&Gde fined [ux)

v.,

#end, f

Fig. 20. Elimination of disjunctions from source cxdr

obtain the restructured source code, which is much easier to understand, as

it no longer contains any disjunctions. A human restructurer would have a

hard time transforming the governing expressions in a similar manner!

But note that the configuration consisting of code pieces 1, III, IV, and V

can no longer be selected. In the old source file, it could have been selected

by defining both UNIX and DOS. As these are disjoint preprocessor symbols,the configuration need not be preserved. A computation of the configurationfunctions of both source files shows that all other configurations can still beselected in the simplified source code.

Thus by removing nonirreducible symbols, configurations can be lost, Bycomputing the configuration functions NORA/RECS can track down allconfigurations which can no longer be selected and thus present them tothe user. If the user insists on a particular configuration, he or she mayagain add attribute labels (CPP symbols) from the original lattice; therestructured table will be extended accordingly. The generated lattice stillremains the same, as the theorem allows for arbitrary supersets of themeet-irreducible elements. In the example, adding DOS again as a “guard”to II allows us to select I, III, IV, and V; the source code is still considerablysimpler.

In general, all governing expressions are transformed into minimalconjunctive normal form; furthermore, the lattice suggests the introduction

Ac~ ‘r~~~s[l~tlon,onSoftware Engineering and ,Methodology, Vol 5, No 2, Apr!l 1996.


of new CPP symbols. But note that the structure of the lattice remains thesame; hence interferences are not really resolved.

Example 8,3. Once more, we apply the method also to “xload.c.” In thesimplified lattice of Figure 19, the meet-irreducible concepts are C2, C3,C5, C6, C7, C9, C1O, Cll, C13, C14, C16, C20, C21, C22, C27, C28, C32,C35, C36, C41, C42, C53, C55, C60, C62, and C68. All of the other(meet-reducible) concepts except C15, C18, and C32 do not have attributelabels; hence the potential for saving CPP symbols is very low. Indeed,most of the governing expressions cannot be simplified.

This example shows once more that “xload” is a hard “nut to crack”: itslattice cannot be decomposed, and its governing expressions cannot besimplified. Future work must show whether even more powerful methods(such as subdirect decomposition of the lattice) can reveal some structureor whether “xload” must be blamed for irreparable configuration hacking.

9. CONCLUSION

We applied mathematical concept analysis in order to explore the configu-ration space of existing software, NORALRECS not only displays all depen-dencies between configurations (respectively, their features), but also al-lows for simplifying governing expressions. Our preliminary experience canbe summarized as follows:

(1) Keeping track of all possible variants of a software system is importantand hard.

(2) Analyzing relationships among such variants is important as a preludeto restructuring the code to enhance maintainability.

(3) The set of possible variants of a software system can be characterizedby the features of each variant.

(4) Concept lattices provide a theory of such variant spaces.

(5) Tools based on this theory are potentially useful for analyzing andminimizing variant spaces, because they can express interesting, rele-vant characteristics of various subsets of the variants, especially inclu-sion relationships among them.

(6) Applying these tools in several case studies has produced evidence thatthe analysis is useful.

(7) Efforts to use the theory as the basis for code restructuring has onlybegun.

Thus, NORA/RECS is a useful tool for analysis of configurations; as a toolfor restructuring, it is still in its infancy. Future research must showwhether modularization can be achieved automatically. We will investigatewhether the theory of concept analysis can be utilized for restructuring;subdirect decomposition of concept lattices [Wine 1983] looks particularlypromising.

As certain relations between “objects” and “attributes” occur all the timein software engineering, concept analysis is potentially useful in other



kinds of maintenance activities. Besides configuration restructuring, we

consider the following applications of concept analysis:

Analysis of So ftuare Architectures. A software architecture is defined byrelations between components and hence can be subject to concept analysis.This might also help for automatic modularization of old code. Note that incontrast to other reengineering approaches, concept analysis is determinis-tic and always allows one to reconstruct the raw data from the lattice. Forexample, the approach of Schwanke [1991] is based on a similarity functionwhich contains several free parameters and therefore requires tuning.Concept analysis does not use any heuristics and is more transparent.

So ftuare Component Retrieval. Imagine a library where components areindexed by keywords. The relation between components and keywords canbe subject to concept analysis. The resulting lattice allows for incrementalnarrowing of the set of still possible components and gives users feedbackabout the still applicable keywords.

NORA/RECS is part of the inference-based software development envi-ronment NORA.l’~ NORA aims at utilizing inference technology in softwaretools and– besides NORA/RECS– covers the following topics:

–NORA/ICE (incremental configuration engine) offers configuration man-agement based on feature logic [Zeller 1995; Zeller and Snelting 19951;

–NORA/HAMMR (highly adaptive multimethod retrieval) offers softwarecomponent retrieval based on deductive and lattice-theoretic techniques[Fischer et al. 1995a; 1995b; Lindig 19951;

–NORA/HOML (higher-order module language) is a calculus for designingreference architectures, which is based on Au-calculus with dependenttypes IGrosch 19951.

NORA/RECS can be obtained through the WWW page of the softwaretechnology group in Brunswick: http:llwww.cs.tu-bs. delsoftechl.

ACKNOWLEDGMENTS

Several persons contributed to NORA/RECS. Maren Krone detected how totreat disjunctions and implemented the front end. Anke Lewien imple-mented interactive restructuring. Christian Lindig implemented the graphlayouter and the concept lattice algorithm, and he conducted severalexperiments. Martin Skorsky from the Darmstadt algebra group providedmany helpful comments. Andreas Zeller implemented the NORA grapheditor and developed NORA/ICE. Franz-Josef Grosch, Victor Pollara, andthree anonymous reviewers provided valuable comments on a preliminaryversion of this article.

‘ ‘NOILA is a drama hy the Norwegian writer H. Ibsen. Hence. NORA is no real acronym.

ACM Transaction. on Software Engineering and Methodology, Vol. 5. No 2, April 1996.


REFERENCES

AEIO,A., HOPCROFT, J., AND ULLMAN, J. 1983, Data Structures and Algorithms, 2nd ed.Addison-Wesley, Reading, Mass.

BIRKHOFF,G. 1940. Lattice Theory, lsted. American Mathematical Society, Providence, R.I.DAVRY, B. A. AND PRIESTLEY, H. A. 1990. Introduction to Lattices and Order, 2nd ed.

Cambridge University Press, Cambridge, England.DUQUENNE,V. 1987. Contextual implications between attributes and some representation

properties for finite lattices. In Beitruge zur Begriffsanalyse, B. Ganter, R. Willie, and K.Wolff, Eds. B. I. Wissenschaftsverlag, Berlin, 213-240.

ESTUBLIER, J. AND CASALLAS, R. 1994. The Adele configuration manager. In Trends inSoftware. Vol. 2, Configuration Management, W. F. Tichy, Ed. John Wiley and Sons,Chichester, England, 99-133,

FISCHER, B., KIEVERNAGEL,M., AND SNELTING,G. 1995a. Deduction-based software compo-nent retrieval. In Working Notes of the ZJCAI-95 Workshop on Formal Approaches to Reuse

of Proofs, Plans, and Programs (Montreal, Canada).FISCHER, B., KIEVERMAGEL,M., AND STRUCKMANN,W, 1995b. VCR: A VDM-based software

component retrieval tool. In Proceedings of the ICSE-17 Workshop on Formal Approaches toSoftware Engineering (Seattle, Wash.).

GANTER, B. 1987. Algorithmen zur formalen Begriffsanalyse. In Beitrdge zur Begriffsanal-yse, B. Ganter, R. Willie, and K. Wolff, Eds. B.I. Wissenschaftsverlag, Berlin, 241-254.

GANTER,B. 1995. Formal concept analysis–A subjective introduction. Mathematik-Bericht,Technische Universitaet Dresden, Germany.

GANTER, B., WILLE, R., ANDWOLFF, K., Eds. 1987. Beitruge zur Begriffsanalyse. B.I. Wissen-schaftsverlag, Berlin.

GROSCH, F.-J. 1995. No type stamps and no structure stamps–A fully applicative higher-order module language. Informatik-Bericht, TU Braunschweig, May, Submitted for publica-tion.

KRONE, M. 1993. Reverse engineering von konfigurationsstrukturen. Master’s thesis, Tech-nical Univ. of Braunschweig, Germany. Sept. In German.

KRONE, M. AND SNELTING,G. 1994. On the inference of configuration structures from sourcecode. In lCSE-16 Proceedings of the 16th International Conference on Software Engineering.IEEE Computer Society Press, Los Alamitos, Calif., 49-57.

LEBLANG,D. B. 1994. The CM challenge: Configuration management that works. In Trendsin Software. Vol. 2, Configuration Management, W. F, Tichy, Ed. John Wiley and Sons,Chichester, England, 1-37,

LINDIG, C. 1995. Concept-based component retrieval. In Working Notes of the ZJCAI-95Workshop: Formal Approaches to the Reuse of Plans, Proofs, and Programs, J. Kohler, F.Giunchiglia, C. Green, and C, Walther, Eds. 21-25.

MAHLER, A. 1994. Variants: Keeping things together and telling them apart. In Trends inSoftware. Vol. 2, Configuration Management, W. F, Tichy, Ed. John Wiley and Sons,Chichester, England, 39-69.

PARNAS, D. 1994. Software aging. In ICSE-16 Proceedings of the 16th International Confer-ence on Software Engineering. IEEE Computer Society Press, Los Alamitos, Calif., 279-290.

SCHWANKK,R. W. 1991. An intelligent tool for reengineering software modularity. InProceedings of the 13th International Conference on Software Engineering. IEEE ComputerSociety Press, Los Alamitos, Calif., 83-92.

SKORSKY,M. 1992. Endliche Verbaude–Diagramme und Eigenschaften. Ph.D. thesis, Dept.of Mathematics, Technical Univ. of Darmstadt, Germany.

SMOLKA, G. 1992. Feature-constrained logics for unification grammars. J. Logic Program.12, 51-87.

SUGIYAMA, K,, SHOKJIRO, T., mm TODA, M. 1981. Methods for visual understanding ofhierarchical system structures, IEEE Trans. Syst, Man Cybernet. 11, 2, 109-125.

TICHY, W. 1985, RCS–A system for version control. Softw. Pratt. Exper. 15, 7 (July),637-654.



TIcH\, W. 1988. Tools for software configuration management. In Proceedings of the 1stInternational Workshop on Software Configuration Management, J. Winkler. Ed. TeubnerVerlag. Grassau, 1-20.

TICHY, W, F., Ed. 1994, Trends in Software. Vol. 2, Configuration Management. ,John Wileyand Sons, Chichester, England.

VAN MKXHEi.EN, 1,, HAMPTON,J., MICHALSKI, R. S., AND THELTNS, P,, Eds. 1993. Categoriesand Concepts, Academic Press, London.

W1],LK, R. 1982. Restructuring lattice theory: An approach based on hierarchies of concepts.In ordere(i Sets, I. Rival, Ed. Reidel, 445-47o,

WILI~, R. 1983. Subdirect decomposition of concept lattices. Algebra Uniuersalis 17, 275-28?

WILLE, R. 1989a. Geometric representation of concept lattices. In Conceptual and Numeri-cal Analy.si,s ufData, O. Opitz, Ed, Springer-Verlag, Berlin, 239–255,

WI I.LE, R. 1989b. Lattices in data analysis: How to draw them with a computer. In{’la.ss(ficat(orl Qd Related Methods of Data Analysis, H. H. Bock, Ed. North-Holland.Amsterdam, 33-58.

WILLE, R. AND GANTER, B. 1993, Mathematische theorie der formalen begriffsanalyse.Lecture notes, Dept. of Mathematics, Technical Univ. of Darmstadt, Germany.

ZELLW+,A. 1995, A unified version model for configuration management. In Proceedings ofthe SIGSOFT .i’rd Symposium on the Foundations of So ftu,are Engineering. ACM, New York,151-160.

ZELLR~, ,4. ~xn SNEL’I’lX(;, ~. 1995. Handling version sets through feature logic. In Proceed-rng,s of’ thr .’lth European Software Engineering Conference, W. Schafer, Ed. Lecture Notes inComputer Science, vol. 989. Springer-Verlag, Berlin, 191-204.

Received January 1995: revised August 1995; accepted February 1996

A(’M Transactlans on Saftware Engineering and Methodology, Vol 5, No 2. April 1996

reengineering of configurations based on mathematical ... · additional key words and phrases:...

Documents