data warehouse process managementpvassil/publications/2001_is_dw/is_2001_dw.pdf · part of the data...

32
Information Systems 26 (2001) 205–236 Data warehouse process management Panos Vassiliadis a, *, Christoph Quix b , Yannis Vassiliou a , Matthias Jarke b,c a Department of Electrical and Computer Engineering, Computer Science Division, National Technical University of Athens, Iroon Polytechniou 9, 157 73, Athens, Greece b Informatik V (Information Systems), RWTH Aachen, 52056 Aachen, Germany c GMD-FIT, 53754 Sankt Augustin, Germany Abstract Previous research has provided metadata models that enable the capturing of the static components of a data warehouse architecture, along with information on different quality factors over these components. This paper complements this work with the modeling of the dynamic parts of the data warehouse. The proposed metamodel of data warehouse operational processes is capable of modeling complex activities, their interrelationships, and the relationship of activities with data sources and execution details. Moreover, the metamodel complements the existing architecture and quality models in a coherent fashion, resulting in a full framework for quality-oriented data warehouse management, capable of supporting the design, administration and especially evolution of a data warehouse. Finally, we exploit our framework to revert the widespread belief that data warehouses can be treated as collections of materialized views. We have implemented this metamodel using the language Telos and the metadata repository system ConceptBase. # 2001 Elsevier Science Ltd. All rights reserved. Keywords: Data warehousing; Process modeling; Evolution; Quality 1. Introduction Data warehouses (DW) integrate data from multiple heterogeneous information sources and transform them into a multidimensional represen- tation for decision support applications. Apart from a complex architecture, involving data sources, the data staging area, operational data stores, the global data warehouse, the client data marts, etc., a data warehouse is also characterized by a complex lifecycle. In a permanent design phase, the designer has to produce and maintain a conceptual model and a usually voluminous logical schema, accompanied by a detailed physi- cal design for efficiency reasons. The designer must also deal with data warehouse administrative processes, which are complex in structure, large in number and hard to code; deadlines must be met for the population of the data warehouse and contingency actions taken in the case of errors. Finally, the evolution phase involves a combina- tion of design and administration tasks: as time passes, the business rules of an organization change, new data are requested by the end users, *Corresponding author. Tel.: +30-1-772-1402; fax: +30-1- 772-1442. E-mail address: [email protected] (P. Vassiliadis). 0306-4379/01/$ - see front matter # 2001 Elsevier Science Ltd. All rights reserved. PII:S0306-4379(01)00018-7

Upload: others

Post on 09-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

Information Systems 26 (2001) 205–236

Data warehouse process management

Panos Vassiliadisa,*, Christoph Quixb, Yannis Vassilioua, Matthias Jarkeb,c

aDepartment of Electrical and Computer Engineering, Computer Science Division, National Technical University of Athens,

Iroon Polytechniou 9, 157 73, Athens, Greeceb Informatik V (Information Systems), RWTH Aachen, 52056 Aachen, Germany

cGMD-FIT, 53754 Sankt Augustin, Germany

Abstract

Previous research has provided metadata models that enable the capturing of the static components of a data

warehouse architecture, along with information on different quality factors over these components. This papercomplements this work with the modeling of the dynamic parts of the data warehouse. The proposed metamodel ofdata warehouse operational processes is capable of modeling complex activities, their interrelationships, and therelationship of activities with data sources and execution details. Moreover, the metamodel complements the existing

architecture and quality models in a coherent fashion, resulting in a full framework for quality-oriented data warehousemanagement, capable of supporting the design, administration and especially evolution of a data warehouse. Finally,we exploit our framework to revert the widespread belief that data warehouses can be treated as collections of

materialized views. We have implemented this metamodel using the language Telos and the metadata repository systemConceptBase. # 2001 Elsevier Science Ltd. All rights reserved.

Keywords: Data warehousing; Process modeling; Evolution; Quality

1. Introduction

Data warehouses (DW) integrate data frommultiple heterogeneous information sources andtransform them into a multidimensional represen-tation for decision support applications. Apartfrom a complex architecture, involving datasources, the data staging area, operational datastores, the global data warehouse, the client datamarts, etc., a data warehouse is also characterized

by a complex lifecycle. In a permanent designphase, the designer has to produce and maintain aconceptual model and a usually voluminouslogical schema, accompanied by a detailed physi-cal design for efficiency reasons. The designer mustalso deal with data warehouse administrativeprocesses, which are complex in structure, largein number and hard to code; deadlines must bemet for the population of the data warehouse andcontingency actions taken in the case of errors.Finally, the evolution phase involves a combina-tion of design and administration tasks: as timepasses, the business rules of an organizationchange, new data are requested by the end users,

*Corresponding author. Tel.: +30-1-772-1402; fax: +30-1-

772-1442.

E-mail address: [email protected] (P. Vassiliadis).

0306-4379/01/$ - see front matter # 2001 Elsevier Science Ltd. All rights reserved.

PII: S 0 3 0 6 - 4 3 7 9 ( 0 1 ) 0 0 0 1 8 - 7

Page 2: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

new sources of information become available, andthe data warehouse architecture must evolve toefficiently support the decision-making processwithin the organization that owns the data ware-house.

All the data warehouse components, processesand data should be tracked and administered via ametadata repository. In [1], we presented ametadata modeling approach which enables thecapturing of the static parts of the architecture of adata warehouse. The linkage of the architecturemodel to quality parameters (in the form of aquality model) and its implementation in themetadata repository ConceptBase have been for-mally described in [2]. Vassiliadis et al. [3] present amethodology for the exploitation of the informa-tion found in the metadata repository and thequality-oriented evolution of a data warehousebased on the architecture and quality model. Inthis paper, we complement these results withmetamodels and support tools for the dynamicpart of the data warehouse environment: theoperational data warehouse processes. The combi-nation of all the data warehouse viewpoints isdepicted in Fig. 1.

In the three phases of the data warehouselifecycle, the interested stakeholders need informa-tion on various aspects of the examined processes:what are they supposed to do, how are theyimplemented, why are they necessary and how dothey affect other processes in the data warehouse[4,1]. Like the data warehouse architecture andquality metamodels, the process metamodel

assumes the clustering of their entities in logical,physical and conceptual perspectives, each assignedwith the task of answering one of the aforemen-tioned stakeholder questions. In the rest of thissection we briefly present the requirements faced ineach phase, our solutions and their expectedbenefits.

The design and implementation of operationaldata warehouse process is a labor-intensive andlengthy procedure, covering 30–80% of effortand expenses of the overall data warehouseconstruction [5,6]. For a metamodel to be ableto efficiently support the design and implementa-tion tasks, it is imperative to capture at least twoessential aspects of data warehouse processes,complexity of structure and relationship with theinvolved data. In our proposal, the logical perspec-tive is capable of modeling the structureof complex activities and capture all the entitiesof the widely accepted Workflow ManagementCoalition Standard [7]. The relationship ofdata warehouse activities with their underlyingdata stores is taken care of in terms of SQLdefinitions.

This simple idea reverts the classical belief thatdata warehouses are simply collections of materi-alized views. In previous data warehouse research,directly assigning a na.ıve view definition to a datawarehouse table has been the most commonpractice. Although this abstraction is elegant andsufficient for the purpose of examining alternativestrategies for view maintenance, it is incapable ofcapturing real world processes within a datawarehouse environment. In our approach, wecan deduce the definition of a table in the datawarehouse as the outcome of the combination ofthe processes that populate it. This new kind ofdefinition complements existing approaches, sinceour approach provides the operational semanticsfor the content of a data warehouse table, whereasthe existing ones give an abstraction of itsintentional semantics.

The conceptual process perspective traces thereasons behind the structure of the data ware-house. We extend the demand-oriented concept ofdependencies as in the Actor-Dependency model[4], with the supply-oriented notion of suitabilitythat fits well with the redundancy found often in

Fig. 1. The different viewpoints for the metadata repository of

a data warehouse.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236206

Page 3: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

data warehouses. As an another extension to theActor-Dependency model, we have generalized thenotion of role in order to uniformly trace anyperson, program or data store participating in thesystem.

By implementing the metamodel in an objectlogic, we can exploit the query facilities of therepository to provide the support for consistencychecking of the design. The deductive capabilitiesof ConceptBase [8] provide the facilities to avoidassigning manually all the interdependencies ofactivity roles in the conceptual perspective. It issufficient to impose rules to deduce these inter-dependencies from the structure of data stores andactivities.

While the design and implementation of thewarehouse are performed in a rather controlledenvironment, the administration of the warehousehas to deal with problems that evolve in an ad-hocfashion. For example, during the loading of thewarehouse contingency treatment is necessary forthe efficient administration of failures. In suchevents, the knowledge of the structure of a processis important; moreover, the specific traces ofexecuted processes are also required to be trackeddown}in an erroneous situation, not only thecauses of the failure, but also the progress of theloading process by the time of the failure must bedetected, in order to efficiently resume its opera-tion. Still, failures during the warehouse loadingare only the tip of the iceberg as far as problems ina data warehouse environment are concerned. Thisbrings up the discussion on data warehouse qualityand the ability of a metadata repository to trace itin an expressive and usable fashion. To face thisproblem, the proposed process metamodel isexplicitly linked to our earlier quality metamodel[2]. We complement this linkage by mentioningspecific quality factors for the quality dimensionsof the ISO 9126 standard for software implemen-tation and evaluation.

Identifying erroneous situations or unsatisfac-tory quality in the data warehouse environment isnot sufficient. The data warehouse stakeholdersshould be supported in their efforts to react againstthese phenomena. The above-mentioned suitabilitynotion in the conceptual perspective of the processmetamodel allows the definition of recovery actions

to potential errors or problems (e.g., alternativepaths for the population of the data warehouse) ina straightforward way, during runtime.

Data warehouse evolution is unavoidable as newsources and clients are integrated, business ruleschange and user requests multiply. The effect ofevolving the structure of the warehouse can bepredicted by tracing the various interdependenciesamong the components of the warehouse. We havealready mentioned how the conceptual perspectiveof the metamodel traces interdependencies be-tween all the participants in a data warehouseenvironment, whether persons, programs or datastores. The prediction of potential impacts(whether of political, structural, or operationalnature) is supported by this feature in severalways. To mention the simplest, the sheer existenceof dependency links forecasts a potential impact inthe architecture of the warehouse in the presenceof any changes. More elaborate techniques willalso be provided in this paper, by taking intoaccount the particular attributes that participate inthese interdependencies and the SQL definitions ofthe involved processes and data stores. Naturally,the existence of suitability links suggests alter-natives for the new structure of the warehouse. Wedo not claim that our approach is suitable for anykind of process, but focus our attention to theinternals of data warehouse systems.

This paper is organized as follows: In Section 2we present the background work and the motiva-tion for this paper. In Section 3 we describe theprocess metamodel and in Section 4 we presentits linkage to the quality model. In Section 5we present how the metadata repository can beused for the determination of the operationalsemantics of the data warehouse tables and forevolution purposes. In Section 6 we present relatedwork and Section 7 presents issues for futureresearch.

2. Background and motivation

In this section we will detail the backgroundwork and the motivation behind the proposedmetamodel for data warehouse operationalprocesses.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236 207

Page 4: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

2.1. Background work for the metamodel: the questfor formal models of quality

For decision support, a data warehouse mustprovide high quality of data and service. Errors indatabases have been reported to be up to 10%range and even higher in a variety of applications.Wang et al. [9] report that more than $2 billion ofUS federal loan money had been lost because ofpoor data quality at a single agency; manufactur-ing companies spend over 25% of their sales onwasteful practices, service companies up to 40%.In certain vertical markets (e.g., the public sector)data quality is not an option but a constraint forthe proper operation of the data warehouse. Thus,data quality problems seem to introduce evenmore complexity and computational burden to theloading of the data warehouse. In the DWQproject (Foundations of Data Warehouse Quality[10]), we have attacked the problem of quality-oriented design and administration in a formalway, without sacrificing optimization and practicalexploitation of our research results. In thissubsection we summarize our results as far asneeded for this paper.

In [1] a basic metamodel for data warehousearchitecture and quality has been presented as in

Fig. 2. The framework describes a data warehousein three perspectives: a conceptual, a logical anda physical perspective. Each perspective ispartitioned into the three traditional data ware-house levels: source, data warehouse and clientlevel.

On the metamodel layer, the framework gives anotation for data warehouse architectures byspecifying meta-classes for the usual data ware-house objects like data store, relation, view, etc.On the metadata layer, the metamodel is instan-tiated with the concrete architecture of a datawarehouse, involving its schema definition, in-dexes, table spaces, etc. The lowest layer in Fig. 2represents the actual processes and data.

The quality metamodel accompanying the ar-chitecture metamodel [2] involves an extension ofthe Goal-Question-Metric approach [11]. Themetamodel introduces the basic entities aroundquality (including Quality Goals, Quality Queriesand Quality Factors), the metadata layer iscustomized per warehouse with quality scenarios,and the instance layer captures concrete measure-ments of the quality of a given data warehouse.

The static description of the architecture (leftpart of Fig. 2) is complemented in this paper witha metamodel of the dynamic data warehouse

Fig. 2. Framework for Data Warehouse Architecture [1].

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236208

Page 5: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

operational processes. As one can notice in themiddle of Fig. 2. we follow again a three-levelinstantiation: a Process Metamodel deals withgeneric entities involved in all data warehouseprocesses (operating on entities found at the datawarehouse metamodel level), the Process Modelcovers the processes of a specific data warehouseby employing instances of the metamodel entities,and the Process Traces capture the execution ofthe actual data warehouse processes happening inthe real world.

In [3], the Goal–Question–Metric (GQM) meth-odology has been extended in order (a) to capturethe interrelationships between different qualityfactors with respect to a specific quality goal,and (b) to define an appropriate lifecycle that dealswith quality goal evaluation and improvement.The methodology comprises a set of steps, invol-ving the design of the quality goal, the evaluationof the current status, the analysis and improvementof this situation, and finally, the re-evaluation ofthe achieved plan. The metadata repositorytogether with this quality goal definition metho-dology constitutes a decision support systemwhich helps data warehouse designers and admin-istrators to take relevant decisions, in order toachieve a reasonable quality level which fits thebest user requirements.

Throughout our models, we encourage theuse of templates for process, quality and architec-ture objects. This is especially apparent inthe metadata layer where an abstract specificationof architecture, process or quality objects, origin-ally coming from the designers of the datawarehouse, can be properly specialized by thedata warehouse administrator at runtime. Forexample, a high-level specification of a chain ofactivities involving extraction, checking for pri-mary and foreign keys and final loading in thewarehouse, can be customized to specific pro-grams, later in the construction of the warehouse.Practical experience has shown that this kind oftemplates can be reoccurring in a data warehousearchitecture, i.e., several tables can be populatedthrough very similar programs (especially if thedata are coming from the same source). Theinterested reader is referred to [12,13] for examplesof such templates.

2.2. The 3 perspectives for the process model

Our process model (cf. Fig. 3) follows the samethree perspectives as the architecture model, sincethe perspectives of the process model operate onobjects of the respective perspective of thearchitecture model. As mentioned in [4], thereare different ways to view a process: what steps itconsists of (logical perspective), how they are to beperformed (physical perspective) and why thesesteps exist (conceptual perspective). Thus, we viewa data warehouse process from three perspectives:a central logical part of the model, which capturesthe basic structure and data characteristics of aprocess, its physical counterpart which providesspecific details over the actual components thatexecute the process and a conceptual perspectivewhich abstractly represents the basic interrelation-ships between data warehouse stakeholders andprocesses in a formal way.

Typically, the information about how a processis executed concerns stakeholders who are in-volved in the everyday use of the process. Theinformation about the structure of the processconcerns stakeholders that manage it while theinformation relevant to the reasons behind thisstructure concerns process engineers who areinvolved in the monitoring or evolution of the

Fig. 3. The reasoning behind the 3 perspectives of the process

metamodel.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236 209

Page 6: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

process environment. Often, all these roles arecovered by the data warehouse administrationteam, although one could also encounter differentschemes.

Another important issue shown in Fig. 3 is thatwe can observe a data flow in each of the threeperspectives. In the logical perspective, the model-ing is concerned with the functionality of anactivity, describing what this particular activity isabout in terms of consumption and production ofinformation. In the physical perspective, thedetails of the execution of the process are thecenter of the modeling. The most intriguing part,though, is the conceptual perspective covering whya process exists. This can be either due to necessityreasons (in which case, the receiver of informationdepends on the process to deliver the data) and/orsuitability reasons (in which case the informationprovider is capable of providing the requestedinformation).

2.3. Complexity and traces

Data warehouse operational processes are quitecomplex, in terms of tasks executed within a singleprocess, execution coherence, contingency treat-ment, etc. A process metamodel should be able tocapture this kind of complexity. In Fig. 4 the datawarehouse refreshment process is depicted, asdescribed in [14]. The refreshment process iscomposed of activities, such as Data Extraction,History Management, Data Cleaning, Data Inte-gration, History Management, Update Propagationand Customization. Each of these activities couldbe executed on a different site. The activities areinterlinked through rules, denoted by arrows; in areal-world case study in banking, no less than 17kinds of knowledge sources determined thisprocess [15]. The gray background in Fig. 4 impliesthat there is a composition hierarchy in the set ofdata warehouse operational processes. In fact, theneed to isolate only a small subset of the overallprocesses of the warehouse is common. Anymetamodel must be suitable to support zoomingin and out of the process structure, in order toachieve this functionality.

The most common reason for this kind ofinspection is to avoid or recover erroneous

execution during runtime. The structure of aprocess is important; moreover the specific tracesof executed processes should be tracked down,too. If the repository is able to capture this kind ofinformation, it gains added value since: ex ante thedata warehouse stakeholders can use it for designpurposes (e.g., to select the data warehouse objectsnecessary for the performance of a task) and expost, people can relate the data warehouse objectsto decisions, tools and the facts which havehappened in the real world [16].

2.4. The data oriented nature of operational datawarehouse processes

Data warehouse activities are of data-intensivenature in their attempt to push data from thesources to the tables of the data warehouse or theclient data marts. We can justify this claim bylisting the most common operational processes:

* data extraction processes, which are used for theextraction of information from the legacysystems;

Fig. 4. The Data Warehouse Refreshment process [14].

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236210

Page 7: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

* data transfer (and loading) processes, used forthe instantiation of higher levels of aggregationin the data warehouse with data coming fromthe sources or lower levels of aggregation;

* data transformation processes, used for thetransformation of the propagated data to thedesired format;

* data cleaning processes, used to ensure theconsistency of the data warehouse data (i.e.,the fact that these data respect the databaseconstraints and the business rules);

* computation processes, which are used for thederivation of new information from the storeddata (e.g., further aggregation, querying, busi-ness logic, etc.).

To deal with the complexity of the data ware-house loading process, specialized Extraction-Transformation-Loading (ETL) tools are availablein the market. Their most prominent tasks include:

* the identification of relevant information at thesource side,

* the extraction of this information,* the customization and integration of the in-

formation coming from multiple sources into acommon format,

* the cleaning of the resulting data set, on thebasis of database and business rules, and

* the propagation of the data to the data ware-house and/or data marts.

According to a study for Merrill Lynch [5], ETLand Data Cleaning tools cover a labor-intensiveand complex part of the data warehouse processes,estimated to cost at least one third of effort andexpenses in the budget of the data warehouse.Demarest [6] mentions that this number can riseup to 80% of the development time in a datawarehouse project. Still, due to the complexity andlong learning curve of these tools, many organiza-tions turn to in-house development to performETL and data cleaning tasks.

2.5. Case study example

To motivate the discussion, we use a part of oneof our real-world case studies [17]. The organiza-tion collects various data about the annual

activities of all the hospitals of a particular region.The source of data, for our example, is a COBOLfile, dealing with the annual information by classof beds and hospital (here we use only threeclasses, namely A, B and C). It yields a specificattribute for each type of class of beds. Periodi-cally, the COBOL file is transferred from theproduction system to the data warehouse andstored in a ‘‘buffer’’ table of the data warehouse,acting as mirror of the file.

Then, the tuples of the buffer table are used bycomputation procedures to further populate a‘‘fact’’ table inside the data warehouse. Finally,several materialized views are populated withaggregate information and used by client toolsfor querying.

The entity Type denotes the logical schema forall kinds of data stores. Each Type is characterizedby a name and a set of Fields. In our example, weassume the following four Types: CBL, Buffer,Class info and V1. The schemata of these types aredepicted in Fig. 5. There are four atomic Activitiesin the data warehouse: Loading, Cleaning, Compu-tation and Aggregation. The Loading activitysimply copies the data from the CBL Cobol fileto the Buffer type. H ID is an identifier for thehospital and the three last attributes hold thenumber of beds per class. The Cleaning activitydeletes all entries violating the primary keyconstraint. The Computation activity transformsthe imported data to a different schema; the date isconverted from American to European format andthe rest of the attributes are converted to acombination (Class id, #Beds). For example, if(03,12/31/1999,30,0,50) is a tuple in the Buffertable, the respective tuples in the Class Infotable are {(03,31/Dec/1999,A,30), (03,31/Dec/1999,C,50)}. The Aggregation activity producesthe sum of beds by hospital and year. Thecombination of all the aforementioned activitiesis captured by the composite activity Populate V1.

3. The metamodel of data warehouse operational

processes

We start the presentation of the metamodel fordata warehouse operational processes from the

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236 211

Page 8: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

logical perspective, to show how the metamodeldeals with the requirements of structure complex-ity and capturing of data semantics in the next twosections. Then, in Sections 3.3 and 3.4 we presentthe physical and the conceptual perspectives. Inthe former, the requirement of trace logging will befulfilled too. The full metamodel is presented inFig. 6.

For the implementation, we have used a meta-database as a repository for meta-information ofthe data warehouse components. The architecture,quality and process models are represented inTelos [18], a conceptual modeling language forrepresenting knowledge about information sys-tems. A prototype was implemented in the object-oriented deductive database system ConceptBase[16], that provides query facilities, and a languagefor constraints and deductive rules. The imple-mentation of the process metamodel in Concept-Base is straightforward. Thus we choose to followan informal, bird’s-eye view of the model, forreasons of presentation.

3.1. Complexity of the process structure

Following the Workflow Coalition [7], the mainentity of the logical perspective is Activity. Anactivity represents a unit of ‘‘work which isprocessed by a combination of resource andcomputer applications’’. Activities can be com-plex, as captured by the specialization of Activity,namely CompositeActivity. This gives the possibi-

lity of Zooming in and out of the repository. Thecomponents of composite activities are ProcessElements. Class ProcessElement is a generalizationof the entities Activity and TransitionElement. Atransition element is employed for the interconnec-tion of activities participating in a complexactivity. The attribute Next captures the sequenceof events. Formally, a Process Element is char-acterized by the following attributes:

* Name: to uniquely identify the ProcessElementwithin the extension of its class.

* Next: a ProcessElement which is next in thesequence of a composite activity, characterizedby:* Context: Since two activities can be inter-

related in more than one complex DWprocess, the context of this interrelationshipis captured by the relevant CompositeActivityinstance.

* Semantics: denotes whether the next activityin a schedule happens upon successfultermination of the previous activity (COM-MIT ) or if a contingency action is required(ABORT ).

A TransitionElement is a specialization ofProcessElement, used to support scheduling ofthe control flow within a composite activity. Thisextra functionality is supported by two mechan-isms. First, we enrich the Next link with moremeaning, by adding a Condition attribute to it. ACondition is a logical expression in Telos denoting

Fig. 5. Motivating example.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236212

Page 9: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

that the firing of the next activity is performedwhen the Condition is met. Second, we specializethe class TransitionElement to four subclasses(not shown in Fig. 6), capturing the basic connect-ives of activities [7]: Split AND, Split XOR, Join-AND, Join XOR. The semantics of these entities

are the same as the ones of the WfMC proposal.For example, the Next activity of a Join XORinstance is fired when (a) the Join XOR has at leasttwo activities ‘‘pointing’’ to it through the Nextattribute and only one Next Activity (well-formedness constraint), (b) the Condition of theJoin XOR is met and (c) at least one of the‘‘incoming’’ activities has COMMIT semantics inthe Next attribute. This behavior can be expressedin Telos with appropriate rules. Similarly,Split AND denotes a point in the process chainwhere more than one concurrent execution threadsare initiated, Split XOR denotes a point whereexactly one execution thread is to be initiated (outof many different alternatives) and JOIN XOR

acts as a rendezvous point for several concurrentthreads, where the execution of the process flow ispaused until all incoming activities have completedtheir execution.

The WfMC proposes two more ways of transi-tion between Activities. Dummy activities performrouting based on condition checking, they aremodeled as simple Transition Elements. LOOPactivities are captured as instances of Composi-teActivity, with an extra attribute: the for condi-tion. Fig. 7 shows the modeling of a compositeactivity, composed of two sub-activities, where thesecond is fired when the first activity commits anda Boolean condition is fulfilled. P1 Commited? is atransition element.

3.2. Relationship with data

A Type denotes the schema for all kinds of datastores. Formally, a Type is defined as a specializa-tion of LogicalObject with the following attributes:

Fig. 6. The metamodel for data warehouse operational processes.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236 213

Page 10: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

* Name: a single name denoting a unique Typeinstance.

* Fields: a multi-value attribute. In other words,each Type has a name and a set of Fields,exactly like a relation in the relational model.

* Stored: a DataStore, a physical object repre-senting software used to manipulate stored data(e.g., a DBMS). This attribute will be detailed inthe description of the physical perspective.

* ModeledFrom: a Concept, an object in theconceptual perspective, providing a descriptionof the type in a more user-friendly manner. AConcept is the generalization of Entities andRelationships in the Entity-Relationship model(not depicted in Fig. 6).

Any kind of physical data store (multidimen-sional arrays, COBOL files, even reports) can berepresented by a Type in the logical perspective.For example, the schema of multidimensionalcubes is of the form [D1; . . . ;Dn;M1; . . . ;Mm]where the Di represent dimensions (forming theprimary key of the cube) and the Mj measures [27].Cobol files, as another example, are records withfields having two peculiarities: nested records andalternative representations. One can easily unfoldthe nested records and choose one of thealternative representations.

Each Activity in a data warehouse environmentis linked to a set of incoming and outgoing types.We capture the relationship of activities to data

by expressing the outcome of a data warehouseprocess as a function over its input datastores. This function is captured through SQLqueries, extended with functions. An Activityis formally characterized by the followingattributes:

* Name, Next: inherited from Process Element.* Input: multi-valued Type attribute modeling all

data stores used by the activity to acquire data.* Output: single-valued Type attribute. This attri-

bute models the data store or report where theactivity outputs data. The Output attribute isfurther explained by two attributes:

* Semantics: a single value belonging to the set{Insert, Update, Delete, Select} (captured asthe domain of class ActivityOutSemantics). Aprocess can either add, delete, or update thedata in a data store. Also it can output somemessages to the user (captured by using a‘‘Message’’ Type and Select semantics).

* Expression: a single SQL query (instance ofclass SQLQuery) to denote the relationshipof the output and the input types, possiblywith functions.

* ExecutedBy: a physical Agent (i.e., an applica-tion program) executing the Activity. Moreinformation on agents will be provided in thedescription of the physical perspective.

Fig. 7. Example of a composite activity: ON COMMIT(P1), IF hconditioni START.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236214

Page 11: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

* HasRole: a conceptual description of theactivity. This attribute will be properlyexplained in the description of the conceptualperspective.

Fig. 8 shows how various types of processes canbe modeled in our approach.

To gain more insight in the proposed modelingapproach, consider the example of Fig. 5. Theexpressions and semantics for each activity arelisted in Fig. 9. All activities append data to theinvolved types, so they have INS semantics, exceptfor the cleaning process, which deletes data, andthus has DEL semantics. We do not imply thateverything should actually be implemented using

the employed queries, but rather that the relation-ship of the input and the output of an activity isexpressed as a function, through a declarativelanguage such as SQL.

3.3. The physical perspective

While the logical perspective covers the struc-ture (‘‘what’’) of a process, the physical perspectivecovers the details of its execution (‘‘how’’). Eachprocess is executed by an Agent applicationprogram. Each Type is assigned to a DataStore(providing information for issues like table spaces,indexes, etc.). An Agent can be formally described

Fig. 8. Examples of Output attribute for particular kinds of activities.

Fig. 9. Expressions and semantics for the activities of the example.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236 215

Page 12: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

as follows:

* Name: to uniquely identify the Agent within theextension of its class.

* Execution parameters: a multi-value string at-tribute capturing any extra information aboutthe execution of an agent.

* In, Out: physical DataStores communicatingwith the Agent. The types used by the respectivelogical activity must be stored within these datastores.

* HasTraces: a multi-value attribute capturing allthe execution traces of the Agent.

An Execution Trace is an entity capturing detailsof activity execution from the proper Agent:

* TraceID: a unique identification number touniquely identify each Execution Trace.

* State: a single value belonging to the domain ofclass AgentStateDomain={In Progress, Com-mit, Abort}.

* Init time, Abort time, Commit time: timestampsdenoting the Timepoints when the respectiveevents have occurred.

* Context: another Execution Trace. This attri-bute is used in the case where a script (Agent) issimply the coordinator script for the executionof several other Agents, i.e., in CompositeActivities. In this case, the Execution Trace ofthe Agent of the Composite Activity defines thecontext for the execution of the coordinatedagents.

The information of the physical perspective canbe used to trace and monitor the execution of datawarehouse processes. Fig. 10 sketches the traceinformation after a successful execution of theprocess described in Fig. 5. We show the relation-ship between the logical and the physical perspec-tive by linking each logical activity to a physicalapplication program. Each agent has a set ofexecution traces. We can see that the trace ofcomposite activity Populate V1 has as Init time thetime of the execution of the first sub-activity andCommit time the completion time of the lastactivity. Also, this composite activity defines thecontext of the execution of its sub-activities: this iscaptured by properly populating the attributeContext.

3.4. The conceptual perspective

A major purpose behind the introduction of theconceptual perspective is to help stakeholdersunderstand the reasoning behind decisions on thearchitecture and physical characteristics of datawarehouse processes. Our modeling approachcaptures dependency and suitability relationshipsamong the basic conceptual entities to facilitatethe design, administration and evolution of thedata warehouse.

Each Type in the logical perspective is thecounterpart of a Concept in the conceptualperspective. A concept represents a class of

Fig. 10. Trace information after a successful execution of the process of the example.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236216

Page 13: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

real-world objects, in terms of a conceptualmetamodel, e.g., the Entity-Relationship orUML notation. Both Types and Concepts areconstructed from Fields (representing their attri-butes), through the attribute fields. We considerField to be a subtype both of LogicalObject andConceptualObject. The central conceptual entity isthe Role which generalizes the conceptual counter-parts of activities, stakeholders and data stores.The class Role is used to express the interdepen-dencies of these entities, though the attributeRelatesTo. Activity Role, Stakeholder and Conceptare specializations of Roles for processes, personsand concepts in the conceptual perspective. For-mally, a Role is defined as follows:

* RelatesTo: another Role.* As: a single value belonging to the domain of

class RelationshipDomain={suitable, depen-dent}.

* Wrt: a multi-valued attribute including in-stances of class ConceptualObject.

* dueTo: text string attribute, documentingextra information on the relationship oftwo roles.

Each Role represents a person, program or datastore participating in the environment of a process,changed with a specific task and/or responsibility.An instance of the RelatesTo relationship is astatement about the interrelationship between tworoles in the real world, such as ‘View V1 relates totable Class Info with respect to the attributesH ID, EDate and #Beds as dependent due toloading reasons’. Note that, since both data andprocesses can be characterized by SQL statements,their interrelationship can be traced in terms ofattributes.

The conceptual perspective is influenced by theActor Dependency model [4]. In this model, actorsdepend on each other for the accomplishment ofgoals and the delivery of products. The dependencynotion is powerful enough to capture the relation-ships in the context of a data flow, where a dataconsumer (person, data store or program) dependson the proper function of its data providers, toachieve its mission. Our extension can capturesuitability as well (e.g., in the case where more thanone concepts can apply for the population of the

aggregation, one concept is suitable to replace theother).

In the process of understanding the occurringerrors or the design decisions on the architectureof a data warehouse, the conceptual perspectivecan be exploited in various ways.

1. The design of the data warehouse is supported,since the conceptual model serves as a doc-umentation repository for the reasons behindthe structure of the data warehouse. The modelallows the tracing of the relationships betweenany pair of persons, programs or data stores.With minimum query facilities of the metadatarepository, these interdependencies do not haveto be directly stored in all the cases, but can alsobe computed incrementally, due to their tran-sitivity.

2. The administration of the data warehouse isfacilitated in several ways. The conceptualmodel is a good roadmap for the qualitymanagement of the warehouse and can act asan entry-point to the logical perspective, since itcan enable the user to pass from the abstractrelationships of roles to the structure of thesystem. During runtime, the suitability notioncan be used to obtain solutions to potentialerrors or problems (e.g., alternative paths forthe population of the data warehouse) in astraightforward way.

3. Data warehouse evolution is supported at twolevels. At the entity level, the impact of anychanges in the architecture of the warehousecan be detected through the sheer existence ofdependency links. At the same time, theexistence of suitability links suggests alterna-tives for the new structure of the warehouse. Atthe attribute level, on the other hand, internalchanges in the schema of data stores or theinterface of software agents can be detected, byusing the details of the relationships of the datawarehouse roles. For example, the previousstatement for the relationship of view V1 andtable Class Info could be interpreted as ‘ViewV1 is affected by any changes to table Class Infoand especially the attributes H ID, EDate and#Beds’.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236 217

Page 14: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

We detail all the aforementioned features andbenefits through our example. As we can see inFig. 11 our conceptual model consists of severalentities, including:

* Aggregation Process is an Activity Role, corre-sponding to the logical activity Aggregation;

* Information By Class and Date (for brevity,‘‘Info By Class’’ in Fig. 11) is a Conceptcorresponding to the Type ‘‘Class Info’’;

* Hospital Information By Date (for brevity,‘‘Hospital Info’’ in Fig. 11) is also a Conceptcorresponding to the Type ‘‘V1’’;

* Administrator 1, Administrator 2 (for brevity,‘‘Admin 1 and 2’’ in Fig. 11) as well as End Userare the involved Stakeholders with obvious rolesin the system.

Clearly, Hospital Information By Date is anaggregation over Information By Class and Dateand actually, what the conceptual schema tells usis that it is the Aggregation Process that performsthe aggregation in practice. Due to the data flow,there exists a strong dependency between theaforementioned roles. Hospital Information ByDate depends on Aggregation since the latter actsas a data pump for the former. Hospital Informa-tion By Date transitively depends on InformationBy Class. Administrator 1 depends also onInformation By Class. Although not depicted inFig. 11 to avoid overloading the figure, thisrelationship relies on the idea of political depen-dency; up to now, it was the Administrator 1 that

provided the data for this kind of information andin fact, he still does within the data warehouseenvironment.

Another interesting point shown in Fig. 11 is theidea of suitability: according to the stated needs ofthe users, the concept Hospital Information ByDate represents information which is suitable forthe End User. It is worth stressing the fact thatalthough some of the entities correspond toprocesses, other than stakeholders and other thandata stores, this does not affect the uniformity andthe simplicity of the representation. Also, noticethat we do not delve into the fields of conceptualmultidimensional aggregation (for example, see[19]) or requirements engineering: the specificapproach one can adopt is orthogonal to ourmodeling.

The notion of suitability helps to support datawarehouse evolution. As already mentioned, it isthe redundancy in the data warehouse that makessuitability so attractive. Consider the followingreal-world case, where the information in the finalview V1, was not 100% consistent. Simplemeasurements of quality (cf. Section 4) indicatedthat the COBOL file CBL was responsible for thisquality problem (upper part of Fig. 12). Thus theoriginal data provider was flawed. Out of themany different choices one could possibly have toresolve the problem, the most suitable one provedto be the exploitation of redundancy. An alter-native population scheme for view V1 used thesource file CBL’ as initial input. CBL’ is thecorresponding data store of the Concept Informa-tion By Department and Date (‘‘Info By Dept’’ inFig. 12) and captures the number of beds bydepartment of hospital (instead of class of beds).

Sometimes, suitability in the conceptual modelof a data warehouse can be automatically derivedfrom aggregate reasoners and algorithms proposedby previous research on view containment [20–23].Again, we would like to stress that suitability inour proposal is not restricted to either persons orresources but can uniquely cover all the entities ina conceptual model of a data warehouse.

Apart from the support for evolution of the datawarehouse at the entity level, the proposed modelis also capable of supporting evolution at theattribute level. As mentioned, the relationshipFig. 11. The conceptual perspective for the example.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236218

Page 15: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

between two roles is normally expressed in termsof fields. In our example, Aggregation depends ona subset of the attributes of Info By Class, namelyH ID, DATE and #Beds. If the attribute Class IDchanges, e.g., due to change of type, removal, orrenaming, the final information of the Concept‘‘Hospital Info’’ is not affected at all. On the otherhand, changes in any of the attributes depicted inFig. 11 clearly affect the information delivered toend users.

It is interesting to note the political conflict thattakes place due to the proposed change. As we cansee, removing Info By Class from the data flow,automatically affects the Stakeholder Administra-tor 1, who depends on Info By Class. A simplequery in the metadata repository for the depen-dents of the entity Info By Class could give awarning for the political vibrations coming fromsuch a decision1. Finally, we should also mentionthat hidden within the short story thatwe have just summarized is the idea ofquality measurement, which will be detailed inSection 4.

3.5. Facilitating data warehouse design throughconsistency checking in the metadata repository

To ensure the validity of the representation inthe metadata repository, consistency checks can beperformed during the design of the data ware-house, by defining constraints, rules and views inthe language Telos. The following view finds outwhether the types used as inputs of an activity arestored in the respective data stores used as inputsof the agent, executed by the activity.

QueryClass InconsistentInTypes isA Type

with constraint

c: $ exists d/DataStore ac/Activity ag/

Agent

(ac input this) and (ac executedBy ag)

and

(ag input d) and not (this storedIn d) $end

Other simple constraints involve the localstructure of the process elements. For example,split transition elements must have at least oneincoming edge and more than one outgoing edge.The timestamps of the agent should also beconsistent with its state. The repository can alsobe used by external programs to support the

Fig. 12. Data warehouse evolution. Upper part: original problematic configuration. Lower part: the new configuration.

1 We take the opportunity to stress the huge importance of

politics in the development lifecycle of a data warehouse. See

[6,17] for more details.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236 219

Page 16: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

execution of consistency checking algorithms asproposed in [24,25].

3.6. Repository support for the automaticderivation of role interdependencies

The Roles of the conceptual perspective can bedirectly assigned by the data warehouse adminis-trator, or other interested stakeholders. Still, onecould argue that this is too much of an adminis-trative burden, based on the number of relation-ships that should be traced in the repository. Bydeductive reasoning in the metadata repository, wecan impose simple rules to deduce these inter-dependencies from the structure of data stores andactivities. For example, we can derive dependencyrelationships by exploiting the structure of thelogical perspective of the metadata repository.Also, interdependencies do not have to be directlystored in all the cases, but can also be computedincrementally, due to the transitivity of theirnature. In Fig. 13 we show three simple ruleswhich can be used to derive the production of roleinterdependencies. They can also be implementedin the metadata repository.

4. Process quality

In this section we present how the processmetamodel is linked to the metamodel for datawarehouse quality proposed in [2]. Moreover, wecomplement this quality metamodel with specificdimensions for data warehouse operationalprocesses.

4.1. Terminology for quality management

The quality metamodel in [2] customizes theGQM approach of [11] for data warehouse

environments. In this section, we adopt the samemetamodel for the operational processes of thedata warehouse.

Each object in the data warehouse is linked to aset of quality goals and a set of quality factors(Fig. 15). A quality goal is an abstract requirement,defined on data warehouse objects, and documen-ted by a purpose and the stakeholder interested init, e.g., ‘improve the availability of source S1 untilthe end of the month in the viewpoint of the datawarehouse administrator’. Quality dimensions(e.g., ‘availability’) are used to group quality goalsand factors into different categories. A QualityFactor represents a quantitative assessment of aparticular aspect of a data warehouse object, i.e., itrelates quality aspects both to actual measure-ments and expected ranges for these quality values.Finally, the method of measurement is attached toa quality factor through a measuring agent.

The bridge between the abstract, subjectivequality goals and the specific, objective qualityfactors is determined through a set of qualityqueries, to which quality factor values are providedas possible answers. Such queries are the outcomeof the methodological approach described in [3]which offers template quality factors and dimen-sions, defined at the metadata level and instanti-ates them, for the specific data warehousearchitecture under examination. As a result ofthe goal evaluation process, a set of improvements(i.e., design decisions) can be proposed, in order toachieve the expected quality.

4.2. Quality dimensions and factors for datawarehouse operational processes

ISO 9126 standard [ISO97] on software imple-mentation and evaluation provides a generalunderstanding of how to measure the quality ofsoftware systems. Data warehouses do not stray

Fig. 13. Simple rules for the production of role interdependencies.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236220

Page 17: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

from these general guidelines; thus we adopt thestandard as a starting point. ISO 9126 is based onsix high level quality dimensions (Functionality,Reliability, Usability, Efficiency, Maintainability,Portability). Time and budget constraints in thedevelopment of a data warehouse cause theaddition of Implementation Effectiveness. Thedimensions are analyzed to several sub-dimensions(Fig. 14).

ISO 9126 does not provide specific qualityfactors. To deal with this shortcoming, AppendixA gives a set of quality factors customized for thecase of data warehouse operational processes. Itdoes not detail the whole set of possible factors forall operational data warehouse processes, butrather, we intend to come up with a minimalrepresentative set. This set of quality factors can be

refined and enriched by the data warehousestakeholders with customized factors. Once again,we encourage the use of ‘‘templates’’ in a way thatfits naturally with the overall metadata frameworkthat we propose.

4.3. Relationships between processes and quality

Quality goals describe intentions of data ware-house users with respect to the status of the datawarehouse. In contrast, our process model de-scribes facts about the current and previous statusof the data warehouse and what activities areperformed in the data warehouse. However, thereason behind the existence of a process is aquality goal. For example, a data cleaning processis executed in the data staging area in order to

Fig. 14. Software quality dimensions [26] and proposed quality factors in data warehouse environments.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236 221

Page 18: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

improve the accuracy of the data warehouse. Wehave represented this interdependency betweenprocesses and quality goals by establishing arelationship between roles and data warehouseobjects in the conceptual perspective of the processmodel (relationship Imposed On). This is shown inthe upper part of Fig. 15.

Our model is capable of capturing all depen-dency types mentioned in [4]. Task dependencies,where the dependum is an activity, are captured byassigning the appropriate role to the attribute Wrtof the relationship. Resource dependencies, wherethe dependum is the availability of a resource, aremodeled when fields or concepts populate thisattribute. The relationship ExpressedFor relates arole to a high-level quality goal; thus Goaldependencies, dealing with the possibility of mak-ing a condition true in the real world, are capturedfrom the model, too. Soft-goal dependencies are aspecialization of goal dependencies, where evalua-tion cannot be done in terms of concrete qualityfactors.

The lower part of Fig. 15 represents the relation-ship between processes and quality on a moreoperational level. The actions of an agent in thedata warehouse affect the expected or measuredquality factors of some data warehouse objects.For example, a data cleaning process affects theavailability of a source: it decreases the amount oftime during which it can be used for regularoperations. Consequently, this process will affectthe quality factors Average Load, CPU state,Available Memory defined on a Source. All thesequality factors are concrete representations of theabstract notion Availability}the relevant quality

dimension. The effect of a data warehouse processmust always be confirmed by new measurementsof the quality factors. Unexpected effects of datawarehouse processes can be detected by comparingthe measurements with the expected behavior ofthe process. The measurement of the quality of theparticular agents through their own quality factorsis analyzed in [1].

A Quality Query provides the methodologicalbridge to link the high-level, user-oriented, sub-jective quality goals to the low-level, objective,component-oriented quality factors. The vocabu-lary (or domain) of quality queries with respect tothe process model is the set of data warehouseactivities, which can be mapped to reasons (roles)and conditions (of agents) of a specific situation.

Let us return to our hospital example to clarifyhow the process and the quality metamodelsinterplay, and how the different perspectivesgracefully map to each other. More than one ofour experiences in the public sector indicated aneed of total quality of data, were no errors wereallowed and no information was missing. Thus,the quality goal is ‘100% quality of data deliveredto the end users’.

For the purpose of our example, we narrow thishigh level goal to the subgoal, 100% consistency ofthe produced information. There are two objectsinvolved in this quality goal, namely the qualitydimension consistency and the role end user. Bothentities, as well as the quality goal itself, belong tothe conceptual perspective and can be used toexplain why the chain of processes exists: to bringclean, consistent information to the involvedstakeholders.

According to the GQM paradigm, a good startto examine a situation would be to find out itscurrent status. With respect to the elements of theprocess model, the basic question over processstatus is naturally over the correctness dimension:are all activities performing as they should? Thequestion, itself belonging to the logical perspective,involves an object of the logical part of the processmetamodel: activities. If one applies the methodol-ogy of [3], this question would directly be analyzedto five consequent questions, each involving one ofthe activities Loading, Cleaning, Computation,Aggregation, and Populate V1.Fig. 15. Relationships between processes and quality.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236222

Page 19: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

The actual quality evaluation of the producedsoftware is done in terms of concrete measure-ments. On the physical perspective (where thequality factors that provide the measurementsbelong) the measurements involve specific softwareagents. In our case, the quality factor Correctnessthough White Box Software Testing (WBCorrect-ness in the sequel) performs this kind of measure-ments.

The discrimination of logical, conceptual andphysical perspectives is proven useful once more,in the quality management of the data warehouse:the quality goals can express ‘‘why’’ things havehappened (or should happen) in the data ware-house, the quality questions try to discover ‘‘what’’actually happens and finally, the quality factorsexpress ‘‘how’’ this reality is measured (Fig. 16).On top of this, we have organized a seamlessintegration of the process and quality models, bymapping the objects of the same perspectives toeach other.

In Fig. 17 we can also see the assignment ofquality factors to the various objects at thephysical perspective. We assign the quality factorWBCorrectness to the software agents and thequality factors Consistency and Completeness tothe data stores. A simple formula derives thequality of the data stores in the latest steps of thedata flow from the quality of the previous datastores and software agents. Let us take Consistency

for example:

consistencyðdsÞ

¼Y

correctnessðaÞY

consistencyðds0Þ;

ð1Þ

where ds is the data store under consideration, adenotes the agents having ds as their output andCOMMIT semantics and ds0 is any data storedifferent from ds, serving as input to the agents a.

Clearly, the consistency of the final view V1depends on the consistency of all the previous datastores and the correctness of all the involvedsoftware agents. Although arguably naive, theformula fitted perfectly in our real world scenario.

Fig. 16. Perspectives and interrelationships for the process and

quality metamodels.

Fig. 17. Application of quality factors to the entities of the example.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236 223

Page 20: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

4.4. Exploitation of quality modeling in therepository for data warehouse administration

The different instantiations of the quality modelcan be exploited to assist both the design and theadministration of the data warehouse. In [27], ithas been shown how the classification of qualitydimensions, factors and goals can be combinedwith existing algorithms to address the datawarehouse design problem (i.e., the selection ofthe set of materialized views with the minimumoverall ‘‘cost’’ that fulfill all the quality goals set bythe involved stakeholders). As far as the adminis-tration of the warehouse is concerned, theinformation stored in the repository may be usedto find deficiencies in a data warehouse.

To show how the quality model is exploited, wetake the following query. It returns all datacleaning activities which have decreased the avail-ability of a data store according to the storedmeasurements. The significance of the query is thatit can show that the implementation of the datacleaning process has become inefficient.

GenericQueryClass DecreasedAccuracy isA

DWCleaningAgent with parameter

ds: DataStore

constraint

c: $ exists qf1, qf2/DataStoreAccuracy

t1, t2, t3/Commit Time v1, v2/Integer

(qf1 onObject ds) and (qf2 onObject ds)

and

(this affects qf1) and (this affects

qf2) and

(this executedOn t3) and (qf1 when t1)

and (qf2 when t2) and

(t15t2) and (t15t3) and (t35t2) and

(qf1 achieved v1) and (qf2 achieved v2)

and (v1>v2) $end

The query has a data store as parameter, i.e., itwill return only cleaning processes that are relatedto the specified data store. The query returns theagents which have worked on the specified datastore and which were executed between themeasurements of quality factors qf 1 and q f 2,and the measured value of the newer quality factor

is lower than the value of the older quality factor.The attribute executedOn of an agent representsthe time when this agent was executed.

5. Repository support for data warehouse

description and evolution

Summarizing the discussion so far, during thedesign phase, the user can check the consistency ofhis/her design, to determine any violations of thebusiness logic of the data warehouse, or the respectof simple rules over the structure of the datawarehouse schema. During the administrationphase, we can use the repository to discoverquality problems.

In this section, we continue to show how themetadata repository can be exploited in differentways. First, we complement the perception of datawarehouses as collections of materialized viewswith a precise operational description of thecontent of data warehouse tables. Second, aparticular task in the data warehouse lifecycle,data warehouse evolution, is examined separately,in order to determine possible impacts, when theschema of a particular table in the data warehousechanges.

5.1. Why data warehouses are not (just) collectionsof materialized views

Many database researchers have considereddata warehouses to be collections of materializedviews, organized in strata where the views of aparticular stratum are populated from the views ofa lower stratum. For example, in [17] where thepapers of three major database conferences relatedto data warehousing, between the years 1995 and1999, are classified into different categories, almosthalf of the papers (around 46%) deal with viewmaintenance and integration issues. The papers onview maintenance have focused on algorithms forupdating the contents of a view in the presence ofchanges in the sources. Papers related to integra-tion have targeted the production of a singleinterface for the processing of distributed hetero-geneous data, along with query processingtechniques for that cause and resolution of

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236224

Page 21: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

conflicts at the schema level. One can observe,thus, that most of the performed research has beendedicated to what should be extracted and loaded,instead of how this process is actually performed.Practical aspects of extraction, loading and con-version processes, such as scheduling, declarativeprocess definition, or data peculiarities (sourceformat, errors, conversions) are clearly neglected(see [17] for a broader discussion).

In summary, although the abstraction oftreating data warehouses as strata of materializedviews, has efficiently served the purpose ofinvestigating the issues of (incremental)view maintenance, it lacks the ability to accuratelydescribe the real content of data warehousetables, due to the fact that all the intermediateprocessing, transformation and reorganization ofthe information is systematically absent from mostresearch efforts.

Our modeling approach follows a different path,by treating data warehouse processes as first classcitizens. The semantic definitions for data ware-house views are not assigned directly by thedesigner but result from combining the respectivedefinitions of the data warehouse processes. Thus,we can argue that there are two ways to define thesemantics of a table in the data warehouse:

* A Type can be defined as a materialized viewover its previous Types in the data flow. This isthe definition of what the Type should containideally, i.e., in a situation where no physicaltransformations or cleaning exists.

* A Type can be defined as a view again, resultingfrom the adoption of our modeling approach.In this case, the resulting definition explainshow the contents of the Type are actuallyproduced, due to schema heterogeneity andbad quality of data.

Of course, both kinds of definition are useful butonly the former has been taken into considerationin previous research. To complement this short-coming, the rest of this subsection is dedicated toshowing how we can derive view definitions fromthe definitions of their populating processes.

To give an intuition of the difference betweenthe two approaches, consider the example ofFig. 5. Ideally, we would like to express the view

VI in terms of the input file CBL (or its relationalcounterpart, table Buffer). A simple formulasuffices to give this intentional semantics:

SELECT H ID, EUROPEAN(DATE) AS EDATE,

CLASS A+CLASS B+Class c AS SUM BEDS FROM CBL

On the other hand, reality is clearly different. Itinvolves the identification of multiple rows for thesame hospital at the same time period, and therestructuring of the information to a normalizedformat before the loading of data in view V1. Thefull expression capturing this operational seman-tics is definitely more complex than the simpleSQL query denoting the intentional semantics.

To construct expressions for data warehousetables with respect to their operational semantics,we constrain ourselves to the case where we areable to construct an acyclic, partially orderedgraph of activities (produced by proper queries inConceptBase). Thus, we can treat the whole set ofdata warehouse activities as an ordered list.Mutually exclusive, concurrent paths in thepartially ordered graph are treated as differentlists (the execution trace determines which list isconsidered each time).

Furthermore, a set of types belonging to the setSourceSchema, denoting all the types found in thedata sources, are treated as source nodes of agraph. For the rest of the types, we can derive anSQL expression by using existing view reductionalgorithms, such as [28] (corrected with the resultsof [29–31], to which we shall refer to as [28þ ]),[32–36]. Our algorithm is applicable to graphs ofactivities that do not involve updates. In mostcases, an update operation can be considered asthe combination of insertions and deletions or asthe application of the appropriate function to therelevant attributes.

The results of the application of this algorithmto our example are shown in Fig. 18. Forconvenience, we break composite definitions oftable expressions into the different lines of Fig. 19.For example, when the third iteration (i=3) refersto the definition of table Buffer, it does so withrespect to the definition of line 2 (i=2). Theexpression of a single type can also be computedlocally.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236 225

Page 22: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

5.2. Repository support for data warehouseevolution

The data warehouse is constantly evolving. Newsources are integrated in the overall architecture

from time to time. New enterprise and client datastores are built in order to cover novel userrequests for information. As time passes by, usersseem more demanding for extra detailed informa-tion. Due to these reasons, not only the structure

Fig. 18. Algorithm for extracting the definition of a type in the repository.

Fig. 19. SQL expressions of the types of the example.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236226

Page 23: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

but also the processes of the data warehouseevolve.

The problem arises to keep the data warehouseobjects and processes consistent in the presence ofchanges. For example, changing the definition of amaterialized view in the data warehouse triggers achain reaction: the update process must evolve(both the refreshment and the cleaning steps), andthe old, historical data must be migrated to thenew schema (possibly with respect to new selectionconditions). All data stores of the data warehouseand client level which are populated from thisparticular view must be examined with respect totheir schema, content and population processes.

In our approach, we distinguish two kinds ofimpact of a hypothetical change:

* Direct impact: the change in the data warehouseobject imposes that some action must be takenagainst an affected object. For example, if anattribute is deleted from a materialized view,then the activity which populates it must also bechanged accordingly.

* Implicit impact: the change in the data ware-house object might change the semantics ofanother object, without obligatorily changingthe structure of the latter.

Our model enables us to construct a partiallyordered graph: for each Type instance, say t, thereis a set of types and activities, used for thepopulation of t (‘‘before’’ t), denoted as BðtÞ. Also,there is another set of objects using t for theirpopulation (‘‘after’’ t), denoted as AðtÞ. We canrecursively compute the two sets from queries onthe metadata repository of process definitions.Queries for the successor and after relationshipscan be defined in a similar way.

Suppose that the final SQL expression of a typet, say e, changes into e0. In the sprit of [38], we canuse the following rules for schema evolution in adata warehouse environment (we consider that thechanges abide by the SQL syntax and the newexpression is valid):

* If the select clause of e0 has an extra attributefrom e, then propagate the extra attribute to thebase relations: there must be at least one pathfrom one type belonging to a SourceSchema to

an activity whose out expression involves theextra attribute. If we delete an attribute fromthe select clause of a type, it must not appear inthe select clause of the processes that directlypopulate the respective type, as well as in thefollowing types and the processes that usethis type. In the case of attribute addition, theimpact is direct for the previous objects BðtÞand implicit for the successor objects AðtÞ.For deletion the impact is direct for bothcategories.

* If the where clause of e0 is more strict than theone of e, the where clause of at least one processbelonging to BðtÞ must change identically. Ifthis is not possible, a new process can be addedbefore t simply deleting the respective tuplesthrough the expression e0 � e. If the whereclause of e0 is less strict than the one of e,subsumption techniques [20–22,39] determinewhich types can be used to calculate thenew expression e0 of t. The having clause istreated in the same fashion. The impact is directfor the previous and implicit for the successorobjects.

* If an attribute is deleted from the group byclause of e, at least the last activity performinga group-by query should be adjusted accord-ingly. All consequent activities in the popula-tion chain of t must change too (as ifan attribute has been deleted). If this is notfeasible we can add an aggregating processperforming this task exactly before t. If an extraattribute is added to the group by clause of e,then at least the last activity performing agroup-by query should be adjusted accordingly.The check is performed recursively for the typespopulating this particular type, too. If this fails,the subsumption techniques mentioned for thewhere-clause can be used for the same purposeagain. The impact is direct for both previousand successor objects. Only in the case ofattribute addition it is implicit for the successorobjects.

Returning to our example, suppose that wedecide to remove attribute CLASS C from thetable BUFFER. This change has the followingimpacts:

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236 227

Page 24: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

* The activities belonging to Previous(BUFFER),namely Loading and Cleaning, must changeaccordingly, to remove CLASS C from theirselect clause.

* The activities belonging to Next(BUFFER),namely Cleaning and Computation must alsochange to remove any appearance of CLASS Cin their query expression.

Note that Cleaning participates both in thePrevious and in the Next list. Fig. 20 showsthe impact of these changes. Attribute CLASS Cis removed from the select list of the Previousactivities and from the body of the queries of theNext list.

Due to the existence of implicit impacts, wedo not provide a fully automated algorithmicsolution to the problem, but rather, wesketch a methodological set of steps, in theform of suggested actions to perform this kindof evolution. Similar algorithms for the evolutionof views in data warehouses can be found in[38,40]. A tool could easily visualize thisevolution plan and allow the user to reactto it.

6. Related work

In this section we discuss the state of art andpractice for research efforts, commercial tools andstandards in the fields of process and workflow

modeling, with particular focus on data ware-housing.

6.1. Standards

The standard [7] proposed by the WorkflowManagement Coalition (WfMC) includes ametamodel for the description of a workflowprocess specification and a textual grammarfor the interchange of process definitions. Aworkflow process comprises a network ofactivities, their interrelationships, criteria forstarting/ending a process and other informationabout participants, invoked applications and rele-vant data. Also, several other entities external tothe workflow, such as system and environmentaldata or the organizational model are roughlydescribed.

The MetaData Coalition (MDC), is anindustrial, non-profit consortium which aims toprovide a standard definition for enterprise meta-data shared between databases, CASE tools andsimilar applications. The Open Information Model(OIM) [41] is a proposal (led by Microsoft) forthe core metadata types found in the operati-onal and data warehousing environment of en-terprises. The OIM uses UML both as a modelinglanguage and as the basis for its core model.The OIM is divided in packages extend UML inorder to address different areas of informationmanagement. The Database and WarehousingModel is composed from the Database SchemaElements package, the Data Transformations

Fig. 20. Impact of the removal of attribute CLASS C from table BUFFER to the definitions of the affected activities.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236228

Page 25: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

Elements package, the OLAP Schema Elementspackage and the Record Oriented LegacyDatabases package. The Database SchemaElements package contains three other packages:a Schema Elements package (covering theclasses modeling tables, views, queries, indexes,etc.), a Catalog and Connections package (coveringphysical properties of a database and theadministration of database connections) and aData Types package, standardizing a core setof database data types. The Data TransformationsElements package covers basic transformationsfor relational-to-relational translations. It doesnot deal with data warehouse process modeling(i.e., it does not cover data propagation, clean-ing rules, or querying), but covers in detailthe sequence of steps, the functions andmappings employed and the execution traces ofdata transformations in a data warehouseenvironment.

6.2. Commercial tools

Basically, commercial ETL tools are responsiblefor the implementation of the data flow in a datawarehouse environment which is only one (albeitimportant) of the data warehouse processes. MostETL tools are of two flavors: engine-based,or code-generation-based. The former assumes thatall data have to go through an engine fortransformation and processing. In code-generatingtools all processing takes place only at the targetor source systems. There is a variety of such toolsin the market; we mention three engine-basedtools, from Ardent [42], DataMirror [43] andMicrosoft [37,44], and one code-generation-basedfrom ETI [45].

6.3. Research efforts

Workflow modeling: There is a growing researchinterest in the field of workflow management.Sadiq and Orlowska [24] use a simplified workflowmodel, based on [7], using tasks and control flowsas its building elements. The authors present analgorithm for identifying structural conflicts in acontrol flow specification. The algorithm uses a setof graph reduction rules to test the correctness

criteria of deadlock freedom and lack-of-synchroni-zation freedom. In [25] the model is enriched withmodeling constructs and algorithms for checkingthe consistency of workflow temporal constraints.In [46], the authors propose a conceptual modeland language for workflows. The model gives thebasic entities of a workflow engine and semanticsabout the execution of a workflow. The proposedmodel captures the mapping from workflowspecification to workflow execution (in particularconcerning exception handling). Importance ispaid to task interaction, the relationship of work-flows to external agents and the access todatabases. Other aspects of workflow managementare explored in [47,48]. In [49] a general model fortransactional workflows is presented. A transac-tional workflow is defined to consist of severaltasks, composed by constructs like ordering,contingency, alternative, conditional anditeration. Nested workflows are also introduced.Furthermore, correctness and acceptable termina-tion schedules are defined over the proposedmodel. In [50] several interesting research results on workflow management are presented inthe field of electronic commerce, distributedexecution and adaptive workflows. A widely usedweb server for workflow literature is maintainedby [51].

Process modeling. Process and workflow model-ing have been applied in numerous disciplines.In [16] the authors propose a software processdata model to support software informationsystems with emphasis on the control, documenta-tion and support of decision making forsoftware design and tool integration. Among otherfeatures, the model captures the representationof design objects (‘‘what’’), design decisions(‘‘why’’) and design tools (‘‘how’’). A recentoverview on process modeling is given in [52],where a categorization of the different issuesinvolved in the process engineering field isprovided. The proposed framework consists offour different but complementary viewpoints(expressed as ‘‘words’’): the subject world,concerning the definition of the process withrespect to the real world objects, the usageworld, concerning the rationale for the processwith respect to the way the system is used, the

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236 229

Page 26: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

system world, concerning the representation ofthe processes and the capturing of the specifica-tions of the system functionality and finally, thedevelopment world, capturing the engineeringmeta-process of constructing process models. Eachworld is characterized by a set of facets, i.e.,attributes describing the properties of a processbelonging to it.

Data quality and quality management. There hasbeen a lot of research on the definition andmeasurement of data quality dimensions [53–56].A very good review of research literature is foundin [9]. Jarke et al., [1] give an extensive list ofquality dimensions for data warehouses, and inparticular data warehouse relations and data.Several goal hierarchies of quality factors havebeen proposed for software quality. For example,the GE Model [57] suggests 11 criteria of softwarequality, while Boehm [58] suggests 19 qualityfactors. ISO 9126 [26] suggests six basic factorswhich are further refined to an overall 21 qualityfactors. In [59] a comparative presentation of thesethree models is offered and the SATC softwarequality model is proposed, along with metrics forall their software quality dimensions. In [60] a setof four basic quality dimensions for workflows issuggested also. Variants of the Goal-Question-Metric (GQM) approach are widely adopted insoftware quality management [11]. A structuredoverview of the issues and strategies, embedded ina repository framework, can be found in [61].Jarke et al. [1,10] provide extensive reviews ofmethodologies employed for quality management,too.

Research focused specifically on ETL. TheAJAX data cleaning tool developed at INRIA[62] deals with typical data quality problems,such as the object identity problem, errors dueto mistyping and data inconsistencies betweenmatching records. AJAX provides a frameworkwherein the logic of a data cleaning programis modeled as a directed graph of data transforma-tions (mapping, matching, clustering andmerging transformations) that start from someinput source data. AJAX also provides a declara-tive language for specifying data cleaning pro-grams, which consists of SQL statements enrichedwith a set of specific primitives to express

mapping, matching, clustering and merging trans-formations.

6.4. Relationship of our proposal to state-of-the-artresearch and practice

Our approach has been influenced by ideas ondependency and workflow modeling stemmingfrom [4,7,16,24,46,63,64]. As far as the standardsare concerned, we found both the WorkflowReference Model and the Open InformationModel too abstract for the purpose of a repositoryserving well focused processes like the ones in adata warehouse environment. First, the relation-ship of an activity with the data it involves is notreally covered, although this would provideextensive information of the data flow in the datawarehouse. Second, the separation of perspectivesis not clear, since the standards focus only on thestructure of the workflows. To compensate thisshortcoming, we employ the basic idea of theActor-Dependency model [4] to add a conceptualperspective to the definition of a process, capturingthe reasons behind its structure. Moreover, weextend [4] with the notion of suitability. As far asdata quality and quality engineering are con-cerned, we have taken into account most of theprevious research for our proposed quality dimen-sions and factors.

7. Conclusions

In this paper we have described a metamodelfor data warehouse operational processesand techniques to design, administrate andfacilitate the evolution of the data warehousethrough the exploitation of the entities ofthis metamodel. This metamodel takes advantageof the clustering of its entities in logical,physical and conceptual perspectives, involvinga high-level conceptual description, which canbe linked to the actual structural and physicalaspects of the data warehouse architecture.This approach is integrated with the results ofprevious research, where data warehouse architec-ture and quality metamodels have been proposedassuming the same categorization.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236230

Page 27: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

The physical perspective of the proposedmetamodel covers the execution details ofdata warehouse processes. At the same time, thelogical perspective is capable of modeling thestructure of complex activities and capture all theentities of the Workflow Management CoalitionStandard. Due to the data oriented nature of thedata warehouse activities, their relationship withdata stores is particularly taken care of, throughclear and expressive semantics in terms of SQLdefinitions. This simple idea reverts the classicalbelief that data warehouses are simply collectionsof materialized views. Instead of directly assigninga na.ıve view definition to a data warehouse table,we can deduce its definition as the outcome of thecombination of the processes that populate it. Thisnew kind of definition does not necessarily refutethe existing approaches, but rather complementsthem, since the former provide the operationalsemantics for the content of a data warehousetable, whereas the latter give an abstraction of itsintentional semantics. The conceptual perspectiveis a key part of our approach as in the ActorDependency model [4]. Also, we generalize thenotion of role to uniformly capture any person,program or data store participating in the system.Furthermore, the process metamodel is linked to aquality metamodel, thereby facilitating the mon-itoring of the quality of data warehouse processesand a quality-oriented evolution of the datawarehouse.

In the process of taking design decisions orunderstanding the occurring errors over thearchitecture of a data warehouse, the proposedmetamodel can be exploited in various ways. Asfar as the design of the data warehouse isconcerned, simple query facilities of the metadatarepository are sufficient to provide the support forconsistency checking of the design. Moreover, theentities of the conceptual perspective serve as adocumentation repository for the reasons behindthe structure of the data warehouse. Second, theadministration of the data warehouse is alsofacilitated in several ways. The measurement ofdata warehouse quality, through the linkage to aquality model, is crucial in terms of enabling thedesired functionality during the everyday use ofthe warehouse.

Evolution is supported by the role interdepen-dencies with two ways. As the entity level, theimpact of any changes in the architecture of thewarehouse can be detected through the existence ofdependency links. At the same time, the existenceof suitability links suggestes alternatives for thenew structure of the warehouse. At the attributelevel, on the other hand, internal changes in theschema of data stores or the interface of softwareagents can also be detected by using the details ofthe relationships of the data warehouse roles.

We have used our experiences from real worldcases as a guide for the proposed metamodel. As faras the practical application of our ideas in the realworld is concerned, we find that field of ETL anddata warehouse design tools is the most relevant toour research. As a partial evaluation of our ideasand to demonstrate the efficiency of our approach,we have developed a prototype ETL tool.

Research can follow our results in various ways.First, it would be interesting to explore automatedways to assist the involved stakeholders (datawarehouse designers and administrators) to popu-late the metadata repository with the relevantinformation. An example for how this can beachieved is explained in Section 3.4 for thesuitability interrelationships. Specialized toolsand algorithms could assist in extending this kindof support for more aspects of the proposedmetamodel. Also, in this paper we have dealt onlywith the operational processes of a data warehouseenvironment. Design processes in such an environ-ment may not fit this model so smoothly. It wouldbe worth trying to investigate the modeling of thedesign processes and to capture the trace of theirevolution in a data warehouse. Finally, we haveused the global-as-view approach for the defini-tions of the data warehouse processes, i.e., wereduce these definitions in terms of the sources inthe warehouse architecture. We plan to investigatethe possibility of using the local-as-view approach,which means reducing both the process definitionsand the data sources to a global enterprise model.The local-as-view approach appears to be moresuitable to environments where a global, corporateview of data is present and thus, provides severalbenefits that the global-as-view approach lacks[65].

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236 231

Page 28: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

Acknowledgements

This research is sponsored in part (a) by theEuropean Esprit Projects ‘‘DWQ: Foundations ofData Warehouse Quality’’, No. 22469, and ‘‘MEMO:Mediating and Monitoring Electronic Commerce’’,No. 26895, (b) by the Deutsche Forschungsge-meinsschaft (DFG) under the Collaborative ResearchCenter IMPROVE (SFB 476) and (c) by the GreekGeneral Secretariat of Research and Technology inthe context of the project ‘‘Data Quality in DecisionSupport Systems’’ of the French–Greek CooperationProgramme ‘‘Plato’’, 2000. We would like to thankour DWQ partners qwho contributed to the progressof this work, especially Mokrane Bouzeghoub,Manfred Jeusfeld, Maurizio Lenzerini, and TimosSellis. Many thanks are also due to the anonymousreviewers for their useful comments and the interest-ing issues they have raised.

Appendix A

In this Appendix, we list quality factors for thesoftware involved a data warehouse environment.

We organize the presentation of these qualityfactors around the lifecycle stage during whichthey are introduced. The examined lifecyclestages are requirements analysis, system design,implementation and testing (including the qualityof the software module itself), management andadministration. For each of these stages, we listthe involved stakeholder and its products(depicted in the header of each table). Each ofthe following tables comprises three columns:the leftmost describes a high level quality dimen-sion, the middle column shows the relatedquality factors of the dimension and the rightmostcolumn suggests measurement methods foreach quality factor. We make a full presentation,including both generic quality factors, applicableto any kind of software module and datawarehouse specific quality factors. The formerare presented in gray background. To provide thelist of these quality factors we have relied onexisting research and practitioner publica-tions, market tools and personal experiences[59,67,66,3,1].

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236232

Page 29: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236 233

Page 30: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

References

[1] M. Jarke, M.A. Jeusfeld, C. Quix, P. Vassiliadis, Archi-

tecture and quality in data warehouses: an extended

repository approach, Inform. Systems 24 (3) (1999)

229–253 (A previous version appeared in the Proceedings

of the 10th Conference of Advanced lnformation Systems

Engineering (CAiSE’98), Pisa, Italy (1998)).

[2] M.A. Jeusfeld, C. Quix, M. Jarke, Design and analysis of

quality information for data warehouses, Proceedings of

the 17th International Conference on Conceptual Model-

ing (ER’98), Singapore, 1998, pp. 349–362.

[3] P. Vassiliadis, M. Bouzeghoub, C. Quix, Towards quality-

oriented data warehouse usage and evolution. Inform.

System 25 (2) (2000) 89–115 (A previous version appeared

in the Proceedings of the 11th Conference of Advanced

Information Systems Engineering (CAiSE’99), Heidelberg,

Germany, 1999, pp. 164–179.

[4] E. Yu, J. Mylopoulos, Understanding ‘why’ in software

process modelling, analysis and design, Proceedings of the

16th International Conference on Software Engineering

(ICSE), Sorrento, Italy, 1994, pp. 159–168.

[5] C. Shilakes, J. Tylman, Enterprise Information Portals,

Enterprise Software Team, 1998, Available at http://

www.sagemaker.com/company/downloads/eip/in-

depth.pdf.

[6] M. Demarest, The politics of data warehousing, 1997,

http://www.hevanet.com/demarest/marc/dwpol.html.

[7] Workflow Management Coalition, Interface I: process

definition interchange process model, Document Number

WfMC TC-1016-P, 1998, Available at wmw.wfmc.org.

[8] M. Jarke, R. Gallersd.orfer, M.A. Jeusfield, M. Staudt, S.

Eherer, ConceptBase}a deductive objectbase for meta

data management, J. Intell. Information Systems (Special

Issue on Advances in Deductive Object-Oriented Data-

bases) 4 (2) (1995) 167–192.

[9] R.Y. Wang, V.C. Storey, C.P. Firth, A framework for

analysis of data quality research, IEEE Trans., Knowledge

Data Eng. 7 (4) (1995) 623–640.

[10] M. Jarke, M. Lenzerini, Y. Vassiliou, P. Vassiliadis (Eds.),

Fundamentals of Data Warehouses, Springer, Berlin, 2000.

[11] M. Oivo, V. Basili, Representing software engineering

models: the TAME goal-oriented approach, IEEE Trans.

Software Eng. 18 (10) (1992) 886–898.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236234

Page 31: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

[12] C. Quix, M. Jarke, M. Jeusfeld, P. Vassiliadis, M.

Lenzerini, D. Calvanese, M. Bouzeghoub, Data warehouse

architecture and quality model, DWQ Technical Report

DWQ-RWTH-002, 1997, Available at http://www.dbne-

t.ece.ntua.gr/dwq.

[13] P. Vassiliadis, Data warehouse modeling and quality

issues, Ph.D. Thesis, 2000.

[14] M. Bouzeghoub, F. Fabret, M. Matulovic, Modeling data

warehouse refreshment process as a workflow application,

Proceedings of the Workshop on Design and Management

of Data Warehouses (DMDW’99), Heidelberg, Germany,

1999.

[15] E. Sch.afer, J.-D. Becker, M. Jarke, DB-PRISM}inte-

grated data warehouses and knowledge networks for bank

controlling. Proceedings of the 26th International Con-

ference on Very Large DataBases (VLDB), Cairo, Egypt,

2000, pp. 715–718.

[16] M. Jarke, M.A. Jeusfeld, T. Rose, A software process data

model for knowledge engineering in information systems.

Inform. Systems 15 (l) (1990) 85–116.

[17] P. Vassiliadis, Gulliver in the land of data warehousing:

practical experiences and observations of a researcher,

Proceedings of the Second International Workshop on

Design and Management of Data Warehouses (DMDW),

Stockholm, Sweden, 2000, pp. 12.1–12.16.

[18] J. Mylopoulos, A. Borgida, M. Jarke, M. Koubarakis,

Telos}a language for representing knowledge about

information systems, ACM Trans. Inform. Systems 8 (4)

(1990) 325–362.

[19] F. Baader, U. Sattler, Description logics with concrete

domains and aggregation, Proceedings of the 13th

European Conf. on Artificial Intelligence (ECAI-98),

Brighton, UK, 1998, pp. 336–340.

[20] D. Srivastava, S. Dar, H.V. Jagadish, A.Y. Levy,

Answering queries with aggregation using views, Proceed-

ings of the 22nd International Conference on Very Large

Data Bases (VLDB), Bombay, India, 1996, pp. 318–329.

[21] A. Gupta, V. Harinarayan, Dallan Quass: aggregate-query

processing in data warehousing environments, Proceedings

of the 21st Conference on Very Large Data Bases (VLDB),

Zurich, Switzerland, 1995, pp. 358–369.

[22] W. Nutt, Y. Sagiv, S. Shurin, Deciding equivalences

among aggregate queries, Proceedings of the 17th ACM

SIGACT-SIGMOD-SIGART Symposium, Principles of

Database Systems (PODS’98), Seattle, 1998, pp. 214–223.

[23] S. Chaudhuri, S. Krishnamurthy, S. Potamianos, K. Shim,

Optimizing queries with materialized views, Proceedings of

the 11th International Conference on Data Engineering

(ICDE), Taipei, 1995, pp. 190–200.

[24] W. Sadiq, M. Orlowska, Applying graph reduction

techniques for identifying structural conflicts in process

models, Proceedings of the 11th Conference on Advanced

Information Systems Engineering, (CAiSE’99),

Heidelberg, Germany, 1999, pp. 195–209.

[25] O. Marjanovic, M. Orlowska, On modeling and verifica-

tion of temporal constraints in production workflows,

Knowledge Inform. Systems 1 (2) (1999) 157–192.

[26] ISO, International Organization for Standardization, ISO/

IEC 9126:1991 Information Technology}Software pro-

duct evaluation}Quality characteristics and guidelines for

their use, 1991.

[27] P. Vassiliadis, Modeling multidimensional databases,

cubes and cube operations, Proceedings of the 10th

International Conference on Statistical and Scientific

Database Management (SSDBM), Capri, Italy, 1998,

pp. 53–62.

[28] W. Kim, On optimizing an SQL-like nested query, ACM

Trans. Database Systems 7 (3) (1982) 443–469.

[29] R.A. Ganski, H.K.T. Wong, Optimization of nested SQL

queries revisited, Proceedings of ACM SIGMOD Interna-

tional Conference on the Management of Data, San

Francisco, CA, 1987, pp. 23–33.

[30] M. Muralikrishna, Optimization and dataflow algorithms

for nested tree queries, Proceedings of the 15th Interna-

tional Conference on Very Large Data Bases (VLDB),

Amsterdam, The Netherlands, 1989, pp. 77–85.

[31] M. Muralikrishna, Improved unnesting algorithms for join

aggregate SQL queries, Proceedings of the 18th Interna-

tional Conference on Very Large Data Bases (VLDB),

Vancouver, Canada, 1992, pp. 91–102.

[32] U. Dayal, Of nests and trees: a unified approach to

processing queries that contain nested subqueries, aggre-

gates and quantifiers, Proceedings of the 13th International

Conference on Very Large Data Bases (VLDB), Brighton,

UK, 1987, pp. 197–208.

[33] S. Chaudhuri, K. Shim, Optimizing queries with aggregate

views, Proceedings of the Fifth International Conference

on Extending Database Technology (EDBT), Avignon,

France 1996, pp. 167–182.

[34] I. Mumick, S. Finkelstein, H. Pirahesh, R. Ramakrishnan,

Magic is relevant, Proceedings of the ACM SIGMOD

International Conference on the Management of Data,

Atlantic City, NJ, 1990, pp. 247–258.

[35] H. Pirahesh, J. Hellerstein, W. Hasan, Extensible/rule

based query rewrite optimization in starburst, Proceedings

of the ACM SIGMOD International Conference on the

Management of Data, San Diego, CA, 1992, pp. 39–48.

[36] A. Levy, I. Mumick, Y. Sagiv, Query optimization by

predicate move-around, Proceedings of the 20th Interna-

tional Conference on Very Large Data Bases (VLDB),

Chile, 1994, pp. 96–107.

[37] Microsoft Corp, MS Data Transformation Services,

Available at ww.microsoft.com/sq.

[38] A. Gupta, I. Memick, K. Ross, Adapting materialized

views after redefinitions, Proceedings of the ACM SIG-

MOD International Conference on the Management of

Data, pp. 211–222, San Jose, CA, 1995.

[39] A.Y. Levy, A.O. Mendelzon, Y. Sagiv, D. Srivastava,

Answering queries using views, Proceedings of the 14th

ACM SIGACT-SIGMOD-SIGART Symposium on

Principles of Database Systems (PODS), San Jose, CA,

1995, pp. 95–104.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236 235

Page 32: Data warehouse process managementpvassil/publications/2001_IS_DW/IS_2001_DW.pdf · part of the data warehouse environment: the operational data warehouse processes. The combi-nation

[40] Z. Bellahs"ene, Structural view maintenance in data ware-

housing systems, Journ!ees Bases de Donn!ees Avanc!ees

(BDA’98), Tunis, 1998.

[41] MetaData Coalition, Open Information Model, version

1.0, 1999, Available at www.MDCinfo.com.

[42] Ardent Software, DataStage Suite, Available at http://

www.ardentsoftware.com/.

[43] DataMirror Corporation, Transformation Server, Avail-

able at http://www.datamirror.com.

[44] P.A. Bernstein, T. Bergstraesser, Meta-data support for

data transformations using microsoft repository, Bull.

Tech. Committee Data Eng. 22 (1) (1999) 9–14.

[45] Evolutionary Technologies Intl., ETI*EXTRACT, Avail-

able at http://www.eti.com/.

[46] F. Casati, S. Ceri, B. Pemici, G. Pozzi, Conceptual

modeling of workflows, Proceedings of the 14th Interna-

tional Conference on Object-Oriented and Entity-Rela-

tionship Modelling (ER’95), Gold Coast, Australia, 1995,

pp. 341–354.

[47] F. Casati, M. Fugini, I. Mirbel, An environment for

designing exceptions in workflows, Inform. Systems 24 (3)

(1999) 255–273.

[48] F. Casati, S. Ceri, B. Pernici, B. Pozzi, Workflow

Evolution, DKE 24 (3) (1998) 211–238.

[49] D. Kuo, M. Lawley, C. Liu, M. Orlowska, A general

model for nested transactional workflows, Proceedings of

the International Workshop on Advanced Transaction

Models and Architecture, India, 1996.

[50] P. Dadam, M. Reichert (Eds.). Enterprise-wide and cross-

enterprise workflow management: concepts, systems,

applications, GI Workshop Informatik’99, 1999, Available

at http://www.informatik.uniulm.de/dbis/veranstaltungen/

Workshop-Infomatik99-Procecdings.pdf.

[51] R. Klamma, Readings in Workflow Management: Anno-

tated and Linked Internet Bibliography, RWTH, Aachen,

http://sunsite.informatik.rwth-aachen.de/WFBib.

[52] C. Rolland, A comprehensive view of process engineering,

Proceedings of the 10th International Conference on

Advanced Information Systems Engineering, (CAiSE’98),

Pisa, Italy, 1998, pp. 1–25.

[53] Y. Wand, R.Y. Wang, Anchoring data quality dimensions

in ontological foundations, Comm. ACM 39 (11) (1996)

86–95.

[54] R.N. Wang, H.B. Kon, S.E. Madnick, Data quality

requirements analysis and modeling, Proceedings of the

9th International Conference on Data Engineering, IEEE

Computer Society, Vienna, Austria, 1993, pp. 670–677.

[55] R.Y. Wang, A product perspective on total data quality

management, Comm. ACM 41 (2) (1998) 58–65.

[56] G.K. Tayi, D.P. Ballou, Examining data quality, Com-

mun. ACM 41 (2) (1998) 54–57.

[57] J.A. McCall, P.K. Richards, G.F. Walters, Factors in

software quality, Technical Report, Rome Air Develop-

ment Center, 1978.

[58] B. Boehm, Software Risk Management, IEEE Computer

Society Press, Los Akmitos, CA, 1989.

[59] L. Hyatt, L. Rosenberg, A software quality model and

metrics for identifying project risks and assessing software

quality, Proceedings of the 8th Annual Software Technol-

ogy Conference, Utah, 1996.

[60] D. Georgakopoulos, M. Rusinkiewicz, Workflow manage-

ment: from business process automation to inter-organiza-

tional collaboration, Tutorials 23rd International

Conference on Very Large Data Bases (VLDB), Athens,

Greece, 1997.

[61] M. Jarke, K. Pohl, Information systems quality and

quality information systems, in: Kendall, Lyytinen, De-

Gross (Eds.), Proceedings of the IFIP 8.2 Working

Conference on the Impact of Computer-Supported Tech-

nologies on Information Systems Development, North

Holland, Minneapolis, 1992, pp. 345–375.

[62] H. Galhardas, D. Florescu, D. Shasha, E. Simon, Ajax: an

extensible data cleaning tool, Proceedings of the ACM

SIGMOD International Conference on the Management

of Data, Dallas, TX, 2000, 590 pp.

[63] B. Ramesh, V. Dhar, Supporting systems development by

capturing deliberations during requirements engineering,

IEEE Trans. Software Eng. 18 (6) (1992) 498–510.

[64] D. Georgakopoulos, M. Hornick, A. Sheth, An overview

of workflow management: from process modeling to

workflow automation infrastructure, Distributed Parallel

Databases 3 (1995) 119–153.

[65] D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi,

R. Rosati, A principled approach to data integration and

reconciliation in data warehousing, Proceedings of the

International Workshop on Design and Management of

Data Warehouses (DMDW’99), Heidelberg, Germany,

1999.

[66] The Data Warehouse Information Center, An (Informal)

Taxonomy of Data Warehouse Data Errors, Available at

http://www.dwinfocenter.org/errors.html.

[67] ViaSoft, Visual recap quality measurements. Viasoft White

Papers, 1996, Available at http://www.viaoft.com/down-

load/white papers/.

P. Vassiliadis et al. / Information Systems 26 (2001) 205–236236