a set of qvt relations to assure the correctness of data...

15
A Set of QVT Relations to Assure the Correctness of Data Warehouses by Using Multidimensional Normal Forms Jose-Norberto Maz´ on 1 , Juan Trujillo 1 , and Jens Lechtenb¨ orger 2 1 Dept. of Software and Computing Systems University of Alicante, Spain {jnmazon,jtrujillo}@dlsi.ua.es 2 Dept. of Information Systems University of M¨ unster, Germany [email protected] Abstract. It is widely accepted that a requirement analysis phase is necessary to develop data warehouses (DWs) which adequately represent the information needs of DW users. Moreover, since the DW integrates the information provided by data sources, it is also crucial to take these sources into account throughout the development process to obtain a consistent representation. In this paper, we use multidimensional normal forms to define a set of Query/View/Transformation (QVT) relations to assure that the DW designed from user requirements agrees with the available data sources that will populate the DW. Thus, we propose a hybrid approach to develop DWs, i.e., we firstly obtain the conceptual schema of the DW from user requirements and then we verify and enforce its correctness against data sources by using a set of QVT relations based on multidimensional normal forms. 1 Introduction A data warehouse (DW) is commonly described as an integrated collection of historical data in support of decision making that structures information into facts and dimensions based on multidimensional (MD) modeling [1, 2]. Since the DW integrates several data sources, the development of conceptual MD models has traditionally been guided by an analysis of these data sources [3– 5]. Considering these data-driven approaches, MNFs (multidimensional normal forms) have been developed [6] to reason, in a rigorous manner, about the quality (faithfulness, completeness, avoidance of redundancies, summarizability) of a conceptual MD model derived from operational data sources. Nevertheless, in these data-driven approaches the requirement analysis phase is overlooked, thus resulting in an MD model in which the user needs and expec- tations may not be satisfied [7]. To overcome this problem, several approaches [7– 10] advocate a requirement-driven DW design process. However, hardly any of To appear in Proc. 25th International Conference on Conceptual Modeling (ER). c Springer-Verlag Berlin Heidelberg 2006.

Upload: ngothuy

Post on 12-Aug-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Set of QVT Relations to Assure the Correctness of Data ...dbis-group.uni-muenster.de/dbms/media/people/lechtenboerger/publications/er06.pdfA Set of QVT Relations to Assure the Correctness

A Set of QVT Relations to Assure the

Correctness of Data Warehouses by Using

Multidimensional Normal Forms⋆

Jose-Norberto Mazon1, Juan Trujillo1, and Jens Lechtenborger2

1 Dept. of Software and Computing SystemsUniversity of Alicante, Spain

{jnmazon,jtrujillo}@dlsi.ua.es2 Dept. of Information SystemsUniversity of Munster, [email protected]

Abstract. It is widely accepted that a requirement analysis phase isnecessary to develop data warehouses (DWs) which adequately representthe information needs of DW users. Moreover, since the DW integratesthe information provided by data sources, it is also crucial to take thesesources into account throughout the development process to obtain aconsistent representation. In this paper, we use multidimensional normalforms to define a set of Query/View/Transformation (QVT) relations toassure that the DW designed from user requirements agrees with theavailable data sources that will populate the DW. Thus, we propose ahybrid approach to develop DWs, i.e., we firstly obtain the conceptualschema of the DW from user requirements and then we verify and enforceits correctness against data sources by using a set of QVT relations basedon multidimensional normal forms.

1 Introduction

A data warehouse (DW) is commonly described as an integrated collection ofhistorical data in support of decision making that structures information intofacts and dimensions based on multidimensional (MD) modeling [1, 2]. Sincethe DW integrates several data sources, the development of conceptual MDmodels has traditionally been guided by an analysis of these data sources [3–5]. Considering these data-driven approaches, MNFs (multidimensional normalforms) have been developed [6] to reason, in a rigorous manner, about the quality(faithfulness, completeness, avoidance of redundancies, summarizability) of aconceptual MD model derived from operational data sources.

Nevertheless, in these data-driven approaches the requirement analysis phaseis overlooked, thus resulting in an MD model in which the user needs and expec-tations may not be satisfied [7]. To overcome this problem, several approaches [7–10] advocate a requirement-driven DW design process. However, hardly any of

⋆ To appear in Proc. 25th International Conference on Conceptual Modeling (ER).c© Springer-Verlag Berlin Heidelberg 2006.

Page 2: A Set of QVT Relations to Assure the Correctness of Data ...dbis-group.uni-muenster.de/dbms/media/people/lechtenboerger/publications/er06.pdfA Set of QVT Relations to Assure the Correctness

these approaches considers the data sources in the early stages of the devel-opment. Therefore, the correctness of the MD model with respect to the datasources cannot be assured and the DW repository cannot be properly populatedfrom these data sources.

In order to reconcile these two points of view (data-driven and requirement-driven), a Model Driven Architecture (MDA) [11] framework for the developmentof DWs has been described in [12]. Within this approach a conceptual MD modelof the DW repository is developed from user requirements. This MD model mustbe then conformed to data sources in order to assure its correctness.

In this paper, we focus on presenting a set of Query/View/Transformation(QVT) relations in order to check the correctness of the MD conceptual modelagainst the available data sources within our MDA framework. These QVT re-lations are based on MNFs proposed in [6]. The QVT language allows us toeasily integrate this approach in our MDA framework from the development ofDWs [12], while MNFs enable us to formalize the relationship between the datasources and the MD conceptual model of the DW repository.

The motivation of our approach is as follows: since the DW integrates the in-formation provided by source databases, it is important to check (in early stagesof the development) if the requirement-driven MD conceptual model agrees withthe available data sources in order to assure that (i) the DW repository will beproperly populated from data sources, (ii) the analysis potential provided bythe data sources is captured by the MD conceptual model, (iii) redundanciesare avoided, and (iv) optional dimension levels, i.e., levels allowing NULL val-ues, are controlled via specialization/generalization to enable context-sensitivesummarizability and to avoid inconsistent queries.

To illustrate these benefits, consider the following running example, whichis inspired by an example of [6]. We assume that the MD conceptual model forthe banking domain shown in Fig. 1 has been derived from analysis require-ments without taking data sources into account, e.g., according to the guidelinespresented in [9]. The notation of Fig. 1 is based on a UML profile for MD mod-eling presented in [13] (see Section 4.2 for details). The figure models Accountfacts which are composed of several measures (balance, turnover, interest, andcustomerAge) and described by dimensions Organization, Product, Time, andCustomer. Due to space constraints, we only focus on the Customer dimension.

Every customer is described in terms of a unique identification number, aname, and a date of birth. Every customer lives in a city which is described witha name and a population. Moreover, customers may be associated with job,gender, industry branch, and contact person. Finally, a city belongs to (Rolls-upTo) exactly one region and exactly one district, while a region belongs toexactly one state.

This model represents a geographical classification where every region fallsinto exactly one state, while districts and states appear to be unrelated. From aconceptual perspective, this classification seems reasonable. However, if the datasources provide geographical information where every district falls into exactlyone state, while regions and states are unrelated then (i) the source information

Page 3: A Set of QVT Relations to Assure the Correctness of Data ...dbis-group.uni-muenster.de/dbms/media/people/lechtenboerger/publications/er06.pdfA Set of QVT Relations to Assure the Correctness

CustomerOrganization

Time

Account

<<FactAttribute>> Balance

<<FactAttribute>> Turnover

<<FactAttribute>> Interest

<<FactAttribute>> CustomerAge

10..n

10..n

Product

Customer

<<Descriptor>> ID

<<DimensionAttribute>> Name

<<DimensionAttribute>> DataOfBirth

<<DimensionAttribute>> Job

<<DimensionAttribute>> Gender

<<DimensionAttribute>> Branch

<<DimensionAttribute>> ContactPerson

District

<<Descriptor>> Name

City

<<Descriptor>> Name

<<DimensionAttribute>> Population

1..n

1

+d

1..n+r

1

<<Rolls-upTo>>

1..n

1

+d1..n

+r

1

<<Rolls-upTo>>

State

<<Descriptor>> Name

<<DimensionAttribute>> Population

<<DimensionAttribute>> Area

Region

<<Descriptor>> Name

1

1..n

+r1

+d

1..n

<<Rolls-upTo>>

11..n

+r

1

+d

1..n

<<Rolls-upTo>>

Fig. 1. MD model for banking domain

concerning regions and states cannot be represented faithfully under the MDmodel and (ii) potential for roll-up queries from level district to level state is notrepresented, i.e., analysis potential is lost. Moreover, the MD model does not rep-resent the structural information that industry branches and contact persons areassigned only to company customers while job and gender are only applicable toprivate customers, which poses challenges for summarizability and complicatesquerying (see [6, 14]). Finally, while it certainly makes sense to analyze the agestructure of customers, the measure age is not specific to accounts but only tocustomers. Thus, this measure should be moved to a different fact schema. Tosummarize, based on schema information for the data sources, the MD concep-tual model shown in Fig. 1 should be improved in a number of ways to obtainthe “better” model shown in Fig. 2. Indeed, in this paper we show how to applyQVT relations, which are derived from MNFs, to obtain the model shown inFig. 2 from the model shown in Fig. 1 by taking source databases into account.

The remainder of this paper is structured as follows: Related work is put intoperspective next, before necessary background concerning QVT and MNFs iscollected in Section 3. Our approach is presented in Section 4 by describing datasource model as well as MD conceptual model, and defining QVT relations basedon MNFs. The application of sample QVT relations is illustrated in Section 5.The paper ends with conclusions and suggestions for future work in Section 6.

Page 4: A Set of QVT Relations to Assure the Correctness of Data ...dbis-group.uni-muenster.de/dbms/media/people/lechtenboerger/publications/er06.pdfA Set of QVT Relations to Assure the Correctness

Organization

Time

Account

<<FactAttribute>> Balance

<<FactAttribute>> Turnover

<<FactAttribute>> Interest

Product

Customer

10..n

10..n

Private

<<DimensionAttribute>> Job

<<DimensionAttribute>> Gender

Company

<<DimensionAttribute>> Branch

<<DimensionAttribute>> ContactPerson

Customer

<<Descriptor>> ID

<<DimensionAttribute>> Name

<<DimensionAttribute>> DataOfBirth

Region

<<Descriptor>> Name

City

<<Descriptor>> Name

<<DimensionAttribute>> Population

1..n

1

+d

1..n+r

1

<<Rolls-upTo>>

1

1..n

+r1

+d

1..n

<<Rolls-upTo>>

State

<<Descriptor>> Name

<<DimensionAttribute>> Population

<<DimensionAttribute>> Area

District

<<Descriptor>> Name

1..n

1

+d 1..n

+r 1

<<Rolls-upTo>>

1 1..n

+r

1

+d

1..n

<<Rolls-upTo>>

Fig. 2. Improved MD model for banking domain

2 Related Work

In this section, we briefly describe the most relevant approaches for both data-driven and requirement-driven DW development.

Concerning data-driven approaches, in [4], the authors present the Multidi-mensional Model, a logical model for MD databases. The authors also proposea general design method, aimed at building an MD schema starting from anoperational database described by an Entity-Relationship (ER) schema.

In [3], the authors propose the Dimensional-Fact Model (DFM), a particularnotation for the DW conceptual design. Moreover, they also propose how toderive a DW schema from the data sources described by ER schemas. Alsoin [15], the building of a conceptual MD model of the DW repository from theconceptual schemas of the operational data sources is proposed.

In [5], the authors present a method to systematically derive a conceptualMD model from data sources. In this paper a preliminary set of multidimensionalnormal forms is used to assure the quality of the resulting conceptual model.

Although in each of these data-driven approaches the design steps are de-scribed in a systematic and coherent way, the DW design is only based on theoperational data sources, what we consider insufficient because the final userrequirements are very important in the DW design [7].

Concerning requirement-driven approaches, in [7] an approach is proposed inorder to both determine information requirements of DW users and match theserequirements with actual data sources. However, no formal approach is given inorder to match requirements with data sources.

Page 5: A Set of QVT Relations to Assure the Correctness of Data ...dbis-group.uni-muenster.de/dbms/media/people/lechtenboerger/publications/er06.pdfA Set of QVT Relations to Assure the Correctness

In [8], the authors propose a requirement elicitation process for DWs bygrouping requirements in several levels of abstraction. Their process consists ofidentifying information that supports decision making via information scenarios.In this process, a Goal-Decision-Information (GDI) diagram is used. Althoughthe derivation of GDI diagrams and information scenarios is described, the re-lationships between information scenarios and requirements are not properlyspecified. Moreover, requirements are not conformed to data sources in order toobtain a conceptual MD model.

In [10], the authors present a framework to obtain a conceptual MD modelfrom requirements. This framework uses the data sources to shape hierarchiesand user requirements are used to choose facts, dimensions and measures. How-ever, the authors do not present a formal way to conform data sources and theMD conceptual model.

As a survey, we wish to point out that these requirement-driven approachesdo not formalize the relation between the data sources and the requirements toverify and enforce the correctness of the resulting DW. Therefore, we proposeto use MNFs [6] in a systematic manner, thus formalizing the development ofthe DW repository by means of (i) obtaining a conceptual MD model fromuser requirements, and (ii) verifying and enforcing its correctness against theoperational data sources. Details on MNFs are presented in the next section.

3 Background

In this section, we provide a brief overview of the building blocks of our approach,namely Query/View/Transformation and Multidimensional Normal Forms.

3.1 Query/View/Transformation Language

The MOF 2.0 Query/View/Transformation (QVT) language [16] is a standardapproach for defining formal relations between MOF-compliant models. Fur-thermore QVT is an essential part of the MDA standard as a means of definingformal and automatic transformations between models.

QVT consists of two parts: declarative and imperative. The declarative partprovides mechanisms to define relations that must hold between the model ele-ments of a set of candidate models (source and target models). This declarativepart can be split into two layers according to the level of abstraction: the rela-tional layer that provides graphical and textual notation for a declarative speci-fication of relations, and the core layer that provides a simpler, but verbose, wayof defining relations. The imperative part defines operational mappings that ex-tend the declarative part with imperative implementations when it is difficult toprovide a purely declarative specification of a relation.

In this paper, we focus on the relational layer of QVT. This layer supportsthe specification of relationships that must hold between MOF models by meansof a relations language. A relation is defined by the following elements:

Page 6: A Set of QVT Relations to Assure the Correctness of Data ...dbis-group.uni-muenster.de/dbms/media/people/lechtenboerger/publications/er06.pdfA Set of QVT Relations to Assure the Correctness

– Two or more domains: each domain is a set of elements of a sourceor a target model. The kind of relation between domains must be specified:checkonly (C), i.e., it is only checked if the relation holds or not; and enforced(E), i.e., the target model can be modified to satisfy the relation.

– When clause: it specifies the conditions under which the relation needs tohold (i.e. precondition).

– Where clause: it specifies the condition that must be satisfied by all modelelements participating in the relation (i.e. postcondition).

Defining relations by using the QVT language has the following advantages:(i) it is a standard language, (ii) relations are formally established and automati-cally performed, and (iii) relations can be easily integrated in an MDA approach.

3.2 Multidimensional Normal Forms

The formal guidelines that we are using to formulate our QVT relations inthe following are the three multidimensional normal forms 1MNF, 2MNF, and3MNF presented in [6]. Here, we recapitulate the essence of these normal formsinformally. The reader is referred to [6] for formal definitions. Preliminarily werecall that within an MD conceptual model the terminal dimension levels of afact are those that are attached immediately to the dimensions, i.e., those thatprovide the finest level of detail within each dimension.

The goal of 1MNF is to ensure that an MD conceptual model “matches” withthe information provided by the source databases. More specifically, 1MNF ischaracterized by four conditions as follows:

1. Faithfulness. The functional dependencies (FDs) implied by the MD modelmust be a subset of those observed in the source databases. (Otherwise, somesource data cannot be represented under the MD model.)

2. Roll-up completeness. The FDs among dimension levels contained in thesource databases must be represented as roll-up arcs in the MD model. (Oth-erwise, analysis potential is lost.)

3. Derivation completeness. The FDs among sets of measures contained in thesource databases must be represented via derivation formulas in the MDmodel. (Otherwise, derivation relationships are lost.)

4. Avoidance of redundancies. Each measure must be assigned to a fact in sucha way that the terminal dimension levels of the fact form a key for themeasure without transitive dependencies. (Otherwise, a measure is recordedredundantly at the “wrong” level of detail. E.g., in Fig. 1 in the Introductionmeasure customerAge was repeated for each account owned by a customer.)

In addition to 1MNF, the normal forms 2MNF and 3MNF aim to controloptional dimension levels by means of so-called contexts of validity. Roughly, acontext of validity for an optional dimension level explains the occurrence (andabsence) of structural null values (such as NULL for industry branch of pri-vate customers in Fig. 1) based on the values of so-called discriminating levels.E.g., for the scenario in Fig. 1, we may assume that in the data sources there

Page 7: A Set of QVT Relations to Assure the Correctness of Data ...dbis-group.uni-muenster.de/dbms/media/people/lechtenboerger/publications/er06.pdfA Set of QVT Relations to Assure the Correctness

is an attribute customerType with values “private” and “company”, which actsas discriminating level, such that a customerType of “private” implies NULLfor Branch and ContactPerson, whereas “company” implies NULL for Job andGender. As argued in [14] and elaborated in more detail in [6], structural NULLvalues can and should be avoided by suitable introduction of specialization hier-archies. In fact, in [6] it has been shown that 3MNF allows to construct a classhierarchy of dimension levels with an implementation as relational database thatavoids null values. Note that such a class hierarchy is indeed part of the improvedmodel shown in Fig. 2.

Importantly, the MD model considered in [6] does not provide mechanismsfor specialization/generalization explicitly, which necessitates the use of contextdependencies. As in this paper we consider a richer MD model that explicitlysupports subclassing, we are able to explain the occurrence of NULL valuesdirectly by moving an attribute with structural NULL values into the appropriatesubclass. As a result, we obtain a simplified approach.

As explained in [6, 14] control over NULL values enables context-sensitivesummarizability (e.g., if an analyst rolls up from individual customers to industrybranches, then schema information explains that the context of analysis haschanged to a subclass of all customers) and avoids inconsistent queries (e.g., aquery such as “group private customers by industry branch” can be rejectedbased on schema information).

4 Checking Correctness of the MD Conceptual Model

In this section, we present our approach to check the correctness of a conceptualMD model with respect to the source databases. To this end, we present a set ofQVT relations based on MNFs and obtain their inherent desirable design objec-tives: The resulting MD conceptual model faithfully represents the data sourcesand captures their analysis potential completely, redundancies are avoided, andNULL values are controlled to allow context-sensitive summarizability and avoidcontradictory queries. Our approach consists of two main phases:

First, the elements of the data sources are marked as dimensional elements(fact, dimension, measure and so on). Second, a set of QVT relations between thedata source model and the MD conceptual model (previously derived from userrequirements) are applied, thus checking and enforcing that the MD conceptualmodel is aligned with data sources.

4.1 Data Source Model

We assume that the data source model is the relational representation of the datasources in third normal form. (Note that third normal form is not a restrictionas well-known algorithms such as Synthesis [17] can transform any input schemainto third normal form.) In particular, we use the CWM (Common WarehouseMetamodel) relational metamodel [18] in order to specify this data source model.The CWM relational metamodel is a standard to represent the structure of data

Page 8: A Set of QVT Relations to Assure the Correctness of Data ...dbis-group.uni-muenster.de/dbms/media/people/lechtenboerger/publications/er06.pdfA Set of QVT Relations to Assure the Correctness

resources in a relational database and allows us to represent tables, columns,primary keys, foreign keys, and so on. Since every CWM metamodel is MOF-compliant [18], it can be used as source or target for QVT relations [16].

On the other hand, this data source model must be marked before the QVTrelations can be applied. Marking models is a technique that provides mecha-nisms to extend elements of the models in order to capture additional informa-tion [11, 19]. Marks are used in MDA to prepare the models in order to guide thematching between them. A mark represents a concept from one model, whichcan be applied to an element of other different model. These marks indicatehow every element of the source model must be matched. In our approach, thedata source model is marked by appending a suffix to the name of each elementaccording to the MD conceptual model. In particular, we assume that the datasource tables corresponding to MD model elements Fact, Dimension, and Baseare marked with FACT, DIM, and BASE, respectively, while data sourcecolumns corresponding to FactAttribute, DimensionAttribute, and Descriptorare marked with MEASURE, DA, and D, respectively. Finally, a ForeignKeyrepresenting a Rolls-upTo element is marked with ROLLS.

4.2 MD Conceptual Model

The conceptual modeling of the DW repository is based on a UML profile forMD modeling presented in [13]. This profile contains the necessary stereotypesin order to elegantly represent main MD properties at the conceptual level bymeans of a UML class diagram in which the information is clearly organizedinto facts and dimensions. These facts and dimensions are represented by Fact(represented as ) and Dimension classes (represented as ), respectively.Fact classes are defined as composite classes in shared aggregation relationshipsof n Dimension classes. A fact is composed of measures or fact attributes (Fac-tAttribute stereotype, ). Furthermore, derived measures (and their derivationrules) can also be explicitly represented as tagged values of a FactAttribute.

With respect to dimensions, each level of a classification hierarchy is speci-fied by a Base class ( ). Every Base class can contain several dimension at-tributes (DimensionAttribute stereotype, ) and must also contain a Descrip-tor attribute (D stereotype, ). An association with a Rolls-UpTo stereotype(<<Rolls-UpTo>>) between Base classes specifies the relationship between twolevels of a classification hierarchy. Within this association, role R represents thedirection in which the hierarchy rolls up, whereas role D represents the directionin which the hierarchy drills down. An overview of our UML profile is given inFig. 3. Apart from these defined stereotypes the generalization/specializationrelationships of UML is used for suitably representing optional dimension levels.

Other MD issues are also defined by this UML profile (degenerate dimensions,degenerate facts, non-strict hierarchies, and so on), however they are not takeninto account in this paper, since only the characteristics related to MNFs areconsidered.

Page 9: A Set of QVT Relations to Assure the Correctness of Data ...dbis-group.uni-muenster.de/dbms/media/people/lechtenboerger/publications/er06.pdfA Set of QVT Relations to Assure the Correctness

+ownedAttribute+class

0..1 *

2..*

Class

Classifier

Generalization

Property

aggregation: AggregationKindupper: UnlimetedNatural (from MultiplicityElement)

lower: Integer (from MultiplicityElement)

type: Type (from TypedElement)

Property

aggregation: AggregationKindupper: UnlimetedNatural (from MultiplicityElement)

lower: Integer (from MultiplicityElement)

type: Type (from TypedElement)

Association<<enumeration>>

AggregationKind

noneshared

composite

<<enumeration>>

AggregationKind

noneshared

composite

11

*

+general

+specific

+generalization

+memberEnd

+association

0..1

<<stereotype>>

Rolls-upTo

<<stereotype>>

Fact

<<stereotype>>

Dimension

<<stereotype>>

Base

<<stereotype>>

FactAttribute

<<stereotype>>

DimensionAttribute

<<stereotype>>

Descriptor

Fig. 3. Extension of the UML with the stereotypes used in this paper.

4.3 QVT Relations

In the following, each QVT relation is described: Check1MNF1 1, Check1MNF1 2,Check1MNF1 3, Check1MNF1 4, Check1MNF2, Check1MNF3, and Check1MNF4are based on the 1MNF; Check2MNF3MNF is based on both 2MNF and 3MNF.

The relations are applied as follows: first Check1MNF1 1, Check1MNF1 2,Check1MNF1 3, and Check1MNF1 4 are applied in order to check that the FDsof the MD model are contained in those of the sources (first condition of the1MNF); since both domains are check-only, it is only checked whether thereexists a valid match that satisfies these relations without modifying any modelif the domains do not match. If the check fails, there typically is no automaticsolution, and the DW developer must redesign the MD conceptual model. (E.g.,in our example given in Fig. 1, the user requirements express that Regions roll-up to States, whereas the data sources do not provide this information. Thus,either the conceptual model has to be modified as shown in Fig. 2 or the sourcedata has to be aligned with the model.) Otherwise, i.e., if the check succeeds,the remaining relations can be applied to properly modify the MD conceptualmodel (according to the second, third, and fourth condition of 1MNF as wellas according to 2MNF and 3MNF). Therefore, these QVT relations not onlycheck the correctness of the MD conceptual model according to the data sources,but also enforce this correctness by creating the necessary elements of the MDconceptual model until each relation holds.

Throughout the checks, we assume that the names of corresponding elementsin both models are equal (apart from the previously added marks) according toa linguistic approach based on name similarity [20]. This issue is captured in thewhen clause of each relation.

Verify 1MNF (first condition). According to this condition, for every FD in theMD conceptual model we have to check that there is a corresponding FD in thedata source model, i.e., the FDs implied by the MD model must be a subset ofthose observed in the source databases. Therefore, this condition assures that thesource data can be properly represented under the MD model. We have definedone QVT relation (see Fig. 4-5) for each situation in which an FD arises in the

Page 10: A Set of QVT Relations to Assure the Correctness of Data ...dbis-group.uni-muenster.de/dbms/media/people/lechtenboerger/publications/er06.pdfA Set of QVT Relations to Assure the Correctness

MD conceptual model in order to check if the same FD occurs in the data sourcemodel. These situations are as follows:

1. Descriptor determines DimensionAttributes. This is checked by Check-1MNF1 1 (see Fig. 4). The elements related to the MD conceptual model arethe following: a Base (b), a Descriptor (d) and a DimensionAttribute (da). Theseelements of the MD conceptual model must be matched against a set of elementsof the data source model: a table (t) with a column (c1) which is part of theprimary key (pk). This table is marked as a Dimension or Base (m n t) and thecolumn (c1) is marked as a Descriptor (m n c1). There is also a column (c2)which is functionally determined by the primary key. This column is marked asa DimensionAttribute (m n c2).

2. A Rolls-upTo association is an FD between hierarchy levels (Bases). Thisis checked by Check1MNF1 2 as follows (see Fig. 4): a set of elements thatrepresent two Bases (b1 and b2) related by means of a Rolls-upTo associationmust be checked against the following pattern in the data source model: a set ofelements that represents a table (t1) with a foreign key (fk) that references theother table (t2). This represents a many-to-one relationship in a third normalform relational database. Furthermore, table t1 must be marked as Dimensionor Base, t2 as Base and foreign key fk as Rolls-upTo.

3. Derived measures. This is checked by Check1MNF1 3 (see Fig. 4). It checksthat if there is a derived FactAttribute (with a derivation rule) in the MD model,then in the data sources there must be a procedure which implements this deriva-tion rule.

Fig. 4. QVT relations based on Multidimensional Normal Forms (1/2)

4. Dimensions (and their terminal dimension levels) functionally determineFactAttributes (i.e., measures). This is checked by Check1MNF1 4 (see Fig. 5).In this relation, a set of elements of the MD conceptual model that represent therelation between a Dimension (d), together with its terminal dimension level, i.e,Base (b) and a Fact (f) together with its attributes (fa) is matched against thefollowing pattern of the data sources: a table (t1) with a column (c), a primary

Page 11: A Set of QVT Relations to Assure the Correctness of Data ...dbis-group.uni-muenster.de/dbms/media/people/lechtenboerger/publications/er06.pdfA Set of QVT Relations to Assure the Correctness

key (pk) which contains a foreign key that references another table (t2). Tablet1 is marked as a Fact, while table t2 is marked as Dimension and column c ismarked as FactAttribute.

Fig. 5. QVT relations based on Multidimensional Normal Forms (2/2)

Verify 1MNF (second condition). The Check1MNF2 relation checks this con-dition, i.e., roll-up completeness (the FDs among dimension levels contained inthe source databases must be represented as roll-up arcs in the MD model).Therefore, if this relation holds then there exists a Rolls-upTo association be-tween bases in the MD conceptual model if there is an FD between columnsof different tables in the data source model. This relation is the same that theCheck1MNF1 2 relation, but the kind of relation in the MD side is specified asenforced.

Verify 1MNF (third condition). This condition is related to derivation complete-ness (third condition of the 1MNF). If a certain measure can be computed froma set of other measures, then it indicates that there is an FD among measures.Therefore, the FDs among measures that appear in the data source model shouldbe reflected as derived FactAttributes of the MD conceptual model. The relationthat verifies this condition (Check1MNF3) is the same that Check1MNF1 3 (seeFig. 4), by specifying the kind of relation in the MD side as enforced.

Verify 1MNF (fourth condition). This condition (avoidance of redundancies)is checked by the Check1MNF4 relation. This relation is the same that theCheck1MNF1 4 relation (see Fig. 5), but with an enforced kind in the MD side.Therefore, each measure must be assigned to a Fact (as a FactAttribute) in sucha way that the terminal dimension levels of the Fact form a key for the measurewithout transitive dependencies.

Verify 2MNF and 3MNF. This relation is based on 2MNF and 3MNF. Thesenormal forms control optional dimension levels by avoiding structural NULLvalues. The aim of this relation is check or enforce a class hierarchy of dimension

Page 12: A Set of QVT Relations to Assure the Correctness of Data ...dbis-group.uni-muenster.de/dbms/media/people/lechtenboerger/publications/er06.pdfA Set of QVT Relations to Assure the Correctness

levels in order to avoid these NULL values. As in this paper we consider an MDconceptual model that explicitly supports subclassing, this QVT relation coversboth 2MNF and 3MNF by moving an attribute with structural NULL valuesinto the appropriate subclass.

This relation is shown in Fig. 5. A table (t1) with two columns, an optionalcolumn (l0) and a discriminating level (l) is matched against a generalizationhierarchy: a superclass is a base (b1), and a subclass is other base (b2) with aDimensionAttribute that corresponds to the optional column. Furthermore, weuse context dependencies as schema level constraints to identify discriminatinglevels, so in the when clause there is a function (isDiscriminatingLevel) thatchecks whether the column l is a discriminating level according to the table t1and the other column l0.

5 Sample Applications of QVT Relations

In this section, we show how our QVT relations are properly applied to assurethe correctness of the MD conceptual model of the DW repository against datasources. We use the sample scenario previously introduced in the Introduction(see Fig. 2). The data source model (already marked) is shown in Fig. 6.

Customer_DIM:Table

ID_Customer_D:Column

PK_Customer:PrimaryKey

/owner

/namespace

/feature

/ownedElement /feature

/uniqueKey

Name_DA: Column

DateOfBirth_DA: Column

/owner

/feature

/feature

City_DA: Column/feature

FK_ToCity_ROLLS:ForeignKey

/ownedElement

/namespace

/uniqueKey

/feature

Population_DA:Column

/feature

City_BASE: Table Name_D: Column

PK_City:PrimaryKey

/owner

/namespace

/feature

/ownedElement /feature

/uniqueKey/owner

/keyRelationship

/feature

/feature

District_BASE: TableName_D: Column

PK_District:PrimaryKey

/owner/feature

/ownedElement

/uniqueKey

/owner

District_DA:Column

FK_ToDistrict_ROLLS:ForeignKey

/namespace

/feature

Account_FACT:Table

Turnover_MEASURE: Column

Interest_MEASURE: Column

Balance_MEASURE: Column

/owner

/feature

/feature

/feature

PK_Account:PrimaryKey

/namespace /ownedElement

FK_To_Customer:ForeignKey

Customer: Column/feature

/ownedElement

/namespace

/keyRelationship

/uniqueKey

/feature

/feature

/uniqueKey

/feature

Region_BASE: Table

Name_D: Column

PK_Region:PrimaryKey

/feature

/ownedElement

/uniqueKey

/owner

/namespace

/feature

Population_DA:Column

/feature

State_BASE: TableName_D: Column

PK_State:PrimaryKey

/owner

/feature

/ownedElement

/uniqueKey

/owner

Area_DA:Column

/feature

/namespace

/feature

State_DA:Column

FK_ToState_ROLLS:ForeignKey

/feature

/keyrelationship

/feature

Region_DA:Column

/feature

FK_ToRegion_ROLLS:ForeignKey

Job_DA: Column

Gender_DA: Column

Branch_DA: Column

ContactPerson_DA: Column

/feature

Type_DA: Column/feature

Fig. 6. Data sources model for our example

Page 13: A Set of QVT Relations to Assure the Correctness of Data ...dbis-group.uni-muenster.de/dbms/media/people/lechtenboerger/publications/er06.pdfA Set of QVT Relations to Assure the Correctness

Due to space restrictions, we only describe a subset of the applied relations.These QVT relations are as follows:

Check1MNF2. This relation checks and enforces that FK ToState ROLLS,a foreign key in the District BASE table referencing the State BASE table(which embodies a many-to-one relationship between districts and states), isrepresented via a Rolls-upTo association between District base and State basein the MD conceptual model. We point out that this Rolls-upTo association wasmissing in the requirement-driven MD conceptual model (recall Fig. 1).

Check1MNF4. This relation checks that the Account FACT table, its pri-mary key (PK Account), foreign key (FK To Customer) to the Customer DIMtable, and its columns (Balance MEASURE, Turnover MEASURE, and Inter-est MEASURE,) correspond to the Account fact (including fact attributes) andthe Customer dimension (including the terminal dimension level Customer base).

Check2MNF3MNF. The enforcement of this relation creates subclasses ofthe Customer base in the MD conceptual model, whose names are determinedby the values of the discriminating level Type DA: company and private. Fur-thermore, it enforces that the optional columns Job DA and Gender DA in thedata source model belong to the private subclass in the MD conceptual model,while the optional columns Branch DA and ContactPerson DA belong to thecompany subclass of the Customer base.

6 Conclusions and Future Work

In this paper, we have presented an approach to assure the correctness of anMD conceptual model of the DW repository according to the data sources thatwill populate this repository. This approach is outlined as follows: we firstly ob-tain the MD conceptual schema of the DW from user requirements and thenwe verify and enforce its correctness against data sources by using a set ofQVT relations based on MNFs. By using MNFs, we can assure that the MDconceptual model also satisfies certain desirable properties such as faithfulness,completeness, avoidance of redundancies, and context-sensitive summarizability.Furthermore, QVT relations allow us to integrate this approach into an MDAframework for the development of DWs.

Our immediate future work is to extend our approach by defining QVT rela-tions in order to automatically transform the MD conceptual model into logicalmodels that are closer to the relational implementation. Furthermore, non-stricthierarchies, many-to-many relationships between a fact and a dimension, degen-erate facts, and other MD issues should be taken into account. Therefore, MNFswill also assure the correctness of these logical models.

7 Acknowledgements

This work has been partially supported by the METASIGN (TIN2004-00779)project from the Spanish Ministry of Education and Science, by the DADAS-MECA project (GV05/220) from the Valencia Ministry of Enterprise, University

Page 14: A Set of QVT Relations to Assure the Correctness of Data ...dbis-group.uni-muenster.de/dbms/media/people/lechtenboerger/publications/er06.pdfA Set of QVT Relations to Assure the Correctness

and Science (Spain), and by the DADS (PBC-05-012-2) project from the Castilla-La Mancha Ministry of Education and Science (Spain). Jose-Norberto Mazon isfunded by the Spanish Ministry of Education and Science under a FPU grant(AP2005-1360).

References

1. Inmon, W.: Building the Data Warehouse. Wiley & Sons (2002)2. Kimball, R., Ross, M.: The Data Warehouse Toolkit. Wiley & Sons (2002)3. Golfarelli, M., Maio, D., Rizzi, S.: The Dimensional Fact Model: A conceptual

model for data warehouses. Int. J. Cooperative Inf. Syst. 7(2-3) (1998) 215–2474. Cabibbo, L., Torlone, R.: A logical approach to multidimensional databases. In

Schek, H.J., Saltor, F., Ramos, I., Alonso, G., eds.: EDBT. Volume 1377 of LectureNotes in Computer Science., Springer (1998) 183–197

5. Husemann, B., Lechtenborger, J., Vossen, G.: Conceptual data warehouse model-ing. In Jeusfeld, M.A., Shu, H., Staudt, M., Vossen, G., eds.: DMDW. Volume 28of CEUR Workshop Proceedings., CEUR-WS.org (2000) 6

6. Lechtenborger, J., Vossen, G.: Multidimensional normal forms for data warehousedesign. Inf. Syst. 28(5) (2003) 415–434

7. Winter, R., Strauch, B.: A method for demand-driven information requirementsanalysis in data warehousing projects. In: HICSS. (2003) 231

8. Prakash, N., Singh, Y., Gosain, A.: Informational scenarios for data warehouserequirements elicitation. In Atzeni, P., et al, eds.: ER. Volume 3288 of LectureNotes in Computer Science., Springer (2004) 205–216

9. Mazon, J.N., Trujillo, J., Serrano, M., Piattini, M.: Designing data warehouses:from business requirement analysis to multidimensional modeling. In Cox, K.,Dubois, E., Pigneur, Y., Bleistein, S.J., Verner, J., Davis, A.M., Wieringa, R.,eds.: REBNITA, University of New South Wales Press (2005) 44–53

10. Giorgini, P., Rizzi, S., Garzetti, M.: Goal-oriented requirement analysis for datawarehouse design. In: DOLAP. (2005) 47–56

11. Object Management Group: MDA Guide 1.0.1. http://www.omg.org/cgi-bin/

doc?omg/03-06-01 (Visited January 2006)12. Mazon, J.N., Trujillo, J., Serrano, M., Piattini, M.: Applying MDA to the devel-

opment of data warehouses. In: DOLAP. (2005) 57–6613. Lujan-Mora, S., Trujillo, J., Song, I.Y.: A UML profile for multidimensional mod-

eling in data warehouses. Data & Knowledge Engineering In Press (2006)14. Lehner, W., Albrecht, J., Wedekind, H.: Normal forms for multidimensional

databases. In Rafanelli, M., Jarke, M., eds.: SSDBM, IEEE Computer Society(1998) 63–72

15. Tryfona, N., Busborg, F., Christiansen, J.G.B.: starER: A conceptual model fordata warehouse design. In: DOLAP, ACM (1999) 3–8

16. Object Management Group: MOF 2.0 Query/View/Transformation. http://www.omg.org/cgi-bin/doc?ptc/2005-11-01 (Visited January 2006)

17. Bernstein, P.A.: Synthesizing third normal form relations from functional depen-dencies. ACM Trans. Database Syst. 1(4) (1976) 277–298

18. Object Management Group: Common Warehouse Metamodel Specification 1.1.http://www.omg.org/cgi-bin/doc?formal/03-03-02 (Visited January 2006)

19. Mellor, S., Scott, K., Uhl, A., Weise, D.: MDA distilled: principles of Model-DrivenArchitecture. Addison Wesley (2004)

Page 15: A Set of QVT Relations to Assure the Correctness of Data ...dbis-group.uni-muenster.de/dbms/media/people/lechtenboerger/publications/er06.pdfA Set of QVT Relations to Assure the Correctness

20. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching.VLDB J. 10(4) (2001) 334–350