mapmerge: correlating independent schema mappingsvldb2010/proceedings/files/papers/r07.pdf · pings...

MapMerge: Correlating Independent Schema Mappings

Bogdan AlexeUC Santa Cruz

Mauricio HernandezIBM Almaden

Lucian PopaIBM Almaden

Wang-Chiew TanIBM Almaden & UC Santa Cruz

ABSTRACTOne of the main steps towards integration or exchange of data isto design the mappings that describe the (often complex) relation-ships between the source schemas or formats and the desired targetschema. In this paper, we introduce a new operator, called Map-Merge, that can be used to correlate multiple, independently de-signed schema mappings of smaller scope into larger schema map-pings. This allows a more modular construction of complex map-pings from various types of smaller mappings such as schema cor-respondences produced by a schema matcher or pre-existing map-pings that were designed by either a human user or via mappingtools. In particular, the new operator also enables a new “divide-and-merge” paradigm for mapping creation, where the design isdivided (on purpose) into smaller components that are easier tocreate and understand, and where MapMerge is used to automat-ically generate a meaningful overall mapping. We describe ourMapMerge algorithm and demonstrate the feasibility of our im-plementation on several real and synthetic mapping scenarios. Inour experiments, we make use of a novel similarity measure be-tween two database instances with different schemas that quanti-fies the preservation of data associations. We show experimentallythat MapMerge improves the quality of the schema mappings, bysignificantly increasing the similarity between the input source in-stance and the generated target instance.

1. IntroductionSchema mappings are essential building blocks for information

integration. One of the main steps in the integration or exchangeof data is to design the mappings that describe the desired relation-ships between the various source schemas or source formats and thetarget schema. Once the mappings are established, they can be usedeither to support query answering on the (virtual) target schema, aprocess that is traditionally called data integration [13], or to phys-ically transform the source data into the target format, a processreferred to as data exchange [7]. In this paper, we focus on the dataexchange aspect although our mapping generation methods will beequally applicable to derive mappings for data integration.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were presented at The36th International Conference on Very Large Data Bases, September 13-17,2010, Singapore.Proceedings of the VLDB Endowment,Vol. 3, No. 1Copyright 2010 VLDB Endowment 2150-8097/10/09...$ 10.00.

Many commercial data transformation systems such as AltovaMapforce1 and Stylus Studio2 to name a few, as well as researchprototypes such as Clio [6] or HePToX [3], include mapping de-sign tools that can be used by a human user to derive the data trans-formation program between a source schema and a target schema.Most of these tools work in two steps. First, a visual interface isused to solicit all known correspondences of elements between thetwo schemas from the mapping designer. Such correspondencesare usually depicted as arrows between the attributes of the sourceand target schemas. (See, for example, Figure 1(a)). Sometimes,a schema matching module [19] is used to suggest or derive corre-spondences. Once the correspondences are established, the systeminterprets them into an executable script, such as an XQuery or SQLquery, which can then transform an instance of the source schemainto an instance of the target schema. The generated transformationscript is usually close to the desired specification. In most cases,however, portions of the script still need to be refined to obtain thedesired specification.

We note that for both Clio and HePToX, the correspondences arefirst compiled into internalschema mapping assertionsor schemamappingsin short, which are high-level, declarative, constraint-likestatements [12]. These schema mappings are then compiled intothe executable script. One advantage of using schema mappings asan intermediate form is that they are more amenable to the formalstudy of data exchange and data integration [12], as well as to opti-mization and automatic reasoning. In fact, the main technique thatwe introduce in this paper represents a form of automatic reasoningon top of schema mappings.

An important drawback of the previously outlined two-step sche-ma mapping design paradigm is that it is hard to retro-fit any pre-existing or user-customized mappings back into the mapping tool,since the mapping tool is based on correspondences. Thus, if themapping tool is restarted, it will regenerate the same fixed trans-formation script based on the input correspondences, even thoughsome portions of the transformation task may have already beenrefined (customized) by the user or may already exist (as part of aprevious transformation task, for example).

In this paper, we propose a radically different approach to the de-sign of schema mappings where the mapping tool can take as inputarbitrary mapping assertions and not just correspondences. This al-lows for the modular construction of complex and larger mappingsfrom various types of “smaller” mappings that include schema cor-respondences but also arbitrary pre-existing or customized map-pings. An essential ingredient of this approach is a new operatoron schema mappings that we callMapMergeand that can be usedto automatically correlate the input mappings in a meaningful way.

1www.altova.com2www.stylusstudio.com

81

Proj

pname

budget

did

Dept

did

dname

Emp

ename

addr

did

Group

gno

gname

Works

ename

addr

pname

gno

t1

t3

t2

CSDept

did

dname

CompSci

did

dname

Staff

ename

address

Projects

pname

budget

Schema S1 Schema S2 Schema S3Intermediateobject

mCS

m

m1

m2

(a)

Input mappings from S1 to S2:

(t1) for g in Group exists d in Dept

where d.dname = g.gname

(t2) for w in Works, g in Group

satisfying w.gno = g.gno, w.addr = “NY”

exists e in Emp, d in Dept

where e.did = d.did,

e.ename = w.ename, e.addr = w.addr,

d.dname = g.gname

(t3) for w in Works exists p in Proj

where p.pname = w.pname

Output of MapMerge(S1, S2, {t1, t2, t3}):

for g in Group exists d in Dept

where d.dname = g.gname, d.did = F[g]

for w in Works, g in Group

satisfying w.gno = g.gno, w.addr = “NY”

exists e in Emp

where e.ename = w.ename, e.addr = w.addr, e.did = F[g]

for w in Works, g in Group

satisfying w.gno = g.gno

exists p in Proj

where p.pname = w.pname, p.budget = H1[w], p.did = F[g]

(b) (c)

Figure 1: (a) A transformation flow from S1 to S3. (b) Schema mappings fromS1 to S2. (c) Output of MapMerge.

1.1 Motivating Example and Overview

To illustrate the ideas, consider first a mapping scenario betweenthe schemasS1 andS2 shown in the left part of Figure 1(a). Thegoal is data restructuring from two source relations,Group andWorks, to three target relations,Emp, Dept, andProj. In this ex-ample,Group(similar toDept) represents groups of scientists shar-ing a common area (e.g., a database group, a CS group, etc.) Thedotted arrows represent foreign key constraints in the schemas.Independent Mappings. Assume the existence of the following(independent) schema mappings fromS1 to S2. The first map-ping is the constraintt1 in Figure 1(b), and corresponds to thearrow t1 in Figure 1(a). This constraint requires every tuple inGroup to be mapped to a tuple inDept such that the group name(gname) becomes department name (dname). The second mappingis more complex and corresponds to the group of arrowst2 in Fig-ure 1(a). This constraint involves a custom filter condition; everypair of joining tuples ofWorksandGroupfor which theaddr valueis “NY” must be mapped into two tuples ofEmpandDept, shar-ing the samedid value, and with correspondingename, addr anddnamevalues. (Note thatdid is a target-specific field that must ex-ist and plays the role of key / foreign key). Intuitively,t2 illustratesa pre-existing mapping that a user may have spent time in the pastto create. Finally, the third constraint in Figure 1(b) corresponds tothe arrowt3 and mapspnamefrom Worksto Proj. This is an exam-ple of a correspondence that is introduced by a user after loadingt1and the pre-existing mappingt2 into the mapping tool.

The goal of the system is now to (re)generate a “good” overallschema mapping fromS1 to S2 based on its input mappings. Wenote first that the input mappings, when considered in isolation, donot generate an ideal target instance.

Indeed, consider the source instanceI in Figure 2. The targetinstance that is obtained by minimally enforcing the constraints{t1, t2, t3} is the instanceJ1 also shown in the figure. The firstDepttuple is obtained by applyingt1 on theGrouptuple(123, CS).There,D1 represents somedid value that must be associated withCS in this tuple. Similarly, theProj tuple, with some unspecifiedvalueB for budget and adid value ofD3 is obtained viat3. TheEmptuple together with the secondDept tuple are obtained basedon t2. As required byt2, these tuples are linked via the samedid

valueD2. Finally, to obtain a target instance that satisfies all theforeign key constraints, we must also have a third tuple inDeptthatincludesD3 together with some unspecified department nameN .

Since the three mapping constraints are not correlated, the threedid values (D1, D2, D3) are distinct. (There is no requirementthat they must be equal.) As a result, the target instanceJ1 ex-hibits the typical problems that arise when uncorrelated mappingsare used to transform data: (1)duplication of data(e.g., multiple

Groupgno gname

123 CS

Worksename addr pname gno

John NY Web 123

Source instance I

Empename addr didJohn NY D2

Deptdid dnameD1 CS D2 CSD3 N

Target instance J1

Projpname budget did

Web B D3

{t1,t2,t3}

MapMerge({t1,t2,t3})

Empename addr did

John NY D

Deptdid dname

D CS

Target instance J2

Projpname budget did

Web B’ D

Figure 2: An instance ofS1 and two instances ofS2.

Dept tuples forCS with differentdid values), and (2)loss of as-sociationswhere tuples are not linked correctly to each other (e.g.,we have lost the association between project nameWeb and de-partment nameCS that existed in the source).Correlated Mappings via MapMerge. Consider now the schemamappings that are shown in Figure 1(c) and that are the result ofMapMerge applied on{t1, t2, t3}. The notable difference from theinput mappings is that all mappings consistently use the same ex-pression, namely the Skolem termF [g] whereg denotes a distinctGroup tuple, to give values for thedid field. The first mapping isthe same ast1 but makes explicit the fact thatdid is F [g]. Thismapping creates a uniqueDept tuple for each distinctGroup tuple.The second mapping is (almost) liket2 with the additional use ofthe same Skolem termF [g]. Moreover, it also drops the existencerequirement forDept (since this is now implied by the first map-ping). Finally, the third mapping differs fromt3 by incorporatinga join withGroupbefore it can actually use the Skolem termF [g].As an additional artifact of MapMerge, which we explain later, italso includes a Skolem termH1[w] that assigns values forbudget.

The target instance that is obtained by applying the result ofMapMerge is the instanceJ2 shown in Figure 2. The data asso-ciations that exist in the source are now correctly preserved in thetarget. For example,Web is linked to theCS tuple (viaD) andalsoJohn is linked to theCS tuple (via the sameD). Further-more, there is no duplication ofDept tuples.Flows of Mappings.Taking the idea of mapping reuse and modu-larity one step further, an even more compelling use case for Map-Merge in conjunction with mapping composition [8, 14, 17], is theflow-of-mappingsscenario [1]. The key idea here is that to producea data transformation from the source to the target, one can decom-pose the process into several simpler stages, where each stage mapsfrom or into some intermediate, possibly simpler schema. More-over, the simpler mappings and schemas play the role of reusablecomponents that can be applied to build other flows. Such abstrac-tion is directly motivated by the development of real-life, large-scale ETL flows such as those typically developed with IBM Infor-mation Server (Datastage), Oracle Warehouse Builder and others.

To illustrate, suppose the goal is to transform data from the schema

82

S1 of Figure 1(a) to a new schemaS3, whereStaffandProjectsinformation are grouped underCompSci. The mapping or ETL de-signer may find it easier to first construct the mapping betweenS1

andS2 (it may also be that this mapping may have been derivedin a prior design). Furthermore, the schemaS2 is a normalizedrepresentation of the data, whereDept, EmpandProj corresponddirectly to the main concepts (or types of data) that are being ma-nipulated. Based on this schema, the designer can then produce amappingmCS from Dept to a more specialized objectCSDept, byapplying some customized filter condition (e.g., based on the nameof the department). The next step is to create the mappingm fromCSDeptto the target schema. Other independent mappings are sim-ilarly defined forEmpandProj (seem1 andm2).

Once these individual mappings are established, the same prob-lem of correlating the mappings arises. In particular, one has tocorrelatemCS ◦ m, which is the result of applying mapping com-position tomCS andm, with the mappingsm1 for Empandm2 forProj. This correlation will ensure that all employees and projectsof computer science departments will be correctly mapped undertheir correct departments, in the target schema.

In this example, composition itself gives another source of map-pings to be correlated by MapMerge. While similar with compo-sition in that it is an operator on schema mappings, MapMerge isfundamentally different in that it correlates mappings that share thesame source schema and the same target schema. In contrast, com-position takes two sequential mappings where the target of the firstmapping is the source of the second mapping. Nevertheless, thetwo operators are complementary and together they can play a fun-damental role in building data flows.

1.2 Contributions and Outline of the Paper

Our main technical contributions are as follows. We give analgorithm for MapMerge, which takes as input arbitrary schemamappings expressed as second-order tgds [8] and generates cor-related second-order tgds. As a particular important case, Map-Merge can also take as input a set of raw schema correspondences;thus, it constitutes a replacement of existing mapping generationalgorithms that are used in Clio [18, 10]. We introduce a novelsimilarity measure that is used to quantify the preservation of dataassociations from a source database to a target database. We usethis measure to show experimentally that MapMerge improves thequality of schema mappings. In particular, we show that the targetdata that is produced based on the outcome of MapMerge has bet-ter quality, in terms of preservation of source associations, than thetarget data that is produced based on Clio-generated mappings.Outline In the next section, we provide some preliminaries on schemamappings and their semantics. In Section 3 we give the main in-tuition behind MapMerge, while in Section 4 we describe the al-gorithm. In Section 5, we introduce the similarity measure thatquantifies the preservation of associations. We make use of thismeasure to evaluate the performance of MapMerge on real-life andsynthetic mapping scenarios. We discuss related work in Section 6and conclude in Section 7.

2. PreliminariesA schema consists of a set of relation symbols, each with an

associated set of attributes. Moreover, each schema can have a setof inclusion dependencies modeling foreign key constraints. Whilewe restrict our presentation to the relational case, all our techniquesare applicable and implemented in the more general case of thenested relational data model used in [18], where the schemas andmappings can be either relational or XML.

Schema Mappings.A schema mapping is a triple(S,T,Σ) whereS is a source schema,T is a target schema, andΣ is a set ofsecond-order tuple generating dependencies (SO tgds)[8]. In this paper,we use the notation

for ~x in ~Ssatisfying B1(~x) exists ~y in ~Twhere B2(~y)and C(~x, ~y)

for expressing SO tgds. Examples of SO tgds in this notation werealready given in Figure 1(b) and Figure 1(c). Here, it suffices tosay that~S represents a vector of source relation symbols (possiblyrepeated), while~x represents the tuple variables that are bound, cor-respondingly, to these relations. A similar notation applies for theexists clause. The conditionsB1(~x) andB2(~y) are conjunctionsof equalities over the source and, respectively, target variables. TheconditionC(~x, ~y) is a conjunction of equalities that equate targetexpressions (e.g.,y.A) with either source expressions (e.g.,x.B)or Skolem terms of the formF [x1, . . . , xi], whereF is a functionsymbol andx1, . . . , xi are a subset of the source variables. Skolemterms are used to relate target expressions across different SO tgds.An SO tgd without a Skolem term may also be called, simply, atuple-generating dependency or tgd [7].

Note that our SO tgds do not allow equalities between or withSkolem terms in thesatisfyingclause. While such equalities maybe needed for more general purposes [8], they do not play a role fordata exchange and can be eliminated, as observed in [24].Chase-Based Semantics.The semantics that we adopt for a schemamapping(S,T, Σ) is the standard data-exchange semantics [7] where,given a source instanceI , the result of “executing” the mapping isthe target instanceJ that is obtained by chasingI with the depen-dencies inΣ. Since the dependencies inΣ are SO tgds, we actuallyuse an extension of the chase as defined in [8].

Intuitively, the chase provides a way of populating the target in-stanceJ in a minimal way, by adding the tuples that arerequiredby Σ. For every instantiation of thefor clause of a dependency inΣsuch that thesatisfyingclause is satisfied but theexistsandwhereclauses are not, the chase adds corresponding tuples to the targetrelations. Fresh new values (also called labeled nulls) are used togive values for the target attributes for which the dependency doesnot provide a source expression. Additionally, Skolem terms areinstantiated by nulls in a consistent way: a termF [x1, . . . , xi] isreplaced by the same null every timex1, . . . , xi are instantiatedwith the same source tuples. Finally, to obtain a valid target in-stance, we must chase (if needed) with the target constraints.

For our earlier example, the target instanceJ1 is the result ofchasing the source instanceI with the tgds in Figure 1(b) and,additionally, with the foreign key constraints. There, the valuesD1, D2, D3 are nulls that are generated to filldid values for whichthe tgds do not provide a source expression. The target instanceJ2

is the result of chasingI with the SO tgds in Figure 1(c). There,D is a null that corresponds to the Skolem termF [g] whereg isinstantiated with the sole tuple ofGroup.

In practice, mapping tools such as Clio do not necessarily im-plement the chase withΣ, but generate queries to achieve a similarresult [10, 18].

3. Correlating Mappings: Key IdeasHow do we achieve the systematic and, moreover,correct con-

struction of correlated mappings? After all, we do not want arbi-trary correlations between mappings, but rather only to the extentthat thenaturaldata associations in the source are preserved and noextra associations are introduced.

There are two key ideas behind MapMerge. The first idea is toexploit the structure and the constraints in the schemas in order to

83

define what natural associations are (for the purpose of the algo-rithm). Two data elements are considered associated if they are inthe same tuple or in two different tuples that are linked via con-straints. This idea has been used before in Clio [18], and providesthe first (conceptual) step towards MapMerge. For our example, theinput mappingt3 in Figure 1(b) is equivalent, in the presence of thesource and target constraints, to the following enriched mapping:

t′3: for w in Works,g in Groupsatisfyingw.gno =g.gno

existsp in Proj,d in Deptwherep.pname =w.pnameand p.did = d.did

Intuitively, if we have aw tuple in Works, we also have a joiningtupleg in Group, sincegno is a foreign key fromWorksto Group.Similarly, a tuplep in Proj implies the existence of a joining tuplein Dept, sincedid is a foreign key fromProj to Dept.

Formally, the above rewriting fromt3 to t′3 is captured by thewell-known chase procedure [2, 15]. The chase is a convenienttool to group together, syntactically, elements of the schema thatare associated. The chase by itself, however, does not change thesemantics of the mapping. In particular, the abovet′3 does not in-clude any additional mapping behavior fromGroupto Dept.

The second key idea behind MapMerge is that ofreusingor bor-rowing mapping behavior from a more general mapping to a morespecific mapping. This is a heuristic that changes the semantics ofthe entire schema mapping and produces an arguably better one,with consolidated semantics.

To illustrate, consider the first mapping constraint in Figure 1(c).This constraint (obtained by skolemizing the inputt1) specifies ageneral mapping behavior fromGroup to Dept. In particular, itspecifies how to createdname anddid from the input record. Onthe other hand, the abovet′3 can be seen as a morespecificmap-ping from asubsetof Group(i.e., those groups that have associatedWorkstuples) to asubsetof Dept (i.e., those departments that haveassociatedProj tuples). At the same time,t′3 does not specify anyconcrete mapping for thedname anddid fields of Dept. We canthen borrow the mapping behavior that is already specified by themore general mapping. Thus,t′3 can be enriched to:

t′′3

: for w in Works,g in Groupsatisfyingw.gno =g.gnoexistsp in Proj,d in Deptwherep.pname =w.pnameand p.did = d.didandd.dname =g.gnameand d.did = F [g] and p.did = F [g]

where two of the last three equalities represent the “borrowed” be-havior, while the last equality is obtained automatically by transi-tivity. Finally, we can drop the existence ofd in Deptwith the twoconditions fordname anddid, since this is repeated behavior thatis already captured by the more general mapping fromGroup toDept. The resulting constraint is identical3 to the third constraint inFigure 1(c), now correlated with the first one viaF [g]. A similarexplanation applies for the second constraint in Figure 1(c).

The actual MapMerge algorithm is more complex than intuitivelysuggested above, and is described in detail in the next section.

4. The MapMerge AlgorithmMapMerge takes as input a set{(S,T, Σ1), ..., (S,T, Σn)} of

schema mappings over the same source and target schemas, whichis equivalent to taking a single schema mapping(S, T,Σ1 ∪ ... ∪Σn) as input. The algorithm is divided into four phases and thecomplete pseudocode is given in the appendix. The first phase de-composes each input mapping assertion into basic components thatare, intuitively, easier to merge. In Phase 2, we apply the chasealgorithm to compute associations (which we calltableaux), from3Modulo the absence ofH1[w], which will be explained separately.

the source and target schemas, as well as from the source and tar-get assertions of the input mappings. By pairing source and targettableaux, we obtain all the possibleskeletonsof mappings. The ac-tual work of constructing correlated mappings takes place in Phase3, where for each skeleton, we take the union of all the basic com-ponents generated in Phase 1 that “match” the skeleton. Phase 4 is asimplification phase that also flags conflicts that may arise and thatneed to be addressed by the user. These conflicts occur when mul-tiple mappings that map to the same portion of the target schemacontribute with different, irreconcilable behaviors.

4.1 Phase 1: Decompose into Basic SO tgds

The first step of the algorithm decomposes each input SO tgdinto a set of simpler SO tgds, calledbasic SO tgds, that have thesamefor andsatisfyingclause as the input SO tgd but have exactlyone relation in theexistsclause. Intuitively, we break the inputmappings into atomic components that each specify mapping be-havior for a single target relation. This decomposition step willsubsequently allow us to merge mapping behaviors even when theycome from different input SO tgds.

In addition to being single-relation in the target, each basic SOtgd gives a complete specification of all the attributes of the targetrelation. More precisely, each basic SO tgd has the form

for ~x in ~S satisfyingB1(~x)existsy in T where

V

A∈Atts(y)y.A = eA(~x)

where the conjunction in thewhereclause contains one equalityconstraint foreachattribute of the recordy asserted in the targetrelation T . The expressioneA(~x) is either a Skolem term or asource expression (e.g.,x.B). Part of the role of the decompositionphase is to assign a Skolem term to every target expressiony.A forwhich the initial mapping does not equate it to a source expression.

For our example, the decomposition algorithm (given in the ap-pendix) obtains the following basic SO tgds from the input map-pingst1, t2, andt3 of Figure 1(b):

(b1): for g in Groupexistsd in Deptwhered.did = F [g] and d.dname =g.gname

(b2): for w in Works,g in Groupsatisfyingw.gno =g.gnoandw.addr = “NY”existse in Empwheree.ename =w.enameand e.addr =w.addrand e.did = G[w, g]

(b′2): for w in Works,g in Groupsatisfyingw.gno =g.gnoandw.addr = “NY”existsd in Deptwhered.did = G[w, g] and d.dname =g.gname

(b3): for w in Worksexistsp in Projwherep.pname =w.pnameand p.budget =H1[w] andp.did = H2[w]

The basic SO tgdb1 is obtained fromt1; the main differenceis thatd.did, whose value was unspecified byt1, is now explicitlyassigned the Skolem termF [g]. The only argument toF is g be-causeg is the only record variable that occurs in thefor clause oft1. Similarly, the basic SO tgdb3 is obtained fromt3, with the dif-ference being thatp.budget andp.did are now explicitly assignedthe Skolem termsH1[w] and, respectively,H2[w].

In the case oft2, we note that we have two existentially quanti-fied variables, one forEmpand one forDept. Hence, the decom-position algorithm generates two basic SO tgds: the first one mapsinto Empand the second one maps intoDept. Observe thatb2 andb′2 are correlated and share a common Skolem termG[w, g] thatis assigned to bothe.did andd.did. Thus, the association between

84

e.did andd.did in the original schema mappingt2 is maintained inthe basic SO tgdsb2 andb′2.

In general, the decomposition process ensures that associationsbetween target facts that are asserted by the original schema map-ping are not lost. The process is similar to the Skolemization pro-cedure that transforms first order tgds with existentially quantifiedvariables into second order tgds with Skolem functions (see [8]).After such Skolemization, all the target relations can be separatedsince they are correlated via Skolem functions. Therefore, the setof basic SO tgds that results after decomposition is equivalent tothe input set of mappings.

4.2 Phase 2: Compute Skeletons of Schema Mappings

Next we apply the chase algorithm to compute syntactic associa-tions (which we calltableaux), from each of the schemas and fromthe input mappings. Essentially, a schematableauis constructedby taking each relation symbol in the schema and chasing it withall the referential constraints that apply. The result of such chaseis a tableau that incorporates a set of relations that is closed underreferential constraints, together with the join conditions that relatethose relations. For each relation symbol in the schema, there is oneschema tableau. As in [10, 18], in order to guarantee termination,we stop the chase whenever we encounter cycles in the referentialconstraints. In our example, there are two source schema tableauxand three target schema tableaux, as follows:

T1 = { g ∈ Group}T2 = { w ∈ Works,g ∈ Group;w.gno =g.gno}T3 = { d ∈ Dept}T4 = { e ∈ Emp,d ∈ Dept;e.did = d.did }T5 = { p ∈ Proj,d ∈ Dept;p.did =d.did }

Intuitively, schema tableaux represent the categories of data thatcan exist according to the schema. AGroup record can exist in-dependently of records in other relations (hence, the tableauT1).However, the existence of aWorksrecord implies that there mustexist a correspondingGroup record with identicalgno (hence, thetableauT2).

Since the MapMerge algorithm takes as input arbitrary mappingassertions, we also need to generate user-defined mapping tableaux,which are obtained by chasing the source and target assertions ofthe input mappings with the referential constraints that are applica-ble from the schemas (see Appendix A). The notion of user-definedtableaux is similar to the notion of user associations in [23]. In ourexample, there is only one new tableau based on the source asser-tions of the input mappingt2:

T ′

2= { w ∈ Works,g ∈ Group;w.gno =g.gno,w.addr = “NY” }

Furthermore, we then pair every source tableau with every targettableau to form askeleton. Each skeleton represents the empty shellof a candidate mapping. For our running example, the set of allskeletons at the end of Phase 2 is:{(T1, T3), (T1, T4), (T1, T5),(T2, T3), (T2, T4), (T2, T5), (T ′

2, T3), (T ′

2, T4), (T ′

2, T5)}.

4.3 Phase 3: Match and Apply Basic SO tgds on Skeletons

In this phase, for each skeleton, we first find the set of basic SOtgds that “match” the skeleton. Then, for each skeleton, we applythe basic SO tgds that were found matching, and construct a mergedSO tgd. The resulting SO tgd is, intuitively, the “conjunction” ofall the basic SO tgds that were found matching.

Matching. We say that a basic SO tgdσ matches a skeleton(T, T ′)if there is a pair(h, g) of homomorphisms that “embed”σ into(T, T ′). This translates into two conditions. First, thefor andsatisfyingclause ofσ are embedded intoT via the homomorphismh. This means thath maps the variables in thefor clause ofσ to

variables ofT such that relation symbols are respected and, more-over, thesatisfyingclause ofσ (after applyingh) is implied by theconditions ofT . Additionally, theexistsclause ofσ must be em-bedded intoT ′ via the homomorphismg. Sinceσ is a basic SO tgdand there is only one relation in itsexistsclause, the latter conditionessentially states that the target relation inσ must occur inT ′.

For our running example, it is easy to see that the basic SO tgdb1

matches the skeleton(T1, T3). In fact, b1 matches every skeletonfrom Phase 2. On the other hand, the basic SO tgdb2 matches onlythe skeleton(T ′

2, T4) under the homomorphisms(h1, h2), whereh1 = {w 7→ w, g 7→ g} andh2 = {e 7→ e}. Altogether, weobtain the following matching of basic SO tgds on skeletons:

(T1, T3, b1) (T1, T4, b1) (T1, T5, b1) (T2, T3, b1)(T2, T4, b1) (T2, T5, b1 ∧ b3) (T ′

2, T3, b1 ∧ b′

2)

(T ′

2, T4, b1 ∧ b2 ∧ b′

2) (T ′

2, T5, b1 ∧ b′

2∧ b3)

Note that the basic SO tgds that match a given skeleton may ac-tually come from different input mappings. For example, each ofthe basic SO tgds that match(T ′

2, T5) comes from a separate inputmapping (fromt1, t2, andt3, respectively). In a sense, we aggre-gate behaviors from multiple input mappings in a given skeleton.

Computing merged SO tgds. For each skeleton along with thematching basic SO tgds, we now construct a “merged” SO tgd. Forour example, the following SO tgds8 is constructed from the eighthtriple (T ′

2, T4, b1 ∧ b2 ∧ b′2) shown earlier.

(s8) for w in Works,g in Groupsatisfyingw.gno =g.gnoandw.addr = “NY”existse in Emp,d in Deptwheree.did = d.didandd.did = F [g] andd.dname=g.gnameande.ename =w.enameande.addr =w.addrande.did = G[w, g]andd.did = G[w, g]

The variable bindings in the source and target tableaux are takenliterally and added to thefor and, respectively,existsclause of thenew SO tgd. The equalities inT ′

2 andT4 are also taken literally andadded to thesatisfyingand, respectively,whereclause of the SOtgd. More interestingly, for every basic SO tgdσ that matches theskeleton (T ′

2, T4), we take thewhereclause ofσ (after applying therespective homomorphisms) and add it to thewhereclause of thenew SO tgd. (Note that, by definition of matching, thesatisfyingclause ofσ is automatically implied by the conditions in the sourcetableau.) The last three lines in the above SO tgd incorporate con-ditions taken from each of the basic SO tgds that match(T ′

2, T4)(i.e., fromb1, b2, andb′2, respectively).

The constructed SO tgd consolidates the semantics ofb1, b2, andb′2 under one merged mapping. Intuitively, since all three basic SOtgds are applicable whenever the source pattern is given byT ′

2 andthe target pattern is given byT4, the resulting SO tgd takes theconjunction of the “behaviors” of the individual basic SO tgds.Correlations. A crucial point about the above construction is thata target expression may now be assigned multiple expressions. Forexample, in the above SO tgd, the target expressiond.did is equatedwith two expressions:F [g] via b1, andG[w, g] via b′2. In otherwords, the semantics of the new constraint requires the values ofthe two Skolem terms to coincide. This is actually what it means tocorrelateb1 andb′2. We can represent such correlation, explicitly, asthe following conditional equality (implied by the above SO tgd):

for w in Works,g in Groupsatisfyingw.gno =g.gnoandw.addr = “NY”⇒ F [g] = G[w, g]

We use the termresidual equality constraintfor such equalityconstraint where one member in the implied equality is a Skolemterm while the other is either a source expression or another Skolemterm. Such constraints have to be enforced at runtime when we

85

perform data exchange with the result of MapMerge. In general,Skolem functions are implemented as (independent) lookup tables,where for every different combination of the arguments, the lookuptable gives a fresh new null. However, residual constraints willrequire correlation between the lookup tables. For example, theabove constraint requires that the two lookup tables (forF andG)must give the same value wheneverw andg are tuples ofWorksandGroupwith the samegno value.

To conclude the presentation of Phase 3, we list the other threemerged SO tgds below that result after this phase for our example.(s1) from (T1, T3, b1):

for g in Groupexistsd in Deptwhered.did = F [g] andd.dname=g.gname

(s6) from (T2, T5, b1 ∧ b3):for w in Works,g in Groupsatisfyingw.gno =g.gnoexistsp in Proj,d in Deptwherep.did = d.didand d.did = F [g] andd.dname=g.gnameand p.pname =w.pnameandp.budget =H1[w] andp.did = H2[w]

(s9) from (T ′

2, T5, b1 ∧ b′

2∧ b3):

for w in Works,g in Groupsatisfyingw.gno =g.gnoandw.addr = “NY”existsp in Proj,d in Deptwherep.did = d.didand d.did = F [g] andd.dname=g.gnameand p.pname =w.pnameandp.budget =H1[w] andp.did = H2[w]and d.did = G[w, g]

One aspect to note is that not all skeletons generate merged SOtgds. Although we had six earlier skeletons, only three generatemappings that are neithersubsumednor implied. (See also the ap-pendix.) We use here the technique for pruning subsumed or im-plied mappings described in [10]. For an example of a subsumedmapping, consider the triple(T1, T4, b1). We do not generate amapping for this, because its behavior is subsumed bys1, whichincludes the same basic componentb1 but maps into a more “gen-eral” tableau, namelyT3. Intuitively, we do not want to construct amapping intoT4, which is a larger (more specific) tableau, withoutactually using the extra part ofT4. Implied mappings are those thatare logically impliedby other mappings. For example, the map-ping that would correspond to(T2, T3, b1) is logically implied bys6: they both have the same premise (T2), buts6 asserts facts abouta larger tableau (T5, which includesT3) and already coversb1.

Finally, for our example, we also obtain three more residualequality constraints, arising froms6, and stating the pairwise equal-ities of F [g], H2[w] andG[w, g] (since they are all equal top.didandd.did, which are also equal to each other).

Since residual equalities cause extra overhead at runtime, it isworthwhile exploring when such constraints can be eliminated with-out changing the overall semantics. We describe such method next.

4.4 Phase 4: Eliminate Residual Equality Constraints

The fourth and final phase of the MapMerge algorithm attemptsto eliminate as many Skolem terms as possible from the generatedSO tgds. The key idea is that, for each residual equality constraint,we attempt to substitute, globally, one member of the equality withthe other member. If the substitution succeeds then there is one lessresidual equality constraint to enforce during runtime. Moreover,the resulting SO tgds are syntactically simpler.

Consider our earlier residual constraint stating the equalityF [g] =G[w, g] (under the conditions of thefor and satisfying clauses).The two Skolem termsF [g] andG[w, g] occur globally in multi-ple SO tgds. To avoid the explicit maintenance and correlation of

two lookup tables (for bothF andG), we attempt the substitutionof eitherF [g] with G[w, g] or G[w, g] with F [g]. Care must betaken since such substitution cannot be arbitrarily applied. First,the substitution can only be applied in SO tgds that satisfy the pre-conditions of the residual equality constraint. For our example, wecannot apply either substitution to the earlier SO tgds1, since theprecondition requires the existence ofWorkstuple that joins withGroup. In general, we need to check for the existence of a homo-morphism that embeds the preconditions of the residual equalityconstraint into thefor andwhereclauses of the SO tgd. The secondissue is that the direction of the substitution matters. For example,let us substituteF [g] by G[w, g] in every SO tgd that satisfies thepreconditions. There are two such SO tgds:s8 ands9. After thesubstitution, in each of these SO tgds, the equalityd.did = F [g]becomesd.did = G[w, g] and can be dropped, since it is already inthewhereclause. Note, however, that the replacement ofF [g] byG[w, g] did not succeed globally. The SO tgdss1 ands6 still referto F [g]. Hence, we still need to maintain the explicit correlation ofthe lookup tables forF andG. On the other hand, let us substituteG[w, g] by F [g] in every SO tgd that satisfies the preconditions.Again, there are two such SO tgds:s8 and s9. The outcome isdifferent now: G[w, g] disappears from boths8 ands9 (in favorof F [g]); moreover, it did not appear ins1 or s6 to start with. Wesay that the substitution ofG[w, g] by F [g] has globally succeeded.Following this substitution, the constraints9 is implied bys6: theyboth assert the same target tuples, and the source tableauT ′

2 for s9

is a restriction of the source tableauT2 for s6. Hence from now onwe can discard the constraints9.

Similarly, based on the other residual equality constraint we hadearlier, we can apply the substitution ofH2[w] by F [g]. This af-fects onlys6 and the outcome is thatH2[w] has been successfullyreplaced globally. The resulting SO tgds, for our example, are:

(s1) for g in Groupexistsd in Deptwhered.did = F [g] andd.dname=g.gname

(s′6) for w in Works,g in Groupsatisfyingw.gno =g.gno

existsp in Proj,d in Deptwherep.did = d.didand d.did = F [g] andd.dname=g.gnameand p.pname =w.pnameandp.budget =H1[w] andp.did = F [g]

(s′8) for w in Works,g in Group

satisfyingw.gno =g.gnoandw.addr = “NY”existse in Emp,d in Deptwheree.did = d.didandd.did = F [g] andd.dname=g.gnameande.ename =w.enameande.addr =w.addrande.did = F [g]

As explained in Section 3, boths′6 ands′8 can be simplified, byremoving the assertions aboutDept, since they are implied bys1.The result is then identical to the SO tgds shown in Figure 1(c).

Our example covered only residual equalities between Skolemterms. The case of equalities between a Skolem term and a sourceexpression is similar, with the difference that we form only onesubstitution (to replace the Skolem term by the source expression).

The exact algorithm for eliminating residual constraints, given inthe appendix, is an exhaustive algorithm that forms each possiblesubstitution and attempts to apply it on the existing SO tgds. If thereplacement is globally successful, the residual equality constraintthat generated the substitution can be eliminated. Then, the algo-rithm goes on to eliminate other residual constraints on the rewrit-ten SO tgds. If the replacement is not globally successful, the algo-rithm tries the reverse substitution (if applicable). In general, it maybe the case that neither substitution succeeds globally. In such case,

86

the corresponding residual constraint is kept as part of the outputof MapMerge. Thus, the outcome of MapMerge is, in general, a setof SO tgdstogetherwith a set of residual equality constraints. (Forour example, the latter set is empty.)

Finally, the last issue that arises is the case of conflicts in map-ping behavior. Conflicts can also be described via constraints, sim-ilar to residual equality constraints but with the main differencethat both members of the equality are source expressions (and notSkolem terms). To illustrate, it might be possible that a mergedSO tgd asserts that the target expressiond.dname is equal to bothg.gname (from some input mapping) and withg.code (from someother input mapping, assuming thatcode is some other sourceattribute). Then, we obtain conflicting semantics, with two com-peting source expressions for the same target expression. Our al-gorithm flags such conflicts, whenever they arise, and returns themapping to the user to be resolved.

5. EvaluationTo evaluate the quality of the data generated based on Map-

Merge, we introduce a measure that captures the similarity betweena source and target instance by measuring the amount of data asso-ciations that are preserved by the transformation from the source tothe target instance. We use this similarity measure in our experi-ments to show that the mappings derived by MapMerge are betterthan the input mappings. The experiements are all given in Ap-pendix B.Similarity measure The main idea behind our measure is to cap-ture the extent to which the “associations” in a source instance arepreserved when transformed into a target instance of a differentschema. For each instance, we will compute a single relation thatincorporates all the natural associations between data elements thatexist in the instance. There are two types of associations we con-sider. The first type is based on the chase with referential con-straints and is naturally captured by tableaux. As seen in Sec-tion 4.2, a tableau is a syntactic object that takes the “closure” ofeach relation under referential constraints. We can then material-ize the join query that is encoded in each tableau and select all theattributes that appear in the input relations (without duplicating theforeign key / key attributes). Thus, for each tableau, we obtain asingle relation, calledtableau relation, that conveniently materi-alizes together data associations that span multiple relations. Forexample, the tableau relations for the source instanceI in Figure 1(for tableauxT1 andT2 in Section 4.2) are shown on top of Figure3(b). We denote the tableau relations of an instanceI of schemaSasτS(I), or simplyτ (I). The tableau relationsτ (J1) andτ (J2)for our running example are also shown in Figure 3.

The second type of association that we consider is based on thenotion offull disjunction[11, 20]. Intuitively, the full disjunction ofrelationsR1, ...,Rk, denoted asFD(R1, ..., Rk), captures in a sin-gle relation all the associations (via natural join) that exist amongtuples of the input relations. The reason for using full disjunctionis that tableau relations by themselves do not capture all the asso-ciations. For example, consider the association that exists betweenJohn and Web in the earlier source instanceJ2. There,John isan employee inCS, andWeb is a project inCS. However, sincethere is no directed path via foreign keys fromJohn to Web, thetwo data elements appear in different tableau relations ofτ (J2)(namely,DeptEmpandDeptProj). On the other hand, if we takethe natural join betweenDeptEmpand DeptProj, the associationbetweenJohnandWebwill appear in the result. Thus, to captureall such associations, we apply an additional step which computesthe full disjunctionFD(τ (I)) of the tableau relations. This gener-

ates a single relation that conveniently captures all the associationsin an instanceI of schemaS. Intuitively, each tuple in this relationcorresponds to one association that exists in the data.

Operationally, full disjunction must perform the outer “union”of all the tuples in every input relation, together with all the tuplesthat arise via all possible natural joins among the input relations.To avoid redundancy,minimal unionis used instead of union. Thismeans that in the final relation, tuples that are subsumed by othertuples are pruned. A tuplet is subsumedby a tuplet′ if for allattributesA such thatt.A 6= null, it is the case thatt′.A = t.A. Weomit here the details of implementing full disjunction, but we pointout that such implementation is part of our experimental evaluation.

For our example, we showFD(τ (J1)), FD(τ (I)), andFD(τ (J2))at the bottom of Figure 3. There, we use the ’-’ symbol to repre-sent the SQL null value. We note thatFD(τ (J2)) connects now allthree ofJohn, WebandCSin one tuple.

Now that we have all the associations in a single relation, oneon each side (source or target), we can compare them. More pre-cisely, given a source instanceI and a target instanceJ , we de-fine the similarity betweenI andJ by defining the similarity be-tweenFD(τ (I)) andFD(τ (J)). However, when we compare tu-ples betweenFD(τ (I)) and FD(τ (J)), we should not comparearbitrary pairs of attributes. Intuitively, to avoid capturing “acci-dental” preservations, we want to compare tuples based only ontheir compatibleattributes that arise from the mapping. In the fol-lowing, we assume that all the mappings that we need to evaluateimplement the same setV of correspondences between attributesof the source schemaS and attributes of the target schemaT. Thisassumption is true for mapping generation algorithms, which startfrom a set of correspondences and generate a faithful implemen-tation of the correspondences (without introducing new attribute-to-attribute mappings). It is also true for MapMerge and its input,since MapMerge does not introduce any new attribute-to-attributemappings that are not already specified by the input mappings.Given a setV of correspondences betweenS andT, we say thatan attributeA of S is compatiblewith an attributeB of T if ei-ther there is a direct correspondence betweenA andB in V, or (2)A is related to an attributeA′ via a foreign key constraint ofS,B is related to an attributeB′ via a foreign key constraint ofT,andA′ is compatible withB′. For our example, the pairs of com-patible attributes (from source to target) are:(gname, dname),(ename, ename), (addr, addr), (pname,pname).

DEFINITION 1 (TUPLE SIMILARITY ). Let t1 and t2 be twotuples inFD(τ (I)) and, respectively,FD(τ (J)). Thesimilarity oft1 andt2, denoted asSim(t1, t2), is defined as:

|{A ∈ Atts(t1) | ∃B ∈ Atts(t2), A andB compatible, t1.A = t2.B 6= null}|

|{A ∈ Atts(t1) | ∃B ∈ Atts(t2), A andB compatible}|

Intuitively, Sim(t1, t2) captures the ratio of the number of valuesthat are actually exported fromt1 to t2 versus the number of valuesthat could be exported fromt1 according toV. For instance, lett1be the only tuple inFD(τ (I)) from Figure 3 andt2 the only tuplein FD(τ (J2)). Then,Sim(t1, t2) is 1.0, sincet1.A = t2.B forevery pair of compatible attributesA andB. Now, lett2 be the firsttuple inFD(τ (J1)). Since onlyt1.gname =t2.dname out of fourpairs of compatible attributes, we have thatSim(t1, t2) is 0.25.

DEFINITION 2 (INSTANCE SIMILARITY). The similarity be-tweenFD(τ (I)) andFD(τ (J)) is

Sim(FD(τ (I)),FD(τ (J)) =X

t1∈FD(τ(I))

maxt2∈FD(τ(J))

Sim(t1, t2).

87

Groupgno gname

123 CS

GroupWorksgno gname ename addr pname

123 CS John NY Web

τS1(I) : Tableaux relations of I

FD(ττττS1(I)):Full disjunction of ττττS1(I)

gno gname ename addr pname

123 CS John NY Web

DeptEmpdid dname ename addrD2 CS John NY

Deptdid dname

D1 CSD2 CSD3 N

τS2(J1) : Tableaux relations of J1

DeptProjdid dname pname budgetD3 N Web B

FD(ττττS2(J1)): Full disjunction of ττττS2(J1)did dname ename addr pname budget

D1 CS ― ― ― ―

D2 CS John NY ― ―

D3 N ― ― Web B

Similarity

0.75

Similarity

1

DeptEmpdid dname ename addr

D CS John NY

Deptdid dname

D CS

τS2(J2) : Tableaux relations of J2

DeptProjdid dname pname budget

D CS Web B’

FD(ττττS2(J2)): Full disjunction of ττττS2(J2)

did dname ename addr pname budget

D CS John NY Web B’

(b) (c)(a)

Figure 3: Tableau relations ofJ1, I , and J2 of Figure 2 and their full disjunctions.

Figure 3 depicts the similaritiesSim(FD(τ (I)), FD(τ (J1))) andSim(FD(τ (I)),FD(τ (J2))). The former similarity score is ob-tained by comparing the only tuple inFD(τ (I)) with the best match-ing tuple (i.e., the second tuple) inFD(τ (J1))).

6. Related WorkModel management [16] has considered various operators on

schema mappings, among which Confluence is closest in spirit toMapMerge. Confluence also operates on mappings with the samesource and target schema, and it amounts to taking the conjunctionof the constraints in the input mappings. Thus, Confluence doesnot attempt any correlation of the input mappings. Our work canbe seen as a step towards the high-level design and optimization inETL flows [21, 22]. This can be envisioned by incorporating map-pings [4] into such flows, and employing operators such as Map-Merge and composition to support modularity and reuse.

The instance similarity measure we used to evaluate MapMergedraws its inspiration from the very general notion of Hausdorff dis-tance between subsets of metric spaces, and from the sum of min-imum distances measure. We refer to [5] for a discussion of thesemeasures. Moreover, our notion of tuple similarity is loosely basedon the well known Jaccard coefficient. However, the previous mea-sures are symmetric and agnostic to the transformation that pro-duces one database instance from the other. In contrast, our notionis tailored to measure the preservation of data associations from asource database to a target database under a schema mapping.

7. ConclusionsWe have presented our MapMerge algorithm and an evaluation

of our implementation of MapMerge. Through a similarity mea-sure that computes the amount of data associations that are pre-served from one instance to another, our evaluation shows that agiven source instance has higher similarity to the target instanceobtained through MapMerge when compared to target instancesobtained through other mapping tools. As part of our future work,we intend to explore the use of the notion of information loss [9]to compare between mappings generated by MapMerge with thosegenerated by other mapping tools. In addition, we would like tofurther explore applications of MapMerge to flows of mappings.

Acknowledgements. Hernandez and Popa are partially fundedby the U.S. Air Force Office for Scientific Research under con-tracts FA9550-07-1-0223 and FA9550-06-1-0226. Part of the workwas done while Alexe was visiting IBM Almaden Research Cen-ter. Alexe and Tan are supported by NSF grant IIS-0430994 andNSF grant IIS-0905276. Tan is also supported by a NSF CAREERaward IIS-0347065.

8. References[1] B. Alexe, M. Gubanov, M. A. Hernandez, H. Ho, J.-W. Huang,

Y. Katsis, L. Popa, B. Saha, and I. Stanoi. Simplifying InformationIntegration: Object-Based Flow-of-Mappings Framework forIntegration. InBIRTE, pages 108–121. Springer, 2009.

[2] C. Beeri and M. Y. Vardi. A Proof Procedure for Data Dependencies.JACM, 31(4):718–741, 1984.

[3] A. Bonifati, E. Q. Chang, T. Ho, V. S. Lakshmanan, and R. Pottinger.HePToX: Marrying XML and Heterogeneity in Your P2P Databases.In VLDB, pages 1267–1270, 2005.

[4] S. Dessloch, M. A. Hernandez, R. Wisnesky, A. Radwan, andJ. Zhou. Orchid: Integrating schema mapping and ETL. InICDE,pages 1307–1316, 2008.

[5] T. Eiter and H. Mannila. Distance measures for point sets and theircomputation.Acta Inf., 34(2):109–133, 1997.

[6] R. Fagin, L. M. Haas, M. A. Hernandez, R. J. Miller, L. Popa, andY. Velegrakis. Clio: Schema Mapping Creation and Data Exchange.In Conceptual Modeling: Foundations and Applications, pages198–236. Springer, 2009.

[7] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data Exchange:Semantics and Query Answering.TCS, 336(1):89–124, 2005.

[8] R. Fagin, P. G. Kolaitis, L. Popa, and W. Tan. Composing SchemaMappings: Second-Order Dependencies to the Rescue.TODS,30(4):994–1055, 2005.

[9] R. Fagin, P. G. Kolaitis, L. Popa, and W. C. Tan. Reverse dataexchange: coping with nulls. InPODS, pages 23–32, 2009.

[10] A. Fuxman, M. A. Hernandez, H. Ho, R. J. Miller, P. Papotti, andL. Popa. Nested Mappings: Schema Mapping Reloaded. InVLDB,pages 67–78, 2006.

[11] C. A. Galindo-Legaria. Outerjoins as disjunctions. InSIGMODConference, pages 348–358, 1994.

[12] P. G. Kolaitis. Schema mappings, data exchange, and metadatamanagement. InPODS, pages 61–75, 2005.

[13] M. Lenzerini. Data Integration: A Theoretical Perspective. InPODS,pages 233–246, 2002.

[14] J. Madhavan and A. Y. Halevy. Composing Mappings Among DataSources. InVLDB, pages 572–583, 2003.

[15] D. Maier, A. O. Mendelzon, and Y. Sagiv. Testing Implications ofData Dependencies.TODS, 4(4):455–469, 1979.

[16] S. Melnik, P. A. Bernstein, A. Halevy, and E. Rahm. SupportingExecutable Mappings in Model Management. InSIGMOD, pages167–178, 2005.

[17] A. Nash, P. A. Bernstein, and S. Melnik. Composition of Mappingsgiven by Embedded Dependencies. InPODS, pages 172–183, 2005.

[18] L. Popa, Y. Velegrakis, R. J. Miller, M. A. Hernandez, and R. Fagin.Translating Web Data. InVLDB, pages 598–609, 2002.

[19] E. Rahm and P. A. Bernstein. A survey of approaches to automaticschema matching.VLDB Journal, 10(4):334–350, 2001.

[20] A. Rajaraman and J. D. Ullman. Integrating information byouterjoins and full disjunctions. InPODS, pages 238–248, 1996.

[21] A. Simitsis, P. Vassiliadis, and T. K. Sellis. Optimizing ETLProcesses in Data Warehouses. InICDE, pages 564–575, 2005.

[22] P. Vassiliadis, A. Simitsis, and S. Skiadopoulos. ConceptualModeling for ETL Processes. InDOLAP, pages 14–21, 2002.

[23] Y. Velegrakis, R. J. Miller, and L. Popa. Mapping adaptation underevolving schemas. InVLDB, pages 584–595, 2003.

[24] C. Yu and L. Popa. Semantic Adaptation of Schema Mappings whenSchemas Evolve. InVLDB, pages 1006–1017, 2005.

88

APPENDIX

A. Pseudocode of MapMergeThe main algorithm for MapMerge is given below. This algo-

rithm makes calls to several subroutines, which are listed sepa-rately, in the respective subsections.

Algorithm MapMerge(S, T, Σ)Input: A schema mapping.Output: (S, T, Σ′) andF , whereΣ′ is the correlated schema mappingandF is a set of failed unifications or “residual constraints”.

Phase 1. (Decompose into basic SO tgds)Initialize the set of basic SO tgdsB = ∅.For each SO tgdσ ∈ Σ do

Add Decompose(σ) to B

Phase 2. (Compute skeletons of schema mappings)Initialize the set of skeletonsK = ∅Initialize the set of source and target tableauxTsrc = ∅, Ttgt = ∅Generate the schema tableaux:

For each relationR ∈ S

Chase{x ∈ R} with referential constraints inS, add result toTsrc

For each relationQ ∈ T

Chase{y ∈ Q} with referential constraints inT, add result toTtgt

Generate the user-defined tableaux:For each SO tgdσ ∈ Σ of the formfor ~x in ~R satisfyingB1(~x) exists~y in ~T whereB2(~y) ∧ C(~x, ~y)

Chase{~x ∈ ~R; B1(~x)} with referential constraints inSIf the result is not implied byTsrc, add it toTsrc

Chase{~y ∈ ~T ; B2(~y)} with referential constraints inTIf the result is not implied byTtgt, add it toTtgt

For eachT ∈ Tsrc andT ′ ∈ Ttgt doAdd the skeleton(T, T ′) to K.

Phase 3. (Match and apply basic SO tgds on skeletons)Initialize the list of output constraintsΣ′ = ∅For each skeletonKi ∈ K do

Initialize the setBi = ∅For eachσ ∈ B do

Let Li = Match(σ, Ki)If Li 6= ∅, then add the pair〈σ, Li〉 to Bi

UpdateΣ′ to beΣ′ ∪ ConstructSOtgd(Ki, Bi)Remove fromΣ′ everyσ′ such that for someσ′′ ∈ Σ′

such thatσ′′ 6= σ′, eitherσ′′ |= σ′ or σ′′ subsumesσ′

Phase 4. (Eliminate residual equality constraints)Initialize the list of failed substitutionsF = ∅Repeat

Let U = FindNextSubstitution(Σ′ , F )If U is a substitution candidate (i.e., not a failure) then

If U cannot be successfully applied onΣ′

(i.e., Substitute(Σ′ , U) fails) thenAdd the failed substitutionU to F

Until no more substitutions can be appliedReturn(Σ′, F ) as the output of the algorithm

A.1 Pseudocode used by Phase 1

The algorithm that decomposes an input SO tgd into its set ofbasic SO tgds is listed below.

Algorithm Decompose(σ)Input: σ is an input SO tgdOutput: Σ is a set of basic SO tgds resulting from the decomposition ofσ

Initialize Σ = ∅Assume the input SO tgdσ is of the form:

for ~x in ~S satisfyingC(~x) exists~y in ~T whereC′(~x, ~y)

The target conditionC′ is a conjunction of equalities between source andtarget expressions, or between target expressions. These equalities partitionthe source and target expressions in thewhereclause into a set E of equiv-alence classes. Associate a fresh Skolem termFj [~x] to each equivalenceclassEj ∈ E.

For eachyi in Ti from theexistsclause ofσ

Initialize the basic SO tgdσ′ to befor ~x in ~S satisfyingC(~x) existsyi in Ti

For each attributeA of the recordyi doIf yi.A appears in an equivalence classEj ∈ E then

If Ej contains a source expressionsk.B thenAdd yi.A = xk.B to thewhereclause ofσ′

Else addyi.A = Fj [~x] to thewhereclause ofσ′

Else addyi.A = G[~x] to thewhereclause ofσ′,whereG is a fresh Skolem function name

Add σ′ to ΣReturnΣ as the output of the algorithm

A.2 Pseudocode for Phase 3

The subroutine that determines whether a basic SO tgdσ matchesa skeleton(T, T ′) is presented below. Ifσ matches(T, T ′), thenthe subroutine Match returns a pair of homomorphisms that “em-beds”σ to (T, T ′). Otherwise, an empty set is returned.

Algorithm Match(σ, (T, T ′))Input: σ is a basic SO tgd,T andT ′ are tableaux.Output: (h, g), whereh andg “embed”σ into (T, T ′).

Recall that the input basic SO tgdσ has the form:for x1 in S1, x2 in S2, . . . ,xn in Sn

satisfyingB(x1, . . . , xn)existsy in Q

whereV

A∈Atts(y)y.A = ei(x1, . . . , xn)

The satisfyingclause is a conjunction of equalities of the formxi.Ai =xj .Aj or xi.Ai = c, whereAi ∈ Atts(xi), Aj ∈ Atts(xj ), andc is aconstant. The setAtts(y) denotes the set of attributes in the recordy. Thewhereclause contains one equality constraint for each attribute of the yrecord.

In addition, the tableauT has the form:{u1 ∈ R1, u2 ∈ R2, . . . , uk ∈ Rk; CT (u1, . . . , uk)}

If there exists a pair of homomorphisms(h, g) such that(1) for every1 ≤ i ≤ n, if xi ∈ Si according toσ,

thenh(xi) ∈ Ri according toT ,(2) CT (u1, . . . , uk) impliesB(h(x1), . . . , h(xn)), and(3) g(y) ∈ Q according toT ′

Return(h, g)Else

Return∅

The algorithm that constructs a merged SO tgd by applying theresult of the previous Match algorithm on the skeletons is listedbelow.

Algorithm ConstructSOtgd((T, T ′), B)Input: (T, T ′) is a skeleton andB is a set of pairs(σ, (h, g)), whereσ isa basic SO tgd, and(h, g) “embeds”σ into (T, T ′).Output: A Skolemized SO tgd according to(T, T ′, B).

Recall thatT andT ′ have the form:T = {x1 ∈ S1, x2 ∈ S2, . . . , xp ∈ Sp; C(x1, . . . , xp)}

T ′ = {y1 ∈ Q1, y2 ∈ Q2, . . . , yk ∈ Qk; C′(y1, . . . , yk)}

Initialize the SO tgdτ to be:for x1 in S1, x2 in S2, . . . ,xp in Sp

satisfyingC(x1, . . . , xp)existsy1 in Q1, y2 in Q2, . . . ,yk in Qk

whereC′(y1, . . . , yk)For each(σ, (h, g)) ∈ B

Recall thatσ is a basic SO tgd of the form:for xj1 in Sj1 , xj2 in Sj2 , . . . ,xjn in Sjn

satisfyingBj(xj1 , . . . , xjn )existsy in Qj

whereV

A∈Atts(y)y.A = eA(xj1 , . . . , xjn )

Add Bj(h(xj1 ), . . . , h(xjn )) to thesatisfyingclause ofτ

89

Add the following conjunction of equalities:V

A∈Atts(y)g(y).A = eA(g(xj1 ), . . . , g(xjn ))

to thewhereclause ofτReturnτ

We list below the complete set of SO tgds that are constructedin Phase 3 of MapMerge before the actual step of eliminating thesubsumed or implied SO tgds.

(s1) from (T1, T3, b1):for g in Groupexistsd in Deptwhered.did = F [g] andd.dname=g.gname

(s2) from (T1, T4, b1):for g in Groupexistse in Emp,d in Deptwhered.did = F [g] andd.dname=g.gname

(s3) from (T1, T5, b1):for g in Groupexistsp in Proj,d in Deptwhered.did = F [g] andd.dname=g.gname

(s4) from (T2, T3, b1):for w in Works,g in Groupsatisfyingw.gno =g.gnoexistsd in Deptwhered.did = F [g] andd.dname=g.gname

(s5) from (T2, T4, b1):for w in Works,g in Groupsatisfyingw.gno =g.gnoexistse in Emp,d in Deptwhered.did = F [g] andd.dname=g.gname

(s6) from (T2, T5, b1 ∧ b3):for w in Works,g in Groupsatisfyingw.gno =g.gnoexistsp in Proj,d in Deptwherep.did = d.didand d.did = F [g] andd.dname=g.gnameand p.pname =w.pnameandp.budget =H1[w] andp.did = H2[w]

(s7) from (T ′

2, T3, b1 ∧ b′

2):

for w in Works,g in Groupsatisfyingw.gno =g.gnoandw.addr = “NY”existsd in Deptwhered.did = F [g] andd.did = G[w, g] andd.dname=g.gname

(s8) from (T ′

2, T4, b1 ∧ b2 ∧ b′

2):

for w in Works,g in Groupsatisfyingw.gno =g.gnoandw.addr = “NY”existse in Emp,d in Deptwheree.did = d.didandd.did = F [g] andd.dname=g.gnameande.ename =w.enameande.addr =w.addrande.did = G[w, g]andd.did = G[w, g]

(s9) from (T ′

2, T5, b1 ∧ b′

2∧ b3):

for w in Works,g in Groupsatisfyingw.gno =g.gnoandw.addr = “NY”existsp in Proj,d in Deptwherep.did = d.didand d.did = F [g] andd.dname=g.gnameand p.pname =w.pnameandp.budget =H1[w] andp.did = H2[w]andd.did = G[w, g]

In the above list of constructed SO tgds,s2 is subsumed bys1.Similarly, s3 is subsumed bys1. Moreover,s5 is subsumed bys4,which is in turn logically implied bys6. Finally, s7 is logicallyimplied bys9. The remaining SO tgds ares1, s6, s8, ands9, andnone of them is logically implied or subsumed by another. Hence,these four SO tgds are returned by Phase 3 of the algorithm.

A.3 Pseudocode for Phase 4

The algorithm that forms substitutions to be applied during theelimination of residual equality constraints is listed below. Notethat the residual equality constraints are created as needed (in theform of actual substitutions) from the input SO tgds. At the endof MapMerge, all the failed substitutions are returned as the finalresidual equality constraints.

Algorithm FindNextSubstitution(Σ, F )Input: Σ is a set of SO tgds,F is a set of substitutions that have failed onprevious attempts.Output: Either (1)U : a substitution candidate that has not been applied onΣ before or, (2) failure if no substitution candidates can be found.

For each SO tgdσ ∈ ΣRecall thatσ has the form:

for x1 in S1, x2 in S2, . . . ,xp in Sp

satisfyingB(x1, . . . , xp)existsy1 in Q1, y2 in Q2, . . . ,yk in Qk

whereV

1≤j≤k

V

A∈Atts(yj) yj .A = ejA(x1, . . . , xp)

Let Cσ be thesource context(i.e., thefor andsatisfyingclause ofσ).{x1 ∈ S1, x2 ∈ S2, . . . ,xp ∈ Sp ; B(x1, . . . , xp)}

For each target expressionyj .A in thewhereclause ofσLet{E1, . . . , Em} be the list of source expressions equated

with yj .A directly or indirectly in thewhereclause ofσ.If m > 1 then

There are three cases to consider depending on the number ofsource expressions in the list{E1, ...,Em}.Case 1.There is more than one source expression of the

form xi.A, where1 ≤ i ≤ p.Return conflicting SO tgdσ to the user and exit.

Case 2.There is exactly one source expression of theform xi.A, where1 ≤ i ≤ p.Wlog, letE1 denote the source expressionxi.A in the list.Let U = (Cσ , Ei, E1) such that2 ≤ i ≤ m andU 6∈ F .If such aU can be found, returnU .Otherwise, continue.

Case 3.There are no source expressions of theform xi.A, where1 ≤ i ≤ p.Let U = (C, Ei, Ej) such thati 6= j and1 ≤ i, j ≤ m andU 6∈ F .If such aU can be found, returnU .Otherwise, continue.

Return failure (no substitutions can be found)

The algorithm that actually applies a substitution on the SO tgdsis presented below.

Algorithm Substitute(Σ, U )Input: Σ is a set of SO tgds,U is a substitutionOutput: Success ifU can be applied toΣ. Otherwise, return failure.

Recall thatU is of the form(C, E1, E2), whereE1 andE2 are sourceexpressions, andC has the form:

{x1 ∈ S1, x2 ∈ S2, . . . , xp ∈ Sp ; B(x1, . . . , xp)}

For each constraintσ ∈ ΣAssumeσ is of the form:

for x′

1in S′

1, x′

2in S′

2, . . . ,x′

n in S′

n

satisfyingB′(x′

1, . . . , x′

n)existsy1 in Q1, y2 in Q2, . . . ,yk in Qk

whereV

1≤j≤k

V

A∈Atts(yj)yj .A = ejA(x′

1, . . . , x′

p)

If there is a homomorphismh : {x1, ..., xp} → {x′

1, ..., x′

n} such thatB′(x′

1, ..., x′

n) impliesB(h(x1), ..., h(xp))Replaceh(E1) with h(E2) in σ

ElseRevert to the originalΣ from the start of the Substitute routineReturn failure

90

B. Experimental EvaluationWe conducted a series of experiments on a set of synthetic and

real-life mapping scenarios to evaluate MapMerge. We first reporton the synthetic mapping scenarios and, using the similarity mea-sure presented in Section 5, demonstrate a clear improvement in thepreservation of data associations when using MapMerge. We thenpresent results for two interesting real-life scenarios, whose char-acteristics match those of our synthetic scenarios. We have alsoimplemented some of our synthetic scenarios on two commercialmapping systems. The comparison between the mappings gener-ated by these systems and by MapMerge produced results similarto the previous experiments.

We implemented MapMerge in Java as a module of Clio [10].For all our experiments we started by creating the mappings withClio. These mappings were then used as input to the MapMerge op-erator. To perform the data exchange, we used the query generationcomponent in Clio to obtain SQL queries that implement the map-pings in the input and output of MapMerge. These queries weresubsequently run on DB2 Express-C 9.7. All results were obtainedon a Dual Intel Xeon 3.4GHz machine with 4GB of RAM.

B.1 Synthetic Mapping Scenarios

Our synthetic mapping scenarios follow the pattern of transform-ing data from a denormalized source schema to a target schemacontaining a number of relational hierarchies, with each hierarchyhaving at its top an “authority” relation, while other relations re-fer to the authority relation through foreign key constraints. Forexample, Figure 4 shows a source schema that consists of a singlerelationS. The target schema consists of2 hierarchies of relations,rooted at relationsT1 andT2. Each relation in a hierarchy refersto the root via a foreign key constraint from itsF attribute to theK attribute of the root. This type of target schema is typical inontologies, where a hierarchy of concepts often occurs.

SA

1

A11

A12

A2

A21

A22

T1

K

A1 T

11

F

A11

T12

F

A12

T2

K

A2 T

21

F

A21

T22

F

A22

Figure 4: Synthetic Experimental Scenario

The synthetic scenarios are parameterized by the number of hi-erarchies in the target schema, as well as the number of relationsreferring to the root in each hierarchy. For our experimental set-tings we choose these two parameters to be equal, and their com-mon valuen defines what we call the complexity of the mappingscenario. In Figure 4 we show an example wheren = 2. Thetable in Figure 5 shows the sizes of the experimental scenarios interms of the number of target relations and the execution times forgenerating Clio mappings and running the MapMerge operator. Wenotice that the time needed to execute MapMerge is small (less than2 minutes in our largest scenario) but dominates the overall execu-tion time as the number of target relations grow.

The graphs in Figure 5 show the results of our experiments on thesynthetic mapping scenarios. For each scenario, the source instancecontained 100 tuples populated with randomly generated string val-ues. The first graph shows that the target instances generated usingMapMerge mappings are consistently (and considerably) smallerthan the instances generated using Clio mappings. Here we used

the total number of atomic data values on the generated target in-stances as the size of the target instance (i.e., the product of numberof tuples and relation arity, summed across target relations).

The second graph in Figure 5 shows that the source instanceshave a higher degree of similarity to these smaller target MapMergeinstances. The degree of similarity of a source instanceI to a targetinstanceJ is computed as a ratio ofSim(FD(τ (I)),FD(τ (J)) toSim(FD(τ (I)),FD(τ (I))), where the latter represents the idealcase where every tuple inFD(τ (I)) is preserved by the target in-stance and this quantity simplifies to the expression|FD(τ (I))|.We notice that the degree of similarity decreases as the complex-ity of the mapping scenario (n) increases. This is because asn in-creases, more uncorrelated hierarchies are used in the target schema.In turn, this means that the source relation is broken into more un-correlated target hierarchies, and hence, they are less similar to thesource. The graph shows that Clio mappings, when compared toMapMerge mappings produce target instances that are significantlyless similar to the source instance in all cases. Furthermore, therelative improvement when using MapMerge on top of the Cliomappings (shown as the numbers on top of the bars) increases sub-stantially, asn becomes larger. Intuitively, this is because mostof the Clio mappings will map the source data into each root andone of its child relation. On the other hand, MapMerge factors outthe common mappings into the root relation and properly correlatesthe generated tuples for the child relation with the tuples in the rootrelation. The effect is that all child relations in the hierarchy arecorrelated by MapMerge while Clio mappings can only correlateroot-child pairs.

B.2 Real-life Scenarios

We consider two related scenarios from the biological domain inthis section. In the first scenario, we mapped Gene Ontology intoBioWarehouse4. In the second, we mapped UniProt to BioWare-house. The BioWarehouse documentation specifies the semanticsof the data transformations needed to load data from various bi-ological databases, including Gene Ontology and UniProt. In theGeneOntology scenario, we extracted 1000 tuples for each relationin the source schema of the mapping, while in the UniProt scenariowe extracted the first 100 entries for the human genome and con-verted them from XML to a relational format to use as a sourceinstance in our experiments. Table 1 shows the number of sourceand target tables mapped, the number of correspondences used foreach mapping scenario, and the number of mapping expressionsgenerated by Clio for each scenario.

Mapping Source Target Attribute ClioScenario relations relations CorrespondencesMappings

GeneOntology 3 4 5 4UniProt 13 10 23 14

Table 1: Characteristics of real-life mapping scenarios

Mapping Size of DegreeMapping generation target ofScenario time (s) instance similarity (%)

Clio MapMerge Clio MapMerge Clio MapMerge

GeneOntology 1.71 0.34 11557 7801 29.7 35.3UniProt 2.36 2.13 12923 11446 20.8 75.8

Table 2: Results for real-life mapping scenarios

4http://biowarehouse.ai.sri.com/

91

(n) Number Mapping generationScenario of target time (s)

Complexity relations Clio MapMerge

2 6 0.55 0.424 20 2.07 0.826 42 2.22 2.278 72 5.54 6.1410 110 7.81 10.6712 156 10.43 22.9814 210 19.02 54.9416 272 31.51 107.99

0

20

40

60

80

100

120

2 4 6 8 10 12 14 16Nu

mb

er

of

ato

mic

va

lue

s ('

00

0)

Complexity of mapping scenario (n)

Target Instance Size

Clio MapMerge

0

10

20

30

40

50

60

2 4 6 8 10 12 14 16

De

gre

e o

f si

mil

ari

ty (

%)

Complexity of mapping scenario (n)

Clio MapMerge

Degree of Similarity to Target Instance

1.5

2.5

3.5

4.5 5.5 6.5 7.5 8.5

Improvement in the quality of generated data

Figure 5: Experiments on synthetic scenarios

In Table 2 we show the results of applying MapMerge to themappings generated by Clio in each scenario. Thegeneration timecolumns show the time needed to generate the Clio mappings andthe time needed by MapMerge to process those mappings (i.e., thetotal execution time is the sum of the two times). Thesize of targetinstancecolumns show the total number of atomic data values onthe generated target BioWarehouse instance for each scenario. Inboth cases, the mappings produced with MapMerge reduced thetarget instance sizes.

Thedegree of similaritycolumns present the similarity measurefrom Section 5 for each scenario. This similarity is normalized as apercentage to ease comparison across scenarios and the percentageis with respect to the ideal similarity that a mapping can producefor the scenario. As discussed in Section B.1, this ideal similarityis the number of tuples in the tableau full disjunction of the source,i.e., |FD(τ (I))|.

On the two real-life settings, MapMerge is able to further cor-relate the mappings produced by Clio by reusing behavior in map-pings that span across different target tableaux and, thus, improvingthe degree of similarity. This improvement is very significant in theUniProt scenario, where the target schema has a central relation andtwelve satellite relations that point back to the central relation (viaa key/foreign key). Here, each Clio mapping maps source data tothe central and one satellite relation. MapMerge factors out thiscommon part from all Clio mappings and properly correlates allgenerated tuples to the central relation.

B.3 Commercial Systems

We implemented some of the synthetic scenarios described inSection B.1 in two commercial mapping systems. Provided withonly the correspondences from source attributes to target attributes,these systems produced mappings that scored lower than both theClio and MapMerge mappings with respect to preservation of dataassociations. For instance, in the synthetic scenario of complex-ity 2, while the MapMerge mappings had a result of 50% and theClio mappings 33%, the result for both commercial systems wasonly 16%. The main reason behind this result is that these sys-tems do not automatically take advantage of any constraints presenton the schemas to better correlate generated data and increase thepreservation of data associations. The mappings generated by thesecommercial systems need to be manually refined to fix this lack ofcorrelations.

92

mapmerge: correlating independent schema mappingsvldb2010/proceedings/files/papers/r07.pdf · pings...

Documents