using heterogeneous equivalences for query rewriting in

Using Heterogeneous Equivalences for QueryRewriting in Multidatabase Systems �Daniela Florescu��, Louiqa Raschidy, Patrick Valduriez��INRIA, Rocquencourt78153 Le Chesnay Cedex, France�[email protected] yUniversity of MarylandCollege Park, MD [email protected] PaperAbstractIn order to have signi�cant practical impact on future information systems, multi-database management systems (MDBMS) must be both exible and e�cient. Weconsider a MDBMS with a common object-oriented model, based on the ODMGstandard, and local databases that may be relational or object-oriented. In this con-text, query rewriting (for optimization) is made di�cult by schematic discrepancy,and the need to model mapping information between the multidatabase and localschemas. We address the exibility issue by representing the mappings from a localschema to the multidatabase schema, as a set of heterogeneous object equivalences,in a declarative language. E�ciency is obtained by exploiting these equivalences torewrite multidatabase OQL queries into equivalent, simpli�ed queries on the localschemas.1 IntroductionThe advent of open systems is increasingly stimulating the development of information systemswhich can provide high-level integration of heterogeneous information within distributed systems.The heterogeneity typically stems from multiple data models (eg. relational, object-oriented), di�er-ent DBMS and legacy applications. Multidatabase systems (MDBMS) will therefore contribute thenecessary technology for interoperability of distributed, heterogeneous and autonomous databases[ �Ozsu91a].�This research has been partially supported by the Advanced Research Project Agency under grantARPA/ONR 92-J1929 and by the Commission of European Communities under Esprit project IDEA.1

A MDBMS must provide transparent access to the participating databases, often called localdatabases, by hiding distribution and heterogeneity. There are several approaches to multidatabasemanagement. The classic approach is to build a single global schema, which resolves the repre-sentational and schematic con icts of the multiple local schemas, and integrates them within theuni�ed global schema. Thus, each of the local databases is a model for the global schema. Althoughthis approach yields full transparency, its obvious limitation is lack of autonomy since each localdatabase must be conceptually integrated within the unique global schema. The mapping fromeach local schema to the global schema is often expressed in some common extended SQL-like datade�nition and manipulation language, eg. HOSQL in the Pegasus system [Ahmed91] or SQL/M inthe UniSQL/M system [Kim93].The federated approach [Sheth90] stresses autonomy and exibility by relying on multiple importschemas, which can be combined as needed at various multidatabase levels. The di�erence fromthe global schema approach is that all con icts are not resolved (within a global schema) and eachlocal database is not a model for the global schema. Similar approaches either specify a mediatorknowledge base which has mapping knowledge among di�erent schemas [Qian95, Raschid94], oruse a common multidatabase language [Litwin86].Another approach which can be used for MDBMS is distributed object management [Manola92,�Ozsu93], which generalizes the federated approach. The idea is to model heterogeneous databasesat the appropriate level of granularity as objects in a distributed object space and to provide thecapabilities and protocols for object interoperation. This involves the de�nition of a common objectmodel and common object language. The adoption of this approach as a common integrationframework is also illustrated by standardization activities in the OMG [OMG92]. Of particularrelevance for object-oriented database interoperability is the ODMG standard [Cattell93] whichextends the OMG data model. Since an object model generalizes the relational model, this approachcan e�ectively address interoperability of relational and object databases.In order to have signi�cant practical impact on future information systems, multidatabase man-agement must be both exible and e�cient. Flexibility is necessary to ease the introduction of localdatabase schemas in the multidatabase schema, without compromising the autonomy of the otherdatabases. E�ciency in processing multidatabase queries is also getting increasingly importantas MDBMS need to scale up to large numbers of local databases. Most of the work on multi-database query processing assumes a global schema which makes it possible to reuse distributedquery processing techniques [ �Ozsu91b]. However, this approach trades exibility and autonomy fore�ciency.Other recent proposals for transforming multidatabase queries are based on higher-order querylanguages [Krishnamurthy91], higher-order logics [Lakshmanan93], or meta-models [Barsalou92].Each of these depends on using a query language or model that is very di�erent (and more complex),compared to the relational or object models and languages that are currently supported.A few projects have implemented multidatabase query processing. For example, in Pegasus,an HOSQL query on the multidatabase is represented as an E-tree (of subqueries and operations),2

in some extended relational algebra, and then decomposed into parametric subqueries, each to beevaluated by a separate DBMS. Some simpli�cation of the queries is also possible, eg. eliminationof the expensive outer-join operation. However, these simpli�cations are ad-hoc and cannot beextended to exploit semantic knowledge about the local schemas.To summarize, a shortcoming of most previous research in multidatabase query processing isthat transformations and simpli�cations are not speci�ed in a systematic manner. There is neither amethodology nor an architecture within which the knowledge needed for query optimization can berepresented, and techniques for query optimization may be developed and applied, in conjunctionwith the task of query transformation in multidatabases.In this paper, we address the issues of exibility and e�ciency in a MDBMS. We follow thedistributed object management approach which is most appealing to us. Thus, our solution canalso work with a global or federated schema. The common multidatabase level is the ODMG datamodel and the multidatabase queries are expressed in the standard Object Query Language (OQL).We simply assume that each application querying the multidatabase accesses a catalog which storesimport schemas with the associated schema mappings. We also assume that local databases arerelational or object-oriented.Flexibility is obtained through a uniform de�nition of both the schema mappings (from the localschema into the multidatabase schema), and semantic knowledge based on integrity constraints inthe local schema and the multidatabase schema, using equivalence rules, and written in a declarativelanguage [Florescu94a, Florescu94b]. E�ciency in rewriting multidatabase queries into equivalentqueries on the local schemas is obtained by performing, in a uniform way the syntactic simpli�-cations, and the transformations based on semantic and heterogeneous object equivalences. Thebene�ts of our research is thus, a declarative way to describe the mapping from the local schemainto the multidatabase schema and a systematic and uniform way to perform transformations,based on a variety of knowledge, to produce an equivalent and simpler query.The paper is organized as follows. Section 2 introduces the multidatabase query processingenvironment with its MDBMS architecture and the multidatabase language. Section 3 de�nessemantic and heterogeneous object equivalences and has an example schema. Section 4 presents thequery rewriting process using object equivalences and syntactic simpli�cation. Section 5 illustratesthe use of heterogeneous object equivalences for query transformation and simpli�cation. Section6 concludes.2 Multidatabase EnvironmentThe introduction of our approach to multidatabase query processing requires the presentation ofa simpli�ed MDBMS architecture. Figure 1 shows the major components of a simpli�ed MDBMSarchitecture and communication among them in terms of query ow. For simplicity, query results(which should go upward) are not shown. The multidatabase consists of two local databases, eachmanaged by an autonomous DBMS. For instance, let us assume for the sake of discussion that3

DBMS1 is object-oriented while DBMS2 is relational.(local application interface) (local application interface)LAI2LAI1 Q02 (DB2)

Q (MDB)QueryProcessor MDB Catalog(Schema Mapping)Q01 (DB1) MDBMSDBMS1 DBMS2Q001 (DB1) Q002 (DB2)Figure 1: MDBMS architecture.Following the distributed object management approach, each DBMS is accessed through a LocalApplication Interface (LAI) component which maps the common multidatabase language into thelocal language. The MDBMS component plays the role of a Distributed Object Manager (DOM) in[Manola et al, 1992] and o�ers multidatabase access using the multidatabase language. It typicallyprovides transaction support and query processing. The MDBMS query processor relies on themultidatabase catalog which contains the import schemas and the schema mappings, all expressedin the common object model. For instance, it would give the description of DB1 and DB2 in termsof the MDBMS level and the mapping rules from local to MDBMS levels.An input query Q is expressed on the multidatabase schema in OQL. The query processor usesthe catalog to apply schema mappings and produce OQL subqueries, each expressed on a localdatabase schema, but represented in the common object model. For instance, Q1(DB1) is an OQLquery expressed on the common object model representation of the object-oriented schema DB1,while Q2(DB2) is an OQL query expressed on the common object model representation of therelational schema DB2. Each subquery is then given to the corresponding LAI which translates itin the local language. For instance, Q1'(DB1) is a query for the local object DBMS while Q2'(DB2)is a standard SQL query on the corresponding relations The query processor also produces a query4

which integrates the results of the subqueries into a �nal result.The multidatabase model and language used to describe each local database is based on theODMG standard [Cattell93]. We introduce the main elements of the object data model and querylanguage (with minor changes) which are necessary for the rest of the paper.The object data model is based on a type system. Type expressions are constructed from basictypes (atomic built-in types and user-de�ned object types), through the recursive application oftype constructors like tuple, set or list. The set of built-in types we use is finteger, oat, bool,stringg.The object types are organized along a subtype hierarchy. All the attributes, associations andoperations de�ned on a supertype are inherited by the subtype. Furthermore, the instances of asubtype satisfy all integrity constraints de�ned on its supertype. The set of all instances of a givenobject type and its subtypes is called the extent of this type. Object type extents can be explicitlynamed, in which case they are automatically maintained.The set of operators includes built-in operators, user-de�ned functions and user-de�ned meth-ods. The built-in operators are comparison and arithmetic operators, aggregation operators (eg.count, min, max, sum, avg), set operators(eg. union, except, intersect, atten, element), list oper-ators (eg. append, �rst, last, nth), the set membership operator (in) and the select operator.An object database is accessed through named variables, each associated with a type expression,which de�ne the persistence roots of the database. The names of these variables and their value aremaintained in the catalog. Particular named variables are associated with extents of object typesthat are automatically maintained. If C is de�ned as object type with extent, then extent (C) is anamed variable of type fCg.OQL is a standard declarative (nonprocedural) language for exploring the database contents,starting from its entry points (i.e. the set of the named variables). It generalizes standard SQLbut it supports more powerful capabilities. The fundamental di�erence between standard relationalSQL and OQL is the fact that OQL queries are de�ned as general well-typed expressions, over theset of named variables, and do not privilege the select constructor, as does SQL. OQL expressionsare syntactically constructed by some combination of user-de�ned and built-in functions, startingwith a �nite set of constants and variables, as follows:expr: constantj variablej op(expr; � � � ; expr) (if op is the name of a built-in or user-de�ned operator)j expr:method name() (method call)j expr:field name (�eld selection)j [field name = expr; � � � ; field name = expr] (tuple constructor)j set(expr; � � � ; expr) (set constructor)j list(expr; � � � ; expr) (list constructor)j new class name (expr) (object creation)5

j select [distinct] exprfrom variable in expr [ and variable in expr]�[ where expr ] (selection)j exists variable in expr ( expr ) (existential quanti�er)j forall variable in expr ( expr ) (universal quanti�er)Thus, an OQL query is not simply the application of a select operator, as in SQL, and could bea nested expression, through the nested application of the operators (as de�ned above). An OQLquery is restricted to be an OQL expression, over the set of particular named variables, associatedwith each schema.However, the OQL select is a built-in n-ary operator of particular importance, which generalizesthe classical select-project-join operators of the relational algebra. The arguments of the selectoperator can be any collection-oriented expression, including class extents. The condition of theselect is optional. The expressions for each input collection, the predicate and the projection aregeneral expressions, which satisfy some type conditions. As a consequence, OQL allows nestedselect, dependent joins, user-de�ned functions or methods to appear in all clauses of the selectoperator and quanti�ed predicates.3 Semantic and Heterogeneous EquivalencesSemantic knowledge regarding the integrity constraints de�ned in the local schemas and in themultidatabase schema is useful for a correct query translation. For exibility, such knowledge isdeclared by means of semantic equivalences [Florescu94a], which we de�ne as follows:De�nition 3.1 Semantic EquivalenceA semantic equivalence is a �rst order logic formula of the form:[forall < var name > of type < type expression >]� expr1 � expr2 (1)where expr1 and expr2 are OQL expressions that only use previously quanti�ed variables and thenamed variables. It is denoted by (X, expr1, expr2), where X is the set of variables that areuniversally quanti�ed.Semantic equivalences can capture useful knowledge on a schema such as the existence of keys,associations and inverse links, attribute duplication, algebraic properties of user-de�ned operators,etc. Figures 2, 3 and 4 show two local database schemas and the multidatabase schema examplesused in our paper, together with the corresponding semantic equivalences. The actual equivalenceswill be described later together with details of the schemas. In this paper, we focus on semanticequivalences based on integrity constraints, e.g., existence of a key in both object-oriented andrelational schemas, functional dependencies in the relational schema, and inverse links in the objectschema. We assume the MDBMS is given valid semantic equivalences.6

De�nition 3.2 Valid Semantic EquivalenceAn equivalence (X, expr1, expr2) is valid in a database if, for all variable instantiation for X, expr1and expr2 evaluate to the same data.Similar to semantic equivalences which are used to incorporate application-speci�c domainknowledge, we use heterogeneous object equivalences, (or simply heterogeneous equivalences), tospecify mapping knowledge between entities in the multidatabase schema (i.e. the set of namevariables), and entities in the local schemas. Each local database schema can be relational orobject. The heterogeneous equivalences are used during the translation process. They are givenby the user and they express the relationship between an object instance that is created in themultidatabase, and the \values" corresponding to this object, which are imported from the localdatabases. We suppose that object identi�ers are not shared across databases, and any linksbetween entities must necessarily be value-based.The heterogeneous equivalence de�ning the named variable of the multidatabase schema var issyntactically de�ned as var � expression where expression is an expression over the set of namedvariables in the multidatabase schema or in the local schemas. These heterogeneous equivalences areparticular cases of semantic equivalences and can be used in the same way during query translation.Every occurrence of the named variable var in the input query can be syntactically replaced by thecorresponding expression. In the particular case of class extents, the heterogeneous equivalencesare de�ned as follows.De�nition 3.3 Heterogeneous Equivalence for Class ExtentsIf C is a class name in the multidatabase schema with extent extentC , then the correspondingheterogeneous equivalence has the following form:extentC � select new C (proj)from var1 in C1 and � � � and varn in Cnwhere predwhere the collections Ci, the predicate and the projection are OQL expressions over the set of namedvariables in the multidatabase schema or in the local schemas.Objects of class C are created after importing data from the local databases. Each query inthe multidatabase schema on the extent of type C will obtain instances of some data values, asspeci�ed in the de�nition. The query will create a new object for each appropriate combination ofvalues. The knowledge of a \key", with an implicit uniqueness constraint is essential to correctlyspecify heterogeneous equivalences.We will illustrate the de�nition of heterogeneous equivalences with a university database exam-ple which will be used in the rest of the paper. Figures 2 and 3 show two local database schemas,together with the corresponding semantic equivalences. Figure 4 shows the corresponding multi-7

Local Schema 1 Semantic Equivalencesclass Course1( extension courses1key name )f attribute string name;attribute string domain;relationship Professor1 professor;class Person1 (S11) forall x:Course1 and y:Course1( key name ) x.name=y.name � x=yf attribute string name;attribute string address; gclass Professor1: Person1( extension professors1 )f attribute string function; gclass Student1: Person1 (S12) forall x:Person1 and y:Person1( extension students1 ) x.name=y.name � x=yf attribute integer reg nb; gclass Enroll1( extension enrolls1key course, stud)f relationship Course1 course;relationship Student1 stud; gFigure 2: Local schema 1database schema with the associated semantic equivalences. For readability, we su�x the names inthe local databases by their index (1 or 2).Local database 1 is object-oriented and stores information on university courses, professors whoteach the courses, and students. Class Enroll1 maintains the relationship between a student andthe attended courses. The corresponding semantic equivalences express that a �eld \name" is thekey for class Course1, and for class Person1. For instance, equivalence S11 states that two objectsof type Course1 are identical (i.e., they have the same object identi�er), if and only if they havethe same value for the �eld name.Local database 2 is relational and stores information on the extra-curricular activity of thepersons in the university (professors and students). It contains three tables corresponding to theactivities, persons and the relationship between persons and their activity. Semantic equivalenceS21 expresses that the �eld \name" is a key for relation Person2.Information in the multidatabase schema (Figure 4) is structured di�erently from the localdatabases. For instance, the many-to-many relationship between students and courses is stored ina set-valued �eld in class Student rather than in an additional class. The one-to-many relationshipbetween professors and courses is stored by means of inverse links. Furthermore, some information8

Local Schema 2 Semantic Equivalencesrelation Activity2(aid, name, time, place )relation Person2 (S21) forall x:Person2 and y:Person2(pid, name, address) x.name=y.name � x.pid=y.pidrelation Do2(pid, aid) Figure 3: Local schema 2Multidatabase Schema Semantic Equivalencesclass Course( extension courseskey name ) (S31) forall x:Course and y:Coursef attribute string name; x.name=y.name � x=yattribute string domain;relationship Professor professor g;class Professor (S32) forall x:Professor and y:Professor( extension professors x.name=y.name � x=ykey name )f attribute string name;relationship set<Course> teaches;g (S33) forall x:Course and y:Professorclass Student x.professor=y � x in y.courses( extension studentskey name )f attribute string name; (S34) forall x:Student and y:Studentattribute string address; x.name=y.name � x=yattribute set<[activity: string, time: string]> extra activity;attribute set<Course> attends;gFigure 4: Multidatabase schema9

in the local databases is ignored at the multidatabase level, eg. student's registration number.The mapping between multidatabase entities and local database entities is speci�ed by hetero-geneous equivalences H1, H2 and H3.Heterogeneous equivalence H1.courses � select new Course([name=x1.name,domain=x1.domain,professor=y])from x1 in courses1 and y in professors and y1 in professors1where y.name=y1.name and x1.professor=y1Heterogeneous equivalence H2.students � select new Student([name=x1.name,address=x1.address,extra activity=(select [activity= y2.name, time= y2.time]from y2 in Activity2 and z2 in Do2 and x2 in Person2where z2.aid=y2.aid and z2.pid=x2.pidand x2.name=x1.name ),attends=(select yfrom y in courses and y1 in courses1 and z1 in enrolls1where y.name=y1.name and z1.course=y1 and z1.stud=x1)])from x1 in students1Heterogeneous equivalence H3.professors � select new Professor([name=x1.name,teaches=(select yfrom y in courses and y1 in courses1where y.name=y1.name and y1.professor=x1 )])from x1 in professors1To explain, H1 speci�es a heterogeneous equivalence for courses, the particular named variablecorresponding to the extent of Course, in the multidatabase schema. The values for the correspond-ing �elds of the typeCourse are obtained by evaluating an OQL query, and a new Course object iscreated. H1 refers to the named variable courses1, corresponding to the extent of the class Course1,and to professors1, corresponding to the extent of the class Professor1, in local schema 1. It alsorefers to the named variable professors, corresponding to the extent of the class Professor, in themultidatabase schema. This implies that a query is evaluated over the extent of classes Courses1and Professor1 in the local schema, and over the extent of class Professor in the multidatabaseschema. H2 and H3 are similar. 10

4 Query RewritingRecall that in the multidatabase architecture, an input OQL query, expressed with respect tothe multidatabase schema, was transformed into OQL queries against the local schemas. Thistranslation process can be summarized as follows: The OQL query, posed on the multidatabaseschema entities, is �rst syntactically simpli�ed. There are several rules for syntactic simpli�cation,and they will be discussed in this section. Next, any applicable logical transformation wrt mul-tidatabase schema entities are applied. This includes both logical equivalences that describe thealgebraic properties of the built-in operators, and semantic equivalences based on integrity con-straints in the multidatabase schema. In this paper, we do not include examples of the �rst typeof logical equivalence. Next, relevant heterogeneous equivalences are applied so that references tomultidatabase schema entities are replaced with references to entities in the local schemas. Everynamed variable in the multidatabase schema must have at least one heterogeneous equivalencede�ning it, otherwise a complete translation may not be available. In addition, any applicablelogical transformations (based on semantic equivalences wrt the local schema entities) is applied.Since the application of the heterogeneous equivalences may introduce new references to entitiesin the multidatabase schema, this process continues until all such references have been replaced.For example, if we apply the heterogeneous equivalence H1, to substitute for the the named vari-able courses in the multidatabase schema, we would introduce a reference to the named variableprofessors, which would have to be replaced by applying H2.Heterogeneous equivalences and semantic equivalences are stored in the multidatabase catalogand used during the translation process. They can be used as rewriting rules in order to producequeries equivalent to an input query, using a type-based pattern-matching algorithm. More detailson this algorithm can be found in [Florescu94a]. The query rewriting can be described simply asfollows:De�nition 4.1 Query RewritingGiven an equivalence E=(X, expr1, expr2) and a query q, q is rewritten as q0, using E, as follows:If there is a subexpression q00 of q, and a substitution � for variables X, such that q00=expr1�, thenq0 is obtained by replacing q00 in q by expr2�. If the equivalence is valid, then q0 is an equivalentquery for q, i.e., they produce the same result.The correctness of the query translation, (i.e., that the �nal query produces the same answers asthe original query), depends on the validity of the equivalences that are used during the translationprocess. In this paper, we do not address the problem of verifying the validity of semantic andheterogeneous equivalences, and assume that only valid equivalences are provided.As mentioned earlier, during query rewriting, the application of the rules inferred from the se-mantic and heterogeneous equivalences is combined with some syntactic simpli�cation rules, whichproduce simpler equivalent queries. Both the syntactic simpli�cation and the rewriting using thesemantic and heterogeneous equivalences are performed uniformly and transparently in the multi-database level. This is one of the advantages of our approach. We brie y describe the syntactic11

simpli�cation rules that will be used in the rest of the paper.Rule 1. Unnesting the nested query in the from clause. If C11, � � �, C1n, C21, � � �, C2m, proj, proj 0,pred and pred0 are OQL expressions, then the query:select projfrom var11 in C11 and � � � and var1i�1 in C1i�1and var1i in ( select proj 0from var21 in C21 and � � � and var2m in C2mwhere pred0 )and var1i+1 in C1i+1 and � � � and varn in C1nwhere predis equivalent to:select proj 00from var11 in C11 and � � � and var1i�1 in C1i�1and var21 in C21 and � � � and var2m in C2mand var1i+1 in C 001i+1 and � � � and varn in C 001nwhere pred0 and pred00where proj 00, C 001i+1, � � �, C 001n and pred00 are obtained from proj, C1i+1, � � �, C1n and pred respec-tively, by replacing all occurrences of variable var1i by proj 0. The expressions corresponding tocollections C11, � � �, C1i�1 remain unchanged because variable var1i cannot appear in them.Rule 2. Eliminate one argument of the select de�ned by a functional dependency. If C1, � � �, Cn,proj, pred and expr are OQL expressions, then the query:select projfrom var1 in C1 and � � � and vari in Ci and � � � and varn in Cnwhere vari=expr and predis equivalent to:select proj 0from var1 in C1 and � � � and vari�1 in Ci�1 and vari+1 in C 0i+1 and � � � and varn in C 0nwhere pred0where proj 0, C 0i+1, � � �, C 0n and pred0 are obtained from proj, Ci+1, � � �, Cn and pred respectively, byreplacing all occurrences of variable vari by expr. The expressions corresponding to the collectionsC1, � � �, Ci�1 remain unchanged because variable vari cannot appear in them.Rule 3. Transforming the existential quanti�er into explicit join. If C1, � � �, Cn+1, proj, pred andpred0 are OQL expressions, then the query:select [distinct] proj 12

from var1 in C1 and � � � and varn in Cnwhere pred and exists varn+1 in Cn+1 ( pred0 )is equivalent to:select distinct projfrom var1 in C1 and � � � and varn in Cn and varn+1 in Cn+1where pred and pred0Rule 4. Simpli�cation of a tuple constructor followed by a �eld selection. If expr1, � � �, exprn areOQL expressions, then the expression:[field name1 = expr1; � � � ; field namen = exprn]:field nameiis equivalent to expri, and(new class name([field name1 = expr1; � � � ; field namen = exprn])):field nameiis equivalent to expri.5 Using Heterogeneous Object EquivalencesIn this section, we illustrate the process of rewriting a query on the multidatabase schema intoan equivalent query (still expressed in the multidatabase language) on the local schemas. Queryrewriting involves uniform application of heterogeneous equivalences, syntactic simpli�cation andsemantic equivalences. We use two example queries, in increasing order of di�culty, on the previousuniversity multidatabase schema. For each query, we trace the rewriting process, highlighting thetransformations that are applied at each step. For readability, we underline the subexpression usedas a basis for each transformation.Example 5.1 Query Translation involving one local databaseThe following query selects the names of all students enrolled in at least one database course:select x.namefrom x in studentswhere exists z in x.attends (z.domain="database")By applying Rule 3 (rewriting of the existential quanti�er as explicit join), the input query becomes:select distinct x.namefrom x in students and z in x.attends 13

where z.domain="database"By replacing the reference to the students collection by the corresponding query given in heteroge-neous equivalence H2, we obtain:select distinct x.namefrom x in (select new Student ([name=x1.name,address=x1.address,extra activity= (select [activity= y2.name, time= y2.time]from y2 in Activity2 and z2 in Do2 and x2 in Person2where z2.aid=y2.aid and z2.pid=x2.pidand x2.name=x1.name),attends= (select yfrom y in courses and y1 in courses1 and z1 in enrolls1where y.name=y1.name and z1.course=y1 and z1.stud=x1)])from x1 in students1) and z in x.attendswhere z.domain="database"This query can be unnested using Rule 1. All occurrences of x in the outer select are replaced bythe projection of the inner select new Student([� � �]) and the variable x1 de�ned in the from clauseof the inner select is added to the from clause of the outer select. Finally, the tuple constructorsfollowed by �eld selections are simpli�ed using Rule 4, yielding:select distinct x1.namefrom x1 in students1 and z in (select yfrom y in courses and y1 in courses1 and z1 in enrolls1where y.name=y1.name and z1.course=y1 and z1.stud=x1 )where z.domain="database"The query can now be unnested by using Rule 1 again. All occurrences of z in the outer select arereplaced by the projection of the inner select y, the variables y, y1 and z1 de�ned in the from clauseof the inner select are added to the from clause of the outer select and the predicate of the innerselect is added with a conjunction to the predicate of the outer select. The query becomes:select distinct x1.namefrom x1 in students1 and y in courses and y1 in courses1 and z1 in enrolls1where y.domain="database" and y.name=y1.name and z1.course=y1 and z1.stud=x1Since y1 must be equal to z1.course, we can replace all occurrences of y1 by z1.course and eliminatethe third argument of the select using Rule 2, yielding:select distinct x1.name 14

from x1 in students1 and y in courses and z1 in enrolls1where y.domain="database" and y.name=z1.course.name and z1.stud=x1The same transformation can be applied to variable x1, eliminating the �rst argument of the selectoperator using Rule 2, which yields:select distinct z1.stud.namefrom y in courses and and z1 in enrolls1where y.domain="database" and y.name=z1.course.nameBy renaming variable y into z, we obtain:select distinct z1.stud.namefrom z in courses and and z1 in enrolls1where z.domain="database" and z.name=z1.course.nameBy replacing the reference to the multidatabase collection courses by the corresponding query givenin the equivalence H1, the previous query is rewritten as:select distinct z1.stud.namefrom z in (select new Course([name=x1.name,domain=x1.domain,professor=y])from x1 in courses1 and y in professors and y1 in professors1where y.name=y1.name and x1.professor=y1 ) and z1 in enrolls1where z.domain="database" and z.name=z1.course.nameThe query can now be unnested using Rule 1. All occurrences of z in the outer select are replacedby the projection of the inner select new Course([� � �]), the variables x1, y and y1 de�ned in thefrom clause of the inner select are added to the from clause of the outer select and the predicateof the inner select is added by conjunction to the predicate of the outer select. Finally, the tupleconstructors followed by �eld selections are simpli�ed using Rule 4. The query becomes:select distinct z1.stud.namefrom z1 in enrolls1 and x1 in courses1 and y in professors and y1 in professors1where x1.domain="database" and x1.name=z1.course.nameand y.name=y1.name and x1.professor=y1In local schema 1, semantic equivalence S11 states that two objects of type Course1 have the samename if and only if they are identical. Thus, the query can be rewritten as:select distinct z1.stud.name 15

from z1 in enrolls1 and x1 in courses1 and y in professors and y1 in professors1where x1.domain="database" and x1=z1.course and y.name=y1.name and x1.professor=y1Since x1 must be equal to z1.course, we can replace all occurrences of x1 by z1.course and eliminatethe second argument of the select using Rule 2, yielding:select distinct z1.stud.namefrom z1 in enrolls1 and y in professors and y1 in professors1where z1.course.domain="database" and y.name=y1.name and z1.course.professor=y1The same transformation can be applied to variable y1, thereby eliminating the third argument ofthe select:select distinct z1.stud.namefrom z1 in enrolls1 and y in professorswhere z1.course.domain="database" and y.name=z1.course.professor.nameBy renaming variable y into t, we obtain:select distinct z1.stud.namefrom z1 in enrolls1 and t in professorswhere z1.course.domain="database" and t.name=z1.course.professor.nameBy replacing the reference to the collection professors by the corresponding query given in equiva-lence H1, the query becomes:select distinct z1.stud.namefrom z1 in enrolls1and t in (select new Professor([name=x1.name,teaches=(select yfrom y in courses and y1 in courses1where y.name=y1.name and y1.professor=x1 )])from x1 in professors1)where z1.course.domain="database" and t.name=z1.course.professor.nameThe query can now be unnested by applying Rule 1. All occurrences of t in the outer select arereplaced by the projection of the inner select new Professor([� � �]) and the variable x1 de�ned in thefrom clause of the inner select is added to the from clause of the outer select. Finally, the tupleconstructors followed by �eld selections can be simpli�ed using Rule 4. The query becomes:select distinct z1.stud.name 16

from z1 in enrolls1 and x1 in professors1where z1.course.domain="database" and x1.name=z1.course.professor.nameWe can now use semantic equivalence S12 in local schema 1 which states that two objects of typePerson1 have the same name if and only if they are identical. Object type Professor1 being asubtype of Person1, it satis�es the integrity constraints de�ned on its supertype. Thus, the querycan be rewritten as:select distinct z1.stud.namefrom z1 in enrolls1 and x1 in professors1where z1.course.domain="database" and x1=z1.course.professorSince x1must be equal to z1.course.professor, we can replace all occurrences of x1 by z1.course.professorand eliminate the second argument using Rule 2. The �nal query on local database 1 is:select distinct z1.stud.namefrom z1 in enrolls1where z1.course.domain="database"Thus, the initial query on the students collection in the multidatabase schema has been rewrittenthrough multiple steps in a simple query on the enrolls1 collection in local schema 1. The �nal querycorrectly produces the same result as the input query.Example 5.2 Query involving multiple local databasesThe following multidatabase query refers to collections in local databases 1 and 2; it selects thename and address of all students that play tennis on Friday.select [name=x.name, address=x.address]from x in studentswhere exists z in x.extra activity ( z.activity="tennis" and z.time="Friday")Using Rule 3 (existential quanti�er into explicit join), the input query can be rewritten as:select distinct [name=x.name, address=x.address]from x in students and z in x.extra activitywhere z.activity="tennis" and z.time="Friday"By replacing the reference to the students collection by the corresponding query given in equivalenceH2, the previous query can be rewritten as:select distinct [name=x.name, address=x.address]17

from x in (select new Student([name=x1.name,address=x1.address,extra activity=( select [activity= y2.name, time= y2.time]from y2 in Activity2 and z2 in Do2 and x2 in Person2where z2.aid=y2.aid and z2.pid=x2.pidand x2.name=x1.name ),attends=(select yfrom y in courses and y1 in courses1 and z1 in enrolls1where y.name=y1.name and z1.course=y1 and z1.stud=x1)])from x1 in students1 ) and z in x.extra activitywhere z.activity="tennis" and z.time="Friday"The query can now be unnested using Rule 1. All occurrences of x in the outer select are replacedby the projection of the inner select new Student([� � �]) and the variable x1 de�ned in the from clauseof the inner select is added to the from clause of the outer select. Finally, the tuple constructorsfollowed by a �eld selections are simpli�ed using Rule 4. The query becomes:select distinct [name=x1.name, address=x1.address]from x1 in students1 and z in ( select [activity= y2.name, time= y2.time]from y2 in Activity2 and z2 in Do2 and x2 in Person2where z2.aid=y2.aid and z2.pid=x2.pidand x2.name=x1.name )where z.activity="tennis" and z.time="Friday"The query can again be unnested using Rule 1. All occurrences of z in the outer select are replacedby the projection of the inner select [activity=� � �, time=� � �], the variables y2, z2 and x2 de�ned inthe from clause of the inner select are added to the from clause of the outer select and the predicateof the inner select is added by conjunction to the predicate of the outer select. The query becomes:select distinct [name=x1.name, address=x1.address]from x1 in students1 and y2 in Activity2 and z2 in Do2 and x2 in Person2where y2.name="tennis" and y2.time="Friday" and z2.aid=y2.aidand z2.pid=x2.pid and x2.name=x1.nameThe above query refers to entities in local database 1 (students1) and local database 2 (Activity2,Do2 and Person2). With some simple syntactic query analysis and using information on the localentities, we can obtain simpler subqueries, each on a local database. We can apply the inverse ofRule 1 to group in a nested query all the entities that belong to one local database, yielding:select distinct zfrom z in (select distinct [name=x1.name, address=x1.address]18

from x1 in students1)and t in (select distinct [name=x2.name]from y2 in Activity2 and z2 in Do2 and x2 in Person2where y2.name="tennis" and y2.time="Friday" and z2.aid=y2.aid and z2.pid=x2.pidwhere z.name=t.nameThe multidatabase query has therefore been rewritten in a simpli�ed nested query with twosubqueries, each on one local database. The two subqueries have also been simpli�ed using thesame rewriting technique.6 ConclusionIn this paper, we have de�ned heterogeneous equivalences for specifying schema mappings inMDBMS and shown their interest for query rewriting. The context is a MDBMS with an object-oriented, ODMG-based multidatabase model and local databases which may be relational or object-oriented. Since we have followed a distributed object management approach, our solution shouldalso work with a multidatabase or federated schema.Heterogeneous object equivalences provide a declarative de�nition to map the entities of a localschema into corresponding ones in the multidatabase data model. They are expressed in a �rst-order logic language like semantic equivalences. Such declarative de�nition of schema mappingsand semantic knowledge eases the introduction of local databases in the multidatabase, withoutcompromising their autonomy.Heterogeneous object equivalences ease the transformation of multidatabase queries, expressedin OQL on the multidatabase schema, into equivalent OQL queries on the local schemas. Likesemantic equivalences, they can be exploited to simplify the local queries, eg. avoiding redundantprocessing, to gain e�ciency. We have demonstrated query rewriting using semantic and heteroge-neous equivalences through signi�cant application examples.The major contribution of the paper is to provide a framework for applying the di�erent kindsof equivalences and simpli�cations, and treat all the equivalences in a uniform manner. In thisframework, it should also be easy to incorporate other equivalences that further simplify queries,eg. an integrity constraint which can say if a query will fail.The equivalences proposed in this paper have been validated in the Flora optimizer prototype.The Flora optimizer supports the ODMG data model and query language. It has been operationalat INRIA since june 1994. The optimizer currently works on top of the O2 DBMS [Bancilhon92]which is used for local database support.A natural extension of this work will be the declarative de�nition of an execution space forheterogeneous query optimization. An heterogeneous execution space should specify the alternativeexecutions for a given multidatabase query. Used with a cost model which predicts the cost of anexecution and a search strategy which selects the best execution in the execution space, it will be19

critical for performing cost-based heterogeneous query optimization.References[Ahmed91] R. Ahmed et al ., \The Pegasus Heterogeneous Multidatabase System". IEEE Com-puter, 24(12), 1991.[Bancilhon92] F. Bancilhon, C. Delobel, and P. Kannelakis (eds.), Building an Object-orientedDatabase System - the story of O2. Morgan Kaufmann, San Mateo, CA, 1992.[Barsalou92] T. Barsalou and D. Gangopadhay, \M(DM): An Open Framework for Interoperationof Multimodel Multidatabase Systems". Int. Conf. on Data Engineering, Tempe, AZ, February1992.[Cattell93] R.G.G. Cattell et al ., The Object Database Standard - ODMG 93. Morgan Kaufmann,San Mateo, CA, 1993.[Florescu94a] D. Florescu and P. Valduriez, \Rule-based Query Processing in the IDEA System".Int. Symp. on Advanced Database Technologies and Their Integration, Nara, Japan, October1994.[Florescu94b] D. Florescu, J-R. Gruser, M. Novak, P. Valduriez and M. Ziane, \Design and Imple-mentation of Flora, a Language for Object Algebra". Information Science, to appear, 1994.[Kim93] W. Kim et al ., \On Resolving Schematic Heterogeneity in Multidatabase Systems". Dis-tributed and Parallel Databases, 1(3), 1993.[Krishnamurthy91] R. Krishnamurthy, W. Litwin and W. Kent, \Language Features for Interop-erability of Databases with Schematic Discrepancies". ACM SIGMOD Int. Conf., Denver, CO,May 1991.[Lakshmanan93] L.V.S. Lakshmanan, F. Sadri and I.N. Subramanian, \On the Logical Founda-tions of Schema Integration and Evolution in Heterogeneous Database Systems". Int. Conf. onDeductive and Object-Oriented Databases, Phoenix, AZ, March 1993.[Litwin86] W. Litwin and A. Abdelatif, \Multidatabase Interoperability". IEEE Computer, 19(12),December 1986.[Manola92] F. Manola et al ., \Distributed Object management". Int. Journal of Intelligent andCooperative Information Systems, 1(1), March 1992.[Novak94] M. Novak, G. Gardarin and P. Valduriez, \Flora: a Functional-style Language for Objectand Relational Algebra", Int. Conf. on Databases and Expert Systems Applications, Athens,Greece, September 1994. 20

[OMG92] Object Management Group, The Common Object Request Broker: Architecture and Spec-i�cation. Framingham, MA, 1992.[ �Ozsu91a] T. �Ozsu and P. Valduriez, Principles of Distributed Database Systems. Prentice Hall,Englewood Cli�s, NJ, 1991.[ �Ozsu91b] T. �Ozsu and P. Valduriez, \Distributed Databases : Where are we now?", IEEE Com-puter, 24(8), 1991.[ �Ozsu93] T. �Ozsu, U. Dayal and P. Valduriez (eds.), Distributed Object Management. Morgan Kauf-mann, San Mateo, CA, 1993.[Qian95] X. Qian and L. Raschid, \Query Interoperation among Object-oriented and RelationalDatabases". Int. Conf. on Data Engineering, Taipei, Taewan, February 1995.[Raschid94] L. Raschid, Y. Chang and B. Dorr, \Query Transformation Techniques for Interoper-able Query Processing in Cooperative Information Systems". Int. Conf. on Cooperative Infor-mation Systems, Toronto, May 1994.[Sheth90] A. Sheth and J. Larson, \Federated Database Systems for Managing Distributed, Het-erogeneous, and Autonomous Databases". ACM Computing Surveys, 22(3), 1990.

21

using heterogeneous equivalences for query rewriting in

Documents