safety and translation of relational calculus

44
Safety and Translation of Relational Calculus Queries ALLEN VAN GELDER University of California at Santa Cruz and RODNEY W. TOPOR The University of Melbourne Notallqueries inrelational calculus can beanswered sensibly when disjunction, negation, and universal quantification are allowed, The class of relation calculus queries or formulas that have sensible answers is called the domam independent class which is known to be undecidable. Subsequent research has focused on identifying large decidable subclasses of domain indepen- dent formulas. In this paper we investigate the properties of two such classes: the et,aluable formulas and the allowed formulas. Although both classes have been defined before, we give simplified definitions, present short proofs of their main properties, and describe a method to incorporate equality. Although evaluable queries have sensible answers, it is not straightforward to compute them efficiently or correctly, We introduce relational algebra normal form for formulas from which form the correct translation into relational algebra istrivlal. We give algorithms to transform anevaluable formula into an equivalent allowed formula and from there into relational algebra normal form, Our algorithms avoid use of the so-called Dom relation, consisting of all constants appearing in the database or the query. Finally, we describe a restriction under which every domain independent formula is evaluable and argue that the class of evaluable formulas is the largest decidable subclass of the domain independent formulas that can be efficiently recognized. Categories and Subject Descriptors: F,4. 1 [Mathematical Logic and Formal Systems]: Mathe- matical Logic—modet theory; H.2,3 [Database Management]: Languages—query ianguages; H.2.3 [Database Management]: Systems–query processing General Terms: Algorithms, Languages, Theory Additional Key Words and Phrases: Allowed formulas, domain independence, evaluable formu- las, existential normal, query translation, relational algebra, relational calculus A preliminary version of this paper was presented at the Sixth ACM Symposium on Principles of Database Systems, 1987. The research of Allen Van Gelder was supported in part by a grant from the IBM Corporation and NSF grants 1ST-84-12791, CCR-8958590, and IR1-8902287. Authors’ Addresses: R. W, Topor, Department of Computer Science, The University of Mel- bourne, Parkville, Victoria 3052, Australia; A, Van Gelder, Baskin Computer Scien;e Center, University of California, Santa Cruz, CA 55064. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery, To copy otherwise, or to republish, requires a ‘fee and/or specific permission. @ 1991 ACM A0362-5915/91/0300-0235 $01.50 ACM Transactions on Database Systems, Vol 16, No 2, June 1991, Pages 235-278.

Upload: trancong

Post on 12-Feb-2017

232 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Safety and translation of relational calculus

Safety and Translation of Relational CalculusQueries

ALLEN VAN GELDER

University of California at Santa Cruz

and

RODNEY W. TOPOR

The University of Melbourne

Notallqueries inrelational calculus can beanswered sensibly when disjunction, negation, and

universal quantification are allowed, The class of relation calculus queries or formulas that have

sensible answers is called the domam independent class which is known to be undecidable.Subsequent research has focused on identifying large decidable subclasses of domain indepen-dent formulas. In this paper we investigate the properties of two such classes: the et,aluable

formulas and the allowed formulas. Although both classes have been defined before, we give

simplified definitions, present short proofs of their main properties, and describe a method toincorporate equality.

Although evaluable queries have sensible answers, it is not straightforward to compute them

efficiently or correctly, We introduce relational algebra normal form for formulas from whichform the correct translation into relational algebra istrivlal. We give algorithms to transform

anevaluable formula into an equivalent allowed formula and from there into relational algebranormal form, Our algorithms avoid use of the so-called Dom relation, consisting of all constants

appearing in the database or the query.Finally, we describe a restriction under which every domain independent formula is evaluable

and argue that the class of evaluable formulas is the largest decidable subclass of the domainindependent formulas that can be efficiently recognized.

Categories and Subject Descriptors: F,4. 1 [Mathematical Logic and Formal Systems]: Mathe-

matical Logic—modet theory; H.2,3 [Database Management]: Languages—query ianguages;

H.2.3 [Database Management]: Systems–query processing

General Terms: Algorithms, Languages, Theory

Additional Key Words and Phrases: Allowed formulas, domain independence, evaluable formu-

las, existential normal, query translation, relational algebra, relational calculus

A preliminary version of this paper was presented at the Sixth ACM Symposium on Principles ofDatabase Systems, 1987. The research of Allen Van Gelder was supported in part by a grant

from the IBM Corporation and NSF grants 1ST-84-12791, CCR-8958590, and IR1-8902287.

Authors’ Addresses: R. W, Topor, Department of Computer Science, The University of Mel-

bourne, Parkville, Victoria 3052, Australia; A, Van Gelder, Baskin Computer Scien;e Center,University of California, Santa Cruz, CA 55064.Permission to copy without fee all or part of this material is granted provided that the copies arenot made or distributed for direct commercial advantage, the ACM copyright notice and the title

of the publication and its date appear, and notice is given that copying is by permission of theAssociation for Computing Machinery, To copy otherwise, or to republish, requires a ‘fee and/or

specific permission.@ 1991 ACM A0362-5915/91/0300-0235 $01.50

ACM Transactions on Database Systems, Vol 16, No 2, June 1991, Pages 235-278.

Page 2: Safety and translation of relational calculus

236 . A Van Gelder and R. W. Topor

1, INTRODUCTION

With the increased interest in development of deductive

and integration of logic programming languages such as

database systems

Prolog with rela-

tional database systems, it has become more important that relational query

systems be able to handle a wider range of relational calculus formulas

correctly and efficiently. In particular, disjunction, negation, and universal

quantification over subformulas, which are excluded from the class of con-

jurzctiue queries [26], should be available. Current “industrial strength”

implementations handle the class of conjunctive queries well but leave much

to be desired in the areas mentioned; we shall give an example later. In

defense of these implementations, we should point out that the large majority

of queries posed by typical users to traditional databases fall into the class of

conjunctive queries.

However, in sophisticated systems of the future we envision the queries

often being generated not by the user typing them in at the terminal but by a

layer of software positioned between the user and the relational database

system. This software will access a large set of deductive rules in addition to

the user’s query in order to construct relational calculus formulas. The Nail!

project at Stanford University [18] and the LDL project at MCC [24, 19] are

two examples of research projects headed in this direction. Such software-

generated queries will surely be nonconjunctive and much more complex

than those originating directly from users; they will require correspondingly

more general back-end translators.

Another important requirement for such systems is the ability to distin-

guish reasonable queries (those than can be answered sensibly) from unrea-

sonable ones. If the literal interpretation of a given query is not reasonable,

we argue that it is better to reject the query than to attempt to guess the

intended interpretation by transforming the query into a nonequivalent one.

This issue is addressed in Examples 2.1 and 5.4.

A preliminary version of this work was presented in a conference [27].

2. PROBLEM STATEMENT AND BACKGROUND

In this paper we shall be concerned with two main questions.

(1) Which relational calculus queries can be answered sensibly?

(2) How can such queries be answered?

For our purposes, answering a query to a database is equivalent to evaluat-

ing a relational calculus formula with respect to an interpretation. 13Y

“sensible” we mean that values in any correct answer are limited to con-

stants that occur in the query or in database relations that occur in the

query.

Not all queries in relational calculus can be answered sensibly. Two simple

examples that cannot be answered sensibly are

F(x)d:f+%,

G(x, .Y)%?V Qy

ACM TransactIons on Database Systems, Vol 16, No 2. June 1991

Page 3: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries . 237

where P and Q are database relations. 1 F(x) holds for arbitrary x‘s that are

not in the database, and G( x, y) holds for arbitrary y values when Px is

true, and vice versa.

In the following section, we describe previous attempts to characterize

those classes of queries that can be answered sensibly.

Evaluation of relational calculus queries can be performed either by trans-

lation into a set of clauses suitable for a Prolog interpreter [14, 23, 51 or by

translation into a relational algebra expression. Here, we are concerned

solely with the second approach.

Translation of a relational calculus query that includes disjunction and/or

negation is a theoretically solved problem, provided the query is “safe” in the

original sense of Unman [26], as shown there. However, the practical difficult-

ies are such that several commercial database query systems give intu-

itively unexpected results on such queries.

Example 2.1. Consider the following SQL query.

select P. name

from P, Q, R

where P. name = Q. name

or P.name = R.name;

The intended interpretation of this query (ignoring attributes other than

name) is clearly the following relational calculus formula.

P(nanze) A (Q(name) V R(name)).

Surprisingly, if relation R is empty, SQL returns the answer nil, even if

there are matches between P and Q. QUEL behaves similarly, as do other

systems whose query language is derived from QUEL [25].

Even though SQL and QUEL may be defined to behave in this way, we

regard such behavior as confusing and incorrect. We address the problem of

the correct translation and evaluation of such queries below.

2.1 What Are the Problems?

Conjunctive query formulas are those that use only ~ and ~. (Equality can be

represented in conjunctive queries by repetition of variables and substitution

of constants; for simplicity, we do not consider “built-in” predicates such as

< , > , etc. ) The translation of such a formula into an equivalent relational

algebra expression is straightforward and well-known. Informally,

A(u, U, w, x) A l?(u, u, y, z) becomes an equi-join on the columns of u and u,

1Technically, the queries are written as {.x 17 Rx } and {x, y I Rx v Qy}, but the associationbetween named formulas and queries is clear.

ACM Transactions on Database Systems, Vol 16, No 2, June 1991.

Page 4: Safety and translation of relational calculus

238 . A. Van Gelder and R W Topor

and s xA( x, y, z) becomes a projection that eliminates the column for x.

Essentially, all such formulas can be translated.

The situation changes when we introduce disjunction and/or negation. We

intend to handle disjunction algebraically by union and handle negation by

set difference. For example, Pxy v Qxy can be evaluated by P U Q and

Pxy ~ 1 Qx can be evaluated by P cliff Q. More generally, to have a simple

representation in relational algebra, both operands of “v” must have the

same variables, and negations must appear in the form A v 7B where the

free variables of B are a subset of those of A [26].

Queries that do not satisfy these conditions possibly cannot be answered

sensibly, as illustrated by the two earlier examples:

F(x) ‘:f -Px

G(x, y)~fPx V Qy

The two problems here, which are the main problems aside from handling

equality, are

—the terms of a disjunction do not have the same set of free variables and

—a variable in a negative atomic formula is not limited in its range by a

positive atomic formula elsewhere in the formula.

Once we develop tools to handle these problems, then universal quantifiers

will not present any new problems; we will be able to rewrite v x as 1 ~ x ~ at

the appropriate moment.

The situation is really more complicated than it might appear at first

glance because the problem in a subformula can often be cured by some other

part of the overall formula. Thus, even though the query

G(x, y)d~f PzV Qxy

is not reasonable, because it is true for arbitrary y values when Px is true,

the query

F(x)~f:yG(x, y)==3y(Px VQXy)

is reasonable. The naive translation into x I( P U Q), where z ~ means “pro-

ject onto column 1,“ presents problems because the expression P U Q is

undefined. However, in this case, F(x) has an equivalent form,

for which the naive translation P U T I( Q) is correct.

ACM Transactions on Database Systems, Vol 16, No 2, June 1991

Page 5: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries . 239

Our goal is to develop a systematic method to distinguish the curable

problems, such as the above, from the uncurable ones, such as ~ y( Px v Qy),

and to provide correct transformations for the curable ones.

2.2 Previous Work

There have been several attempts to define a class of “reasonable” qu cries,

i.e., a class with the following desirable properties:

–The constants in the database and the query provide a sufficient domain

for the values in the answer. Formulas with this property are called

domain independent [9, 16, 281.

–There is an efficient way to decide if the query is domain independent and,

if so, to translate the relational calculus formulas into an equivalent

relational algebra expression. Formulas with this property are often called

“safe”, although this term has been used with several definitions.

–There is an efficient way to evaluate the resulting relational algebra

expression.

The class of conjunctive queries has these properties, as shown by Unman

[261, but this class is rather limited. The class of domain indepwzdentformulas [9, 16], by definition the largest class having the first property listed

above, was shown by Nicolas and Demolombe [21] to be equivalent to the

class of definite formulas defined by Kuhns [12]. This class was shown to be

not recursive by DiPaola [8] and Vardi [281.

The class of safe formulas was introduced by Unman [26] to characterize

those relational calculus formulas that had equivalent relational algebra

expressions. This class was not, however, defined purely syntactically and, in

effect, Unman proved that every domain independent formula had an equiva-

lent relational algebra expression. The translation required the construction

of the Dom relation consisting of all constants in the query and in database

relations that occurred in the query. A rather narrower, and syntactic,

definition of safe appears in Unman’s revised edition [25]. To clarify which

definition is under discussion, we shall use “semantically safe” for the

original term [26], and “syntactically safe” for the newer one [251. (Still other

definitions of “safety” have appeared; Kifer compares several of them [111.)

Other researchers have proposed decidable subclasses oft he domain inde-

pendent formulas. These subclasses include the range separable formulas [4],

the range restricted formulas [20, 51, the eualuable formulas [61, the allowed

formulas [23], and the syntactically safe formulas [251. We define (some of)

these classes below, as we discuss them.

Of these classes, the evaluated formulas comprise the largest class, but the

definition of this class by Demolombe [6] occupies three pages; its cc)mplex

definition makes it unwieldy to work with, as evidenced by the fact that it

required ten pages just to prove that it is a subclass of the domain indepen-dent formulas; moreover, there is no attempt there to describe how to

evaluate evaluable formulas, i.e., how to translate them into reliztional

algebra expressions.

ACM Transactions on Database Systems, Vol 16, No. 2, June 1991.

Page 6: Safety and translation of relational calculus

240 . A Van Gelder and R. W Topor

The allowed formulas are a strict subclass of the evaluable formulas. They

are easier to recognize and easier to translate into relational algebra. The

syntactically safe formulas [25] are a strict subset of the allowed formulas

[231, provided that both classes treat equality alike.The range restricted formulas comprise the evaluable formulas that are in

disjunctive normal form or conjunctive normal form [61. In an important step

toward practical evaluation, Decker [5] extended the class of range restricted

formulas and showed how to transform any range restricted formula into an

equivalent z range form that is suitable for Prolog-style “tuple at a time”

evaluation.

Taking another approach, Aylamazan et al. have shown that even queries

that are not domain independent with potentially infinite answers can be

presented finitely, provided the database relations are finite [21. The method

involves forming a Dom + relation consisting of the Dem. relation together

with k + 1 new symbols where the query has k occurrences of equality. The

query is evaluated along standard lines [261 except that the extended domain

Dom + is used instead of Dem. This very interesting theoretical result does

not appear to be practical to implement in its current form due to the

necessity for enumerating Dom + but merits further study.

Another alternative approach is to add set constructors to an algebra-based

query language and to restrict the possible expressions to ensure domain

independence. This approach has been explored by several researchers

[13, 3, 221.

3. SUMMARY OF RESULTS

In this paper we give a much simpler definition of the evaluable formulas.

With this simpler definition, it is easier to prove properties of the evaluable

class and to see the relationship between allowed formulas and evaluable

formulas. We show that the evaluable class is invariant under a set of

well-known equivalences that can be used as rewrite rules (e. g., DeMorgan’s

laws) which we call conservative transformations. This invariance makes it

easy to see that every evaluable formula can be conservatively rewritten in

prenex-literal normal form (Definition 4.2). However, the eualuable property

is not always preserved under distribution of ~ over v or v over A. Using

distribution is apparently a necessary step to put certain formulas into an

equivalent form that can be “transliterated” into relational algebra. (For

example, the reader is invited to try to construct a relational algebra imple-mentation of P.Yy A ( QX V R-y) without using the distributive law. ) This is our

motivation for transforming evaluable formulas into allowed formulas which

are invariant under distribution.

One of our main results in an algorithm that transforms any evaluable

formula into an equivalent allowed formula.

‘By equivalent we shall always mean loglcally equivalent.

ACM TransactIons on Database Systems, Vol 16, No 2, June 1991

Page 7: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries . 241

Another main result is that every allowed formula can be effectively

translated into an equivalent relational algebra expression.

At this point we should mention two properties of formula transformations

(either into other formulas or into relational algebra expressions) that we

consider unacceptable and wish to avoid. The first property is that the

transformation does not necessarily produce a logically equivalent formula

but is only guaranteed to do so if the input formula is a certain class (such as

the domain independent class). This puts the burden on the user of prcwiding

a correct input formula or getting erroneous results with no warning. The

second unacceptable method is to explicitly form the Dom relation clefined

above. Both these drawbacks are present, for example, in the rewrite rule:

‘Pxy = Donz(x) ADor?z(y) A lPXY

+ Dom ~ Dom — P.

Both our transformation algorithms have the attractive property that such

tactics are not required. The transformation from syntactically safe formulas

to relational algebra expressions given by Unman [25] also avoids such

tactics.

Finally, we argue that the class of evaluable formulas is the largest

practical subclass of domain independent formulas in a certain sense. Essen-

tially, the domain independent class is not recursive because a given formula

may have a subformula that is superficially not domain independent, but is

unsatisfiable, hence is actually domain independent (vacuously). 3 However,

formulas in which no predicate symbol is repeated cannot possibl~y have

unsatisfiable subformulas. We show that formulas in this class are evaluable

if and only if they are domain independent. This means that any extension to

the class of evaluable formulas that remains domain independent must at

least provide for simplifications based on common subexpressions (e.g., sub-

sumption tests) and should probably include some form of inference capabil -

ity (e. g., resolution). Any nontrivial capabilities in this direction may make

the procedure too expensive to serve in a general purpose formula processor.

4. NOTATION AND DEFINITIONS

We assume the reader is familiar with the standard notation and terminol-

ogy of logic, relational calculus, and relational algebra [17, 261. We shall

abbreviate “first order well formed formula” to formula, and “atomic for-

mula” to atom. A literal is either an atom or a negated atom. We assume the

absence of function symbols (other than constants) throughout.

We distinguish two classes of predicates that appear in formulas which we

call edb predicates and restriction predicates.

Definition 4.1. An edb predicate is either a predicate that represents a

database relation or a built-in predicate that represents a finite relation.

—-3The situation is not this simple, but this is the central idea.

ACM Transactmns on Database Systems, Vol. 16, No. 2, June 1991.

Page 8: Safety and translation of relational calculus

242 . A Van Gelder and R W. Topor

A restriction predicate is a built-in predicate whose relation is infinite and

the complement of whose relation is infinite in an infinite domain.

For example, “ unary equality”, represented by atoms of the form x = c

where x is a variable and c is a constant, must be interpreted by a relation of

one tuple, so is edb. However, x = y and x # Y, where x and y are variables,

have infinite interpretations on infinite domains so are restriction predicates.

Note that we have not provided for relations whose complement is finite;

instead we assume the language contains a predicate for the finite comple-

ment but not the infinite relation. For example, it turns out that defining

x # c as a built-in predicate gains nothing. This topic is discussed further in

Section 5.3.

We primarily use P, Q, R and S to denote edb predicate symbols. We use

A, B, . . . . H to denote formulas and sub formulas, a, . . . . d (and integers and

obvious strings) to denote constants, u, ., z to denote variables, and s and t

to denote terms, i.e., constants or variables.

We use 2 to denote a tuple (xl, . . . . x.), where n > 0, and A(x, j) to

denote a formula in which x is a free variable and yl, . . . . y, are (some of)

the remaining free variables of A.

Similarly, we write V? for Vxl . . . Vx~, and write ~1 for qxl . . . qxn. we

also use “Yc’) to denote either v or g or, in the case of 70 Z, a specific sequence

of (possibly mixed) quantifiers. We assume that no quantified variable occurs

outside the scope of its quantifier; i.e., we avoid (~xA(x) ~ ~ xB(x)) and use

instead (~ xl A(xl) ~ ~ Xz B(xz)).

We use - to denote logical equivalence and * to denote logical implica-

tion; both denote relations between formulas, not symbols within formulas.def

In addition, = is often used to mean “is defined as” to givenamestoformulas. We occasionally use “[ ]“ as synonyms for “( )“ for readability.

We adopt the usual definitions (e.g., see Manna [17]) for prenex normal

form, conjunctive normal form, and disjunctive normal form which we abbre-

viate to PNF, CNF and DNF, respectively. We shall also introduce existential

normal form, abbreviate ENF, and relational algebra normal form, abbrevi-

ated RANF. (See Definitions 9.4 and 9.5). In addition, we occasionally refer

to the following normal form.

Definition 4.2. A formula is said to be in prenex-literal normal form(PLNF) if it is in PNF and all negations are immediately above the atoms.

(This is sometimes called negative normal form.)

5. EVALUABLE AND ALLOWED CLASSES OF FORMULAS

In this section we define the classes of evaluable formulas and allowed

formulas, and give some of their properties. The term eualuable is due to

Demolombe [6]. We use the same term because the class is the same,

although our definition is different. (We shall not prove that the classes are

the same, as we do not rely upon this fact; we simply wish to give appropriate

credit. ) Actually, there is a minor difference in that we allow the equality

predicate = which we interpret throughout as the identity relation and allow

ACM Transactions on Database Systems,Vol 16, No 2, June 1991

Page 9: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries . 243

gen(.c, A) if edb(A) & free(z, A).

gen(t, -v4) if geu(r, pushnot(~A)).

gem(z, 3yA) if dlstznct(x, y) &: gen(.c, A).

gen(z, b’yA) if dzst~nct(x, y) & gen(. e, A).

gen(x, A V B) if gen(x, A) & gen(z, L’).

gen(.r, A A II) if gen(z, A).

gen(x, A A l?) if gen(x, B).

con(x, A) if edb(A) & jree(x, P).

con(x, A) if absent(z, A).

con($, 74) if con(.r, pushnoi(IA)).

Cm(.c, 3yA) if d~sttnct(.r, y) & con(.r, A).

con(z, vfJ.4) if Listinct(-r, y) & co?i(.c, A).

con(x, A V I?) if con(r, A) & con(z, B).

con(r, AA B) if gen(.z, A).

con(r, AA B) if gen(x, B).

co7i(x, A A l?) if con(z, A) & con(x, B).

Fig. 1. The rules that define gen and con

for the possibility of other built-in predicates. Such predicates are not men-

tioned by Demolombe but could easily be incorporated.

5.1 Relations gen and con

To define eualuable and allowed we first need to define certain relations

between variables and (sub)formulas. We have chosen the names gen and

con for these key relations. They are abbreviations for generated and con-

strained. Our generated relation is called restricted by Demolombe [61 and

pos by Topor [233. Also, our constrained relation is similar but not identical

to the positiue relation of Demolombe [61. We prefer to reserve the terms

positive and negative to describe the polarity of atoms or sub formulas within

a formula. A subformula is called positive if it occurs under an even number

of negations and negative if’ it occurs under an odd number.

Definition 5.1. The relations gen and con are defined inductively inFigure 1 by a set of logical rules. We intend that the relations gen and con

hold only when they can be established by a finite number of applications of

these rules; that is, they are the inductive closures of their respective rules.

ACM Transactions on DatabaseSystems,Vol. 16, No 2, June 1991.

Page 10: Safety and translation of relational calculus

244 . A Van Gelder and R. W Topor

Several predicates and functions appear in these rules to support the

definitions of gen and con. We intend that they be interpreted as follows:

— edb( A) holds if A is an atomic formula whose predicate symbol represents

a database relation or a built-in that must have a finite relation (see

Definition 4.1).

–fiee( x, A) holds if variable x occurs (freely) in formula A.

— ab.sent( x, A) holds if variable x does not occur freely in formula A.

— distinct( x, y) holds if x and y are different variables.

–Function pushnot(~A) returns a formula B, provided A is a not an edb

atom, as follows:

-A B

In Figure 1, the &’s separating atoms are interpreted as “and”, x and y as

(object-level) variables, and A and B as (meta-level) variables representing

formulas. For example, the first rule reads, “Variable x is generated in

formula A if A is an edb atom and x occurs freely in A.”

To give some intuition about gen and con, consider a fixed database, and

let Dom( A) be the set consisting of all constants in the formula A and in the

database relations that occur in A [261. We assume finite database relations,

so Dom( A) is finite.

The fact that gen( x, A( x, j)) holds tells us that if A( a, ;) is true, then

a e Dom( A). From a database viewpoint, the values of x are generated when

A( x, j) is evaluated. From a logic programming viewpoint, A( x, j) can be

regarded as a goal that succeeds only by binding x to values in (a subset of)

Dom(A).

Similarly, the fact that con( x, A( x, j)) holds tells us that if A(a, ~) is true,

then either:

—a eDom( A), or

—A( x, ~) is true for all values of x

Figure 2 shows a geometric interpretation of this property, where A has two

free variables. If con holds for both free variables of A( x, y), then the set of

points (x, y) where A( x, y) holds can be represented as a finite collection of

points and lines orthogonal to an axis. All isolated points at which A( .x, y)

holds are in Dom( A) x Dom( A) (a sharper restriction is developed in

Section 8), and all lines intersect an axis at a point in Don-z( A).

ACM Transactions on Database Systems, Vol 16, No 2, June 1991

Page 11: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries . 245

Y

t

.

Fig. 2.. .

erty for

.

.

Geometric interpretationdef

A(x, y) = px V Qy V Rxy.

of the con prop-

The term “constrained” is used as corz( x, A) holds whenever x is gener-

ated in every disjunct of A in which it occurs. This constrains x but not

necessarily enough to generate values for it. From a logic progyarnming

viewpoint, A can be regarded as a goal that succeeds only by binding x to

values in Dorn( A) or without binding x at all.

LEMMA 5.1. For every variable x and formula A, gen( x, A) implies

con( x, A).

PROOF. Use structural induction on A. ❑

Note that the converse of Lemma 5.1 is false.

Example 5.1. In each of the following formulas con( x, A) holds but

gen( x, A) does not hold.

A~flQy

A ‘~f Pxy V T Qy

A~f(PXY V~QY)A(RXVSY)

Note that x need not appear in A and that con( y, A) does not hold for any of

the formulas.

Example 5.2. Suppose we have a database that contains relations

—Rwy, which holds if project w requires part y, and

–Sxy, which holds if supplier x supplies part y.

Then, a natural question to ask is: “Does some supplier supply all parts

required for project 100?” which can be expressed by the formula

F~f3xvy(7R(100, y) V SXY)

Here, con( x,= R(1OO, y) V Sxy) holds but gen( x,1 R(1OO, y)I V Sxy) dcles not

hold. Neither con(y, TR(100, y) vSxy) nor gen(x, TR(100, y) ‘V Sxy) holds. We

shall return to this example later, in Examples 5.4 and 8.4.

ACM TransactIons on DatabaseSystems,Vol 16, No 2, June 1991

Page 12: Safety and translation of relational calculus

246 0 A. Van Gelder and R. W. Topor

The properties gen and con are used to classify relational calculus formu-

las in the next section.

5.2 Evaluable and Allowed Formulas

We now define the two main classes of relational calculus formulas consid-

ered in this paper.

Definition 5.2. A formula F is evaluable or has the evaluable property if

–For every variable x that is free in F, gen( x, F) holds.

–For every subformula ~ XA of F, con( x, A) holds.

—For every sub formula v XA of F, con( x, TA) holds.

Definition 5.3. A formula F is allowed or has the allowed property ifi

–For every variable x that is free in F, gen( x, F) holds.

–For every subformula ~ XA of F, gen( x, A) holds.

–For every sub formula v XA of F, gen( x, 1A) holds.

Each definition attempts to characterize a large class of formulas that are

provably domain independent. Requirements (2) and (3) of each definition are

designed to ensure that the scope of every quantified variable in a formula is

sufficiently restricted for the entire formula to be domain independent. The

stricter requirements in Definition 5.3 permit a straightforward implementa-

tion of 3 x as projection for allowed formulas. Moreover, the allowed property

is conceptually simpler and can be tested more efficiently than the evaluable

property.

Rather than prove that our definition of evaluable yields the same class as

Demolombe [6], it is easier to reprove the important properties of the class.

We shall show that every evaluable formula (and hence every allowed

formula) is domain independent in Section 10, after developing some more

machinery. First we note the following important relationship between the

two classes of formulas.

THEOREM 5.2. Euery allowed formula is evaluable.

PROOF. Immediate from Lemma 5.1, ❑

Note that the converse of Theorem 5.2 is false.

Example 5.3. The following formula is evaluable but not allowed:

F(y) d:f~x[(Pxyv Qy) /? lRy]

First note that gen( y, F(y)) holds. Next note that, for x, the subformula in

square brackets satisfies con but not gen. Finally note that the following

equivalent variant of F is allowed, indicating the sensitivity of the allowed

property to the form of the formula.

~(y) ~f(3.x[pxy] V Qy) A ‘Ry.

ACM TransactIons on Database Systems, Vol 16, No. 2, June 1991

Page 13: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries . 247

Example 5.4. Recall Example 5.2 with relations R wy (w requires y) and

Sxy ( x supplies y). The question, “Does some supplier supply all parts

required for project 100?” was expressed by the formula

F*f3xvy[7R(100, y) VSxy] .

We see that F is not allowed because gen( x, 1R(1OO, y) v Sxy) does not hold.

However, con( x, 1R(1OO, y) v Sxy) does hold. Now, the negation of

(1 R(1OO, y) v Sxy) is (R(1OO, y) ~ lSxy), and gen(y, R(1OO, y) A =Sxy) holds

(and hence con also holds). So F is evaluable.

A natural extension of the above question is the apparently harmless

variant: “Which suppliers supply all parts required by project 100?” which

can be expressed by

F1(x)~fvy[=R(lOO, y) V Sxy] .

As x is now free in Fl, the evaluable property requires the stronger gen

relation to hold for x, which it does not. Therefore, F1 is not evaluable.

Intuitively, the problem is that if R(1OO, y) is empty, then Fl( x) is true for

all values of x, which may not be the intended answer.

The question can be expressed in several other ways in attempts to cnpture

its intended meaning more accurately. Examples of such alternative repre -

sensations are

F2(x)~f3zR(100, z) AVy[TR(loo,” y) VSXY]

F3(X)~f3ZSXZ AVY[TR(100, y) VSXY] .

Each of these formulas is evaluable, and F3 is even allowed. Now, however, if

R(1OO, y) is empty, F2(x) is true for no x at all, and F3( x) is true for all

suppliers in the database. As we cannot know which of these three answers is

the intended one, we believe that it is better to ask the user to give a more

precise query than to translate the nonevaluable query into a nonequivalent

evaluable one. It is our thesis, however, that the class of evaluable formulas

is large enough that this situation will rarely arise.

5.3 Equality and Built-in Relations in Evaluable Formulas

The definition of evaluable in this section adopts a “middlle of the road”

approach to equality and other “built-in” predicates. It is conservative with

respect to equality between two variables: gen( x, x = y), con( x, x = y),

gen( x, x # y) and con( x, x # y) never hold. Many formulas with equality are

evaluable despite this conservative approach. Formulas satisfying Defini-

tion 5.2 are called strict-sense evaluable. In Appendix A we describe transfor-

mations that remove many instances of such equalities and yield an “equality

reduced” form. We call formulas that can be transformed into evaluable

formulas by means of these transformations wide-sense evaluable. Our pri-mary motivation for distinguishing between strict- and wide-sense evaluabil -

ity is that we wish to exploit the fact that strict-sense evaluabi[ity is

invariant under many useful transformations, as discussed in Section 16.

ACM Transactions on Database S@ems, VO1 16, No. 2, June 1991.

Page 14: Safety and translation of relational calculus

248 . A Van Gelder and R W, Topor

Fig. 3, The equivalences upon which conservatztw transformations are based, “’%” denotes3orv

On the other hand, defining gen( x, x = c) to hold for “unary equality”

makes sense intuitively, since it is trivial to associate a one-tuple relation

with this atom.

If our system has other built-in predicates, we adopt the corresponding

approach: their atoms satisfy gen and con if and only if they represent

known, finite relations, so are classified as edb. We require that restriction

predicates occur in complementary pairs, such as > and < ; in such cases,

the definition of pushnot is extended in the obvious way (see Definition 5. 1).

In this paper, mainly to simplify parts of the presentation, we assume =

and # (with two variables) are the only restriction predicates. This treat-

ment is convenient but not central to the main topics of the paper.

6. CONSERVATIVE AND DISTRIBUTIVE TRANSFORMATIONS

In this section we study the effects of various logical transformations on the

evaluable and allowed properties of formulas with a view to identifying sets

of transformations under which these properties are invariant or preserved.

To avoid confusion, we use the terms “invariant” and “preserve” as defined

below:

Definition 6.1. Let 2’(A) be a transformation (normally of formulas into

formulas). A property is said to be inuariant under transformation T if T(A)

has the property if and only if A does. A property is said to be preserved

under transformation T if T(A) has the property whenever A does (but

T(A) may have the property even if A does not).

Figure 3 shows some standard equivalences that are frequently useful to

manipulate formulas [1, 17]. Note that the number of atoms and binary

logical operators are invariant under these transformations. We show that

the evaluable property is invariant under transformations based on these

identities.

ACM TransactIons on Database Systems,Vol 16, No. 2, June 1991

Page 15: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries . 249

Definition 6.2. We say that F can be conservatively transformed to G if G

can be obtained by replacing (one copy of) a subformula of F according to one

of the equivalences in Figure 3, or by a series of such replacements.

LEMMA 6.1. The relations gen and con defined in Figure 1 are invariantunder conservative transformations ((El – 10) of Figure 3). That is, if F( y) can

be conservatively transformed to G( y), then gen( y, F( y)) * gen( y, G( y)), and

similarly for con.

PROOF. This is merely a matter of applying the definitions. For example,

suppose (E1O) is applied, i.e.,

F(X, y)d~fVX(A(X, Y) AB(X, y))

G(x, y)~fvxl A(xl, y)~vxall(xz, y)

(y need not occur in A or B). If con(y, F(x, y)) holds, then con(y, A(x, y) A

B( X, y)) also holds, and at least one of the following three propositions is

true:

—gen( y, A( x, y)) holds: Then gen( y, v xl A(xl, y)) also holds.

—gen(y, B( x, y)) holds: Then gen( y, VX2 B( X2, y)) also holds.

–Both con( y, A( x, y)) and con( y, B(x, y)) hold: Then con( y, Vxl A( xl, y))

and con( y, Vxz B( Xz, y)) also hold.

Thus, in each case, con( y, G( x, y)) holds. The other direction and other cases

are similar. ❑

THEOREM 6.2. If A is evaluable and A can be conservatively transformed to

B, then B is evaluable.

PROOF. The only cases not handled by Lemma 6.1 involve moving the

quantifier for the first argument of a con by means of (E7- 10). For example,

if F is evaluable and contains a subformula

Gd:f(3XA(X)AB)

and, using (E8), we form F’ by replacing G in F by

G’~f3X(A(X)AB)

then we need to verify con( x, A(x) A B). As x is not in 1?, con( x, B) holds.

As F is evaluable, con( x, A( x)) also holds. Consequently, con( x, A(x) A B)

holds, and F’ is thus evaluable.

In the other direction, if F’ is evaluable, then con( x, A(x) A B) holck. It is

impossible that gen( x, B) holds, so either gen( x, A(x)) holds or con( x, A(x))

(and con( x, B)) hold. This establishes that F is evaluable. The other cases

are similar. ❑

COROLLARY 6.3. Every evaluable formula can be conservatively trans-

formed into an equivalent evaluable formula in PLNF (Definition 4.2).

ACM Transactions on Database Systems, Voli 16, No 2, June 1991

Page 16: Safety and translation of relational calculus

250 . A. Van Gelder and R. W. TOPOI

AA(l?v C) = (A AII)v(A AC) (En)

Av(ll AC’) R (Av B) A(Av C) (E12j

3z(z=y A A(z, y)) = A(y, y) (E13)

Vz(X#y V A(x, y)) = A(y)y) (E14)

Fig 4 Other useful equivalences: distributive laws and equality elimination

COROLLARY 6.4. Every evaluable formula can be conservatively trans-

formed into an equivalent evaluable formula that contains no universal quan-

tifiers and has negations only immediately above atoms and existential

quantifiers.

The allowed property may not be preserved by the conservative transfor-

mations (E9– 10), as shown in the next example. In particular, conservatively

transforming an allowed formula into prenex normal form can destroy the

allowed property.

Example 6.1. The allowed formula ~ xA( x) v B, where x is not free in B,

can be conservatively transformed into the prenex normal form ~ X[ A( x) v B],

which is not allowed.

We now consider which properties are invariant under the distributive

laws (En- 12) shown in Figure 4.

LEMMA 6.5. The relation con defined in Figure 1 is invariant under (En)

of Figure 4 (“pushing ands”). That is, con( x, A A (B V C)) holds if and only if

con( x, ( A A B) V ( A ~ C)) holds. In addition, gen is invariant under both

distributive laws (En- 12) of Figure 4.

PROOF. Suppose (El 1) is applied, that is,

F~fAA(BVC)

G~f(AAB)V(AAC).

First suppose con( y, F) holds. (F and G may or may not contain y.) Then at

least one of the following three propositions is true:

–gen( y, A) holds: Then both gen(y, A A 1?) and gen( y, A A C) also hold.

—gen( y, B V C) holds: Then both gen( y, B) and gen( y, C) hold. It followsthat gen(y, A A B) and gen(y, A A C) also hold.

—con(ey, A) and con(y, B v C) both hold: Then con(y, B) and con(y, C) alsohold.

In all three cases we conclude that con( -y, G) holds.If con( y, G) is given, a similar case analysis shows that con( y, f’) holds.

The proofs for g-en are similar. O

As noted by Demolombe [6], “pushing ors” (E 12) does not always preserve

con.

ACM TransactIons on Database Systems, Vol. 16, No 2, June 1991

Page 17: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries o 251

Example 6.2. ([61) Let F be transformed into G by applying (E12), where

F~fPx V (Qxy A ‘~y)

G~f(Pxv Qxy) A (Px V ‘Ry)

Then corz( y, F) holds, but con( y, G) does not.

Our final result in this section summarizes under which transformations

the allowed property is invariant or is preserved. This result is used in the

translation given in Section 9 from allowed formulas into relational algebra

normal form, which requires the use of distributive laws.

THEOREM 6.6. If A is allowed and B is obtained from A by either

— a distributive transformation ( Ell – 12) of Figure 4, or

— a conservative transformation (El –8) of Figure 3, or

— a conservative transformation ( E9– 10) of Figure 3, used in the left-to-right

direction only, then B is also allowed.

PROOF. The distributive laws are immediate from Lemma 6.5. The rest is

similar to Theorem 6.2, except that we need to check that the needed gen

relations are present (instead of con) when (E7– 10) are used. El

Note that, although “pushing ands” (En) preserves con (Lemma 6.5

above), it does not always preserve the evaluable property.

Example 6.3. Let F(z) ‘~f v x 3 yA( x, y, z) be evaluable, where

A(x> y, Z)~fRYZA (@cv 7Px).

Since

TA(x, y, Z) = VRYZV (-l QXAPX),

we have con( x, 1A), as required for F to be evaluable.

“Pushing and” by (El 1) gives

B(x, y, Z)d~f(l?yZAQX)V (RYZA TPX)

and the corresponding G ‘~f v x ~ylil( x, y, z). Now, con( x, 7B) does not hold, so

G is not evaluable. The problem is that “pushing and” in A is the same as

“pushing or” (E12) in 1A. This is the one distributive transformation that

may not preserve con.

7. RANGE RESTRICTED FORMULAS

Range restricted formulas are based on disjunctive and conjunctive normal

forms, and represent one of the first decidable subclasses of domain indepen-

dent formulas to be studied [201. (Our definition is a slight extension to

permit some use of equality.) Putting formulas into normall forms requires

the use of distributive laws (El 1- 12) of Figure 4. Since the distributive lawsdo not always preserve the evaluable property, it is not too surprising that

certain evaluable formulas become nonevaluable if we simply put them into

13NF in an attempt to make an equivalent range restricted formula, as shown

ACM Transactions on DatabaseSystems,Vol 16, No 2, June 1991

Page 18: Safety and translation of relational calculus

252 . A Van Gelder and R W. Topor

by Example 6.3. However, we show that every evaluable formula (and only

those) has an associated pair of formulas in DNF and CNF that satisfy

conditions quite similar to those required for range restricted formulas.

This theorem provides an alternative recognition mechanism for evaluable

formulas.

Definition 7.1. Let Fd~f % 2M be a formula in disjunctive normal form,

where

M~f(D&”. vD, ).

Let M’d~f(Cl ~ “ “ “ ~ Cm) be the conjunctive normal form of M constructed by

applying the distributive law (E 12) of Figure 4. Then F is range restricted if

the following properties hold.

(1) For every free variable x in F, x occurs in a positive literal (other than

x = y) in every D,, i.e., gen(x, M) holds.

(2) For every existentially quantified variable x in F, x occurs in a positive

literal (other than x = y) in every DJ in which x occurs, i.e., con( x, M)

holds.

(3) For every universally quantified variable x in F, x occurs in a negative

literal (other than x # y) in every C~ in which x occurs, i.e., con( x, -M’)

holds.

Item 3 in the above definition was stated differently by Demolombe [6]:

3’. For every universally quantified variable x in F, if x occurs in any positive atom,then there is some conjunction D~ such that every atom of D is negative andcontains x. (Either con( x, T D,) holds for all D, or gerz( x, ~D, ) ~olds for some Dj;

i.e., con(x, T 34) holds. )

The equivalence of the two definitions follows from Lemma 6.5, since TM’ is

obtained from TM by repeatedly “pushing and” (El 1).

THEOREM 7.1. (Demolombe [61) Let F be a formula in disjunctive normalform. Then F is evaluable if and only if F is range restricted.

PROOF. Immediate from the definition, Lemma 6.1, and Lemma 6.5. ❑

Demolombe observes that a similar result holds for formulas in conjunctive

normal form. This theorem can be generalized to apply to all evaluable

formulas as follows.

Definition 7.2. Let en f( F) (resp., dn f( F)) be the conjunctive (resp., dis-

junctive) normal form of formula F constructed by applying conservative

transformations and distributive law (En) (resp., (E12)).

THEOREM 7.2. Let F be a formula with

dnf(F)~f%ZM~~f% 2(Dlv . . . vD~),

cnf( F) ‘~f %IiMC ‘:f ‘%l(C1 A . . .A cm).

ACM Transactions on Database Systems, Vol 16, No 2, June 1991

Page 19: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries . 253

Then F is evaluable if and only if the following properties hold:

(1) For every free variable x in F, x occurs in a positive literal (other thanx = y) in every D,, i.e., gen(x, M~) holds.

(2) For every existentially quantified variable x in dnf(F), x occurs in apositive literal (other than x = y) in euery DJ in which x occur,~, i. e.,

con( x, M~) holds.

(3) For every universally quantified variable x in cnf(F), x occurs in a

negative literal (other than x + y) in every Ch in which x occurs, i.e.,

con( x, lMC) holds.

PROOF. Let plnf(F) ‘~f %~M be the prenex-literal normal form of F (Defi-

nition 4.2) constructed by applying conservative transformations. By Theo-

rem 6.2, F is evaluable if and only if plnf(F) is evaluable.

Next, by Lemma 6.5 for every free variable x in F, gen( x, F) holds if and

only if gen( x, M~) holds.

Similarly, for every existentially quantified variable x in plnf(F) (and

hence in dnf(F)), con( x, M) holds if and only if con( x, M~) holds.

Finally, note that

con(x, =( Av(B~C)))

* con(x, TAA (TBV TC))

+ COn,(X, (TA~ lB) v (lAA TC)) (Lemma 6.5)

#con(x,7(( AVB)~(AV C)))

In other words, “pushing ors “ in M is the dual of “pushing ands” in 7M.

Thusj for every universally quantified variable x in plnf( F) (and hence in

cnf( F)), con( x, TM) holds if and only if con( x, lMC) holds. U

Again we remark that dnf( F) and cnf( F) may not themselves be evalu -

able, as shown in Example 6.3.

8. TRANSFORMATION INTO AN ALLOWED FORMULA

This section presents a procedure to transform any evaluable formula into an

equivalent allowed formula. The approach used by Decker [5] to convert a

range-restricted formula into “range form” which is nearly the same as

“allowed” can be generalized quite nicely with the aid of the rules for gen

and con in Figure 1.

8.1 Ternary gen and con Relations

The first idea is to add a third argument G(x) to gen and con which

functions as a “generator” of sorts. The modified rules are shown in Figure 5.

When x is free in A, then G(x) will be a disjunction of certain edb a~oms in

A(x) (and possibly the placeholder L described next) used to prove that gen

or con holds. (From now on we adopt the convention of writing A(x), G(x),

and so on only when x is actually free in the formula; the formula may

ACM Transactions on Database Systems, Vol 16, No. 2, June 1991.

Page 20: Safety and translation of relational calculus

254 . A. Van Gelder and R W Topor

gen(z, A, A) if edb(A) & jme(z, A).

gen(z, -1A, G) if gen(z, pushnot(-IA), G).

gen(z, 3uA, G’) if dJst2nct(x, y) & gen(.r, ii, C).

gen(x, VgA, G) if distlnct(x, y) & gen(.c, A, G’).

gen(x, AV B, Gl VG2) if gen(x, A, Gl) & gen(z, B, Gz).

gen(z, AA l?, G) if gen(x, A, G).

gen(z, A A 11, G) if gen(x, B, G).

con(z, A, A) if edb(A) & free(z, A).

con(x, A, 1) if absent(z, A).

con(z, 7A, G) if con(x, pushnot(~A), G).

con(z, 3yA, G) if dJsiinct(z, y) & con(z, A, G).

con(z, VyA, G) if d~s~~nct(.c, y) & con(z, A, G).

con(x, AV B, GI VG2) if con(x, A, Gl) & con(x, B, Gz).

con(x, AA B, G) if gen(.z, A, G).

con(z, AA B, G) if gen(z, B, G).

con(x, AA B, Gl VG2) if con(x, A, Gl) & con(x, B, G2).

Fig 5. Expansion of rules for gen and con to produce “generators”.

contain other free variables besides x.) We see that the G(x) in the conclu-

sion, or head, of each gen rule is constructed naturally from the subgoals.

The G in con is similar, except we need to provide for the possibility that x

does not even occur in A. For this, we introduce “ ~ “ as a placeholder; it

may be thought of as a O-ary predicate whose relation is always empty, that

is false.

Example 8.1. Consider F(y) ‘~f ~ xA(x, v), where

A(x, y)~f(~xyV Qy) /l (Rxy V lSy),

Figure 6 shows how the ternary gen and con apply to the sub formulas of F.

Since A( x, y) is the only quantified subformula in F(y), we see that F(y)

is evaluable, but not allowed. This example is considered further in

Example 8.3.

Definition 8.1. For any formula G( x), actually containing x, and possibly

containing other free variables, ~*G(x) denotes G(x) with all variables

except x existentially quantified. Also, ~*L denotes false.

ACM TransactIons on Database Systems, Vol 16, No 2, June 1991

Page 21: Safety and translation of relational calculus

Safety and Translation of Relational Calculus QUefleS . 255

gen(z, Pzy, Pzy) con(z, Qy, 1)

gen(y, Pzy, Pzy) gen(y, Qy, Qy)

[T, 7~,

\ /

con(z, (Pzy v Qy), Pzy V 1) con(z, (Rzy V TSy), Rzy V~

gen(y, (F’zy v Qy), P.cy V Qy)

\t 1con(z, (Pry V Qy) A (Rzy V +’y), Pzy V Rzy V 1)

gen(y, (Pzy v Qy) A (Rzy V +g), Pzy V Qy)

gen(y, 3t (Psy V Qu) A (&y V~{

Fig. 6 Application of rules for ternary gem and con, from Example 8.1

Definition 8.2. The operation of

applying the following simplifications

f~ alse + true

A ~ false + false

A v false +A

%x false + false

truth value simplification consists of

to a formula as long as possible.

~true + false

A~true+A

A v true + true

Yox true + true

Note that “simplifications” that depend on the law of the excluded middle,

such as A v 1A + true, are not part of this definition.

Definition 8.3. Let Gd~fPl V “ “ “ v Pm, where P, are atoms in formula A.

Then A[ G / false] denotes the formula in which each occurrence of P, in A is

replaced by false.

Example 8.2. Consider the evaluable, but nonallowed, formula

Fd:f 3xA( x), where

A(X) ~fT3y(R(100, y) A I%y).

Recall from Examples 5.2 and 5.4 that F represents the query “Does some

supplier supply all parts required by project 100?” Here, con( x, A( x), Sxy v

~ ) holds, but gen( x, A(x)) fails.

Letting G ‘~f (Sxy v ~), we have ~* G( x) := ~y Sxy. Note also that

A [ Sxy/ false] = 73y[R(10(), y) A ‘false]

Truth value simplification removes the second conjunct. ‘The resulting for-

mula and ~* G( x) play a key role in the transformation of evaluable formulas

into allowed formulas, as discussed further in Example 8.4.

The following lemma partly motivates the definition of the third argu-ments of gen and con. It shows that, for any interpretation (i. e., database),

the set of values of x for which A(x) holds is a subset of those for which G(x)

holds.

ACM TransactIons on DatabaseSystems,Vol 16, No 2, June 1991.

Page 22: Safety and translation of relational calculus

256 . A. Van Gelder and R. W. Topor

LEMMA 8,1. Let gen be defined as in Figure 5. Suppose that

gen( X, A( x), G( x)) holds, where A and G may have other free uariables aswell. Then

A(x) + a* A(x)

~OOF. The first logical implication

second is straightforward by structural

~ yA. ❑

8,2 The Main Algorithm

In the following Algorithm 8.1, which

=+=* G(x).

follows by its definition, and the

induction, observing that v y A *

implements the procedure con-to-

gen( F), we describe a local transformation that, when repeatedly applied,

transforms an evaluable formula into an equivalent allowed formula. Before

applying con-to-gen, it is necessary to check separately that F satisfies gen

on all of its free variables, and to replace v y by 1 ~ y 1 throughout. If F at the

top level is not evaluable due to an internal con violation, an error message

will result; however, recursive calls of con-to-gen are made on subformulas

that may be nonevaluable.

Algorithm 8.1. con-to-gen (F)

Input: A formula F with no universal quantifiers such that con( x, A( x)) holds mevery subformula of the form ~x A( x). (If the latter property is violated, thealgorithm halts with an error message. )

Output: A formula F’ equivalent to F with these properties:1. If gen(x, F) holds, so does gen( x, F’).2, If gen(x, ~ F) holds, so does gen( x, ~ F’).3. Property gen( x, A’( x)) holds in every subformula of the form ~x A(x)

Procedure:1. If F is an atom, then return F.2. If F has the form -A, then return T con-to-gen( A).

3. If F has the form A A B, then return con-to-gen( A) A corz-to-gen( B).4, If F has the form A v B, then return con- to-gen( A) v con-to-gen( B).5 If F has the form ~xA, where x may not occur in A and A may contain

other variables, then:

(a) If gen(z, A(x), G(x)) holds, then return ~x con-to-gen( A(x)).(b) If con( x, A, G) holds, then:

(i) If x is not free in A, and hence G = L , then return con-to-gen( A)

(ii) If x M free in A, then G(x) is a disjunction PI(x) v . . . v P,n(..r) ofatoms in A( x); here n-z > 1 and there may be additional disjuncts

that are c’ L “, Let ,~) be the truth value simplification of A[ G/ false](see Definition 8.3).4 Define:

@f+* G(x) AA(X)] V j’

and return con-to-gen( fi).(c) If con( x, A, G) does not hold (for any G), print an error message, and

halt.

4Quantlfied variables in A(x) are given new names in J, of course

ACM TransactIons on Database Systems, Vol 16, No. 2, JLUE IXII

Page 23: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries . 257

The key step in Algorithm 8.1 is 5b(ii), where F is rewritten as F. To

justify the algorithm we need to prove both that F = F and that gen

properties in the larger formulas in which F is embedded are preserved when

F is replaced by F. Proof of these facts is deferred to Section 8,3, Here, we

present some examples and discussion of the algorithm.

Example 8.3. Consider F(y) ‘:f ~x A(x, y), where

A(X, y)~f(PXy VQy)A(RXy VTSy)

as in Example 8.1. From that example we saw that gen( x, A( x, y), G) does

not hold (for any G), so F(y) is not allowed. But con( x, A( x, y), Pxy v Rxy v

~ ) does hold, and in step 5b(ii) # (y) is the truth value simplification of

(faz~ev QY) A (falsev ~Sy).

Observe in passing that gen( y, #(y), Qy) holds. Therefore, the alSorithm

defines

~*f~X[dy’[ PXY’VRXy’] AA(x, y)] v (QYA -15Y)

which is an allowed formula equivalent to F,

Example 8.4. Consider the formula Fd:f ~xA( x), where

A(X) XflSy(R(lOO, y) A +xy).

Recall from Examples 5.2, 5.4, and 8.2 that F’ represents the query “Does

some supplier supply all parts required by project 100?” and was found to be

evaluable, but not allowed. From the latter example, we see that .4 of

Algorithm 8.1 is 13 y“( R(1OO, y“). It follows that the algorithm defines

~~fax[~y’sxy’ AA(x)] V ‘3y’’l?(100, y“)

which is an allowed formula equivalent to F.

It follows immediately from the correctness of this algorithm, which is

established below in Theorem 8.6 and Theorem 7.1, that every range re-

stricted formula can also be effectively transformed into an equivalent al-

lowed formula. In this special case, our procedure reduces to a variant of

Decker’s in which ~* G( x) plays the role of Decker’s “range expression” and

w plays the role of his “remainder. ”

Finally, we observe that the expanded rules for gen and con have some

nondeterminacy for conjunctions: the G of either conjunct can be used when

gen holds for both. This choice represents an opportunity for optimization.

8.3 Correctness of Main Algorithm

This subsection addresses the correctness of Algorithm 8.1. The correctness

theorem is supported by a series of lemmas, some of which are rather

technical. We begin with some definitions.

ACM Transactions on DatabaseSystems,Vol. 16, No 2, June 1991

Page 24: Safety and translation of relational calculus

258 . A. Van Gelder and R W. Topor

Definition 8.4. By subdisjunction of PI v . . . v Pn we mean a “disjunc-

tion” of zero or more of the P,; for zero, we use ~ and for one we use just the

atom.

Definition 8.5. A formula is in literal form if the only negations are

immediately above atoms, and cannot be further simplified by combination

with a built-in predicate. (In particular, 7(x = y) and ~(x i+ y) may not

appear. )

Clearly, any formula can be put into literal form while preserving its gen

and con properties, but at the expense of including universal quantifiers. We

shall express several lemmas in terms of formulas in literal normal form, to

facilitate proofs. Although Algorithm 8.1 operates on formulas without uni-

versal quantifiers and not in literal form, the necessary properties estab-

lished in the lemmas carry over to these formulas as well.

LEMMA 8.2. Let A(x) be a formula in literal form for which gen( x, A( x))

fails, but con(x, A(x), G(x)) holds, where G(x)~fP1 v . . . v Pm, as in Step

5b(ii) of Algorithm 8.1. Let B(x) be a sub formula of A(x) and let H(x) be a

subdisjunction of G(x) such that gen(x, B(x), H(x)) holds. (A, B, F, G, and

H may have other free variables.) Then B[ G / false] truth value simplifies to

false.

PROOF. The proof is by induction on h, the height of B. For the basis

h = O, B is an edb atom, so H = B. For h >0, assume the lemma holds for

all subformulas of A with height less than h, and consider each of the

possible principal operators.

If B ‘:f B, v Bz, then there are subdisjunctions HI and H2 such that

gen( x, Bl, 111) and gen( x, B2, H2) both hold, and H = HI v Hz. But then, by

the inductive hypothesis, both Bl[ G / fake] and B2[ G / false] evaluate to

false.

If Bd~f BI ~ Bz, then either gen( x, Bl, H) or gen( x, B2, II) holds. But then,

by the inductive hypothesis, one of Bl[ G / false] and B2[ G / false] evaluates to

false. Similar arguments apply to principal operators ~y and v y.

Finally, if B ‘~f 7B’, the B’ must be an atom, so the premise of the lemma is

false. ❑

LEMMA 8.3. Let A(x) be a formula in literal form for which gen( x, A( x))

fails, but con( x, A( x), G(x)) holds, where G(x) ‘~fP1 v . . . v Pm, as in Step

5b(ii) of Algorithm 8.1. Let B be a subformula of A(x) and let H(x) be a

subdisjunction of G(x) such that gen(x, B) fails, but con(x, B, H(x)) holds.

(A, B, F, G, and H may have other free variables, and H may equal ~ .) LetMB be the truth value simplification of B[G/ false]. Then x is not free in 3B

and T3*G(x)AB - v~.

PROOF. The proof is by induction on h, the height of B. For the basis

h = O, B must be an atom in which x is not free, so H = ~ and Z~ = B. For

ACM Transactions on Database Systems, Vol. 16, No 2, June 1991

Page 25: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries . 259

h >0, assume the lemma holds for all subformulas of A with height less

than h, and consider each of the possible principal operators.

If B ~fB1 v B2, then there are subdisjunctions HI and Hz such that

con( x, Bl, HI) and con( x, Bz, Hz) both hold, and H = HI V IZz. We consider

cases according to whether gen holds on subformulas. Because gen( x, B) does

not hold, it is impossible that gen( x, Bl) and gen( x, Bz) both hold.

Suppose gen( x, Bl) holds and gen( x, Bz ) fails. Then by tlhe definitions of

gen and con there is some subdisjunction HI of H such that gen( x, Bl, Hl)

holds, and by Lemma 8.2 Bl[ G / false] = false. Therefore, B[ G / false] =

Bz[ G / false], in which x is not free. The inductive hypothesis also i:mplies

that l~*G(x) A Bj + Ii[/false] = J+~. Moreover, by Lemma 8.1, BI = ~*Hl

= ~* G(x). It follows that l~*G(x)AB + w~.

The case when gen( x, IIz) holds and gen( x, Bl) fails is similar. Now

suppose gen( x, Bl) and gen( x, Bz ) both fail. Then the inductive hypothesis

applies to both BI and B2, for con must hold for both and the conclusion

follows easily for B.

When B is of the form Bd~fBl A B2, then neither gen(x, B,) nor gen( x, BJ

can hold, but con must hold, so inductive hypothesis applies to both subfor-

mulas, and the conclusion follows for B. Similar arguments apply to B ‘If ~yBl

and B ‘:f v yBl.

Finally, if B ‘:f lB’, then B’ must be an atom, and must rlOt contain x, so

H = J- as in the base case. ❑

LEMMA 8.4. Let A( x, y) be a formula in literal fbrrn for wd~tch

gen( x, 4x, y)) fails, but con( x, A( x, y), G(x)) holds, where G(x) = PIv ““”v Pm, as in Step 5b(ii) of Algorithm 8.1. Let B(y) be a sub formula of

A( x, y) in which y, distinct from x, is free. (A, B, and G may have other free

variables.) Let JIB be the truth value simplification of B[ G/ false]. Then,

(1) If gen(y, B(y)) holds, then either #~ = false or gen(y, W,B) holds.

(2) If gen( y, -B( y)) holds, then either YB = true or gen( y, 1 SYB) holds

PROOF. The proof is by induction on h, the height of B. For the basis

h = O, B(y) is an atom and ~~~ is either B itself or false. For h > 0, assume

the lemma holds for all subformulas of A with height less than 1, and

consider each of the possible principal operators.

Suppose B(y) ~fB1 v B2. If gen( y, B), then gen( y, B,) anc[ gen( y, Bz). By

the inductive hypothesis, ~~, = false or gen( y, ?l~,) for each of i = 1,2. The

same conclusion for %~ follows immediately from the definition of gen. If

gen(y, =B), then either gen(y, ~Bl) or gen(y, =Bz), say the former. 13y the

inductive hypothesis either Y?~l = false and gen( y, fl~,), and the same con-

clusion follows for ,#~.

Suppose B(y) ‘:f 131A B2. The argument when gen( y, B) hcllds is similar to

the argument for gen ( y, ~B ) above, where B was a disjunction. The argu -

ment when gen( y, lB) holds is similar to the argument for gen( y, B) above,

where B was a disjunction.

ACM TransactIons on Database Systems, Vol. 16, No, 2, June 1991,

Page 26: Safety and translation of relational calculus

260 “ A Van Gelder and R. W. Topor

Suppose B(y)d~f~zBl(y, z). If gen(y, B(y)), then gen(y, Bl(y, z)) holds,

the inductive hypothesis applies, and the conclusion follows. If gen( y, lB( y)),

then gen( y, 1 Bl( y, z)) holds, and the inductive hypothesis applies. Either

$~, = true, in which case # B = ~ zd~ue, or Wz( Y, ??B1) holds, in which caseso does gen(y, ,#B). The cd~ze B(y) = VZBI(.Y, z) is similar.

Finally, suppose B(y) = -III(Y). Then BI is an atom, so gen( y, B( y))

is impossible. If gen(y, = B(y)), then gen(y, Bl(y), Ill(y)). That is, H = Bl,

:?B, = false, and 3B = true. ❑

LEMMA 8.5. Let A(x) be a formula for which gen(x, A(x)) fails, but

con(x, A(x), G(x)) holds, where G(x) d~fPl v . . “ v Pm, as in Step 5b(ii) of

Algorithm 8.1. Let # be the truth value simplification of A( x)[G/ false].

Then,

A(x) = [~* G(x) AA(x)] v m

PROOF. Let A be equivalent to A, but in literal form, and let

A“(X)~fT~*G(X )~A(X)

Clearly, A(x) = A’(x) v A“( x). But it follows from Lemma 8.3 with B = A

that A“( x) * M. This shows that

A(x) + (q* G(x) AA(x))vz.

For the other direction, it is sufficient to show that Y + A( x). This follows

from the facts that each atom in G(x) occurs positively in A( x), and

# = A[G/false]. ❑

THEOREM 8.6. Every evaluable formula can be effectively transformed into

an equivalent allowed formula.

PROOF. By Lemma 8.5, F = F in Step 5b(ii) of Algortihm 8.1. Clearly,

gen( x, 3* G( x) A A( x), G( x)) holds. Let A be equivalent to A, but in literal

form. By Lemma 8.3 with B = A, x is not free in 9A, so x is not free in ,fi’.

We need to check that F preserves gen properties in the larger sub formu-las in which F is embedded. If gen( y, F), then clearly gen( y, A), arzcl by

Lemma 8.4 with B = A, gen(y, Y?A). But J ~ is just the literal form of ~?, so

gen ( y, Y). It follows that gen( y, F). Similarly, if gen( y, 7F), it is sufficient

to show that gen( y, 7 J); but this also follows from Lemma 8.4. The other

steps of the algorithm clearly preserve gen properties, so by structural

ACM TransactIons on Database Systems, Vol 16, No 2, June 1991

Page 27: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries . 261

induction, if’ the initial formula satisfies gen on all of its free variables, so

does the formula returned.

Also, it is clear that con is preserved in subformulas of A, and A[G/ false]

by step 5b(ii). Since # is the truth value simplification of .A[ G / false], some

of these subformulas may simplify to false or true; however, con holds for

these. (If a subformula of #, say y~~(y), satisfies con( y, <WB(y)) when its

corresponding subformula B(y) in A did not satisfy con( y, B(y)), that

cannot make a nonevaluable F into an evaluable F, because B(y) still occurs

in A.) It follows that Algorithm 8.1, applied to an evaluable formula, returns

an equivalent allowed formula. ❑

9. TRANSLATION INTO A RELATIONAL ALGEBRA EXPRESSION

We now describe a procedure to translate any allowed formula into an

equivalent relational algebra expression, by progressing through two normal

forms, ENF and RANF, to be defined below. In combination with the trans-

formation of the previous section, this allows any evaluable formula to be

translated into an equivalent relational algebra expression.

The translation procedure has three phases:

(1) transformation of the allowed formula into existential normal form (ENF),

(2) transformation of the ENF formula into a relational algebra normal form

(RANF),

(3) translation of the RANF formula into a relational algebra expression.

For these translations, it is convenient to regard “~” and “v” as polyadic

operators on sets of formulas with no order among them. Similarly, “ ~“

applies to a nonempty set of variables: we write s i to denote z xl “ . “ xm.

Our goal in defining RANF is to ensure that the relation for each subfor-

mula that we need to evaluate is finite. We do not need to evaluate ~B if we

have “something to subtract it from”, i.e., if it occurs in the context of

A A lB, but we still need to evaluate A and B, then take their difference.

Thus, we need B’s free variables to be a subset of A’s for this to be

well-defined and finite. Similarly, if we need to evaluate ( A v II), its relation

is finite only if A and B have the same free variables. With these observa-

tions as motivation, we proceed to the formal development.

9.1 Relational Algebra Normal Form

To facilitate defining relation algebra normal form, we begin with prelimi-

nary definitions. From now on, we assume universal quantifiers have been

removed from all formulas by rewriting v x as 1 ~ x ~.

Definition 9.1. The property genall( A) holds for formula A if and only if

gen( x, A) holds for each free variable x in A.

Note that if genall( A) holds, then the relation for A must be finite.

ACM TransactIons on Database Systems, Vol 16, No. 2, June 1991.

Page 28: Safety and translation of relational calculus

262 . A. Van Gelder and R. W Topor

Definition 9.2. A formula (without universal quantifiers) is called simpli-

fied it

(1) There is no occurrence of 17A.

(2) There is no occurrence of 1(x = y), or -(x # y), or negation of any other

restriction atom, as defined in Definition 4.1.

(3) The polyadic operators “A”, “ v“, and “a” are “flattened”; that is:

(a) in subformula Al A “ “ “ A A ~, no operand A, is itself a conjunction,

and n > 2.

(b) in subformula Al V . “ . V A~, no operand A, is itself a disjunction,

and n > 2.

(c) in subformula 3 1A, operand A does not begin with ~.

(4) In every subformal ~ ZA, each variable x, is actually free in A.

We assume the existence of a function simplify that transforms an arbi-

trary formula into an equivalent (simplified) formula that satisfies Definition

9.2. It is important for the analysis that A, V, and ~ are regarded as polyadic

operators and that the formulas are flattened.

Definition 9.3. A simplified formula is negative if its root is “ 1”; other-

wise, it is positive. An arbitrary formula is negative (resp., positive) if its

simplified form is negative (resp., positive).

A subformula of a simplified formula is restrictive if its parent is “A” and it

is either negative or a restriction atom (see Definition 4.1) such as x = y or

x # y. Any other formula is called constructive.

Intuitively, restrictive subformulas need not be evaluated themselves be-

cause they are only needed to remove tuples from another relation that was

evaluated. For example, in A( x, y) A ( x = y) A Tl?( y), subformulas x = y

and lB( y) are restrictive, but B(y) itself is constructive.

Definition 9.4. A formula is in existential normal form (ENF) ifi

(1) It is simplified.

(2) For each disjunction in the formula:

(a) The parent of the disjunction, if it has one, is “A”.

(b) Each operand of the disjunction is a positive formula.

Example 9.1. The following allowed formulas are in ENF.

(pxy/1 l~uQuy) V 3z(Sxyz A 7Rz)

Pxy A (Qx V Ry)

~xy A ‘=z(Qxz A ‘RyZ)

PxA ‘~y(Qy A ‘~ZRXyZ)

Of course, a formula may be in ENF, but not be allowed:

Pz~(Qxv Ry).

Definition 9.4 leads to prohibited parent-child combinations in an ENF

formula, as shown in Figure 7 by nonblank entries; these entries also specify

ACM TransactIons on Database Systems, Vol 16, No 2, June 1991.

Page 29: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries . 263

Child

Parent v A 3 ~

v s T2R. Fig. 7. Rewrite rules to achieve ENF;

A 8“s” denotes a call to .sm@~Y(F’); blankentries denote acceptable parent-child

3 T9 s combinations,

7 T3 T2t s

I only if every conjunct of A k negative.

rewrite rules to correct the violations. The diagonal entries “s” denote a callto simplify( 1’) (see Definition 9.2), and have highest priority. The off-diago-

nal entries refer to other rewrite rules below which are only applied after

simplification (cf. Figure 3):

l(lAl A””- AIA~)+Alv. ..v A~ (T2)

lAVBI V.. .VBn+l(A AIBIA. .. AIBn) (T2R)

l(A, v””- VAn)+A(lAIA . . . A lAn) (T3)

~2(A(=) V B(=) +(%iA(i) V %/B(l))) (T9).

The restriction to simplified formulas ensures that the following conditions

hold for application of certain rewrite rules:

(1) In T2, all A, must be positive.

(2) In T2R, A must be positive;

The following algorithm uses the rewrite rules to achieve a transformation

into ENF.

Algorithm 9.1. enf(F)

Input: A formula F.

Output: An ENF formula equivalent to F.

Procedure: In this algorithm, “formula” means a tree data ~structure represent-ing a formula. The notation F[ A / B] where A is a subformula of F m cans thedata structure for F is modified by replacing the subtree for A by a tree thatrepresents B.

FO := simplify (F).

While some subformula A of F. matches the left side of one of the rewrite rulesT2, T2R, T3, or T9, do:

Apply the rewrite rule to A, yielding result B.FO := simplify (Fo[ A /Bl).

Return F..

LEMMA S3.1. Algorithm 9.1 transforms any formula into an equivalent

ENF formula. Further, if the input formula is allowed, so is the output.

PROOF. It is clear that FO satisfies Definition 9.4 when the “while”

terminates, as none of the rewrite rules in enf(F) applies. It only remains to

ACM TransactIons on Database Systems, Vol. 16, No. 2, June 1991

Page 30: Safety and translation of relational calculus

264 s A. Van Gelder and R. W. Topor

check that the “while” must terminate. The only possibility of looping

involves T2, T2R, and T3. But, simplification ensures that T2 cannot be

followed T3 at the same point in the formula. As mentioned above, A must

be positive when T2R is used. This ensures that the result cannot match the

left side of T2. Thus a finite number of rule applications are possible at any

given depth in the formula, and no rule increases the number of atoms, so the

total number of possible rule applications is finite.

A consequence of Theorem 6.6 is that the rewrite rules of the algorithm

preserve the “allowed” property. •l

From ENF as a starting point, we can now define relational algebra

normal form.

Definition 9.5. A formula F is in relational algebra normal form (RANF)

if it is in ENF, genall( F) holds, and every proper constructive subformula of

F is in RANF.

Subtractive subformulas need not be in RANF because “~ ~” can be

implemented in relational algebra as “difference”, as described later, while

x = y and x # y become selections. However, a negative subformula that is

not restrictive must be put into RANF. Of course, genall holds for a negative

subformula if and only if it has no free variables.

Example 9.2. Consider again the ENF formulas from Example 9.1. This

formula is in RANF:

Clearly, not every allowed ENF formula is in RANF. Not only are the

following allowed formulas not in RANF, but no conservative transformation

of them yields an RANF formula.

Pxy A (@v RY)

Pxy A 73Z(QXZA ‘@)

PxA 13y(Qy A ldZ&yZ).

It follows from the next lemma that the RANF class is unchanged if the

recursive RANF requirement in its definition is replaced by corresponding

“allowed” requirement.

LEMMA 9.2. An ENF formula F is in RANF if and only if F is allowed andeach constructive subforrnula A of F is allowed.

PROOF. The “if” case follows immediately from the definition of RANF

and the fact that an allowed formula satisfies genall. To prove the “only if”

case, we use structural induction on F. For A an edb atom, genall( A) holds

and A is allowed. Assume positive formulas A, A,, and BJ are allowed, by

inductive hypothesis. Then, by the definitions of “allowed”, gen, and genall:

(1) If genall( Al v . . . v A ~) holds, then each A, has the same free variables

and is not x = y or x # y. Thus, Al V “ . . VAm is allowed.

ACM TransactIons on Database Systems,Vol 16, No 2, June 1991

Page 31: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries . 265

(2)1fgenall(A1 A”. ”AAn ATBIA . . . A 1 Bn) holds, then each free variable

in each ,13Jalso occurs free in some A, (other than x = y or x # y). Thus,

AI A””” A/ire ATBI A... A lBn is allowed.

(3) Formula ~2A is allowed, since ENF (through simplification) requires that

each variable in 2 must actually be free in A.

(4) If -A is a subformula of F whose parent is not “A”, then genall(TA)

holds only if A has no free variables, in which case -A is also allowed.

The same applies to 1 B1 A . . . A lBn.

(5) If the parent of x = y or x # y is not “A”, the parent cannot satisfy

genall.

Thus any positive formula that can be constructed from A, A,, and BJ with

one more logical operation and satisfies genall is also allowed. ❑

9.2 Transformation into RANF

We now present an algorithm to transform an allowed ENF formula into an

equivalent RANF formula. It is based on repeated application of the rewrite

rules below. Since use of the distribution transformation (T I 1 below) can lead

to exponential growth in formula size, the algorithm strives to use it only

where necessary.

(~~AABIA.. “ABn)+3~(A ABIA. .. ABn) (T8R)

(Alv. ” “VAm)AB +(AIAB)V. .OV (Am A B) (Tll)

=AAB+T(AAB)A B. (T15)

Transformation T15 is justified by El 1 and E2, as follows:

TAAB=(TA AB)v(TBAB)=(TAv TB)AB=~(AAB)AB.

Again by Theorem 6.6, these rewrite rules transform allowed formulas into

allowed formulas. It is easy to see that T8R and T15 transform ENF formulas

into ENF formulas, with the understanding that the rewrite rule application

is followed by resimplification of the result (see Definition 9.2). However, T11

may transform an ENF formula into a non-ENF formula, as in the case:

(DA ~ (( Al V Az) A B). Thus en~ is called after application of Tll in the

following algorithm.

Algorithm 9.2. ranf(F)

Input: ArI allowed ENF formula F.

Output: An RANF formula equivalent to F.

Procedure: Figure 8 illustrates the main ideas of the algorithm. The algorithmcalls simplify and enf, which we described earlier, as well as itself.

In this algorithm, “formula” usually means a tree data structure representinga formula. The notation F[ A /B] where A is a subformula of F means the datastructure for F is modified by replacing the subtree for A by a tree thatrepresents B.

The procedure selects one of the recursive cases T8R, Tll, T115, or exits throughthe “ELSE” case, as described below. In each case, FI is an allowed (notnecessarily proper) subformula of F’ to be replaced. Fz is the equivalent allowedformula that replaces F1. (Recall that genall holds for allowed formulas.) The

defnotation ‘(FI = . . . “ means that FI matches the pattern on the righthand side.

ACM Transactions on Database Systems, Vo~ 16, No 2, June 1991

Page 32: Safety and translation of relational calculus

266 . A Van Gelder and R. W ToDor

Rule Pattern Problem Cure

+

A

e

A

T8R gen(z,,4) fads, but3g

C(.C,z) B(z) gend/(A A C) hold> 3yB(:)

A(z, y) A

A(x, y) C(x, z)

&

A

Tll

A

Agen(z, AI V A2) fails,

v qz, ~) ~(z) but genall((A1 v A2) A v(7) holds

B(z)

AI(z, y) A2(y) A A

A,(z, y) C(z, z) A2(y) c(z, Z)

+

A

P

A

T15 gen(z, A) fails, but C

C(x, z) B(z) IS m RANF. C(z, z) B(z)

A(x) A

A(r) C(x, z)

Fig. 8. Illustration of rewrite rules used m ranf (somewhat simplified)

T8R: If FI ‘~f ~jA A BI A . . . A Bn, and genall( A) does not hold, then:

Let 2 be the set of variables that are free in A such that gen( x,, A) fails.(Since FI is an allowed formula, this set is disjoint from ~.)

LetBIA... A B~ be a prefix (possibly after rearrangement) of BI A . . . AB. such that genall( A A BI A . . . A B*) holds. (At worst k = n, because

genall( Fl) holds, )F2:=3j(A ABIA. .. ABk)ABk+1A. ..~Bn

Return r@(SZ~p~Lfy(F[ F, / F,])).

Tll: If F1d~fGAB1 A . . . A Bn, where Gd~fAl v . . . V A ~ and genall(G) does nothold, then:

LetBIA... A Bk be a prefix (possibly after rearrangement) of BI A . . . A

B. such that genall(G A BI A . . ~ A Bk) holds. (At worst h = n, becausegenall( Fl) holds. )

(Distribute BI A . ~. A Bk over G. Note that each disjunct G, has the sameset of free variables. )

For 15 i < m, do: G,:= ranf(A,ABIA . . . AB~),

F2:=(Gl V.. .VG,n)A Bk+l ABn. ABn

Return ranf( en f(F[ F, / F, ]))

T15: If Fld~f=AABI A... A Bn, and genall( A) does not hold, then:

Let ; be the set of variables x that are free in A, but for whichgen( x,, A) does not hold.

ACM Transactions on Database Systems, Vol 16, No. 2, June 1991.

Page 33: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries . 267

~et~lA-.. A B~ be a prefix (possibly after rearrangement) of BI A . . . A

Iin such that all elements of ~ are free in Ill A . . . A B~ and genall( BIA . . . A Bk) holds. (At worst k = n, because genall( Fl) holds.)

G:= ranf(Bl A . ~. A Bk).

F2:=1(A AG)ABIA.. .ABn

Return ranf(.simplify(F[ FI /Fzl)).

ELSE: If none of cases T8R, Tll, or T15 apply, then:

Return F.

Correctness of this algorithm is proved in Section 9.3. Details of the

translation into relational algebra are covered in Section 9.4. The above

algorithm has considerable nondeterminacy, which we may view as an

opportunity for optimization in an implementation. Examp”les of the applica-

tion of Algorithm 9.2 are given in Examples 9.3 and 9.5 below.

Example 9.3. This example illustrates the operation c)f Algorithm 9.2.

From this example it appears necessary to be able to interleave the rewrites

of cases T8R, T11, and T15 in the algorithm. The initial formula is

The rewrite rules are applied in the order T11, T15, T8R, T11, giving the

following sequence of formulas, the last of which is in RANF.

(~XYAQX) V(~XYA~YA T~Z[SZA(~ZXV~Z)])

(~XYAQX) V(~XYA~YA l(@A~ZISZA(~ ZXV~Z)]))

(PXYAQX) V(PXYARYA T~Z[PXYASZA (PZXVRZ)])

(Pxy A Qx) v (PXYAI?YA T2Z[SZA ((~xyA~zx) v (( PxYA&)))]).

This example is discussed further in Example 9.4 below.

9.3 Correctness of Translation to RANF

Correctness of Algorithm 9.2, stated in Theorem 9.6, is established in two

stages:

(1) The algorithm terminates (Lemma 9.3 and 9.4).

(2) If none of the cases T8R, Tll, or T15 applies to F, then it is in RANF

(Lemma 9.5).

Termination is shown by the method of multisets, as described by Der-

showitz and Manna [7]. Finite multisets of nonnegative integers are well-

ordered by listing their elements in decreasing order and comparing these

lists lexicographically. We shall associate a finite multiset with each simpli-

fied formula and use induction on multisets. A similar technique to provetermination of a rewrite procedure was used by Lloyd and Topor [14]. Regard

the formula as a tree with atoms at the leaves and edges directed toward the

leaves.

ACM Transactions on DatabaseSystems,Vol. 16, No. 2, June 1991.

Page 34: Safety and translation of relational calculus

268 . A. Van Gelder and R. W. Topor

Definition 9.6. The and-height of a node in a formula is the maximum

number of “A” nodes, including itself, on any path from the node to a leaf.

Definition 9.7. Let A be a constructive (not necessarily proper) subfor-

mula of F, let x, be a free variable of A such that gen( x,, A) fails. Let k be

the and-height of A and let B be the parent of A (if it exists). Then the

element cost of the variable-node pair (x,, A), denoted pl( x,, A), is the

nonnegative integer defined by:

{

2kw~(~,, A) =

if B exists and gen( x,, B) holds

2 k + 1 otherwise,

The node cost of A, denoted pz( A), is the multiset of element costs of pairs

(x,, A) such that PI( x,, A) is defined. The node cost is the empty multiset if

genall( A) holds.

The cost of a formula F, denoted W(3’), is defined to be the multiset union

of the node costs of its constructive nodes. The cost of an RANF formula is

the empty multiset.

Example 9.4. This example shows the succession of values of the cost

function ~ as the formula of Example 9.3 is rewritten in Algorithm 9.2. The

initial formula and its node costs are

I’~f~XYA (QXV(l?YA 7qZ[SZA (12ZXVl?Z)]))

{} {Ax,%} {5X} {3X} {31} {1.L}

Only the node costs of constructive internal nodes, reading left to right, are

shown. The subscripts are annotations to show which variable produced each

element. Collecting, P(F) = {5,4, 4, 3, 3, 1}. The evolution of the cost during

rewriting is shown below.

(PXYAQX) V(PXYARYA ~~X[SZA(P.XVRZ)])

{1’ {1’ {} {3X) {3z)’ {lx} p = {3,3,1}

(F’XYAQX) V(PXYARYA T(PXYAqZ[SZA (PZXVRZ)]))

{} {1’ {} {H2.Z} {3.} {lx} ~ = {3,2,1}

(~xYAQx) v(pxYARYA 13z[pxY ASZA(pZXVRX)])

{)’ {} {} {} {2.X} {lx} /l = {2,1}

(RXYAQX)V (F%YAl?YA Tq,Z[SZA ((~xy AI?zx) v ((pxYAl?z)))])

P={)

The node costs of the intermediate formulas are shown under them. Only

node costs for internal constructive nodes are shown. Recall that A A B A C is

really one ternary conjunction; its node cost is shown under the left “A”.

Observe that the ~’s are lexicographically decreasing as we proceed throughthe intermediate stages. We prove below that this algorithm always produces

a decreasing sequence of p’s.

ACM Transactions on Database Systems, Vol 16, No 2, June 1991

Page 35: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries . 269

LEMMA 9.3. In case Tll in the body of Algorithm 9.2, K(enf(F[ FI /Fzl )) s

p(F’[Fl/Fzl).

PROOF. The only problem is that T3 or T9 can apply at Fz when k = n,

because in this case Fz ‘~f GI v . . . v Gm. In general, these rewrite rules can

increase the and-height of some nodes, thereby increasing ~z for those nodes.

But Fz is in RANF, so there are no element costs within it to increase. ❑

LEMMA 9.4. In the body of Algorithm 9.2, each recursive call to ranf is

made on an allowed ENF formula whose cost is less than P(F).

PROOF. In case T8R the element costs in p(F) contributed by pairs of the

form (x,, ~ jA) are absent from P(F[ FI /Fz ]). The new nodes are “~ $“ and

“A” in the subformula ~j(A A BI A . . . A Bk). (Recall that A, Bl,. . . . B~ are

all operands of one polyadic “A”, although for notational convenience we

have written several infix symbols.) But these new nodes satisfy genall, so

contribute noting to ~. Moreover, the element costs due to pairs of the form

(x,, A) each decrease by 1 because the parent of A now satisfies genall.

(Calls to simplify can never increase the cost.) Checking the possible cases

for the parent of FI (it must be “v”, “ ~“, or “ -“) shows that

simplify( F[ FI /Fz ]) is in ENF.

In case T11 there are multiple recursive calls. For those that create the

G,, we observe that genall( A, A BI A . . . A B~) holds for all 1 s i s m. Itfollows that these subformulas are allowed and in ENF. We need to show

that their costs are less than p(F). This is immediate if genall( A,) holds,

for there is some AJ that contributes element costs to F but not to this

subformula. But even if genall( A,) fails, the element costs making up the

node cost of A, decrease by 1 because the parent of A, now satisfies genall.

For the last recursive call (of this case), since the G, are in RANF and

have the same free variables, it is obvious that P(F[ FI /Fzl) < K(F). By

Lemma 9.3, the call to enf does not increase p.

In case T15, it is critical that the replicated G is in RANF’, so that p(G) is

the empty set. But now genall( A A G) holds, so the elements of the node cost

of A are reduced by 1 relative to A(F). Checking the possible cases for A (its

root must be “A” or “d” or an atom) shows that simplify ( F[ FI /Fa ]) is in

ENF. ❑

LEMMA 9.5. In the body of Algorithm 9,2, if none of cases T8R, T11, or

T15 applies, then F is in RANF.

PROOF. If F is not in RANF, then let A be a maximal constructive

subformula that is not allowed, which must exist by Lemma 9.2. A must

have some free variables and must be a proper subformula of F, as F is

allowed. By that assumption of maximalit y, the parent of A cannot be d or A.If the parent of A is “~”, the its parent is “A” and is constructive, so must

be allowed. Thus, case T15 applies. (The “ 1“ cannot be the root of F with

free variables in A.)

ACM Transactions on Database Systems, Vol 16, No 2, June 1991

Page 36: Safety and translation of relational calculus

270 . A. Van Gelder and R. W Topor

If the parent of A is “A” and the root of A is “ ~“, then case T8R applies. If

the parent of A is “~” and the root of A is “v”, then case Tll applies. These

are the only possibilities because A is constructive.

It follows that such an A cannot exist in rcznf(F). ❑

THEOREM 9.6. Algorithm 9.2 correctly transforms any allowed formula

into an equivalent RANF formula.

ROOF. Lemmas 9.3 and 9.4 show that recursive calls are always on

formulas that satisfy the input restrictions in the algorithm and have re-

duced p value. Since the range of ~ is well ordered, chains of recursive calls

terminate. Lemma 9.5 shows that the formula returned upon exit is in

RANF. ❑

9.4 Translation into Relational Algebra

The translation of a formula F in relational algebra normal form into an

equivalent relational algebra expression is straightforward; the basics are

given by Unman [26, 251. The translation given earlier [26] required the

construction of the Dom relation consisting of all constants in the formula

and in database relations that occur in the formula. Our translation, like

that in the revised edition [25], does not require such a Dom relation.

We translate A v B into A’ U B’ (possibly after a permutation of columns),

where A’ and B’ are the translations of A and B. Similarly, we translate

A ~ B into A’ w B’ or A’ x B’, and ~ XA into TA’. Atoms x = c are translated

either into an appropriate selection operator or into C, where C is a unary

relation containing the single tuple c. Atoms x = y and x # y (and otherrestriction atoms) are translated into an appropriate selection operator. To

translate A ~ -B, we use the following generalized set difference operator

introduced by Hall et al. [10].

Definition 9.8. Let P and Q be relational expressions such that the

attributes of Q are a subset of those of P. The relational operation general-

ized set difference, P cliff Q, yields the set of tuples in P whose projections

are not in Q. That is,

Pdiff Q~fP-r(P~Q)

where the (equi-)join is on the attributes of Q and the projection is onto the

attributes of P. If P and Q have the same arity, then P cliff Q is simply

P – Q (possibly after a permutation of columns).

More generally, we translate every conjunction Al ~ . . . ~ A ~ ~ lB1A “ “ “ A ~Bn, where the A, and BJ are positive sub formulas, into P’ cliff B:

cliff . . . cliff B;, where P’ is the translation of Al A . . . A Am, and B:, . . . . B;are the translations of the B]. If there are no positive A,( m = O), then each

ACM TransactIons on Database Systems, Vol 16, No 2, June 1991

Page 37: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries . 271

l?, must be a closed formula, and we use the singleton O-ary relation { <> }

(i.e., true) for P’.Although we have defined P cliff Q in terms of primitive relational

operators, it should be implemented as a primitive in its own right, using

techniques similar to those used for efficient joins. (In fact, cliff is also called

anti-join by Maier [15].) Thus we retain cliff in our final relational algebra

expression.

Example 9.5. We show below, for several allowed formuhils (cf., Example

9.2), the RANF formula and relational algebra expression constructed by the

above procedures.

Allowed: Pxy I? (Qx V Ry)

RANF: (Pxy ~ Qx) V (Pxy A Ry)

Algebra: T12(PM1=1Q) U rlJPrMz=l R)

Allowed: PX A Vy(7Qy V 3ZRXyZ)

RANF: PXA Tqy(Px A QyA TszRxyz)Algebra: P– ~l(Px Q–T12R)

Allowed: Pxy A VZ(-QXZ V RYZ)

RANF: Pxy A ‘3 z(px’y A Qxz A 7RYZ)

Algebra: P – ~12(~121(P ~1=1 Q)diffz, ~=2, ~R)

Allowed: Px A 3 y(Qy A VZ[RZ + ~xyz])

RANF: PXA ~Y(PXA QYA T3Z[PXA QYARZA TSxyzl)

Algebra: PD%.l~l(Px Q – T12KP x Q x R) – S1).

THEOREM 9.7. Every allowed formula can effectively be translated into an

equivalent relational algebra expression.

PRoOF. Theorem 9.6 shows that every allowed formula can effectively be

translated into an equivalent formula in RANF. Structural induction on the

resulting formula then shows that the relational algebra expression produced

by the above (informal) algorithm correctly evaluates the RANF formula,

and hence the original formula. ❑

Many simplifications of the relational algebra expressions produced by the

procedures of this section can be made during their construction. Alterna-

tively, final expressions can be simplified using standard methods [251.

10. EVALUABLE AND DOMAIN INDEPENDENT FORMULAS

In this section we show that the evaluable class is contained in the domain

independent class and that with the restriction to formulas with no repeated

predicates evaluable is equivalent to domain independent. To do so, we use

the fact that domain independent is equivalent to definite, which we now

define, following Nicolas and Demolombe [21].

Definition 10.1. Let I be an interpretation with domain I) for a formula

F, and let pi be the relations assigned by I to the edb predicates P, that

ACM Transactions on Database Systems, Vol. 16, No 2, June 1991.

Page 38: Safety and translation of relational calculus

272 . A Van Gelder and R W Topor

occur in F. Let * be a value not in D. Then the *-extension of I is the

interpretation I with domain D’ = D U {*} that assigns the same relations

p, to the p;edicat~s P, as does I. We denote appropriate cross products of Dand D’ by D and D’, respectively.

Definition 10.2. A formula F is called definite if, for all interpretations I,F is satisfied at the same points in I as in I’, where 1’ is the *-extension of 1.In other words, ii satisfies F in I’ if and only if ii satisfies F in I.

10.1 Evaluable Formulas are Domain independent

We now show that every evaluable formula is domain independent. This was

proved originally by Demolombe [6] for evaluable formulas as defined there.

The statement needs to reexamined because we have used an independent

definition, and have incorporated equality.

Our proof is significantly simpler because of Theorems 8.6 an 9.6, which

state that every evaluable formula has an equivalent RANF formula. Hence

it is sufficient to prove domain independence for RANF formulas.

LEMMA 10.1. Let F(x) be a formula, possibly containing other free

variables besides x. Let I be an interpretation for F with domain D and

e-extension 1’. If gen( X, F) holds, then F does not hold in I’ for any assign-

ment that assigns * to x.

PROOF. Use induction on formula size, which we define to be the number

of atoms plus the number of quantifiers (negations are excluded). For the

basis F is an atom and not of the form x = y; the conclusion is immediate.

For the induction, one of the following cases applies.

F~~fA A B. One of A and B satisfies gen, and therefore by the inductive

hypothesis, does not hold if x is assigned *.

—F~~~A v B. Both of A and B satisfy gen, and therefore by the inductive

hypothesis, do not hold if x is assigned *.

—Fd:f % yA. A satisfies gen, and therefore by the inductive hypothesis, does

not hold if x is assigned *.

—Fd~f 7A. If A is an atom, the conclusion holds vacuously, since gen( x, F) is

false. Otherwise, push the 1 giving G (i.e., pushnot(-A, G) holds). Theneither G is an atom other than x = y, or one of the above cases applies to

G. Cl

LEMMA 10.2. If F is an RANF formula, then F is definite.

PROOF. By Lemma 10.1, it is sufficient to show that gen( x, F) holds for

every free variable x in F and that gen( y, A) holds for every sub formula

=yA of F. But these properties follow immediately from the definition of

RANF. ❑

ACM TransactIons on Database Systems, Vol 16, No. 2, June 1991

Page 39: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries . 273

THEOREM 10.3. If F is evaluable, then F is definite, and hence is domain

independent.

PROOF. By Theorems 8.6 and 9.6 and Lemma 10.2. ❑

10.2 Evaluable Formulas with No Repeated Predicates

The class of formulas with no repeated predicates is not particularly interest-

ing in its own right. However, if a query processing method does not

recognize the presence of repeated predicates and exploit tlhem by means of

useful transformations, then formulas with repeated predicates are no more

amenable to evaluation than those without repeated predicates. Thus study

of this class can shed light on the limits of such methods.

Essentially, the domain independent class is not recursive because a given

formula may have a subformula that is superficially not domain independent,

but is unsatisfiable, hence (vacuously) domain independent. Even though

unsatisfiability is decidable for formulas with sufficiently simple quantifier

structure (as was shown by Ackermann [1]), we do not consider it practical to

test subformulas for unsatisfiability as part of the procedure that transforms

them into relational algebra. However, formulas in which no predicate

symbol is repeated cannot possibly have unsatisfiable subformulas. We show

that formulas in this class (without equality) are evaluable if and only if they

are domain independent. This means that any extension to the class of

evaluable formulas that remains domain independent, to be implementable,

must at least provide for simplifications based on common subexpressions

(e.g., subsumption tests), and should probably include some form of inference

capability (e. g., resolution).

LEMMA 10.4. Let F be a formula with no repeated predicate symbols and

no equality that is in PLIVF, i. e., Fd~f % 2M( 2, j), where M is quantifier-free

(see Definition 4.2). Suppose M is a conjunction of literals. If F is definite,

then F is evaluable. The same holds if M is a disjunction of literals.

PROOF. Let M~fPl A “ “ “ A P. A lQI A “ “ “ A TQm where each P, and QJ is

an atom of a different predicate. Suppose F is not evaluable. Let D = { a}.

We shall prove F is not definite by constructing an interpretation I with

domain D and *-extension I’ such that F evaluates differently in I and 1’.

This contradiction will prove the lemma.

As F is not evaluable, one of the following three cases applies.

(1) Variable x is universally quantified in F and con( x, 7M) does not hold.

Then some P, contains x. Assign each P, to hold for (a, . . . . a) and assign

each Qj to be empty. Then F holds (on (a, . . . . a) if it has free variables)

in I but not in I’.(2) Case 1 abo~e does not apply, x is ezsistentially quantified in F, and

con( x, M) does not hold. Then x occurs in some QJ and in no P,. Also

universally quantified variables appear in no P,. Assign each P, and each

ACM Transactions on Database Systems. Vol. 16. No 2, June 1991.

Page 40: Safety and translation of relational calculus

274 . A Van Gelder and R. W TODOI

QJ to hold for (a,. . . . a). Then F holds (on (a, , a) if it has free

variables) in I’ but not in I.

(3) Case 1 above does not apply, x is free in F, and gen( x, M) does not hold.Then con( x, M) also does not hold. Proceed as in Case 2.

The case where M is a disjunction of literals is similar. ❑

THEOREM 10.5. Let F be a formula with no repeated predicate symbols and

no equality. Then F is definite if and only if F is evaluable.

PROOF. The “if” part holds by Theorem 10.3 above. By Corollary 6.3 we

may assume F is in PLNF and is given by

where M is quantifier free. We define the size of a formula to be the number

of atoms plus the number of quantifiers in it. For the “only if” part, we show

by induction on size that if F is definite, then we can reduce to the casedef def

covered in Lemma 10.4. The base cases, M = P and M = 7P, are immediate.

Suppose con( x, M) is false for existential variable x of F. If M contains

any v’s we shall derive a contradiction.

(1) M3fA v B. The~,fone of co~$x, A) and con( x, B) fails. Say con( x, A)

fails. Define Ml = A and FI = % ZMl. Suppose FI were not definite; since

B has no predicates in common with A, we can extend any interpretation

over FI to one over F in which B is always false. This would show that

F was not definite, a contradiction. Therefore FI is definite, and by the

inductive hypothesis, evaluable. But con( x, Ml) fails, a contradiction.

(2) M~fA ~ B. Then one of con(x, A) and con(x, 1?) fails. Say con(x, A)

fails. If any v’s occur in A, we can derive a contradiction by recursively

applying steps 1 and 2 to A instead of M. Assuming A is free of v’s, we

work on B. If con( x, B) fails, we can recursively apply steps 1 and 2. The

only other possibility is that con( x, 1?) holds and gen( x, B) fails.

(a) B~f C v D. Either con( x, C) holds and gen(x, C) fails, or con( x, D)

holds and gen( x, D) fails, (or both), say the former. Define Bz ‘~f C,

Mz to be M with B replaced by Bz and Fz to be % ?Mz. Again, if Fz

were not definite, making D “as true as possible” would demonstrate

that F was not definite. But con( x, Mz ) must fail because of thesimilarity to M, so Fz cannot be evaluable, a contradiction of the

inductive hypothesis.

(b) B ~f C ~ D. Then con( x, C) holds and gen( x, C) fails, and con( x, D)

holds and gen( x, D) fails. If any v’s occur in C or D, we can derive a

contradiction by recursively applying steps 2a and 2b to C and Dinstead of B.

Therefore, a contradiction arises unless M contains no V’S. But in this case

Lemma 10.4 applies.

ACM Transactions on Database Systems, Vol 16, No 2, June 1991

Page 41: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries . 275

The arguments for free variables are similar. The arguments for univer-

sally quantified variables interchange the roles of v and A. ❑

We conjecture that this theorem can be extended to allow some presence of

equality. However, it cannot be extended much in other directions in view of

the fact that (cf., Example 6.2)

~(x)~fVy[(~xA Qy) V (~x A TRY)]

is domain independent but not evaluable.

11. OPEN PROBLEMS

While it is fast (linear time) to determine whether a formula is evaluable, the

actual transformation into allowed form and then into RANF often involves

rewriting the current formula into an equivalent but larger formula. This

suggests some topics for further study:

(1)

(2)

(3)

Can the transformation from evaluable to allowed (Section 8) introduce a

size increase that is exponential in the number of variables, or is there a

polynomial bound?

Can a method that combines our approach with the work of Aylamazan et

al. [2] yield a more efficient evaluation method?

Can a broader subclass of domain independent formulas be recognized

efficiently (i. e., in polynomial time) by using a restricted class of infer-

ence rules such as subsumption and subsumption rescllution to exploit

repeated predicates?

APPENDIX A. Equality Reduction and Wide-Sense Evaluability

In this appendix, we describe transformations that normalize formulas with

respect to equality ( x = y), which we call equality reduction. Certain formu-

las containing equality do not satisfy the requirements for evaluability

initially, but are evaluable after equality reduction. We say that such

formulas are evaluable in the wide sense. Wide-sense evaluability is invariant

under conservative transformations. Since every wide-sense evaluable for-

mula is equivalent to an evaluable formula, it is also domain independent.

LEMMA A. 1. Let Fd~fx = t A A( x, t,j), where t is either a variable or a

constant, and is not required to appear in A( x, t, ?). Then

F= F’~fx=t AA(t, t,j).

PROOF. In any evaluation, either x is assigned the same value as t or

both F and F evaluate to false. ❑

The lemma generalizes the transformations (E13 - 14) in Figure 4 to free

variables.

ACM Transactions on DatabaseSystems,Vcd 16, No 2, June 1991.

Page 42: Safety and translation of relational calculus

276 . A Van Gelder and R. W. Topor

Algorithm A. 1. Equality Reduction

Input: A relational calculus formula F with no universal quantifiers,

Output: An equivalent equality-reduced formula.

Procedure:

1, Apply the following transformation wherever possible.Let A( x, y) be the maximal subformula of F in which x is free. A mayhave other free variables. If A contains an atom x = y (or y = x), where y

is another free variable of A, then:

(a) Define AI(y) to be the formula that results from replacing every occur-rence of x in A by y, and then replacing y = y by true and carrying out

truth value simplification (Def. 8.2).

(b) Define A2( x, y) to be the formula that results from replacing each

occurrence of x = y in A( x, y) by false, and carrying out truth valuesimplification.5 Variables x and/or y may not actually appear in A2.

(c) Replace A( x, y) by

A’(X, Y)~f(X= yAA1(.V)) V(.X#y AA, (X,.Y)).

(d) If x is bound in F, then replace ~xA’( x, y) by

Al(y) vsx[x#y~A, (x, y)].

2. At this point all equalities between two free variables of F that remain can

be put in the form of “case splits” at the top of the formula by appropriately

“pushing ands” (En). For any case of the form x = z A A( z), where x is not

free in A and gerz( z, A) holds, rewrite this case as

x=z~A(x]~A(z)

This typically arises when A originally contained x but it was substituted

for in Step 1 above. In all implementation, we would not actually do it thisway; we would add a column replication primitive to our relational algebra,

The correctness of the algorithm follows from Lemma A. 1 and elementary

arguments.

Definition Al. A formula F is said to be wide-sense eualuable if

Algorithm A. 1 transforms it into an evaluable formula as defined in

Ilefinition 5.2.

Equality reduction can also be carried out on equalities between a variable

and a constant, and even on equalities between two constants, which may beintroduced as a result. Suppose c = d occurs, where c and d are distinct

constants. If the system assumes that the distinct name axiom c # d is

implicit in F, then we can make it explicit at the top level:

F+~#dAF

Now replace c = d by false throughout F and simplify, as in Step lb. Repeat

until all equalities between constants are removed. This extension may

improve efficiency, but does not change the class of wide-sense evaluable

5Bound variables of A are given different names in Al and A2

ACM TransactIons on Database Systems, Vol 16, No, 2, June 1991

Page 43: Safety and translation of relational calculus

Safety and Translation of Relational Calculus Queries . 277

formulas, so we do not consider it further here.

Example A. 1. Consider the formula F(y) ‘:f 3xA( x, y), where

A(x, y)~fPx@xyvx=y),

This formula might arise in a transitive closure calculation: If P represents

nodes at distance at most n from a source, and R represents edges, then Frepresents nodes at distance at most n + 1 from a source. This formula is not

strict-sense evaluable because gen,( y, F) fails to hold.

Noting that gen( x, A( x, y)) does hold, one might be tempted to formulate

some definition of gen in which gen( y,( Pz ~ x = y)) holds “by association”

because gen( x,( Px A x = y)) holds; such a proposal may be found in Unman

[25]. The idea would be to use distribution to locate Px by x = y. Two

drawbacks of relying on distribution to fix formulas are (1) distribution does

not always preserve evaluability, and (2) distribution can substantially in-

crease formula size. Moreover, a general algorithm is not apparent.

Let us instead apply the equality reduction algorithm:

A1(y)~fPy

Az(x, y)d~fpx~ Rxy

F’(Y) ~fPy V~X[X#y APXAl?Xy].

F’ is not only strict-sense evaluable, but is allowed.

Defining a class of formulas by means of an algorithm is not very satisfac-

tory. A better characterization of wide-sense evaluable formulas is a topic for

future research.

ACKNOWLEDGMENTS

We thank Robert Demolombe, who originated the evaluable class of formu-

las, for helpful discussions and for comments on an early draft of this work.

We also thank Hendrik Decker for helpful discussions. We are grateful to

Paris Kanellakis for bringing the article [2] to our attention. Finally, we

thank the anonymous referees for their careful reading of the paper and

numerous helpful suggestions.

REFERENCES

1. ACKERMANN, W, Solclable Cases of the Decision Problem. North-lFIolland, Amsterdam,

1968.

2, AYLAMAZAN, A. K., GICWLA, M. M., STOLBUSHIiIN, A. P., AND SCHWARTZ, G, F. Reduction ofthe relational model with infinite domains to the finite domain case. In Proceedings of

USSR Academy of Science. 286, 2 (1986).

3. BEEIU, C., NAQVI, S , RAMAKRISHNAN, R., SHMUELI, 0., AND TSUR, S. Sets and negation m alogic database language (LDL1). In 6th ACM S.vmposium on Principles of Database Sys-tems. 1987, 21–37.

4 CODD, E. F Relational completeness of data base sublanguages In Dda Base Systems, R.Rustin, Ed. Prentice-Hall, Englewood Cliffs, NJ., 1972, 65-98,

ACM TransactIons on Database Systems, Vol. 16, No 2, June 1991

Page 44: Safety and translation of relational calculus

278 . A Van Gelder and R W Topor

5.

6.

7.

8.

9

10.

11

12.

13

14.

15

16

17

18

19,

20.

21

22.

23

24.

25.

26.

27,

28.

DECKER H Integrity enforcement in deductive databases. In 1st In ternatiorzcsl Conference

on Expert Database Systems. 1986, 271-285,

DEMOLOMBE, R. Syntactical characterization of a subset of domain independent formulas.

Tech. Rep, ONERA-CERT, 1982.

DERSHOWITZ, N AND MANNA Z. Proving termination with multiset orderings Commun,

ACM. 22, 8 (1979), 465-476.

DIPAOL~, R. A, The recursive unsolvabdity of the decision problem for the class of definiteformulas. J, ACM. 16, 2 (1969), 324-324.

FAGIN, R. Horn clauses and database dependencies J. ACM 29, 4 (1982), 952-985

HALL, P., HITCHCOCK, P , .ANDTODD, S An algebra of relations for machine computation.

In ACM Symposwm on Principles of Programmmg Languages 1975, 225-232,

KIFER, M. On safety, domain independence, and capturabdity of database queries. In

Third International Conference on Data and Knozzledge Bases (Jerusalem, 1988),

KUHNS, J L, Answering questions by computer: A logical study, Tech. Rep. RM-5428-PR,

Rand Corp., 1967,

KUPER, G M Logic programming with sets. In 6th ACM Symposium on Prmczples of

Database Systems 1987, 11-20

LLOYD, J W.. AND TOPOR, R. W Making Prolog more expressive. J. Logw Program. 1, 3

(1984), 225-240,

MMER, D The Theory of Relational Databases Computer Science Press, Rockville, Md,,1983.MA~ows~~, J. A. Characterizing data base dependencies, In 8th Colloquium on AZL-

tomata, Languages and Programming Springer Verlag, New York, 1981MANN+ Z. Mathematical Theory of Computation. McGraw-Hill, New York, 1974.

MORRIS, K., ULLMAN, J, D , AND VAN GELDER, A. Design overview of the Nail! system, In

Third International Conference on Logic Programmmg. 1986, 554-568.

N~QVI, S., AND TSUR, S. A Logzc Language for Data and Knowledge Bases. ComputerScience Press, Rockville Md., 1988.

NICOLAS, J, M Logic for improving integrity checking in relational databases Acts Inj(

18, 3 (1982), 227-253

NICOI.AS, J M., AND DEMOLOMBE, R. On the stability of relational queries. Tech. Rep,

ONERA-CERT, 1982.OZSOYO~LU, G , .ANDWANG, H. On set comparison operators, safety, and QBE. Tech, RepCase Western Reserve Umv., Cleveland, Oh., 1987,TOPOR, R Domain independent formulas and databases. Theor. Comput. SCL 52, 3 (1987),

281-307.

TSUR, S,, AND ZANIOLO, C LDL: a logic-based data-language. In Proceedings of the 12th

International Con ference on Very Large Data Bases. 1986, 33-41, See also MCC repDB-150-85,

ULLMAN, J. D Principles of Database and Knowledge-Base Systems. Vol. 1. Computer

Science Press, Rockville, Md , 1988,ULLMAN, J, D. Prmczples of Database Systems. Computer Science Press, Rockville, Md.,1980, (Revised ed, 1982).

VAN GELDER, A , AND TOPOR, R W, Safety and correct translation of relational calculusformulas, In 6th ACM Symposium on Principles of Database Systems. 1987, 313-327.

VARDI, M. The decision problem for database dependencies, Inf Process. Lett. 12, 5 (1981),

251-254.

Received July 1987; revised December 1989; accepted June 1990

ACM TransactIons on Database Systems, Vol 16, No 2, June 1991