1
Query Containment for Conjunctive Queries With Regular Expressions
Daniela Florescu, Alon Levy, Dan Suciu.PODS 1998
Slides by Gala Yadgar.
2
Outline
Semi structured data and conjunctive queries Query containment for different query classes StruQL0 and the data model for it Substitutions and canonical databases Semantic criteria for query containment Query mappings Syntactic criteria for query containment Containment for simple StruQL0 queries
3
Semi Structured Data
Data is irregular:
Attributes may be missing The type and cardinality of an attribute
may not be known The set of attributes may not be known in advance The schema is unknown in advance
This is an example of a data model where graphs represent databases
אברהם
יצחק
יעקב רחל
ישמעאל
son
son
son
wife
o. brother
y. brother
לאה
wife
רבקהwife
הגר
4
Languages Relational calculus: Datalog
Ancestor(X,Y) :- Father(X,Y) Ancestor(X,Y) :- Ancestor(X,Z), Ancestor(Z,Y)
Notice we have union and recursion. Can also have negation
Conjunctive queries: Brother(X,Y) :- Son(Z,X), Son(Z,Y)
No union (one rule only), no recursion, no negation
StruQL: Runs on graphs, the result is a graph Query Q:
where Person{X}, X (“paper”|“publication”) YCollect Page{PersonPage(X),PaperPage(Y)}Link RootPage()“person”PersonPage(X),
PersonPage(X)“paper” PaperPage(Y)
5
Query Containment
Find out whether the results of one query are contained in the results of another query
For all databases Formal definition will be given shortly
Good for: Finding redundant subgoals in a query Testing whether two formulations of a query are equivalent Determining independence of database updates Rewriting queries using views
6
Known Results Query containment for first order conjunctive queries is
decidable (and NP-Complete) Brother(X,Y) :- Son(Z,X), Son(Z,Y) OlderBrother(X,Y) :- Son(Z,X), Son(Z,Y), Older(X,Y)
Queries in StruQL can be translated into datalogwhere Person{x}, X (“paper”|”publication”) YCollect Page{PersonPage(X),PaperPage(Y)}Link RootPage()”person”PersonPage(X),
PersonPage(X)”paper” PaperPage(Y) PaperPage(Y) :- Person(X),WrotePaper(X,Y) PersonPage(X) :- Person(X),WrotePaper(X,Y)
Containment in datalog programs is undecidable All positive results for containment so far are restricted to
the case when one of the programs is non-recursive
7
New Results Define StruQL0 as a subset of StruQL
Leaving out restructuring capabilities
Similar to conjunctive queries for relational calculus Give semantic and syntactic criteria for query
containment StruQL0 identifies a subset of datalog for which containment is
decidable
Show that query containment for a fragment of StruQL0 is NP-complete
8
The Data Model
Labeled directed graphs Nodes correspond to objects Labels on the edges
correspond to attributes
Formally: A universe of constants D A universe of object identifiers I
(I ∩ D = Ф) A database DB is a pair (V,E):
In the example: D = {a,b,c,d} V = I = {u1,u2,u3,u4,u5,u6}
E = {(u1,c,u6), (u1,a,u5),…}
,V I E V D V
u1
u6
u5
u2
u3
u4
c
c
aa
b
b
d
9
A StruQL0 Query
Queries are allowed to include regular path expressions over the attributes Give the ability to deal with lack of schema R := ε | a | _ | L | (R1.R2) | (R1|R2) | R*
Q1 : q1(X,Z) :– XL+Z, YaZ, X(a+|(a.b*)) Z
The relation RQ(X,Y,Z,L) has arity 4 RQ contains 4 tuples:
{(u1,u1,u5,c),(u1,u1,u5,a),(u2,u1,u5,d),(u2,u2,u3,a)}
Q(DB) is the projection of RQ on X and Z: {(u1,u5),(u2,u5),(u2,u3)}
R1 R3R2
u1
u6
u5
u2
u3
u4
c
c
aa
b
b
d
10
A StruQL0 Query Q1 : q1(X,Z) :– XL+Z, YaZ, X(a+|(a.b*)) Z.
Formally: Regular variables range over the nodes in the graph.
Denoted by capital letters Arc variables range over the labels of edges in the graph.
Denoted by L or Li
A regular path expression is defined by the grammar:
R := ε | a | _ | L | (R1.R2) | (R1|R2) | R*
ε is the empty string a is a label constant _ denotes any label L is a label variable
R1 R3R2
11
A StruQL0 Query - Components
Q : q(X) :– Y1R1Z1,…, YnRnZn
nvar(Q) ≡ {Y1,…,Yn,Z1,…,Zn} (node variables) Need not be distinct
Regular path expressions: {R1,…,Rn} avar(Q) ≡ the set of arc variables occurring in R1,…,Rn var(Q) ≡ nvar(Q) U avar(Q) (head variables) Atoms(Q) ≡ the set of constants occurring in R1,…,Rn YiRiZi i=1,…n are conjuncts
nvar(Q)X
12
A StruQL0 Query - Semantics
Semantics: a substitution is a function Q : q(X) :– Y1R1Z1,…, YnRnZn
Node variables are mapped to I Arc variables are mapped D Denote φ(YiRiZi) is the path in DB corresponding to the conjunct (YiRiZi)
Each substitution defines a tuple in the relation RQ
The answer to Q is the projection of RQ on the variables in x
The result of applying Q to a database is Q(DB)
: var( )Q I D
:Q DB
13
A StruQL0 Query
Notice the advantages for semi-structured data: Regular path expressions Arc variables
For example: Q2 : q2(X,Y) :– XLY,
Query for first degree relatives L can be older brother, younger brother,
son, wife, and maybe more (first wife? X-wife?)
Q3 : q3(X,Y) :– X(“son”|“daughter”)+(ε|L)Y Query for descendants and
their relatives
אברהם
יצחק
יעקב רחל
ישמעאל
son
son
son
wife
o. brother
y. brother
לאה
wife
14
Containment A query Q1 is contained in a query Q2 , written
if for all databases DB
The queries Q1 and Q2 are equivalent, written Q1≡Q2 , if
Example: Q1 : q1(X,Z) :– XL+Z, YaZ, X(a+|(a.b*)) Z
Q2 : q2(X,Z) :– Xa+Z
Q1(DB)= {(u1,u5),(u2,u5),(u2,u3)}
Q2(DB)= {(u1,u5),(u2,u3)}
1 2( ) ( )Q DB Q DB1 2Q Q
1 2 2 1 and Q Q Q Q
u1
u6
u5
u2
u3
u4
c
c
aa
b
b
d
15
Canonical Databases - Intuition A canonical database for Q is a pair (DB,ξ) ξ is a substitution A bifurcation node for each node variable A corresponding internal path for each conjunct
Q: q(X1,X2) :– X1(a.L.(_)*))X2, X2(b.c)*Y, X2(a|L)*Z, Y(c|d)X1
How many canonical databases for a query?
X1
d
X2
Y
Z
b
fe
c
L
a
a a
a
aL L
a
c
b Bifurcation
node
Internal
node
Internal
path
16
Canonical Databases – Formal definition Q: q(X1,X2) :– X1(a.L.(_)*))X2, X2(b.c)*Y, X2(a|L)*Z, Y(c|d)X1.
Each internal node belongs to one internalpath, with one outgoingand one incoming edge
The mapping of node variables to bifurcation nodes is surjective
Each arc variable L is mapped to itself
For each conjunct YiRiZi, the path ξ(YiRiZi) is internal and the mappingis one to one
X1
d
X2
Y
Z
b
fe
c
L
a
a a
a
aL L
a
c
b Bifurcation
node
Internal
node
Internal
path
17
Semantic Criteria for Query Containment: Query Q has head variables X1,…Xn, and canonical
database (DB, ξ) (ξ(X1),…ξ(Xn)) is the canonical tuple
Proposition 1 Given two queries, Q, Q’:
for any canonical database (DB, ξ) for Q, its canonical tuple is in the answer of Q’
'Q Q
18
Proposition 1 Proof If Q is contained in Q’ then for any canonical database
(DB, ξ) for Q, its canonical tuple is in the answer of Q’1. StruQL0 queries are generic:
if Q is contained in Q’ for databases over the universe D, then it is also contained in Q’ for databases over D’, where
D’ ≡ D U avar(Q)
2. (DB, ξ) contains constants in D, with addition of the arc variables of Q D’
3. If Q is contained in Q’ then its canonical tuple for each DB over D is contained in Q’(DB)
4. According to 1, the canonical tuple of Q is contained in Q’(DB’) over D’
19
Proposition 1 Proof If Q is contained in Q’ then for any canonical database
(DB, ξ) for Q, its canonical tuple is in the answer of Q’Example: Q1 : q1(X,Z) :– XL+Z, YaZ, X(a+|(a.b*)) Z
Q2 : q2(X,Z) :– Xa+Z
Q1(DB)= {(u1,u5),(u2,u5),(u2,u3)}
Q2(DB)= {(u1,u5),(u2,u3)}
D’ ≡ D U avar(Q) = {a,b,c,d,L} Canonical database
for Q2:
u1
u6
u5
u2
u3
u4
c
c
aa
b
b
d
X
a Z
L1( , ) ( )X Z Q DB
20
Proposition 1 Proof If for any canonical database (DB, ξ) for Q, its canonical
tuple is in the answer of Q’ then Q is contained in Q’
Assume the contrary – Q is not contained in Q’ There exists some database DB and some tuple of nodes and/or
label constants u=(u1,…uk) in DB, such that u is in Q(DB) but not in Q’(DB)
We will construct a canonical database which will contradict the assumption
21
Proposition 1 Proof
There exists a substitution φ : Q DB so that φ(X)=u We construct (DB0,ξ)
The bifurcation nodes are {φ(X)| X is in nvar(Q)} Define ξ(X) = φ(X) for all X in nvar(Q)
So the mapping of node variables is the same in both databases.
For each conjunct YRZ we consider the path φ(YRZ) in DB. This path is not necessarily simple It may contain bifurcation nodes This is because DB is not canonical
Example: Q: q(X1,X3):- X1aX2, X2bX3, X1L.L.LX3
φ(X1,X2,X3,L)=(A,B,C,a)
Bifurcation nodes in DB0: A,B,C
A B Ca
a
a
b
22
Proposition 1 Proof
Introduce a fresh internal node for every occurrence of a node on the path φ(YRZ)
This results in a simple path In the example: Q: q(X1,X3):- X1aX2, X2bX3, X1L.L.LX3
A B Ca
a
a
bA B Ca
a
a
b
a
23
Proposition 1 Proof
Now replace some labels: Let A be some non-deterministic automaton equivalent to R,
where arc variables are viewed as constants By definition, the labels on ξ(YRZ) are accepted by A Replace each label causing a transition in the run of A on ξ(YRZ)
with the corresponding arc variable L
In the example: Q: q(X1,X3):- X1aX2, X2bX3, X1L.L.LX3
A B Ca
L
L
b
L
A B Ca
a
a
b
a
24
Proposition 1 Proof
Example: Q: q(X1,X3):- X1aX2, X2bX3, X1L.L.LX3
φ(X1,X2,X3,L)=(A,B,C,a) Bifurcation nodes in DB0: A,B,C
DB DB0
A B Ca
L
L
b
L
A B Ca
a
a
b
φ’:Q’DB0
ψ:DB0-->DB
25
Proposition 1 Proof
DB0 is a canonical database We have a graph morphism ψ: DB0 DB
Bifurcation nodes are sent to themselves Internal nodes are sent to their originating nodes
We assumed that Q is not contained in Q’ even though the canonical tuple for (DB0,ξ) is in the answer of Q’(DB)
So we must have a substitution φ’ : Q’ DB0 Compose φ’ with ψ and get a substitution
φ’ ○ ψ : Q’ DB0 DB This implies that u is in the answer of Q’ too,
contradicting the assumption □
26
Decidability of containment
We still have an infinite number of canonical databases: The internal paths can be of any length
Q: q(X,Y) :- XL*Y The number of substitutions can be infinite
Q: q(X,Y) :- X_Y It is sufficient to examine only databases whose internal
path is no longer than some N which depends only on Q and Q’
Only a set of n x N constants is sufficient, with N from above and n the number of conjuncts in Q. (Only the constants in DQ,Q’ U avar(Q))
The resulting algorithm for containment is of triple exponential space
But it shows decidability
27
Path Length is bounded
Remember Ai is the non-deterministic automaton equivalent to Ri, for each conjunct YiRiZi
The path between ξ(Y) and ξ(Z) represents a run of Ai
Its length is bounded by N = |nvar(Q)|x|states(Ai)|+2
If a variable appears in the path ξ(YiRiZi) more than |states(Ai)| times, it can be cut short, and still satisfy Ri
Q: q(X1,X3):- X1aX2, X2bX3, X1L+X3
We must check all the runs ofautomata in Q’ on paths inthe canonical DB of Q. Proof in Appendix A of
the full version of the paper.
A B Ca
L
L
b
L
28
Containment by Query mapping A query mapping f:Q’DB sends conjuncts in Q’ to
some path in the canonical database of Q There exist only finitely many mappings They can be encoded in polynomial space
A query mapping f:Q’Q can ‘cover’ a canonical database DB for Q
all query mappings together cover all canonical databases All canonical DBs for a query can
be described in a regular language WQ
For each mapping f, there is a regularexpression for all databases covered by it, Wf
Exponential space
' iff Q fQ Q W W
f
p1p2
p3
Pn-1
pn
Y’ A’ Z’Q’
Q
'Q Q
29
Simple StruQL0 queries
The regular expressions Ri in Q are of the form r1. r2... rn, where each ri is either * or a label constant
Examples: a.*.b.* and *.*.a.a.* are simple regular expressions a*.b or _._ are not simple regular expressions
Given two regular expressions, their containment can be checked in polynomial space
The containment problem of two simple queries is NP-complete By reduction to conjunctive queries
First subset including recursion for which containment decision is no harder than for conjunctive queries.
30
Summary
StruQL0 – conjunctive queries with regular expressions Canonical databases - semantic criteria for query
containment Containment is decidable But in triple exponential space
Query mappings - syntactic criteria for query containment Exponential space
Simple StruQL0 queries – a subset for which containment is NP-complete
31
Backup slides:
Containment by Query Mapping We will show that Q is contained in Q’ iff a certain
condition holds on all query mappings from Q’ to Q
Q is a query with n conjuncts:Q: q(X):- Y1R1Z1,…, YnRnZn
nvar(Q) = {Y1,Z1,…,Yn,Zn} Ai is a fixed non deterministic automaton for each regular
expression Ri
A point in Q is either A node variable (variable-point) A pair (Ai,s) where s is a state in Ai (automaton-point)
points(Q) is the set of points in Q
32
Canonical DB and query points
Nodes in a canonical database DB for Q correspond to points in Q
Several internal nodes in DB may correspond to the same automaton-point
Bifurcation nodes in DB correspond both to variable-points and to automaton points (Ai,s) where s is an initial or terminal state
33
Path in a Query
Given a query Q, a path of points in Q is a sequence p1,…,pn, n≥2
p2,…,pn-1 are all variable-points (p1,pn can be automaton points) Any two adjacent points are connected in Q:
If pj, pj+1 are variable points there is a conjunct YiRiZi in Q with pj=Yi and pj+1=Zi
If p1 is an automaton-point (p2 is a variable point) there exists a conjunct YiRiZi in Q so that Ai is the automaton
associated with Ri, and p2 = Zi
If pn is an automaton-point (pn-1 is a variable point) there exists a conjunct YiRiZi in Q so that Ai is the automaton
associated with Ri, and pn-1 = Yi
If n=2, and both p1 and P2 are automaton points they refer to the same automaton
34
Canonical DB and query path
Let U = u1,u2,…,um be a path in a canonical database DB for Q.
Suppose we drop all internal nodes from u2,…,um-1
Let u1=ui1,ui2,…,uin-1,uin=um be the resulting subsequence
We say that U corresponds to the path of points p1,…,pn iff each uik corresponds to pk, for k=1,…,n
Paths of points rephrase paths in canonical databases
35
Query mapping Consider some other query Q’ Ai’ is a nondeterministic automaton for each Ri’ in Q’ Let X, X’ be head variables in Q,Q’ respectively
A query mapping f: Q’ Q consists of:1. Two mappings,
f: nvar(Q’)points(Q) and f: avar(Q’) DQ,Q’ U avar(Q), so that f(X’ )= X
2. A mapping from conjuncts Yi’Ri’Zi’ in Q’ to paths of points in Q, f(Yi’Ri’Zi’) = p1,…,pn so that n≤|nvar(Q)|x|states(Ai)|+2 f(Yi) = p1, f(Zi’) = pn
3. For each conjunct YiRiZi in Q, a total preorder on those variables Z’ in nvar(Q’) for which f(Z’) is an automaton point corresponding to Ai
Whenever X’≤Y’ and Y’≤X’ then f(X’)=f(Y’)
36
Query mapping
For some canonical database (DB,ξ) a substitution φ:Q’DB is canonical if φ(X’) is the canonical tuple in DB
Condition 1: A substitution now sends conjuncts in Q’ to some path in
the canonical database, and not variables to nodes and arc variables to arcs
f
p1p2
p3
Pn-1
pn
Y’ A’ Z’Q’
Q
37
Path Length is bounded
Condition 2: The path of points p1,…,pn may have cycles
Its length is bounded by |nvar(Q)|x|states(Ai)|+2 If a variable appears in the path f(Y’R’Z’) more than |
states(Ai)| times, it can be cut short, and still satisfy R’
f
p1p2
p3
Pn-1
pn
Y’ A’ Z’Q’
Q
38
Preorder
Condition 3: The preorder defines:
Equivalence classes on the variables (X’≤Y’ and Y’≤X’ X’≡Y’) A total order on the equivalence classes
The query mapping imposes such an order on all variables sent by f to points on the same automaton (A,s1), (A,s2), (A,s3)…
39
Substitutions and mappings
A canonical substitution φ:Q’DB corresponds to a query mapping f: Q’ Q if:
1. For each conjunct Y’R’Z’ in Q’ the path φ(Y’R’Z’) corresponds to the path of points f(Y’R’Z’)
2. For any internal path in DB corresponding to YRZ, the preorder on all variables mapped by φ onto that path coincides with the preorder given by f
There is always a query mapping between two queries For given Q,Q’, there exist only finitely many mappings Each mapping can be encoded in polynomial space
40
Containment
A query mapping f:Q’Q covers a canonical database DB for Q, if there is some canonical substitution φ:Q’DB which corresponds to f Some query mappings don’t cover any canonical database.
Q is contained in Q’ iff all query mappings together cover all canonical databases All canonical databases for a query can be described in a
regular language WQ
For each mapping f, there is a regular expression for all databases covered by it, Wf
This can be computed in exponential space
' iff Q fQ Q W W
41
The connection between the syntactic and semantic criteria
all query mappings together cover all canonical databases
If a query mapping covers a canonical database for Q, then the canonical tuple in the database is in the answer of Q’.
This is implied by the definitions of canonical substitution, of correspondence between a mapping an a substitution, and of “covering” a database.
Both criteria (syntactic and semantic) rely on Proposition 1, but present different algorithms to check containments of two queries.
'Q Q
42
Known results for regular expressions
Containment of regular expressions is PSPACE complete L.J. Stockmeyer and A.R. Meyer. Word problems requiring
exponential time. In 5th STOC, pages 1-9. ACM, 1973.
Containment of simple regular expressions is in PTIME Tova Milo and Dan Suciu. Index structures for path
expressions. In 7th ICDT, pages 277–295. Springer-Verlag, 1999.