containment of nested xml queries xin (luna) dong, alon halevy, igor tatarinov university of...
TRANSCRIPT
Containment of Nested XML Queries
Xin (Luna) Dong, Alon Halevy, Igor Tatarinov
University of Washington
Query Containment
The most fundamental relationship between a pair of queries
Query Q is contained in Q’ if:For any database D,Q(D) is a subset of Q’(D)
Applications of Query Containment Semantic caching Reasoning about contents of data
sources in data integration Verification of integrity constraints Verification of knowledge bases Determining queries independent of
updates Query answering using views
Query Processing in PDMS XML Query Containment in Peer Data
Management System (PDMS)
Answering queries using views to extract remote data
Removing redundant queries to enhance performance[Tatarinov and Halevy, SIGMOD 2004]
MWS
MPW
MSB
MBW
QWQW
UW Stanford
Berkeley UPenn
QW
QP QB1
QB2
QS
QB1
QS
QB2 QB1
Query Containment: Relational v.s. XML
Relational
Input D Sets of tuples
Output Q(D) A set of tuples
Instance containment
Q(D) Q’(D)– Subset
Query containment
Q Q’– for every
input D, Q(D) Q’(D)
Query Containment: Relational v.s. XML
Relational XML
Input D Sets of tuplesAn XML instance
tree
Output Q(D) A set of tuplesAn XML instance
tree
Instance containment
Q(D) Q’(D)– Subset
Q(D) Q’(D)– Tree
homomorphism
Query containment
Q Q’– for every
input D, Q(D) Q’(D)
Q Q’– for every input
D, Q(D) Q’(D)
Example – An XML Instance
D:
<project>
<member>Alice</member>
</project>
<project>
<member>Bob</member>
</project>
project project
member member
Alice Bob
Example – An XML QueryQ:for $x in /project return<group>{
for $y in $x/member return <name>{
where $y=“Alice”return <Alice/>
where $y=“Bob”return <Bob/>
}</name>}</group>
D:
Q(D):
group
name
group
name
Alice Bob
project project
member member
Alice Bob
Example – Another XML Query
Q’:for $x in /project return<group>{
for $y in /project/member return <name>{
where $y=“Alice”return <Alice/>
where $y=“Bob”return <Bob/>
}</name>}</group>
D:
Q’(D):
name
group
name
Alice Bob
project project
member member
Alice Bob
Q’(D):Q(D):
X
Example – Tree Homomorphism and Query Containment
Q (D) Q’(D)
Q’(D) Q (D)
name
group
name
Alice Bob
group
name
group
name
Alice Bob
Q’(D):Q(D):
name
group
name
Alice Bob
group
name
group
name
Alice Bob
Query Containment Problem
From answer containment to query containment
Our problemsGiven queries Q and Q’, decide whether
Q Q’The complexity of query containment
Q’(D) Q (D) Q’ Q
Q (D) Q’(D)
Q Q’
Previous Work (I)
Relational query containment Conjunctive queries [Chandra and Merlin, STOC
1977] Acyclic queries [Yannakakis, VLDB 1981] Queries with union [Sagiv and Yannakakis, JACM
1980] Queries with negation [Levy and Sagiv, VLDB 1993] Queries with arithmetic comparisons [Klug, JACM
1988] Recursive queries
[Shmueli, 1993], [Chaudhuri and Vardi, 1992] Queries over bags [Ioannidis and Ramakrishnan,
1995]
Previous Work (II)
XML query containment – two new challenges XPath containment
With *, // and […] [Miklau and Suciu, PODS 2002] With equality testing on tag variables
[Deutsch and Tannen, KRDB 2001] Conjunctive queries over path expressions
[Florescu, Levy and Suciu, PODS 1998] Nested query containment
Containment Cannot be Determined Solely by Comparing XPath Components
Q: for $g in /group where $g/gname/text() = “database”return<area>{
for $p in $g/person return <person> <name>{$p/text()}</name>{for $q in $g/paper where $q/author/text() = $p/text() return
<paper>{$q/title/text()}</paper>}</person>
}</area>
Q’: for $g in /group return<area>{
for $p in $g/person return <person> <name>{$p/text()}</name> <group>{$g/gname/text()}</group>{for $q in $g/paper where $q/author/text() = $p/text() return
<paper>{$q/title/text()}</paper>}</person>
}</area>
Previous Work (II)
XML query containment – two new challenges XPath containment
With *, // and […] [Miklau and Suciu, PODS 2002] With equality testing on tag variables
[Deutsch and Tannen, KRDB 2001] Conjunctive queries over path expressions
[Florescu, Levy and Suciu, PODS 1998] Nested query containment
Complex object query containment [Levy and Suciu, PODS 1997]Containment of nested XML queries Containment of nested XML queries
has has notnot been fully studied been fully studied
Our Focus: Nested XML Queries Returned tag constants Conjunctive – no two sibling query blocks
return the same tag XPath:
HAVE Child axis (/) Wildcards (*) Branches ([…])
NOT HAVE descendant // Arithmetic comparison Union
Here, XPath containment is in Here, XPath containment is in PTIMEPTIME
Complexity Result (I)
Depth
Fanout
Fixed Arbitrary
= 1 PTIME PTIME
ArbitrarycoNP
complete
In coNEXPTIM
E
Complexity Result (II)
Query
Type
No tag variab
les
With tag
variables
With unions
Withneg
With//
Witheuiq-join on
tags
With arith comp
Un-neste
d
PTIME
PTIME
coNP complet
e
coNP comple
te
coNP complet
e
NP comple
te
2P
complete
Fan-out=1
PTIME
Fixed- depth
coNP complet
e
General
in coNEXPTIME
Complexity Result (II)
Query
Type
No tag variab
les
With tag
variables
With unions
Withneg
With//
Witheuiq-join on
tags
With arith comp
Un-neste
d
PTIME
PTIME
coNP complet
e
coNP comple
te
coNP complet
e
NP comple
te
2P
complete
Fan-out=1
PTIME
PTIME
Fixed- depth
coNP complet
e
coNP complet
e
General
in coNEXPTIME
in coNEXPTIME
Complexity Result (II)
Query
Type
No tag variab
les
With tag
variables
With unions
Withneg
With//
Witheuiq-join on
tags
With arith comp
Un-neste
d
PTIME
PTIME
coNP complet
e
coNP comple
te
coNP complet
e
NP comple
te
2P
complete
Fan-out=1
PTIME
PTIME
coNP complet
e
coNP comple
te
coNP complet
e
NP comple
te
2P
complete
Fixed- depth
coNP complet
e
coNP complet
e
coNP complet
e
coNP comple
te
coNP complet
e
2P
complete
2P
complete
General
in coNEXPTIME
Roadmap
Introduction and problem definition Containment of a subset of XML queries
Query containment is decidable
Query containment in practice Relaxing the assumptions Conclusions
DepthFanout
Fixed Arbitrary
= 1 PTIME PTIME
Arbitrary coNP complete
In coNEXPTIME
Deciding Q Q’?
How to find a property for an infinite number of input XML instances
Standard technique Find a finite set of input representatives – Canonical
Databases Relational query: each canonical database is a
minimal input to generate the answer template XML query answers have infinite number of shapes
Find a finite set of answer templates – Canonical Answers
Answer Shapes Determined by the Head Tree
Q’:
for $x in /project return
<group>{
for $y in /project/member return
<name>{where $y=“Alice”
return <Alice/>
where $y=“Bob”
return <Bob/>
}</name>
}</group>
Alice
Bob
Head Tree:
group
namegroup
name
group
group
Alice
name
group
name
Bob
group
Alice
name
Bob
Head Tree:
An Additional Candidate Answer
name
group
name
Alice Bob
group
name
group
group
Alice
name
group
name
Bob
group
Alice
name
Bob
Head Tree:
Why Consider the Additional Case
name
group
name
Alice Bob
project project
member member
Alice Bob
Q(D):
group
name
group
name
Alice Bob
Q’(D):
D:
What can Serve as Canonical Answers?
Prefix subtrees of the head tree? – necessary but not sufficient
Trees contained in the head tree? – necessary and sufficient– but, too many and too complex
A Head Tree can Have Many Trees Contained in it
group
name name
Alice BobAlice
group
name name
Alice BobAliceBob
name
group group
Alice BobAliceBob
group
name name name
group
Alice
name
Bob
Head Tree:
What can Serve as Canonical Answers? Prefix subtrees of the head tree?
– necessary but not sufficient Trees contained in the head tree?
– necessary and sufficient– but, too many and too complex
Our solution: consider only minimal trees that are contained in the head tree
Canonical Answer A minimal XML instance: No two sibling
subtrees where one is contained in the other Canonical Answer : A minimal XML instance
contained in the head tree
Every answer A of query Q corresponds to a unique canonical answer CA, s.t. A CA, CA A
group
name name
Alice BobAlice
group
Alice
name
Bob
group
name name
Alice Bob
Canonical Database Canonical Database: DBCA
The minimal XML instance to generate CA
project
member
project
member
Alice Bob
project
group
name name
Alice Bob
CA:
DB:
for $x in /project return
<group>{
for $y in /project/member return
<name>{
where $y=“Alice”
return <Alice/>
where $y=“Bob”
return <Bob/>
}</name>
}</group>
Sound and Complete Conditions for Nested Query ContainmentTheorem 1. Q Q’, if and only if for
every canonical database DB of Q, Q(DB) Q’(DB)
Theorem 2. Q Q’, if and only if for every canonical answer CA of Q,
CA is a canonical answer of Q’ DB’CA DBCA
Query Containment Algorithm Algorithm:
for every canonical answer CA of Q do
1. check whether CA is a canonical answer of Q’
2. generate DBCA and DB’CA
3. check DB’CA DBCA
Roadmap
Introduction and problem definition Containment of a subset of XML queries
Query containment is decidable
Query containment in practice Relaxing the assumptions Conclusions
DepthFanout
Fixed Arbitrary
= 1 ? ?
Arbitrary ? ?
Query Containment Algorithm Algorithm:
for every canonical answer CA of Q do
1. check whether CA is a canonical answer of Q’
2. generate DBCA and DB’CA
3. check DB’CA DBCA
Polynomial in the size and number of canonical answers What are the sizes of canonical answers? What is the number of canonical answers?
Containment of XML Queries with Fanout 1 E.g. d=3 – the depth; m=1 – the maximum fanout
Canonical Answers and Complexity Number: the depth of the query Size: bounded by the depth of the query Complexity: O( d·|Q|·|Q’|)
Theorem: Testing containment of XML Queries with fanout 1 is in PTIME
for $x in /project return
<group>{for $y in /project/member return
<name>{where $y =“Alice” return <Alice/>
}</name>
}</group>
group
Alice
name
group
name
group
Nesting with fanout 1 does not Nesting with fanout 1 does not increase complexityincrease complexity
Roadmap
Introduction and problem definition Containment of a subset of XML queries
Query containment is decidable
Query containment in practice Relaxing the assumptions Conclusions
DepthFanout
Fixed Arbitrary
= 1 PTIME PTIME
Arbitrary ? ?
Containment of XML Queries with Arbitrary Fanout E.g. d=4 – the depth; m=3 – the maximum fanout
Canonical Answers Complexity Number:
Size:
Theorem: Testing containment of XML Queries with depth 2 and arbitrary fanout is coNP-hard
1 2 3 1 2 2 33 1 1 2 2 3 2 33 1 3 11 21 2 2 31 2 3
d-1
d-2
d-1
Roadmap
Introduction and problem definition Containment of a subset of XML queries
Query containment is decidable
NOT
TIGHT
Query containment in practice Relaxing the assumptions Conclusions
DepthFanout
Fixed Arbitrary
= 1 PTIME PTIME
Arbitrary coNP hard coNP hard
Effect of the Depth on Containment of XML Queries Insight: Kernel Canonical Answer
The root node has a single child In any subtree, a path pattern is repeated no more than
cd times.d – query depthc – #(maximum path steps in a query block)
The size of kernel canonical answers Polynomial in the query size Exponential in the query depth
Theorem: Testing containment of XML queries with fixed depth is
coNP-complete Testing containment of XML queries with arbitrary
depth is in coNEXPTIME
Roadmap
Introduction and problem definition Containment of a subset of XML queries
Query containment is decidable
Query containment in practice Relaxing the assumptions Conclusions
DepthFanout
Fixed Arbitrary
= 1 PTIME PTIME
Arbitrary coNP complete
In coNEXPTIME
Containment Checking in Practice
Q: for $g in /group where $g/gname/text() = “database”return<area>{
for $p in $g/person return <person> <name>{$p/text()}</name>{for $q in $g/paper where $q/author/text() = $p/text() return
<paper>{$q/title/text()}</paper>}</person>
}</area>
Q’: for $g in /group return<area>{
for $p in $g/person return <person> <name>{$p/text()}</name> <group>{$g/gname/text()}</group>{for $q in $g/paper where $q/author/text() = $p/text() return
<paper>{$q/title/text()}</paper>}</person>
}</area>
Analyze element cardinality to reduce the number of canonical answers for containment checking
#canonical answers – originally : 71 after
analysis : 2
Roadmap
Introduction and problem definition Containment of a subset of XML queries
Query containment is decidable
Query containment in practice Relaxing the assumptions Conclusions
DepthFanout
Fixed Arbitrary
= 1 PTIME PTIME
Arbitrary coNP complete
In coNEXPTIME
An Example Query that Returns Tag Variables
for $x in dbGrp return<result>{
for $y in $x/proj return <group>{
for $u in $y/member return <name> $u/text() </name>for $v in $y/paper return <pub> $v/text() </pub>
}</group>}</result>
Deciding Query Containment Leverage previous results –
simulation mapping [Levy and Suciu, PODS’97]
Check query simulation mapping for every canonical answer
Complexity Simulation mapping can be checked in
polynomial time in terms of query size Complexity of checking containment
does not arise
Other Extensions
Query
Type
No tag variab
les
With tag
variables
With unions
Withneg
With//
Witheuiq-join on
tags
With arith comp
Un-neste
d
PTIME
PTIME
coNP complet
e
coNP comple
te
coNP complet
e
NP comple
te
2P
complete
Fan-out=1
PTIME
PTIME
coNP complet
e
coNP comple
te
coNP complet
e
NP comple
te
2P
complete
Fixed- depth
coNP complet
e
coNP complet
e
coNP complet
e
coNP comple
te
coNP complet
e
2P
complete
2P
complete
General
in coNEXPTIME
Conclusions
Contributions A sound and complete condition for
containment of nested XML queries Detailed complexity analysis
Future work Fill in the open gap of complexity in case of
queries with arbitrary fanout and arbitrary nesting depth
Evaluate and optimize the containment algorithm with element cardinality analysis
Answering nested XML queries using views