from tree patterns to generalized tree patterns: on efficient evaluation of xquery
DESCRIPTION
From tree patterns to generalized tree patterns: On efficient evaluation of XQuery. Z.M. Chen, H.V. Jagadish, L.V.S. Lakshmanan, S. Paparizos (VLDB 2003) Fatih Gön 2002701366 Mehmet Şenvar 2003700221 Bogazici University Department of Computer Engineering. Overview. - PowerPoint PPT PresentationTRANSCRIPT
1
From tree patterns to generalized tree patterns: On efficient evaluation of XQuery
Z.M. Chen, H.V. Jagadish, L.V.S. Lakshmanan, S. Paparizos
(VLDB 2003)
Fatih Gön 2002701366
Mehmet Şenvar 2003700221
Bogazici University Department of Computer Engineering
2
Overview
Motivation: Current approach for XQuery evaluation is not efficient. Need a concise XQuery model as the basis to generate the efficient
evaluation physical plan
Main contribution: • Generalized Tree Patterns query model (GTP)• Algorithm translating from function-free XQuery to GTP • Physical algebra and algorithm translating from GTP to physical plan• Schema-aware optimization of GTP and physical plan
3
Motivation
Current approaches
Navigational plan (NAV) : traverses down the path by recursively getting all children nodes and filter unwanted before next iteration
Baseline plan (BASE) : use TAX operator which take tree pattern and sequence of trees as input. Some tree patterns may be repeatedly evaluated.
Our approach
Generalized Tree Pattern (GTP) : use GTP as XQuery model to generated an efficient evaluation plan
4
Tree pattern query
$p.tag = person &$s.tag = state &$l.tag = profile &$g.tag = age &$g.content > 25 &$s.content != ‘MI’
$p
$l$s
$g
$w
$p
$t
$p.tag = person &$w.tag = watches &$t.tag = watch
(a)
(b)
Boolean formula F
Boolean formula F
Tree T
Tree T
5
Generalized tree pattern (GTP)
FOR $p IN document(“auction.xml”)//person, $l IN $p/profile
WHERE $l/age > 25 AND $p//state != ‘MI’
RETURN <result> {$p//watches/watch} {$l/interest} </result>
(a) An XQuery example
$p
$l$s
$g
$w
$t $i
(0)
(0)(0)
(0)(1)
(1)
(2)
$p.tag = person & $s.tag = state &$l.tag = profile & $i.tag = interest &$w.tag = watches & $t.tag = watch &$g.tag = age & $g.content > 25 &$s.content != ‘MI’
(b) Generalized tree pattern
6
Generalized tree pattern (GTP)
GTP: A pair G=(T,F), where T is a tree and F is a boolean formula.• Each node of T is labeled by a distinct variable and has an
associated group number.• Each edge of T has a pair of associated labels <x,m>, where x
specifies the axis (pc or ad) and m specifies the edge status (mandatory or optional).
• F is a boolean combination of predicates applicable to nodes.
Group: each maximal set of nodes in a GTP connected to each other by paths not involving optional edges. By convention, group 0 include the GTP root.
7
A pattern match of G into a collection of trees C is a partial mapping
h: GC such that:• h is defined on all group 0 nodes.• If h is defined on a node in a group, then it is necessarily defined on
all nodes in that group.• h preserves the structural relationships in G.• h satisfies the boolean formula F.
Pattern match (Formal Description)
8
A pattern match is a mapping from the pattern nodes to nodes in an XML database such that the formula associated with the pattern as well as the structural relationships among pattern nodes.
Pattern match
9
Universal GTP
Universal GTP is a GTP G=(T,F) such that some solid edges may be labeled ‘EVERY’.
‘SOME’ quantifier is already handled.
Eg. FOR $o IN document(“auction.xml”)//open_auction WHERE EVERY $b in $o/bidder SATISFIES $b/increase > 100 RETURN <result> {$o} </result>
$o
$b
$i
EVERY
(0)
(1)
(2)
F_L: pc($o,$b) & $b.tag = bidderF_R: pc($b,$i) & $i.tag = increase & $i.content > 100
$b: [F_L $i: (F_R)]
10
Grammer for XQuery Fragment
Function-free XQuery captured by the following grammar
FLWR ::= ForClause LetClause WhereClause ReturnClause.
ForClause ::= FOR $fv1 IN E1, … , $fvn IN En.
LetClause ::= LET $lv1 := E1, … , $lvn := En.
ReturnClause ::= RETURN {E1} … {En}.
Ei ::= FLWR | XPATH.
WhereClause ::= WHERE (E1, … , En).
11
Algorithm GTPInput: a FLWR expression Exp, a context group number gOutput: a GTP or GTPs with a join formula
if (g’s last level !=0) let g = g + “.0”;foreach (“For $fv in E”) do parse(E,g);let ng = g;foreach (“Let $lv := E”) do{ let ng = ng + 1; parse(E, ng);}foreach predicate p in WHERE do { if (p is “every El satisfies Er” ){ let ng = ng+1; parse (El, ng); F_L be the formula associated with the pattern result from El; let ng = ng+1; parse(Er,ng); F_R be the formula associated with the pattern result from Er; } else{ foreach Ei as p’s argument do parse(Ei, g); }}
foreach “{Ei}” do { let ng = ng + 1; parse (E, ng);}
Procedure parseInput: FLWR expression or XPath expression E, context group number gOutput: Part of GTP resulting from Eif (E is FLWR expression) GTP (E, g);else buildTPQ(E);end procedure
12
Algorithm GTPInput: a FLWR expression Exp, a context group number g
Output: a GTP or GTPs with a join formula
The GTP can be informally understood as follows:
1)Find matches for all nodes connected to the root by only solid edges
2)Next, find matches to the remaining nodes (whose path to the GTP root involves one or more dotted edges), if they exist.
13
Translating GTP Into an Evaluation Plan
• Avoid repeated matching of similar tree patterns
• Postpone the materilization of nodes as much as possible
• Operators and methods are avaliable in any XML database system
14
Physical algebra
Index Scan ISp(S) : output each node satisfying the predicate p using an index for input trees S.
Filter Fp(S) : output only the trees satisfying the predicate p given trees S. Order is preserved.
Sort Sb(S) : Sort the input sequence of trees S based on the sorting basis b.
Value Join Jp(S1,S2) : a value-based comparison on the two input sequences of trees via the join predicate p. output sequence order is based on the left S1 input sequence order.
Structural Join SJr(S1,S2) : input tree sequences S1,S2 must be sorted based on the node id. Operator joins S1 and S2 based on the structural relationship r between them for each pair. Output is sorted by S1 or S2 as needed. Outer Structural Join (OSJ) where all S1 is included in the output. Semi structural Join (SSJ) where only S1 is retained in the output.
Group By Gb(S) : input is sorted on the grouping basis b. Group trees based on the grouping basis b.
Merge M(S1,…,Sn) : Sj’s are assumed to have the same cardinality k. For each i<=i<=k, merge tree i from each input under an artificial root and produce an output tree. Order is preserved.
15
Translating GTP to Physical Plans
• Evaluation Algorithm
• Plan is a DAG where each node is a physical operator or input document
• Helper functions used findOrder(SJs, $n), getGroupBasis(g), getGroupEvalOrder(G)
16
Stages of Evaluation Algorithm ( 7 steps)
1. Compute structural joins
2. Filter based on predicates depending on contents of more than 2 pattern nodes
3. Compute value joins
4. Compute aggregation
5. Filter based on predicates depending on aggr. value (if needed)
6. Compute value joins based on aggr. values (if needed)
7. Group return arguements (if any)
17
Physical plan from the GTP
M
G G
S
SSJ
SJ
IS
IS
ISOSJ
SJ
OSJ
S
IS
FIS
SSJ
IS
F
ISS
state age
person profilecontent != ‘MI’ content > 25
person/profile
person/watches profile
watches
watch interest
watches/watch profile/interest
person, profile
person, profile person, profile
person, profile
RETURN
ARGUMENT #1
RETURN
ARGUMENT #2
F : filterIS : tag index scanSSJ : structural semi-joinSJ : strcutural joinOSJ : outer structural joinS : sortM : merge
person//state profile/age
18
Schema-Aware Optimization
• Logical Optimization
- simplfy GTP by eliminating nodes using DTD or XML schema
• Phsysical Optimization
- eliminate duplicate operators (e.g. sorting, duplicate elimination)
19
Schema-Aware Optimization
Internal node eliminationa//b//c a//c,
if schema implies every path from a to c passes through b.
a/b/c a//c?
$a
$c
$b
$a
$c
20
Schema-Aware Optimization
Identifying two nodes with same tag
FOR $b IN …//book
WHERE $b/title = ‘DB’
RETURN <x> {$b/title} {$b/year} </x>
$b
$t2$t $y
$b
$t $y
$t2 can be eliminated,
if schema says every book has at most one title child
21
Schema-Aware Optimization
Eliminate redundant leaves
FOR $a IN …./a[b]
RETURN {$a/c}
$a
$c$b
$a
$c
$b can be eliminated,
If schema implies every a has at least one b
22
Schema-Aware Optimization
Elimination of sorting
SJ
person/profile
Provided two sorted input, the output will be in either person order or profile order. Not both in general.
However, if schema implies no person can have person descendants, output of the structural join ordered by person node id will also be in profile node id order.
person
person profile
profile
{p1 – l2, p2 – l1}
Not both in order!!!
“p1”
“p2” “l2”
“l1”
23
Schema-Aware Optimization
Elimination of group-by
{$l/interest}
We must group the return argument results for the FOR variable in general.
However, if schema implies each profile has at most one interest subelement, then grouping on interest can be eliminated.
24
Schema-Aware Optimization
Elimination of duplicate elimination
If schema implies watches cannot have watches descendants, the duplicate elimination is unnecessary.
$p//watches//watchwatches
watches watch
watch
“w1”
“w2”
“ws1”
“ws2”
ws1: {w1,w2}
ws2: {w2}
$p//watches/watch?
Note: 1. t can not have t descendants
2. A can only have one child B
25
GTP Simplification
• Algorithm : pruneGTP(G)simplifies GTP based on child/descendant constraints
and avoidance constraints
• Steps (4)1. Detect emptiness of (sub)queries
2. Identify nodes with same tag
3. Eliminate reduntant leaves
4. Eliminate redundant internal nodes
26
Theorem 1 (Optimality)
Let C : set of child/descendant constraints
Let G : GTP
There is a unique GTP Hmin equivalent to G under C, which has the smallest size among all equivalent GTPs.
GTP simplification algoritm will correctly simplfy G
to Hmin in polynomial time
27
Experiments
• TIMBER native XML database• XMark generated documents• P-III 866 MHz• Windows 2000 professional• TIMBER had 100 MB buffer pool• 5 execution, eliminate max&min, get avr. • 479 MB XML document
28
Navigational & Base Plans
I. NAV– Traverses recursively getting all children of a node checking
condition or name before next iteration
– Dependent on path size & number of children of each node
II. BASE– Straightforward tree pattern translation approach that utilizes
set-at-a-time processing– Unlike GTP does not make use tree pattern reuse
29
Interesting Cases
• Parameters: path length, number of return arguements, query selectivity, data materilization cost
• GTP outperforms NAV and BASE for every query by a magnitude of 1 or 2
• All algorithms effected by path length, Nav is mostly
• Query selectivity, Number of return arguements does not effect all algoritms, NAV will do same iteration
• Data materilization cost affects both GTP and BASE, but not much NAV
30
CPU Timings
31
Scalability
• Used 24 MB, 47 MB, 239 MB, 479 MB, 2397 MB documents (Factor 1-5). Results:
• GTP scales linearly with size of database
32
Schema-Aware Optimization Results
• In come case greatly enhance performance, but very little in others.• Well when data materilization is not the dominating cost.• Beneficial when path is of the form many/many/many and converted
to many//many.
33
Related Work
• Navigation-based XQuery processing systems : Galax, Natix, Tamino, TIMBER
• No optimization and plan generation systems for XQueries for native systems as a whole
• GTP is 3-20 times faster than TIMBER system
• Resech is going on optimizing XPath expressions by using TPQs and schema knowledge
34
Summary & Future Work
• A novel structure called GTP is proposed
• GTPs are used as a a basis for physical plan generation and query optimization
• Compared GTP with other methods with extensive set of tests and observed that GTP win by at least an order of magnitude.
• Presented an algorithm for schema-based simplification of GTP
• Evaluation of GTP on relational XML systems as well as native systems
35
Thanks...
Questions ?