from tree patterns to generalized tree patterns: on efficient evaluation of xquery

35
1 From tree patterns to generalized tree patterns: On efficient evaluation of XQuery Z.M. Chen, H.V. Jagadish, L.V.S. Lakshmanan, S. Paparizos (VLDB 2003) Fatih Gön 2002701366 Mehmet Şenvar 2003700221 Bogazici University Department of Computer Engineering

Upload: denise

Post on 31-Jan-2016

46 views

Category:

Documents


0 download

DESCRIPTION

From tree patterns to generalized tree patterns: On efficient evaluation of XQuery. Z.M. Chen, H.V. Jagadish, L.V.S. Lakshmanan, S. Paparizos (VLDB 2003) Fatih Gön 2002701366 Mehmet Şenvar 2003700221 Bogazici University Department of Computer Engineering. Overview. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

1

From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

Z.M. Chen, H.V. Jagadish, L.V.S. Lakshmanan, S. Paparizos

(VLDB 2003)

Fatih Gön 2002701366

Mehmet Şenvar 2003700221

Bogazici University Department of Computer Engineering

Page 2: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

2

Overview

Motivation: Current approach for XQuery evaluation is not efficient. Need a concise XQuery model as the basis to generate the efficient

evaluation physical plan

Main contribution: • Generalized Tree Patterns query model (GTP)• Algorithm translating from function-free XQuery to GTP • Physical algebra and algorithm translating from GTP to physical plan• Schema-aware optimization of GTP and physical plan

Page 3: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

3

Motivation

Current approaches

Navigational plan (NAV) : traverses down the path by recursively getting all children nodes and filter unwanted before next iteration

Baseline plan (BASE) : use TAX operator which take tree pattern and sequence of trees as input. Some tree patterns may be repeatedly evaluated.

Our approach

Generalized Tree Pattern (GTP) : use GTP as XQuery model to generated an efficient evaluation plan

Page 4: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

4

Tree pattern query

$p.tag = person &$s.tag = state &$l.tag = profile &$g.tag = age &$g.content > 25 &$s.content != ‘MI’

$p

$l$s

$g

$w

$p

$t

$p.tag = person &$w.tag = watches &$t.tag = watch

(a)

(b)

Boolean formula F

Boolean formula F

Tree T

Tree T

Page 5: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

5

Generalized tree pattern (GTP)

FOR $p IN document(“auction.xml”)//person, $l IN $p/profile

WHERE $l/age > 25 AND $p//state != ‘MI’

RETURN <result> {$p//watches/watch} {$l/interest} </result>

(a) An XQuery example

$p

$l$s

$g

$w

$t $i

(0)

(0)(0)

(0)(1)

(1)

(2)

$p.tag = person & $s.tag = state &$l.tag = profile & $i.tag = interest &$w.tag = watches & $t.tag = watch &$g.tag = age & $g.content > 25 &$s.content != ‘MI’

(b) Generalized tree pattern

Page 6: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

6

Generalized tree pattern (GTP)

GTP: A pair G=(T,F), where T is a tree and F is a boolean formula.• Each node of T is labeled by a distinct variable and has an

associated group number.• Each edge of T has a pair of associated labels <x,m>, where x

specifies the axis (pc or ad) and m specifies the edge status (mandatory or optional).

• F is a boolean combination of predicates applicable to nodes.

Group: each maximal set of nodes in a GTP connected to each other by paths not involving optional edges. By convention, group 0 include the GTP root.

Page 7: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

7

A pattern match of G into a collection of trees C is a partial mapping

h: GC such that:• h is defined on all group 0 nodes.• If h is defined on a node in a group, then it is necessarily defined on

all nodes in that group.• h preserves the structural relationships in G.• h satisfies the boolean formula F.

Pattern match (Formal Description)

Page 8: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

8

A pattern match is a mapping from the pattern nodes to nodes in an XML database such that the formula associated with the pattern as well as the structural relationships among pattern nodes.

Pattern match

Page 9: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

9

Universal GTP

Universal GTP is a GTP G=(T,F) such that some solid edges may be labeled ‘EVERY’.

‘SOME’ quantifier is already handled.

Eg. FOR $o IN document(“auction.xml”)//open_auction WHERE EVERY $b in $o/bidder SATISFIES $b/increase > 100 RETURN <result> {$o} </result>

$o

$b

$i

EVERY

(0)

(1)

(2)

F_L: pc($o,$b) & $b.tag = bidderF_R: pc($b,$i) & $i.tag = increase & $i.content > 100

$b: [F_L $i: (F_R)]

Page 10: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

10

Grammer for XQuery Fragment

Function-free XQuery captured by the following grammar

FLWR ::= ForClause LetClause WhereClause ReturnClause.

ForClause ::= FOR $fv1 IN E1, … , $fvn IN En.

LetClause ::= LET $lv1 := E1, … , $lvn := En.

ReturnClause ::= RETURN {E1} … {En}.

Ei ::= FLWR | XPATH.

WhereClause ::= WHERE (E1, … , En).

Page 11: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

11

Algorithm GTPInput: a FLWR expression Exp, a context group number gOutput: a GTP or GTPs with a join formula

if (g’s last level !=0) let g = g + “.0”;foreach (“For $fv in E”) do parse(E,g);let ng = g;foreach (“Let $lv := E”) do{ let ng = ng + 1; parse(E, ng);}foreach predicate p in WHERE do { if (p is “every El satisfies Er” ){ let ng = ng+1; parse (El, ng); F_L be the formula associated with the pattern result from El; let ng = ng+1; parse(Er,ng); F_R be the formula associated with the pattern result from Er; } else{ foreach Ei as p’s argument do parse(Ei, g); }}

foreach “{Ei}” do { let ng = ng + 1; parse (E, ng);}

Procedure parseInput: FLWR expression or XPath expression E, context group number gOutput: Part of GTP resulting from Eif (E is FLWR expression) GTP (E, g);else buildTPQ(E);end procedure

Page 12: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

12

Algorithm GTPInput: a FLWR expression Exp, a context group number g

Output: a GTP or GTPs with a join formula

The GTP can be informally understood as follows:

1)Find matches for all nodes connected to the root by only solid edges

2)Next, find matches to the remaining nodes (whose path to the GTP root involves one or more dotted edges), if they exist.

Page 13: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

13

Translating GTP Into an Evaluation Plan

• Avoid repeated matching of similar tree patterns

• Postpone the materilization of nodes as much as possible

• Operators and methods are avaliable in any XML database system

Page 14: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

14

Physical algebra

Index Scan ISp(S) : output each node satisfying the predicate p using an index for input trees S.

Filter Fp(S) : output only the trees satisfying the predicate p given trees S. Order is preserved.

Sort Sb(S) : Sort the input sequence of trees S based on the sorting basis b.

Value Join Jp(S1,S2) : a value-based comparison on the two input sequences of trees via the join predicate p. output sequence order is based on the left S1 input sequence order.

Structural Join SJr(S1,S2) : input tree sequences S1,S2 must be sorted based on the node id. Operator joins S1 and S2 based on the structural relationship r between them for each pair. Output is sorted by S1 or S2 as needed. Outer Structural Join (OSJ) where all S1 is included in the output. Semi structural Join (SSJ) where only S1 is retained in the output.

Group By Gb(S) : input is sorted on the grouping basis b. Group trees based on the grouping basis b.

Merge M(S1,…,Sn) : Sj’s are assumed to have the same cardinality k. For each i<=i<=k, merge tree i from each input under an artificial root and produce an output tree. Order is preserved.

Page 15: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

15

Translating GTP to Physical Plans

• Evaluation Algorithm

• Plan is a DAG where each node is a physical operator or input document

• Helper functions used findOrder(SJs, $n), getGroupBasis(g), getGroupEvalOrder(G)

Page 16: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

16

Stages of Evaluation Algorithm ( 7 steps)

1. Compute structural joins

2. Filter based on predicates depending on contents of more than 2 pattern nodes

3. Compute value joins

4. Compute aggregation

5. Filter based on predicates depending on aggr. value (if needed)

6. Compute value joins based on aggr. values (if needed)

7. Group return arguements (if any)

Page 17: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

17

Physical plan from the GTP

M

G G

S

SSJ

SJ

IS

IS

ISOSJ

SJ

OSJ

S

IS

FIS

SSJ

IS

F

ISS

state age

person profilecontent != ‘MI’ content > 25

person/profile

person/watches profile

watches

watch interest

watches/watch profile/interest

person, profile

person, profile person, profile

person, profile

RETURN

ARGUMENT #1

RETURN

ARGUMENT #2

F : filterIS : tag index scanSSJ : structural semi-joinSJ : strcutural joinOSJ : outer structural joinS : sortM : merge

person//state profile/age

Page 18: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

18

Schema-Aware Optimization

• Logical Optimization

- simplfy GTP by eliminating nodes using DTD or XML schema

• Phsysical Optimization

- eliminate duplicate operators (e.g. sorting, duplicate elimination)

Page 19: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

19

Schema-Aware Optimization

Internal node eliminationa//b//c a//c,

if schema implies every path from a to c passes through b.

a/b/c a//c?

$a

$c

$b

$a

$c

Page 20: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

20

Schema-Aware Optimization

Identifying two nodes with same tag

FOR $b IN …//book

WHERE $b/title = ‘DB’

RETURN <x> {$b/title} {$b/year} </x>

$b

$t2$t $y

$b

$t $y

$t2 can be eliminated,

if schema says every book has at most one title child

Page 21: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

21

Schema-Aware Optimization

Eliminate redundant leaves

FOR $a IN …./a[b]

RETURN {$a/c}

$a

$c$b

$a

$c

$b can be eliminated,

If schema implies every a has at least one b

Page 22: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

22

Schema-Aware Optimization

Elimination of sorting

SJ

person/profile

Provided two sorted input, the output will be in either person order or profile order. Not both in general.

However, if schema implies no person can have person descendants, output of the structural join ordered by person node id will also be in profile node id order.

person

person profile

profile

{p1 – l2, p2 – l1}

Not both in order!!!

“p1”

“p2” “l2”

“l1”

Page 23: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

23

Schema-Aware Optimization

Elimination of group-by

{$l/interest}

We must group the return argument results for the FOR variable in general.

However, if schema implies each profile has at most one interest subelement, then grouping on interest can be eliminated.

Page 24: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

24

Schema-Aware Optimization

Elimination of duplicate elimination

If schema implies watches cannot have watches descendants, the duplicate elimination is unnecessary.

$p//watches//watchwatches

watches watch

watch

“w1”

“w2”

“ws1”

“ws2”

ws1: {w1,w2}

ws2: {w2}

$p//watches/watch?

Note: 1. t can not have t descendants

2. A can only have one child B

Page 25: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

25

GTP Simplification

• Algorithm : pruneGTP(G)simplifies GTP based on child/descendant constraints

and avoidance constraints

• Steps (4)1. Detect emptiness of (sub)queries

2. Identify nodes with same tag

3. Eliminate reduntant leaves

4. Eliminate redundant internal nodes

Page 26: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

26

Theorem 1 (Optimality)

Let C : set of child/descendant constraints

Let G : GTP

There is a unique GTP Hmin equivalent to G under C, which has the smallest size among all equivalent GTPs.

GTP simplification algoritm will correctly simplfy G

to Hmin in polynomial time

Page 27: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

27

Experiments

• TIMBER native XML database• XMark generated documents• P-III 866 MHz• Windows 2000 professional• TIMBER had 100 MB buffer pool• 5 execution, eliminate max&min, get avr. • 479 MB XML document

Page 28: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

28

Navigational & Base Plans

I. NAV– Traverses recursively getting all children of a node checking

condition or name before next iteration

– Dependent on path size & number of children of each node

II. BASE– Straightforward tree pattern translation approach that utilizes

set-at-a-time processing– Unlike GTP does not make use tree pattern reuse

Page 29: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

29

Interesting Cases

• Parameters: path length, number of return arguements, query selectivity, data materilization cost

• GTP outperforms NAV and BASE for every query by a magnitude of 1 or 2

• All algorithms effected by path length, Nav is mostly

• Query selectivity, Number of return arguements does not effect all algoritms, NAV will do same iteration

• Data materilization cost affects both GTP and BASE, but not much NAV

Page 30: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

30

CPU Timings

Page 31: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

31

Scalability

• Used 24 MB, 47 MB, 239 MB, 479 MB, 2397 MB documents (Factor 1-5). Results:

• GTP scales linearly with size of database

Page 32: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

32

Schema-Aware Optimization Results

• In come case greatly enhance performance, but very little in others.• Well when data materilization is not the dominating cost.• Beneficial when path is of the form many/many/many and converted

to many//many.

Page 33: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

33

Related Work

• Navigation-based XQuery processing systems : Galax, Natix, Tamino, TIMBER

• No optimization and plan generation systems for XQueries for native systems as a whole

• GTP is 3-20 times faster than TIMBER system

• Resech is going on optimizing XPath expressions by using TPQs and schema knowledge

Page 34: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

34

Summary & Future Work

• A novel structure called GTP is proposed

• GTPs are used as a a basis for physical plan generation and query optimization

• Compared GTP with other methods with extensive set of tests and observed that GTP win by at least an order of magnitude.

• Presented an algorithm for schema-based simplification of GTP

• Evaluation of GTP on relational XML systems as well as native systems

Page 35: From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

35

Thanks...

Questions ?