1 holistic twig joins: optimal xml pattern matching nicolas bruno, nick koudas, divesh srivastava...
DESCRIPTION
3 Introduction XML de facto standard of Data Exchange and Retrieval Tree structured modelTRANSCRIPT
1
Holistic Twig Joins:Optimal XML Pattern Matching
Nicolas Bruno, Nick Koudas, Divesh Srivastava
ACM SIGMOD 2002Presented by Jun-Ki Min
2
Contents Introduction Background Holistic Path Join Algorithms Twig join Algorithms Experimental Evaluation Conclusion
3
Introduction XML
de facto standard of Data Exchange and Retrieval
Tree structured model
4
Introduction XML Query Languages
have specified tree structured relationship
specify patterns of selection predicate
ex)book[title =‘XML’]//author[fn=‘jane’ AND
ln=‘doe’]
5
Introduction Finding all occurrences of a twig pattern
in a database is core operation Previous work
decompose the twig pattern into a set of binary(p-c and a-d) relationships
matching each of the binary relationships “stitching” together these basic matching
6
Introduction Contributions
Two families of holistic path join algorithms
Holistic path join approach Holistic twig join approach
Experimental study
7
Background XML Data Model
a XML database is a forest of rooted, ordered, labeled trees.
8
Background Indexing XML Documents
Element positions represented as tuples (DocID, Left:Right, Level), sorted by Left
Child and Descendant relationships between elements easily determined.authorbookjane…titleXMLyear
(1,6:20,3) …(1,1:150,1)…(1,8:8,5) … (1,43:43,5)…
(1,2:4,2) (1,65:67,3)…(1,3:3,3) (1,66:66,4)…(1,61:63,2) …
9
Background Twig Pattern Matching
Given a query twig pattern Q and an XML database D, compute the set of all matching for Q on D.
book[title = ‘XML’ AND year = ‘2000’]
10
Background Previous attempts
Based on binary joins Decompose query into binary relationships Solve binary joins against XML DB Combine together “basic” matches
Main drawbacks: Optimization is required Intermediate results can be large
book[title = ‘XML’ AND year = ‘2000’] ((book JOIN title)JOIN XML)JOIN (year JOIN 2000) (((book JOIN year)JOIN 2000)JOIN title)JOIN XML)many other possibilities
11
Holistic Joins Solve the entire twig query in two
phases produce “guaranteed” partial results
using one pass Combine (merge join) partial results
Partial result smaller than final result effective encoding of partial results
12
Data Structure Each node q in query has associated:
A stream Tq, with the positions of the elements corresponding to node q, in increasing “left” order.
A stack Sq with a compact encoding of partial solutions (stacks are chained).
a node (position, pointer to a node in Sparent(q))
13
Data Structure: Result representation
Nodes in Stack Sq are lie on a root-to-leaf path
A
C
D
A1
C1
A2
C2
B1
D1
[A1 ,C1 ,D1][A1 ,C2 ,D1][A2 ,C2 ,D1]
D1
SD
C1
SC
C2
A1
SA
A2
XML fragment Query Matches Stacks
//A//C//D
14
Path Stack: Holistic Path Queries Repeatedly constructs stack encodings of
partial solutions by iterating through the streams Tq.
Stacks encode the set of partial solutions from the current element in Tq to the root of the XML tree.
WHILE (!eof) qN = “getMin(q)” clean stacks push TqN’s first element to SqN with
the pointer to top(Sparent(qN)) IF qN is a leaf node, expand solutions
15
Path Stack ExampleA1
A2
C1
B1
C2
B2
C3 C4
A
B
C
SA1
A2
C1
B1
C2
B2
C3 C4
A
B
C
SA1
A2
C1
B1
C2
B2
C3 C4
A
B
C
SA1A1
A2
C1
B1
C2
B2
C3 C4
A
B
C
SA1 - A2A1
A2
C1
B1
C2
B2
C3 C4
A
B
C
SA1 - A2
C1
A1
A2
C1
B1
C2
B2
C3 C4
A
B
C
SA1 - A2
B1
C1
A1
A2
C1
B1
C2
B2
C3 C4
A
B
C
SA1 - A2
B1
A1,B1,C2A2,B1,C2
C1 - C2
A1
A2
C1
B1
C2
B2
C3 C4
A
B
C
S A1
B2
A1,B1,C2A2,B1,C2
A1
A2
C1
B1
C2
B2
C3 C4
A
B
C
S A1
B2
A1,B1,C2A2,B1,C2A1,B2,C3
C3
A1
A2
C1
B1
C2
B2
C3 C4
A
B
C
S A1
B2
A1,B1,C2A2,B1,C2A1,B2,C3A1,B2,C4
C4
16
Twig Queries Naïve adaptation of PathStack
solve each root-to-leaf path independantly Merge-Join each intermediate result
Problem: Many intermediate results might not be part of the final answer.
A
B D
C EB
A AA A
BB B D D D D
X
C C C C E E E E
A
B D
C E
A
17
Twig Stack1) Compute only partial solutions that are
guaranteed to extend to a final solution.
2) Merge partial solutions to obtain all matches.
WHILE (!eof) qN = “getNext(q)” clean stacks IF TqN’s first element is part of a solution, push it IF qN is a leaf node, expand solutions
getNext might advance the streams in
subTree(q) that are guaranteed not to be
part of a solution
18
Twig Stack Key difference between PathStack
and TwigStack is that a node hq from Tq is pushed on its stack Sq, Twig Stack ensure (1) node hq has a descendant hqi in
each of the stream Tqi, for qi ∈ children(q)
(2) each node hqi, recursively satisfies the first property
19
Twig Stack Example
before insert author to stackauthor, all child streams(Tfn, Tln)’s current elements are checked.
Partial results are (6,11)(7,8) and (6,11)(9,10), then merge to generate final results.
allauthors
fn lnfn
author
ln
authorauthor
1,16
9,107,8
6,11
3, 4
2,5 12,15
13,14
author
fn ln
authorfnln
(2,5) (6,11) (12,15)(3,4) (7,8)(9,10) (13,14)
20
Experiment Environments Implemented all algorithms in C++ using
the file system as a simple storage engine. Synthetic database.
Random XML documents. depth, fan-out, number of distinct labels
Techniques compared: Binary Join techniques. PathStack. TwigStack.
21
PathStack vs. Binary Joins
Sequential Scan: 1.87s Path Stack: 2.53s Binary Joins: 16.1s to 53.07s
0
10
20
30
40
50
60
Exec
utio
n tim
e (s
econ
ds)
Binary Joins PathStack SS
XML database fragment: 1 million nodes.Path Query: A1//A2//A3//A4//A5//A6
22
PathStack vs. TwigStack Query
Data: a full ternary tree first subtree contains only A1,A2,A3 and A4 second subtree : A1,A5,A6,A7 third subtree contains all possible nodes Vary the size of thir subtree relative to the
size of the first two from 8% to 24%
A1
A3
A5A2
A6
A7A4
23
PathStack vs. TwigStack
•Partial solutions are discarded at the merge step
24
Conclusion Developed holistic path join algorithms Developed TwigStack, which generalizes
PathStack for twig queries. better than binary join approach
Future work Integrate TwigStack with value-based joins
(id-refs, user defined predicates, etc.). Incorporate remaining axes (following, etc.).