![Page 1: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/1.jpg)
DIMACS Streaming Data Working Group II
On the Optimality of the Holistic Twig Join Algorithm
Speaker: Byron Choi (Upenn)
Joint Work with Susan Davidson (Upenn), Malika Mahoui (Upenn) and
Derick Wood (HKUST)
![Page 2: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/2.jpg)
A Scenario
XML Doc. Server
Memory
SmallDevices
Memory is sharedby many Concurrent apps.
Limited computing resources
Streams of elements
Picking up useful
elements on the fly
![Page 3: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/3.jpg)
Background The Model, Data Representation
and Assumptions
![Page 4: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/4.jpg)
The Model Data Streaming Model
Spend constant time to process each element
An element in a stream is either discarded or stored in the main memory once it is processed
See the element in streams only once
![Page 5: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/5.jpg)
Node Representation 4-ary tuple: <preorder #, postorder #,
depth, label> Complexity of Desc, Child, Ances,
Parent: O(1) Desc(n1, n2) = true if
n1.preorder < n2.preorder ^ n1.postorder > n2.postorder
Child(n1, n2) = true ifn1.preorder < n2.preorder ^ n1.postorder >
n2.postorder ^ n1.depth + 1 = n2.depth
![Page 6: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/6.jpg)
Example Document
a1
a2b1
b2 c1
c2
(1, 9, 1, A)(2, 7, 2, B)
(3, 6, 3, A)
(4, 4, 4, B) (5, 5, 4, C)
(8, 8, 2, C)
![Page 7: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/7.jpg)
Twig Queries Syntax:
Step ::= / | //NodeTest ::= symbolPath ::= Step NodeTest | Step NodeTest Path
Twig ::= Path | Path (Twig, Twig, …, Twig)
Example // A (//B, //C) In English: Want to find the A nodes which
has a B descendent and a C descendent
A
B C
![Page 8: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/8.jpg)
Twig Join Algorithms Containment Join [Jiang et al.]
Decompose a twig query into a set of steps Apply relational join algor. to join the nodes of each step Use customized traditional indexes and estimation
methods [SIGMOD03]
Path Join [Zhang et al.] Decompose a twig query into a set of paths Apply relational join algor. to join the nodes of each path
Holistic Twig Join [Bruno et al.] Evaluate the twig query as a whole
![Page 9: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/9.jpg)
Twig Join Algorithms (cont’) The first two approaches may
compute large intermediate results and not suitable for data streaming
In this talk we will focus on the third approach. The TwigStack Algor. (Bruno et al.
SIGMOD 02)
![Page 10: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/10.jpg)
The TwigStack Algor. (Overview) Associate a stream to each NodeTest
The nodes in the stream satisfy the NodeTest Asymptotically optimal among the algorithms that
read the entire input Scan the streams only once Spend constant memory only on the nodes that are useful,
i.e. participate in at least one solution
Guarantee the optimality when the query contains descendent edges only.
Suboptimal when the query contains some child edges
Memory is spent on possibly useless nodes.
![Page 11: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/11.jpg)
Problem Statement Given a twig query and the
associated streams, is it possible to find all solutions … By using a single forward scan of the
streams By paying constant memory only to
the useful nodes By spending constant time on
processing each node in the streams
![Page 12: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/12.jpg)
Main Results So Far Assume the data streaming model…
There is no optimal holistic twig join algorithm – Theorem 1.
The evaluation of the twig queries is not memory bounded – Theorem 1.
By relaxing some restrictions on the data streaming model, we showed… The lower bounds of such relaxed models
are still quite high – Theorem 2 and Theorem 3.
![Page 13: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/13.jpg)
Outline TwigStack By Examples Offline Sorting Multiple Scans Discussion Conclusion
![Page 14: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/14.jpg)
TwigStack By Examples Query: //A (//B, //C) Document: Streams:
TA = [a1, a2], TB = [b1, b2], TC = [c1, c2] pA, pB, pC are the anchor pointing to
the “top” of the streams Useful nodes are stored in the main
memory and can be read later
a1
a2b1
b2 c1
c2
![Page 15: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/15.jpg)
TwigStack By Examples Step 0
pA -> a1, pB -> b1, pC -> c1 a1 is useful, TA is advanced,
pA->a2
Step 1 b1 is useful, TB is advanced,
pB->b2
a1
a2b1
b2 c1
c2
a1
a2b1
b2 c1
c2
a1
![Page 16: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/16.jpg)
TwigStack By Examples Step 2
a2 is useful, TA is advanced, pA -> null
Step 3 b2 is useful, TB is advanced,
pB -> null
a1
a2b1
b2 c1
c2
a1b1
a1
a2b1
b2 c1
c2
a1b1a2
![Page 17: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/17.jpg)
TwigStack By Examples Step 4
c1 is useful, TC is advanced, pC -> c2
Step 5 Printing
Step 6 c2 is useful, TC is advanced,
pC-> null
a1
a2b1
b2 c1
c2
a1b1a2b2
a1
a2b1
b2
c2
a1b1
![Page 18: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/18.jpg)
TwigStack By Examples Query: //A (/B, /C) Document: Streams: TA = [a1, a2], TB = [b1,
b2], TC = [c1, c2]a1
a2b1
b2 c1
c2
![Page 19: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/19.jpg)
TwigStack By Examples Computation 1
pA -> a1, pB -> b1, pC -> c1 TA is advanced, pA->a2, TB is advanced,
pB -> b2 a2 is useful (a1 is discarded)
Computation 2 TC is advanced, pC->c2 a1 is useful a2 is useless because c1 is discarded
a1
a2b1
b2 c1
c2
a1
a2b1
b2 c1
![Page 20: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/20.jpg)
TwigStack By Examplesa1
b1 c4
a1
b2 c3
a1
b3 c2
a1
b4 c1
The Extreme Case O(stream size)
![Page 21: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/21.jpg)
TwigStack Pseudo Code
We’ve only walkedthrough the red boxes
![Page 22: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/22.jpg)
Twig Queries over Streams Theorem 1
There is no optimal holistic twig join algorithms, no matter how the nodes are sorted.
Memory must be spent on possibly useless nodes
Given arbitrary streams, memory requirement of exact algorithms is unbounded.
![Page 23: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/23.jpg)
Proof of Theorem 1 (Sketch) Fix a document Issue a few queries: //A//B, /A
(/A, /A) and /A/A Optimality implies certain
constraints on the streams No single stream can satisfy all the
constraints
![Page 24: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/24.jpg)
Proof of Theorem 1 (cont’) Reduce a twig query to a SPJ query the twig query is memory bounded
iff the SPJ query is memory bounded.
Babcock et al PODS 02
![Page 25: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/25.jpg)
Outline TwigStack By Examples Offline Sorting Multiple Scans Discussion Conclusion
![Page 26: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/26.jpg)
Variation 1: Offline Sorting Pre-compute some intermediate
results and collect the results in a scan
Allow offline sorting on the nodes and keep all the necessary sorted nodes
Allow the algorithm to scan the nodes in the correct orderings
![Page 27: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/27.jpg)
Motivation The anchors are performing a
depth first transversal But why? How about an ordering in
which recursions are removed?
a1
a2b1
b2 c1
c2a1 a2
b1 b2 c1c2
![Page 28: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/28.jpg)
The Lower Bound The number of necessary sorting
performed offline is high Data redundancy
m is the number of structurally recursive label in the doc. DTD. d is the doc. depth.
The lower bound is m We identify a restricted case that DTDs
help to lower the lower bound
d
![Page 29: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/29.jpg)
Variation 2: Multiple Scans Massive storage (tapes, disks)
naturally produces a stream of items.
Sequential scans is a vital requirement of such storage Can only allow a small number of
scans due to the high volume of data
![Page 30: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/30.jpg)
The Lower Bound Allow P scans on the data streams. The lower bound of P is high d where d is the doc. depth and t
is the number of simple child-edge query in a twig query
t
![Page 31: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/31.jpg)
Discussion Bruno et al. assigns memory to possible
useless nodes and illustrates that such computation model is practical by experiments
No work on approximating the twig queries with provable guarantees
Constraints expressed in DTDs Our work assumes certain representation of
the node: ancestor, descendent, parent, child relationship can be determined in O(1)
![Page 32: DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson](https://reader031.vdocuments.site/reader031/viewer/2022032015/56649c7c5503460f94931363/html5/thumbnails/32.jpg)
Conclusion The evaluation of twig queries in
data streaming context is tricky. It is not memory bounded. Optimal memory constraint cannot
be satisfied in a pass of streams. Need to look for other solutions.