tax: a tree algebra for xml reference: jagadish et al. dbpl 2001
TRANSCRIPT
TAX: A Tree Algebra for XML
Reference: Jagadish et al. DBPL 2001.
Overview
Why an algebra for XML? Main challenges Data model Patterns & Witnesses Tree Value Functions Some Example Operators Translation Example – XQuery
Overview (contd.)
Main Results Optimization Examples Implementation Summary & Future Work
Why an Algebra (for XML)? (aka Related Work)
Bulk algebra for tree manipulation – efficient implementation of XML queries
Algebra for manipulating trees (has been attempted before) Feature algebras – linguistics; efficient
implementation? Grammar-based algebra for trees [Tompa+ 87,
Gyssens+ 89] Aqua project [Zdonik+95]
Why XML algebra? [Related work] (contd.)
GraphLog, Hy+ [Consens+90], GOOD [Paradaens+92] – cannot exploit special properties of trees (e.g., support for arbitrary recursion vs. descendants, order)
SS data – Lorel [Abiteboul+ 96], UnQL [Buneman+ 96].
XML algebras – [Beech+ 99], [Fernandez+ 00] (mainly type system issues), [Christofidis+ 00] (trees tuples), [Ludascher+ 00] (nodes, not trees), SAL [Beeri+ 99] (ordered lists of nodes)
Why? (contd.)
be close to relational model, but direct support for (collections of) trees express at least RA + aggregation capture substantial fragment of XQuery admit efficient implementation and
effective query optimization e.g., satisfy “natural” identities.
Main Chellanges
Capture rich variety of manipulations in a simple algebra
Handle heterogeneity in tree collections structure “schema” of nodes of the same “type”
Handle order (documents are ordered) sometimes important (e.g., author list,
whether anesthesia preceded incision) sometimes not (e.g., publisher vs. authors)
Data Model Data tree = rooted ordered tree Data in node = set of attr-val pairs Special attribute: pedigree – where did I come
from? Node representation = (docId, startPos:endPos, level) preserved for (copies of) original nodes thru
manipulations. play important role in grouping, sorting, etc. null for new nodes.
Collections (of trees) – unordered. IDREF(S) treated like other attr’s. Possible alternative: treat them as pointers. One position: express pointer dereferencing as
IDREF=ID join (but implement as you will).
Patterns & Witnesses
first challenge: how do you get at nodes and/or attributes?
Notion of selection parameter – considerably more complex
our solution: patterns – enable specification of parameters for most operations
only show parts of interest: Need not know/care about entire structure of trees in
collection Analogy: in SchemaLog, you only specify what you
care about.
Patterns & Witnesses (contd.)
Example P1:$1
$2 $3
pc ad
$1.tag = book & $2.tag = year & $2.content < 2000 & $3.tag = author
Structural part
Condition partAdditional parameters possible: e.g., selection/projection lists, grouping, ordering, etc.
pc = directad = transitive
Patterns & Witnesses (contd.) What does a pattern do for you?
generate witnesses against i/p collection one for each matching of pattern against i/p conditions must be respected (sub)structure preserved in o/p
e.g., witness trees for pattern P1 – one tree for each author of each book published
before 2000, showing year & author book-author link may be transitive in i/p but is
necessarily direct in o/p source trees = trees witnesses “came from”
Example Database
bib
book book
author author
name
first lastmid
deg degname
titletitleyear
first last
1910 PrincipiaMathematica
Alfred North Whitehead Bertrand Russel
Sc.D., FRS
M.A., FRS
author
name
Panini
Ashtadhyayi(First book on Sanskrit Grammar)
year
560 BC
12
3 4 5 12
19
20 22
Only startPos shown.
What should selection return?
trees where a match occurred? poor granularity when DB = one big document tree;
e.g., select books authored by John Grisham the whole bib tree!
only distinguished nodes (as XPath)? don’t get all info. that you want.
witness trees – right level of abstraction and info. extraction.
may enhance: e.g., relatives of selected nodes might be of interest too. Deescendants – most useful case.
Example Operators – Selection Input: collection; parameters: pattern, selection
list (pattern nodes) Example
pattern P1 and empty SL: same witness trees as before
pattern P1 with SL = {$1}: whole book subtrees (i.e. retain $1’s descendants)
One-zero/more o/p trees in general per i/p tree Could retain other “relatives” instead (e.g.,
siblings)
Selection with P1 (empty SL)
book book
authoryear
1910
authoryear
560 BC
book
year author
Whole author subtree included when SL = {$3}.
1910
2
35
2
3 12
19
2122
What should projection do?
Unlike relational model, selection is not purely “horizontal” (so, can’t expect pure “vertical” for project).
Can one op serve both roles? Select finds match witnesses (localized) Want project to retain all (named) nodes
satisfying some predicates in a given source tree no matter how you match the pattern
The two ops are still orthogonal
Example operators – Projection Input: collection; parameters: pattern, projection
list Example
Pattern P1 w/ PL = {$1, $2, $3}: one tree for each book published before 2000, showing year and author(s)
Pattern P1 w/ PL = {$3}: one tree for each author of aforementioned books
`*’ in PL causes descendants to be retained One-zero/more op (for reasons diff. from select)
Projection: P1 w/ PL = {$1,$2,$3}
book book
author authoryear1910
authoryear
560 BC
With $3*, we can include whole author subtrees.
2
3 5 12
19
2122
Selection vs. Projection Example
FOR $b IN document(“doc.xml”)//book FOR $y IN $b/year[data() < 2000]
FOR $a IN $b//author RETURN
<book> {$y} {$a}</book>
versus FOR $b IN document(“doc.xml”)//book[/year/data() < 2000
& author] RETURN <book> {$b/year} {$b/author}
</book>
selection
projection
Tree Value Functions (TVF)
What are they? Primitive recursive functions on structure of source
trees Codomain must be ordered
Where are they used? grouping, ordering, aggregation, etc.
Here is an example: f: T value of author, number of authors, tuple of
authors, {author tuple, title}, etc. Complete example coming up …
Example operators – grouping Input: collection; parameters: pattern,
grouping TVF, ordering TVF. Example
input: collection of books
pattern: $1
$2 $3
$4$1.tag = book & $2.tag = title & $3.tag = author & $4.tag = name
f_g(T) = “$4.content”f_o(T) = “$2.content”pc ad
pc
Grouping (contd.)
Here is what the o/p looks like:
-- books ordered by title in each group
…tax_group_root
tax_group_basis tax_group_subroot
authorbook book
name
Other operators
Derived operators – various joins. Set operations:
When are two data trees the “same”? Equality (shallow/deep) vs. isomorphism
(include pedigree or not?) Multiset versions of operators
Aggregation, Reordering, Renaming.
Joins
$1 $1.tag=book & SL=$2
E SELECT: | $2.tag=publisher
$2
$3 $3.tag=book & SL=$4
F SELECT: |ad $4.tag=author
$4
G (F x E) $5 $5.tag=tax_prod_root &
H SELECT: / \ $6.tag=book & $7.tag=book &
$6 $7 $6.pedigree=$7.pedigree
SL=$6, $7. - we joined on pedigrees. - could have joined on publisher city = author city instead, if desired. - can express a variety of outerjoins easily.
XQuery Translation Example 1
FOR $b IN
document(“doc.xml)//book[//author/@hobby=tennis] RETURN <sportydiveshbook>
$b/title IF SOME $a IN $b//author SATISFIES $a/data() = “divesh” THEN $b//author
</sportydiveshbook>
Example 1 (contd.) outer pattern tree:
inner pattern tree:
$1
$2
$1.tag=book & $2.tag=author & $2.hobby=tennis
$3
$4
$3.tag=book & $4.tag=author & $4.content=divesh
ad
ad
Example 1 (contd.) SELECT input DB w/ outer pattern and
empty SL; Take Cartesian product with entire input
DB; SELECT result w/ combined inner+outer
pattern and join condition:
$5
$6
$7
$8
$9
$5.tag=tax_prod_root & $6.tag=book & $7.tag=author & $8.tag=book & $8.pedigree=$6.pedigree & $9.tag=author & $9.content=divesh & $10.tag=title
$10What is wrong with this translation?
Example 1 (contd.) Pre-IF part E: select w/
$1
$2
$1.tag=book & $2.tag=author & $2.hobby=tennisSL = $1*
$3
$4PL = $3, $4 $3.tag=book & $4.tag=titlePL = $3, $4
Additional duplicate elimination needed if we don’t know title is unique per book.
ad
then project w/
Example 1 (contd.)
IF part F: select w/
then project w/
$5
$6$5.tag=book & $6.tag=author & $6.content = divesh
SL = $5*
$7
$8$7.tag=book & $8.tag=author PL = $7, $8
ad
ad
Example 1 (contd.) Do a left outerjoin of E with F w/ the condition $3
= $7 (What does this really entail?)
tax_prod_root
/ \
book book . . .
| / ... \
title author author
PL = $9 $9.tag != book$9
Rename tax_prod_root sportydiveshbook.
Project w/
Example 2
FOR $a IN distinct(document(“bib.xml”))//author
RETURN
<authorpubs>
{$a}
{FOR $b IN document(“bib.xml”)//article
WHERE $a = $b/author
RETURN $b/title }
</authorpubs>
Example 2 (contd.) select/project authors and dup-elim. join with books based on (pedigree-) equality
ofbook nodes. (So, what should the selection pattern look like?)
Group by author pedigree. Do a project, retaining only author and title. Do a final renaming, if needed.
Main Results
Duplicate elimination by value can be expressed in TAX.
The operators in TAX are independent. TAX is complete for relational algebra w/
aggregation. TAX can capture the fragment of XQuery FLWR
expressions w/o function calls, recursion, w/ all path expressions using only constants, wildcards, and / & //, when no new ancestor-descendant relationships are created.
Optimization Examples
Revisit translation example 1: E can be simplified to – project w/
$1
$2 $3$1.tag=book & $2.tag=author & $2.hobby=tennis & $3.tag=title
PL= $1,$3
Similar simplification applies to F
Self-join can sometimes be eliminated Associativity, commutativity issues
Implementation
TIMBER system at Univ. of Michigan Find pattern tree matches via
Index scans Full scans Twig joins
Joins implemented on streams Pedigree – implemented as position of
element within document Pedigrees similar to RID at impl. level
Summary & Future Work TAX – extension of RA for handling
heterogeneous collections of ordered labeled trees
Simplicity; few more operators Recognize selective importance of order and
handle elegantly Bulk algebra for efficient implementation of XML
querying Stay tuned for TIMBER release(s) Future
Arbitrary restructuring: copy-and-paste Updates: principled via operators
More translation examples – ex3 FOR $b IN document(“mybib.xml”)//book,
$a IN $b/author
WHERE $b/year < 1990 AND $a/@hobby=“tennis”
RETURN
<result>
{$b//publisher}
{$a/affiliation}
</result>
What’s a generic way to translate such queries into TAX?
More translation examples – ex3 FOR $b IN document(“mybib.xml”)//book,
$a IN $b/author
WHERE $b/year < 1990 AND $a/@hobby=“tennis”
RETURN
<result>
{$b//publisher}
{$a/affiliation}
</result>
Eforwhere
Ereturn1
Ereturn2
Efinal
Identify major components in query statement and associate expressions with each. Expressions developed in cascade. Each uses its own pattern (tree).
More translation examples – ex3 FOR $b IN document(“mybib.xml”)//book,
$a IN $b/author
WHERE $b/year < 1990 AND $a/@hobby=“tennis”
RETURN
<result>
{$b//publisher}
{$a/affiliation}
</result>
Eforwhere
Ereturn1
Ereturn2
Efinal
$b
$a $y
$b.tag=book & $a.tag=
author & $y.tag=year & $a.hobby=“tennis” &
$y.content<1990
pattern used for creating Eforwhere
E0 = SELECT_{Pforwhere, {}}(mybib.xml); E1 = PROJECT_{P’forwhere, {$b,$a}}(E0);
Eforwhere = DE_{P’forwhere, {$b,$a}}(E1);
P’forwhere – same as Pforwhere, except $y is dropped.
Why need project? Why need DE?
More translation examples – ex3 FOR $b IN document(“mybib.xml”)//book,
$a IN $b/author
WHERE $b/year < 1990 AND $a/@hobby=“tennis”
RETURN
<result>
{$b//publisher}
{$a/affiliation}
</result>
Eforwhere
Ereturn1
Ereturn2
Efinal
$tpr
$b
$a $p
$b’ad
$tpr.tag=tax_prod_root & $b.tag=$b’.tag=book & $a.tag=author &
$p.tag=publisher & $b’.pedigree=$b.pedigree
& $a’.tag=author & $a’.pedigree=$a.pedigree
Why did we impose pedigree equality?
pettern used for creating Ereturn1
$a’
More translation examples – ex3 FOR $b IN document(“mybib.xml”)//book,
$a IN $b/author
WHERE $b/year < 1990 AND $a/@hobby=“tennis”
RETURN
<result>
{$b//publisher}
{$a/affiliation}
</result>
Eforwhere
Ereturn1
Ereturn2
Efinal
$tpr
$b
$a $p
$b’ad
$tpr.tag=tax_prod_root & $b.tag=$b’.tag=book & $a.tag=author &
$p.tag=publisher & $b’.pedigree=$b.pedigree
& $a’.tag=author & $a’.pedigree=$a.pedigree
pettern used for creating Ereturn1
$a’
Ereturn1is created via left outer-join, Project;DE; followed by GROUP-BY.
More translation examples – ex3 FOR $b IN document(“mybib.xml”)//book,
$a IN $b/author
WHERE $b/year < 1990 AND $a/@hobby=“tennis”
RETURN
<result>
{$b//publisher}
{$a/affiliation}
</result>
Eforwhere
Ereturn1
Ereturn2
Efinal
$tpr
$b
$a $p
$b’ad
$tpr.tag=tax_prod_root & $b.tag=$b’.tag=book & $a.tag=author &
$p.tag=publisher & $b’.pedigree=$b.pedigree
& $a’.tag=author & $a’.pedigree=$a.pedigree
pettern used for creating Ereturn1
$a’
E2 = LOJ_{P_LG,{$p}}(Eforwhere, mybib.xml);E3 = PD_{P’_LG, {$b,$a,$p*}}(E2 ); Ereturn1 = GP_{P’’_LG, {$b,$a}, {$p*}}(E3 );
More translation examples – ex3 FOR $b IN document(“mybib.xml”)//book,
$a IN $b/author
WHERE $b/year < 1990 AND $a/@hobby=“tennis”
RETURN
<result>
{$b//publisher}
{$a/affiliation}
</result>
Eforwhere
Ereturn1
Ereturn2
Efinal
$tpr
$b
$a $p
$b’ad
$tpr.tag=tax_prod_root & $b.tag=$b’.tag=book & $a.tag=author &
$p.tag=publisher & $b’.pedigree=$b.pedigree
& $a’.tag=author & $a’.pedigree=$a.pedigree
pettern used for creating Ereturn1
$a’
Efinal = PJ_{P_PJ, {$r, $p*, $l*}}(Ereturn1, Ereturn2);
General translation remarks
LET clause handled as correlated subquery; E_LET left outer-joined with E_FORWHERE (just like E_RETURNi).
Ordering by pedigree (i.e., as in original input) already captured.
Ordering by other means doable. Aggregation – straightforward. Nested queries (with correlated subqueries) –
handled by rewriting them so the query conforms to: (FOR LET)*RETURN where WHERE clause and ORDER-BY are implicit.