monetdb/xquery: using a relational dbms for xml
DESCRIPTION
Peter Boncz CWI The Netherlands. MonetDB/XQuery: Using a Relational DBMS for XML. Peter Boncz. Pathfinder - MonetDB/XQuery. TU Delft 10-5-2005. Outline. Basic XML / XQuery Introduction of Pathfinder and MonetDB projects Relational XQuery XPath steps in the pre/post plane - PowerPoint PPT PresentationTRANSCRIPT
MonetDB/XQuery:
Using a Relational DBMS for XML
Peter BonczCWI
The Netherlands
Outline
• Basic XML / XQuery• Introduction of Pathfinder and MonetDB projects• Relational XQuery
– XPath steps in the pre/post plane– Translating for-loops, and beyond
• Optimizations– Order prevention– Loop-Lifted Staircase join – Join recognition
• Outlook– Conclusions
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Outline
• Basic XML / XQuery• Introduction of Pathfinder and MonetDB projects• Relational XQuery
– XPath steps in the pre/post plane– Translating for-loops, and beyond
• Optimizations– Order prevention– Loop-Lifted Staircase join – Join recognition
• Outlook– Conclusions
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
XML
• Standard, flexible syntax for data exchange
– Regular, structured data
Database content of all kinds: Inventory, billing, orders, …
“Small” typed values
– Irregular, unstructured text
Documents of all kinds: Transcripts, books, legal briefs, …
“Large” untyped values
• Lingua franca of B2B Applications…
– Increase access to products & services
– Integrate disparate data sources
– Automate business processes
• … and numerous other application domains
– Bio-informatics, library science, …
XML : A First Look
• XML document describing catalog of books
<?xml version="1.0" encoding="ISO-8859-1" ?><catalog> <book isbn="ISBN 1565114302"> <title>No Such Thing as a Bad Day</title> <author>Hamilton Jordan</author> <publisher>Longstreet Press, Inc.</publisher> <price currency="USD">17.60</price> <review> <reviewer>Publisher</reviewer>: This book is the moving
account of one man's successful battles against three cancers ... <title>No Such Thing as a Bad Day</title> is warmly recommended.
</review> </book>
<!-- more books and specifications -->
</catalog>
XQuery 1.0
• Functional, strongly-typed query language• XQuery 1.0 =
XPath 2.0 for navigation, selection, extraction
+ A few more expressions For-Let-Where-Order By-Return (FLWOR)
XML construction
Operators on types
+ User-defined functions & modules
+ Strong typing
XSLT vs. XQuery
• XSLT 1.0: XML XML, HTML, Text– Loosely-typed scripting language– Format XML in HTML for display in browser– Must be highly tolerant of variability/errors in data
• XQuery 1.0: XML XML– Strongly-typed query language– Large-scale database access– Must guarantee safety/correctness of operations on data
• Over time, XSLT & XQuery may both serve needs of many application domains
• XQuery will become a hidden, commodity language
Navigation, Selection, Extraction
• Titles of all books published by Longstreet Press
$cat/catalog/book[publisher=“Longstreet Press”]/title <title>No Such Thing As A Bad Day</title>
• Publications with Jerome Simeon as author or editor • $cat//*[(author|editor) = “Jerome Simeon”]
<book><title>XQuery from the Experts</title>…</book>
<spec><title>XQuery Formal Semantics</title>…</spec>
Transformation & Construction
• First author & title of books published by A/W
for $b in $cat//book[publisher = “Addison Wesley”] return <awbook> { $b/author[1], $b/title } </awbook> <awbook> <author>Don Chamberlin</author> <title>XQuery from the Experts</title>
</awbook>
Sequences & Iteration
• Sequence constructorReturn all books followed by all W3C specifications($cat/catalog/book, $cat/catalog/W3Cspec)
• XPath ExpressionReturn all books & W3C specifications in doc order$cat/catalog/(book|W3Cspec)
• For Expression– Similar to map : apply function to each item in sequence
Return number of authors in each bookfor $b in $cat/catalog/book return fn:count($b/authors)
=> (3,1,2,…)
Conditional & Quantified
• Conditionalif //show[year >= 2000] then “A-OK!” else “Error!”
• Existential quantification
– Implicit meaning of predicate expressions
//show[year >= 2000]
– Explicit expression:
//show[some $y in ./year satisfies $y >= 2000]
• Universal quantification //show[every $y in year satisfies $y >= 2000]
Putting It Together
• For each author, return number of books and receipts books published in past 2 years, ordered by name
let $cat := fn:doc(“www.bn.com/catalog.xml“), Join $sales := fn:doc(“www.publishersweekly.com/sales.xml“)
for $author in distinct-values($cat//author) Groupinglet $books := $cat//book[@year >= 2000 and author = $a], S.J.
$receipts := $sales/book[@isbn = $books/@isbn]/receipts
order by $author Orderingreturn
<sales> XML Construction { $author }
<count> { fn:count($books) } </count> Aggregation <total> { fn:sum($receipts) } </total></sales>
Recursive Processing
• Recursive functions support recursive data <part id=“001”> <partCt count=“2” id=“001”>
<part id=“002”> <partCt count=“1” id=“002”/>
<part id=“003”/> => <partCt count=“0” id=“003”/> </part> </partCt>
<part id=“004”/> <partCt count=“0” id=“004”/>
</part> </partCt>
declare function partCount($p as element(part))
as element(partCt) {
<partCt count=“{ count($p/part) }”>
{ $p1/@id, for $p2 in $p/part return partCount($p2) }
</partCt>
}
XML Schema Languages
• Many variants…– DTDs, XML Schema, RELAX-N/G, XDuce
• … with similar goals to define– Types of literal (terminal) data– Names of elements & attribute
• XQuery designed to support (all of) XML Schema– Structural & name constraints over types– Regular tree expressions over elements, attributes, atomic types
TeXQuery : Full-text extensions
• Text search & querying of structured content
• Limited support in XQuery 1.0
– String operators with collation sequences
$cat//book[contains(review/text(), “two thumbs up”)]
• Stop words, proximity searching, ranking
Ex: “Tony Blair” within two words of “George Bush”
• Phrases that span tags and annotations
Ex: Match “Mr. English sponsored the bill” in <sponsor> Mr. English </sponsor> <footnote> for himself and <co-
sponsor> Mr.Coyne </co-sponsor> </footnote> sponsored the bill in the <committee-name> Committee for Financial Services </committee-name>
Outline
• Basic XML / XQuery• Introduction of Pathfinder and MonetDB projects• Relational XQuery
– XPath steps in the pre/post plane– Translating for-loops, and beyond
• Optimizations– Order prevention– Loop-Lifted Staircase join – Join recognition
• Outlook– Conclusions
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Outline
• Basic XML / XQuery• Introduction of Pathfinder and MonetDB projects• Relational XQuery
– XPath steps in the pre/post plane– Translating for-loops, and beyond
• Optimizations– Order prevention– Loop-Lifted Staircase join – Join recognition
• Outlook– Conclusions
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
XQuery Systems: 2 Approaches
• Tree-based– Tree is basic data structure
• Also on disk (if an XQuery DBMS)– Navigational Approach
• Galax [Simeon..], Flux [Koch..], X-Hive– Tree Algebra Approach
• TIMBER [Jagadish..]
• Relational– Data shredded in relational tables– XQuery translated into database query (e.g. SQL)
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
The Pathfinder Project
• Challenge / Goal:– Turn RDBMSs into efficient XQuery engines
• People:– Maurice van Keulen
• University of Twente
– Torsten Grust, Jens Teubner• University of Konstanz
– Jan Rittinger• University of Konstanz & CWI
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
The Pathfinder Project
• Challenge / Goal:– Turn RDBMSs into efficient XQuery engines
• People:– Maurice van Keulen
• University of Twente
– Torsten Grust, Jens Teubner• University of Konstanz
– Jan Rittinger• University of Konstanz & CWI
• Task: generate code for MonetDB
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
MonetDB: Applied CS Research at CWI
• a decade of “query-intensive” application experience
• image retrieval: Peter Bosch ImageSpotter
• audio/video retrieval: Alex van Ballegooij RAM
• XML text retrieval: de Vries / Hiemstra TIJAH
• biological sequences: Arno Siebes BRICKS
• XML databases: Albrecht Schmidt XMark
Grust / vKeulen Pathfinder
• GIS: Wilco Quak MAGNUM
• data warehousing / OLAP / data mining
SPSS DataDistilleries
Univ. Massachussetts PROXIMITY
CWI research group successfully spun off DataDistilleries (now SPSS)
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
MIL (Query Algebra)
Pathfinder — MonetDB
Pathfinder
MonetDB
Parser
Sem. Analysis
Core Translation
Typechecking
Relational Algebra
Database
SQL
Core to MILTranslation
Parser
Sem. Analysis
Core Translation
Typechecking
Database
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Open Source
• MonetDB + Pathfinder on Sourceforge– Mozilla License
• Project Homepage– http://monetdb.cwi.nl
• Developers website:– http://sf.net/projects/monetdb
RoadMap• 14-apr-04: initial Beta release MonetDB/SQL• 30-sep-04: first official release MonetDB/SQL• 30-may-05: beta release of MonetDB/XQuery (i.e. Pathfinder)
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
MonetDB
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
MonetDB Particulars
• Column wise fragmentation– BAT: Binary Association Tables [oid,X]– Don’t touch what you don’t need
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Binary Association Tables (BATs)
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
BAT storage as thin arrays
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
MonetDB Particulars
• Column wise fragmentation– BAT: Binary Association Tables [oid,X]– Don’t touch what you don’t need
• Void (virtual-oid) columns– Contain dense sequence 0,1,2,3,4,…– Require no space– Positional access (nice for XPath skipping)
• pre = void
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
DBMS Architecture
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Monet: DBMS Microkernel
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
MonetDB: extensible architecture
Front-end/back-end:
• support multiple data models
• support multiple end-user languages
• support diverse application domains
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Front-end/back-end:
• support multiple data models
• support multiple end-user languages
• support diverse application domains
PathfinderXQuery Frontend
MonetDB: extensible architecture
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Architecture
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Outline
• Basic XML / XQuery• Introduction of Pathfinder and MonetDB projects• Relational XQuery
– XPath steps in the pre/post plane– Translating for-loops, and beyond
• Optimizations– Order prevention– Loop-Lifted Staircase join – Join recognition
• Outlook– Conclusions
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Outline
• Basic XML / XQuery
• Introduction of Pathfinder and MonetDB projects
• Relational XQuery– XPath steps in the pre/post plane
– Translating for-loops, and beyond
• MonetDB Implementation– Data structures
• Optimizations– Order prevention
– Loop-Lifted Staircase join
– Join recognition
• Outlook– Conclusions
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
XPath on and RDBMS
Node-based relational encoding of XQuery's data model
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Tree Knowledge 1: pruning
Tree Knowledge 2: Partitioning
Staircase Join Algorithm
Tree Knowledge 3: Skipping
Pre/Post Pre/Level/Size
done for better skipping and updates
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Updates
• Dense pre-numbers are nice for XPath– Positional skipping in Staircase join!
• But how to handle updates?
Updates
• Dense pre-numbers are nice for XPath– Positional skipping in Staircase join!
• But how to handle updates?
DenseNot Dense
Planned Update Solution
Planned Update Solution
Planned Update Solution
XPath XQuery
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Sequence Representation
• sequence = table of items• add pos column for maintaining order• ignore polymorphism for the moment
(10, “x”, <a/>, 10) →Pos Item
1 102 “X”3 pre(a)4 10
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
For-loops: the iter column
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
For-loops: the iter column
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Loop-lifting
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Loop-lifting
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Full Example
join calc project
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Mapping Rules
XQuery construct relational algebraSee VLDB’04 / TDM’04
[Grust,Teubner]
– Sequence construction union– If-Then-[Else] select, [union]– For loop map with cartesian product (all combinations)– Calculations projection expressions– List-functions (e.g. fn:first) select(pos=1)– Element Construction updates using descendant– Path steps selections on the pre/post plane
• Staircase join [VLDB03]: – Single-pass for a *set* of context nodes
– elaborate skipping!
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Xmark Query 2
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Xmark Query 2 (common subexpr)
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Outline
• Basic XML / XQuery• Introduction of Pathfinder and MonetDB projects• Relational XQuery
– XPath steps in the pre/post plane– Translating for-loops, and beyond
• Optimizations– Order prevention– Loop-Lifted Staircase join – Join recognition
• Outlook– Conclusions
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Outline
• Basic XML / XQuery
• Introduction of Pathfinder and MonetDB projects
• Relational XQuery– XPath steps in the pre/post plane
– Translating for-loops, and beyond
• MonetDB Implementation– Data structures
• Optimizations– Order prevention
– Loop-Lifted Staircase join
– Join recognition
• Outlook– Conclusions
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Order Prevention
To encode order, we use the pos column
New pos columns are created using DENSE RANK (sql) primitive
• Needs [pos] | [iter] order
• More commonly [iter,pos]
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Order Prevention
To encode order, we use the pos column
New pos columns are created using DENSE RANK (SQL) primitive
• Needs [pos] | [iter] order
• More commonly [iter,pos]
This requires a lot of sorting! often not necessary
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Order Prevention
[VLDB03 Wang&Cherniack]
• Order properties of relations
• Order propagation rules for relational operators
Decoration of physical plans with order properties eliminate sort
New ideas:
• RefineSort: pipelined algorithm that extends sort order
• Order property [C1] | [C2]
“for each equal value of [C2] in order of appearance, the values in [C1] are monotonically increasing”
Hash-based DENSE RANK only requires [pos] | [iter]
sorts on [iter,pos] avoided
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Order Prevention
[VLDB03 Wang&Cherniack] define:
• Order properties of relations
• Order propagation rules for relational operators
Decoration of physical plans with order properties eliminate sort
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Join Recognition (recap Mapping Rules)
XQuery construct relational algebraSee VLDB’04 / TDM’04
[Grust,Teubner]
– Sequence construction union– If-Then-[Else] select, [union]– For loop map with cartesian product (all combinations)– Calculations projection expressions– List-functions (e.g. fn:first) select(pos=1)– Element Construction updates using descendant– Path steps selections on the pre/post plane
• Staircase join [VLDB03]: – Single-pass for a *set* of context nodes
– elaborate skipping!
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
– For loop map with all combinations O(N*N)– If `simple’ condition exist on two loop variables join
– Only make a map with the matching combinations– E.g. with Hash-Table O(N)
Join Recognition
for $p in $auction/site/people/person for $t in $auction/site/closed_auctions/closed_auction where $t/buyer/@person = $p/@id return $t
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
– For loop map with all combinations O(N*N)– If `simple’ condition exist on two loop variables join
– Only make a map with the matching combinations– E.g. with Hash-Table O(N)
Performed on the XCore tree
Recognize if-then expressions
Open question:
where to optimize best??
Join Recognition
for $p in $auction/site/people/person for $t in $auction/site/closed_auctions/closed_auction where $t/buyer/@person = $p/@id return $t
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Join Optimization for $x in $foo for $y in $bar where $x/p1/@a < $y/p2/@a return $x
p1p1 p2theta-join
project
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Join Optimization for $x in $foo for $y in $bar where $x/p1/@a < $y/p2/@a return $x
p1/p1 /p2theta-join
project
p1/p1 /p2
theta-join
Aggr(min) Aggr(max)
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Loop-Lifted StaircaseJoin (recap rules)
XQuery construct relational algebraSee VLDB’04 / TDM’04
[Grust,Teubner]
– Sequence construction union– If-Then-[Else] select, [union]– For loop map with cartesian product (all combinations)– Calculations projection expressions– List-functions (e.g. fn:first) select(pos=1)– Element Construction updates using descendant– Path steps selections on the pre/post plane
• Staircase join [VLDB03]: – Single-pass for a *set* of context nodes
– elaborate skipping!
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Loop-lifted staircase join
• Staircase join [VLDB03]: – Single-pass for a *set* of context nodes
Loop-lifting multiple iters multiple sets of context nodes
– elaborate skipping!
– Loop-Lifted Staircase Join
In a single pass: process multiple input context node lists
– Use a stack
– Exploit axis properties for pruning
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Staircase join
document
List of context nodes
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Loop-lifted staircase join
document document
List of context nodes Active stack
Multiple lists of context nodes
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Loop-lifted staircase join
• Staircase join [VLDB03]: – Single-pass for a *set* of context nodes
Loop-lifting multiple iters multiple sets of context nodes
– elaborate skipping!
– Loop-Lifted Staircase Join
In a single pass: process multiple input context node lists
– Use a stack
– Exploit axis properties for pruning
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Scalability
Test platform• Opteron 1.6GHz, 8GB RAM, Red Hat Linux 64-bit
• Can process 11GB document!
Mostly linear scaling with document size
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Scalability
Test platform• Opteron 1.6GHz, 8GB RAM, Red Hat Linux 64-bit
• Can process 11GB document!
Mostly linear scaling with document size
• Some swapping in the join queries
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Scalability
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Test platform• Opteron 1.6GHz, 8GB RAM, Red Hat Linux 64-bit
• Can process 11GB document!
Mostly linear scaling with document size
• Some swapping in the join-queries
• Q11 + Q12 generate quadratic result
XMark 10MB : Pathfinder vs XHive & Galax
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
XMark 1GB: Pathfinder vs X-Hive
did not finish
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Conclusions
• Relational approach can be scalable & fast• Crucial Optimizations
– Join recognition– Loop-lifted XPath steps– Order awareness
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery
Conclusions
• Relational approach can be scalable & fast• Crucial Optimizations
– Join recognition– Loop-lifted XPath steps– Order awareness
Future Roadmap (beta: May 30, Holland Open)• Alegebraic Query Optimization• Updates (not in release)
Peter Boncz TU Delft 10-5-2005Pathfinder - MonetDB/XQuery