xml databases zachary g. ives university of pennsylvania cis 650 – database & information...
Post on 21-Dec-2015
221 views
TRANSCRIPT
![Page 1: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/1.jpg)
XML Databases
Zachary G. IvesUniversity of Pennsylvania
CIS 650 – Database & Information Systems
March 23, 2005
![Page 2: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/2.jpg)
2
Administrivia
We’re moving beyond simple databases now…
For Monday – read & compare focus of: Hanson: Scalable Trigger Processing Stanford STREAM processor
For Wednesday: Retrospective on Aurora
![Page 3: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/3.jpg)
3
Today’s Trivia Question
![Page 4: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/4.jpg)
4
XML: What Makes It Hard? It’s not normalized…
It conceptually centers around some origin, meaning that navigation becomes central
Contrast with E-R diagrams How to store the hierarchy? Complex navigation Updates, locking Optimization
Also, it’s ordered May restrict order of evaluation (or at least presentation) Makes updates more complex
Many of these issues aren’t unique to XML Semistructured databases, esp. with ordered collections,
were similar But our efforts in that area basically failed…
![Page 5: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/5.jpg)
5
XML: What’s It Good For?
Collections of text documents, e.g., the Web, doc DBs … How would we want to query those? IR/text queries, path queries, XQueries?
Interchanging data SOAP messages, RSS, XML streams Perhaps subsets of data from RDBMSs
Storing native, database-like XML data Caching Logging of XML messages …?
![Page 6: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/6.jpg)
6
Lots of XML Research Out There
Text: Hybrids of database and IR techniques for
search (e.g., Amer-Yahia & Shanmugasundaram,
Weikum & Ramakrishnan, …) Interchange:
Web service verification XML stream processing
XML databases: Natix, TIMBER, … Tamino, DB2 UDB, Oracle, …
![Page 7: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/7.jpg)
7
The Main Focal Points
XML with documents Inverted indices Integration of ranking into DBMS Interaction between structure and content
“Streaming XML” RDBMS XML export Partitioning of computation between source and
mediator “Streaming XPath” engines
XML databases Hierarchical storage + locking (Natix, TIMBER,
BerkeleyDB, Tamino, …) Query optimization
![Page 8: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/8.jpg)
8
Text-Based XML The fundamental questions:
1. How should we model ranking in query processing? Simply as another value (e.g., Amer-Yahia &
Shanmugasundaram) Using a probabilistic model or as an undefined metric
e.g., Weikum and Ramakrishnan work-in-progress2. How does structure affect ranking?
PageRank-style (e.g., Shanmugasundaram et al.) Query relaxation (FleXPath) Other?
3. How do we achieve efficient pruning? A* search [Cohen 98] Fagin’s Threshold Algorithm Custom logic?
4. How do we integrate keyword indexing with structural indexing? Multiple indices (e.g., Lore, Natix, …) Integrated indices (e.g., ViST)
![Page 9: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/9.jpg)
9
XML as a Wire Format
RDBMS XML export SilkRoute and Xperanto, outer unions Interaction with RDBMS optimization techniques Updates [Tatarinov+01]
Cascading updates are already possible in RDBMSs Updating XML views
Streaming XML SAX-based XPath-matching engines [Ives+01]
[Altinel&Franklin00][Green+02] [Diao&Franklin][Chen+] …
Push-down of XPath matching as early as possible Query decomposition (still in need of a standard means
of pushing XQuery to a source) Subsets of XQuery that are amenable to streaming
![Page 10: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/10.jpg)
10
XML in a Database
Use a legacy RDBMS Shredding [Shanmugasundaram+99] and many others Path-based encodings [Cooper+01] Region-based encodings [Bruno+02][Chen+04] Order preservation in updates [Tatarinov+02], … What’s novel here? How does this relate to materialized
views and warehousing?
Native XML databases Hierarchical storage (Natix, TIMBER, BerkeleyDB,
Tamino, …) Updates and locking Query optimization (e.g., that on Galax)
![Page 11: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/11.jpg)
11
Query Processing for XML
Why is optimization harder? Hierarchy means many more joins (conceptually)
“traverse”, “tree-match”, “x-scan”, “unnest”, “path”, … op Though typically parent-child relationships Often don’t have good measure of “fan-out” More ways of optimizing this
Order preservation limits processing in many ways Nested content ~ left outer join
Except that we need to cluster a collection with the parent Relationship with NF2 approach
Tags (don’t really add much complexity except in trying to encode efficiently)
Complex functions and recursion Few real DB systems implement these fully
Why is storage harder? That’s the focus of Natix, really
![Page 12: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/12.jpg)
12
The Natix System
In contrast to many pieces of work on XML, focuses on the bottom layers, equivalent to System R’s RSS
Physical layout Indexing Locking/concurrency control Logging/recovery
![Page 13: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/13.jpg)
13
Physical Layout
What are our options in storing XML trees? At some level, it’s all smoke-and-mirrors
Need to map to “flat” byte sequences on disk But several options:
Shred completely, as in many RDBMS mappings Each path may get its own contiguous set of pages
e.g., vectorized XML [Buneman et al.] An element may get its 1:1 children
e.g., shared inlining [Shanmugasundaram+] and [Chen+] All content may be in one table
e.g., [Florescu/Kossmann] and most interval encoded XML We may embed a few items on the same page and “overflow”
the rest How collections are often stored in ORDBMS
We may try to cluster XML trees on the same page, as “interpreted BLOBs”
This is Natix’s approach (and also IBM’s DB2) Pros and cons of these approaches?
![Page 14: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/14.jpg)
14
Challenges of the Page-per-Tree Approach
How big of a tree? What happens if the XML overflows the tree?
Natix claims an adaptive approach to choosing the tree’s granularity Primarily based on balancing the tree, constraints
on children that must appear with a parent What other possibilities make sense?
Natix uses a B+ Tree-like scheme for achieving balance and splitting a tree across pages
![Page 15: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/15.jpg)
15
Example
Split point in parent page
Note “proxy” nodes
![Page 16: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/16.jpg)
16
That Was Simple – But What about Updates?
Clearly, insertions and deletions can affect things Deletion may ultimately require us to rebalance Ditto with insertion
But insertion also may make us run out of space – what to do? Their approach: add another page; ultimately may
need to split at multiple levels, as in B+ Tree
Others have studied this problem and used integer encoding schemes (plus B+ Trees) for the order
![Page 17: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/17.jpg)
17
Does this Help?
According to general lore, yes The Natix experiments in this paper were
limited in their query and adaptivity loads But the IBM guys say their approach, which is
similar, works significantly better than Oracle’s shredded approach
![Page 18: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/18.jpg)
18
There’s More to Updates than the Pages
What about concurrency control and recovery?
We already have a notion of hierarchical locks, but they claim: If we want to support IDREF traversal, and
indexing directly to nodes, we need more What’s the idea behind SPP locking?
![Page 19: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/19.jpg)
19
Logging
They claim ARIES needs some modifications – why?
Their changes: Need to make subtree updates more efficient – don’t want
to write a log entry for each subtree insertion Use (a copy of) the page itself as a means of tracking
what was inserted, then batch-apply to WAL “Annihilators”: if we undo a tree creation, then we
probably don’t need to worry about undoing later changes to that tree
A few minor tweaks to minimize undo/redo when only one transaction touches a page
![Page 20: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/20.jpg)
20
Annihilators
![Page 21: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/21.jpg)
21
Assessment
Native XML storage isn’t really all that different from other means of storage There are probably some good reasons to
make a few tweaks in locking Optimization stays harder
A real solution to materialized view creation would probably make RDBMSs come close to delivering the same performance, modulo locking
![Page 22: XML Databases Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 23, 2005](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649d545503460f94a315f9/html5/thumbnails/22.jpg)
22
Questions
Where are the main challenges of XML processing at this point?
Impact of BinaryXML? Are we working on the right problems?
What’s XML going to be used for, anyway?