the storage and benchmarking of xml
DESCRIPTION
The Storage and Benchmarking of XML. Presenter: Kevin See ([email protected]) IBM Toronto Lab. DB2 SQL /Catalog Development Date: Nov 2, 2001. . Outline. Introduction Text file / OODBMS/ native DB approach Relational DB approaches Categories 2 latest proposals XML benchmarks - PowerPoint PPT PresentationTRANSCRIPT
Nov 2, 2001The Storage and Benchmarking of
XML 1
The Storage and Benchmarking of XML
Presenter: Kevin See([email protected])
IBM Toronto Lab. DB2 SQL /Catalog Development
Date: Nov 2, 2001
<?XML?>
Nov 2, 2001The Storage and Benchmarking of
XML 2
Outline Introduction Text file / OODBMS/ native DB
approach Relational DB approaches
Categories 2 latest proposals
XML benchmarks Conclusion
Nov 2, 2001The Storage and Benchmarking of
XML 3
Introduction XML is emerging to become the
standard for data exchanging Demand for storage and
management of the XML documents is growing
There are a few ways to manage the XML document
Nov 2, 2001The Storage and Benchmarking of
XML 4
Text Approach File system A separate query engine will need to be
implemented Parsing not possible Index strategies : (parent_offset, tag),
(child_offset, parent_offset), (tagname, value), (attribute_name, attribute_value)
not good for update
Nov 2, 2001The Storage and Benchmarking of
XML 5
Object-oriented Database Management System Michael R. Olson and Byung S. Lee
(1997) OO model fit well Immature technology: Hard to
scale Conclude the experiment without
any great success.
Nov 2, 2001The Storage and Benchmarking of
XML 6
Native Database Approach Prototypes: Lore (Stanford
University), Xyleme (INRIA, France).
Immature technology No optimization capabilities
Nov 2, 2001The Storage and Benchmarking of
XML 7
Relational Database Technology Very mature technology Query optimization techniques and
the processing mechanisms in relational databases have been studied for a quarter of a century
A very large percentage of the data are currently stored in RDMS
Nov 2, 2001The Storage and Benchmarking of
XML 8
Storage and Retrieval of XML Using Relational Database
XML
Table1
Table1
Relational to XMLConversion
XML to Relational Mapping
XML Query to SQL SQL
XML
XMLQueryLanguagesuch asXPath,XQuery,Quilt, XQL
Nov 2, 2001The Storage and Benchmarking of
XML 9
Classifications of Various Mapping Methods Structure-mapping
approach The XML document’s
logical structures (or DTDs if available) are represented by the database schemas
1 DTD : 1 set of generated schemas
Model-mapping approach Constructs of XML
model are represented by the database schemas
1 set of generated schemas for all/any DTD
Nov 2, 2001The Storage and Benchmarking of
XML 10
Relational Schema Prototype Tree Mapping Method M. Yan and A. Fu @ The Chinese
University of Hong Kong (2001) Structure-mapping Global Schema Extraction
Algorithm DTD Splitting Schema Extraction
Algorithm
Nov 2, 2001The Storage and Benchmarking of
XML 11
Relational Schema Prototype Tree Mapping Method (Cont’d) Relational Databases for Querying XML
Documents: Limitations and Opportunities (J. Shanmugasundaram)
Basic steps:1. Simplify DTD2. Construct schema prototype tree3. Generate relational schema prototypes4. Detect functional dependencies and
candidate keys5. Normalize the relational schema
prototypes
Nov 2, 2001The Storage and Benchmarking of
XML 12
DTD Splitting Schema Extraction Algorithm Step 1: Simplify DTDp|p' p, p'p+ p*(p, p') p, p'..., p,..., p*,... p*p? p(p, p’)* p*, p’*..., p,..., p,... p*
Nov 2, 2001The Storage and Benchmarking of
XML 13
<!ENTITY %txt “(#PCDATA)”>
<!ELEMENT book (booktitle, price?, author, authority*)>
<!ELEMENT authority (authname, country)>
<!ELEMENT authname %txt>
<!ELEMENT country %txt>
<!ELEMENT booktitle %txt>
<!ELEMENT price %txt>
<!ELEMENT monograph (title, author, editor)>
<!ELEMENT title %txt>
<!ELEMENT editor (monograph+)>
<!ATTLIST editor name CDATA #REQUIRED>
<!ELEMENT author (name, address)>
<!ATTLIST author id ID>
<!ELEMENT name (firstname, lastname)>
<!ELEMENT firstname %txt>
<!ELEMENT lastname %txt>
<!ELEMENT address %txt>
An Book DTD
Nov 2, 2001The Storage and Benchmarking of
XML 14
<!ELEMENT book (booktitle, price, author, authority*)><!ELEMENT authority (authname, country)><!ELEMENT authname (#PCDATA)><!ELEMENT country (#PCDATA)><!ELEMENT booktitle (#PCDATA)><!ELEMENT price (#PCDATA)><!ELEMENT monograph (title, author, editor)><!ELEMENT title (#PCDATA)><!ELEMENT editor (monograph*)><!ATTLIST editor name CDATA><!ELEMENT author (name, address)><!ATTLIST author id ID><!ELEMENT name (firstname, lastname)><!ELEMENT firstname (#PCDATA)><!ELEMENT lastname (#PCDATA)><!ELEMENT address (#PCDATA)>
Transformed/ Simplified DTD
Nov 2, 2001The Storage and Benchmarking of
XML 15
Step 2: Construct Schema Prototypes Trees1. Only an element can become a root2. An element that is not nested inside other
elements can become the root3. A non-#PCDATA element that is nested in
more than 1 other element becomes the root4. If a non-#PCDATA element B is not the only
subelement of A and B only appears in A with a “*”, it becomes the root
5. One of the elements in the recursion is selected as root should recursion occurs in the DTD
Nov 2, 2001The Storage and Benchmarking of
XML 16
Roots for the Example DTD
Element book is selected as root – rule 2
Element author is selected as root – rule 3
Element authority is selected as root – rule 4
Element monograph is selected as root – rule 5
<!ELEMENT book (booktitle, price, author, authority*)><!ELEMENT authority (authname, country)><!ELEMENT authname (#PCDATA)><!ELEMENT country (#PCDATA)><!ELEMENT booktitle (#PCDATA)><!ELEMENT price (#PCDATA)><!ELEMENT monograph (title, author, editor)><!ELEMENT title (#PCDATA)><!ELEMENT editor (monograph*)><!ATTLIST editor name CDATA><!ELEMENT author (name, address)><!ATTLIST author id ID><!ELEMENT name (firstname, lastname)><!ELEMENT firstname (#PCDATA)><!ELEMENT lastname (#PCDATA)><!ELEMENT address (#PCDATA)>
Nov 2, 2001The Storage and Benchmarking of
XML 17
Step 2: Construct Schema Prototypes Trees (Cont’d) Tree construction:
Depth-first scan on DTD for all selected root(s) starting from the subelements of the root
New nodes for each visited elements and attributes
A mixed element (element containing both #PCDATA and other subelement) will be marked with a “#” in the tree
Recursion – a new leaf node with label <node name>.A
Nov 2, 2001The Storage and Benchmarking of
XML 18
Schema Prototype Trees
Nov 2, 2001The Storage and Benchmarking of
XML 19
Step 3: Generate Relational Schema Prototype All necessary descendants are
inlined starting from the root except key nodes or foreign key nodes.
Nov 2, 2001The Storage and Benchmarking of
XML 20
Relational Schema Prototype
Book (booktitle, price)Authority (country, authname)Author (address, id, firstname, lastname)Monograph (title, name)
Nov 2, 2001The Storage and Benchmarking of
XML 21
Step 4: Discover FDs and Candidate Keys Functional dependencies (FDs) and
the candidate keys discovery by analyzing the XML data
TANE algorithm (http://www.cs.helsinki.fi/research/fdk/datamining/tane/)
Nov 2, 2001The Storage and Benchmarking of
XML 22
Candidate Keys
Book {booktitle}Authority {country, authname}Monograph {title}Author {id}, {lastname, address}
Nov 2, 2001The Storage and Benchmarking of
XML 23
Relational Schema Prototype With Candidate Keys
Book (booktitle, price, author.id)Authority (country, authname, assigned, book.booktitle)Author (address, id, firstname, lastname)Monograph (title, name, author.id, monograph.title)Book (booktitle, price)
Authority (country, authname)Author (address, id, firstname, lastname)Monograph (title, name)
Nov 2, 2001The Storage and Benchmarking of
XML 24
Step 5: Normalize the Relational Schema Prototypes The last step. Normalize the schema to 3NF
(third normal form) if possible. Structure mapping methods does
not handle order but leave it to metadata or user to handle.
Nov 2, 2001The Storage and Benchmarking of
XML 25
X-Rel
Masatoshi Yoshikawa, Toshiyuki Amagasa, Takeyuki Shimura and Shunsuke Uemura @ Nara Institute of Science and Technology, Japan (2001)
Model-mapping Data model: XPath (root node,
element nodes, attribute nodes, and text nodes)
The concept of region
Nov 2, 2001The Storage and Benchmarking of
XML 26
Definition of Region
The region of: An element node or a text node is a
pair of numbers representing the start and end positions of the node in the XML document
An attribute node is a pair of identical numbers equal to the start position of the parent element node plus one
Nov 2, 2001The Storage and Benchmarking of
XML 27
Simple Path Expressions
Path – an unit of decomposition of XML trees Store simple path expression (denoted by
SimplePathExpr) from the root node. Why? Path is appear in XML queries frequently
Nov 2, 2001The Storage and Benchmarking of
XML 28
Why “#” Is Added?
Look for family descendants of issue.1. WHERE p1.pathexp LIKE
‘/issue%/family’2. WHERE p1.pathexp LIKE ‘#/issue#
%/family’ /issuelist/family (WRONG) is match
for the first but not the second.
Nov 2, 2001The Storage and Benchmarking of
XML 29
<Paper Title = “The Suffix-Signature Method for Searching Phrases in Text”><Authors>
<FN> Mei </FN><LN> Zhou </LN>
<Affiliation> Open Text Corporation </Affiliation> </Authors>
<Authors> <FN> Frank </FN><LN> Tompa </LN>
<Affiliation> University of Waterloo </Affiliation> </Authors></Paper>
Example XML Document
Nov 2, 2001The Storage and Benchmarking of
XML 30
ROOT
Paper
Element
attribute
abc
text
string-valueTitle
The Suffix-Signature Method forSearching Phrases in Text
Authors Authors
FN FNLN LN AffiliationAffiliation
Mei Zhou OpenTextCorporation
Frank Tompa UniversityOFWaterloo
1
2
3 4 11
5 7 9 12 14 16
6 8 10 13 15 1711
XML Tree
Nov 2, 2001The Storage and Benchmarking of
XML 31
Simple Path Expressions /Regions
Node 3 #/Paper#/@Title (1,1)
Node 9 #/Paper#/
Authors#/affiliation
(99, 145)
ROOT
Paper
Element
attribute
abc
text
string-valueTitle
The Suffix-Signature Method forSearching Phrases in Text
Authors Authors
FN FNLN LN AffiliationAffiliation
Mei Zhou OpenTextCorporation
Frank Tompa UniversityOFWaterloo
1
2
3 4 11
5 7 9 12 14 16
6 8 10 13 15 1711
Nov 2, 2001The Storage and Benchmarking of
XML 32
Mapping Idea
A relational table per node type Simple path expression are normalized docID is introduced Basic XRel schema Element (docID, pathID, start, end, index,
reindex) Attribute (docID, pathID, start, end, value) Text (docID, pathID, start, end, value) Path (pathID, pathexp)
Nov 2, 2001The Storage and Benchmarking of
XML 33
Table - Element
docID pathID start end index reindex
1 1 0 257 1 1
1 3 66 155 1 2
1 4 75 86 1 2
1 5 87 99 1 2
1 6 100 145 1 2
1 3 156 249 2 1
1 4 165 178 2 1
1 5 179 192 2 1
1 6 193 239 2 1
Nov 2, 2001The Storage and Benchmarking of
XML 34
Table - Attribute
docID pathID start end value
1 2 1 1 The Suffix-Signature Method for Searching Phrases in Text
Nov 2, 2001The Storage and Benchmarking of
XML 35
Table - Text
docID
pathID
start
end value
1 4 79 81 Mei
1 5 91 94 Zhou
1 6 113 131 Open Text Corporation
1 4 169 173 Frank
1 5 183 187 Tompa
1 6 206 225 University of Waterloo
Nov 2, 2001The Storage and Benchmarking of
XML 36
Table - Path
pathID pathexpr
1 #/Paper
2 #/Paper/@Title
3 #/Paper#/Authors
4 #/Paper#/Authors#/FN
5 #/Paper#/Authors#/LN
6 #/Paper#/Authors#/Affiliation
Nov 2, 2001The Storage and Benchmarking of
XML 37
XML Benchmarking Desiderata
Bulk loading Reconstruction Path traversals Casting Missing elements
Nov 2, 2001The Storage and Benchmarking of
XML 38
XML Benchmarking Desiderata (continued) Ordered access References Joins Construction of large results Containment, full-text search
Nov 2, 2001The Storage and Benchmarking of
XML 39
XML Benchmarks
3 XML benchmark proposals. XMach-1.
University of Leipzig, Germany. XML benchmark project.
CWI, the Netherlands. Kanda et al. Proposal (unpublished).
University of Michigan, IBM Toronto lab center for advanced studies.
Nov 2, 2001The Storage and Benchmarking of
XML 40
Conclusion
From the different mapping approaches and experiments, there are a few places where relational database enhancement can help in coping with XML model differences. Support for sets. Flexible comparisons operators. Multi-predicate merge join.
Nov 2, 2001The Storage and Benchmarking of
XML 41
Questions & Answers
Nov 2, 2001The Storage and Benchmarking of
XML 42
Appendix A
Enhancing Structural Mappings Based on Statistics
Nov 2, 2001The Storage and Benchmarking of
XML 43
Optimal Hybrid Database Algorithm M. Klettke, and H. Meyer XML and object-relational database
systems - enhancing structural mappings based on statistics (2000)
An algorithm that finds a type of optimal mapping based on the statistics and the DTD
Nov 2, 2001The Storage and Benchmarking of
XML 44
Optimal Hybrid Database Algorithm
1. Build a graph representing the hierarchy of the elements and attributes of the DTD.
2. For every element/attribute of the graph, a measure of significance, w, is determined.
3. Derive the resulting database design from the graph.
Nov 2, 2001The Storage and Benchmarking of
XML 45
Nov 2, 2001The Storage and Benchmarking of
XML 46
Graph for an Example DTD
Nov 2, 2001The Storage and Benchmarking of
XML 47
Calculate the Weight (Step 2)
W = 1/6 (SQ + SA + SH) + ¼ (DA/DG) + ¼ (QA/QG) where SQ - exploitation of quantifiers SA - exploitation of alternatives SH - position in the hierarchy DA -number of documents containing the element/attribute DG - absolute number of XML documents QA - number of queries containing the element/attribute QG - absolute number of queries
Nov 2, 2001The Storage and Benchmarking of
XML 48
The Graph With the Colored Weight
Nov 2, 2001The Storage and Benchmarking of
XML 49
Step 3 - Deriving Hybrid Databases From the Graph First, specify a limit on which
attributes and/or elements is represented as attributes of the databases and which attributes and/or element are represented as XML attributes
Nov 2, 2001The Storage and Benchmarking of
XML 50
Step 3 - Deriving Hybrid Databases From the Graph Then, search for all nodes of the
graph that satisfy the following conditions:
The node is not a leaf of the graph The node and all its descendants
are below the limit given No predecessor that satisfies the
first two conditions exists.
Nov 2, 2001The Storage and Benchmarking of
XML 51
Step 3 - Deriving Hybrid Databases From the Graph The selected nodes and its
descendents (the whole sub-graph) will be replaced by an XML attribute. (A BLOB like attribute)
All other elements and attributes will be mapped to relational database using mapping.
Nov 2, 2001The Storage and Benchmarking of
XML 52
Resulting XML Attributes for the Example DTD
Nov 2, 2001The Storage and Benchmarking of
XML 53
Referenceshttp://www.geocities.com/ysee.geo/ml.htmlhttp://db-www.aist-nara.ac.jp/members/Yoshikawa/paper/TOIT2001.pdfhttp://citeseer.nj.nec.com/454884.htmlhttp://dol.uni-leipzig.de/pub/2001-1/enhttp://www.cs.wisc.edu/niagara/papers/xmlstore.pdfhttp://www.cwi.nl/htbin/ins1/publications?request=papers