the storage and benchmarking of xml

Nov 2, 2001The Storage and Benchmarking of

XML 1

The Storage and Benchmarking of XML

Presenter: Kevin See([email protected])

IBM Toronto Lab. DB2 SQL /Catalog Development

Date: Nov 2, 2001

<?XML?>


XML 2

Outline Introduction Text file / OODBMS/ native DB

approach Relational DB approaches

Categories 2 latest proposals

XML benchmarks Conclusion


XML 3

Introduction XML is emerging to become the

standard for data exchanging Demand for storage and

management of the XML documents is growing

There are a few ways to manage the XML document


XML 4

Text Approach File system A separate query engine will need to be

implemented Parsing not possible Index strategies : (parent_offset, tag),

(child_offset, parent_offset), (tagname, value), (attribute_name, attribute_value)

not good for update


XML 5

Object-oriented Database Management System Michael R. Olson and Byung S. Lee

(1997) OO model fit well Immature technology: Hard to

scale Conclude the experiment without

any great success.


XML 6

Native Database Approach Prototypes: Lore (Stanford

University), Xyleme (INRIA, France).

Immature technology No optimization capabilities


XML 7

Relational Database Technology Very mature technology Query optimization techniques and

the processing mechanisms in relational databases have been studied for a quarter of a century

A very large percentage of the data are currently stored in RDMS


XML 8

Storage and Retrieval of XML Using Relational Database

XML

Table1

Table1

Relational to XMLConversion

XML to Relational Mapping

XML Query to SQL SQL

XML

XMLQueryLanguagesuch asXPath,XQuery,Quilt, XQL


XML 9

Classifications of Various Mapping Methods Structure-mapping

approach The XML document’s

logical structures (or DTDs if available) are represented by the database schemas

1 DTD : 1 set of generated schemas

Model-mapping approach Constructs of XML

model are represented by the database schemas

1 set of generated schemas for all/any DTD


XML 10

Relational Schema Prototype Tree Mapping Method M. Yan and A. Fu @ The Chinese

University of Hong Kong (2001) Structure-mapping Global Schema Extraction

Algorithm DTD Splitting Schema Extraction

Algorithm


XML 11

Relational Schema Prototype Tree Mapping Method (Cont’d) Relational Databases for Querying XML

Documents: Limitations and Opportunities (J. Shanmugasundaram)

Basic steps:1. Simplify DTD2. Construct schema prototype tree3. Generate relational schema prototypes4. Detect functional dependencies and

candidate keys5. Normalize the relational schema

prototypes


XML 12

DTD Splitting Schema Extraction Algorithm Step 1: Simplify DTDp|p' p, p'p+ p*(p, p') p, p'..., p,..., p*,... p*p? p(p, p’)* p*, p’*..., p,..., p,... p*


XML 13

<!ENTITY %txt “(#PCDATA)”>

<!ELEMENT book (booktitle, price?, author, authority*)>

<!ELEMENT authority (authname, country)>

<!ELEMENT authname %txt>

<!ELEMENT country %txt>

<!ELEMENT booktitle %txt>

<!ELEMENT price %txt>

<!ELEMENT monograph (title, author, editor)>

<!ELEMENT title %txt>

<!ELEMENT editor (monograph+)>

<!ATTLIST editor name CDATA #REQUIRED>

<!ELEMENT author (name, address)>

<!ATTLIST author id ID>

<!ELEMENT name (firstname, lastname)>

<!ELEMENT firstname %txt>

<!ELEMENT lastname %txt>

<!ELEMENT address %txt>

An Book DTD


XML 14

<!ELEMENT book (booktitle, price, author, authority*)><!ELEMENT authority (authname, country)><!ELEMENT authname (#PCDATA)><!ELEMENT country (#PCDATA)><!ELEMENT booktitle (#PCDATA)><!ELEMENT price (#PCDATA)><!ELEMENT monograph (title, author, editor)><!ELEMENT title (#PCDATA)><!ELEMENT editor (monograph*)><!ATTLIST editor name CDATA><!ELEMENT author (name, address)><!ATTLIST author id ID><!ELEMENT name (firstname, lastname)><!ELEMENT firstname (#PCDATA)><!ELEMENT lastname (#PCDATA)><!ELEMENT address (#PCDATA)>

Transformed/ Simplified DTD


XML 15

Step 2: Construct Schema Prototypes Trees1. Only an element can become a root2. An element that is not nested inside other

elements can become the root3. A non-#PCDATA element that is nested in

more than 1 other element becomes the root4. If a non-#PCDATA element B is not the only

subelement of A and B only appears in A with a “*”, it becomes the root

5. One of the elements in the recursion is selected as root should recursion occurs in the DTD


XML 16

Roots for the Example DTD

Element book is selected as root – rule 2

Element author is selected as root – rule 3

Element authority is selected as root – rule 4

Element monograph is selected as root – rule 5

<!ELEMENT book (booktitle, price, author, authority*)><!ELEMENT authority (authname, country)><!ELEMENT authname (#PCDATA)><!ELEMENT country (#PCDATA)><!ELEMENT booktitle (#PCDATA)><!ELEMENT price (#PCDATA)><!ELEMENT monograph (title, author, editor)><!ELEMENT title (#PCDATA)><!ELEMENT editor (monograph*)><!ATTLIST editor name CDATA><!ELEMENT author (name, address)><!ATTLIST author id ID><!ELEMENT name (firstname, lastname)><!ELEMENT firstname (#PCDATA)><!ELEMENT lastname (#PCDATA)><!ELEMENT address (#PCDATA)>


XML 17

Step 2: Construct Schema Prototypes Trees (Cont’d) Tree construction:

Depth-first scan on DTD for all selected root(s) starting from the subelements of the root

New nodes for each visited elements and attributes

A mixed element (element containing both #PCDATA and other subelement) will be marked with a “#” in the tree

Recursion – a new leaf node with label <node name>.A


XML 18

Schema Prototype Trees


XML 19

Step 3: Generate Relational Schema Prototype All necessary descendants are

inlined starting from the root except key nodes or foreign key nodes.


XML 20

Relational Schema Prototype

Book (booktitle, price)Authority (country, authname)Author (address, id, firstname, lastname)Monograph (title, name)


XML 21

Step 4: Discover FDs and Candidate Keys Functional dependencies (FDs) and

the candidate keys discovery by analyzing the XML data

TANE algorithm (http://www.cs.helsinki.fi/research/fdk/datamining/tane/)


XML 22

Candidate Keys

Book {booktitle}Authority {country, authname}Monograph {title}Author {id}, {lastname, address}


XML 23

Relational Schema Prototype With Candidate Keys

Book (booktitle, price, author.id)Authority (country, authname, assigned, book.booktitle)Author (address, id, firstname, lastname)Monograph (title, name, author.id, monograph.title)Book (booktitle, price)

Authority (country, authname)Author (address, id, firstname, lastname)Monograph (title, name)


XML 24

Step 5: Normalize the Relational Schema Prototypes The last step. Normalize the schema to 3NF

(third normal form) if possible. Structure mapping methods does

not handle order but leave it to metadata or user to handle.


XML 25

X-Rel

Masatoshi Yoshikawa, Toshiyuki Amagasa, Takeyuki Shimura and Shunsuke Uemura @ Nara Institute of Science and Technology, Japan (2001)

Model-mapping Data model: XPath (root node,

element nodes, attribute nodes, and text nodes)

The concept of region


XML 26

Definition of Region

The region of: An element node or a text node is a

pair of numbers representing the start and end positions of the node in the XML document

An attribute node is a pair of identical numbers equal to the start position of the parent element node plus one


XML 27

Simple Path Expressions

Path – an unit of decomposition of XML trees Store simple path expression (denoted by

SimplePathExpr) from the root node. Why? Path is appear in XML queries frequently


XML 28

Why “#” Is Added?

Look for family descendants of issue.1. WHERE p1.pathexp LIKE

‘/issue%/family’2. WHERE p1.pathexp LIKE ‘#/issue#

%/family’ /issuelist/family (WRONG) is match

for the first but not the second.


XML 29

<Paper Title = “The Suffix-Signature Method for Searching Phrases in Text”><Authors>

<FN> Mei </FN><LN> Zhou </LN>

<Affiliation> Open Text Corporation </Affiliation> </Authors>

<Authors> <FN> Frank </FN><LN> Tompa </LN>

<Affiliation> University of Waterloo </Affiliation> </Authors></Paper>

Example XML Document


XML 30

ROOT

Paper

Element

attribute

abc

text

string-valueTitle

The Suffix-Signature Method forSearching Phrases in Text

Authors Authors

FN FNLN LN AffiliationAffiliation

Mei Zhou OpenTextCorporation

Frank Tompa UniversityOFWaterloo

1

2

3 4 11

5 7 9 12 14 16

6 8 10 13 15 1711

XML Tree


XML 31

Simple Path Expressions /Regions

Node 3 #/Paper#/@Title (1,1)

Node 9 #/Paper#/

Authors#/affiliation

(99, 145)

ROOT

Paper

Element

attribute

abc

text

string-valueTitle

The Suffix-Signature Method forSearching Phrases in Text

Authors Authors

FN FNLN LN AffiliationAffiliation

Mei Zhou OpenTextCorporation

Frank Tompa UniversityOFWaterloo

1

2

3 4 11

5 7 9 12 14 16

6 8 10 13 15 1711


XML 32

Mapping Idea

A relational table per node type Simple path expression are normalized docID is introduced Basic XRel schema Element (docID, pathID, start, end, index,

reindex) Attribute (docID, pathID, start, end, value) Text (docID, pathID, start, end, value) Path (pathID, pathexp)


XML 33

Table - Element

docID pathID start end index reindex

1 1 0 257 1 1

1 3 66 155 1 2

1 4 75 86 1 2

1 5 87 99 1 2

1 6 100 145 1 2

1 3 156 249 2 1

1 4 165 178 2 1

1 5 179 192 2 1

1 6 193 239 2 1


XML 34

Table - Attribute

docID pathID start end value

1 2 1 1 The Suffix-Signature Method for Searching Phrases in Text


XML 35

Table - Text

docID

pathID

start

end value

1 4 79 81 Mei

1 5 91 94 Zhou

1 6 113 131 Open Text Corporation

1 4 169 173 Frank

1 5 183 187 Tompa

1 6 206 225 University of Waterloo


XML 36

Table - Path

pathID pathexpr

1 #/Paper

2 #/Paper/@Title

3 #/Paper#/Authors

4 #/Paper#/Authors#/FN

5 #/Paper#/Authors#/LN

6 #/Paper#/Authors#/Affiliation


XML 37

XML Benchmarking Desiderata

Bulk loading Reconstruction Path traversals Casting Missing elements


XML 38

XML Benchmarking Desiderata (continued) Ordered access References Joins Construction of large results Containment, full-text search


XML 39

XML Benchmarks

3 XML benchmark proposals. XMach-1.

University of Leipzig, Germany. XML benchmark project.

CWI, the Netherlands. Kanda et al. Proposal (unpublished).

University of Michigan, IBM Toronto lab center for advanced studies.


XML 40

Conclusion

From the different mapping approaches and experiments, there are a few places where relational database enhancement can help in coping with XML model differences. Support for sets. Flexible comparisons operators. Multi-predicate merge join.


XML 41

Questions & Answers


XML 42

Appendix A

Enhancing Structural Mappings Based on Statistics


XML 43

Optimal Hybrid Database Algorithm M. Klettke, and H. Meyer XML and object-relational database

systems - enhancing structural mappings based on statistics (2000)

An algorithm that finds a type of optimal mapping based on the statistics and the DTD


XML 44

Optimal Hybrid Database Algorithm

1. Build a graph representing the hierarchy of the elements and attributes of the DTD.

2. For every element/attribute of the graph, a measure of significance, w, is determined.

3. Derive the resulting database design from the graph.


XML 45


XML 46

Graph for an Example DTD


XML 47

Calculate the Weight (Step 2)

W = 1/6 (SQ + SA + SH) + ¼ (DA/DG) + ¼ (QA/QG) where SQ - exploitation of quantifiers SA - exploitation of alternatives SH - position in the hierarchy DA -number of documents containing the element/attribute DG - absolute number of XML documents QA - number of queries containing the element/attribute QG - absolute number of queries


XML 48

The Graph With the Colored Weight


XML 49

Step 3 - Deriving Hybrid Databases From the Graph First, specify a limit on which

attributes and/or elements is represented as attributes of the databases and which attributes and/or element are represented as XML attributes


XML 50

Step 3 - Deriving Hybrid Databases From the Graph Then, search for all nodes of the

graph that satisfy the following conditions:

The node is not a leaf of the graph The node and all its descendants

are below the limit given No predecessor that satisfies the

first two conditions exists.


XML 51

Step 3 - Deriving Hybrid Databases From the Graph The selected nodes and its

descendents (the whole sub-graph) will be replaced by an XML attribute. (A BLOB like attribute)

All other elements and attributes will be mapped to relational database using mapping.


XML 52

Resulting XML Attributes for the Example DTD


XML 53

Referenceshttp://www.geocities.com/ysee.geo/ml.htmlhttp://db-www.aist-nara.ac.jp/members/Yoshikawa/paper/TOIT2001.pdfhttp://citeseer.nj.nec.com/454884.htmlhttp://dol.uni-leipzig.de/pub/2001-1/enhttp://www.cs.wisc.edu/niagara/papers/xmlstore.pdfhttp://www.cwi.nl/htbin/ins1/publications?request=papers

the storage and benchmarking of xml

Documents

retrieval of xml

querying xml documents

dtdpp p

benchmarking of xml

database schemas

relational mappingxml

pcdata element

rootan element