chapter 10: xml the world of xml. context the dawn of database technology 70s a dbms is a flexible...

Chapter 10: XML

The world of XML

Context

• The dawn of database technology 70s

• A DBMS is a flexible store-recall system for digital information

• It provides permanent memory for structured information

Context

• Database Managements technology for administrative settings ‘completed’ in the early 80s

• Search for demanding application areas that could benefit from a database approach

– A sound datamodel to structure the information and maintain integrity rules

– A high level programming language model to manipulate the data

– Separation of concerns between modelling and manipulation, and physical storage and order of execution thanks to query optimizer technology

Context

• Demanding areas of research in DBMS core technology:– Office Information systems, e.g. document modelling

and workflow– CAD/CAM, e.g. how to manage the design of an

airplane or nucleur power plant– GIS, e.g. managing remote sensing information– WWW, e.g. how to integrate heterogenous sources– Agent-based systems, e.g. reactive systems– Multimedia, e.g. video storage/retrieval– Datamining, e.g. discovery of client profiles– Sensor networks, e.g. small footprint and energy-wise

computing

Context

• Demanding areas of research in DBMS core technology:

– Office Information systems, Extensible DBMS, blobs

– CAD/CAM, Object-oriented DBMS, geometry

– GIS, GIS DBMS, geometry and images

– Agent-based systems, Active DBMS, triggers

– Multimedia, MM DBMS, feature analysis

– Datamining, Datawarehouse systems, cube, association rules

– Sensor networks, P2P databases, ad-hoc networking

Context

• Application interaction with DBMS– Proprietary application programming interface,

shielding the hardware distinctions– Use readable interfaces to improve monitoring and

development• Example: in Monetdb the interaction is based on ascii text with

the first character indicative for the message type‘>’ prompt, await for next request‘!’ error occurred, rest is the message‘[‘ start of a tuple answer

– Language embedding to remove the impedance mismatch, i.e. avoid cost of transforming data

• Effectively failed in the OO world

Context

• Learning points database perspective, – Database system should not be concerned with the

user-interaction technology, ‘they should be blind and deaf’

– The strong requirements on schema, integrity rules and processing is a harness

– Interaction with applications should be self-descriptive as much as possible, because, you can not a priori know a complete schema

– Need for semi-structured databases

Semi-structured data

• Properties of semistructured databases:– The schema is not given in advance and may be implicit in the

data

– The schema is relatively large and changes frequently

– The schema is descriptive rather than prescriptive, integrity rules may be violated

– The data is not strongly typed, the values of attributes may be of different type

• Stanford Lore system is the prototypical first attempt to support semi-structured databases

Context

Accidentally, in the world of digital publishing there is a need for a simple datamodel to structure information

SMGL HTML XML XHTML

XPATH XQUERY XSLT

By the end 90s, the document world meets the database world

Introduction

• XML: Extensible Markup Language

• Defined by the WWW Consortium (W3C)

• Originally intended as a document markup language not a database language– Documents have tags giving extra information about sections of the

document• E.g. <title> XML </title> <slide> Introduction …</slide>

– Derived from SGML (Standard Generalized Markup Language), but simpler to use than SGML

– Extensible, unlike HTML

• Users can add new tags, and separately specify how the tag should be handled for display

XML Introduction (Cont.)

• The ability to specify new tags, and to create nested tag structures made XML a great way to exchange data, not just documents.– Much of the use of XML has been in data exchange applications, not as a

replacement for HTML

• Tags make data (relatively) self-documenting – E.g.

<bank> <account>

<account-number> A-101 </account-number> <branch-name> Downtown </branch-name> <balance> 500 </balance>

</account> <depositor>

<account-number> A-101 </account-number> <customer-name> Johnson </customer-name>

</depositor> </bank>

XML: Motivation

• Data interchange is critical in today’s networked world

– Examples:• Banking: funds transfer

• Order processing (especially inter-company orders)

• Scientific data

– Chemistry: ChemML, …

– Genetics: BSML (Bio-Sequence Markup Language), …

– Paper flow of information between organizations is being replaced by electronic flow of information

• Each application area has its own set of standards for representing information (W3C maintains ca 30 standards)

• XML has become the basis for all new generation data interchange formats

XML Motivation (Cont.)• Each XML based standard defines what are valid elements,

using

– XML type specification languages to specify the syntax• DTD (Document Type Descriptors)

• XML Schema

– Plus textual descriptions of the semantics

• XML allows new tags to be defined as required

– However, this may be constrained by DTDs

• A wide variety of tools is available for parsing, browsing and querying XML documents/data

Motivation for Nesting

• Nesting of data is useful in data transfer

– Example: elements representing customer-id, customer name, and address nested within an order element

• Nesting is not supported, or discouraged, in relational databases

– With multiple orders, customer name and address are stored redundantly

– normalization replaces nested structures in each order by foreign key into table storing customer name and address information

– Nesting is supported in object-relational databases and NF2

• But nesting is appropriate when transferring data

– External application does not have direct access to data referenced by a foreign key

Example of Nested Elements

<bank-1> <customer>

<customer-name> Hayes </customer-name> <customer-street> Main </customer-street> <customer-city> Harrison </customer-city> <account>

<account-number> A-102 </account-number> <branch-name> Perryridge </branch-name> <balance> 400 </balance>

</account> <account> … </account>

</customer> . .

</bank-1>

Structure of XML Data (Cont.)

• Mixture of text with sub-elements is legal in XML. – Example: <account>

This account is seldom used any more. <account-number> A-102</account-number> <branch-name> Perryridge</branch-name> <balance>400 </balance></account>

– Useful for document markup, but discouraged for data representation

Attributes

• Elements can have attributes– <account acct-type = “checking” >

<account-number> A-102 </account-number> <branch-name> Perryridge </branch-name> <balance> 400 </balance>

</account>

• Attributes are specified by name=value pairs inside the starting tag of an element

• An element may have several attributes, but each attribute name can only occur once

• <account acct-type = “checking” monthly-fee=“5”>

Attributes Vs. Subelements

• Distinction between subelement and attribute

– In the context of documents, attributes are part of markup, while subelement contents are part of the basic document contents

– In the context of data representation, the difference is unclear and may be confusing

• Same information can be represented in two ways

– <account account-number = “A-101”> …. </account>

– <account> <account-number>A-101</account-number> … </account>

– Suggestion: use attributes for identifiers of elements, and use subelements for contents

XML Document Schema

• Database schemas constrain what information can be stored, and the data types of stored values

• XML documents are not required to have an associated schema

• However, schemas are very important for XML data exchange– Otherwise, a site cannot automatically interpret data

received from another site• Two mechanisms for specifying XML schema

– Document Type Definition (DTD)• Widely used

– XML Schema • Newer, not yet widely used

Attribute Specification in DTD

• Attribute specification : for each attribute – Name– Type of attribute

• CDATA• ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs)

– more on this later

– Whether • mandatory (#REQUIRED)• has a default value (value), • or neither (#IMPLIED)

• Examples– <!ATTLIST account acct-type CDATA “checking”>– <!ATTLIST customer

customer-id ID # REQUIREDaccounts IDREFS # REQUIRED >

IDs and IDREFs

• An element can have at most one attribute of type ID

• The ID attribute value of each element in an XML document must be distinct

– Thus the ID attribute value is an object identifier

• An attribute of type IDREF must contain the ID value of an element in the same document

• An attribute of type IDREFS contains a set of (0 or more) ID values. Each ID value must contain the ID value of an element in the same document

Limitations of DTDs

• No typing of text elements and attributes

– All values are strings, no integers, reals, etc.

• Difficult to specify unordered sets of subelements

– Order is usually irrelevant in databases

– (A | B)* allows specification of an unordered set, but• Cannot ensure that each of A and B occurs only once

• IDs and IDREFs are untyped

– The owners attribute of an account may contain a reference to another account, which is meaningless

• owners attribute should ideally be constrained to refer to customer elements

XML Schema

• XML Schema is a more sophisticated schema language which addresses the drawbacks of DTDs. Supports

– Typing of values• E.g. integer, string, etc

• Also, constraints on min/max values

– User defined types

– Is itself specified in XML syntax, unlike DTDs• More standard representation, but verbose

– Is integrated with namespaces

– Many more features• List types, uniqueness and foreign key constraints, inheritance ..

• BUT: significantly more complicated than DTDs, not yet widely used.

Storage of XML Data

• XML data can be stored in

– Non-relational data stores• Flat files

– Natural for storing XML– But has all problems discussed in Chapter 1 (no

concurrency, no recovery, …)• XML database

– Database built specifically for storing XML data, supporting DOM model and declarative querying

– Currently no commercial-grade scaleable system

– Relational databases• Data must be translated into relational form• Advantage: mature database systems• Disadvantages: overhead of translating data and queries

Storing XML in Relational Databases

• Store as string– E.g. store each top level element as a string field of a tuple in a

database• Use a single relation to store all elements, or• Use a separate relation for each top-level element type

– E.g. account, customer, depositor– Indexing:

» Store values of subelements/attributes to be indexed, such as customer-name and account-number as extra fields of the relation, and build indices

» Oracle 9 supports function indices which use the result of a function as the key value. Here, the function should return the value of the required subelement/attribute

» SQL server 2005 same

Storing XML in Relational Databases

• Store as string– E.g. store each top level element as a string field of a tuple in a

database– Benefits:

• Can store any XML data even without DTD• As long as there are many top-level elements in a document, strings are

small compared to full document, allowing faster access to individual elements.

– Drawback: Need to parse strings to access values inside the elements; parsing is slow.

OEM model

• Semi structured and XML databases can be modelled as graph-problems

• Early prototypes directly supported the graph model as the physical implementation scheme. Querying the graph model was implemented using graph traversals

• XML without IDREFS can be modelled as trees

Storing XML as Relations (Cont.)

• Tree representation: model XML data as tree and store using relations nodes(id, type, label, value) child (child-id, parent-id)– Each element/attribute is given a unique identifier– Type indicates element/attribute– Label specifies the tag name of the element/name of attribute– Value is the text value of the element/attribute– The relation child notes the parent-child relationships in the tree

• Can add an extra attribute to child to record ordering of children

– Benefit: Can store any XML data, even without DTD– Drawbacks:

• Data is broken up into too many pieces, increasing space overheads• Even simple queries require a large number of joins, which can be

slow

Storing XML in Relations (Cont.)• Map to relations • If DTD of document is known, you can map data to relations

– Bottom-level elements and attributes are mapped to attributes of relations

– A relation is created for each element type• An id attribute to store a unique id for each element• all element attributes become relation attributes• All subelements that occur only once become attributes

– For text-valued subelements, store the text as attribute value– For complex subelements, store the id of the subelement

• Subelements that can occur multiple times represented in a separate table

– Similar to handling of multivalued attributes when converting ER diagrams to tables

– Benefits: • Efficient storage• Can translate XML queries into SQL, execute efficiently, and then

translate SQL results back to XML

Alternative mappings

• Mapping the structure– The Edge approach– The Attribute approach– The Universal Table approach– The Normalized Universal approach– The Dataguide approach

• Mapping values– Separate value tables– Inlining

• Shredding

Edge approach

• Use a single Edge table to capture the graph structure

Edge(source, ordinal, name, flag, target)

Flag: {value, reference}

Keys: {source, ordinal)

Index: source, {name,target}

Attribute approach

• Group all attributes with the same name into one table

Aname(source,ordinal,flag, target)

Key: {source,ordinal}

Index:{target}

Universal approach

• Use the Universal Table, all attributes are stored as columnsUniversal(source, ord-1,flag-1,target-1, …,ord-n,flag-n,target-n)

Key: source, index: target-i

Normalized Universal

• Same as Universal, but factor out the repeating values

Universal(source, ord-1,flag-1,target-1, …,ord-n,flag-n,target-n)

Overflow_n(source,ord, flag,target)

Key: source, and {source,ord}

Index: target-i

Mapping values

• Separate value tables

– Use V_type(vid, value) tables, eg. int(vid,val), str(vid,val),….

Mapping values

• Inlining

– As illustrated in previous mappings, inline the values in the structure relations

Shredding

• Try to recognize repeating structures and map them to separate tables

• Handle the remainder through any of the previous methods

Evaluation

• Some results reported by Florescu, Kossmann using a commercial DBMS on documents of 100K objects in 1999

• Database storage overhead:

Evaluation


• Bulk loading:

Evaluation


• Reconstruction:

The Data

Semistructured data instance = a large graph

The indexing problem

• The storage problem

– Store the graph in a relational DBMS

– Develop a new database storage structure

• The indexing problem:

– Input: large, irregular data graph

– Output: index structure for evaluating (regular) path expressions, e.g.

bib.paper.author.firstname

XSet: a simple index for XML

• Part of the Ninja project at Berkeley• Example XML data:

XSet: a simple index for XML

Each node = a hashtable

Each entry = list of pointers to data nodes (not shown)

XSet: Efficient query evaluation

• SELECT X FROM part.name X -yes• SELECT X FROM part.supplier.name X -yes• SELECT X FROM part.*.subpart.name X -maybe• SELECT X FROM *.supplier.name X -maybe

Will gain when index fits in memory

Region Algebras

• structured text = text with tags (like XML)

• data = sequence of characters [c1c2c3 …]

• region = interval in the text

– representation (x,y) = [cx,cx+1, … cy]

– example: <section> … </section>• region set = a set of regions

– example all <section> regions (may be nested)

• region algebra = operators on region set, s1 op s2s1 op s2

Region algebra: some operators

• s1 intersect s2 = {r | r s1, r s2}

• s1 included s2 = {r | rs1, r’ s2, r r’}

• s1 including s2 = {r | r s1, r’ s2, r r’}

• s1 parent s2 = {r | r s1, r’ s2, r is a parent of r’}

• s1 child s2 = {r | r s1, r’ s2, r is child of r’}

Examples:

<subpart> included <part>

<part> including <subpart>

Efficient computation of Region Algebra Operators

Example: s1 included s2

s1 = {(x1,x1'), (x2,x2'), …}

s2 = {(y1,y1'), (y2,y2'), …}

(i.e. assume each consists of disjoint regions)

Algorithm:

if xi < yj then i := i + 1

if xi' > yj' then j := j + 1

otherwise: print (xi,xi'), do i := i + 1

Can do in sub-linear time when one region is very small

From path expressions to region expressions

part.name name child (part child root)

part.supplier.name name child (supplier child (part child root))

*.supplier.name name child supplier

part.*.subpart.name name child (subpart included (part child root))

Region expressions correspond to simple XPath expressions

Storage structures for region algebras

• Every node is characterised by an integer pair (x,y)

• This means we have a 2-d space

• Any 2-d space data structure can be used

• If you use a (pre-order,post-order) numbering you get triangular filling of 2-d

(to be discussed later)

Alternative mappings

• Mapping the structure to the relational world

– The Edge approach

– The Attribute approach

– The Universal Table approach

– The Normalized Universal approach

– The Monet/XML approach

– The Dataguide approach

• Mapping values

– Separate value tables

– Inlining

• Shredding

Dataguide approach

• Developed in the context of Lore, Lorel (Stanford Univ)

• Predecessor of the Monet/XML model

• Observation:

– queries in the graph-representation take a limited form

– they are partial walks from the root to an object of interest

– this behaviour was stressed by the query language Lorel, i.e. an SQL-based query language based on processing regular expressions

SELECT X

FROM (Bib.*.author).(lastname|firstname).Abiteboul X

DataGuides

Definition

given a semistructured data instance DB, a DataGuide for DB is a graph G s.t.:

- every path in DB also occurs in G

- every path in G occurs in DB

- every path in G is unique

Dataguides

Example:

DataGuides

• Multiple DataGuides for the same data:

DataGuides

DefinitionLet w, w’ be two words (I.e word queries) and G a graph

w G w’ if w(G) = w’(G)

Definition

G is a strong dataguide for a database DB if G is the same as DB

Example:- G1 is a strong dataguide- G2 is not strong

person.project !DB dept.project

person.project !G2 dept.project

DataGuides

• Constructing the strong DataGuide G:

Nodes(G)={{root}}

Edges(G)=while changes do

choose s in Nodes(G), a in Labels

add s’={y|x in s, (x -a->y) in Edges(DB)} to Nodes(G)

add (x -a->y) to Edges(G)

• Use hash table for Nodes(G)

• This is precisely the powerset automaton construction.

DataGuides

• How large are the dataguides ?

– if DB is a tree, then size(G) <= size(DB)• why? answer: every node is in exactly one extent of G

• here: dataguide = XSet

– How many nodes does the strong dataguide have for this DB ? 20 nodes (least common

multiple of 4 and 5)

Dataguides usually fail on data with cyclic schemas, like:

chapter 10: xml the world of xml. context the dawn of database technology 70s a dbms is a flexible...

Documents

extensible dbms

dbms core technology

objectoriented dbms

gis dbms

active dbms

database system

office information systems

database worldintroductionxml