xml + databases = ? (dimacs workshop, 3/2000)

XML + Databases = ?(DIMACS Workshop, 3/2000)

Mike CareyExploratory Database Systems Department

IBM Almaden Research [email protected]

Plan for Today’s TalkPlan for Today’s Talk Thoughts on DB and web technologies

– The web and web “querying”– Semistructured databases– Object-relational databases– XML and databases

XML/DB research at IBM Almaden– The XPERANTO project

• Motivation and approach• Whirlwind tour of the system

The Web is Great at Supporting URL-Based Sharing Ex: Online conference proceedings Web browsers have given us

– Universal file access (ftp++)– Universal document access (html)– Universal service access (forms)

What more could we navigational couch potatoes possibly want?– Universal platform for e-shopping!

The Web is Lousy at Supporting Parametric Searches Ex: Find all the used Musicman Sterling bass

guitars currently available for under $750 within a 50-mile radius of my San Jose home

This is hard for a number of reasons– Data buried in web pages, news groups,

classified ads, store sites, auction sites, …– No schema (no metal fish, please!)– No data types (miles, US$, instruments)– No regularity within/across (good!) sites

Aren’t Aren’t WeWe Supposed to be the Supposed to be the Experts on Data Management?Experts on Data Management? The DB community brought the world

– Data models, schemas, and views– Query languages, optimizers, fast joins– Scalable parallel servers– Federated database systems

What do we have in our bag of tricks?– Semistructured databases– Object-relational database systems

Is Semistructured Database Technology the Answer? Database characteristics

– Collections of [name, value] pairs or maybe [name, type, value] triples

– Collections typically set<any> or list<any> System characteristics

– “Typeloose” query languages– Indexes for nested, typeloose structures– Appropriate query processing techniques

Are Semistructured Databases the Answer? (2)

No, because schemas are critical for– Data readers

• What info is in a given collection?• Thus, what queries might make sense?

– Data writers• What should I call this piece of info?• Is it okay to put this kind of data here?

– Efficient/effective query processors• Indexing, statistics, ... (e.g., range queries)• Integration mappings (e.g., unit conversions)

Are Semistructured Databases the Answer? (3) It has some nice features, though

– Flexible, dynamic schemas• Forgiving w.r.t. variations and exceptions• Schema evolution is not a big deal

– Richer data modeling (vs. relational)• Nested structures, ordered collections

– More powerful query languages• Blurring of schema and data querying• Ordering, nesting, restructuring handled

Is Object-Relational Database Technology the Answer? Database characteristics

– Base types, user-defined structured types, inheritance, reference types, collections

– Collections are well-typed System characteristics

– Extended SQL-based query languages– Support for methods (fenced/unfenced)– Also triggers, LOBs, extensible indexes

Are Object-Relational Databases the Answer? (2) No, because most O-R DBMSs have

– Overly rigid schemas• Every instance is of one (known) type• Evolving a type can be a major burden• Distributed type management is hard

– Crufty old storage managers• Ragged or sparse records poorly supported

– Insufficient power in extended SQL• Prehistoric assumptions get in the way• Weak on restructuring, schema-querying

Is XML the Answer?(Yes!!Yes!! ...What Was the Question Again?) Structured documents (for the web)

<book> <booktitle> Tables Are The Answer </booktitle> <author id = “cdate”> <name> <firstname> Chris </firstname> <lastname> Date </lastname> </name> <address> <city> Saratoga </city> <state> CA </state> </address> </author></book>

Is XML the Answer? (2) W3C’s XML Schema working group

– Typed elements, attributes, documents– Simple types and complex types– Derived types (extension, restriction)– Facets, anonymous types, groups, …– Uniqueness, keys and key references

W3C’s XML Query working group– XML-QL, Xpath, XQL, XSL/T, XSQL, …– Recommendation due in late 2000 (?)

Is XML the Answer? (3) XML Schema might help because

– XML has achieved a huge mindshare for data interchange on the web

– DTD standardization is happening for documents within vertical industries, and XML Schemas should take over

– When finished, XML Schema should be a widely used schema description tool

• Similar to O-R schemas, but with more flexibility (and web-based sex appeal)

Some Useful XML+DB Topics Publish documents with XML Schemas from

O-R databases– B2B e-commerce messages– B2C comparison shopping (if permitted!)– Robust O-R DB-resident web sites with XML for

page content generation Use XML Schema as the central data model

for data integration middleware– I.e., web information integration

Useful XML+DB Topics (2) Build a “native” XML Repository on top of

an O-R DBMS– Map from XML Schema model to O-R DBMS

modeling constructs– Map from XML queries to O-R queries

(including tag variables and loose typing)– Thereby provide XML document storage

management with industrial-strength robustness, scalability, and performance

Useful XML+DB Topics (3) Evolve XML-QL into a complete web

data manipulation language – Typing a la XML Schema– Ordered/unordered collections– XPath-inspired expressions– Easier grouping and aggregation– Updates (insert/delete, modify)– Etc.

The XPERANTO ProjectThe XPERANTO Project Middleware for publishing O-R (or plain

relational) DB content on the web– Provides a virtual XML document view– Based on a “pure XML” approach– Using XML-QL (as W3C placeholder)

Born at Almaden in summer of 1999– Mike Carey, Dana Florescu, Zack Ives, Ying Lu,

Jai Shanmugasundaram, Beau Shekita, Subbu Subramanian

The XPERANTO Belief System Databases contain, and will continue to

contain, the world’s “data jewels”– Transactional data (RDBMS)– Important multimedia assets (ORDBMS)

XML application developers of the future may not love SQL like we do– View databases as default XML documents– Let them define appropriate (query-able)

views of these XML documents

XPERANTO Architecture

Views

XML Schema

O-R Database

SQL Query Processor

Stored Tables

System Catalog

Metadata Services

View Services

Type & Table Services

Query Translation

XQGM

XML-QL Parser

XQGM

Query Rewrite

SQL Translation

XML SchemaGenerator

Catalog Info XML Tagger

Data Tuples

Table & Type Info

SQL Queries

XPERANTO Components XML-QL Parser

– Neutral query representation (XQGM) Query Rewrite

– View composition and other rewrites SQL Translation

– Produce SQL query(s) to get the required data from the underlying DBMS

XML Tagger– Tag and structure the tabular results

XPERANTO Components View Services

– Repository for XML view definitions Type & Table Services

– Interface (and cache) for DB catalog info XML Schema Generator

– Give DB catalog info in XML Schema form for default views

– Infer XML Schema info for queries and non-default view definitions

Consider a Simple O-R SchemaCreate Table book AS (bookID CHAR(30), name VARCHAR(255), publisher VARCHAR(30))

Create Table publisher AS (name VARCHAR(30), address VARCHAR(255))

Create Type author_type AS (bookID CHAR(30), first VARCHAR(30), last VARCHAR(30))

Create Table author OF author_type (REF IS ssn USER GENERATED)

Part of the Default XML View<simpleType name=”string255” source=”string”> <maxLength value=”255” /></simpleType>

<simpleType name=”string30” source=”string”> <maxLength value=”30” /></simpleType>

<complexType name=“bookTupleType”> <element name=“bookID” type=“string30” /> <element name=“name” type=“string255” /> <element name=“publisher” type=“string30” /></complexType>

<complexType name=“bookSetType”> <element name=“bookTuple” type=“bookTupleType” maxOccurs=“*” /></complexType>

<element name=“book” type=“bookSetType” />

.

.

.

XPERANTO’s Default Views XPERANTO generates default O-R to XML

Schema mappings– Each DB shown as an XML file– Subtyping handled via XML Schema’s refinement

facilities– OIDs and references become ids/idrefs

“Don’t use this at home!”– Application developers are expected to define the

real view(s) using XML-QL

Creating a Better XML ViewWHERE <library.book.bookTuple> <bookID> $bid </> <name> $name </> <publisher> $bpub </> </> IN “db2:xml:books/library”, $bpub = “Kluwer”CONSTRUCT <book id=$bid> <name> $bname </> {WHERE <library.publisher.publisherTuple> <name> $bpub </> <address> $addr </> </> IN “db2:xml:books/library” CONSTRUCT <publisher> <address> $addr </> </>} {WHERE <library.author.authorTuple> <bookID> $bid </> <first> $fname </> <last> $lname </> </> IN “db2:xml:books/library” CONSTRUCT <author first=$fname last=$lname/>} </>

.

.

.

XPERANTO Query Rewrite

XML-QL queries first translated into XQGM representation– Neutral, well-poised for more features– Easier to go from XML-QL to SQL– Borrow rewrites from DB2 UDB engine

XQGM is an extension of DB2’s QGM– XML data type for “columns”– Set of XML-specific functions

SQL Generation and XML Document Tagging/Structuring Sorted Outer Union queries are used to obtain

the data– Fetch the data in one query that brings it back in

the appropriate order– Tag and nest it to create XML document

Advantages of this approach– Shown to be stable as well as fast– Simple (linear-space) tagging possible

• Just watch for nesting-related changes

Outer Union Query ExampleWITH OuterUnion (type, bookID, bookName, pubName, pubAddr, authFirst, authLast) AS( SELECT ‘0’, b.bookID, b.name, NULL, NULL, NULL, NULL FROM book b WHERE b.publisher = “Kluwer”UNION ALL SELECT ‘1’, b.bookID, NULL, p.name, p.address, NULL, NULL FROM book b, publisher p WHERE b.publisher = “Kluwer” and b.publisher = p.nameUNION ALL SELECT ‘2’, b.bookID, NULL, NULL, NULL, a.first, a.last FROM book b, author a WHERE b.publisher = “Kluwer” and b.bookID = a.bookID)SELECT * FROM OuterUnion ORDER BY bookID

XPERANTO Project Summary Goal is to publish O-R data in XML form

– Default XML views– XML-QL for defining useful views– “Look Ma, no SQL!”

Currently (re)building our prototype– View composition is our first stop– Updates in addition to queries– Queries over both data and metadata– Other needs for XML web sites...?

A Few Closing RemarksA Few Closing Remarks DB community must ensure that the web

will support real queries…!– XML Schema and XML Query standards

need ongoing input from DB researchers– Large-scale technologies needed for XML

indexing, caching, querying, etc. DB community should also work on

important underlying technologies– Publishing XML both from and to RDBMSs

and ORDBMSs, for example!

xml + databases = ? (dimacs workshop, 3/2000)

Documents