from semistructured data to xml dan suciu at&t labs suciu/vldb99-tutorial.pdf

118
From Semistructured Data to XML Dan Suciu AT&T Labs http://www.research.att.com/~suciu/ vldb99-tutorial.pdf

Upload: julia-glenn

Post on 13-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

From Semistructured Data to XML

Dan Suciu

AT&T Labshttp://www.research.att.com/~suciu/vldb99-tutorial.pdf

Page 2: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

How the Web is Today

• HTML documents

• all intended for human consumption

• many generated automatically by applications

Easy to fetch any Web page, from any server, any platform

Page 3: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Limits of the Web Today

• application cannot consume HTML

• HTML wrapper technology is brittle– screen scraping

• OO technology (Corba) requires controlled environment

• companies merge, form partnerships; need interoperability fast

people are inventive: send data by fax !

Page 4: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Paradigm Shift on the Web

• new Web standard XML:– XML generated by applications– XML consumed by applications

• data exchange– across platforms: enterprise interoperability– across enterprises

Web: from collection of documents to data and documents

Page 5: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Database Community Can Help

• query optimization, processing

• views, transformations

• data warehouses, data integration

• mediators, query rewriting

• secondary storage, indexes

Page 6: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

But Needs a Paradigm Shift Too

• Web data differs from database data:– self-describing, schema-less– structure changes without notice– heterogeneous, deeply nested, irregular– documents and data mixed together

• designed by document, not db experts

• need Web data management

Page 7: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

What This Tutorial is About

• what the database community has done– semistructured data model– query languages, schemas

• what the Web community has done:– data formats/models: XML, RDF– transformation language (XSL), schemas

• where they meet and where they differ

Page 8: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Outline

• Semistructured data and XML

• Query languages

• Schemas

• Systems issues

• Conclusions

Page 9: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Part 1Semistructured Data and XML

Page 10: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Semistructured Data

Origins:

• integration of heterogeneous sources

• data sources with non-rigid structure

• biological data

• Web data

Page 11: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

The Semistructured Data Model

&o1

&o12 &o24 &o29

&o43&96

&243 &206

&25

“Serge”“Abiteboul”

1997

“Victor”“Vianu”

122 133

paperbook

paper

references

referencesreferences

authortitle

yearhttp

author

authorauthor

title publisherauthor

authortitle

page

firstnamelastname

firstname lastname firstlast

Bib

Object Exchange Model (OEM)

complex object

atomic object

Page 12: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Syntax for Semistructured Data

Bib: &o1 { paper: &o12 { … },

book: &o24 { … },

paper: &o29

{ author: &o52 “Abiteboul”,

author: &o96 { firstname: &243 “Victor”,

lastname: &o206 “Vianu”},

title: &o93 “Regular path queries with constraints”,

references: &o12,

references: &o24,

pages: &o25 { first: &o64 122, last: &o92 133}

}

}

Page 13: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Syntax for Semistructured Data

May omit oid’s:

{ paper: { author: “Abiteboul”,

author: { firstname: “Victor”,

lastname: “Vianu”},

title: “Regular path queries …”,

page: { first: 122, last: 133 }

}

}

Page 14: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Characteristics of Semistructured Data

• missing or additional attributes

• multiple attributes

• different types in different objects

• heterogeneous collections

self-describing, irregular data, no a priori structure

Page 15: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Comparison with Relational Data

{ row: { name: “John”, phone: 3634 },

row: { name: “Sue”, phone: 6343 },

row: { name: “Dick”, phone: 6363 }

}

n a m e p h o n e

J o h n 3 6 3 4

S u e 6 3 4 3

D i c k 6 3 6 3

row row row

name name namephone phone phone

“John” 3634“Sue” “Dick”6343 6363

Page 16: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

XML

• a W3C standard to complement HTML

• origins: structured text SGML

• motivation:– HTML describes presentation– XML describes content

• http://www.w3.org/TR/REC-xml (2/98)

SGMLXMLHTML4.0

Page 17: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

From HTML to XML

HTML describes the presentation

Page 18: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

HTML

<h1> Bibliography </h1>

<p> <i> Foundations of Databases </i>

Abiteboul, Hull, Vianu

<br> Addison Wesley, 1995

<p> <i> Data on the Web </i>

Abiteoul, Buneman, Suciu

<br> Morgan Kaufmann, 1999

Page 19: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

XML<bibliography>

<book> <title> Foundations… </title>

<author> Abiteboul </author>

<author> Hull </author>

<author> Vianu </author>

<publisher> Addison Wesley </publisher>

<year> 1995 </year>

</book>

</bibliography>XML describes the content

Page 20: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

XML Terminology• tags: book, title, author, …• start tag: <book>, end tag: </book>• elements: <book>…<book>,<author>…</author>• elements are nested• empty element: <red></red> abbrv. <red/>• an XML document: single root element

well formed XML document: if it has matching tags

Page 21: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

More XML: Attributes

<book price = “55” currency = “USD”>

<title> Foundations of Databases </title>

<author> Abiteboul </author>

<year> 1995 </year>

</book>

attributes are alternative ways to represent data

Page 22: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

More XML: Oids and References

<person id=“o555”> <name> Jane </name> </person>

<person id=“o456”> <name> Mary </name>

<children idref=“o123 o555”/>

</person>

<person id=“o123” mother=“o456”><name>John</name>

</person>

oids and references in XML are just syntax

Page 23: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

XML Data Model

• does not exists• Document Object Model (DOM):

– http://www.w3.org/TR/REC-DOM-Level-1 (10/98)– class hierarchy (node, element, attribute,…)– objects have behavior– defines API to inspect/modify the document

Page 24: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

XML Parsers

• traditional: return data structure (DOM?)

• event based: SAX (Simple API for XML)– http://www.megginson.com/SAX– write handler for start tag and for end tag

Page 25: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

XML Namespaces

• http://www.w3.org/TR/REC-xml-names (1/99)• name ::= [prefix:]localpart

<book xmlns:isbn=“www.isbn-org.org/def”>

<title> … </title>

<number> 15 </number>

<isbn:number> …. </isbn:number>

</book>

Page 26: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

XML Namespaces

• syntactic: <number> , <isbn:number>

• semantic: provide URL for schema

<tag xmlns:mystyle = “http://…”>

<mystyle:title> … </mystyle:title>

<mystyle:number> …

</tag>

defined here

Page 27: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

XML v.s. Semistructured Data

• both described best by a graph

• both are schema-less, self-describing

Page 28: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Similarities and Differences

<person id=“o123”>

<name> Alan </name>

<age> 42 </age>

<email> ab@com </email>

</person>

{ person: &o123

{ name: “Alan”,

age: 42,

email: “ab@com” }

}

person

name age email

Alan 42 ab@com

person

name age email

Alan 42 ab@com

father father

<person father=“o123”> …</person>

{ person: { father: &o123 …}}

similar on trees, different on graphs

Page 29: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

More Differences

• XML is ordered, ssd is not

• XML can mix text and elements: <talk> Making Java easier to type and easier to type

<speaker> Phil Wadler </speaker>

</talk>

• XML has lots of other stuff: entities, processing instructions, comments

Page 30: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

RDF

• http://www.w3.org/TR/REC-rdf-syntax (2/99)

• purpose: metadata for Web– help search engines

• syntax in XML

• semantics: edge-labeled graphs

Page 31: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

RDF Syntax

<rdf:Description about=“www.mypage.com”>

<about> birds, butterflies, snakes </about>

<author> <rdf:Description>

<firstname> John </firstname>

<lastname> Smith </lastname>

</rdf:Description>

</author>

</rdf:Description>

Page 32: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

RDF Data Model

www.mypage.com

birds, butterflies, snakes

John Smith

about author

firstname lastname

the RDF Data Model is very close to semistructured data

Page 33: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

More RDF Examples

www.mypage.com

birds, butterflies, snakes

John Smith

about author

firstname lastname

www.anotherpage.com

author

related

Joe Doe

author

Page 34: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

<rdf:Description about=“www.mypage.com”>

<about> birds, butterflies, snakes </about>

<author> <rdf:Description ID=“&o55”>

<firstname> John </firstname>

<lastname> Smith </lastname>

</rdf:Description> </author>

</rdf:Description>

<rdf:Description about=“www.anotherpage.com”> <related> <rdf:Description about=“www.mypage.com”/> </related> <author rdf:resource=“&o55”/> <author> Joe Doe </author></rdf:Description>

Page 35: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

RDF Terminology

subject

object

predicate

statement

O E M R D Fn o d e r e s o u r c el a b e l p r o p e r t y

s o u r c e / l a b e l / d e s t s u b j e c t / p r e d i c a t e / o b j e c te d g e s t a t e m e n t

Page 36: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

More RDF: Containers• bag, sequence, alternative

<rdf:Description> <a> <rdf:Bag>

<rdf:li> s1 </rdf:li>

<rdf:li> s2 </rdf:li>

</rdf:Bag>

</a>

</rdf:Description>

Page 37: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

RDF Containers (cont’d)

Bag s1 s2

a

rdf:typerdf_1

rdf_2

Page 38: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

More RDF: Higher Order Statements

“the author of www.thispage.com says: ‘the topic of www.thatpage.com is environment’ “

www.thatpage.com

environment

topic

www.thispage.com

says

author

RDF uses reification

Page 39: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Summary of Data Models

• semistructured data, XML, RDF

• data is self-describing, irregular

• schema embedded in the data

Page 40: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Part 2Query Languages

• Semistructured data and XML

• Query languages

• Schemas

• Systems issues

• Conclusions

Page 41: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Query Languages: Motivation

• granularity of the HTML Web: one file

• granularity of Web data varies:– single data item: “get John’s salary”– entire database: “get all salaries”– aggregates: “get average salary”

• need query language to define granularity

Page 42: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Query Languages: Outline

• for semistructured data:– Lorel– UnQL– StruQL

• for XML: XML-QL• a different paradigm

– structural recursion– XSL

Page 43: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Lorel

• part of the Lore system (Stanford)

• adapts OQL to semistructured data

select X.titlefrom Bib.paper Xwhere X.year > 1995

select X.titlefrom Bib.paper Xwhere X.year > 1995

select Bib.paper.titlefrom Bib.paperwhere Bib.paper.year > 1995

select Bib.paper.titlefrom Bib.paperwhere Bib.paper.year > 1995

example:

abbreviated to:

Page 44: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Lorel v.s. OQL

• implicit coercions: 1995 to “1995”

• missing attributes– empty answer v.s. type error

• set-valued attributes– in X.year>1995, X may have several years

• regular path expressions (next)

Page 45: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Regular Path Expressions

Useful for:• syntactic substitute for inheritance: paper|book• navigating partially known structures: lastname?• transitive closure: reference+

select X.titlefrom Bib.paper X, Bib.(paper|book) Ywhere Y.author.lastname? = “Ullman” and Y.reference+ X

select X.titlefrom Bib.paper X, Bib.(paper|book) Ywhere Y.author.lastname? = “Ullman” and Y.reference+ X

Page 46: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

select Twhere Bib.paper: { title: T, year: Y, journal: “TODS”} and Y > 1995

select Twhere Bib.paper: { title: T, year: Y, journal: “TODS”} and Y > 1995

UnQL

• Unstructured Query Language

• patterns, templates, structural recursion

• patterns:

Page 47: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

UnQL: Templates

select result: { fn: F, ln: L, pub: { title: T, year: Y }}where Bib.paper: { title: T, year: Y, journal: “TODS”} and Y > 1995

select result: { fn: F, ln: L, pub: { title: T, year: Y }}where Bib.paper: { title: T, year: Y, journal: “TODS”} and Y > 1995

Result looks like: { result: { fn: “John”, ln: “Smith”, pub: { title: “P equals NP”, year: 2005}}, result: { fn: “Joe”, ln: “Doe”, pub: { title: “Errata to P=NP”, year: 2006}} … }

Page 48: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Skolem Functions

• Maier, 1986– in OO systems

• Kifer et al, 1989– F-logic

• Hull and Yoshikawa, 1990– deductive db (ILOG)

• Papakonstantinou et al., 1996– semistructured db (MSL)

• illustrate with Strudel (next)

Page 49: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Skolem Functions in StruQL

• Strudel: a Web Site Management System

• StruQL: its query language

Page 50: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Example: Bibliography Data

{Bib: { paper: { author: “Jones”,

author: “Smith”,

title: “The Comma”,

year: 1994

}

},

{ paper: ….. }

}

Page 51: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Example: A Complex Web Site

Root()

YearPage(“Smith”,1994)

YearPage(“Smith”,1996)

YearPage(“Jones”,1994)

YearPage(“Jones”,1998)

YearPage(“Mark”,1996)

yearentry yearentry yearentryyearentry yearentry

publication

publicationPubPage(“The Comma”) PubPage(“The Dot”)

publication publicationpublication

title title

author

author

author

HomePage(“Smith”) HomePage(“Jones”) HomePage(“Mark”)

personperson

person

Page 52: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Example: Skolem Functions in

StruQLwhere Root -> “Bib” -> X, X -> “paper” -> P, P -> “author” -> A, P -> “title” -> T, P -> “year” -> Y

create Root(), HomePage(A), YearPage(A,Y), PubPage(P)

link Root() -> “person” -> HomePage(A), HomePage(A) -> “yearentry” -> YearPage(A,Y), YearPage(A,Y) -> “publication” -> PubPage(P), PubPage(P) -> “author” -> HomePage(A), PubPage(P) -> “title” -> T

where Root -> “Bib” -> X, X -> “paper” -> P, P -> “author” -> A, P -> “title” -> T, P -> “year” -> Y

create Root(), HomePage(A), YearPage(A,Y), PubPage(P)

link Root() -> “person” -> HomePage(A), HomePage(A) -> “yearentry” -> YearPage(A,Y), YearPage(A,Y) -> “publication” -> PubPage(P), PubPage(P) -> “author” -> HomePage(A), PubPage(P) -> “title” -> T

Page 53: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

XML-QL: A Query Language for XML

• http://www.w3.org/TR/NOTE-xml-ql (8/98)

• features:– regular path expressions– patterns, templates– Skolem Functions

• based on OEM data model

Page 54: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Pattern Matching in XML-QL

where <book language=“french”> <publisher> <name> Morgan Kaufmann </name> </publisher> <author> $a </author> </book> in “www.a.b.c/bib.xml”construct $a

where <book language=“french”> <publisher> <name> Morgan Kaufmann </name> </publisher> <author> $a </author> </book> in “www.a.b.c/bib.xml”construct $a

Page 55: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Simple Constructors in XML-QL

Note: </> abbreviates </book> or </result> or ...

where <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”construct <result> <author> $a </> <lang> $l </> </>

where <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”construct <result> <author> $a </> <lang> $l </> </>

<result> <author>Smith</author><lang>English</lang></result><result> <author>Smith</author><lang>Mandarin</lang></result><result> <author>Doe</author><lang>English</lang></result>

Page 56: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Skolem Functions in XML-QL

where <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”construct <result> <author id=F($a)> $a</> <lang> $l </> </>

where <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”construct <result> <author id=F($a)> $a</> <lang> $l </> </>

<result> <author>Smith</author> <lang>English</lang> <lang>Mandarin</lang> </result><result> <author>Doe</author> <lang>English</lang> </result>

Page 57: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

A Different Paradigm: Structural Recursion

Data as sets with a union operator:

{a:3, a:{b:”one”, c:5}, b:4} =

{a:3} U {a:{b:”one”,c:5}} U {b:4}

Page 58: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Structural Recursion

f(T1 U T2) = f(T1) U f(T2)f({L: T}) = f(T)f({}) = {}f(V) = if isInt(V) then {result: V} else {}

f(T1 U T2) = f(T1) U f(T2)f({L: T}) = f(T)f({}) = {}f(V) = if isInt(V) then {result: V} else {}

Example: retrieve all integers in the data

a a b

b c3

“one” 5

4result result result

3 5 4

standard textbook programming on trees

Page 59: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Structural Recursion

Example: increase all engine prices by 10%

f(T1 U T2) = f(T1) U f(T2)f({L: T}) = if L= engine then {L: g(T)} else {L: f(T)}f({}) = {}f(V) = V

f(T1 U T2) = f(T1) U f(T2)f({L: T}) = if L= engine then {L: g(T)} else {L: f(T)}f({}) = {}f(V) = V

g(T1 U T2) = g(T1) U g(T2)g({L: T}) = if L= price then {L:1.1*T} else {L: g(T)}g({}) = {}g(V) = V

g(T1 U T2) = g(T1) U g(T2)g({L: T}) = if L= price then {L:1.1*T} else {L: g(T)}g({}) = {}g(V) = V

engine body

part price

price price

part price

100

1000

100

1000

engine body

part price

price price

part price

110

1100

100

1000

Page 60: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

XSL

• two W3C drafts: XSLT and XPATH– http://www.w3.org/TR/xpath, 7/99– http://www.w3.org/TR/WD-xslt, 7/99

• in commercial products (e.g. IE5.0)

• purpose: stylesheet specification language:– stylesheet: XML -> HTML– in general: XML -> XML

Page 61: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

XSL Templates and Rules

• query = collection of template rules

• template rule = match pattern + template

<xsl:template> <xsl:apply-templates/> </xsl:template>

<xsl:template match = “/bib/*/title”> <result> <xsl:value-of/> </result></xsl:template>

<xsl:template> <xsl:apply-templates/> </xsl:template>

<xsl:template match = “/bib/*/title”> <result> <xsl:value-of/> </result></xsl:template>

Retrieve all book titles:

Page 62: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

XPath Expressions in Match Patterns

bib matches a bib element

* matches any element

/ matches the root element

/bib matches a bib element under root

bib/paper matches a paper in bib

bib//paper matches a paper in bib, at any depth

//paper matches a paper at any depth

paper|book matches a paper or a book

@price matches a price attribute

bib/book/@price matches price attribute in book, in bib

Page 63: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Flow Control in XSL

<xsl:template> <xsl:apply-templates/> </xsl:template>

<xsl:template match=“a”> <A><xsl:apply-templates/></A></xsl:template>

<xsl:template match=“b”> <B><xsl:apply-templates/></B></xsl:template>

<xsl:template match=“c”> <C><xsl:value-of/></C></xsl:template>

<xsl:template> <xsl:apply-templates/> </xsl:template>

<xsl:template match=“a”> <A><xsl:apply-templates/></A></xsl:template>

<xsl:template match=“b”> <B><xsl:apply-templates/></B></xsl:template>

<xsl:template match=“c”> <C><xsl:value-of/></C></xsl:template>

Page 64: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

<a> <e> <b> <c> 1 </c>

<c> 2 </c>

</b>

<a> <c> 3 </c>

</a>

</e>

<c> 4 </c>

</a>

<A> <B> <C> 1 </C>

<C> 2 </C>

</B>

<A> <C> 3 </C>

</A>

<C> 4 </C>

</A>

Page 65: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

XSL is Structural Recursion

Equivalent to:

f(T1 U T2) = f(T1) U f(T2)f({L: T}) = if L= c then {C: t} else L= b then {B: f(t)} else L= a then {A: f(t)} else f(t)f({}) = {}f(V) = V

f(T1 U T2) = f(T1) U f(T2)f({L: T}) = if L= c then {C: t} else L= b then {B: f(t)} else L= a then {A: f(t)} else f(t)f({}) = {}f(V) = V

XSL query = single functionXSL query with modes = multiple function

Page 66: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

XSL and Structural Recursion

XSL:• trees only• may loop

Structural Recursion:• arbitrary graphs• always terminates

<xsl:template match = “e”> <xsl:apply-patterns select=“/”/></xsl:template>

<xsl:template match = “e”> <xsl:apply-patterns select=“/”/></xsl:template>

stack overflow on IE 5.0

add the following rule:

Page 67: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Summary of Query Languages

• studied extensively in semistructured data

• some quite powerful features

• no standard for XML QL yet (WG soon)

• XSL available today (for stylesheets)

• XSL = structural recursion

Page 68: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Part 3Schemas

• Semistructured data and XML

• Query languages

• Schemas

• Systems issues

• Conclusions

Page 69: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Schemas

• why ?– XML: to describe semantics– semistructured data: to improve processing

• what ?– semistructured data: foundational – XML: several concrete proposals

here lies our interest

Page 70: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Schemas

• when ?– semistructured data, XML: a posteriori– RDBMS: a priori, to interpret binary data

• how ?– semistructured data: schema is independent – XML: schema is hardwired with the data

Page 71: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Outline

• schemas for semistructured data:– foundations– schema extraction

• schemas for XML:– DTD– XML-Schema– RDF-Schema

Page 72: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Schemas: An Example

&r1

&c1 &c2

&s2 &s3 &s6 &s7

&s10

companycompany

nameaddress name

url

address

“Widget” “Trenton” “Gadget”

“www.gp.fr”

“Paris”

&p2&p1 &p3

&s0 &s1 &s4 &s5 &s8 &s9

personperson

person

“Smith”

nameposition name phonename

position

“Manager” “Jones” “5552121” “Dupont” “Sales”

employeemanages

c.e.o.works-for works-for

works-for

c.e.o.

&a1

&a2 &a3

&a4&a5

&a6&a7

description

description

procurement salesrep

contact

task

eval1997

1998

“on target”

“below target”

Some database:

Page 73: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Lower-Bound Schemas

Root

Company Employee

string

companyperson

works-for

c.e.o.

address

name

managed-by

name

Page 74: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Upper Bound Schemas

Root

Company Employee

string

companyperson

works-for

c.e.o. | employee

name | address | url

managed-by

name | phone | position

Any

description

-

Page 75: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

The Two Questions to Ask

Conformance: does that data conform to this schema ?

Classification: if so, then which objects belong to what classes ?

Page 76: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Graph Simulation

Definition Two edge-labeled graphs G1, G2

A simulation is a relation R between nodes:• if (x1, x2) in R, and (x1,a,y1) in G1,

then exists (x2,a,y2) in G2 (same label)

s.t. (y1,y2) in R

x1 x2

a

R

G1 G2

y1

a

Ry2

Note: a simulation can be efficiently computed [Henzinger, et a. 1995]

Page 77: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Using Simulation

Data graph D, schema S

• upper bound schema:– conformance: find simulation R from D to S– classification: check if (x,c) in R

• lower bound schema– conformance: find simulation R from S to D– classification: check if (c,x) in R

[Buneman et al 1997]

Page 78: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Example

&r1

&c1 &c2

&s2 &s3 &s6 &s7

&s10

companycompany

nameaddress name

url

address

“Widget” “Trenton” “Gadget”

“www.gp.fr”

“Paris”

&p2&p1 &p3

&s0 &s1 &s4 &s5 &s8 &s9

personperson

person

“Smith”

nameposition namephonename

position

“Manager” “Jones” “5552121” “Dupont” “Sales”

employeemanages

c.e.o.works-for works-for

works-forc.e.o.

&a1

&a2 &a3

&a4&a5

&a6&a7

description

description

procurement salesrep

contact

task

eval1997

1998

“on target”

“below target”

Root

Company Employee

string

company

person

works-for

c.e.o.

address

name

managed-by

name

Root

Company Employee

string

company

person

works-for

c.e.o. | employee

name | address | url

managed-by

name | phone | position

Any

description

-

DatabaseLower Bound Upper Bound

simulation: efficient technique for checking conformance to schema

Page 79: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Application 1: Improve Secondary Storage

Root

Company Employee

string

company

person

works-for

c.e.o.

address

name

managed-by

name

o i d n a m e a d d r e s s c . e . o .… … … …… … … …

Company

o i d n a m e m a n a g e d - b y w o r k s - f o r… … … …… … … …

Employee

Store rest in overflow graph

Lower-bound schema

Page 80: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Application 2: Query Optimization

Bib

paper book

yearjournal

title

int string string

addressauthor

title

zip city street

lastname

firstname

string string string string string

string

select X.titlefrom Bib._ Xwhere X.*.zip = “12345”

select X.titlefrom Bib._ Xwhere X.*.zip = “12345”

select X.titlefrom Bib.book Xwhere X.address.zip = “12345”

select X.titlefrom Bib.book Xwhere X.address.zip = “12345”

Upper-bound schema[Fernandez, Suciu 1998]

Page 81: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Schema Extraction(From Data)

Problem statement

• given data instance D

• find the “most specific” schema S for D

In practice: S too large, need to relax

[Nestorov et al. 1998]

Page 82: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Schema Extraction: Sample Data

&r

&p8&p1 &p2 &p3 &p4 &p5 &p6 &p7

&c

company

employeeemployee

employeeemployee employee employee

employeeemployee

worksfor

worksfor

worksforworksforworksfor

worksforworksfor

worksfor

manages

manages

manages

manages

managedby

managedbymanagedby

manages

managedby

managedby

Page 83: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Lower Bound Schema Extraction

Root&r

Bosses&p1,&p4,&p6

Regulars&p2,&p3,&p5,&p7,&p8

Company&c

company employee

manages

managedby

worksfor

worksfor

employee

Page 84: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Upper Bound Schema Extraction: Data Guides

Root&r

Employees&p1,&p1,&p3,P4

&p5,&p6,&p7,&p8

Bosses&p1,&p4,&p6

Regulars&p2,&p3,&p5,&p7,&p8

Company&c

company

employee

managesmanagedby

manages

managedby

worksfor

worksfor

worksfor

Page 85: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Schemas in XML

• Document Type Definition (DTD)

• XML Schema

• RDF Schema

Page 86: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Document Type Definition: DTD

• part of the original XML specification

• an XML document may have a DTD

• terminology for XML:– well-formed: if tags are correctly closed– valid: if it has a DTD and conforms to it

• validation is useful in data exchange

Page 87: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

DTDs as Grammars

<!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)>]>

<!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)>]>

<paper> <section> <text> </text> </section> <section> <title> </title> <section> … </section> <section> … </section> </section></paper>

Page 88: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

DTDs as Schemas

Not so well suited:• impose unwanted constraints on order

<!ELEMENT person (name,phone)>

• references cannot be constraint

• can be to vague: <!ELEMENT person ((name|phone|email)*)>

Page 89: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

XML Schemas

• very recent proposal

• unifies previous schema proposals

• generalizes DTDs

• uses XML syntax

• two documents: structure and datatypes– http://www.w3.org/TR/xmlschema-1– http://www.w3.org/TR/xmlschema-2

Page 90: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

XML Schemas<elementType name=“paper”>

<sequence>

<elementTypeRef name=“title”/>

<elementTypeRef name=“author” minOccurs=“0”/>

<elementTypeRef name=“year”/>

<choice> <elementTypeRef name=“journal”/>

<elementTypeRef name=“conference”/>

</choice>

</sequence>

</elementType>DTD: <!ELEMENT paper (title,author*,year, (journal|conference))>

Page 91: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

RDF Schemas

• http://www.w3.org/TR/PR-rdf-schema (3/99)

• object-oriented flavor

Page 92: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

RDF Schemas

• recall RDF data:– resources– properties

• RDF schema:– classes– properties

subject

object

predicate

statement

Page 93: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

RDF Schemas

Data:

<rdf:Description ID=“car001”>

<name> My Honda </name>

<miles> 50000 </miles>

<rdf:type resource=“#MotorVehicle”/>

</rdf:Description>

Page 94: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

RDF SchemasSchema:

<rdf:Description ID=“MotorVehicle”>

<rdf:type resource=“#Class”/>

<rdf:subClassOf resource=“#Resource”/>

</rdf:Description>

<rdf:Description ID=“Truck”>

<rdf:type resource=“#Class”/>

<rdf:subClassOf resource=“#MotorVehicle”/>

</rdf:Description>

Page 95: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

RDF Schemas

Truck

MotorVehicle

car001

type

type

Classtype

subClassOf

name miles

My Honda 50000

Page 96: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

RDF Schemas

• different from object-oriented systems:– OO: define a class by set of properties– RDF: define a property in terms of its classes

• metadata in RDF:– an RDF schema described as an RDF data

Page 97: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Summary of Schemas

• in SS data: – graph theoretic– data and schema are decoupled– used in data processing

• in XML– from grammar to object-oriented– schema wired with the data– emphasis on semantics for exchange

Page 98: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Part 4Systems Issues

• Semistructured data and XML

• Query languages

• Schemas

• Systems issues

• Conclusions

Page 99: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Systems Issues

• servers

• mediators

Page 100: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Servers for Semistructured Data / XML

• storage

• index• query evaluation [McHugh, Widom 1999]

Page 101: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

XML Storage

• text file (XML)

• store in ternary relation

• use DTD to derive schema

• mine data to derive schema

• build special purpose repository (Lore)

Page 102: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

XML Storage: Text File

• advantages– simple– less space than one thinks– reasonable clustering

• disadvantage– no updates– require special purpose query processor

Page 103: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

&o1

&o3

&o2

&o4 &o5

paper

title author authoryear

&o6

“The Calculus” “…” “…” “1986”

Store XML in Ternary Relation

[Florescu, Kossman 1999]

S o u r c e L a b e l D e s t

& o 1 p a p e r & o 2& o 2 t i t l e & o 3& o 2 a u t h o r & o 4& o 2 a u t h o r & o 5& o 2 y e a r & o 6

N o d e V a l u e

& o 3 T h e C a l c u l u s& o 4 …& o 5 …& o 6 1 9 8 6

Ref

Val

Page 104: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Use DTD to derive Schema

• DTD:

• ODMG classes:

• [Christophides et al. 1994 , Shanmugasundaram et al. 1999]

<!ELEMENT employee (name, address, project*)><!ELEMENT address (street, city, state, zip)>

class Employee public type tuple (name:string, address:Address, project:List(Project))class Address public type tuple (street:string, …)

Page 105: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Mine Data to Derive Schema

paperpaper paper

paper

authorauthor author author author

titletitle title title

year

fn fn fn fn lnlnlnln

a u t h o r t i t l eX X

f n 1 l n 1 f n 2 l n 2 t i t l e y e a r

X X X X X -X X - - X XX X - - X -

Paper1

Paper2

[Deutsch et al. 1999]

Page 106: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Indexing Semistructured Data

• coercions: 1995 v.s. “1995”

• regular path expressions– data guides [Goldman, Widom, 1997]– T-indexes [Milo, Suciu, 1999]

Page 107: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Indexing All Paths in the Data1

2 3 4 5 6

7 8 9 10 11 12 13

t t t t t

a b a c a d a a b

Semistructured Data

1

2 3 4 5 6

7 8 10 12 13 7 13 9 11

t

ab c

d

Data Guide

1

2 3 4 5 6

7 13 8 10 12 9 11

t

ab c db

T-Index

Page 108: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Mediators for Semistructured Data / XML

• XML = virtual view of Relational/OO/OR sources• mediator = translation, integration• issues:

– query composition and rewriting [Papakonstatinou, et al. 1996]– limited source capabilities [Yerneni, et al. 1999]

Page 109: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Example: An XML Mediator

• relational database:

• virtual XML view:

<store> <name> n1 </name> <book> ... </book> <book> ... </book> ... </store> <store> <name>n2 </name> <book> ... </book> <book> ... </book> …</store>

s i d n a m e… …… …

Stores i d b i d… …… …

SBb i d t i t l e… …… …

Book

Page 110: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Example: An XML Mediator

• specify mediator declaratively (a view):

from Store, SB, Bookwhere Store.sid=SB.sid and SB.bid=Book.bidconstruct <store ID=f(Store.sid)> <name> Store.name </name> <book> Book.title </book> </store>

from Store, SB, Bookwhere Store.sid=SB.sid and SB.bid=Book.bidconstruct <store ID=f(Store.sid)> <name> Store.name </name> <book> Book.title </book> </store>

Page 111: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Example: An XML Mediator

• users ask XML-QL queries:– find stores who sell “The Calculus”

where <store> <name> $n </name> <book> The Calculus </book> <store>construct <result> $n </result>

where <store> <name> $n </name> <book> The Calculus </book> <store>construct <result> $n </result>

Page 112: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Example: An XML Mediator

• system composes query with view:

from Store, SB, Bookwhere Store.sid=SB.sid and SB.bid=Book.bid and Book.title=“The Calculus”construct <result> Store.name </result>

from Store, SB, Bookwhere Store.sid=SB.sid and SB.bid=Book.bid and Book.title=“The Calculus”construct <result> Store.name </result>

Page 113: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Summary of Systems

• unclear today how XML will be used– materialized ? Need servers– virtual ? Need mediators

• most work is still ahead

Page 114: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Part 5Conclusions

• Semistructured data and XML

• Query languages

• Schemas

• Systems issues

• Conclusions

Page 115: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Summary

• XML = what is out there

• semistructured data = what we can process

• paradigm shift, for both Web and db

• covered in tutorial:– data models, queries, schemas

Page 116: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Current and Future Technologies

• Web applications possible today:– export relational data to XML (e.g. Oracle)– import XML directly into applications

• Web applications in the future:– mediator technology (XML view)– store/process native XML data– compress XML– mine/analyze XML

Page 117: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Why This Is Cool for Database Researchers

• put to work what you teach in CS101 !– tree traversals (structural recursion, XSL)– automata theory (DTD’s, path expressions)– graph theory (simulation)

• adapt old DB tricks to new kind of data

• save the trees: from fax to XML

The End

Page 118: From Semistructured Data to XML Dan Suciu AT&T Labs suciu/vldb99-tutorial.pdf

Further Readings

www. w3.org/XML

www-db.stanford.edu/~widom

www-rocq.inria.fr/~abiteboul

db.cis.upenn.edu

www.research.att.com/~suciu

Abiteboul, Buneman, Suciu

Data on the Web: From Relational to Semistructured to XML

Morgan Kaufmann, 1999 (appears in October)